Troubleshooting

Troubleshooting¶

Common issues and solutions for your homelab deployment. This guide covers the most frequently encountered problems across different service categories.

Table of Contents¶

OCFS2 Cluster - IP Address Changes
Service Deployment Failures
SSL Certificate Issues
DNS Resolution Problems
Secondary DNS and Pi-hole Sync
Docker Swarm Networking
Authentik SSO Integration
Storage Mount Failures
Database Performance Issues

OCFS2 Cluster - IP Address Changes¶

Issue: After a network change, nodes with new IP addresses fail to mount OCFS2 filesystems with error -107 (ENOTCONN) in ocfs2_dlm_init.

Symptoms:

o2net: Connection to node <name> shutdown, state 7
o2net: No connection established with node X after 30.0 seconds
o2cb: This node could not connect to nodes
(mount.ocfs2): ERROR: status = -107

Root Cause: O2CB cluster caches node IP addresses in kernel state (/sys/kernel/config/cluster/homelab/node/*/ipv4_address). When node IP addresses change, the kernel state becomes stale and O2NET connections fail during handshake.

Solution:

Remove affected nodes from cluster:

# On all nodes
/usr/sbin/o2cb remove-node homelab <node-name>

Stop O2CB on all nodes:
```
systemctl stop o2cb
```

Re-add nodes with new IPs:

# On all nodes
/usr/sbin/o2cb add-node homelab <node-name> --ip <new-ip> --port 7777 --number <node-number>

Restart O2CB:
```
systemctl restart o2cb
```
If kernel state persists: Reboot nodes that still show old IPs in /sys/kernel/config/cluster/homelab/node/*/ipv4_address.

Prevention: After IP address changes, reboot all cluster nodes to ensure clean O2CB registration with new IPs.

Service Deployment Failures¶

Issue: Docker stack fails to deploy, or services restart repeatedly after deployment.

Stack Fails to Deploy¶

Symptoms:

Creating service myapp_service
failed to create service myapp_service: Error response from daemon: ...

Common Causes: - Missing environment variables in .env file - Invalid Docker Compose syntax - Missing Docker secrets - Network not created - Node constraints not met (labels missing)

Solution:

Verify environment variables:

# Check if all required variables are set
grep -v '^#' .env | grep -v '^$'

Validate Docker Compose syntax:

cd stacks/apps/myservice
docker compose config

Check Docker secrets:

docker secret ls
# Create missing secrets
echo "secret-value" | docker secret create secret_name -

Verify network exists:

docker network ls | grep traefik-public
# Create if missing
docker network create --driver overlay traefik-public

Check node labels:

docker node ls
docker node inspect <node-name> --format '{{.Spec.Labels}}'
# Add missing labels
docker node update --label-add database=true <node-name>

Container Restarts Repeatedly¶

Symptoms:

docker service ps myapp_service
# Shows multiple restarts

Common Causes: - Application configuration errors - Missing volume mounts - Database connection failures - Port conflicts - Health check failures

Solution:

Check service logs:

docker service logs myapp_service --tail 100 --follow

Inspect service tasks:

docker service ps myapp_service --no-trunc

Verify volume mounts:

# Check if mount points exist
ls -la /mnt/iscsi/app-data/myapp
ls -la /mnt/cifs/

Check port conflicts:

docker service inspect myapp_service --format '{{.Endpoint.Ports}}'
netstat -tulpn | grep <port>

Test database connectivity:

docker exec -it $(docker ps -q -f name=myapp) sh
# Inside container, test connection
nc -zv database-host 5432

Health Check Failures¶

Symptoms: Service shows as running but marked unhealthy in docker service ps.

Solution:

Check health check configuration in docker-compose.yml:

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
  interval: 30s
  timeout: 10s
  retries: 3

Test health check manually:

docker exec -it $(docker ps -q -f name=myapp) curl -f http://localhost:3000/health

Increase timeout or retries if service is slow to start

Prevention: - Always validate compose files with docker compose config before deployment - Test services locally before deploying to swarm - Use docker service logs immediately after deployment to catch early errors

SSL Certificate Issues¶

Issue: HTTPS not working, certificate not issued, or Cloudflare DNS challenges failing.

Certificate Not Issued¶

Symptoms: - Service accessible via HTTP but not HTTPS - Browser shows "connection not secure" - Traefik dashboard shows no certificate

Common Causes: - Cloudflare API token invalid or expired - DNS records not pointing to server - Let's Encrypt rate limits hit - Traefik not configured for certificate resolver

Solution:

Check Traefik logs:

docker service logs reverse-proxy_traefik --tail 100 --follow
# Look for ACME challenge errors

Verify Cloudflare API token:

# Check token in .env file
grep CLOUDFLARE_DNS_API_TOKEN .env

# Test token with Cloudflare API
curl -X GET "https://api.cloudflare.com/client/v4/user/tokens/verify" \
     -H "Authorization: Bearer $CLOUDFLARE_DNS_API_TOKEN"

Verify DNS records:

dig yourdomain.com
dig subdomain.yourdomain.com
# Should point to your server IP

Check acme.json file:

ls -la /mnt/iscsi/app-data/traefik/acme.json
# Should have 600 permissions
chmod 600 /mnt/iscsi/app-data/traefik/acme.json

Check Traefik labels on service:

labels:
  - traefik.enable=true
  - traefik.http.routers.myservice.rule=Host(`myapp.${BASE_DOMAIN}`)
  - traefik.http.routers.myservice.tls=true
  - traefik.http.routers.myservice.tls.certresolver=dns

Certificate Expired or Invalid¶

Symptoms: - Browser shows certificate error - Certificate expiry warning

Solution:

Force certificate renewal:

# Remove old certificate from acme.json
# Traefik will automatically request a new one
docker service update --force reverse-proxy_traefik

Check Let's Encrypt rate limits:
Limit: 50 certificates per week per domain
Use staging environment for testing

Prevention: - Monitor certificate expiry with Uptime Kuma - Ensure Cloudflare API token doesn't expire - Use wildcard certificates to reduce certificate count - Test with Let's Encrypt staging environment first

DNS Resolution Problems¶

Issue: Services not accessible by domain name, or DNS server not responding.

Services Not Accessible by Domain¶

Symptoms:

ping myapp.yourdomain.com
# ping: cannot resolve myapp.yourdomain.com: Unknown host

Common Causes: - Technitium DNS not running - DNS records not configured - Client not using correct DNS server - Firewall blocking DNS port 5380

Solution:

Check Technitium DNS status:

docker service ls | grep dns
docker service logs dns_dns-server --tail 50

Verify DNS records in Technitium:
Access Technitium UI: http://<server-ip>:5380
Check A records for your domain
Verify wildcard records (*.yourdomain.com)

Test DNS resolution:

# Test with Technitium DNS
dig @<server-ip> myapp.yourdomain.com

# Test with system DNS
dig myapp.yourdomain.com

Check client DNS configuration:

# Linux
cat /etc/resolv.conf
# Should contain: nameserver <server-ip>

# Update if needed
echo "nameserver <server-ip>" | sudo tee /etc/resolv.conf

DNS Server Not Responding¶

Symptoms:

dig @<server-ip> yourdomain.com
# ;; connection timed out; no servers could be reached

Solution:

Check DNS service is running:
```
docker service ps dns_dns-server
```

Verify port 53 is open:

sudo netstat -tulpn | grep :53
# Should show docker-proxy listening

Check firewall rules:

sudo ufw status
sudo ufw allow 53/tcp
sudo ufw allow 53/udp
sudo ufw allow 5380/tcp

Restart DNS service:

task ansible:deploy:stack -- -e "stack_name=dns"

Prevention: - Configure DNS records before deploying services - Use wildcard DNS records (*.yourdomain.com) for easier management - Monitor DNS service with Uptime Kuma - Document all DNS records

Secondary DNS and Pi-hole Sync¶

Issues specific to the optional Pi-hole secondary DNS feature (SECONDARY_DNS_ENABLED=true).

Prerequisites before enabling: - Pi-hole v6.3+ (sudo pihole -up to upgrade) - API writes enabled: sudo pihole-FTL --config webserver.api.app_sudo true

Verify sync after registration:

# Query Pi-hole directly
dig @<pihole-ip> grafana.yourdomain.com

# Or check Pi-hole Admin UI → Local DNS → CNAME Records

Pi-hole API Returns 403 Forbidden¶

Symptoms:

fatal: [localhost]: FAILED! => {"status": 403, "json": {"error": {"key": "app_sudo_disabled", ...}}}

Cause: Pi-hole's API write permission (app_sudo) is disabled.

Fix:

ssh user@<pihole-ip> "sudo pihole-FTL --config webserver.api.app_sudo true"

Connection Refused During DNS Registration¶

Symptoms:

fatal: [localhost]: FAILED! => {"msg": "Connection refused"}

A few CNAMEs register successfully, then subsequent ones fail with connection refused.

Cause: Pi-hole below v6.3 restarts its FTL resolver after every CNAME change via the API. Each restart takes ~5 seconds, during which the API is unreachable. The playbooks use ?restart=false to prevent this, which requires v6.3+.

Fix: Upgrade Pi-hole, then re-run DNS registration:

sudo pihole -up
task ansible:dns:register

Pi-hole Has A Records Instead of CNAMEs for Services¶

Symptoms: Pi-hole's custom DNS list shows service hostnames (e.g. grafana.yourdomain.com) with IP addresses instead of CNAME targets.

Fix: Run the cleanup task, then re-register:

task ansible:dns:pihole-cleanup
task ansible:dns:register

Pi-hole Not Resolving Services When Primary is Down¶

Symptoms: Technitium is unreachable, but Pi-hole also fails to resolve service hostnames.

Checklist: 1. Check Pi-hole Admin UI → Local DNS → CNAME Records — service hostnames should be listed 2. Confirm your router falls back to Pi-hole when Technitium is unreachable 3. Re-run DNS registration if records are missing:

task ansible:dns:register

Docker Swarm Networking¶

Issue: Overlay network issues, services can't communicate, or node connectivity problems.

Services Can't Communicate¶

Symptoms: - Service A cannot reach Service B - Connection timeouts between containers - Services on different nodes can't communicate

Common Causes: - Services not on same overlay network - Firewall blocking overlay network ports (4789/udp) - Network encryption issues - MTU size mismatch

Solution:

Verify services are on same network:

docker service inspect myservice --format '{{.Spec.TaskTemplate.Networks}}'
docker network ls
docker network inspect traefik-public

Check overlay network connectivity:

# From one service
docker exec -it $(docker ps -q -f name=myapp) ping other-service

Verify firewall rules allow overlay network:

# Port 4789/udp must be open between nodes
sudo ufw allow 4789/udp
sudo ufw allow 7946/tcp
sudo ufw allow 7946/udp

Check MTU size:

ip link show | grep mtu
# All nodes should have same MTU

Node Connectivity Problems¶

Symptoms:

docker node ls
# Shows nodes as "Down" or "Unreachable"

Solution:

Check node availability:

docker node ls
docker node inspect <node-name>

Verify Swarm ports are open:

# Required ports between nodes
sudo ufw allow 2377/tcp   # Cluster management
sudo ufw allow 7946/tcp   # Node communication
sudo ufw allow 7946/udp   # Node communication
sudo ufw allow 4789/udp   # Overlay network

Check node connectivity:

# From manager node
ping <worker-node-ip>
nc -zv <worker-node-ip> 2377

Rejoin node to swarm if needed:

# On worker node
docker swarm leave

# On manager node, get join token
docker swarm join-token worker

# On worker node, rejoin with token
docker swarm join --token <token> <manager-ip>:2377

Traefik 502 Bad Gateway — Overlay ARP FAILED (Stale Entries)¶

Symptoms: - Most or all services return 502 Bad Gateway - Traefik logs show repeated "no route to host" errors:

502 Bad Gateway error="dial tcp 10.0.1.32:9000: connect: no route to host"

- Restarting Traefik does not fix it - Services show as healthy in docker service ls

Root Cause: A known Docker bug (moby #50232) where ARP entries in the overlay network namespace are never garbage collected. After containers restart or get rescheduled (getting new IPs), their old IPs remain in the ARP table and become FAILED. Traefik keeps routing to these dead IPs.

Identify:

# Check for FAILED ARP entries inside Traefik's network namespace
docker exec $(docker ps -q --filter name=reverse-proxy_traefik) ip neigh show | grep FAILED

# Example output showing stale entries:
# 10.0.1.32 dev eth2  used 0/0/0 probes 6 FAILED
# 10.0.1.7  dev eth2  used 0/0/0 probes 6 FAILED

# Cross-reference with which services own those IPs
docker network inspect traefik-public --format '{{json .Containers}}' | \
  python3 -c "import json,sys; d=json.load(sys.stdin); [print(v['IPv4Address'], v['Name']) for v in d.values()]"

Fix:

Force-update the affected backend services (not Traefik) to get fresh IPs:
```
docker service update --force <service_name>
```
Run this for each service whose IP shows as FAILED. Docker tears down and recreates the container with a new IP, clearing the stale ARP state.
If only a few services are affected, you can identify them by matching FAILED IPs to the network inspect output above, then force-update only those services.
If all services are affected, force-update the most critical ones first (e.g. authentik_server since it handles SSO for everything else).

Verify recovery:

# Should drop to 0 within ~30 seconds of force-updates completing
docker service logs reverse-proxy_traefik --since 30s 2>&1 | grep -c "no route to host"

Deeper Fix — Overlay Sandbox Veth Repair (no downtime):

If force-updates don't fix it, the overlay sandbox bridge may have veth pairs stuck in the host namespace (never attached to the sandbox). Symptoms: container's eth2 shows NO-CARRIER/DOWN even in a freshly started container. Fix without restarting Docker:

# 1. Identify the overlay sandbox namespace for traefik-public
#    (first 12 chars of network ID, prefixed with "4-")
docker network inspect traefik-public --format '{{.Id}}'
# e.g. v8mwtol6eu6h... → sandbox is "4-v8mwtol6eu"

# 2. Attach all broken overlay veths to the sandbox bridge
docker run --rm --privileged --pid=host --net=host -v /run/docker/netns:/netns alpine sh -c "
apk add -q iproute2 &&
for veth in \$(ip link | grep 'mtu 1450' | grep 'state DOWN' | grep -v 'M-DOWN' | awk -F': ' '{print \$2}' | awk -F'@' '{print \$1}'); do
  ip link set \$veth netns /netns/4-v8mwtol6eu &&
  nsenter --net=/netns/4-v8mwtol6eu -- ip link set \$veth master br0 &&
  nsenter --net=/netns/4-v8mwtol6eu -- ip link set \$veth up &&
  echo \"OK: \$veth\" || echo \"FAILED: \$veth\"
done
"

# 3. Verify — should show 0
docker exec \$(docker ps -q --filter name=reverse-proxy_traefik) ip neigh show | grep -c FAILED

Last Resort — Docker Daemon Restart:

If the sandbox veth repair doesn't work, restart Docker on the affected node:

# 1. Drain the node first to avoid split-brain
docker node update --availability drain <node-hostname>

# 2. Wait for tasks to migrate, then restart Docker
sudo systemctl restart docker

# 3. Restore node to active
docker node update --availability active <node-hostname>

# 4. Force-update Traefik to reconnect to overlay
docker service update --force reverse-proxy_traefik

Prevention: This is an unfixed upstream Docker bug. To reduce frequency: - Avoid frequent service restarts/updates during peak hours - Consider pinning services to specific nodes with placement constraints to reduce IP churn across the overlay

Network Not Found¶

Symptoms:

Error response from daemon: network traefik-public not found

Solution:

Create missing network:

docker network create \
  --driver overlay \
  --attachable \
  traefik-public

Verify network exists:

docker network ls | grep traefik-public

Prevention: - Document required firewall ports - Use network monitoring to detect connectivity issues - Keep Docker Engine updated on all nodes - Regularly check node health with docker node ls

Authentik SSO Integration¶

Issue: Forward auth not working, OAuth redirect errors, or LDAP authentication failures.

Forward Auth Not Working¶

Symptoms: - Accessing service shows "502 Bad Gateway" - No SSO login prompt appears - Traefik returns authentication errors

Common Causes: - Authentik proxy outpost not running - Middleware not configured correctly - Authentik host URL incorrect - Service not configured to use middleware

Solution:

Verify Authentik services are running:

docker service ls | grep authentik
# Should see: authentik_server, authentik_worker, authentik_proxy

Check Authentik proxy outpost logs:

docker service logs authentik_proxy --tail 50 --follow

Verify Traefik middleware configuration:

docker service inspect authentik_proxy --format '{{.Spec.Labels}}'
# Should contain middleware labels

Check service uses middleware:

labels:
  - traefik.http.routers.myservice.middlewares=authentik@swarm

Verify Authentik host URL:

# In authentik service environment
grep AUTHENTIK_HOST stacks/apps/authentik/docker-compose.yml
# Should match: https://auth.${BASE_DOMAIN}

OAuth Redirect Errors¶

Symptoms: - "Invalid redirect URI" error - "OAuth callback failed" - User redirected to wrong URL after login

Solution:

Check OAuth provider configuration in Authentik:
Login to Authentik: https://auth.yourdomain.com
Go to Applications → Providers
Verify redirect URIs match service callback URLs
Common format: https://myapp.yourdomain.com/oauth/callback

Verify service OAuth configuration:

environment:
  - OAUTH_ISSUER_URL=https://auth.${BASE_DOMAIN}/application/o/myapp/
  - OAUTH_CLIENT_ID=${OAUTH_CLIENT_ID}
  - OAUTH_CLIENT_SECRET=${OAUTH_CLIENT_SECRET}

Check environment variables are set:
```
grep OAUTH .env
```
Test OAuth flow:
Clear browser cache and cookies
Try SSO login
Check browser developer console for errors

LDAP Authentication Fails¶

Symptoms: - Service shows "LDAP bind failed" - Cannot authenticate with LDAP credentials - Connection timeouts to LDAP server

Solution:

Verify Authentik LDAP outpost is running:

docker service ls | grep authentik_ldap
docker service logs authentik_ldap --tail 50

Check LDAP port accessibility:

# Test LDAP connection
nc -zv <authentik-host> 389
nc -zv <authentik-host> 636  # LDAPS

Verify service LDAP configuration:

environment:
  - LDAP_HOST=authentik_ldap
  - LDAP_PORT=3389  # Internal Docker network
  - LDAP_BASE_DN=dc=ldap,dc=goauthentik,dc=io

Test LDAP bind:

ldapsearch -x -H ldap://auth.yourdomain.com:389 \
  -D "cn=admin,dc=ldap,dc=goauthentik,dc=io" \
  -w <password> -b "dc=ldap,dc=goauthentik,dc=io"

Prevention: - Document all OAuth redirect URIs - Test SSO integration before deploying to production - Use Authentik's application wizard for consistent configuration - Monitor Authentik service health - Keep Authentik outpost tokens secure and don't expire them

Storage Mount Failures¶

Issue: CIFS/SMB mounts not working, iSCSI mount issues, or permission denied errors.

CIFS/SMB Mounts Not Working¶

Symptoms:

ls /mnt/cifs/
# ls: cannot access '/mnt/cifs/': No such file or directory

Or:

docker service logs myapp
# Error: cannot access /media: Permission denied

Common Causes: - NAS server not reachable - Incorrect SMB credentials - Mount point not created - Network connectivity issues - SMB version mismatch

Solution:

Verify NAS is reachable:

ping ${NAS_SERVER}
# Test SMB port
nc -zv ${NAS_SERVER} 445

Check mount configuration in docker-compose.yml:

volumes:
  - type: bind
    source: /mnt/cifs/media
    target: /media

Verify CIFS mount on host:

# Check if mounted
mount | grep cifs
df -h | grep cifs

# Test manual mount
sudo mount -t cifs //${NAS_SERVER}/share /mnt/cifs/media \
  -o username=${SMB_USERNAME},password=${SMB_PASSWORD},vers=3.0

Check SMB credentials:

grep SMB .env
# Verify username, password, and domain are correct

Create mount points if missing:

sudo mkdir -p /mnt/cifs/media
sudo chmod 755 /mnt/cifs

iSCSI Mount Issues¶

Symptoms:

ls /mnt/iscsi/app-data
# ls: cannot access '/mnt/iscsi/app-data': Transport endpoint is not connected

Solution:

Check iSCSI session:

sudo iscsiadm -m session
# Should show active sessions

Verify OCFS2 cluster status:

sudo o2cb cluster-status homelab
# All nodes should show online

Check mount status:
```
mount | grep ocfs2
df -h | grep iscsi
```

Restart iSCSI and OCFS2:

sudo systemctl restart o2cb
sudo systemctl restart ocfs2
sudo iscsiadm -m session --rescan

Re-mount if needed:

sudo mount -a
# Or specific mount
sudo mount /mnt/iscsi/app-data

Permission Denied Errors¶

Symptoms:

docker service logs myapp
# Error: cannot write to /data: Permission denied

Solution:

Check file permissions on host:

ls -la /mnt/iscsi/app-data/myapp
ls -la /mnt/cifs/media

Verify UID/GID match:

# In docker-compose.yml
environment:
  - PUID=1000
  - PGID=1000

# Check ownership on host
sudo chown -R 1000:1000 /mnt/iscsi/app-data/myapp

For CIFS mounts, check mount options:

volumes:
  - type: volume
    source: myapp_media
    target: /media
    volume:
      nocopy: true

volumes:
  myapp_media:
    driver: local
    driver_opts:
      type: cifs
      o: "username=${SMB_USERNAME},password=${SMB_PASSWORD},uid=1000,gid=1000,file_mode=0770,dir_mode=0770"
      device: "//${NAS_SERVER}/media"

Prevention: - Use Ansible to configure mounts consistently - Document mount points and credentials - Test mounts before deploying services - Monitor mount health with automated checks - Use consistent UID/GID across services (1000:1000)

Database Performance Issues¶

Issue: Slow queries, connection timeouts, or performance degradation for database-heavy services (Immich, LibreChat).

Slow Queries or Timeouts¶

Symptoms: - Application shows "Database connection timeout" - Web UI is extremely slow - Services restart due to health check failures - High CPU usage on database container

Common Causes: - Database running on network storage (CIFS/SMB) instead of local storage - Insufficient resources allocated to database - Database not placed on correct node - Database needs vacuuming or optimization

Solution:

CRITICAL: Verify database is on local storage:

docker service inspect immich_postgres --format '{{.Spec.TaskTemplate.Placement}}'
# Should show: map[Constraints:[node.labels.database==true]]

# Check node label exists
docker node inspect <node-name> --format '{{.Spec.Labels}}'

Verify database volume is local, not network:

# CORRECT: Local volume (no driver_opts for cifs)
volumes:
  immich_postgres:
    # No driver or driver_opts = local storage

# INCORRECT: CIFS mount (will be slow)
volumes:
  immich_postgres:
    driver: local
    driver_opts:
      type: cifs
      device: "//${NAS_SERVER}/database"  # DON'T DO THIS

Check database logs:

docker service logs immich_postgres --tail 100 --follow
docker service logs librechat_mongodb --tail 100 --follow

Monitor database resource usage:

docker stats $(docker ps -q -f name=postgres)
docker stats $(docker ps -q -f name=mongodb)

For PostgreSQL, vacuum database:

docker exec -it $(docker ps -q -f name=postgres) psql -U postgres -d immich
# Inside psql
VACUUM ANALYZE;
\q

Connection Errors¶

Symptoms:

Error: Connection to database failed
could not connect to server: Connection refused

Solution:

Verify database service is running:

docker service ls | grep postgres
docker service ls | grep mongodb
docker service ps immich_postgres --no-trunc

Check database credentials:

# In docker-compose.yml
grep DB_PASSWORD stacks/apps/immich/docker-compose.yml
grep DB_USERNAME stacks/apps/immich/docker-compose.yml

# Verify in .env
grep IMMICH_DB .env

Test database connection from application container:

docker exec -it $(docker ps -q -f name=immich_server) sh
# Inside container
nc -zv immich_postgres 5432

Check database is ready:

docker exec -it $(docker ps -q -f name=postgres) pg_isready -U postgres

Performance Degradation Over Time¶

Symptoms: - Application was fast but now slow - Database size growing rapidly - Disk I/O very high

Solution:

Check database size:

# PostgreSQL
docker exec -it $(docker ps -q -f name=postgres) psql -U postgres -d immich -c "SELECT pg_size_pretty(pg_database_size('immich'));"

# MongoDB
docker exec -it $(docker ps -q -f name=mongodb) mongo librechat --eval "db.stats()"

Optimize PostgreSQL:

docker exec -it $(docker ps -q -f name=postgres) psql -U postgres -d immich
# Inside psql
VACUUM FULL ANALYZE;
REINDEX DATABASE immich;

Check disk space on database node:
```
df -h
# Ensure sufficient free space
```

Increase database resources if needed:

deploy:
  resources:
    limits:
      memory: 2G
    reservations:
      memory: 1G

Prevention: - ALWAYS run PostgreSQL and MongoDB on local storage, never network storage - Set node labels correctly: docker node update --label-add database=true <node> - Use node with fast SSD for database workloads - Monitor database size and performance with Prometheus/Grafana - Schedule regular maintenance (VACUUM for PostgreSQL) - Allocate sufficient resources for database containers - Keep database versions updated

Database Node Label Missing¶

Symptoms:

docker service ps immich_postgres
# Shows "no suitable node" error

Solution:

Add database node label:

# List nodes
docker node ls

# Add label to node with fast local storage
docker node update --label-add database=true <node-name>

# Verify label
docker node inspect <node-name> --format '{{.Spec.Labels}}'

Redeploy service:

task ansible:deploy:stack -- -e "stack_name=immich"

Critical Note: For services like Immich and LibreChat with PostgreSQL or MongoDB databases, local storage is mandatory for acceptable performance. Network storage (CIFS/SMB) will result in extremely slow performance, connection timeouts, and health check failures.

General Troubleshooting Tips¶

Check Service Logs¶

# Follow logs in real-time
docker service logs <service-name> --tail 100 --follow

# Search logs for errors
docker service logs <service-name> --tail 1000 | grep -i error

Inspect Service Configuration¶

# View service details
docker service inspect <service-name>

# View service tasks (replicas)
docker service ps <service-name> --no-trunc

# View service placement constraints
docker service inspect <service-name> --format '{{.Spec.TaskTemplate.Placement}}'

Restart Service¶

# Restart single service
docker service update --force <service-name>

# Redeploy entire stack
task ansible:deploy:stack -- -e "stack_name=myapp"

Check System Resources¶

# Check disk space
df -h

# Check memory usage
free -h

# Check CPU usage
top

# Check Docker resources
docker system df

Get Shell Access to Container¶

# Find container
docker ps | grep myapp

# Execute shell
docker exec -it <container-id> sh
# Or bash if available
docker exec -it <container-id> bash

Need More Help?¶

If you're still experiencing issues:

Check service-specific documentation in /stacks/apps/<service>/README.md
Review recent changes with git log
Search existing issues on GitHub
Create new issue with detailed logs and configuration

Report issues on GitHub →