Troubleshooting
Troubleshooting¶
Common issues and solutions for your homelab deployment. This guide covers the most frequently encountered problems across different service categories.
Table of Contents¶
- OCFS2 Cluster - IP Address Changes
- Service Deployment Failures
- SSL Certificate Issues
- DNS Resolution Problems
- Secondary DNS and Pi-hole Sync
- Docker Swarm Networking
- Authentik SSO Integration
- Storage Mount Failures
- Database Performance Issues
OCFS2 Cluster - IP Address Changes¶
Issue: After a network change, nodes with new IP addresses fail to mount OCFS2 filesystems with error -107 (ENOTCONN) in ocfs2_dlm_init.
Symptoms:
o2net: Connection to node <name> shutdown, state 7
o2net: No connection established with node X after 30.0 seconds
o2cb: This node could not connect to nodes
(mount.ocfs2): ERROR: status = -107
Root Cause: O2CB cluster caches node IP addresses in kernel state (/sys/kernel/config/cluster/homelab/node/*/ipv4_address). When node IP addresses change, the kernel state becomes stale and O2NET connections fail during handshake.
Solution:
-
Remove affected nodes from cluster:
-
Stop O2CB on all nodes:
-
Re-add nodes with new IPs:
-
Restart O2CB:
-
If kernel state persists: Reboot nodes that still show old IPs in
/sys/kernel/config/cluster/homelab/node/*/ipv4_address.
Prevention: After IP address changes, reboot all cluster nodes to ensure clean O2CB registration with new IPs.
Service Deployment Failures¶
Issue: Docker stack fails to deploy, or services restart repeatedly after deployment.
Stack Fails to Deploy¶
Symptoms:
Creating service myapp_service
failed to create service myapp_service: Error response from daemon: ...
Common Causes:
- Missing environment variables in .env file
- Invalid Docker Compose syntax
- Missing Docker secrets
- Network not created
- Node constraints not met (labels missing)
Solution:
-
Verify environment variables:
-
Validate Docker Compose syntax:
-
Check Docker secrets:
-
Verify network exists:
-
Check node labels:
Container Restarts Repeatedly¶
Symptoms:
Common Causes: - Application configuration errors - Missing volume mounts - Database connection failures - Port conflicts - Health check failures
Solution:
-
Check service logs:
-
Inspect service tasks:
-
Verify volume mounts:
-
Check port conflicts:
-
Test database connectivity:
Health Check Failures¶
Symptoms:
Service shows as running but marked unhealthy in docker service ps.
Solution:
-
Check health check configuration in docker-compose.yml:
-
Test health check manually:
-
Increase timeout or retries if service is slow to start
Prevention:
- Always validate compose files with docker compose config before deployment
- Test services locally before deploying to swarm
- Use docker service logs immediately after deployment to catch early errors
SSL Certificate Issues¶
Issue: HTTPS not working, certificate not issued, or Cloudflare DNS challenges failing.
Certificate Not Issued¶
Symptoms: - Service accessible via HTTP but not HTTPS - Browser shows "connection not secure" - Traefik dashboard shows no certificate
Common Causes: - Cloudflare API token invalid or expired - DNS records not pointing to server - Let's Encrypt rate limits hit - Traefik not configured for certificate resolver
Solution:
-
Check Traefik logs:
-
Verify Cloudflare API token:
-
Verify DNS records:
-
Check acme.json file:
-
Check Traefik labels on service:
Certificate Expired or Invalid¶
Symptoms: - Browser shows certificate error - Certificate expiry warning
Solution:
-
Force certificate renewal:
-
Check Let's Encrypt rate limits:
- Limit: 50 certificates per week per domain
- Use staging environment for testing
Prevention: - Monitor certificate expiry with Uptime Kuma - Ensure Cloudflare API token doesn't expire - Use wildcard certificates to reduce certificate count - Test with Let's Encrypt staging environment first
DNS Resolution Problems¶
Issue: Services not accessible by domain name, or DNS server not responding.
Services Not Accessible by Domain¶
Symptoms:
Common Causes: - Technitium DNS not running - DNS records not configured - Client not using correct DNS server - Firewall blocking DNS port 5380
Solution:
-
Check Technitium DNS status:
-
Verify DNS records in Technitium:
- Access Technitium UI:
http://<server-ip>:5380 - Check A records for your domain
-
Verify wildcard records (*.yourdomain.com)
-
Test DNS resolution:
-
Check client DNS configuration:
DNS Server Not Responding¶
Symptoms:
Solution:
-
Check DNS service is running:
-
Verify port 53 is open:
-
Check firewall rules:
-
Restart DNS service:
Prevention: - Configure DNS records before deploying services - Use wildcard DNS records (*.yourdomain.com) for easier management - Monitor DNS service with Uptime Kuma - Document all DNS records
Secondary DNS and Pi-hole Sync¶
Issues specific to the optional Pi-hole secondary DNS feature (SECONDARY_DNS_ENABLED=true).
Prerequisites before enabling:
- Pi-hole v6.3+ (sudo pihole -up to upgrade)
- API writes enabled: sudo pihole-FTL --config webserver.api.app_sudo true
Verify sync after registration:
# Query Pi-hole directly
dig @<pihole-ip> grafana.yourdomain.com
# Or check Pi-hole Admin UI → Local DNS → CNAME Records
Pi-hole API Returns 403 Forbidden¶
Symptoms:
fatal: [localhost]: FAILED! => {"status": 403, "json": {"error": {"key": "app_sudo_disabled", ...}}}
Cause: Pi-hole's API write permission (app_sudo) is disabled.
Fix:
Connection Refused During DNS Registration¶
Symptoms:
A few CNAMEs register successfully, then subsequent ones fail with connection refused.Cause: Pi-hole below v6.3 restarts its FTL resolver after every CNAME change via the API. Each restart takes ~5 seconds, during which the API is unreachable. The playbooks use ?restart=false to prevent this, which requires v6.3+.
Fix: Upgrade Pi-hole, then re-run DNS registration:
Pi-hole Has A Records Instead of CNAMEs for Services¶
Symptoms: Pi-hole's custom DNS list shows service hostnames (e.g. grafana.yourdomain.com) with IP addresses instead of CNAME targets.
Fix: Run the cleanup task, then re-register:
Pi-hole Not Resolving Services When Primary is Down¶
Symptoms: Technitium is unreachable, but Pi-hole also fails to resolve service hostnames.
Checklist: 1. Check Pi-hole Admin UI → Local DNS → CNAME Records — service hostnames should be listed 2. Confirm your router falls back to Pi-hole when Technitium is unreachable 3. Re-run DNS registration if records are missing:
Docker Swarm Networking¶
Issue: Overlay network issues, services can't communicate, or node connectivity problems.
Services Can't Communicate¶
Symptoms: - Service A cannot reach Service B - Connection timeouts between containers - Services on different nodes can't communicate
Common Causes: - Services not on same overlay network - Firewall blocking overlay network ports (4789/udp) - Network encryption issues - MTU size mismatch
Solution:
-
Verify services are on same network:
-
Check overlay network connectivity:
-
Verify firewall rules allow overlay network:
-
Check MTU size:
Node Connectivity Problems¶
Symptoms:
Solution:
-
Check node availability:
-
Verify Swarm ports are open:
-
Check node connectivity:
-
Rejoin node to swarm if needed:
Traefik 502 Bad Gateway — Overlay ARP FAILED (Stale Entries)¶
Symptoms:
- Most or all services return 502 Bad Gateway
- Traefik logs show repeated "no route to host" errors:
docker service ls
Root Cause: A known Docker bug (moby #50232) where ARP entries in the overlay network namespace are never garbage collected. After containers restart or get rescheduled (getting new IPs), their old IPs remain in the ARP table and become FAILED. Traefik keeps routing to these dead IPs.
Identify:
# Check for FAILED ARP entries inside Traefik's network namespace
docker exec $(docker ps -q --filter name=reverse-proxy_traefik) ip neigh show | grep FAILED
# Example output showing stale entries:
# 10.0.1.32 dev eth2 used 0/0/0 probes 6 FAILED
# 10.0.1.7 dev eth2 used 0/0/0 probes 6 FAILED
# Cross-reference with which services own those IPs
docker network inspect traefik-public --format '{{json .Containers}}' | \
python3 -c "import json,sys; d=json.load(sys.stdin); [print(v['IPv4Address'], v['Name']) for v in d.values()]"
Fix:
-
Force-update the affected backend services (not Traefik) to get fresh IPs:
Run this for each service whose IP shows asFAILED. Docker tears down and recreates the container with a new IP, clearing the stale ARP state. -
If only a few services are affected, you can identify them by matching FAILED IPs to the network inspect output above, then force-update only those services.
-
If all services are affected, force-update the most critical ones first (e.g.
authentik_serversince it handles SSO for everything else). -
Verify recovery:
Deeper Fix — Overlay Sandbox Veth Repair (no downtime):
If force-updates don't fix it, the overlay sandbox bridge may have veth pairs stuck in the host
namespace (never attached to the sandbox). Symptoms: container's eth2 shows NO-CARRIER/DOWN
even in a freshly started container. Fix without restarting Docker:
# 1. Identify the overlay sandbox namespace for traefik-public
# (first 12 chars of network ID, prefixed with "4-")
docker network inspect traefik-public --format '{{.Id}}'
# e.g. v8mwtol6eu6h... → sandbox is "4-v8mwtol6eu"
# 2. Attach all broken overlay veths to the sandbox bridge
docker run --rm --privileged --pid=host --net=host -v /run/docker/netns:/netns alpine sh -c "
apk add -q iproute2 &&
for veth in \$(ip link | grep 'mtu 1450' | grep 'state DOWN' | grep -v 'M-DOWN' | awk -F': ' '{print \$2}' | awk -F'@' '{print \$1}'); do
ip link set \$veth netns /netns/4-v8mwtol6eu &&
nsenter --net=/netns/4-v8mwtol6eu -- ip link set \$veth master br0 &&
nsenter --net=/netns/4-v8mwtol6eu -- ip link set \$veth up &&
echo \"OK: \$veth\" || echo \"FAILED: \$veth\"
done
"
# 3. Verify — should show 0
docker exec \$(docker ps -q --filter name=reverse-proxy_traefik) ip neigh show | grep -c FAILED
Last Resort — Docker Daemon Restart:
If the sandbox veth repair doesn't work, restart Docker on the affected node:
# 1. Drain the node first to avoid split-brain
docker node update --availability drain <node-hostname>
# 2. Wait for tasks to migrate, then restart Docker
sudo systemctl restart docker
# 3. Restore node to active
docker node update --availability active <node-hostname>
# 4. Force-update Traefik to reconnect to overlay
docker service update --force reverse-proxy_traefik
Prevention: This is an unfixed upstream Docker bug. To reduce frequency: - Avoid frequent service restarts/updates during peak hours - Consider pinning services to specific nodes with placement constraints to reduce IP churn across the overlay
Network Not Found¶
Symptoms:
Solution:
-
Create missing network:
-
Verify network exists:
Prevention:
- Document required firewall ports
- Use network monitoring to detect connectivity issues
- Keep Docker Engine updated on all nodes
- Regularly check node health with docker node ls
Authentik SSO Integration¶
Issue: Forward auth not working, OAuth redirect errors, or LDAP authentication failures.
Forward Auth Not Working¶
Symptoms: - Accessing service shows "502 Bad Gateway" - No SSO login prompt appears - Traefik returns authentication errors
Common Causes: - Authentik proxy outpost not running - Middleware not configured correctly - Authentik host URL incorrect - Service not configured to use middleware
Solution:
-
Verify Authentik services are running:
-
Check Authentik proxy outpost logs:
-
Verify Traefik middleware configuration:
-
Check service uses middleware:
-
Verify Authentik host URL:
OAuth Redirect Errors¶
Symptoms: - "Invalid redirect URI" error - "OAuth callback failed" - User redirected to wrong URL after login
Solution:
- Check OAuth provider configuration in Authentik:
- Login to Authentik:
https://auth.yourdomain.com - Go to Applications → Providers
- Verify redirect URIs match service callback URLs
-
Common format:
https://myapp.yourdomain.com/oauth/callback -
Verify service OAuth configuration:
-
Check environment variables are set:
-
Test OAuth flow:
- Clear browser cache and cookies
- Try SSO login
- Check browser developer console for errors
LDAP Authentication Fails¶
Symptoms: - Service shows "LDAP bind failed" - Cannot authenticate with LDAP credentials - Connection timeouts to LDAP server
Solution:
-
Verify Authentik LDAP outpost is running:
-
Check LDAP port accessibility:
-
Verify service LDAP configuration:
-
Test LDAP bind:
Prevention: - Document all OAuth redirect URIs - Test SSO integration before deploying to production - Use Authentik's application wizard for consistent configuration - Monitor Authentik service health - Keep Authentik outpost tokens secure and don't expire them
Storage Mount Failures¶
Issue: CIFS/SMB mounts not working, iSCSI mount issues, or permission denied errors.
CIFS/SMB Mounts Not Working¶
Symptoms:
Or:Common Causes: - NAS server not reachable - Incorrect SMB credentials - Mount point not created - Network connectivity issues - SMB version mismatch
Solution:
-
Verify NAS is reachable:
-
Check mount configuration in docker-compose.yml:
-
Verify CIFS mount on host:
-
Check SMB credentials:
-
Create mount points if missing:
iSCSI Mount Issues¶
Symptoms:
ls /mnt/iscsi/app-data
# ls: cannot access '/mnt/iscsi/app-data': Transport endpoint is not connected
Solution:
-
Check iSCSI session:
-
Verify OCFS2 cluster status:
-
Check mount status:
-
Restart iSCSI and OCFS2:
-
Re-mount if needed:
Permission Denied Errors¶
Symptoms:
Solution:
-
Check file permissions on host:
-
Verify UID/GID match:
-
For CIFS mounts, check mount options:
Prevention: - Use Ansible to configure mounts consistently - Document mount points and credentials - Test mounts before deploying services - Monitor mount health with automated checks - Use consistent UID/GID across services (1000:1000)
Database Performance Issues¶
Issue: Slow queries, connection timeouts, or performance degradation for database-heavy services (Immich, LibreChat).
Slow Queries or Timeouts¶
Symptoms: - Application shows "Database connection timeout" - Web UI is extremely slow - Services restart due to health check failures - High CPU usage on database container
Common Causes: - Database running on network storage (CIFS/SMB) instead of local storage - Insufficient resources allocated to database - Database not placed on correct node - Database needs vacuuming or optimization
Solution:
-
CRITICAL: Verify database is on local storage:
-
Verify database volume is local, not network:
-
Check database logs:
-
Monitor database resource usage:
-
For PostgreSQL, vacuum database:
Connection Errors¶
Symptoms:
Solution:
-
Verify database service is running:
-
Check database credentials:
-
Test database connection from application container:
-
Check database is ready:
Performance Degradation Over Time¶
Symptoms: - Application was fast but now slow - Database size growing rapidly - Disk I/O very high
Solution:
-
Check database size:
-
Optimize PostgreSQL:
-
Check disk space on database node:
-
Increase database resources if needed:
Prevention:
- ALWAYS run PostgreSQL and MongoDB on local storage, never network storage
- Set node labels correctly: docker node update --label-add database=true <node>
- Use node with fast SSD for database workloads
- Monitor database size and performance with Prometheus/Grafana
- Schedule regular maintenance (VACUUM for PostgreSQL)
- Allocate sufficient resources for database containers
- Keep database versions updated
Database Node Label Missing¶
Symptoms:
Solution:
-
Add database node label:
-
Redeploy service:
Critical Note: For services like Immich and LibreChat with PostgreSQL or MongoDB databases, local storage is mandatory for acceptable performance. Network storage (CIFS/SMB) will result in extremely slow performance, connection timeouts, and health check failures.
General Troubleshooting Tips¶
Check Service Logs¶
# Follow logs in real-time
docker service logs <service-name> --tail 100 --follow
# Search logs for errors
docker service logs <service-name> --tail 1000 | grep -i error
Inspect Service Configuration¶
# View service details
docker service inspect <service-name>
# View service tasks (replicas)
docker service ps <service-name> --no-trunc
# View service placement constraints
docker service inspect <service-name> --format '{{.Spec.TaskTemplate.Placement}}'
Restart Service¶
# Restart single service
docker service update --force <service-name>
# Redeploy entire stack
task ansible:deploy:stack -- -e "stack_name=myapp"
Check System Resources¶
# Check disk space
df -h
# Check memory usage
free -h
# Check CPU usage
top
# Check Docker resources
docker system df
Get Shell Access to Container¶
# Find container
docker ps | grep myapp
# Execute shell
docker exec -it <container-id> sh
# Or bash if available
docker exec -it <container-id> bash
Need More Help?¶
If you're still experiencing issues:
- Check service-specific documentation in
/stacks/apps/<service>/README.md - Review recent changes with
git log - Search existing issues on GitHub
- Create new issue with detailed logs and configuration