Troubleshooting¶
Common issues and solutions for your homelab deployment. This guide covers the most frequently encountered problems across different service categories.
Table of Contents¶
- OCFS2 Cluster - IP Address Changes
- Service Deployment Failures
- SSL Certificate Issues
- DNS Resolution Problems
- Secondary DNS and Pi-hole Sync
- Docker Swarm Networking
- Authentik SSO Integration
- Storage Mount Failures
- Database Performance Issues
- DNS Cascade Failure — All Workers Go Down Together
- Cluster Health Monitoring
- Docker Swarm Port 53 DNAT Hijacking Worker DNS
OCFS2 Cluster - IP Address Changes¶
Issue: After a network change, nodes with new IP addresses fail to mount OCFS2 filesystems with error -107 (ENOTCONN) in ocfs2_dlm_init.
Symptoms:
o2net: Connection to node <name> shutdown, state 7
o2net: No connection established with node X after 30.0 seconds
o2cb: This node could not connect to nodes
(mount.ocfs2): ERROR: status = -107
Root Cause: O2CB cluster caches node IP addresses in kernel state (/sys/kernel/config/cluster/homelab/node/*/ipv4_address). When node IP addresses change, the kernel state becomes stale and O2NET connections fail during handshake.
Solution:
-
Remove affected nodes from cluster:
-
Stop O2CB on all nodes:
-
Re-add nodes with new IPs:
-
Restart O2CB:
-
If kernel state persists: Reboot nodes that still show old IPs in
/sys/kernel/config/cluster/homelab/node/*/ipv4_address.
Prevention: After IP address changes, reboot all cluster nodes to ensure clean O2CB registration with new IPs.
Service Deployment Failures¶
Issue: Docker stack fails to deploy, or services restart repeatedly after deployment.
Stack Fails to Deploy¶
Symptoms:
Creating service myapp_service
failed to create service myapp_service: Error response from daemon: ...
Common Causes:
- Missing environment variables in .env file
- Invalid Docker Compose syntax
- Missing Docker secrets
- Network not created
- Node constraints not met (labels missing)
Solution:
-
Verify environment variables:
-
Validate Docker Compose syntax:
-
Check Docker secrets:
-
Verify network exists:
-
Check node labels:
Container Restarts Repeatedly¶
Symptoms:
Common Causes: - Application configuration errors - Missing volume mounts - Database connection failures - Port conflicts - Health check failures
Solution:
-
Check service logs:
-
Inspect service tasks:
-
Verify volume mounts:
-
Check port conflicts:
-
Test database connectivity:
Health Check Failures¶
Symptoms:
Service shows as running but marked unhealthy in docker service ps.
Solution:
-
Check health check configuration in docker-compose.yml:
-
Test health check manually:
-
Increase timeout or retries if service is slow to start
Prevention:
- Always validate compose files with docker compose config before deployment
- Test services locally before deploying to swarm
- Use docker service logs immediately after deployment to catch early errors
SSL Certificate Issues¶
Issue: HTTPS not working, certificate not issued, or Cloudflare DNS challenges failing.
Certificate Not Issued¶
Symptoms: - Service accessible via HTTP but not HTTPS - Browser shows "connection not secure" - Traefik dashboard shows no certificate
Common Causes: - Cloudflare API token invalid or expired - DNS records not pointing to server - Let's Encrypt rate limits hit - Traefik not configured for certificate resolver
Solution:
-
Check Traefik logs:
-
Verify Cloudflare API token:
-
Verify DNS records:
-
Check acme.json file:
-
Check Traefik labels on service:
Certificate Expired or Invalid¶
Symptoms: - Browser shows certificate error - Certificate expiry warning
Solution:
-
Force certificate renewal:
-
Check Let's Encrypt rate limits:
- Limit: 50 certificates per week per domain
- Use staging environment for testing
Prevention: - Monitor certificate expiry with Uptime Kuma - Ensure Cloudflare API token doesn't expire - Use wildcard certificates to reduce certificate count - Test with Let's Encrypt staging environment first
DNS Resolution Problems¶
Issue: Services not accessible by domain name, or DNS server not responding.
Services Not Accessible by Domain¶
Symptoms:
Common Causes: - Technitium DNS not running - DNS records not configured - Client not using correct DNS server - Firewall blocking DNS port 5380
Solution:
-
Check Technitium DNS status:
-
Verify DNS records in Technitium:
- Access Technitium UI:
http://<server-ip>:5380 - Check A records for your domain
-
Verify wildcard records (*.yourdomain.com)
-
Test DNS resolution:
-
Check client DNS configuration:
DNS Server Not Responding¶
Symptoms:
Solution:
-
Check DNS service is running:
-
Verify port 53 is open:
-
Check firewall rules:
-
Restart DNS service:
Prevention: - Configure DNS records before deploying services - Use wildcard DNS records (*.yourdomain.com) for easier management - Monitor DNS service with Uptime Kuma - Document all DNS records
Secondary DNS and Pi-hole Sync¶
Issues specific to the optional Pi-hole secondary DNS feature (SECONDARY_DNS_ENABLED=true).
Prerequisites before enabling:
- Pi-hole v6.3+ (sudo pihole -up to upgrade)
- API writes enabled: sudo pihole-FTL --config webserver.api.app_sudo true
Verify sync after registration:
# Query Pi-hole directly
dig @<pihole-ip> grafana.yourdomain.com
# Or check Pi-hole Admin UI → Local DNS → CNAME Records
Pi-hole API Returns 403 Forbidden¶
Symptoms:
fatal: [localhost]: FAILED! => {"status": 403, "json": {"error": {"key": "app_sudo_disabled", ...}}}
Cause: Pi-hole's API write permission (app_sudo) is disabled.
Fix:
Connection Refused During DNS Registration¶
Symptoms:
A few CNAMEs register successfully, then subsequent ones fail with connection refused.Cause: Pi-hole below v6.3 restarts its FTL resolver after every CNAME change via the API. Each restart takes ~5 seconds, during which the API is unreachable. The playbooks use ?restart=false to prevent this, which requires v6.3+.
Fix: Upgrade Pi-hole, then re-run DNS registration:
Pi-hole Has A Records Instead of CNAMEs for Services¶
Symptoms: Pi-hole's custom DNS list shows service hostnames (e.g. grafana.yourdomain.com) with IP addresses instead of CNAME targets.
Fix: Run the cleanup task, then re-register:
Pi-hole Not Resolving Services When Primary is Down¶
Symptoms: Technitium is unreachable, but Pi-hole also fails to resolve service hostnames.
Checklist: 1. Check Pi-hole Admin UI → Local DNS → CNAME Records — service hostnames should be listed 2. Confirm your router falls back to Pi-hole when Technitium is unreachable 3. Re-run DNS registration if records are missing:
Docker Swarm Networking¶
Issue: Overlay network issues, services can't communicate, or node connectivity problems.
Services Can't Communicate¶
Symptoms: - Service A cannot reach Service B - Connection timeouts between containers - Services on different nodes can't communicate
Common Causes: - Services not on same overlay network - Firewall blocking overlay network ports (4789/udp) - Network encryption issues - MTU size mismatch
Solution:
-
Verify services are on same network:
-
Check overlay network connectivity:
-
Verify firewall rules allow overlay network:
-
Check MTU size:
Node Connectivity Problems¶
Symptoms:
Solution:
-
Check node availability:
-
Verify Swarm ports are open:
-
Check node connectivity:
-
Rejoin node to swarm if needed:
If all nodes went down at exactly the same time and self-recovered, this is almost always a DNS cascade failure rather than an individual connectivity issue. See DNS Cascade Failure.
Traefik 502 Bad Gateway — Overlay ARP FAILED (Stale Entries)¶
Symptoms:
- Most or all services return 502 Bad Gateway
- Traefik logs show repeated "no route to host" errors:
docker service ls
Root Cause: A known Docker bug (moby #50232) where ARP entries in the overlay network namespace are never garbage collected. After containers restart or get rescheduled (getting new IPs), their old IPs remain in the ARP table and become FAILED. Traefik keeps routing to these dead IPs.
Identify:
# Check for FAILED ARP entries inside Traefik's network namespace
docker exec $(docker ps -q --filter name=reverse-proxy_traefik) ip neigh show | grep FAILED
# Example output showing stale entries:
# 10.0.1.32 dev eth2 used 0/0/0 probes 6 FAILED
# 10.0.1.7 dev eth2 used 0/0/0 probes 6 FAILED
# Cross-reference with which services own those IPs
docker network inspect traefik-public --format '{{json .Containers}}' | \
python3 -c "import json,sys; d=json.load(sys.stdin); [print(v['IPv4Address'], v['Name']) for v in d.values()]"
Fix:
-
Force-update the affected backend services (not Traefik) to get fresh IPs:
Run this for each service whose IP shows asFAILED. Docker tears down and recreates the container with a new IP, clearing the stale ARP state. -
If only a few services are affected, you can identify them by matching FAILED IPs to the network inspect output above, then force-update only those services.
-
If all services are affected, force-update the most critical ones first (e.g.
authentik_serversince it handles SSO for everything else). -
Verify recovery:
Deeper Fix — Overlay Sandbox Veth Repair (no downtime):
If force-updates don't fix it, the overlay sandbox bridge may have veth pairs stuck in the host
namespace (never attached to the sandbox). Symptoms: container's eth2 shows NO-CARRIER/DOWN
even in a freshly started container. Fix without restarting Docker:
# 1. Identify the overlay sandbox namespace for traefik-public
# (first 12 chars of network ID, prefixed with "4-")
docker network inspect traefik-public --format '{{.Id}}'
# e.g. v8mwtol6eu6h... → sandbox is "4-v8mwtol6eu"
# 2. Attach all broken overlay veths to the sandbox bridge
docker run --rm --privileged --pid=host --net=host -v /run/docker/netns:/netns alpine sh -c "
apk add -q iproute2 &&
for veth in \$(ip link | grep 'mtu 1450' | grep 'state DOWN' | grep -v 'M-DOWN' | awk -F': ' '{print \$2}' | awk -F'@' '{print \$1}'); do
ip link set \$veth netns /netns/4-v8mwtol6eu &&
nsenter --net=/netns/4-v8mwtol6eu -- ip link set \$veth master br0 &&
nsenter --net=/netns/4-v8mwtol6eu -- ip link set \$veth up &&
echo \"OK: \$veth\" || echo \"FAILED: \$veth\"
done
"
# 3. Verify — should show 0
docker exec \$(docker ps -q --filter name=reverse-proxy_traefik) ip neigh show | grep -c FAILED
Last Resort — Docker Daemon Restart:
If the sandbox veth repair doesn't work, restart Docker on the affected node:
# 1. Drain the node first to avoid split-brain
docker node update --availability drain <node-hostname>
# 2. Wait for tasks to migrate, then restart Docker
sudo systemctl restart docker
# 3. Restore node to active
docker node update --availability active <node-hostname>
# 4. Force-update Traefik to reconnect to overlay
docker service update --force reverse-proxy_traefik
Prevention: This is an unfixed upstream Docker bug. To reduce frequency: - Avoid frequent service restarts/updates during peak hours - Consider pinning services to specific nodes with placement constraints to reduce IP churn across the overlay
Network Not Found¶
Symptoms:
Solution:
-
Create missing network:
-
Verify network exists:
Prevention:
- Document required firewall ports
- Use network monitoring to detect connectivity issues
- Keep Docker Engine updated on all nodes
- Regularly check node health with docker node ls
Authentik SSO Integration¶
Issue: Forward auth not working, OAuth redirect errors, or LDAP authentication failures.
Forward Auth Not Working¶
Symptoms: - Accessing service shows "502 Bad Gateway" - No SSO login prompt appears - Traefik returns authentication errors
Common Causes: - Authentik proxy outpost not running - Middleware not configured correctly - Authentik host URL incorrect - Service not configured to use middleware
Solution:
-
Verify Authentik services are running:
-
Check Authentik proxy outpost logs:
-
Verify Traefik middleware configuration:
-
Check service uses middleware:
-
Verify Authentik host URL:
OAuth Redirect Errors¶
Symptoms: - "Invalid redirect URI" error - "OAuth callback failed" - User redirected to wrong URL after login
Solution:
- Check OAuth provider configuration in Authentik:
- Login to Authentik:
https://auth.yourdomain.com - Go to Applications → Providers
- Verify redirect URIs match service callback URLs
-
Common format:
https://myapp.yourdomain.com/oauth/callback -
Verify service OAuth configuration:
-
Check environment variables are set:
-
Test OAuth flow:
- Clear browser cache and cookies
- Try SSO login
- Check browser developer console for errors
LDAP Authentication Fails¶
Symptoms: - Service shows "LDAP bind failed" - Cannot authenticate with LDAP credentials - Connection timeouts to LDAP server
Solution:
-
Verify Authentik LDAP outpost is running:
-
Check LDAP port accessibility:
-
Verify service LDAP configuration:
-
Test LDAP bind:
Prevention: - Document all OAuth redirect URIs - Test SSO integration before deploying to production - Use Authentik's application wizard for consistent configuration - Monitor Authentik service health - Keep Authentik outpost tokens secure and don't expire them
Storage Mount Failures¶
Issue: CIFS/SMB mounts not working, iSCSI mount issues, or permission denied errors.
CIFS/SMB Mounts Not Working¶
Symptoms:
Or:Common Causes: - NAS server not reachable - Incorrect SMB credentials - Mount point not created - Network connectivity issues - SMB version mismatch
Solution:
-
Verify NAS is reachable:
-
Check mount configuration in docker-compose.yml:
-
Verify CIFS mount on host:
-
Check SMB credentials:
-
Create mount points if missing:
iSCSI Mount Issues¶
Symptoms:
ls /mnt/iscsi/app-data
# ls: cannot access '/mnt/iscsi/app-data': Transport endpoint is not connected
"Ghost" Local Directory - Mount Race Condition¶
Issue: Services appear "wiped" (starting a setup wizard) or report missing configuration/snapshots, even though the data exists on the cluster storage.
Symptoms:
- Emby shows the "Welcome" setup wizard instead of your library.
- Kopia reports "Repository not configured" even though repository.config exists on the NAS.
- Files appear correctly on the host at /mnt/iscsi/app-data, but the container sees an empty directory.
- docker service logs show the application initializing a fresh database.
Root Cause:
This is a race condition where the Docker container starts before the iSCSI/OCFS2 mount is fully ready on the host. If the bind-mount source path (e.g., /mnt/iscsi/app-data/emby) does not exist when the container starts, Docker automatically creates an empty "ghost" directory on the node's local root disk and binds the container to it. When the iSCSI storage finally mounts, it "covers up" the ghost folder on the host, but the running container remains locked to the empty local version.
Identify: Check if the inode of the directory inside the container matches the host:
# On the host
ls -id /mnt/iscsi/app-data/emby
# Inside the container
docker exec $(docker ps -q -f name=emby) ls -id /config
Solution: Force a service update to re-bind the volumes now that the mount is active:
Prevention:
- Ensure the iscsi-wait.conf systemd drop-in is present (managed by Ansible).
- Use task ansible:storage:recover after a node crash to ensure mounts are healthy before starting services.
-
Check iSCSI session:
-
Verify OCFS2 cluster status:
-
Check mount status:
-
Restart iSCSI and OCFS2:
-
Re-mount if needed:
Permission Denied Errors¶
Symptoms:
Solution:
-
Check file permissions on host:
-
Verify UID/GID match:
-
For CIFS mounts, check mount options:
Prevention: - Use Ansible to configure mounts consistently - Document mount points and credentials - Test mounts before deploying services - Monitor mount health with automated checks - Use consistent UID/GID across services (1000:1000)
Database Performance Issues¶
Issue: Slow queries, connection timeouts, or performance degradation for database-heavy services (Immich, LibreChat).
Slow Queries or Timeouts¶
Symptoms: - Application shows "Database connection timeout" - Web UI is extremely slow - Services restart due to health check failures - High CPU usage on database container
Common Causes: - Database running on network storage (CIFS/SMB) instead of local storage - Insufficient resources allocated to database - Database not placed on correct node - Database needs vacuuming or optimization
Solution:
-
CRITICAL: Verify database is on local storage:
-
Verify database volume is local, not network:
-
Check database logs:
-
Monitor database resource usage:
-
For PostgreSQL, vacuum database:
Connection Errors¶
Symptoms:
Solution:
-
Verify database service is running:
-
Check database credentials:
-
Test database connection from application container:
-
Check database is ready:
Performance Degradation Over Time¶
Symptoms: - Application was fast but now slow - Database size growing rapidly - Disk I/O very high
Solution:
-
Check database size:
-
Optimize PostgreSQL:
-
Check disk space on database node:
-
Increase database resources if needed:
Prevention:
- ALWAYS run PostgreSQL and MongoDB on local storage, never network storage
- Set node labels correctly: docker node update --label-add database=true <node>
- Use node with fast SSD for database workloads
- Monitor database size and performance with Prometheus/Grafana
- Schedule regular maintenance (VACUUM for PostgreSQL)
- Allocate sufficient resources for database containers
- Keep database versions updated
Database Node Label Missing¶
Symptoms:
Solution:
-
Add database node label:
-
Redeploy service:
Critical Note: For services like Immich and LibreChat with PostgreSQL or MongoDB databases, local storage is mandatory for acceptable performance. Network storage (CIFS/SMB) will result in extremely slow performance, connection timeouts, and health check failures.
DNS Cascade Failure¶
Issue: All three worker nodes go down simultaneously, services self-recover within seconds to minutes, but the event is unexplained.
Symptoms:
- Services appear healthy, then all fail at once, then recover without intervention
- docker node ls briefly showed mini, imac, and giant all as Down at the same time
- DNS stopped responding during the outage window
- Docker clients on the nodes got context deadline exceeded resolving external hostnames (e.g. ghcr.io)
Root Cause:
The Technitium DNS container crashed, which broke Docker's embedded resolver. The overlay gossip network (port 7946) depends on name resolution to stay healthy. When DNS goes down, all nodes lose gossip connectivity within seconds and the manager marks them all Down simultaneously. Swarm's restart policy brings DNS back, nodes reconnect, and services resume — the whole event typically resolves in under 30 seconds.
Identify:
Check the DNS crash history first:
Look for exit codes and errors:
- non-zero exit (139) — SIGSEGV (segmentation fault), typically .NET heap corruption
- non-zero exit (137) — SIGKILL (OOM kill or Docker watchdog)
- Error message double free or corruption (!prev) in logs confirms .NET heap corruption
Check the DNS container log for the crash message:
Confirm gossip instability in the Docker daemon at the same timestamp:
journalctl -u docker --since "2 hours ago" --no-pager | grep -E "memberlist|connectivity issues" | head -20
You will see a burst of memberlist: Suspect ... has failed entries followed immediately by all workers rejoining.
Confirm with the health check:
The DNS crash history section and gossip instability count will highlight the problem.
Fix:
If the cluster has already self-recovered, no immediate action is required. The restart policy handles it automatically. To reduce recurrence:
-
Check whether a newer Technitium version fixes the heap corruption bug:
Check Technitium releases for changelogs mentioning crash or memory fixes.# Current version docker service inspect dns_dns-server --format '{{.Spec.TaskTemplate.ContainerSpec.Image}}' -
Enable the Pi-hole secondary DNS so clients have a fallback during the crash window:
-
If crashes are frequent, run the health check regularly to track the pattern:
Cluster Health Monitoring¶
task ansible:cluster:health runs a local shell script against the manager's Docker socket — no Ansible overhead, results are immediate. Use it after any unexpected service degradation, before a maintenance window, or when investigating why things "just started working again."
The Grafana Cluster Health dashboard covers the same signals but via Loki and Prometheus, giving you historical time-series and searchable log streams. Use the shell script for a quick live check; use the dashboard to understand when an issue started and how often it recurs.
What it reports:
| Section | What to look for |
|---|---|
| Nodes | All nodes should show Ready / Active. Any Down or Drain is a problem. |
| Service Replicas | All services should show N/N. A 0/1 means a service is completely down. |
| DNS Crash History | More than 1–2 failures in the history is a signal to investigate Technitium stability. |
| Gossip Stability | Any count > 0 in the last hour indicates the overlay network had connectivity issues, likely triggered by a DNS crash. |
Related tasks:
| Task | When to use |
|---|---|
task ansible:cluster:status |
Quick node/service/network overview |
task ansible:cluster:health |
Deeper check — adds DNS history and gossip instability |
task ansible:cluster:sync |
A node shows Down and hasn't recovered — detects IP changes and rejoins it |
task ansible:node:reboot -- -e "target_node=<node>" -K |
Safely reboot a single node with clean storage detach/reattach — see Rebooting a Single Node |
task ansible:storage:recover |
Node was hard-rebooted or crashed — OCFS2 journals may be dirty |
Docker Swarm Port 53 DNAT Hijacking Worker DNS¶
Issue: One or more worker nodes lose all DNS resolution. docker login and docker pull to external registries time out. The affected node cannot reach external hostnames even though its network interface is up and other nodes work fine.
Symptoms:
- dig @127.0.0.53 ghcr.io hangs indefinitely on the affected worker
- docker login ghcr.io times out after ~30 seconds
- curl https://ghcr.io also times out, but curl https://1.1.1.1 succeeds (routing is fine)
- nslookup ghcr.io 192.168.86.1 works (router DNS works), but nslookup ghcr.io 192.168.86.227 (Technitium) times out
- tcpdump -i lo udp port 53 shows zero packets — queries never reach the loopback interface
- systemctl restart systemd-resolved does not fix it
- Restarting Docker daemon does not fix it
Root Cause:
When the DNS service (stacks/dns/) publishes port 53 in Docker Swarm's default ingress mode, Swarm installs a DNAT rule on every node in the cluster — including workers that don't run the DNS container. The rule intercepts all UDP/TCP port 53 traffic and redirects it into the Swarm ingress network load balancer (172.18.0.2:53).
Under normal conditions this is invisible — the ingress LB proxies through to Technitium on whichever node is running it. But if the overlay network on a worker becomes corrupted (e.g., from a VXLAN failure during a cluster event), the ingress proxy can no longer reach Technitium. All port 53 traffic on that worker is silently black-holed by the DNAT rule before it ever reaches the host's resolver (127.0.0.53).
How to confirm:
# 1. On the affected worker, check nftables for the DNAT rule
sudo nft list ruleset | grep -A5 "DOCKER-INGRESS"
# You will see something like:
# udp dport 53 dnat to 172.18.0.2:53
# tcp dport 53 dnat to 172.18.0.2:53
# 2. Confirm the ingress proxy is unreachable
dig @172.18.0.2 ghcr.io
# Should also time out — confirms the proxy is broken, not just slow
# 3. Confirm packets aren't reaching the stub resolver
sudo tcpdump -i lo udp port 53 &
dig @127.0.0.53 ghcr.io
# tcpdump should show 0 packets captured — the DNAT rule intercepts before loopback
Fix:
Change port 53 in stacks/dns/docker-compose.yml from ingress mode (the default) to host mode. Host mode binds directly on the node running the container and does NOT install DNAT rules on other nodes.
# stacks/dns/docker-compose.yml
ports:
- target: 53
published: 53
protocol: udp
mode: host # was ingress (the default)
- target: 53
published: 53
protocol: tcp
mode: host # was ingress (the default)
- "5380:5380" # web UI — leave as ingress
Then redeploy the DNS stack:
Swarm will remove the DNAT rules from all worker nodes. After the redeploy, confirm the DNAT rules are gone:
# On any worker node — should return nothing
sudo nft list ruleset | grep "dport 53"
# Confirm DNS works again
dig @127.0.0.53 ghcr.io
Trade-off of mode: host:
| Ingress (default) | Host mode | |
|---|---|---|
| Port 53 DNAT on all nodes | Yes — every node | No — only the node running DNS |
| Swarm failover routing | Automatic | Not automatic (DNS reschedules to a labeled node; clients must re-point or use DHCP) |
| Worker DNS breakage if overlay corrupts | Yes | No |
With mode: host, the DNS service can still fail over to another dns=true labeled node if the running node goes down — Swarm will reschedule it. However, clients that have a hard-coded DNS IP pointing at the old node's IP will temporarily lose DNS until they re-request via DHCP. For this homelab that is not a concern since DNS assignments come from the router.
Prevention:
Always publish DNS (port 53) services with mode: host in Swarm. The ingress routing mesh is designed for stateless HTTP/HTTPS load balancing — using it for UDP-based DNS creates a hidden single point of failure on every worker node.
General Troubleshooting Tips¶
Check Service Logs¶
# Follow logs in real-time
docker service logs <service-name> --tail 100 --follow
# Search logs for errors
docker service logs <service-name> --tail 1000 | grep -i error
Inspect Service Configuration¶
# View service details
docker service inspect <service-name>
# View service tasks (replicas)
docker service ps <service-name> --no-trunc
# View service placement constraints
docker service inspect <service-name> --format '{{.Spec.TaskTemplate.Placement}}'
Restart Service¶
# Restart single service
docker service update --force <service-name>
# Redeploy entire stack
task ansible:deploy:stack -- -e "stack_name=myapp"
Check System Resources¶
# Check disk space
df -h
# Check memory usage
free -h
# Check CPU usage
top
# Check Docker resources
docker system df
Get Shell Access to Container¶
# Find container
docker ps | grep myapp
# Execute shell
docker exec -it <container-id> sh
# Or bash if available
docker exec -it <container-id> bash
Need More Help?¶
If you're still experiencing issues:
- Check service-specific documentation in
/stacks/apps/<service>/README.md - Review recent changes with
git log - Search existing issues on GitHub
- Create new issue with detailed logs and configuration