Error Medic

How to Fix ArgoCD Connection Refused, CrashLoopBackOff, and Timeout Errors

Resolve ArgoCD connection refused, CrashLoopBackOff, and timeout errors with our complete troubleshooting guide. Learn root causes, diagnostic commands, and qui

Last updated:
Last verified:
1,459 words
Key Takeaways
  • Connection Refused is often caused by aggressively restrictive NetworkPolicies, mismatched Service selectors, or unready argocd-server pods.
  • CrashLoopBackOff and timeouts typically stem from OOMKilled events on the repo-server due to large Git repositories or complex Helm charts lacking memory limits.
  • Permission Denied errors during app sync mean the argocd-application-controller ServiceAccount lacks the required RBAC ClusterRoles.
  • ImagePullBackOff usually indicates Docker Hub rate limits or missing imagePullSecrets for private enterprise registries.
  • Quick Fix: Check pod statuses (`kubectl get pods -n argocd`), review events for OOM kills, verify RBAC bindings, and increase CPU/Memory limits on the repo-server.
Fix Approaches Compared
MethodWhen to UseTimeRisk
Restart Failed PodsTransient Redis cache issues or temporary network drops< 2 minsLow
Increase Resource LimitsPods stuck in CrashLoopBackOff (OOMKilled) or consistent timeouts5 minsLow
Modify RBAC / ClusterRolesArgoCD permission denied errors during Application Sync phases10 minsHigh (Security)
Update NetworkPoliciesArgoCD connection refused errors between internal components15 minsMedium

Understanding ArgoCD Connection and Lifecycle Errors

When managing Kubernetes clusters using GitOps, ArgoCD is often the beating heart of your continuous delivery pipeline. However, encountering errors like dial tcp: lookup argocd-server: connection refused, CrashLoopBackOff, or timeout can bring your deployments to a grinding halt. This guide, written from the trenches of site reliability engineering, covers the diagnosis and remediation of the most common ArgoCD failure states.

Symptom 1: ArgoCD Connection Refused

The connection refused error typically manifests in two scenarios: when the ArgoCD CLI cannot reach the API server, or when internal ArgoCD components (like the Application Controller) cannot communicate with the Repo Server or Redis.

Common Error Messages:

  • FATA[0000] rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 10.96.x.x:443: connect: connection refused"
  • dial tcp [::1]:8080: connect: connection refused

Root Causes:

  1. Pod Readiness: The argocd-server pod is not in a Ready state.
  2. Network Policies: Aggressive default-deny network policies are blocking intra-namespace communication or ingress traffic.
  3. Service Misconfiguration: The Kubernetes Service pointing to the ArgoCD server has mismatched selectors or ports.
  4. TLS/Certificate Issues: Ingress controllers failing to terminate TLS properly, causing backend connection drops.

Resolution: Verify the service endpoints using kubectl get endpoints -n argocd. If the endpoints list is empty, the service isn't mapping to the pods. Check pod labels and service selectors. If network policies are in play, ensure you have an allow-argocd-server policy that permits ingress on ports 80 and 443. For CLI port-forwarding issues, ensure the forward is active and binding to the correct local interface.

Symptom 2: CrashLoopBackOff and OOMKilled

A component entering CrashLoopBackOff means the container is repeatedly starting and crashing. In ArgoCD, this most frequently affects the argocd-repo-server or argocd-application-controller.

Common Error Messages:

  • Reason: OOMKilled
  • Exit Code: 137
  • Reason: CrashLoopBackOff

Root Causes:

  1. Out of Memory (OOM): The argocd-repo-server processes Git clones and Helm templating in memory. Large repositories or complex Helm charts can easily breach default resource limits.
  2. Corrupt Redis Cache: If the argocd-redis component crashes, dependent services may fail to initialize.
  3. Misconfigured ConfigMaps: Syntax errors in argocd-cm or argocd-rbac-cm can cause the server to crash on startup.

Resolution: Increase resource requests and limits. Edit the deployment: kubectl edit deploy argocd-repo-server -n argocd. Bump the memory limit to 1Gi or 2Gi depending on your repository size. If Redis is corrupted, a simple kubectl delete pod -l app.kubernetes.io/name=argocd-redis -n argocd will force a recreation and often clear the cache-related crashes.

Symptom 3: ImagePullBackOff

ImagePullBackOff or ErrImagePull occurs when the Kubelet cannot fetch the container image required for an ArgoCD component.

Root Causes:

  1. Rate Limiting: Hitting Docker Hub rate limits if pulling public images without authentication.
  2. Private Registries: Missing imagePullSecrets for custom/enterprise ArgoCD images.
  3. Network Egress: The worker node lacks outbound internet access to reach image registries like quay.io or ghcr.io.

Resolution: Inspect the exact failure using kubectl describe pod <pod-name> -n argocd. Look at the events at the bottom. If it's a rate limit issue, consider mirroring the images to an internal registry like Harbor or AWS ECR, and update your ArgoCD manifests (or Helm values) to point to the internal registry.

Symptom 4: ArgoCD Permission Denied

Permission errors often occur during the sync phase when ArgoCD attempts to apply resources to the target cluster.

Common Error Messages:

  • Failed to sync application: permission denied: roles.rbac.authorization.k8s.io "my-role" is forbidden
  • User "system:serviceaccount:argocd:argocd-application-controller" cannot create resource

Root Causes: ArgoCD uses a ServiceAccount (usually argocd-application-controller) to interact with the Kubernetes API. If you are deploying resources across different namespaces or utilizing cluster-scoped resources (like CustomResourceDefinitions or ClusterRoles), the ServiceAccount needs elevated permissions.

Resolution: Ensure the application controller has the correct ClusterRoleBinding. For full cluster admin (common in dedicated GitOps clusters), verify the binding: kubectl describe clusterrolebinding argocd-application-controller. If restricting access, ensure you have explicitly granted permissions to the target namespace in the ArgoCD cluster configuration and updated your destination RBAC appropriately.

Symptom 5: ArgoCD Timeout Errors

Timeouts generally occur when generating manifests takes longer than the configured threshold, or when Git operations stall over the network.

Common Error Messages:

  • rpc error: code = DeadlineExceeded desc = context deadline exceeded
  • ComparisonError: rpc error: code = Unavailable desc = transport is closing

Root Causes:

  1. Slow Helm Rendering: Helm charts with multiple dependencies or complex templates.
  2. Large Git Repositories: Cloning monolithic repositories takes too long.
  3. Resource Starvation: CPU throttling on the argocd-repo-server slows down manifest generation.

Resolution: Increase the server timeout settings. In the argocd-cm ConfigMap, set server.repo.server.timeout.seconds: "120" (default is 60). Additionally, configure webhook events in your Git provider (GitHub/GitLab) to trigger ArgoCD syncs immediately, preventing the need for exhaustive polling, and ensure the argocd-repo-server has sufficient CPU allocated to avoid throttling.

Step-by-Step Diagnostic Workflow

  1. Check the Control Plane Health: Run kubectl get pods -n argocd -o wide. Identify any pods not in Running state.
  2. Examine Events: Run kubectl get events -n argocd --sort-by='.metadata.creationTimestamp'. Look for OOM events, scheduling failures, or readiness probe failures.
  3. Inspect the Logs: For connection issues, start with the API server: kubectl logs -l app.kubernetes.io/name=argocd-server -n argocd --tail=100. For sync timeouts or permission errors, look at the controller: kubectl logs -l app.kubernetes.io/name=argocd-application-controller -n argocd --tail=100.
  4. Validate Network Connectivity: Exec into the application controller and attempt to resolve the repo server: kubectl exec -it deployment/argocd-application-controller -n argocd -- sh and run nc -zv argocd-repo-server 8081.
  5. Review Configuration Maps: Verify the contents of argocd-cm, argocd-rbac-cm, and argocd-secret using kubectl describe cm argocd-cm -n argocd.

Frequently Asked Questions

bash
#!/bin/bash
# ArgoCD Automated Diagnostic Script
NAMESPACE="argocd"

echo "=== Checking Pod Health ==="
kubectl get pods -n $NAMESPACE -o wide | grep -v "Running"

echo "\n=== Checking for OOMKilled Events ==="
kubectl get events -n $NAMESPACE | grep -i "OOMKilled"

echo "\n=== Checking ArgoCD Server Logs for Connection Errors ==="
kubectl logs -n $NAMESPACE -l app.kubernetes.io/name=argocd-server --tail=50 | grep -i -E "error|refused|timeout"

echo "\n=== Checking Application Controller Logs for Permission Denied ==="
kubectl logs -n $NAMESPACE -l app.kubernetes.io/name=argocd-application-controller --tail=50 | grep -i "permission denied"

echo "\n=== Checking Repo Server Resource Usage ==="
kubectl top pods -n $NAMESPACE -l app.kubernetes.io/name=argocd-repo-server
E

Error Medic Editorial

Error Medic Editorial is managed by a team of Senior DevOps and Site Reliability Engineers dedicated to demystifying cloud-native tooling, Kubernetes troubleshooting, and GitOps best practices.

Sources

Related Guides