Error Medic

Troubleshooting 'CircleCI Build Failed': Resolving Exit Code 137, Permission Denied, and Timeouts

Fix common CircleCI build failures including OOM (Exit Code 137), permission denied (403), and timeouts. Actionable steps to debug and optimize CI/CD pipelines.

Last updated:
Last verified:
1,458 words
Key Takeaways
  • Out of memory (OOM) errors, identifiable by 'Exited with code 137', require upgrading your resource_class or tuning language-specific memory limits like max-old-space-size.
  • Permission denied errors often originate from missing SSH keys for private repository access or expired cloud provider credentials in CircleCI Contexts.
  • CircleCI terminates jobs with 'Too long with no output (exceeded 10m0s)' if tasks run silently; fix this by adding a no_output_timeout declaration or implementing keep-alive logging.
  • Always utilize the 'Rerun job with SSH' feature to interactively debug failing containers, check system logs, and validate environment variables before modifying the pipeline configuration.
Fix Approaches Compared
Error TypeCommon Root CauseQuick FixDiagnostic Effort
Out of Memory (Code 137)Container RAM exhaustion during build/testIncrease resource_class to 'large' or 'xlarge'Medium
Permission Denied (publickey)Missing deploy keys for git submodulesInject SSH keys using the 'add_ssh_keys' stepLow
Timeout (no output)Silent, long-running processes (e.g., DB seeding)Override with 'no_output_timeout: 30m'Low
IAM Permission Denied (403)Expired AWS/GCP credentials in ContextsRotate context variables or migrate to OIDCHigh

Understanding the Error

When a developer encounters a generic 'CircleCI build failed' notification, it is rarely descriptive enough to resolve the issue directly. CircleCI acts as an orchestration engine, and a failure could originate from application code, infrastructure constraints, network flakes, or authentication hurdles. As a DevOps or SRE professional, your goal is to immediately identify the specific failure vector. The most common infrastructure-level failures manifest as Out of Memory (OOM) terminations, Permission Denied blocks, and silent Timeouts.

By examining the job steps in the CircleCI dashboard, you can isolate the exact command that failed and its corresponding exit code. This guide provides a deep dive into diagnosing and resolving the three most prevalent CircleCI infrastructure errors.

Step 1: Diagnosing 'CircleCI Out of Memory' (Exit Code 137)

One of the most notorious and confusing errors in CircleCI is the silent killer: Exited with code 137. This exit code indicates that the container was abruptly terminated by the Linux Out of Memory (OOM) killer. It often happens without a stack trace, leaving developers puzzled as to why their build stopped mid-execution.

Identifying the issue: If you inspect the CircleCI UI, the build simply halts. This typically occurs during resource-intensive tasks such as Webpack compilation, Docker image builds, or running comprehensive test suites in memory-heavy languages like Java or Node.js.

The Fix: There are two primary ways to resolve OOM errors: vertical scaling or application-level memory tuning.

  1. Increase the Resource Class: By default, CircleCI uses a medium resource class for many executors. You can vertically scale the container by requesting a larger executor in your .circleci/config.yml.
jobs:
  build:
    docker:
      - image: cimg/node:18.0
    resource_class: large # Upgraded from medium (gives 4 vCPUs and 8GB RAM)
    steps:
      - checkout
      - run: npm run build
  1. Optimize Garbage Collection and Memory Limits: If upgrading the resource class is not an option due to billing constraints, you must constrain your application's memory usage.
  • For Node.js: V8's garbage collector can be lazy. Force it to clean up by setting explicit memory limits. Add NODE_OPTIONS=--max_old_space_size=4096 to your environment variables to restrict heap size to 4GB.
  • For Java: Tune the JVM heap size. Setting _JAVA_OPTIONS="-Xmx3200m" ensures the JVM leaves enough RAM for the operating system and other background processes, preventing the OOM killer from targeting the container.

Step 2: Resolving 'CircleCI Permission Denied'

Permission denied errors are explicit roadblocks that generally manifest in two distinct areas: Git operations and Cloud Provider deployments.

Diagnosing Git and SSH Issues: If your build fails during the checkout step, or when attempting to clone a private submodule, you will likely see: Permission denied (publickey). fatal: Could not read from remote repository. This means the build container lacks the authorized SSH keys to authenticate with GitHub or Bitbucket.

The Fix: Ensure you have added the appropriate deploy keys or user SSH keys in the project settings under 'SSH Keys'. Then, you must explicitly inject them into the job steps using the add_ssh_keys directive before checkout:

steps:
  - add_ssh_keys:
      fingerprints:
        - "SO:ME:FI:NG:ER:PR:IN:T"
  - checkout
  - run: git submodule update --init --recursive

Diagnosing Cloud Provider Permissions (AWS/GCP): Another variation of 'permission denied' occurs during deployment steps, resulting in HTTP 403 Forbidden errors. This usually indicates that the AWS IAM keys or GCP Service Account credentials stored in your CircleCI Contexts have expired or lack the required attached policies.

The Fix: Navigate to Organization Settings -> Contexts. Verify that your deployment context contains valid credentials. For enhanced security and to avoid hardcoded credential expiration issues, migrate to OpenID Connect (OIDC). OIDC allows CircleCI to assume an IAM role dynamically without storing long-lived secret keys.

Step 3: Fixing 'CircleCI Timeout' Errors

CircleCI enforces a strict default timeout of 10 minutes for any task that does not produce output to stdout or stderr. The exact error logged is: Too long with no output (exceeded 10m0s): context deadline exceeded.

Identifying the issue: This often happens during silent, long-running processes. Common culprits include downloading massive machine learning datasets, restoring gigabytes of cache, seeding large databases, or compiling complex binaries.

The Fix: If your task genuinely requires more than 10 minutes of silent processing time, you can override the default timeout on a per-step basis using the no_output_timeout key.

steps:
  - run:
      name: Run heavy database migration
      command: ./run-migrations.sh
      no_output_timeout: 30m

Alternatively, improve observability by refactoring your scripts to emit keep-alive logs. Using curl -v instead of silent mode, or adding an echo statement inside long loops, prevents the orchestrator from assuming the job has hung indefinitely.

Step 4: Utilizing the 'Rerun Job with SSH' Feature

When static logs are insufficient to diagnose bizarre failures or random network drops, your most powerful tool is the 'Rerun job with SSH' feature.

Initiating this creates a fresh container and pauses the job, providing an SSH connection string in the UI. Once connected, you can impersonate the CI environment. You can check network egress using curl, validate memory constraints by running free -m, examine the system log for OOM events using dmesg, and manually execute the failing steps to observe real-time behavior. This interactive debugging drastically reduces the cycle time for resolving complex CI/CD pipeline breakages.

Frequently Asked Questions

bash
#!/bin/bash
# Diagnostic script to run when SSHing into a failed CircleCI container

echo "=== Checking Memory Limits and Usage ==="
free -m

echo -e "\n=== Checking for OOM Kills in Syslog ==="
dmesg -T | grep -i oom || echo "No OOM events found in dmesg."

echo -e "\n=== Validating SSH Keys for Git Access ==="
ssh -T -o StrictHostKeyChecking=no git@github.com || true

echo -e "\n=== Checking Environment Variable Accessibility ==="
if [ -z "$AWS_ACCESS_KEY_ID" ]; then
    echo "WARNING: AWS credentials are not set in this context."
else
    echo "AWS credentials found. Validating caller identity..."
    aws sts get-caller-identity || echo "AWS token validation failed. Check IAM roles or Context settings."
fi

echo -e "\n=== Checking Running Processes ==="
ps aux --sort=-%mem | head -n 10
E

Error Medic Editorial

Our editorial team consists of Senior Site Reliability Engineers and DevOps architects specializing in CI/CD pipeline optimization, infrastructure as code, and robust deployment automation.

Sources

Related Guides