Why does my CircleCI build fail randomly without any code changes?

Intermittent failures are often caused by external factors such as network flakes (e.g., NPM or Docker Hub rate limits), shared runner resource exhaustion, or third-party API downtimes. Implementing retry logic for network calls and using CircleCI's dependency caching can mitigate these flakes.

How do I fix 'Exited with code 137' in CircleCI?

Exit code 137 indicates the container ran out of memory and was killed by the OS. Fix this by increasing the `resource_class` in your `.circleci/config.yml` (e.g., from medium to large) or by setting language-level limits, such as `NODE_OPTIONS=--max-old-space-size=4096` for Node.js.

How can I debug a CircleCI timeout that leaves no logs behind?

CircleCI kills processes that output nothing for 10 minutes. Use the 'Rerun job with SSH' feature to connect to the runner interactively. Run commands like `top` or `ps aux` to identify hanging processes. To prevent the timeout in the future, set the `no_output_timeout: 30m` parameter in your run step.

Why is my docker build failing with 'permission denied' on /var/run/docker.sock?

This happens when your job attempts to run Docker commands inside a standard execution environment without proper permissions. You must add the `setup_remote_docker` step to your job to safely provision a separate environment where you can build and push Docker images.

Troubleshooting 'CircleCI Build Failed': Resolving Exit Code 137, Permission Denied, and Timeouts

Q: What causes 'Permission denied (publickey)' when fetching from GitHub?

This occurs when the CircleCI runner lacks the necessary SSH keys to access a private repository or submodule. Ensure you have added the appropriate deploy keys in the project settings and included the `add_ssh_keys` step with the correct fingerprint in your job configuration.

Fix Approaches Compared
Error Type	Common Root Cause	Quick Fix	Diagnostic Effort
Out of Memory (Code 137)	Container RAM exhaustion during build/test	Increase resource_class to 'large' or 'xlarge'	Medium
Permission Denied (publickey)	Missing deploy keys for git submodules	Inject SSH keys using the 'add_ssh_keys' step	Low
Timeout (no output)	Silent, long-running processes (e.g., DB seeding)	Override with 'no_output_timeout: 30m'	Low
IAM Permission Denied (403)	Expired AWS/GCP credentials in Contexts	Rotate context variables or migrate to OIDC	High

Understanding the Error

When a developer encounters a generic 'CircleCI build failed' notification, it is rarely descriptive enough to resolve the issue directly. CircleCI acts as an orchestration engine, and a failure could originate from application code, infrastructure constraints, network flakes, or authentication hurdles. As a DevOps or SRE professional, your goal is to immediately identify the specific failure vector. The most common infrastructure-level failures manifest as Out of Memory (OOM) terminations, Permission Denied blocks, and silent Timeouts.

By examining the job steps in the CircleCI dashboard, you can isolate the exact command that failed and its corresponding exit code. This guide provides a deep dive into diagnosing and resolving the three most prevalent CircleCI infrastructure errors.

Step 1: Diagnosing 'CircleCI Out of Memory' (Exit Code 137)

One of the most notorious and confusing errors in CircleCI is the silent killer: Exited with code 137. This exit code indicates that the container was abruptly terminated by the Linux Out of Memory (OOM) killer. It often happens without a stack trace, leaving developers puzzled as to why their build stopped mid-execution.

Identifying the issue: If you inspect the CircleCI UI, the build simply halts. This typically occurs during resource-intensive tasks such as Webpack compilation, Docker image builds, or running comprehensive test suites in memory-heavy languages like Java or Node.js.

The Fix: There are two primary ways to resolve OOM errors: vertical scaling or application-level memory tuning.

Increase the Resource Class: By default, CircleCI uses a medium resource class for many executors. You can vertically scale the container by requesting a larger executor in your .circleci/config.yml.

jobs:
  build:
    docker:
      - image: cimg/node:18.0
    resource_class: large # Upgraded from medium (gives 4 vCPUs and 8GB RAM)
    steps:
      - checkout
      - run: npm run build

Optimize Garbage Collection and Memory Limits: If upgrading the resource class is not an option due to billing constraints, you must constrain your application's memory usage.

For Node.js: V8's garbage collector can be lazy. Force it to clean up by setting explicit memory limits. Add NODE_OPTIONS=--max_old_space_size=4096 to your environment variables to restrict heap size to 4GB.
For Java: Tune the JVM heap size. Setting _JAVA_OPTIONS="-Xmx3200m" ensures the JVM leaves enough RAM for the operating system and other background processes, preventing the OOM killer from targeting the container.

Step 2: Resolving 'CircleCI Permission Denied'

Permission denied errors are explicit roadblocks that generally manifest in two distinct areas: Git operations and Cloud Provider deployments.

Diagnosing Git and SSH Issues: If your build fails during the checkout step, or when attempting to clone a private submodule, you will likely see: Permission denied (publickey). fatal: Could not read from remote repository. This means the build container lacks the authorized SSH keys to authenticate with GitHub or Bitbucket.

The Fix: Ensure you have added the appropriate deploy keys or user SSH keys in the project settings under 'SSH Keys'. Then, you must explicitly inject them into the job steps using the add_ssh_keys directive before checkout:

steps:
  - add_ssh_keys:
      fingerprints:
        - "SO:ME:FI:NG:ER:PR:IN:T"
  - checkout
  - run: git submodule update --init --recursive

Diagnosing Cloud Provider Permissions (AWS/GCP): Another variation of 'permission denied' occurs during deployment steps, resulting in HTTP 403 Forbidden errors. This usually indicates that the AWS IAM keys or GCP Service Account credentials stored in your CircleCI Contexts have expired or lack the required attached policies.

The Fix: Navigate to Organization Settings -> Contexts. Verify that your deployment context contains valid credentials. For enhanced security and to avoid hardcoded credential expiration issues, migrate to OpenID Connect (OIDC). OIDC allows CircleCI to assume an IAM role dynamically without storing long-lived secret keys.

Step 3: Fixing 'CircleCI Timeout' Errors

CircleCI enforces a strict default timeout of 10 minutes for any task that does not produce output to stdout or stderr. The exact error logged is: Too long with no output (exceeded 10m0s): context deadline exceeded.

Identifying the issue: This often happens during silent, long-running processes. Common culprits include downloading massive machine learning datasets, restoring gigabytes of cache, seeding large databases, or compiling complex binaries.

The Fix: If your task genuinely requires more than 10 minutes of silent processing time, you can override the default timeout on a per-step basis using the no_output_timeout key.

steps:
  - run:
      name: Run heavy database migration
      command: ./run-migrations.sh
      no_output_timeout: 30m

Alternatively, improve observability by refactoring your scripts to emit keep-alive logs. Using curl -v instead of silent mode, or adding an echo statement inside long loops, prevents the orchestrator from assuming the job has hung indefinitely.

Step 4: Utilizing the 'Rerun Job with SSH' Feature

When static logs are insufficient to diagnose bizarre failures or random network drops, your most powerful tool is the 'Rerun job with SSH' feature.

Initiating this creates a fresh container and pauses the job, providing an SSH connection string in the UI. Once connected, you can impersonate the CI environment. You can check network egress using curl, validate memory constraints by running free -m, examine the system log for OOM events using dmesg, and manually execute the failing steps to observe real-time behavior. This interactive debugging drastically reduces the cycle time for resolving complex CI/CD pipeline breakages.