Error Medic

Datadog Not Working: Troubleshooting Agent Status, APM Drops, and Connectivity Errors

Resolve "Datadog not working" issues. Learn to diagnose stopped agents, invalid API keys, port 8126 APM conflicts, and blocked outbound network traffic.

Last updated:
Last verified:
1,232 words
Key Takeaways
  • Verify the Datadog Agent is actually running and API keys are valid by executing 'sudo datadog-agent status'.
  • Ensure outbound network traffic to Datadog endpoints (port 443) is not blocked by firewalls, VPC endpoints, or security groups.
  • For missing APM traces, check for port 8126 conflicts and verify the trace-agent service is active and receiving data.
  • Examine '/var/log/datadog/agent.log' for localized errors like 'Connection refused' or 'API Key is invalid'.
  • Use the 'datadog-agent flare' command to securely bundle logs and configuration files for Datadog Support if immediate fixes fail.
Fix Approaches Compared
MethodWhen to UseTimeRisk
Agent RestartInitial triage for frozen metrics or unresponsive agent1 minLow
Verify API KeysAgent status explicitly reports 'Invalid API Key'2 minsLow
Network / Firewall AuditAgent logs show 'dial tcp: i/o timeout' or 'no such host'15 minsMedium
Port Conflict ResolutionAPM is missing and port 8126 is bound by another service10 minsMedium
Agent ReinstallationCorrupted binaries, botched upgrades, or missing config files10 minsMedium

Understanding the Error

When engineers report that "Datadog is not working," the issue typically manifests in one of three distinct ways: missing infrastructure metrics, dropped APM traces, or silent log forwarders. Because the Datadog Agent is a sophisticated daemon running multiple concurrent subprocesses (core agent, trace-agent, process-agent, security-agent), a failure in any single component can create dangerous blind spots in your observability pipeline.

The most critical first step is identifying which part of Datadog is failing. Is the entire host offline in the Datadog UI, or are you just missing specific custom metrics? Are your logs flowing, but your distributed traces breaking?

Common Error Messages

Before making changes, check /var/log/datadog/agent.log (Linux) or C:\ProgramData\Datadog\logs\agent.log (Windows) for these exact error strings:

  • Error: API Key is invalid
  • dial tcp: lookup intake.logs.datadoghq.com: no such host
  • Agent is not running
  • Failed to send traces: payload too large
  • connection refused: port 8126

Step 1: Diagnose with the Status Command

The Datadog Agent comes with a built-in diagnostic tool. Running the status command provides a comprehensive overview of the agent's health, collector processes, and forwarder status.

Run the following command on your host: sudo datadog-agent status

Pay close attention to the Forwarder section. If you see multiple retries or dropped payloads, you are likely dealing with a network issue or an invalid API key.

API Key Validation: If the status output shows API Key is invalid, verify your datadog.yaml file. Ensure that api_key matches the key provided in your Datadog Organization Settings. Remember that Datadog has multiple sites (e.g., US1, US3, EU). If your account is in EU, but your agent is defaulting to US1, your API key will register as invalid. Ensure site: datadoghq.eu (or your specific site) is correctly set in datadog.yaml.


Step 2: Fix Network and Connectivity Issues

The Datadog Agent pushes data outwards via HTTPS on port 443. It does not require any inbound ports to be opened. If metrics aren't showing up, a firewall, proxy, or security group is likely blocking outbound traffic.

To isolate DNS and network routing issues, try connecting to Datadog's infrastructure endpoints directly from the affected host:

curl -v https://app.datadoghq.com curl -v https://intake.logs.datadoghq.com

If the connection times out (dial tcp: i/o timeout), verify your AWS Security Groups, Azure NSGs, or local iptables rules. If you are operating in an air-gapped environment or a strict enterprise network, you must configure the Datadog Agent to route its traffic through your corporate proxy. Edit the datadog.yaml file to include your proxy settings under the proxy block.


Step 3: Troubleshoot Missing APM Traces

If infrastructure metrics are visible but APM traces are missing, the issue is isolated to the trace-agent.

Applications send traces locally to the Datadog trace-agent over port 8126 (TCP) and stats over port 8125 (UDP). If another application on your host is listening on port 8126, the trace-agent will fail to bind to it, and traces will be dropped into the void.

Check for port conflicts: sudo netstat -tulpn | grep 8126

If the port is available but traces are still failing, verify that your application's Datadog tracing library (e.g., dd-trace-js, dd-trace-py) is configured with the correct environment variables. Specifically, ensure DD_AGENT_HOST is set correctly. In containerized environments like Kubernetes or ECS, DD_AGENT_HOST must often point to the host's IP address or the dedicated DaemonSet service rather than localhost.


Step 4: Resolve Log Collection Failures

If logs are not appearing in Datadog, confirm that log collection is explicitly enabled. By default, log collection is disabled in the Datadog Agent to prevent unexpected billing spikes.

In your datadog.yaml, verify:

logs_enabled: true

Next, verify the Datadog Agent user (dd-agent) has the necessary read permissions for your log files. If your application writes logs to /var/log/myapp/app.log, but those files are owned by root with 600 permissions, the Datadog Agent will silently fail to read them. Grant read access to the dd-agent user or add dd-agent to the appropriate user group.


Step 5: The Nuclear Option (Flare)

If you have verified API keys, network connectivity, port availability, and file permissions, and the agent still isn't working, it's time to gather the diagnostic bundle known as a "flare".

Running sudo datadog-agent flare will compress all agent logs, configurations, and internal state into a single archive and upload it directly to Datadog Support. It redacts sensitive passwords and provides the support engineers with exactly what they need to debug complex state corruption or edge-case bugs.

Frequently Asked Questions

bash
# 1. Check the complete status of the Datadog Agent
sudo datadog-agent status

# 2. Check for trace-agent port conflicts (APM issues)
sudo netstat -tulpn | grep 8126

# 3. Test outbound network connectivity to Datadog API
curl -v https://app.datadoghq.com

# 4. Tail the agent logs for real-time error messages
sudo tail -f /var/log/datadog/agent.log

# 5. Generate a support flare (if all else fails)
sudo datadog-agent flare
E

Error Medic Editorial

Senior SRE and DevOps engineering team specializing in deep-dive troubleshooting for enterprise observability platforms, cloud infrastructure, and distributed systems.

Sources

Related Guides