Error Medic

Datadog Agent Not Reporting: Troubleshooting 'Agent is not sending metrics' and Connection Errors

Fix Datadog agent not reporting metrics. Learn to troubleshoot API key errors, site misconfigurations, NTP time drift, and network connectivity issues.

Last updated:
Last verified:
1,084 words
Key Takeaways
  • Verify the API key and 'site' parameter in datadog.yaml to prevent 403 Forbidden errors.
  • Check outbound network connectivity on port 443 to the specific Datadog intake servers for your region.
  • Ensure NTP time synchronization is accurate; time skew causes Datadog to drop metric payloads.
  • Inspect /var/log/datadog/agent.log for specific forwarder or intake connection errors.
Diagnostic Approaches Compared
MethodWhen to UseTimeRisk
Agent Status CommandInitial triage to see component health and forwarder queues1 minLow
Flare CreationWhen escalating to Datadog support or doing deep offline analysis3 minsLow
Network Curl TestAgent logs show 'Connection refused' or 'Timeout'2 minsLow
NTP Sync VerificationMetrics are missing but logs show successful HTTP 200 posts5 minsMedium

Understanding the Error

When a Datadog Agent is not reporting, it means the host has lost communication with the Datadog backend intake servers. This manifests in the Datadog UI as gaps in metric graphs, hosts appearing as '???' or completely disappearing from the infrastructure list, and missing APM traces or logs. Because the Datadog Agent is a complex, multi-process daemon (core agent, trace-agent, process-agent), a failure in reporting can be systemic or localized to a specific telemetry type.

Typical error messages you might encounter in the logs include:

  • ERROR | (pkg/forwarder/worker.go) | Error while processing transaction: error: HTTP 403 Forbidden
  • x509: certificate signed by unknown authority
  • Error: Post "https://5-0-0-app.agent.datadoghq.com/api/v1/validate": dial tcp: lookup 5-0-0-app.agent.datadoghq.com: no such host
  • context deadline exceeded (Client.Timeout exceeded while awaiting headers)

Root Causes

  1. Network & Firewall Restrictions: The agent relies on outbound HTTPS (port 443) to send data. If a firewall, security group, or egress proxy blocks this traffic, the agent will queue metrics until memory limits are reached, then drop them.
  2. Misconfigured Credentials or Region: Datadog operates across multiple isolated regions (US1, US3, US5, EU1, AP1). If your agent is configured with an API key for US1 but the site parameter in datadog.yaml defaults to datadoghq.com (US1) while your account is actually in EU1 (datadoghq.eu), the intake will reject the payloads with a 403 Forbidden error.
  3. Time Synchronization (NTP) Drift: Datadog's intake servers validate the timestamps of incoming metrics. If your host's clock drifts significantly (typically > 10 minutes) from UTC, the payload will be successfully transmitted but silently dropped by the backend.
  4. Resource Starvation (OOMKilled): If the agent exceeds its memory limits—common in highly containerized Kubernetes environments without proper resource limits—the OS out-of-memory killer will terminate the process.

Step 1: Diagnose the Agent Status

The most critical first step is running the agent status command. This provides a comprehensive overview of the agent's health, configuration, and recent errors.

On Linux, run: sudo datadog-agent status

Scroll down to the Forwarder section. This section tells you if the agent is successfully sending data to Datadog. Look for:

  • Transactions: A high number of dropped or retried transactions indicates network issues.
  • API Key validation: Should say valid. If it says invalid, check your datadog.yaml.

Next, check the logs for real-time errors: sudo tail -f /var/log/datadog/agent.log

Step 2: Validate Network Connectivity

If the forwarder is failing, simulate the agent's network traffic to isolate DNS or firewall issues. The agent connects to several endpoints (e.g., <VERSION>-app.agent.datadoghq.com). You can test general connectivity to the Datadog API endpoints using curl.

For US1 (default): curl -v https://api.datadoghq.com For EU1: curl -v https://api.datadoghq.eu

If the curl command hangs or returns a connection timeout, your host is lacking outbound internet access on port 443. If you use a proxy, ensure the agent is configured to use it by setting the proxy block in datadog.yaml.

Step 3: Check Time Synchronization (NTP)

If the API key is valid, the network connects perfectly, and logs show HTTP 200 OK responses, but metrics still aren't appearing, check your host's clock.

Run date -u and compare it to a reliable time source. To verify your NTP sync status: chronyc tracking or timedatectl status

If the system clock is inaccurate, restart your NTP service (chronyd or systemd-timesyncd) and force a synchronization.

Step 4: Fix Configuration and Restart

Most configuration issues stem from an incorrect datadog.yaml. Open /etc/datadog-agent/datadog.yaml (Linux) and verify:

  1. api_key: <YOUR_API_KEY>
  2. site: <YOUR_DATADOG_SITE> (e.g., datadoghq.com, us3.datadoghq.com, datadoghq.eu)

If you modify datadog.yaml, you must restart the agent for the changes to take effect: sudo systemctl restart datadog-agent

Wait two minutes, then run sudo datadog-agent status again to verify the forwarder is successfully transmitting payloads.

Frequently Asked Questions

bash
# --- Datadog Agent Troubleshooting Script ---

# 1. Check the overall status of the agent
sudo datadog-agent status

# 2. Check the logs for ERROR or WARN messages (specifically looking for API/Forwarder issues)
sudo grep -E "ERROR|WARN" /var/log/datadog/agent.log | tail -n 20

# 3. Test outbound network connectivity to the default US site (Change URL based on your region)
curl -v https://api.datadoghq.com

# 4. Check system time synchronization (NTP)
timedatectl status

# 5. Restart the Datadog Agent after applying any fixes in datadog.yaml
sudo systemctl restart datadog-agent
E

Error Medic Editorial

Error Medic Editorial comprises senior DevOps, SRE, and platform engineering experts dedicated to providing actionable, reliable troubleshooting guides for modern cloud infrastructure.

Sources

Related Guides