Error Medic

Troubleshooting Grafana Alertmanager: Fixing Sync Failures, Connection Drops, and Mimir/Loki Integration Errors

Comprehensive guide to fixing Grafana Alertmanager failures, including external Alertmanager sync issues, Grafana Cloud auth errors, and Mimir/Loki tenant misco

Last updated:
Last verified:
1,754 words
Key Takeaways
  • Root Cause 1: Network isolation or DNS resolution failures preventing the Grafana server from reaching the external Alertmanager API (e.g., `no such host`).
  • Root Cause 2: Misconfigured or missing `X-Scope-OrgID` headers causing 401/403 errors when routing alerts to Grafana Mimir or Loki Alertmanagers.
  • Root Cause 3: Legacy alerting configurations conflicting with Grafana Unified Alerting (GUA), leading to duplicate or silently dropped alert payloads.
  • Quick Fix: Validate API reachability from within the Grafana container using `curl`, check the Contact Points provisioning YAML for syntax errors, and ensure the Alertmanager URL does not include a trailing slash.
Alertmanager Configuration & Fix Approaches Compared
MethodWhen to UseTime to ImplementRisk Level
Grafana UI (Contact Points)Initial setup, quick debugging, or isolated testing of payload formats.5 minsHigh (Configuration drift, manual errors)
File Provisioning (YAML)Production environments, GitOps workflows, Kubernetes deployments.15 minsLow (Version controlled, reproducible)
mimirtool (CLI)Configuring distributed Alertmanager in Grafana Mimir or Grafana Cloud.10 minsMedium (Requires API key management)
Direct API POSTVerifying Alertmanager receiver functionality independently of Grafana.5 minsLow (Read/Test only)

Understanding Grafana and Alertmanager Architecture

When transitioning to Grafana 8+ and Grafana Unified Alerting (GUA), the architecture of how alerts are processed shifted significantly. Grafana now includes a built-in Alertmanager, but many enterprise architectures rely on a Grafana external Alertmanager, or distributed Alertmanagers backed by Grafana Mimir or Grafana Loki.

Troubleshooting issues in this ecosystem requires understanding the data flow: Grafana evaluates alert rules (or queries backend data sources like Prometheus/Loki) -> The state changes to Firing -> Grafana constructs a payload -> Grafana pushes this payload via HTTP POST to the Alertmanager API (/api/v2/alerts) -> Alertmanager routes, groups, and deduplicates the alert -> Alertmanager sends the notification to the receiver (Slack, PagerDuty, Webhook).

Failures can occur at any boundary in this pipeline. This guide focuses on the critical boundary between Grafana and the Alertmanager.

Common Error Messages

You are likely reading this guide because you have encountered one of the following exact error messages in your Grafana server logs (/var/log/grafana/grafana.log or kubectl logs deployment/grafana):

  • Failed to send alert to Alertmanager: Post "http://alertmanager:9093/api/v2/alerts": dial tcp: lookup alertmanager on 10.96.0.10:53: no such host
  • level=error msg="unable to sync alertmanager configuration" err="bad response status 400 Bad Request"
  • level=error msg="Failed to send alert notifications" err="context deadline exceeded"
  • level=error msg="failed to send alerts to all alertmanagers" err="1 errors: Post \"https://alertmanager-us-central1.grafana.net/api/prom/push\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)"

Step 1: Diagnosing Network and DNS Issues

The most frequent cause of an external Alertmanager failing to receive alerts from Grafana is network reachability. Grafana must be able to resolve the hostname and establish a TCP connection to the Alertmanager port (usually 9093).

Validation from the Grafana Container

Do not test connectivity from your local workstation. You must test from the environment where the Grafana process is running.

  1. Exec into the Grafana container/server:

    kubectl exec -it deploy/grafana -n monitoring -- /bin/sh
    # or
    docker exec -it grafana /bin/sh
    
  2. Test DNS Resolution:

    nslookup alertmanager.monitoring.svc.cluster.local
    

    If this fails, your CoreDNS or equivalent DNS service is failing to resolve the service name. Check your Kubernetes service definitions.

  3. Test API Reachability with Curl:

    curl -v http://alertmanager.monitoring.svc.cluster.local:9093/-/ready
    

    You should receive a 200 OK response. If you get Connection refused, the Alertmanager process is not binding to 0.0.0.0 or the port is mismatched. If you get context deadline exceeded or it hangs, check your NetworkPolicies, security groups, or firewalls blocking traffic between the Grafana and Alertmanager nodes.


Step 2: Fixing Grafana External Alertmanager Configuration

If the network is healthy, the issue usually lies in how Grafana is configured to talk to the external Alertmanager.

Legacy grafana.ini vs. Provisioning UI

In older versions of Grafana, you might have configured the external Alertmanager in grafana.ini under the [alerting] or [unified_alerting] blocks.

Warning: Relying solely on grafana.ini for contact points can lead to silent failures if the API schema changes. The modern approach is to use the UI or Provisioning YAML.

If you are using Provisioning YAML (/etc/grafana/provisioning/alerting/alertmanager.yaml), verify the syntax carefully:

apiVersion: 1
contactPoints:
  - orgId: 1
    name: 'External Alertmanager'
    receivers:
      - uid: ext-am-1
        type: prometheus-alertmanager
        settings:
          url: http://alertmanager.monitoring.svc.cluster.local:9093

Crucial Fix: Ensure the url does NOT contain a trailing slash or the /api/v2/alerts path. Grafana automatically appends the correct API path. Providing http://alertmanager:9093/ or http://alertmanager:9093/api/v2/alerts will result in Grafana calling http://alertmanager:9093/api/v2/alerts/api/v2/alerts, yielding a 404 Not Found.


Step 3: Grafana Cloud Alertmanager & Mimir Integrations

Integrating Grafana with Grafana Cloud Alertmanager or a self-hosted Grafana Mimir Alertmanager introduces multi-tenancy. Multi-tenancy requires strict authentication, usually handled via HTTP headers.

The Missing Tenant ID Error (401 Unauthorized or 400 Bad Request)

If you see authorization errors when syncing configurations or firing alerts to Mimir/Loki, you are likely missing the X-Scope-OrgID header. Mimir requires this header to know which tenant's Alertmanager configuration to apply the alert to.

Fixing Mimir Alertmanager Integration

If configuring via the Grafana UI (Alerting -> Alertmanagers -> Add Alertmanager):

  1. Set the URL to your Mimir Alertmanager endpoint (e.g., http://mimir-gateway/alertmanager).
  2. Under Custom HTTP headers, add:
    • Header: X-Scope-OrgID
    • Value: <your-tenant-id> (e.g., tenant-a or anonymous if auth is disabled but multitenancy is enabled).

If configuring via Provisioning:

apiVersion: 1
contactPoints:
  - orgId: 1
    name: 'Mimir Alertmanager'
    receivers:
      - uid: mimir-am
        type: prometheus-alertmanager
        settings:
          url: http://mimir-gateway.mimir.svc.cluster.local/alertmanager
          httpHeaderName1: 'X-Scope-OrgID'
          httpHeaderValue1: 'tenant-a'

Grafana Cloud Specifics

For Grafana Cloud Alertmanager, the URL and authentication are provided via Basic Auth. Ensure you are using the correct Cloud Access Policy token with alerts:write permissions.

  • URL: https://alertmanager-<region>.grafana.net
  • Basic Auth User: <your-cloud-username/instance-id>
  • Basic Auth Password: <your-access-policy-token>

Step 4: Troubleshooting High Availability (HA) Alertmanager Gossip

If you are running multiple instances of Alertmanager (HA mode) and users complain about receiving duplicate alert notifications (e.g., two Slack messages for the same incident), your external Alertmanager cluster is suffering from a "split-brain" scenario.

Alertmanager instances use a gossip protocol over TCP/UDP port 9094 to synchronize silence states and notification logs. If Grafana pushes an alert to an external AM cluster, and the cluster members cannot communicate, both members will independently evaluate the routing tree and send the notification.

Diagnosing Gossip Failures

Check the Alertmanager logs for gossip errors:

kubectl logs -l app=alertmanager -c alertmanager | grep gossip

Look for: msg="Failed to join cluster" err="1 error occurred: Failed to resolve alertmanager-0.alertmanager-headless:9094: no such host"

Fixing Gossip Synchronization

  1. Ensure you have a headless service in Kubernetes specifically for the gossip ring.
  2. Pass the --cluster.peer flag correctly to the Alertmanager binary arguments.
# Kubernetes deployment args snippet for Alertmanager
args:
  - "--config.file=/etc/alertmanager/config.yml"
  - "--storage.path=/alertmanager"
  - "--cluster.peer=alertmanager-0.alertmanager-headless.monitoring.svc.cluster.local:9094"
  - "--cluster.peer=alertmanager-1.alertmanager-headless.monitoring.svc.cluster.local:9094"
  1. Ensure your Kubernetes NetworkPolicy allows TCP/UDP traffic on port 9094 between the Alertmanager pods.

Step 5: Advanced Debugging with amtool and Payload Inspection

Sometimes Grafana successfully sends the alert, but Alertmanager drops it or routes it to a "null" receiver. To isolate if the problem is Grafana generating a bad payload or Alertmanager misrouting it, bypass Grafana entirely.

Manually Pushing an Alert Payload

Use the Prometheus Alertmanager API to push a dummy alert. If this succeeds and routes correctly, the issue is Grafana's configuration. If this fails to route, the issue is in your alertmanager.yml routing tree.

curl -XPOST http://alertmanager.monitoring.svc.cluster.local:9093/api/v1/alerts -d '[
  {
    "labels": {
      "alertname": "TestManualAlert",
      "severity": "critical",
      "service": "web"
    },
    "annotations": {
      "summary": "This is a test alert bypassing Grafana."
    }
  }
]'

Check the Alertmanager UI (http://<alertmanager-ip>:9093/#/alerts). If "TestManualAlert" appears, your Alertmanager is healthy and accepting external payloads. Go back and check the Grafana contact point logs and ensure the labels matching your Alertmanager route are actually being generated by your Grafana alert rule.

Conclusion

Troubleshooting the Grafana to Alertmanager pipeline requires systematically verifying network boundaries, authentication headers (especially for Mimir/Loki/Cloud), and payload structures. By utilizing local curl tests, verifying provisioning syntax, and inspecting HA gossip logs, you can rapidly isolate and resolve alert delivery failures.

Frequently Asked Questions

bash
#!/bin/bash
# Diagnostic script to test Grafana to Alertmanager connectivity and API health

AM_URL="http://alertmanager.monitoring.svc.cluster.local:9093"
TENANT_HEADER="X-Scope-OrgID: my-tenant" # Optional: Only for Mimir/Loki

echo "1. Testing basic network connectivity..."
curl -s -o /dev/null -w "%{http_code}" ${AM_URL}/-/ready

if [ $? -ne 0 ]; then
  echo "ERROR: Cannot resolve or connect to ${AM_URL}. Check DNS and Firewalls."
  exit 1
fi

echo -e "\n2. Sending a test alert payload to Alertmanager API..."
curl -X POST -H "Content-Type: application/json" -H "${TENANT_HEADER}" ${AM_URL}/api/v2/alerts -d '[
  {
    "labels": {
      "alertname": "DiagnosticTestAlert",
      "severity": "info",
      "source": "troubleshooting-script"
    },
    "annotations": {
      "summary": "Validating Alertmanager API ingestion pipeline."
    }
  }
]'

echo -e "\n\n3. Checking Alertmanager logs for recent errors (Kubernetes)..."
kubectl logs -l app=alertmanager -n monitoring --tail=20 | grep -i -E "error|warn|failed|gossip"
E

Error Medic Editorial

Error Medic Editorial comprises senior Site Reliability Engineers and DevOps practitioners dedicated to solving complex infrastructure, observability, and cloud-native integration challenges.

Sources

Related Guides