Troubleshooting Grafana Alertmanager: Fixing Sync Failures, Connection Drops, and Mimir/Loki Integration Errors
Comprehensive guide to fixing Grafana Alertmanager failures, including external Alertmanager sync issues, Grafana Cloud auth errors, and Mimir/Loki tenant misco
- Root Cause 1: Network isolation or DNS resolution failures preventing the Grafana server from reaching the external Alertmanager API (e.g., `no such host`).
- Root Cause 2: Misconfigured or missing `X-Scope-OrgID` headers causing 401/403 errors when routing alerts to Grafana Mimir or Loki Alertmanagers.
- Root Cause 3: Legacy alerting configurations conflicting with Grafana Unified Alerting (GUA), leading to duplicate or silently dropped alert payloads.
- Quick Fix: Validate API reachability from within the Grafana container using `curl`, check the Contact Points provisioning YAML for syntax errors, and ensure the Alertmanager URL does not include a trailing slash.
| Method | When to Use | Time to Implement | Risk Level |
|---|---|---|---|
| Grafana UI (Contact Points) | Initial setup, quick debugging, or isolated testing of payload formats. | 5 mins | High (Configuration drift, manual errors) |
| File Provisioning (YAML) | Production environments, GitOps workflows, Kubernetes deployments. | 15 mins | Low (Version controlled, reproducible) |
| mimirtool (CLI) | Configuring distributed Alertmanager in Grafana Mimir or Grafana Cloud. | 10 mins | Medium (Requires API key management) |
| Direct API POST | Verifying Alertmanager receiver functionality independently of Grafana. | 5 mins | Low (Read/Test only) |
Understanding Grafana and Alertmanager Architecture
When transitioning to Grafana 8+ and Grafana Unified Alerting (GUA), the architecture of how alerts are processed shifted significantly. Grafana now includes a built-in Alertmanager, but many enterprise architectures rely on a Grafana external Alertmanager, or distributed Alertmanagers backed by Grafana Mimir or Grafana Loki.
Troubleshooting issues in this ecosystem requires understanding the data flow: Grafana evaluates alert rules (or queries backend data sources like Prometheus/Loki) -> The state changes to Firing -> Grafana constructs a payload -> Grafana pushes this payload via HTTP POST to the Alertmanager API (/api/v2/alerts) -> Alertmanager routes, groups, and deduplicates the alert -> Alertmanager sends the notification to the receiver (Slack, PagerDuty, Webhook).
Failures can occur at any boundary in this pipeline. This guide focuses on the critical boundary between Grafana and the Alertmanager.
Common Error Messages
You are likely reading this guide because you have encountered one of the following exact error messages in your Grafana server logs (/var/log/grafana/grafana.log or kubectl logs deployment/grafana):
Failed to send alert to Alertmanager: Post "http://alertmanager:9093/api/v2/alerts": dial tcp: lookup alertmanager on 10.96.0.10:53: no such hostlevel=error msg="unable to sync alertmanager configuration" err="bad response status 400 Bad Request"level=error msg="Failed to send alert notifications" err="context deadline exceeded"level=error msg="failed to send alerts to all alertmanagers" err="1 errors: Post \"https://alertmanager-us-central1.grafana.net/api/prom/push\": net/http: request canceled (Client.Timeout exceeded while awaiting headers)"
Step 1: Diagnosing Network and DNS Issues
The most frequent cause of an external Alertmanager failing to receive alerts from Grafana is network reachability. Grafana must be able to resolve the hostname and establish a TCP connection to the Alertmanager port (usually 9093).
Validation from the Grafana Container
Do not test connectivity from your local workstation. You must test from the environment where the Grafana process is running.
Exec into the Grafana container/server:
kubectl exec -it deploy/grafana -n monitoring -- /bin/sh # or docker exec -it grafana /bin/shTest DNS Resolution:
nslookup alertmanager.monitoring.svc.cluster.localIf this fails, your CoreDNS or equivalent DNS service is failing to resolve the service name. Check your Kubernetes service definitions.
Test API Reachability with Curl:
curl -v http://alertmanager.monitoring.svc.cluster.local:9093/-/readyYou should receive a
200 OKresponse. If you getConnection refused, the Alertmanager process is not binding to0.0.0.0or the port is mismatched. If you getcontext deadline exceededor it hangs, check your NetworkPolicies, security groups, or firewalls blocking traffic between the Grafana and Alertmanager nodes.
Step 2: Fixing Grafana External Alertmanager Configuration
If the network is healthy, the issue usually lies in how Grafana is configured to talk to the external Alertmanager.
Legacy grafana.ini vs. Provisioning UI
In older versions of Grafana, you might have configured the external Alertmanager in grafana.ini under the [alerting] or [unified_alerting] blocks.
Warning: Relying solely on grafana.ini for contact points can lead to silent failures if the API schema changes. The modern approach is to use the UI or Provisioning YAML.
If you are using Provisioning YAML (/etc/grafana/provisioning/alerting/alertmanager.yaml), verify the syntax carefully:
apiVersion: 1
contactPoints:
- orgId: 1
name: 'External Alertmanager'
receivers:
- uid: ext-am-1
type: prometheus-alertmanager
settings:
url: http://alertmanager.monitoring.svc.cluster.local:9093
Crucial Fix: Ensure the url does NOT contain a trailing slash or the /api/v2/alerts path. Grafana automatically appends the correct API path. Providing http://alertmanager:9093/ or http://alertmanager:9093/api/v2/alerts will result in Grafana calling http://alertmanager:9093/api/v2/alerts/api/v2/alerts, yielding a 404 Not Found.
Step 3: Grafana Cloud Alertmanager & Mimir Integrations
Integrating Grafana with Grafana Cloud Alertmanager or a self-hosted Grafana Mimir Alertmanager introduces multi-tenancy. Multi-tenancy requires strict authentication, usually handled via HTTP headers.
The Missing Tenant ID Error (401 Unauthorized or 400 Bad Request)
If you see authorization errors when syncing configurations or firing alerts to Mimir/Loki, you are likely missing the X-Scope-OrgID header. Mimir requires this header to know which tenant's Alertmanager configuration to apply the alert to.
Fixing Mimir Alertmanager Integration
If configuring via the Grafana UI (Alerting -> Alertmanagers -> Add Alertmanager):
- Set the URL to your Mimir Alertmanager endpoint (e.g.,
http://mimir-gateway/alertmanager). - Under Custom HTTP headers, add:
- Header:
X-Scope-OrgID - Value:
<your-tenant-id>(e.g.,tenant-aoranonymousif auth is disabled but multitenancy is enabled).
- Header:
If configuring via Provisioning:
apiVersion: 1
contactPoints:
- orgId: 1
name: 'Mimir Alertmanager'
receivers:
- uid: mimir-am
type: prometheus-alertmanager
settings:
url: http://mimir-gateway.mimir.svc.cluster.local/alertmanager
httpHeaderName1: 'X-Scope-OrgID'
httpHeaderValue1: 'tenant-a'
Grafana Cloud Specifics
For Grafana Cloud Alertmanager, the URL and authentication are provided via Basic Auth. Ensure you are using the correct Cloud Access Policy token with alerts:write permissions.
- URL:
https://alertmanager-<region>.grafana.net - Basic Auth User:
<your-cloud-username/instance-id> - Basic Auth Password:
<your-access-policy-token>
Step 4: Troubleshooting High Availability (HA) Alertmanager Gossip
If you are running multiple instances of Alertmanager (HA mode) and users complain about receiving duplicate alert notifications (e.g., two Slack messages for the same incident), your external Alertmanager cluster is suffering from a "split-brain" scenario.
Alertmanager instances use a gossip protocol over TCP/UDP port 9094 to synchronize silence states and notification logs. If Grafana pushes an alert to an external AM cluster, and the cluster members cannot communicate, both members will independently evaluate the routing tree and send the notification.
Diagnosing Gossip Failures
Check the Alertmanager logs for gossip errors:
kubectl logs -l app=alertmanager -c alertmanager | grep gossip
Look for: msg="Failed to join cluster" err="1 error occurred: Failed to resolve alertmanager-0.alertmanager-headless:9094: no such host"
Fixing Gossip Synchronization
- Ensure you have a headless service in Kubernetes specifically for the gossip ring.
- Pass the
--cluster.peerflag correctly to the Alertmanager binary arguments.
# Kubernetes deployment args snippet for Alertmanager
args:
- "--config.file=/etc/alertmanager/config.yml"
- "--storage.path=/alertmanager"
- "--cluster.peer=alertmanager-0.alertmanager-headless.monitoring.svc.cluster.local:9094"
- "--cluster.peer=alertmanager-1.alertmanager-headless.monitoring.svc.cluster.local:9094"
- Ensure your Kubernetes NetworkPolicy allows TCP/UDP traffic on port 9094 between the Alertmanager pods.
Step 5: Advanced Debugging with amtool and Payload Inspection
Sometimes Grafana successfully sends the alert, but Alertmanager drops it or routes it to a "null" receiver. To isolate if the problem is Grafana generating a bad payload or Alertmanager misrouting it, bypass Grafana entirely.
Manually Pushing an Alert Payload
Use the Prometheus Alertmanager API to push a dummy alert. If this succeeds and routes correctly, the issue is Grafana's configuration. If this fails to route, the issue is in your alertmanager.yml routing tree.
curl -XPOST http://alertmanager.monitoring.svc.cluster.local:9093/api/v1/alerts -d '[
{
"labels": {
"alertname": "TestManualAlert",
"severity": "critical",
"service": "web"
},
"annotations": {
"summary": "This is a test alert bypassing Grafana."
}
}
]'
Check the Alertmanager UI (http://<alertmanager-ip>:9093/#/alerts). If "TestManualAlert" appears, your Alertmanager is healthy and accepting external payloads. Go back and check the Grafana contact point logs and ensure the labels matching your Alertmanager route are actually being generated by your Grafana alert rule.
Conclusion
Troubleshooting the Grafana to Alertmanager pipeline requires systematically verifying network boundaries, authentication headers (especially for Mimir/Loki/Cloud), and payload structures. By utilizing local curl tests, verifying provisioning syntax, and inspecting HA gossip logs, you can rapidly isolate and resolve alert delivery failures.
Frequently Asked Questions
#!/bin/bash
# Diagnostic script to test Grafana to Alertmanager connectivity and API health
AM_URL="http://alertmanager.monitoring.svc.cluster.local:9093"
TENANT_HEADER="X-Scope-OrgID: my-tenant" # Optional: Only for Mimir/Loki
echo "1. Testing basic network connectivity..."
curl -s -o /dev/null -w "%{http_code}" ${AM_URL}/-/ready
if [ $? -ne 0 ]; then
echo "ERROR: Cannot resolve or connect to ${AM_URL}. Check DNS and Firewalls."
exit 1
fi
echo -e "\n2. Sending a test alert payload to Alertmanager API..."
curl -X POST -H "Content-Type: application/json" -H "${TENANT_HEADER}" ${AM_URL}/api/v2/alerts -d '[
{
"labels": {
"alertname": "DiagnosticTestAlert",
"severity": "info",
"source": "troubleshooting-script"
},
"annotations": {
"summary": "Validating Alertmanager API ingestion pipeline."
}
}
]'
echo -e "\n\n3. Checking Alertmanager logs for recent errors (Kubernetes)..."
kubectl logs -l app=alertmanager -n monitoring --tail=20 | grep -i -E "error|warn|failed|gossip"Error Medic Editorial
Error Medic Editorial comprises senior Site Reliability Engineers and DevOps practitioners dedicated to solving complex infrastructure, observability, and cloud-native integration challenges.