Error Medic

Troubleshooting InfluxDB Connection Refused and Out of Memory Errors

Resolve 'InfluxDB connection refused', OOM crashes, and slow query timeouts. Learn to diagnose bind-address issues, optimize TSI indexes, and fix retention poli

Last updated:
Last verified:
1,404 words
Key Takeaways
  • Connection refused is often a symptom of the InfluxDB process crashing due to Out of Memory (OOM) kills by the Linux kernel.
  • Network misconfigurations, such as binding to 127.0.0.1 instead of 0.0.0.0 or firewall blocks, are common causes for external connection failures.
  • High series cardinality and unbounded queries lead to InfluxDB slow queries, timeouts, and eventual memory exhaustion.
  • Quick Fix: Check dmesg for OOM kills, verify bind-address in influxdb.conf, and enable TSI (Time Series Index) to reduce RAM usage.
Fix Approaches Compared
MethodWhen to UseTimeRisk
Change bind-addressService runs but external clients get Connection Refused5 minsLow
Increase ulimit (File Descriptors)Logs show 'too many open files' before crashing10 minsLow
Enable TSI (Time Series Index)High series cardinality causing OOM kills1-2 hoursMedium (Requires downtime and index rebuild)
Implement Downsampling & RPsDatabase size growing unbounded, causing slow queriesDaysHigh (Data deletion involved)

Understanding the Error: InfluxDB Connection Refused

When you encounter curl: (7) Failed to connect to localhost port 8086: Connection refused or dial tcp 127.0.0.1:8086: connect: connection refused, it usually means the InfluxDB service is either not running, crashing in a loop, or bound to the wrong network interface.

However, in production environments, a "Connection Refused" error is rarely just a simple networking mistake. Frequently, it is the trailing symptom of a more severe underlying issue, such as an InfluxDB Out of Memory (OOM) kill by the Linux kernel, or the database becoming completely unresponsive due to an InfluxDB Slow Query or InfluxDB Timeout.

When InfluxDB runs out of memory, the Linux OOM killer terminates the influxd process. Subsequently, any client attempting to write or query data receives a "Connection Refused" error because the daemon is dead. Similarly, if the database is locked up processing a massive, unoptimized query, the HTTP API may fail to respond within the reverse proxy's timeout window, or the TCP backlog might fill up, leading to rejected connections.

Common Root Causes

  1. Service Stopped or Crash Looping (OOM Kills): The most common reason for unexpected connection refusals. The influxd process consumes memory proportional to series cardinality and query complexity.
  2. Bind Address Configuration: The bind-address or http-bind-address in influxdb.conf is set to 127.0.0.1 but clients are trying to connect via a public or Docker network IP.
  3. Firewall and Security Groups: Port 8086 (HTTP API) or 8088 (RPC) is blocked by iptables, ufw, or cloud provider security groups.
  4. File Descriptor Exhaustion: InfluxDB opens many files for TSM and TSI files. If ulimit -n is too low, it stops accepting new connections.

Step 1: Diagnose the Connection Refused Error

First, determine if the process is actually running or if it was recently killed.

Check the Service Status: Look for Active: failed or Active: inactive. If it recently restarted, check the uptime. If the service is not running, clients cannot connect.

Check for OOM Kills in the Kernel Log: If you see Out of memory: Killed process <PID> (influxd), your connection refused error is definitively an OOM issue. InfluxDB requires substantial RAM when handling high cardinality or complex GROUP BY time intervals on large datasets.

Verify Network Binding: If the service is running perfectly fine but external clients cannot connect, check the listening ports. You should see it listening on :::8086 or 0.0.0.0:8086. If it only shows 127.0.0.1:8086, external traffic will be refused.


Step 2: Fixing "Connection Refused" due to Network/Config

If the issue is purely networking or configuration:

1. Update the Bind Address: Edit your /etc/influxdb/influxdb.conf (or the respective Docker environment variables) and set bind-address = ":8086" under the [http] section. Restart the service.

2. Adjust File Descriptors: If you see too many open files in the logs leading to dropped connections, edit /lib/systemd/system/influxdb.service and add LimitNOFILE=65536. Then run sudo systemctl daemon-reload && sudo systemctl restart influxdb.


Step 3: Resolving OOM Kills and Memory Issues

If the "Connection Refused" is caused by OOM kills, you must address the memory consumption. InfluxDB memory usage is driven by series cardinality and query payload.

1. Switch to TSI (Time Series Index): By default, older versions of InfluxDB (1.x) keep the entire series index in memory (in-memory index). If you have millions of unique series (e.g., generating unique tags per request like UUIDs), you will run out of RAM. Enable TSI to move the index to disk. In influxdb.conf, under [data], set index-version = "tsi1". Note: If you have existing data, you must rebuild the index using the influx_inspect buildtsi tool before restarting.

2. Implement Proper Retention Policies (RPs): Keeping raw, high-resolution data forever guarantees eventual OOMs and slow queries. Create a Retention Policy to drop high-res data after 30 days.

3. Use Continuous Queries (CQs) or Tasks: Downsample your data. Instead of querying 6 months of 1-second resolution data, query 1-hour resolution data using CQs.


Step 4: Mitigating InfluxDB Slow Queries and Timeouts

Sometimes, connection refused or 502 Bad Gateway errors happen because the HTTP request times out while InfluxDB is churning through a massive query.

1. Query Timeout Configuration: Prevent rogue queries from locking up the database. In influxdb.conf under [coordinator], configure query-timeout = "60s" and log-queries-after = "10s". Setting log-queries-after allows you to identify which queries are causing the bottlenecks in your logs.

2. Analyze Slow Queries: Run SHOW QUERIES in the InfluxDB CLI. If you see queries executing for hundreds of seconds, you may need to kill them using KILL QUERY <qid>.

3. Optimize Your Queries:

  • Avoid leading wildcards in regex tag matching (=~ /.*value/).
  • Limit the time range: Always include a WHERE time > now() - 1h clause. Querying without a time bound scans the entire database.
  • Reduce GROUP BY cardinality: Grouping by a tag that has 100,000 unique values will create 100,000 buckets in memory, instantly causing an OOM or massive timeout.

Final Verification

After applying fixes (especially memory limits and TSI):

  1. Monitor memory usage with top or an external monitoring tool like Telegraf + Grafana.
  2. Watch the InfluxDB logs: tail -f /var/log/influxdb/influxd.log for any level=error or level=warn.
  3. Test external connections from a different host: curl -I http://<influxdb-ip>:8086/ping. If it returns a 204 No Content HTTP status, the connection is successfully established and InfluxDB is healthy.

Frequently Asked Questions

bash
# Check service status
sudo systemctl status influxdb

# Check for OOM Kills in the kernel log
dmesg -T | grep -i oom-killer

# Check active listening ports for InfluxDB
sudo netstat -tulpn | grep influxd

# Edit influxdb.conf to fix bind-address or enable TSI
sudo nano /etc/influxdb/influxdb.conf

# Restart the service after making config changes
sudo systemctl restart influxdb

# Verify connection is working
curl -I http://localhost:8086/ping
E

Error Medic Editorial

A collective of senior SREs, DevOps engineers, and database administrators dedicated to untangling complex infrastructure issues and providing clear, actionable troubleshooting guides.

Sources

Related Guides