MariaDB Crash Recovery: Diagnosing OOM, Disk Full, and InnoDB Corruption
Comprehensive SRE guide to MariaDB crash recovery. Learn how to fix out of memory (OOM) kills, disk full errors, deadlocks, and 'too many connections' to restor
- Always check the Linux kernel logs (dmesg) for Out of Memory (OOM) killer events when MariaDB stops unexpectedly without error log entries.
- Disk full errors (Errcode: 28) can corrupt binlogs or prevent crash recovery; safely purge old binlogs to restore service.
- Resolve 'ERROR 1040: Too many connections' by aggressively identifying idle sessions, tuning max_connections, and implementing connection pooling.
- Use innodb_force_recovery sequentially (levels 1 through 6) ONLY as a last resort when the database refuses to start due to severe page corruption.
| Method | When to Use | Recovery Time | Data Loss Risk |
|---|---|---|---|
| Standard Crash Recovery | Unexpected restart or power loss (InnoDB auto-replays redo logs) | Minutes (depends on redo log size) | None |
| InnoDB Force Recovery (Levels 1-3) | Minor page corruption; database loops/crashes on startup | Hours (requires mysqldump and restore) | Low |
| InnoDB Force Recovery (Levels 4-6) | Severe corruption, redo log corruption, data dictionary mismatch | Hours to Days | High |
| Point-in-Time Restore (Binlogs) | Catastrophic storage failure or accidental DROP TABLE | Varies by database size | Minimal (depends on binlog sync interval) |
Understanding MariaDB Crash Recovery
When a MariaDB instance crashes, the mysqld process terminates abnormally. Under normal operating conditions, MariaDB relies on the robust ACID compliance of the InnoDB storage engine. Upon restarting, InnoDB automatically performs MariaDB crash recovery by reading the redo logs (ib_logfile0, ib_logfile1) and applying committed transactions that were not yet flushed to the data files, while rolling back uncommitted transactions.
However, automatic crash recovery can fail. As an SRE, diagnosing why the process died and why it cannot recover is your first priority. Common culprits span the operating system, storage layers, and application misconfigurations.
Scenario 1: MariaDB Out of Memory (OOM Killer)
One of the most insidious crashes occurs when MariaDB abruptly disappears, leaving truncated or non-existent entries in /var/log/mysql/error.log. This usually points to a MariaDB out of memory event triggered by the Linux kernel's OOM Killer.
The Symptom:
The database goes offline. Trying to check the status yields: Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2).
The Diagnosis: Check the kernel message buffer:
dmesg -T | grep -i oom
You will likely see an output like: Out of memory: Killed process 12345 (mysqld) total-vm:8543200kB, anon-rss:6291456kB.
The Fix:
The OOM killer targets mysqld because it is consuming too much RAM. The primary culprit is an oversized InnoDB Buffer Pool.
- Open
/etc/mysql/mariadb.conf.d/50-server.cnf. - Adjust
innodb_buffer_pool_size. A rule of thumb is 60-70% of total system RAM on a dedicated database server, but less if other services are co-located. - Check for memory leaks or excessive per-connection buffers (e.g.,
join_buffer_size,sort_buffer_size).
Scenario 2: MariaDB Disk Full Error
Running out of disk space brings MariaDB to an immediate halt.
The Symptom:
You find the following exact error in the logs:
[ERROR] mysqld: Disk is full writing './mysql-bin.0000X' (Errcode: 28 "No space left on device")
or
InnoDB: Operating system error number 28 in a file operation.
The Diagnosis:
Run df -h to verify available storage and df -i to check for inode exhaustion. A MariaDB disk full situation often happens because binary logs (binlogs) have grown out of control, or temporary tables filled the /tmp partition.
The Fix:
Do NOT delete active ibdata1 or InnoDB redo logs. Instead, safely clean up old binary logs. If the server cannot start, you can manually remove the oldest binlogs and update the mysql-bin.index file. Once started, properly configure binlog expiration:
SET GLOBAL expire_logs_days = 7;
PURGE BINARY LOGS BEFORE '2023-10-01 00:00:00';
Scenario 3: MariaDB Too Many Connections
While not strictly a hard crash, exhausting the connection pool acts like a Denial of Service (DoS), halting application functionality.
The Symptom:
Applications and CLI attempts return: ERROR 1040 (HY000): Too many connections.
The Diagnosis & MariaDB Too Many Connections Fix:
If you cannot log in normally, you may need to connect using the extra_port (if configured) or restart the service to apply changes.
To apply the MariaDB too many connections fix, increase the limit in your configuration file:
[mysqld]
max_connections = 1000
However, simply raising the limit often masks the root cause. Investigate why connections are piling up. It is frequently caused by a MariaDB slow query holding locks, or applications failing to close connections. Tune wait_timeout and interactive_timeout down (e.g., to 300-600 seconds) to reap sleeping connections automatically.
Scenario 4: MariaDB Deadlock and Timeout Issues
Heavy concurrent write traffic can cause transactions to block each other.
The Symptom:
Applications report: ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction or ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction.
The Diagnosis:
A MariaDB deadlock occurs when two transactions wait on each other to release row locks. A MariaDB timeout (Error 1205) happens when a query waits for a lock longer than the innodb_lock_wait_timeout (default 50s).
Enable deadlock logging to catch the culprits:
SET GLOBAL innodb_print_all_deadlocks = 1;
The Fix: Deadlocks must be resolved at the application layer by ensuring transactions access tables and rows in a consistent order, keeping transactions small, and optimizing queries with proper indexes to reduce the footprint of locked rows.
Scenario 5: Managing MariaDB Replication Lag
If the master crashes and recovers, or if it handles heavy bulk loads, read replicas can fall significantly behind.
The Symptom:
SHOW SLAVE STATUS\G (or SHOW REPLICA STATUS\G in newer versions) shows Seconds_Behind_Master consistently rising.
The Diagnosis & Fix:
MariaDB replication lag is often caused by the single-threaded nature of the default SQL thread applying events on the replica. If the replica's disk I/O is saturated, upgrade the storage. Otherwise, enable parallel replication by adjusting slave_parallel_workers (or replica_parallel_workers) and setting slave_parallel_mode = optimistic.
Scenario 6: MariaDB Permission Denied Errors
Often seen after migrating a database directory or restoring from backups, incorrect filesystem permissions will completely block crash recovery.
The Symptom:
InnoDB: Operating system error number 13 in a file operation. (Permission denied)
The Fix:
A MariaDB permission denied error means the mysql user cannot read/write the data directory.
- Fix standard Linux permissions:
chown -R mysql:mysql /var/lib/mysqlchmod -R 700 /var/lib/mysql - If SELinux is enforcing (RHEL/CentOS), restore the security context:
restorecon -Rv /var/lib/mysql - If AppArmor is enforcing (Ubuntu/Debian), ensure the new data directory is whitelisted in
/etc/apparmor.d/tunables/alias.
Advanced: Forcing InnoDB Recovery
If the crash resulted in corrupted InnoDB pages and standard recovery loops infinitely (crashing immediately upon startup), you must resort to innodb_force_recovery.
Add this to your configuration under [mysqld]:
innodb_force_recovery = 1
Start the service. If it fails, increment the number up to 6.
Warning: Levels 4-6 can cause permanent data loss. Once the database starts in recovery mode, it is read-only. You must immediately mysqldump all databases, wipe the /var/lib/mysql directory, re-initialize the database, remove the innodb_force_recovery flag, start the empty database, and restore your dump.
Frequently Asked Questions
#!/bin/bash
# MariaDB SRE Diagnostic Triage Script
echo "=== 1. Checking for OOM Killer Events ==="
dmesg -T | grep -i -E 'killed process.*mysqld|oom-killer'
echo -e "\n=== 2. Checking Disk Space on Data Directory ==="
DATA_DIR=$(mysql -N -s -e "select @@datadir;" 2>/dev/null || echo "/var/lib/mysql")
df -h $DATA_DIR
echo -e "\n=== 3. Extracting Recent Fatal Errors from Logs ==="
ERROR_LOG=$(mysql -N -s -e "select @@log_error;" 2>/dev/null || echo "/var/log/mysql/error.log")
if [ -f "$ERROR_LOG" ]; then
tail -n 200 "$ERROR_LOG" | grep -i -E "crash|fatal|error|corrupt|denied|timeout"
else
echo "Error log not found at $ERROR_LOG"
fi
echo -e "\n=== 4. Checking Connection Saturation ==="
mysql -e "SHOW GLOBAL STATUS LIKE 'Threads_connected'; SHOW GLOBAL VARIABLES LIKE 'max_connections';"Error Medic Editorial
Written by our senior Site Reliability Engineering team. We specialize in high-availability database architecture, incident response, and performance tuning for mission-critical MySQL and MariaDB deployments.