Fixing systemd OOM (Out of Memory) Kills and Service Failures: A Complete Guide
Resolve systemd OOM kills, core dumps, and high CPU issues. Learn how to diagnose memory leaks, configure OOMPolicy, and fix systemd service not starting.
- systemd-oomd or the kernel OOM killer terminates services exceeding memory limits, resulting in a status=9/KILL error.
- Core dumps and 'systemd failed' states often result from unhandled exceptions, resource starvation, or misconfigured limits.
- Use `journalctl -xe` and `dmesg | grep -i oom` to identify the exact trigger for service termination.
- Fix by adjusting `MemoryMax=`, configuring `OOMScoreAdjust=`, or identifying memory leaks in the application.
- Permission denied (203/EXEC) errors usually stem from SELinux/AppArmor profiles or incorrect file ownership.
| Method | When to Use | Time | Risk |
|---|---|---|---|
| Increase MemoryMax in Unit | Service legitimately needs more memory for workloads | 5 mins | Low |
| Adjust OOMScoreAdjust | Critical service (e.g., DB) shouldn't be killed first | 5 mins | Medium (May kill other services) |
| Configure OOMPolicy=continue | Service can recover itself or should be left to kernel OOM | 10 mins | Low |
| Analyze Core Dump via coredumpctl | Service crashes unexpectedly before hitting OOM limits | 30+ mins | Low |
Understanding systemd OOM Kills and Service Failures
When managing Linux servers, encountering a systemd failed state is a rite of passage for any DevOps or SRE engineer. One of the most disruptive and confusing scenarios is the systemd oom (Out of Memory) kill. In modern Linux distributions (like Ubuntu 22.04+ and Fedora), memory management is handled not just by the kernel OOM killer, but actively by systemd-oomd. When a service consumes too much memory, you might see it mysteriously terminate, leading to cascading application failures.
This comprehensive guide will walk you through diagnosing and fixing systemd OOM events, analyzing a systemd core dump, troubleshooting systemd high cpu usage, resolving systemd permission denied errors, and fixing a systemd service not starting.
Step 1: Diagnosing the systemd Failure
Before changing configurations, you must confirm why systemd is not working as expected. Is it a kernel-level kill, a systemd-oomd intervention, or a crash resulting in a core dump?
Checking for OOM Kills
If a service suddenly stops, check its status:
systemctl status my-app.service
If you see the exact error message: Main process exited, code=killed, status=9/KILL, it was forcefully terminated. To confirm if it was an OOM kill, check the kernel ring buffer and the systemd journal:
# Check kernel OOM logs
dmesg -T | grep -i -E 'killed process|oom'
# Check systemd-oomd logs
journalctl -u systemd-oomd | tail -n 50
You might see an output like: systemd-oomd[678]: Killed /system.slice/my-app.service due to memory pressure.
Checking for Core Dumps
If the service crashed due to a segmentation fault (often resulting in code=dumped, status=11/SEGV), systemd will generate a core dump. You can view these using:
coredumpctl list
coredumpctl info <PID>
Step 2: Fixing systemd OOM (Out of Memory) Issues
When a service is killed by systemd-oomd or the kernel OOM killer, it means the system or the cgroup ran out of memory. You have several ways to address this.
1. Adjusting Memory Limits (MemoryHigh and MemoryMax)
Often, a service is artificially constrained by systemd unit file limits. You can override these limits without modifying the package-provided unit file by using drop-in files.
systemctl edit my-app.service
Add the following lines to increase the memory limit:
[Service]
# Set a soft limit where systemd starts aggressively throttling the process
MemoryHigh=2G
# Set the absolute hard limit before the OOM killer is invoked
MemoryMax=3G
2. Protecting Critical Services with OOMScoreAdjust
If you have a critical service (like PostgreSQL or MySQL) that absolutely must not be killed during memory pressure, you can adjust its OOM score. The kernel OOM killer looks for the process with the highest score. A score of -1000 completely disables OOM killing for that process.
systemctl edit postgresql.service
[Service]
OOMScoreAdjust=-900
Note: Use this cautiously. If your database consumes all RAM and cannot be killed, the entire server may become unresponsive.
3. Managing systemd-oomd OOMPolicy
In systemd version 243+, you can define how systemd reacts to an OOM event within the cgroup using OOMPolicy=.
[Service]
# Options: continue, stop, kill
OOMPolicy=continue
Setting this to continue means systemd won't terminate the entire cgroup if one child process goes OOM, which is highly useful for worker-based applications like Gunicorn or PHP-FPM.
Step 3: Resolving systemd High CPU Usage
Sometimes, the issue isn't memory, but systemd high cpu usage. If systemd-journald or the main systemd process (PID 1) is pinned at 100% CPU, it usually indicates an I/O bottleneck or an application spamming logs.
- Identify the culprit: Run
journalctl -fto see if an application is writing thousands of lines per second. - Rate Limit Journald: Edit
/etc/systemd/journald.confto throttle aggressive logging:[Journal] RateLimitIntervalSec=30s RateLimitBurst=1000 - Restart journald:
systemctl restart systemd-journald
Additionally, you can use systemd-cgtop to view resource usage per cgroup, which is often more useful than the standard top command for containerized or systemd-managed environments.
Step 4: Fixing systemd Service Not Starting and Permission Denied
If you are dealing with a systemd service not starting, the status will usually show code=exited, status=....
Status 203/EXEC: systemd permission denied
The exact error Main process exited, code=exited, status=203/EXEC is incredibly common. It explicitly means systemd could not execute the binary.
Root Causes & Fixes:
- Missing Executable Flag: The file isn't executable. Fix:
chmod +x /path/to/binary - Wrong Architecture: You are trying to run an ARM binary on x86_64.
- SELinux/AppArmor: The security module blocked execution. Check audit logs:
ausearch -m avc -ts recent. If SELinux is the culprit, restore the context:restorecon -Rv /path/to/binary. - Missing Shebang: If it's a script, ensure
#!/bin/bashor#!/usr/bin/env python3is at the very top of the file.
Dependency Failures
If systemd is not working because a service simply won't start, check for dependency failures. If Service A Requires Service B, and Service B fails, Service A will never start.
journalctl -xeu my-app.service
Look for errors like Dependency failed for My Application.
Conclusion
Troubleshooting systemd oom, systemd core dump, and failure states requires a systematic approach. By utilizing journalctl, coredumpctl, and understanding systemd drop-in configurations (systemctl edit), you can stabilize your Linux infrastructure, tame the OOM killer, and ensure your critical services remain highly available.
Frequently Asked Questions
#!/bin/bash
# Diagnostic script: Check for systemd OOM kills, core dumps, and failing services
echo "=== Checking Kernel OOM Kills (Last 10) ==="
dmesg -T | grep -i -E 'killed process|oom' | tail -n 10
echo -e "\n=== Checking systemd-oomd interventions ==="
journalctl -u systemd-oomd --since "2 days ago" | grep "Killed" | tail -n 10
echo -e "\n=== Checking Recent Core Dumps ==="
coredumpctl list --reverse | head -n 10
echo -e "\n=== Finding Failed systemd Services ==="
systemctl --failed
echo -e "\n=== Top 5 Memory Consuming Cgroups ==="
systemd-cgtop -b -n 1 -m | head -n 10Error Medic Editorial
Error Medic Editorial is a team of senior SREs and DevOps engineers dedicated to solving complex Linux, Kubernetes, and infrastructure challenges. With decades of combined experience in high-availability environments, they provide actionable, battle-tested solutions for production outages.
Sources
- https://www.freedesktop.org/software/systemd/man/systemd-oomd.service.html
- https://www.kernel.org/doc/gorman/html/understand/understand016.html
- https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_monitoring_and_updating_the_kernel/configuring-the-oom-killer_managing-monitoring-and-updating-the-kernel