Error Medic

Fixing systemd OOM (Out of Memory) Kills and Service Failures: A Complete Guide

Resolve systemd OOM kills, core dumps, and high CPU issues. Learn how to diagnose memory leaks, configure OOMPolicy, and fix systemd service not starting.

Last updated:
Last verified:
1,401 words
Key Takeaways
  • systemd-oomd or the kernel OOM killer terminates services exceeding memory limits, resulting in a status=9/KILL error.
  • Core dumps and 'systemd failed' states often result from unhandled exceptions, resource starvation, or misconfigured limits.
  • Use `journalctl -xe` and `dmesg | grep -i oom` to identify the exact trigger for service termination.
  • Fix by adjusting `MemoryMax=`, configuring `OOMScoreAdjust=`, or identifying memory leaks in the application.
  • Permission denied (203/EXEC) errors usually stem from SELinux/AppArmor profiles or incorrect file ownership.
systemd OOM Mitigation Approaches Compared
MethodWhen to UseTimeRisk
Increase MemoryMax in UnitService legitimately needs more memory for workloads5 minsLow
Adjust OOMScoreAdjustCritical service (e.g., DB) shouldn't be killed first5 minsMedium (May kill other services)
Configure OOMPolicy=continueService can recover itself or should be left to kernel OOM10 minsLow
Analyze Core Dump via coredumpctlService crashes unexpectedly before hitting OOM limits30+ minsLow

Understanding systemd OOM Kills and Service Failures

When managing Linux servers, encountering a systemd failed state is a rite of passage for any DevOps or SRE engineer. One of the most disruptive and confusing scenarios is the systemd oom (Out of Memory) kill. In modern Linux distributions (like Ubuntu 22.04+ and Fedora), memory management is handled not just by the kernel OOM killer, but actively by systemd-oomd. When a service consumes too much memory, you might see it mysteriously terminate, leading to cascading application failures.

This comprehensive guide will walk you through diagnosing and fixing systemd OOM events, analyzing a systemd core dump, troubleshooting systemd high cpu usage, resolving systemd permission denied errors, and fixing a systemd service not starting.

Step 1: Diagnosing the systemd Failure

Before changing configurations, you must confirm why systemd is not working as expected. Is it a kernel-level kill, a systemd-oomd intervention, or a crash resulting in a core dump?

Checking for OOM Kills

If a service suddenly stops, check its status:

systemctl status my-app.service

If you see the exact error message: Main process exited, code=killed, status=9/KILL, it was forcefully terminated. To confirm if it was an OOM kill, check the kernel ring buffer and the systemd journal:

# Check kernel OOM logs
dmesg -T | grep -i -E 'killed process|oom'

# Check systemd-oomd logs
journalctl -u systemd-oomd | tail -n 50

You might see an output like: systemd-oomd[678]: Killed /system.slice/my-app.service due to memory pressure.

Checking for Core Dumps

If the service crashed due to a segmentation fault (often resulting in code=dumped, status=11/SEGV), systemd will generate a core dump. You can view these using:

coredumpctl list
coredumpctl info <PID>

Step 2: Fixing systemd OOM (Out of Memory) Issues

When a service is killed by systemd-oomd or the kernel OOM killer, it means the system or the cgroup ran out of memory. You have several ways to address this.

1. Adjusting Memory Limits (MemoryHigh and MemoryMax)

Often, a service is artificially constrained by systemd unit file limits. You can override these limits without modifying the package-provided unit file by using drop-in files.

systemctl edit my-app.service

Add the following lines to increase the memory limit:

[Service]
# Set a soft limit where systemd starts aggressively throttling the process
MemoryHigh=2G
# Set the absolute hard limit before the OOM killer is invoked
MemoryMax=3G
2. Protecting Critical Services with OOMScoreAdjust

If you have a critical service (like PostgreSQL or MySQL) that absolutely must not be killed during memory pressure, you can adjust its OOM score. The kernel OOM killer looks for the process with the highest score. A score of -1000 completely disables OOM killing for that process.

systemctl edit postgresql.service
[Service]
OOMScoreAdjust=-900

Note: Use this cautiously. If your database consumes all RAM and cannot be killed, the entire server may become unresponsive.

3. Managing systemd-oomd OOMPolicy

In systemd version 243+, you can define how systemd reacts to an OOM event within the cgroup using OOMPolicy=.

[Service]
# Options: continue, stop, kill
OOMPolicy=continue

Setting this to continue means systemd won't terminate the entire cgroup if one child process goes OOM, which is highly useful for worker-based applications like Gunicorn or PHP-FPM.

Step 3: Resolving systemd High CPU Usage

Sometimes, the issue isn't memory, but systemd high cpu usage. If systemd-journald or the main systemd process (PID 1) is pinned at 100% CPU, it usually indicates an I/O bottleneck or an application spamming logs.

  1. Identify the culprit: Run journalctl -f to see if an application is writing thousands of lines per second.
  2. Rate Limit Journald: Edit /etc/systemd/journald.conf to throttle aggressive logging:
    [Journal]
    RateLimitIntervalSec=30s
    RateLimitBurst=1000
    
  3. Restart journald: systemctl restart systemd-journald

Additionally, you can use systemd-cgtop to view resource usage per cgroup, which is often more useful than the standard top command for containerized or systemd-managed environments.

Step 4: Fixing systemd Service Not Starting and Permission Denied

If you are dealing with a systemd service not starting, the status will usually show code=exited, status=....

Status 203/EXEC: systemd permission denied

The exact error Main process exited, code=exited, status=203/EXEC is incredibly common. It explicitly means systemd could not execute the binary.

Root Causes & Fixes:

  • Missing Executable Flag: The file isn't executable. Fix: chmod +x /path/to/binary
  • Wrong Architecture: You are trying to run an ARM binary on x86_64.
  • SELinux/AppArmor: The security module blocked execution. Check audit logs: ausearch -m avc -ts recent. If SELinux is the culprit, restore the context: restorecon -Rv /path/to/binary.
  • Missing Shebang: If it's a script, ensure #!/bin/bash or #!/usr/bin/env python3 is at the very top of the file.
Dependency Failures

If systemd is not working because a service simply won't start, check for dependency failures. If Service A Requires Service B, and Service B fails, Service A will never start.

journalctl -xeu my-app.service

Look for errors like Dependency failed for My Application.

Conclusion

Troubleshooting systemd oom, systemd core dump, and failure states requires a systematic approach. By utilizing journalctl, coredumpctl, and understanding systemd drop-in configurations (systemctl edit), you can stabilize your Linux infrastructure, tame the OOM killer, and ensure your critical services remain highly available.

Frequently Asked Questions

bash
#!/bin/bash
# Diagnostic script: Check for systemd OOM kills, core dumps, and failing services

echo "=== Checking Kernel OOM Kills (Last 10) ==="
dmesg -T | grep -i -E 'killed process|oom' | tail -n 10

echo -e "\n=== Checking systemd-oomd interventions ==="
journalctl -u systemd-oomd --since "2 days ago" | grep "Killed" | tail -n 10

echo -e "\n=== Checking Recent Core Dumps ==="
coredumpctl list --reverse | head -n 10

echo -e "\n=== Finding Failed systemd Services ==="
systemctl --failed

echo -e "\n=== Top 5 Memory Consuming Cgroups ==="
systemd-cgtop -b -n 1 -m | head -n 10
E

Error Medic Editorial

Error Medic Editorial is a team of senior SREs and DevOps engineers dedicated to solving complex Linux, Kubernetes, and infrastructure challenges. With decades of combined experience in high-availability environments, they provide actionable, battle-tested solutions for production outages.

Sources

Related Guides