Error Medic

Troubleshooting 'systemd OOM killed process': Fix High CPU and Service Crashes

Fix 'systemd OOM killed process' and related core dumps. Learn how to diagnose high CPU, adjust MemoryLimit, and resolve systemd permission denied errors.

Last updated:
Last verified:
1,384 words
Key Takeaways
  • Application memory leaks triggering the cgroup OOM killer are the most common root cause.
  • Misconfigured MemoryMax or MemoryLimit directives in unit files restrict resources unnecessarily.
  • Quick Fix: Inspect kernel logs (dmesg) and journalctl, adjust MemoryMax via systemctl edit, or add system swap.
Fix Approaches Compared
MethodWhen to UseTimeRisk
Increase MemoryMaxService legitimately requires more memory to function5 minsLow
Configure System SwapSystem needs a buffer for idle memory pages10 minsLow
Fix App Memory LeakMemory consumption grows unbounded over timeHours/DaysHigh
Set OOMScoreAdjustProtect critical infrastructure from being killed first5 minsMedium

Understanding the Error: systemd OOM Killed Process

When a Linux system runs out of memory, or a specific control group (cgroup) exceeds its memory limit, the Out-Of-Memory (OOM) killer steps in to terminate processes and free up RAM. In modern Linux distributions managed by systemd, you will often encounter this as a systemd oom event, sometimes accompanied by a systemd core dump or a status of systemd failed.

The exact error message in your logs might look like this:

systemd[1]: my-app.service: A process of this unit has been killed by the OOM killer. kernel: Out of memory: Killed process 14523 (node) total-vm:1523456kB, anon-rss:854321kB, file-rss:0kB

If you are using systemd-oomd (the user-space OOM killer introduced in newer systemd versions), the message might be:

systemd-oomd[678]: Killed /system.slice/my-app.service due to memory pressure

This guide will walk you through diagnosing and fixing systemd OOM errors, addressing related issues like systemd high cpu usage, systemd not working properly, and resolving systemd permission denied errors that prevent services from starting.

Step 1: Diagnose the OOM Event

The first step in troubleshooting a systemd service not starting or crashing abruptly is to confirm if the OOM killer was involved. We need to check both the kernel ring buffer and the systemd journal.

Check Kernel Logs: Run the following command to search the kernel logs for OOM killer invocations: dmesg -T | grep -i -E 'oom|out of memory'

Check systemd Journal: Inspect the specific service's logs: journalctl -u your-service-name.service -n 100 --no-pager

Look for lines indicating that the process was terminated by a signal (often SIGKILL, signal 9) or explicit OOM messages.

Step 2: Differentiating Between Kernel OOM and systemd-oomd

It is crucial to understand which component killed your process:

  1. Kernel OOM Killer: Triggered when the entire system is out of memory and swap, or when a cgroup hits its MemoryMax limit. The kernel selects a process based on its oom_score.
  2. systemd-oomd: A user-space daemon that monitors memory pressure (using PSI - Pressure Stall Information). It acts proactively, killing process groups before the kernel OOM killer is invoked, which helps prevent system lockups but can sometimes seem overly aggressive, leading to unexpected systemd failed states.

Step 3: Fixing 'systemd oom' for Specific Services

If your service is legitimately using more memory than allowed by its unit file, you need to adjust its resource limits.

Adjusting Memory Limits in systemd: Open the unit file for editing: systemctl edit your-service-name.service

Add or modify the [Service] section to increase the MemoryMax (cgroup v2) or MemoryLimit (cgroup v1):

[Service]
# Set a hard limit
MemoryMax=2G
# Set a soft limit to encourage reclaiming memory
MemoryHigh=1.5G

Protecting Critical Services: If a service is critical and should be the last thing the OOM killer targets, you can adjust its OOM score. A highly negative score makes it less likely to be killed.

[Service]
OOMScoreAdjust=-1000
OOMPolicy=continue

Note: Setting OOMPolicy=continue tells systemd not to stop the whole service if only one of its child processes is killed, which can be useful for worker-pool architectures.

Step 4: Troubleshooting 'systemd high cpu' and 'systemd core dump'

Sometimes, a service doesn't just hit an OOM limit; it spins out of control, causing systemd high cpu usage before ultimately crashing and generating a systemd core dump.

Analyzing Core Dumps: If your service fails and dumps core, you can retrieve the core dump using coredumpctl: coredumpctl list coredumpctl info <PID> coredumpctl gdb <PID>

Analyzing the core dump with GDB will show you the exact C/C++ or runtime stack trace where the application crashed. Often, high CPU followed by a crash points to infinite loops allocating memory or race conditions.

systemd-journald High CPU: If systemd-journald itself is consuming high CPU, it's usually because an application is spamming logs. Identify the noisy service: journalctl -f Then configure rate limiting in /etc/systemd/journald.conf:

RateLimitIntervalSec=30s
RateLimitBurst=1000

Step 5: Resolving 'systemd permission denied' and 'service not starting'

A common reason for a systemd service not starting is a systemd permission denied error. This happens when the User= or Group= specified in the systemd unit file lacks read/execute access to the binary or read/write access to necessary configuration and data directories.

Diagnosis: journalctl -xeu failing-service.service Look for code=exited, status=203/EXEC or Permission denied.

Fixes:

  1. Verify File Permissions: Ensure the executable is owned by the correct user and has the execute bit set (chmod +x /path/to/binary).
  2. Check SELinux/AppArmor: Security modules often block systemd services. Check SELinux audit logs (ausearch -m avc -ts recent) and adjust boolean values or file contexts (chcon, semanage fcontext).
  3. Systemd Sandboxing: Modern systemd unit files use security features like ProtectSystem=strict or ProtectHome=yes. If your service needs to write to /home or /var, you may need to override these settings:
[Service]
ProtectHome=false
ReadWritePaths=/var/lib/my-app

Step 6: Addressing 'systemd not working' System-Wide

If systemd not working refers to systemd freezing, timing out during boot, or failing to communicate over D-Bus, it's often a sign of severe resource exhaustion (like OOM) or broken D-Bus communication.

  • Try reloading the daemon: systemctl daemon-reload
  • Check systemd status: systemctl status
  • Look for failed units system-wide: systemctl --failed

Best Practices for Preventing systemd OOM Issues

  1. Right-Size Your Instances: Ensure your VMs or bare-metal servers have adequate RAM for the workloads.
  2. Configure Swap: Even on SSD-backed cloud instances, a small swap file (e.g., 2GB) provides a buffer that allows the kernel to page out idle memory, giving active applications more breathing room and preventing sudden OOM kills.
  3. Monitor Memory Pressure: Implement monitoring using tools like Prometheus and node_exporter to track memory usage and cgroup limits, alerting you before the OOM killer strikes.
  4. Audit Application Leaks: Continuously profile your applications for memory leaks. A well-configured systemd unit will mitigate the blast radius of a leak (by isolating the OOM to the specific cgroup), but it won't fix the underlying software bug.

Frequently Asked Questions

bash
# Check for OOM kills in kernel logs
dmesg -T | grep -i -E 'oom|out of memory'

# Check specific service logs for memory pressure kills
journalctl -u my-app.service | grep -i killed

# Safely increase MemoryMax for a service via drop-in file
mkdir -p /etc/systemd/system/my-app.service.d/
cat <<EOF > /etc/systemd/system/my-app.service.d/override.conf
[Service]
MemoryMax=2G
OOMScoreAdjust=-500
EOF

# Reload systemd configuration and restart the service
systemctl daemon-reload
systemctl restart my-app.service
E

Error Medic Editorial

A collective of senior SREs and Linux administrators dedicated to demystifying complex system infrastructure, kernel panics, and deployment bottlenecks.

Sources

Related Guides