Why does my service show 'systemd failed' with a core dump?

A core dump indicates the application crashed due to an internal error (like a segmentation fault, memory corruption, or unhandled exception) rather than being killed by the external OOM killer. Use 'coredumpctl gdb ' to analyze the crash trace.

How do I fix 'systemd permission denied' (status 203/EXEC)?

This typically means the User or Group specified in the unit file lacks execute permissions on the target binary, or a security module (SELinux/AppArmor) is blocking execution. Check file permissions, ownership, and audit logs.

Why is systemd-journald causing high CPU usage?

systemd-journald CPU spikes typically occur when a noisy application spams the logging daemon aggressively. Identify the culprit using 'journalctl -f' and implement rate limiting in /etc/systemd/journald.conf.

How do I disable systemd-oomd if it's too aggressive?

You can disable it entirely by running 'systemctl disable --now systemd-oomd.service', but a safer approach is often tuning your cgroup memory limits and configuring ManagedOOMPreference=avoid in your unit files.

What is the difference between kernel OOM and systemd-oomd?

The kernel OOM killer acts reactively as a last resort when the system or cgroup is completely out of memory. systemd-oomd acts proactively based on system memory pressure metrics (PSI) to kill processes before severe system lockups occur.

Troubleshooting 'systemd OOM killed process': Fix High CPU and Service Crashes

Fix Approaches Compared
Method	When to Use	Time	Risk
Increase MemoryMax	Service legitimately requires more memory to function	5 mins	Low
Configure System Swap	System needs a buffer for idle memory pages	10 mins	Low
Fix App Memory Leak	Memory consumption grows unbounded over time	Hours/Days	High
Set OOMScoreAdjust	Protect critical infrastructure from being killed first	5 mins	Medium

Understanding the Error: systemd OOM Killed Process

When a Linux system runs out of memory, or a specific control group (cgroup) exceeds its memory limit, the Out-Of-Memory (OOM) killer steps in to terminate processes and free up RAM. In modern Linux distributions managed by systemd, you will often encounter this as a systemd oom event, sometimes accompanied by a systemd core dump or a status of systemd failed.

The exact error message in your logs might look like this:

systemd[1]: my-app.service: A process of this unit has been killed by the OOM killer. kernel: Out of memory: Killed process 14523 (node) total-vm:1523456kB, anon-rss:854321kB, file-rss:0kB

If you are using systemd-oomd (the user-space OOM killer introduced in newer systemd versions), the message might be:

systemd-oomd[678]: Killed /system.slice/my-app.service due to memory pressure

This guide will walk you through diagnosing and fixing systemd OOM errors, addressing related issues like systemd high cpu usage, systemd not working properly, and resolving systemd permission denied errors that prevent services from starting.

Step 1: Diagnose the OOM Event

The first step in troubleshooting a systemd service not starting or crashing abruptly is to confirm if the OOM killer was involved. We need to check both the kernel ring buffer and the systemd journal.

Check Kernel Logs: Run the following command to search the kernel logs for OOM killer invocations: dmesg -T | grep -i -E 'oom|out of memory'

Check systemd Journal: Inspect the specific service's logs: journalctl -u your-service-name.service -n 100 --no-pager

Look for lines indicating that the process was terminated by a signal (often SIGKILL, signal 9) or explicit OOM messages.

Step 2: Differentiating Between Kernel OOM and systemd-oomd

It is crucial to understand which component killed your process:

Kernel OOM Killer: Triggered when the entire system is out of memory and swap, or when a cgroup hits its MemoryMax limit. The kernel selects a process based on its oom_score.
systemd-oomd: A user-space daemon that monitors memory pressure (using PSI - Pressure Stall Information). It acts proactively, killing process groups before the kernel OOM killer is invoked, which helps prevent system lockups but can sometimes seem overly aggressive, leading to unexpected systemd failed states.

Step 3: Fixing 'systemd oom' for Specific Services

If your service is legitimately using more memory than allowed by its unit file, you need to adjust its resource limits.

Adjusting Memory Limits in systemd: Open the unit file for editing: systemctl edit your-service-name.service

Add or modify the [Service] section to increase the MemoryMax (cgroup v2) or MemoryLimit (cgroup v1):

[Service]
# Set a hard limit
MemoryMax=2G
# Set a soft limit to encourage reclaiming memory
MemoryHigh=1.5G

Protecting Critical Services: If a service is critical and should be the last thing the OOM killer targets, you can adjust its OOM score. A highly negative score makes it less likely to be killed.

[Service]
OOMScoreAdjust=-1000
OOMPolicy=continue

Note: Setting OOMPolicy=continue tells systemd not to stop the whole service if only one of its child processes is killed, which can be useful for worker-pool architectures.

Step 4: Troubleshooting 'systemd high cpu' and 'systemd core dump'

Sometimes, a service doesn't just hit an OOM limit; it spins out of control, causing systemd high cpu usage before ultimately crashing and generating a systemd core dump.

Analyzing Core Dumps: If your service fails and dumps core, you can retrieve the core dump using coredumpctl: coredumpctl list coredumpctl info <PID> coredumpctl gdb <PID>

Analyzing the core dump with GDB will show you the exact C/C++ or runtime stack trace where the application crashed. Often, high CPU followed by a crash points to infinite loops allocating memory or race conditions.

systemd-journald High CPU: If systemd-journald itself is consuming high CPU, it's usually because an application is spamming logs. Identify the noisy service: journalctl -f Then configure rate limiting in /etc/systemd/journald.conf:

RateLimitIntervalSec=30s
RateLimitBurst=1000

Step 5: Resolving 'systemd permission denied' and 'service not starting'

A common reason for a systemd service not starting is a systemd permission denied error. This happens when the User= or Group= specified in the systemd unit file lacks read/execute access to the binary or read/write access to necessary configuration and data directories.

Diagnosis: journalctl -xeu failing-service.service Look for code=exited, status=203/EXEC or Permission denied.

Fixes:

Verify File Permissions: Ensure the executable is owned by the correct user and has the execute bit set (chmod +x /path/to/binary).
Check SELinux/AppArmor: Security modules often block systemd services. Check SELinux audit logs (ausearch -m avc -ts recent) and adjust boolean values or file contexts (chcon, semanage fcontext).
Systemd Sandboxing: Modern systemd unit files use security features like ProtectSystem=strict or ProtectHome=yes. If your service needs to write to /home or /var, you may need to override these settings:

[Service]
ProtectHome=false
ReadWritePaths=/var/lib/my-app

Step 6: Addressing 'systemd not working' System-Wide

If systemd not working refers to systemd freezing, timing out during boot, or failing to communicate over D-Bus, it's often a sign of severe resource exhaustion (like OOM) or broken D-Bus communication.

Try reloading the daemon: systemctl daemon-reload
Check systemd status: systemctl status
Look for failed units system-wide: systemctl --failed

Best Practices for Preventing systemd OOM Issues

Right-Size Your Instances: Ensure your VMs or bare-metal servers have adequate RAM for the workloads.
Configure Swap: Even on SSD-backed cloud instances, a small swap file (e.g., 2GB) provides a buffer that allows the kernel to page out idle memory, giving active applications more breathing room and preventing sudden OOM kills.
Monitor Memory Pressure: Implement monitoring using tools like Prometheus and node_exporter to track memory usage and cgroup limits, alerting you before the OOM killer strikes.
Audit Application Leaks: Continuously profile your applications for memory leaks. A well-configured systemd unit will mitigate the blast radius of a leak (by isolating the OOM to the specific cgroup), but it won't fix the underlying software bug.