Files
runaway-process-killer/README.md
2026-01-22 18:46:43 -05:00

300 lines
6.8 KiB
Markdown

# Runaway Process Killer
Automatic protection against runaway CPU and RAM processes on Ubuntu servers. Prevents hosting provider throttling by killing processes that pin CPU to 95%+ for extended periods or exhaust available memory.
## Overview
| Protection | Tool | Trigger | Action |
|------------|------|---------|--------|
| CPU | monit | 95%+ CPU for 5 minutes | Kill all processes matching top CPU consumer |
| RAM | earlyoom | <5% free memory | Kill highest memory consumer |
| Orphan Claude | monit | Every 15 minutes | Kill Claude processes with no TTY or PPID=1 |
## Requirements
- Ubuntu 20.04+ (tested on 24.04 LTS)
- Root access
- ~6MB RAM overhead total
## Quick Install
```bash
curl -fsSL https://git.upfrontops.cloud/upfrontops/runaway-process-killer/raw/branch/main/install.sh | sudo bash
```
Or clone and run:
```bash
git clone https://git.upfrontops.cloud/upfrontops/runaway-process-killer.git
cd runaway-process-killer
sudo ./install.sh
```
## Manual Installation
### 1. Install earlyoom (RAM Protection)
```bash
sudo apt update && sudo apt install -y earlyoom
```
Edit `/etc/default/earlyoom`:
```bash
EARLYOOM_ARGS="-m 5 -s 5 --avoid '(^|/)(init|systemd|sshd)$' -r 60"
```
Enable and start:
```bash
sudo systemctl enable earlyoom && sudo systemctl restart earlyoom
```
### 2. Install monit (CPU Protection)
```bash
sudo apt install -y monit
```
Edit `/etc/monit/monitrc`, change daemon interval:
```
set daemon 60
```
Enable HTTP interface (required for `monit status`):
```
set httpd port 2812 and
use address localhost
allow localhost
allow admin:monit
```
Create `/etc/monit/conf.d/cpu-killer`:
```
check system $HOST
if cpu usage > 95% for 5 cycles then exec "/usr/local/bin/kill-top-cpu.sh"
```
Create `/usr/local/bin/kill-top-cpu.sh`:
```bash
#!/bin/bash
# Kill the process tree using the most CPU (excluding critical ones)
# Find the top CPU consumer (excluding protected and transient processes)
TOP_LINE=$(ps -eo pid,comm,%cpu --sort=-%cpu | grep -v -E '(PID|systemd|sshd|monit|earlyoom|bash|ps|awk|grep|head)' | head -1)
TARGET_PID=$(echo "$TOP_LINE" | awk '{print $1}')
COMM=$(echo "$TOP_LINE" | awk '{print $2}')
if [ -n "$TARGET_PID" ] && [ -n "$COMM" ]; then
logger "monit cpu-killer: Killing all '$COMM' processes (detected high CPU on PID $TARGET_PID)"
# Kill all processes with this command name
pkill -9 -x "$COMM"
fi
```
Make executable and enable:
```bash
sudo chmod +x /usr/local/bin/kill-top-cpu.sh
sudo systemctl enable monit && sudo systemctl restart monit
```
## Configuration
### CPU Threshold Timing
Edit `/etc/monit/conf.d/cpu-killer` to change timing:
| Cycles | Time (at 60s interval) |
|--------|------------------------|
| 2 | 2 minutes |
| 5 | 5 minutes (default) |
| 10 | 10 minutes |
| 30 | 30 minutes |
### RAM Threshold
Edit `/etc/default/earlyoom`:
| Setting | Meaning |
|---------|---------|
| `-m 5` | Kill when free RAM < 5% |
| `-m 10` | Kill when free RAM < 10% |
| `-s 5` | Kill when free swap < 5% |
### Protected Processes
**earlyoom** protects (via `--avoid`):
- init, systemd, sshd
**kill-top-cpu.sh** protects (via grep exclusion):
- systemd, sshd, monit, earlyoom, bash, ps, awk, grep, head
To add more protected processes, edit the grep pattern in `/usr/local/bin/kill-top-cpu.sh`.
## Monitoring
### Check Status
```bash
# earlyoom status
sudo systemctl status earlyoom
# monit status
sudo monit status
# Combined check
sudo ./scripts/status.sh
```
### View Logs
```bash
# earlyoom logs
journalctl -u earlyoom -f
# monit logs
tail -f /var/log/monit.log
# Kill events
journalctl | grep -i "cpu-killer\|earlyoom\|orphan-claude"
```
## Testing
### Test CPU Killer
```bash
# Install stress tool
sudo apt install -y stress
# For quick testing, temporarily set to 2 cycles in /etc/monit/conf.d/cpu-killer
# then reload: sudo monit reload
# Start CPU stress (will be killed after threshold)
stress --cpu 4 --timeout 300
```
### Test RAM Killer
```bash
# This will be killed quickly by earlyoom
stress --vm 4 --vm-bytes 4G --vm-keep --timeout 120
```
### Test Orphan Claude Killer
```bash
# Run the detection script manually to see what it would find
sudo /usr/local/bin/kill-orphan-claude.sh
# Check logs for any kills
journalctl | grep orphan-claude-killer
```
## Uninstall
```bash
sudo ./uninstall.sh
```
Or manually:
```bash
sudo systemctl stop earlyoom monit
sudo systemctl disable earlyoom monit
sudo apt remove -y earlyoom monit
sudo rm -f /usr/local/bin/kill-top-cpu.sh
sudo rm -f /usr/local/bin/kill-orphan-claude.sh
sudo rm -f /etc/monit/conf.d/cpu-killer
sudo rm -f /etc/monit/conf.d/orphan-claude-killer
```
## Resource Overhead
| Component | RAM | CPU | Disk |
|-----------|-----|-----|------|
| earlyoom | ~2MB | Negligible (adaptive polling) | 77KB |
| monit | ~3-4MB | ~28ms per 60s cycle | 1MB |
| kill script | 0 (runs only when triggered) | Milliseconds | <1KB |
**Total: ~6MB RAM, essentially 0% CPU during normal operation**
## How It Works
### Orphan Claude Detection (monit)
A Claude process is considered orphaned if:
- Its controlling TTY is `?` (no terminal attached), OR
- Its parent PID is 1 (adopted by init)
1. monit runs the orphan detection script every 15 cycles (15 minutes with 60s daemon interval)
2. Script finds all `claude` processes via `pgrep -x claude`
3. For each process, checks TTY (`ps -o tty=`) and PPID (`ps -o ppid=`)
4. If orphaned, kills the process tree (children first, then parent)
5. Logs details to syslog including PID, reason, start time, CPU%, and memory%
### CPU Protection (monit)
1. monit checks system CPU every 60 seconds
2. If CPU > 95% for 5 consecutive checks (5 minutes), executes kill script
3. Kill script identifies the process using most CPU
4. Kills ALL processes with that command name (handles multi-worker processes)
5. Logs the action to syslog
### RAM Protection (earlyoom)
1. earlyoom monitors available memory (adaptive polling - more frequent when memory is low)
2. When free memory drops below 5%, sends SIGTERM to highest memory consumer
3. If process doesn't exit, sends SIGKILL at 2.5% threshold
4. Protected processes (init, systemd, sshd) are never killed
## Troubleshooting
### monit not triggering
```bash
# Check monit is running
sudo systemctl status monit
# Check config syntax
sudo monit -t
# Check monit log
tail -f /var/log/monit.log
# Verify CPU threshold is being detected
sudo monit status | grep cpu
```
### earlyoom not killing
```bash
# Check earlyoom is running
sudo systemctl status earlyoom
# Check configuration
cat /etc/default/earlyoom
# Watch real-time
journalctl -u earlyoom -f
```
### Kill script not working
```bash
# Test manually
sudo /usr/local/bin/kill-top-cpu.sh
# Check script is executable
ls -la /usr/local/bin/kill-top-cpu.sh
# Check for errors
bash -x /usr/local/bin/kill-top-cpu.sh
```
## License
MIT License - Use freely, no warranty.
## Author
Created for UpfrontOps infrastructure management.