Replace with B2 archive contents
This commit is contained in:
299
README.md
Normal file
299
README.md
Normal file
@@ -0,0 +1,299 @@
|
||||
# Runaway Process Killer
|
||||
|
||||
Automatic protection against runaway CPU and RAM processes on Ubuntu servers. Prevents hosting provider throttling by killing processes that pin CPU to 95%+ for extended periods or exhaust available memory.
|
||||
|
||||
## Overview
|
||||
|
||||
| Protection | Tool | Trigger | Action |
|
||||
|------------|------|---------|--------|
|
||||
| CPU | monit | 95%+ CPU for 5 minutes | Kill all processes matching top CPU consumer |
|
||||
| RAM | earlyoom | <5% free memory | Kill highest memory consumer |
|
||||
| Orphan Claude | monit | Every 15 minutes | Kill Claude processes with no TTY or PPID=1 |
|
||||
|
||||
## Requirements
|
||||
|
||||
- Ubuntu 20.04+ (tested on 24.04 LTS)
|
||||
- Root access
|
||||
- ~6MB RAM overhead total
|
||||
|
||||
## Quick Install
|
||||
|
||||
```bash
|
||||
curl -fsSL https://git.upfrontops.cloud/upfrontops/runaway-process-killer/raw/branch/main/install.sh | sudo bash
|
||||
```
|
||||
|
||||
Or clone and run:
|
||||
|
||||
```bash
|
||||
git clone https://git.upfrontops.cloud/upfrontops/runaway-process-killer.git
|
||||
cd runaway-process-killer
|
||||
sudo ./install.sh
|
||||
```
|
||||
|
||||
## Manual Installation
|
||||
|
||||
### 1. Install earlyoom (RAM Protection)
|
||||
|
||||
```bash
|
||||
sudo apt update && sudo apt install -y earlyoom
|
||||
```
|
||||
|
||||
Edit `/etc/default/earlyoom`:
|
||||
```bash
|
||||
EARLYOOM_ARGS="-m 5 -s 5 --avoid '(^|/)(init|systemd|sshd)$' -r 60"
|
||||
```
|
||||
|
||||
Enable and start:
|
||||
```bash
|
||||
sudo systemctl enable earlyoom && sudo systemctl restart earlyoom
|
||||
```
|
||||
|
||||
### 2. Install monit (CPU Protection)
|
||||
|
||||
```bash
|
||||
sudo apt install -y monit
|
||||
```
|
||||
|
||||
Edit `/etc/monit/monitrc`, change daemon interval:
|
||||
```
|
||||
set daemon 60
|
||||
```
|
||||
|
||||
Enable HTTP interface (required for `monit status`):
|
||||
```
|
||||
set httpd port 2812 and
|
||||
use address localhost
|
||||
allow localhost
|
||||
allow admin:monit
|
||||
```
|
||||
|
||||
Create `/etc/monit/conf.d/cpu-killer`:
|
||||
```
|
||||
check system $HOST
|
||||
if cpu usage > 95% for 5 cycles then exec "/usr/local/bin/kill-top-cpu.sh"
|
||||
```
|
||||
|
||||
Create `/usr/local/bin/kill-top-cpu.sh`:
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Kill the process tree using the most CPU (excluding critical ones)
|
||||
|
||||
# Find the top CPU consumer (excluding protected and transient processes)
|
||||
TOP_LINE=$(ps -eo pid,comm,%cpu --sort=-%cpu | grep -v -E '(PID|systemd|sshd|monit|earlyoom|bash|ps|awk|grep|head)' | head -1)
|
||||
TARGET_PID=$(echo "$TOP_LINE" | awk '{print $1}')
|
||||
COMM=$(echo "$TOP_LINE" | awk '{print $2}')
|
||||
|
||||
if [ -n "$TARGET_PID" ] && [ -n "$COMM" ]; then
|
||||
logger "monit cpu-killer: Killing all '$COMM' processes (detected high CPU on PID $TARGET_PID)"
|
||||
# Kill all processes with this command name
|
||||
pkill -9 -x "$COMM"
|
||||
fi
|
||||
```
|
||||
|
||||
Make executable and enable:
|
||||
```bash
|
||||
sudo chmod +x /usr/local/bin/kill-top-cpu.sh
|
||||
sudo systemctl enable monit && sudo systemctl restart monit
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### CPU Threshold Timing
|
||||
|
||||
Edit `/etc/monit/conf.d/cpu-killer` to change timing:
|
||||
|
||||
| Cycles | Time (at 60s interval) |
|
||||
|--------|------------------------|
|
||||
| 2 | 2 minutes |
|
||||
| 5 | 5 minutes (default) |
|
||||
| 10 | 10 minutes |
|
||||
| 30 | 30 minutes |
|
||||
|
||||
### RAM Threshold
|
||||
|
||||
Edit `/etc/default/earlyoom`:
|
||||
|
||||
| Setting | Meaning |
|
||||
|---------|---------|
|
||||
| `-m 5` | Kill when free RAM < 5% |
|
||||
| `-m 10` | Kill when free RAM < 10% |
|
||||
| `-s 5` | Kill when free swap < 5% |
|
||||
|
||||
### Protected Processes
|
||||
|
||||
**earlyoom** protects (via `--avoid`):
|
||||
- init, systemd, sshd
|
||||
|
||||
**kill-top-cpu.sh** protects (via grep exclusion):
|
||||
- systemd, sshd, monit, earlyoom, bash, ps, awk, grep, head
|
||||
|
||||
To add more protected processes, edit the grep pattern in `/usr/local/bin/kill-top-cpu.sh`.
|
||||
|
||||
## Monitoring
|
||||
|
||||
### Check Status
|
||||
|
||||
```bash
|
||||
# earlyoom status
|
||||
sudo systemctl status earlyoom
|
||||
|
||||
# monit status
|
||||
sudo monit status
|
||||
|
||||
# Combined check
|
||||
sudo ./scripts/status.sh
|
||||
```
|
||||
|
||||
### View Logs
|
||||
|
||||
```bash
|
||||
# earlyoom logs
|
||||
journalctl -u earlyoom -f
|
||||
|
||||
# monit logs
|
||||
tail -f /var/log/monit.log
|
||||
|
||||
# Kill events
|
||||
journalctl | grep -i "cpu-killer\|earlyoom\|orphan-claude"
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
### Test CPU Killer
|
||||
|
||||
```bash
|
||||
# Install stress tool
|
||||
sudo apt install -y stress
|
||||
|
||||
# For quick testing, temporarily set to 2 cycles in /etc/monit/conf.d/cpu-killer
|
||||
# then reload: sudo monit reload
|
||||
|
||||
# Start CPU stress (will be killed after threshold)
|
||||
stress --cpu 4 --timeout 300
|
||||
```
|
||||
|
||||
### Test RAM Killer
|
||||
|
||||
```bash
|
||||
# This will be killed quickly by earlyoom
|
||||
stress --vm 4 --vm-bytes 4G --vm-keep --timeout 120
|
||||
```
|
||||
|
||||
### Test Orphan Claude Killer
|
||||
|
||||
```bash
|
||||
# Run the detection script manually to see what it would find
|
||||
sudo /usr/local/bin/kill-orphan-claude.sh
|
||||
|
||||
# Check logs for any kills
|
||||
journalctl | grep orphan-claude-killer
|
||||
```
|
||||
|
||||
## Uninstall
|
||||
|
||||
```bash
|
||||
sudo ./uninstall.sh
|
||||
```
|
||||
|
||||
Or manually:
|
||||
|
||||
```bash
|
||||
sudo systemctl stop earlyoom monit
|
||||
sudo systemctl disable earlyoom monit
|
||||
sudo apt remove -y earlyoom monit
|
||||
sudo rm -f /usr/local/bin/kill-top-cpu.sh
|
||||
sudo rm -f /usr/local/bin/kill-orphan-claude.sh
|
||||
sudo rm -f /etc/monit/conf.d/cpu-killer
|
||||
sudo rm -f /etc/monit/conf.d/orphan-claude-killer
|
||||
```
|
||||
|
||||
## Resource Overhead
|
||||
|
||||
| Component | RAM | CPU | Disk |
|
||||
|-----------|-----|-----|------|
|
||||
| earlyoom | ~2MB | Negligible (adaptive polling) | 77KB |
|
||||
| monit | ~3-4MB | ~28ms per 60s cycle | 1MB |
|
||||
| kill script | 0 (runs only when triggered) | Milliseconds | <1KB |
|
||||
|
||||
**Total: ~6MB RAM, essentially 0% CPU during normal operation**
|
||||
|
||||
## How It Works
|
||||
|
||||
### Orphan Claude Detection (monit)
|
||||
|
||||
A Claude process is considered orphaned if:
|
||||
- Its controlling TTY is `?` (no terminal attached), OR
|
||||
- Its parent PID is 1 (adopted by init)
|
||||
|
||||
1. monit runs the orphan detection script every 15 cycles (15 minutes with 60s daemon interval)
|
||||
2. Script finds all `claude` processes via `pgrep -x claude`
|
||||
3. For each process, checks TTY (`ps -o tty=`) and PPID (`ps -o ppid=`)
|
||||
4. If orphaned, kills the process tree (children first, then parent)
|
||||
5. Logs details to syslog including PID, reason, start time, CPU%, and memory%
|
||||
|
||||
### CPU Protection (monit)
|
||||
|
||||
1. monit checks system CPU every 60 seconds
|
||||
2. If CPU > 95% for 5 consecutive checks (5 minutes), executes kill script
|
||||
3. Kill script identifies the process using most CPU
|
||||
4. Kills ALL processes with that command name (handles multi-worker processes)
|
||||
5. Logs the action to syslog
|
||||
|
||||
### RAM Protection (earlyoom)
|
||||
|
||||
1. earlyoom monitors available memory (adaptive polling - more frequent when memory is low)
|
||||
2. When free memory drops below 5%, sends SIGTERM to highest memory consumer
|
||||
3. If process doesn't exit, sends SIGKILL at 2.5% threshold
|
||||
4. Protected processes (init, systemd, sshd) are never killed
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### monit not triggering
|
||||
|
||||
```bash
|
||||
# Check monit is running
|
||||
sudo systemctl status monit
|
||||
|
||||
# Check config syntax
|
||||
sudo monit -t
|
||||
|
||||
# Check monit log
|
||||
tail -f /var/log/monit.log
|
||||
|
||||
# Verify CPU threshold is being detected
|
||||
sudo monit status | grep cpu
|
||||
```
|
||||
|
||||
### earlyoom not killing
|
||||
|
||||
```bash
|
||||
# Check earlyoom is running
|
||||
sudo systemctl status earlyoom
|
||||
|
||||
# Check configuration
|
||||
cat /etc/default/earlyoom
|
||||
|
||||
# Watch real-time
|
||||
journalctl -u earlyoom -f
|
||||
```
|
||||
|
||||
### Kill script not working
|
||||
|
||||
```bash
|
||||
# Test manually
|
||||
sudo /usr/local/bin/kill-top-cpu.sh
|
||||
|
||||
# Check script is executable
|
||||
ls -la /usr/local/bin/kill-top-cpu.sh
|
||||
|
||||
# Check for errors
|
||||
bash -x /usr/local/bin/kill-top-cpu.sh
|
||||
```
|
||||
|
||||
## License
|
||||
|
||||
MIT License - Use freely, no warranty.
|
||||
|
||||
## Author
|
||||
|
||||
Created for UpfrontOps infrastructure management.
|
||||
Reference in New Issue
Block a user