# OPS.md — Sentinel Operations Playbook

---

## Boot Sequence

When I start a new session:

1. Read `SOUL.md`, `USER.md`, `IDENTITY.md`
2. Read `memory/YYYY-MM-DD.md` (today + yesterday)
3. If main session: read `MEMORY.md`
4. Check inbox for unread messages
5. Run fleet briefing (`fleet-cli.js briefing sentinel --brief`)
6. Quick health check: `uptime`, `free -h`, `df -h /`, all 8 services active, RateRight app responding
7. Check queue for pending tasks
8. Update status.json

Total boot: one heartbeat cycle. No wasted turns.

---

## Heartbeat Cycle (Every 3 Minutes)

### Priority Order
1. **Inbox** → process unread messages, ack completed
2. **Fleet briefing** → check for alerts, blocked agents, my tasks
3. **Queue** → claim and execute highest-priority pending task
4. **Health check** → system resources + agent services + critical services
5. **Buddy check** → is Cog alive? Wake if stalled >30min
6. **Status update** → fleet-update.js + status.json

### What I Check
- CPU load (via `uptime`)
- RAM usage (via `free -h`)
- Disk usage (via `df -h /`)
- All 8 agent systemd services (via `systemctl is-active`)
- RateRight app (via `curl localhost:3000`)
- Anything flagged in inbox or queue

### What I Skip
- Detailed per-directory disk breakdown (only if disk >60%)
- SSL cert check (once per day is enough)
- Restart counts (only when investigating issues)
- External URL checks (only during daily audit)

---

## Idle Behavior

When inbox is empty, queue is empty, and no alerts:

1. **Rotate through daily checks** (one per idle heartbeat):
   - Journal log size — rotate if >2GB
   - `/tmp` cleanup — remove stale pip/build artifacts
   - Port audit — verify only expected ports are externally bound
   - SSL cert expiry check
   - Per-agent restart count in last 24h
2. **Check buddy (Cog)** — heartbeat freshness, wake if needed
3. **Review fleet-state.json** — any stalled agents I should wake?
4. **Memory maintenance** — every few days, distill daily logs into MEMORY.md

If truly nothing to do: `HEARTBEAT_OK`.

---

## Health Signal

The fleet knows I'm alive through:

1. **status.json** — written at END of every heartbeat
   - Path: `/home/ccuser/sentinel/status.json`
   - Contains: status, current_task, last_updated, last_heartbeat
2. **fleet-state.json** — updated via `fleet-update.js sentinel`
   - Path: `/home/ccuser/shared/fleet-state.json`
   - Rivet and Cog monitor this for stalls
3. **Heartbeat cadence** — 3-minute intervals
   - >6 minutes without update = something's wrong with me

---

## Self-Recovery

### My Own Stall Detection
If I notice my own context is growing large or responses are slow:
1. Write critical state to `memory/YYYY-MM-DD.md`
2. Update status.json with `"status": "recovering"`
3. Keep heartbeat responses minimal to conserve context
4. If at 85%+ context: summarize and prepare for session reset

### Session Reset
If my session gets cleared/restarted:
1. Boot sequence kicks in automatically (step 1 above)
2. Daily log tells me what happened recently
3. MEMORY.md tells me long-term lessons
4. status.json tells other agents I'm back

---

## Escalation Path

| Severity | Action |
|----------|--------|
| **INFO** | Log to daily memory file. No notification. |
| **WARNING** | Log + update fleet-state. Cog will see it. |
| **ERROR** | Log + alert fleet (`fleet-cli.js alert`). Rivet investigates. |
| **CRITICAL** | Log + alert fleet + message Michael directly. Systems at risk. |

### When to Escalate to Michael
- Multiple agents down simultaneously and auto-recovery failing
- Disk <10% free after cleanup attempts
- RAM <5% available (OOM imminent)
- RateRight app down >15 minutes and I can't fix it
- Security incident (exposed secrets, unauthorized access)
- SSL cert expiring <7 days and auto-renewal isn't working

### When NOT to Escalate
- Single agent restart (I handle it)
- Temp file cleanup (I handle it)
- Journal rotation (I handle it)
- Brief resource spikes that self-resolve
- Anything I've already fixed

---

## Failure Modes

### 1. Agent Crash Loop
**Symptoms:** Agent restarting every few seconds, journalctl shows repeated errors.
**Automatic Response:**
1. Check `journalctl -u clawdbot-<name> --since "5 min ago"` for error pattern
2. If "Unknown model" or "mapOptionsForApi" → config/session problem, NOT a restart fix
3. If config issue → check model exists in agent's config providers
4. If session issue → archive old sessions: `mv /root/.clawdbot-<name>/agents/main/sessions/*.jsonl /root/.clawdbot-<name>/agents/main/sessions/archive/`
5. Then restart service
**Lesson:** Never restart without checking logs first (Feb 17 incident: 181 crash loops from blind restarts).

### 2. Disk Space Exhaustion
**Symptoms:** Disk >80%, services may fail to write.
**Automatic Response:**
1. Check `/tmp/` first — stale pip-unpack dirs are the #1 offender (seen 16GB+)
2. `rm -rf /tmp/pip-unpack-* /tmp/pip-build-env-* /tmp/pip-install-*`
3. Check journal size: `journalctl --disk-usage` → `sudo journalctl --vacuum-size=1G` if >2GB
4. Check `/home/ccuser/*/` for unexpected growth
5. If still >80% after cleanup → escalate to Michael with breakdown

### 3. Memory Pressure (OOM Risk)
**Symptoms:** RAM >85%, swap usage climbing, processes getting killed.
**Automatic Response:**
1. `ps aux --sort=-%mem | head -10` — identify the hog
2. If a single agent is using >2GB → it's likely context overflow. Session archive + restart.
3. If general pressure → check for runaway processes, zombie children
4. If swap >50% used → something is wrong. Identify and kill or restart.
5. If <500MB available → CRITICAL escalation.

### 4. RateRight App Down
**Symptoms:** `curl localhost:3000` returns non-200 or times out.
**Automatic Response:**
1. Check `systemctl status rateright-app`
2. If stopped → `sudo systemctl restart rateright-app`
3. If running but unresponsive → check logs: `journalctl -u rateright-app --since "5 min ago"`
4. If port conflict → `ss -tlnp | grep 3000`
5. If still down after restart → escalate to Builder (code issue) and Michael (customer impact)

---

## What's Not Built Yet

Being honest about gaps:

- **Automated backup system** — No VPS snapshots, no DB backup verification, no git push automation. Queue task sentinel-004 covers discovery.
- **Proactive alerting** — I check on heartbeat intervals. No push-based monitoring (no Prometheus, no uptime service). A problem between heartbeats goes unnoticed for up to 3 minutes.
- **Log aggregation** — Each agent's logs are in journald. No centralized view. I grep manually.
- **SSL auto-renewal verification** — Certs are Let's Encrypt (presumably auto-renew). I haven't verified the renewal mechanism works.
- **Capacity planning** — No trending of resource usage over time. I check point-in-time snapshots.

These are real gaps. I'll address them in priority order as I get bandwidth.
