# LESSONS.md — Shared Lessons Learned
*Read this before doing anything risky. Updated by all agents.*
*Last updated: 2026-02-17 05:35 AEDT*

---

## 🚨 CRITICAL — Never Repeat These

### 1. NEVER use Kimi/cheap models for code changes
**Date:** 2026-02-07
**What happened:** Kimi agents wrote TypeScript, React components, API routes. Five build-breaking bugs in one session.
**Rule:** ALL code goes through Builder (Claude Code on port 18790). Kimi = research/specs/plans ONLY.

### 2. NEVER overwrite inbox files
**Date:** 2026-02-16
**What happened:** Rivet overwrote RIVET-INBOX.md while "processing" it, destroying Builder's updates. Builder had done 12 tasks that went unacknowledged.
**Rule:** Append-only to inbox files. Never overwrite. Never edit the other agent's inbox.

### 3. NEVER send external comms without Michael's approval
**Rule:** Draft emails, SMS, tweets, LinkedIn posts — but NEVER send without explicit "yes" from Michael.

### 4. NEVER trust cached data for morning briefs
**Date:** 2026-02-16
**What happened:** Morning brief referenced completed tasks as still pending. Stale data from hours ago.
**Rule:** Always verify live (API check, file read with timestamp) before reporting status.

### 5. API keys NEVER in committed files
**Date:** 2026-02-15
**What happened:** API keys exposed in transcript files. Had to scrub logs.
**Rule:** Keys go in .env files or /root/.clawdbot/secrets.json. Never in markdown, never in git.

---

## ⚠️ IMPORTANT — Keep These in Mind

### Model Selection Matters
- **Opus/Sonnet** for dirty data (external content, web scraping, email processing) — hardest to prompt inject
- **Kimi** for routine tasks (research, summaries, formatting) — cheap, good enough
- **Builder (Claude Code)** for ALL code — has full codebase context, catches what others miss

### File-Based Communication Protocol
- `BUILDER-INBOX.md` — Rivet writes TO Builder
- `RIVET-INBOX.md` — Builder writes TO Rivet
- Append-only, timestamp everything, ACK in your own file
- Gateway bridge = notification, not delivery mechanism

### Agent Bridge (HTTP, not WebSocket)
- Old WebSocket bridge was fragile and complex
- New HTTP bridge: `node /home/ccuser/the-50-dollar-app/scripts/agent-bridge.js <agent> <command>`
- All 8 agents in registry with correct ports and tokens

### Michael's Schedule — Respect It
- 5:30 AM - 6 PM: On site. Can read on breaks. Don't spam.
- 7 PM - 8:30 PM: THE WINDOW. Every message saves him time.
- After 8:30 PM: Winding down. Don't message. Queue for morning.

### Token Costs
- Track spending. Kimi is nearly free. Sonnet is cheap. Opus is expensive.
- Batch operations when possible
- Don't waste tokens on HEARTBEAT_OK when there's work to do

---

## 💡 What Works Well

- **Overnight autonomous work** — Builder + Rivet can ship features while Michael sleeps
- **Voice notes for Michael** — He listens while driving, can't read text on site
- **Sub-agents for parallel research** — Spin up 3-5 Kimi agents simultaneously
- **Fleet health check script** — `bash /home/ccuser/the-50-dollar-app/scripts/fleet-health-check.sh`
- **Pre-call intel briefs** — Growth Engine data + AI analysis before outreach calls

### Fleet Coordination ≠ Fleet Monitoring
**Date:** 2026-02-17
**What happened:** Rivet treated Chief of Staff role as watchman — checking if agents were alive, restarting stalled ones. All 6 agent task queues were EMPTY for hours. Agents ran heartbeat routines in isolation with zero cross-communication. 4 agents stalled for 1-3 hours with nobody driving real work.
**Rule:** Chief of Staff actively drives work. Push tasks, connect agents' outputs, force cross-referencing. Empty queues = failure. Watching ≠ leading.

### Model Override Without Provider = Silent Death Loop
**Date:** 2026-02-17
**What happened:** Susan and Harper sessions were overridden to use deepseek/deepseek-chat, but their configs only had Moonshot provider. Every heartbeat/wake attempt failed with "Unknown model: deepseek/deepseek-chat" — 21 errors in one hour. Agent appears "active" in systemd but can't process anything. Restarts don't help because the session override persists.
**Rule:** Before setting a session model override, verify the target model's provider exists in that agent's config. When adding a new model to the fleet, add the provider to ALL agent configs, not just some.

### Restart Alone Doesn't Fix Stalls
**Date:** 2026-02-17
**What happened:** Multiple `systemctl restart` and wake commands sent to stalled agents (Susan, Radar, Cog, Sentinel) — none recovered. Susan stuck at 108% context despite restarts.
**Rule:** Diagnose BEFORE restarting blindly. Check context %, model config, error logs. Context overflow needs session reset, not service restart.

---

*When something goes wrong, add it here. Future agents will thank you.*

### Model Change = Config + Session Archive (ALWAYS)
**Date:** 2026-02-17 (caused 181 crashes in one day)
**What happened:** Changed agent primary model in clawdbot.json but left old sessions intact. Sessions contain `model_change` events that OVERRIDE the config. Old session says "use Sonnet", new config says "use MiniMax" → API mismatch → crash loop. Happened 3 separate times in one day (Susan 43 crashes, Sentinel/Radar/Cog 42 each).
**Rule:** When changing an agent's model:
```bash
# ALWAYS do both steps together:
mv /root/.clawdbot-{agent}/agents/main/sessions/*.jsonl .../archive/
systemctl restart clawdbot-{agent}
```
Never change the model config without archiving the session. Never.

### Blind Restarts Don't Fix Config Problems
**Date:** 2026-02-17
**What happened:** Stall detector and buddy check kept restarting crashed agents. Agents kept crashing on startup with the same error. Restart loops for hours.
**Rule:** Before restarting, check `journalctl -u clawdbot-{agent}` for the actual error. "Unknown model", "mapOptionsForApi", "auth profile" errors need config/session fixes, not restarts.
