# MEMORY.md — Cog

## Structure
- Daily logs: `memory/YYYY-MM-DD.md`
- Archive: `memory/archive/` (old daily logs, rotated weekly)

## What to Log
- **Every heartbeat:** Fleet snapshot (who's active, who's stalled, unread counts)
- **Escalations sent:** What, to whom, why, timestamp
- **Archives performed:** How many messages, which agents
- **Productivity observations:** What each agent produced (or didn't)
- **Communication patterns:** Who's talking, who's silent
- **Failure modes:** What went wrong, how it was resolved
- **OpsMan metrics:** Tasks handled autonomously, human interventions avoided

## Performance Metrics to Track
- **Uptime:** How many consecutive heartbeats completed without stalling
- **Delivery rate:** Messages sent vs messages acked (fleet-wide)
- **Escalation accuracy:** Escalations sent vs escalations that resulted in action
- **Archive throughput:** Messages archived per day
- **Stale detection time:** Average time between message going stale and Cog flagging it

## Fleet Baselines (Update Weekly)
- Normal fleet unread count: ~0-5 across all agents (Rivet can accumulate 15-20 — chronic, not actionable unless high/crit)
- Normal message volume: ~20-40 messages/day fleet-wide
- Expected heartbeat frequency: every 10-30 min per agent
- Typical stall recovery time: 15-60 min (longer if model crash)
- Fleet doctor runs every 5min, catches and auto-restarts downed services
- Rivet's last heartbeat often 5-6h old during business hours — generates morning brief ~05:10, then idle until evening window

## Known Failure Modes
- **MiniMax crashes:** Can take down multiple agents simultaneously (happened 2026-02-17, 4 agents stalled 8-14h)
- **Restart loops:** Agent status shows "recovered-from-restart-loop" — check if actually producing post-recovery
- **Context overflow:** Agent hits context limit, starts dropping tasks — context-monitor.js tracks this
- **Silent stalls:** Agent heartbeat is fine but producing zero output — hardest to detect, check deliverables not just status
- **Herald transient outages:** Seen Feb 25 09:00 — fleet doctor caught, auto-restarted, recovered within 5min
- **Sentinel brief drops:** Seen Feb 25 11:10 — service stopped, fleet doctor restarted it. Monitor for pattern.
- **Fleet doctor script bugs:** Line 421 `local` outside function, line 476 integer expression — non-critical, doctor still functions

## Lessons Learned
- **voice-brief-data.json schema mismatch (Feb 24):** GE was never down — wrong URL in freshness check. Fixed. Always verify the check before trusting the result.
- **BAS is NOT a task:** No employees, not GST-registered. Never flag as blocker.
- **Stripe webhook cold starts:** Not broken, not a blocker. Resolved.
- **Challenge culture (Feb 24):** Push back when something seems wrong. Agreeing with doubts is a failure.
- **All agents run Opus 4.6:** Primary model is anthropic/claude-opus-4-6, fallbacks are sonnet+deepseek. Corrected in IDENTITY.md Feb 25.

---

*Updated: 2026-02-25*
