# Optimal Multi-Agent Coordination Architecture
*Analysis by Sentinel using MiniMax M2.5*
*Date: 2026-02-17 06:03 AEDT*

## The Core Problem

Eight AI agents with:
- Zero session memory
- 15-minute heartbeat cycles
- File-only communication
- One coordinator, seven specialists
- Must scale to 20+ agents
- Minimal token cost

**Key insight:** This is a distributed systems problem masquerading as an AI coordination challenge.

## Optimal Architecture: State-Machine Coordination

### 1. Central State Registry (`/shared/state/`)

**Single source of truth for all agent states:**

```
/shared/state/
├── WORLD.json          # Global system state snapshot
├── agents/
│   ├── rivet.json      # {status, task, heartbeat, context_hash, dependencies}
│   ├── builder.json
│   └── ...
├── tasks/
│   ├── active/         # JSON files for each active task
│   ├── pending/        # Task queue with priorities
│   └── completed/      # Archived with results
└── locks/
    ├── task-123.lock   # Atomic task claiming
    └── resource-db.lock # Resource contention prevention
```

**WORLD.json structure:**
```json
{
  "timestamp": 1703275200,
  "sequence": 42,
  "critical_alerts": [],
  "system_health": {"cpu": 2.1, "memory": 0.5, "disk": 0.35},
  "active_tasks": ["task-123", "task-456"],
  "blocked_agents": [],
  "last_coordination": 1703275180
}
```

### 2. Atomic Task Management

**Task lifecycle with file-based locking:**

```
task-123.json:
{
  "id": "task-123",
  "title": "Deploy RateRight to production",
  "owner": "builder",
  "status": "in_progress",
  "priority": 1,
  "dependencies": [],
  "blockers": [],
  "heartbeat_deadline": 1703275500,
  "context_hash": "abc123",
  "progress_checkpoints": [
    {"step": "backup_db", "status": "complete", "timestamp": 1703275200},
    {"step": "deploy_code", "status": "in_progress", "timestamp": 1703275300}
  ],
  "token_budget": 50000,
  "token_used": 12000
}
```

**Lock-based claiming prevents duplication:**
```bash
# Agent attempts to claim task
touch /shared/state/locks/task-123.lock.claiming
if [ $? -eq 0 ]; then
  mv task-123.lock.claiming task-123.lock.rivet
  # Task claimed by rivet
fi
```

### 3. Instant Situational Awareness (5-File Bootstrap)

**Every agent reads exactly 5 files on startup:**

1. **WORLD.json** (500 bytes) - Global state snapshot
2. **agents/self.json** (300 bytes) - Own previous state
3. **PRIORITIES.md** (1KB) - Current mission-critical tasks 
4. **BLOCKERS.md** (500 bytes) - Known system blockers
5. **CONTEXT-{hash}.md** (2KB) - Relevant context for current work

**Total: ~4KB = ~1K tokens per agent startup**

### 4. Heartbeat Protocol

**Three-phase heartbeat cycle:**

**Phase 1: Health Check (30 seconds)**
```bash
# Update own state
echo '{"status":"alive","task":"monitoring","heartbeat":'$(date +%s)'}' > agents/sentinel.json

# Check for stalled agents
find agents/ -name "*.json" -mmin +16 -exec basename {} .json \; > stalled_agents.tmp
```

**Phase 2: Work Detection (60 seconds)**
```bash
# Scan for new tasks
ls tasks/pending/ | head -5 > available_tasks.tmp

# Check for blockers to clear
grep "blocked_by.*$(hostname)" tasks/active/*.json > can_unblock.tmp
```

**Phase 3: Coordination Update (30 seconds)**
```bash
# Update WORLD.json atomically
jq '.timestamp='$(date +%s)' | .sequence+=1' WORLD.json > WORLD.tmp && mv WORLD.tmp WORLD.json

# Clean up stale locks (>20 minutes)
find locks/ -name "*.lock.*" -mmin +20 -delete
```

### 5. Blocked Work Detection

**Automatic stall detection:**

```bash
# In coordinator heartbeat
for task in tasks/active/*.json; do
  deadline=$(jq -r '.heartbeat_deadline' "$task")
  now=$(date +%s)
  if [ $now -gt $deadline ]; then
    agent=$(jq -r '.owner' "$task")
    echo "STALL: $agent on $(basename $task .json)" >> BLOCKERS.md
  fi
done
```

**Progressive escalation:**
- 1 missed heartbeat: Log warning
- 2 missed heartbeats: Mark task as stalled
- 3 missed heartbeats: Auto-reassign to coordinator

### 6. Context Optimization

**Content-addressed context sharing:**
```bash
# Generate context hash
context_hash=$(echo "$relevant_data" | sha256sum | cut -c1-8)

# Write context file if doesn't exist
if [ ! -f "context/CONTEXT-$context_hash.md" ]; then
  echo "$relevant_data" > "context/CONTEXT-$context_hash.md"
fi

# Reference in task
jq '.context_hash="'$context_hash'"' task-123.json > task-123.tmp && mv task-123.tmp task-123.json
```

**Benefits:**
- Deduplicates identical context across agents
- Only load context when hash changes
- Automatic garbage collection of unused contexts

### 7. Scaling Architecture

**Hierarchical coordination for 20+ agents:**

```
/shared/state/
├── WORLD.json                 # Global state
├── domains/
│   ├── infrastructure/        # Sentinel + monitoring agents
│   ├── development/          # Builder + QA agents  
│   ├── sales/                # Susan + outreach agents
│   └── operations/           # Rivet + support agents
└── cross_domain/
    ├── dependencies.json     # Inter-domain task dependencies
    └── resource_conflicts.json # Shared resource coordination
```

**Domain coordinators handle local coordination, main coordinator handles cross-domain conflicts.**

### 8. Token Cost Optimization

**Aggressive caching and differentials:**

```bash
# Only read changed files
find /shared/state -newer /tmp/last_read_$(whoami) -type f > changed_files.txt
while read file; do
  process_file "$file"
done < changed_files.txt
touch /tmp/last_read_$(whoami)
```

**Structured data reduces parsing costs:**
- JSON for machine data (state, metrics)
- Markdown only for human-readable summaries
- Binary deltas for large datasets

### 9. Beyond Kanban: Event Sourcing

**Append-only event log captures all state changes:**

```
/shared/events/
├── 2026-02-17-06.jsonl      # Hourly event logs
├── 2026-02-17-05.jsonl
└── 2026-02-16-23.jsonl
```

**Event format:**
```json
{"timestamp":1703275200,"agent":"rivet","event":"task_claimed","task":"task-123","data":{}}
{"timestamp":1703275260,"agent":"builder","event":"task_progress","task":"task-123","data":{"checkpoint":"deploy_code"}}
{"timestamp":1703275320,"agent":"sentinel","event":"system_alert","data":{"cpu":0.95,"type":"high_load"}}
```

**Benefits:**
- Complete audit trail
- State reconstruction from events
- Automatic conflict resolution
- Historical analysis for optimization

### 10. Implementation Priority

**Phase 1: Core State System**
1. WORLD.json + agent state files
2. Task claiming with file locks
3. Basic heartbeat protocol

**Phase 2: Coordination Intelligence**
4. Stall detection and auto-reassignment
5. Context optimization with content addressing
6. Token cost monitoring

**Phase 3: Advanced Features**
7. Event sourcing
8. Hierarchical scaling
9. Predictive task scheduling
10. Performance optimization

## Key Technical Decisions

**Why file-based over database?**
- Zero external dependencies
- Atomic operations via filesystem
- Human-readable for debugging
- Natural backup via filesystem snapshots

**Why JSON + Markdown hybrid?**
- JSON for machine parsing (fast, structured)
- Markdown for human context (readable, diffable)
- Clear separation of concerns

**Why event sourcing?**
- Immutable history prevents coordination bugs
- Natural disaster recovery mechanism
- Enables sophisticated analytics
- Conflict resolution through replay

**Why content-addressed context?**
- Massive deduplication savings
- Cache-friendly architecture
- Automatic cleanup of unused data
- Hash-based change detection

This architecture handles coordination at scale while maintaining the constraints of file-only communication and stateless agents. Token costs remain minimal through aggressive caching and structured data formats.