# Self-Healing Ops Brain - Implementation Plan

> Problem detected → Diagnosed → Fixed → Verified → Learned

**Created:** Jan 21, 2026
**Status:** Plan Ready
**Estimated Effort:** 45-65 hours (4-week phased approach)
**Risk Level:** Low (starts diagnosis-only, then gradual auto-fix enablement)
**Ongoing AI Cost:** ~$20-40/month

---

## The Dream

Problem happens → System detects it → Diagnoses root cause → Fixes itself → Verifies fix → Reports what it did

You wake up to: "3 issues overnight. All fixed. Here's what happened."

---

## AI Architecture

### Which AI for What?

| Component | AI Model | Why | Cost |
|-----------|----------|-----|------|
| **Detection** | None (rules) | frequentAudit.js already works | $0 |
| **SOP Matching** | GPT-4o-mini | Fast pattern matching, cheap | ~$5/mo |
| **Diagnosis** | GPT-4o-mini | Analyze symptoms, find root cause | ~$10/mo |
| **Decision** | Rules + GPT-4o-mini | 80% rules, 20% AI for edge cases | ~$3/mo |
| **Execution** | None (code) | Deterministic, no AI unpredictability | $0 |
| **Verification** | None (rules) | Re-run health check, compare | $0 |
| **Learning** | GPT-4o | Weekly SOP improvement analysis | ~$5/mo |
| **Reporting** | GPT-4o-mini | Daily summary generation | ~$2/mo |

### Why NOT AI for Execution?

For automated ops, you want:
- **Deterministic** - Same input = same fix every time
- **Fast** - No API latency in the critical fix path
- **Cheap** - Runs 24/7, can't afford GPT-4 per operation
- **Auditable** - Predefined operations, not AI-generated commands
- **Safe** - AI could hallucinate dangerous commands

### AI Flow

```
┌─────────────────────────────────────────────────────────────────┐
│                      AI USAGE MAP                                │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Alert Detected                                                  │
│       │                                                          │
│       ▼                                                          │
│  ┌─────────────────┐                                            │
│  │ SOP MATCHER     │ ← GPT-4o-mini                              │
│  │ "Which SOP fits │   "Given this alert about SMS delivery,    │
│  │  this alert?"   │    which SOP should we use?"               │
│  └────────┬────────┘                                            │
│           │                                                      │
│           ▼                                                      │
│  ┌─────────────────┐                                            │
│  │ DIAGNOSIS       │ ← GPT-4o-mini                              │
│  │ "What's the     │   "Analyze these symptoms and identify     │
│  │  root cause?"   │    the most likely root cause"             │
│  └────────┬────────┘                                            │
│           │                                                      │
│           ▼                                                      │
│  ┌─────────────────┐                                            │
│  │ DECISION        │ ← Rules (80%) + GPT-4o-mini (20%)          │
│  │ "Auto-fix or    │   Rules: history, cooldowns, risk level    │
│  │  escalate?"     │   AI: ambiguous cases only                 │
│  └────────┬────────┘                                            │
│           │                                                      │
│           ▼                                                      │
│  ┌─────────────────┐                                            │
│  │ EXECUTION       │ ← NO AI - Pure code                        │
│  │ "Run the fix"   │   Predefined safe operations only          │
│  └────────┬────────┘                                            │
│           │                                                      │
│           ▼                                                      │
│  ┌─────────────────┐                                            │
│  │ VERIFICATION    │ ← NO AI - Rules                            │
│  │ "Did it work?"  │   Re-run health check, compare metrics     │
│  └────────┬────────┘                                            │
│           │                                                      │
│           ▼                                                      │
│  ┌─────────────────┐                                            │
│  │ LEARNING        │ ← GPT-4o (weekly batch)                    │
│  │ "How to improve │   "Analyze failed fixes and suggest        │
│  │  the SOP?"      │    improvements to the SOP"                │
│  └─────────────────┘                                            │
│                                                                  │
│  ┌─────────────────┐                                            │
│  │ DAILY REPORT    │ ← GPT-4o-mini (once daily)                 │
│  │ "Summarize      │   "Generate human-readable summary         │
│  │  overnight"     │    of overnight ops activity"              │
│  └─────────────────┘                                            │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```

### Prompt Templates

**SOP Matcher Prompt:**
```
You are an ops SOP matcher. Given an alert, identify the best matching SOP.

ALERT:
- Type: {alert_type}
- Title: {title}
- Severity: {severity}
- Evidence: {evidence}

AVAILABLE SOPs:
{sop_list}

Return JSON: { "sop_name": "...", "confidence": 0-100, "reasoning": "..." }
```

**Diagnosis Prompt:**
```
You are an ops diagnostician. Analyze these symptoms and identify root cause.

ALERT: {alert_details}
SOP: {sop_name}
DIAGNOSIS STEPS RESULTS: {diagnosis_results}

Return JSON: {
  "root_cause": "...",
  "confidence": 0-100,
  "evidence": ["...", "..."],
  "recommended_fix": "..."
}
```

**Daily Report Prompt:**
```
Generate a concise ops report for the last 24 hours.

DATA:
- Issues detected: {count}
- Auto-fixed: {fixed_count}
- Escalated: {escalated_count}
- Fix details: {fix_list}
- Cost metrics: {costs}

Format as a Slack message with emojis. Keep it scannable.
```

### Cost Breakdown

| Usage | Tokens/Day | Monthly Cost |
|-------|------------|--------------|
| SOP Matching (~50 alerts/day) | ~25K | ~$3 |
| Diagnosis (~30 diagnoses/day) | ~30K | ~$4 |
| Decision edge cases (~5/day) | ~5K | ~$1 |
| Daily report (1/day) | ~2K | ~$0.50 |
| Weekly learning (1/week) | ~20K | ~$5 |
| **Total** | | **~$15-25/mo** |

Buffer for spikes: **~$20-40/mo total**

---

## Executive Summary

Build a 0.1% self-healing operations system that:
1. **Detects** problems via existing monitoring (frequentAudit.js, criticalAlerts.js)
2. **Diagnoses** root cause using machine-executable SOP steps
3. **Decides** auto-fix vs escalate based on confidence scoring
4. **Executes** safe fixes with sandboxed operations
5. **Verifies** the fix worked by re-running health checks
6. **Learns** and improves SOPs based on outcomes

**Safety First:** NEVER auto-fixes billing, security, schema, or user data. Starts in diagnosis-only mode.

---

## Architecture

```
┌─────────────────────────────────────────────────────────────────────┐
│                     OPS BRAIN (24/7/365)                            │
├─────────────────────────────────────────────────────────────────────┤
│                                                                     │
│  ┌─────────────┐   ┌─────────────┐   ┌─────────────┐               │
│  │  MONITORS   │   │   AGENTS    │   │    SOPs     │               │
│  ├─────────────┤   ├─────────────┤   ├─────────────┤               │
│  │ Health      │   │ Infra Agent │   │ 50+ runbooks│               │
│  │ Performance │   │ Data Agent  │   │ Per problem │               │
│  │ Business    │   │ API Agent   │   │ Auto-updated│               │
│  │ Security    │   │ UX Agent    │   │ From fixes  │               │
│  │ Cost        │   │ Cost Agent  │   │             │               │
│  └──────┬──────┘   └──────┬──────┘   └──────┬──────┘               │
│         │                 │                 │                       │
│         └────────────┬────┴────────────────┘                       │
│                      ▼                                              │
│         ┌─────────────────────────┐                                │
│         │    DECISION ENGINE      │                                │
│         │  ┌───────────────────┐  │                                │
│         │  │ Can I fix this?   │  │                                │
│         │  │ YES → Execute SOP │  │                                │
│         │  │ NO → Escalate     │  │                                │
│         │  │ MAYBE → Try safe  │  │                                │
│         │  └───────────────────┘  │                                │
│         └───────────┬─────────────┘                                │
│                     ▼                                              │
│         ┌─────────────────────────┐                                │
│         │   EXECUTION ENGINE      │                                │
│         │  Claude Code runs fix   │                                │
│         │  Sandboxed, logged      │                                │
│         │  Rollback ready         │                                │
│         └───────────┬─────────────┘                                │
│                     ▼                                              │
│         ┌─────────────────────────┐                                │
│         │   VERIFICATION          │                                │
│         │  Re-run health check    │                                │
│         │  Compare before/after   │                                │
│         │  Mark resolved or retry │                                │
│         └───────────┬─────────────┘                                │
│                     ▼                                              │
│         ┌─────────────────────────┐                                │
│         │   LEARNING SYSTEM       │                                │
│         │  What worked?           │                                │
│         │  Update SOP if better   │                                │
│         │  Track fix success rate │                                │
│         └─────────────────────────┘                                │
│                                                                     │
└─────────────────────────────────────────────────────────────────────┘
```

---

## The Agents

| Agent | Monitors | Auto-Fixes |
|-------|----------|------------|
| **Infra Agent** | Server health, uptime, memory, CPU | Restart services, scale resources, rollback deploys |
| **Data Agent** | Database connections, sync jobs, data integrity | Reconnect, re-run syncs, fix duplicates |
| **API Agent** | Endpoint health, response times, error rates | Restart endpoints, clear caches, fix rate limits |
| **UX Agent** | Frontend errors, load times, user friction | Rebuild assets, clear CDN, flag for dev |
| **Cost Agent** | API spend, Twilio costs, token usage | Pause runaway jobs, alert on overspend |
| **Security Agent** | Auth failures, suspicious patterns, rate limit abuse | Block IPs, revoke tokens, alert |

---

## SOP Library Structure

```
/SOPs/
├── infrastructure/
│   ├── railway-deploy-failed.md
│   ├── server-out-of-memory.md
│   ├── ssl-cert-expiring.md
│   ├── database-connection-pool-exhausted.md
│   └── service-unresponsive.md
│
├── integrations/
│   ├── twilio-auth-failed.md
│   ├── twilio-sms-delivery-low.md
│   ├── openai-rate-limited.md
│   ├── openai-quota-exceeded.md
│   ├── deepgram-transcription-failed.md
│   ├── supabase-connection-dropped.md
│   └── slack-webhook-failed.md
│
├── data/
│   ├── lead-sync-stuck.md
│   ├── duplicate-leads-detected.md
│   ├── orphaned-communications.md
│   ├── sequence-enrollment-stuck.md
│   └── conversion-not-tracked.md
│
├── api/
│   ├── endpoint-500-error.md
│   ├── endpoint-timeout.md
│   ├── rate-limit-exceeded.md
│   ├── auth-token-expired.md
│   └── cors-error.md
│
├── frontend/
│   ├── build-failed.md
│   ├── asset-404.md
│   ├── white-screen-of-death.md
│   └── slow-page-load.md
│
├── business/
│   ├── conversion-rate-dropped.md
│   ├── sms-delivery-rate-low.md
│   ├── no-calls-in-2-hours.md
│   └── lead-response-time-high.md
│
└── cost/
    ├── openai-spend-spike.md
    ├── twilio-spend-spike.md
    └── database-storage-growing.md
```

---

## SOP Template

```markdown
# twilio-sms-delivery-low.md

## Trigger
SMS delivery rate < 85% in last hour

## Severity
HIGH

## Auto-Fix Allowed
YES

## Diagnosis Steps
1. Check Twilio dashboard for errors
2. Check for invalid phone numbers in recent sends
3. Check for carrier blocks
4. Check account balance

## Fix Steps
1. If invalid numbers → quarantine leads with bad numbers
2. If carrier block → switch to alternate Twilio number
3. If balance low → alert Michael (don't auto-fix billing)
4. If Twilio outage → pause SMS sequences, notify team

## Verification
- Send test SMS to known good number
- Check delivery rate recovers in next 15 min

## Escalate If
- Fix doesn't work after 2 attempts
- Twilio account suspended
- Delivery rate < 50%

## Notification
Slack: #ops-alerts
Pushover: Only if escalated
```

---

## Database Schema

### Table: `self_healing_sops`

```sql
CREATE TABLE IF NOT EXISTS self_healing_sops (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),

  -- Identification
  sop_name VARCHAR(100) UNIQUE NOT NULL,
  category VARCHAR(50) NOT NULL,        -- database, api, external, queue, cache
  description TEXT NOT NULL,
  version INTEGER DEFAULT 1,

  -- Trigger Conditions (when this SOP applies)
  trigger_conditions JSONB NOT NULL,
  -- { "alert_types": ["quality_audit_critical"],
  --   "title_patterns": ["Database Connection"],
  --   "severity_min": "high" }

  -- Diagnosis Steps (what to check)
  diagnosis_steps JSONB NOT NULL,
  -- [{ "step": 1, "name": "Check DB", "type": "db_query",
  --    "action": { "query": "SELECT 1" }, "success_condition": "no_error" }]

  -- Fix Operations (what actions to take)
  fix_operations JSONB NOT NULL,
  -- [{ "step": 1, "name": "Reset pool", "type": "function_call",
  --    "action": { "function": "resetConnectionPool" } }]

  -- Verification (how to confirm fix worked)
  verification_steps JSONB NOT NULL,
  -- [{ "step": 1, "name": "Re-run test", "type": "rerun_audit",
  --    "wait_before_ms": 5000 }]

  -- Safety Controls
  risk_level VARCHAR(20) NOT NULL,      -- low, medium, high, critical
  auto_fix_enabled BOOLEAN DEFAULT FALSE,
  requires_approval BOOLEAN DEFAULT TRUE,
  max_executions_per_hour INTEGER DEFAULT 3,
  cooldown_minutes INTEGER DEFAULT 30,

  -- Learning Stats
  times_executed INTEGER DEFAULT 0,
  times_successful INTEGER DEFAULT 0,
  avg_resolution_time_ms INTEGER,
  last_executed_at TIMESTAMPTZ,

  created_at TIMESTAMPTZ DEFAULT NOW(),
  updated_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX idx_sops_category ON self_healing_sops(category);
CREATE INDEX idx_sops_risk ON self_healing_sops(risk_level);
```

### Table: `self_healing_executions`

```sql
CREATE TABLE IF NOT EXISTS self_healing_executions (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  sop_id UUID REFERENCES self_healing_sops(id),
  alert_id UUID REFERENCES intelligence_alerts(id),

  -- Results
  diagnosis_results JSONB,
  fix_actions_taken JSONB,
  verification_results JSONB,

  -- Status: diagnosing → awaiting_approval → fixing → verifying → success/failed/escalated
  status VARCHAR(30) NOT NULL,
  confidence_score INTEGER,

  -- Timing
  started_at TIMESTAMPTZ DEFAULT NOW(),
  completed_at TIMESTAMPTZ,
  total_resolution_time_ms INTEGER,

  -- Human Interaction
  approved_by VARCHAR(100),
  escalation_reason TEXT,

  -- Learning
  was_effective_fix BOOLEAN,
  human_notes TEXT
);

CREATE INDEX idx_executions_sop ON self_healing_executions(sop_id);
CREATE INDEX idx_executions_alert ON self_healing_executions(alert_id);
CREATE INDEX idx_executions_status ON self_healing_executions(status);
```

---

## Decision Engine

### Decision Matrix

```
┌─────────────────────────────────────────────────────────┐
│                 DECISION MATRIX                         │
├─────────────────┬───────────────┬───────────────────────┤
│ Problem Type    │ Confidence    │ Action                │
├─────────────────┼───────────────┼───────────────────────┤
│ Known + SOP     │ HIGH (>90%)   │ Auto-fix immediately  │
│ Known + SOP     │ MEDIUM (70%)  │ Auto-fix, verify hard │
│ Known + SOP     │ LOW (<70%)    │ Try safe fix, escalate│
│ Unknown         │ Any           │ Diagnose, escalate    │
│ Dangerous       │ Any           │ Never auto-fix        │
└─────────────────┴───────────────┴───────────────────────┘

NEVER AUTO-FIX:
- Billing/payment issues
- User data deletion
- Security incidents
- Database schema changes
- Production deployments
```

### Confidence Scoring (0-100)

| Factor | Weight | Description |
|--------|--------|-------------|
| SOP Success Rate | 40% | Historical success rate (needs >5 executions) |
| Diagnosis Clarity | 30% | How clearly diagnosis matched expected patterns |
| Pattern Match | 20% | How well alert matches SOP trigger conditions |
| Environmental | 10% | Time of day, concurrent alerts, recent success |

### Thresholds by Risk Level

| Risk Level | Auto-Fix Threshold | Example SOPs |
|------------|-------------------|--------------|
| Low | ≥60% confidence | Cache clear, job retry, log rotation |
| Medium | ≥75% confidence | Connection pool reset, rate limit adjust |
| High | ≥90% confidence | Service restart, credential refresh |
| Critical | Never auto-fix | Schema changes, security, billing |

---

## Safe Auto-Executable Operations

| Operation | Risk | Description |
|-----------|------|-------------|
| `cache_clear` | Low | Clear specific cache keys |
| `queue_retry` | Low | Retry failed queue items |
| `job_restart` | Low | Restart stuck scheduled jobs |
| `connection_refresh` | Low | Refresh external connections |
| `log_rotation` | Low | Archive old logs |
| `temp_cleanup` | Low | Clean temp files |
| `metrics_backfill` | Low | Re-aggregate missing metrics |
| `token_refresh` | Medium | Refresh API tokens |
| `pool_reset` | Medium | Reset DB connection pool |
| `rate_limit_reset` | Medium | Clear rate limit blocks |

---

## NEVER Auto-Fix (Always Escalate)

- Schema changes (CREATE/ALTER/DROP)
- User data deletion
- Billing/payment operations
- Security/authentication changes
- Environment variable changes
- API credential rotation
- Production data modifications
- Twilio number configuration
- Supabase RLS policy changes

---

## Learning System

After every fix:

```
┌─────────────────────────────────────────────────────────┐
│                 FIX OUTCOME TRACKING                    │
├─────────────────┬───────────────────────────────────────┤
│ Problem         │ twilio-sms-delivery-low               │
│ SOP Used        │ v2.3                                  │
│ Fix Applied     │ Quarantined 12 invalid numbers        │
│ Time to Fix     │ 2m 34s                                │
│ Verified        │ ✅ Delivery rate back to 94%          │
│ Outcome         │ SUCCESS                               │
├─────────────────┴───────────────────────────────────────┤
│ LEARNING:                                               │
│ → This SOP has 94% success rate (47/50 fixes)           │
│ → Average fix time: 2m 12s                              │
│ → Last failure: Twilio outage (not our fault)           │
│ → Suggested improvement: Check Twilio status first      │
└─────────────────────────────────────────────────────────┘
```

---

## Daily Ops Report (7am Slack)

```
🤖 OPS BRAIN DAILY REPORT
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

✅ SYSTEM HEALTH: 98/100

📊 LAST 24 HOURS
┌────────────────┬───────┐
│ Issues detected│ 7     │
│ Auto-fixed     │ 6     │
│ Escalated      │ 1     │
│ Fix success    │ 100%  │
│ Avg fix time   │ 1m 48s│
└────────────────┴───────┘

🔧 FIXES APPLIED
1. 02:34 — Database connection dropped
   → Reconnected, verified ✅

2. 04:12 — OpenAI rate limited
   → Switched to backup key ✅

3. 06:45 — Lead sync stuck (3 leads)
   → Cleared queue, re-synced ✅

⚠️ ESCALATED (Needs Human)
1. 05:20 — Twilio balance < $50
   → Can't auto-fix billing
   → ACTION: Top up Twilio account

💰 COST WATCH
- OpenAI: $12.34 (normal)
- Twilio: $8.21 (normal)
- Deepgram: $2.10 (normal)

🔮 PREDICTED ISSUES
- SSL cert expires in 12 days → Auto-renew scheduled
- Database approaching 80% storage → Alert at 90%

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
No action needed. Have a good day! ☕
```

---

## Files to Create

| File | Purpose |
|------|---------|
| `supabase/migrations/YYYYMMDD_self_healing_ops.sql` | Database schema |
| `supabase/sop-seed.sql` | Initial 20 SOPs |
| `src/services/selfHealing/sopMatcher.js` | Match alerts to SOPs |
| `src/services/selfHealing/decisionEngine.js` | Confidence scoring, decisions |
| `src/services/selfHealing/executor.js` | Safe fix execution |
| `src/services/selfHealing/verifier.js` | Verify fixes worked |
| `src/services/selfHealing/opsBrain.js` | Main orchestrator |
| `src/jobs/dailyOpsReport.js` | 7am daily ops summary |
| `src/routes/ops.js` | API for ops dashboard |

---

## Integration Points

### 1. frequentAudit.js (Primary Trigger)
After creating intelligence_alert, call `opsBrain.evaluate(alertId)`

### 2. criticalAlerts.js (Secondary Trigger)
After sending Slack alert, call `opsBrain.evaluate(alertId)`

### 3. slack.js (Notifications)
Add `sendSelfHealingNotification()` with approval buttons

### 4. learning.js (Pattern Extension)
Track SOP execution outcomes for improvement suggestions

---

## Phased Implementation

### Phase 1: Foundation (Week 1) - DIAGNOSIS ONLY
- [ ] Create database schema
- [ ] Create SOP matcher service
- [ ] Create diagnosis engine
- [ ] Seed 10 initial SOPs (diagnosis only)
- [ ] Hook into frequentAudit.js
- [ ] Slack notifications for diagnosis results
- **NO auto-fixes** - just detection and reporting

### Phase 2: Approval Workflow (Week 2)
- [ ] Create decision engine with confidence scoring
- [ ] Create Slack approval buttons
- [ ] Create executor service
- [ ] Create verifier service
- [ ] Audit logging
- **Human approval required for ALL fixes**

### Phase 3: Cautious Auto-Fix (Week 3)
- [ ] Enable auto-fix for 5 lowest-risk SOPs
- [ ] Implement cooldowns and rate limits
- [ ] Rollback capability
- [ ] Learning stats tracking
- [ ] Ops dashboard UI

### Phase 4: Full Self-Healing (Week 4+)
- [ ] Enable auto-fix for remaining low-risk SOPs
- [ ] SOP improvement from failures
- [ ] Daily ops report job
- [ ] Predictive issue detection

---

## Initial SOP Seed List (20 Runbooks)

### Infrastructure (5)
1. `database-connection-dropped` - DB connectivity issues
2. `server-memory-high` - Memory pressure
3. `ssl-cert-expiring` - Certificate renewal
4. `build-failed` - Deployment issues
5. `asset-404` - Missing frontend assets

### Integrations (5)
6. `twilio-sms-delivery-low` - SMS delivery problems
7. `openai-rate-limited` - GPT rate limiting
8. `deepgram-transcription-failed` - Transcription issues
9. `supabase-connection-dropped` - DB pool exhaustion
10. `slack-webhook-failed` - Notification failures

### Data (5)
11. `lead-sync-stuck` - Platform sync issues
12. `sequence-enrollment-stuck` - Sequence processor
13. `duplicate-leads-detected` - Data integrity
14. `conversion-not-tracked` - Conversion tracking
15. `orphaned-communications` - Data cleanup

### API/Business (5)
16. `endpoint-500-error` - API errors
17. `rate-limit-exceeded` - User rate limits
18. `no-calls-in-2-hours` - Business anomaly
19. `openai-spend-spike` - Cost alert
20. `twilio-spend-spike` - Cost alert

---

## Build Effort

| Phase | What | Hours |
|-------|------|-------|
| 1 | Monitor agents + basic alerts | 8-12 |
| 2 | SOP library (20 runbooks) | 6-10 |
| 3 | Decision engine + execution | 12-16 |
| 4 | Verification + learning | 8-12 |
| 5 | Full 50+ runbooks | 10-15 |
| **Total** | **Full 0.1% system** | **45-65 hrs** |

---

## What Makes It 0.1%

| Basic Monitoring | 0.1% Self-Healing |
|------------------|-------------------|
| Alert when broken | Detect + diagnose + fix + verify |
| Manual intervention | Auto-fix 80%+ of issues |
| Static runbooks | SOPs that improve from every fix |
| One alert channel | Smart escalation (Slack vs Pushover) |
| React to problems | Predict problems before they happen |
| Human fixes at 3am | System fixes, you sleep |

---

## Success Metrics

| Metric | Target |
|--------|--------|
| Issues auto-fixed | 80%+ without human |
| Fix time | <5 minutes average |
| Fix success rate | 95%+ |
| False escalations | <5% |
| SOP coverage | 90% of common issues |

---

## Risk Mitigation

1. **Start diagnosis-only** - Build confidence before enabling fixes
2. **Human approval first** - All fixes need approval in Phase 2
3. **Gradual enablement** - Only enable auto-fix after proven success
4. **Cooldowns** - Prevent rapid repeated fixes
5. **Audit trail** - Every action logged
6. **Rollback** - Revert when possible
7. **Never-auto-fix list** - Critical operations always escalate

---

## When to Build

| Priority | When |
|----------|------|
| Lead assignment | NOW |
| Hire callers | NOW |
| Get revenue | NOW |
| **Self-healing ops** | **Month 2-3** |

This is a strategic investment for scale. Build it when you have revenue and need to reduce ops burden.