# LAUNCH-MONITORING.md — Weekend Launch Ops Protocol

*Cog's operational monitoring for RateRight launch, Feb 22-23 2026*

**Author:** Cog (Fleet Ops)  
**Created:** 2026-02-20  
**Launch window:** Saturday Feb 22 morning → Sunday Feb 23 evening  
**Builder's checklist:** `/home/ccuser/the-50-dollar-app/LAUNCH-CHECKLIST.md` (60+ points)

This document covers what **Cog monitors during launch** — not what Builder tests or what Harper tracks financially. This is the ops observability layer.

---

## Pre-Launch Issues (Flagged Now)

### 🔴 CRITICAL: PM2 + Systemd Conflict on Port 3000

**Finding:** PM2 has 411,854 restarts with EADDRINUSE errors. Both `rateright-app.service` (systemd) and PM2 are configured to manage the app on port 3000. Systemd currently holds the port; PM2 keeps crash-looping trying to bind it.

**Impact on launch:** If systemd service restarts and PM2 grabs the port first, the reverse proxy will hit PM2's unstable instance instead of the stable systemd one. Or vice versa — race condition on every restart.

**Fix before launch (Sentinel/Builder):**
```bash
# Option A: Disable PM2 (recommended — systemd is working fine)
pm2 stop rateright-app
pm2 delete rateright-app
pm2 save

# Option B: Disable systemd (if PM2 is preferred)
sudo systemctl stop rateright-app
sudo systemctl disable rateright-app
```

**Status:** MUST FIX before Saturday. Escalated to Sentinel.

### 🟡 WARNING: Next.js Server Action Errors

App logs show: `Error: Failed to find Server Action "x". This request might be from an older or newer deployment.`

**Impact:** Stale browser tabs after a deploy will get this error. Not critical for fresh visitors on launch day, but Michael testing pre-deploy could hit it.

**Mitigation:** Clear browser cache after each deploy. No code fix needed.

### 🟡 WARNING: Growth Engine Cold Start (Railway Free Tier)

Railway app sleeps after idle. First request gets 404 or 3+ second delay. Sentinel confirmed this is expected behaviour.

**Impact on launch:** GE is CRM, not customer-facing. No launch impact. But Cog's data freshness checks may intermittently report it as DOWN.

**Mitigation:** Cog treats first GE failure as "cold start" — re-check after 10s before alerting.

---

## Launch Day Monitoring Checklist

### Tier 1: CRITICAL (Check Every 5 Minutes During First Hour, Then Every 15)

| # | What to Monitor | How | Alert Threshold | Escalation |
|---|----------------|-----|-----------------|------------|
| 1 | **App responding** | `curl -s -o /dev/null -w "%{http_code}" https://rateright.com.au` | Non-200 for 2 consecutive checks | → Sentinel immediately |
| 2 | **App response time** | `curl -s -o /dev/null -w "%{time_total}" https://rateright.com.au` | >5s for 3 consecutive checks | → Sentinel |
| 3 | **Stripe webhook endpoint** | `curl -s -o /dev/null -w "%{http_code}" -X GET https://rivet.rateright.com.au/api/webhooks/stripe` | Non-405 (405 = correct) | → Builder + Michael |
| 4 | **Supabase reachable** | `curl -s -o /dev/null -w "%{http_code}" https://eciepjpcyfurbkfzekok.supabase.co/rest/v1/ -H "apikey: <anon_key>"` | Non-200 | → Sentinel + Michael |
| 5 | **Process running** | `systemctl is-active rateright-app` | Not "active" | → Sentinel (auto-restart) |
| 6 | **Error log spikes** | `journalctl -u rateright-app --since "5 min ago" \| grep -c "Error\|error\|FATAL"` | >10 errors in 5 min | → Builder |
| 7 | **Memory usage** | `free -m \| awk '/Mem:/{print $3/$2*100}'` | >85% RAM | → Sentinel |
| 8 | **Disk space** | `df -h / \| awk 'NR==2{print $5}'` | >80% | → Sentinel |

### Tier 2: IMPORTANT (Check Every 30 Minutes)

| # | What to Monitor | How | Alert Threshold | Escalation |
|---|----------------|-----|-----------------|------------|
| 9 | **SSL certificate** | `echo \| openssl s_client -connect rateright.com.au:443 2>/dev/null \| openssl x509 -noout -dates` | <7 days to expiry | → Sentinel |
| 10 | **Nginx status** | `systemctl is-active nginx` | Not "active" | → Sentinel |
| 11 | **Fleet agent health** | `node /home/ccuser/shared/scripts/fleet-cli.js status` | Any agent RED during launch | → Cog handles (Rivet backup protocol) |
| 12 | **VPS load average** | `uptime` | Load >4.0 (2 vCPU box) | → Sentinel |
| 13 | **PM2 not interfering** | `pm2 list \| grep rateright` | Still exists / running | → Sentinel (must be deleted pre-launch) |
| 14 | **OpenAI API** | Check Builder logs for 429/503 | Rate limit or quota hit | → Builder (degrade AI gracefully) |

### Tier 3: BUSINESS METRICS (Check Hourly)

| # | What to Monitor | How | What It Tells Us |
|---|----------------|-----|-----------------|
| 15 | **New signups** | Supabase `profiles` count | User acquisition rate |
| 16 | **Worker vs contractor ratio** | `profiles` by role | Marketplace balance |
| 17 | **Jobs posted** | Supabase `jobs` count | Contractor engagement |
| 18 | **Matches created** | Supabase `matches` count | Matching working |
| 19 | **Payments completed** | Supabase `payments` WHERE status='charged' | REVENUE! 🎉 |
| 20 | **Payment failures** | Stripe dashboard | Payment flow broken? |
| 21 | **SMS delivery** | Twilio dashboard | Phone verification working? |
| 22 | **Error rate vs traffic** | Error count / request count | Is the error rate acceptable? |

---

## Automated Monitoring Script

Run this via cron every 5 minutes during launch:

```bash
#!/bin/bash
# launch-monitor.sh — Cog's launch day watchdog
# Usage: */5 * * * * bash /home/ccuser/cog/scripts/launch-monitor.sh

set -o pipefail

LOGFILE="/home/ccuser/cog/memory/launch-monitor.log"
ALERT_FILE="/home/ccuser/cog/memory/launch-alerts.jsonl"
NOW=$(TZ=Australia/Sydney date '+%Y-%m-%d %H:%M:%S AEDT')
FAILURES=0
ALERTS=""

log() { echo "[$NOW] $1" >> "$LOGFILE"; }
alert() {
  local severity="$1" target="$2" msg="$3"
  echo "{\"ts\":\"$NOW\",\"severity\":\"$severity\",\"target\":\"$target\",\"msg\":\"$msg\"}" >> "$ALERT_FILE"
  ALERTS="${ALERTS}\n${severity}: ${msg} → ${target}"
  log "ALERT [$severity] $msg → $target"
}

# --- Tier 1 Checks ---

# 1. App responding
HTTP_CODE=$(curl -s -o /dev/null -w "%{http_code}" --max-time 10 https://rateright.com.au 2>/dev/null)
RESPONSE_TIME=$(curl -s -o /dev/null -w "%{time_total}" --max-time 10 https://rateright.com.au 2>/dev/null)
if [ "$HTTP_CODE" != "200" ]; then
  alert "CRITICAL" "sentinel" "App returned HTTP $HTTP_CODE (expected 200)"
  ((FAILURES++))
else
  log "App OK: HTTP $HTTP_CODE in ${RESPONSE_TIME}s"
fi

# 2. Response time
RT_MS=$(echo "$RESPONSE_TIME * 1000" | bc 2>/dev/null | cut -d. -f1)
if [ "${RT_MS:-0}" -gt 5000 ]; then
  alert "WARNING" "sentinel" "App response time ${RESPONSE_TIME}s (threshold: 5s)"
fi

# 3. Process running
APP_STATUS=$(systemctl is-active rateright-app 2>/dev/null)
if [ "$APP_STATUS" != "active" ]; then
  alert "CRITICAL" "sentinel" "rateright-app service is $APP_STATUS"
  ((FAILURES++))
fi

# 4. Error log spike
ERROR_COUNT=$(journalctl -u rateright-app --since "5 min ago" 2>/dev/null | grep -ci "error\|fatal\|unhandled" || echo 0)
if [ "$ERROR_COUNT" -gt 10 ]; then
  alert "HIGH" "builder" "Error spike: $ERROR_COUNT errors in last 5 min"
fi

# 5. Memory
MEM_PCT=$(free | awk '/Mem:/{printf "%.0f", $3/$2*100}')
if [ "${MEM_PCT:-0}" -gt 85 ]; then
  alert "HIGH" "sentinel" "RAM at ${MEM_PCT}% (threshold: 85%)"
fi

# 6. Disk
DISK_PCT=$(df / | awk 'NR==2{print $5}' | tr -d '%')
if [ "${DISK_PCT:-0}" -gt 80 ]; then
  alert "HIGH" "sentinel" "Disk at ${DISK_PCT}% (threshold: 80%)"
fi

# 7. Load
LOAD=$(uptime | awk -F'load average: ' '{print $2}' | cut -d, -f1 | tr -d ' ')
LOAD_INT=$(echo "$LOAD" | cut -d. -f1)
if [ "${LOAD_INT:-0}" -gt 4 ]; then
  alert "WARNING" "sentinel" "Load average $LOAD (threshold: 4.0 on 2-vCPU)"
fi

# --- Summary ---
log "Check complete: $FAILURES critical failures, $ERROR_COUNT app errors, RAM ${MEM_PCT}%, disk ${DISK_PCT}%, load $LOAD"

if [ "$FAILURES" -gt 0 ] && [ -n "$ALERTS" ]; then
  # Write to Cog's inbox to trigger escalation on next heartbeat
  echo "{\"alert_time\":\"$NOW\",\"failures\":$FAILURES,\"alerts\":\"$ALERTS\"}" >> /home/ccuser/cog/memory/launch-pending-alerts.jsonl
fi
```

---

## Michael Escalation Thresholds

**Alert Michael IMMEDIATELY if:**

| Condition | Why | Message Format |
|-----------|-----|---------------|
| App down >5 min | Customers can't access platform | "🔴 rateright.com.au is DOWN. Sentinel investigating. ETA: [X]" |
| Payment flow broken | Revenue stream blocked | "🔴 Payments failing. [X] attempts failed. Stripe webhook [status]." |
| Database unreachable | All features broken | "🔴 Supabase unreachable. Entire app non-functional. Sentinel investigating." |
| >50 errors in 15 min | Something fundamentally broken | "🔴 Error storm: [N] errors in 15 min. Most common: [error]. Builder investigating." |
| First payment received | CELEBRATION 🎉 | "🎉 FIRST PAYMENT! $50 from [company]. Platform working. Revenue: $1." |

**Do NOT alert Michael for:**
- Brief cold-start delays (<10s)
- Single transient errors
- AI features degraded (not customer-blocking)
- Fleet agent issues (Cog handles)
- GE CRM intermittent (internal tool)
- Anything Sentinel or Builder can fix autonomously

---

## Launch Day Agent Roles

| Agent | Launch Role | What Cog Expects From Them |
|-------|------------|---------------------------|
| **Builder** | On-call for hotfixes | Respond to code issues <15 min |
| **Sentinel** | Infrastructure watch | Respond to infra alerts <5 min |
| **Susan** | Lead monitoring | Track any inbound leads from launch |
| **Harper** | Revenue tracking | Log first payments, track Stripe |
| **Herald** | Comms standby | Ready to draft announcements |
| **Radar** | Market monitoring | Watch for competitor reactions |
| **Rivet** | Strategic coordination | Triage decisions if priorities conflict |
| **Cog** | Ops monitoring (THIS DOC) | Run monitoring loop, escalate, coordinate |

---

## Launch Day Timeline

### Friday Feb 21 (Pre-Launch)

| Time | Action | Owner |
|------|--------|-------|
| Morning | Fix PM2/systemd conflict | Sentinel + Builder |
| Afternoon | Run Builder's full 60-point checklist | Builder |
| Evening | **Michael's window (7-8:30 PM):** Test real signup + payment | Michael + Builder |
| After test | Record last-known-good commit hash | Builder |
| 10 PM | Deploy launch-monitor.sh cron | Cog |
| 10 PM | Verify all fleet agents healthy | Cog |
| 10 PM | Ensure voice-brief-data.json fresh | Cog |

### Saturday Feb 22 (Launch Day)

| Time | Action | Owner |
|------|--------|-------|
| 5:15 AM | Morning voice brief includes launch status | Cog (data freshness) |
| 6:00 AM | Cog verifies all Tier 1 checks pass | Cog |
| Launch time | Michael shares link / starts soft outreach | Michael |
| First hour | Tier 1 checks every 5 min | Cog (via cron) |
| First hour | Watch error logs live | `journalctl -u rateright-app -f` | Sentinel |
| First hour | Watch Stripe dashboard | Harper |
| After hour 1 | Reduce to 15-min checks if stable | Cog |
| 6:00 PM | Evening voice brief with launch metrics | Cog |
| 8:00 PM | First-day summary to Michael | Rivet (via Cog data) |

### Sunday Feb 23 (Day 2)

| Time | Action | Owner |
|------|--------|-------|
| 5:15 AM | Morning brief with Day 1 stats | Cog |
| All day | Standard 15-min monitoring | Cog |
| 6:00 PM | Weekend launch summary | Rivet |
| Evening | Reduce to normal monitoring cadence | Cog |

---

## Voice Brief Enhancement for Launch

During launch weekend, `voice-brief-data.json` should include:

```json
{
  "launch_status": "LIVE",
  "launch_metrics": {
    "signups_total": 0,
    "workers": 0,
    "contractors": 0,
    "jobs_posted": 0,
    "matches": 0,
    "payments": 0,
    "revenue_total": "$0",
    "first_payment": null,
    "uptime_pct": "100%",
    "errors_24h": 0
  }
}
```

Cog will add these fields to the data freshness script on Friday.

---

## Post-Launch Metrics Dashboard (For Monday)

| Metric | Source | Target |
|--------|--------|--------|
| Total signups | Supabase `profiles` | Any >0 is a win |
| Worker:Contractor ratio | Supabase `profiles` by role | Ideally 3:1+ |
| Jobs posted | Supabase `jobs` | ≥1 |
| Matches made | Supabase `matches` | ≥1 |
| Payments collected | Stripe | First $50 = product-market signal |
| Uptime | launch-monitor.log | >99.5% |
| Error rate | launch-monitor.log | <1% of requests |
| Avg response time | launch-monitor.log | <2s |
| SMS delivered | Twilio | >95% delivery rate |

---

## Rollback Triggers (Ops Perspective)

| Condition | Action |
|-----------|--------|
| App crashes >3 times in 30 min | Rollback to last-known-good commit |
| Payment flow broken, no fix in 30 min | Put maintenance page, fix offline |
| Database corruption detected | STOP EVERYTHING, notify Michael immediately |
| SSL certificate expired | Emergency renewal via certbot |
| VPS unresponsive | DigitalOcean console access, hard reboot |

---

*Launch day is a product demo. Every minute of uptime, every payment that processes, every error we catch before Michael sees it — that's OpsMan proving itself.*
