self-hosting
Runbook
On call response procedures, diagnostic commands, and rollback playbooks for operators running Pilot in production. 1. Check /health — curl https://<your domain /health. Anything non 200 is a hard incident. 2. Check SentPublicSource-ownedMarkdown export
On-call response procedures, diagnostic commands, and rollback playbooks for operators running Pilot in production.
0. Start Here (Every Incident)
- Check
/health—curl https://<your-domain>/health. Anything non-200 is a hard incident. - Check Sentry — are there recent unhandled errors spiking?
- Check Grafana dashboards —
api.json,orchestrator.json,infrastructure.json. - Check DigitalOcean Droplet and Compose status —
bash infra/digitalocean/deploy.sh status. - Declare severity and communicate on your internal channel.
1. Common Incidents
1A. Auth broken — users cannot log in
Symptoms: users report not receiving email magic link, or verify returns 401 despite correct code.
- Check
/health→checks.dbtrue? - Check
EMAIL_PROVIDERconfig — if it reverted tonoop, emails are not sent; only local development auth responses expose the code.ssh root@$DO_DROPLET_IP 'cd /opt/pilot/current && docker compose -f infra/digitalocean/docker-compose.yml exec pilot env | grep EMAIL' - Check Resend/SMTP dashboard for bounces or rate limiting.
- Check the
sessionstable — are rows being written?SELECT COUNT(*), MAX(created_at) FROM sessions WHERE channel = 'email_pending'; - If Resend API is down: temporarily switch to a backup SMTP provider in
.env.production.pilot, redeploy withbash infra/digitalocean/deploy.sh deploy, and verify login.
1B. Database down
Symptoms: /health returns 503 with checks.db: false.
- DigitalOcean Compose Postgres:
docker compose --env-file .env.production.shared -f infra/digitalocean/docker-compose.yml exec postgres psql -U helm -d pilot -c '\l'. - Check DB logs:
docker compose --env-file .env.production.shared -f infra/digitalocean/docker-compose.yml logs --tail=200 postgres. - Connection pool exhausted? Query
SELECT count(*) FROM pg_stat_activity. If >80 increaseDB_POOL_MAXor investigate slow queries. - If the DB is truly down, restore the latest verified backup on a replacement Droplet, then point DNS at the new Droplet IP.
1C. HELM sidecar or LLM provider outage
Symptoms: /health returns degraded with checks.helm: "unreachable", or agent tasks fail with HELM unreachability / upstream provider errors.
- Check HELM first:
docker compose --env-file .env.production.shared -f infra/digitalocean/docker-compose.yml logs --tail=200 helmanddocker compose --env-file .env.production.shared -f infra/digitalocean/docker-compose.yml exec helm wget -qO- http://localhost:8081/healthz. - Check provider status pages for the upstream configured on HELM (
HELM_UPSTREAM_URL). In production, provider keys belong on the HELM sidecar, not on the Pilot app. - If HELM is healthy but the upstream is down, rotate the sidecar upstream/key in
.env.production.helmand redeploy. - Enable kill switch temporarily: set
policy.killSwitch=truevia a workspace-settings update. Tasks will be blocked rather than retry-storming.
1D. Rate-limiting users unexpectedly
Symptoms: users report 429 on normal usage.
- Check Postgres token buckets in
ratelimit_bucketsfor the reported subject and route class. - If a single bad IP is hammering, block it with Cloudflare, Caddy, or the DigitalOcean Cloud Firewall before changing app limits.
- Bump limits temporarily via code — update
services/gateway/src/index.tsrate-limit configs and ship.
1E. Disk full / storage exhausted
- Check Droplet and Docker volume usage with
df -handdocker system df. - Rotate old Pino logs (if writing to disk):
find /app/logs -mtime +7 -delete. - Clean old S3 backups via
scripts/backup.shretention (default 30 days). - If
STORAGE_PROVIDER=local, migrate to S3.
1F. Task stuck in 'running' status
Symptoms: a task shows status='running' for hours.
- Expected: the reaper job should move it to 'failed' after 10 minutes (
tasks.reap_stuckscheduled every 5 min). - If not happening, check pg-boss health:
SELECT * FROM pgboss.schedule WHERE name = 'tasks.reap_stuck'. - Manual reap:
UPDATE tasks SET status='failed' WHERE id='<uuid>' AND status='running'.
1G. YC private sync / Scrapling runtime broken
Symptoms: YC connector validates in UI but private sync jobs fail, or operator fetches return browser/runtime errors.
- Check the configured Python runtime:
PYTHON_BIN=${PYTHON_BIN:-./.venv-pipelines/bin/python} $PYTHON_BIN scripts/verify-python-runtime.py - If browser binaries are missing, re-run:
bash scripts/install-python-runtime.sh - Verify
ENCRYPTION_KEYdid not rotate without a connector/session migration plan. - Confirm the
ycconnector still shows a validated session in the workspace settings UI. - If validation fails after a YC auth change, capture and save a fresh session snapshot.
2. Diagnostic Commands
Health
curl https://<host>/health | jq # full health JSON
curl https://<host>/metrics | head -50 # Prometheus metrics sample
Logs (DigitalOcean Compose)
ssh root@$DO_DROPLET_IP
cd /opt/pilot/current
docker compose -f infra/digitalocean/docker-compose.yml logs --tail=200 pilot
docker compose -f infra/digitalocean/docker-compose.yml logs pilot | grep ERROR
docker compose -f infra/digitalocean/docker-compose.yml logs pilot | grep requestId=XXX
Database (DigitalOcean Compose Postgres)
docker compose -f infra/digitalocean/docker-compose.yml exec postgres psql -U helm -d pilot
# Most active tables
SELECT schemaname, relname, n_live_tup
FROM pg_stat_user_tables
ORDER BY n_live_tup DESC LIMIT 10;
# Pool saturation
SELECT count(*), state FROM pg_stat_activity GROUP BY state;
# Long-running queries
SELECT pid, now() - pg_stat_activity.query_start AS duration, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - pg_stat_activity.query_start > interval '30 seconds';
pg-boss queue
SELECT name, state, COUNT(*) FROM pgboss.job GROUP BY name, state;
SELECT * FROM pgboss.archive ORDER BY completed_on DESC LIMIT 20; -- recently completed
Python / Scrapling
PYTHON_BIN=${PYTHON_BIN:-./.venv-pipelines/bin/python} $PYTHON_BIN scripts/verify-python-runtime.py
ls -la ${PLAYWRIGHT_BROWSERS_PATH:-./.cache/ms-playwright}
ls -la ${PATCHRIGHT_BROWSERS_PATH:-./.cache/ms-patchright}
3. Rollback Procedure
3A. Application rollback
- Find the last known-good release:
git log --oneline -20 - Roll back to the previous release:
DO_DROPLET_IP=<ip> bash infra/digitalocean/deploy.sh rollback - Watch health:
while true; do curl -s https://<host>/health | jq -c; sleep 2; done
3B. Database rollback (destructive — last resort)
- Stop the gateway:
docker compose -f infra/digitalocean/docker-compose.yml stop pilot. - Take a snapshot of current DB state.
- Restore from last known-good backup:
bash scripts/backup.sh restore <backup-file.sql.gz.gpg> - Start gateway:
docker compose -f infra/digitalocean/docker-compose.yml up -d pilot. - Warning: connector tokens encrypted with a since-rotated ENCRYPTION_KEY will be unreadable. Plan carefully.
3C. Migration rollback
Drizzle migrations are forward-only by default. To roll a migration back:
- Write a reverse migration SQL file manually (e.g.,
0005_revert_xyz.sqlthat drops the column added in0004). - Apply it normally via
npm run db:push. - Redeploy the app pinned to the pre-
0004schema version.
4. Escalation
- SEV-1 (data loss, security breach): Wake the on-call immediately. Preserve state (snapshots, logs) before attempting fixes.
- SEV-2 (service down for all users): Respond within 15 min. Start incident channel.
- SEV-3 (degraded, subset of users): Respond within 1h. File ticket, fix during business hours.
5. Post-Incident Review Template
# Incident YYYY-MM-DD — <one-line summary>
## Timeline
- HH:MM UTC — Detected (how?)
- HH:MM UTC — First responder on call
- HH:MM UTC — Mitigated (what was done?)
- HH:MM UTC — Fully resolved
## Impact
- Users affected: <count> / <total>
- Features affected: <list>
- Data loss: <yes/no + scope>
## Root Cause
<what actually caused it>
## Contributing Factors
<what made it worse or prevented faster recovery>
## Action Items
- [ ] <fix root cause permanently>
- [ ] <detect earlier next time>
- [ ] <respond faster next time>
- [ ] <prevent this class of issue>
## Lessons Learned
<what we knew vs. didn't know; what we'd do differently>