Operator runbook
This document describes how to operate the hosted runtime safely: health checks, pricing modes, queues, and incident triage.
Daily checks
- Open
/dashboard/control-planeafter signing in with an allowed operator email (LEAD_OS_OPERATOR_EMAILS). - Confirm SYSTEM_ENABLED matches the intended posture (off during maintenance).
- Confirm ENABLE_LIVE_PRICING is false unless you are deliberately applying real price mutations.
- Review persisted dead-letter count and the recent dead-letter list for recurring failures.
HTTP endpoints (read-only diagnostics)
| Endpoint | Auth | Purpose |
|---|---|---|
GET /api/health | Public | Liveness; includes pricing worker tick summary when available. |
GET /api/health/deep | Public | Postgres, Redis, DLQ row count, pricing runtime snapshot. |
GET /api/system | Public | High-level flags and integration hints (no secrets). |
GET /docs | Public | In-app documentation hub (links to OpenAPI JSON, SLA summary, repo Markdown). |
GET /docs/api | Public | Human-readable API entry + link to /api/docs/openapi.json. |
GET /api/queue | CRON_SECRET / LEAD_OS_AUTH_SECRET or operator session / API identity |
Shadow vs live pricing
- Shadow (default): Recommendations are written; structural simulation runs; live SKU mutations require
ENABLE_LIVE_PRICING=trueplus other safety checks insrc/lib/pricing/safety-policy.ts. - Live: Only enable after migrations
005–006are applied, Postgres and optional Redis are healthy, and you accept the risk of writing real prices. - Billing: Migration
007addsbilling_plans,billing_subscriptions, andoperator_audit_log. SetLEAD_OS_BILLING_ENFORCE=trueonly after seeding subscription rows for each tenant; otherwise pricing ticks will stop.
Redis and workers
- With REDIS_URL set, the Next.js web process no longer starts BullMQ workers. Run
npm run worker(or the Docker Composeworkerservice) so consumers and the distributed scheduler are not duplicated inside the web server. - Without Redis, the web app may start the memory fallback scheduler in non-production only; production web without Redis will not simulate pricing ticks.
- Worker-only escape hatch:
LEAD_OS_WORKER_ALLOW_MEMORY=true(development) allows a memory fallback inside the worker if Redis is absent — avoid in production.
Control plane actions
Authenticated operators may POST /api/operator/actions with a JSON body (discriminated on type). All successful mutations append a row to operator_audit_log when the table exists.
type | Payload | Effect |
|---|---|---|
dlq_replay | deadLetterId | Re-enqueue persisted BullMQ payload to main or measure queue. |
dlq_delete | deadLetterId | Remove persisted DLQ row. |
node_pause / node_resume | nodeKey | Toggle nodes.status. |
pricing_force_tick | optional tenantId (must match deployment tenant) | Run runPricingTickJob once (billing + live gates apply). |
pricing_override | recommendationId, decision: \ |
The dashboard /dashboard/control-plane surfaces the same actions with browser confirmation prompts.
Go-to-market execution
- UI:
/dashboard/gtmlists all revenue plays with launch checklist items, env keys, and links derived fromtechnicalAnchors(intake, control plane, repo paths). Erie play #1 is highlighted as the recommended first path. - CLI:
npm run gtm:printfor a terminal-friendly dump or--jsonfor automation. - Persistence: Status and notes are stored per deployment tenant in Postgres; apply migration `011_gtm_use_case_statuses.sql` (via the normal migration runner) before expecting PATCH to succeed in production.
Intake
- Successful
POST /api/intakeresponses are preceded by a structured log lineintake.persistedwithtenantId,leadKey,existing,dryRun, andsource. - Failures log
POST /api/intake failedwith the error message.
Dead-letter queue
- BullMQ failures are mirrored into Postgres
dead_letter_jobsfor durable inspection. - Use the control plane or
/api/queue//api/health/deepto inspect counts; re-drive work only after fixing root cause (schema drift, external API, etc.).