zeroclaw/docs/operations-runbook.md

128 lines
2.8 KiB
Markdown

# ZeroClaw Operations Runbook
This runbook is for operators who maintain availability, security posture, and incident response.
Last verified: **February 18, 2026**.
## Scope
Use this document for day-2 operations:
- starting and supervising runtime
- health checks and diagnostics
- safe rollout and rollback
- incident triage and recovery
For first-time installation, start from [one-click-bootstrap.md](one-click-bootstrap.md).
## Runtime Modes
| Mode | Command | When to use |
|---|---|---|
| Foreground runtime | `zeroclaw daemon` | local debugging, short-lived sessions |
| Foreground gateway only | `zeroclaw gateway` | webhook endpoint testing |
| User service | `zeroclaw service install && zeroclaw service start` | persistent operator-managed runtime |
## Baseline Operator Checklist
1. Validate configuration:
```bash
zeroclaw status
```
2. Verify diagnostics:
```bash
zeroclaw doctor
zeroclaw channel doctor
```
3. Start runtime:
```bash
zeroclaw daemon
```
4. For persistent user session service:
```bash
zeroclaw service install
zeroclaw service start
zeroclaw service status
```
## Health and State Signals
| Signal | Command / File | Expected |
|---|---|---|
| Config validity | `zeroclaw doctor` | no critical errors |
| Channel connectivity | `zeroclaw channel doctor` | configured channels healthy |
| Runtime summary | `zeroclaw status` | expected provider/model/channels |
| Daemon heartbeat/state | `~/.zeroclaw/daemon_state.json` | file updates periodically |
## Logs and Diagnostics
### macOS / Windows (service wrapper logs)
- `~/.zeroclaw/logs/daemon.stdout.log`
- `~/.zeroclaw/logs/daemon.stderr.log`
### Linux (systemd user service)
```bash
journalctl --user -u zeroclaw.service -f
```
## Incident Triage Flow (Fast Path)
1. Snapshot system state:
```bash
zeroclaw status
zeroclaw doctor
zeroclaw channel doctor
```
2. Check service state:
```bash
zeroclaw service status
```
3. If service is unhealthy, restart cleanly:
```bash
zeroclaw service stop
zeroclaw service start
```
4. If channels still fail, verify allowlists and credentials in `~/.zeroclaw/config.toml`.
5. If gateway is involved, verify bind/auth settings (`[gateway]`) and local reachability.
## Safe Change Procedure
Before applying config changes:
1. backup `~/.zeroclaw/config.toml`
2. apply one logical change at a time
3. run `zeroclaw doctor`
4. restart daemon/service
5. verify with `status` + `channel doctor`
## Rollback Procedure
If a rollout regresses behavior:
1. restore previous `config.toml`
2. restart runtime (`daemon` or `service`)
3. confirm recovery via `doctor` and channel health checks
4. document incident root cause and mitigation
## Related Docs
- [one-click-bootstrap.md](one-click-bootstrap.md)
- [troubleshooting.md](troubleshooting.md)
- [config-reference.md](config-reference.md)
- [commands-reference.md](commands-reference.md)