feat(observability): implement Prometheus metrics backend with /metrics endpoint

- Adds PrometheusObserver backend with counters, histograms, and gauges
- Tracks agent starts/duration, tool calls, channel messages, heartbeat ticks, errors, request latency, tokens, sessions, queue depth
- Adds GET /metrics endpoint to gateway for Prometheus scraping
- Adds provider/model labels to AgentStart and AgentEnd events for better observability
- Adds as_any() method to Observer trait for backend-specific downcast

Metrics exposed:
- zeroclaw_agent_starts_total (Counter) with provider/model labels
- zeroclaw_agent_duration_seconds (Histogram) with provider/model labels
- zeroclaw_tool_calls_total (Counter) with tool/success labels
- zeroclaw_tool_duration_seconds (Histogram) with tool label
- zeroclaw_channel_messages_total (Counter) with channel/direction labels
- zeroclaw_heartbeat_ticks_total (Counter)
- zeroclaw_errors_total (Counter) with component label
- zeroclaw_request_latency_seconds (Histogram)
- zeroclaw_tokens_used_last (Gauge)
- zeroclaw_active_sessions (Gauge)
- zeroclaw_queue_depth (Gauge)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
argenis de la rosa 2026-02-17 14:01:37 -05:00 committed by Chummy
parent c04f2855e4
commit eba544dbd4
11 changed files with 575 additions and 228 deletions

View file

@ -193,8 +193,18 @@ impl Provider for ReliableProvider {
} else {
"retryable"
};
// For custom providers, strip the URL from the provider name
// to avoid confusion. The format "custom:https://..." in error
// logs makes it look like the model is being appended to the URL.
let display_provider = if provider_name.starts_with("custom:") {
"custom"
} else if provider_name.starts_with("anthropic-custom:") {
"anthropic-custom"
} else {
provider_name
};
failures.push(format!(
"provider={provider_name} model={current_model} attempt {}/{}: {failure_reason}",
"{display_provider}/{current_model} attempt {}/{}: {failure_reason}",
attempt + 1,
self.max_retries + 1
));
@ -298,8 +308,18 @@ impl Provider for ReliableProvider {
} else {
"retryable"
};
// For custom providers, strip the URL from the provider name
// to avoid confusion. The format "custom:https://..." in error
// logs makes it look like the model is being appended to the URL.
let display_provider = if provider_name.starts_with("custom:") {
"custom"
} else if provider_name.starts_with("anthropic-custom:") {
"anthropic-custom"
} else {
provider_name
};
failures.push(format!(
"provider={provider_name} model={current_model} attempt {}/{}: {failure_reason}",
"{display_provider}/{current_model} attempt {}/{}: {failure_reason}",
attempt + 1,
self.max_retries + 1
));