How to diagnose a latency spike with Prometheus + Claude
Quando usar: A service p99 alert fires — you need context without memorizing PromQL.
Pré-requisitos
- Prometheus URL reachable — Set PROMETHEUS_URL in the MCP config; add auth if protected
Fluxo
-
Scope the spikeQuery http request p99 latency for service X in the last hour, 30-second resolution. Compare to the last 7 days baseline.✓ Copiado→ Range query result showing the spike
-
Find correlated metricsFor the spike window, what other metrics for service X moved >2 sigma? CPU, memory, GC, queue depth?✓ Copiado→ Candidate culprit metrics
-
Narrow by labelBreak down the spike by pod/host labels. Is it one pod or fleet-wide?✓ Copiado→ Per-label decomposition
Resultado: A hypothesis tied to specific metrics in under 5 minutes.
Armadilhas
- Query returns no data — Check label names with
list_metrics— label casing and delimiters vary between exporters