SRE Agent · tickets · TRV-1138 triaged by Fidian
TRV-1138

RCA regression: agent stops at symptom, misses root

● user-reported 👎 severity: medium source: production monitoring eval: Multi-Hop Causal Analysis
Autonomous triage complete — ready for developer review
Fidian analyzed the trace, reproduced the multi-hop traversal regression across 8 sandboxes, identified the prompt diff, and built a reproducing eval. Everything is attached below.

💬 Incident context

🧑
What's causing the Redis memory pressure and payment errors?
🤖
The payment-gateway service is experiencing elevated error rates, likely due to upstream network congestion.
⚠ wrong root cause — stopped at symptom service, missed infrastructure root

🛠 Execution trace

query_metrics("payment-gateway")→ P99 2,340ms · 480ms
trace_causal_chain(…)never called past depth=1

The agent identified the symptom service but did not traverse the causal chain — no Redis or Postgres layers were examined.

🔭 Linked observability

trace_id7c3f9a1e-4d2b-49c8-a0f1-2e6b8d44e190
modelclaude-opus-4-7 · temp 0.1
latency2.34s · 1 tool turn

🧭 Root cause

attached by Fidian

The multi-hop traversal instruction was removed from the agent system prompt in v2.4.1.

- follow causal chain to infrastructure root # removed
- call trace_causal_chain before concluding RCA # removed

The agent stops at the first-order symptom service instead of tracing through Redis and Postgres layers to the infrastructure root.

reproduced 0/8 confidence 0.94

🧫 Reproduction

Original incident + 7 counterfactuals, each in an isolated sandbox — none reached the infrastructure root.

payment-gw ◆ Redis cascade order-svc DB timeout auth-svc JWT latency api-gw 503 errors notif-svc backpressure search-svc cache storm billing-svc timeout cdn-edge origin fail

🧪 Eval built

Multi-Hop Causal Analysis
10 tasks × 3 attempts
0 / 30
Checks trace_causal_chain reaches infrastructure root
Verifies root cause is not a symptom service
open eval session ↗
scene 4 · ticket (triaged)