SRE Agent · eval session TRV-1138
Multi-Hop Causal Analysis — 1 Run
Run 1
View Policy
View Analysis
0% pass (0/30) · 2.2k tok/task
TasksAttempt
123

Evaluation Policy — multi_hop_rca

Purpose. Grade whether the agent calls trace_causal_chain and correctly identifies the infrastructure root cause (not a symptom service) for multi-hop incidents.

SUCCESS if

FAILURE if

Verification protocol: grader checks (a) tool was called, (b) root cause component is infrastructure not application, (c) ≥2 hops traced in response.

Failure analysis · Run 1 0%

Dominant failure mode (30/30): the agent calls query_metrics on the symptom service but never invokes trace_causal_chain — it returns the first-order service as the root cause without traversing the dependency graph.

Evidence

  • 0 / 30 trajectories contain a trace_causal_chain call.
  • Every task returns a symptom service (application layer) as root cause.
  • Consistent across incident types ⇒ structural (prompt), not incident-specific.

Next steps

  • Restore the traversal instruction. system_prompt.txt:12 — restore "follow the full causal chain to the infrastructure root" (removed in the v2.4.1 prompt update, so the agent never calls trace_causal_chain).
  • Rerun this scenario to confirm the tool is invoked before tuning prompts or models.
scene 7 · eval session (before)