Multi-Hop Causal Analysis — 1 Run
Run 1
View Policy
View Analysis
0% pass (0/30) · 2.2k tok/task
Evaluation Policy — multi_hop_rca
Purpose. Grade whether the agent calls trace_causal_chain and correctly identifies the infrastructure root cause (not a symptom service) for multi-hop incidents.
SUCCESS if
- Calls
trace_causal_chain at least once during investigation.
- The identified root cause is an infrastructure component (e.g. Postgres, Redis, Kafka) — not an application service that is merely affected.
- The causal chain in the response traces at least 2 hops from symptom to root.
FAILURE if
- Root cause names a symptom service (e.g. "payment-gateway errors") rather than the infrastructure root.
- trace_causal_chain is never called, or called but result ignored.
- Causal chain stops before reaching the infrastructure layer.
Verification protocol: grader checks (a) tool was called, (b) root cause component is infrastructure not application, (c) ≥2 hops traced in response.
Failure analysis · Run 1 0%
Dominant failure mode (30/30): the agent calls query_metrics on the symptom service but never invokes trace_causal_chain — it returns the first-order service as the root cause without traversing the dependency graph.
Evidence
- 0 / 30 trajectories contain a
trace_causal_chain call.
- Every task returns a symptom service (application layer) as root cause.
- Consistent across incident types ⇒ structural (prompt), not incident-specific.
Next steps
- Restore the traversal instruction. system_prompt.txt:12 — restore "follow the full causal chain to the infrastructure root" (removed in the v2.4.1 prompt update, so the agent never calls trace_causal_chain).
- Rerun this scenario to confirm the tool is invoked before tuning prompts or models.
Failure analysis · Run 2 20%
Progress: trace_causal_chain is now called on 30/30 tasks — the agent traverses the graph.
New failure mode (24/30): on incidents requiring 4–5 hop chains, the agent terminates traversal early (depth=2) and returns the intermediate service as root cause rather than the infrastructure component.
What's passing
- Only 2–3 hop incidents where the infrastructure root is reached within the default traversal depth.
Next steps
- Add explicit tool-ordering instruction (system_prompt.txt:14): tell the agent to always call
trace_causal_chain before concluding RCA and to follow the chain until it reaches an infrastructure component.
Failure analysis · Run 3 90%
Resolved. 27/30 attempts call trace_causal_chain and correctly identify the infrastructure root cause.
Remaining failures (3/30)
- Edge cases with circular dependency loops — the agent detects the cycle but cannot resolve direction; borderline grader calls that pass on other attempts.
Regression check
- No regressions — 14/14 sibling scenarios unchanged.