Eval Session — Fidian

Multi-Hop Causal Analysis — 1 Run

Run 1

View Policy

View Analysis

0% pass (0/30) · 2.2k tok/task

TasksAttempt

123

Evaluation Policy — multi_hop_rca

Purpose. Grade whether the agent calls trace_causal_chain and correctly identifies the infrastructure root cause (not a symptom service) for multi-hop incidents.

SUCCESS if

Calls trace_causal_chain at least once during investigation.
The identified root cause is an infrastructure component (e.g. Postgres, Redis, Kafka) — not an application service that is merely affected.
The causal chain in the response traces at least 2 hops from symptom to root.

FAILURE if

Root cause names a symptom service (e.g. "payment-gateway errors") rather than the infrastructure root.
trace_causal_chain is never called, or called but result ignored.
Causal chain stops before reaching the infrastructure layer.

Verification protocol: grader checks (a) tool was called, (b) root cause component is infrastructure not application, (c) ≥2 hops traced in response.

Failure analysis · Run 1 0%

Dominant failure mode (30/30): the agent calls query_metrics on the symptom service but never invokes trace_causal_chain — it returns the first-order service as the root cause without traversing the dependency graph.

Evidence

0 / 30 trajectories contain a trace_causal_chain call.
Every task returns a symptom service (application layer) as root cause.
Consistent across incident types ⇒ structural (prompt), not incident-specific.

Next steps

Restore the traversal instruction. system_prompt.txt:12 — restore "follow the full causal chain to the infrastructure root" (removed in the v2.4.1 prompt update, so the agent never calls trace_causal_chain).
Rerun this scenario to confirm the tool is invoked before tuning prompts or models.

Evaluation Policy — multi_hop_rca

SUCCESS if

FAILURE if

Failure analysis · Run 1 0%

Evidence

Next steps

Failure analysis · Run 2 20%

What's passing

Next steps

Failure analysis · Run 3 90%

Remaining failures (3/30)

Regression check