Trajectory evaluation

AI & LLM Testing

// Definition

Evaluating an agent on the sequence of steps it took, not just the final outcome. End-to-end evaluation ("did the agent eventually complete the task?") misses a large class of failures: agents that arrived at the right answer via the wrong tool, that took ten steps when two would have done, that corrupted state mid-flow but recovered, that retried successfully past a permission boundary they shouldn't have crossed. Trajectory evaluation scores the steps themselves: were tool-call arguments correct, was state propagation clean, did the agent refuse when it should have refused. Research from 2023 onward shows agents pass 20–40 percent more end-to-end evaluations than they pass trajectory ones — the gap is the work hidden by single-shot scoring.

// Related terms

Eval harness
Software that runs an LLM-backed system against a dataset of inputs, scores the outputs against criteria (exact match, similarity, LLM-as-judge, custom rubric), and tracks how scores change across model versions, prompts, or code changes. Eval harnesses are to AI features what test runners are to deterministic code: the place CI calls into, the place regressions get caught, the place quality is measured rather than asserted. The 2026 ecosystem has fragmented rather than consolidated — Braintrust is eval-first, Langfuse is prompt-first (acquired by Clickhouse in January), Laminar is built for agent debugging, Arize Phoenix is OpenTelemetry-native. Most teams pick one platform per workflow rather than expecting one tool to cover everything.
Agent observability
Instrumentation and tooling that makes the behaviour of an AI agent debuggable in production. A multi-step agent that fails mid-flow leaves a different kind of evidence than a crashed service: there's a tool-call trace, an LLM reasoning chain, a sequence of page snapshots, a token-and-cost ledger. Agent observability platforms — Laminar, Langfuse, Arize Phoenix, LangSmith, Braintrust — capture this and make it queryable. The distinction from regular APM is the unit of analysis: traditional observability shows you the request that failed, agent observability shows you the decision that was wrong. The hardest signal to capture cleanly is whether a failure was application flakiness or LLM context failure — those look identical in a trace but require different fixes.