Agent Evaluation Readiness Checklist

Agent Evaluation Readiness Checklist

A practical checklist for agent evaluation: error analysis, dataset construction, grader design, offline & online evals, and production readiness.

17 min read

By Victor Moreira, Deployed Engineer @ LangChain

This checklist is a practical companion to "Agent Observability Powers Agent Evaluation", which covers why agent evaluation is different from traditional software testing, introduces the core observability primitives (runs, traces, threads), and explains how they map to evaluation levels. Read that post first if you're new to agent evaluation.

This post focuses on the how, a step-by-step checklist for building, running, and shipping agent evals.

Start with the simplest eval that gives you signal. A few end-to-end evals that test whether your agent completes its core tasks will give you a baseline immediately, even if your architecture is still changing. Only add complexity when you have evidence that simpler approaches are missing real failures.

👉
Don’t care for the deep dive? Skip to the full checklist.

Before you build evals

0:00
/0:42

Use LangSmith to go from traces to the annotation queue to datasets & experiments

☑️ Manually review 20-50 real agent traces before building any eval infrastructure

☑️ Define unambiguous success criteria for a single task

☑️ Separate capability evals from regression evals

☑️ Ensure you can identify and articulate why each failure occurs

☑️ Assign eval ownership to a single domain expert

☑️ Rule out infrastructure and data pipeline issues before blaming the agent

Deep dive

Manually review 20-50 real agent traces before building any eval infrastructure

Use LangSmith to go from traces to the annotation queue to datasets & experiments.

Before building any infrastructure, spend 30 minutes reading through real agent traces. You'll learn more about failure patterns from this than from any automated system. LangSmith's traces and annotation queues are excellent for this.

Define unambiguous success criteria for a single task

If two experts can't agree on pass/fail, the task needs refinement:

  • Unclear success: "Summarize this document well."
  • Clear success: "Extract the 3 main action items from this meeting transcript. Each should be < 20 words and include an owner if mentioned."

Separate capability evals from regression evals

You need both because they serve different purposes. Capability evals push your agent forward by measuring progress on hard tasks, while regression evals protect what already works. Without the separation, you'll either stop improving because you're only guarding existing behavior, or you'll ship regressions because you're only chasing new capabilities.

  • Capability evals answer "what can it do?"
    • Start with a low pass rate and give you a hill to climb.
  • Regression evals answer "does it still work?"
    • Should have ~100% pass rate and catch backsliding.

Ensure you can identify and articulate why each failure occurs

If you can't articulate why something failed, you need more error analysis before building automated evals. This is where you should spend 60-80% of your eval effort. Follow this process:

  1. Gather traces: Collect representative failures from production or testing
  2. Open coding: Review traces with a domain expert, noting every issue you see without pre-categorizing (or use our annotation queue to have subject matter experts review traces on their own)
  3. Categorize: Group issues into a failure taxonomy (prompt problems, tool design problems, model limitations, tool failures, data gaps, etc.)
  4. Iterate: Keep reviewing until you stop discovering new failure categories

Once you've categorized, the fix depends on the root cause:

  • Prompt problem: The agent misunderstood because your instructions were unclear → fix the prompt
  • Tool design problem: The tool interface made it easy for the agent to make mistakes → redesign parameters, add examples, clarify boundaries
  • Model limitation: Instructions were clear but the LLM doesn't generalize to edge cases → add examples, try a different architecture, or use a different model
  • Don't know yet: You haven't looked at enough failures to see the pattern → do more error analysis first

Assign eval ownership to a single domain expert

Someone needs to own the eval process: maintaining datasets, recalibrating judges, triaging new failure modes, and deciding what "good enough" means. Ideally one domain expert acts as the quality arbiter for ambiguous cases rather than designing by committee.

Rule out infrastructure and data pipeline issues before blaming the agent

The Witan Labs team found that a single extraction bug moved their benchmark from 50% to 73%. Infrastructure issues (timeouts, malformed API responses, stale caches) frequently masquerade as reasoning failures. Check the data pipeline first.

Choose your evaluation level

Single-step vs. Full-turn vs. Multi-turn evals

Not all evals test the same thing. Match your evaluation to the right level of agent behavior. For a deep dive on each level, see "Agent Observability Powers Agent Evaluation".

Single-step vs. Full-turn vs. Multi-turn evals

☑️ Understand the three evaluation levels: single-step (run), full-turn (trace), and multi-turn (thread)

☑️ Start with trace-level (full-turn) evals, then layer in run-level and thread-level as needed

Deep dive

Single-step evals

These answer: "Did the agent choose the right tool?" "Did it generate a valid API call?" They're the easiest to automate but require stable agent architecture; if you're still changing your tool definitions, run-level evals may break.

Full-turn evals

This is where most teams should start. Grade a full trace across three dimensions:

  • Final response: Is the output correct and useful?
  • Trajectory: Did the agent take a reasonable path? (Not necessarily the exact path you expected, just a valid one)
  • State changes: Did the agent create the right artifacts? (files written, database updated, meeting scheduled, etc.)

State change evaluation is often overlooked but critical for agents that do things, not just say things. For example, if your agent schedules meetings, don't just check that it said "Meeting scheduled!" Verify the calendar event actually exists with the right time, attendees, and description. If it writes code, run the code. If it updates a database, query the rows. The final response can say "Done!" while the actual state is wrong.

Multi-turn evals

The hardest level to implement, layer them in after your trace-level evals are solid.

💡
Practical tip: Use N-1 testing. Take real conversation prefixes from production (the first N-1 turns) and let the agent generate only the final turn. This avoids the compounding error problem of fully synthetic multi-turn simulations.

Start with trace-level (full-turn) evals, then layer in run-level and thread-level as needed

Trace-level gives you the most signal per eval. Run-level is useful for debugging specific steps. Thread-level matters when your agent has multi-turn conversations.

Dataset construction

☑️ Ensure every task is unambiguous, with a reference solution that proves it's solvable

☑️ Test both positive cases (behavior should occur) and negative cases (behavior should not occur)

☑️ Ensure dataset structure matches your chosen evaluation level

☑️ Tailor datasets to your agent type (coding, conversational, research)

☑️ Generate seed examples if you lack production data

☑️ Source from dogfooding errors, adapted external benchmarks, and hand-written behavior tests

☑️ Set up a trace-to-dataset flywheel for continuous improvement

Deep dive

Ensure every task is unambiguous, with a reference solution that proves it's solvable

  • Ambiguous: "Find me good flights to NYC."
  • Unambiguous: "Find roundtrip flights from SFO to JFK, departing Dec 15-17, returning Dec 22, under $400, economy class."

If the agent can't possibly succeed (missing info, impossible constraints), the task is broken, not the agent. Include a reference solution for every task so you can prove it's solvable and have a baseline to grade against.

Test both positive cases (behavior should occur) and negative cases (behavior should not occur)

If you only test "did it search when it should?", you'll optimize for an agent that searches everything. Test the negative cases too. Include examples designed to falsify your assumptions, not just confirm expected behavior.

Ensure dataset structure matches your chosen evaluation level

  • Run-level (single-step) evals need reference tool calls or decisions
  • Trace-level (full-turn) evals need expected final outputs and/or state changes
  • Thread-level (multi-turn) evals need multi-turn conversation sequences with expected context retention

Tailor datasets to your agent type (coding, conversational, research)

  • Coding agents: Include deterministic test suites (unit tests that pass/fail) alongside quality rubrics
  • Conversational agents: Include multi-dimensional criteria, task completion and interaction quality (empathy, clarity)
  • Research agents: Include groundedness checks (are claims supported by sources?) and coverage checks (are key facts included?)

Generate seed examples if you lack production data

Define the key dimensions of variation for your task (query complexity, topic, edge case type). Manually create ~20 example inputs covering those dimensions, run them through your existing agent, review and modify them to store as reliable ground truths.

💡
Practical tip: 20-50 hand-reviewed examples you're confident in will outperform hundreds of synthetic examples you haven't verified. Quality beats quantity here!

Source from dogfooding errors, adapted external benchmarks, and hand-written behavior tests

Once you're past the cold start, you need an ongoing pipeline for discovering new evals. Three strategies work well together:

  1. Dogfood your agent daily and turn every error into an eval. This is different from production monitoring; it's your team intentionally stress-testing the agent across real workflows.
  2. Pull and adapt tasks from external benchmarks like Terminal Bench or BFCL. Don't run full benchmarks in aggregate; cherry-pick tasks that test capabilities you care about and adapt them for your agent.
  3. Write focused tests by hand for specific behaviors you think are important, like "does the agent parallelize tool calls?" or "does it ask clarifying questions for vague requests?"

See "How we build evals for Deep Agents" for a concrete example of this approach.

Grader design

☑️ Select specialized graders per evaluation dimension: default to code-based for objective checks, LLM-as-judge for subjective assessments, human for ambiguous cases, and pairwise for version comparison

☑️ Distinguish guardrails (inline, runtime) from evaluators (async, quality assessment)

☑️ Prefer binary pass/fail over numeric scales

☑️ Calibrate LLM-as-a-Judge graders to human preferences

☑️ Grade the outcome, not the exact path, and build in partial credit for incremental progress

☑️ Use custom evaluators derived from your error analysis, not generic off-the-shelf metrics

Deep dive

Select specialized graders per evaluation dimension

Grader Type Best For Watch Out For
Code-based Deterministic checks, tool call verification, output format, execution results Can false-fail on valid but unexpected formats
LLM-as-judge Nuanced quality, rubric-based scoring, open-ended tasks Requires calibration with humans (see Align Evals)
Human Calibration, subjective criteria, edge cases Expensive, slow, hard to scale

Default to code-based evaluators when there's an objectively correct answer. LLM-as-judge grading for objective tasks can be unreliable, inconsistent judgments can mask real regressions. Switching to deterministic comparison can often eliminate inconsistency and provide better signal. Reserve LLM-as-judge for genuinely subjective assessments.

💡
Practical Tip: Rather than trying to create a correctness evaluator, decompose evaluation into specialized graders per dimension rather than one monolithic grader.

For example: the Witan Labs team built 5 specialized evaluators (content accuracy, structure, visual formatting, formula scenarios, text quality), each with dimension-appropriate thresholds. This gives you clearer signal about what's actually failing!

Distinguish guardrails from evaluators

Guardrails Evaluators
When During execution, before user sees output After generation, asynchronously
Speed Milliseconds (must be fast) Seconds to minutes (can be expensive)
Purpose Block dangerous or malformed outputs Measure quality and catch regressions
Examples PII detection, format validation, safety filters LLM-as-judge scoring, trajectory analysis

Safety checks and format validation are guardrails, they should run inline. Quality assessment and regression testing are evaluators, they run async. Don't confuse the two.

Prefer binary pass/fail over numeric scales

A 1-5 scale introduces subjective differences between adjacent scores and requires larger sample sizes for statistical significance. Binary forces clearer thinking: either the agent succeeded or it didn't. You can always decompose a complex task into multiple binary checks.

Note: recent research suggests short scales (0-5) may yield stronger human-LLM alignment when using LLM-as-judge specifically, but binary remains simpler for human reviewers and faster iteration.

Calibrate LLM-as-a-Judge graders to human preferences

Grade the outcome, not the exact path, and build in partial credit for incremental progress

Agents find creative solutions. As Anthropic puts it in Demystifying Evals: "Don't grade the path the agent took, grade what it produced." If you require "must call tool A → B → C in that order," you'll fail agents that found a smarter route. Better: "Did the meeting get scheduled correctly?" not "Did it call check_availability before create_event?"

An agent that correctly identifies the problem but fails at the final step is better than one that fails immediately. Build in partial credit so your metrics reflect incremental progress.

Use custom evaluators derived from your error analysis, not generic off-the-shelf metrics

Off-the-shelf metrics like "helpfulness" or "coherence" create false confidence. The evaluators that matter are the ones that catch your specific failure modes, discovered through the error analysis process above.

Running & iterating

☑️ Distinguish between offline, online, and ad-hoc evaluation and use all three

☑️ Run multiple trials per task to account for non-determinism

☑️ Manually review traces for failed evaluations to verify grader fairness

☑️ Ensure each trial runs in a clean, isolated environment with no shared state

☑️ Tag evals by capability category, document what each measures, and track efficiency metrics (step count, tool calls, latency) alongside quality

☑️ Recognize when pass rates plateau and evolve your test suite accordingly

☑️ Only keep evals that directly measure a production behavior you care about

☑️ Invest in tool interface design and testing, not just prompt optimization

☑️ Distinguish between task failures (agent got it wrong) and evaluation failures (grader got it wrong)

Deep dive

Distinguish between offline, online, and ad-hoc evaluation and use all three

Most of this checklist focuses on offline evaluation, and that's intentional. Offline evals are where you improve with: curated datasets, controlled experiments, iterating before you ship. You'll also need online and ad-hoc evaluation once your agent hits production.

Timing What It Is When to Use
Offline Curated datasets, run pre-deployment Testing changes before they ship
Online Continuous evaluation on production traces Catching failures in real traffic
Ad-hoc Exploratory analysis of ingested traces Discovering patterns you didn't anticipate (see Insights)

The Production readiness section below covers setting up online evaluations and scheduling ad-hoc trace exploration in detail.

Run multiple trials per task to account for non-determinism

Model outputs vary between runs. Use multiple repetitions if not cost prohibitive. When running multiple trials, compute confidence intervals before declaring improvement—single-run benchmarks are noisy. For non-deterministic agents, consider using pass@k (at least one of k attempts succeeds) or pass^k (all k attempts succeed) metrics depending on your product requirements.

Track operational metrics alongside quality: turns taken, token usage, latency, cost per task. An agent that's 95% accurate but 10x slower might not be an improvement.

Tag evals by capability category, document what each measures, and track efficiency metrics alongside quality

Group evals by what they test, not where they come from. Categories like file_operations, retrieval, tool_use, memory, and conversation give you a "middle view" of performance between a single aggregate score and individual test results. Add a docstring to each eval explaining how it measures an agent capability. This keeps intent clear as the suite grows and lets you run targeted subsets (e.g., only tool_use evals after changing a tool definition).

Attach metadata to every experiment so you can filter, group, and compare runs across dimensions that matter. This makes it easy to answer questions like "did switching from GPT-4.1 to Claude Sonnet improve accuracy?" or "which prompt version regressed on this dataset?" without digging through logs. LangSmith automatically captures git info when available, but explicitly tagging model and prompt metadata pays off quickly as your experiment volume grows.

Once quality is established, compare models on efficiency. An agent that's 95% accurate but 10x slower might not be an improvement. Track ratios like observed steps / ideal steps, observed tool calls / ideal tool calls, and observed latency / ideal latency. This doesn't conflict with "grade the outcome, not the exact path": ideal trajectories measure efficiency, not correctness. You still pass an agent that found a creative route, but you can see if it took longer to get there. See the metrics framework in How we build evals for Deep Agents for a worked example.

Manually review traces for failed evaluations to verify grader fairness

A "failed" task might actually be a creative valid solution your grader didn't anticipate. Reading traces is how you know if your graders are being fair.

Ensure each trial runs in a clean, isolated environment with no shared state

If trial 2 can see artifacts from trial 1, your results aren't independent. What this means in practice:

  • Coding agents: Fresh containers or VMs per trial
  • API-calling agents: Staging environments or mock services
  • Database agents: Snapshot and restore between trials

Recognize when pass rates plateau and evolve your test suite accordingly

When your pass rate plateaus and adding more tasks of the same type stops revealing new failure modes, it's time to evolve: add harder tasks, test new capabilities, or shift to different dimensions. Grinding on a saturated eval set wastes effort.

Only keep evals that directly measure a production behavior you care about

Every eval applies pressure on your system over time. It's tempting to blindly add hundreds of tests, but this creates an illusion of progress. You end up optimizing for an eval suite that doesn't reflect what matters in production. More evals does not equal better agents. Build targeted evals, and periodically prune the ones that no longer give you signal. For a concrete example of this approach, see How we build evals for Deep Agents.

Invest in tool interface design and testing, not just prompt optimization

Tool design eliminates entire classes of agent errors. Anthropic's team noted they spent more time optimizing tools than prompts when building their SWE-bench agent. Test how the model actually uses your tools: try different parameter formats (diffs vs full rewrites, JSON vs. markdown), redesign interfaces to make mistakes harder, and invest in clear documentation with examples. The goal is to make mistakes structurally impossible, not just unlikely. For example, requiring absolute file paths eliminates an entire class of navigation errors.

Distinguish between task failures (agent got it wrong) and evaluation failures (grader got it wrong)

Track run status explicitly (complete, error, timeout). A grader that marks a timeout as "incorrect reasoning" pollutes your signal. Separate task failures from evaluation failures to keep your metrics clean.

Production readiness

☑️ Promote capability evals with consistently high pass rates into your regression suite

☑️ Integrate regression evals into your CI/CD pipeline with automated quality gates

☑️ Capture user feedback

☑️ Set up online evaluations for production traffic

☑️ Schedule regular manual exploration of production traces beyond automated checks

☑️ Version your prompts and tool definitions alongside your code

☑️ Ensure production failures feed back into datasets, error analysis, and eval improvements

Deep dive

Promote capability evals with consistently high pass rates into your regression suite

Once you've climbed the hill, protect it. Tasks that used to test "can we do this?" become "can we still do this?"

Integrate regression evals into your CI/CD pipeline with automated quality gates

A typical flow:

  1. Code or prompt change triggers the pipeline (via git push, PromptHub update, or manual trigger)
  2. Offline evals run unit tests, integration tests, and evaluation against curated datasets using cheap, fast graders
  3. Preview deployment goes up if offline evals pass
  4. Online evals run against the preview with live data using LLM-as-judge graders
  5. Promote to production only if all quality gates pass, otherwise route failing traces to annotation queues and alert the team

Use cheap code-based graders in CI for every commit. Reserve expensive LLM-as-judge evaluations for preview/production evaluation. See LangSmith's CI/CD pipeline guide for a full implementation example with GitHub Actions.

Set up online evaluations for production traffic

Safety checks, format validation, quality heuristics. You'll find failure modes in production you never anticipated (see You don't know what your agent will do until it's in production)

Capture user feedback

Once your agent is in production, user feedback becomes one of your most valuable signals. Automated evals can only catch the failure modes you already know about. Users will surface the ones you don't: edge cases your dataset missed, outputs that are technically correct but unhelpful, and workflows that break in ways you never anticipated.

Capturing this feedback in a structured way lets you feed it back into your datasets, calibrate your graders against real-world expectations, and prioritize the improvements that actually matter to the people using your agent.

Schedule regular manual exploration of production traces beyond automated checks

Don't rely solely on automated pass/fail. Periodically explore production traces for unexpected patterns or failure modes your graders don't cover, surprising user behaviors, or opportunities to improve. Our Insights Agent is a great way to do this!

Version your prompts and tool definitions

LangSmith makes it easy to version your prompts. Without this, you can't correlate eval results with specific changes or know which edit caused a regression.

Ensure production failures feed back into datasets, error analysis, and eval improvements

Production successes and failures should feed back into your datasets, error analysis, and eval improvements. This is the flywheel that makes your agent better over time!

You don't need all of these items on day one. Pick the section that matches where you are right now, nail those items, and expand from there. The teams that ship reliable agents aren't the ones with the most sophisticated eval infrastructure - they're the ones who started evaluating early and never stopped iterating.


The full checklist

Before you build evals

⬜️ Manually review 20-50 real agent traces before building any eval infrastructure

⬜️ Define unambiguous success criteria for a single task

⬜️ Separate capability evals from regression evals

⬜️ Ensure you can identify and articulate why each failure occurs

⬜️ Assign eval ownership to a single domain expert

⬜️ Rule out infrastructure and data pipeline issues before blaming the agent

Choose your evaluation level

⬜️ Understand the three evaluation levels: single-step (run), full-turn (trace), and multi-turn (thread)

⬜️ Start with trace-level (full-turn) evals, then layer in run-level and thread-level as needed

Dataset construction

⬜️ Ensure every task is unambiguous, with a reference solution that proves it's solvable

⬜️ Test both positive cases (behavior should occur) and negative cases (behavior should not occur)

⬜️ Ensure dataset structure matches your chosen evaluation level

⬜️ Tailor datasets to your agent type (coding, conversational, research)

⬜️ Generate seed examples if you lack production data

⬜️ Source from dogfooding errors, adapted external benchmarks, and hand-written behavior tests

⬜️ Set up a trace-to-dataset flywheel for continuous improvement

Grader design

⬜️ Select specialized graders per evaluation dimension: default to code-based for objective checks, LLM-as-judge for subjective assessments, human for ambiguous cases, and pairwise for version comparison

⬜️ Distinguish guardrails (inline, runtime) from evaluators (async, quality assessment)

⬜️ Prefer binary pass/fail over numeric scales

⬜️ Calibrate LLM-as-a-Judge graders to human preferences

⬜️ Grade the outcome, not the exact path, and build in partial credit for incremental progress

⬜️ Use custom evaluators derived from your error analysis, not generic off-the-shelf metrics

Running & iterating

⬜️ Distinguish between offline, online, and ad-hoc evaluation and use all three

⬜️ Run multiple trials per task to account for non-determinism

⬜️ Manually review traces for failed evaluations to verify grader fairness

⬜️ Ensure each trial runs in a clean, isolated environment with no shared state

⬜️ Tag evals by capability category, document what each measures, and track efficiency metrics (step count, tool calls, latency) alongside quality

⬜️ Recognize when pass rates plateau and evolve your test suite accordingly

⬜️ Only keep evals that directly measure a production behavior you care about

⬜️ Invest in tool interface design and testing, not just prompt optimization

⬜️ Distinguish between task failures (agent got it wrong) and evaluation failures (grader got it wrong)

Production readiness

⬜️ Promote capability evals with consistently high pass rates into your regression suite

⬜️ Integrate regression evals into your CI/CD pipeline with automated quality gates

⬜️ Capture user feedback

⬜️ Set up online evaluations for production traffic

⬜️ Schedule regular manual exploration of production traces beyond automated checks

⬜️ Version your prompts and tool definitions alongside your code

⬜️ Ensure production failures feed back into datasets, error analysis, and eval improvements


Further reading

LangChain:

Witan Labs:

External benchmarks (for sourcing eval tasks):

Anthropic:

OpenAI:

Hamel Husain:

arXiv papers:

LangSmith Docs: