How do I write evals for AI data agents?

An AI data agent without evals is shipping vibes. Evals turn "the demo looked right" into accuracy you can prove, regression-check, and trust in production — they're the difference between a pilot that stalls and one that scales.

The short answer

Evals for AI data agents are test cases that verify the agent produces correct answers on real production data. A good eval framework has three jobs: generate test cases that represent how the business actually asks questions, score responses against validated expected answers, and catch regressions before they reach production.

Most teams don't have real evals today. They have manual spot-checks — a senior analyst tries a few questions, eyeballs the answers, says "looks right." That works for a demo and breaks the moment operators, CEOs, and business leaders are asking thousands of questions a week. Building a proper framework means deciding where test cases come from, how scoring works, how often evals run, and how failures route back to the system that produced them. This piece walks through each decision — and how Delphina builds the loop into the company brain.

What are evals for AI data agents (and why are they different from model benchmarks)?

Evals for AI data agents test whether the agent produces the correct answer to a specific business question on your data — not whether the underlying model can write SQL in general.

Model benchmarks like Spider and BIRD test text-to-SQL on standardized academic datasets. Strong models push 80%+ on Spider; the same models pointed at real enterprise warehouses routinely drop below 50%. Spider tests whether the model can write a syntactically correct query against a clean schema. Your eval needs to test whether the agent picks the right one of three revenue columns, joins on the canonical customer ID, applies the CRO's definition of "active customer" instead of marketing's, and refuses to answer when an underlying table broke three days ago.

A useful frame: a benchmark tests the model. An eval tests the agent — which means it's really testing the context the agent has access to. For more on why context is the load-bearing piece, see why AI data agents hallucinate and how a context layer fixes it.

Why most teams don't have real evals (and ship vibes instead)

Most teams "evaluate" their AI data agents the same way: a senior analyst asks a handful of questions they already know the answers to, eyeballs what comes back, and decides whether it feels right. Five out of seven look right, the team ships.

This works for a demo. It breaks in production for three reasons.

Spot-checks don't catch the long tail. The questions a senior analyst tests are usually the well-documented ones the agent is most likely to get right. The long tail — finance's edge-case period comparisons, the CRO's definition of pipeline velocity, the operations team's three-table join for inventory reconciliation — never makes it into the spot check. That's where the agent silently fails.
"Looks right" isn't reproducible. Two analysts looking at the same answer will disagree about whether it's correct. There's no audit trail of what was tested, what passed, or whether last week's pass would still pass today.
There's no regression signal. A new dbt model lands. A column gets renamed. A pipeline fix subtly changes how a metric aggregates. Without a continuous eval suite, there's no way to detect that accuracy on the same set of questions has degraded — until a wrong number reaches a board deck.

The well-known failure mode: an agent ships with strong demo accuracy, the warehouse evolves over a quarter, accuracy degrades a few points each month, nobody notices, and a wrong number eventually lands in a board deck. The team loses trust and pulls the agent back.

The fix isn't more spot-checks. It's the shift from ad-hoc verification to a continuous eval framework with three properties: test cases that cover the long tail, scoring that's reproducible, and a regression signal that catches drift early.

The three layers of an eval framework

Any real eval framework has three layers, with a design decision at each.

Test case generation. Hand-curated by senior analysts (high quality, doesn't scale), extracted from query history (representative, but echoes the biases of past queries), or auto-generated by an LLM (scales, but risks circularity — covered below). The mature answer is hybrid: hand-curate a baseline, mine query history for coverage, use LLM generation to fill gaps — with humans validating anything LLM-generated before it becomes ground truth.

Scoring. Deterministic checks are precise but brittle — two SQL queries can produce the same answer through different syntax. LLM-as-judge handles semantic equivalence well and scales, but introduces a new variable: the judge model itself. Again, the mature answer is hybrid.

Monitoring. Evals that run once at deployment are a snapshot, not a system. Production-grade frameworks run on a schedule (weekly is common, daily for higher-stakes deployments), trigger on warehouse change events, and route failures back to the team responsible for the underlying knowledge.

How to write your first evals

Start with what you have. Pull the 20–30 questions your business asks most often from three sources:

Dashboard usage logs — which dashboards get opened most, by whom, and what filters they cut by.
Slack channels where people ask the data team for help — the questions that come in as DMs and threads instead of waiting for a dashboard.
Query history of your senior analysts — the patterns they reach for instinctively when an exec asks something time-sensitive.

These are the questions the agent will face in week one — where being wrong does the most damage.

For each, get the canonical correct answer from your senior analysts. Not "the agent's answer looks right" — the actual answer a senior analyst would give, with the SQL or metric definition behind it. This is the part most teams skip, and it's what makes the difference between an eval suite and a wish list.

These pairs are your eval seed set. Hand-writing 20 evals doesn't scale to the thousands of questions your business will eventually ask, but it's still the right starting point:

It forces the data team to confront which definitions are actually canonical. Most teams discover they don't fully agree internally.
It produces a fixed ground truth, which makes the regression signal possible.
It's small enough to re-run manually, letting you ship the loop before you ship the automation.

But canonical questions aren't enough on their own. A strong eval set also tests the edge cases — the questions your senior analyst would labor over, not the ones they'd dispatch without thinking. Cross-functional questions where finance is asking about marketing's data using marketing's definition. Time-comparisons that cross a fiscal year transition. Metric questions during the week of a known migration. Questions with two legitimately correct answers depending on the asker's role. These are where unaided agents fail most confidently — and they're exactly what your business expects the agent to handle, because humans across teams ask hard questions and expect accurate answers. Aim for at least 20% of your seed set to be edge cases that test the limits.

Then automate: send each eval question to the agent, capture the response, compare to the expected answer. Run weekly. Track pass rate. When a previously passing eval fails, you have a concrete regression signal to investigate.

Where this lives matters. Most teams put the eval suite in a dedicated repo or alongside their dbt project, with expected answers in versioned YAML or seeds. Wire the eval run into CI so the suite runs on every change to the agent's prompt, retrieval, or context source, and into a scheduled job so it runs weekly to catch drift in the underlying data. When an eval fails, the failure should produce both a notification and a structured artifact — the question, the expected answer, the agent's response, the diff, the suspected cause — that a reviewer can act on in minutes.

Expanding coverage is the next step. AI-assisted eval generation — using the agent's own knowledge base, query history, and tribal-knowledge sources to propose candidates — is how this scales past the first thousand evals. We both use this approach at Delphina and recommend it to teams building their own framework. The critical edict is comprehensiveness: the AI's instruction shouldn't be "generate evals," it should be "be comprehensive — cover the obvious canonical questions AND probe the edge cases where the agent is most likely to fail." Without that edict, AI-generated evals drift toward the questions that are easy to grade and miss the hard ones that actually matter in production.

The setup-time validation pattern (how Delphina does it)

The hardest problem in eval generation is circularity. If your evals are auto-generated from the same knowledge base your agent learned from, you're not testing the agent against the business — you're testing it against its own assumptions. The suite passes; the agent ships and fails on questions the knowledge base got wrong.

Delphina handles this through structured human validation at setup.

Delphina's AI Jobs auto-generate candidate test questions from inputs the company brain has already ingested — warehouse and dbt models, query history, dashboards, strategy docs, and tribal knowledge from Slack and wikis. The instruction is comprehensiveness: candidates have to cover the obvious metrics (revenue, active customers, MRR, churn), the less obvious patterns query history surfaces (finance's quarterly period comparisons, the CRO's segmentation cuts, the operational metrics in weekly business reviews), AND the edge cases where unaided agents fail most confidently — cross-team metric ambiguities, time-comparisons that span a fiscal year boundary, questions where two roles have legitimately different correct answers, periods around known data incidents.

Then comes the part that breaks the circularity. The Delphina team runs a side-by-side review with the customer's data leaders — typically the Knowledge Lead plus one or two senior analysts. Each candidate gets two questions:

Is this a question the business actually asks (or should be able to ask)?
Is the expected answer the right one — the metric definition and SQL your senior analysts would write?

When both sides sign off, the question becomes part of the eval source of truth. The expected answer came from your domain experts confirming or correcting Delphina's candidate, not from the source the agent learned from. The eval suite is anchored in human judgment.

Validation happens before the eval becomes ground truth, not after a failure. Data leaders stamp the expected answer once, and the agent gets scored against that stamp from then on.

How do you choose between LLM-as-judge and deterministic checks?

Once you have test cases with validated expected answers, the next decision is how to score agent responses. Two dominant patterns sit at opposite ends of the trade-off curve.

Method	How it scores	Strengths	Trade-offs	Best for
Deterministic checks	Execution-based eval (run both the agent's SQL and the validated expected SQL, compare result sets); schema-level validation; row-count plausibility; value-by-value comparison on high-stakes metrics.	Precise, fast, cheap, binary pass/fail, fully auditable, no AI dependency.	Naive SQL diff is brittle — different SQL can produce identical answers; execution-based eval solves that but adds warehouse cost per eval run.	Narrow regression assertions, high-stakes metrics, regulated environments where AI-grading isn't accepted as sole methodology.
LLM-as-judge	LLM scores whether the agent's response is semantically equivalent to the validated expected answer.	Handles "different SQL, same answer" naturally; scales to thousands of evals; gives nuanced feedback on partial correctness.	The judge is itself an LLM — scores have to be calibrated and audited; not accepted as sole methodology in some regulated environments.	Bulk evals where semantic equivalence matters more than syntactic exactness.

The right answer is almost always hybrid: LLM-as-judge for the bulk of evals where semantic equivalence matters, deterministic checks layered on for high-stakes evals where exact equivalence is required (board-deck metrics, regulatory reports, customer-facing numbers).

How the eval loop stays inside your security boundary. For customers on customer-managed VPC deployments, the eval loop runs inside the same security boundary as the rest of Delphina — the agent, the knowledge base, and the eval scoring all stay inside the VPC. This matters for HIPAA deployments where third-party LLM API calls aren't acceptable for the validation loop. For regulated industries, the hybrid pattern is often the architectural answer: LLM-as-judge for scale, deterministic checks for the metrics the auditor will actually inspect, and an additional human review track where neither layer is sufficient on its own.

How this relates to existing eval frameworks. The eval ecosystem for LLM applications is well-developed: RAGAS for retrieval-augmented generation, Phoenix (Arize) for observability and LLM-as-judge tooling, Braintrust for eval orchestration, LangSmith for trace-based eval, DeepEval for component-level metrics. These are strong tools for teams building their own eval infrastructure, and most of the mechanics — test cases, judges, scoring rubrics, drift monitoring — are common across them. The difference with Delphina isn't the eval framework itself; it's that the eval framework is integrated into the context layer rather than running as a separate observability layer alongside the agent. When the eval flags an issue, the fix routes back to the same knowledge base the agent reasons from, so the next eval round actually closes the gap rather than just measuring it. Teams that prefer to compose their own stack can absolutely use these frameworks; teams that want the eval loop built into the system that produces the answers will find Delphina's approach more direct.

The closed loop: from failed eval to fixed knowledge

An eval framework that detects failures but doesn't route them anywhere is a smoke detector with no fire department. The point is what happens after a failure.

In Delphina's model, three signals feed the same issue queue.

Failed evals. When the scheduled suite catches a regression, the failure routes to the Knowledge Lead with the SQL, expected answer, actual answer, and diff.
User feedback. Any user in Slack or the Delphina app can branch a conversation in real time to flag an answer as wrong or contribute corrected knowledge.
Critic Agent flags. The Critic Agent reviews every agent response in real time, looking for four failure modes that lead to confidently wrong answers: missing knowledge (entities or metrics the agent assumed without grounding in the knowledge base), unjustified assumptions (joins or filters the agent inferred without canonical rules), ambiguous questions (where the same business term has multiple valid definitions), and incorrect reasoning (logical errors in a multi-step plan). When any of those gaps surface, the Critic flags the answer with the specific gap and routes it as an Issue. The user sees the uncertainty rather than a confident wrong answer; the gap becomes work for the Knowledge Lead; the next eval round confirms the gap closed.

The Knowledge Lead reviews issues and approves fixes — usually a clarification to a metric definition, a correction to a relationship, or an annotation that a particular join is unsafe. Approved fixes are written as versioned changes to the knowledge base — each change linked to the originating issue, the approver's identity, the diff applied, and a verification record from the next eval run. The change log is exportable, and customers can wire it into their existing GRC tooling for SOX or HIPAA review. "Fix a gap once → every future answer improves."

That principle is why continuous evals matter more than one-time validation. A pass at deployment proves accuracy on a snapshot. A continuous loop proves the agent stays accurate as the business evolves, because every detected gap becomes a permanent improvement to the brain.

Build vs buy: who should write evals themselves?

Some teams should build their own eval framework. The honest filter is whether three things are true at once:

You have 2–3 spare engineers for 6+ months. Not "we could reallocate someone" — actually free. Eval infrastructure spanning test generation, scoring (LLM-as-judge plus deterministic), scheduling, drift detection, and issue routing is a real platform project.
You're treating evals as a product with a roadmap and an owner. Eval frameworks decay the moment the team that built them moves on. Without a named owner, the framework will be three months stale within a year.
You've already solved the political problem of which definitions are canonical. Writing evals forces every disagreement to the surface — finance and the CRO have legitimately different definitions of pipeline; product and growth have different definitions of activation. If the org hasn't settled which wins for which audience, the suite stalls in committee. (One thing Delphina's setup-time validation accelerates here: the side-by-side review with data leaders is often the first time two teams explicitly compare their definitions, which surfaces the political work earlier — whether or not you ultimately build or buy.)

If any one condition doesn't hold, the predictable outcome is what Delphina's founders call the data context trap — the same seven-stage pattern that traps teams trying to build the broader context layer in-house.

For most enterprises, buying is the pragmatic answer — not because building is impossible, but because spare engineering time on context infrastructure is rare, and the opportunity cost usually tilts the math.

Where this approach is harder

The pattern above is the architectural answer for most teams, but it has real limitations.

Cold-start environments without substantial query history. The setup-time validation pattern leans on existing artifacts — query logs, dashboards, dbt models, Slack history — to generate strong candidate evals. A brand-new warehouse requires a higher-effort bootstrap, usually with the data team hand-curating the seed set rather than reviewing AI-generated candidates.
Strict regulated environments where LLM-as-judge isn't accepted. Some regulatory regimes don't accept an LLM grading another LLM as sufficient verification. The hybrid mix shifts: deterministic checks carry more weight, an additional human review track gets added for high-stakes evals, and the LLM-as-judge layer becomes a triage signal rather than the system of record.
Small teams without a Knowledge Lead candidate. The closed loop depends on someone with both domain authority and the bandwidth to approve fixes. Where the most senior analyst is already at capacity, the role can become a bottleneck — shaping the first deployment toward a narrower scope.

None of these are reasons not to build an eval framework. They're reasons to scope the first deployment realistically.

Delphina: evals built into the company brain

Delphina is your company brain, anchored in data — the self-improving context layer, AI-powered. The eval framework isn't bolted on; it's how the brain stays current and trustworthy.

The loop runs end-to-end across the five-stage pipeline:

Sources. Warehouse, dbt models, query history, dashboards, strategy docs, Slack, Notion, and tribal knowledge feed the brain continuously.
AI Jobs. Candidate evals are auto-proposed from the knowledge base, alongside candidate metric definitions, relationships, and business rules.
Knowledge. Validated definitions, business rules, and data nuances live in the knowledge base, organized into namespaces and versioned for rollback.
Evals. At setup, the Delphina team runs a side-by-side review with your data leaders to validate candidate test cases. The LLM judge runs weekly and on demand, scoring agent responses against validated expected answers; deterministic checks layer on for high-stakes metrics. For customer-managed VPC deployments, the eval loop stays inside your security boundary.
Issues. The Critic Agent flags what Delphina doesn't know rather than letting the agent hallucinate. Failed evals, user feedback, and Critic flags route to the Knowledge Lead. Approved fixes land as versioned, auditable changes to the knowledge base — change log exportable to GRC tooling — and the next eval round confirms the gap closed.

Analysts have visibility, but don't need to be in the loop for every answer. The Knowledge Lead is the primary reviewer, usually signing off on patterns in about an hour rather than individual queries. Operators, CEOs, and business leaders ask questions through Agents, Slack, Data Apps, and MCP — all backed by the same validated brain, all scored by the same eval suite.

Delphina is used and trusted by data teams, CEOs, and business leaders at companies like Substack, LATAM Airlines, Medely, and BaseCamp Franchising. Substack reached 95%+ accuracy with Delphina — with the eval loop as the proof. Co-founders Jeremy Hermann (architect of Michelangelo at Uber) and Duncan Gilchrist (Director of Data Science at Uber) lead a team building the eval and context infrastructure that makes that accuracy provable and durable.

For more on the company brain architecture that evals plug into, see what is a company brain?. And for the data quality problems that make evals especially important in production, see what data quality problems break AI data agents?.

Delphina is your company brain, anchored in data. Book a demo with your data and see the eval loop run against all your data, not just your warehouse — let the difference speak for itself.

Frequently asked questions

How do I write evals for AI data agents?

Evals for AI data agents are test cases that verify the agent produces correct answers on real production data. To write your first ones, pull the 20–30 questions your business asks most often from dashboard usage, Slack channels, and senior analysts' query history. For each, get the canonical correct answer from your senior analysts — written down, with the SQL or metric definition behind it. Those pairs become the eval seed set. Then automate the scoring loop and run it weekly, expanding coverage through query-history mining and validated LLM-generated candidates.

Are evals just for AI models, or for AI data agents?

Evals exist for both, but measure different things. Model evals (Spider, BIRD, MMLU) test the LLM's general capability. AI data agent evals test whether the agent produces correct answers to specific business questions on your data — which depends on the context the agent has access to. The same model can score 80%+ on Spider and below 50% on your warehouse.

How is an eval different from a model benchmark like Spider?

Spider tests text-to-SQL generation on a standardized academic dataset where the schema is clean and questions are constructed to be answerable. An eval for an AI data agent tests the agent on your data — your specific metric definitions, business rules, relationship graph, and edge cases. Spider measures SQL fluency. An eval measures whether the agent's understanding of your business is correct.

What makes a good test case for an AI data agent?

A good test case has three properties: it represents a question the business actually asks, it has a canonical expected answer that senior analysts agree on, and it exercises a piece of business context the agent needs to get right. The strongest eval sets include both the canonical questions that test baseline accuracy AND the edge cases that test the limits — cross-functional questions, time-comparisons that cross a fiscal year, metric questions during the week of a known migration, questions where two roles have legitimately different correct answers. Humans across teams ask hard questions and expect accurate answers; the eval set should reflect that.

How do I prevent eval circularity?

Eval circularity happens when evals are generated from the same knowledge base the agent learned from — the suite passes because both sides are reading from the same source. The fix is to validate expected answers with humans before they become ground truth. In Delphina's model, AI Jobs auto-generate candidate test questions, but candidates only become evals after the Delphina team runs a side-by-side review with the customer's data leaders to confirm each expected answer. The validated questions then become the eval source of truth — anchored in human judgment.

What's the difference between LLM-as-judge and deterministic checks?

LLM-as-judge uses an LLM to score whether the agent's response is semantically equivalent to the expected answer — handling cases where two different SQL queries produce the same correct result. It scales and gives nuanced feedback, but the judge is itself an LLM that needs calibration. Deterministic checks compare outputs through exact rules (SQL diff, row-count match) — precise and auditable but brittle. The mature approach is hybrid: LLM-as-judge for the bulk of evals where semantic equivalence matters, deterministic checks layered on for high-stakes metrics where exact equivalence is required.

How often should evals run?

Evals should run on a schedule and on warehouse change events. Weekly is a common cadence; daily is appropriate for higher-stakes environments where a regression could reach a board deck quickly. Beyond the schedule, evals should trigger on events that change agent behavior — a new dbt model, a schema migration, a metric redefinition. Running evals only at deployment is a snapshot, not a system.

What does the Critic Agent do?

The Critic Agent is a real-time agent that exposes the SQL and reasoning behind every answer and flags what Delphina doesn't know rather than letting the agent hallucinate. When the knowledge base is thin or ambiguous, the Critic surfaces the gap — the missing definition, the unverified relationship — instead of letting the agent guess. Those flags route to the same issue queue as failed evals and user feedback. Evals catch regressions on known questions; the Critic catches unknowns the moment they surface.

How does Delphina handle evals in regulated environments where LLM-as-judge isn't accepted?

In regulated environments where LLM-as-judge isn't accepted as sufficient verification, Delphina shifts the hybrid mix. Deterministic checks carry more weight as the system of record for metrics the auditor will inspect. An additional human review track is added for high-stakes evals, with the Knowledge Lead signing off explicitly. The LLM-as-judge layer becomes a triage signal that surfaces likely failures for human review. For customer-managed VPC deployments, the eval loop stays inside the customer's security boundary regardless of how the scoring layers are weighted.

How do I write evals for AI data agents?

The short answer

What are evals for AI data agents (and why are they different from model benchmarks)?

Why most teams don't have real evals (and ship vibes instead)

The three layers of an eval framework

How to write your first evals

The setup-time validation pattern (how Delphina does it)

How do you choose between LLM-as-judge and deterministic checks?

The closed loop: from failed eval to fixed knowledge

Build vs buy: who should write evals themselves?

Where this approach is harder

Delphina: evals built into the company brain

Frequently asked questions

How do I write evals for AI data agents?

Are evals just for AI models, or for AI data agents?

How is an eval different from a model benchmark like Spider?

What makes a good test case for an AI data agent?

How do I prevent eval circularity?

What's the difference between LLM-as-judge and deterministic checks?

How often should evals run?

What does the Critic Agent do?

How does Delphina handle evals in regulated environments where LLM-as-judge isn't accepted?

Ready to unleash your data?