AI and the Judgment Problem in Data Science
Dawn Woodward (LinkedIn), Andrés Bucchi (LATAM Airlines), and Jeremy Hermann (Delphina) join High Signal to examine how data science architecture is transforming in the AI era. The panel discusses shifting from static dashboards to conversational interfaces, emphasizing that foundational data practices—strict cataloging, verifiable outputs, and unified data sources—have become critical bottlenecks rather than optional governance measures. We dig into semantic ambiguity, upstream validation, AB testing platform limits, security architectures for AI agents, the shifting role of the analyst toward verification, and where LLMs still fall short on causal reasoning.
Guests
Key Takeaways
Semantic ambiguity kills AI utility.
Dawn highlights that at Uber, having multiple conflicting definitions for a single concept like "user sessions" makes AI-driven analysis impossible. Rigorous data annotation and a single "source of truth" catalog are now more critical for AI than they ever were for human analysts.
Upstream validation replaces downstream cleaning.
Andres argues that the traditional data engineering workflow is being flipped. By implementing "strong validation first" at the point of ingestion, organizations can use AI to maintain downstream data quality that was previously too expensive or complex to manage manually.
AB testing platforms are the new bottleneck.
While AI allows teams to build features and variants at record speed, legacy experimentation platforms are hitting their limits. Dawn observed that the sheer volume of AI-generated experiments is creating statistical bias issues and "babysitting" requirements that legacy internal tools weren't designed to handle.
AI requires "headless" security architectures.
Standard hierarchical permission models break when agents are introduced. Andres advocates for building "headless" AI services that live in the middleware, allowing agents to inherit and enforce the specific identity tokens and access rights of the human user they are representing.
The "Analyst" is now a "Verifier."
As non-technical leaders use GenAI to self-serve data queries, the role of the data expert is shifting. Instead of wrangling data to answer a question, technical teams must now focus on "verifiable outputs"—auditing the AI's chain of thought to ensure the analysis isn't based on a hallucination or a biased dataset.
AI imitates causal workflows without causal reasoning.
Despite their next-token prediction capabilities, LLMs are not inherently causal or probabilistic. Dawn notes that while an agent can be prompted to imitate the steps of a causal analysis, true inference still requires the AI to execute specialized, bespoke statistical frameworks (like PyMC) rather than relying on its own reasoning.
Brownfield codebases remain the final frontier.
There is a massive gap between "vibe coding" a greenfield application and integrating AI into a "brownfield" enterprise codebase. Andres points out that while AI can generate snippets, it cannot yet verify if those changes will scale or break existing complex systems, requiring high-level human architectural judgment.
Qualitative evals are replacing traditional metrics.
In the era of conversational interfaces, traditional ranking metrics like AUC or precision are becoming insufficient. The panel suggests that evaluation is shifting toward "AI-on-AI" qualitative assessments, where models are used to grade the helpfulness and nuance of a conversational experience.
Data engineering is shrinking to expand.
LATAM Airlines reduced its data engineering headcount by 20% by automating routine pipeline tasks. This wasn't a cost-cutting measure, but a strategic reallocation: moving those engineers to high-value areas where AI can bridge "tech gaps" that were previously prohibitively expensive to close.
You can read the full transcript here.
Ready to unleash your data?
Discover how Delphina can transform your data science.
