The Retrieval Backbone of Modern AI

Databricks’ Adam Gurary on retrieval, metadata, and why most AI work happens before the model

Dec 16, 2025

If you peel back most successful AI applications today, you’ll find retrieval is core. Modern AI systems are only as effective as their ability to get the right data into context at the right time. That retrieval layer underpins everything from enterprise RAG systems to reconciliation pipelines to personalization engines.

This week I sat down with Adam Gurary, a product manager on the AI team at Databricks who focuses entirely on search and retrieval. Adam didn’t begin on the retrieval side of AI. His earlier work at C3.ai focused on model serving, which gave him a close-up view of how data scientists build pipelines that rely on LLM inference and retrieval.

Now at Databricks, Adam sees retrieval patterns and pain points across hundreds of customers, giving him a unique vantage into what’s actually happening behind the scenes of AI adoption.

During our conversation, three themes became clear:

Data engineering and ingestion are a critical bottleneck.
Eval debt is why so many AI projects stall before production.
Metadata (and batch systems) are massively under-discussed levers for quality and differentiation.

Below I unpack these insights and share Adam’s guidance for founders building on top of retrieval-heavy stacks.

Where Retrieval Shows Up in the Wild

Retrieval is almost everywhere. As Adam puts it, retrieval is relevant for “anyone that has internal knowledge bases and might need to match things or has a search bar.”

Across industries, retrieval workloads tend to fall into one of three buckets:

RAG and Knowledge Systems: These include chat and Q&A over internal or semi-structured data, knowledge assistants, and internal support or sales-enablement tools. This could be any scenario where employees need fast, high-quality access to institutional knowledge.
Matching Engines: Everything from aligning applicants with job postings to reconciling financial transactions with entities to powering recommendation and personalization pipelines. Any workflow that requires “find the closest match” falls into this camp.
Search and Discovery: Internal knowledge search, content and ecommerce search bars, and type-ahead/autocomplete experiences all rely heavily on vector search for relevance and speed.

In all three cases, retrieval determines what information is available to the system in the first place.

The Hardest Part Is Getting Data Into the Index

Adam notes that everyone talks about which vector database to use, but almost no one talks about the ingestion pipeline, which is where most teams get stuck. “Let’s say you want to expose search over some knowledge base… you have to actually get data into that index, right? And that’s not a trivial task” Adam said.

An ingestion pipeline typically requires:

Detecting changes/additions/deletions in the source systems.
Pulling data from the source systems, like SharePoint, S3, and other sources.
Incrementally parsing, chunking, cleaning, and preprocessing.
Embedding content.
Inserting into or deleting from the index.
Keeping everything consistent with the source system over time.

This requires data engineering, and most AI teams dramatically underestimate its importance at the start. As Adam put it, “just to make SharePoint searchable, it’s basically impossible for most customers because you need a team of data engineers to do this.”

What makes it hard isn’t the first index, but everything that comes after. Real systems have to detect changes in near real time and incrementally parse, chunk, embed, and index only the documents that are new or modified, while correctly removing entries for documents that have been deleted. On top of that, they need to handle versioning and enforce user- and group-specific access controls so people only retrieve what they’re allowed to see. Every one of these requirements adds state, coordination, and failure modes.

Adam sees this across customer after customer. Many teams start by debating Milvus vs. OpenSearch vs. other self-managed options, but they quickly realize that “they end up spending all of their time on data engineering because they have to build these pipelines to keep source systems and indices in sync over time.” That operational burden is why managed, source-aware approaches can be so compelling. As Adam explained, “One of the major reasons we win search workloads on lakehouse data is because of Delta Sync. You give us a table, and we give you an index on that table and manage all of the pipelines to compute embeddings and keep it in sync.”

Adam recommends factoring in how well vendors support your source systems (and how much data engineering that support saves your team) as a major consideration in the selection process, “really think through what sources you’re going to be searching over, how you’re going to process them, and which vendors make that easier because you don’t necessarily want to be spending all of your time on data engineering.”

For example, if your data is in DynamoDB, Amazon OpenSearch will save you pipeline work. Whereas if your data is in SharePoint, Azure AI Search and Databricks have native integrations. If your data is already on the Databricks lakehouse, then it’s a safe default to use Databricks Vector Search to avoid heavy data engineering.

Instead of starting with which database to use, source-system-first thinking starts with where your data lives and chooses tooling that minimizes the ingestion and synchronization work required to make that data usable so you can focus on your applications instead of data pipelines

Evaluation Debt

The second pattern Adam often observes is that teams can’t ship because they can’t measure quality. “Being able to evaluate your system seems basic,” he said. “In classical ML, it was always out-of-the-box. No one has really had to think about evals seriously until now.”

Without evals, teams don’t know whether changes actually help. “Anytime they do anything, they can’t prove it’s better than it was before,” Adam told me. That uncertainty is most painful before a system ever reaches production, when teams haven’t yet delivered business impact and can’t demonstrate that they’re getting closer to doing so. Without a way to show progress, projects stall before they even have a chance to prove value.

Part of the challenge is that retrieval quality is mostly about retrieving the right documents, but what “right” means varies dramatically by business context. And usually the software engineers building the systems don’t know what a relevant document is because they’re not the SMEs that they’re building the apps for.

In some use cases, recency matters more than semantic closeness. In others, diversity across the retrieved set is critical. Sometimes business logic overrides relevance entirely – for example, boosting certain results because of contractual or commercial considerations. These signals rarely live in embeddings, and they’re often invisible unless you talk directly to the SMEs.

This is where many retrieval systems break down. Adam argues that the reason this is such a problem is largely because it’s a persona mismatch, “software engineers aren’t used to evals being part of the development cycle. It’s not in their muscles.” Engineers optimize for generic notions of similarity, but the real determinants of quality live in metadata, ranking rules, and domain constraints that have nothing to do with vector distance. Getting retrieval “right” is about encoding business logic into the retrieval layer so the system surfaces what actually matters for that specific workflow.

To do this well, Adam says it’s important to:

Talk to SMEs to understand what “good” means.
Build a seed dataset of real Q&A pairs.
Use synthetic data generation to expand coverage. “One of the things we spent a lot of time on at Databricks was making it easy to generate synthetic eval data out of the box,” Adam told me. “Most teams know they need broader coverage, but they don’t have the time or tooling to hand-curate large evaluation sets.”
Use LLM judges to assess grounding. “We’ve put a lot of effort into making judging workflows usable out of the box, because while they’re incredibly powerful for measuring progress, most teams don’t want to build that infrastructure from scratch.” Teams often evaluate their systems using different models than the ones running in production, which helps avoid the model “grading its own homework.” While LLM judges may not tell you with high confidence that a system is exactly 60% accurate, they’re extremely powerful for measuring relative improvement over time. If you make a change and your eval score jumps from 60% to 78%, that’s a strong signal you’re moving in the right direction.

Architecting the Retrieval Stack: Performance, Metadata, and Realistic Constraints

Different UX surfaces impose different retrieval budgets. For example, type-ahead search (aka autocomplete) is extremely latency-sensitive and can’t afford heavy retrieval or LLM re-ranking, since responses must return in tens of milliseconds. Conversational agents, by contrast, tolerate 1-3 seconds of latency, making multi-stage retrieval and LLM re-ranking feasible.

As he puts it, “You have to think about latency and performance considerations first before you start making decisions.”

If your latency budget is larger, you can stack on:

Hybrid search: Combines lexical search (keywords, filters, metadata) with vector search (embeddings) to balance precision, recall, and semantic understanding.
Re-ranking: Takes an initial set of retrieved candidates and uses a more expensive model (often an LLM or cross-encoder) to reorder them by relevance.
Query rewriting and expansion: Transforms a user’s original query into clearer, more complete, or multiple related queries to improve retrieval quality.
Larger embedding models: Higher-capacity embedding models capture richer semantic nuance and domain context, typically improving recall at the cost of higher latency and compute.

Across all of these approaches, Adam believes that the most underused lever is still metadata. As he put it, “If you have metadata, it can be the biggest lever for retrieval quality.” Metadata can be product catalogs, PDF repositories, or even recency, which is one of the simplest but most powerful filters.

He gives a simple example: if a customer searches for “blue Toyota” and your dataset has structured metadata fields like make, model, and color, you can first filter the corpus to only rows where make = Toyota before running any semantic search. Instead of embedding-searching across millions of vehicles, you’ve constrained the search space to a much smaller, highly relevant subset, which improves accuracy. In practice, this kind of metadata-first filtering often matters more than upgrading to a larger embedding model or adding a complex re-ranker.

For teams choosing retrieval infrastructure, Adam suggests choosing vendors based on the type of metadata you have: geospatial constraints, recency, access controls, hierarchical taxonomies, or other domain-specific attributes that materially define relevance. “Oftentimes you don’t need a fancy embedding model. You don’t need hybrid search,” Adam stressed. “You literally just filter on metadata in advance of the query.” That insight comes directly from what Databricks has seen in production, and it’s why they’ve invested so heavily in robust metadata filtering, a direction the rest of the market is now converging on as well.

The Next Frontier: Offline Batch LLM Systems

When I asked Adam about what he’s most excited about in the world of AI, he said that it’s the rise of batch AI systems that don’t return answers in real time.

“There’s been a lot of hype around real-time gen AI,” he said, “but something that should really be on people’s radar is offline batch systems.”

Adam described a pattern he’s seeing more frequently among sophisticated teams: you submit a question and wait an hour or longer, but the output is dramatically better: “I see customers building these crazy batch pipelines,” he explained. “They pull in massive amounts of data, run a cheap LLM to extract and enrich metadata, filter and structure it, and then pass it to another model to generate a report. It might take eight hours, but what comes out is executive-quality.”

In practice, these systems look a lot like traditional ETL jobs, except with LLMs woven throughout the pipeline as flexible, semantic transformation layers.

There’s more technical risk with these systems because they can’t be assembled from off-the-shelf frameworks. You can’t just ask LangChain to ingest a billion documents, extract charts from PDFs, and synthesize an executive report. “There’s nothing out of the box,” Adam told me. “It’s harder to build, takes more time, and comes with real technical risk.”

Batch systems make explicit where a lot of the real work in enterprise AI lives. The throughline in Adam’s perspective is that the hardest and most valuable work for enterprise AI deployments, especially in legacy verticals, happens before and around the models, not inside them. Retrieval forces teams to grapple with things like messy data, latency budgets, and evals, which is where most projects succeed or fail.

Mixture of Experts

Discussion about this post

Ready for more?