AI Search
Failure Mode Analysis

4 models · 18 datasets · 468K question evaluations
Where retrieval breaks and what we can do about it

Miguel Cardoso · AI Search · March 2026

Agenda

Benchmark Overview — what we measured
Scorecard — where we're good vs. bad
Six Failure Modes — taxonomy of why retrieval fails

Evidence from Benchmarks — mapping data to failure modes
Fix / Mitigate / Redirect — product strategy per failure
What We Could Do — options on the table

1. Benchmark Overview

What we measured and how

Evaluation Scale

4 Embedding Models

18 Datasets

423 Configs Tested

468K Question Evaluations

Full ablation across chunk sizes (256/512/1024), search modes (vector/hybrid/hybrid+fuzzy), 4 embedding models, and multiple pipeline versions

Models Tested

Model	Type	Params	Status
Qwen3	CF-served	0.6B	Default — wins 5/12 (PubMedQA, CourtListener, TechQA, Legal CR, CUAD)
Gemini2	3rd-party (Google)	N/A	Neck and neck — wins 5/12 (emanual, ExpertQA, FinQA, NewsQA, QASPER)
BGE-M3	CF-served	0.6B	Strong MRR on short docs, but R@10 lags top two
Gemma	CF-served	0.3B	Weakest across the board

Dataset Coverage

Dataset	Domain	Docs	Questions	Avg Doc Size	Challenge
MS MARCO	Web passages	3,481	423	~105 tok	Short docs, lexically similar
HotpotQA	Wikipedia	1,550	390	~123 tok	Multi-hop reasoning
ExpertQA	Expert knowledge	808	203	~654 tok	Long-form expert answers
FinQA	Financial tables	1,097	2,294	~346 tok	Tabular + numerical
TechQA	Technical docs	769	314	~832 tok	Technical terminology
PubMedQA	Biomedical	5,932	2,450	~96 tok	Domain-specific vocabulary
CUAD	Legal contracts	102	510	~10.6K tok	Very long docs, few templates
emanual	Product manuals	102	132	~214 tok	Short consumer content
NewsQA	News articles	638	4,212	~756 tok	Journalism, broad topics
CourtListener	Court opinions	1,979	2,000	~12.1K tok	Long legal, citations
LegalCaseReports	Case reports	770	770	~10.2K tok	Long legal
QASPER	Academic papers	416	1,372	~5.5K tok	Long, multi-section
CRAG	Mixed domains	5,090	1,335	~5.2K tok	Hallucination, answer quality
code-eval	Source code	30	142	~1.3K tok	Structured, non-prose
json-eval	JSON documents	30	159	~758 tok	Structured, key-value
BEIR-SciFact	Scientific claims	5,183	300	~390 tok	Fact verification 0.85 R@10
BEIR-NFCorpus	Medical/nutrition	3,593	323	~392 tok	Multi-relevance levels
BEIR-FiQA	Finance Q&A	57,599	648	~198 tok	Large corpus, opinion-heavy
BEIR-ArguAna	Argument mining	8,626	1,401	~266 tok	Counterargument retrieval
BEIR-SciDocs	Scientific papers	25,656	1,000	~304 tok	Citation prediction
MIRACL Arabic	Multilingual (AR)	2,061	2,723	~150 tok	Non-Latin script 0.94 R@10
MIRACL Korean	Multilingual (KO)	1,486	199	~200 tok	Non-Latin script 0.86 R@10

Red = long documents (>2K tokens) or structured data — where retrieval degrades most. MIRACL (Arabic 0.94, Korean 0.86) = strong multilingual. BEIR-SciFact 0.85. Other BEIR in progress.

2. Scorecard

Where we're strong and where we break

Retrieval Scorecard (Best R@10)

Dataset	Model	R@3	R@5	R@10
emanual	Gemini2	0.851	0.962	1.000
HotpotQA	Qwen3	0.982	0.997	0.999
MS MARCO	Qwen3	0.510	0.746	0.990
ExpertQA	Gemini2	0.773	0.891	0.954
PubMedQA	Qwen3	0.549	0.800	0.925
TechQA	Qwen3	0.655	0.766	0.897
CourtListener	Qwen3	0.851	0.875	0.889
FinQA	Gemini2	0.700	0.780	0.862
NewsQA	Gemini2	0.753	0.806	0.855
MIRACL Arabic	Qwen3	0.850	0.902	0.935
MIRACL Korean	Qwen3	0.760	0.824	0.863
BEIR-SciFact	Qwen3	0.745	0.797	0.847
QASPER	Gemini2	0.422	0.460	0.500
json-eval	Gemini2	0.487	0.487	0.487
code-eval	Gemini2	0.323	0.330	0.330
CUAD	Qwen3	0.035	0.059	0.102

Where We're Strong

Short-to-medium prose documents — HotpotQA (0.999), MS MARCO (0.99), emanual (1.00 Gemini2), PubMedQA (0.93)
Expert & technical retrieval — ExpertQA 0.95 (Gemini2), TechQA 0.90 (Qwen3), both near-saturated
Qwen3 leads on domain content — PubMedQA: +30pp vs Gemini2, CourtListener: Gemini2 untested, CUAD: Qwen3 best despite low absolute score
Gemini2 leads on breadth — emanual 1.00, ExpertQA 0.95, FinQA 0.86, NewsQA 0.86 — consistent advantage on well-structured docs
Multilingual retrieval works — MIRACL Arabic 0.94, Korean 0.86 R@10 (Qwen3, non-Latin scripts)
Pipeline quality gains — v8 pipeline lifted MS MARCO fuzzy +53pp, FinQA +29pp on same model

Where We're Weak

Academic papers — QASPER at 0.50 R@10 (Gemini2), all models clustered at 0.47–0.50
Very long documents (>5K tokens) — CUAD collapses to 0.10 R@10
Structured/tabular data — code-eval 0.33, json-eval 0.49 R@10 (Gemini2 no better)
Hallucination rate — 28% Qwen3, 35% Gemini2, 63-65% BGE-M3/Gemma (CRAG)
Qwen3 lags Gemini2 on some content types — NewsQA -20pp, emanual -11pp, FinQA -4pp

CRAG: End-to-End Answer Quality

Model	Best Config	Accuracy	Hallucination	Missing	Composite
Qwen3	256+h+f	31.6%	27.9%	40.6%	+0.04
Gemini2	512+h	35.9%	34.8%	29.3%	+0.01
Gemma	512	27.2%	63.5%	9.3%	-0.36
BGE-M3	256	26.6%	64.7%	8.7%	-0.38

BGE-M3 and Gemma rarely say "I don't know" (low missing) but are wrong 2/3 of the time. Qwen3 achieves the lowest hallucination rate (28%) by abstaining more aggressively (41% missing). Both top models achieve positive composites.

3. Six Failure Modes

A taxonomy of why retrieval fails

Semantic search retrieves content that is about a topic.
RAG needs to deliver specific facts, claims, and statements.

— The fundamental gap

The Six Failure Modes

#	Failure Mode	Core Issue
1	Dilution	Fact is 10% of a chunk's embedding signal
2	Aboutness vs. Answerness	"About X" does not mean "Answers X"
3	Shared Facts	Same fact in N docs, no linking or authority
4	No Connective Structure	Chunks are flat — can't compose facts across docs
5	Query-Document Asymmetry	Short specific query vs. long general passage
6	Unembeddable Information	Negation, comparison, absence — not representable

FM1: Dilution

Problem: A chunk has 10 sentences. The answer is in one sentence. The embedding represents the average — the fact contributes ~10%.

Result: A chunk that's entirely about the topic ranks higher than the one with the answer.

Worse with: Larger chunks, dense config refs, mixed-content pages

FM2: Aboutness vs. Answerness

Problem: Embedding models are trained on similarity, not Q&A alignment. "Discusses OAuth" and "States Service X uses OAuth" look identical.

Result: Top-K is full of explanatory content, the one-liner answer is pushed out.

Worse with: Large corpora with deep topic coverage, well-written docs

Relationship	Useful?
Defines the concept	Low
Discusses the concept	Low
References the concept	Very Low
States a specific claim	High
Provides evidence	High

FM3: Shared Facts Without Awareness

Problem: Same fact in 5 documents. System returns 4 copies of the same info (wasting retrieval budget) and no source authority
Stale data risk: An outdated doc with strong semantic match outranks the canonical source
Dimensions: Deduplication, authority ranking, conflict detection, budget waste
Structured data: Same fact in different schemas (rate_limit_rpm, requests_per_minute, alert_threshold_rpm) — embedding similarity can't deduplicate across schemas

FM4: No Connective Structure

Problem: Facts live in separate chunks with no relationships. Multi-hop questions fail.

Math: If single-hop recall = 80%, then 2-hop = 64%, 3-hop = 51%

Especially hard for structured data: Join keys (owner_team → team_id → rotation) are opaque identifiers with no semantic meaning

Example: "Who is on-call for the payments service?"

services.json → svc-payments → team-billing
teams.json → team-billing → billing-primary
oncall.csv → billing-primary → carol@example.com

3 hops across 3 files. Search finds "payments" but never reaches the on-call CSV.

FM5: Query-Document Asymmetry

Problem: 10-token query vs. 300-token passage. Config blocks, JSON, CSV embed poorly compared to prose.

The format gap:

Format	Embed Quality
Prose sentence	Good
Table row	Moderate
YAML / JSON key	Poor
CSV cell	Very Poor

Our benchmark evidence:

code-eval: 0.33 R@10
json-eval: 0.49 R@10
FinQA (tabular): 0.86 R@10

Structured content consistently retrieves worse than prose.

FM6: Unembeddable Information

Prose negation Partial

Negation in text: "regions that do NOT support X" — modern embeddings have weak but real negation sensitivity
Negation-aware training helps (Jina, DEO) but signal is still overwhelmed when many topically similar chunks compete
Absence: "What features are missing?" — weak signal for things that aren't there

Structured ops Impossible

Comparison: "Products under 50 euros" — 29.99 and 189.00 embed identically
Filtering: category = 'electronics' AND price < 50
Aggregation: "How many active users on enterprise?"
Sorting: "Three most recent deployments"

Key distinction: Prose negation is a solvable model quality problem. Structured operations over JSON/CSV are fundamentally not similarity — no embedding model will make price < 50 work via cosine similarity.

4. Evidence from Benchmarks

Mapping our data to failure modes

Failure Modes in Our Benchmarks

Benchmark Signal	Primary FM	Evidence
CUAD 0.10 R@10	FM1 (Dilution)	41 templates × 100+ identical contracts. ~10.6K tok docs collapse
QASPER 0.50 R@10 ceiling	FM1 + FM4	Long academic papers (~5.5K tok), answers scattered across sections
code-eval 0.33 / json-eval 0.49	FM5 (Asymmetry)	Structured formats embed poorly. Gemini2 no better (0pp gap)
Fuzzy -33pp on PubMedQA (Qwen3)	FM3 + FM1	Citation-heavy short docs — fuzzy BM25 overwhelms vector signal
CRAG 28-65% hallucination	FM2 + FM6	Retrieved "about" but not "answers"; unembeddable query types
HotpotQA v6→v8: 0.79→0.999	FM4 (solved)	Pipeline improvements resolved multi-hop on this dataset

Hybrid Search: Double-Edged Sword

Helps +28pp

CourtListener (Qwen3): +28pp (citations need exact match)
emanual (Gemma): +13pp
NewsQA (Gemini2): +8pp

Qwen3 and Gemini2 hybrid is safe almost everywhere. Most gains come from fuzzy match (next slide).

BGE-M3 catastrophic -26pp

MS MARCO (BGE-M3): -26pp
PubMedQA (Gemma): -19pp
Qwen3: worst -4pp (emanual)
Gemini2: worst -0.5pp (safe everywhere)

Pattern: weaker embeddings can't absorb BM25 noise. Model-dependent, not dataset-dependent.

Fuzzy Match: High Reward, Conditional Risk

Wins (Qwen3, best pipeline)

TechQA: +10.7pp
FinQA: +5.4pp
QASPER: +5.0pp
emanual: +4.5pp
beir-scifact: +2.4pp

Pattern: diverse vocabulary, technical terms, academic content

Losses (Qwen3, best pipeline)

PubMedQA: -32.6pp
Legal CR: -12.3pp
CourtListener: -10.6pp
MS MARCO: -3.0pp (was -49pp before pipeline fix)

Pattern: citation-heavy, short lexically similar docs where fuzzy BM25 overwhelms the vector signal

Pipeline Quality > Model > Config

+53pp MS MARCO fuzzy v6→v8

+29pp FinQA v6→v8

+20pp HotpotQA v6→v8

Same model (Qwen3), same configs — only the pipeline changed (table parsing, boundary detection, overlap). Chunking quality matters more than model or config tuning. Infrastructure investment has the highest ROI.

5. Fix / Mitigate / Redirect

Product strategy per failure mode

Strategy Framework

Stance	Meaning	Implication
Fix	This is our problem. We invest in solving it.	Roadmap commitment, measurable improvement
Mitigate	Can't fully solve, but reduce the damage.	Partial solutions, honest limits in docs
Redirect	Not a search problem. Point elsewhere.	Guidance toward D1, R2 SQL, Agents SDK

Per-Failure-Mode Strategy

#	Failure Mode	Stance	Reasoning
1	Dilution	Fix	Core retrieval quality. Better chunking + reranking helps everyone
2	Aboutness	Fix	Reranking helps. Proposition indexing would help more
3	Shared Facts	Mitigate	Full dedup is hard. Diversity + metadata hooks are tractable
4	No Structure	Redirect?	Multi-hop = graph problem. No CF graph DB. Agents SDK?
5	Asymmetry	Fix	Format gap is our pipeline's problem. Contextual retrieval helps
6a	Prose negation	Mitigate	Weak but real signal. Better models + query decomposition help
6b	Structured ops	Redirect	Comparison, filtering, aggregation = database ops. D1 / R2 SQL

What Fixes What?

Approach	FM1 Dilute	FM2 About	FM3 Shared	FM4 Struct	FM5 Asym	FM6 Unemb
Better Reranking		YES
Proposition Indexing	YES	YES	partial		YES
Contextual Retrieval	YES				YES
Content-Aware Chunking	YES				YES
Graph RAG / Entity Index			YES	YES		partial
Diversity / MMR			YES
Query Classification + Routing						YES
D1 / R2 SQL (dual-write)						YES

6. What We Could Do

Options on the table

Current Trajectory

Where we've been investing

Chunking pipeline v2 — table parsing, improved boundary detection, overlap fixes. Drove the v6→v8 lift (+20pp HotpotQA)
Reranking — BGE-reranker-base shipped. Addresses FM2 (aboutness)
Hybrid search tuning — rank-based scoring, chunk correctness validation
Benchmark infrastructure — 18 datasets, 4 models, automated eval pipeline

Where we're headed

Header injection — simplified contextual retrieval (FM1, FM5)
Content categories — detect content type, adapt chunking per file (FM1, FM5)
Code processing — format-aware chunking for source code (FM5)
Chunk diversity — exploring similar chunk detection (FM3)

What We Could Do Next

Approach	Failure Modes	Effort	What it does
Better reranker	FM2	Low	Current BGE-reranker-base is 0.3B/512 tok. Newer models (zerank-1-small: 1.7B, 32K) exist
Adaptive hybrid defaults	FM1, FM3	Low	Auto-disable hybrid/fuzzy on lexically similar content. Prevents -26pp failures (BGE-M3)
Full contextual retrieval	FM1, FM5	Medium	LLM-generated context per chunk (beyond RAG-958 headers). Anthropic: -67% retrieval failure
Result diversity (MMR)	FM3	Low	Penalize redundant results. Stops 4/5 top results from stating the same fact
Proposition indexing	FM1, FM2, FM5	High	Extract atomic facts, embed individually. 3-5x storage. Eliminates dilution + aboutness
Query planner	FM4, FM6	Medium	Classify intent, extract filters, decompose multi-hop in /chat/completions
Entity-relationship index	FM3, FM4	Very High	Lightweight graph on D1 + Vectorize. Multi-hop + dedup
D1 dual-write	FM6b	High	Detect tabular content, dual-write to D1, route structured queries to SQL

Pipeline quality > model choice > config tuning.

Fix FM1 & FM2 first (dilution + aboutness) — they affect every customer.
Redirect FM6 (unembeddable) — it's not a search problem.

— The ordering principle

Open Questions for Discussion

Where's the line between "search" and "not search"? — Is multi-hop in scope? Is filtering?
Structured data responsibility? — If customers upload CSV/JSON, what do we owe them when search fails?
Agents SDK as the answer for FM4 + FM6? — If an agent can call search + D1/R2 SQL, does AI Search need to solve these internally?
Do we need graph infrastructure? — Entity-relationship indexing on D1 + Vectorize, or accept FM3/FM4 as limits?
How do we communicate limits? — "Search + D1 + Agents SDK" is stronger than "search can't do X"

Key Takeaways

We're strong on prose retrieval (0.86–1.00 R@10)
Qwen3 and Gemini2 are neck-and-neck (5/12 wins each)
Pipeline quality matters more than model or config (+53pp MS MARCO)
FM1 + FM2 affect every customer — fix those first

Benchmark report: tools/benchmark-eval/docs/benchmark-report.html
Research: docs/r&d/FACT_BASED_RETRIEVAL.md

Miguel Cardoso · AI Search · March 2026

AI SearchFailure Mode Analysis