VECTOR SPACE R1 R2 R3 Q query ! answer here (diluted) "about" topic retrieved missed query irrelevant
Cloudflare

AI Search
Failure Mode Analysis

4 models · 18 datasets · 468K question evaluations
Where retrieval breaks and what we can do about it

Miguel Cardoso · AI Search · March 2026

Agenda

  1. Benchmark Overview — what we measured
  2. Scorecard — where we're good vs. bad
  3. Six Failure Modes — taxonomy of why retrieval fails
  1. Evidence from Benchmarks — mapping data to failure modes
  2. Fix / Mitigate / Redirect — product strategy per failure
  3. What We Could Do — options on the table

1. Benchmark Overview

What we measured and how

Evaluation Scale

4 Embedding Models
18 Datasets
423 Configs Tested
468K Question Evaluations

Full ablation across chunk sizes (256/512/1024), search modes (vector/hybrid/hybrid+fuzzy), 4 embedding models, and multiple pipeline versions

Models Tested

ModelTypeParamsStatus
Qwen3CF-served0.6BDefault — wins 5/12 (PubMedQA, CourtListener, TechQA, Legal CR, CUAD)
Gemini23rd-party (Google)N/ANeck and neck — wins 5/12 (emanual, ExpertQA, FinQA, NewsQA, QASPER)
BGE-M3CF-served0.6BStrong MRR on short docs, but R@10 lags top two
GemmaCF-served0.3BWeakest across the board

Dataset Coverage

DatasetDomainDocsQuestionsAvg Doc SizeChallenge
MS MARCOWeb passages3,481423~105 tokShort docs, lexically similar
HotpotQAWikipedia1,550390~123 tokMulti-hop reasoning
ExpertQAExpert knowledge808203~654 tokLong-form expert answers
FinQAFinancial tables1,0972,294~346 tokTabular + numerical
TechQATechnical docs769314~832 tokTechnical terminology
PubMedQABiomedical5,9322,450~96 tokDomain-specific vocabulary
CUADLegal contracts102510~10.6K tokVery long docs, few templates
emanualProduct manuals102132~214 tokShort consumer content
NewsQANews articles6384,212~756 tokJournalism, broad topics
CourtListenerCourt opinions1,9792,000~12.1K tokLong legal, citations
LegalCaseReportsCase reports770770~10.2K tokLong legal
QASPERAcademic papers4161,372~5.5K tokLong, multi-section
CRAGMixed domains5,0901,335~5.2K tokHallucination, answer quality
code-evalSource code30142~1.3K tokStructured, non-prose
json-evalJSON documents30159~758 tokStructured, key-value
BEIR-SciFactScientific claims5,183300~390 tokFact verification 0.85 R@10
BEIR-NFCorpusMedical/nutrition3,593323~392 tokMulti-relevance levels
BEIR-FiQAFinance Q&A57,599648~198 tokLarge corpus, opinion-heavy
BEIR-ArguAnaArgument mining8,6261,401~266 tokCounterargument retrieval
BEIR-SciDocsScientific papers25,6561,000~304 tokCitation prediction
MIRACL ArabicMultilingual (AR)2,0612,723~150 tokNon-Latin script 0.94 R@10
MIRACL KoreanMultilingual (KO)1,486199~200 tokNon-Latin script 0.86 R@10

Red = long documents (>2K tokens) or structured data — where retrieval degrades most. MIRACL (Arabic 0.94, Korean 0.86) = strong multilingual. BEIR-SciFact 0.85. Other BEIR in progress.

2. Scorecard

Where we're strong and where we break

Retrieval Scorecard (Best R@10)

DatasetModelR@3R@5R@10
emanualGemini20.8510.9621.000
HotpotQAQwen30.9820.9970.999
MS MARCOQwen30.5100.7460.990
ExpertQAGemini20.7730.8910.954
PubMedQAQwen30.5490.8000.925
TechQAQwen30.6550.7660.897
CourtListenerQwen30.8510.8750.889
FinQAGemini20.7000.7800.862
NewsQAGemini20.7530.8060.855
MIRACL ArabicQwen30.8500.9020.935
MIRACL KoreanQwen30.7600.8240.863
BEIR-SciFactQwen30.7450.7970.847
QASPERGemini20.4220.4600.500
json-evalGemini20.4870.4870.487
code-evalGemini20.3230.3300.330
CUADQwen30.0350.0590.102

Where We're Strong

  • Short-to-medium prose documents — HotpotQA (0.999), MS MARCO (0.99), emanual (1.00 Gemini2), PubMedQA (0.93)
  • Expert & technical retrieval — ExpertQA 0.95 (Gemini2), TechQA 0.90 (Qwen3), both near-saturated
  • Qwen3 leads on domain content — PubMedQA: +30pp vs Gemini2, CourtListener: Gemini2 untested, CUAD: Qwen3 best despite low absolute score
  • Gemini2 leads on breadth — emanual 1.00, ExpertQA 0.95, FinQA 0.86, NewsQA 0.86 — consistent advantage on well-structured docs
  • Multilingual retrieval works — MIRACL Arabic 0.94, Korean 0.86 R@10 (Qwen3, non-Latin scripts)
  • Pipeline quality gains — v8 pipeline lifted MS MARCO fuzzy +53pp, FinQA +29pp on same model

Where We're Weak

  • Academic papers — QASPER at 0.50 R@10 (Gemini2), all models clustered at 0.47–0.50
  • Very long documents (>5K tokens) — CUAD collapses to 0.10 R@10
  • Structured/tabular data — code-eval 0.33, json-eval 0.49 R@10 (Gemini2 no better)
  • Hallucination rate — 28% Qwen3, 35% Gemini2, 63-65% BGE-M3/Gemma (CRAG)
  • Qwen3 lags Gemini2 on some content types — NewsQA -20pp, emanual -11pp, FinQA -4pp

CRAG: End-to-End Answer Quality

ModelBest ConfigAccuracyHallucinationMissingComposite
Qwen3256+h+f31.6%27.9%40.6%+0.04
Gemini2512+h35.9%34.8%29.3%+0.01
Gemma51227.2%63.5%9.3%-0.36
BGE-M325626.6%64.7%8.7%-0.38

BGE-M3 and Gemma rarely say "I don't know" (low missing) but are wrong 2/3 of the time. Qwen3 achieves the lowest hallucination rate (28%) by abstaining more aggressively (41% missing). Both top models achieve positive composites.

3. Six Failure Modes

A taxonomy of why retrieval fails

Semantic search retrieves content that is about a topic.
RAG needs to deliver specific facts, claims, and statements.

— The fundamental gap

The Six Failure Modes

#Failure ModeCore Issue
1DilutionFact is 10% of a chunk's embedding signal
2Aboutness vs. Answerness"About X" does not mean "Answers X"
3Shared FactsSame fact in N docs, no linking or authority
4No Connective StructureChunks are flat — can't compose facts across docs
5Query-Document AsymmetryShort specific query vs. long general passage
6Unembeddable InformationNegation, comparison, absence — not representable

FM1: Dilution

Problem: A chunk has 10 sentences. The answer is in one sentence. The embedding represents the average — the fact contributes ~10%.

Result: A chunk that's entirely about the topic ranks higher than the one with the answer.

Worse with: Larger chunks, dense config refs, mixed-content pages

CHUNK A (has answer) 128MB memory limit Embedding signal: ~10% CHUNK B (about topic) Embedding signal: ~100% Query: "Memory limit for Workers?" Rank 3 Rank 1 B wins because all sentences are on-topic

FM2: Aboutness vs. Answerness

Problem: Embedding models are trained on similarity, not Q&A alignment. "Discusses OAuth" and "States Service X uses OAuth" look identical.

Result: Top-K is full of explanatory content, the one-liner answer is pushed out.

Worse with: Large corpora with deep topic coverage, well-written docs

RelationshipUseful?
Defines the conceptLow
Discusses the conceptLow
References the conceptVery Low
States a specific claimHigh
Provides evidenceHigh

FM3: Shared Facts Without Awareness

  • Problem: Same fact in 5 documents. System returns 4 copies of the same info (wasting retrieval budget) and no source authority
  • Stale data risk: An outdated doc with strong semantic match outranks the canonical source
  • Dimensions: Deduplication, authority ranking, conflict detection, budget waste
  • Structured data: Same fact in different schemas (rate_limit_rpm, requests_per_minute, alert_threshold_rpm) — embedding similarity can't deduplicate across schemas

FM4: No Connective Structure

Problem: Facts live in separate chunks with no relationships. Multi-hop questions fail.

Math: If single-hop recall = 80%, then 2-hop = 64%, 3-hop = 51%

Especially hard for structured data: Join keys (owner_teamteam_idrotation) are opaque identifiers with no semantic meaning

Example: "Who is on-call for the payments service?"

services.json → svc-payments → team-billing
teams.json → team-billing → billing-primary
oncall.csv → billing-primary → carol@example.com

3 hops across 3 files. Search finds "payments" but never reaches the on-call CSV.

FM5: Query-Document Asymmetry

Problem: 10-token query vs. 300-token passage. Config blocks, JSON, CSV embed poorly compared to prose.

The format gap:

FormatEmbed Quality
Prose sentenceGood
Table rowModerate
YAML / JSON keyPoor
CSV cellVery Poor

Our benchmark evidence:

  • code-eval: 0.33 R@10
  • json-eval: 0.49 R@10
  • FinQA (tabular): 0.86 R@10

Structured content consistently retrieves worse than prose.

FM6: Unembeddable Information

Prose negation Partial

  • Negation in text: "regions that do NOT support X" — modern embeddings have weak but real negation sensitivity
  • Negation-aware training helps (Jina, DEO) but signal is still overwhelmed when many topically similar chunks compete
  • Absence: "What features are missing?" — weak signal for things that aren't there

Structured ops Impossible

  • Comparison: "Products under 50 euros" — 29.99 and 189.00 embed identically
  • Filtering: category = 'electronics' AND price < 50
  • Aggregation: "How many active users on enterprise?"
  • Sorting: "Three most recent deployments"

Key distinction: Prose negation is a solvable model quality problem. Structured operations over JSON/CSV are fundamentally not similarity — no embedding model will make price < 50 work via cosine similarity.

4. Evidence from Benchmarks

Mapping our data to failure modes

Failure Modes in Our Benchmarks

Benchmark SignalPrimary FMEvidence
CUAD 0.10 R@10FM1 (Dilution)41 templates × 100+ identical contracts. ~10.6K tok docs collapse
QASPER 0.50 R@10 ceilingFM1 + FM4Long academic papers (~5.5K tok), answers scattered across sections
code-eval 0.33 / json-eval 0.49FM5 (Asymmetry)Structured formats embed poorly. Gemini2 no better (0pp gap)
Fuzzy -33pp on PubMedQA (Qwen3)FM3 + FM1Citation-heavy short docs — fuzzy BM25 overwhelms vector signal
CRAG 28-65% hallucinationFM2 + FM6Retrieved "about" but not "answers"; unembeddable query types
HotpotQA v6→v8: 0.79→0.999FM4 (solved)Pipeline improvements resolved multi-hop on this dataset

Hybrid Search: Double-Edged Sword

Helps +28pp

  • CourtListener (Qwen3): +28pp (citations need exact match)
  • emanual (Gemma): +13pp
  • NewsQA (Gemini2): +8pp

Qwen3 and Gemini2 hybrid is safe almost everywhere. Most gains come from fuzzy match (next slide).

BGE-M3 catastrophic -26pp

  • MS MARCO (BGE-M3): -26pp
  • PubMedQA (Gemma): -19pp
  • Qwen3: worst -4pp (emanual)
  • Gemini2: worst -0.5pp (safe everywhere)

Pattern: weaker embeddings can't absorb BM25 noise. Model-dependent, not dataset-dependent.

Fuzzy Match: High Reward, Conditional Risk

Wins (Qwen3, best pipeline)

  • TechQA: +10.7pp
  • FinQA: +5.4pp
  • QASPER: +5.0pp
  • emanual: +4.5pp
  • beir-scifact: +2.4pp

Pattern: diverse vocabulary, technical terms, academic content

Losses (Qwen3, best pipeline)

  • PubMedQA: -32.6pp
  • Legal CR: -12.3pp
  • CourtListener: -10.6pp
  • MS MARCO: -3.0pp (was -49pp before pipeline fix)

Pattern: citation-heavy, short lexically similar docs where fuzzy BM25 overwhelms the vector signal

Pipeline Quality > Model > Config

+53pp MS MARCO fuzzy v6→v8
+29pp FinQA v6→v8
+20pp HotpotQA v6→v8

Same model (Qwen3), same configs — only the pipeline changed (table parsing, boundary detection, overlap). Chunking quality matters more than model or config tuning. Infrastructure investment has the highest ROI.

5. Fix / Mitigate / Redirect

Product strategy per failure mode

Strategy Framework

StanceMeaningImplication
FixThis is our problem. We invest in solving it.Roadmap commitment, measurable improvement
MitigateCan't fully solve, but reduce the damage.Partial solutions, honest limits in docs
RedirectNot a search problem. Point elsewhere.Guidance toward D1, R2 SQL, Agents SDK

Per-Failure-Mode Strategy

#Failure ModeStanceReasoning
1DilutionFixCore retrieval quality. Better chunking + reranking helps everyone
2AboutnessFixReranking helps. Proposition indexing would help more
3Shared FactsMitigateFull dedup is hard. Diversity + metadata hooks are tractable
4No StructureRedirect?Multi-hop = graph problem. No CF graph DB. Agents SDK?
5AsymmetryFixFormat gap is our pipeline's problem. Contextual retrieval helps
6aProse negationMitigateWeak but real signal. Better models + query decomposition help
6bStructured opsRedirectComparison, filtering, aggregation = database ops. D1 / R2 SQL

What Fixes What?

ApproachFM1
Dilute
FM2
About
FM3
Shared
FM4
Struct
FM5
Asym
FM6
Unemb
Better RerankingYES
Proposition IndexingYESYESpartialYES
Contextual RetrievalYESYES
Content-Aware ChunkingYESYES
Graph RAG / Entity IndexYESYESpartial
Diversity / MMRYES
Query Classification + RoutingYES
D1 / R2 SQL (dual-write)YES

6. What We Could Do

Options on the table

Current Trajectory

Where we've been investing

  • Chunking pipeline v2 — table parsing, improved boundary detection, overlap fixes. Drove the v6→v8 lift (+20pp HotpotQA)
  • Reranking — BGE-reranker-base shipped. Addresses FM2 (aboutness)
  • Hybrid search tuning — rank-based scoring, chunk correctness validation
  • Benchmark infrastructure — 18 datasets, 4 models, automated eval pipeline

Where we're headed

  • Header injection — simplified contextual retrieval (FM1, FM5)
  • Content categories — detect content type, adapt chunking per file (FM1, FM5)
  • Code processing — format-aware chunking for source code (FM5)
  • Chunk diversity — exploring similar chunk detection (FM3)

What We Could Do Next

ApproachFailure ModesEffortWhat it does
Better rerankerFM2LowCurrent BGE-reranker-base is 0.3B/512 tok. Newer models (zerank-1-small: 1.7B, 32K) exist
Adaptive hybrid defaultsFM1, FM3LowAuto-disable hybrid/fuzzy on lexically similar content. Prevents -26pp failures (BGE-M3)
Full contextual retrievalFM1, FM5MediumLLM-generated context per chunk (beyond RAG-958 headers). Anthropic: -67% retrieval failure
Result diversity (MMR)FM3LowPenalize redundant results. Stops 4/5 top results from stating the same fact
Proposition indexingFM1, FM2, FM5HighExtract atomic facts, embed individually. 3-5x storage. Eliminates dilution + aboutness
Query plannerFM4, FM6MediumClassify intent, extract filters, decompose multi-hop in /chat/completions
Entity-relationship indexFM3, FM4Very HighLightweight graph on D1 + Vectorize. Multi-hop + dedup
D1 dual-writeFM6bHighDetect tabular content, dual-write to D1, route structured queries to SQL
Pipeline quality > model choice > config tuning.

Fix FM1 & FM2 first (dilution + aboutness) — they affect every customer.
Redirect FM6 (unembeddable) — it's not a search problem.

— The ordering principle

Open Questions for Discussion

  1. Where's the line between "search" and "not search"? — Is multi-hop in scope? Is filtering?
  2. Structured data responsibility? — If customers upload CSV/JSON, what do we owe them when search fails?
  3. Agents SDK as the answer for FM4 + FM6? — If an agent can call search + D1/R2 SQL, does AI Search need to solve these internally?
  4. Do we need graph infrastructure? — Entity-relationship indexing on D1 + Vectorize, or accept FM3/FM4 as limits?
  5. How do we communicate limits? — "Search + D1 + Agents SDK" is stronger than "search can't do X"

Key Takeaways

We're strong on prose retrieval (0.86–1.00 R@10)
Qwen3 and Gemini2 are neck-and-neck (5/12 wins each)
Pipeline quality matters more than model or config (+53pp MS MARCO)
FM1 + FM2 affect every customer — fix those first

Benchmark report: tools/benchmark-eval/docs/benchmark-report.html
Research: docs/r&d/FACT_BASED_RETRIEVAL.md

Miguel Cardoso · AI Search · March 2026