4 models · 18 datasets · 468K question evaluations
Where retrieval breaks and what we can do about it
Miguel Cardoso · AI Search · March 2026
What we measured and how
Full ablation across chunk sizes (256/512/1024), search modes (vector/hybrid/hybrid+fuzzy), 4 embedding models, and multiple pipeline versions
| Model | Type | Params | Status |
|---|---|---|---|
| Qwen3 | CF-served | 0.6B | Default — wins 5/12 (PubMedQA, CourtListener, TechQA, Legal CR, CUAD) |
| Gemini2 | 3rd-party (Google) | N/A | Neck and neck — wins 5/12 (emanual, ExpertQA, FinQA, NewsQA, QASPER) |
| BGE-M3 | CF-served | 0.6B | Strong MRR on short docs, but R@10 lags top two |
| Gemma | CF-served | 0.3B | Weakest across the board |
| Dataset | Domain | Docs | Questions | Avg Doc Size | Challenge |
|---|---|---|---|---|---|
| MS MARCO | Web passages | 3,481 | 423 | ~105 tok | Short docs, lexically similar |
| HotpotQA | Wikipedia | 1,550 | 390 | ~123 tok | Multi-hop reasoning |
| ExpertQA | Expert knowledge | 808 | 203 | ~654 tok | Long-form expert answers |
| FinQA | Financial tables | 1,097 | 2,294 | ~346 tok | Tabular + numerical |
| TechQA | Technical docs | 769 | 314 | ~832 tok | Technical terminology |
| PubMedQA | Biomedical | 5,932 | 2,450 | ~96 tok | Domain-specific vocabulary |
| CUAD | Legal contracts | 102 | 510 | ~10.6K tok | Very long docs, few templates |
| emanual | Product manuals | 102 | 132 | ~214 tok | Short consumer content |
| NewsQA | News articles | 638 | 4,212 | ~756 tok | Journalism, broad topics |
| CourtListener | Court opinions | 1,979 | 2,000 | ~12.1K tok | Long legal, citations |
| LegalCaseReports | Case reports | 770 | 770 | ~10.2K tok | Long legal |
| QASPER | Academic papers | 416 | 1,372 | ~5.5K tok | Long, multi-section |
| CRAG | Mixed domains | 5,090 | 1,335 | ~5.2K tok | Hallucination, answer quality |
| code-eval | Source code | 30 | 142 | ~1.3K tok | Structured, non-prose |
| json-eval | JSON documents | 30 | 159 | ~758 tok | Structured, key-value |
| BEIR-SciFact | Scientific claims | 5,183 | 300 | ~390 tok | Fact verification 0.85 R@10 |
| BEIR-NFCorpus | Medical/nutrition | 3,593 | 323 | ~392 tok | Multi-relevance levels |
| BEIR-FiQA | Finance Q&A | 57,599 | 648 | ~198 tok | Large corpus, opinion-heavy |
| BEIR-ArguAna | Argument mining | 8,626 | 1,401 | ~266 tok | Counterargument retrieval |
| BEIR-SciDocs | Scientific papers | 25,656 | 1,000 | ~304 tok | Citation prediction |
| MIRACL Arabic | Multilingual (AR) | 2,061 | 2,723 | ~150 tok | Non-Latin script 0.94 R@10 |
| MIRACL Korean | Multilingual (KO) | 1,486 | 199 | ~200 tok | Non-Latin script 0.86 R@10 |
Red = long documents (>2K tokens) or structured data — where retrieval degrades most. MIRACL (Arabic 0.94, Korean 0.86) = strong multilingual. BEIR-SciFact 0.85. Other BEIR in progress.
Where we're strong and where we break
| Dataset | Model | R@3 | R@5 | R@10 |
|---|---|---|---|---|
| emanual | Gemini2 | 0.851 | 0.962 | 1.000 |
| HotpotQA | Qwen3 | 0.982 | 0.997 | 0.999 |
| MS MARCO | Qwen3 | 0.510 | 0.746 | 0.990 |
| ExpertQA | Gemini2 | 0.773 | 0.891 | 0.954 |
| PubMedQA | Qwen3 | 0.549 | 0.800 | 0.925 |
| TechQA | Qwen3 | 0.655 | 0.766 | 0.897 |
| CourtListener | Qwen3 | 0.851 | 0.875 | 0.889 |
| FinQA | Gemini2 | 0.700 | 0.780 | 0.862 |
| NewsQA | Gemini2 | 0.753 | 0.806 | 0.855 |
| MIRACL Arabic | Qwen3 | 0.850 | 0.902 | 0.935 |
| MIRACL Korean | Qwen3 | 0.760 | 0.824 | 0.863 |
| BEIR-SciFact | Qwen3 | 0.745 | 0.797 | 0.847 |
| QASPER | Gemini2 | 0.422 | 0.460 | 0.500 |
| json-eval | Gemini2 | 0.487 | 0.487 | 0.487 |
| code-eval | Gemini2 | 0.323 | 0.330 | 0.330 |
| CUAD | Qwen3 | 0.035 | 0.059 | 0.102 |
| Model | Best Config | Accuracy | Hallucination | Missing | Composite |
|---|---|---|---|---|---|
| Qwen3 | 256+h+f | 31.6% | 27.9% | 40.6% | +0.04 |
| Gemini2 | 512+h | 35.9% | 34.8% | 29.3% | +0.01 |
| Gemma | 512 | 27.2% | 63.5% | 9.3% | -0.36 |
| BGE-M3 | 256 | 26.6% | 64.7% | 8.7% | -0.38 |
BGE-M3 and Gemma rarely say "I don't know" (low missing) but are wrong 2/3 of the time. Qwen3 achieves the lowest hallucination rate (28%) by abstaining more aggressively (41% missing). Both top models achieve positive composites.
A taxonomy of why retrieval fails
Semantic search retrieves content that is about a topic.
RAG needs to deliver specific facts, claims, and statements.
— The fundamental gap
| # | Failure Mode | Core Issue |
|---|---|---|
| 1 | Dilution | Fact is 10% of a chunk's embedding signal |
| 2 | Aboutness vs. Answerness | "About X" does not mean "Answers X" |
| 3 | Shared Facts | Same fact in N docs, no linking or authority |
| 4 | No Connective Structure | Chunks are flat — can't compose facts across docs |
| 5 | Query-Document Asymmetry | Short specific query vs. long general passage |
| 6 | Unembeddable Information | Negation, comparison, absence — not representable |
Problem: A chunk has 10 sentences. The answer is in one sentence. The embedding represents the average — the fact contributes ~10%.
Result: A chunk that's entirely about the topic ranks higher than the one with the answer.
Worse with: Larger chunks, dense config refs, mixed-content pages
Problem: Embedding models are trained on similarity, not Q&A alignment. "Discusses OAuth" and "States Service X uses OAuth" look identical.
Result: Top-K is full of explanatory content, the one-liner answer is pushed out.
Worse with: Large corpora with deep topic coverage, well-written docs
| Relationship | Useful? |
|---|---|
| Defines the concept | Low |
| Discusses the concept | Low |
| References the concept | Very Low |
| States a specific claim | High |
| Provides evidence | High |
rate_limit_rpm, requests_per_minute, alert_threshold_rpm) — embedding similarity can't deduplicate across schemasProblem: Facts live in separate chunks with no relationships. Multi-hop questions fail.
Math: If single-hop recall = 80%, then 2-hop = 64%, 3-hop = 51%
Especially hard for structured data: Join keys (owner_team → team_id → rotation) are opaque identifiers with no semantic meaning
Example: "Who is on-call for the payments service?"
services.json → svc-payments → team-billing
teams.json → team-billing → billing-primary
oncall.csv → billing-primary → carol@example.com
3 hops across 3 files. Search finds "payments" but never reaches the on-call CSV.
Problem: 10-token query vs. 300-token passage. Config blocks, JSON, CSV embed poorly compared to prose.
The format gap:
| Format | Embed Quality |
|---|---|
| Prose sentence | Good |
| Table row | Moderate |
| YAML / JSON key | Poor |
| CSV cell | Very Poor |
Our benchmark evidence:
Structured content consistently retrieves worse than prose.
29.99 and 189.00 embed identicallycategory = 'electronics' AND price < 50Key distinction: Prose negation is a solvable model quality problem. Structured operations over JSON/CSV are fundamentally not similarity — no embedding model will make price < 50 work via cosine similarity.
Mapping our data to failure modes
| Benchmark Signal | Primary FM | Evidence |
|---|---|---|
| CUAD 0.10 R@10 | FM1 (Dilution) | 41 templates × 100+ identical contracts. ~10.6K tok docs collapse |
| QASPER 0.50 R@10 ceiling | FM1 + FM4 | Long academic papers (~5.5K tok), answers scattered across sections |
| code-eval 0.33 / json-eval 0.49 | FM5 (Asymmetry) | Structured formats embed poorly. Gemini2 no better (0pp gap) |
| Fuzzy -33pp on PubMedQA (Qwen3) | FM3 + FM1 | Citation-heavy short docs — fuzzy BM25 overwhelms vector signal |
| CRAG 28-65% hallucination | FM2 + FM6 | Retrieved "about" but not "answers"; unembeddable query types |
| HotpotQA v6→v8: 0.79→0.999 | FM4 (solved) | Pipeline improvements resolved multi-hop on this dataset |
Qwen3 and Gemini2 hybrid is safe almost everywhere. Most gains come from fuzzy match (next slide).
Pattern: weaker embeddings can't absorb BM25 noise. Model-dependent, not dataset-dependent.
Pattern: diverse vocabulary, technical terms, academic content
Pattern: citation-heavy, short lexically similar docs where fuzzy BM25 overwhelms the vector signal
Same model (Qwen3), same configs — only the pipeline changed (table parsing, boundary detection, overlap). Chunking quality matters more than model or config tuning. Infrastructure investment has the highest ROI.
Product strategy per failure mode
| Stance | Meaning | Implication |
|---|---|---|
| Fix | This is our problem. We invest in solving it. | Roadmap commitment, measurable improvement |
| Mitigate | Can't fully solve, but reduce the damage. | Partial solutions, honest limits in docs |
| Redirect | Not a search problem. Point elsewhere. | Guidance toward D1, R2 SQL, Agents SDK |
| # | Failure Mode | Stance | Reasoning |
|---|---|---|---|
| 1 | Dilution | Fix | Core retrieval quality. Better chunking + reranking helps everyone |
| 2 | Aboutness | Fix | Reranking helps. Proposition indexing would help more |
| 3 | Shared Facts | Mitigate | Full dedup is hard. Diversity + metadata hooks are tractable |
| 4 | No Structure | Redirect? | Multi-hop = graph problem. No CF graph DB. Agents SDK? |
| 5 | Asymmetry | Fix | Format gap is our pipeline's problem. Contextual retrieval helps |
| 6a | Prose negation | Mitigate | Weak but real signal. Better models + query decomposition help |
| 6b | Structured ops | Redirect | Comparison, filtering, aggregation = database ops. D1 / R2 SQL |
| Approach | FM1 Dilute | FM2 About | FM3 Shared | FM4 Struct | FM5 Asym | FM6 Unemb |
|---|---|---|---|---|---|---|
| Better Reranking | YES | |||||
| Proposition Indexing | YES | YES | partial | YES | ||
| Contextual Retrieval | YES | YES | ||||
| Content-Aware Chunking | YES | YES | ||||
| Graph RAG / Entity Index | YES | YES | partial | |||
| Diversity / MMR | YES | |||||
| Query Classification + Routing | YES | |||||
| D1 / R2 SQL (dual-write) | YES |
Options on the table
| Approach | Failure Modes | Effort | What it does |
|---|---|---|---|
| Better reranker | FM2 | Low | Current BGE-reranker-base is 0.3B/512 tok. Newer models (zerank-1-small: 1.7B, 32K) exist |
| Adaptive hybrid defaults | FM1, FM3 | Low | Auto-disable hybrid/fuzzy on lexically similar content. Prevents -26pp failures (BGE-M3) |
| Full contextual retrieval | FM1, FM5 | Medium | LLM-generated context per chunk (beyond RAG-958 headers). Anthropic: -67% retrieval failure |
| Result diversity (MMR) | FM3 | Low | Penalize redundant results. Stops 4/5 top results from stating the same fact |
| Proposition indexing | FM1, FM2, FM5 | High | Extract atomic facts, embed individually. 3-5x storage. Eliminates dilution + aboutness |
| Query planner | FM4, FM6 | Medium | Classify intent, extract filters, decompose multi-hop in /chat/completions |
| Entity-relationship index | FM3, FM4 | Very High | Lightweight graph on D1 + Vectorize. Multi-hop + dedup |
| D1 dual-write | FM6b | High | Detect tabular content, dual-write to D1, route structured queries to SQL |
Pipeline quality > model choice > config tuning.
Fix FM1 & FM2 first (dilution + aboutness) — they affect every customer.
Redirect FM6 (unembeddable) — it's not a search problem.
— The ordering principle
We're strong on prose retrieval (0.86–1.00 R@10)
Qwen3 and Gemini2 are neck-and-neck (5/12 wins each)
Pipeline quality matters more than model or config (+53pp MS MARCO)
FM1 + FM2 affect every customer — fix those first
Benchmark report: tools/benchmark-eval/docs/benchmark-report.html
Research: docs/r&d/FACT_BASED_RETRIEVAL.md
Miguel Cardoso · AI Search · March 2026