AI Search Retrieval Benchmark Report

Key Findings

Qwen3 and Gemini2 are neck and neck — each wins 5 of 12 retrieval datasets. HotpotQA and MS MARCO are effectively tied at R@10 ≈ 0.99. Qwen3 leads on PubMedQA, legal, TechQA, and CUAD; Gemini2 leads on emanual, ExpertQA, FinQA, NewsQA, and QASPER.
CRAG: both top models achieve positive composites — Qwen3 leads with +0.04 composite by abstaining more (41% missing), while Gemini2 is more accurate (36%) but hallucinates more (35%). BGE-M3 and Gemma remain unsolved at 63–65% hallucination.
Pipeline improvements are massive — indexing pipeline updates lifted MS MARCO fuzzy by +53pp and FinQA by +29pp, without changing the embedding model.
Fuzzy match is the highest-leverage config knob — up to +11pp on TechQA across all models, but −33pp on PubMedQA. Dataset-dependent, not model-dependent.
QASPER (academic papers) remains hard — best R@10 is 0.50 (Gemini2), with all models clustered at 0.47–0.50. Fuzzy match is critical (+6–8pp boost).
Hybrid search is model-dependent — safe for Gemini2, mostly safe for Qwen3, but degrades BGE-M3 and Gemma on short-doc datasets.

How to read this report

The core question is simple: when a user asks a question, does AI Search find the right documents? And if it does, is the generated answer any good? We measure this with standard information retrieval and answer quality metrics, evaluated against public datasets with known ground-truth answers.

Retrieval — did we find the right documents?

Recall@K

Of all the relevant documents that exist, what fraction did we return in the top K results? R@10 = 0.90 means we found 90% of the relevant documents in the top 10.

MRR

Mean Reciprocal Rank. How high up does the first relevant result appear? MRR = 1.0 means the first result is always relevant; 0.5 means it’s typically second.

NDCG@K

Normalized Discounted Cumulative Gain. Are the most relevant results ranked higher than less relevant ones? Rewards good ordering, not just presence.

Hit Rate@K

Did we find at least one relevant document in the top K? A binary yes/no per query, averaged across all queries.

Answer quality — is the generated answer correct?

Token F1

Word-level overlap between the generated answer and the ground truth. Balances precision (no extra words) with recall (no missing words). The primary answer quality metric.

ROUGE-L

Longest common subsequence between generated and expected answer. Captures whether the answer preserves the right sequence of information.

CRAG-specific — does the system know what it doesn’t know?

CRAG Composite

Scores correct answers +1, abstentions ("I don't know") 0, and hallucinations −1. Range [−1, +1]. Penalizes confident wrong answers more than silence.

Hallucination Rate

What percentage of answers are confidently wrong? The single most important safety metric — lower is better.

Datasets

All datasets are public and commonly used in information retrieval and question-answering research. Each provides a corpus of documents and a set of questions with known ground-truth answers or relevant passages.

Dataset	Domain	Questions	Best R@3	Best R@5	Best R@10	Source
RAGBench — galileo-ai/ragbench
MS MARCO	Web search	423	0.510	0.746	0.990	HuggingFace
HotpotQA	Multi-hop reasoning	390	0.982	0.997	0.999	HuggingFace
ExpertQA	Expert-level Q&A	203	0.773	0.891	0.954	HuggingFace
FinQA	Financial documents	2,294	0.700	0.780	0.862	HuggingFace
TechQA	Technical support	314	0.655	0.766	0.897	HuggingFace
PubMedQA	Biomedical literature	2,450	0.549	0.800	0.925	HuggingFace
CUAD	Legal contracts	510	0.035	0.059	0.102	HuggingFace
emanual	Product manuals	132	0.851	0.962	1.000	HuggingFace
BEIR — beir-cellar/beir
SciFact	Scientific claims	300	0.745	0.797	0.847	HuggingFace
NFCorpus	Nutrition & health	323	—	—	—	HuggingFace
FiQA	Financial opinion Q&A	648	—	—	—	HuggingFace
ArguAna	Argument retrieval	1,401	—	—	—	HuggingFace
SciDocs	Scientific documents	1,000	—	—	—	HuggingFace
Other benchmarks
CRAG	End-to-end QA (web)	1,335	—	—	—	GitHub
QASPER	Academic papers	1,372	0.422	0.460	0.500	HuggingFace
NewsQA	News articles	4,212	0.753	0.806	0.855	HuggingFace
CourtListener	Legal opinions	2,000	0.851	0.875	0.889	HuggingFace
Legal CR	Legal case reports	770	0.534	0.609	0.656	HuggingFace
Internal benchmarks
Code Eval	Structured data	142	0.323	0.330	0.330	Internal
JSON Eval	Structured data	159	0.487	0.487	0.487	Internal
MIRACL — miracl/miracl
Arabic	Multilingual (Arabic)	2,723	0.850	0.902	0.935	HuggingFace
Korean	Multilingual (Korean)	199	0.760	0.824	0.863	HuggingFace

Cross-Model Leaderboard

The best score achieved by any configuration of each model across all evaluation runs. Dashes indicate the model was not tested on that dataset.

Best Recall@10 by Dataset and Model

Grouped bar chart — higher is better — best config per model

Dataset	Qwen3	Gemini2	BGE-M3	Gemma	Winner
emanual	0.886	1.000	0.907	0.649	Gemini2
MS MARCO	0.990	0.990	0.840	0.770	—
ExpertQA	0.930	0.954	0.568	0.434	Gemini2
PubMedQA	0.925	0.625	0.916	0.881	Qwen3
CourtListener	0.889	—	0.779	0.636	Qwen3
HotpotQA	0.999	0.999	0.535	0.383	—
FinQA	0.824	0.862	0.804	0.334	Gemini2
TechQA	0.897	0.864	0.846	0.401	Qwen3
NewsQA	0.653	0.855	0.789	0.510	Gemini2
Legal CR	0.656	0.329	—	—	Qwen3
CUAD	0.102	0.072	0.080	0.069	Qwen3
QASPER New	0.471	0.500	0.472	0.474	Gemini2
emanual	0.745	0.881	0.664	0.581	Gemini2
MS MARCO	0.510	0.510	0.373	0.366	Gemini2
ExpertQA	0.808	0.808	0.501	0.412	—
PubMedQA	0.549	0.553	0.550	0.530	Gemini2
CourtListener	0.851	—	0.699	0.549	Qwen3
HotpotQA	0.984	0.984	0.470	0.366	—
FinQA	0.684	0.720	0.671	0.242	Gemini2
TechQA	0.655	0.683	0.641	0.288	Gemini2
NewsQA	0.577	0.753	0.684	0.490	Gemini2
Legal CR	0.534	0.295	—	—	Qwen3
CUAD	0.041	0.033	0.033	0.037	Qwen3
QASPER New	0.398	0.422	0.398	0.405	Gemini2
emanual	0.868	0.859	0.951	0.874	BGE-M3
MS MARCO	0.989	0.922	0.995	0.991	BGE-M3
ExpertQA	0.854	0.876	0.852	0.760	Gemini2
PubMedQA	0.988	0.780	0.994	0.987	BGE-M3
CourtListener	0.705	—	0.641	0.487	Qwen3
HotpotQA	0.996	0.996	0.943	0.858	—
FinQA	0.646	0.668	0.628	0.507	Gemini2
TechQA	0.770	0.754	0.731	0.751	Qwen3
NewsQA	0.526	0.686	0.628	0.456	Gemini2
Legal CR	0.460	0.262	—	—	Qwen3
CUAD	0.035	0.030	0.028	0.031	Qwen3
QASPER New	0.348	0.372	0.348	0.350	Gemini2

Scorecard: Qwen3 wins 5 (PubMedQA, CourtListener, TechQA, Legal CR, CUAD). Gemini2 wins 5 (emanual, ExpertQA, FinQA, NewsQA, QASPER). Dashes indicate the model was not tested on that dataset.

Indexing Pipeline Quality Has a Large Impact

Improvements to the AI Search indexing pipeline produced significant retrieval gains on several datasets — without changing the embedding model or search configuration. This demonstrates that the backend infrastructure matters as much as the model choice.

Impact of Indexing Pipeline Improvements (Qwen3, same config)

R@10 change after indexing pipeline update

Dataset	Config	Before	After	Delta
MS MARCO	512+hybrid+fuzzy	0.430	0.960	+52.9pp
MS MARCO	1024+hybrid+fuzzy	0.445	0.960	+51.6pp
MS MARCO	256+hybrid+fuzzy	0.445	0.956	+51.1pp
FinQA	256	0.481	0.775	+29.3pp
FinQA	256+hybrid	0.481	0.770	+28.9pp

Takeaway: The indexing pipeline is a significant lever for retrieval quality. Some regressions exist on short-doc datasets, suggesting the improvements favor longer documents. This is an active area of optimization.

QASPER: Academic Papers Are Hard

QASPER tests retrieval over academic NLP papers with 1,372 questions. The best R@10 across all models is 0.50 (Gemini2) — meaning half the relevant passages are missed. All models cluster between 0.47–0.50, with fuzzy match providing a consistent boost of +6–8pp.

QASPER: Recall@10 by Config and Model

Fuzzy match provides a consistent ~8-10pp boost across all models

Config	Qwen3 R@10	Gemini2 R@10	BGE-M3 R@10	Gemma R@10
256	0.407	0.446	0.410	0.377
256+h	0.407	0.437	0.401	0.391
256+h+f	0.471	0.500	0.472	0.473
512	0.425	0.441	0.423	0.358
512+h	0.421	0.443	0.423	0.368
512+h+f	0.470	0.494	0.472	0.474
1024	0.393	0.382	—	—
1024+h	0.408	0.407	—	—
1024+h+f	0.466	0.468	—	—

Takeaway: QASPER is the hardest retrieval dataset in the benchmark — a ~50% ceiling that no model breaks through. Fuzzy match is critical here, providing a consistent +6–8pp boost regardless of model. The difficulty comes from questions that require reasoning across multiple paper sections, not just embedding quality.

CRAG: Hallucination Rates Vary Wildly by Model

CRAG (Comprehensive RAG Benchmark) evaluates end-to-end answer quality on 1,335 questions across 5 domains. Answers are scored as correct (+1), missing (0), or incorrect/hallucinated (−1). Qwen3 (latest pipeline) slightly edges Gemini2 on composite score (+0.04 vs +0.01), while Gemini2 has higher raw accuracy (36% vs 32%). Both achieve positive composites with hallucination rates of 28–35%. BGE-M3 and Gemma hallucinate on 63–65% of questions.

CRAG: Answer Quality Breakdown by Model

Best config per model — Qwen3 and Gemini2 achieve positive composites, BGE-M3 and Gemma are negative

0.04

Best Composite

28%

Min Hallucination Rate

36%

Best Accuracy

22%

Avg Abstention Rate

Model	Best Config	Composite	Accuracy	Halluc. Rate	Missing Rate
Qwen3	256+hybrid+fuzzy	0.04	31.6%	27.9%	40.6%
Gemini2	512+hybrid	0.01	35.9%	34.8%	29.3%
BGE-M3	256	-0.38	26.6%	64.7%	8.7%
Gemma	512	-0.36	27.2%	63.5%	9.3%

CRAG has no retrieval ground truth (retrieval metrics are all 0.0 by design), so this benchmarks the full pipeline: retrieval + generation. The embedding model and pipeline version both have significant impact — Qwen3 (28% hallucination) and Gemini2 (35% hallucination) are in a different league from BGE-M3 and Gemma (63–65%). Models with lower hallucination rates also abstain more often (29–41%), which is the right behavior when the answer isn't in context.

Takeaway: Qwen3 leads on composite (+0.04) by abstaining more aggressively (41% missing), while Gemini2 is more accurate (36%) but hallucinates more (35%). Both are positive — very different from BGE-M3/Gemma at 63%+ hallucination. Answer calibration (knowing when to say "I don't know") matters as much as retrieval quality for factual QA.

Fuzzy Match: High-Risk, High-Reward

Fuzzy match (OR-mode BM25) remains the single most impactful configuration option. The effect is dataset-dependent, not model-dependent — all models benefit equally on technical/specialized content, and all suffer equally on short-document or precision-sensitive corpora.

Fuzzy Match Impact: Best Hybrid+Fuzzy vs Best Hybrid

R@10 change in percentage points — green = improvement, red = degradation

Dataset	Model	Without Fuzzy	With Fuzzy	Delta
TechQA	Gemini2	0.756	0.864	+10.8pp
TechQA	Qwen3	0.791	0.897	+10.7pp
TechQA	BGE-M3	0.742	0.846	+10.5pp
QASPER	Gemma	0.391	0.474	+8.3pp
QASPER	Gemini2	0.443	0.500	+5.6pp
PubMedQA	Qwen3	0.925	0.599	-32.6pp
Legal CR	Qwen3	0.656	0.533	-12.3pp
CourtListener	Qwen3	0.889	0.783	-10.6pp

Takeaway: Fuzzy match should be opt-in per use case. Enable for: technical docs, financial data, expert content, academic papers (QASPER). Avoid for: short-document collections, citation-heavy content, web search.

Model Selection Guide

Each model has a distinct niche. The gap between best and worst model on any dataset ranges from 1pp (QASPER) to 53pp (FinQA).

Model Strengths: Where Each Model Wins

Best R@10 per model on datasets where all were tested

Based on the closest benchmark dataset for each use case. These are proxies — actual performance depends on your corpus size, document structure, and question distribution.

Use Case	Best Model	Config	Benchmark Proxy	R@10
Product docs, manuals	Gemini2	512+h	emanual	1.00
Web search, short docs	Qwen3	1024+h	MS MARCO	0.99
Financial / tabular	Gemini2	512+h+fuzzy	FinQA	0.86
Technical docs (long)	Qwen3	1024+h+fuzzy	TechQA	0.90
Expert knowledge	Gemini2	1024+h+fuzzy	ExpertQA	0.95
Biomedical	Qwen3	1024	PubMedQA	0.93
Multi-hop reasoning	Qwen3	256	HotpotQA	1.00
Legal citations	Qwen3	1024+h	CourtListener	0.89
News, journalism	Gemini2	512+h+fuzzy	NewsQA	0.85
Academic papers	Gemini2	256+h+fuzzy	QASPER	0.50

Hybrid Search: Essential for Some, Catastrophic for Others

Hybrid search (vector + BM25 exact match) is safe for Qwen3 and Gemini2, but BGE-M3 and Gemma are vulnerable to significant degradation on short-document datasets. The risk profile is model-dependent.

Hybrid Search Delta by Model (Avg R@10 change)

Positive = hybrid helps, negative = hybrid hurts

Safe defaults by model

Qwen3 + hybrid: Mostly safe. Worst: emanual (-4.1pp).
Gemini2 + hybrid: Safe everywhere. Worst delta is -0.5pp.
BGE-M3 + hybrid: Catastrophic on MS MARCO (-26.1pp).
Gemma + hybrid: Catastrophic on PubMedQA (-18.9pp).

Code & JSON Retrieval

Internal benchmarks for non-prose content: 30 source code files (TypeScript, Python, Go, Rust, SQL) and 30 structured JSON files (API responses, config files, Terraform state, GeoJSON, logs). Documents include deliberate distractors — similar files with overlapping themes that force the model to discriminate (e.g. two Hono middleware guides, two database schemas, two ETL pipelines).

Dataset	Model	Best Config	R@3	R@10	MRR
Code Eval	Qwen3	256+hybrid+fuzzy	0.320	0.326	0.338
Code Eval	Gemini2	1024	0.323	0.330	0.354
JSON Eval	Qwen3	256+hybrid+fuzzy	0.450	0.472	0.428
JSON Eval	Gemini2	256	0.487	0.487	0.474

Takeaway: Code and JSON retrieval is significantly harder than prose. With 30 distractor documents per dataset, best R@10 drops to ~0.33 for code and ~0.49 for JSON. Both models perform similarly, and no config provides a major advantage — structured content remains a genuine retrieval challenge.

Answer Quality vs Retrieval Quality

Better retrieval does not guarantee better answers. The correlation between R@10 and Token F1 is weak (r ≈ 0.27), and Token F1 varies less than 2pp across configs within any dataset.

Retrieval Quality vs Answer Quality

Each point = one dataset × model × config — weak correlation confirms LLM compensates for retrieval gaps

Takeaway: Once you clear a retrieval quality floor (~0.5 R@10), additional retrieval improvements have diminishing returns for answer quality. Focus optimization effort on model selection and corpus preparation, not config tuning.

Coverage Matrix

Current evaluation coverage across all result sets.

Dataset	Qwen3	Gemini2	BGE-M3	Gemma	Status
beir-scifact New	9	0	0	0	Gemini2, BGE-M3, Gemma missing
Code Eval	9	9	0	0	BGE-M3, Gemma missing
CRAG New	9	4	6	6	Complete
JSON Eval	8	8	0	0	BGE-M3, Gemma missing
CourtListener	9	0	4	4	Gemini2 missing
Legal CR	9	1	6	6	Complete
MIRACL Arabic	9	0	0	0	Gemini2, BGE-M3, Gemma missing
MIRACL Korean	3	0	0	0	Gemini2, BGE-M3, Gemma missing
NewsQA	9	4	6	6	Complete
QASPER New	9	9	6	6	Complete
CUAD	9	9	6	6	Complete
emanual	9	9	6	6	Complete
ExpertQA	10	9	6	6	Complete
FinQA	9	9	6	6	Complete
HotpotQA	10	9	6	6	Complete
MS MARCO	9	10	6	6	Complete
PubMedQA	9	9	6	6	Complete
TechQA	15	9	6	6	Complete

Question Explorer

A sample of questions where models disagree most on retrieval (R@10 diff > 0.5). R@10 measures whether the search engine found the right source documents — not whether the final answer is correct. You'll often see models give similar answers despite very different R@10 scores. This can happen because the LLM already knows the answer from its training data, or because it infers it from partial context. The value of high retrieval is grounding: a model with R@10 = 1.0 can cite the actual source documents, while a model with R@10 = 0.0 may have given the right answer from memory — which won't work for private data the LLM has never seen.

Select a dataset above to view questions.

Methodology

Evaluation pipeline: Questions are sent to AI Search's /chat/completions endpoint across multiple indexing pipeline versions. Each question returns retrieved chunks and a generated answer. We compute Recall@K, Precision@K, NDCG@K, Hit Rate@K, and MRR for retrieval quality, plus Token F1 and ROUGE-L for answer quality. CRAG uses rule-based correct/missing/incorrect scoring.

Ground truth: RAGBench and LoCoV1 provide annotated relevant document IDs. QASPER provides paragraph-level evidence annotations. CRAG has no retrieval ground truth (answer-only evaluation). BEIR uses standard passage-level qrels.

Concurrency: 50–100 concurrent API calls per eval run.

AI Search instances: Each dataset × model × chunk_size × hybrid_enabled combination has its own AI Search instance. Fuzzy match is a query-time parameter.

Result sources: Production (v6) and staging (v8) indexing pipelines. Four embedding models tested: Qwen3, Gemini2, BGE-M3, Gemma. Staging results are used where the production pipeline had no data for a dataset.

Generated from 423 eval configurations · March 18, 2026

AI SearchBenchmark Evaluation