AI Search — Benchmark Evaluation
Internal Evaluation — Last updated March 18, 2026

AI Search
Benchmark Evaluation

How well does Cloudflare AI Search find the right documents and generate accurate answers? We tested 4 embedding models across 18 public datasets (RAGBench, CRAG, QASPER, LoCoV1, NewsQA) with 423 configuration variants — covering retrieval quality, answer accuracy, and hallucination rates.

423
Configurations
468K
Question Evaluations
18
Public Datasets
4
Embedding Models
Download All Results (CSV)

Key Findings

  1. Qwen3 and Gemini2 are neck and neck — each wins 5 of 12 retrieval datasets. HotpotQA and MS MARCO are effectively tied at R@10 ≈ 0.99. Qwen3 leads on PubMedQA, legal, TechQA, and CUAD; Gemini2 leads on emanual, ExpertQA, FinQA, NewsQA, and QASPER.
  2. CRAG: both top models achieve positive composites — Qwen3 leads with +0.04 composite by abstaining more (41% missing), while Gemini2 is more accurate (36%) but hallucinates more (35%). BGE-M3 and Gemma remain unsolved at 63–65% hallucination.
  3. Pipeline improvements are massive — indexing pipeline updates lifted MS MARCO fuzzy by +53pp and FinQA by +29pp, without changing the embedding model.
  4. Fuzzy match is the highest-leverage config knob — up to +11pp on TechQA across all models, but −33pp on PubMedQA. Dataset-dependent, not model-dependent.
  5. QASPER (academic papers) remains hard — best R@10 is 0.50 (Gemini2), with all models clustered at 0.47–0.50. Fuzzy match is critical (+6–8pp boost).
  6. Hybrid search is model-dependent — safe for Gemini2, mostly safe for Qwen3, but degrades BGE-M3 and Gemma on short-doc datasets.

How to read this report

The core question is simple: when a user asks a question, does AI Search find the right documents? And if it does, is the generated answer any good? We measure this with standard information retrieval and answer quality metrics, evaluated against public datasets with known ground-truth answers.

Retrieval — did we find the right documents?
Recall@K
Of all the relevant documents that exist, what fraction did we return in the top K results? R@10 = 0.90 means we found 90% of the relevant documents in the top 10.
MRR
Mean Reciprocal Rank. How high up does the first relevant result appear? MRR = 1.0 means the first result is always relevant; 0.5 means it’s typically second.
NDCG@K
Normalized Discounted Cumulative Gain. Are the most relevant results ranked higher than less relevant ones? Rewards good ordering, not just presence.
Hit Rate@K
Did we find at least one relevant document in the top K? A binary yes/no per query, averaged across all queries.
Answer quality — is the generated answer correct?
Token F1
Word-level overlap between the generated answer and the ground truth. Balances precision (no extra words) with recall (no missing words). The primary answer quality metric.
ROUGE-L
Longest common subsequence between generated and expected answer. Captures whether the answer preserves the right sequence of information.
CRAG-specific — does the system know what it doesn’t know?
CRAG Composite
Scores correct answers +1, abstentions ("I don't know") 0, and hallucinations −1. Range [−1, +1]. Penalizes confident wrong answers more than silence.
Hallucination Rate
What percentage of answers are confidently wrong? The single most important safety metric — lower is better.

Datasets

All datasets are public and commonly used in information retrieval and question-answering research. Each provides a corpus of documents and a set of questions with known ground-truth answers or relevant passages.

Dataset Domain Questions Best R@3 Best R@5 Best R@10 Source
RAGBench — galileo-ai/ragbench
MS MARCOWeb search4230.5100.7460.990HuggingFace
HotpotQAMulti-hop reasoning3900.9820.9970.999HuggingFace
ExpertQAExpert-level Q&A2030.7730.8910.954HuggingFace
FinQAFinancial documents2,2940.7000.7800.862HuggingFace
TechQATechnical support3140.6550.7660.897HuggingFace
PubMedQABiomedical literature2,4500.5490.8000.925HuggingFace
CUADLegal contracts5100.0350.0590.102HuggingFace
emanualProduct manuals1320.8510.9621.000HuggingFace
BEIR — beir-cellar/beir
SciFactScientific claims3000.7450.7970.847HuggingFace
NFCorpusNutrition & health323HuggingFace
FiQAFinancial opinion Q&A648HuggingFace
ArguAnaArgument retrieval1,401HuggingFace
SciDocsScientific documents1,000HuggingFace
Other benchmarks
CRAGEnd-to-end QA (web)1,335GitHub
QASPERAcademic papers1,3720.4220.4600.500HuggingFace
NewsQANews articles4,2120.7530.8060.855HuggingFace
CourtListenerLegal opinions2,0000.8510.8750.889HuggingFace
Legal CRLegal case reports7700.5340.6090.656HuggingFace
Internal benchmarks
Code EvalStructured data1420.3230.3300.330Internal
JSON EvalStructured data1590.4870.4870.487Internal
MIRACL — miracl/miracl
ArabicMultilingual (Arabic)2,7230.8500.9020.935HuggingFace
KoreanMultilingual (Korean)1990.7600.8240.863HuggingFace

Cross-Model Leaderboard

The best score achieved by any configuration of each model across all evaluation runs. Dashes indicate the model was not tested on that dataset.

Best Recall@10 by Dataset and Model
Grouped bar chart — higher is better — best config per model
Dataset Qwen3 Gemini2 BGE-M3 Gemma Winner
emanual 0.886 1.000 0.907 0.649 Gemini2
MS MARCO 0.990 0.990 0.840 0.770
ExpertQA 0.930 0.954 0.568 0.434 Gemini2
PubMedQA 0.925 0.625 0.916 0.881 Qwen3
CourtListener 0.889 0.779 0.636 Qwen3
HotpotQA 0.999 0.999 0.535 0.383
FinQA 0.824 0.862 0.804 0.334 Gemini2
TechQA 0.897 0.864 0.846 0.401 Qwen3
NewsQA 0.653 0.855 0.789 0.510 Gemini2
Legal CR 0.656 0.329 Qwen3
CUAD 0.102 0.072 0.080 0.069 Qwen3
QASPER New 0.471 0.500 0.472 0.474 Gemini2
emanual 0.745 0.881 0.664 0.581 Gemini2
MS MARCO 0.510 0.510 0.373 0.366 Gemini2
ExpertQA 0.808 0.808 0.501 0.412
PubMedQA 0.549 0.553 0.550 0.530 Gemini2
CourtListener 0.851 0.699 0.549 Qwen3
HotpotQA 0.984 0.984 0.470 0.366
FinQA 0.684 0.720 0.671 0.242 Gemini2
TechQA 0.655 0.683 0.641 0.288 Gemini2
NewsQA 0.577 0.753 0.684 0.490 Gemini2
Legal CR 0.534 0.295 Qwen3
CUAD 0.041 0.033 0.033 0.037 Qwen3
QASPER New 0.398 0.422 0.398 0.405 Gemini2
emanual 0.868 0.859 0.951 0.874 BGE-M3
MS MARCO 0.989 0.922 0.995 0.991 BGE-M3
ExpertQA 0.854 0.876 0.852 0.760 Gemini2
PubMedQA 0.988 0.780 0.994 0.987 BGE-M3
CourtListener 0.705 0.641 0.487 Qwen3
HotpotQA 0.996 0.996 0.943 0.858
FinQA 0.646 0.668 0.628 0.507 Gemini2
TechQA 0.770 0.754 0.731 0.751 Qwen3
NewsQA 0.526 0.686 0.628 0.456 Gemini2
Legal CR 0.460 0.262 Qwen3
CUAD 0.035 0.030 0.028 0.031 Qwen3
QASPER New 0.348 0.372 0.348 0.350 Gemini2

Scorecard: Qwen3 wins 5 (PubMedQA, CourtListener, TechQA, Legal CR, CUAD). Gemini2 wins 5 (emanual, ExpertQA, FinQA, NewsQA, QASPER). Dashes indicate the model was not tested on that dataset.

1

Indexing Pipeline Quality Has a Large Impact

Improvements to the AI Search indexing pipeline produced significant retrieval gains on several datasets — without changing the embedding model or search configuration. This demonstrates that the backend infrastructure matters as much as the model choice.

Impact of Indexing Pipeline Improvements (Qwen3, same config)
R@10 change after indexing pipeline update
DatasetConfigBeforeAfterDelta
MS MARCO512+hybrid+fuzzy0.4300.960+52.9pp
MS MARCO1024+hybrid+fuzzy0.4450.960+51.6pp
MS MARCO256+hybrid+fuzzy0.4450.956+51.1pp
FinQA2560.4810.775+29.3pp
FinQA256+hybrid0.4810.770+28.9pp
Takeaway: The indexing pipeline is a significant lever for retrieval quality. Some regressions exist on short-doc datasets, suggesting the improvements favor longer documents. This is an active area of optimization.
2

QASPER: Academic Papers Are Hard

QASPER tests retrieval over academic NLP papers with 1,372 questions. The best R@10 across all models is 0.50 (Gemini2) — meaning half the relevant passages are missed. All models cluster between 0.47–0.50, with fuzzy match providing a consistent boost of +6–8pp.

QASPER: Recall@10 by Config and Model
Fuzzy match provides a consistent ~8-10pp boost across all models
ConfigQwen3 R@10Gemini2 R@10BGE-M3 R@10Gemma R@10
2560.4070.4460.4100.377
256+h0.4070.4370.4010.391
256+h+f0.4710.5000.4720.473
5120.4250.4410.4230.358
512+h0.4210.4430.4230.368
512+h+f0.4700.4940.4720.474
10240.3930.382
1024+h0.4080.407
1024+h+f0.4660.468
Takeaway: QASPER is the hardest retrieval dataset in the benchmark — a ~50% ceiling that no model breaks through. Fuzzy match is critical here, providing a consistent +6–8pp boost regardless of model. The difficulty comes from questions that require reasoning across multiple paper sections, not just embedding quality.
3

CRAG: Hallucination Rates Vary Wildly by Model

CRAG (Comprehensive RAG Benchmark) evaluates end-to-end answer quality on 1,335 questions across 5 domains. Answers are scored as correct (+1), missing (0), or incorrect/hallucinated (−1). Qwen3 (latest pipeline) slightly edges Gemini2 on composite score (+0.04 vs +0.01), while Gemini2 has higher raw accuracy (36% vs 32%). Both achieve positive composites with hallucination rates of 28–35%. BGE-M3 and Gemma hallucinate on 63–65% of questions.

CRAG: Answer Quality Breakdown by Model
Best config per model — Qwen3 and Gemini2 achieve positive composites, BGE-M3 and Gemma are negative
0.04
Best Composite
28%
Min Hallucination Rate
36%
Best Accuracy
22%
Avg Abstention Rate
ModelBest ConfigCompositeAccuracyHalluc. RateMissing Rate
Qwen3256+hybrid+fuzzy0.0431.6%27.9%40.6%
Gemini2512+hybrid0.0135.9%34.8%29.3%
BGE-M3256-0.3826.6%64.7%8.7%
Gemma512-0.3627.2%63.5%9.3%

CRAG has no retrieval ground truth (retrieval metrics are all 0.0 by design), so this benchmarks the full pipeline: retrieval + generation. The embedding model and pipeline version both have significant impact — Qwen3 (28% hallucination) and Gemini2 (35% hallucination) are in a different league from BGE-M3 and Gemma (63–65%). Models with lower hallucination rates also abstain more often (29–41%), which is the right behavior when the answer isn't in context.

Takeaway: Qwen3 leads on composite (+0.04) by abstaining more aggressively (41% missing), while Gemini2 is more accurate (36%) but hallucinates more (35%). Both are positive — very different from BGE-M3/Gemma at 63%+ hallucination. Answer calibration (knowing when to say "I don't know") matters as much as retrieval quality for factual QA.
4

Fuzzy Match: High-Risk, High-Reward

Fuzzy match (OR-mode BM25) remains the single most impactful configuration option. The effect is dataset-dependent, not model-dependent — all models benefit equally on technical/specialized content, and all suffer equally on short-document or precision-sensitive corpora.

Fuzzy Match Impact: Best Hybrid+Fuzzy vs Best Hybrid
R@10 change in percentage points — green = improvement, red = degradation
DatasetModelWithout FuzzyWith FuzzyDelta
TechQAGemini20.7560.864+10.8pp
TechQAQwen30.7910.897+10.7pp
TechQABGE-M30.7420.846+10.5pp
QASPERGemma0.3910.474+8.3pp
QASPERGemini20.4430.500+5.6pp
PubMedQAQwen30.9250.599-32.6pp
Legal CRQwen30.6560.533-12.3pp
CourtListenerQwen30.8890.783-10.6pp
Takeaway: Fuzzy match should be opt-in per use case. Enable for: technical docs, financial data, expert content, academic papers (QASPER). Avoid for: short-document collections, citation-heavy content, web search.
5

Model Selection Guide

Each model has a distinct niche. The gap between best and worst model on any dataset ranges from 1pp (QASPER) to 53pp (FinQA).

Model Strengths: Where Each Model Wins
Best R@10 per model on datasets where all were tested

Based on the closest benchmark dataset for each use case. These are proxies — actual performance depends on your corpus size, document structure, and question distribution.

Use CaseBest ModelConfigBenchmark ProxyR@10
Product docs, manualsGemini2512+hemanual1.00
Web search, short docsQwen31024+hMS MARCO0.99
Financial / tabularGemini2512+h+fuzzyFinQA0.86
Technical docs (long)Qwen31024+h+fuzzyTechQA0.90
Expert knowledgeGemini21024+h+fuzzyExpertQA0.95
BiomedicalQwen31024PubMedQA0.93
Multi-hop reasoningQwen3256HotpotQA1.00
Legal citationsQwen31024+hCourtListener0.89
News, journalismGemini2512+h+fuzzyNewsQA0.85
Academic papersGemini2256+h+fuzzyQASPER0.50
6

Hybrid search (vector + BM25 exact match) is safe for Qwen3 and Gemini2, but BGE-M3 and Gemma are vulnerable to significant degradation on short-document datasets. The risk profile is model-dependent.

Hybrid Search Delta by Model (Avg R@10 change)
Positive = hybrid helps, negative = hybrid hurts

Safe defaults by model

  • Qwen3 + hybrid: Mostly safe. Worst: emanual (-4.1pp).
  • Gemini2 + hybrid: Safe everywhere. Worst delta is -0.5pp.
  • BGE-M3 + hybrid: Catastrophic on MS MARCO (-26.1pp).
  • Gemma + hybrid: Catastrophic on PubMedQA (-18.9pp).
7

Code & JSON Retrieval

Internal benchmarks for non-prose content: 30 source code files (TypeScript, Python, Go, Rust, SQL) and 30 structured JSON files (API responses, config files, Terraform state, GeoJSON, logs). Documents include deliberate distractors — similar files with overlapping themes that force the model to discriminate (e.g. two Hono middleware guides, two database schemas, two ETL pipelines).

DatasetModelBest ConfigR@3R@10MRR
Code EvalQwen3256+hybrid+fuzzy0.3200.3260.338
Code EvalGemini210240.3230.3300.354
JSON EvalQwen3256+hybrid+fuzzy0.4500.4720.428
JSON EvalGemini22560.4870.4870.474
Takeaway: Code and JSON retrieval is significantly harder than prose. With 30 distractor documents per dataset, best R@10 drops to ~0.33 for code and ~0.49 for JSON. Both models perform similarly, and no config provides a major advantage — structured content remains a genuine retrieval challenge.
8

Answer Quality vs Retrieval Quality

Better retrieval does not guarantee better answers. The correlation between R@10 and Token F1 is weak (r ≈ 0.27), and Token F1 varies less than 2pp across configs within any dataset.

Retrieval Quality vs Answer Quality
Each point = one dataset × model × config — weak correlation confirms LLM compensates for retrieval gaps
Takeaway: Once you clear a retrieval quality floor (~0.5 R@10), additional retrieval improvements have diminishing returns for answer quality. Focus optimization effort on model selection and corpus preparation, not config tuning.

Coverage Matrix

Current evaluation coverage across all result sets.

Dataset Qwen3 Gemini2 BGE-M3 Gemma Status
beir-scifact New9000Gemini2, BGE-M3, Gemma missing
Code Eval9900BGE-M3, Gemma missing
CRAG New9466Complete
JSON Eval8800BGE-M3, Gemma missing
CourtListener9044Gemini2 missing
Legal CR9166Complete
MIRACL Arabic9000Gemini2, BGE-M3, Gemma missing
MIRACL Korean3000Gemini2, BGE-M3, Gemma missing
NewsQA9466Complete
QASPER New9966Complete
CUAD9966Complete
emanual9966Complete
ExpertQA10966Complete
FinQA9966Complete
HotpotQA10966Complete
MS MARCO91066Complete
PubMedQA9966Complete
TechQA15966Complete

Question Explorer

A sample of questions where models disagree most on retrieval (R@10 diff > 0.5). R@10 measures whether the search engine found the right source documents — not whether the final answer is correct. You'll often see models give similar answers despite very different R@10 scores. This can happen because the LLM already knows the answer from its training data, or because it infers it from partial context. The value of high retrieval is grounding: a model with R@10 = 1.0 can cite the actual source documents, while a model with R@10 = 0.0 may have given the right answer from memory — which won't work for private data the LLM has never seen.

Select a dataset above to view questions.

Methodology

Evaluation pipeline: Questions are sent to AI Search's /chat/completions endpoint across multiple indexing pipeline versions. Each question returns retrieved chunks and a generated answer. We compute Recall@K, Precision@K, NDCG@K, Hit Rate@K, and MRR for retrieval quality, plus Token F1 and ROUGE-L for answer quality. CRAG uses rule-based correct/missing/incorrect scoring.

Ground truth: RAGBench and LoCoV1 provide annotated relevant document IDs. QASPER provides paragraph-level evidence annotations. CRAG has no retrieval ground truth (answer-only evaluation). BEIR uses standard passage-level qrels.

Concurrency: 50–100 concurrent API calls per eval run.

AI Search instances: Each dataset × model × chunk_size × hybrid_enabled combination has its own AI Search instance. Fuzzy match is a query-time parameter.

Result sources: Production (v6) and staging (v8) indexing pipelines. Four embedding models tested: Qwen3, Gemini2, BGE-M3, Gemma. Staging results are used where the production pipeline had no data for a dataset.

Generated from 423 eval configurations · March 18, 2026