Key Findings
- Qwen3 and Gemini2 are neck and neck — each wins 5 of 12 retrieval datasets. HotpotQA and MS MARCO are effectively tied at R@10 ≈ 0.99. Qwen3 leads on PubMedQA, legal, TechQA, and CUAD; Gemini2 leads on emanual, ExpertQA, FinQA, NewsQA, and QASPER.
- CRAG: both top models achieve positive composites — Qwen3 leads with +0.04 composite by abstaining more (41% missing), while Gemini2 is more accurate (36%) but hallucinates more (35%). BGE-M3 and Gemma remain unsolved at 63–65% hallucination.
- Pipeline improvements are massive — indexing pipeline updates lifted MS MARCO fuzzy by +53pp and FinQA by +29pp, without changing the embedding model.
- Fuzzy match is the highest-leverage config knob — up to +11pp on TechQA across all models, but −33pp on PubMedQA. Dataset-dependent, not model-dependent.
- QASPER (academic papers) remains hard — best R@10 is 0.50 (Gemini2), with all models clustered at 0.47–0.50. Fuzzy match is critical (+6–8pp boost).
- Hybrid search is model-dependent — safe for Gemini2, mostly safe for Qwen3, but degrades BGE-M3 and Gemma on short-doc datasets.
How to read this report
The core question is simple: when a user asks a question, does AI Search find the right documents? And if it does, is the generated answer any good? We measure this with standard information retrieval and answer quality metrics, evaluated against public datasets with known ground-truth answers.
Datasets
All datasets are public and commonly used in information retrieval and question-answering research. Each provides a corpus of documents and a set of questions with known ground-truth answers or relevant passages.
| Dataset | Domain | Questions | Best R@3 | Best R@5 | Best R@10 | Source |
|---|---|---|---|---|---|---|
| RAGBench — galileo-ai/ragbench | ||||||
| MS MARCO | Web search | 423 | 0.510 | 0.746 | 0.990 | HuggingFace |
| HotpotQA | Multi-hop reasoning | 390 | 0.982 | 0.997 | 0.999 | HuggingFace |
| ExpertQA | Expert-level Q&A | 203 | 0.773 | 0.891 | 0.954 | HuggingFace |
| FinQA | Financial documents | 2,294 | 0.700 | 0.780 | 0.862 | HuggingFace |
| TechQA | Technical support | 314 | 0.655 | 0.766 | 0.897 | HuggingFace |
| PubMedQA | Biomedical literature | 2,450 | 0.549 | 0.800 | 0.925 | HuggingFace |
| CUAD | Legal contracts | 510 | 0.035 | 0.059 | 0.102 | HuggingFace |
| emanual | Product manuals | 132 | 0.851 | 0.962 | 1.000 | HuggingFace |
| BEIR — beir-cellar/beir | ||||||
| SciFact | Scientific claims | 300 | 0.745 | 0.797 | 0.847 | HuggingFace |
| NFCorpus | Nutrition & health | 323 | — | — | — | HuggingFace |
| FiQA | Financial opinion Q&A | 648 | — | — | — | HuggingFace |
| ArguAna | Argument retrieval | 1,401 | — | — | — | HuggingFace |
| SciDocs | Scientific documents | 1,000 | — | — | — | HuggingFace |
| Other benchmarks | ||||||
| CRAG | End-to-end QA (web) | 1,335 | — | — | — | GitHub |
| QASPER | Academic papers | 1,372 | 0.422 | 0.460 | 0.500 | HuggingFace |
| NewsQA | News articles | 4,212 | 0.753 | 0.806 | 0.855 | HuggingFace |
| CourtListener | Legal opinions | 2,000 | 0.851 | 0.875 | 0.889 | HuggingFace |
| Legal CR | Legal case reports | 770 | 0.534 | 0.609 | 0.656 | HuggingFace |
| Internal benchmarks | ||||||
| Code Eval | Structured data | 142 | 0.323 | 0.330 | 0.330 | Internal |
| JSON Eval | Structured data | 159 | 0.487 | 0.487 | 0.487 | Internal |
| MIRACL — miracl/miracl | ||||||
| Arabic | Multilingual (Arabic) | 2,723 | 0.850 | 0.902 | 0.935 | HuggingFace |
| Korean | Multilingual (Korean) | 199 | 0.760 | 0.824 | 0.863 | HuggingFace |
Cross-Model Leaderboard
The best score achieved by any configuration of each model across all evaluation runs. Dashes indicate the model was not tested on that dataset.
| Dataset | Qwen3 | Gemini2 | BGE-M3 | Gemma | Winner |
|---|---|---|---|---|---|
| emanual | 0.886 | 1.000 | 0.907 | 0.649 | Gemini2 |
| MS MARCO | 0.990 | 0.990 | 0.840 | 0.770 | — |
| ExpertQA | 0.930 | 0.954 | 0.568 | 0.434 | Gemini2 |
| PubMedQA | 0.925 | 0.625 | 0.916 | 0.881 | Qwen3 |
| CourtListener | 0.889 | — | 0.779 | 0.636 | Qwen3 |
| HotpotQA | 0.999 | 0.999 | 0.535 | 0.383 | — |
| FinQA | 0.824 | 0.862 | 0.804 | 0.334 | Gemini2 |
| TechQA | 0.897 | 0.864 | 0.846 | 0.401 | Qwen3 |
| NewsQA | 0.653 | 0.855 | 0.789 | 0.510 | Gemini2 |
| Legal CR | 0.656 | 0.329 | — | — | Qwen3 |
| CUAD | 0.102 | 0.072 | 0.080 | 0.069 | Qwen3 |
| QASPER New | 0.471 | 0.500 | 0.472 | 0.474 | Gemini2 |
| emanual | 0.745 | 0.881 | 0.664 | 0.581 | Gemini2 |
| MS MARCO | 0.510 | 0.510 | 0.373 | 0.366 | Gemini2 |
| ExpertQA | 0.808 | 0.808 | 0.501 | 0.412 | — |
| PubMedQA | 0.549 | 0.553 | 0.550 | 0.530 | Gemini2 |
| CourtListener | 0.851 | — | 0.699 | 0.549 | Qwen3 |
| HotpotQA | 0.984 | 0.984 | 0.470 | 0.366 | — |
| FinQA | 0.684 | 0.720 | 0.671 | 0.242 | Gemini2 |
| TechQA | 0.655 | 0.683 | 0.641 | 0.288 | Gemini2 |
| NewsQA | 0.577 | 0.753 | 0.684 | 0.490 | Gemini2 |
| Legal CR | 0.534 | 0.295 | — | — | Qwen3 |
| CUAD | 0.041 | 0.033 | 0.033 | 0.037 | Qwen3 |
| QASPER New | 0.398 | 0.422 | 0.398 | 0.405 | Gemini2 |
| emanual | 0.868 | 0.859 | 0.951 | 0.874 | BGE-M3 |
| MS MARCO | 0.989 | 0.922 | 0.995 | 0.991 | BGE-M3 |
| ExpertQA | 0.854 | 0.876 | 0.852 | 0.760 | Gemini2 |
| PubMedQA | 0.988 | 0.780 | 0.994 | 0.987 | BGE-M3 |
| CourtListener | 0.705 | — | 0.641 | 0.487 | Qwen3 |
| HotpotQA | 0.996 | 0.996 | 0.943 | 0.858 | — |
| FinQA | 0.646 | 0.668 | 0.628 | 0.507 | Gemini2 |
| TechQA | 0.770 | 0.754 | 0.731 | 0.751 | Qwen3 |
| NewsQA | 0.526 | 0.686 | 0.628 | 0.456 | Gemini2 |
| Legal CR | 0.460 | 0.262 | — | — | Qwen3 |
| CUAD | 0.035 | 0.030 | 0.028 | 0.031 | Qwen3 |
| QASPER New | 0.348 | 0.372 | 0.348 | 0.350 | Gemini2 |
Scorecard: Qwen3 wins 5 (PubMedQA, CourtListener, TechQA, Legal CR, CUAD). Gemini2 wins 5 (emanual, ExpertQA, FinQA, NewsQA, QASPER). Dashes indicate the model was not tested on that dataset.
Indexing Pipeline Quality Has a Large Impact
Improvements to the AI Search indexing pipeline produced significant retrieval gains on several datasets — without changing the embedding model or search configuration. This demonstrates that the backend infrastructure matters as much as the model choice.
| Dataset | Config | Before | After | Delta |
|---|---|---|---|---|
| MS MARCO | 512+hybrid+fuzzy | 0.430 | 0.960 | +52.9pp |
| MS MARCO | 1024+hybrid+fuzzy | 0.445 | 0.960 | +51.6pp |
| MS MARCO | 256+hybrid+fuzzy | 0.445 | 0.956 | +51.1pp |
| FinQA | 256 | 0.481 | 0.775 | +29.3pp |
| FinQA | 256+hybrid | 0.481 | 0.770 | +28.9pp |
QASPER: Academic Papers Are Hard
QASPER tests retrieval over academic NLP papers with 1,372 questions. The best R@10 across all models is 0.50 (Gemini2) — meaning half the relevant passages are missed. All models cluster between 0.47–0.50, with fuzzy match providing a consistent boost of +6–8pp.
| Config | Qwen3 R@10 | Gemini2 R@10 | BGE-M3 R@10 | Gemma R@10 |
|---|---|---|---|---|
| 256 | 0.407 | 0.446 | 0.410 | 0.377 |
| 256+h | 0.407 | 0.437 | 0.401 | 0.391 |
| 256+h+f | 0.471 | 0.500 | 0.472 | 0.473 |
| 512 | 0.425 | 0.441 | 0.423 | 0.358 |
| 512+h | 0.421 | 0.443 | 0.423 | 0.368 |
| 512+h+f | 0.470 | 0.494 | 0.472 | 0.474 |
| 1024 | 0.393 | 0.382 | — | — |
| 1024+h | 0.408 | 0.407 | — | — |
| 1024+h+f | 0.466 | 0.468 | — | — |
CRAG: Hallucination Rates Vary Wildly by Model
CRAG (Comprehensive RAG Benchmark) evaluates end-to-end answer quality on 1,335 questions across 5 domains. Answers are scored as correct (+1), missing (0), or incorrect/hallucinated (−1). Qwen3 (latest pipeline) slightly edges Gemini2 on composite score (+0.04 vs +0.01), while Gemini2 has higher raw accuracy (36% vs 32%). Both achieve positive composites with hallucination rates of 28–35%. BGE-M3 and Gemma hallucinate on 63–65% of questions.
| Model | Best Config | Composite | Accuracy | Halluc. Rate | Missing Rate |
|---|---|---|---|---|---|
| Qwen3 | 256+hybrid+fuzzy | 0.04 | 31.6% | 27.9% | 40.6% |
| Gemini2 | 512+hybrid | 0.01 | 35.9% | 34.8% | 29.3% |
| BGE-M3 | 256 | -0.38 | 26.6% | 64.7% | 8.7% |
| Gemma | 512 | -0.36 | 27.2% | 63.5% | 9.3% |
CRAG has no retrieval ground truth (retrieval metrics are all 0.0 by design), so this benchmarks the full pipeline: retrieval + generation. The embedding model and pipeline version both have significant impact — Qwen3 (28% hallucination) and Gemini2 (35% hallucination) are in a different league from BGE-M3 and Gemma (63–65%). Models with lower hallucination rates also abstain more often (29–41%), which is the right behavior when the answer isn't in context.
Fuzzy Match: High-Risk, High-Reward
Fuzzy match (OR-mode BM25) remains the single most impactful configuration option. The effect is dataset-dependent, not model-dependent — all models benefit equally on technical/specialized content, and all suffer equally on short-document or precision-sensitive corpora.
| Dataset | Model | Without Fuzzy | With Fuzzy | Delta |
|---|---|---|---|---|
| TechQA | Gemini2 | 0.756 | 0.864 | +10.8pp |
| TechQA | Qwen3 | 0.791 | 0.897 | +10.7pp |
| TechQA | BGE-M3 | 0.742 | 0.846 | +10.5pp |
| QASPER | Gemma | 0.391 | 0.474 | +8.3pp |
| QASPER | Gemini2 | 0.443 | 0.500 | +5.6pp |
| PubMedQA | Qwen3 | 0.925 | 0.599 | -32.6pp |
| Legal CR | Qwen3 | 0.656 | 0.533 | -12.3pp |
| CourtListener | Qwen3 | 0.889 | 0.783 | -10.6pp |
Model Selection Guide
Each model has a distinct niche. The gap between best and worst model on any dataset ranges from 1pp (QASPER) to 53pp (FinQA).
Based on the closest benchmark dataset for each use case. These are proxies — actual performance depends on your corpus size, document structure, and question distribution.
| Use Case | Best Model | Config | Benchmark Proxy | R@10 |
|---|---|---|---|---|
| Product docs, manuals | Gemini2 | 512+h | emanual | 1.00 |
| Web search, short docs | Qwen3 | 1024+h | MS MARCO | 0.99 |
| Financial / tabular | Gemini2 | 512+h+fuzzy | FinQA | 0.86 |
| Technical docs (long) | Qwen3 | 1024+h+fuzzy | TechQA | 0.90 |
| Expert knowledge | Gemini2 | 1024+h+fuzzy | ExpertQA | 0.95 |
| Biomedical | Qwen3 | 1024 | PubMedQA | 0.93 |
| Multi-hop reasoning | Qwen3 | 256 | HotpotQA | 1.00 |
| Legal citations | Qwen3 | 1024+h | CourtListener | 0.89 |
| News, journalism | Gemini2 | 512+h+fuzzy | NewsQA | 0.85 |
| Academic papers | Gemini2 | 256+h+fuzzy | QASPER | 0.50 |
Hybrid Search: Essential for Some, Catastrophic for Others
Hybrid search (vector + BM25 exact match) is safe for Qwen3 and Gemini2, but BGE-M3 and Gemma are vulnerable to significant degradation on short-document datasets. The risk profile is model-dependent.
Safe defaults by model
- Qwen3 + hybrid: Mostly safe. Worst: emanual (-4.1pp).
- Gemini2 + hybrid: Safe everywhere. Worst delta is -0.5pp.
- BGE-M3 + hybrid: Catastrophic on MS MARCO (-26.1pp).
- Gemma + hybrid: Catastrophic on PubMedQA (-18.9pp).
Code & JSON Retrieval
Internal benchmarks for non-prose content: 30 source code files (TypeScript, Python, Go, Rust, SQL) and 30 structured JSON files (API responses, config files, Terraform state, GeoJSON, logs). Documents include deliberate distractors — similar files with overlapping themes that force the model to discriminate (e.g. two Hono middleware guides, two database schemas, two ETL pipelines).
| Dataset | Model | Best Config | R@3 | R@10 | MRR |
|---|---|---|---|---|---|
| Code Eval | Qwen3 | 256+hybrid+fuzzy | 0.320 | 0.326 | 0.338 |
| Code Eval | Gemini2 | 1024 | 0.323 | 0.330 | 0.354 |
| JSON Eval | Qwen3 | 256+hybrid+fuzzy | 0.450 | 0.472 | 0.428 |
| JSON Eval | Gemini2 | 256 | 0.487 | 0.487 | 0.474 |
Answer Quality vs Retrieval Quality
Better retrieval does not guarantee better answers. The correlation between R@10 and Token F1 is weak (r ≈ 0.27), and Token F1 varies less than 2pp across configs within any dataset.
Coverage Matrix
Current evaluation coverage across all result sets.
| Dataset | Qwen3 | Gemini2 | BGE-M3 | Gemma | Status |
|---|---|---|---|---|---|
| beir-scifact New | 9 | 0 | 0 | 0 | Gemini2, BGE-M3, Gemma missing |
| Code Eval | 9 | 9 | 0 | 0 | BGE-M3, Gemma missing |
| CRAG New | 9 | 4 | 6 | 6 | Complete |
| JSON Eval | 8 | 8 | 0 | 0 | BGE-M3, Gemma missing |
| CourtListener | 9 | 0 | 4 | 4 | Gemini2 missing |
| Legal CR | 9 | 1 | 6 | 6 | Complete |
| MIRACL Arabic | 9 | 0 | 0 | 0 | Gemini2, BGE-M3, Gemma missing |
| MIRACL Korean | 3 | 0 | 0 | 0 | Gemini2, BGE-M3, Gemma missing |
| NewsQA | 9 | 4 | 6 | 6 | Complete |
| QASPER New | 9 | 9 | 6 | 6 | Complete |
| CUAD | 9 | 9 | 6 | 6 | Complete |
| emanual | 9 | 9 | 6 | 6 | Complete |
| ExpertQA | 10 | 9 | 6 | 6 | Complete |
| FinQA | 9 | 9 | 6 | 6 | Complete |
| HotpotQA | 10 | 9 | 6 | 6 | Complete |
| MS MARCO | 9 | 10 | 6 | 6 | Complete |
| PubMedQA | 9 | 9 | 6 | 6 | Complete |
| TechQA | 15 | 9 | 6 | 6 | Complete |
Question Explorer
A sample of questions where models disagree most on retrieval (R@10 diff > 0.5). R@10 measures whether the search engine found the right source documents — not whether the final answer is correct. You'll often see models give similar answers despite very different R@10 scores. This can happen because the LLM already knows the answer from its training data, or because it infers it from partial context. The value of high retrieval is grounding: a model with R@10 = 1.0 can cite the actual source documents, while a model with R@10 = 0.0 may have given the right answer from memory — which won't work for private data the LLM has never seen.
Select a dataset above to view questions.
Methodology
Evaluation pipeline: Questions are sent to AI Search's
/chat/completions endpoint across multiple indexing pipeline versions.
Each question returns retrieved chunks and a generated answer. We compute Recall@K,
Precision@K, NDCG@K, Hit Rate@K, and MRR for retrieval quality, plus Token F1 and ROUGE-L
for answer quality. CRAG uses rule-based correct/missing/incorrect scoring.
Ground truth: RAGBench and LoCoV1 provide annotated relevant document IDs. QASPER provides paragraph-level evidence annotations. CRAG has no retrieval ground truth (answer-only evaluation). BEIR uses standard passage-level qrels.
Concurrency: 50–100 concurrent API calls per eval run.
AI Search instances: Each dataset × model × chunk_size × hybrid_enabled combination has its own AI Search instance. Fuzzy match is a query-time parameter.
Result sources: Production (v6) and staging (v8) indexing pipelines. Four embedding models tested: Qwen3, Gemini2, BGE-M3, Gemma. Staging results are used where the production pipeline had no data for a dataset.
Generated from 423 eval configurations · March 18, 2026