Furukama

Furukama's Blog

Ben Koehler - Founder, Speaker, Coder Web | GitHub | X | Bluesky | LinkedIn

Fu — Benchmark of Benchmarks

Fu-Benchmark is a meta-benchmark of the most influential evaluation suites used to measure and rank large language models. Use the search box to filter by name, topic, or model.

Benchmarks
#NameTopicDescriptionRelevanceGitHub ★LeaderTop %
1AA-Index oMulti-domain QAComprehensive QA index across diverse domains.★★★★ 🇺🇸 Grok 473.2%
2AA-LCR O Long-context reasoningA challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens.★★★★ 🇺🇸 GPT-5 High76.0%
3AA-Omniscience O Knowledge and hallucinationBenchmark measuring factual recall and hallucination across economically relevant domains.★★★★ 🇺🇸 Gemini 3 Pro Preview13.0%
4AceBench OIndustry QAIndustry-focused benchmark assessing domain QA and reasoning.★★★★ 🇨🇳 Kimi-K282.2%
5ACP-Bench Bool O Safety evaluation (boolean)Safety and behavior evaluation with yes/no questions.★★★★ 🇨🇳 Qwen3-32B85.1%
6ACP-Bench MCQ O Safety evaluation (MCQ)Safety and behavior evaluation with multiple-choice questions.★★★★ 🇺🇸 Llama 3.3 70B82.1%
7AetherCode oCode generationCode generation benchmark for diverse coding tasks.★★★★ 🇺🇸 Gemini 3 Pro56.7%
8AgentCompany oAgent reasoningCompany-level agent reasoning and decision-making benchmark.★★★★ 🇺🇸 Claude Sonnet 4.541.0%
9AgentDojo O Agent evaluationInteractive evaluation suite for autonomous agents across tools and tasks.★★★★ 🇺🇸 Claude 3.7 Sonnet88.7%
10Agentic Coding OAgentic codingAgentic coding benchmark for autonomous software tasks.★★★★ 🇺🇸 Gemini 3 Flash Preview53.8%
11AGIEval (English) OExamsEnglish subset of AGIEval; academic and professional exam questions.★★★★ 🇨🇳 Qwen3-VL 32B Thinking92.2%
12AGIEval LSAT-AR oLaw exam reasoningLSAT Analytical Reasoning subset from AGIEval benchmark.★★★★ 🇨🇳 Qwen2.5 32B Base30.4%
13AI2D ODiagram understanding (VQA)Visual question answering over science and diagram images.★★★★ 🇺🇸 Gemini 3 Pro98.7%
14Aider Code Editing o Code editingMeasures interactive code editing quality within the Aider assistant workflow.★★★★ 🇺🇸 Gemini 2.5 Pro89.8%
15Aider-Polyglot O Code assistant evalAider polyglot coding leaderboard.★★★★★ 🇺🇸 Gemini 3 Pro Preview92.9%
16Aider-Polyglot (Diff) O Code assistant evalAider polyglot leaderboard using diff mode (pass@2).★★★★ 🇺🇸 Gemini 3 Pro Preview91.9%
17AIME 2024 O Math (competition)American Invitational Mathematics Examination 2024 problems.★★★★★ 🇺🇸 GPT-OSS 120B96.6%
18AIME 2024-Ko oMath (competition, Korean)Korean translation of AIME 2024 problems.★★★★ 🇨🇳 Qwen3-30B-A3B-Thinking-250780.3%
19AIME 2025 O Math (competition)American Invitational Mathematics Examination 2025 problems.★★★★★ 🇺🇸 Claude Sonnet 4.5100.0%
20AInstein-SWE-Bench oAgentic codingAInstein agent coding benchmark.★★★★ 🇺🇸 Gemini 3 Pro42.8%
21All-Angles Bench OSpatial perceptionAll-Angles benchmark for spatial recognition and 3D perception.★★★★G Step3-VL-10B57.2%
22AlpacaEval O Instruction followingAutomatic eval using GPT-4 as a judge.★★★★★1849 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT99.4%
23AlpacaEval 2.0 OInstruction followingUpdated AlpacaEval with improved prompts and judging.★★★★ 🇨🇳 DeepSeek R187.6%
24AMC-23 O Math (competition)American Mathematics Competition 2023 evaluation.★★★★G QwQ-32B98.5%
25AMO-Bench OMath (competition)Advanced math olympiad-style benchmark.★★★★ 🇺🇸 Gemini 3 Pro72.5%
26AMO-Bench CH oMath (competition)Chinese subset of AMO-Bench.★★★★ 🇺🇸 Gemini 3 Pro74.9%
27AndroidWorld OMobile agentsBenchmark for agents operating Android apps via UI automation.★★★★G Seed1.870.7%
28API-Bank o Tool useAPI-Bank tool-use benchmark.★★★★ 🇺🇸 Llama 3.1 405B92.0%
29ARC-AGI-1 O General reasoningARC-AGI Phase 1 aggregate accuracy.★★★★ 🇺🇸 GPT-5.2 Thinking86.2%
30ARC-AGI-2 O General reasoningARC-AGI Phase 2 aggregate accuracy.★★★★ 🇺🇸 GPT-5.2 Thinking52.9%
31ARC Average O Science QA (average)Average accuracy across ARC-Easy and ARC-Challenge.★★★★ 🇺🇸 SmolLM2 1.7B Pretrained60.5%
32ARC-Challenge O Science QAHard subset of AI2 Reasoning Challenge; grade-school science.★★★★ 🇺🇸 Llama 3.1 405B96.9%
33ARC-Challenge (DE) oScience QA (German)German translation of the ARC Challenge benchmark.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.7%
34ARC-Easy O Science QAEasier subset of AI2 Reasoning Challenge.★★★★ 🇺🇸 Gemma 3 PT 27B89.0%
35ARC-Easy (DE) oScience QA (German)German translation of the ARC Easy science QA benchmark.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.8%
36Arena-Hard O Chat abilityHard prompts on Chatbot Arena.★★★★★920 🇫🇷 Mistral Medium 397.1%
37Arena-Hard V2 O Chat abilityUpdated Arena-Hard v2 prompts on Chatbot Arena.★★★★★920 🇨🇳 Qwen3 Max Thinking90.2%
38Arena-Hard V2 Creative Writing O Creative writingChatbot Arena Hard V2 creative writing win-rate subset.★★★★ 🇺🇸 Gemini 3 Pro93.6%
39Arena-Hard V2 Hard Prompt O Chat abilityChatbot Arena Hard V2 benchmark using the hard prompt win-rate subset.★★★★ 🇺🇸 Gemini 3 Pro72.6%
40ARKitScenes O3D scene understandingARKitScenes benchmark for assessing 3D scene reconstruction and understanding from mixed reality captures.★★★★ 🇨🇳 Qwen2.5-VL 72B Instruct61.5%
41ART Agent Red Teaming O Agent robustnessEvaluation suite for adversarial red-teaming of autonomous AI agents.★★★★ 🇺🇸 Claude Opus 4.5↓ 33.6%
42ArtifactsBench O Agentic codingArtifacts-focused coding and tool-use benchmark evaluating generated code artifacts.★★★★ 🇺🇸 GPT-5 Thinking73.0%
43ASR AMI oASRAutomatic speech recognition benchmark on AMI meeting speech.★★★★ 🇨🇳 Qwen2.5-Omni-3B↓ 15.1%
44ASR Earnings22 oASRAutomatic speech recognition benchmark on Earnings22 financial calls.★★★★G Whisper-large-V3↓ 11.3%
45ASR GigaSpeech oASRAutomatic speech recognition benchmark on GigaSpeech.★★★★G Whisper-large-V3↓ 10.0%
46ASR LibriSpeech Clean oASRAutomatic speech recognition benchmark on LibriSpeech clean split.★★★★ 🇺🇸 LFM2.5-Audio-1.5B↓ 1.9%
47ASR LibriSpeech Other oASRAutomatic speech recognition benchmark on LibriSpeech other split.★★★★G Whisper-large-V3↓ 3.9%
48ASR SPGISpeech oASRAutomatic speech recognition benchmark on SPGISpeech.★★★★ 🇺🇸 LFM2.5-Audio-1.5B↓ 2.8%
49ASR TED-LIUM oASRAutomatic speech recognition benchmark on TED-LIUM.★★★★ 🇺🇸 LFM2.5-Audio-1.5B↓ 3.5%
50ASR VoxPopuli oASRAutomatic speech recognition benchmark on VoxPopuli.★★★★ 🇨🇳 Qwen2.5-Omni-3B↓ 5.6%
51AstaBench O Agent evaluationEvaluates science agents across literature understanding, data analysis, planning, tool use, coding, and search.★★★★ 🇺🇸 Claude Sonnet 453.0%
52AttaQ OSafety / jailbreakAdversarial jailbreak suite measuring refusal robustness against targeted attack prompts.★★★★G Granite 3.3 8B Instruct88.5%
53AutoCodeBench O Autonomous codingEnd-to-end autonomous coding benchmark with unit-test based execution across diverse repositories and tasks.★★★★ 🇺🇸 Claude Opus 4 (Thinking)52.4%
54AutoCodeBench-Lite O Autonomous codingLite version of AutoCodeBench focusing on smaller tasks with the same end-to-end, unit-test-based evaluation.★★★★ 🇺🇸 Claude Opus 464.5%
55AutoLogi oLogical reasoningAutoLogi benchmark evaluating automated logical reasoning accuracy.★★★★ 🇺🇸 Claude Sonnet 489.8%
56BALROG O Agent robustnessBenchmark for assessing LLM agents under adversarial and out-of-distribution tool-use scenarios.★★★★ 🇺🇸 Grok 443.6%
57BBH O Multi-task reasoningHard subset of BIG-bench with diverse reasoning tasks.★★★★★510 🇨🇳 ERNIE 4.5 424B A47B94.3%
58BBQ O Bias evaluationBias Benchmark for Question Answering evaluating social biases across contexts.★★★★ 🇫🇷 Mixtral 8x 7B56.0%
59BeaverTails oSafety / harmfulnessSafety benchmark evaluating harmfulness in model responses.★★★★G IQuest-Coder-V1-40B-Thinking76.7%
60BeyondAIME oMath (beyond AIME)Advanced math problems exceeding AIME difficulty.★★★★ 🇺🇸 Gemini 3 Pro83.0%
61BFCL OCode reasoningBenchmark for functional code correctness and logic.★★★★ 🇨🇳 Qwen3-4B95.0%
62BFCL Live v2 OFinance QAFinancial compliance and literacy questions from the BFCL Live v2 benchmark.★★★★ 🇺🇸 o1 Mini81.0%
63BFCL v2 oCode reasoningSecond release of the BFCL benchmark focusing on functional code correctness and logic.★★★★G MobileLLM P129.4%
64BFCL v3 OCode reasoningBenchmark for functional code correctness and logic (v3).★★★★ 🇨🇳 GLM 4.577.8%
65BFCL v3 (Live) oTool callingBFCL v3 Live subset for real-time tool calling evaluation.★★★★ 🇨🇳 Qwen3-30B-A3B-Thinking-250782.9%
66BFCL v3 (Multi-Turn) oTool callingBFCL v3 Multi-Turn subset for multi-turn tool calling evaluation.★★★★ 🇨🇳 Qwen3-30B-A3B-Thinking-250753.6%
67BFCL v4 OCode reasoningBFCL v4 benchmark for functional code correctness and logic.★★★★ 🇺🇸 Claude Opus 4.577.5%
68BIG-Bench o Multi-task reasoningBIG-bench overall performance (original).★★★★★3110 🇺🇸 Gemma 2 7B55.1%
69BIG-Bench Extra Hard oMulti-task reasoningExtra hard subset of BIG-bench tasks.★★★★G Ling 1T47.3%
70BigCodeBench O Code GenerationBigCodeBench evaluates large language models on practical code generation tasks with unit-test verification.★★★★G MiMo V2 Flash Base70.1%
71BigCodeBench Hard O Code generation (hard)Harder variant of BigCodeBench testing complex programming and library tasks with function-level code generation.★★★★ 🇺🇸 Claude 3.7 Sonnet (2025-02-19)35.8%
72BIOBench oBiology reasoningBiology knowledge and reasoning benchmark.★★★★ 🇺🇸 Gemini 3 Pro51.9%
73BioLP-Bench o Biomedical NLPComprehensive biomedical language processing benchmark evaluating LLMs across tasks like NER, relation extraction, and QA.★★★★ 🇺🇸 Grok 447.0%
74Bird-SQL OText-to-SQLNatural language to SQL generation benchmark.★★★★ 🇺🇸 Gemini 2.0 Pro59.3%
75BLINK OMultimodal groundingEvaluates visual-language grounding and reference resolution to reduce hallucinations.★★★★ 🇺🇸 Gemini 3 Pro87.4%
76BoB-HVR OComposite capability indexHard, Versatile, and Relevant composite score across eight capability buckets.★★★★ 🇺🇸 Llama 3 70B9.0%
77BOLD o Bias evaluationBias in Open-ended Language Dataset probing demographic biases in text generation.★★★★ 🇫🇷 Mixtral 8x 7B↓ 0.1%
78BoolQ O Reading comprehensionYes/no QA from naturally occurring questions.★★★★★171G Marin-32B-Mantis89.4%
79Borda Count (Multilingual) oAggregate rankingBorda count aggregate ranking across multilingual benchmarks; lower is better.★★★★ 🇨🇳 Qwen3-32B↓ 2.9%
80BrowseComp O Web browsingWeb browsing comprehension and competence benchmark.★★★★G MiroThinker-v1.5-235B69.8%
81BrowseComp (With Content Manager) O Web browsingBrowseComp benchmark evaluated with content manager assistance.★★★★ 🇨🇳 LongCat-Flash-Thinking-260173.1%
82BrowseComp_zh OWeb browsing (Chinese)Chinese variant of the BrowseComp web browsing benchmark.★★★★G Seed1.881.3%
83BRuMo25 oMath competitionBruMo 2025 olympiad-style mathematics benchmark.★★★★ 🇺🇸 QuestA Nemotron 1.5B69.5%
84BuzzBench O Humor analysisA humour analysis benchmark.★★★★ 🇺🇸 Gemini 2.5 Pro71.1%
85C-Eval O Chinese examsChinese college-level exam benchmark.★★★★★1768 🇨🇳 Qwen3 Max Thinking93.7%
86C3-Bench o Reasoning (Chinese)Comprehensive Chinese reasoning capability benchmark.★★★★35 🇨🇳 GLM-4.5 Base83.1%
87CaseLaw v2 O Legal reasoningU.S. case law benchmark evaluating legal reasoning and judgment over court opinions.★★★★ 🇺🇸 GPT-4.178.1%
88CC-OCR OOCR (cross-lingual)Cross-lingual OCR benchmark evaluating character recognition across mixed-language documents.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking81.5%
89CFEval oCoding ELO / contest evalContest-style coding evaluation with ELO-like scoring.★★★★ 🇨🇳 Qwen3-235B-A22B-Thinking-25072134
90CGBench oLong video QACartoon/CG long video question answering benchmark.★★★★ 🇺🇸 Gemini 2.5 Pro64.6%
91Charades-STA O Video groundingCharades-STA temporal grounding (mIoU).★★★★ 🇨🇳 Seed1.5-VL-Thinking64.0%
92ChartMuseum o Chart understandingLarge-scale curated collection of charts for evaluating parsing, grounding, and reasoning.★★★★ 🇺🇸 GPT-5 mini63.3%
93ChartQA O Chart understanding (VQA)Visual question answering over charts and plots.★★★★ 🇨🇳 Keye-VL-1.5-8B94.1%
94ChartQA-Pro oChart understanding (VQA)Professional-grade chart question answering with diverse chart types and complex reasoning.★★★★ 🇨🇳 GLM-4.6V65.5%
95CharXiv (DQ) O Chart description (PDF)Scientific chart/table descriptive questions from arXiv PDFs.★★★★ 🇺🇸 o3-high95.0%
96CharXiv (RQ) O Chart reasoning (PDF)Scientific chart/table reasoning questions from arXiv PDFs.★★★★ 🇺🇸 GPT-5.2 Thinking82.1%
97Chinese SimpleQA oQA (Chinese)Chinese variant of the SimpleQA benchmark.★★★★ 🇨🇳 Kimi-K2 Base77.6%
98CLIcK oKorean instruction followingKorean long-form instruction-following benchmark.★★★★ 🇨🇳 DeepSeek V3.2-Thinking86.3%
99CloningScenarios oBiosecurity refusalSafety benchmark that red-teams models with cloning-related misuse scenarios to measure compliance and refusal rates.★★★★ 🇺🇸 Grok 4↓ 45.0%
100CLUEWSC o Coreference reasoning (Chinese)Chinese Winograd Schema-style coreference benchmark from CLUE.★★★★ 🇨🇳 DeepSeek R192.8%
101CMath oMath (Chinese)Chinese mathematics benchmark.★★★★ 🇨🇳 ERNIE 4.5 424B A47B96.7%
102CMMLU O Chinese multi-domainChinese counterpart to MMLU.★★★★★781 🇨🇳 Qwen2.5 Max91.9%
103CNMO 2024 oMath (competition)China National Mathematical Olympiad 2024 evaluation set.★★★★G openPangu-R-72B-2512 Slow Thinking82.8%
104Codeforces O Competitive programmingCompetitive programming performance on Codeforces problems (ELO).★★★★ 🇺🇸 o4 mini2719
105COLLIE o Instruction followingComprehensive instruction-following evaluation suite.★★★★55 🇺🇸 GPT-599.0%
106Collie-Hard oInstruction followingHard subset of Collie instruction-following tasks.★★★★ 🇺🇸 GPT-5 High99.0%
107CommonsenseQA O Commonsense QAMultiple-choice QA requiring commonsense knowledge.★★★★ 🇨🇳 Qwen2.5 32B Base88.5%
108Complex Workflow oComplex workflowsComplex workflow benchmark for economically valuable tasks.★★★★ 🇺🇸 Gemini 3 Pro58.2%
109COPA o Causal reasoningChoice of Plausible Alternatives.★★★★G Marin-32B-Bison94.0%
110CorpusQA OLong-context QAQuestion answering over large text corpora.★★★★ 🇺🇸 GPT-581.6%
111CountBench O Visual countingObject counting and numeracy benchmark for visual-language models across varied scenes.★★★★ 🇺🇸 Gemini 3 Pro97.3%
112CountBenchQA OVisual counting QAVisual question answering benchmark focused on counting objects across varied scenes.★★★★G Moondream-9B-A2B93.2%
113Countdown oPlanning and reasoningCountdown-style reasoning and planning benchmark.★★★★G K2-V275.6%
114Countix oVideo countingVideo-based counting benchmark for multiple objects.★★★★G Seed1.831.0%
115CRAG oRetrieval QAComplex Retrieval-Augmented Generation benchmark for grounded question answering.★★★★G Jamba Mini 1.676.2%
116Creative Story‑Writing Benchmark V3 O Creative writingStory writing benchmark evaluating creativity, coherence, and style (v3).★★★★★291 🇨🇳 Kimi-K2-Instruct-09058.7%
117Longform Creative Writing O Creative writingLongform creative writing evaluation (EQ-Bench).★★★★20 🇺🇸 Claude Sonnet 4.5 (Thinking)79.8%
118Creative Writing v3 O Creative writingA LLM-judged creative writing benchmark.★★★★54 🇺🇸 o31661
119Complex Research using Integrated Thinking – Physics Test O ReasoningCritPt (Complex Research using Integrated Thinking – Physics Test) benchmark.★★★★ 🇺🇸 GPT-5 (High, Code & Web)12.6%
120CRUX-I O Code reasoningCode Reasoning and Understanding eXam – Interactive.★★★★ 🇺🇸 Gemini 3 Pro Preview98.8%
121CRUX-O O Code reasoningCode Reasoning and Understanding eXam – Offline.★★★★G IQuest-Coder-V1-40B-Loop-Thinking99.4%
122CruxEval O Code reasoningMathematical coding challenge set from the CruxEval benchmark.★★★★ 🇨🇳 Qwen3-30B-A3B-Instruct-250786.8%
123CSimpleQA oQAChinese SimpleQA benchmark variant (short factual questions).★★★★ 🇨🇳 Kimi-K2 Base77.6%
124Customer Support Q&A oCustomer support QACustomer support question answering benchmark.★★★★G Seed1.869.0%
125CUTE oEnglish charactersCUTE aggregate capability score.★★★★ 🇺🇸 Bolmo 7B78.6%
126CV-Bench OComputer vision QADiverse CV tasks for VLMs.★★★★ 🇺🇸 Gemini 3 Pro92.0%
127CVTG-2K CLIPScore oText renderingCVTG-2K CLIPScore for text rendering in image generation.★★★★G Seedream 4.50.8%
128CVTG-2K NED oText renderingCVTG-2K normalized edit distance (NED) for text rendering.★★★★ 🇨🇳 GLM-Image1.0%
129CVTG-2K Word Accuracy oText renderingCVTG-2K word accuracy for text rendering in images.★★★★ 🇨🇳 GLM-Image0.9%
130CyBench o Cybersecurity CTFFramework with 40 professional-level CTF tasks evaluating LLMs' practical cybersecurity capabilities.★★★★ 🇺🇸 o3 mini↓ 22.5%
131CyberGym oCybersecurity tasksBenchmark for cybersecurity-related coding and reasoning tasks.★★★★ 🇺🇸 Claude Opus 4.5 Thinking50.6%
132DA-2K oSpatial reasoning2D/3D spatial reasoning benchmark.★★★★ 🇨🇳 Seed1.5-VL-Thinking85.3%
133Deep Planning oPlanning and reasoningBenchmark evaluating deep planning and multi-step reasoning capabilities.★★★★ 🇺🇸 GPT-5.2 Thinking44.6%
134DeepConsult oAgentic writingAgentic consulting and writing benchmark.★★★★ 🇺🇸 GPT-5 High57.2%
135DeepMind Mathematics o Math reasoningSynthetic math problem sets from DeepMind covering arithmetic, algebra, calculus, and more.★★★★G Granite-4.0-H-Small59.3%
136DeepResearchBench oAgentic research writingResearch-oriented agentic writing and planning benchmark.★★★★ 🇺🇸 Gemini 3 Pro49.6%
137DeepSearchQA oDeep web search QAMulti-step web search and question answering benchmark.★★★★ 🇨🇳 Kimi-K2.5 Thinking77.1%
138Design2Code OCoding (UI)Translating UI designs into code.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking93.4%
139DesignArena O Generative designLeaderboard tracking generative design systems across layout, branding, and marketing tasks.★★★★ 🇺🇸 Claude Sonnet 4.5 (Thinking)1410
140DetailBench o Spot small mistakesEvaluates whether LLMs can notice subtle errors and minor inconsistencies in text.★★★★ 🇺🇸 Llama 4 Maverick8.7%
141DiscoX oAgentic writingDiscoX benchmark for agentic writing and reasoning.★★★★ 🇺🇸 Gemini 3 Pro75.8%
142Do-Anything-Now oSafety / jailbreakResistance to Do Anything Now (DAN) style jailbreak prompts.★★★★G IQuest-Coder-V1-40B-Thinking97.7%
143Do-Not-Answer o Safety / refusalEvaluates a model's ability to refuse unsafe or disallowed requests.★★★★G K2-THINK88.0%
144DocMath ODocument mathMath reasoning on document-based problems.★★★★ 🇺🇸 GPT-567.6%
145DocVQA ODocument understanding (VQA)Visual question answering over scanned documents.★★★★ 🇨🇳 Seed1.5-VL-Thinking96.9%
146Dolphin-Page oDocument OCRDolphin Page benchmark measuring OCR fidelity and structured extraction on multi-layout documents.★★★★ 🇺🇸 Dolphin 1.5↓ 7.4%
147DPG-Bench OText renderingDPG-Bench score for text rendering in image generation.★★★★G Seedream 4.588.6%
148DROP O Reading + reasoningDiscrete reasoning over paragraphs (addition, counting, comparisons).★★★★★ 🇨🇳 Kimi K2 Instruct93.5%
149DUDE oMultimodal long-contextLong-context multimodal understanding benchmark.★★★★ 🇺🇸 Gemini 3 Pro70.1%
150DynaMath O Math reasoning (video)Dynamic/video-based mathematical reasoning evaluating temporal and visual understanding.★★★★ 🇺🇸 GPT-4o63.7%
151Economically important tasks oIndustry QA (cross-domain)Evaluation suite of real-world, economically impactful tasks across key industries and workflows.★★★★ 🇺🇸 GPT-547.1%
152Education oEconomics/educationEducation field evaluation (economically valuable tasks).★★★★G Seed1.860.8%
153EgoSchema OEgocentric video QAEgoSchema validation accuracy.★★★★ 🇨🇳 Qwen2-VL 72B Instruct77.9%
154EgoTempo oEgocentric temporal reasoningEgocentric video temporal reasoning benchmark.★★★★G Seed1.867.0%
155EIFBench oInstruction followingComplex instruction-following benchmark.★★★★ 🇺🇸 GPT-5 High66.7%
156EmbSpatialBench OSpatial understandingEmbodied spatial understanding benchmark evaluating navigation and localization.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking84.3%
157EMMA oMultimodal reasoningEMMA benchmark for multimodal reasoning.★★★★ 🇺🇸 Gemini 3 Pro66.5%
158Enamel oComposite capabilityComposite capability benchmark capturing broad model performance (Enamel score).★★★★G Rnj-149.0%
159EnConda-Bench oCode editingEnglish code editing benchmark for applying conditional modifications.★★★★G Youtu-LLM-2B21.5%
160EnigmaEval OChallenging puzzlesChallenging puzzle benchmark.★★★★ 🇺🇸 Gemini 3 Pro17.8%
161Enterprise RAG oRetrieval-augmented generationEnterprise retrieval-augmented generation evaluation covering internal knowledge bases.★★★★ 🇺🇸 Apriel Nemotron 15B Thinker69.2%
162EQ-Bench O ReasoningGeneral reasoning benchmark assessing equation/logic capabilities.★★★★★352G Jan v1 250985.0%
163EQ-Bench 3 O Emotional intelligence (roleplay)A benchmark measuring emotional intelligence in challenging roleplays, judged by Sonnet 3.7.★★★★21 🇨🇳 Kimi K2 Instruct1555
164ERQA OSpatial reasoningSpatial recognition and reasoning QA benchmark (ERQA).★★★★ 🇺🇸 Gemini 3 Flash71.0%
165EvalPerf O Code evaluation performanceMeasures performance of LLM code evaluation, including runtime, memory, and efficiency metrics.★★★★ 🇺🇸 GPT-4o (2024-08-06)100.0%
166EvalPlus O Code generationAggregated code evaluation suite from EvalPlus.★★★★★1577 🇺🇸 o1 Mini89.0%
167EVG oDocument OCREVG document OCR benchmark evaluating recognition accuracy and layout extraction.★★★★ 🇺🇸 Dolphin 1.5↓ 3.0%
168EXECUTE oMultilingual character tasksMultilingual character-level evaluation benchmark.★★★★ 🇺🇸 Bolmo 7B71.6%
169FACTS Benchmark Suite oHeld out internal grounding, parametric, MM, and search retrieval benchmarksComprehensive factuality benchmark suite covering held-out internal grounding, parametric knowledge, multimodal understanding, and search retrieval benchmarks.★★★★ 🇺🇸 Gemini 3 Pro70.5%
170FACTS Grounding O Grounding / factualityGrounded factuality benchmark evaluating model alignment with source facts.★★★★ 🇨🇳 Kimi K2 Instruct88.5%
171FActScore oHallucination rate on open-source promptsMeasures hallucination rate on an open-source prompt suite; lower is better.★★★★ 🇺🇸 GPT-5↓ 1.0%
172FaithJudge (1-Hallu.) oHallucination detectionFaithJudge hallucination rate with 1-hallucination metric (lower is better).★★★★G Moonlight-Instruct↓ 56.0%
173Meta Score Agent OComposite capability index★★★★ 🇺🇸 Claude Opus 4.5100.0%
174Meta Score Code OComposite capability index★★★★ 🇺🇸 Claude Opus 4.5100.0%
175Meta Score Math OComposite capability index★★★★ 🇨🇳 Qwen3-VL-235B-A22B Instruct100.0%
176Meta Score OCR OComposite capability index★★★★ 🇺🇸 o3 (Low)80.0%
177Meta Score Safety OComposite safety index★★★★G Granite 3.3 8B Instruct70.0%
178Meta Score STEM OComposite capability index★★★★ 🇨🇳 Qwen3-VL-235B-A22B Instruct100.0%
179Meta Score Text OComposite capability index★★★★ 🇺🇸 Claude Opus 4.585.7%
180Meta Score Visual OComposite capability index★★★★ 🇺🇸 Gemini 3 Pro100.0%
181Meta Score Writing OComposite capability index★★★★ 🇨🇳 Qwen3 235B A22B Instruct 250760.0%
182FigQA oFigure understanding and QAFigure question answering benchmark evaluating visual reasoning over scientific figures and diagrams.★★★★ 🇺🇸 Grok 4.1 (Thinking)34.0%
183FinanceReasoning oFinancial reasoningFinancial reasoning benchmark evaluating quantitative and qualitative finance problem solving.★★★★G Ling 1T87.5%
184FinanceAgent oAgentic finance tasksInteractive financial agent benchmark requiring multi-step tool use.★★★★ 🇺🇸 Claude Sonnet 4.555.3%
185FinanceBench (FullDoc) oFinance QAFinanceBench full-document question answering benchmark requiring long-context financial understanding.★★★★G Jamba Mini 1.645.4%
186FinSearchComp O Financial retrievalFinancial search and comprehension benchmark measuring retrieval grounded reasoning over financial content.★★★★ 🇺🇸 Grok 468.9%
187FinSearchComp-CN OFinancial retrieval (Chinese)Chinese financial search and comprehension benchmark measuring retrieval-grounded reasoning over regional financial content.★★★★G doubao-1-5-vision-pro54.2%
188FinSearchComp (T2&T3) oFinance searchFinance search competition tasks (tracks T2 and T3).★★★★ 🇺🇸 GPT-5 High64.5%
189Flame-React-Eval oFrontend codingFront-end React coding tasks and evaluation.★★★★ 🇨🇳 GLM-4.6V86.3%
190Flores o Machine translation (multilingual)FLORES multilingual translation benchmark.★★★★G EuroLLM-22B88.9%
191Fox-Page-cn oDocument OCR (Chinese)Fox Page benchmark evaluating OCR accuracy and layout understanding on Chinese document pages.★★★★ 🇺🇸 Dolphin 1.5↓ 0.8%
192Fox-Page-en oDocument OCR (English)Fox Page benchmark evaluating OCR accuracy and layout understanding on English document pages.★★★★ 🇺🇸 Dolphin 1.5↓ 0.7%
193FRAMES OInteractive reasoningFrame-based interactive reasoning and dialogue benchmark.★★★★ 🇨🇳 Tongyi DeepResearch90.6%
194FreshQA oRecency QAQuestion answering benchmark emphasizing up-to-date knowledge and recency.★★★★ 🇨🇳 Qwen3-4B Thinking 250766.9%
195FrontierScience oScience reasoningFrontier-level scientific reasoning and QA benchmark.★★★★ 🇺🇸 GPT-5.225.2%
196FSC-147 oFew-shot countingFew-shot counting benchmark across 147 categories.★★★★G Seed1.833.8%
197FullStackBench OFull-stack developmentEnd-to-end web/app development tasks and evaluation.★★★★ 🇺🇸 Claude Opus 4.572.3%
198GAIA O General AI tasksComprehensive benchmark for agentic tasks.★★★★G Seed1.887.4%
199GAIA 2 OGeneral agent tasksGrounded agentic intelligence benchmark version 2 covering multi-tool tasks.★★★★ 🇺🇸 GPT-5 High42.1%
200GAOKAO-Bench oChinese examsGAOKAO benchmark measuring Chinese college entrance exam performance.★★★★ 🇨🇳 Qwen3-30B-A3B-Instruct-250794.5%
201GDPVal O General capabilityGDPVal benchmark evaluating broad general capabilities of LLMs across diverse tasks.★★★★ 🇺🇸 GPT-5.2 Thinking70.9%
202General Tool Use OTool useGeneral tool-use benchmark covering web and API tasks.★★★★ 🇺🇸 Claude Opus 4.578.9%
203GeoBench1 oGeospatial reasoningGeospatial visual QA and reasoning (set 1).★★★★ 🇨🇳 GLM-4.5V79.7%
204Global-MMLU OMulti-domain knowledge (global)Full Global-MMLU evaluation across diverse languages and regions.★★★★ 🇨🇳 DeepSeek V3.2-Exp82.0%
205Global-MMLU-Lite O Multi-domain knowledge (global)Lightweight global variant of MMLU covering diverse languages and regions.★★★★ 🇺🇸 Gemini 2.5 Pro89.2%
206Global PIQA oCommonsense reasoning across 100 Languages and CulturesPhysical commonsense reasoning benchmark spanning 100 languages and diverse cultural contexts.★★★★ 🇺🇸 Gemini 3 Pro93.4%
207Gorilla Benchmark API Bench o Tool useGorilla API Bench tool-use evaluation.★★★★ 🇺🇸 Llama 3.1 405B35.3%
208GPQA O Graduate-level QAGraduate-level question answering evaluating advanced reasoning.★★★★★406 🇺🇸 GPT-5.2 Thinking92.4%
209GPQA-diamond O Graduate-level QAHard subset of GPQA (diamond level).★★★★★ 🇺🇸 GPT-5.2 Thinking XHigh92.9%
210GRE Math maj@16 oMath (standardized tests)GRE quantitative section evaluated via majority voting over 16 samples.★★★★ 🇨🇳 Qwen2 7B58.5%
211Ground-UI-1K OGUI groundingAccuracy on the Ground-UI-1K grounding benchmark.★★★★ 🇨🇳 Qwen2.5-VL 72B85.4%
212GSM-Infinite Hard (128K) oMath reasoningGSM-Infinite Hard benchmark at 128K context.★★★★G MiMo V2 Flash Base29.0%
213GSM-Infinite Hard (16K) oMath reasoningGSM-Infinite Hard benchmark at 16K context.★★★★ 🇨🇳 DeepSeek V3.2-Exp50.4%
214GSM-Infinite Hard (32K) oMath reasoningGSM-Infinite Hard benchmark at 32K context.★★★★ 🇨🇳 DeepSeek V3.2-Exp45.2%
215GSM-Infinite Hard (64K) oMath reasoningGSM-Infinite Hard benchmark at 64K context.★★★★ 🇨🇳 DeepSeek V3.134.7%
216GSM-Plus OMath (grade-school, enhanced)Enhanced GSM-style grade-school math benchmark variant.★★★★ 🇨🇳 Qwen3-4B82.1%
217GSM-Symbolic o Math reasoningSymbolic reasoning variant of GSM that tests algebraic manipulation and arithmetic with structured problems.★★★★G Granite-4.0-H-Small87.4%
218GSM8K O Math (grade-school)Grade-school math word problems requiring multi-step reasoning.★★★★★1322 🇨🇳 Kimi K2 Instruct97.3%
219GSM8K (DE) oMath (grade-school, German)German translation of the GSM8K grade-school math word problems.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.6%
220GSM8K-Ko oMath (grade-school, Korean)Korean translation of the GSM8K grade-school math word problems.★★★★ 🇨🇳 Qwen3-30B-A3B88.1%
221GSM8K Platinum o Math (grade-school, hard)Harder subset/setting of GSM8K grade-school math problems.★★★★ 🇨🇳 Kimi-Linear-Base89.6%
222GSO Benchmark O Code generationLiveCodeBench GSO benchmark.★★★★ 🇺🇸 o3-high8.8%
223HAE-RAE Bench o Korean language understandingKorean language understanding benchmark evaluating knowledge and reasoning.★★★★G Kanana-1.5-32.5B-Base90.7%
224HallusionBench O Multimodal hallucinationBenchmark for evaluating hallucination tendencies in multimodal LLMs.★★★★ 🇺🇸 Gemini 3 Pro69.9%
225HarmBench oSafetyHarmfulness and safety compliance benchmark across a variety of risky prompts.★★★★G IQuest-Coder-V1-40B-Thinking94.8%
226HarmfulQA o SafetyHarmful question set testing models' ability to avoid unsafe answers.★★★★★104G K2-THINK99.0%
227HealthBench OMedical QAComprehensive medical knowledge and clinical reasoning benchmark across specialties and tasks.★★★★ 🇺🇸 GPT-567.2%
228HealthBench-Hard oMedical QA (hard)Challenging subset of HealthBench focusing on complex, ambiguous clinical cases.★★★★ 🇺🇸 GPT-546.2%
229HealthBench-Hard Hallucinations oMedical hallucination safetyMeasures hallucination and unsafe medical advice under hard clinical scenarios.★★★★ 🇺🇸 GPT-5↓ 1.6%
230HellaSwag O Commonsense reasoningAdversarial commonsense sentence completion.★★★★★220 🇨🇳 DeepSeek V3 Base96.4%
231HellaSwag (DE) oCommonsense reasoning (German)German translation of the HellaSwag commonsense benchmark.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.7%
232HELMET LongQA oLong-context QALong-context subset of the HELMET benchmark focusing on grounded question answering.★★★★G Jamba Mini 1.646.9%
233HeroBench O Long-horizon planningBenchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds.★★★★ 🇺🇸 Grok 491.7%
234HHEM v2.1 O Hallucination detectionHughes Hallucination Evaluation Model (Vectara) — lower is better.★★★★G AntGroup Finix_S1_32b↓ 0.6%
235HiddenMath OMath reasoningMathematical reasoning benchmark referenced in recent model cards.★★★★ 🇺🇸 Gemini 2.0 Pro65.2%
236HLE O Multi-domain reasoningChallenging LLMs at the frontier of human knowledge.★★★★★1085 🇺🇸 Gemini 3 Pro45.8%
237HLE Overconfidence OOverconfidence / safetyOverconfidence rate derived from Humanity's Last Exam evaluations.★★★★ 🇺🇸 GPT-5.2↓ 43.7%
238HLE (Text Only) O Advanced reasoningHumanity's Last Exam benchmark restricted to text-only inputs.★★★★★1085 🇺🇸 Gemini 3 Pro45.8%
239HLE-VL oHolistic language evaluation (vision-language)Vision-language HLE benchmark.★★★★ 🇺🇸 Gemini 3 Pro36.0%
240HLE (With Tools) O Tool-augmented reasoningHumanity's Last Exam benchmark evaluated with tool access.★★★★★1085 🇨🇳 Kimi-K2.5 Thinking50.2%
241HMMT o Math (competition)Harvard–MIT Mathematics Tournament problems.★★★★ 🇺🇸 GPT-5 pro100.0%
242HMMT 2025 OMath (competition)Harvard–MIT Mathematics Tournament 2025 problems.★★★★ 🇺🇸 Gemini 3 Pro99.8%
243HMMT Feb 2025 O Math (competition)Harvard–MIT Mathematics Tournament February 2025 problems.★★★★ 🇺🇸 GPT-5.2 Thinking99.4%
244HMMT Nov 2025 O Math (competition)Harvard–MIT Mathematics Tournament November 2025 problems.★★★★ 🇨🇳 Qwen3 Max Thinking94.7%
245HotpotQA o Multi-hop QAExplainable multi-hop QA with supporting facts.★★★★ 🇨🇳 Qwen 3 0.6B64.0%
246HRBench 4K OHallucination robustnessHallucination robustness benchmark with 4K token contexts.★★★★ 🇨🇳 Qwen3-VL-30B-A3B Instruct89.5%
247HRBench 8K OHallucination robustnessHallucination robustness benchmark with 8K token contexts.★★★★ 🇨🇳 Qwen3-VL-30B-A3B Instruct82.5%
248HRM8K oKorean reasoning8k-question Korean reasoning and knowledge benchmark.★★★★ 🇨🇳 Qwen3-235B-A22B-Thinking-250792.0%
249HumanEval O Code generationPython synthesis problems evaluated by unit tests.★★★★★2916 🇺🇸 Gemini 3 Pro Preview100.0%
250HumanEval+ O Code generationExtended HumanEval with more tests.★★★★★1577 🇺🇸 Claude Sonnet 494.5%
251HumanEval-V oCode generation (vision)HumanEval variant with visual programming prompts.★★★★G Step3-VL-10B66.0%
252HumanEval-X o Code generation (multilingual)Multilingual code generation benchmark extending HumanEval to multiple programming languages.★★★★G TeleChat3-36B-Thinking92.7%
253Hypersim O3D scene understandingHypersim benchmark for synthetic indoor scene understanding and reconstruction.★★★★ 🇺🇸 GPT-5 Mini Minimal39.3%
254IFBench O Instruction followingInstruction-following benchmark measuring compliance and adherence.★★★★70 🇫🇷 Mistral Small 3.2 24B Instruct84.8%
255IFEval O Instruction followingInstruction following capability evaluation for LLMs.★★★★★36312 🇺🇸 o3 mini-high93.9%
256IFEval-Code oInstruction following (code)Instruction following evaluation for code generation tasks.★★★★ 🇨🇳 Qwen3-32B28.0%
257Image QA Average OImage QA (aggregate)Average of single-image visual question answering benchmarks.★★★★ 🇺🇸 Gemini 3 Pro86.2%
258IMO AnswerBench O Math (competition)Evaluates free-form solutions to International Mathematical Olympiad problems using expert-style grading rubrics.★★★★ 🇨🇳 LongCat-Flash-Thinking-260186.8%
259INCLUDE OInclusiveness / biasEvaluates inclusive language use and bias mitigation in model outputs.★★★★ 🇺🇸 Gemini-2.5-Flash Thinking83.9%
260InfoQA OInformation-seeking QAInformation retrieval question answering benchmark evaluating factual responses.★★★★ 🇺🇸 Gemini 3 Pro86.9%
261Information Extraction oInformation extractionInformation extraction benchmark for economically valuable fields.★★★★ 🇺🇸 Claude Sonnet 4.546.9%
262Information Processing oInformation processingInformation processing benchmark for economically valuable tasks.★★★★ 🇺🇸 Gemini 3 Pro56.5%
263InfoVQA OInfographic VQAVisual question answering over infographics requiring reading, counting, and reasoning.★★★★ 🇨🇳 Kimi-K2.5 Thinking92.6%
264Intention Recognition oIntent recognitionIntent recognition benchmark for practical applications.★★★★ 🇺🇸 Gemini 3 Pro65.3%
265IntPhys 2 OIntuitive physicsIntuitive physics reasoning benchmark.★★★★ 🇺🇸 Gemini 3 Flash63.4%
266Inverse IFEval oInstruction following (inverse)Inverse instruction-following evaluation.★★★★ 🇺🇸 Gemini 3 Pro80.6%
267ISL/OSL 8k/16k oThroughputRelative throughput on ISL/OSL 8k/16k context workloads.★★★★ 🇺🇸 Nemotron-3-Nano-30B-A3B3.3%
268JudgeMark v2.1 O LLM judging abilityA benchmark measuring LLM judging ability.★★★★ 🇺🇸 Claude Sonnet 482.0%
269KGC-Safety oSafety (Korean)Korean safety benchmark evaluating harmfulness and compliance.★★★★G K-EXAONE96.1%
270KK-4 People oWorking memory (4 people)Keep/kill working-memory benchmark with 4 people entities.★★★★G K2-V292.9%
271KK-8 People oWorking memory (8 people)Keep/kill working-memory benchmark with 8 people entities.★★★★G K2-V282.8%
272KMMLU O Korean knowledgeKorean Massive Multitask Language Understanding benchmark.★★★★ 🇨🇳 DeepSeek V3.178.7%
273KMMLU-Pro O Multilingual knowledgeKorean Multilingual Massive Multitask Language Understanding Pro★★★★ 🇺🇸 o177.5%
274KMMLU-Redux O Multilingual knowledgeRedux variant of KMMLU benchmark★★★★ 🇺🇸 o181.1%
275Ko-LongBench oKorean long-contextLong-context understanding benchmark in Korean.★★★★ 🇨🇳 DeepSeek V3.2-Thinking87.9%
276KoBALT oKorean knowledgeKorean benchmark for knowledge and language understanding.★★★★ 🇨🇳 DeepSeek V3.2-Thinking62.7%
277KoMT-Bench oKorean chat abilityKorean multi-turn chat evaluation benchmark.★★★★ 🇨🇳 Qwen3-30B-A3B-Instruct-25078.5%
278KOR-Bench OReasoningComprehensive reasoning benchmark spanning diverse domains and cognitive skills.★★★★ 🇺🇸 GPT-5 High77.4%
279KoSimpleQA oKorean QAKorean simple question answering benchmark.★★★★G Kanana-2-30B-A3B-Mid-260149.7%
280KSM oMultilingual mathKorean STEM and math benchmark★★★★G EXAONE Deep 2.4B60.9%
281LAMBADA O Language modelingWord prediction requiring broad context understanding.★★★★ 🇺🇸 GPT-386.4%
282LatentJailbreak o Safety / jailbreakRobustness to latent jailbreak adversarial techniques.★★★★39 🇺🇸 GPT-3.5-turbo77.4%
283LBV1-QA OVision-languageVision-language QA benchmark v1.★★★★ 🇺🇸 GPT-573.7%
284LBV2 OVision-languageVision-language benchmark v2.★★★★ 🇺🇸 Gemini 2.5 Pro65.7%
285LiveBench OGeneral capabilityContinually updated capability benchmark across diverse tasks.★★★★ 🇺🇸 Gemini 2.5 Pro82.4%
286LiveCodeBench O Code generationLive coding and execution-based evaluation benchmark (v6 dataset).★★★★★ 🇺🇸 Gemini 3 Pro92.0%
287LiveCodeBench-Ko oCode generation (Korean)Korean translation of LiveCodeBench.★★★★ 🇨🇳 Qwen3-30B-A3B-Thinking-250766.3%
288LiveCodeBench Pro O Competitive coding problems from Codeforces, ICPC, and IOILiveCodeBench Pro evaluates competitive programming performance across Codeforces, ICPC, and IOI contests. Elo rating, higher is better.★★★★ 🇺🇸 Gemini 3 Pro2439
289LCB Pro 25Q2 (Easy) o Code generationLiveCodeBench Pro 2025 Q2 easy subset.★★★★ 🇺🇸 Nemotron-Cascade-14B-Thinking68.9%
290LCB Pro 25Q2 (Med) O Code generationLiveCodeBench Pro 2025 Q2 medium subset.★★★★ 🇺🇸 GPT-OSS 120B (High)35.4%
291LiveCodeBench v3 O Code generationLiveCodeBench v3 snapshot measuring pass rates on streaming coding tasks.★★★★ 🇨🇳 Qwen3 32B90.2%
292LiveCodeBench v5 (2024.10-2025.02) O Code generationLiveCodeBench v5 snapshot covering Oct 2024-Feb 2025.★★★★G IQuest-Coder-V1-40B-Loop-Thinking86.2%
293LiveMCP-101 O Agent real-time evalA novel real-time evaluation framework and benchmark to stress‑test agents on complex, real‑world tasks.★★★★ 🇺🇸 GPT-558.4%
294LiveSports-3K oSports videoLive sports video understanding benchmark (3K).★★★★G Seed1.877.5%
295LMArena Text O Crowd eval (text)Chatbot Arena text evaluation (average win rate).★★★★ 🇺🇸 Gemini 2.5 Pro1455
296LMArena Vision O Crowd eval (vision)Chatbot Arena vision evaluation leaderboard (ELO ratings).★★★★ 🇺🇸 Gemini 2.5 Pro1242
297LogicVista OVisual logical reasoningVisual logic and pattern reasoning tasks requiring compositional and spatial understanding.★★★★ 🇺🇸 Gemini 3 Pro80.8%
298LogiQA o Logical reasoningReading comprehension with logical reasoning.★★★★★138G Pythia 70M23.5%
299LongBench o Long-context evalLong-context understanding across tasks.★★★★★957G Jamba Mini 1.632.0%
300longbench-v2 O Long-context evalNext-generation LongBench v2 long-context evaluation benchmark.★★★★ 🇺🇸 Gemini 3 Pro68.2%
301LongFact-Concepts oHallucination rate on open-source promptsLong-context factuality eval focused on conceptual statements; lower is better.★★★★ 🇺🇸 GPT-5↓ 0.7%
302LongFact-Objects oHallucination rate on open-source promptsLong-context factuality eval focused on object/entity references; lower is better.★★★★ 🇺🇸 GPT-5↓ 0.8%
303LongText-Bench EN oText renderingLongText-Bench English subset score for text rendering.★★★★G Seedream 4.51.0%
304LongText-Bench ZH oText renderingLongText-Bench Chinese subset score for text rendering.★★★★G Seedream 4.51.0%
305LongVideoBench oLong video QALong video understanding and QA benchmark.★★★★ 🇨🇳 Kimi-K2.5 Thinking79.8%
306LPFQA oFinance QALong-form financial question answering benchmark.★★★★ 🇺🇸 GPT-5 High54.4%
307LVBench OVideo understandingLong video understanding benchmark (LVBench).★★★★ 🇨🇳 Kimi-K2.5 Thinking75.9%
308M3GIA (CN) oChinese multimodal QAChinese-language M3GIA benchmark covering grounded multimodal question answering.★★★★ 🇨🇳 Seed1.5-VL-Thinking91.2%
309Machiavelli ODeception / safetyBenchmark for deceptive or manipulative behavior in social interactions.★★★★ 🇺🇸 Claude Haiku 4.5↓ 52.2%
310MakeMeSay oAdversarial robustnessAdversarial benchmark testing model robustness against manipulation attempts. Lower is better.★★★★ 🇺🇸 Grok 4.1 (Thinking)
311Mantis OMultimodal reasoningMultimodal reasoning and instruction following benchmark (Mantis).★★★★G dots.vlm186.2%
312MARS-Bench oInstruction followingInstruction-following benchmark with complex tasks.★★★★ 🇺🇸 Gemini 3 Pro80.8%
313MASK O Safety / red teamingModel behavior safety assessment via red-teaming scenarios.★★★★ 🇺🇸 Claude Sonnet 4 (t)95.3%
314MATH O Math (competition)Competition-level mathematics across algebra, geometry, number theory, combinatorics.★★★★★1185 🇺🇸 o3 mini97.9%
315MATH-Ko oMath (Korean)Korean translation of the MATH competition benchmark.★★★★ 🇨🇳 Qwen3-30B-A3B58.2%
316MATH Level 5 o Math (competition)Level 5 subset of the MATH benchmark emphasizing the hardest competition-style problems.★★★★ 🇨🇳 Qwen3-4B-Instruct-250773.6%
317MATH500 OMath reasoning500-problem slice of the MATH benchmark for challenging math reasoning.★★★★★G Motif-2-12.7B-Reasoning99.3%
318MATH500 (ES) oMath (multilingual)Spanish MATH500 benchmark★★★★G EXAONE 4.0 1.2B88.8%
319MathArena Apex oChallenging Math Contest problemsChallenging math contest problems from MathArena Apex benchmark.★★★★ 🇺🇸 Gemini 3 Pro23.4%
320MathVerse-mini OMath reasoning (multimodal)Compact MathVerse split focusing on single-image math puzzles and visual reasoning.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking85.0%
321MathVerse-Vision OMath reasoning (multimodal)Multi-image visual mathematical reasoning tasks from the MathVerse ecosystem.★★★★ 🇺🇸 GPT-5 High84.1%
322MathVision O Math reasoning (multimodal)Visual math reasoning benchmark with problems that combine images (charts, diagrams) and text.★★★★ 🇺🇸 Gemini 3 Pro86.1%
323MathVista O Multimodal math reasoningVisual math reasoning across diverse tasks.★★★★ 🇨🇳 Kimi-K2.5 Thinking90.1%
324MathVista-Mini OMath reasoning (multimodal)Lightweight subset of MathVista for quick evaluation of visual mathematical reasoning.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking85.8%
325MBPP O Code generationShort Python problems with hidden tests.★★★★★36312 🇨🇳 Kimi-K2 Thinking97.4%
326MBPP-Ko oCode generation (Korean)Korean translation of MBPP code generation benchmark.★★★★ 🇨🇳 Qwen3-30B-A3B66.8%
327MBPP+ OCode generationExtended MBPP with more tests and stricter evaluation.★★★★ 🇨🇳 GLM 4.694.2%
328MCP-Atlas OAgent evaluationAggregate MCP agent benchmark covering tool-use and planning tasks.★★★★ 🇺🇸 Claude Opus 4.562.3%
329MCP Universe O Agent evaluationBenchmarks multi-step tool-use agents across diverse task suites with a unified overall success metric.★★★★ 🇺🇸 Gemini 3 Pro50.7%
330MCPMark O Agent tool-use (MCP)Benchmark for Model Context Protocol (MCP) agent tool-use.★★★★★127 🇺🇸 GPT-5 High50.9%
331METR O Long task benchmarkMETR evaluates AI agents on long-horizon coding and agentic tasks, measuring autonomous task completion time.★★★★ 🇺🇸 Claude Opus 4.54.8%
332MGSM OMath (multilingual)Multilingual grade school math word problems.★★★★ 🇺🇸 Claude Opus 4.1 (2025-08-05) Thinking94.4%
333MIABench OMultimodal instruction followingMultimodal instruction-following benchmark evaluating accuracy on complex image-text tasks.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking92.7%
334NIAH-Multi 128K oLong-context QANeedle-in-a-haystack multi-query benchmark at 128K context.★★★★ 🇨🇳 Kimi-K2 Base99.5%
335NIAH-Multi 32K oLong-context QANeedle-in-a-haystack multi-query benchmark at 32K context.★★★★ 🇨🇳 Kimi-K2 Base99.8%
336NIAH-Multi 64K oLong-context QANeedle-in-a-haystack multi-query benchmark at 64K context.★★★★ 🇨🇳 Kimi-K2 Base100.0%
337MindCube OSpatial navigationSpatial navigation benchmark.★★★★ 🇺🇸 Gemini 3 Flash78.3%
338Minerva Math O University-level mathAdvanced quantitative reasoning set inspired by the Minerva benchmark for STEM problem solving.★★★★ 🇨🇳 Qwen3 235B A22B Thinking98.0%
339MiniF2F pass@1 o Math competitionMiniF2F competition benchmark pass@1 accuracy.★★★★ 🇺🇸 NVIDIA-Nemotron-3-Nano-30B-A3B-BF1650.0%
340MiniF2F pass@32 o Math competitionMiniF2F competition benchmark pass@32 accuracy.★★★★ 🇺🇸 NVIDIA-Nemotron-3-Nano-30B-A3B-BF1679.9%
341MiniF2F (Test) o Math competitionMiniF2F competition benchmark (test split).★★★★ 🇨🇳 LongCat-Flash-Thinking81.6%
342MixEval oMulti-task reasoningMixed-subject benchmark covering knowledge and reasoning tasks across domains.★★★★ 🇺🇸 o1 Mini82.9%
343MixEval Hard oMulti-task reasoning (hard)Hard subset of MixEval covering diverse reasoning tasks.★★★★ 🇨🇳 Qwen3-4B31.6%
344MLVU OLarge video understandingMLVU: Large-scale multi-task benchmark for video understanding.★★★★ 🇺🇸 GPT-586.2%
345MM-BrowseComp oMultimodal browsingMultimodal browsing comprehension benchmark.★★★★G Seed1.846.3%
346MM-IFEval o Multimodal instruction followingInstruction-following benchmark assessing multimodal obedience to complex prompts.★★★★ 🇺🇸 LFM2.5-VL-1.6B52.3%
347MM-MT-Bench OMultimodal instruction followingMulti-turn multimodal instruction following benchmark evaluating dialogue quality and helpfulness.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking8.5%
348MMBench v1.1 (CN) O Multimodal understanding (Chinese)MMBench v1.1 Chinese subset for evaluating multimodal LLMs.★★★★ 🇺🇸 Gemini 3 Pro91.3%
349MMBench v1.1 (EN) O Multimodal understanding (English)MMBench v1.1 English subset for evaluating multimodal LLMs.★★★★ 🇺🇸 Gemini 3 Pro93.3%
350MMBench v1.1 (EN dev) O General VQAEnglish dev split of MMBench v1.1 measuring multimodal question answering.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking90.6%
351MME-CC oMultimodal evaluationMME-CC multimodal evaluation suite.★★★★ 🇺🇸 Gemini 3 Pro56.9%
352MME Elo oMultimodal perceptionElo-style scoring for the MME multimodal evaluation benchmark.★★★★ 🇨🇳 InternVL3-2B2186.4%
353MME-RealWorld (cn) oReal-world perception (CN)MME-RealWorld Chinese split.★★★★ 🇺🇸 GPT-4o58.5%
354MME-RealWorld (en) oReal-world perception (EN)MME-RealWorld English split.★★★★G MiMo-VL 7B-RL59.1%
355MMIU OMulti-image understandingMulti-image understanding benchmark evaluating cross-image reasoning.★★★★ 🇺🇸 Gemini 3 Pro72.1%
356MMLB-NIAH (128k) oMultimodal long-contextMMLB-NIAH 128k long-context multimodal benchmark.★★★★G Seed1.872.2%
357MMLB-VRAG (128k) oMultimodal long-contextMMLB-VRAG 128k long-context multimodal benchmark.★★★★ 🇺🇸 Gemini 3 Pro88.9%
358MMLongBench-128K oLong-context multimodal128K-context variant of MMLongBench evaluating multimodal long-context understanding.★★★★ 🇨🇳 GLM-4.6V64.1%
359MMLongBench-Doc OLong-context multimodal documentsEvaluates long-context document understanding with mixed text, tables, and figures across multiple pages.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking56.2%
360MMLU O Multi-domain knowledge57 tasks spanning STEM, humanities, social sciences; broad knowledge and reasoning.★★★★★1488 🇺🇸 GPT-5 High93.8%
361MMLU Arabic oArabic knowledge and reasoningArabic-language variant of MMLU evaluating knowledge and reasoning.★★★★ 🇨🇳 Qwen 2.5 72B74.1%
362MMLU (cloze) o Multi-domain knowledge (cloze)Cloze-form MMLU evaluation variant.★★★★ 🇺🇸 SmolLM2 135M Base31.5%
363Full Text MMLU oMulti-domain knowledge (long-form)Full-context MMLU variant evaluating reasoning over long passages.★★★★ 🇺🇸 Llama 3.3 70B Instruct83.0%
364MMLU-Pro O Multi-domain knowledgeHarder successor to MMLU with more challenging questions.★★★★286 🇺🇸 Gemini 3 Pro90.1%
365MMLU Pro MCF oMulti-domain knowledge (few-shot)MMLU-Pro common format (MCF) few-shot evaluation.★★★★ 🇨🇳 Qwen3-4B-Base41.1%
366MMLU-ProX OMulti-domain knowledgeCross-lingual and robust variant of MMLU-Pro.★★★★ 🇨🇳 Qwen3-235B-A22B-Thinking-250781.0%
367MMLU-Redux O Multi-domain knowledgeUpdated MMLU-style evaluation with revised questions and scoring.★★★★ 🇺🇸 Gemini 3 Pro95.9%
368MMLU-STEM O STEM knowledgeSTEM subset of MMLU.★★★★★1488G Falcon-H1-34B-Instruct83.6%
369MMMB oMultilingual MMBenchMultilingual Multimodal Benchmark (MMMB) average score.★★★★ 🇺🇸 LFM2.5-VL-1.6B77.0%
370MMMLU O Multi-domain knowledge (multilingual)Massively multilingual MMLU-style evaluation across many languages.★★★★ 🇺🇸 Gemini 3 Pro91.8%
371MMMLU (ES) oMultilingual knowledgeSpanish MMMLU benchmark★★★★ 🇺🇸 SmolLM 3 3B64.7%
372MMMU O Multimodal understandingMulti-discipline multimodal understanding benchmark.★★★★★ 🇺🇸 Gemini 3 Pro87.0%
373MMMU PRO O Multimodal understanding (hard)Professional/advanced subset of MMMU for multimodal reasoning.★★★★ 🇺🇸 Gemini 3 Flash81.2%
374MMMU-Pro (vision) o Multimodal understanding (vision)MMMU-Pro vision-only setting.★★★★ 🇺🇸 Claude 3.7 Sonnet45.8%
375MMSIBench (circular) oSpatial understandingMMSIBench circular subset for spatial reasoning.★★★★ 🇺🇸 Gemini 3 Pro25.4%
376MMStar O Multimodal reasoningBroad evaluation of multimodal LLMs across diverse tasks.★★★★ 🇺🇸 Gemini 3 Pro83.1%
377MMVP O Multimodal video perceptionBenchmark for multimodal video understanding and perception.★★★★G Seed1.891.6%
378MMVU OVideo understandingMultimodal video understanding benchmark (MMVU).★★★★ 🇺🇸 GPT-5.2 Thinking XHigh80.8%
379MotionBench OVideo motion understandingVideo motion and temporal reasoning benchmark.★★★★G Seed1.870.6%
380OpenAI-MRCR (128k) OLong-context reasoningOpenAI Multi-Round Chain Reasoning benchmark with 128k context window.★★★★ 🇺🇸 Gemini 3 Pro89.7%
381OpenAI-MRCR (1M) oLong-context reasoningOpenAI Multi-Round Chain Reasoning benchmark with 1M context window.★★★★ 🇺🇸 Gemini 2.5 Pro58.8%
382MRCR v2 oMultimodal reasoningMulti-round multimodal chain-of-reasoning evaluation (v2).★★★★ 🇺🇸 Gemini 2.5 Flash81.7%
383MT-Bench O Chat abilityMulti-turn chat evaluation via GPT-4 grading.★★★★★39074 🇺🇸 Apriel Nemotron 15B Thinker85.7%
384MTOB (full book) oLong-form reasoningLong-context book understanding benchmark (full-book setting).★★★★ 🇺🇸 Llama 4 Maverick50.8%
385MTOB (half book) oLong-form reasoningLong-context book understanding benchmark (half-book setting).★★★★ 🇺🇸 Llama 4 Maverick54.0%
386MUIRBENCH OMultimodal robustnessEvaluates multimodal understanding robustness and reliability.★★★★ 🇺🇸 Gemini 3 Pro86.1%
387Multi-IF OInstruction following (multi-task)Composite instruction-following evaluation across multiple tasks.★★★★ 🇨🇳 Qwen3-30B-A3B81.0%
388Multi-IFEval OInstruction following (multi-task)Multi-task variant of instruction-following evaluation.★★★★ 🇺🇸 Llama 3.3 70B88.7%
389Multi-SWE-Bench O Code repair (multi-repo)Multi-repository SWE-Bench variant.★★★★★246G MiniMax M2.149.4%
390MultiChallenge OInstruction followingMulti-domain instruction-following benchmark.★★★★ 🇺🇸 GPT-569.6%
391Multi-Image QA Average OMulti-image QA (aggregate)Aggregate score over multi-image visual question answering tasks.★★★★ 🇺🇸 Gemini 3 Pro81.9%
392Multilingual MMBench oMultilingual vision benchmarkMultilingual MMBench average score across languages.★★★★ 🇺🇸 LFM2.5-VL-1.6B65.9%
393Multilingual MMLU OMulti-domain knowledge (multilingual)Multilingual variant of MMLU across many languages.★★★★ 🇺🇸 GPT-4.187.3%
394MultiPL-E O Code generation (multilingual)Multilingual code generation and execution benchmark across many programming languages.★★★★★269 🇺🇸 Claude Opus 489.6%
395MultiPL-E HumanEval o Code generation (multilingual)MultiPL-E variant of HumanEval tasks.★★★★ 🇺🇸 Llama 3.1 405B75.2%
396MultiPL-E MBPP o Code generation (multilingual)MultiPL-E variant of MBPP tasks.★★★★ 🇺🇸 Llama 3.1 405B65.7%
397MuSR O ReasoningMultistep Soft Reasoning.★★★★ 🇺🇸 Hermes 4.3 70B70.4%
398MVBench OVideo QAMulti-view or multi-video QA benchmark (MVBench).★★★★ 🇨🇳 GLM-4.5V73.0%
399Natural2Code oCode generationNatural language to code benchmark for instruction-following synthesis.★★★★ 🇺🇸 Gemini 2.0 Flash92.9%
400NaturalQuestions O Open-domain QAGoogle NQ; real user questions with long/short answers.★★★★ 🇫🇷 Mixtral 8x22B40.1%
401Nexus (0-shot) oTool useNexus tool-use benchmark, zero-shot setting.★★★★ 🇺🇸 Llama 3.1 405B58.7%
402Needle In A Haystack o Long-context retrievalNeedle In A Haystack test for locating hidden facts in long contexts.★★★★G MobileLLM P1 Base100.0%
403Objectron oObject detectionObjectron benchmark for 3D object detection in video captures.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking71.2%
404OBQA oOpen book QAOpenBookQA science question answering benchmark.★★★★ 🇨🇳 Qwen2.5-Omni-3B76.3%
405OCRBench V2 OOCR (vision text extraction)OCRBench v2 evaluating text extraction from images and documents.★★★★ 🇨🇳 Qwen3-VL 2B Instruct858.0%
406OCRBench-ELO oOCR (ELO ranking)OCR benchmark using ELO rating system to rank model performance on text extraction tasks.★★★★ 🇺🇸 Gemini 2.5 Pro866
407OCRBenchV2 (CN) OOCR (Chinese)OCRBenchV2 Chinese subset assessing OCR performance on Chinese-language documents.★★★★ 🇨🇳 Qwen2.5-VL 72B Instruct63.7%
408OCRBenchV2 (EN) OOCR (English)OCRBenchV2 English subset evaluating OCR accuracy on English documents and layouts.★★★★ 🇨🇳 Qwen3-VL 32B Instruct67.4%
409OCRReasoning oOCR reasoningOCR reasoning benchmark combining text extraction with multi-step reasoning over documents.★★★★ 🇺🇸 Gemini 2.5 Pro70.8%
410OctoCodingBench oCode generationCoding benchmark across multi-language programming tasks.★★★★ 🇺🇸 Claude Opus 4.536.2%
411ODinW-13 OObject detection (in the wild)Object Detection in the Wild benchmark covering 13 real-world domains.★★★★ 🇨🇳 Qwen3-VL-4B-Instruct48.2%
412Odyssey Math oMath reasoningOdyssey multi-step math benchmark.★★★★G Mathstral 7B37.2%
413OIBench EN oCode generationEnglish subset of OIBench for code generation.★★★★ 🇺🇸 Gemini 3 Pro58.2%
414OJBench OCode generation (online judge)Programming problems evaluated via online judge-style execution.★★★★ 🇺🇸 Gemini 3 Pro68.5%
415olmOCR-Bench ODocument OCRolmOCR benchmark assessing OCR fidelity and structured extraction on complex document pages.★★★★G Chandra OCR 0.1.083.1%
416OlympiadBench oMath (olympiad)Advanced mathematics olympiad-style problem benchmark.★★★★ 🇨🇳 Qwen3-30B-A3B-Instruct-250777.6%
417OlympicArena oMath (competition)Olympiad-style mathematics reasoning benchmark.★★★★ 🇨🇳 DeepSeek V376.2%
418OMEGA OMath (advanced)OMEGA olympiad-grade mathematics reasoning benchmark.★★★★ 🇺🇸 OLMo-3-Think-32B50.8%
419Omni-MATH OMath reasoningOmni-MATH benchmark covering diverse math reasoning tasks across difficulty levels.★★★★G Ling 1T74.5%
420Omni-MATH-HARD OMathChallenging math benchmark (Omni-MATH-HARD).★★★★ 🇺🇸 GPT-5 High73.6%
421OmniDocBench ODocument understandingDocument understanding benchmark covering multi-page layouts, tables, and charts for robust question answering.★★★★G Gundam-M↓ 12.3%
422OmniDocBench 1.5 OOCRDocument understanding benchmark v1.5 with OCR evaluation. Overall Edit Distance metric, lower is better.★★★★ 🇺🇸 Dolphin V2↓ 0.1%
423OmniDocBench-CN ODocument understanding (Chinese)Chinese subset of OmniDocBench focusing on OCR-grounded document comprehension and reasoning.★★★★G PPStructure v3↓ 13.6%
424OmniMMI oMultimodal interactionOmniMMI benchmark for multimodal interaction across video streams.★★★★G Seed1.853.0%
425OmniSpatial oSpatial reasoningSpatial understanding and reasoning benchmark (OmniSpatial).★★★★ 🇨🇳 GLM-4.6V52.0%
426OneIG-Bench EN OText-to-imageOneIG-Bench English subset score for text-to-image generation.★★★★G Nano Banana 2.00.6%
427OneIG-Bench ZH OText-to-imageOneIG-Bench Chinese subset score for text-to-image generation.★★★★G Nano Banana 2.00.6%
428Online-Mind2web oWeb automationOnline web automation and task execution benchmark.★★★★G Seed1.885.9%
429Open Rewrite oInstruction followingRewrite benchmark assessing open-ended editing and directive-following quality.★★★★G MobileLLM P151.0%
430OpenBookQA O Science QAOpen-book multiple choice science questions with supporting facts.★★★★★128 🇺🇸 Hermes 4.3 36B Pyche96.6%
431OpenRewrite-Eval oRewrite qualityOpenRewrite evaluation; micro-averaged RougeL.★★★★ 🇨🇳 Qwen2.5 1.5B Instruct46.9%
432OptMATH oMath optimization reasoningOptMATH benchmark targeting challenging math optimization and problem-solving tasks.★★★★G Ling 1T57.7%
433Order 15 Items oList orderingOrdering benchmark requiring models to sequence 15 items correctly.★★★★G K2-V287.6%
434Order 30 Items oList ordering (long)Ordering benchmark requiring models to sequence 30 items correctly.★★★★G K2-V240.3%
435OSWorld OGUI agentsAgentic GUI task completion and grounding on desktop environments.★★★★ 🇺🇸 Claude Opus 4.566.3%
436OSWorld-G OGUI agentsOSWorld-G center accuracy (no_refusal).★★★★ 🇺🇸 Holo1.5-72B71.8%
437OSWorld2 oGUI agentsSecond-generation OSWorld GUI agent benchmark.★★★★ 🇨🇳 GLM-4.5V35.8%
438OVBench oOpen-vocabulary streamingOpen-vocabulary benchmark for streaming video understanding.★★★★G Seed1.865.1%
439OVOBench oStreaming video QAStreaming video QA benchmark with open-vocabulary queries.★★★★G Seed1.872.6%
440PaperBench Code-Dev oCode understandingPaperBench developer subset measuring code reasoning accuracy.★★★★ 🇺🇸 Claude Sonnet 443.3%
441PaperBench oResearch paper understandingBenchmark for understanding and reasoning over research papers.★★★★ 🇺🇸 Claude Opus 4.5 Thinking72.9%
442PHYBench oPhysics reasoningPhysics reasoning and calculation benchmark.★★★★ 🇺🇸 Gemini 3 Pro59.0%
443PhyX oPhysics reasoning (multimodal)Multimodal physics reasoning benchmark (PhyX).★★★★G Step3-VL-10B59.5%
444PIQA O Physical commonsensePhysical commonsense about everyday tasks and object affordances.★★★★ 🇨🇳 GLM-4.5 Base87.1%
445PixmoCount O Visual countingCounting objects/instances in images (PixmoCount).★★★★G Eagle2.5-8B90.2%
446Point-Bench oPointing and countingBenchmark for pointing and counting objects in images.★★★★ 🇺🇸 Gemini 2.5 Pro85.5%
447PolyMATH OMath reasoningPolyglot mathematics benchmark assessing cross-topic math reasoning.★★★★ 🇨🇳 Kimi K2 Instruct65.1%
448POPE o Hallucination detectionVision-language hallucination benchmark focusing on object existence verification.★★★★ 🇨🇳 InternVL3-2B90.1%
449PopQA O Knowledge / QAOpen-domain popular culture question answering benchmark testing long-tail factual recall.★★★★ 🇺🇸 Llama 3.1 Tulu 3 405B SFT55.7%
450PostTrainBench o Post-training automationMeasures how well AI agents can post-train base LLMs under fixed compute/time constraints; average score across AIME 2025, BFCL, GPQA Main, GSM8K, and HumanEval.★★★★ 🇺🇸 GPT-5.1 Codex-Max34.9%
451ProofBench Advanced oMathematical proofs (advanced)Advanced mathematical proof benchmark covering complex theorem proving tasks.★★★★ 🇺🇸 Gemini Deep Think (IMO Gold)65.7%
452ProofBench Basic oMathematical proofsEntry-level mathematical proof benchmarking set.★★★★ 🇨🇳 DeepSeekMath-V2-Heavy99.0%
453ProtocolQA oProtocol understanding and QAProtocol question answering benchmark evaluating understanding of scientific protocols and procedures.★★★★ 🇺🇸 Grok 4.1 (Thinking)79.0%
454QuAC o Conversational QAQuestion answering in context.★★★★ 🇺🇸 Llama 3.1 405B Base53.6%
455QuALITY o Long-context reading comprehensionLong-document multiple-choice reading comprehension benchmark.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT48.8%
456RACE o Reading comprehensionEnglish exams for middle and high school.★★★★ 🇺🇸 Nemotron-3-Nano-30B-A3B-Base88.0%
457Random Complex Tasks oAgentic tasks (random)Randomly constructed complex task environments for agent generalization.★★★★ 🇨🇳 LongCat-Flash-Thinking-260135.8%
458Realbench oWeb browsingReal-world browsing and QA benchmark.★★★★G Seed1.849.1%
459RealWorldQA O Real-world visual QAVisual question answering with real-world images and scenarios.★★★★ 🇺🇸 GPT-582.8%
460Ref-L4 (test) o Referring expressionsRef-L4 referring expression comprehension on the test split.★★★★ 🇨🇳 GLM-4.6V88.9%
461RefCOCO O Referring expressionsRefCOCO average accuracy at IoU 0.5 (val).★★★★ 🇨🇳 InternVL3.5-4B92.4%
462RefCOCOg o Referring expressionsRefCOCOg average accuracy at IoU 0.5 (val).★★★★G Moondream-9B-A2B88.6%
463RefCOCO+ o Referring expressionsRefCOCO+ accuracy at IoU 0.5 on the val split.★★★★G Moondream-9B-A2B81.8%
464RefSpatialBench OSpatial reasoningReference spatial understanding benchmark covering spatial grounding tasks.★★★★ 🇨🇳 Qwen2.5-VL 72B Instruct72.1%
465RefusalBench oSafety / refusalSafety-oriented refusal and policy adherence benchmark.★★★★ 🇺🇸 Hermes 4.3 36B Pyche72.3%
466ReMI oMultimodal reasoningReasoning over multimodal inputs (ReMI).★★★★G Step3-VL-10B67.3%
467RepoBench OCode understandingRepository-level code comprehension and reasoning benchmark.★★★★ 🇺🇸 Claude Sonnet 4.583.8%
468RoboSpatialHome OEmbodied spatial understandingRoboSpatialHome benchmark for embodied spatial reasoning in domestic environments.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking73.9%
469Roo Code Evals O Code assistant evalCommunity-maintained coding evals and leaderboard by Roo Code.★★★★ 🇺🇸 GPT-5 mini99.0%
470RULER-100 @1M o Long-context evalRULER-100 evaluation at a 1M context window.★★★★ 🇺🇸 NVIDIA-Nemotron-3-Nano-30B-A3B-BF1686.3%
471RULER-100 @256k o Long-context evalRULER-100 evaluation at a 256k context window.★★★★ 🇺🇸 NVIDIA-Nemotron-3-Nano-30B-A3B-BF1692.9%
472RULER-100 @512k o Long-context evalRULER-100 evaluation at a 512k context window.★★★★ 🇺🇸 NVIDIA-Nemotron-3-Nano-30B-A3B-BF1691.3%
473Ruler 128k o Long-context evalRULER benchmark at 128k context window.★★★★ 🇨🇳 Kimi-Linear-Instruct95.4%
474Ruler 16k o Long-context evalRULER benchmark at 16k context window.★★★★ 🇨🇳 Qwen2.5 7.6B92.2%
475Ruler 1M o Long-context evalRULER benchmark at 1M context window.★★★★ 🇨🇳 Kimi-Linear-Instruct94.8%
476Ruler 32k o Long-context evalRULER benchmark at 32k context window.★★★★ 🇫🇷 Mistral Medium 396.0%
477Ruler 4k o Long-context evalRULER benchmark at 4k context window.★★★★ 🇫🇷 Ministral 8B96.0%
478Ruler 64k o Long-context evalRULER benchmark at 64k context window.★★★★ 🇺🇸 Nemotron-3-Nano-30B-A3B-Base87.5%
479Ruler 8k o Long-context evalRULER benchmark at 8k context window.★★★★ 🇺🇸 Llama 3.1 8B Base93.8%
480RW Search oAgentic searchReal-world search benchmark evaluating retrieval and reasoning.★★★★ 🇺🇸 GPT-5.2 Thinking XHigh82.0%
481SALAD-Bench o Safety alignmentSafety Alignment and Dangerous-behavior benchmark evaluating harmful assistance and refusal consistency.★★★★G Granite-4.0-H-Micro↓ 96.8%
482Scale AI Multi Challenge oChat & instruction followingScale AI Multi Challenge crowd-evaluated instruction following benchmark.★★★★ 🇨🇳 Qwen3-30B-A3B-Thinking-250744.8%
483SciCode (sub) OCodeSciCode subset score (sub).★★★★ 🇺🇸 Gemini 3 Pro56.1%
484SciCode (main) OCodeSciCode main score.★★★★ 🇺🇸 Gemini 2.5 Pro15.4%
485ScienceQA OScience QA (multimodal)Multiple-choice science questions with images, diagrams, and text context.★★★★G FastVLM-7B96.7%
486SciQ o Science QAMultiple choice science questions.★★★★G Pythia 12B92.9%
487SciRes FrontierMath Tier 1-3 oMath (frontier)SciRes FrontierMath benchmark covering tiers 1-3.★★★★ 🇺🇸 GPT-5.2 Thinking40.3%
488SciRes FrontierMath Tier 4 oMath (frontier)SciRes FrontierMath benchmark covering tier 4.★★★★ 🇺🇸 Gemini 3 Pro18.8%
489ScreenQA Complex OGUI QAComplex ScreenQA benchmark accuracy.★★★★ 🇺🇸 Holo1.5-72B87.1%
490ScreenQA Short OGUI QAShort-form ScreenQA benchmark accuracy.★★★★ 🇺🇸 Holo1.5-72B91.9%
491ScreenSpot OScreen UI locatorsCenter accuracy on ScreenSpot.★★★★ 🇨🇳 Qwen3-VL 32B Instruct95.8%
492ScreenSpot-Pro O Screen UI locatorsAverage center accuracy on ScreenSpot-Pro.★★★★ 🇺🇸 GPT-5.2 Extra High86.3%
493ScreenSpot-v2 OScreen UI locatorsCenter accuracy on ScreenSpot-v2.★★★★G UI-Venus 72B95.3%
494SEAL-0 OAgentic web searchEvaluation of multi-step browsing agents on search, evidence gathering, and synthesis tasks.★★★★ 🇨🇳 Kimi-K2.5 Thinking57.4%
495SEED-Bench-2-Plus O Multimodal evaluationSEED-Bench-2-Plus overall accuracy.★★★★ 🇺🇸 Claude 3.7 Sonnet72.9%
496SEED-Bench-Img OMultimodal image understandingSEED-Bench image-only subset (SEED-Bench-Img).★★★★G Bagel 14B78.5%
497SEED-Bench o Multimodal evaluationSEED-Bench comprehensive multimodal understanding benchmark evaluating generative comprehension across multiple dimensions.★★★★ 🇺🇸 LFM2-VL-3B76.5%
498SFE oMultimodal reasoningStructured factual evaluation for multimodal models.★★★★ 🇺🇸 Gemini 3 Pro61.9%
499Showdown OGUI agentsSuccess rate on the Showdown UI interaction benchmark.★★★★ 🇺🇸 Holo1.5-72B76.8%
500SIFO oInstruction followingSingle-turn instruction following benchmark.★★★★ 🇨🇳 Qwen3-VL-30B-A3B Thinking66.9%
501SIFO Multiturn oInstruction followingMulti-turn SIFO benchmark for sustained instruction adherence.★★★★ 🇨🇳 Qwen3-VL-30B-A3B Thinking60.3%
502SimpleQA OQASimple question answering benchmark.★★★★★ 🇨🇳 DeepSeek V3.2-Exp97.1%
503SimpleQA Verified oQAVerified SimpleQA variant for parametric knowledge accuracy.★★★★ 🇺🇸 Gemini 3 Pro72.1%
504SimpleVQA OGeneral VQALightweight visual question answering set with everyday scenes.★★★★ 🇨🇳 Kimi-K2.5 Thinking71.2%
505SimpleVQA-DS oGeneral VQASimpleVQA variant curated by DeepSeek with everyday image question answering tasks.★★★★ 🇨🇳 Seed1.5-VL-Thinking61.3%
506Social Interaction QA (SIQA) oSocial commonsense QASocial Interaction QA benchmark evaluating social commonsense and situational reasoning.★★★★ 🇺🇸 Gemma 3 27B54.9%
507SocialIQA o Social commonsenseSocial interaction commonsense QA.★★★★ 🇺🇸 Gemma 3 PT 27B54.9%
508SpatialViz OMental visualizationMental visualization benchmark.★★★★ 🇺🇸 GPT-5.265.8%
509Spider O Text-to-SQLComplex text-to-SQL benchmark over cross-domain databases.★★★★G LLaDA2.0 Flash82.5%
510Spiral-Bench O Safety / sycophancyA LLM-judged benchmark measuring sycophancy and delusion reinforcement.★★★★ 🇺🇸 GPT-587.0%
511SQuAD v1.1 o Reading comprehensionExtractive QA from Wikipedia articles.★★★★★566 🇺🇸 Llama 3.1 405B Base89.3%
512SQuAD v2.0 o Reading comprehensionLike v1.1 with unanswerable questions.★★★★★566G LLaDA2.0 Flash90.0%
513StreamingBench oStreaming videoStreaming video understanding benchmark.★★★★G Seed1.884.4%
514SUNRGBD O3D scene understandingSUN RGB-D benchmark for indoor scene understanding from RGB-D imagery.★★★★ 🇺🇸 GPT-5 Mini Minimal45.8%
515SuperGPQA OGraduate-level QAHarder GPQA variant assessing advanced graduate-level reasoning.★★★★ 🇺🇸 Gemini 3 Pro75.3%
516SWE-Bench O Code repairSupervised software engineering benchmark across many repos and issues.★★★★★3442 🇺🇸 GPT-5 Codex74.5%
517SWE-Bench Multilingual OCode repair (multilingual)Multilingual variant of SWE-Bench for issue fixing.★★★★ 🇺🇸 Claude Opus 4.5 Thinking77.5%
518SWE-Bench (OpenHands) o Code repairSWE-Bench results using the OpenHands autonomous coding agent.★★★★★3442 🇺🇸 NVIDIA-Nemotron-3-Nano-30B-A3B-BF1638.8%
519SWE-Bench Pro o Software engineeringFull SWE-Bench Pro benchmark for software-engineering agents.★★★★ 🇺🇸 GPT-5.2 Thinking55.6%
520SWE-Bench Pro (Public) oSoftware engineeringPublic subset of the SWE-Bench Pro benchmark for software-engineering agents.★★★★ 🇺🇸 GPT-523.3%
521SWE-Bench Verified O Code repairVerified subset of SWE-Bench for issue fixing.★★★★★ 🇺🇸 Claude Opus 4.580.9%
522SWE-Dev oCode repairSoftware engineering development and bug fixing benchmark.★★★★ 🇺🇸 Claude Sonnet 467.1%
523SWE-Lancer o Code repair (freelance tasks)Software engineering benchmark using real freelance-style issues.★★★★ 🇺🇸 GPT-5.1 Codex-Max79.9%
524SWE-Lancer Diamond oCode repair (freelance)Diamond subset of SWE-Lancer focusing on the hardest freelance-style issues.★★★★ 🇺🇸 GPT-4.532.6%
525SWE-Perf oCode repairSoftware engineering benchmark focused on performance-oriented fixes.★★★★ 🇺🇸 Gemini 3 Pro6.5%
526SWE-Review oCode reviewSoftware engineering review benchmark for assessing code review quality.★★★★ 🇺🇸 Claude Opus 4.516.2%
527SWT-Bench oCode repairSoftware tool-use benchmark for code tasks.★★★★ 🇺🇸 GPT-5.2 Thinking80.7%
528SysBench oSystem promptsSystem prompt understanding and adherence benchmark.★★★★ 🇺🇸 GPT-4.174.1%
529TAU1-Airline OAgent tasks (airline)Tool-augmented agent evaluation in airline scenarios (TAU1).★★★★G openPangu-R-72B-2512 Slow Thinking56.0%
530TAU1-Retail OAgent tasks (retail)Tool-augmented agent evaluation in retail scenarios (TAU1).★★★★G openPangu-R-72B-2512 Slow Thinking73.0%
531TAU2-Airline OAgent tasks (airline)Tool-augmented agent evaluation in airline scenarios (TAU2).★★★★ 🇨🇳 LongCat-Flash-Thinking-260176.5%
532TAU2-Retail OAgent tasks (retail)Tool-augmented agent evaluation in retail scenarios (TAU2).★★★★ 🇺🇸 Claude Opus 4.588.9%
533TAU2-Telecom OAgent tasks (telecom)Tool-augmented agent evaluation in telecom scenarios (TAU2).★★★★ 🇨🇳 LongCat-Flash-Thinking-260199.3%
534TempCompass oTemporal reasoningTemporal reasoning benchmark evaluating understanding of time-related concepts in videos and images.★★★★ 🇺🇸 Gemini 3 Pro88.0%
535Terminal-Bench O Agent terminal tasksCommand-line task completion benchmark for agents.★★★★★637 🇺🇸 Claude Sonnet 4.5 (Thinking)61.3%
536Terminal-Bench 2.0 O Agent terminal tasksSecond-generation Terminal-Bench leaderboard for end-to-end terminal agents.★★★★G IQuest-Coder-V1-40B-Loop-Instruct81.4%
537Terminal-Bench Hard O Agent terminal tasksHard subset of Terminal-Bench command-line agent tasks.★★★★ 🇺🇸 GPT-5.1 High43.0%
538Terminal-Bench Terminus o Agent terminal tasksTerminal-Bench Terminus track assessing end-to-end terminal tool use.★★★★ 🇺🇸 GPT-4.130.3%
539TextQuests OText-based video gamesText-based video game benchmark.★★★★ 🇺🇸 Gemini 3 Pro41.0%
540TextQuests Harm OHarmful propensitiesHarmfulness evaluation on TextQuests scenarios.★★★★ 🇺🇸 Grok 4.1 Fast↓ 9.1%
541TextVQA O Text-based VQAVisual question answering that requires reading text in images.★★★★G PLM-8B86.5%
542TIIF-Bench Long OText-to-imageTIIF-Bench long prompt score for text-to-image generation.★★★★G Seedream 4.588.5%
543TIIF-Bench Short OText-to-imageTIIF-Bench short prompt score for text-to-image generation.★★★★G Nano Banana 2.091.0%
544TLDR9+ oSummarizationLong-form summarization benchmark with nine-domain TLDR prompts plus extended variations.★★★★G MobileLLM P116.8%
545TOMATO oTemporal understandingTemporal ordering and motion analysis benchmark (TOMATO).★★★★G Seed1.860.8%
546Tool-Decathlon oAgent tool-useComposite tool-use suite measuring multi-domain tool invocation success (Pass@1).★★★★ 🇺🇸 Claude Sonnet 4.538.6%
547Toolathlon OAgentic software tasksLong-horizon, real-world software tool-use tasks.★★★★ 🇺🇸 Gemini 3 Flash49.4%
548TreeBench o Reasoning with tree structuresEvaluates hierarchical/tree-structured reasoning and planning capabilities in LLMs/VLMs.★★★★ 🇨🇳 GLM-4.6V51.4%
549TriQA oKnowledge QATriadic question answering benchmark evaluating world knowledge and reasoning.★★★★ 🇫🇷 Mixtral 8x22B82.2%
550TriviaQA O Open-domain QAOpen-domain question answering benchmark built from trivia and web evidence.★★★★ 🇺🇸 Gemma 3 PT 27B85.5%
551TriviaQA-Wiki o Open-domain QATriviaQA subset answering using Wikipedia evidence.★★★★ 🇺🇸 Llama 3.1 405B Base91.8%
552TrustLLM oSafety / reliabilityTrustLLM benchmark for trustworthiness and safety behaviors.★★★★ 🇨🇳 Qwen3-Coder-480B-A35B-Instruct88.4%
553TruthfulQA O Truthfulness / hallucinationMeasures whether a model imitates human falsehoods (truthfulness).★★★★G SOLAR-10.7B-Instruct-v1.071.4%
554TruthfulQA (DE) oTruthfulness / hallucination (German)German translation of the TruthfulQA benchmark.★★★★ 🇺🇸 Llama 3.3 70B Instruct0.2%
555TVBench oTV comprehensionBenchmark for TV show video comprehension and QA.★★★★G Seed1.871.5%
556TydiQA o Cross-lingual QATypologically diverse QA across languages.★★★★★313 🇺🇸 Llama 3.1 405B Base34.3%
557U-Artifacts oAgentic coding artifactsBenchmark focusing on generated code artifacts quality.★★★★ 🇺🇸 Gemini 3 Pro57.8%
558V* OMultimodal reasoningV* benchmark accuracy.★★★★ 🇨🇳 Qwen3-VL-8B-Instruct86.4%
559VCRBench oVisual commonsense reasoningVisual commonsense reasoning benchmark.★★★★G Seed1.859.8%
560VCT O Virology capability (protocol troubleshooting)Virology Capabilities Test: a benchmark that measures an LLM's ability to troubleshoot complex virology laboratory protocols.★★★★ 🇺🇸 Gemini 2.5 Pro100.0%
561Vending-Bench 2 OLong-horizon agentic tasksLong-horizon agentic task benchmark evaluating sustained goal completion.★★★★ 🇺🇸 Gemini 3 Pro5478.2%
562Vibe Android oVibe evaluation (Android)Vibe evaluation on Android tasks.★★★★ 🇺🇸 Claude Opus 4.592.2%
563Vibe Average oVibe evaluationAggregate Vibe evaluation score.★★★★G MiniMax M2.188.6%
564Vibe Backend oVibe evaluation (backend)Vibe evaluation on backend tasks.★★★★ 🇺🇸 Claude Opus 4.598.0%
565Vibe iOS oVibe evaluation (iOS)Vibe evaluation on iOS tasks.★★★★ 🇺🇸 Claude Opus 4.590.0%
566Vibe Simulation oVibe evaluation (simulation)Vibe evaluation on simulation tasks.★★★★ 🇺🇸 Gemini 3 Pro89.2%
567Vibe Web oVibe evaluation (web)Vibe evaluation on web tasks.★★★★G MiniMax M2.191.5%
568VibeEval OAesthetic/visual qualityVLM aesthetic evaluation with GPT scores.★★★★ 🇺🇸 Gemini 2.5 Pro76.4%
569Video-MME O Video understanding (multimodal)Multimodal evaluation of video understanding and reasoning.★★★★ 🇨🇳 Qwen3-VL 32B Instruct76.6%
570VideoHolmes oVideo QAVideo question answering benchmark focused on detective-style clues.★★★★G Seed1.865.5%
571VideoMME oMultimodal video evaluationVideo multimodal evaluation suite (VideoMME).★★★★ 🇺🇸 Gemini 3 Pro88.4%
572VideoMME (w/o sub) OVideo understandingVideo understanding benchmark without subtitles.★★★★ 🇺🇸 Gemini 2.5 Pro85.1%
573VideoMME (w/sub) oVideo understandingVideo understanding benchmark with subtitles.★★★★ 🇨🇳 GLM-4.5V80.7%
574VideoMMMU OMultimodal video understandingVideo-based extension of MMMU evaluating temporal multimodal reasoning and perception across disciplines.★★★★ 🇺🇸 Gemini 3 Pro87.6%
575VideoReasonBench oVideo reasoningVideo reasoning benchmark assessing temporal and causal understanding.★★★★ 🇺🇸 Gemini 2.5 Pro59.7%
576VideoSimpleQA oVideo QASimple question answering over short videos.★★★★ 🇺🇸 Gemini 3 Pro71.9%
577ViSpeak oVideo dialogueVideo-grounded dialogue and description benchmark.★★★★ 🇺🇸 Gemini 3 Pro89.0%
578VisualPuzzle oVisual reasoningVisual puzzle solving benchmark evaluating reasoning and pattern recognition capabilities.★★★★ 🇺🇸 GPT-5 High57.8%
579VisualWebBench O Web UI understandingAverage accuracy on VisualWebBench.★★★★ 🇺🇸 Holo1.5-72B83.8%
580VisuLogic O Visual logical reasoningLogical reasoning and compositionality benchmark for visual-language models.★★★★ 🇨🇳 ERNIE-4.5-VL-28B-A3B-Thinking52.5%
581VitaBench OIndustry QAIndustry-focused benchmark evaluating domain QA performance.★★★★ 🇺🇸 Claude Opus 4.556.3%
582VL-RewardBench oReward modeling (VL)Reward alignment benchmark for VLMs.★★★★ 🇺🇸 Claude 3.7 Sonnet67.4%
583VLMs are Biased o Multimodal biasEvaluates whether VLMs truly 'see' vs. relying on memorized knowledge; measures bias toward non-visual priors.★★★★90 🇺🇸 o4 mini20.2%
584VLMs are Blind O Visual grounding robustnessEvaluates failure modes of VLMs in grounding and perception tasks.★★★★G MiMo-VL 7B-RL79.4%
585VLMsAreBiased oMultimodal biasBenchmark evaluating biases in vision-language models.★★★★G Seed1.862.0%
586VLMsAreBlind oMultimodal robustnessBenchmark probing robustness of vision-language models to visual perturbations.★★★★ 🇺🇸 Gemini 3 Pro97.5%
587VoiceBench AdvBench OVoiceBenchVoiceBench adversarial safety evaluation.★★★★ 🇨🇳 Qwen3-Omni-30B-A3B-Thinking99.4%
588VoiceBench AlpacaEval oVoiceBenchVoiceBench evaluation on AlpacaEval instructions.★★★★ 🇨🇳 Qwen3-Omni-Flash-Thinking96.8%
589VoiceBench BBH oVoiceBenchVoiceBench evaluation on Big-Bench Hard prompts.★★★★ 🇺🇸 Gemini 2.5 Pro92.6%
590VoiceBench CommonEval OVoiceBenchVoiceBench evaluation on CommonEval.★★★★ 🇨🇳 Qwen3-Omni-Flash-Instruct91.0%
591VoiceBench IFEval oVoiceBenchVoiceBench instruction-following evaluation (IFEval).★★★★ 🇺🇸 Gemini 2.5 Pro85.7%
592MMAU v05.15.25 oAudio reasoningAudio reasoning benchmark MMAU v05.15.25.★★★★ 🇨🇳 Qwen3-Omni-Flash-Instruct77.6%
593VoiceBench MMSU OVoiceBenchVoiceBench MMSU benchmark (voice modality).★★★★ 🇨🇳 Qwen3-Omni-Flash-Thinking84.3%
594VoiceBench MMSU (Audio) oAudio reasoningAudio reasoning MMSU results.★★★★ 🇺🇸 Gemini 2.5 Pro77.7%
595VoiceBench OpenBookQA oVoiceBenchVoiceBench results on OpenBookQA prompts.★★★★ 🇨🇳 Qwen3-Omni-Flash-Thinking95.0%
596VoiceBench SD-QA OVoiceBenchVoiceBench Spoken Dialogue QA results.★★★★ 🇺🇸 Gemini 2.5 Pro90.1%
597VoiceBench WildVoice OVoiceBenchVoiceBench evaluation on WildVoice dataset.★★★★ 🇺🇸 Gemini 2.5 Pro93.4%
598VPCT oMultimodal reasoningVisual perception and comprehension test.★★★★ 🇺🇸 Gemini 3 Pro90.0%
599VQAv2 O Visual question answeringStandard Visual Question Answering v2 benchmark on natural images.★★★★ 🇺🇸 Molmo2-8B87.0%
600VSI-Bench OSpatial intelligenceVisual spatial intelligence benchmark covering 3D reasoning and spatial inference tasks.★★★★ 🇨🇳 Qwen3-VL-30B-A3B Instruct63.2%
601WebClick OGUI agentsTask success on the WebClick UI agent benchmark.★★★★ 🇺🇸 Claude Sonnet 493.0%
602WebDev Arena O Web development agentsArena evaluation for autonomous web development agents.★★★★ 🇺🇸 GPT-51483
603WebQuest-MultiQA oWeb agentsMulti-question web search and interaction tasks.★★★★ 🇨🇳 GLM-4.5V60.6%
604WebQuest-SingleQA oWeb agentsSingle-question web search and interaction tasks.★★★★ 🇨🇳 GLM-4.6V79.5%
605WebSrc OWeb QAWebpage question answering (SQuAD F1).★★★★ 🇺🇸 Holo1.5-72B97.2%
606WebVoyager oWeb agentsWeb navigation and interaction tasks for LLM agents.★★★★ 🇨🇳 GLM-4.6V81.0%
607WebVoyager2 oWeb agentsWeb navigation and interaction tasks for LLM agents (v2).★★★★ 🇨🇳 GLM-4.5V84.4%
608WebWalkerQA oWeb agentsWebWalker tasks evaluating autonomous browsing question answering performance.★★★★ 🇨🇳 Tongyi DeepResearch72.2%
609WeMath oMath reasoningMath reasoning benchmark spanning diverse curricula and difficulty levels.★★★★ 🇨🇳 GLM-4.6V69.8%
610WideSearch oWeb searchWide web search and QA benchmark.★★★★ 🇺🇸 Claude Opus 4.5 Thinking76.2%
611Wild-Jailbreak oSafety / jailbreakAdversarial jailbreak benchmark evaluating refusal robustness.★★★★ 🇺🇸 GPT-OSS 120B (High)98.2%
612WildBench V2 oInstruction followingWildBench V2 human preference benchmark for instruction following and helpfulness.★★★★ 🇫🇷 Mistral Small 3.2 24B Instruct65.3%
613WildGuardTest oSafetyWildGuardTest safety benchmark.★★★★G IQuest-Coder-V1-40B-Thinking86.8%
614Winogender o Gender bias (coreference)Coreference resolution dataset for measuring gender bias.★★★★ 🇺🇸 Llama 3.3 70B Instruct84.3%
615WinoGrande O Coreference reasoningLarge-scale adversarial Winograd Schema-style pronoun resolution.★★★★99 🇺🇸 OLMo-3-Think-32B90.3%
616WinoGrande (DE) oCoreference reasoning (German)German translation of the WinoGrande pronoun resolution benchmark.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.8%
617WMDP Bio o Biosecurity knowledgeWeapons of Mass Destruction Proxy benchmark for biosecurity, measuring hazardous biological knowledge without info hazards.★★★★ 🇺🇸 Zephyr 7B↓ 63.7%
618WMDP Chem o Chemical security knowledgeWMDP benchmark for chemical security, evaluating knowledge relevant to chemical weapons development.★★★★ 🇺🇸 Zephyr 7B↓ 45.8%
619WMDP Cyber o Cybersecurity knowledgeWMDP benchmark for cybersecurity, assessing knowledge that could aid in cyber weapons development.★★★★ 🇺🇸 Zephyr 7B↓ 44.0%
620WMT16 En–De o Machine translationWMT16 English–German translation benchmark (news).★★★★ 🇺🇸 Llama 3.3 70B Instruct38.8%
621WMT16 En–De (Instruct) oMachine translationInstruction-tuned evaluation on the WMT16 English–German translation set.★★★★ 🇺🇸 Llama 3.3 70B Instruct37.9%
622WMT24++ O Machine translationExtended WMT 2024 evaluation across multiple language pairs.★★★★ 🇨🇳 Qwen3-235B-A22B-Thinking-250794.7%
623WorldTravel2 (multi-modal) oTravel planning (multimodal)WorldTravel2 benchmark multimodal track.★★★★ 🇺🇸 Gemini 3 Pro47.2%
624WorldTravel2 (text) oTravel planning (text)WorldTravel2 benchmark text-only track.★★★★ 🇺🇸 GPT-5 High56.4%
625WorldVQA oWorld knowledge VQAVisual question answering requiring world knowledge and commonsense reasoning.★★★★ 🇺🇸 Gemini 3 Pro47.4%
626WritingBench OWriting qualityGeneral-purpose writing quality benchmark.★★★★ 🇨🇳 Qwen3-235B-A22B-Thinking-250788.3%
627WSC O Coreference reasoningClassic Winograd Schema Challenge measuring commonsense coreference.★★★★ 🇺🇸 Gemma 3 PT 27B91.9%
628xBench-DeepSearch OAgentic researchEvaluates multi-hop deep research workflows on xBench DeepSearch tasks.★★★★ 🇺🇸 GPT-5 High77.9%
629XpertBench (Edu) oEconomics/educationXpertBench education domain subset.★★★★ 🇺🇸 GPT-5 High56.9%
630XpertBench (Fin) oEconomics/financeXpertBench finance domain subset.★★★★ 🇺🇸 GPT-5 High64.5%
631XpertBench (Humanities) oEconomics/humanitiesXpertBench humanities domain subset.★★★★ 🇺🇸 GPT-5 High68.5%
632XpertBench (Law) oEconomics/legalXpertBench legal domain subset.★★★★ 🇺🇸 Claude Sonnet 4.558.7%
633XpertBench (Research) oEconomics/researchXpertBench research domain subset.★★★★ 🇺🇸 GPT-5 High48.2%
634XSTest oSafetyXSTest safety benchmark.★★★★G IQuest-Coder-V1-40B-Thinking94.3%
635ZebraLogic O Logical reasoningLogical reasoning benchmark assessing complex pattern and rule inference.★★★★ 🇨🇳 Qwen3-VL 32B Thinking96.1%
636ZeroBench OZero-shot generalizationEvaluates zero-shot performance across diverse tasks without task-specific finetuning.★★★★ 🇨🇳 GLM-4.5V23.4%
637ZeroBench (sub) OZero-shot generalizationSubset of ZeroBench targeting harder zero-shot reasoning cases.★★★★ 🇺🇸 Gemini 2.5 Pro33.8%
638ZeroSCROLLS BookSumSort o Long-context summarizationZeroSCROLLS split based on BookSumSort long-form summarization.★★★★ 🇺🇸 GPT-460.5%
639ZeroSCROLLS GovReport o Long-context summarizationZeroSCROLLS split based on the GovReport summarization benchmark.★★★★G CoLT541.0%
640ZeroSCROLLS MuSiQue o Long-context reasoningZeroSCROLLS split derived from MuSiQue multi-hop QA.★★★★ 🇺🇸 Llama 3.3 70B Instruct52.2%
641ZeroSCROLLS NarrativeQA o Long-context QAZeroSCROLLS split based on the NarrativeQA reading comprehension benchmark.★★★★ 🇺🇸 Claude v1.332.6%
642ZeroSCROLLS Qasper o Long-context QAZeroSCROLLS split based on the Qasper paper QA benchmark.★★★★G FLAN-UL256.9%
643ZeroSCROLLS QMSum o Long-context summarizationZeroSCROLLS split based on the QMSum meeting summarization benchmark.★★★★G CoLT522.5%
644ZeroSCROLLS QuALITY o Long-context QAZeroSCROLLS split based on the QuALITY reading comprehension benchmark.★★★★ 🇺🇸 GPT-489.2%
645ZeroSCROLLS SpaceDigest o Long-context summarizationZeroSCROLLS SpaceDigest extractive summarization task.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT77.9%
646ZeroSCROLLS SQuALITY o Long-context summarizationZeroSCROLLS split based on the SQuALITY long-form summarization benchmark.★★★★ 🇺🇸 GPT-422.6%
647ZeroSCROLLS SummScreenFD o Long-context summarizationZeroSCROLLS split based on the SummScreenFD summarization benchmark.★★★★G CoLT520.0%

Model Size

Release Date