Furukama

Furukama's Blog

Ben Koehler - Founder, Speaker, Coder Web | GitHub | X | Bluesky | LinkedIn

Fu — Benchmark of Benchmarks

Fu-Benchmark is a meta-benchmark of the most influential evaluation suites used to measure and rank large language models. Use the search box to filter by name, topic, or model.

Benchmarks
#NameTopicDescriptionRelevanceGitHub ★LeaderTop %
1A12D oDiagram reasoningA12D diagram reasoning benchmark for measuring multimodal understanding of annotated diagrams.★★★★ 🇺🇸 Gemini 2.5 Pro90.9%
2AA-Index oMulti-domain QAComprehensive QA index across diverse domains.★★★★ 🇺🇸 Grok 473.2%
3AA-LCR O Long-context reasoningA challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens.★★★★★ 🇺🇸 GPT-575.6%
4ACP-Bench Bool O Safety evaluation (boolean)Safety and behavior evaluation with yes/no questions.★★★★★ 🇨🇳 Qwen3-32B85.1%
5ACP-Bench MCQ O Safety evaluation (MCQ)Safety and behavior evaluation with multiple-choice questions.★★★★★ 🇺🇸 Llama 3.3 70B82.1%
6AgentDojo O Agent evaluationInteractive evaluation suite for autonomous agents across tools and tasks.★★★★ 🇺🇸 Claude 3.7 Sonnet88.7%
7AGIEval OExamsAcademic and professional exam benchmark.★★★★★ 🇺🇸 Llama 3.1 405B Base71.6%
8AI2D ODiagram understanding (VQA)Visual question answering over science and diagram images.★★★★★ 🇺🇸 Molmo-72B96.3%
9Aider Code Editing o Code editingMeasures interactive code editing quality within the Aider assistant workflow.★★★★★ 🇺🇸 Gemini 2.5 Pro89.8%
10Aider-Polyglot O Code assistant evalAider polyglot coding leaderboard.★★★★★ 🇺🇸 GPT-588.0%
11AIME 2024 O Math (competition)American Invitational Mathematics Examination 2024 problems.★★★★★ 🇺🇸 GPT-OSS 120B96.6%
12AIME 2025 O Math (competition)American Invitational Mathematics Examination 2025 problems.★★★★ 🇺🇸 GPT-5 pro96.7%
13All-Angles Bench oSpatial perceptionAll-Angles benchmark for spatial recognition and 3D perception.★★★★ 🇨🇳 GLM-4.5V56.9%
14AlpacaEval O Instruction followingAutomatic eval using GPT-4 as a judge.★★★★★1849 🇨🇳 Qwen3-32B64.2%
15AlpacaEval 2.0 OInstruction followingUpdated AlpacaEval with improved prompts and judging.★★★★★ 🇨🇳 DeepSeek R187.6%
16AMC 2023 maj@16 oMath (competition)AMC 2023 problems evaluated via majority voting with 16 samples.★★★★G Mathstral 7B42.4%
17AMC-23 O Math (competition)American Mathematics Competition 2023 evaluation.★★★★G QwQ-32B98.5%
18AndroidWorld OMobile agentsBenchmark for agents operating Android apps via UI automation.★★★★★ 🇨🇳 GLM-4.5V57.0%
19API-Bank o Tool useAPI-Bank tool-use benchmark.★★★★ 🇺🇸 Llama 3.1 405B92.0%
20ARC-AGI-1 O General reasoningARC-AGI Phase 1 aggregate accuracy.★★★★★ 🇺🇸 o375.7%
21ARC-AGI-2 O General reasoningARC-AGI Phase 2 aggregate accuracy.★★★★★ 🇺🇸 GPT-5 pro18.3%
22ARC Average O Science QA (average)Average accuracy across ARC-Easy and ARC-Challenge.★★★★ 🇺🇸 SmolLM2 1.7B Pretrained60.5%
23ARC-Challenge O Science QAHard subset of AI2 Reasoning Challenge; grade-school science.★★★★ 🇺🇸 Llama 3.1 405B96.9%
24ARC-Challenge (DE) oScience QA (German)German translation of the ARC Challenge benchmark.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.7%
25ARC-Easy O Science QAEasier subset of AI2 Reasoning Challenge.★★★★ 🇺🇸 Gemma 3 PT 27B89.0%
26ARC-Easy (DE) oScience QA (German)German translation of the ARC Easy science QA benchmark.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.8%
27Arena-Hard O Chat abilityHard prompts on Chatbot Arena.★★★★★920 🇫🇷 Mistral Medium 397.1%
28Arena-Hard V2 O Chat abilityUpdated Arena-Hard v2 prompts on Chatbot Arena.★★★★★920 🇨🇳 Qwen3 MoE-250788.2%
29ARKitScenes o3D scene understandingARKitScenes benchmark for assessing 3D scene reconstruction and understanding from mixed reality captures.★★★★★ 🇨🇳 Qwen2.5-VL 72B Instruct61.5%
30ART Agent Red Teaming O Agent robustnessEvaluation suite for adversarial red-teaming of autonomous AI agents.★★★★ 🇺🇸 Claude Sonnet 4.5 (Thinking)↓ 40.0%
31ArtifactsBench O Agentic codingArtifacts-focused coding and tool-use benchmark evaluating generated code artifacts.★★★★★ 🇺🇸 GPT-572.5%
32AstaBench O Agent evaluationEvaluates science agents across literature understanding, data analysis, planning, tool use, coding, and search.★★★★★ 🇺🇸 Claude Sonnet 453.0%
33AttaQ OSafety / jailbreakAdversarial jailbreak suite measuring refusal robustness against targeted attack prompts.★★★★★G Granite 3.3 8B Instruct88.5%
34AutoCodeBench O Autonomous codingEnd-to-end autonomous coding benchmark with unit-test based execution across diverse repositories and tasks.★★★★ 🇺🇸 Claude Opus 4 (Thinking)52.4%
35AutoCodeBench-Lite O Autonomous codingLite version of AutoCodeBench focusing on smaller tasks with the same end-to-end, unit-test-based evaluation.★★★★ 🇺🇸 Claude Opus 464.5%
36BALROG O Agent robustnessBenchmark for assessing LLM agents under adversarial and out-of-distribution tool-use scenarios.★★★★★ 🇺🇸 Grok 443.6%
37BBH O Multi-task reasoningHard subset of BIG-bench with diverse reasoning tasks.★★★★510 🇨🇳 ERNIE 4.5 424B A47B94.3%
38BBQ O Bias evaluationBias Benchmark for Question Answering evaluating social biases across contexts.★★★★ 🇫🇷 Mixtral 8x 7B56.0%
39BFCL oCode reasoningBenchmark for functional code correctness and logic.★★★★ 🇨🇳 Qwen3-4B95.0%
40BFCL Live v2 oFinance QAFinancial compliance and literacy questions from the BFCL Live v2 benchmark.★★★★ 🇺🇸 o1 Mini81.0%
41BFCL v2 oCode reasoningSecond release of the BFCL benchmark focusing on functional code correctness and logic.★★★★G MobileLLM P129.4%
42BFCL v3 OCode reasoningBenchmark for functional code correctness and logic (v3).★★★★★ 🇨🇳 GLM 4.577.8%
43BIG-Bench o Multi-task reasoningBIG-bench overall performance (original).★★★★★3110 🇺🇸 Gemma 2 7B55.1%
44BIG-Bench Extra Hard oMulti-task reasoningExtra hard subset of BIG-bench tasks.★★★★★G Ling 1T47.3%
45BigCodeBench O Code GenerationBigCodeBench evaluates large language models on practical code generation tasks with unit-test verification.★★★★★ 🇺🇸 GPT-4o-2024-05-1356.1%
46BigCodeBench Hard O Code generation (hard)Harder variant of BigCodeBench testing complex programming and library tasks with function-level code generation.★★★★★ 🇺🇸 Claude 3.7 Sonnet (2025-02-19)35.8%
47Bird-SQL OText-to-SQLNatural language to SQL generation benchmark.★★★★ 🇺🇸 Gemini 2.0 Pro59.3%
48BLINK OMultimodal groundingEvaluates visual-language grounding and reference resolution to reduce hallucinations.★★★★★ 🇨🇳 Seed1.5-VL-Thinking72.4%
49BoB-HVR OComposite capability indexHard, Versatile, and Relevant composite score across eight capability buckets.★★★★★ 🇺🇸 Llama 3 70B9.0%
50BOLD o Bias evaluationBias in Open-ended Language Dataset probing demographic biases in text generation.★★★★ 🇫🇷 Mixtral 8x 7B↓ 0.1%
51BoolQ O Reading comprehensionYes/no QA from naturally occurring questions.★★★★★171 🇺🇸 Gemma 2 27B84.8%
52BrowseComp O Web browsingWeb browsing comprehension and competence benchmark.★★★★ 🇺🇸 GPT-554.9%
53BrowseComp_zh OWeb browsing (Chinese)Chinese variant of the BrowseComp web browsing benchmark.★★★★ 🇺🇸 o358.1%
54BRuMo25 oMath competitionBruMo 2025 olympiad-style mathematics benchmark.★★★★ 🇺🇸 QuestA Nemotron 1.5B69.5%
55BuzzBench O Humor analysisA humour analysis benchmark.★★★★ 🇺🇸 Gemini 2.5 Pro71.1%
56C-Eval O Chinese examsComprehensive Chinese exam benchmark across multiple subjects.★★★★ 🇨🇳 Kimi-K2 Base92.5%
57C3-Bench o Reasoning (Chinese)Comprehensive Chinese reasoning capability benchmark.★★★★35 🇨🇳 GLM-4.5 Base83.1%
58CaseLaw v2 O Legal reasoningU.S. case law benchmark evaluating legal reasoning and judgment over court opinions.★★★★★ 🇺🇸 GPT-4.178.1%
59CC-OCR OOCR (cross-lingual)Cross-lingual OCR benchmark evaluating character recognition across mixed-language documents.★★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking81.5%
60CFEval oCoding ELO / contest evalContest-style coding evaluation with ELO-like scoring.★★★★ 🇨🇳 Qwen3-235B-A22B-Thinking-25072134
61Charades-STA O Video groundingCharades-STA temporal grounding (mIoU).★★★★★ 🇨🇳 Seed1.5-VL-Thinking64.0%
62ChartMuseum o Chart understandingLarge-scale curated collection of charts for evaluating parsing, grounding, and reasoning.★★★★ 🇺🇸 GPT-5 mini63.3%
63ChartQA O Chart understanding (VQA)Visual question answering over charts and plots.★★★★★G MiMo-VL 7B-SFT92.9%
64ChartQA-Pro oChart understanding (VQA)Professional-grade chart question answering with diverse chart types and complex reasoning.★★★★ 🇨🇳 GLM-4.5V64.0%
65CharXiv (DQ) O Chart description (PDF)Scientific chart/table descriptive questions from arXiv PDFs.★★★★★ 🇺🇸 o3-high95.0%
66CharXiv (RQ) O Chart reasoning (PDF)Scientific chart/table reasoning questions from arXiv PDFs.★★★★★ 🇺🇸 GPT-581.1%
67Chinese SimpleQA oQA (Chinese)Chinese variant of the SimpleQA benchmark.★★★★ 🇨🇳 Kimi-K2 Base77.6%
68CLUEWSC o Coreference reasoning (Chinese)Chinese Winograd Schema-style coreference benchmark from CLUE.★★★★ 🇨🇳 DeepSeek R192.8%
69CMath oMath (Chinese)Chinese mathematics benchmark.★★★★ 🇨🇳 ERNIE 4.5 424B A47B96.7%
70CMMLU o Chinese multi-domainChinese counterpart to MMLU.★★★★★781 🇨🇳 Qwen2.5 Max91.9%
71Codeforces O Competitive programmingCompetitive programming performance on Codeforces problems (ELO).★★★★★ 🇺🇸 o4 mini2719
72COLLIE o Instruction followingComprehensive instruction-following evaluation suite.★★★★55 🇺🇸 GPT-599.0%
73CommonsenseQA O Commonsense QAMultiple-choice QA requiring commonsense knowledge.★★★★ 🇺🇸 Llama 3.1 405B Base85.8%
74CountBench O Visual countingObject counting and numeracy benchmark for visual-language models across varied scenes.★★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking93.7%
75CountBenchQA OVisual counting QAVisual question answering benchmark focused on counting objects across varied scenes.★★★★G Moondream-9B-A2B93.2%
76CRAG oRetrieval QAComplex Retrieval-Augmented Generation benchmark for grounded question answering.★★★★G Jamba Mini 1.676.2%
77Creative Story‑Writing Benchmark V3 O Creative writingStory writing benchmark evaluating creativity, coherence, and style (v3).★★★★★291 🇨🇳 Kimi-K2-Instruct-09058.7%
78Longform Creative Writing O Creative writingLongform creative writing evaluation (EQ-Bench).★★★★★20 🇨🇳 DeepSeek V3 052878.9%
79Creative Writing v3 O Creative writingA LLM-judged creative writing benchmark.★★★★54 🇺🇸 o31661
80CRUX-I O Code reasoningCode Reasoning and Understanding eXam – Interactive.★★★★ 🇺🇸 GPT-4 Turbo (2024-04-09) CoT75.7%
81CRUX-O O Code reasoningCode Reasoning and Understanding eXam – Offline.★★★★★ 🇺🇸 GPT-4 0613 CoT88.2%
82CruxEval O Code reasoningMathematical coding challenge set from the CruxEval benchmark.★★★★ 🇨🇳 Qwen3-32B78.5%
83CV-Bench OComputer vision QADiverse CV tasks for VLMs.★★★★★ 🇨🇳 Seed1.5-VL-Thinking89.7%
84DeepMind Mathematics o Math reasoningSynthetic math problem sets from DeepMind covering arithmetic, algebra, calculus, and more.★★★★★G Granite-4.0-H-Small59.3%
85Design2Code OCoding (UI)Translating UI designs into code.★★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking93.4%
86DesignArena O Generative designLeaderboard tracking generative design systems across layout, branding, and marketing tasks.★★★★ 🇺🇸 Claude Sonnet 4.5 (Thinking)1410
87DetailBench o Spot small mistakesEvaluates whether LLMs can notice subtle errors and minor inconsistencies in text.★★★★ 🇺🇸 Llama 4 Maverick8.7%
88DocVQA ODocument understanding (VQA)Visual question answering over scanned documents.★★★★★ 🇨🇳 Seed1.5-VL-Thinking96.9%
89DROP O Reading + reasoningDiscrete reasoning over paragraphs (addition, counting, comparisons).★★★★★ 🇨🇳 DeepSeek R192.2%
90DynaMath O Math reasoning (video)Dynamic/video-based mathematical reasoning evaluating temporal and visual understanding.★★★★ 🇺🇸 GPT-4o63.7%
91Economically important tasks oIndustry QA (cross-domain)Evaluation suite of real-world, economically impactful tasks across key industries and workflows.★★★★ 🇺🇸 GPT-547.1%
92EgoSchema OEgocentric video QAEgoSchema validation accuracy.★★★★ 🇨🇳 Qwen2-VL 72B Instruct77.9%
93EmbSpatialBench OSpatial understandingEmbodied spatial understanding benchmark evaluating navigation and localization.★★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking84.3%
94Enterprise RAG oRetrieval-augmented generationEnterprise retrieval-augmented generation evaluation covering internal knowledge bases.★★★★ 🇺🇸 Apriel Nemotron 15B Thinker69.2%
95EQ-Bench O ReasoningGeneral reasoning benchmark assessing equation/logic capabilities.★★★★★352G Jan v1 250985.0%
96EQ-Bench 3 O Emotional intelligence (roleplay)A benchmark measuring emotional intelligence in challenging roleplays, judged by Sonnet 3.7.★★★★21 🇨🇳 Kimi K2 Instruct1555
97ERQA OSpatial reasoningSpatial recognition and reasoning QA benchmark (ERQA).★★★★★ 🇺🇸 GPT-565.7%
98EvalPerf O Code evaluation performanceMeasures performance of LLM code evaluation, including runtime, memory, and efficiency metrics.★★★★ 🇺🇸 GPT-4o (2024-08-06)100.0%
99EvalPlus O Code generationAggregated code evaluation suite from EvalPlus.★★★★★1577 🇺🇸 o1 Mini89.0%
100FACTS Grounding O Grounding / factualityGrounded factuality benchmark evaluating model alignment with source facts.★★★★★ 🇺🇸 Gemini 2.5 Pro87.8%
101FActScore oHallucination rate on open-source promptsMeasures hallucination rate on an open-source prompt suite; lower is better.★★★★ 🇺🇸 GPT-5↓ 1.0%
102FAIX Agent OComposite capability index★★★★★ 🇺🇸 Holo1.5-72B90.7%
103FAIX Code OComposite capability index★★★★★ 🇺🇸 Claude Sonnet 4.574.4%
104FAIX Math OComposite capability index★★★★★ 🇨🇳 Qwen3-VL-235B-A22B Instruct100.0%
105FAIX OCR OComposite capability index★★★★★ 🇺🇸 o3 (Low)86.0%
106FAIX Safety OComposite safety index★★★★★ 🇺🇸 Claude Sonnet 4.574.0%
107FAIX STEM OComposite capability index★★★★★ 🇨🇳 Qwen3-VL-235B-A22B Instruct100.0%
108FAIX Text OComposite capability index★★★★★ 🇨🇳 GLM-4.5V70.0%
109FAIX Visual OComposite capability index★★★★★ 🇺🇸 Holo1.5-72B91.6%
110FAIX Writing OComposite capability index★★★★★ 🇨🇳 Qwen3 235B A22B Instruct 250769.3%
111FinanceReasoning oFinancial reasoningFinancial reasoning benchmark evaluating quantitative and qualitative finance problem solving.★★★★★G Ling 1T87.5%
112FinanceAgent oAgentic finance tasksInteractive financial agent benchmark requiring multi-step tool use.★★★★ 🇺🇸 Claude Sonnet 4.555.3%
113FinanceBench (FullDoc) oFinance QAFinanceBench full-document question answering benchmark requiring long-context financial understanding.★★★★G Jamba Mini 1.645.4%
114FinSearchComp O Financial retrievalFinancial search and comprehension benchmark measuring retrieval grounded reasoning over financial content.★★★★ 🇺🇸 Grok 468.9%
115FinSearchComp-CN OFinancial retrieval (Chinese)Chinese financial search and comprehension benchmark measuring retrieval-grounded reasoning over regional financial content.★★★★G doubao-1-5-vision-pro54.2%
116Flame-React-Eval oFrontend codingFront-end React coding tasks and evaluation.★★★★ 🇨🇳 GLM-4.5V82.5%
117FRAMES oInteractive reasoningFrame-based interactive reasoning and dialogue benchmark.★★★★ 🇨🇳 Tongyi DeepResearch90.6%
118FreshQA oRecency QAQuestion answering benchmark emphasizing up-to-date knowledge and recency.★★★★ 🇨🇳 Qwen3-4B Thinking 250766.9%
119FullStackBench OFull-stack developmentEnd-to-end web/app development tasks and evaluation.★★★★★ 🇺🇸 GPT-4.168.5%
120GAIA o General AI tasksComprehensive benchmark for agentic tasks.★★★★ 🇨🇳 Tongyi DeepResearch70.9%
121GAIA 2 OGeneral agent tasksGrounded agentic intelligence benchmark version 2 covering multi-tool tasks.★★★★ 🇺🇸 GPT-5 High42.1%
122GDPVal o General capabilityGDPVal benchmark evaluating broad general capabilities of LLMs across diverse tasks.★★★★ 🇺🇸 Claude Opus 4.147.6%
123GeoBench1 oGeospatial reasoningGeospatial visual QA and reasoning (set 1).★★★★ 🇨🇳 GLM-4.5V79.7%
124Global-MMLU oMulti-domain knowledge (global)Full Global-MMLU evaluation across diverse languages and regions.★★★★★ 🇺🇸 Llama 3.3 70B77.8%
125Global-MMLU-Lite O Multi-domain knowledge (global)Lightweight global variant of MMLU covering diverse languages and regions.★★★★★ 🇺🇸 Gemini 2.5 Pro89.2%
126Gorilla Benchmark API Bench o Tool useGorilla API Bench tool-use evaluation.★★★★ 🇺🇸 Llama 3.1 405B35.3%
127GPQA O Graduate-level QAGraduate-level question answering evaluating advanced reasoning.★★★★406 🇺🇸 Grok 488.4%
128GPQA-diamond O Graduate-level QAHard subset of GPQA (diamond level).★★★★ 🇺🇸 GPT-5 pro89.4%
129GRE Math maj@16 oMath (standardized tests)GRE quantitative section evaluated via majority voting over 16 samples.★★★★ 🇨🇳 Qwen2 7B58.5%
130Ground-UI-1K OGUI groundingAccuracy on the Ground-UI-1K grounding benchmark.★★★★ 🇨🇳 Qwen2.5-VL 72B85.4%
131GSM-Plus OMath (grade-school, enhanced)Enhanced GSM-style grade-school math benchmark variant.★★★★ 🇨🇳 Qwen3-4B82.1%
132GSM-Symbolic o Math reasoningSymbolic reasoning variant of GSM that tests algebraic manipulation and arithmetic with structured problems.★★★★★G Granite-4.0-H-Small87.4%
133GSM8K O Math (grade-school)Grade-school math word problems requiring multi-step reasoning.★★★★1322 🇨🇳 Kimi K2 Instruct97.3%
134GSM8K (DE) oMath (grade-school, German)German translation of the GSM8K grade-school math word problems.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.6%
135GSO Benchmark O Code generationLiveCodeBench GSO benchmark.★★★★★ 🇺🇸 o3-high8.8%
136HallusionBench O Multimodal hallucinationBenchmark for evaluating hallucination tendencies in multimodal LLMs.★★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking66.7%
137HarmfulQA o SafetyHarmful question set testing models' ability to avoid unsafe answers.★★★★★104G K2-THINK99.0%
138HealthBench OMedical QAComprehensive medical knowledge and clinical reasoning benchmark across specialties and tasks.★★★★ 🇺🇸 GPT-567.2%
139HealthBench-Hard oMedical QA (hard)Challenging subset of HealthBench focusing on complex, ambiguous clinical cases.★★★★ 🇺🇸 GPT-546.2%
140HealthBench-Hard Hallucinations oMedical hallucination safetyMeasures hallucination and unsafe medical advice under hard clinical scenarios.★★★★ 🇺🇸 GPT-5↓ 1.6%
141HellaSwag O Commonsense reasoningAdversarial commonsense sentence completion.★★★★★220 🇨🇳 DeepSeek V3 Base96.4%
142HellaSwag (DE) oCommonsense reasoning (German)German translation of the HellaSwag commonsense benchmark.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.7%
143HELMET LongQA oLong-context QALong-context subset of the HELMET benchmark focusing on grounded question answering.★★★★G Jamba Mini 1.646.9%
144HeroBench O Long-horizon planningBenchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds.★★★★ 🇺🇸 Grok 491.7%
145HHEM v2.1 O Hallucination detectionHughes Hallucination Evaluation Model (Vectara) — lower is better.★★★★★G AntGroup Finix_S1_32b↓ 0.6%
146HiddenMath OMath reasoningMathematical reasoning benchmark referenced in recent model cards.★★★★★ 🇺🇸 Gemini 2.0 Pro65.2%
147HLE O Multi-domain reasoningChallenging LLMs at the frontier of human knowledge.★★★★★1085 🇨🇳 Tongyi DeepResearch32.9%
148HMMT o Math (competition)Harvard–MIT Mathematics Tournament problems.★★★★ 🇺🇸 GPT-5 pro100.0%
149HMMT 2025 OMath (competition)Harvard–MIT Mathematics Tournament 2025 problems.★★★★★ 🇺🇸 Grok 4 Fast93.3%
150HRBench 4K oHallucination robustnessHallucination robustness benchmark with 4K token contexts.★★★★★ 🇨🇳 Qwen3-VL-30B-A3B Instruct89.5%
151HRBench 8K oHallucination robustnessHallucination robustness benchmark with 8K token contexts.★★★★★ 🇨🇳 Qwen3-VL-30B-A3B Instruct82.5%
152HumanEval O Code generationPython synthesis problems evaluated by unit tests.★★★★2916 🇺🇸 o1-preview96.3%
153HumanEval+ O Code generationExtended HumanEval with more tests.★★★★★1577 🇺🇸 Claude Sonnet 494.5%
154Hypersim o3D scene understandingHypersim benchmark for synthetic indoor scene understanding and reconstruction.★★★★★ 🇺🇸 GPT-5 Mini Minimal39.3%
155IFBench O Instruction followingInstruction-following benchmark measuring compliance and adherence.★★★★★70 🇫🇷 Mistral Small 3.2 24B Instruct84.8%
156IFEval O Instruction followingInstruction following capability evaluation for LLMs.★★★★36312 🇺🇸 o3 mini-high93.9%
157INCLUDE OInclusiveness / biasEvaluates inclusive language use and bias mitigation in model outputs.★★★★★ 🇺🇸 Gemini-2.5-Flash Thinking83.9%
158InfoQA OInformation-seeking QAInformation retrieval question answering benchmark evaluating factual responses.★★★★ 🇨🇳 Qwen2-VL-72B84.5%
159InfoVQA OInfographic VQAVisual question answering over infographics requiring reading, counting, and reasoning.★★★★★ 🇨🇳 Seed1.5-VL-Thinking91.2%
160JudgeMark v2.1 O LLM judging abilityA benchmark measuring LLM judging ability.★★★★ 🇺🇸 Claude Sonnet 482.0%
161KMMLU-Pro O Multilingual knowledgeKorean Multilingual Massive Multitask Language Understanding Pro★★★★ 🇺🇸 o177.5%
162KMMLU-Redux O Multilingual knowledgeRedux variant of KMMLU benchmark★★★★ 🇺🇸 o181.1%
163KOR-Bench oReasoningComprehensive reasoning benchmark spanning diverse domains and cognitive skills.★★★★★G Ling 1T76.0%
164KSM oMultilingual mathKorean STEM and math benchmark★★★★★G EXAONE Deep 2.4B60.9%
165LAMBADA o Language modelingWord prediction requiring broad context understanding.★★★★★ 🇺🇸 GPT-386.4%
166LatentJailbreak o Safety / jailbreakRobustness to latent jailbreak adversarial techniques.★★★★★39 🇺🇸 GPT-3.5-turbo77.4%
167LiveBench OGeneral capabilityContinually updated capability benchmark across diverse tasks.★★★★★ 🇺🇸 Gemini 2.5 Pro82.4%
168LiveBench 20241125 OGeneral capabilityLiveBench snapshot (2024-11-25) tracking mixed-task evals.★★★★ 🇺🇸 Gemini 2.5 Pro82.4%
169LiveCodeBench O Code generationLive coding and execution-based evaluation benchmark (v6 dataset).★★★★ 🇺🇸 GPT-5 mini86.6%
170LiveCodeBench v5 (2024.10-2025.02) O Code generationLiveCodeBench v5 snapshot covering Oct 2024-Feb 2025.★★★★★ 🇨🇳 Qwen3-235B-A22B70.7%
171LiveMCP-101 O Agent real-time evalA novel real-time evaluation framework and benchmark to stress‑test agents on complex, real‑world tasks.★★★★ 🇺🇸 GPT-558.4%
172LMArena Text O Crowd eval (text)Chatbot Arena text evaluation (average win rate).★★★★★ 🇺🇸 Gemini 2.5 Pro1455
173LMArena Vision O Crowd eval (vision)Chatbot Arena vision evaluation leaderboard (ELO ratings).★★★★★ 🇺🇸 Gemini 2.5 Pro1242
174LogicVista OVisual logical reasoningVisual logic and pattern reasoning tasks requiring compositional and spatial understanding.★★★★ 🇨🇳 GLM-4.5V62.4%
175LogiQA o Logical reasoningReading comprehension with logical reasoning.★★★★★138G Pythia 70M23.5%
176LongBench o Long-context evalLong-context understanding across tasks.★★★★★957G Jamba Mini 1.632.0%
177LongFact-Concepts oHallucination rate on open-source promptsLong-context factuality eval focused on conceptual statements; lower is better.★★★★ 🇺🇸 GPT-5↓ 0.7%
178LongFact-Objects oHallucination rate on open-source promptsLong-context factuality eval focused on object/entity references; lower is better.★★★★ 🇺🇸 GPT-5↓ 0.8%
179LVBench OVideo understandingLong video understanding benchmark (LVBench).★★★★★ 🇺🇸 Gemini 2.5 Pro73.0%
180M3GIA (CN) oChinese multimodal QAChinese-language M3GIA benchmark covering grounded multimodal question answering.★★★★★ 🇨🇳 Seed1.5-VL-Thinking91.2%
181Mantis OMultimodal reasoningMultimodal reasoning and instruction following benchmark (Mantis).★★★★★G dots.vlm186.2%
182MASK O Safety / red teamingModel behavior safety assessment via red-teaming scenarios.★★★★ 🇺🇸 Claude Sonnet 4 (t)95.3%
183MATH O Math (competition)Competition-level mathematics across algebra, geometry, number theory, combinatorics.★★★★★1185 🇺🇸 o3 mini97.9%
184MATH Level 5 o Math (competition)Level 5 subset of the MATH benchmark emphasizing the hardest competition-style problems.★★★★ 🇨🇳 Qwen3-4B-Instruct-250773.6%
185MATH500 OMath (competition)500 curated math problems for evaluating high-level reasoning.★★★★★ 🇺🇸 GPT-599.2%
186MATH500 (ES) oMath (multilingual)Spanish MATH500 benchmark★★★★★G EXAONE 4.0 1.2B88.8%
187MathVerse-mini OMath reasoning (multimodal)Compact MathVerse split focusing on single-image math puzzles and visual reasoning.★★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking85.0%
188MathVerse-Vision OMath reasoning (multimodal)Multi-image visual mathematical reasoning tasks from the MathVerse ecosystem.★★★★ 🇨🇳 GLM-4.5V72.1%
189MathVision O Math reasoning (multimodal)Visual math reasoning benchmark with problems that combine images (charts, diagrams) and text.★★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking74.6%
190MathVista O Multimodal math reasoningVisual math reasoning across diverse tasks.★★★★ 🇺🇸 o386.8%
191MathVista-Mini OMath reasoning (multimodal)Lightweight subset of MathVista for quick evaluation of visual mathematical reasoning.★★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking85.8%
192MBPP O Code generationShort Python problems with hidden tests.★★★★36312 🇺🇸 o1-preview95.5%
193MBPP+ OCode generationExtended MBPP with more tests and stricter evaluation.★★★★★ 🇺🇸 Llama 3.1 405B88.6%
194MCP Universe O Agent evaluationBenchmarks multi-step tool-use agents across diverse task suites with a unified overall success metric.★★★★ 🇺🇸 GPT-5 High44.2%
195MCPMark O Agent tool-use (MCP)Benchmark for Model Context Protocol (MCP) agent tool-use.★★★★★127 🇺🇸 GPT-546.9%
196MGSM OMath (multilingual)Multilingual grade school math word problems.★★★★★ 🇺🇸 Claude Opus 4.1 (2025-08-05) Thinking94.4%
197MIABench OMultimodal instruction followingMultimodal instruction-following benchmark evaluating accuracy on complex image-text tasks.★★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking92.7%
198Minerva Math o University-level mathAdvanced quantitative reasoning set inspired by the Minerva benchmark for STEM problem solving.★★★★★G Granite-4.0-H-Small74.0%
199MiniF2F (Test) o Math competitionMiniF2F competition benchmark (test split).★★★★ 🇨🇳 LongCat-Flash-Thinking81.6%
200MixEval oMulti-task reasoningMixed-subject benchmark covering knowledge and reasoning tasks across domains.★★★★★ 🇺🇸 o1 Mini82.9%
201MixEval Hard oMulti-task reasoning (hard)Hard subset of MixEval covering diverse reasoning tasks.★★★★ 🇨🇳 Qwen3-4B31.6%
202MLVU OLarge video understandingMLVU: Large-scale multi-task benchmark for video understanding.★★★★★ 🇺🇸 GPT-586.2%
203MM-MT-Bench OMultimodal instruction followingMulti-turn multimodal instruction following benchmark evaluating dialogue quality and helpfulness.★★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking8.5%
204MMBench v1.1 (EN dev) O General VQAEnglish dev split of MMBench v1.1 measuring multimodal question answering.★★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking90.6%
205MMBench v1.1 (CN) O Multimodal understanding (Chinese)MMBench v1.1 Chinese subset for evaluating multimodal LLMs.★★★★★ 🇨🇳 Keye-VL 8B89.8%
206MMBench v1.1 (EN) O Multimodal understanding (English)MMBench v1.1 English subset for evaluating multimodal LLMs.★★★★★ 🇨🇳 Keye-VL 8B89.7%
207MME-RealWorld (cn) oReal-world perception (CN)MME-RealWorld Chinese split.★★★★ 🇺🇸 GPT-4o58.5%
208MME-RealWorld (en) oReal-world perception (EN)MME-RealWorld English split.★★★★G MiMo-VL 7B-RL59.1%
209MMLongBench-Doc OLong-context multimodal documentsEvaluates long-context document understanding with mixed text, tables, and figures across multiple pages.★★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking56.2%
210MMLU O Multi-domain knowledge57 tasks spanning STEM, humanities, social sciences; broad knowledge and reasoning.★★★★★1488 🇺🇸 GPT-593.5%
211MMLU (cloze) o Multi-domain knowledge (cloze)Cloze-form MMLU evaluation variant.★★★★ 🇺🇸 SmolLM2 135M Base31.5%
212Full Text MMLU oMulti-domain knowledge (long-form)Full-context MMLU variant evaluating reasoning over long passages.★★★★ 🇺🇸 Llama 3.3 70B Instruct0.8%
213MMLU-Pro O Multi-domain knowledgeHarder successor to MMLU with more challenging questions.★★★★★286 🇺🇸 o189.3%
214MMLU Pro MCF oMulti-domain knowledge (few-shot)MMLU-Pro common format (MCF) few-shot evaluation.★★★★ 🇨🇳 Qwen3-4B-Base41.1%
215MMLU-ProX OMulti-domain knowledgeCross-lingual and robust variant of MMLU-Pro.★★★★★ 🇨🇳 Qwen3-235B-A22B-Thinking-250781.0%
216MMLU-Redux O Multi-domain knowledgeUpdated MMLU-style evaluation with revised questions and scoring.★★★★★ 🇨🇳 Qwen3-235B-A22B-Thinking-250793.8%
217MMLU-STEM O STEM knowledgeSTEM subset of MMLU.★★★★★1488G Falcon-H1-34B-Instruct83.6%
218MMMLU O Multi-domain knowledge (multilingual)Massively multilingual MMLU-style evaluation across many languages.★★★★★ 🇺🇸 Claude Opus 4.189.5%
219MMMLU (ES) oMultilingual knowledgeSpanish MMMLU benchmark★★★★★ 🇺🇸 SmolLM 3 3B64.7%
220MMMU O Multimodal understandingMulti-discipline multimodal understanding benchmark.★★★★★ 🇺🇸 Gemini 2.5 Pro84.2%
221MMMU PRO O Multimodal understanding (hard)Professional/advanced subset of MMMU for multimodal reasoning.★★★★★ 🇺🇸 GPT-578.4%
222MMMU-Pro (vision) o Multimodal understanding (vision)MMMU-Pro vision-only setting.★★★★ 🇺🇸 Claude 3.7 Sonnet45.8%
223MMStar O Multimodal reasoningBroad evaluation of multimodal LLMs across diverse tasks.★★★★★ 🇺🇸 Gemini 2.5 Pro78.7%
224MMVP o Multimodal video perceptionBenchmark for multimodal video understanding and perception.★★★★★ 🇨🇳 R-4B-RL80.7%
225MMVU oVideo understandingMultimodal video understanding benchmark (MMVU).★★★★ 🇨🇳 GLM-4.5V68.7%
226MotionBench oVideo motion understandingVideo motion and temporal reasoning benchmark.★★★★ 🇨🇳 GLM-4.5V62.4%
227MT-Bench O Chat abilityMulti-turn chat evaluation via GPT-4 grading.★★★★★39074 🇺🇸 Apriel Nemotron 15B Thinker85.7%
228MTOB (full book) oLong-form reasoningLong-context book understanding benchmark (full-book setting).★★★★★ 🇺🇸 Llama 4 Maverick50.8%
229MTOB (half book) oLong-form reasoningLong-context book understanding benchmark (half-book setting).★★★★★ 🇺🇸 Llama 4 Maverick54.0%
230MUIRBENCH OMultimodal robustnessEvaluates multimodal understanding robustness and reliability.★★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking80.1%
231Multi-IF OInstruction following (multi-task)Composite instruction-following evaluation across multiple tasks.★★★★ 🇺🇸 o3 mini-high79.5%
232Multi-IFEval OInstruction following (multi-task)Multi-task variant of instruction-following evaluation.★★★★★ 🇺🇸 Llama 3.3 70B88.7%
233Multi-SWE-Bench o Code repair (multi-repo)Multi-repository SWE-Bench variant.★★★★★246 🇺🇸 Claude Sonnet 435.7%
234MultiChallenge o Multi-task reasoningComposite benchmark across diverse challenges by Scale AI.★★★★ 🇺🇸 GPT-569.6%
235MultiPL-E O Code generation (multilingual)Multilingual code generation and execution benchmark across many programming languages.★★★★★269 🇨🇳 Qwen3-235B-A22B87.9%
236MultiPL-E HumanEval o Code generation (multilingual)MultiPL-E variant of HumanEval tasks.★★★★ 🇺🇸 Llama 3.1 405B75.2%
237MultiPL-E MBPP o Code generation (multilingual)MultiPL-E variant of MBPP tasks.★★★★ 🇺🇸 Llama 3.1 405B65.7%
238MuSR O ReasoningMultistep Soft Reasoning.★★★★★ 🇨🇳 ERNIE 4.5 424B A47B69.9%
239MVBench OVideo QAMulti-view or multi-video QA benchmark (MVBench).★★★★★ 🇨🇳 GLM-4.5V73.0%
240Natural2Code oCode generationNatural language to code benchmark for instruction-following synthesis.★★★★ 🇺🇸 Gemini 2.0 Flash92.9%
241NaturalQuestions O Open-domain QAGoogle NQ; real user questions with long/short answers.★★★★ 🇫🇷 Mixtral 8x22B40.1%
242Nexus (0-shot) oTool useNexus tool-use benchmark, zero-shot setting.★★★★ 🇺🇸 Llama 3.1 405B58.7%
243Needle In A Haystack o Long-context retrievalNeedle In A Haystack test for locating hidden facts in long contexts.★★★★G MobileLLM P1 Base100.0%
244Objectron oObject detectionObjectron benchmark for 3D object detection in video captures.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking71.2%
245OCRBench OOCR (vision text extraction)Optical character recognition benchmark evaluating text extraction from images, documents, and complex layouts.★★★★★ 🇨🇳 Qwen3-VL-30B-A3B Instruct90.3%
246OCRBenchV2 (CN) OOCR (Chinese)OCRBenchV2 Chinese subset assessing OCR performance on Chinese-language documents.★★★★★ 🇨🇳 Qwen2.5-VL 72B Instruct63.7%
247OCRBenchV2 (EN) OOCR (English)OCRBenchV2 English subset evaluating OCR accuracy on English documents and layouts.★★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking66.8%
248OCRReasoning oOCR reasoningOCR reasoning benchmark combining text extraction with multi-step reasoning over documents.★★★★★ 🇺🇸 Gemini 2.5 Pro70.8%
249ODinW-13 OObject detection (in the wild)Object Detection in the Wild benchmark covering 13 real-world domains.★★★★★ 🇨🇳 Qwen3-VL-4B-Instruct48.2%
250Odyssey Math oMath reasoningOdyssey multi-step math benchmark.★★★★G Mathstral 7B37.2%
251OJBench OCode generation (online judge)Programming problems evaluated via online judge-style execution.★★★★ 🇺🇸 Gemini 2.5 Pro41.6%
252OlympiadBench oMath (olympiad)Advanced mathematics olympiad-style problem benchmark.★★★★ 🇨🇳 Hunyuan-7B-Instruct76.5%
253OlympicArena oMath (competition)Olympiad-style mathematics reasoning benchmark.★★★★ 🇨🇳 DeepSeek V376.2%
254Omni-MATH OMath reasoningOmni-MATH benchmark covering diverse math reasoning tasks across difficulty levels.★★★★★G Ling 1T74.5%
255Omni-MATH-HARD OMathChallenging math benchmark (Omni-MATH-HARD).★★★★★ 🇺🇸 GPT-5 High73.6%
256OmniSpatial oSpatial reasoningSpatial understanding and reasoning benchmark (OmniSpatial).★★★★ 🇨🇳 GLM-4.5V51.0%
257Open Rewrite oInstruction followingRewrite benchmark assessing open-ended editing and directive-following quality.★★★★G MobileLLM P151.0%
258OpenBookQA O Science QAOpen-book multiple choice science questions with supporting facts.★★★★★128 🇨🇳 Qwen396.4%
259OpenRewrite-Eval oRewrite qualityOpenRewrite evaluation; micro-averaged RougeL.★★★★ 🇨🇳 Qwen2.5 1.5B Instruct46.9%
260OptMATH oMath optimization reasoningOptMATH benchmark targeting challenging math optimization and problem-solving tasks.★★★★★G Ling 1T57.7%
261OSWorld OGUI agentsAgentic GUI task completion and grounding on desktop environments.★★★★★ 🇺🇸 Claude Sonnet 4.561.4%
262OSWorld-G OGUI agentsOSWorld-G center accuracy (no_refusal).★★★★★ 🇺🇸 Holo1.5-72B71.8%
263OSWorld2 oGUI agentsSecond-generation OSWorld GUI agent benchmark.★★★★ 🇨🇳 GLM-4.5V35.8%
264PIQA O Physical commonsensePhysical commonsense about everyday tasks and object affordances.★★★★ 🇨🇳 GLM-4.5 Base87.1%
265PixmoCount O Visual countingCounting objects/instances in images (PixmoCount).★★★★ 🇺🇸 Molmo-72B85.2%
266PolyMATH OMath reasoningPolyglot mathematics benchmark assessing cross-topic math reasoning.★★★★★ 🇨🇳 Qwen3-235B-A22B-Thinking-250760.1%
267POPE o Hallucination detectionVision-language hallucination benchmark focusing on object existence verification.★★★★G Moondream-9B-A2B89.0%
268PopQA O Knowledge / QAOpen-domain popular culture question answering benchmark testing long-tail factual recall.★★★★★ 🇺🇸 Llama 3.1 8B Instruct28.8%
269QuAC o Conversational QAQuestion answering in context.★★★★ 🇺🇸 Llama 3.1 405B Base53.6%
270QuALITY o Long-context reading comprehensionLong-document multiple-choice reading comprehension benchmark.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.5%
271RACE o Reading comprehensionEnglish exams for middle and high school.★★★★G RND1-Base-091057.6%
272RealWorldQA O Real-world visual QAVisual question answering with real-world images and scenarios.★★★★★ 🇺🇸 GPT-582.8%
273RefCOCO O Referring expressionsRefCOCO average accuracy at IoU 0.5 (val).★★★★★ 🇨🇳 InternVL3.5-4B92.4%
274RefCOCOg o Referring expressionsRefCOCOg average accuracy at IoU 0.5 (val).★★★★G Moondream-9B-A2B88.6%
275RefCOCO+ o Referring expressionsRefCOCO+ accuracy at IoU 0.5 on the val split.★★★★G Moondream-9B-A2B81.8%
276RefSpatialBench OSpatial reasoningReference spatial understanding benchmark covering spatial grounding tasks.★★★★★ 🇨🇳 Qwen2.5-VL 72B Instruct72.1%
277RepoBench OCode understandingRepository-level code comprehension and reasoning benchmark.★★★★ 🇺🇸 Claude Sonnet 4.583.8%
278RoboSpatialHome OEmbodied spatial understandingRoboSpatialHome benchmark for embodied spatial reasoning in domestic environments.★★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking73.9%
279Roo Code Evals O Code assistant evalCommunity-maintained coding evals and leaderboard by Roo Code.★★★★★ 🇺🇸 GPT-5 mini99.0%
280Ruler 128k o Long-context evalRULER benchmark at 128k context window.★★★★ 🇫🇷 Mistral Medium 390.2%
281Ruler 32k o Long-context evalRULER benchmark at 32k context window.★★★★★ 🇫🇷 Mistral Medium 396.0%
282SALAD-Bench o Safety alignmentSafety Alignment and Dangerous-behavior benchmark evaluating harmful assistance and refusal consistency.★★★★★G Granite-4.0-H-Micro↓ 96.8%
283SciCode (sub) OCodeSciCode subset score (sub).★★★★★ 🇺🇸 Grok 445.7%
284SciCode (main) OCodeSciCode main score.★★★★★ 🇺🇸 Gemini 2.5 Pro15.4%
285ScienceQA OScience QA (multimodal)Multiple-choice science questions with images, diagrams, and text context.★★★★G FastVLM-7B96.7%
286SciQ o Science QAMultiple choice science questions.★★★★G Pythia 12B92.9%
287ScreenQA Complex OGUI QAComplex ScreenQA benchmark accuracy.★★★★ 🇺🇸 Holo1.5-72B87.1%
288ScreenQA Short OGUI QAShort-form ScreenQA benchmark accuracy.★★★★ 🇺🇸 Holo1.5-72B91.9%
289ScreenSpot OScreen UI locatorsCenter accuracy on ScreenSpot.★★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking95.4%
290ScreenSpot-Pro O Screen UI locatorsAverage center accuracy on ScreenSpot-Pro.★★★★★ 🇺🇸 Holo1.5-72B63.2%
291ScreenSpot-v2 OScreen UI locatorsCenter accuracy on ScreenSpot-v2.★★★★G UI-Venus 72B95.3%
292SEED-Bench-2-Plus o Multimodal evaluationSEED-Bench-2-Plus overall accuracy.★★★★ 🇺🇸 Claude 3.7 Sonnet72.9%
293SEED-Bench-Img OMultimodal image understandingSEED-Bench image-only subset (SEED-Bench-Img).★★★★G Bagel 14B78.5%
294Showdown OGUI agentsSuccess rate on the Showdown UI interaction benchmark.★★★★ 🇺🇸 Holo1.5-72B76.8%
295SIFO oInstruction followingSingle-turn instruction following benchmark.★★★★★ 🇨🇳 Qwen3-VL-30B-A3B Thinking66.9%
296SIFO Multiturn oInstruction followingMulti-turn SIFO benchmark for sustained instruction adherence.★★★★★ 🇨🇳 Qwen3-VL-30B-A3B Thinking60.3%
297SimpleQA OQASimple question answering benchmark.★★★★★ 🇨🇳 DeepSeek V3.2-Exp97.1%
298SimpleVQA OGeneral VQALightweight visual question answering set with everyday scenes.★★★★★ 🇺🇸 Gemini 2.5 Pro65.4%
299SimpleVQA-DS oGeneral VQASimpleVQA variant curated by DeepSeek with everyday image question answering tasks.★★★★★ 🇨🇳 Seed1.5-VL-Thinking61.3%
300SocialIQA o Social commonsenseSocial interaction commonsense QA.★★★★ 🇺🇸 Gemma 3 PT 27B54.9%
301Spider O Text-to-SQLComplex text-to-SQL benchmark over cross-domain databases.★★★★ 🇺🇸 Llama 3 70B Base67.1%
302Spiral-Bench O Safety / sycophancyA LLM-judged benchmark measuring sycophancy and delusion reinforcement.★★★★ 🇺🇸 GPT-587.0%
303SQuAD v1.1 o Reading comprehensionExtractive QA from Wikipedia articles.★★★★★566 🇺🇸 Llama 3.1 405B Base89.3%
304SUNRGBD o3D scene understandingSUN RGB-D benchmark for indoor scene understanding from RGB-D imagery.★★★★★ 🇺🇸 GPT-5 Mini Minimal45.8%
305SuperGPQA OGraduate-level QAHarder GPQA variant assessing advanced graduate-level reasoning.★★★★★ 🇨🇳 Qwen3-235B-A22B-Thinking-250764.9%
306SWE-Bench o Code repairSupervised software engineering benchmark across many repos and issues.★★★★★3442 🇺🇸 GPT-5 Codex74.5%
307SWE-Bench Multilingual oCode repair (multilingual)Multilingual variant of SWE-Bench for issue fixing.★★★★★ 🇨🇳 DeepSeek V3.2-Exp57.9%
308SWE-Bench Pro (Public) oSoftware engineeringPublic subset of the SWE-Bench Pro benchmark for software-engineering agents.★★★★ 🇺🇸 GPT-523.3%
309SWE-Bench Verified O Code repairVerified subset of SWE-Bench for issue fixing.★★★★ 🇺🇸 Claude Sonnet 4.577.2%
310SWE-Dev oCode repairSoftware engineering development and bug fixing benchmark.★★★★★ 🇺🇸 Claude Sonnet 467.1%
311SysBench oSystem promptsSystem prompt understanding and adherence benchmark.★★★★ 🇺🇸 GPT-4.174.1%
312τ²-Bench (airline) o Industry QA (airline)τ²-Bench airline domain evaluation.★★★★★ 🇺🇸 o364.8%
313τ²-Bench (retail) oIndustry QA (retail)τ²-Bench retail domain evaluation.★★★★★ 🇺🇸 Claude Opus 4.182.4%
314τ²-Bench (telecom) O Industry QA (telecom)τ²-Bench telecom domain evaluation.★★★★ 🇺🇸 GPT-596.7%
315TAU1-Airline oAgent tasks (airline)Tool-augmented agent evaluation in airline scenarios (TAU1).★★★★ 🇺🇸 Gemini-2.5-Flash Thinking54.0%
316TAU1-Retail oAgent tasks (retail)Tool-augmented agent evaluation in retail scenarios (TAU1).★★★★ 🇨🇳 Qwen3-235B-A22B71.3%
317TAU2-Airline OAgent tasks (airline)Tool-augmented agent evaluation in airline scenarios (TAU2).★★★★★ 🇺🇸 Claude Sonnet 4.570.0%
318TAU2-Retail OAgent tasks (retail)Tool-augmented agent evaluation in retail scenarios (TAU2).★★★★★ 🇺🇸 Claude Opus 4.186.8%
319TAU2-Telecom OAgent tasks (telecom)Tool-augmented agent evaluation in telecom scenarios (TAU2).★★★★★ 🇺🇸 Claude Sonnet 4.598.0%
320Terminal-Bench O Agent terminal tasksCommand-line task completion benchmark for agents.★★★★637 🇺🇸 Claude Sonnet 4.5 (Thinking)61.3%
321Terminal-Bench Hard O Agent terminal tasksHard subset of Terminal-Bench command-line agent tasks.★★★★★ 🇺🇸 Grok 437.6%
322TextVQA O Text-based VQAVisual question answering that requires reading text in images.★★★★ 🇨🇳 Qwen2-VL-72B85.5%
323TLDR9+ oSummarizationLong-form summarization benchmark with nine-domain TLDR prompts plus extended variations.★★★★G MobileLLM P116.8%
324TreeBench o Reasoning with tree structuresEvaluates hierarchical/tree-structured reasoning and planning capabilities in LLMs/VLMs.★★★★ 🇨🇳 GLM-4.5V50.1%
325TriQA oKnowledge QATriadic question answering benchmark evaluating world knowledge and reasoning.★★★★ 🇫🇷 Mixtral 8x22B82.2%
326TriviaQA O Open-domain QAOpen-domain question answering benchmark built from trivia and web evidence.★★★★★ 🇺🇸 Gemma 3 PT 27B85.5%
327TriviaQA-Wiki o Open-domain QATriviaQA subset answering using Wikipedia evidence.★★★★ 🇺🇸 Llama 3.1 405B Base91.8%
328TruthfulQA O Truthfulness / hallucinationMeasures whether a model imitates human falsehoods (truthfulness).★★★★★ 🇨🇳 Qwen2.5 32B Instruct70.3%
329TruthfulQA (DE) oTruthfulness / hallucination (German)German translation of the TruthfulQA benchmark.★★★★ 🇺🇸 Llama 3.3 70B Instruct0.2%
330TydiQA o Cross-lingual QATypologically diverse QA across languages.★★★★★313 🇺🇸 Llama 3.1 405B Base34.3%
331V* oMultimodal reasoningV* benchmark accuracy.★★★★★ 🇨🇳 Qwen3-VL-8B-Instruct86.4%
332VCT O Virology capability (protocol troubleshooting)Virology Capabilities Test: a benchmark that measures an LLM's ability to troubleshoot complex virology laboratory protocols.★★★★ 🇺🇸 o343.8%
333VibeEval OAesthetic/visual qualityVLM aesthetic evaluation with GPT scores.★★★★★ 🇺🇸 Gemini 2.5 Pro76.4%
334Video-MME o Video understanding (multimodal)Multimodal evaluation of video understanding and reasoning.★★★★ 🇨🇳 Qwen3-VL-30B-A3B Instruct74.5%
335VideoMME (w/o sub) OVideo understandingVideo understanding benchmark without subtitles.★★★★★ 🇺🇸 Gemini 2.5 Pro85.1%
336VideoMME (w/sub) oVideo understandingVideo understanding benchmark with subtitles.★★★★ 🇨🇳 GLM-4.5V80.7%
337VideoMMMU OMultimodal video understandingVideo-based extension of MMMU evaluating temporal multimodal reasoning and perception across disciplines.★★★★★ 🇺🇸 GPT-584.6%
338VisualWebBench O Web UI understandingAverage accuracy on VisualWebBench.★★★★ 🇺🇸 Holo1.5-72B83.8%
339VisuLogic O Visual logical reasoningLogical reasoning and compositionality benchmark for visual-language models.★★★★★ 🇨🇳 Seed1.5-VL-Thinking35.9%
340VitaBench oIndustry QAIndustry-focused benchmark evaluating domain QA performance.★★★★ 🇺🇸 o335.3%
341VL-RewardBench oReward modeling (VL)Reward alignment benchmark for VLMs.★★★★ 🇺🇸 Claude 3.7 Sonnet67.4%
342VLMs are Biased o Multimodal biasEvaluates whether VLMs truly 'see' vs. relying on memorized knowledge; measures bias toward non-visual priors.★★★★90 🇺🇸 o4 mini20.2%
343VLMs are Blind O Visual grounding robustnessEvaluates failure modes of VLMs in grounding and perception tasks.★★★★G MiMo-VL 7B-RL79.4%
344VoiceBench AdvBench oVoiceBenchVoiceBench adversarial safety evaluation.★★★★ 🇨🇳 Qwen3-Omni-30B-A3B-Thinking99.4%
345VoiceBench AlpacaEval oVoiceBenchVoiceBench evaluation on AlpacaEval instructions.★★★★ 🇨🇳 Qwen3-Omni-Flash-Thinking96.8%
346VoiceBench BBH oVoiceBenchVoiceBench evaluation on Big-Bench Hard prompts.★★★★ 🇺🇸 Gemini 2.5 Pro92.6%
347VoiceBench CommonEval oVoiceBenchVoiceBench evaluation on CommonEval.★★★★ 🇨🇳 Qwen3-Omni-Flash-Instruct91.0%
348VoiceBench IFEval oVoiceBenchVoiceBench instruction-following evaluation (IFEval).★★★★ 🇺🇸 Gemini 2.5 Pro85.7%
349MMAU v05.15.25 oAudio reasoningAudio reasoning benchmark MMAU v05.15.25.★★★★ 🇨🇳 Qwen3-Omni-Flash-Instruct77.6%
350VoiceBench MMSU oVoiceBenchVoiceBench MMSU benchmark (voice modality).★★★★ 🇨🇳 Qwen3-Omni-Flash-Thinking84.3%
351VoiceBench MMSU (Audio) oAudio reasoningAudio reasoning MMSU results.★★★★ 🇺🇸 Gemini 2.5 Pro77.7%
352VoiceBench OpenBookQA oVoiceBenchVoiceBench results on OpenBookQA prompts.★★★★ 🇨🇳 Qwen3-Omni-Flash-Thinking95.0%
353VoiceBench Overall oVoiceBenchOverall VoiceBench aggregate score.★★★★ 🇺🇸 Gemini 2.5 Pro89.6%
354VoiceBench SD-QA oVoiceBenchVoiceBench Spoken Dialogue QA results.★★★★ 🇺🇸 Gemini 2.5 Pro90.1%
355VoiceBench WildVoice oVoiceBenchVoiceBench evaluation on WildVoice dataset.★★★★ 🇺🇸 Gemini 2.5 Pro93.4%
356VQAv2 O Visual question answeringStandard Visual Question Answering v2 benchmark on natural images.★★★★ 🇺🇸 Molmo-72B86.5%
357VSI-Bench oSpatial intelligenceVisual spatial intelligence benchmark covering 3D reasoning and spatial inference tasks.★★★★★ 🇨🇳 Qwen3-VL-30B-A3B Instruct63.2%
358WebClick OGUI agentsTask success on the WebClick UI agent benchmark.★★★★ 🇺🇸 Claude Sonnet 493.0%
359WebDev Arena O Web development agentsArena evaluation for autonomous web development agents.★★★★★ 🇺🇸 GPT-51483
360WebQuest-MultiQA oWeb agentsMulti-question web search and interaction tasks.★★★★ 🇨🇳 GLM-4.5V60.6%
361WebQuest-SingleQA oWeb agentsSingle-question web search and interaction tasks.★★★★ 🇨🇳 GLM-4.5V76.9%
362WebSrc OWeb QAWebpage question answering (SQuAD F1).★★★★ 🇺🇸 Holo1.5-72B97.2%
363WebVoyager2 oWeb agentsWeb navigation and interaction tasks for LLM agents (v2).★★★★ 🇨🇳 GLM-4.5V84.4%
364WebWalkerQA oWeb agentsWebWalker tasks evaluating autonomous browsing question answering performance.★★★★ 🇨🇳 Tongyi DeepResearch72.2%
365WeMath oMath reasoningMath reasoning benchmark spanning diverse curricula and difficulty levels.★★★★ 🇨🇳 GLM-4.5V68.8%
366WildBench V2 oInstruction followingWildBench V2 human preference benchmark for instruction following and helpfulness.★★★★ 🇫🇷 Mistral Small 3.2 24B Instruct65.3%
367Winogender o Gender bias (coreference)Coreference resolution dataset for measuring gender bias.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT67.9%
368WinoGrande O Coreference reasoningLarge-scale adversarial Winograd Schema-style pronoun resolution.★★★★99 🇺🇸 Llama 3.1 405B Base86.7%
369WinoGrande (DE) oCoreference reasoning (German)German translation of the WinoGrande pronoun resolution benchmark.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.8%
370WMT16 En–De o Machine translationWMT16 English–German translation benchmark (news).★★★★ 🇺🇸 Llama 3.3 70B Instruct38.8%
371WMT16 En–De (Instruct) oMachine translationInstruction-tuned evaluation on the WMT16 English–German translation set.★★★★ 🇺🇸 Llama 3.3 70B Instruct37.9%
372WritingBench OWriting qualityGeneral-purpose writing quality benchmark.★★★★★ 🇨🇳 Qwen3-235B-A22B-Thinking-250788.3%
373WSC o Coreference reasoningClassic Winograd Schema Challenge measuring commonsense coreference.★★★★G Pythia 410M47.1%
374xBench-DeepSearch oAgentic researchEvaluates multi-hop deep research workflows on xBench DeepSearch tasks.★★★★ 🇨🇳 Tongyi DeepResearch75.0%
375ZebraLogic O Logical reasoningLogical reasoning benchmark assessing complex pattern and rule inference.★★★★ 🇨🇳 Qwen3 MoE-250794.2%
376ZeroBench OZero-shot generalizationEvaluates zero-shot performance across diverse tasks without task-specific finetuning.★★★★★ 🇨🇳 GLM-4.5V23.4%
377ZeroBench (sub) OZero-shot generalizationSubset of ZeroBench targeting harder zero-shot reasoning cases.★★★★★ 🇺🇸 Gemini 2.5 Pro33.8%
378ZeroSCROLLS MuSiQue o Long-context reasoningZeroSCROLLS split derived from MuSiQue multi-hop QA.★★★★ 🇺🇸 Llama 3.3 70B Instruct0.5%
379ZeroSCROLLS SpaceDigest o Long-context summarizationZeroSCROLLS SpaceDigest extractive summarization task.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.8%
380ZeroSCROLLS SQuALITY o Long-context summarizationZeroSCROLLS split based on the SQuALITY long-form summarization benchmark.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.2%