Furukama

Furukama's Blog

Ben Koehler - Founder, Speaker, Coder Web | GitHub | X | Bluesky | LinkedIn

Fu — Benchmark of Benchmarks

Fu-Benchmark is a meta-benchmark of the most influential evaluation suites used to measure and rank large language models. Use the search box to filter by name, topic, or model.

Benchmarks
#NameTopicDescriptionRelevanceGitHub ★LeaderTop %
1A12D oDiagram reasoningA12D diagram reasoning benchmark for measuring multimodal understanding of annotated diagrams.★★★★ 🇺🇸 Gemini 2.5 Pro90.9%
2AA-Index oMulti-domain QAComprehensive QA index across diverse domains.★★★★ 🇺🇸 Grok 473.2%
3AA-LCR O Long-context reasoningA challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens.★★★★★ 🇺🇸 GPT-575.6%
4ACP-Bench Bool O Safety evaluation (boolean)Safety and behavior evaluation with yes/no questions.★★★★★ 🇨🇳 Qwen3-32B85.1%
5ACP-Bench MCQ O Safety evaluation (MCQ)Safety and behavior evaluation with multiple-choice questions.★★★★★ 🇺🇸 Llama 3.3 70B82.1%
6AgentDojo O Agent evaluationInteractive evaluation suite for autonomous agents across tools and tasks.★★★★ 🇺🇸 Claude 3.7 Sonnet88.7%
7AGIEval OExamsAcademic and professional exam benchmark.★★★★★ 🇺🇸 Llama 3.1 405B Base71.6%
8AI2D ODiagram understanding (VQA)Visual question answering over science and diagram images.★★★★★ 🇺🇸 Molmo-72B96.3%
9Aider Code Editing o Code editingMeasures interactive code editing quality within the Aider assistant workflow.★★★★★ 🇺🇸 Gemini 2.5 Pro89.8%
10Aider-Polyglot O Code assistant evalAider polyglot coding leaderboard.★★★★★ 🇺🇸 GPT-588.0%
11AIME 2024 O Math (competition)American Invitational Mathematics Examination 2024 problems.★★★★★ 🇺🇸 GPT-OSS 120B96.6%
12AIME 2025 O Math (competition)American Invitational Mathematics Examination 2025 problems.★★★★ 🇺🇸 GPT-5 pro96.7%
13AIME25 oMath (competition)American Invitational Mathematics Examination 2025 benchmark (set AIME25).★★★★ 🇨🇳 Qwen3-VL-30B-A3B Thinking83.1%
14All-Angles Bench oSpatial perceptionAll-Angles benchmark for spatial recognition and 3D perception.★★★★ 🇨🇳 GLM-4.5V56.9%
15AlpacaEval O Instruction followingAutomatic eval using GPT-4 as a judge.★★★★★1849 🇨🇳 Qwen3-32B64.2%
16AlpacaEval 2.0 OInstruction followingUpdated AlpacaEval with improved prompts and judging.★★★★★ 🇨🇳 DeepSeek R187.6%
17AMC-23 O Math (competition)American Mathematics Competition 2023 evaluation.★★★★G QwQ-32B98.5%
18AndroidWorld oMobile agentsBenchmark for agents operating Android apps via UI automation.★★★★ 🇨🇳 GLM-4.5V57.0%
19API-Bank o Tool useAPI-Bank tool-use benchmark.★★★★ 🇺🇸 Llama 3.1 405B92.0%
20ARC-AGI-1 O General reasoningARC-AGI Phase 1 aggregate accuracy.★★★★★ 🇺🇸 o375.7%
21ARC-AGI-2 O General reasoningARC-AGI Phase 2 aggregate accuracy.★★★★★ 🇺🇸 GPT-5 pro18.3%
22ARC Average O Science QA (average)Average accuracy across ARC-Easy and ARC-Challenge.★★★★ 🇺🇸 SmolLM2 1.7B Pretrained60.5%
23ARC-Challenge O Science QAHard subset of AI2 Reasoning Challenge; grade-school science.★★★★ 🇺🇸 Llama 3.1 405B96.9%
24ARC-Challenge (DE) oScience QA (German)German translation of the ARC Challenge benchmark.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.7%
25ARC-Easy O Science QAEasier subset of AI2 Reasoning Challenge.★★★★ 🇺🇸 Gemma 3 PT 27B89.0%
26ARC-Easy (DE) oScience QA (German)German translation of the ARC Easy science QA benchmark.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.8%
27Arena-Hard O Chat abilityHard prompts on Chatbot Arena.★★★★★920 🇫🇷 Mistral Medium 397.1%
28Arena-Hard V2 O Chat abilityUpdated Arena-Hard v2 prompts on Chatbot Arena.★★★★★920 🇨🇳 Qwen3 MoE-250788.2%
29ARKitScenes o3D scene understandingARKitScenes benchmark for assessing 3D scene reconstruction and understanding from mixed reality captures.★★★★ 🇨🇳 Qwen2.5-VL 72B Instruct61.5%
30ART Agent Red Teaming O Agent robustnessEvaluation suite for adversarial red-teaming of autonomous AI agents.★★★★ 🇺🇸 Claude Sonnet 4.5 (Thinking)↓ 40.0%
31ArtifactsBench O Agentic codingArtifacts-focused coding and tool-use benchmark evaluating generated code artifacts.★★★★★ 🇺🇸 GPT-572.5%
32AstaBench O Agent evaluationEvaluates science agents across literature understanding, data analysis, planning, tool use, coding, and search.★★★★★ 🇺🇸 Claude Sonnet 453.0%
33AttaQ OSafety / jailbreakAdversarial jailbreak suite measuring refusal robustness against targeted attack prompts.★★★★★G Granite 3.3 8B Instruct88.5%
34AutoCodeBench O Autonomous codingEnd-to-end autonomous coding benchmark with unit-test based execution across diverse repositories and tasks.★★★★ 🇺🇸 Claude Opus 4 (Thinking)52.4%
35AutoCodeBench-Lite O Autonomous codingLite version of AutoCodeBench focusing on smaller tasks with the same end-to-end, unit-test-based evaluation.★★★★ 🇺🇸 Claude Opus 464.5%
36BALROG O Agent robustnessBenchmark for assessing LLM agents under adversarial and out-of-distribution tool-use scenarios.★★★★★ 🇺🇸 Grok 443.6%
37BBH O Multi-task reasoningHard subset of BIG-bench with diverse reasoning tasks.★★★★510 🇨🇳 ERNIE 4.5 424B A47B94.3%
38BBQ O Bias evaluationBias Benchmark for Question Answering evaluating social biases across contexts.★★★★ 🇫🇷 Mixtral 8x 7B56.0%
39BFCL oCode reasoningBenchmark for functional code correctness and logic.★★★★ 🇨🇳 Qwen3-4B95.0%
40BFCL Live v2 oFinance QAFinancial compliance and literacy questions from the BFCL Live v2 benchmark.★★★★ 🇺🇸 o1 Mini81.0%
41BFCL v3 OCode reasoningBenchmark for functional code correctness and logic (v3).★★★★★ 🇨🇳 GLM 4.577.8%
42BIG-Bench o Multi-task reasoningBIG-bench overall performance (original).★★★★★3110 🇺🇸 Gemma 2 7B55.1%
43BIG-Bench Extra Hard oMulti-task reasoningExtra hard subset of BIG-bench tasks.★★★★★G Ling 1T47.3%
44BigCodeBench O Code GenerationBigCodeBench evaluates large language models on practical code generation tasks with unit-test verification.★★★★★ 🇺🇸 GPT-4o-2024-05-1356.1%
45BigCodeBench Hard O Code generation (hard)Harder variant of BigCodeBench testing complex programming and library tasks with function-level code generation.★★★★★ 🇺🇸 Claude 3.7 Sonnet (2025-02-19)35.8%
46BLINK OMultimodal groundingEvaluates visual-language grounding and reference resolution to reduce hallucinations.★★★★★ 🇨🇳 Seed1.5-VL-Thinking72.4%
47BoB-HVR OComposite capability indexHard, Versatile, and Relevant composite score across eight capability buckets.★★★★★ 🇺🇸 Llama 3 70B9.0%
48BOLD o Bias evaluationBias in Open-ended Language Dataset probing demographic biases in text generation.★★★★ 🇫🇷 Mixtral 8x 7B↓ 0.1%
49BoolQ O Reading comprehensionYes/no QA from naturally occurring questions.★★★★★171 🇺🇸 Gemma 2 27B84.8%
50BrowseComp O Web browsingWeb browsing comprehension and competence benchmark.★★★★ 🇺🇸 GPT-554.9%
51BrowseComp_zh OWeb browsing (Chinese)Chinese variant of the BrowseComp web browsing benchmark.★★★★ 🇺🇸 o358.1%
52BRuMo25 oMath competitionBruMo 2025 olympiad-style mathematics benchmark.★★★★ 🇺🇸 QuestA Nemotron 1.5B69.5%
53BuzzBench O Humor analysisA humour analysis benchmark.★★★★ 🇺🇸 Gemini 2.5 Pro71.1%
54C-Eval O Chinese examsComprehensive Chinese exam benchmark across multiple subjects.★★★★ 🇨🇳 Kimi-K2 Base92.5%
55C3-Bench o Reasoning (Chinese)Comprehensive Chinese reasoning capability benchmark.★★★★35 🇨🇳 GLM-4.5 Base83.1%
56CaseLaw v2 O Legal reasoningU.S. case law benchmark evaluating legal reasoning and judgment over court opinions.★★★★★ 🇺🇸 GPT-4.178.1%
57CC-OCR oOCR (cross-lingual)Cross-lingual OCR benchmark evaluating character recognition across mixed-language documents.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking81.5%
58CFEval oCoding ELO / contest evalContest-style coding evaluation with ELO-like scoring.★★★★ 🇨🇳 Qwen3-235B-A22B-Thinking-25072134
59Charades-STA O Video groundingCharades-STA temporal grounding (mIoU).★★★★ 🇨🇳 Seed1.5-VL-Thinking64.0%
60ChartMuseum o Chart understandingLarge-scale curated collection of charts for evaluating parsing, grounding, and reasoning.★★★★ 🇺🇸 GPT-5 mini63.3%
61ChartQA O Chart understanding (VQA)Visual question answering over charts and plots.★★★★★G MiMo-VL 7B-SFT92.9%
62ChartQA-Pro oChart understanding (VQA)Professional-grade chart question answering with diverse chart types and complex reasoning.★★★★ 🇨🇳 GLM-4.5V64.0%
63CharXiv (DQ) O Chart description (PDF)Scientific chart/table descriptive questions from arXiv PDFs.★★★★★ 🇺🇸 o3-high95.0%
64CharXiv (RQ) O Chart reasoning (PDF)Scientific chart/table reasoning questions from arXiv PDFs.★★★★★ 🇺🇸 GPT-581.1%
65Chinese SimpleQA oQA (Chinese)Chinese variant of the SimpleQA benchmark.★★★★ 🇨🇳 Kimi-K2 Base77.6%
66CLUEWSC o Coreference reasoning (Chinese)Chinese Winograd Schema-style coreference benchmark from CLUE.★★★★ 🇨🇳 DeepSeek R192.8%
67CMath oMath (Chinese)Chinese mathematics benchmark.★★★★ 🇨🇳 ERNIE 4.5 424B A47B96.7%
68CMMLU o Chinese multi-domainChinese counterpart to MMLU.★★★★★781 🇨🇳 Qwen2.5 Max91.9%
69Codeforces O Competitive programmingCompetitive programming performance on Codeforces problems (ELO).★★★★★ 🇺🇸 o4 mini2719
70COLLIE o Instruction followingComprehensive instruction-following evaluation suite.★★★★55 🇺🇸 GPT-599.0%
71CommonsenseQA O Commonsense QAMultiple-choice QA requiring commonsense knowledge.★★★★ 🇺🇸 Llama 3.1 405B Base85.8%
72CountBench O Visual countingObject counting and numeracy benchmark for visual-language models across varied scenes.★★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking93.7%
73CountBenchQA OVisual counting QAVisual question answering benchmark focused on counting objects across varied scenes.★★★★G Moondream-9B-A2B93.2%
74CRAG oRetrieval QAComplex Retrieval-Augmented Generation benchmark for grounded question answering.★★★★G Jamba Mini 1.676.2%
75Creative Story‑Writing Benchmark V3 O Creative writingStory writing benchmark evaluating creativity, coherence, and style (v3).★★★★★291 🇨🇳 Kimi-K2-Instruct-09058.7%
76Longform Creative Writing O Creative writingLongform creative writing evaluation (EQ-Bench).★★★★★20 🇨🇳 DeepSeek V3 052878.9%
77Creative Writing v3 O Creative writingA LLM-judged creative writing benchmark.★★★★54 🇺🇸 o31661
78CRUX-I O Code reasoningCode Reasoning and Understanding eXam – Interactive.★★★★ 🇺🇸 GPT-4 Turbo (2024-04-09) CoT75.7%
79CRUX-O O Code reasoningCode Reasoning and Understanding eXam – Offline.★★★★★ 🇺🇸 GPT-4 0613 CoT88.2%
80CruxEval O Code reasoningMathematical coding challenge set from the CruxEval benchmark.★★★★ 🇨🇳 Qwen3-32B78.5%
81CV-Bench OComputer vision QADiverse CV tasks for VLMs.★★★★★ 🇨🇳 Seed1.5-VL-Thinking89.7%
82DeepMind Mathematics o Math reasoningSynthetic math problem sets from DeepMind covering arithmetic, algebra, calculus, and more.★★★★★G Granite-4.0-H-Small59.3%
83Design2Code OCoding (UI)Translating UI designs into code.★★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking93.4%
84DesignArena O Generative designLeaderboard tracking generative design systems across layout, branding, and marketing tasks.★★★★ 🇺🇸 Claude Sonnet 4.5 (Thinking)1410
85DetailBench o Spot small mistakesEvaluates whether LLMs can notice subtle errors and minor inconsistencies in text.★★★★ 🇺🇸 Llama 4 Maverick8.7%
86DocVQA ODocument understanding (VQA)Visual question answering over scanned documents.★★★★★ 🇨🇳 Seed1.5-VL-Thinking96.9%
87DROP O Reading + reasoningDiscrete reasoning over paragraphs (addition, counting, comparisons).★★★★★ 🇨🇳 DeepSeek R192.2%
88DynaMath O Math reasoning (video)Dynamic/video-based mathematical reasoning evaluating temporal and visual understanding.★★★★ 🇺🇸 GPT-4o63.7%
89Economically important tasks oIndustry QA (cross-domain)Evaluation suite of real-world, economically impactful tasks across key industries and workflows.★★★★ 🇺🇸 GPT-547.1%
90EgoSchema OEgocentric video QAEgoSchema validation accuracy.★★★★ 🇨🇳 Qwen2-VL 72B Instruct77.9%
91EmbSpatialBench oSpatial understandingEmbodied spatial understanding benchmark evaluating navigation and localization.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking84.3%
92Enterprise RAG oRetrieval-augmented generationEnterprise retrieval-augmented generation evaluation covering internal knowledge bases.★★★★ 🇺🇸 Apriel Nemotron 15B Thinker69.2%
93EQ-Bench O ReasoningGeneral reasoning benchmark assessing equation/logic capabilities.★★★★★352G Jan v1 250985.0%
94EQ-Bench 3 O Emotional intelligence (roleplay)A benchmark measuring emotional intelligence in challenging roleplays, judged by Sonnet 3.7.★★★★21 🇨🇳 Kimi K2 Instruct1555
95ERQA OSpatial reasoningSpatial recognition and reasoning QA benchmark (ERQA).★★★★ 🇺🇸 GPT-565.7%
96EvalPerf O Code evaluation performanceMeasures performance of LLM code evaluation, including runtime, memory, and efficiency metrics.★★★★ 🇺🇸 GPT-4o (2024-08-06)100.0%
97EvalPlus O Code generationAggregated code evaluation suite from EvalPlus.★★★★★1577 🇺🇸 o1 Mini89.0%
98FACTS Grounding o Grounding / factualityGrounded factuality benchmark evaluating model alignment with source facts.★★★★★ 🇺🇸 Gemini 2.5 Pro87.8%
99FActScore oHallucination rate on open-source promptsMeasures hallucination rate on an open-source prompt suite; lower is better.★★★★ 🇺🇸 GPT-5↓ 1.0%
100FAIX Agent OComposite capability index★★★★★ 🇺🇸 Holo1.5-72B90.7%
101FAIX Code OComposite capability index★★★★★ 🇺🇸 Claude Sonnet 4.574.4%
102FAIX Math OComposite capability index★★★★★ 🇨🇳 Qwen3-VL-235B-A22B Instruct100.0%
103FAIX OCR OComposite capability index★★★★★ 🇺🇸 o3 (Low)86.0%
104FAIX Safety OComposite safety index★★★★★ 🇺🇸 Claude Sonnet 4.574.0%
105FAIX STEM OComposite capability index★★★★★ 🇨🇳 Qwen3-VL-235B-A22B Instruct100.0%
106FAIX Text OComposite capability index★★★★★ 🇨🇳 GLM-4.5V70.0%
107FAIX Visual OComposite capability index★★★★★ 🇺🇸 Holo1.5-72B91.6%
108FAIX Writing OComposite capability index★★★★★ 🇨🇳 Qwen3 235B A22B Instruct 250769.3%
109FinanceReasoning oFinancial reasoningFinancial reasoning benchmark evaluating quantitative and qualitative finance problem solving.★★★★★G Ling 1T87.5%
110FinanceAgent oAgentic finance tasksInteractive financial agent benchmark requiring multi-step tool use.★★★★ 🇺🇸 Claude Sonnet 4.555.3%
111FinanceBench (FullDoc) oFinance QAFinanceBench full-document question answering benchmark requiring long-context financial understanding.★★★★G Jamba Mini 1.645.4%
112FinSearchComp O Financial retrievalFinancial search and comprehension benchmark measuring retrieval grounded reasoning over financial content.★★★★ 🇺🇸 Grok 468.9%
113FinSearchComp-CN OFinancial retrieval (Chinese)Chinese financial search and comprehension benchmark measuring retrieval-grounded reasoning over regional financial content.★★★★G doubao-1-5-vision-pro54.2%
114Flame-React-Eval oFrontend codingFront-end React coding tasks and evaluation.★★★★ 🇨🇳 GLM-4.5V82.5%
115FRAMES oInteractive reasoningFrame-based interactive reasoning and dialogue benchmark.★★★★ 🇨🇳 Tongyi DeepResearch90.6%
116FreshQA oRecency QAQuestion answering benchmark emphasizing up-to-date knowledge and recency.★★★★ 🇨🇳 Qwen3-4B Thinking 250766.9%
117FullStackBench OFull-stack developmentEnd-to-end web/app development tasks and evaluation.★★★★★ 🇺🇸 GPT-4.168.5%
118GAIA o General AI tasksComprehensive benchmark for agentic tasks.★★★★ 🇨🇳 Tongyi DeepResearch70.9%
119GAIA 2 OGeneral agent tasksGrounded agentic intelligence benchmark version 2 covering multi-tool tasks.★★★★ 🇺🇸 GPT-5 High42.1%
120GDPVal o General capabilityGDPVal benchmark evaluating broad general capabilities of LLMs across diverse tasks.★★★★ 🇺🇸 Claude Opus 4.147.6%
121GeoBench1 oGeospatial reasoningGeospatial visual QA and reasoning (set 1).★★★★ 🇨🇳 GLM-4.5V79.7%
122Global-MMLU oMulti-domain knowledge (global)Full Global-MMLU evaluation across diverse languages and regions.★★★★★ 🇺🇸 Llama 3.3 70B77.8%
123Gorilla Benchmark API Bench o Tool useGorilla API Bench tool-use evaluation.★★★★ 🇺🇸 Llama 3.1 405B35.3%
124GPQA O Graduate-level QAGraduate-level question answering evaluating advanced reasoning.★★★★406 🇺🇸 Grok 488.4%
125GPQA-diamond O Graduate-level QAHard subset of GPQA (diamond level).★★★★ 🇺🇸 GPT-5 pro89.4%
126Ground-UI-1K OGUI groundingAccuracy on the Ground-UI-1K grounding benchmark.★★★★ 🇨🇳 Qwen2.5-VL 72B85.4%
127GSM-Plus OMath (grade-school, enhanced)Enhanced GSM-style grade-school math benchmark variant.★★★★ 🇨🇳 Qwen3-4B82.1%
128GSM-Symbolic o Math reasoningSymbolic reasoning variant of GSM that tests algebraic manipulation and arithmetic with structured problems.★★★★★G Granite-4.0-H-Small87.4%
129GSM8K O Math (grade-school)Grade-school math word problems requiring multi-step reasoning.★★★★1322 🇨🇳 Kimi K2 Instruct97.3%
130GSM8K (DE) oMath (grade-school, German)German translation of the GSM8K grade-school math word problems.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.6%
131GSO Benchmark O Code generationLiveCodeBench GSO benchmark.★★★★★ 🇺🇸 o3-high8.8%
132HallusionBench O Multimodal hallucinationBenchmark for evaluating hallucination tendencies in multimodal LLMs.★★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking66.7%
133HarmfulQA o SafetyHarmful question set testing models' ability to avoid unsafe answers.★★★★★104G K2-THINK99.0%
134HealthBench OMedical QAComprehensive medical knowledge and clinical reasoning benchmark across specialties and tasks.★★★★ 🇺🇸 GPT-567.2%
135HealthBench-Hard oMedical QA (hard)Challenging subset of HealthBench focusing on complex, ambiguous clinical cases.★★★★ 🇺🇸 GPT-546.2%
136HealthBench-Hard Hallucinations oMedical hallucination safetyMeasures hallucination and unsafe medical advice under hard clinical scenarios.★★★★ 🇺🇸 GPT-5↓ 1.6%
137HellaSwag O Commonsense reasoningAdversarial commonsense sentence completion.★★★★★220 🇨🇳 DeepSeek V3 Base96.4%
138HellaSwag (DE) oCommonsense reasoning (German)German translation of the HellaSwag commonsense benchmark.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.7%
139HELMET LongQA oLong-context QALong-context subset of the HELMET benchmark focusing on grounded question answering.★★★★G Jamba Mini 1.646.9%
140HeroBench O Long-horizon planningBenchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds.★★★★ 🇺🇸 Grok 491.7%
141HHEM v2.1 O Hallucination detectionHughes Hallucination Evaluation Model (Vectara) — lower is better.★★★★★G AntGroup Finix_S1_32b↓ 0.6%
142HLE O Multi-domain reasoningChallenging LLMs at the frontier of human knowledge.★★★★★1085 🇨🇳 Tongyi DeepResearch32.9%
143HMMT o Math (competition)Harvard–MIT Mathematics Tournament problems.★★★★ 🇺🇸 GPT-5 pro100.0%
144HMMT 2025 OMath (competition)Harvard–MIT Mathematics Tournament 2025 problems.★★★★★ 🇺🇸 Grok 4 Fast93.3%
145HMMT25 oMath (competition)Harvard-MIT Mathematics Tournament 2025 benchmark.★★★★ 🇨🇳 Qwen3-VL-30B-A3B Thinking67.6%
146HRBench 4K oHallucination robustnessHallucination robustness benchmark with 4K token contexts.★★★★ 🇨🇳 Qwen3-VL-30B-A3B Instruct89.5%
147HRBench 8K oHallucination robustnessHallucination robustness benchmark with 8K token contexts.★★★★ 🇨🇳 Qwen3-VL-30B-A3B Instruct82.5%
148HumanEval O Code generationPython synthesis problems evaluated by unit tests.★★★★2916 🇺🇸 o1-preview96.3%
149HumanEval+ O Code generationExtended HumanEval with more tests.★★★★★1577 🇺🇸 Claude Sonnet 494.5%
150Hypersim o3D scene understandingHypersim benchmark for synthetic indoor scene understanding and reconstruction.★★★★ 🇺🇸 GPT-5 Mini Minimal39.3%
151IFBench O Instruction followingInstruction-following benchmark measuring compliance and adherence.★★★★★70 🇫🇷 Mistral Small 3.2 24B Instruct84.8%
152IFEval O Instruction followingInstruction following capability evaluation for LLMs.★★★★36312 🇺🇸 o3 mini-high93.9%
153INCLUDE OInclusiveness / biasEvaluates inclusive language use and bias mitigation in model outputs.★★★★★ 🇺🇸 Gemini-2.5-Flash Thinking83.9%
154InfoQA OInformation-seeking QAInformation retrieval question answering benchmark evaluating factual responses.★★★★ 🇨🇳 Qwen2-VL-72B84.5%
155InfoVQA OInfographic VQAVisual question answering over infographics requiring reading, counting, and reasoning.★★★★ 🇨🇳 Seed1.5-VL-Thinking91.2%
156JudgeMark v2.1 O LLM judging abilityA benchmark measuring LLM judging ability.★★★★ 🇺🇸 Claude Sonnet 482.0%
157KMMLU-Pro O Multilingual knowledgeKorean Multilingual Massive Multitask Language Understanding Pro★★★★★ 🇺🇸 o177.5%
158KMMLU-Redux O Multilingual knowledgeRedux variant of KMMLU benchmark★★★★★ 🇺🇸 o181.1%
159KOR-Bench oReasoningComprehensive reasoning benchmark spanning diverse domains and cognitive skills.★★★★★G Ling 1T76.0%
160KSM oMultilingual mathKorean STEM and math benchmark★★★★★G EXAONE Deep 2.4B60.9%
161LAMBADA o Language modelingWord prediction requiring broad context understanding.★★★★★ 🇺🇸 GPT-386.4%
162LatentJailbreak o Safety / jailbreakRobustness to latent jailbreak adversarial techniques.★★★★★39 🇺🇸 GPT-3.5-turbo77.4%
163LiveBench OGeneral capabilityContinually updated capability benchmark across diverse tasks.★★★★ 🇺🇸 Gemini 2.5 Pro82.4%
164LiveBench 20241125 OGeneral capabilityLiveBench snapshot (2024-11-25) tracking mixed-task evals.★★★★ 🇺🇸 Gemini 2.5 Pro82.4%
165LiveCodeBench O Code generationLive coding and execution-based evaluation benchmark (v6 dataset).★★★★ 🇺🇸 GPT-5 mini86.6%
166LiveCodeBench v5 (2024.10-2025.02) O Code generationLiveCodeBench v5 snapshot covering Oct 2024-Feb 2025.★★★★★ 🇨🇳 Qwen3-235B-A22B70.7%
167LiveMCP-101 O Agent real-time evalA novel real-time evaluation framework and benchmark to stress‑test agents on complex, real‑world tasks.★★★★ 🇺🇸 GPT-558.4%
168LMArena Text O Crowd eval (text)Chatbot Arena text evaluation (average win rate).★★★★★ 🇺🇸 Gemini 2.5 Pro1455
169LMArena Vision O Crowd eval (vision)Chatbot Arena vision evaluation leaderboard (ELO ratings).★★★★★ 🇺🇸 Gemini 2.5 Pro1242
170LogicVista OVisual logical reasoningVisual logic and pattern reasoning tasks requiring compositional and spatial understanding.★★★★ 🇨🇳 GLM-4.5V62.4%
171LogiQA o Logical reasoningReading comprehension with logical reasoning.★★★★★138G Pythia 70M23.5%
172LongBench o Long-context evalLong-context understanding across tasks.★★★★★957G Jamba Mini 1.632.0%
173LongFact-Concepts oHallucination rate on open-source promptsLong-context factuality eval focused on conceptual statements; lower is better.★★★★ 🇺🇸 GPT-5↓ 0.7%
174LongFact-Objects oHallucination rate on open-source promptsLong-context factuality eval focused on object/entity references; lower is better.★★★★ 🇺🇸 GPT-5↓ 0.8%
175LVBench OVideo understandingLong video understanding benchmark (LVBench).★★★★ 🇺🇸 Gemini 2.5 Pro73.0%
176M3GIA (CN) oChinese multimodal QAChinese-language M3GIA benchmark covering grounded multimodal question answering.★★★★★ 🇨🇳 Seed1.5-VL-Thinking91.2%
177Mantis OMultimodal reasoningMultimodal reasoning and instruction following benchmark (Mantis).★★★★★G dots.vlm186.2%
178MASK O Safety / red teamingModel behavior safety assessment via red-teaming scenarios.★★★★ 🇺🇸 Claude Sonnet 4 (t)95.3%
179MATH O Math (competition)Competition-level mathematics across algebra, geometry, number theory, combinatorics.★★★★★1185 🇺🇸 o3 mini97.9%
180MATH Level 5 o Math (competition)Level 5 subset of the MATH benchmark emphasizing the hardest competition-style problems.★★★★ 🇨🇳 Qwen3-4B-Instruct-250773.6%
181MATH500 OMath (competition)500 curated math problems for evaluating high-level reasoning.★★★★★ 🇺🇸 GPT-599.2%
182MATH500 (ES) oMath (multilingual)Spanish MATH500 benchmark★★★★★G EXAONE 4.0 1.2B88.8%
183MathVerse-mini oMath reasoning (multimodal)Compact MathVerse split focusing on single-image math puzzles and visual reasoning.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking85.0%
184MathVerse-Vision OMath reasoning (multimodal)Multi-image visual mathematical reasoning tasks from the MathVerse ecosystem.★★★★ 🇨🇳 GLM-4.5V72.1%
185MathVision O Math reasoning (multimodal)Visual math reasoning benchmark with problems that combine images (charts, diagrams) and text.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking74.6%
186MathVista O Multimodal math reasoningVisual math reasoning across diverse tasks.★★★★ 🇺🇸 o386.8%
187MathVista-Mini OMath reasoning (multimodal)Lightweight subset of MathVista for quick evaluation of visual mathematical reasoning.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking85.8%
188MBPP O Code generationShort Python problems with hidden tests.★★★★36312 🇺🇸 o1-preview95.5%
189MBPP+ OCode generationExtended MBPP with more tests and stricter evaluation.★★★★★ 🇺🇸 Llama 3.1 405B88.6%
190MCP Universe O Agent evaluationBenchmarks multi-step tool-use agents across diverse task suites with a unified overall success metric.★★★★ 🇺🇸 GPT-5 High44.2%
191MCPMark O Agent tool-use (MCP)Benchmark for Model Context Protocol (MCP) agent tool-use.★★★★★127 🇺🇸 GPT-546.9%
192MGSM OMath (multilingual)Multilingual grade school math word problems.★★★★★ 🇺🇸 Claude Opus 4.1 (2025-08-05) Thinking94.4%
193MIABench oMultimodal instruction followingMultimodal instruction-following benchmark evaluating accuracy on complex image-text tasks.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking92.7%
194Minerva Math o University-level mathAdvanced quantitative reasoning set inspired by the Minerva benchmark for STEM problem solving.★★★★★G Granite-4.0-H-Small74.0%
195MiniF2F (Test) o Math competitionMiniF2F competition benchmark (test split).★★★★ 🇨🇳 LongCat-Flash-Thinking81.6%
196MixEval oMulti-task reasoningMixed-subject benchmark covering knowledge and reasoning tasks across domains.★★★★★ 🇺🇸 o1 Mini82.9%
197MixEval Hard oMulti-task reasoning (hard)Hard subset of MixEval covering diverse reasoning tasks.★★★★ 🇨🇳 Qwen3-4B31.6%
198MLVU OLarge video understandingMLVU: Large-scale multi-task benchmark for video understanding.★★★★ 🇺🇸 GPT-586.2%
199MM-MT-Bench OMultimodal instruction followingMulti-turn multimodal instruction following benchmark evaluating dialogue quality and helpfulness.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking8.5%
200MMBench v1.1 (EN dev) O General VQAEnglish dev split of MMBench v1.1 measuring multimodal question answering.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking90.6%
201MMBench v1.1 (CN) O Multimodal understanding (Chinese)MMBench v1.1 Chinese subset for evaluating multimodal LLMs.★★★★★ 🇨🇳 Keye-VL 8B89.8%
202MMBench v1.1 (EN) O Multimodal understanding (English)MMBench v1.1 English subset for evaluating multimodal LLMs.★★★★★ 🇨🇳 Keye-VL 8B89.7%
203MME-RealWorld (cn) oReal-world perception (CN)MME-RealWorld Chinese split.★★★★ 🇺🇸 GPT-4o58.5%
204MME-RealWorld (en) oReal-world perception (EN)MME-RealWorld English split.★★★★G MiMo-VL 7B-RL59.1%
205MMLongBench-Doc OLong-context multimodal documentsEvaluates long-context document understanding with mixed text, tables, and figures across multiple pages.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking56.2%
206MMLU O Multi-domain knowledge57 tasks spanning STEM, humanities, social sciences; broad knowledge and reasoning.★★★★★1488 🇺🇸 GPT-593.5%
207MMLU (cloze) o Multi-domain knowledge (cloze)Cloze-form MMLU evaluation variant.★★★★ 🇺🇸 SmolLM2 135M Base31.5%
208Full Text MMLU oMulti-domain knowledge (long-form)Full-context MMLU variant evaluating reasoning over long passages.★★★★ 🇺🇸 Llama 3.3 70B Instruct0.8%
209MMLU-Pro O Multi-domain knowledgeHarder successor to MMLU with more challenging questions.★★★★★286 🇺🇸 o189.3%
210MMLU Pro MCF oMulti-domain knowledge (few-shot)MMLU-Pro common format (MCF) few-shot evaluation.★★★★ 🇨🇳 Qwen3-4B-Base41.1%
211MMLU-ProX OMulti-domain knowledgeCross-lingual and robust variant of MMLU-Pro.★★★★ 🇨🇳 Qwen3-235B-A22B-Thinking-250781.0%
212MMLU-Redux O Multi-domain knowledgeUpdated MMLU-style evaluation with revised questions and scoring.★★★★★ 🇨🇳 Qwen3-235B-A22B-Thinking-250793.8%
213MMLU-STEM O STEM knowledgeSTEM subset of MMLU.★★★★★1488G Falcon-H1-34B-Instruct83.6%
214MMMLU O Multi-domain knowledge (multilingual)Massively multilingual MMLU-style evaluation across many languages.★★★★★ 🇺🇸 Claude Opus 4.189.5%
215MMMLU (ES) oMultilingual knowledgeSpanish MMMLU benchmark★★★★★ 🇺🇸 SmolLM 3 3B64.7%
216MMMU O Multimodal understandingMulti-discipline multimodal understanding benchmark.★★★★★ 🇺🇸 Gemini 2.5 Pro84.2%
217MMMU PRO O Multimodal understanding (hard)Professional/advanced subset of MMMU for multimodal reasoning.★★★★ 🇺🇸 GPT-578.4%
218MMMU-Pro (vision) o Multimodal understanding (vision)MMMU-Pro vision-only setting.★★★★ 🇺🇸 Claude 3.7 Sonnet45.8%
219MMStar O Multimodal reasoningBroad evaluation of multimodal LLMs across diverse tasks.★★★★★ 🇺🇸 Gemini 2.5 Pro78.7%
220MMVP o Multimodal video perceptionBenchmark for multimodal video understanding and perception.★★★★★ 🇨🇳 R-4B-RL80.7%
221MMVU oVideo understandingMultimodal video understanding benchmark (MMVU).★★★★ 🇨🇳 GLM-4.5V68.7%
222MotionBench oVideo motion understandingVideo motion and temporal reasoning benchmark.★★★★ 🇨🇳 GLM-4.5V62.4%
223MT-Bench O Chat abilityMulti-turn chat evaluation via GPT-4 grading.★★★★★39074 🇺🇸 Apriel Nemotron 15B Thinker85.7%
224MTOB (full book) oLong-form reasoningLong-context book understanding benchmark (full-book setting).★★★★★ 🇺🇸 Llama 4 Maverick50.8%
225MTOB (half book) oLong-form reasoningLong-context book understanding benchmark (half-book setting).★★★★★ 🇺🇸 Llama 4 Maverick54.0%
226MUIRBENCH OMultimodal robustnessEvaluates multimodal understanding robustness and reliability.★★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking80.1%
227Multi-IF OInstruction following (multi-task)Composite instruction-following evaluation across multiple tasks.★★★★ 🇺🇸 o3 mini-high79.5%
228Multi-IFEval OInstruction following (multi-task)Multi-task variant of instruction-following evaluation.★★★★★ 🇺🇸 Llama 3.3 70B88.7%
229Multi-SWE-Bench o Code repair (multi-repo)Multi-repository SWE-Bench variant.★★★★★246 🇺🇸 Claude Sonnet 435.7%
230MultiChallenge o Multi-task reasoningComposite benchmark across diverse challenges by Scale AI.★★★★ 🇺🇸 GPT-569.6%
231MultiPL-E O Code generation (multilingual)Multilingual code generation and execution benchmark across many programming languages.★★★★★269 🇨🇳 Qwen3-235B-A22B87.9%
232MultiPL-E HumanEval o Code generation (multilingual)MultiPL-E variant of HumanEval tasks.★★★★ 🇺🇸 Llama 3.1 405B75.2%
233MultiPL-E MBPP o Code generation (multilingual)MultiPL-E variant of MBPP tasks.★★★★ 🇺🇸 Llama 3.1 405B65.7%
234MuSR O ReasoningMultistep Soft Reasoning.★★★★★ 🇨🇳 ERNIE 4.5 424B A47B69.9%
235MVBench OVideo QAMulti-view or multi-video QA benchmark (MVBench).★★★★ 🇨🇳 GLM-4.5V73.0%
236Natural2Code oCode generationNatural language to code benchmark for instruction-following synthesis.★★★★ 🇺🇸 Gemini 2.0 Flash92.9%
237NaturalQuestions O Open-domain QAGoogle NQ; real user questions with long/short answers.★★★★ 🇫🇷 Mixtral 8x22B40.1%
238Nexus (0-shot) oTool useNexus tool-use benchmark, zero-shot setting.★★★★ 🇺🇸 Llama 3.1 405B58.7%
239Objectron oObject detectionObjectron benchmark for 3D object detection in video captures.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking71.2%
240OCRBench OOCR (vision text extraction)Optical character recognition benchmark evaluating text extraction from images, documents, and complex layouts.★★★★★ 🇨🇳 Qwen3-VL-30B-A3B Instruct90.3%
241OCRBenchV2 (CN) oOCR (Chinese)OCRBenchV2 Chinese subset assessing OCR performance on Chinese-language documents.★★★★ 🇨🇳 Qwen2.5-VL 72B Instruct63.7%
242OCRBenchV2 (EN) oOCR (English)OCRBenchV2 English subset evaluating OCR accuracy on English documents and layouts.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking66.8%
243OCRReasoning oOCR reasoningOCR reasoning benchmark combining text extraction with multi-step reasoning over documents.★★★★★ 🇺🇸 Gemini 2.5 Pro70.8%
244ODinW-13 oObject detection (in the wild)Object Detection in the Wild benchmark covering 13 real-world domains.★★★★ 🇨🇳 Qwen3-VL-30B-A3B Instruct47.5%
245OJBench OCode generation (online judge)Programming problems evaluated via online judge-style execution.★★★★ 🇺🇸 Gemini 2.5 Pro41.6%
246OlympiadBench oMath (olympiad)Advanced mathematics olympiad-style problem benchmark.★★★★ 🇨🇳 Hunyuan-7B-Instruct76.5%
247OlympicArena oMath (competition)Olympiad-style mathematics reasoning benchmark.★★★★ 🇨🇳 DeepSeek V376.2%
248Omni-MATH OMath reasoningOmni-MATH benchmark covering diverse math reasoning tasks across difficulty levels.★★★★★G Ling 1T74.5%
249Omni-MATH-HARD OMathChallenging math benchmark (Omni-MATH-HARD).★★★★★ 🇺🇸 GPT-5 High73.6%
250OmniSpatial oSpatial reasoningSpatial understanding and reasoning benchmark (OmniSpatial).★★★★ 🇨🇳 GLM-4.5V51.0%
251OpenBookQA O Science QAOpen-book multiple choice science questions with supporting facts.★★★★★128 🇨🇳 Qwen396.4%
252OpenRewrite-Eval oRewrite qualityOpenRewrite evaluation; micro-averaged RougeL.★★★★ 🇨🇳 Qwen2.5 1.5B Instruct46.9%
253OptMATH oMath optimization reasoningOptMATH benchmark targeting challenging math optimization and problem-solving tasks.★★★★★G Ling 1T57.7%
254OSWorld oGUI agentsAgentic GUI task completion and grounding on desktop environments.★★★★★ 🇺🇸 Claude Sonnet 4.561.4%
255OSWorld-G OGUI agentsOSWorld-G center accuracy (no_refusal).★★★★ 🇺🇸 Holo1.5-72B71.8%
256OSWorld2 oGUI agentsSecond-generation OSWorld GUI agent benchmark.★★★★ 🇨🇳 GLM-4.5V35.8%
257PIQA O Physical commonsensePhysical commonsense about everyday tasks and object affordances.★★★★ 🇨🇳 GLM-4.5 Base87.1%
258PixmoCount O Visual countingCounting objects/instances in images (PixmoCount).★★★★ 🇺🇸 Molmo-72B85.2%
259PolyMATH OMath reasoningPolyglot mathematics benchmark assessing cross-topic math reasoning.★★★★ 🇨🇳 Qwen3-235B-A22B-Thinking-250760.1%
260POPE o Hallucination detectionVision-language hallucination benchmark focusing on object existence verification.★★★★G Moondream-9B-A2B89.0%
261PopQA O Knowledge / QAOpen-domain popular culture question answering benchmark testing long-tail factual recall.★★★★★ 🇺🇸 Llama 3.1 8B Instruct28.8%
262QuAC o Conversational QAQuestion answering in context.★★★★ 🇺🇸 Llama 3.1 405B Base53.6%
263QuALITY o Long-context reading comprehensionLong-document multiple-choice reading comprehension benchmark.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.5%
264RACE o Reading comprehensionEnglish exams for middle and high school.★★★★G RND1-Base-091057.6%
265RealWorldQA O Real-world visual QAVisual question answering with real-world images and scenarios.★★★★★ 🇺🇸 GPT-582.8%
266RefCOCO O Referring expressionsRefCOCO average accuracy at IoU 0.5 (val).★★★★★ 🇨🇳 InternVL3.5-4B92.4%
267RefCOCOg o Referring expressionsRefCOCOg average accuracy at IoU 0.5 (val).★★★★G Moondream-9B-A2B88.6%
268RefCOCO+ o Referring expressionsRefCOCO+ accuracy at IoU 0.5 on the val split.★★★★G Moondream-9B-A2B81.8%
269RefSpatialBench oSpatial reasoningReference spatial understanding benchmark covering spatial grounding tasks.★★★★ 🇨🇳 Qwen2.5-VL 72B Instruct72.1%
270RepoBench OCode understandingRepository-level code comprehension and reasoning benchmark.★★★★ 🇺🇸 Claude Sonnet 4.583.8%
271RoboSpatialHome oEmbodied spatial understandingRoboSpatialHome benchmark for embodied spatial reasoning in domestic environments.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking73.9%
272Roo Code Evals O Code assistant evalCommunity-maintained coding evals and leaderboard by Roo Code.★★★★★ 🇺🇸 GPT-5 mini99.0%
273Ruler 128k o Long-context evalRULER benchmark at 128k context window.★★★★ 🇫🇷 Mistral Medium 390.2%
274Ruler 32k o Long-context evalRULER benchmark at 32k context window.★★★★★ 🇫🇷 Mistral Medium 396.0%
275SALAD-Bench o Safety alignmentSafety Alignment and Dangerous-behavior benchmark evaluating harmful assistance and refusal consistency.★★★★★G Granite-4.0-H-Micro↓ 96.8%
276SciCode (sub) OCodeSciCode subset score (sub).★★★★★ 🇺🇸 Grok 445.7%
277SciCode (main) OCodeSciCode main score.★★★★★ 🇺🇸 Gemini 2.5 Pro15.4%
278ScienceQA OScience QA (multimodal)Multiple-choice science questions with images, diagrams, and text context.★★★★G FastVLM-7B96.7%
279SciQ o Science QAMultiple choice science questions.★★★★G Pythia 12B92.9%
280ScreenQA Complex OGUI QAComplex ScreenQA benchmark accuracy.★★★★ 🇺🇸 Holo1.5-72B87.1%
281ScreenQA Short OGUI QAShort-form ScreenQA benchmark accuracy.★★★★ 🇺🇸 Holo1.5-72B91.9%
282ScreenSpot OScreen UI locatorsCenter accuracy on ScreenSpot.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking95.4%
283ScreenSpot-Pro O Screen UI locatorsAverage center accuracy on ScreenSpot-Pro.★★★★ 🇺🇸 Holo1.5-72B63.2%
284ScreenSpot-v2 OScreen UI locatorsCenter accuracy on ScreenSpot-v2.★★★★G UI-Venus 72B95.3%
285SEED-Bench-2-Plus o Multimodal evaluationSEED-Bench-2-Plus overall accuracy.★★★★ 🇺🇸 Claude 3.7 Sonnet72.9%
286SEED-Bench-Img OMultimodal image understandingSEED-Bench image-only subset (SEED-Bench-Img).★★★★G Bagel 14B78.5%
287Showdown OGUI agentsSuccess rate on the Showdown UI interaction benchmark.★★★★ 🇺🇸 Holo1.5-72B76.8%
288SIFO oInstruction followingSingle-turn instruction following benchmark.★★★★ 🇨🇳 Qwen3-VL-30B-A3B Thinking66.9%
289SIFO Multiturn oInstruction followingMulti-turn SIFO benchmark for sustained instruction adherence.★★★★ 🇨🇳 Qwen3-VL-30B-A3B Thinking60.3%
290SimpleQA OQASimple question answering benchmark.★★★★★ 🇨🇳 DeepSeek V3.2-Exp97.1%
291SimpleVQA oGeneral VQALightweight visual question answering set with everyday scenes.★★★★ 🇺🇸 Gemini 2.5 Pro65.4%
292SimpleVQA-DS oGeneral VQASimpleVQA variant curated by DeepSeek with everyday image question answering tasks.★★★★★ 🇨🇳 Seed1.5-VL-Thinking61.3%
293SocialIQA o Social commonsenseSocial interaction commonsense QA.★★★★ 🇺🇸 Gemma 3 PT 27B54.9%
294Spider o Text-to-SQLComplex text-to-SQL benchmark over cross-domain databases.★★★★ 🇺🇸 Llama 3 70B Base67.1%
295Spiral-Bench O Safety / sycophancyA LLM-judged benchmark measuring sycophancy and delusion reinforcement.★★★★ 🇺🇸 GPT-587.0%
296SQuAD v1.1 o Reading comprehensionExtractive QA from Wikipedia articles.★★★★★566 🇺🇸 Llama 3.1 405B Base89.3%
297SUNRGBD o3D scene understandingSUN RGB-D benchmark for indoor scene understanding from RGB-D imagery.★★★★ 🇺🇸 GPT-5 Mini Minimal45.8%
298SuperGPQA OGraduate-level QAHarder GPQA variant assessing advanced graduate-level reasoning.★★★★★ 🇨🇳 Qwen3-235B-A22B-Thinking-250764.9%
299SWE-Bench o Code repairSupervised software engineering benchmark across many repos and issues.★★★★★3442 🇺🇸 GPT-5 Codex74.5%
300SWE-Bench Multilingual oCode repair (multilingual)Multilingual variant of SWE-Bench for issue fixing.★★★★★ 🇨🇳 DeepSeek V3.2-Exp57.9%
301SWE-Bench Pro (Public) oSoftware engineeringPublic subset of the SWE-Bench Pro benchmark for software-engineering agents.★★★★ 🇺🇸 GPT-523.3%
302SWE-Bench Verified O Code repairVerified subset of SWE-Bench for issue fixing.★★★★ 🇺🇸 Claude Sonnet 4.577.2%
303SWE-Dev oCode repairSoftware engineering development and bug fixing benchmark.★★★★★ 🇺🇸 Claude Sonnet 467.1%
304SysBench oSystem promptsSystem prompt understanding and adherence benchmark.★★★★ 🇺🇸 GPT-4.174.1%
305τ²-Bench (airline) o Industry QA (airline)τ²-Bench airline domain evaluation.★★★★★ 🇺🇸 o364.8%
306τ²-Bench (retail) oIndustry QA (retail)τ²-Bench retail domain evaluation.★★★★★ 🇺🇸 Claude Opus 4.182.4%
307τ²-Bench (telecom) O Industry QA (telecom)τ²-Bench telecom domain evaluation.★★★★ 🇺🇸 GPT-596.7%
308TAU1-Airline oAgent tasks (airline)Tool-augmented agent evaluation in airline scenarios (TAU1).★★★★ 🇺🇸 Gemini-2.5-Flash Thinking54.0%
309TAU1-Retail oAgent tasks (retail)Tool-augmented agent evaluation in retail scenarios (TAU1).★★★★ 🇨🇳 Qwen3-235B-A22B71.3%
310TAU2-Airline OAgent tasks (airline)Tool-augmented agent evaluation in airline scenarios (TAU2).★★★★ 🇺🇸 Claude Sonnet 4.570.0%
311TAU2-Retail OAgent tasks (retail)Tool-augmented agent evaluation in retail scenarios (TAU2).★★★★ 🇺🇸 Claude Opus 4.186.8%
312TAU2-Telecom OAgent tasks (telecom)Tool-augmented agent evaluation in telecom scenarios (TAU2).★★★★ 🇺🇸 Claude Sonnet 4.598.0%
313Terminal-Bench O Agent terminal tasksCommand-line task completion benchmark for agents.★★★★637 🇺🇸 Claude Sonnet 4.5 (Thinking)61.3%
314Terminal-Bench Hard O Agent terminal tasksHard subset of Terminal-Bench command-line agent tasks.★★★★★ 🇺🇸 Grok 437.6%
315TextVQA O Text-based VQAVisual question answering that requires reading text in images.★★★★ 🇨🇳 Qwen2-VL-72B85.5%
316TreeBench o Reasoning with tree structuresEvaluates hierarchical/tree-structured reasoning and planning capabilities in LLMs/VLMs.★★★★ 🇨🇳 GLM-4.5V50.1%
317TriQA oKnowledge QATriadic question answering benchmark evaluating world knowledge and reasoning.★★★★ 🇫🇷 Mixtral 8x22B82.2%
318TriviaQA O Open-domain QAOpen-domain question answering benchmark built from trivia and web evidence.★★★★★ 🇺🇸 Gemma 3 PT 27B85.5%
319TriviaQA-Wiki o Open-domain QATriviaQA subset answering using Wikipedia evidence.★★★★ 🇺🇸 Llama 3.1 405B Base91.8%
320TruthfulQA O Truthfulness / hallucinationMeasures whether a model imitates human falsehoods (truthfulness).★★★★★ 🇨🇳 Qwen2.5 32B Instruct70.3%
321TruthfulQA (DE) oTruthfulness / hallucination (German)German translation of the TruthfulQA benchmark.★★★★ 🇺🇸 Llama 3.3 70B Instruct0.2%
322TydiQA o Cross-lingual QATypologically diverse QA across languages.★★★★★313 🇺🇸 Llama 3.1 405B Base34.3%
323V* oMultimodal reasoningV* benchmark accuracy.★★★★G MiMo-VL 7B-RL81.7%
324VCT O Virology capability (protocol troubleshooting)Virology Capabilities Test: a benchmark that measures an LLM's ability to troubleshoot complex virology laboratory protocols.★★★★ 🇺🇸 o343.8%
325VibeEval OAesthetic/visual qualityVLM aesthetic evaluation with GPT scores.★★★★★ 🇺🇸 Gemini 2.5 Pro76.4%
326Video-MME o Video understanding (multimodal)Multimodal evaluation of video understanding and reasoning.★★★★ 🇨🇳 Qwen3-VL-30B-A3B Instruct74.5%
327VideoMME (w/o sub) OVideo understandingVideo understanding benchmark without subtitles.★★★★ 🇺🇸 Gemini 2.5 Pro85.1%
328VideoMME (w/sub) oVideo understandingVideo understanding benchmark with subtitles.★★★★ 🇨🇳 GLM-4.5V80.7%
329VideoMMMU OMultimodal video understandingVideo-based extension of MMMU evaluating temporal multimodal reasoning and perception across disciplines.★★★★★ 🇺🇸 GPT-584.6%
330VisualWebBench O Web UI understandingAverage accuracy on VisualWebBench.★★★★ 🇺🇸 Holo1.5-72B83.8%
331VisuLogic O Visual logical reasoningLogical reasoning and compositionality benchmark for visual-language models.★★★★★ 🇨🇳 Seed1.5-VL-Thinking35.9%
332VitaBench oIndustry QAIndustry-focused benchmark evaluating domain QA performance.★★★★ 🇺🇸 o335.3%
333VL-RewardBench oReward modeling (VL)Reward alignment benchmark for VLMs.★★★★ 🇺🇸 Claude 3.7 Sonnet67.4%
334VLMs are Biased o Multimodal biasEvaluates whether VLMs truly 'see' vs. relying on memorized knowledge; measures bias toward non-visual priors.★★★★90 🇺🇸 o4 mini20.2%
335VLMs are Blind O Visual grounding robustnessEvaluates failure modes of VLMs in grounding and perception tasks.★★★★G MiMo-VL 7B-RL79.4%
336VoiceBench AdvBench oVoiceBenchVoiceBench adversarial safety evaluation.★★★★ 🇨🇳 Qwen3-Omni-30B-A3B-Thinking99.4%
337VoiceBench AlpacaEval oVoiceBenchVoiceBench evaluation on AlpacaEval instructions.★★★★ 🇨🇳 Qwen3-Omni-Flash-Thinking96.8%
338VoiceBench BBH oVoiceBenchVoiceBench evaluation on Big-Bench Hard prompts.★★★★ 🇺🇸 Gemini 2.5 Pro92.6%
339VoiceBench CommonEval oVoiceBenchVoiceBench evaluation on CommonEval.★★★★ 🇨🇳 Qwen3-Omni-Flash-Instruct91.0%
340VoiceBench IFEval oVoiceBenchVoiceBench instruction-following evaluation (IFEval).★★★★ 🇺🇸 Gemini 2.5 Pro85.7%
341MMAU v05.15.25 oAudio reasoningAudio reasoning benchmark MMAU v05.15.25.★★★★ 🇨🇳 Qwen3-Omni-Flash-Instruct77.6%
342VoiceBench MMSU oVoiceBenchVoiceBench MMSU benchmark (voice modality).★★★★ 🇨🇳 Qwen3-Omni-Flash-Thinking84.3%
343VoiceBench MMSU (Audio) oAudio reasoningAudio reasoning MMSU results.★★★★ 🇺🇸 Gemini 2.5 Pro77.7%
344VoiceBench OpenBookQA oVoiceBenchVoiceBench results on OpenBookQA prompts.★★★★ 🇨🇳 Qwen3-Omni-Flash-Thinking95.0%
345VoiceBench Overall oVoiceBenchOverall VoiceBench aggregate score.★★★★ 🇺🇸 Gemini 2.5 Pro89.6%
346VoiceBench SD-QA oVoiceBenchVoiceBench Spoken Dialogue QA results.★★★★ 🇺🇸 Gemini 2.5 Pro90.1%
347VoiceBench WildVoice oVoiceBenchVoiceBench evaluation on WildVoice dataset.★★★★ 🇺🇸 Gemini 2.5 Pro93.4%
348VQAv2 O Visual question answeringStandard Visual Question Answering v2 benchmark on natural images.★★★★ 🇺🇸 Molmo-72B86.5%
349VSI-Bench oSpatial intelligenceVisual spatial intelligence benchmark covering 3D reasoning and spatial inference tasks.★★★★ 🇨🇳 Qwen3-VL-30B-A3B Instruct63.2%
350WebClick OGUI agentsTask success on the WebClick UI agent benchmark.★★★★ 🇺🇸 Claude Sonnet 493.0%
351WebDev Arena O Web development agentsArena evaluation for autonomous web development agents.★★★★★ 🇺🇸 GPT-51483
352WebQuest-MultiQA oWeb agentsMulti-question web search and interaction tasks.★★★★ 🇨🇳 GLM-4.5V60.6%
353WebQuest-SingleQA oWeb agentsSingle-question web search and interaction tasks.★★★★ 🇨🇳 GLM-4.5V76.9%
354WebSrc OWeb QAWebpage question answering (SQuAD F1).★★★★ 🇺🇸 Holo1.5-72B97.2%
355WebVoyager2 oWeb agentsWeb navigation and interaction tasks for LLM agents (v2).★★★★ 🇨🇳 GLM-4.5V84.4%
356WebWalkerQA oWeb agentsWebWalker tasks evaluating autonomous browsing question answering performance.★★★★ 🇨🇳 Tongyi DeepResearch72.2%
357WeMath oMath reasoningMath reasoning benchmark spanning diverse curricula and difficulty levels.★★★★ 🇨🇳 GLM-4.5V68.8%
358WildBench V2 oInstruction followingWildBench V2 human preference benchmark for instruction following and helpfulness.★★★★ 🇫🇷 Mistral Small 3.2 24B Instruct65.3%
359Winogender o Gender bias (coreference)Coreference resolution dataset for measuring gender bias.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT67.9%
360WinoGrande O Coreference reasoningLarge-scale adversarial Winograd Schema-style pronoun resolution.★★★★99 🇺🇸 Llama 3.1 405B Base86.7%
361WinoGrande (DE) oCoreference reasoning (German)German translation of the WinoGrande pronoun resolution benchmark.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.8%
362WMT16 En–De o Machine translationWMT16 English–German translation benchmark (news).★★★★ 🇺🇸 Llama 3.3 70B Instruct38.8%
363WMT16 En–De (Instruct) oMachine translationInstruction-tuned evaluation on the WMT16 English–German translation set.★★★★ 🇺🇸 Llama 3.3 70B Instruct37.9%
364WritingBench OWriting qualityGeneral-purpose writing quality benchmark.★★★★ 🇨🇳 Qwen3-235B-A22B-Thinking-250788.3%
365WSC o Coreference reasoningClassic Winograd Schema Challenge measuring commonsense coreference.★★★★G Pythia 410M47.1%
366xBench-DeepSearch oAgentic researchEvaluates multi-hop deep research workflows on xBench DeepSearch tasks.★★★★ 🇨🇳 Tongyi DeepResearch75.0%
367ZebraLogic O Logical reasoningLogical reasoning benchmark assessing complex pattern and rule inference.★★★★ 🇨🇳 Qwen3 MoE-250794.2%
368ZeroBench OZero-shot generalizationEvaluates zero-shot performance across diverse tasks without task-specific finetuning.★★★★★ 🇨🇳 GLM-4.5V23.4%
369ZeroBench (sub) OZero-shot generalizationSubset of ZeroBench targeting harder zero-shot reasoning cases.★★★★★ 🇺🇸 Gemini 2.5 Pro33.8%
370ZeroSCROLLS MuSiQue o Long-context reasoningZeroSCROLLS split derived from MuSiQue multi-hop QA.★★★★ 🇺🇸 Llama 3.3 70B Instruct0.5%
371ZeroSCROLLS SpaceDigest o Long-context summarizationZeroSCROLLS SpaceDigest extractive summarization task.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.8%
372ZeroSCROLLS SQuALITY o Long-context summarizationZeroSCROLLS split based on the SQuALITY long-form summarization benchmark.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.2%