Fu — Benchmark of Benchmarks
Fu-Benchmark is a meta-benchmark of the most influential evaluation suites used to measure and rank large language models. Use the search box to filter by name, topic, or model.
# | Name | Topic | Description | Relevance | GitHub ★ | Leader | Top % |
---|---|---|---|---|---|---|---|
1 | A12D o | Diagram reasoning | A12D diagram reasoning benchmark for measuring multimodal understanding of annotated diagrams. | ★★★★★ | 90.9% | ||
2 | AA-Index o | Multi-domain QA | Comprehensive QA index across diverse domains. | ★★★★★ | 73.2% | ||
3 | AA-LCR O | Long-context reasoning | A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens. | ★★★★★ | 75.6% | ||
4 | ACP-Bench Bool O | Safety evaluation (boolean) | Safety and behavior evaluation with yes/no questions. | ★★★★★ | 85.1% | ||
5 | ACP-Bench MCQ O | Safety evaluation (MCQ) | Safety and behavior evaluation with multiple-choice questions. | ★★★★★ | 82.1% | ||
6 | AgentDojo O | Agent evaluation | Interactive evaluation suite for autonomous agents across tools and tasks. | ★★★★★ | 88.7% | ||
7 | AGIEval O | Exams | Academic and professional exam benchmark. | ★★★★★ | 71.6% | ||
8 | AI2D O | Diagram understanding (VQA) | Visual question answering over science and diagram images. | ★★★★★ | 96.3% | ||
9 | Aider Code Editing o | Code editing | Measures interactive code editing quality within the Aider assistant workflow. | ★★★★★ | 89.8% | ||
10 | Aider-Polyglot O | Code assistant eval | Aider polyglot coding leaderboard. | ★★★★★ | 88.0% | ||
11 | AIME 2024 O | Math (competition) | American Invitational Mathematics Examination 2024 problems. | ★★★★★ | 96.6% | ||
12 | AIME 2025 O | Math (competition) | American Invitational Mathematics Examination 2025 problems. | ★★★★★ | 96.7% | ||
13 | AIME25 o | Math (competition) | American Invitational Mathematics Examination 2025 benchmark (set AIME25). | ★★★★★ | 83.1% | ||
14 | All-Angles Bench o | Spatial perception | All-Angles benchmark for spatial recognition and 3D perception. | ★★★★★ | 56.9% | ||
15 | AlpacaEval O | Instruction following | Automatic eval using GPT-4 as a judge. | ★★★★★ | 1849 | 64.2% | |
16 | AlpacaEval 2.0 O | Instruction following | Updated AlpacaEval with improved prompts and judging. | ★★★★★ | 87.6% | ||
17 | AMC-23 O | Math (competition) | American Mathematics Competition 2023 evaluation. | ★★★★★ | G QwQ-32B | 98.5% | |
18 | AndroidWorld o | Mobile agents | Benchmark for agents operating Android apps via UI automation. | ★★★★★ | 57.0% | ||
19 | API-Bank o | Tool use | API-Bank tool-use benchmark. | ★★★★★ | 92.0% | ||
20 | ARC-AGI-1 O | General reasoning | ARC-AGI Phase 1 aggregate accuracy. | ★★★★★ | 75.7% | ||
21 | ARC-AGI-2 O | General reasoning | ARC-AGI Phase 2 aggregate accuracy. | ★★★★★ | 18.3% | ||
22 | ARC Average O | Science QA (average) | Average accuracy across ARC-Easy and ARC-Challenge. | ★★★★★ | 60.5% | ||
23 | ARC-Challenge O | Science QA | Hard subset of AI2 Reasoning Challenge; grade-school science. | ★★★★★ | 96.9% | ||
24 | ARC-Challenge (DE) o | Science QA (German) | German translation of the ARC Challenge benchmark. | ★★★★★ | 0.7% | ||
25 | ARC-Easy O | Science QA | Easier subset of AI2 Reasoning Challenge. | ★★★★★ | 89.0% | ||
26 | ARC-Easy (DE) o | Science QA (German) | German translation of the ARC Easy science QA benchmark. | ★★★★★ | 0.8% | ||
27 | Arena-Hard O | Chat ability | Hard prompts on Chatbot Arena. | ★★★★★ | 920 | 97.1% | |
28 | Arena-Hard V2 O | Chat ability | Updated Arena-Hard v2 prompts on Chatbot Arena. | ★★★★★ | 920 | 88.2% | |
29 | ARKitScenes o | 3D scene understanding | ARKitScenes benchmark for assessing 3D scene reconstruction and understanding from mixed reality captures. | ★★★★★ | 61.5% | ||
30 | ART Agent Red Teaming O | Agent robustness | Evaluation suite for adversarial red-teaming of autonomous AI agents. | ★★★★★ | ↓ 40.0% | ||
31 | ArtifactsBench O | Agentic coding | Artifacts-focused coding and tool-use benchmark evaluating generated code artifacts. | ★★★★★ | 72.5% | ||
32 | AstaBench O | Agent evaluation | Evaluates science agents across literature understanding, data analysis, planning, tool use, coding, and search. | ★★★★★ | 53.0% | ||
33 | AttaQ O | Safety / jailbreak | Adversarial jailbreak suite measuring refusal robustness against targeted attack prompts. | ★★★★★ | G Granite 3.3 8B Instruct | 88.5% | |
34 | AutoCodeBench O | Autonomous coding | End-to-end autonomous coding benchmark with unit-test based execution across diverse repositories and tasks. | ★★★★★ | 52.4% | ||
35 | AutoCodeBench-Lite O | Autonomous coding | Lite version of AutoCodeBench focusing on smaller tasks with the same end-to-end, unit-test-based evaluation. | ★★★★★ | 64.5% | ||
36 | BALROG O | Agent robustness | Benchmark for assessing LLM agents under adversarial and out-of-distribution tool-use scenarios. | ★★★★★ | 43.6% | ||
37 | BBH O | Multi-task reasoning | Hard subset of BIG-bench with diverse reasoning tasks. | ★★★★★ | 510 | 94.3% | |
38 | BBQ O | Bias evaluation | Bias Benchmark for Question Answering evaluating social biases across contexts. | ★★★★★ | 56.0% | ||
39 | BFCL o | Code reasoning | Benchmark for functional code correctness and logic. | ★★★★★ | 95.0% | ||
40 | BFCL Live v2 o | Finance QA | Financial compliance and literacy questions from the BFCL Live v2 benchmark. | ★★★★★ | 81.0% | ||
41 | BFCL v3 O | Code reasoning | Benchmark for functional code correctness and logic (v3). | ★★★★★ | 77.8% | ||
42 | BIG-Bench o | Multi-task reasoning | BIG-bench overall performance (original). | ★★★★★ | 3110 | 55.1% | |
43 | BIG-Bench Extra Hard o | Multi-task reasoning | Extra hard subset of BIG-bench tasks. | ★★★★★ | G Ling 1T | 47.3% | |
44 | BigCodeBench O | Code Generation | BigCodeBench evaluates large language models on practical code generation tasks with unit-test verification. | ★★★★★ | 56.1% | ||
45 | BigCodeBench Hard O | Code generation (hard) | Harder variant of BigCodeBench testing complex programming and library tasks with function-level code generation. | ★★★★★ | 35.8% | ||
46 | BLINK O | Multimodal grounding | Evaluates visual-language grounding and reference resolution to reduce hallucinations. | ★★★★★ | 72.4% | ||
47 | BoB-HVR O | Composite capability index | Hard, Versatile, and Relevant composite score across eight capability buckets. | ★★★★★ | 9.0% | ||
48 | BOLD o | Bias evaluation | Bias in Open-ended Language Dataset probing demographic biases in text generation. | ★★★★★ | ↓ 0.1% | ||
49 | BoolQ O | Reading comprehension | Yes/no QA from naturally occurring questions. | ★★★★★ | 171 | 84.8% | |
50 | BrowseComp O | Web browsing | Web browsing comprehension and competence benchmark. | ★★★★★ | 54.9% | ||
51 | BrowseComp_zh O | Web browsing (Chinese) | Chinese variant of the BrowseComp web browsing benchmark. | ★★★★★ | 58.1% | ||
52 | BRuMo25 o | Math competition | BruMo 2025 olympiad-style mathematics benchmark. | ★★★★★ | 69.5% | ||
53 | BuzzBench O | Humor analysis | A humour analysis benchmark. | ★★★★★ | 71.1% | ||
54 | C-Eval O | Chinese exams | Comprehensive Chinese exam benchmark across multiple subjects. | ★★★★★ | 92.5% | ||
55 | C3-Bench o | Reasoning (Chinese) | Comprehensive Chinese reasoning capability benchmark. | ★★★★★ | 35 | 83.1% | |
56 | CaseLaw v2 O | Legal reasoning | U.S. case law benchmark evaluating legal reasoning and judgment over court opinions. | ★★★★★ | 78.1% | ||
57 | CC-OCR o | OCR (cross-lingual) | Cross-lingual OCR benchmark evaluating character recognition across mixed-language documents. | ★★★★★ | 81.5% | ||
58 | CFEval o | Coding ELO / contest eval | Contest-style coding evaluation with ELO-like scoring. | ★★★★★ | 2134 | ||
59 | Charades-STA O | Video grounding | Charades-STA temporal grounding (mIoU). | ★★★★★ | 64.0% | ||
60 | ChartMuseum o | Chart understanding | Large-scale curated collection of charts for evaluating parsing, grounding, and reasoning. | ★★★★★ | 63.3% | ||
61 | ChartQA O | Chart understanding (VQA) | Visual question answering over charts and plots. | ★★★★★ | G MiMo-VL 7B-SFT | 92.9% | |
62 | ChartQA-Pro o | Chart understanding (VQA) | Professional-grade chart question answering with diverse chart types and complex reasoning. | ★★★★★ | 64.0% | ||
63 | CharXiv (DQ) O | Chart description (PDF) | Scientific chart/table descriptive questions from arXiv PDFs. | ★★★★★ | 95.0% | ||
64 | CharXiv (RQ) O | Chart reasoning (PDF) | Scientific chart/table reasoning questions from arXiv PDFs. | ★★★★★ | 81.1% | ||
65 | Chinese SimpleQA o | QA (Chinese) | Chinese variant of the SimpleQA benchmark. | ★★★★★ | 77.6% | ||
66 | CLUEWSC o | Coreference reasoning (Chinese) | Chinese Winograd Schema-style coreference benchmark from CLUE. | ★★★★★ | 92.8% | ||
67 | CMath o | Math (Chinese) | Chinese mathematics benchmark. | ★★★★★ | 96.7% | ||
68 | CMMLU o | Chinese multi-domain | Chinese counterpart to MMLU. | ★★★★★ | 781 | 91.9% | |
69 | Codeforces O | Competitive programming | Competitive programming performance on Codeforces problems (ELO). | ★★★★★ | 2719 | ||
70 | COLLIE o | Instruction following | Comprehensive instruction-following evaluation suite. | ★★★★★ | 55 | 99.0% | |
71 | CommonsenseQA O | Commonsense QA | Multiple-choice QA requiring commonsense knowledge. | ★★★★★ | 85.8% | ||
72 | CountBench O | Visual counting | Object counting and numeracy benchmark for visual-language models across varied scenes. | ★★★★★ | 93.7% | ||
73 | CountBenchQA O | Visual counting QA | Visual question answering benchmark focused on counting objects across varied scenes. | ★★★★★ | G Moondream-9B-A2B | 93.2% | |
74 | CRAG o | Retrieval QA | Complex Retrieval-Augmented Generation benchmark for grounded question answering. | ★★★★★ | G Jamba Mini 1.6 | 76.2% | |
75 | Creative Story‑Writing Benchmark V3 O | Creative writing | Story writing benchmark evaluating creativity, coherence, and style (v3). | ★★★★★ | 291 | 8.7% | |
76 | Longform Creative Writing O | Creative writing | Longform creative writing evaluation (EQ-Bench). | ★★★★★ | 20 | 78.9% | |
77 | Creative Writing v3 O | Creative writing | A LLM-judged creative writing benchmark. | ★★★★★ | 54 | 1661 | |
78 | CRUX-I O | Code reasoning | Code Reasoning and Understanding eXam – Interactive. | ★★★★★ | 75.7% | ||
79 | CRUX-O O | Code reasoning | Code Reasoning and Understanding eXam – Offline. | ★★★★★ | 88.2% | ||
80 | CruxEval O | Code reasoning | Mathematical coding challenge set from the CruxEval benchmark. | ★★★★★ | 78.5% | ||
81 | CV-Bench O | Computer vision QA | Diverse CV tasks for VLMs. | ★★★★★ | 89.7% | ||
82 | DeepMind Mathematics o | Math reasoning | Synthetic math problem sets from DeepMind covering arithmetic, algebra, calculus, and more. | ★★★★★ | G Granite-4.0-H-Small | 59.3% | |
83 | Design2Code O | Coding (UI) | Translating UI designs into code. | ★★★★★ | 93.4% | ||
84 | DesignArena O | Generative design | Leaderboard tracking generative design systems across layout, branding, and marketing tasks. | ★★★★★ | 1410 | ||
85 | DetailBench o | Spot small mistakes | Evaluates whether LLMs can notice subtle errors and minor inconsistencies in text. | ★★★★★ | 8.7% | ||
86 | DocVQA O | Document understanding (VQA) | Visual question answering over scanned documents. | ★★★★★ | 96.9% | ||
87 | DROP O | Reading + reasoning | Discrete reasoning over paragraphs (addition, counting, comparisons). | ★★★★★ | 92.2% | ||
88 | DynaMath O | Math reasoning (video) | Dynamic/video-based mathematical reasoning evaluating temporal and visual understanding. | ★★★★★ | 63.7% | ||
89 | Economically important tasks o | Industry QA (cross-domain) | Evaluation suite of real-world, economically impactful tasks across key industries and workflows. | ★★★★★ | 47.1% | ||
90 | EgoSchema O | Egocentric video QA | EgoSchema validation accuracy. | ★★★★★ | 77.9% | ||
91 | EmbSpatialBench o | Spatial understanding | Embodied spatial understanding benchmark evaluating navigation and localization. | ★★★★★ | 84.3% | ||
92 | Enterprise RAG o | Retrieval-augmented generation | Enterprise retrieval-augmented generation evaluation covering internal knowledge bases. | ★★★★★ | 69.2% | ||
93 | EQ-Bench O | Reasoning | General reasoning benchmark assessing equation/logic capabilities. | ★★★★★ | 352 | G Jan v1 2509 | 85.0% |
94 | EQ-Bench 3 O | Emotional intelligence (roleplay) | A benchmark measuring emotional intelligence in challenging roleplays, judged by Sonnet 3.7. | ★★★★★ | 21 | 1555 | |
95 | ERQA O | Spatial reasoning | Spatial recognition and reasoning QA benchmark (ERQA). | ★★★★★ | 65.7% | ||
96 | EvalPerf O | Code evaluation performance | Measures performance of LLM code evaluation, including runtime, memory, and efficiency metrics. | ★★★★★ | 100.0% | ||
97 | EvalPlus O | Code generation | Aggregated code evaluation suite from EvalPlus. | ★★★★★ | 1577 | 89.0% | |
98 | FACTS Grounding o | Grounding / factuality | Grounded factuality benchmark evaluating model alignment with source facts. | ★★★★★ | 87.8% | ||
99 | FActScore o | Hallucination rate on open-source prompts | Measures hallucination rate on an open-source prompt suite; lower is better. | ★★★★★ | ↓ 1.0% | ||
100 | FAIX Agent O | Composite capability index | ★★★★★ | 90.7% | |||
101 | FAIX Code O | Composite capability index | ★★★★★ | 74.4% | |||
102 | FAIX Math O | Composite capability index | ★★★★★ | 100.0% | |||
103 | FAIX OCR O | Composite capability index | ★★★★★ | 86.0% | |||
104 | FAIX Safety O | Composite safety index | ★★★★★ | 74.0% | |||
105 | FAIX STEM O | Composite capability index | ★★★★★ | 100.0% | |||
106 | FAIX Text O | Composite capability index | ★★★★★ | 70.0% | |||
107 | FAIX Visual O | Composite capability index | ★★★★★ | 91.6% | |||
108 | FAIX Writing O | Composite capability index | ★★★★★ | 69.3% | |||
109 | FinanceReasoning o | Financial reasoning | Financial reasoning benchmark evaluating quantitative and qualitative finance problem solving. | ★★★★★ | G Ling 1T | 87.5% | |
110 | FinanceAgent o | Agentic finance tasks | Interactive financial agent benchmark requiring multi-step tool use. | ★★★★★ | 55.3% | ||
111 | FinanceBench (FullDoc) o | Finance QA | FinanceBench full-document question answering benchmark requiring long-context financial understanding. | ★★★★★ | G Jamba Mini 1.6 | 45.4% | |
112 | FinSearchComp O | Financial retrieval | Financial search and comprehension benchmark measuring retrieval grounded reasoning over financial content. | ★★★★★ | 68.9% | ||
113 | FinSearchComp-CN O | Financial retrieval (Chinese) | Chinese financial search and comprehension benchmark measuring retrieval-grounded reasoning over regional financial content. | ★★★★★ | G doubao-1-5-vision-pro | 54.2% | |
114 | Flame-React-Eval o | Frontend coding | Front-end React coding tasks and evaluation. | ★★★★★ | 82.5% | ||
115 | FRAMES o | Interactive reasoning | Frame-based interactive reasoning and dialogue benchmark. | ★★★★★ | 90.6% | ||
116 | FreshQA o | Recency QA | Question answering benchmark emphasizing up-to-date knowledge and recency. | ★★★★★ | 66.9% | ||
117 | FullStackBench O | Full-stack development | End-to-end web/app development tasks and evaluation. | ★★★★★ | 68.5% | ||
118 | GAIA o | General AI tasks | Comprehensive benchmark for agentic tasks. | ★★★★★ | 70.9% | ||
119 | GAIA 2 O | General agent tasks | Grounded agentic intelligence benchmark version 2 covering multi-tool tasks. | ★★★★★ | 42.1% | ||
120 | GDPVal o | General capability | GDPVal benchmark evaluating broad general capabilities of LLMs across diverse tasks. | ★★★★★ | 47.6% | ||
121 | GeoBench1 o | Geospatial reasoning | Geospatial visual QA and reasoning (set 1). | ★★★★★ | 79.7% | ||
122 | Global-MMLU o | Multi-domain knowledge (global) | Full Global-MMLU evaluation across diverse languages and regions. | ★★★★★ | 77.8% | ||
123 | Gorilla Benchmark API Bench o | Tool use | Gorilla API Bench tool-use evaluation. | ★★★★★ | 35.3% | ||
124 | GPQA O | Graduate-level QA | Graduate-level question answering evaluating advanced reasoning. | ★★★★★ | 406 | 88.4% | |
125 | GPQA-diamond O | Graduate-level QA | Hard subset of GPQA (diamond level). | ★★★★★ | 89.4% | ||
126 | Ground-UI-1K O | GUI grounding | Accuracy on the Ground-UI-1K grounding benchmark. | ★★★★★ | 85.4% | ||
127 | GSM-Plus O | Math (grade-school, enhanced) | Enhanced GSM-style grade-school math benchmark variant. | ★★★★★ | 82.1% | ||
128 | GSM-Symbolic o | Math reasoning | Symbolic reasoning variant of GSM that tests algebraic manipulation and arithmetic with structured problems. | ★★★★★ | G Granite-4.0-H-Small | 87.4% | |
129 | GSM8K O | Math (grade-school) | Grade-school math word problems requiring multi-step reasoning. | ★★★★★ | 1322 | 97.3% | |
130 | GSM8K (DE) o | Math (grade-school, German) | German translation of the GSM8K grade-school math word problems. | ★★★★★ | 0.6% | ||
131 | GSO Benchmark O | Code generation | LiveCodeBench GSO benchmark. | ★★★★★ | 8.8% | ||
132 | HallusionBench O | Multimodal hallucination | Benchmark for evaluating hallucination tendencies in multimodal LLMs. | ★★★★★ | 66.7% | ||
133 | HarmfulQA o | Safety | Harmful question set testing models' ability to avoid unsafe answers. | ★★★★★ | 104 | G K2-THINK | 99.0% |
134 | HealthBench O | Medical QA | Comprehensive medical knowledge and clinical reasoning benchmark across specialties and tasks. | ★★★★★ | 67.2% | ||
135 | HealthBench-Hard o | Medical QA (hard) | Challenging subset of HealthBench focusing on complex, ambiguous clinical cases. | ★★★★★ | 46.2% | ||
136 | HealthBench-Hard Hallucinations o | Medical hallucination safety | Measures hallucination and unsafe medical advice under hard clinical scenarios. | ★★★★★ | ↓ 1.6% | ||
137 | HellaSwag O | Commonsense reasoning | Adversarial commonsense sentence completion. | ★★★★★ | 220 | 96.4% | |
138 | HellaSwag (DE) o | Commonsense reasoning (German) | German translation of the HellaSwag commonsense benchmark. | ★★★★★ | 0.7% | ||
139 | HELMET LongQA o | Long-context QA | Long-context subset of the HELMET benchmark focusing on grounded question answering. | ★★★★★ | G Jamba Mini 1.6 | 46.9% | |
140 | HeroBench O | Long-horizon planning | Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds. | ★★★★★ | 91.7% | ||
141 | HHEM v2.1 O | Hallucination detection | Hughes Hallucination Evaluation Model (Vectara) — lower is better. | ★★★★★ | G AntGroup Finix_S1_32b | ↓ 0.6% | |
142 | HLE O | Multi-domain reasoning | Challenging LLMs at the frontier of human knowledge. | ★★★★★ | 1085 | 32.9% | |
143 | HMMT o | Math (competition) | Harvard–MIT Mathematics Tournament problems. | ★★★★★ | 100.0% | ||
144 | HMMT 2025 O | Math (competition) | Harvard–MIT Mathematics Tournament 2025 problems. | ★★★★★ | 93.3% | ||
145 | HMMT25 o | Math (competition) | Harvard-MIT Mathematics Tournament 2025 benchmark. | ★★★★★ | 67.6% | ||
146 | HRBench 4K o | Hallucination robustness | Hallucination robustness benchmark with 4K token contexts. | ★★★★★ | 89.5% | ||
147 | HRBench 8K o | Hallucination robustness | Hallucination robustness benchmark with 8K token contexts. | ★★★★★ | 82.5% | ||
148 | HumanEval O | Code generation | Python synthesis problems evaluated by unit tests. | ★★★★★ | 2916 | 96.3% | |
149 | HumanEval+ O | Code generation | Extended HumanEval with more tests. | ★★★★★ | 1577 | 94.5% | |
150 | Hypersim o | 3D scene understanding | Hypersim benchmark for synthetic indoor scene understanding and reconstruction. | ★★★★★ | 39.3% | ||
151 | IFBench O | Instruction following | Instruction-following benchmark measuring compliance and adherence. | ★★★★★ | 70 | 84.8% | |
152 | IFEval O | Instruction following | Instruction following capability evaluation for LLMs. | ★★★★★ | 36312 | 93.9% | |
153 | INCLUDE O | Inclusiveness / bias | Evaluates inclusive language use and bias mitigation in model outputs. | ★★★★★ | 83.9% | ||
154 | InfoQA O | Information-seeking QA | Information retrieval question answering benchmark evaluating factual responses. | ★★★★★ | 84.5% | ||
155 | InfoVQA O | Infographic VQA | Visual question answering over infographics requiring reading, counting, and reasoning. | ★★★★★ | 91.2% | ||
156 | JudgeMark v2.1 O | LLM judging ability | A benchmark measuring LLM judging ability. | ★★★★★ | 82.0% | ||
157 | KMMLU-Pro O | Multilingual knowledge | Korean Multilingual Massive Multitask Language Understanding Pro | ★★★★★ | 77.5% | ||
158 | KMMLU-Redux O | Multilingual knowledge | Redux variant of KMMLU benchmark | ★★★★★ | 81.1% | ||
159 | KOR-Bench o | Reasoning | Comprehensive reasoning benchmark spanning diverse domains and cognitive skills. | ★★★★★ | G Ling 1T | 76.0% | |
160 | KSM o | Multilingual math | Korean STEM and math benchmark | ★★★★★ | G EXAONE Deep 2.4B | 60.9% | |
161 | LAMBADA o | Language modeling | Word prediction requiring broad context understanding. | ★★★★★ | 86.4% | ||
162 | LatentJailbreak o | Safety / jailbreak | Robustness to latent jailbreak adversarial techniques. | ★★★★★ | 39 | 77.4% | |
163 | LiveBench O | General capability | Continually updated capability benchmark across diverse tasks. | ★★★★★ | 82.4% | ||
164 | LiveBench 20241125 O | General capability | LiveBench snapshot (2024-11-25) tracking mixed-task evals. | ★★★★★ | 82.4% | ||
165 | LiveCodeBench O | Code generation | Live coding and execution-based evaluation benchmark (v6 dataset). | ★★★★★ | 86.6% | ||
166 | LiveCodeBench v5 (2024.10-2025.02) O | Code generation | LiveCodeBench v5 snapshot covering Oct 2024-Feb 2025. | ★★★★★ | 70.7% | ||
167 | LiveMCP-101 O | Agent real-time eval | A novel real-time evaluation framework and benchmark to stress‑test agents on complex, real‑world tasks. | ★★★★★ | 58.4% | ||
168 | LMArena Text O | Crowd eval (text) | Chatbot Arena text evaluation (average win rate). | ★★★★★ | 1455 | ||
169 | LMArena Vision O | Crowd eval (vision) | Chatbot Arena vision evaluation leaderboard (ELO ratings). | ★★★★★ | 1242 | ||
170 | LogicVista O | Visual logical reasoning | Visual logic and pattern reasoning tasks requiring compositional and spatial understanding. | ★★★★★ | 62.4% | ||
171 | LogiQA o | Logical reasoning | Reading comprehension with logical reasoning. | ★★★★★ | 138 | G Pythia 70M | 23.5% |
172 | LongBench o | Long-context eval | Long-context understanding across tasks. | ★★★★★ | 957 | G Jamba Mini 1.6 | 32.0% |
173 | LongFact-Concepts o | Hallucination rate on open-source prompts | Long-context factuality eval focused on conceptual statements; lower is better. | ★★★★★ | ↓ 0.7% | ||
174 | LongFact-Objects o | Hallucination rate on open-source prompts | Long-context factuality eval focused on object/entity references; lower is better. | ★★★★★ | ↓ 0.8% | ||
175 | LVBench O | Video understanding | Long video understanding benchmark (LVBench). | ★★★★★ | 73.0% | ||
176 | M3GIA (CN) o | Chinese multimodal QA | Chinese-language M3GIA benchmark covering grounded multimodal question answering. | ★★★★★ | 91.2% | ||
177 | Mantis O | Multimodal reasoning | Multimodal reasoning and instruction following benchmark (Mantis). | ★★★★★ | G dots.vlm1 | 86.2% | |
178 | MASK O | Safety / red teaming | Model behavior safety assessment via red-teaming scenarios. | ★★★★★ | 95.3% | ||
179 | MATH O | Math (competition) | Competition-level mathematics across algebra, geometry, number theory, combinatorics. | ★★★★★ | 1185 | 97.9% | |
180 | MATH Level 5 o | Math (competition) | Level 5 subset of the MATH benchmark emphasizing the hardest competition-style problems. | ★★★★★ | 73.6% | ||
181 | MATH500 O | Math (competition) | 500 curated math problems for evaluating high-level reasoning. | ★★★★★ | 99.2% | ||
182 | MATH500 (ES) o | Math (multilingual) | Spanish MATH500 benchmark | ★★★★★ | G EXAONE 4.0 1.2B | 88.8% | |
183 | MathVerse-mini o | Math reasoning (multimodal) | Compact MathVerse split focusing on single-image math puzzles and visual reasoning. | ★★★★★ | 85.0% | ||
184 | MathVerse-Vision O | Math reasoning (multimodal) | Multi-image visual mathematical reasoning tasks from the MathVerse ecosystem. | ★★★★★ | 72.1% | ||
185 | MathVision O | Math reasoning (multimodal) | Visual math reasoning benchmark with problems that combine images (charts, diagrams) and text. | ★★★★★ | 74.6% | ||
186 | MathVista O | Multimodal math reasoning | Visual math reasoning across diverse tasks. | ★★★★★ | 86.8% | ||
187 | MathVista-Mini O | Math reasoning (multimodal) | Lightweight subset of MathVista for quick evaluation of visual mathematical reasoning. | ★★★★★ | 85.8% | ||
188 | MBPP O | Code generation | Short Python problems with hidden tests. | ★★★★★ | 36312 | 95.5% | |
189 | MBPP+ O | Code generation | Extended MBPP with more tests and stricter evaluation. | ★★★★★ | 88.6% | ||
190 | MCP Universe O | Agent evaluation | Benchmarks multi-step tool-use agents across diverse task suites with a unified overall success metric. | ★★★★★ | 44.2% | ||
191 | MCPMark O | Agent tool-use (MCP) | Benchmark for Model Context Protocol (MCP) agent tool-use. | ★★★★★ | 127 | 46.9% | |
192 | MGSM O | Math (multilingual) | Multilingual grade school math word problems. | ★★★★★ | 94.4% | ||
193 | MIABench o | Multimodal instruction following | Multimodal instruction-following benchmark evaluating accuracy on complex image-text tasks. | ★★★★★ | 92.7% | ||
194 | Minerva Math o | University-level math | Advanced quantitative reasoning set inspired by the Minerva benchmark for STEM problem solving. | ★★★★★ | G Granite-4.0-H-Small | 74.0% | |
195 | MiniF2F (Test) o | Math competition | MiniF2F competition benchmark (test split). | ★★★★★ | 81.6% | ||
196 | MixEval o | Multi-task reasoning | Mixed-subject benchmark covering knowledge and reasoning tasks across domains. | ★★★★★ | 82.9% | ||
197 | MixEval Hard o | Multi-task reasoning (hard) | Hard subset of MixEval covering diverse reasoning tasks. | ★★★★★ | 31.6% | ||
198 | MLVU O | Large video understanding | MLVU: Large-scale multi-task benchmark for video understanding. | ★★★★★ | 86.2% | ||
199 | MM-MT-Bench O | Multimodal instruction following | Multi-turn multimodal instruction following benchmark evaluating dialogue quality and helpfulness. | ★★★★★ | 8.5% | ||
200 | MMBench v1.1 (EN dev) O | General VQA | English dev split of MMBench v1.1 measuring multimodal question answering. | ★★★★★ | 90.6% | ||
201 | MMBench v1.1 (CN) O | Multimodal understanding (Chinese) | MMBench v1.1 Chinese subset for evaluating multimodal LLMs. | ★★★★★ | 89.8% | ||
202 | MMBench v1.1 (EN) O | Multimodal understanding (English) | MMBench v1.1 English subset for evaluating multimodal LLMs. | ★★★★★ | 89.7% | ||
203 | MME-RealWorld (cn) o | Real-world perception (CN) | MME-RealWorld Chinese split. | ★★★★★ | 58.5% | ||
204 | MME-RealWorld (en) o | Real-world perception (EN) | MME-RealWorld English split. | ★★★★★ | G MiMo-VL 7B-RL | 59.1% | |
205 | MMLongBench-Doc O | Long-context multimodal documents | Evaluates long-context document understanding with mixed text, tables, and figures across multiple pages. | ★★★★★ | 56.2% | ||
206 | MMLU O | Multi-domain knowledge | 57 tasks spanning STEM, humanities, social sciences; broad knowledge and reasoning. | ★★★★★ | 1488 | 93.5% | |
207 | MMLU (cloze) o | Multi-domain knowledge (cloze) | Cloze-form MMLU evaluation variant. | ★★★★★ | 31.5% | ||
208 | Full Text MMLU o | Multi-domain knowledge (long-form) | Full-context MMLU variant evaluating reasoning over long passages. | ★★★★★ | 0.8% | ||
209 | MMLU-Pro O | Multi-domain knowledge | Harder successor to MMLU with more challenging questions. | ★★★★★ | 286 | 89.3% | |
210 | MMLU Pro MCF o | Multi-domain knowledge (few-shot) | MMLU-Pro common format (MCF) few-shot evaluation. | ★★★★★ | 41.1% | ||
211 | MMLU-ProX O | Multi-domain knowledge | Cross-lingual and robust variant of MMLU-Pro. | ★★★★★ | 81.0% | ||
212 | MMLU-Redux O | Multi-domain knowledge | Updated MMLU-style evaluation with revised questions and scoring. | ★★★★★ | 93.8% | ||
213 | MMLU-STEM O | STEM knowledge | STEM subset of MMLU. | ★★★★★ | 1488 | G Falcon-H1-34B-Instruct | 83.6% |
214 | MMMLU O | Multi-domain knowledge (multilingual) | Massively multilingual MMLU-style evaluation across many languages. | ★★★★★ | 89.5% | ||
215 | MMMLU (ES) o | Multilingual knowledge | Spanish MMMLU benchmark | ★★★★★ | 64.7% | ||
216 | MMMU O | Multimodal understanding | Multi-discipline multimodal understanding benchmark. | ★★★★★ | 84.2% | ||
217 | MMMU PRO O | Multimodal understanding (hard) | Professional/advanced subset of MMMU for multimodal reasoning. | ★★★★★ | 78.4% | ||
218 | MMMU-Pro (vision) o | Multimodal understanding (vision) | MMMU-Pro vision-only setting. | ★★★★★ | 45.8% | ||
219 | MMStar O | Multimodal reasoning | Broad evaluation of multimodal LLMs across diverse tasks. | ★★★★★ | 78.7% | ||
220 | MMVP o | Multimodal video perception | Benchmark for multimodal video understanding and perception. | ★★★★★ | 80.7% | ||
221 | MMVU o | Video understanding | Multimodal video understanding benchmark (MMVU). | ★★★★★ | 68.7% | ||
222 | MotionBench o | Video motion understanding | Video motion and temporal reasoning benchmark. | ★★★★★ | 62.4% | ||
223 | MT-Bench O | Chat ability | Multi-turn chat evaluation via GPT-4 grading. | ★★★★★ | 39074 | 85.7% | |
224 | MTOB (full book) o | Long-form reasoning | Long-context book understanding benchmark (full-book setting). | ★★★★★ | 50.8% | ||
225 | MTOB (half book) o | Long-form reasoning | Long-context book understanding benchmark (half-book setting). | ★★★★★ | 54.0% | ||
226 | MUIRBENCH O | Multimodal robustness | Evaluates multimodal understanding robustness and reliability. | ★★★★★ | 80.1% | ||
227 | Multi-IF O | Instruction following (multi-task) | Composite instruction-following evaluation across multiple tasks. | ★★★★★ | 79.5% | ||
228 | Multi-IFEval O | Instruction following (multi-task) | Multi-task variant of instruction-following evaluation. | ★★★★★ | 88.7% | ||
229 | Multi-SWE-Bench o | Code repair (multi-repo) | Multi-repository SWE-Bench variant. | ★★★★★ | 246 | 35.7% | |
230 | MultiChallenge o | Multi-task reasoning | Composite benchmark across diverse challenges by Scale AI. | ★★★★★ | 69.6% | ||
231 | MultiPL-E O | Code generation (multilingual) | Multilingual code generation and execution benchmark across many programming languages. | ★★★★★ | 269 | 87.9% | |
232 | MultiPL-E HumanEval o | Code generation (multilingual) | MultiPL-E variant of HumanEval tasks. | ★★★★★ | 75.2% | ||
233 | MultiPL-E MBPP o | Code generation (multilingual) | MultiPL-E variant of MBPP tasks. | ★★★★★ | 65.7% | ||
234 | MuSR O | Reasoning | Multistep Soft Reasoning. | ★★★★★ | 69.9% | ||
235 | MVBench O | Video QA | Multi-view or multi-video QA benchmark (MVBench). | ★★★★★ | 73.0% | ||
236 | Natural2Code o | Code generation | Natural language to code benchmark for instruction-following synthesis. | ★★★★★ | 92.9% | ||
237 | NaturalQuestions O | Open-domain QA | Google NQ; real user questions with long/short answers. | ★★★★★ | 40.1% | ||
238 | Nexus (0-shot) o | Tool use | Nexus tool-use benchmark, zero-shot setting. | ★★★★★ | 58.7% | ||
239 | Objectron o | Object detection | Objectron benchmark for 3D object detection in video captures. | ★★★★★ | 71.2% | ||
240 | OCRBench O | OCR (vision text extraction) | Optical character recognition benchmark evaluating text extraction from images, documents, and complex layouts. | ★★★★★ | 90.3% | ||
241 | OCRBenchV2 (CN) o | OCR (Chinese) | OCRBenchV2 Chinese subset assessing OCR performance on Chinese-language documents. | ★★★★★ | 63.7% | ||
242 | OCRBenchV2 (EN) o | OCR (English) | OCRBenchV2 English subset evaluating OCR accuracy on English documents and layouts. | ★★★★★ | 66.8% | ||
243 | OCRReasoning o | OCR reasoning | OCR reasoning benchmark combining text extraction with multi-step reasoning over documents. | ★★★★★ | 70.8% | ||
244 | ODinW-13 o | Object detection (in the wild) | Object Detection in the Wild benchmark covering 13 real-world domains. | ★★★★★ | 47.5% | ||
245 | OJBench O | Code generation (online judge) | Programming problems evaluated via online judge-style execution. | ★★★★★ | 41.6% | ||
246 | OlympiadBench o | Math (olympiad) | Advanced mathematics olympiad-style problem benchmark. | ★★★★★ | 76.5% | ||
247 | OlympicArena o | Math (competition) | Olympiad-style mathematics reasoning benchmark. | ★★★★★ | 76.2% | ||
248 | Omni-MATH O | Math reasoning | Omni-MATH benchmark covering diverse math reasoning tasks across difficulty levels. | ★★★★★ | G Ling 1T | 74.5% | |
249 | Omni-MATH-HARD O | Math | Challenging math benchmark (Omni-MATH-HARD). | ★★★★★ | 73.6% | ||
250 | OmniSpatial o | Spatial reasoning | Spatial understanding and reasoning benchmark (OmniSpatial). | ★★★★★ | 51.0% | ||
251 | OpenBookQA O | Science QA | Open-book multiple choice science questions with supporting facts. | ★★★★★ | 128 | 96.4% | |
252 | OpenRewrite-Eval o | Rewrite quality | OpenRewrite evaluation; micro-averaged RougeL. | ★★★★★ | 46.9% | ||
253 | OptMATH o | Math optimization reasoning | OptMATH benchmark targeting challenging math optimization and problem-solving tasks. | ★★★★★ | G Ling 1T | 57.7% | |
254 | OSWorld o | GUI agents | Agentic GUI task completion and grounding on desktop environments. | ★★★★★ | 61.4% | ||
255 | OSWorld-G O | GUI agents | OSWorld-G center accuracy (no_refusal). | ★★★★★ | 71.8% | ||
256 | OSWorld2 o | GUI agents | Second-generation OSWorld GUI agent benchmark. | ★★★★★ | 35.8% | ||
257 | PIQA O | Physical commonsense | Physical commonsense about everyday tasks and object affordances. | ★★★★★ | 87.1% | ||
258 | PixmoCount O | Visual counting | Counting objects/instances in images (PixmoCount). | ★★★★★ | 85.2% | ||
259 | PolyMATH O | Math reasoning | Polyglot mathematics benchmark assessing cross-topic math reasoning. | ★★★★★ | 60.1% | ||
260 | POPE o | Hallucination detection | Vision-language hallucination benchmark focusing on object existence verification. | ★★★★★ | G Moondream-9B-A2B | 89.0% | |
261 | PopQA O | Knowledge / QA | Open-domain popular culture question answering benchmark testing long-tail factual recall. | ★★★★★ | 28.8% | ||
262 | QuAC o | Conversational QA | Question answering in context. | ★★★★★ | 53.6% | ||
263 | QuALITY o | Long-context reading comprehension | Long-document multiple-choice reading comprehension benchmark. | ★★★★★ | 0.5% | ||
264 | RACE o | Reading comprehension | English exams for middle and high school. | ★★★★★ | G RND1-Base-0910 | 57.6% | |
265 | RealWorldQA O | Real-world visual QA | Visual question answering with real-world images and scenarios. | ★★★★★ | 82.8% | ||
266 | RefCOCO O | Referring expressions | RefCOCO average accuracy at IoU 0.5 (val). | ★★★★★ | ![]() | 92.4% | |
267 | RefCOCOg o | Referring expressions | RefCOCOg average accuracy at IoU 0.5 (val). | ★★★★★ | G Moondream-9B-A2B | 88.6% | |
268 | RefCOCO+ o | Referring expressions | RefCOCO+ accuracy at IoU 0.5 on the val split. | ★★★★★ | G Moondream-9B-A2B | 81.8% | |
269 | RefSpatialBench o | Spatial reasoning | Reference spatial understanding benchmark covering spatial grounding tasks. | ★★★★★ | 72.1% | ||
270 | RepoBench O | Code understanding | Repository-level code comprehension and reasoning benchmark. | ★★★★★ | 83.8% | ||
271 | RoboSpatialHome o | Embodied spatial understanding | RoboSpatialHome benchmark for embodied spatial reasoning in domestic environments. | ★★★★★ | 73.9% | ||
272 | Roo Code Evals O | Code assistant eval | Community-maintained coding evals and leaderboard by Roo Code. | ★★★★★ | 99.0% | ||
273 | Ruler 128k o | Long-context eval | RULER benchmark at 128k context window. | ★★★★★ | 90.2% | ||
274 | Ruler 32k o | Long-context eval | RULER benchmark at 32k context window. | ★★★★★ | 96.0% | ||
275 | SALAD-Bench o | Safety alignment | Safety Alignment and Dangerous-behavior benchmark evaluating harmful assistance and refusal consistency. | ★★★★★ | G Granite-4.0-H-Micro | ↓ 96.8% | |
276 | SciCode (sub) O | Code | SciCode subset score (sub). | ★★★★★ | 45.7% | ||
277 | SciCode (main) O | Code | SciCode main score. | ★★★★★ | 15.4% | ||
278 | ScienceQA O | Science QA (multimodal) | Multiple-choice science questions with images, diagrams, and text context. | ★★★★★ | G FastVLM-7B | 96.7% | |
279 | SciQ o | Science QA | Multiple choice science questions. | ★★★★★ | G Pythia 12B | 92.9% | |
280 | ScreenQA Complex O | GUI QA | Complex ScreenQA benchmark accuracy. | ★★★★★ | 87.1% | ||
281 | ScreenQA Short O | GUI QA | Short-form ScreenQA benchmark accuracy. | ★★★★★ | 91.9% | ||
282 | ScreenSpot O | Screen UI locators | Center accuracy on ScreenSpot. | ★★★★★ | 95.4% | ||
283 | ScreenSpot-Pro O | Screen UI locators | Average center accuracy on ScreenSpot-Pro. | ★★★★★ | 63.2% | ||
284 | ScreenSpot-v2 O | Screen UI locators | Center accuracy on ScreenSpot-v2. | ★★★★★ | G UI-Venus 72B | 95.3% | |
285 | SEED-Bench-2-Plus o | Multimodal evaluation | SEED-Bench-2-Plus overall accuracy. | ★★★★★ | 72.9% | ||
286 | SEED-Bench-Img O | Multimodal image understanding | SEED-Bench image-only subset (SEED-Bench-Img). | ★★★★★ | G Bagel 14B | 78.5% | |
287 | Showdown O | GUI agents | Success rate on the Showdown UI interaction benchmark. | ★★★★★ | 76.8% | ||
288 | SIFO o | Instruction following | Single-turn instruction following benchmark. | ★★★★★ | 66.9% | ||
289 | SIFO Multiturn o | Instruction following | Multi-turn SIFO benchmark for sustained instruction adherence. | ★★★★★ | 60.3% | ||
290 | SimpleQA O | QA | Simple question answering benchmark. | ★★★★★ | 97.1% | ||
291 | SimpleVQA o | General VQA | Lightweight visual question answering set with everyday scenes. | ★★★★★ | 65.4% | ||
292 | SimpleVQA-DS o | General VQA | SimpleVQA variant curated by DeepSeek with everyday image question answering tasks. | ★★★★★ | 61.3% | ||
293 | SocialIQA o | Social commonsense | Social interaction commonsense QA. | ★★★★★ | 54.9% | ||
294 | Spider o | Text-to-SQL | Complex text-to-SQL benchmark over cross-domain databases. | ★★★★★ | 67.1% | ||
295 | Spiral-Bench O | Safety / sycophancy | A LLM-judged benchmark measuring sycophancy and delusion reinforcement. | ★★★★★ | 87.0% | ||
296 | SQuAD v1.1 o | Reading comprehension | Extractive QA from Wikipedia articles. | ★★★★★ | 566 | 89.3% | |
297 | SUNRGBD o | 3D scene understanding | SUN RGB-D benchmark for indoor scene understanding from RGB-D imagery. | ★★★★★ | 45.8% | ||
298 | SuperGPQA O | Graduate-level QA | Harder GPQA variant assessing advanced graduate-level reasoning. | ★★★★★ | 64.9% | ||
299 | SWE-Bench o | Code repair | Supervised software engineering benchmark across many repos and issues. | ★★★★★ | 3442 | 74.5% | |
300 | SWE-Bench Multilingual o | Code repair (multilingual) | Multilingual variant of SWE-Bench for issue fixing. | ★★★★★ | 57.9% | ||
301 | SWE-Bench Pro (Public) o | Software engineering | Public subset of the SWE-Bench Pro benchmark for software-engineering agents. | ★★★★★ | 23.3% | ||
302 | SWE-Bench Verified O | Code repair | Verified subset of SWE-Bench for issue fixing. | ★★★★★ | 77.2% | ||
303 | SWE-Dev o | Code repair | Software engineering development and bug fixing benchmark. | ★★★★★ | 67.1% | ||
304 | SysBench o | System prompts | System prompt understanding and adherence benchmark. | ★★★★★ | 74.1% | ||
305 | τ²-Bench (airline) o | Industry QA (airline) | τ²-Bench airline domain evaluation. | ★★★★★ | 64.8% | ||
306 | τ²-Bench (retail) o | Industry QA (retail) | τ²-Bench retail domain evaluation. | ★★★★★ | 82.4% | ||
307 | τ²-Bench (telecom) O | Industry QA (telecom) | τ²-Bench telecom domain evaluation. | ★★★★★ | 96.7% | ||
308 | TAU1-Airline o | Agent tasks (airline) | Tool-augmented agent evaluation in airline scenarios (TAU1). | ★★★★★ | 54.0% | ||
309 | TAU1-Retail o | Agent tasks (retail) | Tool-augmented agent evaluation in retail scenarios (TAU1). | ★★★★★ | 71.3% | ||
310 | TAU2-Airline O | Agent tasks (airline) | Tool-augmented agent evaluation in airline scenarios (TAU2). | ★★★★★ | 70.0% | ||
311 | TAU2-Retail O | Agent tasks (retail) | Tool-augmented agent evaluation in retail scenarios (TAU2). | ★★★★★ | 86.8% | ||
312 | TAU2-Telecom O | Agent tasks (telecom) | Tool-augmented agent evaluation in telecom scenarios (TAU2). | ★★★★★ | 98.0% | ||
313 | Terminal-Bench O | Agent terminal tasks | Command-line task completion benchmark for agents. | ★★★★★ | 637 | 61.3% | |
314 | Terminal-Bench Hard O | Agent terminal tasks | Hard subset of Terminal-Bench command-line agent tasks. | ★★★★★ | 37.6% | ||
315 | TextVQA O | Text-based VQA | Visual question answering that requires reading text in images. | ★★★★★ | 85.5% | ||
316 | TreeBench o | Reasoning with tree structures | Evaluates hierarchical/tree-structured reasoning and planning capabilities in LLMs/VLMs. | ★★★★★ | 50.1% | ||
317 | TriQA o | Knowledge QA | Triadic question answering benchmark evaluating world knowledge and reasoning. | ★★★★★ | 82.2% | ||
318 | TriviaQA O | Open-domain QA | Open-domain question answering benchmark built from trivia and web evidence. | ★★★★★ | 85.5% | ||
319 | TriviaQA-Wiki o | Open-domain QA | TriviaQA subset answering using Wikipedia evidence. | ★★★★★ | 91.8% | ||
320 | TruthfulQA O | Truthfulness / hallucination | Measures whether a model imitates human falsehoods (truthfulness). | ★★★★★ | 70.3% | ||
321 | TruthfulQA (DE) o | Truthfulness / hallucination (German) | German translation of the TruthfulQA benchmark. | ★★★★★ | 0.2% | ||
322 | TydiQA o | Cross-lingual QA | Typologically diverse QA across languages. | ★★★★★ | 313 | 34.3% | |
323 | V* o | Multimodal reasoning | V* benchmark accuracy. | ★★★★★ | G MiMo-VL 7B-RL | 81.7% | |
324 | VCT O | Virology capability (protocol troubleshooting) | Virology Capabilities Test: a benchmark that measures an LLM's ability to troubleshoot complex virology laboratory protocols. | ★★★★★ | 43.8% | ||
325 | VibeEval O | Aesthetic/visual quality | VLM aesthetic evaluation with GPT scores. | ★★★★★ | 76.4% | ||
326 | Video-MME o | Video understanding (multimodal) | Multimodal evaluation of video understanding and reasoning. | ★★★★★ | 74.5% | ||
327 | VideoMME (w/o sub) O | Video understanding | Video understanding benchmark without subtitles. | ★★★★★ | 85.1% | ||
328 | VideoMME (w/sub) o | Video understanding | Video understanding benchmark with subtitles. | ★★★★★ | 80.7% | ||
329 | VideoMMMU O | Multimodal video understanding | Video-based extension of MMMU evaluating temporal multimodal reasoning and perception across disciplines. | ★★★★★ | 84.6% | ||
330 | VisualWebBench O | Web UI understanding | Average accuracy on VisualWebBench. | ★★★★★ | 83.8% | ||
331 | VisuLogic O | Visual logical reasoning | Logical reasoning and compositionality benchmark for visual-language models. | ★★★★★ | 35.9% | ||
332 | VitaBench o | Industry QA | Industry-focused benchmark evaluating domain QA performance. | ★★★★★ | 35.3% | ||
333 | VL-RewardBench o | Reward modeling (VL) | Reward alignment benchmark for VLMs. | ★★★★★ | 67.4% | ||
334 | VLMs are Biased o | Multimodal bias | Evaluates whether VLMs truly 'see' vs. relying on memorized knowledge; measures bias toward non-visual priors. | ★★★★★ | 90 | 20.2% | |
335 | VLMs are Blind O | Visual grounding robustness | Evaluates failure modes of VLMs in grounding and perception tasks. | ★★★★★ | G MiMo-VL 7B-RL | 79.4% | |
336 | VoiceBench AdvBench o | VoiceBench | VoiceBench adversarial safety evaluation. | ★★★★★ | 99.4% | ||
337 | VoiceBench AlpacaEval o | VoiceBench | VoiceBench evaluation on AlpacaEval instructions. | ★★★★★ | 96.8% | ||
338 | VoiceBench BBH o | VoiceBench | VoiceBench evaluation on Big-Bench Hard prompts. | ★★★★★ | 92.6% | ||
339 | VoiceBench CommonEval o | VoiceBench | VoiceBench evaluation on CommonEval. | ★★★★★ | 91.0% | ||
340 | VoiceBench IFEval o | VoiceBench | VoiceBench instruction-following evaluation (IFEval). | ★★★★★ | 85.7% | ||
341 | MMAU v05.15.25 o | Audio reasoning | Audio reasoning benchmark MMAU v05.15.25. | ★★★★★ | 77.6% | ||
342 | VoiceBench MMSU o | VoiceBench | VoiceBench MMSU benchmark (voice modality). | ★★★★★ | 84.3% | ||
343 | VoiceBench MMSU (Audio) o | Audio reasoning | Audio reasoning MMSU results. | ★★★★★ | 77.7% | ||
344 | VoiceBench OpenBookQA o | VoiceBench | VoiceBench results on OpenBookQA prompts. | ★★★★★ | 95.0% | ||
345 | VoiceBench Overall o | VoiceBench | Overall VoiceBench aggregate score. | ★★★★★ | 89.6% | ||
346 | VoiceBench SD-QA o | VoiceBench | VoiceBench Spoken Dialogue QA results. | ★★★★★ | 90.1% | ||
347 | VoiceBench WildVoice o | VoiceBench | VoiceBench evaluation on WildVoice dataset. | ★★★★★ | 93.4% | ||
348 | VQAv2 O | Visual question answering | Standard Visual Question Answering v2 benchmark on natural images. | ★★★★★ | 86.5% | ||
349 | VSI-Bench o | Spatial intelligence | Visual spatial intelligence benchmark covering 3D reasoning and spatial inference tasks. | ★★★★★ | 63.2% | ||
350 | WebClick O | GUI agents | Task success on the WebClick UI agent benchmark. | ★★★★★ | 93.0% | ||
351 | WebDev Arena O | Web development agents | Arena evaluation for autonomous web development agents. | ★★★★★ | 1483 | ||
352 | WebQuest-MultiQA o | Web agents | Multi-question web search and interaction tasks. | ★★★★★ | 60.6% | ||
353 | WebQuest-SingleQA o | Web agents | Single-question web search and interaction tasks. | ★★★★★ | 76.9% | ||
354 | WebSrc O | Web QA | Webpage question answering (SQuAD F1). | ★★★★★ | 97.2% | ||
355 | WebVoyager2 o | Web agents | Web navigation and interaction tasks for LLM agents (v2). | ★★★★★ | 84.4% | ||
356 | WebWalkerQA o | Web agents | WebWalker tasks evaluating autonomous browsing question answering performance. | ★★★★★ | 72.2% | ||
357 | WeMath o | Math reasoning | Math reasoning benchmark spanning diverse curricula and difficulty levels. | ★★★★★ | 68.8% | ||
358 | WildBench V2 o | Instruction following | WildBench V2 human preference benchmark for instruction following and helpfulness. | ★★★★★ | 65.3% | ||
359 | Winogender o | Gender bias (coreference) | Coreference resolution dataset for measuring gender bias. | ★★★★★ | 67.9% | ||
360 | WinoGrande O | Coreference reasoning | Large-scale adversarial Winograd Schema-style pronoun resolution. | ★★★★★ | 99 | 86.7% | |
361 | WinoGrande (DE) o | Coreference reasoning (German) | German translation of the WinoGrande pronoun resolution benchmark. | ★★★★★ | 0.8% | ||
362 | WMT16 En–De o | Machine translation | WMT16 English–German translation benchmark (news). | ★★★★★ | 38.8% | ||
363 | WMT16 En–De (Instruct) o | Machine translation | Instruction-tuned evaluation on the WMT16 English–German translation set. | ★★★★★ | 37.9% | ||
364 | WritingBench O | Writing quality | General-purpose writing quality benchmark. | ★★★★★ | 88.3% | ||
365 | WSC o | Coreference reasoning | Classic Winograd Schema Challenge measuring commonsense coreference. | ★★★★★ | G Pythia 410M | 47.1% | |
366 | xBench-DeepSearch o | Agentic research | Evaluates multi-hop deep research workflows on xBench DeepSearch tasks. | ★★★★★ | 75.0% | ||
367 | ZebraLogic O | Logical reasoning | Logical reasoning benchmark assessing complex pattern and rule inference. | ★★★★★ | 94.2% | ||
368 | ZeroBench O | Zero-shot generalization | Evaluates zero-shot performance across diverse tasks without task-specific finetuning. | ★★★★★ | 23.4% | ||
369 | ZeroBench (sub) O | Zero-shot generalization | Subset of ZeroBench targeting harder zero-shot reasoning cases. | ★★★★★ | 33.8% | ||
370 | ZeroSCROLLS MuSiQue o | Long-context reasoning | ZeroSCROLLS split derived from MuSiQue multi-hop QA. | ★★★★★ | 0.5% | ||
371 | ZeroSCROLLS SpaceDigest o | Long-context summarization | ZeroSCROLLS SpaceDigest extractive summarization task. | ★★★★★ | 0.8% | ||
372 | ZeroSCROLLS SQuALITY o | Long-context summarization | ZeroSCROLLS split based on the SQuALITY long-form summarization benchmark. | ★★★★★ | 0.2% |