Furukama

Furukama's Blog

Ben Koehler - Founder, Speaker, Coder Web | GitHub | X | Bluesky | LinkedIn

Fu — Benchmark of Benchmarks

Fu-Benchmark is a meta-benchmark of the most influential evaluation suites used to measure and rank large language models. Use the search box to filter by name, topic, or model.

Benchmarks
#NameTopicDescriptionRelevanceGitHub ★LeaderTop %
1AA-Index oMulti-domain QAComprehensive QA index across diverse domains.★★★★ 🇺🇸 Grok 473.2%
2AA-LCR O Long-context reasoningA challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens.★★★★ 🇺🇸 GPT-5 High76.0%
3AA-Omniscience O Knowledge and hallucinationBenchmark measuring factual recall and hallucination across economically relevant domains.★★★★ 🇺🇸 Gemini 3 Pro Preview13.0%
4AceBench OIndustry QAIndustry-focused benchmark assessing domain QA and reasoning.★★★★ 🇨🇳 Kimi-K282.2%
5ACP-Bench Bool O Safety evaluation (boolean)Safety and behavior evaluation with yes/no questions.★★★★ 🇨🇳 Qwen3-32B85.1%
6ACP-Bench MCQ O Safety evaluation (MCQ)Safety and behavior evaluation with multiple-choice questions.★★★★ 🇺🇸 Llama 3.3 70B82.1%
7AetherCode OCode generationCode generation benchmark for diverse coding tasks.★★★★ 🇺🇸 GPT-5.2 High73.8%
8AgentCompany oAgent reasoningCompany-level agent reasoning and decision-making benchmark.★★★★ 🇺🇸 Claude Sonnet 4.541.0%
9AgentDojo O Agent evaluationInteractive evaluation suite for autonomous agents across tools and tasks.★★★★ 🇺🇸 Claude 3.7 Sonnet88.7%
10Agentic Coding OAgentic codingAgentic coding benchmark for autonomous software tasks.★★★★ 🇺🇸 Gemini 3 Flash Preview53.8%
11AGIEval (English) OExamsEnglish subset of AGIEval; academic and professional exam questions.★★★★ 🇨🇳 Qwen3-VL 32B Thinking92.2%
12AGIEval LSAT-AR oLaw exam reasoningLSAT Analytical Reasoning subset from AGIEval benchmark.★★★★ 🇨🇳 Qwen2.5 32B Base30.4%
13AI2D ODiagram understanding (VQA)Visual question answering over science and diagram images.★★★★ 🇺🇸 Gemini 3 Pro98.7%
14AICodeKing Non-Agentic OCode generation (non-agentic)Non-agentic code generation benchmark from AICodeKing.★★★★ 🇺🇸 Claude Opus 4.6100.0%
15Aider Code Editing O Code editingMeasures interactive code editing quality within the Aider assistant workflow.★★★★ 🇺🇸 Gemini 2.5 Pro89.8%
16Aider-Polyglot O Code assistant evalAider polyglot coding leaderboard.★★★★★ 🇺🇸 Gemini 3 Pro Preview92.9%
17Aider-Polyglot (Diff) O Code assistant evalAider polyglot leaderboard using diff mode (pass@2).★★★★ 🇺🇸 Gemini 3 Pro Preview91.9%
18AIME 2024 O Math (competition)American Invitational Mathematics Examination 2024 problems.★★★★★ 🇺🇸 GPT-OSS 120B96.6%
19AIME 2024-Ko oMath (competition, Korean)Korean translation of AIME 2024 problems.★★★★ 🇨🇳 Qwen3-30B-A3B-Thinking-250780.3%
20AIME 2025 O Math (competition)American Invitational Mathematics Examination 2025 problems.★★★★★ 🇺🇸 Claude Sonnet 4.5100.0%
21AIME 2026 I OMath (competition)American Invitational Mathematics Examination 2026 I problems.★★★★ 🇺🇸 GPT-5.2 High97.5%
22AInstein-SWE-Bench oAgentic codingAInstein agent coding benchmark.★★★★ 🇺🇸 Gemini 3 Pro42.8%
23AlignBench oAlignment and instruction followingBenchmark for instruction-following quality and alignment behavior.★★★★G JoyAI-LLM Flash8.2%
24All-Angles Bench OSpatial perceptionAll-Angles benchmark for spatial recognition and 3D perception.★★★★G Step3-VL-10B57.2%
25AlpacaEval O Instruction followingAutomatic eval using GPT-4 as a judge.★★★★★1849 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT99.4%
26AlpacaEval 2.0 OInstruction followingUpdated AlpacaEval with improved prompts and judging.★★★★ 🇨🇳 DeepSeek R187.6%
27AMC-23 O Math (competition)American Mathematics Competition 2023 evaluation.★★★★G QwQ-32B98.5%
28AMO-Bench OMath (competition)Advanced math olympiad-style benchmark.★★★★ 🇺🇸 Gemini 3 Pro72.5%
29AMO-Bench CH oMath (competition)Chinese subset of AMO-Bench.★★★★ 🇺🇸 Gemini 3 Pro74.9%
30AndroidWorld OMobile agentsBenchmark for agents operating Android apps via UI automation.★★★★ 🇨🇳 Qwen3.5-35B-A3B71.1%
31APEX-Agents OLong horizon professional tasksAPEX benchmark evaluating agents on long-horizon professional tasks.★★★★ 🇺🇸 Gemini 3.1 Pro33.5%
32API-Bank o Tool useAPI-Bank tool-use benchmark.★★★★ 🇺🇸 Llama 3.1 405B92.0%
33ARC-AGI-1 O General reasoningARC-AGI Phase 1 aggregate accuracy.★★★★ 🇺🇸 GPT-5.2 High89.9%
34ARC-AGI-2 O General reasoningARC-AGI Phase 2 aggregate accuracy.★★★★ 🇺🇸 Gemini 3 Deep Think84.6%
35ARC Average O Science QA (average)Average accuracy across ARC-Easy and ARC-Challenge.★★★★ 🇺🇸 SmolLM2 1.7B Pretrained60.5%
36ARC-Challenge O Science QAHard subset of AI2 Reasoning Challenge; grade-school science.★★★★ 🇺🇸 Llama 3.1 405B96.9%
37ARC-Challenge (DE) oScience QA (German)German translation of the ARC Challenge benchmark.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.7%
38ARC-Easy O Science QAEasier subset of AI2 Reasoning Challenge.★★★★ 🇺🇸 Gemma 3 PT 27B89.0%
39ARC-Easy (DE) oScience QA (German)German translation of the ARC Easy science QA benchmark.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.8%
40Arena-Hard O Chat abilityHard prompts on Chatbot Arena.★★★★★920 🇫🇷 Mistral Medium 397.1%
41Arena-Hard V2 O Chat abilityUpdated Arena-Hard v2 prompts on Chatbot Arena.★★★★★920 🇨🇳 Qwen3 Max Thinking90.2%
42Arena-Hard V2 Creative Writing O Creative writingChatbot Arena Hard V2 creative writing win-rate subset.★★★★ 🇺🇸 Gemini 3 Pro93.6%
43Arena-Hard V2 Hard Prompt O Chat abilityChatbot Arena Hard V2 benchmark using the hard prompt win-rate subset.★★★★ 🇺🇸 Gemini 3 Pro72.6%
44ARKitScenes O3D scene understandingARKitScenes benchmark for assessing 3D scene reconstruction and understanding from mixed reality captures.★★★★ 🇨🇳 Qwen2.5-VL 72B Instruct61.5%
45ART Agent Red Teaming O Agent robustnessEvaluation suite for adversarial red-teaming of autonomous AI agents.★★★★ 🇺🇸 Claude Opus 4.5↓ 33.6%
46ArtifactsBench O Agentic codingArtifacts-focused coding and tool-use benchmark evaluating generated code artifacts.★★★★ 🇺🇸 GPT-5 Thinking73.0%
47ASR AMI oASRAutomatic speech recognition benchmark on AMI meeting speech.★★★★ 🇨🇳 Qwen2.5-Omni-3B↓ 15.1%
48ASR Earnings22 oASRAutomatic speech recognition benchmark on Earnings22 financial calls.★★★★G Whisper-large-V3↓ 11.3%
49ASR GigaSpeech oASRAutomatic speech recognition benchmark on GigaSpeech.★★★★G Whisper-large-V3↓ 10.0%
50ASR LibriSpeech Clean oASRAutomatic speech recognition benchmark on LibriSpeech clean split.★★★★ 🇺🇸 LFM2.5-Audio-1.5B↓ 1.9%
51ASR LibriSpeech Other oASRAutomatic speech recognition benchmark on LibriSpeech other split.★★★★G Whisper-large-V3↓ 3.9%
52ASR SPGISpeech oASRAutomatic speech recognition benchmark on SPGISpeech.★★★★ 🇺🇸 LFM2.5-Audio-1.5B↓ 2.8%
53ASR TED-LIUM oASRAutomatic speech recognition benchmark on TED-LIUM.★★★★ 🇺🇸 LFM2.5-Audio-1.5B↓ 3.5%
54ASR VoxPopuli oASRAutomatic speech recognition benchmark on VoxPopuli.★★★★ 🇨🇳 Qwen2.5-Omni-3B↓ 5.6%
55AstaBench O Agent evaluationEvaluates science agents across literature understanding, data analysis, planning, tool use, coding, and search.★★★★ 🇺🇸 Claude Sonnet 453.0%
56AttaQ OSafety / jailbreakAdversarial jailbreak suite measuring refusal robustness against targeted attack prompts.★★★★G Granite 3.3 8B Instruct88.5%
57AutoCodeBench O Autonomous codingEnd-to-end autonomous coding benchmark with unit-test based execution across diverse repositories and tasks.★★★★ 🇺🇸 Claude Opus 4 (Thinking)52.4%
58AutoCodeBench-Lite O Autonomous codingLite version of AutoCodeBench focusing on smaller tasks with the same end-to-end, unit-test-based evaluation.★★★★ 🇺🇸 Claude Opus 464.5%
59AutoLogi oLogical reasoningAutoLogi benchmark evaluating automated logical reasoning accuracy.★★★★ 🇺🇸 Claude Sonnet 489.8%
60BABE oSTEM reasoningSTEM reasoning benchmark evaluating broad applied and basic engineering knowledge.★★★★ 🇺🇸 GPT-5.2 High58.1%
61BabyVision OVisual reasoningVisual reasoning benchmark testing basic visual perception and understanding.★★★★ 🇨🇳 Qwen3.5-397B-A17B52.3%
62BALROG O Agent robustnessBenchmark for assessing LLM agents under adversarial and out-of-distribution tool-use scenarios.★★★★ 🇺🇸 Grok 443.6%
63BBH O Multi-task reasoningHard subset of BIG-bench with diverse reasoning tasks.★★★★★510 🇨🇳 ERNIE 4.5 424B A47B94.3%
64BBH-ZH OMulti-task reasoning (Chinese)Chinese translation of BIG-Bench Hard reasoning tasks.★★★★G LLaDA2.0 Flash87.5%
65BBQ O Bias evaluationBias Benchmark for Question Answering evaluating social biases across contexts.★★★★ 🇫🇷 Mixtral 8x 7B56.0%
66BeaverTails oSafety / harmfulnessSafety benchmark evaluating harmfulness in model responses.★★★★G IQuest-Coder-V1-40B-Thinking76.7%
67BeyondAIME OMath (beyond AIME)Advanced math problems exceeding AIME difficulty.★★★★G Seed2.0 Pro86.5%
68BFCL OCode reasoningBenchmark for functional code correctness and logic.★★★★ 🇨🇳 Qwen3-4B95.0%
69BFCL Live v2 OFinance QAFinancial compliance and literacy questions from the BFCL Live v2 benchmark.★★★★ 🇺🇸 o1 Mini81.0%
70BFCL v2 oCode reasoningSecond release of the BFCL benchmark focusing on functional code correctness and logic.★★★★G MobileLLM P129.4%
71BFCL v3 OCode reasoningBenchmark for functional code correctness and logic (v3).★★★★ 🇨🇳 GLM 4.577.8%
72BFCL v3 (Live) oTool callingBFCL v3 Live subset for real-time tool calling evaluation.★★★★ 🇨🇳 Qwen3-30B-A3B-Thinking-250782.9%
73BFCL v3 (Multi-Turn) OTool callingBFCL v3 Multi-Turn subset for multi-turn tool calling evaluation.★★★★G MiniMax M2.576.8%
74BFCL v4 OCode reasoningBFCL v4 benchmark for functional code correctness and logic.★★★★ 🇺🇸 Claude Opus 4.577.5%
75BIG-Bench o Multi-task reasoningBIG-bench overall performance (original).★★★★★3110 🇺🇸 Gemma 2 7B55.1%
76BIG-Bench Extra Hard OMulti-task reasoningExtra hard subset of BIG-bench tasks.★★★★G Ling 2.5 1T52.0%
77BigCodeBench O Code GenerationBigCodeBench evaluates large language models on practical code generation tasks with unit-test verification.★★★★G MiMo V2 Flash Base70.1%
78BigCodeBench Hard O Code generation (hard)Harder variant of BigCodeBench testing complex programming and library tasks with function-level code generation.★★★★ 🇺🇸 Claude 3.7 Sonnet (2025-02-19)35.8%
79BIOBench oBiology reasoningBiology knowledge and reasoning benchmark.★★★★ 🇺🇸 Gemini 3 Pro51.9%
80Biology-Instruction oBiology multi-omicsMulti-omics sequence reasoning benchmark for biological data understanding.★★★★G Intern-S1-Pro52.5%
81BioLP-Bench o Biomedical NLPComprehensive biomedical language processing benchmark evaluating LLMs across tasks like NER, relation extraction, and QA.★★★★ 🇺🇸 Grok 447.0%
82Bird-SQL OText-to-SQLNatural language to SQL generation benchmark.★★★★ 🇺🇸 Gemini 2.0 Pro59.3%
83BLINK OMultimodal groundingEvaluates visual-language grounding and reference resolution to reduce hallucinations.★★★★ 🇺🇸 Gemini 3 Pro87.4%
84BoB-HVR OComposite capability indexHard, Versatile, and Relevant composite score across eight capability buckets.★★★★ 🇺🇸 Llama 3 70B9.0%
85BOLD o Bias evaluationBias in Open-ended Language Dataset probing demographic biases in text generation.★★★★ 🇫🇷 Mixtral 8x 7B↓ 0.1%
86BoolQ O Reading comprehensionYes/no QA from naturally occurring questions.★★★★★171G Marin-32B-Mantis89.4%
87Borda Count (Multilingual) oAggregate rankingBorda count aggregate ranking across multilingual benchmarks; lower is better.★★★★ 🇨🇳 Qwen3-32B↓ 2.9%
88BridgeBench oReasoningBridgeBench evaluation benchmark.★★★★ 🇺🇸 Claude Opus 4.660.1%
89BrowseComp O Web browsingWeb browsing comprehension and competence benchmark.★★★★ 🇺🇸 Gemini 3.1 Pro85.9%
90BrowseComp (With Content Manager) O Web browsingBrowseComp benchmark evaluated with content manager assistance.★★★★ 🇺🇸 Claude Opus 4.684.0%
91BrowseComp_zh OWeb browsing (Chinese)Chinese variant of the BrowseComp web browsing benchmark.★★★★G Seed1.881.3%
92BRuMo25 oMath competitionBruMo 2025 olympiad-style mathematics benchmark.★★★★ 🇺🇸 QuestA Nemotron 1.5B69.5%
93BuzzBench O Humor analysisA humour analysis benchmark.★★★★ 🇺🇸 Gemini 2.5 Pro71.1%
94C-Eval O Chinese examsChinese college-level exam benchmark.★★★★★1768 🇨🇳 Kimi-K2.594.0%
95C3-Bench o Reasoning (Chinese)Comprehensive Chinese reasoning capability benchmark.★★★★35 🇨🇳 GLM-4.5 Base83.1%
96CaseLaw v2 O Legal reasoningU.S. case law benchmark evaluating legal reasoning and judgment over court opinions.★★★★ 🇺🇸 GPT-4.178.1%
97CC-OCR OOCR (cross-lingual)Cross-lingual OCR benchmark evaluating character recognition across mixed-language documents.★★★★ 🇨🇳 Qwen3.5-397B-A17B82.0%
98CFEval oCoding ELO / contest evalContest-style coding evaluation with ELO-like scoring.★★★★ 🇨🇳 Qwen3-235B-A22B-Thinking-25072134
99CGBench oLong video QACartoon/CG long video question answering benchmark.★★★★ 🇺🇸 Gemini 2.5 Pro64.6%
100Charades-STA O Video groundingCharades-STA temporal grounding (mIoU).★★★★ 🇨🇳 Seed1.5-VL-Thinking64.0%
101ChartMuseum o Chart understandingLarge-scale curated collection of charts for evaluating parsing, grounding, and reasoning.★★★★ 🇺🇸 GPT-5 mini63.3%
102ChartQA O Chart understanding (VQA)Visual question answering over charts and plots.★★★★ 🇨🇳 Keye-VL-1.5-8B94.1%
103ChartQA-Pro OChart understanding (VQA)Professional-grade chart question answering with diverse chart types and complex reasoning.★★★★ 🇺🇸 Gemini 2.5 Pro69.5%
104CharXiv (DQ) O Chart description (PDF)Scientific chart/table descriptive questions from arXiv PDFs.★★★★ 🇺🇸 o3-high95.0%
105CharXiv (RQ) O Chart reasoning (PDF)Scientific chart/table reasoning questions from arXiv PDFs.★★★★ 🇺🇸 GPT-5.2 Thinking82.1%
106Chinese SimpleQA oQA (Chinese)Chinese variant of the SimpleQA benchmark.★★★★ 🇨🇳 Kimi-K2 Base77.6%
107CL-Bench oLong-context reasoningComprehensive long-context benchmark evaluating reasoning over extended contexts.★★★★ 🇺🇸 GPT-5 Mini High25.2%
108CLIcK oKorean instruction followingKorean long-form instruction-following benchmark.★★★★ 🇨🇳 DeepSeek V3.2-Thinking86.3%
109CloningScenarios oBiosecurity refusalSafety benchmark that red-teams models with cloning-related misuse scenarios to measure compliance and refusal rates.★★★★ 🇺🇸 Grok 4↓ 45.0%
110CLUEWSC o Coreference reasoning (Chinese)Chinese Winograd Schema-style coreference benchmark from CLUE.★★★★ 🇨🇳 DeepSeek R192.8%
111CMath OMath (Chinese)Chinese mathematics benchmark.★★★★G LLaDA2.0 Flash96.9%
112CMMLU O Chinese multi-domainChinese counterpart to MMLU.★★★★★781 🇨🇳 Qwen2.5 Max91.9%
113CNMO 2024 oMath (competition)China National Mathematical Olympiad 2024 evaluation set.★★★★G openPangu-R-72B-2512 Slow Thinking82.8%
114Codeforces O Competitive programmingCompetitive programming performance on Codeforces problems (ELO).★★★★ 🇺🇸 Gemini 3 Deep Think3455
115CodeIF-Bench oCode instruction followingCode-focused instruction following benchmark.★★★★ 🇨🇳 Qwen3-8B Non-Thinking50.0%
116COLLIE O Instruction followingComprehensive instruction-following evaluation suite.★★★★55 🇺🇸 GPT-599.0%
117Collie-Hard oInstruction followingHard subset of Collie instruction-following tasks.★★★★ 🇺🇸 GPT-5 High99.0%
118CommonsenseQA O Commonsense QAMultiple-choice QA requiring commonsense knowledge.★★★★ 🇨🇳 Qwen2.5 32B Base88.5%
119Complex Workflow oComplex workflowsComplex workflow benchmark for economically valuable tasks.★★★★ 🇺🇸 Gemini 3 Pro58.2%
120COPA o Causal reasoningChoice of Plausible Alternatives.★★★★G Marin-32B-Bison94.0%
121CORE oOntological reasoningComprehensive Ontological Relation Evaluation for Large Language Models.★★★★G Nanbeige4.1-3B53.5%
122CorpusQA OLong-context QAQuestion answering over large text corpora.★★★★ 🇺🇸 GPT-581.6%
123CountBench O Visual countingObject counting and numeracy benchmark for visual-language models across varied scenes.★★★★ 🇨🇳 Qwen3.5-27B97.8%
124CountBenchQA OVisual counting QAVisual question answering benchmark focused on counting objects across varied scenes.★★★★G Moondream-9B-A2B93.2%
125Countdown oPlanning and reasoningCountdown-style reasoning and planning benchmark.★★★★G K2-V275.6%
126Countix oVideo countingVideo-based counting benchmark for multiple objects.★★★★G Seed1.831.0%
127CRAG oRetrieval QAComplex Retrieval-Augmented Generation benchmark for grounded question answering.★★★★G Jamba Mini 1.676.2%
128Creative Story‑Writing Benchmark V3 O Creative writingStory writing benchmark evaluating creativity, coherence, and style (v3).★★★★★291 🇨🇳 Kimi-K2-Instruct-09058.7%
129Longform Creative Writing O Creative writingLongform creative writing evaluation (EQ-Bench).★★★★20 🇺🇸 Claude Sonnet 4.5 (Thinking)79.8%
130Creative Writing v3 O Creative writingA LLM-judged creative writing benchmark.★★★★54 🇺🇸 o31661
131Complex Research using Integrated Thinking – Physics Test O ReasoningCritPt (Complex Research using Integrated Thinking – Physics Test) benchmark.★★★★ 🇺🇸 GPT-5 (High, Code & Web)12.6%
132CRUX-I O Code reasoningCode Reasoning and Understanding eXam – Interactive.★★★★ 🇺🇸 Gemini 3 Pro Preview98.8%
133CRUX-O O Code reasoningCode Reasoning and Understanding eXam – Offline.★★★★G IQuest-Coder-V1-40B-Loop-Thinking99.4%
134CruxEval O Code reasoningMathematical coding challenge set from the CruxEval benchmark.★★★★ 🇨🇳 Qwen3-30B-A3B-Instruct-250786.8%
135CSimpleQA OQAChinese SimpleQA benchmark variant (short factual questions).★★★★G Ling 2.5 1T79.0%
136Customer Support Q&A oCustomer support QACustomer support question answering benchmark.★★★★G Seed1.869.0%
137CUTE oEnglish charactersCUTE aggregate capability score.★★★★ 🇺🇸 Bolmo 7B78.6%
138CV-Bench OComputer vision QADiverse CV tasks for VLMs.★★★★ 🇺🇸 Gemini 3 Pro92.0%
139CVTG-2K CLIPScore oText renderingCVTG-2K CLIPScore for text rendering in image generation.★★★★G Seedream 4.50.8%
140CVTG-2K NED oText renderingCVTG-2K normalized edit distance (NED) for text rendering.★★★★ 🇨🇳 GLM-Image1.0%
141CVTG-2K Word Accuracy oText renderingCVTG-2K word accuracy for text rendering in images.★★★★ 🇨🇳 GLM-Image0.9%
142CyBench o Cybersecurity CTFFramework with 40 professional-level CTF tasks evaluating LLMs' practical cybersecurity capabilities.★★★★ 🇺🇸 o3 mini↓ 22.5%
143CyberGym oCybersecurity tasksBenchmark for cybersecurity-related coding and reasoning tasks.★★★★ 🇺🇸 Claude Opus 4.5 Thinking50.6%
144Cybersecurity Capture The Flag Challenges oCybersecurity CTFCapture-the-flag challenge benchmark evaluating cybersecurity problem-solving skills.★★★★ 🇺🇸 GPT-5.3 Codex77.6%
145Cybersecurity CTF oCybersecurity CTFCybersecurity Capture The Flag challenges benchmark.★★★★ 🇺🇸 GPT-5.3 Codex77.6%
146DA-2K oSpatial reasoning2D/3D spatial reasoning benchmark.★★★★ 🇨🇳 Seed1.5-VL-Thinking85.3%
147Deep Planning OPlanning and reasoningBenchmark evaluating deep planning and multi-step reasoning capabilities.★★★★ 🇺🇸 GPT-5.2 Thinking44.6%
148DeepConsult oAgentic writingAgentic consulting and writing benchmark.★★★★ 🇺🇸 GPT-5 High57.2%
149DeepMind Mathematics o Math reasoningSynthetic math problem sets from DeepMind covering arithmetic, algebra, calculus, and more.★★★★G Granite-4.0-H-Small59.3%
150DeepResearchBench oAgentic research writingResearch-oriented agentic writing and planning benchmark.★★★★ 🇺🇸 Gemini 3 Pro49.6%
151DeepSearchQA oDeep web search QAMulti-step web search and question answering benchmark.★★★★ 🇨🇳 Kimi-K2.5 Thinking77.1%
152DeR2 Bench oLong-context reasoningDense retrieval and reasoning benchmark for long-context evaluation.★★★★ 🇺🇸 GPT-5.2 High69.0%
153Design2Code OCoding (UI)Translating UI designs into code.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking93.4%
154DesignArena O Generative designLeaderboard tracking generative design systems across layout, branding, and marketing tasks.★★★★ 🇺🇸 Claude Sonnet 4.5 (Thinking)1410
155DetailBench o Spot small mistakesEvaluates whether LLMs can notice subtle errors and minor inconsistencies in text.★★★★ 🇺🇸 Llama 4 Maverick8.7%
156DiscoX OAgentic writingDiscoX benchmark for agentic writing and reasoning.★★★★G Seed2.0 Pro82.0%
157Do-Anything-Now oSafety / jailbreakResistance to Do Anything Now (DAN) style jailbreak prompts.★★★★G IQuest-Coder-V1-40B-Thinking97.7%
158Do-Not-Answer o Safety / refusalEvaluates a model's ability to refuse unsafe or disallowed requests.★★★★G K2-THINK88.0%
159DocMath ODocument mathMath reasoning on document-based problems.★★★★ 🇺🇸 GPT-567.6%
160DocVQA ODocument understanding (VQA)Visual question answering over scanned documents.★★★★ 🇨🇳 Seed1.5-VL-Thinking96.9%
161Dolphin-Page oDocument OCRDolphin Page benchmark measuring OCR fidelity and structured extraction on multi-layout documents.★★★★ 🇺🇸 Dolphin 1.5↓ 7.4%
162DPG-Bench OText renderingDPG-Bench score for text rendering in image generation.★★★★G Seedream 4.588.6%
163DROP O Reading + reasoningDiscrete reasoning over paragraphs (addition, counting, comparisons).★★★★★ 🇨🇳 Kimi K2 Instruct93.5%
164DUDE oMultimodal long-contextLong-context multimodal understanding benchmark.★★★★ 🇺🇸 Gemini 3 Pro70.1%
165DynaMath O Math reasoning (video)Dynamic/video-based mathematical reasoning evaluating temporal and visual understanding.★★★★ 🇨🇳 Qwen3.5-27B87.7%
166Economically important tasks oIndustry QA (cross-domain)Evaluation suite of real-world, economically impactful tasks across key industries and workflows.★★★★ 🇺🇸 GPT-547.1%
167Education oEconomics/educationEducation field evaluation (economically valuable tasks).★★★★G Seed1.860.8%
168EgoSchema OEgocentric video QAEgoSchema validation accuracy.★★★★ 🇨🇳 Qwen2-VL 72B Instruct77.9%
169EgoTempo oEgocentric temporal reasoningEgocentric video temporal reasoning benchmark.★★★★G Seed1.867.0%
170EIFBench oInstruction followingComplex instruction-following benchmark.★★★★ 🇺🇸 GPT-5 High66.7%
171EmbSpatialBench OSpatial understandingEmbodied spatial understanding benchmark evaluating navigation and localization.★★★★ 🇨🇳 Qwen3.5-397B-A17B84.5%
172EMMA oMultimodal reasoningEMMA benchmark for multimodal reasoning.★★★★ 🇺🇸 Gemini 3 Pro66.5%
173Enamel oComposite capabilityComposite capability benchmark capturing broad model performance (Enamel score).★★★★G Rnj-149.0%
174EnConda-Bench oCode editingEnglish code editing benchmark for applying conditional modifications.★★★★G Youtu-LLM-2B21.5%
175Encyclo-K oEncyclopedic knowledgeEncyclopedic knowledge evaluation benchmark.★★★★G Seed2.0 Pro65.7%
176EnigmaEval OChallenging puzzlesChallenging puzzle benchmark.★★★★ 🇺🇸 Gemini 3 Pro17.8%
177Enterprise RAG oRetrieval-augmented generationEnterprise retrieval-augmented generation evaluation covering internal knowledge bases.★★★★ 🇺🇸 Apriel Nemotron 15B Thinker69.2%
178EQ-Bench O ReasoningGeneral reasoning benchmark assessing equation/logic capabilities.★★★★★352G Jan v1 250985.0%
179EQ-Bench 3 O Emotional intelligence (roleplay)A benchmark measuring emotional intelligence in challenging roleplays, judged by Sonnet 3.7.★★★★21 🇨🇳 Kimi K2 Instruct1555
180ERQA OSpatial reasoningSpatial recognition and reasoning QA benchmark (ERQA).★★★★ 🇺🇸 Gemini 3 Flash71.0%
181EvalPerf O Code evaluation performanceMeasures performance of LLM code evaluation, including runtime, memory, and efficiency metrics.★★★★ 🇺🇸 GPT-4o (2024-08-06)100.0%
182EvalPlus O Code generationAggregated code evaluation suite from EvalPlus.★★★★★1577 🇺🇸 o1 Mini89.0%
183EVG oDocument OCREVG document OCR benchmark evaluating recognition accuracy and layout extraction.★★★★ 🇺🇸 Dolphin 1.5↓ 3.0%
184EXECUTE oMultilingual character tasksMultilingual character-level evaluation benchmark.★★★★ 🇺🇸 Bolmo 7B71.6%
185FACTS Benchmark Suite oHeld out internal grounding, parametric, MM, and search retrieval benchmarksComprehensive factuality benchmark suite covering held-out internal grounding, parametric knowledge, multimodal understanding, and search retrieval benchmarks.★★★★ 🇺🇸 Gemini 3 Pro70.5%
186FACTS Grounding O Grounding / factualityGrounded factuality benchmark evaluating model alignment with source facts.★★★★ 🇨🇳 Kimi K2 Instruct88.5%
187FActScore OHallucination rate on open-source promptsMeasures hallucination rate on an open-source prompt suite; lower is better.★★★★ 🇺🇸 GPT-5↓ 1.0%
188FaithJudge (1-Hallu.) oHallucination detectionFaithJudge hallucination rate with 1-hallucination metric (lower is better).★★★★G Moonlight-Instruct↓ 56.0%
189Meta Score Agent OComposite capability index★★★★ 🇺🇸 Claude Opus 4.5100.0%
190Meta Score Code OComposite capability index★★★★ 🇺🇸 Claude Opus 4.5100.0%
191Meta Score Math OComposite capability index★★★★ 🇨🇳 Qwen3-VL-235B-A22B Instruct100.0%
192Meta Score OCR OComposite capability index★★★★ 🇺🇸 o3 (Low)80.0%
193Meta Score Safety OComposite safety index★★★★G Granite 3.3 8B Instruct70.0%
194Meta Score STEM OComposite capability index★★★★ 🇨🇳 Qwen3-VL-235B-A22B Instruct100.0%
195Meta Score Text OComposite capability index★★★★ 🇺🇸 Claude Opus 4.585.7%
196Meta Score Visual OComposite capability index★★★★ 🇺🇸 Gemini 3 Pro100.0%
197Meta Score Writing OComposite capability index★★★★ 🇨🇳 Qwen3 235B A22B Instruct 250760.0%
198FigQA oFigure understanding and QAFigure question answering benchmark evaluating visual reasoning over scientific figures and diagrams.★★★★ 🇺🇸 Grok 4.1 (Thinking)34.0%
199FinanceReasoning oFinancial reasoningFinancial reasoning benchmark evaluating quantitative and qualitative finance problem solving.★★★★G Ling 1T87.5%
200FinanceAgent OAgentic finance tasksInteractive financial agent benchmark requiring multi-step tool use.★★★★ 🇺🇸 Claude Opus 4.660.7%
201FinanceAgent v1.1 oAgentic finance tasksFinance Agent v1.1 benchmark for interactive financial agent evaluation.★★★★ 🇺🇸 Claude Sonnet 4.663.3%
202FinanceBench (FullDoc) oFinance QAFinanceBench full-document question answering benchmark requiring long-context financial understanding.★★★★G Jamba Mini 1.645.4%
203FinSearchComp O Financial retrievalFinancial search and comprehension benchmark measuring retrieval grounded reasoning over financial content.★★★★ 🇺🇸 Grok 468.9%
204FinSearchComp-CN OFinancial retrieval (Chinese)Chinese financial search and comprehension benchmark measuring retrieval-grounded reasoning over regional financial content.★★★★G doubao-1-5-vision-pro54.2%
205FinSearchComp (T2&T3) oFinance searchFinance search competition tasks (tracks T2 and T3).★★★★ 🇺🇸 GPT-5 High64.5%
206Flame-React-Eval oFrontend codingFront-end React coding tasks and evaluation.★★★★ 🇨🇳 GLM-4.6V86.3%
207Flores O Machine translation (multilingual)FLORES multilingual translation benchmark.★★★★G EuroLLM-22B88.9%
208Fox-Page-cn oDocument OCR (Chinese)Fox Page benchmark evaluating OCR accuracy and layout understanding on Chinese document pages.★★★★ 🇺🇸 Dolphin 1.5↓ 0.8%
209Fox-Page-en oDocument OCR (English)Fox Page benchmark evaluating OCR accuracy and layout understanding on English document pages.★★★★ 🇺🇸 Dolphin 1.5↓ 0.7%
210FRAMES OInteractive reasoningFrame-based interactive reasoning and dialogue benchmark.★★★★ 🇨🇳 Tongyi DeepResearch90.6%
211FreshQA oRecency QAQuestion answering benchmark emphasizing up-to-date knowledge and recency.★★★★ 🇨🇳 Qwen3-4B Thinking 250766.9%
212FrontierScience oScience reasoningFrontier-level scientific reasoning and QA benchmark.★★★★ 🇺🇸 GPT-5.225.2%
213FrontierScience Olympiad oScience reasoning (olympiad)Olympiad-level problems from the FrontierScience benchmark.★★★★ 🇺🇸 GPT-5.2 High75.0%
214FrontierScience Research oScience reasoning (research)Research-level problems from the FrontierScience benchmark.★★★★ 🇺🇸 GPT-5.2 High25.0%
215FSC-147 oFew-shot countingFew-shot counting benchmark across 147 categories.★★★★G Seed1.833.8%
216FullStackBench OFull-stack developmentEnd-to-end web/app development tasks and evaluation.★★★★ 🇺🇸 Claude Opus 4.572.3%
217FullStackBench (zh) oFull-stack developmentChinese language full-stack development tasks and evaluation.★★★★ 🇨🇳 Qwen3-235B-A22B63.1%
218GAIA O General AI tasksComprehensive benchmark for agentic tasks.★★★★G Seed1.887.4%
219GAIA (no file) o General AI tasksGAIA benchmark subset without file inputs.★★★★G Step-3.5 Flash 2026020484.5%
220GAIA 2 OGeneral agent tasksGrounded agentic intelligence benchmark version 2 covering multi-tool tasks.★★★★G Ring-1T-2.575.0%
221GAOKAO-Bench oChinese examsGAOKAO benchmark measuring Chinese college entrance exam performance.★★★★ 🇨🇳 Qwen3-30B-A3B-Instruct-250794.5%
222GDPVal O General capabilityGDPVal benchmark evaluating broad general capabilities of LLMs across diverse tasks.★★★★ 🇺🇸 Claude Opus 4.673.5%
223GDPVal-AA Elo O Office tasksGDPVal Artificial Analysis Elo rating for office-style tasks.★★★★ 🇺🇸 Claude Sonnet 4.61633
224General Tool Use OTool useGeneral tool-use benchmark covering web and API tasks.★★★★ 🇺🇸 Claude Opus 4.578.9%
225GeoBench1 oGeospatial reasoningGeospatial visual QA and reasoning (set 1).★★★★ 🇨🇳 GLM-4.5V79.7%
226GlobalMGSM oMath (multilingual)Global multilingual grade school math word problems.★★★★ 🇨🇳 Qwen3-4B60.9%
227Global-MMLU OMulti-domain knowledge (global)Full Global-MMLU evaluation across diverse languages and regions.★★★★ 🇨🇳 DeepSeek V3.2-Exp82.0%
228Global-MMLU-Lite O Multi-domain knowledge (global)Lightweight global variant of MMLU covering diverse languages and regions.★★★★ 🇺🇸 Gemini 2.5 Pro89.2%
229Global PIQA OCommonsense reasoning across 100 Languages and CulturesPhysical commonsense reasoning benchmark spanning 100 languages and diverse cultural contexts.★★★★ 🇺🇸 Gemini 3 Flash95.6%
230Gorilla Benchmark API Bench o Tool useGorilla API Bench tool-use evaluation.★★★★ 🇺🇸 Llama 3.1 405B35.3%
231GPQA O Graduate-level QAGraduate-level question answering evaluating advanced reasoning.★★★★★406 🇺🇸 GPT-5.2 Thinking92.4%
232GPQA-diamond O Graduate-level QAHard subset of GPQA (diamond level).★★★★★ 🇺🇸 Gemini 3.1 Pro94.3%
233GraphWalks BFS OLong-context reasoningGraph traversal/GraphWalks benchmark (BFS variant) for long-context reasoning.★★★★ 🇺🇸 GPT-5.2 High98.0%
234GraphWalks Parents oLong-context reasoningGraph traversal/GraphWalks benchmark (Parents variant) for long-context reasoning.★★★★G Seed2.0 Lite100.0%
235GRE Math maj@16 oMath (standardized tests)GRE quantitative section evaluated via majority voting over 16 samples.★★★★ 🇨🇳 Qwen2 7B58.5%
236Ground-UI-1K OGUI groundingAccuracy on the Ground-UI-1K grounding benchmark.★★★★ 🇨🇳 Qwen2.5-VL 72B85.4%
237GSM-Infinite Hard (128K) oMath reasoningGSM-Infinite Hard benchmark at 128K context.★★★★G MiMo V2 Flash Base29.0%
238GSM-Infinite Hard (16K) oMath reasoningGSM-Infinite Hard benchmark at 16K context.★★★★ 🇨🇳 DeepSeek V3.2-Exp50.4%
239GSM-Infinite Hard (32K) oMath reasoningGSM-Infinite Hard benchmark at 32K context.★★★★ 🇨🇳 DeepSeek V3.2-Exp45.2%
240GSM-Infinite Hard (64K) oMath reasoningGSM-Infinite Hard benchmark at 64K context.★★★★ 🇨🇳 DeepSeek V3.134.7%
241GSM-Plus OMath (grade-school, enhanced)Enhanced GSM-style grade-school math benchmark variant.★★★★G LLaDA2.0 Flash89.7%
242GSM-Symbolic o Math reasoningSymbolic reasoning variant of GSM that tests algebraic manipulation and arithmetic with structured problems.★★★★G Granite-4.0-H-Small87.4%
243GSM8K O Math (grade-school)Grade-school math word problems requiring multi-step reasoning.★★★★★1322 🇨🇳 Kimi K2 Instruct97.3%
244GSM8K (DE) oMath (grade-school, German)German translation of the GSM8K grade-school math word problems.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.6%
245GSM8K-Ko oMath (grade-school, Korean)Korean translation of the GSM8K grade-school math word problems.★★★★ 🇨🇳 Qwen3-30B-A3B88.1%
246GSM8K Platinum o Math (grade-school, hard)Harder subset/setting of GSM8K grade-school math problems.★★★★ 🇨🇳 Kimi-Linear-Base89.6%
247GSO Benchmark O Code generationLiveCodeBench GSO benchmark.★★★★ 🇺🇸 o3-high8.8%
248HAE-RAE Bench o Korean language understandingKorean language understanding benchmark evaluating knowledge and reasoning.★★★★G Kanana-1.5-32.5B-Base90.7%
249HallusionBench O Multimodal hallucinationBenchmark for evaluating hallucination tendencies in multimodal LLMs.★★★★ 🇨🇳 Qwen3.5-397B-A17B71.4%
250HarmBench oSafetyHarmfulness and safety compliance benchmark across a variety of risky prompts.★★★★G IQuest-Coder-V1-40B-Thinking94.8%
251HarmfulQA o SafetyHarmful question set testing models' ability to avoid unsafe answers.★★★★★104G K2-THINK99.0%
252HealthBench OMedical QAComprehensive medical knowledge and clinical reasoning benchmark across specialties and tasks.★★★★ 🇺🇸 GPT-567.2%
253HealthBench-Hard OMedical QA (hard)Challenging subset of HealthBench focusing on complex, ambiguous clinical cases.★★★★ 🇺🇸 GPT-546.2%
254HealthBench-Hard Hallucinations oMedical hallucination safetyMeasures hallucination and unsafe medical advice under hard clinical scenarios.★★★★ 🇺🇸 GPT-5↓ 1.6%
255HellaSwag O Commonsense reasoningAdversarial commonsense sentence completion.★★★★★220 🇨🇳 DeepSeek V3 Base96.4%
256HellaSwag (DE) oCommonsense reasoning (German)German translation of the HellaSwag commonsense benchmark.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.7%
257HELMET LongQA oLong-context QALong-context subset of the HELMET benchmark focusing on grounded question answering.★★★★G Jamba Mini 1.646.9%
258HeroBench O Long-horizon planningBenchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds.★★★★ 🇺🇸 Grok 491.7%
259HHEM v2.1 O Hallucination detectionHughes Hallucination Evaluation Model (Vectara) — lower is better.★★★★G AntGroup Finix_S1_32b↓ 0.6%
260HiddenMath OMath reasoningMathematical reasoning benchmark referenced in recent model cards.★★★★ 🇺🇸 Gemini 2.0 Pro65.2%
261HLE O Multi-domain reasoningChallenging LLMs at the frontier of human knowledge.★★★★★1085 🇺🇸 Gemini 3 Deep Think48.4%
262HLE Overconfidence OOverconfidence / safetyOverconfidence rate derived from Humanity's Last Exam evaluations.★★★★ 🇺🇸 GPT-5.2↓ 43.7%
263HLE (Text Only) O Advanced reasoningHumanity's Last Exam benchmark restricted to text-only inputs.★★★★★1085 🇺🇸 Gemini 3 Pro45.8%
264HLE-Verified o Multi-domain reasoningVerified and revised version of Humanity's Last Exam (HLE) with component-wise verification protocol.★★★★ 🇺🇸 Gemini 3 Pro48.0%
265HLE-VL oHolistic language evaluation (vision-language)Vision-language HLE benchmark.★★★★ 🇺🇸 Gemini 3 Pro36.0%
266HLE (With Tools) O Tool-augmented reasoningHumanity's Last Exam benchmark evaluated with tool access.★★★★★1085 🇺🇸 Claude Opus 4.653.1%
267HMMT o Math (competition)Harvard–MIT Mathematics Tournament problems.★★★★ 🇺🇸 GPT-5 pro100.0%
268HMMT 2025 OMath (competition)Harvard–MIT Mathematics Tournament 2025 problems.★★★★ 🇺🇸 Gemini 3 Pro99.8%
269HMMT Feb 2025 O Math (competition)Harvard–MIT Mathematics Tournament February 2025 problems.★★★★ 🇺🇸 GPT-5.2 High100.0%
270HMMT Nov 2025 O Math (competition)Harvard–MIT Mathematics Tournament November 2025 problems.★★★★ 🇺🇸 GPT-5.2 High100.0%
271HotpotQA o Multi-hop QAExplainable multi-hop QA with supporting facts.★★★★ 🇨🇳 Qwen 3 0.6B64.0%
272HRBench 4K OHallucination robustnessHallucination robustness benchmark with 4K token contexts.★★★★ 🇨🇳 Qwen3-VL-30B-A3B Instruct89.5%
273HRBench 8K OHallucination robustnessHallucination robustness benchmark with 8K token contexts.★★★★ 🇺🇸 Gemini 2.5 Pro84.0%
274HRM8K oKorean reasoning8k-question Korean reasoning and knowledge benchmark.★★★★ 🇨🇳 Qwen3-235B-A22B-Thinking-250792.0%
275HumanEval O Code generationPython synthesis problems evaluated by unit tests.★★★★★2916 🇺🇸 Gemini 3 Pro Preview100.0%
276HumanEval+ O Code generationExtended HumanEval with more tests.★★★★★1577 🇺🇸 Claude Sonnet 494.5%
277HumanEval-V oCode generation (vision)HumanEval variant with visual programming prompts.★★★★G Step3-VL-10B66.0%
278HumanEval-X o Code generation (multilingual)Multilingual code generation benchmark extending HumanEval to multiple programming languages.★★★★G TeleChat3-36B-Thinking92.7%
279Hypersim O3D scene understandingHypersim benchmark for synthetic indoor scene understanding and reconstruction.★★★★ 🇺🇸 GPT-5 Mini Minimal39.3%
280IFBench O Instruction followingInstruction-following benchmark measuring compliance and adherence.★★★★70 🇫🇷 Mistral Small 3.2 24B Instruct84.8%
281IFEval O Instruction followingInstruction following capability evaluation for LLMs.★★★★★36312 🇨🇳 Qwen3.5-27B95.0%
282IFEval-Code oInstruction following (code)Instruction following evaluation for code generation tasks.★★★★ 🇨🇳 Qwen3-32B28.0%
283IFEval (Strict Prompt) OInstruction followingIFEval strict prompt-level accuracy.★★★★ 🇨🇳 Qwen3-8B Non-Thinking84.3%
284Image QA Average OImage QA (aggregate)Average of single-image visual question answering benchmarks.★★★★ 🇺🇸 Gemini 3 Pro86.2%
285IMO AnswerBench O Math (competition)Evaluates free-form solutions to International Mathematical Olympiad problems using expert-style grading rubrics.★★★★G Ring-1T-2.5-heavy-thinking90.0%
286INCLUDE OInclusiveness / biasEvaluates inclusive language use and bias mitigation in model outputs.★★★★ 🇺🇸 Gemini 3 Pro90.5%
287InfoQA OInformation-seeking QAInformation retrieval question answering benchmark evaluating factual responses.★★★★ 🇺🇸 Gemini 3 Pro86.9%
288Information Extraction oInformation extractionInformation extraction benchmark for economically valuable fields.★★★★ 🇺🇸 Claude Sonnet 4.546.9%
289Information Processing oInformation processingInformation processing benchmark for economically valuable tasks.★★★★ 🇺🇸 Gemini 3 Pro56.5%
290InfoVQA OInfographic VQAVisual question answering over infographics requiring reading, counting, and reasoning.★★★★ 🇨🇳 Kimi-K2.5 Thinking92.6%
291Intention Recognition oIntent recognitionIntent recognition benchmark for practical applications.★★★★ 🇺🇸 Gemini 3 Pro65.3%
292IntPhys 2 OIntuitive physicsIntuitive physics reasoning benchmark.★★★★ 🇺🇸 Gemini 3 Flash63.4%
293Inverse IFEval OInstruction following (inverse)Inverse instruction-following evaluation.★★★★ 🇺🇸 Gemini 3 Flash80.9%
294ISL/OSL 8k/16k oThroughputRelative throughput on ISL/OSL 8k/16k context workloads.★★★★ 🇺🇸 Nemotron-3-Nano-30B-A3B3.3%
295JudgeMark v2.1 O LLM judging abilityA benchmark measuring LLM judging ability.★★★★ 🇺🇸 Claude Sonnet 482.0%
296KGC-Safety oSafety (Korean)Korean safety benchmark evaluating harmfulness and compliance.★★★★G K-EXAONE96.1%
297KK-4 People oWorking memory (4 people)Keep/kill working-memory benchmark with 4 people entities.★★★★G K2-V292.9%
298KK-8 People oWorking memory (8 people)Keep/kill working-memory benchmark with 8 people entities.★★★★G K2-V282.8%
299KMMLU O Korean knowledgeKorean Massive Multitask Language Understanding benchmark.★★★★ 🇨🇳 DeepSeek V3.178.7%
300KMMLU-Pro O Multilingual knowledgeKorean Multilingual Massive Multitask Language Understanding Pro★★★★ 🇺🇸 o177.5%
301KMMLU-Redux O Multilingual knowledgeRedux variant of KMMLU benchmark★★★★ 🇺🇸 o181.1%
302Ko-LongBench oKorean long-contextLong-context understanding benchmark in Korean.★★★★ 🇨🇳 DeepSeek V3.2-Thinking87.9%
303KoBALT oKorean knowledgeKorean benchmark for knowledge and language understanding.★★★★ 🇨🇳 DeepSeek V3.2-Thinking62.7%
304KoMT-Bench oKorean chat abilityKorean multi-turn chat evaluation benchmark.★★★★ 🇨🇳 Qwen3-30B-A3B-Instruct-25078.5%
305KOR-Bench OReasoningComprehensive reasoning benchmark spanning diverse domains and cognitive skills.★★★★ 🇺🇸 GPT-5 High77.4%
306KORBench oGeneral reasoningKorean reasoning benchmark evaluating diverse reasoning capabilities.★★★★ 🇺🇸 GPT-5.2 High79.2%
307KoSimpleQA oKorean QAKorean simple question answering benchmark.★★★★G Kanana-2-30B-A3B-Mid-260149.7%
308KSM oMultilingual mathKorean STEM and math benchmark★★★★G EXAONE Deep 2.4B60.9%
309LAMBADA O Language modelingWord prediction requiring broad context understanding.★★★★ 🇺🇸 GPT-386.4%
310LatentJailbreak o Safety / jailbreakRobustness to latent jailbreak adversarial techniques.★★★★39 🇺🇸 GPT-3.5-turbo77.4%
311LBV1-QA OVision-languageVision-language QA benchmark v1.★★★★ 🇺🇸 GPT-573.7%
312LBV2 OVision-languageVision-language benchmark v2.★★★★ 🇺🇸 Gemini 2.5 Pro65.7%
313LIFEBench oInstruction followingLength-based instruction-following evaluation benchmark.★★★★ 🇺🇸 GPT-5.261.7%
314LingoQA ODriving scene QAQuestion answering benchmark for autonomous driving scene understanding.★★★★ 🇨🇳 Qwen3.5-27B82.0%
315LiveBench OGeneral capabilityContinually updated capability benchmark across diverse tasks.★★★★ 🇺🇸 Gemini 2.5 Pro82.4%
316LiveCodeBench O Code generationLive coding and execution-based evaluation benchmark (v6 dataset).★★★★★ 🇺🇸 Gemini 3 Pro92.0%
317LiveCodeBench-Ko oCode generation (Korean)Korean translation of LiveCodeBench.★★★★ 🇨🇳 Qwen3-30B-A3B-Thinking-250766.3%
318LiveCodeBench Pro O Competitive coding problems from Codeforces, ICPC, and IOILiveCodeBench Pro evaluates competitive programming performance across Codeforces, ICPC, and IOI contests. Elo rating, higher is better.★★★★ 🇺🇸 Gemini 3.1 Pro2887
319LCB Pro 25Q2 (Easy) O Code generationLiveCodeBench Pro 2025 Q2 easy subset.★★★★G Nanbeige4.1-3B81.4%
320LCB Pro 25Q2 (Med) O Code generationLiveCodeBench Pro 2025 Q2 medium subset.★★★★ 🇺🇸 GPT-OSS 120B (High)35.4%
321LiveCodeBench v3 O Code generationLiveCodeBench v3 snapshot measuring pass rates on streaming coding tasks.★★★★ 🇨🇳 Qwen3 32B90.2%
322LiveCodeBench v5 (2024.10-2025.02) O Code generationLiveCodeBench v5 snapshot covering Oct 2024-Feb 2025.★★★★G IQuest-Coder-V1-40B-Loop-Thinking86.2%
323LiveMCP-101 O Agent real-time evalA novel real-time evaluation framework and benchmark to stress‑test agents on complex, real‑world tasks.★★★★ 🇺🇸 GPT-558.4%
324LiveSports-3K oSports videoLive sports video understanding benchmark (3K).★★★★G Seed1.877.5%
325LMArena Text O Crowd eval (text)Chatbot Arena text evaluation (average win rate).★★★★ 🇺🇸 Gemini 2.5 Pro1455
326LMArena Vision O Crowd eval (vision)Chatbot Arena vision evaluation leaderboard (ELO ratings).★★★★ 🇺🇸 Gemini 2.5 Pro1242
327Local Agent Bench O Tool calling judgmentTests whether small open-weight models can reliably decide when to call tools and when not to. Agent Score = (Action x 0.4) + (Restraint x 0.3) + (Wrong-Tool-Avoidance x 0.3).★★★★ 🇺🇸 LFM2.5 1.2B Instruct88.0%
328LogicVista OVisual logical reasoningVisual logic and pattern reasoning tasks requiring compositional and spatial understanding.★★★★ 🇺🇸 Gemini 3 Pro80.8%
329LogiQA o Logical reasoningReading comprehension with logical reasoning.★★★★★138G Pythia 70M23.5%
330LongBench o Long-context evalLong-context understanding across tasks.★★★★★957G Jamba Mini 1.632.0%
331longbench-v2 O Long-context evalNext-generation LongBench v2 long-context evaluation benchmark.★★★★ 🇺🇸 Gemini 3 Pro68.2%
332LongFact-Concepts OHallucination rate on open-source promptsLong-context factuality eval focused on conceptual statements; lower is better.★★★★ 🇺🇸 GPT-5↓ 0.7%
333LongFact-Objects OHallucination rate on open-source promptsLong-context factuality eval focused on object/entity references; lower is better.★★★★ 🇺🇸 GPT-5↓ 0.8%
334LongText-Bench EN oText renderingLongText-Bench English subset score for text rendering.★★★★G Seedream 4.51.0%
335LongText-Bench ZH oText renderingLongText-Bench Chinese subset score for text rendering.★★★★G Seedream 4.51.0%
336LongVideoBench oLong video QALong video understanding and QA benchmark.★★★★ 🇨🇳 Kimi-K2.5 Thinking79.8%
337LPFQA OFinance QALong-form financial question answering benchmark.★★★★ 🇺🇸 Claude Sonnet 4.554.9%
338LVBench OVideo understandingLong video understanding benchmark (LVBench).★★★★ 🇺🇸 Gemini 3 Pro76.2%
339M3GIA (CN) oChinese multimodal QAChinese-language M3GIA benchmark covering grounded multimodal question answering.★★★★ 🇨🇳 Seed1.5-VL-Thinking91.2%
340Machiavelli ODeception / safetyBenchmark for deceptive or manipulative behavior in social interactions.★★★★ 🇺🇸 Claude Haiku 4.5↓ 52.2%
341MakeMeSay oAdversarial robustnessAdversarial benchmark testing model robustness against manipulation attempts. Lower is better.★★★★ 🇺🇸 Grok 4.1 (Thinking)
342Mantis OMultimodal reasoningMultimodal reasoning and instruction following benchmark (Mantis).★★★★G dots.vlm186.2%
343mArenaHard oChat ability (multilingual)Multilingual variant of Arena-Hard evaluating chat quality across languages.★★★★ 🇨🇳 Qwen3-4B70.1%
344MARS-Bench OInstruction followingInstruction-following benchmark with complex tasks.★★★★ 🇺🇸 GPT-5.2 High87.9%
345MASK O Safety / red teamingModel behavior safety assessment via red-teaming scenarios.★★★★ 🇺🇸 Claude Sonnet 4 (t)95.3%
346MatBench oMaterials property predictionMaterials property prediction benchmark for scientific AI models.★★★★G Intern-S1-Pro72.8%
347MATH O Math (competition)Competition-level mathematics across algebra, geometry, number theory, combinatorics.★★★★★1185 🇺🇸 o3 mini97.9%
348MATH-Ko oMath (Korean)Korean translation of the MATH competition benchmark.★★★★ 🇨🇳 Qwen3-30B-A3B58.2%
349MATH Level 5 o Math (competition)Level 5 subset of the MATH benchmark emphasizing the hardest competition-style problems.★★★★ 🇨🇳 Qwen3-4B-Instruct-250773.6%
350MATH500 OMath reasoning500-problem slice of the MATH benchmark for challenging math reasoning.★★★★★G Motif-2-12.7B-Reasoning99.3%
351MATH500 (ES) oMath (multilingual)Spanish MATH500 benchmark★★★★G EXAONE 4.0 1.2B88.8%
352MathArena Apex OChallenging Math Contest problemsChallenging math contest problems from MathArena Apex benchmark.★★★★G Seed2.0 Pro82.1%
353MathVerse oMath reasoning (multimodal)Visual math reasoning benchmark combining images and text across diverse mathematical tasks.★★★★ 🇺🇸 Gemini 2.5 Pro82.9%
354MathVerse-mini OMath reasoning (multimodal)Compact MathVerse split focusing on single-image math puzzles and visual reasoning.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking85.0%
355MathVerse-Vision OMath reasoning (multimodal)Multi-image visual mathematical reasoning tasks from the MathVerse ecosystem.★★★★ 🇺🇸 GPT-5 High84.1%
356MathVision O Math reasoning (multimodal)Visual math reasoning benchmark with problems that combine images (charts, diagrams) and text.★★★★ 🇨🇳 Qwen3.5-397B-A17B88.6%
357MathVista O Multimodal math reasoningVisual math reasoning across diverse tasks.★★★★ 🇨🇳 Kimi-K2.5 Thinking90.1%
358MathVista-Mini OMath reasoning (multimodal)Lightweight subset of MathVista for quick evaluation of visual mathematical reasoning.★★★★ 🇨🇳 Qwen3.5-397B-A17B90.3%
359MAXIFE OInstruction following (multilingual)Multilingual instruction-following evaluation across English and multilingual original prompts.★★★★ 🇺🇸 GPT-5.288.4%
360MBPP O Code generationShort Python problems with hidden tests.★★★★★36312 🇨🇳 Kimi-K2 Thinking97.4%
361MBPP-Ko oCode generation (Korean)Korean translation of MBPP code generation benchmark.★★★★ 🇨🇳 Qwen3-30B-A3B66.8%
362MBPP+ OCode generationExtended MBPP with more tests and stricter evaluation.★★★★ 🇨🇳 GLM 4.694.2%
363MCP-Atlas OAgent evaluationAggregate MCP agent benchmark covering tool-use and planning tasks.★★★★ 🇺🇸 Gemini 3.1 Pro69.2%
364MCP Universe O Agent evaluationBenchmarks multi-step tool-use agents across diverse task suites with a unified overall success metric.★★★★ 🇺🇸 Gemini 3 Pro50.7%
365MCPMark O Agent tool-use (MCP)Benchmark for Model Context Protocol (MCP) agent tool-use.★★★★★127 🇺🇸 GPT-5.257.5%
366mDolly oInstruction following (multilingual)Multilingual variant of the Dolly instruction-following benchmark.★★★★G Tiny Aya Global86.9%
367MedXpertQA-MM OMedical VQAMultimodal medical expert question answering benchmark.★★★★ 🇺🇸 Gemini 3 Pro76.0%
368METR O Long task benchmarkMETR evaluates AI agents on long-horizon coding and agentic tasks, measuring autonomous task completion time.★★★★ 🇺🇸 Claude Opus 4.54.8%
369MEWC oWeb comprehensionMulti-page End-to-end Web Comprehension benchmark.★★★★ 🇺🇸 Claude Opus 4.689.8%
370MGSM OMath (multilingual)Multilingual grade school math word problems.★★★★ 🇺🇸 Claude Opus 4.1 (2025-08-05) Thinking94.4%
371MIABench OMultimodal instruction followingMultimodal instruction-following benchmark evaluating accuracy on complex image-text tasks.★★★★ 🇺🇸 Gemini 2.5 Pro96.0%
372MicroVQA oBiological microscopyVisual question answering benchmark for biological microscopy images.★★★★ 🇺🇸 Gemini 3 Pro69.0%
373NIAH-Multi 128K oLong-context QANeedle-in-a-haystack multi-query benchmark at 128K context.★★★★ 🇨🇳 Kimi-K2 Base99.5%
374NIAH-Multi 32K oLong-context QANeedle-in-a-haystack multi-query benchmark at 32K context.★★★★ 🇨🇳 Kimi-K2 Base99.8%
375NIAH-Multi 64K oLong-context QANeedle-in-a-haystack multi-query benchmark at 64K context.★★★★ 🇨🇳 Kimi-K2 Base100.0%
376MindCube OSpatial navigationSpatial navigation benchmark.★★★★ 🇺🇸 Gemini 3 Flash78.3%
377Minerva Math O University-level mathAdvanced quantitative reasoning set inspired by the Minerva benchmark for STEM problem solving.★★★★ 🇨🇳 Qwen3 235B A22B Thinking98.0%
378MiniF2F pass@1 o Math competitionMiniF2F competition benchmark pass@1 accuracy.★★★★ 🇺🇸 NVIDIA-Nemotron-3-Nano-30B-A3B-BF1650.0%
379MiniF2F pass@32 o Math competitionMiniF2F competition benchmark pass@32 accuracy.★★★★ 🇺🇸 NVIDIA-Nemotron-3-Nano-30B-A3B-BF1679.9%
380MiniF2F (Test) o Math competitionMiniF2F competition benchmark (test split).★★★★ 🇨🇳 LongCat-Flash-Thinking81.6%
381MixEval oMulti-task reasoningMixed-subject benchmark covering knowledge and reasoning tasks across domains.★★★★ 🇺🇸 o1 Mini82.9%
382MixEval Hard oMulti-task reasoning (hard)Hard subset of MixEval covering diverse reasoning tasks.★★★★ 🇨🇳 Qwen3-4B31.6%
383MLVU OLarge video understandingMLVU: Large-scale multi-task benchmark for video understanding.★★★★ 🇨🇳 Qwen3.5-122B-A10B87.3%
384MM-BrowseComp oMultimodal browsingMultimodal browsing comprehension benchmark.★★★★G Seed1.846.3%
385MM-IFEval o Multimodal instruction followingInstruction-following benchmark assessing multimodal obedience to complex prompts.★★★★ 🇺🇸 LFM2.5-VL-1.6B52.3%
386MM-MT-Bench OMultimodal instruction followingMulti-turn multimodal instruction following benchmark evaluating dialogue quality and helpfulness.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking8.5%
387MMBench v1.1 (CN) O Multimodal understanding (Chinese)MMBench v1.1 Chinese subset for evaluating multimodal LLMs.★★★★ 🇺🇸 Gemini 3 Pro91.3%
388MMBench v1.1 (EN) O Multimodal understanding (English)MMBench v1.1 English subset for evaluating multimodal LLMs.★★★★ 🇺🇸 Gemini 3 Pro93.3%
389MMBench v1.1 (EN dev) O General VQAEnglish dev split of MMBench v1.1 measuring multimodal question answering.★★★★ 🇨🇳 Kimi-K2.594.2%
390MME-CC oMultimodal evaluationMME-CC multimodal evaluation suite.★★★★ 🇺🇸 Gemini 3 Pro56.9%
391MME Elo oMultimodal perceptionElo-style scoring for the MME multimodal evaluation benchmark.★★★★ 🇨🇳 InternVL3-2B2186.4%
392MME-RealWorld (cn) oReal-world perception (CN)MME-RealWorld Chinese split.★★★★ 🇺🇸 GPT-4o58.5%
393MME-RealWorld (en) oReal-world perception (EN)MME-RealWorld English split.★★★★G MiMo-VL 7B-RL59.1%
394MMIU OMulti-image understandingMulti-image understanding benchmark evaluating cross-image reasoning.★★★★ 🇺🇸 Gemini 3 Pro72.1%
395MMLB-NIAH (128k) oMultimodal long-contextMMLB-NIAH 128k long-context multimodal benchmark.★★★★G Seed1.872.2%
396MMLB-VRAG (128k) oMultimodal long-contextMMLB-VRAG 128k long-context multimodal benchmark.★★★★ 🇺🇸 Gemini 3 Pro88.9%
397MMLongBench-128K oLong-context multimodal128K-context variant of MMLongBench evaluating multimodal long-context understanding.★★★★ 🇨🇳 GLM-4.6V64.1%
398MMLongBench-Doc OLong-context multimodal documentsEvaluates long-context document understanding with mixed text, tables, and figures across multiple pages.★★★★ 🇺🇸 Claude Opus 4.561.9%
399MMLU O Multi-domain knowledge57 tasks spanning STEM, humanities, social sciences; broad knowledge and reasoning.★★★★★1488 🇺🇸 GPT-5 High93.8%
400MMLU Arabic oArabic knowledge and reasoningArabic-language variant of MMLU evaluating knowledge and reasoning.★★★★ 🇨🇳 Qwen 2.5 72B74.1%
401MMLU (cloze) o Multi-domain knowledge (cloze)Cloze-form MMLU evaluation variant.★★★★ 🇺🇸 SmolLM2 135M Base31.5%
402Full Text MMLU oMulti-domain knowledge (long-form)Full-context MMLU variant evaluating reasoning over long passages.★★★★ 🇺🇸 Llama 3.3 70B Instruct83.0%
403MMLU-Pro O Multi-domain knowledgeHarder successor to MMLU with more challenging questions.★★★★286 🇺🇸 Gemini 3 Pro90.1%
404MMLU Pro MCF oMulti-domain knowledge (few-shot)MMLU-Pro common format (MCF) few-shot evaluation.★★★★ 🇨🇳 Qwen3-4B-Base41.1%
405MMLU-ProX OMulti-domain knowledgeCross-lingual and robust variant of MMLU-Pro.★★★★ 🇺🇸 Gemini 3 Pro87.7%
406MMLU-Redux O Multi-domain knowledgeUpdated MMLU-style evaluation with revised questions and scoring.★★★★ 🇺🇸 Gemini 3 Pro95.9%
407MMLU-STEM O STEM knowledgeSTEM subset of MMLU.★★★★★1488G Falcon-H1-34B-Instruct83.6%
408MMMB oMultilingual MMBenchMultilingual Multimodal Benchmark (MMMB) average score.★★★★ 🇺🇸 LFM2.5-VL-1.6B77.0%
409MMMLU O Multi-domain knowledge (multilingual)Massively multilingual MMLU-style evaluation across many languages.★★★★ 🇺🇸 Gemini 3.1 Pro92.6%
410MMMLU (ES) oMultilingual knowledgeSpanish MMMLU benchmark★★★★ 🇺🇸 SmolLM 3 3B64.7%
411MMMU O Multimodal understandingMulti-discipline multimodal understanding benchmark.★★★★★ 🇺🇸 Gemini 3 Pro87.2%
412MMMU PRO O Multimodal understanding (hard)Professional/advanced subset of MMMU for multimodal reasoning.★★★★ 🇺🇸 Gemini 3 Deep Think81.5%
413MMMU-Pro (vision) o Multimodal understanding (vision)MMMU-Pro vision-only setting.★★★★ 🇺🇸 Claude 3.7 Sonnet45.8%
414MMMU Pro (with tools) O Multimodal understanding (with tools)MMMU-Pro benchmark evaluated with tool access.★★★★ 🇺🇸 GPT-5.280.4%
415MMSIBench (circular) oSpatial understandingMMSIBench circular subset for spatial reasoning.★★★★ 🇺🇸 Gemini 3 Pro25.4%
416MMStar O Multimodal reasoningBroad evaluation of multimodal LLMs across diverse tasks.★★★★ 🇨🇳 Qwen3.5-397B-A17B83.8%
417MMVet o Multimodal evaluationComprehensive evaluation suite for assessing multimodal LLM capabilities.★★★★ 🇨🇳 R-4B-Base85.9%
418MMVP O Multimodal video perceptionBenchmark for multimodal video understanding and perception.★★★★G Seed1.891.6%
419MMVU OVideo understandingMultimodal video understanding benchmark (MMVU).★★★★ 🇺🇸 GPT-5.2 Thinking XHigh80.8%
420Mol-Instructions oBio-molecular instruction followingInstruction-following benchmark for bio-molecular understanding and generation.★★★★G Intern-S1-Pro48.8%
421MotionBench OVideo motion understandingVideo motion and temporal reasoning benchmark.★★★★G Seed1.870.6%
422OpenAI-MRCR (128k) OLong-context reasoningOpenAI Multi-Round Chain Reasoning benchmark with 128k context window.★★★★ 🇺🇸 Gemini 3 Pro89.7%
423MRCR 128K-2N oLong-context reasoningMulti-Round Coreference Resolution benchmark at 128k context with 2 needles.★★★★ 🇫🇷 Ministral-3-R 8B50.3%
424MRCR 128K-4N oLong-context reasoningMulti-Round Coreference Resolution benchmark at 128k context with 4 needles.★★★★ 🇫🇷 Ministral-3-R 8B22.7%
425MRCR 128K-8N OLong-context reasoningMulti-Round Coreference Resolution benchmark at 128k context with 8 needles.★★★★ 🇺🇸 Gemini 3.1 Pro84.9%
426OpenAI-MRCR (1M) OLong-context reasoningOpenAI Multi-Round Chain Reasoning benchmark with 1M context window.★★★★ 🇺🇸 Gemini 2.5 Pro58.8%
427MRCR 64K-2N oLong-context reasoningMulti-Round Coreference Resolution benchmark at 64k context with 2 needles.★★★★ 🇫🇷 Ministral-3-R 8B44.0%
428MRCR 64K-4N oLong-context reasoningMulti-Round Coreference Resolution benchmark at 64k context with 4 needles.★★★★ 🇫🇷 Ministral-3-R 8B35.8%
429MRCR 64K-8N oLong-context reasoningMulti-Round Coreference Resolution benchmark at 64k context with 8 needles.★★★★ 🇨🇳 Qwen3-8B17.8%
430MRCR v2 OMultimodal reasoningMulti-round multimodal chain-of-reasoning evaluation (v2).★★★★ 🇺🇸 GPT-5.2 High89.4%
431MSEarth-MCQ oEarth scienceEarth science multiple-choice question benchmark for scientific AI models.★★★★ 🇺🇸 Gemini 3 Pro65.8%
432MT-Bench O Chat abilityMulti-turn chat evaluation via GPT-4 grading.★★★★★39074 🇺🇸 Apriel Nemotron 15B Thinker85.7%
433MTOB (full book) oLong-form reasoningLong-context book understanding benchmark (full-book setting).★★★★ 🇺🇸 Llama 4 Maverick50.8%
434MTOB (half book) oLong-form reasoningLong-context book understanding benchmark (half-book setting).★★★★ 🇺🇸 Llama 4 Maverick54.0%
435MUIRBENCH OMultimodal robustnessEvaluates multimodal understanding robustness and reliability.★★★★ 🇺🇸 Gemini 3 Pro86.1%
436Multi-IF OInstruction following (multi-task)Composite instruction-following evaluation across multiple tasks.★★★★ 🇨🇳 Qwen3-30B-A3B81.0%
437Multi-IFEval OInstruction following (multi-task)Multi-task variant of instruction-following evaluation.★★★★ 🇺🇸 Llama 3.3 70B88.7%
438Multi-SWE-Bench O Code repair (multi-repo)Multi-repository SWE-Bench variant.★★★★★246G MiniMax M2.551.3%
439MultiChallenge OInstruction followingMulti-domain instruction-following benchmark.★★★★ 🇺🇸 GPT-569.6%
440Multi-Image QA Average OMulti-image QA (aggregate)Aggregate score over multi-image visual question answering tasks.★★★★ 🇺🇸 Gemini 3 Pro81.9%
441Multilingual MMBench oMultilingual vision benchmarkMultilingual MMBench average score across languages.★★★★ 🇺🇸 LFM2.5-VL-1.6B65.9%
442Multilingual MMLU OMulti-domain knowledge (multilingual)Multilingual variant of MMLU across many languages.★★★★ 🇺🇸 GPT-4.187.3%
443MultiPL-E O Code generation (multilingual)Multilingual code generation and execution benchmark across many programming languages.★★★★★269 🇺🇸 Claude Opus 489.6%
444MultiPL-E HumanEval o Code generation (multilingual)MultiPL-E variant of HumanEval tasks.★★★★ 🇺🇸 Llama 3.1 405B75.2%
445MultiPL-E MBPP o Code generation (multilingual)MultiPL-E variant of MBPP tasks.★★★★ 🇺🇸 Llama 3.1 405B65.7%
446MuSR O ReasoningMultistep Soft Reasoning.★★★★G Ling Flash 2.082.7%
447MVBench OVideo QAMulti-view or multi-video QA benchmark (MVBench).★★★★ 🇺🇸 GPT-5.278.1%
448Natural2Code oCode generationNatural language to code benchmark for instruction-following synthesis.★★★★ 🇺🇸 Gemini 2.0 Flash92.9%
449NaturalQuestions O Open-domain QAGoogle NQ; real user questions with long/short answers.★★★★ 🇫🇷 Mixtral 8x22B40.1%
450Nexus (0-shot) OTool useNexus tool-use benchmark, zero-shot setting.★★★★ 🇺🇸 Llama 3.1 405B58.7%
451Needle In A Haystack o Long-context retrievalNeedle In A Haystack test for locating hidden facts in long contexts.★★★★G MobileLLM P1 Base100.0%
452NoLiMa 128K oLong-context evalNoLiMa (No Literal Match) long-context benchmark at 128k context window.★★★★ 🇨🇳 MiniCPM-SALA23.9%
453NoLiMa 32K oLong-context evalNoLiMa (No Literal Match) long-context benchmark at 32k context window.★★★★ 🇨🇳 MiniCPM-SALA54.5%
454NoLiMa 64K oLong-context evalNoLiMa (No Literal Match) long-context benchmark at 64k context window.★★★★ 🇨🇳 MiniCPM-SALA43.0%
455NOVA-63 OMultilingual evaluationMultilingual evaluation benchmark covering 63 languages.★★★★ 🇨🇳 Qwen3.5-397B-A17B59.1%
456NuScenes o3D scene understanding3D scene understanding and perception benchmark for autonomous driving.★★★★ 🇨🇳 Qwen3.5-397B-A17B16.0%
457Objectron oObject detectionObjectron benchmark for 3D object detection in video captures.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking71.2%
458OBQA oOpen book QAOpenBookQA science question answering benchmark.★★★★ 🇨🇳 Qwen2.5-Omni-3B76.3%
459OCNLI ONatural language inference (Chinese)Original Chinese Natural Language Inference benchmark.★★★★G LLaDA2.1-Flash (Q Mode)72.8%
460OCRBench V2 OOCR (vision text extraction)OCRBench v2 evaluating text extraction from images and documents.★★★★ 🇨🇳 Qwen3-VL 2B Instruct858.0%
461OCRBench-ELO oOCR (ELO ranking)OCR benchmark using ELO rating system to rank model performance on text extraction tasks.★★★★ 🇺🇸 Gemini 2.5 Pro866
462OCRBenchV2 (CN) OOCR (Chinese)OCRBenchV2 Chinese subset assessing OCR performance on Chinese-language documents.★★★★G Ovis2.6-30B-A3B67.1%
463OCRBenchV2 (EN) OOCR (English)OCRBenchV2 English subset evaluating OCR accuracy on English documents and layouts.★★★★G Ovis2.6-30B-A3B72.6%
464OCRReasoning oOCR reasoningOCR reasoning benchmark combining text extraction with multi-step reasoning over documents.★★★★ 🇺🇸 Gemini 2.5 Pro70.8%
465OctoCodingBench oCode generationCoding benchmark across multi-language programming tasks.★★★★ 🇺🇸 Claude Opus 4.536.2%
466ODinW-13 OObject detection (in the wild)Object Detection in the Wild benchmark covering 13 real-world domains.★★★★ 🇨🇳 Qwen3-VL-4B-Instruct48.2%
467Odyssey Math oMath reasoningOdyssey multi-step math benchmark.★★★★G Mathstral 7B37.2%
468OIBench EN oCode generationEnglish subset of OIBench for code generation.★★★★ 🇺🇸 Gemini 3 Pro58.2%
469OJBench OCode generation (online judge)Programming problems evaluated via online judge-style execution.★★★★ 🇺🇸 Gemini 3 Pro68.5%
470olmOCR-Bench ODocument OCRolmOCR benchmark assessing OCR fidelity and structured extraction on complex document pages.★★★★G Chandra OCR 0.1.083.1%
471OlympiadBench OMath (olympiad)Advanced mathematics olympiad-style problem benchmark.★★★★ 🇨🇳 Qwen3-30B-A3B-Instruct-250777.6%
472OlympicArena oMath (competition)Olympiad-style mathematics reasoning benchmark.★★★★ 🇨🇳 DeepSeek V376.2%
473OMEGA OMath (advanced)OMEGA olympiad-grade mathematics reasoning benchmark.★★★★ 🇺🇸 OLMo-3-Think-32B50.8%
474Omni-MATH OMath reasoningOmni-MATH benchmark covering diverse math reasoning tasks across difficulty levels.★★★★G Ling 1T74.5%
475Omni-MATH-HARD OMathChallenging math benchmark (Omni-MATH-HARD).★★★★ 🇺🇸 GPT-5 High73.6%
476OmniDocBench ODocument understandingDocument understanding benchmark covering multi-page layouts, tables, and charts for robust question answering.★★★★G Gundam-M↓ 12.3%
477OmniDocBench 1.5 OOCRDocument understanding benchmark v1.5 with OCR evaluation. Overall Edit Distance metric, lower is better.★★★★ 🇺🇸 Dolphin V2↓ 0.1%
478OmniDocBench-CN ODocument understanding (Chinese)Chinese subset of OmniDocBench focusing on OCR-grounded document comprehension and reasoning.★★★★G PPStructure v3↓ 13.6%
479OmniMMI oMultimodal interactionOmniMMI benchmark for multimodal interaction across video streams.★★★★G Seed1.853.0%
480OmniSpatial oSpatial reasoningSpatial understanding and reasoning benchmark (OmniSpatial).★★★★ 🇨🇳 GLM-4.6V52.0%
481OneIG-Bench EN OText-to-imageOneIG-Bench English subset score for text-to-image generation.★★★★G Nano Banana 2.00.6%
482OneIG-Bench ZH OText-to-imageOneIG-Bench Chinese subset score for text-to-image generation.★★★★G Nano Banana 2.00.6%
483Online-Mind2web oWeb automationOnline web automation and task execution benchmark.★★★★G Seed1.885.9%
484Open Rewrite oInstruction followingRewrite benchmark assessing open-ended editing and directive-following quality.★★★★G MobileLLM P151.0%
485OpenBookQA O Science QAOpen-book multiple choice science questions with supporting facts.★★★★★128 🇺🇸 Hermes 4.3 36B Pyche96.6%
486OpenRewrite-Eval oRewrite qualityOpenRewrite evaluation; micro-averaged RougeL.★★★★ 🇨🇳 Qwen2.5 1.5B Instruct46.9%
487OptMATH oMath optimization reasoningOptMATH benchmark targeting challenging math optimization and problem-solving tasks.★★★★G Ling 1T57.7%
488Order 15 Items oList orderingOrdering benchmark requiring models to sequence 15 items correctly.★★★★G K2-V287.6%
489Order 30 Items oList ordering (long)Ordering benchmark requiring models to sequence 30 items correctly.★★★★G K2-V240.3%
490OSWorld OGUI agentsAgentic GUI task completion and grounding on desktop environments.★★★★ 🇺🇸 Claude Opus 4.672.7%
491OSWorld-G OGUI agentsOSWorld-G center accuracy (no_refusal).★★★★ 🇺🇸 Holo1.5-72B71.8%
492OSWorld Verified OGUI agentsVerified subset of OSWorld GUI agent benchmark.★★★★ 🇺🇸 Claude Opus 4.672.7%
493OSWorld2 oGUI agentsSecond-generation OSWorld GUI agent benchmark.★★★★ 🇨🇳 GLM-4.5V35.8%
494OVBench oOpen-vocabulary streamingOpen-vocabulary benchmark for streaming video understanding.★★★★G Seed1.865.1%
495OVOBench oStreaming video QAStreaming video QA benchmark with open-vocabulary queries.★★★★G Seed1.872.6%
496PaperBench Code-Dev oCode understandingPaperBench developer subset measuring code reasoning accuracy.★★★★ 🇺🇸 Claude Sonnet 443.3%
497PaperBench oResearch paper understandingBenchmark for understanding and reasoning over research papers.★★★★ 🇺🇸 Claude Opus 4.5 Thinking72.9%
498PHYBench OPhysics reasoningPhysics reasoning and calculation benchmark.★★★★ 🇺🇸 Gemini 3 Pro80.0%
499PhyX oPhysics reasoning (multimodal)Multimodal physics reasoning benchmark (PhyX).★★★★G Step3-VL-10B59.5%
500PIQA O Physical commonsensePhysical commonsense about everyday tasks and object affordances.★★★★G LLaDA2.0 Flash96.5%
501PixmoCount O Visual countingCounting objects/instances in images (PixmoCount).★★★★G Eagle2.5-8B90.2%
502PMC-VQA OMedical VQAPubMed Central visual question answering benchmark for biomedical images.★★★★ 🇨🇳 Qwen3.5-397B-A17B64.2%
503Point-Bench oPointing and countingBenchmark for pointing and counting objects in images.★★★★ 🇺🇸 Gemini 2.5 Pro85.5%
504PolyMATH OMath reasoningPolyglot mathematics benchmark assessing cross-topic math reasoning.★★★★ 🇺🇸 Gemini 3 Pro81.6%
505POPE o Hallucination detectionVision-language hallucination benchmark focusing on object existence verification.★★★★ 🇨🇳 InternVL3-2B90.1%
506PopQA O Knowledge / QAOpen-domain popular culture question answering benchmark testing long-tail factual recall.★★★★ 🇺🇸 Llama 3.1 Tulu 3 405B SFT55.7%
507PostTrainBench o Post-training automationMeasures how well AI agents can post-train base LLMs under fixed compute/time constraints; average score across AIME 2025, BFCL, GPQA Main, GSM8K, and HumanEval.★★★★ 🇺🇸 GPT-5.1 Codex-Max34.9%
508PRDBench oAgentic codingProduct Requirements Document benchmark for evaluating agentic coding capabilities.★★★★ 🇨🇳 LongCat-Flash-Lite39.6%
509ProcBench oProcedural reasoningProcedural reasoning benchmark evaluating step-by-step logical reasoning.★★★★G Seed2.0 Pro96.6%
510PrOntoQA OLogical reasoningProbing ontological reasoning via question answering.★★★★G Ling Flash 2.097.9%
511ProofBench Advanced oMathematical proofs (advanced)Advanced mathematical proof benchmark covering complex theorem proving tasks.★★★★ 🇺🇸 Gemini Deep Think (IMO Gold)65.7%
512ProofBench Basic oMathematical proofsEntry-level mathematical proof benchmarking set.★★★★ 🇨🇳 DeepSeekMath-V2-Heavy99.0%
513ProtocolQA oProtocol understanding and QAProtocol question answering benchmark evaluating understanding of scientific protocols and procedures.★★★★ 🇺🇸 Grok 4.1 (Thinking)79.0%
514QuAC o Conversational QAQuestion answering in context.★★★★ 🇺🇸 Llama 3.1 405B Base53.6%
515QuALITY o Long-context reading comprehensionLong-document multiple-choice reading comprehension benchmark.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT48.8%
516RACE o Reading comprehensionEnglish exams for middle and high school.★★★★ 🇺🇸 Nemotron-3-Nano-30B-A3B-Base88.0%
517Random Complex Tasks oAgentic tasks (random)Randomly constructed complex task environments for agent generalization.★★★★ 🇨🇳 LongCat-Flash-Thinking-260135.8%
518Realbench oWeb browsingReal-world browsing and QA benchmark.★★★★G Seed1.849.1%
519RealWorldQA O Real-world visual QAVisual question answering with real-world images and scenarios.★★★★ 🇨🇳 Qwen3.5-122B-A10B85.1%
520Ref-L4 (test) o Referring expressionsRef-L4 referring expression comprehension on the test split.★★★★ 🇨🇳 GLM-4.6V88.9%
521RefCOCO O Referring expressionsRefCOCO average accuracy at IoU 0.5 (val).★★★★ 🇨🇳 InternVL3.5-4B92.4%
522RefCOCOg o Referring expressionsRefCOCOg average accuracy at IoU 0.5 (val).★★★★G Moondream-9B-A2B88.6%
523RefCOCO+ o Referring expressionsRefCOCO+ accuracy at IoU 0.5 on the val split.★★★★G Moondream-9B-A2B81.8%
524RefSpatialBench OSpatial reasoningReference spatial understanding benchmark covering spatial grounding tasks.★★★★ 🇨🇳 Qwen3.5-397B-A17B73.6%
525RefusalBench oSafety / refusalSafety-oriented refusal and policy adherence benchmark.★★★★ 🇺🇸 Hermes 4.3 36B Pyche72.3%
526ReMI oMultimodal reasoningReasoning over multimodal inputs (ReMI).★★★★G Step3-VL-10B67.3%
527RepoBench OCode understandingRepository-level code comprehension and reasoning benchmark.★★★★ 🇺🇸 Claude Sonnet 4.583.8%
528ResearchRubrics oResearch evaluationBenchmark evaluating model ability to conduct research and synthesize findings.★★★★G Step-3.5 Flash 2026020465.3%
529RoboSpatialHome OEmbodied spatial understandingRoboSpatialHome benchmark for embodied spatial reasoning in domestic environments.★★★★ 🇨🇳 Qwen3-VL-235B-A22B Thinking73.9%
530Roo Code Evals O Code assistant evalCommunity-maintained coding evals and leaderboard by Roo Code.★★★★ 🇺🇸 GPT-5 mini99.0%
531RULER-100 @1M o Long-context evalRULER-100 evaluation at a 1M context window.★★★★ 🇺🇸 NVIDIA-Nemotron-3-Nano-30B-A3B-BF1686.3%
532RULER-100 @256k o Long-context evalRULER-100 evaluation at a 256k context window.★★★★ 🇺🇸 NVIDIA-Nemotron-3-Nano-30B-A3B-BF1692.9%
533RULER-100 @512k o Long-context evalRULER-100 evaluation at a 512k context window.★★★★ 🇺🇸 NVIDIA-Nemotron-3-Nano-30B-A3B-BF1691.3%
534Ruler 128k O Long-context evalRULER benchmark at 128k context window.★★★★ 🇨🇳 Qwen3-Next-80B-A3B-Instruct96.0%
535Ruler 16k o Long-context evalRULER benchmark at 16k context window.★★★★ 🇨🇳 Qwen2.5 7.6B92.2%
536Ruler 1M o Long-context evalRULER benchmark at 1M context window.★★★★ 🇨🇳 Kimi-Linear-Instruct94.8%
537Ruler 32k o Long-context evalRULER benchmark at 32k context window.★★★★ 🇫🇷 Mistral Medium 396.0%
538Ruler 4k o Long-context evalRULER benchmark at 4k context window.★★★★ 🇫🇷 Ministral 8B96.0%
539Ruler 512k o Long-context evalRULER benchmark at 512k context window.★★★★ 🇨🇳 Qwen3-235B-A22B-Instruct-250790.9%
540Ruler 64k o Long-context evalRULER benchmark at 64k context window.★★★★ 🇨🇳 MiniCPM-SALA92.7%
541Ruler 8k o Long-context evalRULER benchmark at 8k context window.★★★★ 🇺🇸 Llama 3.1 8B Base93.8%
542RW Search oAgentic searchReal-world search benchmark evaluating retrieval and reasoning.★★★★ 🇺🇸 GPT-5.2 Thinking XHigh82.0%
543SALAD-Bench o Safety alignmentSafety Alignment and Dangerous-behavior benchmark evaluating harmful assistance and refusal consistency.★★★★G Granite-4.0-H-Micro↓ 96.8%
544SArena (Icon) oSVG generationSVG Arena benchmark for icon generation evaluation.★★★★G Intern-S1-Pro83.5%
545Scale AI Multi Challenge oChat & instruction followingScale AI Multi Challenge crowd-evaluated instruction following benchmark.★★★★ 🇨🇳 Qwen3-30B-A3B-Thinking-250744.8%
546SciCode (sub) OCodeSciCode subset score (sub).★★★★ 🇺🇸 Gemini 3.1 Pro59.0%
547SciCode (main) OCodeSciCode main score.★★★★ 🇺🇸 Gemini 2.5 Pro15.4%
548ScienceQA OScience QA (multimodal)Multiple-choice science questions with images, diagrams, and text context.★★★★G FastVLM-7B96.7%
549SciQ o Science QAMultiple choice science questions.★★★★G Pythia 12B92.9%
550SciReasoner oScientific reasoningScientific reasoning benchmark evaluating multimodal AI models on scientific tasks.★★★★G Intern-S1-Pro55.5%
551SciRes FrontierMath Tier 1-3 oMath (frontier)SciRes FrontierMath benchmark covering tiers 1-3.★★★★ 🇺🇸 GPT-5.2 Thinking40.3%
552SciRes FrontierMath Tier 4 oMath (frontier)SciRes FrontierMath benchmark covering tier 4.★★★★ 🇺🇸 Gemini 3 Pro18.8%
553ScreenQA Complex OGUI QAComplex ScreenQA benchmark accuracy.★★★★ 🇺🇸 Holo1.5-72B87.1%
554ScreenQA Short OGUI QAShort-form ScreenQA benchmark accuracy.★★★★ 🇺🇸 Holo1.5-72B91.9%
555ScreenSpot OScreen UI locatorsCenter accuracy on ScreenSpot.★★★★ 🇨🇳 Qwen3-VL 32B Instruct95.8%
556ScreenSpot-Pro O Screen UI locatorsAverage center accuracy on ScreenSpot-Pro.★★★★ 🇺🇸 GPT-5.2 Extra High86.3%
557ScreenSpot-v2 OScreen UI locatorsCenter accuracy on ScreenSpot-v2.★★★★G UI-Venus 72B95.3%
558SEAL-0 OAgentic web searchEvaluation of multi-step browsing agents on search, evidence gathering, and synthesis tasks.★★★★ 🇨🇳 Kimi-K2.5 Thinking57.4%
559SecCodeBench oSecure code generationBenchmark evaluating secure code generation capabilities.★★★★ 🇺🇸 GPT-5.268.7%
560SEED-Bench-2-Plus O Multimodal evaluationSEED-Bench-2-Plus overall accuracy.★★★★ 🇺🇸 Claude 3.7 Sonnet72.9%
561SEED-Bench-Img OMultimodal image understandingSEED-Bench image-only subset (SEED-Bench-Img).★★★★G Bagel 14B78.5%
562SEED-Bench o Multimodal evaluationSEED-Bench comprehensive multimodal understanding benchmark evaluating generative comprehension across multiple dimensions.★★★★ 🇺🇸 LFM2-VL-3B76.5%
563SFE OMultimodal reasoningStructured factual evaluation for multimodal models.★★★★ 🇺🇸 Gemini 3 Pro61.9%
564Showdown OGUI agentsSuccess rate on the Showdown UI interaction benchmark.★★★★ 🇺🇸 Holo1.5-72B76.8%
565SIFO oInstruction followingSingle-turn instruction following benchmark.★★★★ 🇨🇳 Qwen3-VL-30B-A3B Thinking66.9%
566SIFO Multiturn oInstruction followingMulti-turn SIFO benchmark for sustained instruction adherence.★★★★ 🇨🇳 Qwen3-VL-30B-A3B Thinking60.3%
567SimpleQA OQASimple question answering benchmark.★★★★★ 🇨🇳 DeepSeek V3.2-Exp97.1%
568SimpleQA Verified OQAVerified SimpleQA variant for parametric knowledge accuracy.★★★★ 🇺🇸 Gemini 3 Pro72.1%
569SimpleVQA OGeneral VQALightweight visual question answering set with everyday scenes.★★★★ 🇺🇸 Gemini 3 Pro73.2%
570SimpleVQA-DS oGeneral VQASimpleVQA variant curated by DeepSeek with everyday image question answering tasks.★★★★ 🇨🇳 Seed1.5-VL-Thinking61.3%
571Social Interaction QA (SIQA) oSocial commonsense QASocial Interaction QA benchmark evaluating social commonsense and situational reasoning.★★★★ 🇺🇸 Gemma 3 27B54.9%
572SLAKE OMedical VQASemantically-Labeled Knowledge-Enhanced medical visual question answering benchmark.★★★★ 🇨🇳 Kimi-K2.581.6%
573SmolInstruct oSmall molecule understandingSmall molecule instruction-following and understanding benchmark.★★★★G Intern-S1-Pro74.8%
574SocialIQA o Social commonsenseSocial interaction commonsense QA.★★★★ 🇺🇸 Gemma 3 PT 27B54.9%
575SpatialViz OMental visualizationMental visualization benchmark.★★★★ 🇺🇸 GPT-5.265.8%
576Spider O Text-to-SQLComplex text-to-SQL benchmark over cross-domain databases.★★★★G LLaDA2.0 Flash82.5%
577Spiral-Bench O Safety / sycophancyA LLM-judged benchmark measuring sycophancy and delusion reinforcement.★★★★ 🇺🇸 GPT-587.0%
578SQuAD v1.1 o Reading comprehensionExtractive QA from Wikipedia articles.★★★★★566 🇺🇸 Llama 3.1 405B Base89.3%
579SQuAD v2.0 O Reading comprehensionLike v1.1 with unanswerable questions.★★★★★566G LLaDA2.1-Flash (Q Mode)90.8%
580StreamingBench oStreaming videoStreaming video understanding benchmark.★★★★G Seed1.884.4%
581SUNRGBD O3D scene understandingSUN RGB-D benchmark for indoor scene understanding from RGB-D imagery.★★★★ 🇺🇸 GPT-5 Mini Minimal45.8%
582SuperChem oChemistry reasoningChemistry reasoning benchmark evaluating text-based chemical knowledge and problem solving.★★★★ 🇺🇸 Gemini 3 Pro63.2%
583SuperGPQA OGraduate-level QAHarder GPQA variant assessing advanced graduate-level reasoning.★★★★ 🇺🇸 Gemini 3 Pro75.3%
584SWE-Bench O Code repairSupervised software engineering benchmark across many repos and issues.★★★★★3442 🇺🇸 GPT-5 Codex74.5%
585SWE-Bench Multilingual OCode repair (multilingual)Multilingual variant of SWE-Bench for issue fixing.★★★★ 🇺🇸 Claude Opus 4.5 Thinking77.5%
586SWE-Bench (OpenHands) o Code repairSWE-Bench results using the OpenHands autonomous coding agent.★★★★★3442 🇺🇸 NVIDIA-Nemotron-3-Nano-30B-A3B-BF1638.8%
587SWE-Bench Pro O Software engineeringFull SWE-Bench Pro benchmark for software-engineering agents.★★★★ 🇺🇸 Claude Opus 4.556.9%
588SWE-Bench Pro (Public) OSoftware engineeringPublic subset of the SWE-Bench Pro benchmark for software-engineering agents.★★★★ 🇺🇸 GPT-5.3 Codex56.8%
589SWE-Bench Verified O Code repairVerified subset of SWE-Bench for issue fixing.★★★★★ 🇺🇸 Claude Opus 4.580.9%
590SWE-Dev oCode repairSoftware engineering development and bug fixing benchmark.★★★★ 🇺🇸 Claude Sonnet 467.1%
591SWE-Lancer o Code repair (freelance tasks)Software engineering benchmark using real freelance-style issues.★★★★ 🇺🇸 GPT-5.1 Codex-Max79.9%
592SWE-Lancer Diamond oCode repair (freelance)Diamond subset of SWE-Lancer focusing on the hardest freelance-style issues.★★★★ 🇺🇸 GPT-5.3 Codex81.4%
593SWE-Lancer IC Diamond oCode repair (freelance)Individual Contributor Diamond subset of SWE-Lancer.★★★★ 🇺🇸 GPT-5.3 Codex81.4%
594SWE-Perf oCode repairSoftware engineering benchmark focused on performance-oriented fixes.★★★★ 🇺🇸 Gemini 3 Pro6.5%
595SWE-Review oCode reviewSoftware engineering review benchmark for assessing code review quality.★★★★ 🇺🇸 Claude Opus 4.516.2%
596SWT-Bench oCode repairSoftware tool-use benchmark for code tasks.★★★★ 🇺🇸 GPT-5.2 Thinking80.7%
597SysBench oSystem promptsSystem prompt understanding and adherence benchmark.★★★★ 🇺🇸 GPT-4.174.1%
598TAU1-Airline OAgent tasks (airline)Tool-augmented agent evaluation in airline scenarios (TAU1).★★★★G openPangu-R-72B-2512 Slow Thinking56.0%
599TAU1-Retail OAgent tasks (retail)Tool-augmented agent evaluation in retail scenarios (TAU1).★★★★G openPangu-R-72B-2512 Slow Thinking73.0%
600TAU2-Airline OAgent tasks (airline)Tool-augmented agent evaluation in airline scenarios (TAU2).★★★★ 🇨🇳 LongCat-Flash-Thinking-260176.5%
601TAU2-Bench oAgent tasksAggregate tool-augmented agent evaluation across airline, retail, and telecom scenarios (TAU2).★★★★ 🇺🇸 Claude Opus 4.591.6%
602TAU2-Retail OAgent tasks (retail)Tool-augmented agent evaluation in retail scenarios (TAU2).★★★★ 🇺🇸 Claude Opus 4.691.9%
603TAU2-Telecom OAgent tasks (telecom)Tool-augmented agent evaluation in telecom scenarios (TAU2).★★★★ 🇨🇳 LongCat-Flash-Thinking-260199.3%
604TempCompass oTemporal reasoningTemporal reasoning benchmark evaluating understanding of time-related concepts in videos and images.★★★★ 🇺🇸 Gemini 3 Pro88.0%
605Terminal-Bench O Agent terminal tasksCommand-line task completion benchmark for agents.★★★★★637 🇺🇸 Claude Sonnet 4.5 (Thinking)61.3%
606Terminal-Bench 2.0 O Agent terminal tasksSecond-generation Terminal-Bench leaderboard for end-to-end terminal agents.★★★★G IQuest-Coder-V1-40B-Loop-Instruct81.4%
607Terminal-Bench Hard O Agent terminal tasksHard subset of Terminal-Bench command-line agent tasks.★★★★ 🇺🇸 GPT-5.1 High43.0%
608Terminal-Bench Terminus O Agent terminal tasksTerminal-Bench Terminus track assessing end-to-end terminal tool use.★★★★ 🇺🇸 Gemini 3.1 Pro68.5%
609TextQuests OText-based video gamesText-based video game benchmark.★★★★ 🇺🇸 Gemini 3 Pro41.0%
610TextQuests Harm OHarmful propensitiesHarmfulness evaluation on TextQuests scenarios.★★★★ 🇺🇸 Grok 4.1 Fast↓ 9.1%
611TextVQA O Text-based VQAVisual question answering that requires reading text in images.★★★★G Ovis2.6-30B-A3B90.7%
612TIIF-Bench Long OText-to-imageTIIF-Bench long prompt score for text-to-image generation.★★★★G Seedream 4.588.5%
613TIIF-Bench Short OText-to-imageTIIF-Bench short prompt score for text-to-image generation.★★★★G Nano Banana 2.091.0%
614TIR-Bench oTool-integrated reasoningBenchmark for tool-integrated reasoning with visual models.★★★★ 🇨🇳 Qwen3.5-27B59.8%
615TLDR9+ oSummarizationLong-form summarization benchmark with nine-domain TLDR prompts plus extended variations.★★★★G MobileLLM P116.8%
616TOMATO oTemporal understandingTemporal ordering and motion analysis benchmark (TOMATO).★★★★G Seed1.860.8%
617Tool-Decathlon OAgent tool-useComposite tool-use suite measuring multi-domain tool invocation success (Pass@1).★★★★ 🇺🇸 GPT-5.243.8%
618Toolathlon OAgentic software tasksLong-horizon, real-world software tool-use tasks.★★★★ 🇺🇸 Gemini 3 Flash49.4%
619TreeBench o Reasoning with tree structuresEvaluates hierarchical/tree-structured reasoning and planning capabilities in LLMs/VLMs.★★★★ 🇨🇳 GLM-4.6V51.4%
620TriQA oKnowledge QATriadic question answering benchmark evaluating world knowledge and reasoning.★★★★ 🇫🇷 Mixtral 8x22B82.2%
621TriviaQA O Open-domain QAOpen-domain question answering benchmark built from trivia and web evidence.★★★★ 🇺🇸 Gemma 3 PT 27B85.5%
622TriviaQA-Wiki o Open-domain QATriviaQA subset answering using Wikipedia evidence.★★★★ 🇺🇸 Llama 3.1 405B Base91.8%
623TrustLLM oSafety / reliabilityTrustLLM benchmark for trustworthiness and safety behaviors.★★★★ 🇨🇳 Qwen3-Coder-480B-A35B-Instruct88.4%
624TruthfulQA O Truthfulness / hallucinationMeasures whether a model imitates human falsehoods (truthfulness).★★★★G SOLAR-10.7B-Instruct-v1.071.4%
625TruthfulQA (DE) oTruthfulness / hallucination (German)German translation of the TruthfulQA benchmark.★★★★ 🇺🇸 Llama 3.3 70B Instruct0.2%
626TVBench oTV comprehensionBenchmark for TV show video comprehension and QA.★★★★G Seed1.871.5%
627TydiQA o Cross-lingual QATypologically diverse QA across languages.★★★★★313 🇺🇸 Llama 3.1 405B Base34.3%
628U-Artifacts oAgentic coding artifactsBenchmark focusing on generated code artifacts quality.★★★★ 🇺🇸 Gemini 3 Pro57.8%
629V* OMultimodal reasoningV* benchmark accuracy.★★★★ 🇨🇳 Qwen3.5-397B-A17B95.8%
630VCRBench oVisual commonsense reasoningVisual commonsense reasoning benchmark.★★★★G Seed1.859.8%
631VCT O Virology capability (protocol troubleshooting)Virology Capabilities Test: a benchmark that measures an LLM's ability to troubleshoot complex virology laboratory protocols.★★★★ 🇺🇸 Gemini 2.5 Pro100.0%
632Vending-Bench 2 OLong-horizon agentic tasksLong-horizon agentic task benchmark evaluating sustained goal completion.★★★★ 🇺🇸 Gemini 3 Pro5478.2%
633Vibe Android oVibe evaluation (Android)Vibe evaluation on Android tasks.★★★★ 🇺🇸 Claude Opus 4.592.2%
634Vibe Average oVibe evaluationAggregate Vibe evaluation score.★★★★G MiniMax M2.188.6%
635Vibe Backend oVibe evaluation (backend)Vibe evaluation on backend tasks.★★★★ 🇺🇸 Claude Opus 4.598.0%
636Vibe iOS oVibe evaluation (iOS)Vibe evaluation on iOS tasks.★★★★ 🇺🇸 Claude Opus 4.590.0%
637Vibe Simulation oVibe evaluation (simulation)Vibe evaluation on simulation tasks.★★★★ 🇺🇸 Gemini 3 Pro89.2%
638Vibe Web oVibe evaluation (web)Vibe evaluation on web tasks.★★★★G MiniMax M2.191.5%
639VibeEval OAesthetic/visual qualityVLM aesthetic evaluation with GPT scores.★★★★ 🇺🇸 Gemini 2.5 Pro76.4%
640Video-MME O Video understanding (multimodal)Multimodal evaluation of video understanding and reasoning.★★★★ 🇨🇳 Qwen3-VL 32B Instruct76.6%
641VideoHolmes oVideo QAVideo question answering benchmark focused on detective-style clues.★★★★G Seed1.865.5%
642VideoMME oMultimodal video evaluationVideo multimodal evaluation suite (VideoMME).★★★★ 🇺🇸 Gemini 3 Pro88.4%
643VideoMME (w/o sub) OVideo understandingVideo understanding benchmark without subtitles.★★★★ 🇺🇸 Gemini 3 Pro87.7%
644VideoMME (w/sub) OVideo understandingVideo understanding benchmark with subtitles.★★★★ 🇺🇸 Gemini 3 Pro88.4%
645VideoMMMU OMultimodal video understandingVideo-based extension of MMMU evaluating temporal multimodal reasoning and perception across disciplines.★★★★ 🇺🇸 Gemini 3 Pro87.6%
646VideoReasonBench oVideo reasoningVideo reasoning benchmark assessing temporal and causal understanding.★★★★ 🇺🇸 Gemini 2.5 Pro59.7%
647VideoSimpleQA oVideo QASimple question answering over short videos.★★★★ 🇺🇸 Gemini 3 Pro71.9%
648ViSpeak oVideo dialogueVideo-grounded dialogue and description benchmark.★★★★ 🇺🇸 Gemini 3 Pro89.0%
649VisualPuzzle oVisual reasoningVisual puzzle solving benchmark evaluating reasoning and pattern recognition capabilities.★★★★ 🇺🇸 GPT-5 High57.8%
650VisualWebBench O Web UI understandingAverage accuracy on VisualWebBench.★★★★ 🇺🇸 Holo1.5-72B83.8%
651VisuLogic O Visual logical reasoningLogical reasoning and compositionality benchmark for visual-language models.★★★★ 🇨🇳 ERNIE-4.5-VL-28B-A3B-Thinking52.5%
652VitaBench OIndustry QAIndustry-focused benchmark evaluating domain QA performance.★★★★ 🇺🇸 Claude Opus 4.556.3%
653VL-RewardBench oReward modeling (VL)Reward alignment benchmark for VLMs.★★★★ 🇺🇸 Claude 3.7 Sonnet67.4%
654VLMs are Biased o Multimodal biasEvaluates whether VLMs truly 'see' vs. relying on memorized knowledge; measures bias toward non-visual priors.★★★★90 🇺🇸 o4 mini20.2%
655VLMs are Blind O Visual grounding robustnessEvaluates failure modes of VLMs in grounding and perception tasks.★★★★G MiMo-VL 7B-RL79.4%
656VLMsAreBiased oMultimodal biasBenchmark evaluating biases in vision-language models.★★★★G Seed1.862.0%
657VLMsAreBlind OMultimodal robustnessBenchmark probing robustness of vision-language models to visual perturbations.★★★★ 🇺🇸 Gemini 3 Pro97.5%
658VoiceBench AdvBench OVoiceBenchVoiceBench adversarial safety evaluation.★★★★ 🇨🇳 Qwen3-Omni-30B-A3B-Thinking99.4%
659VoiceBench AlpacaEval oVoiceBenchVoiceBench evaluation on AlpacaEval instructions.★★★★ 🇨🇳 Qwen3-Omni-Flash-Thinking96.8%
660VoiceBench BBH oVoiceBenchVoiceBench evaluation on Big-Bench Hard prompts.★★★★ 🇺🇸 Gemini 2.5 Pro92.6%
661VoiceBench CommonEval OVoiceBenchVoiceBench evaluation on CommonEval.★★★★ 🇨🇳 Qwen3-Omni-Flash-Instruct91.0%
662VoiceBench IFEval oVoiceBenchVoiceBench instruction-following evaluation (IFEval).★★★★ 🇺🇸 Gemini 2.5 Pro85.7%
663MMAU v05.15.25 oAudio reasoningAudio reasoning benchmark MMAU v05.15.25.★★★★ 🇨🇳 Qwen3-Omni-Flash-Instruct77.6%
664VoiceBench MMSU OVoiceBenchVoiceBench MMSU benchmark (voice modality).★★★★ 🇨🇳 Qwen3-Omni-Flash-Thinking84.3%
665VoiceBench MMSU (Audio) oAudio reasoningAudio reasoning MMSU results.★★★★ 🇺🇸 Gemini 2.5 Pro77.7%
666VoiceBench OpenBookQA oVoiceBenchVoiceBench results on OpenBookQA prompts.★★★★ 🇨🇳 Qwen3-Omni-Flash-Thinking95.0%
667VoiceBench SD-QA OVoiceBenchVoiceBench Spoken Dialogue QA results.★★★★ 🇺🇸 Gemini 2.5 Pro90.1%
668VoiceBench WildVoice OVoiceBenchVoiceBench evaluation on WildVoice dataset.★★★★ 🇺🇸 Gemini 2.5 Pro93.4%
669VPCT oMultimodal reasoningVisual perception and comprehension test.★★★★ 🇺🇸 Gemini 3 Pro90.0%
670VQAv2 O Visual question answeringStandard Visual Question Answering v2 benchmark on natural images.★★★★ 🇺🇸 Molmo2-8B87.0%
671VSI-Bench OSpatial intelligenceVisual spatial intelligence benchmark covering 3D reasoning and spatial inference tasks.★★★★ 🇨🇳 Qwen3-VL-30B-A3B Instruct63.2%
672WebClick OGUI agentsTask success on the WebClick UI agent benchmark.★★★★ 🇺🇸 Claude Sonnet 493.0%
673WebDev Arena O Web development agentsArena evaluation for autonomous web development agents.★★★★ 🇺🇸 GPT-51483
674WebQuest-MultiQA oWeb agentsMulti-question web search and interaction tasks.★★★★ 🇨🇳 GLM-4.5V60.6%
675WebQuest-SingleQA oWeb agentsSingle-question web search and interaction tasks.★★★★ 🇨🇳 GLM-4.6V79.5%
676WebSrc OWeb QAWebpage question answering (SQuAD F1).★★★★ 🇺🇸 Holo1.5-72B97.2%
677WebVoyager oWeb agentsWeb navigation and interaction tasks for LLM agents.★★★★ 🇨🇳 GLM-4.6V81.0%
678WebVoyager2 oWeb agentsWeb navigation and interaction tasks for LLM agents (v2).★★★★ 🇨🇳 GLM-4.5V84.4%
679WebWalkerQA oWeb agentsWebWalker tasks evaluating autonomous browsing question answering performance.★★★★ 🇨🇳 Tongyi DeepResearch72.2%
680WeMath OMath reasoningMath reasoning benchmark spanning diverse curricula and difficulty levels.★★★★ 🇨🇳 Qwen3.5-397B-A17B87.9%
681WideSearch OWeb searchWide web search and QA benchmark.★★★★ 🇺🇸 GPT-5.276.8%
682Wild-Jailbreak oSafety / jailbreakAdversarial jailbreak benchmark evaluating refusal robustness.★★★★ 🇺🇸 GPT-OSS 120B (High)98.2%
683WildBench V2 oInstruction followingWildBench V2 human preference benchmark for instruction following and helpfulness.★★★★ 🇫🇷 Mistral Small 3.2 24B Instruct65.3%
684WildGuardTest oSafetyWildGuardTest safety benchmark.★★★★G IQuest-Coder-V1-40B-Thinking86.8%
685Winogender o Gender bias (coreference)Coreference resolution dataset for measuring gender bias.★★★★ 🇺🇸 Llama 3.3 70B Instruct84.3%
686WinoGrande O Coreference reasoningLarge-scale adversarial Winograd Schema-style pronoun resolution.★★★★99 🇺🇸 OLMo-3-Think-32B90.3%
687WinoGrande (DE) oCoreference reasoning (German)German translation of the WinoGrande pronoun resolution benchmark.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT0.8%
688WMDP Bio o Biosecurity knowledgeWeapons of Mass Destruction Proxy benchmark for biosecurity, measuring hazardous biological knowledge without info hazards.★★★★ 🇺🇸 Zephyr 7B↓ 63.7%
689WMDP Chem o Chemical security knowledgeWMDP benchmark for chemical security, evaluating knowledge relevant to chemical weapons development.★★★★ 🇺🇸 Zephyr 7B↓ 45.8%
690WMDP Cyber o Cybersecurity knowledgeWMDP benchmark for cybersecurity, assessing knowledge that could aid in cyber weapons development.★★★★ 🇺🇸 Zephyr 7B↓ 44.0%
691WMT16 En–De o Machine translationWMT16 English–German translation benchmark (news).★★★★ 🇺🇸 Llama 3.3 70B Instruct38.8%
692WMT16 En–De (Instruct) oMachine translationInstruction-tuned evaluation on the WMT16 English–German translation set.★★★★ 🇺🇸 Llama 3.3 70B Instruct37.9%
693WMT24++ O Machine translationExtended WMT 2024 evaluation across multiple language pairs.★★★★ 🇨🇳 Qwen3-235B-A22B-Thinking-250794.7%
694WorldTravel2 (multi-modal) oTravel planning (multimodal)WorldTravel2 benchmark multimodal track.★★★★ 🇺🇸 Gemini 3 Pro47.2%
695WorldTravel2 (text) oTravel planning (text)WorldTravel2 benchmark text-only track.★★★★ 🇺🇸 GPT-5 High56.4%
696WorldVQA oWorld knowledge VQAVisual question answering requiring world knowledge and commonsense reasoning.★★★★ 🇺🇸 Gemini 3 Pro47.4%
697WritingBench OWriting qualityGeneral-purpose writing quality benchmark.★★★★ 🇨🇳 Qwen3-235B-A22B-Thinking-250788.3%
698WSC O Coreference reasoningClassic Winograd Schema Challenge measuring commonsense coreference.★★★★ 🇺🇸 Gemma 3 PT 27B91.9%
699xBench-DeepSearch OAgentic researchEvaluates multi-hop deep research workflows on xBench DeepSearch tasks.★★★★ 🇺🇸 GPT-5 High77.9%
700xBench-DeepSearch (2025.05) oAgentic researchxBench DeepSearch benchmark May 2025 snapshot.★★★★G Step-3.5 Flash 2026020483.7%
701xBench-DeepSearch (2025.10) oAgentic researchxBench DeepSearch benchmark October 2025 snapshot.★★★★G Step-3.5 Flash 2026020456.3%
702XLRS-Bench oRemote sensingRemote sensing benchmark for evaluating multimodal AI on satellite and aerial imagery.★★★★G Intern-S1-Pro52.8%
703XpertBench (Edu) oEconomics/educationXpertBench education domain subset.★★★★ 🇺🇸 GPT-5 High56.9%
704XpertBench (Fin) oEconomics/financeXpertBench finance domain subset.★★★★ 🇺🇸 GPT-5 High64.5%
705XpertBench (Humanities) oEconomics/humanitiesXpertBench humanities domain subset.★★★★ 🇺🇸 GPT-5 High68.5%
706XpertBench (Law) oEconomics/legalXpertBench legal domain subset.★★★★ 🇺🇸 Claude Sonnet 4.558.7%
707XpertBench (Research) oEconomics/researchXpertBench research domain subset.★★★★ 🇺🇸 GPT-5 High48.2%
708XSTest oSafetyXSTest safety benchmark.★★★★G IQuest-Coder-V1-40B-Thinking94.3%
709ZebraLogic O Logical reasoningLogical reasoning benchmark assessing complex pattern and rule inference.★★★★ 🇨🇳 Qwen3-VL 32B Thinking96.1%
710ZeroBench OZero-shot generalizationEvaluates zero-shot performance across diverse tasks without task-specific finetuning.★★★★ 🇨🇳 GLM-4.5V23.4%
711ZeroBench (sub) OZero-shot generalizationSubset of ZeroBench targeting harder zero-shot reasoning cases.★★★★ 🇨🇳 Qwen3.5-397B-A17B41.0%
712ZeroSCROLLS BookSumSort o Long-context summarizationZeroSCROLLS split based on BookSumSort long-form summarization.★★★★ 🇺🇸 GPT-460.5%
713ZeroSCROLLS GovReport o Long-context summarizationZeroSCROLLS split based on the GovReport summarization benchmark.★★★★G CoLT541.0%
714ZeroSCROLLS MuSiQue o Long-context reasoningZeroSCROLLS split derived from MuSiQue multi-hop QA.★★★★ 🇺🇸 Llama 3.3 70B Instruct52.2%
715ZeroSCROLLS NarrativeQA o Long-context QAZeroSCROLLS split based on the NarrativeQA reading comprehension benchmark.★★★★ 🇺🇸 Claude v1.332.6%
716ZeroSCROLLS Qasper o Long-context QAZeroSCROLLS split based on the Qasper paper QA benchmark.★★★★G FLAN-UL256.9%
717ZeroSCROLLS QMSum o Long-context summarizationZeroSCROLLS split based on the QMSum meeting summarization benchmark.★★★★G CoLT522.5%
718ZeroSCROLLS QuALITY o Long-context QAZeroSCROLLS split based on the QuALITY reading comprehension benchmark.★★★★ 🇺🇸 GPT-489.2%
719ZeroSCROLLS SpaceDigest o Long-context summarizationZeroSCROLLS SpaceDigest extractive summarization task.★★★★ 🇺🇸 Llama-3_1-70B-TFree-HAT-SFT77.9%
720ZeroSCROLLS SQuALITY o Long-context summarizationZeroSCROLLS split based on the SQuALITY long-form summarization benchmark.★★★★ 🇺🇸 GPT-422.6%
721ZeroSCROLLS SummScreenFD o Long-context summarizationZeroSCROLLS split based on the SummScreenFD summarization benchmark.★★★★G CoLT520.0%

Model Size

Release Date