Fu — Benchmark of Benchmarks
Fu-Benchmark is a meta-benchmark of the most influential evaluation suites used to measure and rank large language models. Use the search box to filter by name, topic, or model.
| # | Name | Topic | Description | Relevance | GitHub ★ | Leader | Top % |
|---|---|---|---|---|---|---|---|
| 1 | AA-Index o | Multi-domain QA | Comprehensive QA index across diverse domains. | ★★★★★ | 73.2% | ||
| 2 | AA-LCR O | Long-context reasoning | A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens. | ★★★★★ | 76.0% | ||
| 3 | AA-Omniscience O | Knowledge and hallucination | Benchmark measuring factual recall and hallucination across economically relevant domains. | ★★★★★ | 13.0% | ||
| 4 | AceBench O | Industry QA | Industry-focused benchmark assessing domain QA and reasoning. | ★★★★★ | 82.2% | ||
| 5 | ACP-Bench Bool O | Safety evaluation (boolean) | Safety and behavior evaluation with yes/no questions. | ★★★★★ | 85.1% | ||
| 6 | ACP-Bench MCQ O | Safety evaluation (MCQ) | Safety and behavior evaluation with multiple-choice questions. | ★★★★★ | 82.1% | ||
| 7 | AetherCode o | Code generation | Code generation benchmark for diverse coding tasks. | ★★★★★ | 56.7% | ||
| 8 | AgentCompany o | Agent reasoning | Company-level agent reasoning and decision-making benchmark. | ★★★★★ | 41.0% | ||
| 9 | AgentDojo O | Agent evaluation | Interactive evaluation suite for autonomous agents across tools and tasks. | ★★★★★ | 88.7% | ||
| 10 | Agentic Coding O | Agentic coding | Agentic coding benchmark for autonomous software tasks. | ★★★★★ | 53.8% | ||
| 11 | AGIEval (English) O | Exams | English subset of AGIEval; academic and professional exam questions. | ★★★★★ | 92.2% | ||
| 12 | AGIEval LSAT-AR o | Law exam reasoning | LSAT Analytical Reasoning subset from AGIEval benchmark. | ★★★★★ | 30.4% | ||
| 13 | AI2D O | Diagram understanding (VQA) | Visual question answering over science and diagram images. | ★★★★★ | 98.7% | ||
| 14 | Aider Code Editing o | Code editing | Measures interactive code editing quality within the Aider assistant workflow. | ★★★★★ | 89.8% | ||
| 15 | Aider-Polyglot O | Code assistant eval | Aider polyglot coding leaderboard. | ★★★★★ | 92.9% | ||
| 16 | Aider-Polyglot (Diff) O | Code assistant eval | Aider polyglot leaderboard using diff mode (pass@2). | ★★★★★ | 91.9% | ||
| 17 | AIME 2024 O | Math (competition) | American Invitational Mathematics Examination 2024 problems. | ★★★★★ | 96.6% | ||
| 18 | AIME 2024-Ko o | Math (competition, Korean) | Korean translation of AIME 2024 problems. | ★★★★★ | 80.3% | ||
| 19 | AIME 2025 O | Math (competition) | American Invitational Mathematics Examination 2025 problems. | ★★★★★ | 100.0% | ||
| 20 | AInstein-SWE-Bench o | Agentic coding | AInstein agent coding benchmark. | ★★★★★ | 42.8% | ||
| 21 | All-Angles Bench O | Spatial perception | All-Angles benchmark for spatial recognition and 3D perception. | ★★★★★ | G Step3-VL-10B | 57.2% | |
| 22 | AlpacaEval O | Instruction following | Automatic eval using GPT-4 as a judge. | ★★★★★ | 1849 | 99.4% | |
| 23 | AlpacaEval 2.0 O | Instruction following | Updated AlpacaEval with improved prompts and judging. | ★★★★★ | 87.6% | ||
| 24 | AMC-23 O | Math (competition) | American Mathematics Competition 2023 evaluation. | ★★★★★ | G QwQ-32B | 98.5% | |
| 25 | AMO-Bench O | Math (competition) | Advanced math olympiad-style benchmark. | ★★★★★ | 72.5% | ||
| 26 | AMO-Bench CH o | Math (competition) | Chinese subset of AMO-Bench. | ★★★★★ | 74.9% | ||
| 27 | AndroidWorld O | Mobile agents | Benchmark for agents operating Android apps via UI automation. | ★★★★★ | G Seed1.8 | 70.7% | |
| 28 | API-Bank o | Tool use | API-Bank tool-use benchmark. | ★★★★★ | 92.0% | ||
| 29 | ARC-AGI-1 O | General reasoning | ARC-AGI Phase 1 aggregate accuracy. | ★★★★★ | 86.2% | ||
| 30 | ARC-AGI-2 O | General reasoning | ARC-AGI Phase 2 aggregate accuracy. | ★★★★★ | 52.9% | ||
| 31 | ARC Average O | Science QA (average) | Average accuracy across ARC-Easy and ARC-Challenge. | ★★★★★ | 60.5% | ||
| 32 | ARC-Challenge O | Science QA | Hard subset of AI2 Reasoning Challenge; grade-school science. | ★★★★★ | 96.9% | ||
| 33 | ARC-Challenge (DE) o | Science QA (German) | German translation of the ARC Challenge benchmark. | ★★★★★ | 0.7% | ||
| 34 | ARC-Easy O | Science QA | Easier subset of AI2 Reasoning Challenge. | ★★★★★ | 89.0% | ||
| 35 | ARC-Easy (DE) o | Science QA (German) | German translation of the ARC Easy science QA benchmark. | ★★★★★ | 0.8% | ||
| 36 | Arena-Hard O | Chat ability | Hard prompts on Chatbot Arena. | ★★★★★ | 920 | 97.1% | |
| 37 | Arena-Hard V2 O | Chat ability | Updated Arena-Hard v2 prompts on Chatbot Arena. | ★★★★★ | 920 | 90.2% | |
| 38 | Arena-Hard V2 Creative Writing O | Creative writing | Chatbot Arena Hard V2 creative writing win-rate subset. | ★★★★★ | 93.6% | ||
| 39 | Arena-Hard V2 Hard Prompt O | Chat ability | Chatbot Arena Hard V2 benchmark using the hard prompt win-rate subset. | ★★★★★ | 72.6% | ||
| 40 | ARKitScenes O | 3D scene understanding | ARKitScenes benchmark for assessing 3D scene reconstruction and understanding from mixed reality captures. | ★★★★★ | 61.5% | ||
| 41 | ART Agent Red Teaming O | Agent robustness | Evaluation suite for adversarial red-teaming of autonomous AI agents. | ★★★★★ | ↓ 33.6% | ||
| 42 | ArtifactsBench O | Agentic coding | Artifacts-focused coding and tool-use benchmark evaluating generated code artifacts. | ★★★★★ | 73.0% | ||
| 43 | ASR AMI o | ASR | Automatic speech recognition benchmark on AMI meeting speech. | ★★★★★ | ↓ 15.1% | ||
| 44 | ASR Earnings22 o | ASR | Automatic speech recognition benchmark on Earnings22 financial calls. | ★★★★★ | G Whisper-large-V3 | ↓ 11.3% | |
| 45 | ASR GigaSpeech o | ASR | Automatic speech recognition benchmark on GigaSpeech. | ★★★★★ | G Whisper-large-V3 | ↓ 10.0% | |
| 46 | ASR LibriSpeech Clean o | ASR | Automatic speech recognition benchmark on LibriSpeech clean split. | ★★★★★ | ↓ 1.9% | ||
| 47 | ASR LibriSpeech Other o | ASR | Automatic speech recognition benchmark on LibriSpeech other split. | ★★★★★ | G Whisper-large-V3 | ↓ 3.9% | |
| 48 | ASR SPGISpeech o | ASR | Automatic speech recognition benchmark on SPGISpeech. | ★★★★★ | ↓ 2.8% | ||
| 49 | ASR TED-LIUM o | ASR | Automatic speech recognition benchmark on TED-LIUM. | ★★★★★ | ↓ 3.5% | ||
| 50 | ASR VoxPopuli o | ASR | Automatic speech recognition benchmark on VoxPopuli. | ★★★★★ | ↓ 5.6% | ||
| 51 | AstaBench O | Agent evaluation | Evaluates science agents across literature understanding, data analysis, planning, tool use, coding, and search. | ★★★★★ | 53.0% | ||
| 52 | AttaQ O | Safety / jailbreak | Adversarial jailbreak suite measuring refusal robustness against targeted attack prompts. | ★★★★★ | G Granite 3.3 8B Instruct | 88.5% | |
| 53 | AutoCodeBench O | Autonomous coding | End-to-end autonomous coding benchmark with unit-test based execution across diverse repositories and tasks. | ★★★★★ | 52.4% | ||
| 54 | AutoCodeBench-Lite O | Autonomous coding | Lite version of AutoCodeBench focusing on smaller tasks with the same end-to-end, unit-test-based evaluation. | ★★★★★ | 64.5% | ||
| 55 | AutoLogi o | Logical reasoning | AutoLogi benchmark evaluating automated logical reasoning accuracy. | ★★★★★ | 89.8% | ||
| 56 | BALROG O | Agent robustness | Benchmark for assessing LLM agents under adversarial and out-of-distribution tool-use scenarios. | ★★★★★ | 43.6% | ||
| 57 | BBH O | Multi-task reasoning | Hard subset of BIG-bench with diverse reasoning tasks. | ★★★★★ | 510 | 94.3% | |
| 58 | BBQ O | Bias evaluation | Bias Benchmark for Question Answering evaluating social biases across contexts. | ★★★★★ | 56.0% | ||
| 59 | BeaverTails o | Safety / harmfulness | Safety benchmark evaluating harmfulness in model responses. | ★★★★★ | G IQuest-Coder-V1-40B-Thinking | 76.7% | |
| 60 | BeyondAIME o | Math (beyond AIME) | Advanced math problems exceeding AIME difficulty. | ★★★★★ | 83.0% | ||
| 61 | BFCL O | Code reasoning | Benchmark for functional code correctness and logic. | ★★★★★ | 95.0% | ||
| 62 | BFCL Live v2 O | Finance QA | Financial compliance and literacy questions from the BFCL Live v2 benchmark. | ★★★★★ | 81.0% | ||
| 63 | BFCL v2 o | Code reasoning | Second release of the BFCL benchmark focusing on functional code correctness and logic. | ★★★★★ | G MobileLLM P1 | 29.4% | |
| 64 | BFCL v3 O | Code reasoning | Benchmark for functional code correctness and logic (v3). | ★★★★★ | 77.8% | ||
| 65 | BFCL v3 (Live) o | Tool calling | BFCL v3 Live subset for real-time tool calling evaluation. | ★★★★★ | 82.9% | ||
| 66 | BFCL v3 (Multi-Turn) o | Tool calling | BFCL v3 Multi-Turn subset for multi-turn tool calling evaluation. | ★★★★★ | 53.6% | ||
| 67 | BFCL v4 O | Code reasoning | BFCL v4 benchmark for functional code correctness and logic. | ★★★★★ | 77.5% | ||
| 68 | BIG-Bench o | Multi-task reasoning | BIG-bench overall performance (original). | ★★★★★ | 3110 | 55.1% | |
| 69 | BIG-Bench Extra Hard o | Multi-task reasoning | Extra hard subset of BIG-bench tasks. | ★★★★★ | G Ling 1T | 47.3% | |
| 70 | BigCodeBench O | Code Generation | BigCodeBench evaluates large language models on practical code generation tasks with unit-test verification. | ★★★★★ | G MiMo V2 Flash Base | 70.1% | |
| 71 | BigCodeBench Hard O | Code generation (hard) | Harder variant of BigCodeBench testing complex programming and library tasks with function-level code generation. | ★★★★★ | 35.8% | ||
| 72 | BIOBench o | Biology reasoning | Biology knowledge and reasoning benchmark. | ★★★★★ | 51.9% | ||
| 73 | BioLP-Bench o | Biomedical NLP | Comprehensive biomedical language processing benchmark evaluating LLMs across tasks like NER, relation extraction, and QA. | ★★★★★ | 47.0% | ||
| 74 | Bird-SQL O | Text-to-SQL | Natural language to SQL generation benchmark. | ★★★★★ | 59.3% | ||
| 75 | BLINK O | Multimodal grounding | Evaluates visual-language grounding and reference resolution to reduce hallucinations. | ★★★★★ | 87.4% | ||
| 76 | BoB-HVR O | Composite capability index | Hard, Versatile, and Relevant composite score across eight capability buckets. | ★★★★★ | 9.0% | ||
| 77 | BOLD o | Bias evaluation | Bias in Open-ended Language Dataset probing demographic biases in text generation. | ★★★★★ | ↓ 0.1% | ||
| 78 | BoolQ O | Reading comprehension | Yes/no QA from naturally occurring questions. | ★★★★★ | 171 | G Marin-32B-Mantis | 89.4% |
| 79 | Borda Count (Multilingual) o | Aggregate ranking | Borda count aggregate ranking across multilingual benchmarks; lower is better. | ★★★★★ | ↓ 2.9% | ||
| 80 | BrowseComp O | Web browsing | Web browsing comprehension and competence benchmark. | ★★★★★ | G MiroThinker-v1.5-235B | 69.8% | |
| 81 | BrowseComp (With Content Manager) O | Web browsing | BrowseComp benchmark evaluated with content manager assistance. | ★★★★★ | 73.1% | ||
| 82 | BrowseComp_zh O | Web browsing (Chinese) | Chinese variant of the BrowseComp web browsing benchmark. | ★★★★★ | G Seed1.8 | 81.3% | |
| 83 | BRuMo25 o | Math competition | BruMo 2025 olympiad-style mathematics benchmark. | ★★★★★ | 69.5% | ||
| 84 | BuzzBench O | Humor analysis | A humour analysis benchmark. | ★★★★★ | 71.1% | ||
| 85 | C-Eval O | Chinese exams | Chinese college-level exam benchmark. | ★★★★★ | 1768 | 93.7% | |
| 86 | C3-Bench o | Reasoning (Chinese) | Comprehensive Chinese reasoning capability benchmark. | ★★★★★ | 35 | 83.1% | |
| 87 | CaseLaw v2 O | Legal reasoning | U.S. case law benchmark evaluating legal reasoning and judgment over court opinions. | ★★★★★ | 78.1% | ||
| 88 | CC-OCR O | OCR (cross-lingual) | Cross-lingual OCR benchmark evaluating character recognition across mixed-language documents. | ★★★★★ | 81.5% | ||
| 89 | CFEval o | Coding ELO / contest eval | Contest-style coding evaluation with ELO-like scoring. | ★★★★★ | 2134 | ||
| 90 | CGBench o | Long video QA | Cartoon/CG long video question answering benchmark. | ★★★★★ | 64.6% | ||
| 91 | Charades-STA O | Video grounding | Charades-STA temporal grounding (mIoU). | ★★★★★ | 64.0% | ||
| 92 | ChartMuseum o | Chart understanding | Large-scale curated collection of charts for evaluating parsing, grounding, and reasoning. | ★★★★★ | 63.3% | ||
| 93 | ChartQA O | Chart understanding (VQA) | Visual question answering over charts and plots. | ★★★★★ | 94.1% | ||
| 94 | ChartQA-Pro o | Chart understanding (VQA) | Professional-grade chart question answering with diverse chart types and complex reasoning. | ★★★★★ | 65.5% | ||
| 95 | CharXiv (DQ) O | Chart description (PDF) | Scientific chart/table descriptive questions from arXiv PDFs. | ★★★★★ | 95.0% | ||
| 96 | CharXiv (RQ) O | Chart reasoning (PDF) | Scientific chart/table reasoning questions from arXiv PDFs. | ★★★★★ | 82.1% | ||
| 97 | Chinese SimpleQA o | QA (Chinese) | Chinese variant of the SimpleQA benchmark. | ★★★★★ | 77.6% | ||
| 98 | CLIcK o | Korean instruction following | Korean long-form instruction-following benchmark. | ★★★★★ | 86.3% | ||
| 99 | CloningScenarios o | Biosecurity refusal | Safety benchmark that red-teams models with cloning-related misuse scenarios to measure compliance and refusal rates. | ★★★★★ | ↓ 45.0% | ||
| 100 | CLUEWSC o | Coreference reasoning (Chinese) | Chinese Winograd Schema-style coreference benchmark from CLUE. | ★★★★★ | 92.8% | ||
| 101 | CMath o | Math (Chinese) | Chinese mathematics benchmark. | ★★★★★ | 96.7% | ||
| 102 | CMMLU O | Chinese multi-domain | Chinese counterpart to MMLU. | ★★★★★ | 781 | 91.9% | |
| 103 | CNMO 2024 o | Math (competition) | China National Mathematical Olympiad 2024 evaluation set. | ★★★★★ | G openPangu-R-72B-2512 Slow Thinking | 82.8% | |
| 104 | Codeforces O | Competitive programming | Competitive programming performance on Codeforces problems (ELO). | ★★★★★ | 2719 | ||
| 105 | COLLIE o | Instruction following | Comprehensive instruction-following evaluation suite. | ★★★★★ | 55 | 99.0% | |
| 106 | Collie-Hard o | Instruction following | Hard subset of Collie instruction-following tasks. | ★★★★★ | 99.0% | ||
| 107 | CommonsenseQA O | Commonsense QA | Multiple-choice QA requiring commonsense knowledge. | ★★★★★ | 88.5% | ||
| 108 | Complex Workflow o | Complex workflows | Complex workflow benchmark for economically valuable tasks. | ★★★★★ | 58.2% | ||
| 109 | COPA o | Causal reasoning | Choice of Plausible Alternatives. | ★★★★★ | G Marin-32B-Bison | 94.0% | |
| 110 | CorpusQA O | Long-context QA | Question answering over large text corpora. | ★★★★★ | 81.6% | ||
| 111 | CountBench O | Visual counting | Object counting and numeracy benchmark for visual-language models across varied scenes. | ★★★★★ | 97.3% | ||
| 112 | CountBenchQA O | Visual counting QA | Visual question answering benchmark focused on counting objects across varied scenes. | ★★★★★ | G Moondream-9B-A2B | 93.2% | |
| 113 | Countdown o | Planning and reasoning | Countdown-style reasoning and planning benchmark. | ★★★★★ | G K2-V2 | 75.6% | |
| 114 | Countix o | Video counting | Video-based counting benchmark for multiple objects. | ★★★★★ | G Seed1.8 | 31.0% | |
| 115 | CRAG o | Retrieval QA | Complex Retrieval-Augmented Generation benchmark for grounded question answering. | ★★★★★ | G Jamba Mini 1.6 | 76.2% | |
| 116 | Creative Story‑Writing Benchmark V3 O | Creative writing | Story writing benchmark evaluating creativity, coherence, and style (v3). | ★★★★★ | 291 | 8.7% | |
| 117 | Longform Creative Writing O | Creative writing | Longform creative writing evaluation (EQ-Bench). | ★★★★★ | 20 | 79.8% | |
| 118 | Creative Writing v3 O | Creative writing | A LLM-judged creative writing benchmark. | ★★★★★ | 54 | 1661 | |
| 119 | Complex Research using Integrated Thinking – Physics Test O | Reasoning | CritPt (Complex Research using Integrated Thinking – Physics Test) benchmark. | ★★★★★ | 12.6% | ||
| 120 | CRUX-I O | Code reasoning | Code Reasoning and Understanding eXam – Interactive. | ★★★★★ | 98.8% | ||
| 121 | CRUX-O O | Code reasoning | Code Reasoning and Understanding eXam – Offline. | ★★★★★ | G IQuest-Coder-V1-40B-Loop-Thinking | 99.4% | |
| 122 | CruxEval O | Code reasoning | Mathematical coding challenge set from the CruxEval benchmark. | ★★★★★ | 86.8% | ||
| 123 | CSimpleQA o | QA | Chinese SimpleQA benchmark variant (short factual questions). | ★★★★★ | 77.6% | ||
| 124 | Customer Support Q&A o | Customer support QA | Customer support question answering benchmark. | ★★★★★ | G Seed1.8 | 69.0% | |
| 125 | CUTE o | English characters | CUTE aggregate capability score. | ★★★★★ | 78.6% | ||
| 126 | CV-Bench O | Computer vision QA | Diverse CV tasks for VLMs. | ★★★★★ | 92.0% | ||
| 127 | CVTG-2K CLIPScore o | Text rendering | CVTG-2K CLIPScore for text rendering in image generation. | ★★★★★ | G Seedream 4.5 | 0.8% | |
| 128 | CVTG-2K NED o | Text rendering | CVTG-2K normalized edit distance (NED) for text rendering. | ★★★★★ | 1.0% | ||
| 129 | CVTG-2K Word Accuracy o | Text rendering | CVTG-2K word accuracy for text rendering in images. | ★★★★★ | 0.9% | ||
| 130 | CyBench o | Cybersecurity CTF | Framework with 40 professional-level CTF tasks evaluating LLMs' practical cybersecurity capabilities. | ★★★★★ | ↓ 22.5% | ||
| 131 | CyberGym o | Cybersecurity tasks | Benchmark for cybersecurity-related coding and reasoning tasks. | ★★★★★ | 50.6% | ||
| 132 | DA-2K o | Spatial reasoning | 2D/3D spatial reasoning benchmark. | ★★★★★ | 85.3% | ||
| 133 | Deep Planning o | Planning and reasoning | Benchmark evaluating deep planning and multi-step reasoning capabilities. | ★★★★★ | 44.6% | ||
| 134 | DeepConsult o | Agentic writing | Agentic consulting and writing benchmark. | ★★★★★ | 57.2% | ||
| 135 | DeepMind Mathematics o | Math reasoning | Synthetic math problem sets from DeepMind covering arithmetic, algebra, calculus, and more. | ★★★★★ | G Granite-4.0-H-Small | 59.3% | |
| 136 | DeepResearchBench o | Agentic research writing | Research-oriented agentic writing and planning benchmark. | ★★★★★ | 49.6% | ||
| 137 | DeepSearchQA o | Deep web search QA | Multi-step web search and question answering benchmark. | ★★★★★ | 77.1% | ||
| 138 | Design2Code O | Coding (UI) | Translating UI designs into code. | ★★★★★ | 93.4% | ||
| 139 | DesignArena O | Generative design | Leaderboard tracking generative design systems across layout, branding, and marketing tasks. | ★★★★★ | 1410 | ||
| 140 | DetailBench o | Spot small mistakes | Evaluates whether LLMs can notice subtle errors and minor inconsistencies in text. | ★★★★★ | 8.7% | ||
| 141 | DiscoX o | Agentic writing | DiscoX benchmark for agentic writing and reasoning. | ★★★★★ | 75.8% | ||
| 142 | Do-Anything-Now o | Safety / jailbreak | Resistance to Do Anything Now (DAN) style jailbreak prompts. | ★★★★★ | G IQuest-Coder-V1-40B-Thinking | 97.7% | |
| 143 | Do-Not-Answer o | Safety / refusal | Evaluates a model's ability to refuse unsafe or disallowed requests. | ★★★★★ | G K2-THINK | 88.0% | |
| 144 | DocMath O | Document math | Math reasoning on document-based problems. | ★★★★★ | 67.6% | ||
| 145 | DocVQA O | Document understanding (VQA) | Visual question answering over scanned documents. | ★★★★★ | 96.9% | ||
| 146 | Dolphin-Page o | Document OCR | Dolphin Page benchmark measuring OCR fidelity and structured extraction on multi-layout documents. | ★★★★★ | ↓ 7.4% | ||
| 147 | DPG-Bench O | Text rendering | DPG-Bench score for text rendering in image generation. | ★★★★★ | G Seedream 4.5 | 88.6% | |
| 148 | DROP O | Reading + reasoning | Discrete reasoning over paragraphs (addition, counting, comparisons). | ★★★★★ | 93.5% | ||
| 149 | DUDE o | Multimodal long-context | Long-context multimodal understanding benchmark. | ★★★★★ | 70.1% | ||
| 150 | DynaMath O | Math reasoning (video) | Dynamic/video-based mathematical reasoning evaluating temporal and visual understanding. | ★★★★★ | 63.7% | ||
| 151 | Economically important tasks o | Industry QA (cross-domain) | Evaluation suite of real-world, economically impactful tasks across key industries and workflows. | ★★★★★ | 47.1% | ||
| 152 | Education o | Economics/education | Education field evaluation (economically valuable tasks). | ★★★★★ | G Seed1.8 | 60.8% | |
| 153 | EgoSchema O | Egocentric video QA | EgoSchema validation accuracy. | ★★★★★ | 77.9% | ||
| 154 | EgoTempo o | Egocentric temporal reasoning | Egocentric video temporal reasoning benchmark. | ★★★★★ | G Seed1.8 | 67.0% | |
| 155 | EIFBench o | Instruction following | Complex instruction-following benchmark. | ★★★★★ | 66.7% | ||
| 156 | EmbSpatialBench O | Spatial understanding | Embodied spatial understanding benchmark evaluating navigation and localization. | ★★★★★ | 84.3% | ||
| 157 | EMMA o | Multimodal reasoning | EMMA benchmark for multimodal reasoning. | ★★★★★ | 66.5% | ||
| 158 | Enamel o | Composite capability | Composite capability benchmark capturing broad model performance (Enamel score). | ★★★★★ | G Rnj-1 | 49.0% | |
| 159 | EnConda-Bench o | Code editing | English code editing benchmark for applying conditional modifications. | ★★★★★ | G Youtu-LLM-2B | 21.5% | |
| 160 | EnigmaEval O | Challenging puzzles | Challenging puzzle benchmark. | ★★★★★ | 17.8% | ||
| 161 | Enterprise RAG o | Retrieval-augmented generation | Enterprise retrieval-augmented generation evaluation covering internal knowledge bases. | ★★★★★ | 69.2% | ||
| 162 | EQ-Bench O | Reasoning | General reasoning benchmark assessing equation/logic capabilities. | ★★★★★ | 352 | G Jan v1 2509 | 85.0% |
| 163 | EQ-Bench 3 O | Emotional intelligence (roleplay) | A benchmark measuring emotional intelligence in challenging roleplays, judged by Sonnet 3.7. | ★★★★★ | 21 | 1555 | |
| 164 | ERQA O | Spatial reasoning | Spatial recognition and reasoning QA benchmark (ERQA). | ★★★★★ | 71.0% | ||
| 165 | EvalPerf O | Code evaluation performance | Measures performance of LLM code evaluation, including runtime, memory, and efficiency metrics. | ★★★★★ | 100.0% | ||
| 166 | EvalPlus O | Code generation | Aggregated code evaluation suite from EvalPlus. | ★★★★★ | 1577 | 89.0% | |
| 167 | EVG o | Document OCR | EVG document OCR benchmark evaluating recognition accuracy and layout extraction. | ★★★★★ | ↓ 3.0% | ||
| 168 | EXECUTE o | Multilingual character tasks | Multilingual character-level evaluation benchmark. | ★★★★★ | 71.6% | ||
| 169 | FACTS Benchmark Suite o | Held out internal grounding, parametric, MM, and search retrieval benchmarks | Comprehensive factuality benchmark suite covering held-out internal grounding, parametric knowledge, multimodal understanding, and search retrieval benchmarks. | ★★★★★ | 70.5% | ||
| 170 | FACTS Grounding O | Grounding / factuality | Grounded factuality benchmark evaluating model alignment with source facts. | ★★★★★ | 88.5% | ||
| 171 | FActScore o | Hallucination rate on open-source prompts | Measures hallucination rate on an open-source prompt suite; lower is better. | ★★★★★ | ↓ 1.0% | ||
| 172 | FaithJudge (1-Hallu.) o | Hallucination detection | FaithJudge hallucination rate with 1-hallucination metric (lower is better). | ★★★★★ | G Moonlight-Instruct | ↓ 56.0% | |
| 173 | Meta Score Agent O | Composite capability index | ★★★★★ | 100.0% | |||
| 174 | Meta Score Code O | Composite capability index | ★★★★★ | 100.0% | |||
| 175 | Meta Score Math O | Composite capability index | ★★★★★ | 100.0% | |||
| 176 | Meta Score OCR O | Composite capability index | ★★★★★ | 80.0% | |||
| 177 | Meta Score Safety O | Composite safety index | ★★★★★ | G Granite 3.3 8B Instruct | 70.0% | ||
| 178 | Meta Score STEM O | Composite capability index | ★★★★★ | 100.0% | |||
| 179 | Meta Score Text O | Composite capability index | ★★★★★ | 85.7% | |||
| 180 | Meta Score Visual O | Composite capability index | ★★★★★ | 100.0% | |||
| 181 | Meta Score Writing O | Composite capability index | ★★★★★ | 60.0% | |||
| 182 | FigQA o | Figure understanding and QA | Figure question answering benchmark evaluating visual reasoning over scientific figures and diagrams. | ★★★★★ | 34.0% | ||
| 183 | FinanceReasoning o | Financial reasoning | Financial reasoning benchmark evaluating quantitative and qualitative finance problem solving. | ★★★★★ | G Ling 1T | 87.5% | |
| 184 | FinanceAgent o | Agentic finance tasks | Interactive financial agent benchmark requiring multi-step tool use. | ★★★★★ | 55.3% | ||
| 185 | FinanceBench (FullDoc) o | Finance QA | FinanceBench full-document question answering benchmark requiring long-context financial understanding. | ★★★★★ | G Jamba Mini 1.6 | 45.4% | |
| 186 | FinSearchComp O | Financial retrieval | Financial search and comprehension benchmark measuring retrieval grounded reasoning over financial content. | ★★★★★ | 68.9% | ||
| 187 | FinSearchComp-CN O | Financial retrieval (Chinese) | Chinese financial search and comprehension benchmark measuring retrieval-grounded reasoning over regional financial content. | ★★★★★ | G doubao-1-5-vision-pro | 54.2% | |
| 188 | FinSearchComp (T2&T3) o | Finance search | Finance search competition tasks (tracks T2 and T3). | ★★★★★ | 64.5% | ||
| 189 | Flame-React-Eval o | Frontend coding | Front-end React coding tasks and evaluation. | ★★★★★ | 86.3% | ||
| 190 | Flores o | Machine translation (multilingual) | FLORES multilingual translation benchmark. | ★★★★★ | G EuroLLM-22B | 88.9% | |
| 191 | Fox-Page-cn o | Document OCR (Chinese) | Fox Page benchmark evaluating OCR accuracy and layout understanding on Chinese document pages. | ★★★★★ | ↓ 0.8% | ||
| 192 | Fox-Page-en o | Document OCR (English) | Fox Page benchmark evaluating OCR accuracy and layout understanding on English document pages. | ★★★★★ | ↓ 0.7% | ||
| 193 | FRAMES O | Interactive reasoning | Frame-based interactive reasoning and dialogue benchmark. | ★★★★★ | 90.6% | ||
| 194 | FreshQA o | Recency QA | Question answering benchmark emphasizing up-to-date knowledge and recency. | ★★★★★ | 66.9% | ||
| 195 | FrontierScience o | Science reasoning | Frontier-level scientific reasoning and QA benchmark. | ★★★★★ | 25.2% | ||
| 196 | FSC-147 o | Few-shot counting | Few-shot counting benchmark across 147 categories. | ★★★★★ | G Seed1.8 | 33.8% | |
| 197 | FullStackBench O | Full-stack development | End-to-end web/app development tasks and evaluation. | ★★★★★ | 72.3% | ||
| 198 | GAIA O | General AI tasks | Comprehensive benchmark for agentic tasks. | ★★★★★ | G Seed1.8 | 87.4% | |
| 199 | GAIA 2 O | General agent tasks | Grounded agentic intelligence benchmark version 2 covering multi-tool tasks. | ★★★★★ | 42.1% | ||
| 200 | GAOKAO-Bench o | Chinese exams | GAOKAO benchmark measuring Chinese college entrance exam performance. | ★★★★★ | 94.5% | ||
| 201 | GDPVal O | General capability | GDPVal benchmark evaluating broad general capabilities of LLMs across diverse tasks. | ★★★★★ | 70.9% | ||
| 202 | General Tool Use O | Tool use | General tool-use benchmark covering web and API tasks. | ★★★★★ | 78.9% | ||
| 203 | GeoBench1 o | Geospatial reasoning | Geospatial visual QA and reasoning (set 1). | ★★★★★ | 79.7% | ||
| 204 | Global-MMLU O | Multi-domain knowledge (global) | Full Global-MMLU evaluation across diverse languages and regions. | ★★★★★ | 82.0% | ||
| 205 | Global-MMLU-Lite O | Multi-domain knowledge (global) | Lightweight global variant of MMLU covering diverse languages and regions. | ★★★★★ | 89.2% | ||
| 206 | Global PIQA o | Commonsense reasoning across 100 Languages and Cultures | Physical commonsense reasoning benchmark spanning 100 languages and diverse cultural contexts. | ★★★★★ | 93.4% | ||
| 207 | Gorilla Benchmark API Bench o | Tool use | Gorilla API Bench tool-use evaluation. | ★★★★★ | 35.3% | ||
| 208 | GPQA O | Graduate-level QA | Graduate-level question answering evaluating advanced reasoning. | ★★★★★ | 406 | 92.4% | |
| 209 | GPQA-diamond O | Graduate-level QA | Hard subset of GPQA (diamond level). | ★★★★★ | 92.9% | ||
| 210 | GRE Math maj@16 o | Math (standardized tests) | GRE quantitative section evaluated via majority voting over 16 samples. | ★★★★★ | 58.5% | ||
| 211 | Ground-UI-1K O | GUI grounding | Accuracy on the Ground-UI-1K grounding benchmark. | ★★★★★ | 85.4% | ||
| 212 | GSM-Infinite Hard (128K) o | Math reasoning | GSM-Infinite Hard benchmark at 128K context. | ★★★★★ | G MiMo V2 Flash Base | 29.0% | |
| 213 | GSM-Infinite Hard (16K) o | Math reasoning | GSM-Infinite Hard benchmark at 16K context. | ★★★★★ | 50.4% | ||
| 214 | GSM-Infinite Hard (32K) o | Math reasoning | GSM-Infinite Hard benchmark at 32K context. | ★★★★★ | 45.2% | ||
| 215 | GSM-Infinite Hard (64K) o | Math reasoning | GSM-Infinite Hard benchmark at 64K context. | ★★★★★ | 34.7% | ||
| 216 | GSM-Plus O | Math (grade-school, enhanced) | Enhanced GSM-style grade-school math benchmark variant. | ★★★★★ | 82.1% | ||
| 217 | GSM-Symbolic o | Math reasoning | Symbolic reasoning variant of GSM that tests algebraic manipulation and arithmetic with structured problems. | ★★★★★ | G Granite-4.0-H-Small | 87.4% | |
| 218 | GSM8K O | Math (grade-school) | Grade-school math word problems requiring multi-step reasoning. | ★★★★★ | 1322 | 97.3% | |
| 219 | GSM8K (DE) o | Math (grade-school, German) | German translation of the GSM8K grade-school math word problems. | ★★★★★ | 0.6% | ||
| 220 | GSM8K-Ko o | Math (grade-school, Korean) | Korean translation of the GSM8K grade-school math word problems. | ★★★★★ | 88.1% | ||
| 221 | GSM8K Platinum o | Math (grade-school, hard) | Harder subset/setting of GSM8K grade-school math problems. | ★★★★★ | 89.6% | ||
| 222 | GSO Benchmark O | Code generation | LiveCodeBench GSO benchmark. | ★★★★★ | 8.8% | ||
| 223 | HAE-RAE Bench o | Korean language understanding | Korean language understanding benchmark evaluating knowledge and reasoning. | ★★★★★ | G Kanana-1.5-32.5B-Base | 90.7% | |
| 224 | HallusionBench O | Multimodal hallucination | Benchmark for evaluating hallucination tendencies in multimodal LLMs. | ★★★★★ | 69.9% | ||
| 225 | HarmBench o | Safety | Harmfulness and safety compliance benchmark across a variety of risky prompts. | ★★★★★ | G IQuest-Coder-V1-40B-Thinking | 94.8% | |
| 226 | HarmfulQA o | Safety | Harmful question set testing models' ability to avoid unsafe answers. | ★★★★★ | 104 | G K2-THINK | 99.0% |
| 227 | HealthBench O | Medical QA | Comprehensive medical knowledge and clinical reasoning benchmark across specialties and tasks. | ★★★★★ | 67.2% | ||
| 228 | HealthBench-Hard o | Medical QA (hard) | Challenging subset of HealthBench focusing on complex, ambiguous clinical cases. | ★★★★★ | 46.2% | ||
| 229 | HealthBench-Hard Hallucinations o | Medical hallucination safety | Measures hallucination and unsafe medical advice under hard clinical scenarios. | ★★★★★ | ↓ 1.6% | ||
| 230 | HellaSwag O | Commonsense reasoning | Adversarial commonsense sentence completion. | ★★★★★ | 220 | 96.4% | |
| 231 | HellaSwag (DE) o | Commonsense reasoning (German) | German translation of the HellaSwag commonsense benchmark. | ★★★★★ | 0.7% | ||
| 232 | HELMET LongQA o | Long-context QA | Long-context subset of the HELMET benchmark focusing on grounded question answering. | ★★★★★ | G Jamba Mini 1.6 | 46.9% | |
| 233 | HeroBench O | Long-horizon planning | Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds. | ★★★★★ | 91.7% | ||
| 234 | HHEM v2.1 O | Hallucination detection | Hughes Hallucination Evaluation Model (Vectara) — lower is better. | ★★★★★ | G AntGroup Finix_S1_32b | ↓ 0.6% | |
| 235 | HiddenMath O | Math reasoning | Mathematical reasoning benchmark referenced in recent model cards. | ★★★★★ | 65.2% | ||
| 236 | HLE O | Multi-domain reasoning | Challenging LLMs at the frontier of human knowledge. | ★★★★★ | 1085 | 45.8% | |
| 237 | HLE Overconfidence O | Overconfidence / safety | Overconfidence rate derived from Humanity's Last Exam evaluations. | ★★★★★ | ↓ 43.7% | ||
| 238 | HLE (Text Only) O | Advanced reasoning | Humanity's Last Exam benchmark restricted to text-only inputs. | ★★★★★ | 1085 | 45.8% | |
| 239 | HLE-VL o | Holistic language evaluation (vision-language) | Vision-language HLE benchmark. | ★★★★★ | 36.0% | ||
| 240 | HLE (With Tools) O | Tool-augmented reasoning | Humanity's Last Exam benchmark evaluated with tool access. | ★★★★★ | 1085 | 50.2% | |
| 241 | HMMT o | Math (competition) | Harvard–MIT Mathematics Tournament problems. | ★★★★★ | 100.0% | ||
| 242 | HMMT 2025 O | Math (competition) | Harvard–MIT Mathematics Tournament 2025 problems. | ★★★★★ | 99.8% | ||
| 243 | HMMT Feb 2025 O | Math (competition) | Harvard–MIT Mathematics Tournament February 2025 problems. | ★★★★★ | 99.4% | ||
| 244 | HMMT Nov 2025 O | Math (competition) | Harvard–MIT Mathematics Tournament November 2025 problems. | ★★★★★ | 94.7% | ||
| 245 | HotpotQA o | Multi-hop QA | Explainable multi-hop QA with supporting facts. | ★★★★★ | 64.0% | ||
| 246 | HRBench 4K O | Hallucination robustness | Hallucination robustness benchmark with 4K token contexts. | ★★★★★ | 89.5% | ||
| 247 | HRBench 8K O | Hallucination robustness | Hallucination robustness benchmark with 8K token contexts. | ★★★★★ | 82.5% | ||
| 248 | HRM8K o | Korean reasoning | 8k-question Korean reasoning and knowledge benchmark. | ★★★★★ | 92.0% | ||
| 249 | HumanEval O | Code generation | Python synthesis problems evaluated by unit tests. | ★★★★★ | 2916 | 100.0% | |
| 250 | HumanEval+ O | Code generation | Extended HumanEval with more tests. | ★★★★★ | 1577 | 94.5% | |
| 251 | HumanEval-V o | Code generation (vision) | HumanEval variant with visual programming prompts. | ★★★★★ | G Step3-VL-10B | 66.0% | |
| 252 | HumanEval-X o | Code generation (multilingual) | Multilingual code generation benchmark extending HumanEval to multiple programming languages. | ★★★★★ | G TeleChat3-36B-Thinking | 92.7% | |
| 253 | Hypersim O | 3D scene understanding | Hypersim benchmark for synthetic indoor scene understanding and reconstruction. | ★★★★★ | 39.3% | ||
| 254 | IFBench O | Instruction following | Instruction-following benchmark measuring compliance and adherence. | ★★★★★ | 70 | 84.8% | |
| 255 | IFEval O | Instruction following | Instruction following capability evaluation for LLMs. | ★★★★★ | 36312 | 93.9% | |
| 256 | IFEval-Code o | Instruction following (code) | Instruction following evaluation for code generation tasks. | ★★★★★ | 28.0% | ||
| 257 | Image QA Average O | Image QA (aggregate) | Average of single-image visual question answering benchmarks. | ★★★★★ | 86.2% | ||
| 258 | IMO AnswerBench O | Math (competition) | Evaluates free-form solutions to International Mathematical Olympiad problems using expert-style grading rubrics. | ★★★★★ | 86.8% | ||
| 259 | INCLUDE O | Inclusiveness / bias | Evaluates inclusive language use and bias mitigation in model outputs. | ★★★★★ | 83.9% | ||
| 260 | InfoQA O | Information-seeking QA | Information retrieval question answering benchmark evaluating factual responses. | ★★★★★ | 86.9% | ||
| 261 | Information Extraction o | Information extraction | Information extraction benchmark for economically valuable fields. | ★★★★★ | 46.9% | ||
| 262 | Information Processing o | Information processing | Information processing benchmark for economically valuable tasks. | ★★★★★ | 56.5% | ||
| 263 | InfoVQA O | Infographic VQA | Visual question answering over infographics requiring reading, counting, and reasoning. | ★★★★★ | 92.6% | ||
| 264 | Intention Recognition o | Intent recognition | Intent recognition benchmark for practical applications. | ★★★★★ | 65.3% | ||
| 265 | IntPhys 2 O | Intuitive physics | Intuitive physics reasoning benchmark. | ★★★★★ | 63.4% | ||
| 266 | Inverse IFEval o | Instruction following (inverse) | Inverse instruction-following evaluation. | ★★★★★ | 80.6% | ||
| 267 | ISL/OSL 8k/16k o | Throughput | Relative throughput on ISL/OSL 8k/16k context workloads. | ★★★★★ | 3.3% | ||
| 268 | JudgeMark v2.1 O | LLM judging ability | A benchmark measuring LLM judging ability. | ★★★★★ | 82.0% | ||
| 269 | KGC-Safety o | Safety (Korean) | Korean safety benchmark evaluating harmfulness and compliance. | ★★★★★ | G K-EXAONE | 96.1% | |
| 270 | KK-4 People o | Working memory (4 people) | Keep/kill working-memory benchmark with 4 people entities. | ★★★★★ | G K2-V2 | 92.9% | |
| 271 | KK-8 People o | Working memory (8 people) | Keep/kill working-memory benchmark with 8 people entities. | ★★★★★ | G K2-V2 | 82.8% | |
| 272 | KMMLU O | Korean knowledge | Korean Massive Multitask Language Understanding benchmark. | ★★★★★ | 78.7% | ||
| 273 | KMMLU-Pro O | Multilingual knowledge | Korean Multilingual Massive Multitask Language Understanding Pro | ★★★★★ | 77.5% | ||
| 274 | KMMLU-Redux O | Multilingual knowledge | Redux variant of KMMLU benchmark | ★★★★★ | 81.1% | ||
| 275 | Ko-LongBench o | Korean long-context | Long-context understanding benchmark in Korean. | ★★★★★ | 87.9% | ||
| 276 | KoBALT o | Korean knowledge | Korean benchmark for knowledge and language understanding. | ★★★★★ | 62.7% | ||
| 277 | KoMT-Bench o | Korean chat ability | Korean multi-turn chat evaluation benchmark. | ★★★★★ | 8.5% | ||
| 278 | KOR-Bench O | Reasoning | Comprehensive reasoning benchmark spanning diverse domains and cognitive skills. | ★★★★★ | 77.4% | ||
| 279 | KoSimpleQA o | Korean QA | Korean simple question answering benchmark. | ★★★★★ | G Kanana-2-30B-A3B-Mid-2601 | 49.7% | |
| 280 | KSM o | Multilingual math | Korean STEM and math benchmark | ★★★★★ | G EXAONE Deep 2.4B | 60.9% | |
| 281 | LAMBADA O | Language modeling | Word prediction requiring broad context understanding. | ★★★★★ | 86.4% | ||
| 282 | LatentJailbreak o | Safety / jailbreak | Robustness to latent jailbreak adversarial techniques. | ★★★★★ | 39 | 77.4% | |
| 283 | LBV1-QA O | Vision-language | Vision-language QA benchmark v1. | ★★★★★ | 73.7% | ||
| 284 | LBV2 O | Vision-language | Vision-language benchmark v2. | ★★★★★ | 65.7% | ||
| 285 | LiveBench O | General capability | Continually updated capability benchmark across diverse tasks. | ★★★★★ | 82.4% | ||
| 286 | LiveCodeBench O | Code generation | Live coding and execution-based evaluation benchmark (v6 dataset). | ★★★★★ | 92.0% | ||
| 287 | LiveCodeBench-Ko o | Code generation (Korean) | Korean translation of LiveCodeBench. | ★★★★★ | 66.3% | ||
| 288 | LiveCodeBench Pro O | Competitive coding problems from Codeforces, ICPC, and IOI | LiveCodeBench Pro evaluates competitive programming performance across Codeforces, ICPC, and IOI contests. Elo rating, higher is better. | ★★★★★ | 2439 | ||
| 289 | LCB Pro 25Q2 (Easy) o | Code generation | LiveCodeBench Pro 2025 Q2 easy subset. | ★★★★★ | 68.9% | ||
| 290 | LCB Pro 25Q2 (Med) O | Code generation | LiveCodeBench Pro 2025 Q2 medium subset. | ★★★★★ | 35.4% | ||
| 291 | LiveCodeBench v3 O | Code generation | LiveCodeBench v3 snapshot measuring pass rates on streaming coding tasks. | ★★★★★ | 90.2% | ||
| 292 | LiveCodeBench v5 (2024.10-2025.02) O | Code generation | LiveCodeBench v5 snapshot covering Oct 2024-Feb 2025. | ★★★★★ | G IQuest-Coder-V1-40B-Loop-Thinking | 86.2% | |
| 293 | LiveMCP-101 O | Agent real-time eval | A novel real-time evaluation framework and benchmark to stress‑test agents on complex, real‑world tasks. | ★★★★★ | 58.4% | ||
| 294 | LiveSports-3K o | Sports video | Live sports video understanding benchmark (3K). | ★★★★★ | G Seed1.8 | 77.5% | |
| 295 | LMArena Text O | Crowd eval (text) | Chatbot Arena text evaluation (average win rate). | ★★★★★ | 1455 | ||
| 296 | LMArena Vision O | Crowd eval (vision) | Chatbot Arena vision evaluation leaderboard (ELO ratings). | ★★★★★ | 1242 | ||
| 297 | LogicVista O | Visual logical reasoning | Visual logic and pattern reasoning tasks requiring compositional and spatial understanding. | ★★★★★ | 80.8% | ||
| 298 | LogiQA o | Logical reasoning | Reading comprehension with logical reasoning. | ★★★★★ | 138 | G Pythia 70M | 23.5% |
| 299 | LongBench o | Long-context eval | Long-context understanding across tasks. | ★★★★★ | 957 | G Jamba Mini 1.6 | 32.0% |
| 300 | longbench-v2 O | Long-context eval | Next-generation LongBench v2 long-context evaluation benchmark. | ★★★★★ | 68.2% | ||
| 301 | LongFact-Concepts o | Hallucination rate on open-source prompts | Long-context factuality eval focused on conceptual statements; lower is better. | ★★★★★ | ↓ 0.7% | ||
| 302 | LongFact-Objects o | Hallucination rate on open-source prompts | Long-context factuality eval focused on object/entity references; lower is better. | ★★★★★ | ↓ 0.8% | ||
| 303 | LongText-Bench EN o | Text rendering | LongText-Bench English subset score for text rendering. | ★★★★★ | G Seedream 4.5 | 1.0% | |
| 304 | LongText-Bench ZH o | Text rendering | LongText-Bench Chinese subset score for text rendering. | ★★★★★ | G Seedream 4.5 | 1.0% | |
| 305 | LongVideoBench o | Long video QA | Long video understanding and QA benchmark. | ★★★★★ | 79.8% | ||
| 306 | LPFQA o | Finance QA | Long-form financial question answering benchmark. | ★★★★★ | 54.4% | ||
| 307 | LVBench O | Video understanding | Long video understanding benchmark (LVBench). | ★★★★★ | 75.9% | ||
| 308 | M3GIA (CN) o | Chinese multimodal QA | Chinese-language M3GIA benchmark covering grounded multimodal question answering. | ★★★★★ | 91.2% | ||
| 309 | Machiavelli O | Deception / safety | Benchmark for deceptive or manipulative behavior in social interactions. | ★★★★★ | ↓ 52.2% | ||
| 310 | MakeMeSay o | Adversarial robustness | Adversarial benchmark testing model robustness against manipulation attempts. Lower is better. | ★★★★★ | |||
| 311 | Mantis O | Multimodal reasoning | Multimodal reasoning and instruction following benchmark (Mantis). | ★★★★★ | G dots.vlm1 | 86.2% | |
| 312 | MARS-Bench o | Instruction following | Instruction-following benchmark with complex tasks. | ★★★★★ | 80.8% | ||
| 313 | MASK O | Safety / red teaming | Model behavior safety assessment via red-teaming scenarios. | ★★★★★ | 95.3% | ||
| 314 | MATH O | Math (competition) | Competition-level mathematics across algebra, geometry, number theory, combinatorics. | ★★★★★ | 1185 | 97.9% | |
| 315 | MATH-Ko o | Math (Korean) | Korean translation of the MATH competition benchmark. | ★★★★★ | 58.2% | ||
| 316 | MATH Level 5 o | Math (competition) | Level 5 subset of the MATH benchmark emphasizing the hardest competition-style problems. | ★★★★★ | 73.6% | ||
| 317 | MATH500 O | Math reasoning | 500-problem slice of the MATH benchmark for challenging math reasoning. | ★★★★★ | G Motif-2-12.7B-Reasoning | 99.3% | |
| 318 | MATH500 (ES) o | Math (multilingual) | Spanish MATH500 benchmark | ★★★★★ | G EXAONE 4.0 1.2B | 88.8% | |
| 319 | MathArena Apex o | Challenging Math Contest problems | Challenging math contest problems from MathArena Apex benchmark. | ★★★★★ | 23.4% | ||
| 320 | MathVerse-mini O | Math reasoning (multimodal) | Compact MathVerse split focusing on single-image math puzzles and visual reasoning. | ★★★★★ | 85.0% | ||
| 321 | MathVerse-Vision O | Math reasoning (multimodal) | Multi-image visual mathematical reasoning tasks from the MathVerse ecosystem. | ★★★★★ | 84.1% | ||
| 322 | MathVision O | Math reasoning (multimodal) | Visual math reasoning benchmark with problems that combine images (charts, diagrams) and text. | ★★★★★ | 86.1% | ||
| 323 | MathVista O | Multimodal math reasoning | Visual math reasoning across diverse tasks. | ★★★★★ | 90.1% | ||
| 324 | MathVista-Mini O | Math reasoning (multimodal) | Lightweight subset of MathVista for quick evaluation of visual mathematical reasoning. | ★★★★★ | 85.8% | ||
| 325 | MBPP O | Code generation | Short Python problems with hidden tests. | ★★★★★ | 36312 | 97.4% | |
| 326 | MBPP-Ko o | Code generation (Korean) | Korean translation of MBPP code generation benchmark. | ★★★★★ | 66.8% | ||
| 327 | MBPP+ O | Code generation | Extended MBPP with more tests and stricter evaluation. | ★★★★★ | 94.2% | ||
| 328 | MCP-Atlas O | Agent evaluation | Aggregate MCP agent benchmark covering tool-use and planning tasks. | ★★★★★ | 62.3% | ||
| 329 | MCP Universe O | Agent evaluation | Benchmarks multi-step tool-use agents across diverse task suites with a unified overall success metric. | ★★★★★ | 50.7% | ||
| 330 | MCPMark O | Agent tool-use (MCP) | Benchmark for Model Context Protocol (MCP) agent tool-use. | ★★★★★ | 127 | 50.9% | |
| 331 | METR O | Long task benchmark | METR evaluates AI agents on long-horizon coding and agentic tasks, measuring autonomous task completion time. | ★★★★★ | 4.8% | ||
| 332 | MGSM O | Math (multilingual) | Multilingual grade school math word problems. | ★★★★★ | 94.4% | ||
| 333 | MIABench O | Multimodal instruction following | Multimodal instruction-following benchmark evaluating accuracy on complex image-text tasks. | ★★★★★ | 92.7% | ||
| 334 | NIAH-Multi 128K o | Long-context QA | Needle-in-a-haystack multi-query benchmark at 128K context. | ★★★★★ | 99.5% | ||
| 335 | NIAH-Multi 32K o | Long-context QA | Needle-in-a-haystack multi-query benchmark at 32K context. | ★★★★★ | 99.8% | ||
| 336 | NIAH-Multi 64K o | Long-context QA | Needle-in-a-haystack multi-query benchmark at 64K context. | ★★★★★ | 100.0% | ||
| 337 | MindCube O | Spatial navigation | Spatial navigation benchmark. | ★★★★★ | 78.3% | ||
| 338 | Minerva Math O | University-level math | Advanced quantitative reasoning set inspired by the Minerva benchmark for STEM problem solving. | ★★★★★ | 98.0% | ||
| 339 | MiniF2F pass@1 o | Math competition | MiniF2F competition benchmark pass@1 accuracy. | ★★★★★ | 50.0% | ||
| 340 | MiniF2F pass@32 o | Math competition | MiniF2F competition benchmark pass@32 accuracy. | ★★★★★ | 79.9% | ||
| 341 | MiniF2F (Test) o | Math competition | MiniF2F competition benchmark (test split). | ★★★★★ | 81.6% | ||
| 342 | MixEval o | Multi-task reasoning | Mixed-subject benchmark covering knowledge and reasoning tasks across domains. | ★★★★★ | 82.9% | ||
| 343 | MixEval Hard o | Multi-task reasoning (hard) | Hard subset of MixEval covering diverse reasoning tasks. | ★★★★★ | 31.6% | ||
| 344 | MLVU O | Large video understanding | MLVU: Large-scale multi-task benchmark for video understanding. | ★★★★★ | 86.2% | ||
| 345 | MM-BrowseComp o | Multimodal browsing | Multimodal browsing comprehension benchmark. | ★★★★★ | G Seed1.8 | 46.3% | |
| 346 | MM-IFEval o | Multimodal instruction following | Instruction-following benchmark assessing multimodal obedience to complex prompts. | ★★★★★ | 52.3% | ||
| 347 | MM-MT-Bench O | Multimodal instruction following | Multi-turn multimodal instruction following benchmark evaluating dialogue quality and helpfulness. | ★★★★★ | 8.5% | ||
| 348 | MMBench v1.1 (CN) O | Multimodal understanding (Chinese) | MMBench v1.1 Chinese subset for evaluating multimodal LLMs. | ★★★★★ | 91.3% | ||
| 349 | MMBench v1.1 (EN) O | Multimodal understanding (English) | MMBench v1.1 English subset for evaluating multimodal LLMs. | ★★★★★ | 93.3% | ||
| 350 | MMBench v1.1 (EN dev) O | General VQA | English dev split of MMBench v1.1 measuring multimodal question answering. | ★★★★★ | 90.6% | ||
| 351 | MME-CC o | Multimodal evaluation | MME-CC multimodal evaluation suite. | ★★★★★ | 56.9% | ||
| 352 | MME Elo o | Multimodal perception | Elo-style scoring for the MME multimodal evaluation benchmark. | ★★★★★ |
🇨🇳
InternVL3-2B | 2186.4% | |
| 353 | MME-RealWorld (cn) o | Real-world perception (CN) | MME-RealWorld Chinese split. | ★★★★★ | 58.5% | ||
| 354 | MME-RealWorld (en) o | Real-world perception (EN) | MME-RealWorld English split. | ★★★★★ | G MiMo-VL 7B-RL | 59.1% | |
| 355 | MMIU O | Multi-image understanding | Multi-image understanding benchmark evaluating cross-image reasoning. | ★★★★★ | 72.1% | ||
| 356 | MMLB-NIAH (128k) o | Multimodal long-context | MMLB-NIAH 128k long-context multimodal benchmark. | ★★★★★ | G Seed1.8 | 72.2% | |
| 357 | MMLB-VRAG (128k) o | Multimodal long-context | MMLB-VRAG 128k long-context multimodal benchmark. | ★★★★★ | 88.9% | ||
| 358 | MMLongBench-128K o | Long-context multimodal | 128K-context variant of MMLongBench evaluating multimodal long-context understanding. | ★★★★★ | 64.1% | ||
| 359 | MMLongBench-Doc O | Long-context multimodal documents | Evaluates long-context document understanding with mixed text, tables, and figures across multiple pages. | ★★★★★ | 56.2% | ||
| 360 | MMLU O | Multi-domain knowledge | 57 tasks spanning STEM, humanities, social sciences; broad knowledge and reasoning. | ★★★★★ | 1488 | 93.8% | |
| 361 | MMLU Arabic o | Arabic knowledge and reasoning | Arabic-language variant of MMLU evaluating knowledge and reasoning. | ★★★★★ | 74.1% | ||
| 362 | MMLU (cloze) o | Multi-domain knowledge (cloze) | Cloze-form MMLU evaluation variant. | ★★★★★ | 31.5% | ||
| 363 | Full Text MMLU o | Multi-domain knowledge (long-form) | Full-context MMLU variant evaluating reasoning over long passages. | ★★★★★ | 83.0% | ||
| 364 | MMLU-Pro O | Multi-domain knowledge | Harder successor to MMLU with more challenging questions. | ★★★★★ | 286 | 90.1% | |
| 365 | MMLU Pro MCF o | Multi-domain knowledge (few-shot) | MMLU-Pro common format (MCF) few-shot evaluation. | ★★★★★ | 41.1% | ||
| 366 | MMLU-ProX O | Multi-domain knowledge | Cross-lingual and robust variant of MMLU-Pro. | ★★★★★ | 81.0% | ||
| 367 | MMLU-Redux O | Multi-domain knowledge | Updated MMLU-style evaluation with revised questions and scoring. | ★★★★★ | 95.9% | ||
| 368 | MMLU-STEM O | STEM knowledge | STEM subset of MMLU. | ★★★★★ | 1488 | G Falcon-H1-34B-Instruct | 83.6% |
| 369 | MMMB o | Multilingual MMBench | Multilingual Multimodal Benchmark (MMMB) average score. | ★★★★★ | 77.0% | ||
| 370 | MMMLU O | Multi-domain knowledge (multilingual) | Massively multilingual MMLU-style evaluation across many languages. | ★★★★★ | 91.8% | ||
| 371 | MMMLU (ES) o | Multilingual knowledge | Spanish MMMLU benchmark | ★★★★★ | 64.7% | ||
| 372 | MMMU O | Multimodal understanding | Multi-discipline multimodal understanding benchmark. | ★★★★★ | 87.0% | ||
| 373 | MMMU PRO O | Multimodal understanding (hard) | Professional/advanced subset of MMMU for multimodal reasoning. | ★★★★★ | 81.2% | ||
| 374 | MMMU-Pro (vision) o | Multimodal understanding (vision) | MMMU-Pro vision-only setting. | ★★★★★ | 45.8% | ||
| 375 | MMSIBench (circular) o | Spatial understanding | MMSIBench circular subset for spatial reasoning. | ★★★★★ | 25.4% | ||
| 376 | MMStar O | Multimodal reasoning | Broad evaluation of multimodal LLMs across diverse tasks. | ★★★★★ | 83.1% | ||
| 377 | MMVP O | Multimodal video perception | Benchmark for multimodal video understanding and perception. | ★★★★★ | G Seed1.8 | 91.6% | |
| 378 | MMVU O | Video understanding | Multimodal video understanding benchmark (MMVU). | ★★★★★ | 80.8% | ||
| 379 | MotionBench O | Video motion understanding | Video motion and temporal reasoning benchmark. | ★★★★★ | G Seed1.8 | 70.6% | |
| 380 | OpenAI-MRCR (128k) O | Long-context reasoning | OpenAI Multi-Round Chain Reasoning benchmark with 128k context window. | ★★★★★ | 89.7% | ||
| 381 | OpenAI-MRCR (1M) o | Long-context reasoning | OpenAI Multi-Round Chain Reasoning benchmark with 1M context window. | ★★★★★ | 58.8% | ||
| 382 | MRCR v2 o | Multimodal reasoning | Multi-round multimodal chain-of-reasoning evaluation (v2). | ★★★★★ | 81.7% | ||
| 383 | MT-Bench O | Chat ability | Multi-turn chat evaluation via GPT-4 grading. | ★★★★★ | 39074 | 85.7% | |
| 384 | MTOB (full book) o | Long-form reasoning | Long-context book understanding benchmark (full-book setting). | ★★★★★ | 50.8% | ||
| 385 | MTOB (half book) o | Long-form reasoning | Long-context book understanding benchmark (half-book setting). | ★★★★★ | 54.0% | ||
| 386 | MUIRBENCH O | Multimodal robustness | Evaluates multimodal understanding robustness and reliability. | ★★★★★ | 86.1% | ||
| 387 | Multi-IF O | Instruction following (multi-task) | Composite instruction-following evaluation across multiple tasks. | ★★★★★ | 81.0% | ||
| 388 | Multi-IFEval O | Instruction following (multi-task) | Multi-task variant of instruction-following evaluation. | ★★★★★ | 88.7% | ||
| 389 | Multi-SWE-Bench O | Code repair (multi-repo) | Multi-repository SWE-Bench variant. | ★★★★★ | 246 | G MiniMax M2.1 | 49.4% |
| 390 | MultiChallenge O | Instruction following | Multi-domain instruction-following benchmark. | ★★★★★ | 69.6% | ||
| 391 | Multi-Image QA Average O | Multi-image QA (aggregate) | Aggregate score over multi-image visual question answering tasks. | ★★★★★ | 81.9% | ||
| 392 | Multilingual MMBench o | Multilingual vision benchmark | Multilingual MMBench average score across languages. | ★★★★★ | 65.9% | ||
| 393 | Multilingual MMLU O | Multi-domain knowledge (multilingual) | Multilingual variant of MMLU across many languages. | ★★★★★ | 87.3% | ||
| 394 | MultiPL-E O | Code generation (multilingual) | Multilingual code generation and execution benchmark across many programming languages. | ★★★★★ | 269 | 89.6% | |
| 395 | MultiPL-E HumanEval o | Code generation (multilingual) | MultiPL-E variant of HumanEval tasks. | ★★★★★ | 75.2% | ||
| 396 | MultiPL-E MBPP o | Code generation (multilingual) | MultiPL-E variant of MBPP tasks. | ★★★★★ | 65.7% | ||
| 397 | MuSR O | Reasoning | Multistep Soft Reasoning. | ★★★★★ | 70.4% | ||
| 398 | MVBench O | Video QA | Multi-view or multi-video QA benchmark (MVBench). | ★★★★★ | 73.0% | ||
| 399 | Natural2Code o | Code generation | Natural language to code benchmark for instruction-following synthesis. | ★★★★★ | 92.9% | ||
| 400 | NaturalQuestions O | Open-domain QA | Google NQ; real user questions with long/short answers. | ★★★★★ | 40.1% | ||
| 401 | Nexus (0-shot) o | Tool use | Nexus tool-use benchmark, zero-shot setting. | ★★★★★ | 58.7% | ||
| 402 | Needle In A Haystack o | Long-context retrieval | Needle In A Haystack test for locating hidden facts in long contexts. | ★★★★★ | G MobileLLM P1 Base | 100.0% | |
| 403 | Objectron o | Object detection | Objectron benchmark for 3D object detection in video captures. | ★★★★★ | 71.2% | ||
| 404 | OBQA o | Open book QA | OpenBookQA science question answering benchmark. | ★★★★★ | 76.3% | ||
| 405 | OCRBench V2 O | OCR (vision text extraction) | OCRBench v2 evaluating text extraction from images and documents. | ★★★★★ | 858.0% | ||
| 406 | OCRBench-ELO o | OCR (ELO ranking) | OCR benchmark using ELO rating system to rank model performance on text extraction tasks. | ★★★★★ | 866 | ||
| 407 | OCRBenchV2 (CN) O | OCR (Chinese) | OCRBenchV2 Chinese subset assessing OCR performance on Chinese-language documents. | ★★★★★ | 63.7% | ||
| 408 | OCRBenchV2 (EN) O | OCR (English) | OCRBenchV2 English subset evaluating OCR accuracy on English documents and layouts. | ★★★★★ | 67.4% | ||
| 409 | OCRReasoning o | OCR reasoning | OCR reasoning benchmark combining text extraction with multi-step reasoning over documents. | ★★★★★ | 70.8% | ||
| 410 | OctoCodingBench o | Code generation | Coding benchmark across multi-language programming tasks. | ★★★★★ | 36.2% | ||
| 411 | ODinW-13 O | Object detection (in the wild) | Object Detection in the Wild benchmark covering 13 real-world domains. | ★★★★★ | 48.2% | ||
| 412 | Odyssey Math o | Math reasoning | Odyssey multi-step math benchmark. | ★★★★★ | G Mathstral 7B | 37.2% | |
| 413 | OIBench EN o | Code generation | English subset of OIBench for code generation. | ★★★★★ | 58.2% | ||
| 414 | OJBench O | Code generation (online judge) | Programming problems evaluated via online judge-style execution. | ★★★★★ | 68.5% | ||
| 415 | olmOCR-Bench O | Document OCR | olmOCR benchmark assessing OCR fidelity and structured extraction on complex document pages. | ★★★★★ | G Chandra OCR 0.1.0 | 83.1% | |
| 416 | OlympiadBench o | Math (olympiad) | Advanced mathematics olympiad-style problem benchmark. | ★★★★★ | 77.6% | ||
| 417 | OlympicArena o | Math (competition) | Olympiad-style mathematics reasoning benchmark. | ★★★★★ | 76.2% | ||
| 418 | OMEGA O | Math (advanced) | OMEGA olympiad-grade mathematics reasoning benchmark. | ★★★★★ | 50.8% | ||
| 419 | Omni-MATH O | Math reasoning | Omni-MATH benchmark covering diverse math reasoning tasks across difficulty levels. | ★★★★★ | G Ling 1T | 74.5% | |
| 420 | Omni-MATH-HARD O | Math | Challenging math benchmark (Omni-MATH-HARD). | ★★★★★ | 73.6% | ||
| 421 | OmniDocBench O | Document understanding | Document understanding benchmark covering multi-page layouts, tables, and charts for robust question answering. | ★★★★★ | G Gundam-M | ↓ 12.3% | |
| 422 | OmniDocBench 1.5 O | OCR | Document understanding benchmark v1.5 with OCR evaluation. Overall Edit Distance metric, lower is better. | ★★★★★ | ↓ 0.1% | ||
| 423 | OmniDocBench-CN O | Document understanding (Chinese) | Chinese subset of OmniDocBench focusing on OCR-grounded document comprehension and reasoning. | ★★★★★ | G PPStructure v3 | ↓ 13.6% | |
| 424 | OmniMMI o | Multimodal interaction | OmniMMI benchmark for multimodal interaction across video streams. | ★★★★★ | G Seed1.8 | 53.0% | |
| 425 | OmniSpatial o | Spatial reasoning | Spatial understanding and reasoning benchmark (OmniSpatial). | ★★★★★ | 52.0% | ||
| 426 | OneIG-Bench EN O | Text-to-image | OneIG-Bench English subset score for text-to-image generation. | ★★★★★ | G Nano Banana 2.0 | 0.6% | |
| 427 | OneIG-Bench ZH O | Text-to-image | OneIG-Bench Chinese subset score for text-to-image generation. | ★★★★★ | G Nano Banana 2.0 | 0.6% | |
| 428 | Online-Mind2web o | Web automation | Online web automation and task execution benchmark. | ★★★★★ | G Seed1.8 | 85.9% | |
| 429 | Open Rewrite o | Instruction following | Rewrite benchmark assessing open-ended editing and directive-following quality. | ★★★★★ | G MobileLLM P1 | 51.0% | |
| 430 | OpenBookQA O | Science QA | Open-book multiple choice science questions with supporting facts. | ★★★★★ | 128 | 96.6% | |
| 431 | OpenRewrite-Eval o | Rewrite quality | OpenRewrite evaluation; micro-averaged RougeL. | ★★★★★ | 46.9% | ||
| 432 | OptMATH o | Math optimization reasoning | OptMATH benchmark targeting challenging math optimization and problem-solving tasks. | ★★★★★ | G Ling 1T | 57.7% | |
| 433 | Order 15 Items o | List ordering | Ordering benchmark requiring models to sequence 15 items correctly. | ★★★★★ | G K2-V2 | 87.6% | |
| 434 | Order 30 Items o | List ordering (long) | Ordering benchmark requiring models to sequence 30 items correctly. | ★★★★★ | G K2-V2 | 40.3% | |
| 435 | OSWorld O | GUI agents | Agentic GUI task completion and grounding on desktop environments. | ★★★★★ | 66.3% | ||
| 436 | OSWorld-G O | GUI agents | OSWorld-G center accuracy (no_refusal). | ★★★★★ | 71.8% | ||
| 437 | OSWorld2 o | GUI agents | Second-generation OSWorld GUI agent benchmark. | ★★★★★ | 35.8% | ||
| 438 | OVBench o | Open-vocabulary streaming | Open-vocabulary benchmark for streaming video understanding. | ★★★★★ | G Seed1.8 | 65.1% | |
| 439 | OVOBench o | Streaming video QA | Streaming video QA benchmark with open-vocabulary queries. | ★★★★★ | G Seed1.8 | 72.6% | |
| 440 | PaperBench Code-Dev o | Code understanding | PaperBench developer subset measuring code reasoning accuracy. | ★★★★★ | 43.3% | ||
| 441 | PaperBench o | Research paper understanding | Benchmark for understanding and reasoning over research papers. | ★★★★★ | 72.9% | ||
| 442 | PHYBench o | Physics reasoning | Physics reasoning and calculation benchmark. | ★★★★★ | 59.0% | ||
| 443 | PhyX o | Physics reasoning (multimodal) | Multimodal physics reasoning benchmark (PhyX). | ★★★★★ | G Step3-VL-10B | 59.5% | |
| 444 | PIQA O | Physical commonsense | Physical commonsense about everyday tasks and object affordances. | ★★★★★ | 87.1% | ||
| 445 | PixmoCount O | Visual counting | Counting objects/instances in images (PixmoCount). | ★★★★★ | G Eagle2.5-8B | 90.2% | |
| 446 | Point-Bench o | Pointing and counting | Benchmark for pointing and counting objects in images. | ★★★★★ | 85.5% | ||
| 447 | PolyMATH O | Math reasoning | Polyglot mathematics benchmark assessing cross-topic math reasoning. | ★★★★★ | 65.1% | ||
| 448 | POPE o | Hallucination detection | Vision-language hallucination benchmark focusing on object existence verification. | ★★★★★ |
🇨🇳
InternVL3-2B | 90.1% | |
| 449 | PopQA O | Knowledge / QA | Open-domain popular culture question answering benchmark testing long-tail factual recall. | ★★★★★ | 55.7% | ||
| 450 | PostTrainBench o | Post-training automation | Measures how well AI agents can post-train base LLMs under fixed compute/time constraints; average score across AIME 2025, BFCL, GPQA Main, GSM8K, and HumanEval. | ★★★★★ | 34.9% | ||
| 451 | ProofBench Advanced o | Mathematical proofs (advanced) | Advanced mathematical proof benchmark covering complex theorem proving tasks. | ★★★★★ | 65.7% | ||
| 452 | ProofBench Basic o | Mathematical proofs | Entry-level mathematical proof benchmarking set. | ★★★★★ | 99.0% | ||
| 453 | ProtocolQA o | Protocol understanding and QA | Protocol question answering benchmark evaluating understanding of scientific protocols and procedures. | ★★★★★ | 79.0% | ||
| 454 | QuAC o | Conversational QA | Question answering in context. | ★★★★★ | 53.6% | ||
| 455 | QuALITY o | Long-context reading comprehension | Long-document multiple-choice reading comprehension benchmark. | ★★★★★ | 48.8% | ||
| 456 | RACE o | Reading comprehension | English exams for middle and high school. | ★★★★★ | 88.0% | ||
| 457 | Random Complex Tasks o | Agentic tasks (random) | Randomly constructed complex task environments for agent generalization. | ★★★★★ | 35.8% | ||
| 458 | Realbench o | Web browsing | Real-world browsing and QA benchmark. | ★★★★★ | G Seed1.8 | 49.1% | |
| 459 | RealWorldQA O | Real-world visual QA | Visual question answering with real-world images and scenarios. | ★★★★★ | 82.8% | ||
| 460 | Ref-L4 (test) o | Referring expressions | Ref-L4 referring expression comprehension on the test split. | ★★★★★ | 88.9% | ||
| 461 | RefCOCO O | Referring expressions | RefCOCO average accuracy at IoU 0.5 (val). | ★★★★★ |
🇨🇳
InternVL3.5-4B | 92.4% | |
| 462 | RefCOCOg o | Referring expressions | RefCOCOg average accuracy at IoU 0.5 (val). | ★★★★★ | G Moondream-9B-A2B | 88.6% | |
| 463 | RefCOCO+ o | Referring expressions | RefCOCO+ accuracy at IoU 0.5 on the val split. | ★★★★★ | G Moondream-9B-A2B | 81.8% | |
| 464 | RefSpatialBench O | Spatial reasoning | Reference spatial understanding benchmark covering spatial grounding tasks. | ★★★★★ | 72.1% | ||
| 465 | RefusalBench o | Safety / refusal | Safety-oriented refusal and policy adherence benchmark. | ★★★★★ | 72.3% | ||
| 466 | ReMI o | Multimodal reasoning | Reasoning over multimodal inputs (ReMI). | ★★★★★ | G Step3-VL-10B | 67.3% | |
| 467 | RepoBench O | Code understanding | Repository-level code comprehension and reasoning benchmark. | ★★★★★ | 83.8% | ||
| 468 | RoboSpatialHome O | Embodied spatial understanding | RoboSpatialHome benchmark for embodied spatial reasoning in domestic environments. | ★★★★★ | 73.9% | ||
| 469 | Roo Code Evals O | Code assistant eval | Community-maintained coding evals and leaderboard by Roo Code. | ★★★★★ | 99.0% | ||
| 470 | RULER-100 @1M o | Long-context eval | RULER-100 evaluation at a 1M context window. | ★★★★★ | 86.3% | ||
| 471 | RULER-100 @256k o | Long-context eval | RULER-100 evaluation at a 256k context window. | ★★★★★ | 92.9% | ||
| 472 | RULER-100 @512k o | Long-context eval | RULER-100 evaluation at a 512k context window. | ★★★★★ | 91.3% | ||
| 473 | Ruler 128k o | Long-context eval | RULER benchmark at 128k context window. | ★★★★★ | 95.4% | ||
| 474 | Ruler 16k o | Long-context eval | RULER benchmark at 16k context window. | ★★★★★ | 92.2% | ||
| 475 | Ruler 1M o | Long-context eval | RULER benchmark at 1M context window. | ★★★★★ | 94.8% | ||
| 476 | Ruler 32k o | Long-context eval | RULER benchmark at 32k context window. | ★★★★★ | 96.0% | ||
| 477 | Ruler 4k o | Long-context eval | RULER benchmark at 4k context window. | ★★★★★ | 96.0% | ||
| 478 | Ruler 64k o | Long-context eval | RULER benchmark at 64k context window. | ★★★★★ | 87.5% | ||
| 479 | Ruler 8k o | Long-context eval | RULER benchmark at 8k context window. | ★★★★★ | 93.8% | ||
| 480 | RW Search o | Agentic search | Real-world search benchmark evaluating retrieval and reasoning. | ★★★★★ | 82.0% | ||
| 481 | SALAD-Bench o | Safety alignment | Safety Alignment and Dangerous-behavior benchmark evaluating harmful assistance and refusal consistency. | ★★★★★ | G Granite-4.0-H-Micro | ↓ 96.8% | |
| 482 | Scale AI Multi Challenge o | Chat & instruction following | Scale AI Multi Challenge crowd-evaluated instruction following benchmark. | ★★★★★ | 44.8% | ||
| 483 | SciCode (sub) O | Code | SciCode subset score (sub). | ★★★★★ | 56.1% | ||
| 484 | SciCode (main) O | Code | SciCode main score. | ★★★★★ | 15.4% | ||
| 485 | ScienceQA O | Science QA (multimodal) | Multiple-choice science questions with images, diagrams, and text context. | ★★★★★ | G FastVLM-7B | 96.7% | |
| 486 | SciQ o | Science QA | Multiple choice science questions. | ★★★★★ | G Pythia 12B | 92.9% | |
| 487 | SciRes FrontierMath Tier 1-3 o | Math (frontier) | SciRes FrontierMath benchmark covering tiers 1-3. | ★★★★★ | 40.3% | ||
| 488 | SciRes FrontierMath Tier 4 o | Math (frontier) | SciRes FrontierMath benchmark covering tier 4. | ★★★★★ | 18.8% | ||
| 489 | ScreenQA Complex O | GUI QA | Complex ScreenQA benchmark accuracy. | ★★★★★ | 87.1% | ||
| 490 | ScreenQA Short O | GUI QA | Short-form ScreenQA benchmark accuracy. | ★★★★★ | 91.9% | ||
| 491 | ScreenSpot O | Screen UI locators | Center accuracy on ScreenSpot. | ★★★★★ | 95.8% | ||
| 492 | ScreenSpot-Pro O | Screen UI locators | Average center accuracy on ScreenSpot-Pro. | ★★★★★ | 86.3% | ||
| 493 | ScreenSpot-v2 O | Screen UI locators | Center accuracy on ScreenSpot-v2. | ★★★★★ | G UI-Venus 72B | 95.3% | |
| 494 | SEAL-0 O | Agentic web search | Evaluation of multi-step browsing agents on search, evidence gathering, and synthesis tasks. | ★★★★★ | 57.4% | ||
| 495 | SEED-Bench-2-Plus O | Multimodal evaluation | SEED-Bench-2-Plus overall accuracy. | ★★★★★ | 72.9% | ||
| 496 | SEED-Bench-Img O | Multimodal image understanding | SEED-Bench image-only subset (SEED-Bench-Img). | ★★★★★ | G Bagel 14B | 78.5% | |
| 497 | SEED-Bench o | Multimodal evaluation | SEED-Bench comprehensive multimodal understanding benchmark evaluating generative comprehension across multiple dimensions. | ★★★★★ | 76.5% | ||
| 498 | SFE o | Multimodal reasoning | Structured factual evaluation for multimodal models. | ★★★★★ | 61.9% | ||
| 499 | Showdown O | GUI agents | Success rate on the Showdown UI interaction benchmark. | ★★★★★ | 76.8% | ||
| 500 | SIFO o | Instruction following | Single-turn instruction following benchmark. | ★★★★★ | 66.9% | ||
| 501 | SIFO Multiturn o | Instruction following | Multi-turn SIFO benchmark for sustained instruction adherence. | ★★★★★ | 60.3% | ||
| 502 | SimpleQA O | QA | Simple question answering benchmark. | ★★★★★ | 97.1% | ||
| 503 | SimpleQA Verified o | QA | Verified SimpleQA variant for parametric knowledge accuracy. | ★★★★★ | 72.1% | ||
| 504 | SimpleVQA O | General VQA | Lightweight visual question answering set with everyday scenes. | ★★★★★ | 71.2% | ||
| 505 | SimpleVQA-DS o | General VQA | SimpleVQA variant curated by DeepSeek with everyday image question answering tasks. | ★★★★★ | 61.3% | ||
| 506 | Social Interaction QA (SIQA) o | Social commonsense QA | Social Interaction QA benchmark evaluating social commonsense and situational reasoning. | ★★★★★ | 54.9% | ||
| 507 | SocialIQA o | Social commonsense | Social interaction commonsense QA. | ★★★★★ | 54.9% | ||
| 508 | SpatialViz O | Mental visualization | Mental visualization benchmark. | ★★★★★ | 65.8% | ||
| 509 | Spider O | Text-to-SQL | Complex text-to-SQL benchmark over cross-domain databases. | ★★★★★ | G LLaDA2.0 Flash | 82.5% | |
| 510 | Spiral-Bench O | Safety / sycophancy | A LLM-judged benchmark measuring sycophancy and delusion reinforcement. | ★★★★★ | 87.0% | ||
| 511 | SQuAD v1.1 o | Reading comprehension | Extractive QA from Wikipedia articles. | ★★★★★ | 566 | 89.3% | |
| 512 | SQuAD v2.0 o | Reading comprehension | Like v1.1 with unanswerable questions. | ★★★★★ | 566 | G LLaDA2.0 Flash | 90.0% |
| 513 | StreamingBench o | Streaming video | Streaming video understanding benchmark. | ★★★★★ | G Seed1.8 | 84.4% | |
| 514 | SUNRGBD O | 3D scene understanding | SUN RGB-D benchmark for indoor scene understanding from RGB-D imagery. | ★★★★★ | 45.8% | ||
| 515 | SuperGPQA O | Graduate-level QA | Harder GPQA variant assessing advanced graduate-level reasoning. | ★★★★★ | 75.3% | ||
| 516 | SWE-Bench O | Code repair | Supervised software engineering benchmark across many repos and issues. | ★★★★★ | 3442 | 74.5% | |
| 517 | SWE-Bench Multilingual O | Code repair (multilingual) | Multilingual variant of SWE-Bench for issue fixing. | ★★★★★ | 77.5% | ||
| 518 | SWE-Bench (OpenHands) o | Code repair | SWE-Bench results using the OpenHands autonomous coding agent. | ★★★★★ | 3442 | 38.8% | |
| 519 | SWE-Bench Pro o | Software engineering | Full SWE-Bench Pro benchmark for software-engineering agents. | ★★★★★ | 55.6% | ||
| 520 | SWE-Bench Pro (Public) o | Software engineering | Public subset of the SWE-Bench Pro benchmark for software-engineering agents. | ★★★★★ | 23.3% | ||
| 521 | SWE-Bench Verified O | Code repair | Verified subset of SWE-Bench for issue fixing. | ★★★★★ | 80.9% | ||
| 522 | SWE-Dev o | Code repair | Software engineering development and bug fixing benchmark. | ★★★★★ | 67.1% | ||
| 523 | SWE-Lancer o | Code repair (freelance tasks) | Software engineering benchmark using real freelance-style issues. | ★★★★★ | 79.9% | ||
| 524 | SWE-Lancer Diamond o | Code repair (freelance) | Diamond subset of SWE-Lancer focusing on the hardest freelance-style issues. | ★★★★★ | 32.6% | ||
| 525 | SWE-Perf o | Code repair | Software engineering benchmark focused on performance-oriented fixes. | ★★★★★ | 6.5% | ||
| 526 | SWE-Review o | Code review | Software engineering review benchmark for assessing code review quality. | ★★★★★ | 16.2% | ||
| 527 | SWT-Bench o | Code repair | Software tool-use benchmark for code tasks. | ★★★★★ | 80.7% | ||
| 528 | SysBench o | System prompts | System prompt understanding and adherence benchmark. | ★★★★★ | 74.1% | ||
| 529 | TAU1-Airline O | Agent tasks (airline) | Tool-augmented agent evaluation in airline scenarios (TAU1). | ★★★★★ | G openPangu-R-72B-2512 Slow Thinking | 56.0% | |
| 530 | TAU1-Retail O | Agent tasks (retail) | Tool-augmented agent evaluation in retail scenarios (TAU1). | ★★★★★ | G openPangu-R-72B-2512 Slow Thinking | 73.0% | |
| 531 | TAU2-Airline O | Agent tasks (airline) | Tool-augmented agent evaluation in airline scenarios (TAU2). | ★★★★★ | 76.5% | ||
| 532 | TAU2-Retail O | Agent tasks (retail) | Tool-augmented agent evaluation in retail scenarios (TAU2). | ★★★★★ | 88.9% | ||
| 533 | TAU2-Telecom O | Agent tasks (telecom) | Tool-augmented agent evaluation in telecom scenarios (TAU2). | ★★★★★ | 99.3% | ||
| 534 | TempCompass o | Temporal reasoning | Temporal reasoning benchmark evaluating understanding of time-related concepts in videos and images. | ★★★★★ | 88.0% | ||
| 535 | Terminal-Bench O | Agent terminal tasks | Command-line task completion benchmark for agents. | ★★★★★ | 637 | 61.3% | |
| 536 | Terminal-Bench 2.0 O | Agent terminal tasks | Second-generation Terminal-Bench leaderboard for end-to-end terminal agents. | ★★★★★ | G IQuest-Coder-V1-40B-Loop-Instruct | 81.4% | |
| 537 | Terminal-Bench Hard O | Agent terminal tasks | Hard subset of Terminal-Bench command-line agent tasks. | ★★★★★ | 43.0% | ||
| 538 | Terminal-Bench Terminus o | Agent terminal tasks | Terminal-Bench Terminus track assessing end-to-end terminal tool use. | ★★★★★ | 30.3% | ||
| 539 | TextQuests O | Text-based video games | Text-based video game benchmark. | ★★★★★ | 41.0% | ||
| 540 | TextQuests Harm O | Harmful propensities | Harmfulness evaluation on TextQuests scenarios. | ★★★★★ | ↓ 9.1% | ||
| 541 | TextVQA O | Text-based VQA | Visual question answering that requires reading text in images. | ★★★★★ | G PLM-8B | 86.5% | |
| 542 | TIIF-Bench Long O | Text-to-image | TIIF-Bench long prompt score for text-to-image generation. | ★★★★★ | G Seedream 4.5 | 88.5% | |
| 543 | TIIF-Bench Short O | Text-to-image | TIIF-Bench short prompt score for text-to-image generation. | ★★★★★ | G Nano Banana 2.0 | 91.0% | |
| 544 | TLDR9+ o | Summarization | Long-form summarization benchmark with nine-domain TLDR prompts plus extended variations. | ★★★★★ | G MobileLLM P1 | 16.8% | |
| 545 | TOMATO o | Temporal understanding | Temporal ordering and motion analysis benchmark (TOMATO). | ★★★★★ | G Seed1.8 | 60.8% | |
| 546 | Tool-Decathlon o | Agent tool-use | Composite tool-use suite measuring multi-domain tool invocation success (Pass@1). | ★★★★★ | 38.6% | ||
| 547 | Toolathlon O | Agentic software tasks | Long-horizon, real-world software tool-use tasks. | ★★★★★ | 49.4% | ||
| 548 | TreeBench o | Reasoning with tree structures | Evaluates hierarchical/tree-structured reasoning and planning capabilities in LLMs/VLMs. | ★★★★★ | 51.4% | ||
| 549 | TriQA o | Knowledge QA | Triadic question answering benchmark evaluating world knowledge and reasoning. | ★★★★★ | 82.2% | ||
| 550 | TriviaQA O | Open-domain QA | Open-domain question answering benchmark built from trivia and web evidence. | ★★★★★ | 85.5% | ||
| 551 | TriviaQA-Wiki o | Open-domain QA | TriviaQA subset answering using Wikipedia evidence. | ★★★★★ | 91.8% | ||
| 552 | TrustLLM o | Safety / reliability | TrustLLM benchmark for trustworthiness and safety behaviors. | ★★★★★ | 88.4% | ||
| 553 | TruthfulQA O | Truthfulness / hallucination | Measures whether a model imitates human falsehoods (truthfulness). | ★★★★★ | G SOLAR-10.7B-Instruct-v1.0 | 71.4% | |
| 554 | TruthfulQA (DE) o | Truthfulness / hallucination (German) | German translation of the TruthfulQA benchmark. | ★★★★★ | 0.2% | ||
| 555 | TVBench o | TV comprehension | Benchmark for TV show video comprehension and QA. | ★★★★★ | G Seed1.8 | 71.5% | |
| 556 | TydiQA o | Cross-lingual QA | Typologically diverse QA across languages. | ★★★★★ | 313 | 34.3% | |
| 557 | U-Artifacts o | Agentic coding artifacts | Benchmark focusing on generated code artifacts quality. | ★★★★★ | 57.8% | ||
| 558 | V* O | Multimodal reasoning | V* benchmark accuracy. | ★★★★★ | 86.4% | ||
| 559 | VCRBench o | Visual commonsense reasoning | Visual commonsense reasoning benchmark. | ★★★★★ | G Seed1.8 | 59.8% | |
| 560 | VCT O | Virology capability (protocol troubleshooting) | Virology Capabilities Test: a benchmark that measures an LLM's ability to troubleshoot complex virology laboratory protocols. | ★★★★★ | 100.0% | ||
| 561 | Vending-Bench 2 O | Long-horizon agentic tasks | Long-horizon agentic task benchmark evaluating sustained goal completion. | ★★★★★ | 5478.2% | ||
| 562 | Vibe Android o | Vibe evaluation (Android) | Vibe evaluation on Android tasks. | ★★★★★ | 92.2% | ||
| 563 | Vibe Average o | Vibe evaluation | Aggregate Vibe evaluation score. | ★★★★★ | G MiniMax M2.1 | 88.6% | |
| 564 | Vibe Backend o | Vibe evaluation (backend) | Vibe evaluation on backend tasks. | ★★★★★ | 98.0% | ||
| 565 | Vibe iOS o | Vibe evaluation (iOS) | Vibe evaluation on iOS tasks. | ★★★★★ | 90.0% | ||
| 566 | Vibe Simulation o | Vibe evaluation (simulation) | Vibe evaluation on simulation tasks. | ★★★★★ | 89.2% | ||
| 567 | Vibe Web o | Vibe evaluation (web) | Vibe evaluation on web tasks. | ★★★★★ | G MiniMax M2.1 | 91.5% | |
| 568 | VibeEval O | Aesthetic/visual quality | VLM aesthetic evaluation with GPT scores. | ★★★★★ | 76.4% | ||
| 569 | Video-MME O | Video understanding (multimodal) | Multimodal evaluation of video understanding and reasoning. | ★★★★★ | 76.6% | ||
| 570 | VideoHolmes o | Video QA | Video question answering benchmark focused on detective-style clues. | ★★★★★ | G Seed1.8 | 65.5% | |
| 571 | VideoMME o | Multimodal video evaluation | Video multimodal evaluation suite (VideoMME). | ★★★★★ | 88.4% | ||
| 572 | VideoMME (w/o sub) O | Video understanding | Video understanding benchmark without subtitles. | ★★★★★ | 85.1% | ||
| 573 | VideoMME (w/sub) o | Video understanding | Video understanding benchmark with subtitles. | ★★★★★ | 80.7% | ||
| 574 | VideoMMMU O | Multimodal video understanding | Video-based extension of MMMU evaluating temporal multimodal reasoning and perception across disciplines. | ★★★★★ | 87.6% | ||
| 575 | VideoReasonBench o | Video reasoning | Video reasoning benchmark assessing temporal and causal understanding. | ★★★★★ | 59.7% | ||
| 576 | VideoSimpleQA o | Video QA | Simple question answering over short videos. | ★★★★★ | 71.9% | ||
| 577 | ViSpeak o | Video dialogue | Video-grounded dialogue and description benchmark. | ★★★★★ | 89.0% | ||
| 578 | VisualPuzzle o | Visual reasoning | Visual puzzle solving benchmark evaluating reasoning and pattern recognition capabilities. | ★★★★★ | 57.8% | ||
| 579 | VisualWebBench O | Web UI understanding | Average accuracy on VisualWebBench. | ★★★★★ | 83.8% | ||
| 580 | VisuLogic O | Visual logical reasoning | Logical reasoning and compositionality benchmark for visual-language models. | ★★★★★ | 52.5% | ||
| 581 | VitaBench O | Industry QA | Industry-focused benchmark evaluating domain QA performance. | ★★★★★ | 56.3% | ||
| 582 | VL-RewardBench o | Reward modeling (VL) | Reward alignment benchmark for VLMs. | ★★★★★ | 67.4% | ||
| 583 | VLMs are Biased o | Multimodal bias | Evaluates whether VLMs truly 'see' vs. relying on memorized knowledge; measures bias toward non-visual priors. | ★★★★★ | 90 | 20.2% | |
| 584 | VLMs are Blind O | Visual grounding robustness | Evaluates failure modes of VLMs in grounding and perception tasks. | ★★★★★ | G MiMo-VL 7B-RL | 79.4% | |
| 585 | VLMsAreBiased o | Multimodal bias | Benchmark evaluating biases in vision-language models. | ★★★★★ | G Seed1.8 | 62.0% | |
| 586 | VLMsAreBlind o | Multimodal robustness | Benchmark probing robustness of vision-language models to visual perturbations. | ★★★★★ | 97.5% | ||
| 587 | VoiceBench AdvBench O | VoiceBench | VoiceBench adversarial safety evaluation. | ★★★★★ | 99.4% | ||
| 588 | VoiceBench AlpacaEval o | VoiceBench | VoiceBench evaluation on AlpacaEval instructions. | ★★★★★ | 96.8% | ||
| 589 | VoiceBench BBH o | VoiceBench | VoiceBench evaluation on Big-Bench Hard prompts. | ★★★★★ | 92.6% | ||
| 590 | VoiceBench CommonEval O | VoiceBench | VoiceBench evaluation on CommonEval. | ★★★★★ | 91.0% | ||
| 591 | VoiceBench IFEval o | VoiceBench | VoiceBench instruction-following evaluation (IFEval). | ★★★★★ | 85.7% | ||
| 592 | MMAU v05.15.25 o | Audio reasoning | Audio reasoning benchmark MMAU v05.15.25. | ★★★★★ | 77.6% | ||
| 593 | VoiceBench MMSU O | VoiceBench | VoiceBench MMSU benchmark (voice modality). | ★★★★★ | 84.3% | ||
| 594 | VoiceBench MMSU (Audio) o | Audio reasoning | Audio reasoning MMSU results. | ★★★★★ | 77.7% | ||
| 595 | VoiceBench OpenBookQA o | VoiceBench | VoiceBench results on OpenBookQA prompts. | ★★★★★ | 95.0% | ||
| 596 | VoiceBench SD-QA O | VoiceBench | VoiceBench Spoken Dialogue QA results. | ★★★★★ | 90.1% | ||
| 597 | VoiceBench WildVoice O | VoiceBench | VoiceBench evaluation on WildVoice dataset. | ★★★★★ | 93.4% | ||
| 598 | VPCT o | Multimodal reasoning | Visual perception and comprehension test. | ★★★★★ | 90.0% | ||
| 599 | VQAv2 O | Visual question answering | Standard Visual Question Answering v2 benchmark on natural images. | ★★★★★ | 87.0% | ||
| 600 | VSI-Bench O | Spatial intelligence | Visual spatial intelligence benchmark covering 3D reasoning and spatial inference tasks. | ★★★★★ | 63.2% | ||
| 601 | WebClick O | GUI agents | Task success on the WebClick UI agent benchmark. | ★★★★★ | 93.0% | ||
| 602 | WebDev Arena O | Web development agents | Arena evaluation for autonomous web development agents. | ★★★★★ | 1483 | ||
| 603 | WebQuest-MultiQA o | Web agents | Multi-question web search and interaction tasks. | ★★★★★ | 60.6% | ||
| 604 | WebQuest-SingleQA o | Web agents | Single-question web search and interaction tasks. | ★★★★★ | 79.5% | ||
| 605 | WebSrc O | Web QA | Webpage question answering (SQuAD F1). | ★★★★★ | 97.2% | ||
| 606 | WebVoyager o | Web agents | Web navigation and interaction tasks for LLM agents. | ★★★★★ | 81.0% | ||
| 607 | WebVoyager2 o | Web agents | Web navigation and interaction tasks for LLM agents (v2). | ★★★★★ | 84.4% | ||
| 608 | WebWalkerQA o | Web agents | WebWalker tasks evaluating autonomous browsing question answering performance. | ★★★★★ | 72.2% | ||
| 609 | WeMath o | Math reasoning | Math reasoning benchmark spanning diverse curricula and difficulty levels. | ★★★★★ | 69.8% | ||
| 610 | WideSearch o | Web search | Wide web search and QA benchmark. | ★★★★★ | 76.2% | ||
| 611 | Wild-Jailbreak o | Safety / jailbreak | Adversarial jailbreak benchmark evaluating refusal robustness. | ★★★★★ | 98.2% | ||
| 612 | WildBench V2 o | Instruction following | WildBench V2 human preference benchmark for instruction following and helpfulness. | ★★★★★ | 65.3% | ||
| 613 | WildGuardTest o | Safety | WildGuardTest safety benchmark. | ★★★★★ | G IQuest-Coder-V1-40B-Thinking | 86.8% | |
| 614 | Winogender o | Gender bias (coreference) | Coreference resolution dataset for measuring gender bias. | ★★★★★ | 84.3% | ||
| 615 | WinoGrande O | Coreference reasoning | Large-scale adversarial Winograd Schema-style pronoun resolution. | ★★★★★ | 99 | 90.3% | |
| 616 | WinoGrande (DE) o | Coreference reasoning (German) | German translation of the WinoGrande pronoun resolution benchmark. | ★★★★★ | 0.8% | ||
| 617 | WMDP Bio o | Biosecurity knowledge | Weapons of Mass Destruction Proxy benchmark for biosecurity, measuring hazardous biological knowledge without info hazards. | ★★★★★ | ↓ 63.7% | ||
| 618 | WMDP Chem o | Chemical security knowledge | WMDP benchmark for chemical security, evaluating knowledge relevant to chemical weapons development. | ★★★★★ | ↓ 45.8% | ||
| 619 | WMDP Cyber o | Cybersecurity knowledge | WMDP benchmark for cybersecurity, assessing knowledge that could aid in cyber weapons development. | ★★★★★ | ↓ 44.0% | ||
| 620 | WMT16 En–De o | Machine translation | WMT16 English–German translation benchmark (news). | ★★★★★ | 38.8% | ||
| 621 | WMT16 En–De (Instruct) o | Machine translation | Instruction-tuned evaluation on the WMT16 English–German translation set. | ★★★★★ | 37.9% | ||
| 622 | WMT24++ O | Machine translation | Extended WMT 2024 evaluation across multiple language pairs. | ★★★★★ | 94.7% | ||
| 623 | WorldTravel2 (multi-modal) o | Travel planning (multimodal) | WorldTravel2 benchmark multimodal track. | ★★★★★ | 47.2% | ||
| 624 | WorldTravel2 (text) o | Travel planning (text) | WorldTravel2 benchmark text-only track. | ★★★★★ | 56.4% | ||
| 625 | WorldVQA o | World knowledge VQA | Visual question answering requiring world knowledge and commonsense reasoning. | ★★★★★ | 47.4% | ||
| 626 | WritingBench O | Writing quality | General-purpose writing quality benchmark. | ★★★★★ | 88.3% | ||
| 627 | WSC O | Coreference reasoning | Classic Winograd Schema Challenge measuring commonsense coreference. | ★★★★★ | 91.9% | ||
| 628 | xBench-DeepSearch O | Agentic research | Evaluates multi-hop deep research workflows on xBench DeepSearch tasks. | ★★★★★ | 77.9% | ||
| 629 | XpertBench (Edu) o | Economics/education | XpertBench education domain subset. | ★★★★★ | 56.9% | ||
| 630 | XpertBench (Fin) o | Economics/finance | XpertBench finance domain subset. | ★★★★★ | 64.5% | ||
| 631 | XpertBench (Humanities) o | Economics/humanities | XpertBench humanities domain subset. | ★★★★★ | 68.5% | ||
| 632 | XpertBench (Law) o | Economics/legal | XpertBench legal domain subset. | ★★★★★ | 58.7% | ||
| 633 | XpertBench (Research) o | Economics/research | XpertBench research domain subset. | ★★★★★ | 48.2% | ||
| 634 | XSTest o | Safety | XSTest safety benchmark. | ★★★★★ | G IQuest-Coder-V1-40B-Thinking | 94.3% | |
| 635 | ZebraLogic O | Logical reasoning | Logical reasoning benchmark assessing complex pattern and rule inference. | ★★★★★ | 96.1% | ||
| 636 | ZeroBench O | Zero-shot generalization | Evaluates zero-shot performance across diverse tasks without task-specific finetuning. | ★★★★★ | 23.4% | ||
| 637 | ZeroBench (sub) O | Zero-shot generalization | Subset of ZeroBench targeting harder zero-shot reasoning cases. | ★★★★★ | 33.8% | ||
| 638 | ZeroSCROLLS BookSumSort o | Long-context summarization | ZeroSCROLLS split based on BookSumSort long-form summarization. | ★★★★★ | 60.5% | ||
| 639 | ZeroSCROLLS GovReport o | Long-context summarization | ZeroSCROLLS split based on the GovReport summarization benchmark. | ★★★★★ | G CoLT5 | 41.0% | |
| 640 | ZeroSCROLLS MuSiQue o | Long-context reasoning | ZeroSCROLLS split derived from MuSiQue multi-hop QA. | ★★★★★ | 52.2% | ||
| 641 | ZeroSCROLLS NarrativeQA o | Long-context QA | ZeroSCROLLS split based on the NarrativeQA reading comprehension benchmark. | ★★★★★ | 32.6% | ||
| 642 | ZeroSCROLLS Qasper o | Long-context QA | ZeroSCROLLS split based on the Qasper paper QA benchmark. | ★★★★★ | G FLAN-UL2 | 56.9% | |
| 643 | ZeroSCROLLS QMSum o | Long-context summarization | ZeroSCROLLS split based on the QMSum meeting summarization benchmark. | ★★★★★ | G CoLT5 | 22.5% | |
| 644 | ZeroSCROLLS QuALITY o | Long-context QA | ZeroSCROLLS split based on the QuALITY reading comprehension benchmark. | ★★★★★ | 89.2% | ||
| 645 | ZeroSCROLLS SpaceDigest o | Long-context summarization | ZeroSCROLLS SpaceDigest extractive summarization task. | ★★★★★ | 77.9% | ||
| 646 | ZeroSCROLLS SQuALITY o | Long-context summarization | ZeroSCROLLS split based on the SQuALITY long-form summarization benchmark. | ★★★★★ | 22.6% | ||
| 647 | ZeroSCROLLS SummScreenFD o | Long-context summarization | ZeroSCROLLS split based on the SummScreenFD summarization benchmark. | ★★★★★ | G CoLT5 | 20.0% |