Fu — Benchmark of Benchmarks

Fu-Benchmark is a meta-benchmark of the most influential evaluation suites used to measure and rank large language models. Use the search box to filter by name, topic, or model.

Benchmarks

#	Name	Topic	Description	Relevance	GitHub ★	Leader	Top %
1	AA-Index o	Multi-domain QA	Comprehensive QA index across diverse domains.	★★★★★		🇺🇸 Grok 4	73.2%
2	AA-LCR O	Long-context reasoning	A challenging benchmark measuring language models' ability to extract, reason about, and synthesize information from long-form documents ranging from 10k to 100k tokens.	★★★★★		🇺🇸 GPT-5 High	76.0%
3	AA-Omniscience O	Knowledge and hallucination	Benchmark measuring factual recall and hallucination across economically relevant domains.	★★★★★		🇺🇸 Gemini 3 Pro Preview	13.0%
4	AceBench O	Industry QA	Industry-focused benchmark assessing domain QA and reasoning.	★★★★★		🇨🇳 Kimi-K2	82.2%
5	ACP-Bench Bool O	Safety evaluation (boolean)	Safety and behavior evaluation with yes/no questions.	★★★★★		🇨🇳 Qwen3-32B	85.1%
6	ACP-Bench MCQ O	Safety evaluation (MCQ)	Safety and behavior evaluation with multiple-choice questions.	★★★★★		🇺🇸 Llama 3.3 70B	82.1%
7	AetherCode O	Code generation	Code generation benchmark for diverse coding tasks.	★★★★★		🇺🇸 GPT-5.2 High	73.8%
8	AgentCompany o	Agent reasoning	Company-level agent reasoning and decision-making benchmark.	★★★★★		🇺🇸 Claude Sonnet 4.5	41.0%
9	AgentDojo O	Agent evaluation	Interactive evaluation suite for autonomous agents across tools and tasks.	★★★★★		🇺🇸 Claude 3.7 Sonnet	88.7%
10	Agentic Coding O	Agentic coding	Agentic coding benchmark for autonomous software tasks.	★★★★★		🇺🇸 Gemini 3 Flash Preview	53.8%
11	AGIEval (English) O	Exams	English subset of AGIEval; academic and professional exam questions.	★★★★★		🇨🇳 Qwen3-VL 32B Thinking	92.2%
12	AGIEval LSAT-AR o	Law exam reasoning	LSAT Analytical Reasoning subset from AGIEval benchmark.	★★★★★		🇨🇳 Qwen2.5 32B Base	30.4%
13	AI2D O	Diagram understanding (VQA)	Visual question answering over science and diagram images.	★★★★★		🇺🇸 Gemini 3 Pro	98.7%
14	AICodeKing Non-Agentic O	Code generation (non-agentic)	Non-agentic code generation benchmark from AICodeKing.	★★★★★		🇺🇸 Claude Opus 4.6	100.0%
15	Aider Code Editing O	Code editing	Measures interactive code editing quality within the Aider assistant workflow.	★★★★★		🇺🇸 Gemini 2.5 Pro	89.8%
16	Aider-Polyglot O	Code assistant eval	Aider polyglot coding leaderboard.	★★★★★		🇺🇸 Gemini 3 Pro Preview	92.9%
17	Aider-Polyglot (Diff) O	Code assistant eval	Aider polyglot leaderboard using diff mode (pass@2).	★★★★★		🇺🇸 Gemini 3 Pro Preview	91.9%
18	AIME 2024 O	Math (competition)	American Invitational Mathematics Examination 2024 problems.	★★★★★		🇺🇸 GPT-OSS 120B	96.6%
19	AIME 2024-Ko o	Math (competition, Korean)	Korean translation of AIME 2024 problems.	★★★★★		🇨🇳 Qwen3-30B-A3B-Thinking-2507	80.3%
20	AIME 2025 O	Math (competition)	American Invitational Mathematics Examination 2025 problems.	★★★★★		🇺🇸 Claude Sonnet 4.5	100.0%
21	AIME 2026 I O	Math (competition)	American Invitational Mathematics Examination 2026 I problems.	★★★★★		🇺🇸 GPT-5.2 High	97.5%
22	AInstein-SWE-Bench o	Agentic coding	AInstein agent coding benchmark.	★★★★★		🇺🇸 Gemini 3 Pro	42.8%
23	AlignBench o	Alignment and instruction following	Benchmark for instruction-following quality and alignment behavior.	★★★★★		G JoyAI-LLM Flash	8.2%
24	All-Angles Bench O	Spatial perception	All-Angles benchmark for spatial recognition and 3D perception.	★★★★★		G Step3-VL-10B	57.2%
25	AlpacaEval O	Instruction following	Automatic eval using GPT-4 as a judge.	★★★★★	1849	🇺🇸 Llama-3_1-70B-TFree-HAT-SFT	99.4%
26	AlpacaEval 2.0 O	Instruction following	Updated AlpacaEval with improved prompts and judging.	★★★★★		🇨🇳 DeepSeek R1	87.6%
27	AMC-23 O	Math (competition)	American Mathematics Competition 2023 evaluation.	★★★★★		G QwQ-32B	98.5%
28	AMO-Bench O	Math (competition)	Advanced math olympiad-style benchmark.	★★★★★		🇺🇸 Gemini 3 Pro	72.5%
29	AMO-Bench CH o	Math (competition)	Chinese subset of AMO-Bench.	★★★★★		🇺🇸 Gemini 3 Pro	74.9%
30	AndroidWorld O	Mobile agents	Benchmark for agents operating Android apps via UI automation.	★★★★★		🇨🇳 Qwen3.5-35B-A3B	71.1%
31	APEX-Agents O	Long horizon professional tasks	APEX benchmark evaluating agents on long-horizon professional tasks.	★★★★★		🇺🇸 Gemini 3.1 Pro	33.5%
32	API-Bank o	Tool use	API-Bank tool-use benchmark.	★★★★★		🇺🇸 Llama 3.1 405B	92.0%
33	ARC-AGI-1 O	General reasoning	ARC-AGI Phase 1 aggregate accuracy.	★★★★★		🇺🇸 GPT-5.2 High	89.9%
34	ARC-AGI-2 O	General reasoning	ARC-AGI Phase 2 aggregate accuracy.	★★★★★		🇺🇸 Gemini 3 Deep Think	84.6%
35	ARC Average O	Science QA (average)	Average accuracy across ARC-Easy and ARC-Challenge.	★★★★★		🇺🇸 SmolLM2 1.7B Pretrained	60.5%
36	ARC-Challenge O	Science QA	Hard subset of AI2 Reasoning Challenge; grade-school science.	★★★★★		🇺🇸 Llama 3.1 405B	96.9%
37	ARC-Challenge (DE) o	Science QA (German)	German translation of the ARC Challenge benchmark.	★★★★★		🇺🇸 Llama-3_1-70B-TFree-HAT-SFT	0.7%
38	ARC-Easy O	Science QA	Easier subset of AI2 Reasoning Challenge.	★★★★★		🇺🇸 Gemma 3 PT 27B	89.0%
39	ARC-Easy (DE) o	Science QA (German)	German translation of the ARC Easy science QA benchmark.	★★★★★		🇺🇸 Llama-3_1-70B-TFree-HAT-SFT	0.8%
40	Arena-Hard O	Chat ability	Hard prompts on Chatbot Arena.	★★★★★	920	🇫🇷 Mistral Medium 3	97.1%
41	Arena-Hard V2 O	Chat ability	Updated Arena-Hard v2 prompts on Chatbot Arena.	★★★★★	920	🇨🇳 Qwen3 Max Thinking	90.2%
42	Arena-Hard V2 Creative Writing O	Creative writing	Chatbot Arena Hard V2 creative writing win-rate subset.	★★★★★		🇺🇸 Gemini 3 Pro	93.6%
43	Arena-Hard V2 Hard Prompt O	Chat ability	Chatbot Arena Hard V2 benchmark using the hard prompt win-rate subset.	★★★★★		🇺🇸 Gemini 3 Pro	72.6%
44	ARKitScenes O	3D scene understanding	ARKitScenes benchmark for assessing 3D scene reconstruction and understanding from mixed reality captures.	★★★★★		🇨🇳 Qwen2.5-VL 72B Instruct	61.5%
45	ART Agent Red Teaming O	Agent robustness	Evaluation suite for adversarial red-teaming of autonomous AI agents.	★★★★★		🇺🇸 Claude Opus 4.5	↓ 33.6%
46	ArtifactsBench O	Agentic coding	Artifacts-focused coding and tool-use benchmark evaluating generated code artifacts.	★★★★★		🇺🇸 GPT-5 Thinking	73.0%
47	ASR AMI o	ASR	Automatic speech recognition benchmark on AMI meeting speech.	★★★★★		🇨🇳 Qwen2.5-Omni-3B	↓ 15.1%
48	ASR Earnings22 o	ASR	Automatic speech recognition benchmark on Earnings22 financial calls.	★★★★★		G Whisper-large-V3	↓ 11.3%
49	ASR GigaSpeech o	ASR	Automatic speech recognition benchmark on GigaSpeech.	★★★★★		G Whisper-large-V3	↓ 10.0%
50	ASR LibriSpeech Clean o	ASR	Automatic speech recognition benchmark on LibriSpeech clean split.	★★★★★		🇺🇸 LFM2.5-Audio-1.5B	↓ 1.9%
51	ASR LibriSpeech Other o	ASR	Automatic speech recognition benchmark on LibriSpeech other split.	★★★★★		G Whisper-large-V3	↓ 3.9%
52	ASR SPGISpeech o	ASR	Automatic speech recognition benchmark on SPGISpeech.	★★★★★		🇺🇸 LFM2.5-Audio-1.5B	↓ 2.8%
53	ASR TED-LIUM o	ASR	Automatic speech recognition benchmark on TED-LIUM.	★★★★★		🇺🇸 LFM2.5-Audio-1.5B	↓ 3.5%
54	ASR VoxPopuli o	ASR	Automatic speech recognition benchmark on VoxPopuli.	★★★★★		🇨🇳 Qwen2.5-Omni-3B	↓ 5.6%
55	AstaBench O	Agent evaluation	Evaluates science agents across literature understanding, data analysis, planning, tool use, coding, and search.	★★★★★		🇺🇸 Claude Sonnet 4	53.0%
56	AttaQ O	Safety / jailbreak	Adversarial jailbreak suite measuring refusal robustness against targeted attack prompts.	★★★★★		G Granite 3.3 8B Instruct	88.5%
57	AutoCodeBench O	Autonomous coding	End-to-end autonomous coding benchmark with unit-test based execution across diverse repositories and tasks.	★★★★★		🇺🇸 Claude Opus 4 (Thinking)	52.4%
58	AutoCodeBench-Lite O	Autonomous coding	Lite version of AutoCodeBench focusing on smaller tasks with the same end-to-end, unit-test-based evaluation.	★★★★★		🇺🇸 Claude Opus 4	64.5%
59	AutoLogi o	Logical reasoning	AutoLogi benchmark evaluating automated logical reasoning accuracy.	★★★★★		🇺🇸 Claude Sonnet 4	89.8%
60	BABE o	STEM reasoning	STEM reasoning benchmark evaluating broad applied and basic engineering knowledge.	★★★★★		🇺🇸 GPT-5.2 High	58.1%
61	BabyVision O	Visual reasoning	Visual reasoning benchmark testing basic visual perception and understanding.	★★★★★		🇨🇳 Qwen3.5-397B-A17B	52.3%
62	BALROG O	Agent robustness	Benchmark for assessing LLM agents under adversarial and out-of-distribution tool-use scenarios.	★★★★★		🇺🇸 Grok 4	43.6%
63	BBH O	Multi-task reasoning	Hard subset of BIG-bench with diverse reasoning tasks.	★★★★★	510	🇨🇳 ERNIE 4.5 424B A47B	94.3%
64	BBH-ZH O	Multi-task reasoning (Chinese)	Chinese translation of BIG-Bench Hard reasoning tasks.	★★★★★		G LLaDA2.0 Flash	87.5%
65	BBQ O	Bias evaluation	Bias Benchmark for Question Answering evaluating social biases across contexts.	★★★★★		🇫🇷 Mixtral 8x 7B	56.0%
66	BeaverTails o	Safety / harmfulness	Safety benchmark evaluating harmfulness in model responses.	★★★★★		G IQuest-Coder-V1-40B-Thinking	76.7%
67	BeyondAIME O	Math (beyond AIME)	Advanced math problems exceeding AIME difficulty.	★★★★★		G Seed2.0 Pro	86.5%
68	BFCL O	Code reasoning	Benchmark for functional code correctness and logic.	★★★★★		🇨🇳 Qwen3-4B	95.0%
69	BFCL Live v2 O	Finance QA	Financial compliance and literacy questions from the BFCL Live v2 benchmark.	★★★★★		🇺🇸 o1 Mini	81.0%
70	BFCL v2 o	Code reasoning	Second release of the BFCL benchmark focusing on functional code correctness and logic.	★★★★★		G MobileLLM P1	29.4%
71	BFCL v3 O	Code reasoning	Benchmark for functional code correctness and logic (v3).	★★★★★		🇨🇳 GLM 4.5	77.8%
72	BFCL v3 (Live) o	Tool calling	BFCL v3 Live subset for real-time tool calling evaluation.	★★★★★		🇨🇳 Qwen3-30B-A3B-Thinking-2507	82.9%
73	BFCL v3 (Multi-Turn) O	Tool calling	BFCL v3 Multi-Turn subset for multi-turn tool calling evaluation.	★★★★★		G MiniMax M2.5	76.8%
74	BFCL v4 O	Code reasoning	BFCL v4 benchmark for functional code correctness and logic.	★★★★★		🇺🇸 Claude Opus 4.5	77.5%
75	BIG-Bench o	Multi-task reasoning	BIG-bench overall performance (original).	★★★★★	3110	🇺🇸 Gemma 2 7B	55.1%
76	BIG-Bench Extra Hard O	Multi-task reasoning	Extra hard subset of BIG-bench tasks.	★★★★★		G Ling 2.5 1T	52.0%
77	BigCodeBench O	Code Generation	BigCodeBench evaluates large language models on practical code generation tasks with unit-test verification.	★★★★★		G MiMo V2 Flash Base	70.1%
78	BigCodeBench Hard O	Code generation (hard)	Harder variant of BigCodeBench testing complex programming and library tasks with function-level code generation.	★★★★★		🇺🇸 Claude 3.7 Sonnet (2025-02-19)	35.8%
79	BIOBench o	Biology reasoning	Biology knowledge and reasoning benchmark.	★★★★★		🇺🇸 Gemini 3 Pro	51.9%
80	Biology-Instruction o	Biology multi-omics	Multi-omics sequence reasoning benchmark for biological data understanding.	★★★★★		G Intern-S1-Pro	52.5%
81	BioLP-Bench o	Biomedical NLP	Comprehensive biomedical language processing benchmark evaluating LLMs across tasks like NER, relation extraction, and QA.	★★★★★		🇺🇸 Grok 4	47.0%
82	Bird-SQL O	Text-to-SQL	Natural language to SQL generation benchmark.	★★★★★		🇺🇸 Gemini 2.0 Pro	59.3%
83	BLINK O	Multimodal grounding	Evaluates visual-language grounding and reference resolution to reduce hallucinations.	★★★★★		🇺🇸 Gemini 3 Pro	87.4%
84	BoB-HVR O	Composite capability index	Hard, Versatile, and Relevant composite score across eight capability buckets.	★★★★★		🇺🇸 Llama 3 70B	9.0%
85	BOLD o	Bias evaluation	Bias in Open-ended Language Dataset probing demographic biases in text generation.	★★★★★		🇫🇷 Mixtral 8x 7B	↓ 0.1%
86	BoolQ O	Reading comprehension	Yes/no QA from naturally occurring questions.	★★★★★	171	G Marin-32B-Mantis	89.4%
87	Borda Count (Multilingual) o	Aggregate ranking	Borda count aggregate ranking across multilingual benchmarks; lower is better.	★★★★★		🇨🇳 Qwen3-32B	↓ 2.9%
88	BridgeBench o	Reasoning	BridgeBench evaluation benchmark.	★★★★★		🇺🇸 Claude Opus 4.6	60.1%
89	BrowseComp O	Web browsing	Web browsing comprehension and competence benchmark.	★★★★★		🇺🇸 Gemini 3.1 Pro	85.9%
90	BrowseComp (With Content Manager) O	Web browsing	BrowseComp benchmark evaluated with content manager assistance.	★★★★★		🇺🇸 Claude Opus 4.6	84.0%
91	BrowseComp_zh O	Web browsing (Chinese)	Chinese variant of the BrowseComp web browsing benchmark.	★★★★★		G Seed1.8	81.3%
92	BRuMo25 o	Math competition	BruMo 2025 olympiad-style mathematics benchmark.	★★★★★		🇺🇸 QuestA Nemotron 1.5B	69.5%
93	BuzzBench O	Humor analysis	A humour analysis benchmark.	★★★★★		🇺🇸 Gemini 2.5 Pro	71.1%
94	C-Eval O	Chinese exams	Chinese college-level exam benchmark.	★★★★★	1768	🇨🇳 Kimi-K2.5	94.0%
95	C3-Bench o	Reasoning (Chinese)	Comprehensive Chinese reasoning capability benchmark.	★★★★★	35	🇨🇳 GLM-4.5 Base	83.1%
96	CaseLaw v2 O	Legal reasoning	U.S. case law benchmark evaluating legal reasoning and judgment over court opinions.	★★★★★		🇺🇸 GPT-4.1	78.1%
97	CC-OCR O	OCR (cross-lingual)	Cross-lingual OCR benchmark evaluating character recognition across mixed-language documents.	★★★★★		🇨🇳 Qwen3.5-397B-A17B	82.0%
98	CFEval o	Coding ELO / contest eval	Contest-style coding evaluation with ELO-like scoring.	★★★★★		🇨🇳 Qwen3-235B-A22B-Thinking-2507	2134
99	CGBench o	Long video QA	Cartoon/CG long video question answering benchmark.	★★★★★		🇺🇸 Gemini 2.5 Pro	64.6%
100	Charades-STA O	Video grounding	Charades-STA temporal grounding (mIoU).	★★★★★		🇨🇳 Seed1.5-VL-Thinking	64.0%
101	ChartMuseum o	Chart understanding	Large-scale curated collection of charts for evaluating parsing, grounding, and reasoning.	★★★★★		🇺🇸 GPT-5 mini	63.3%
102	ChartQA O	Chart understanding (VQA)	Visual question answering over charts and plots.	★★★★★		🇨🇳 Keye-VL-1.5-8B	94.1%
103	ChartQA-Pro O	Chart understanding (VQA)	Professional-grade chart question answering with diverse chart types and complex reasoning.	★★★★★		🇺🇸 Gemini 2.5 Pro	69.5%
104	CharXiv (DQ) O	Chart description (PDF)	Scientific chart/table descriptive questions from arXiv PDFs.	★★★★★		🇺🇸 o3-high	95.0%
105	CharXiv (RQ) O	Chart reasoning (PDF)	Scientific chart/table reasoning questions from arXiv PDFs.	★★★★★		🇺🇸 GPT-5.2 Thinking	82.1%
106	Chinese SimpleQA o	QA (Chinese)	Chinese variant of the SimpleQA benchmark.	★★★★★		🇨🇳 Kimi-K2 Base	77.6%
107	CL-Bench o	Long-context reasoning	Comprehensive long-context benchmark evaluating reasoning over extended contexts.	★★★★★		🇺🇸 GPT-5 Mini High	25.2%
108	CLIcK o	Korean instruction following	Korean long-form instruction-following benchmark.	★★★★★		🇨🇳 DeepSeek V3.2-Thinking	86.3%
109	CloningScenarios o	Biosecurity refusal	Safety benchmark that red-teams models with cloning-related misuse scenarios to measure compliance and refusal rates.	★★★★★		🇺🇸 Grok 4	↓ 45.0%
110	CLUEWSC o	Coreference reasoning (Chinese)	Chinese Winograd Schema-style coreference benchmark from CLUE.	★★★★★		🇨🇳 DeepSeek R1	92.8%
111	CMath O	Math (Chinese)	Chinese mathematics benchmark.	★★★★★		G LLaDA2.0 Flash	96.9%
112	CMMLU O	Chinese multi-domain	Chinese counterpart to MMLU.	★★★★★	781	🇨🇳 Qwen2.5 Max	91.9%
113	CNMO 2024 o	Math (competition)	China National Mathematical Olympiad 2024 evaluation set.	★★★★★		G openPangu-R-72B-2512 Slow Thinking	82.8%
114	Codeforces O	Competitive programming	Competitive programming performance on Codeforces problems (ELO).	★★★★★		🇺🇸 Gemini 3 Deep Think	3455
115	CodeIF-Bench o	Code instruction following	Code-focused instruction following benchmark.	★★★★★		🇨🇳 Qwen3-8B Non-Thinking	50.0%
116	COLLIE O	Instruction following	Comprehensive instruction-following evaluation suite.	★★★★★	55	🇺🇸 GPT-5	99.0%
117	Collie-Hard o	Instruction following	Hard subset of Collie instruction-following tasks.	★★★★★		🇺🇸 GPT-5 High	99.0%
118	CommonsenseQA O	Commonsense QA	Multiple-choice QA requiring commonsense knowledge.	★★★★★		🇨🇳 Qwen2.5 32B Base	88.5%
119	Complex Workflow o	Complex workflows	Complex workflow benchmark for economically valuable tasks.	★★★★★		🇺🇸 Gemini 3 Pro	58.2%
120	COPA o	Causal reasoning	Choice of Plausible Alternatives.	★★★★★		G Marin-32B-Bison	94.0%
121	CORE o	Ontological reasoning	Comprehensive Ontological Relation Evaluation for Large Language Models.	★★★★★		G Nanbeige4.1-3B	53.5%
122	CorpusQA O	Long-context QA	Question answering over large text corpora.	★★★★★		🇺🇸 GPT-5	81.6%
123	CountBench O	Visual counting	Object counting and numeracy benchmark for visual-language models across varied scenes.	★★★★★		🇨🇳 Qwen3.5-27B	97.8%
124	CountBenchQA O	Visual counting QA	Visual question answering benchmark focused on counting objects across varied scenes.	★★★★★		G Moondream-9B-A2B	93.2%
125	Countdown o	Planning and reasoning	Countdown-style reasoning and planning benchmark.	★★★★★		G K2-V2	75.6%
126	Countix o	Video counting	Video-based counting benchmark for multiple objects.	★★★★★		G Seed1.8	31.0%
127	CRAG o	Retrieval QA	Complex Retrieval-Augmented Generation benchmark for grounded question answering.	★★★★★		G Jamba Mini 1.6	76.2%
128	Creative Story‑Writing Benchmark V3 O	Creative writing	Story writing benchmark evaluating creativity, coherence, and style (v3).	★★★★★	291	🇨🇳 Kimi-K2-Instruct-0905	8.7%
129	Longform Creative Writing O	Creative writing	Longform creative writing evaluation (EQ-Bench).	★★★★★	20	🇺🇸 Claude Sonnet 4.5 (Thinking)	79.8%
130	Creative Writing v3 O	Creative writing	A LLM-judged creative writing benchmark.	★★★★★	54	🇺🇸 o3	1661
131	Complex Research using Integrated Thinking – Physics Test O	Reasoning	CritPt (Complex Research using Integrated Thinking – Physics Test) benchmark.	★★★★★		🇺🇸 GPT-5 (High, Code & Web)	12.6%
132	CRUX-I O	Code reasoning	Code Reasoning and Understanding eXam – Interactive.	★★★★★		🇺🇸 Gemini 3 Pro Preview	98.8%
133	CRUX-O O	Code reasoning	Code Reasoning and Understanding eXam – Offline.	★★★★★		G IQuest-Coder-V1-40B-Loop-Thinking	99.4%
134	CruxEval O	Code reasoning	Mathematical coding challenge set from the CruxEval benchmark.	★★★★★		🇨🇳 Qwen3-30B-A3B-Instruct-2507	86.8%
135	CSimpleQA O	QA	Chinese SimpleQA benchmark variant (short factual questions).	★★★★★		G Ling 2.5 1T	79.0%
136	Customer Support Q&A o	Customer support QA	Customer support question answering benchmark.	★★★★★		G Seed1.8	69.0%
137	CUTE o	English characters	CUTE aggregate capability score.	★★★★★		🇺🇸 Bolmo 7B	78.6%
138	CV-Bench O	Computer vision QA	Diverse CV tasks for VLMs.	★★★★★		🇺🇸 Gemini 3 Pro	92.0%
139	CVTG-2K CLIPScore o	Text rendering	CVTG-2K CLIPScore for text rendering in image generation.	★★★★★		G Seedream 4.5	0.8%
140	CVTG-2K NED o	Text rendering	CVTG-2K normalized edit distance (NED) for text rendering.	★★★★★		🇨🇳 GLM-Image	1.0%
141	CVTG-2K Word Accuracy o	Text rendering	CVTG-2K word accuracy for text rendering in images.	★★★★★		🇨🇳 GLM-Image	0.9%
142	CyBench o	Cybersecurity CTF	Framework with 40 professional-level CTF tasks evaluating LLMs' practical cybersecurity capabilities.	★★★★★		🇺🇸 o3 mini	↓ 22.5%
143	CyberGym o	Cybersecurity tasks	Benchmark for cybersecurity-related coding and reasoning tasks.	★★★★★		🇺🇸 Claude Opus 4.5 Thinking	50.6%
144	Cybersecurity Capture The Flag Challenges o	Cybersecurity CTF	Capture-the-flag challenge benchmark evaluating cybersecurity problem-solving skills.	★★★★★		🇺🇸 GPT-5.3 Codex	77.6%
145	Cybersecurity CTF o	Cybersecurity CTF	Cybersecurity Capture The Flag challenges benchmark.	★★★★★		🇺🇸 GPT-5.3 Codex	77.6%
146	DA-2K o	Spatial reasoning	2D/3D spatial reasoning benchmark.	★★★★★		🇨🇳 Seed1.5-VL-Thinking	85.3%
147	Deep Planning O	Planning and reasoning	Benchmark evaluating deep planning and multi-step reasoning capabilities.	★★★★★		🇺🇸 GPT-5.2 Thinking	44.6%
148	DeepConsult o	Agentic writing	Agentic consulting and writing benchmark.	★★★★★		🇺🇸 GPT-5 High	57.2%
149	DeepMind Mathematics o	Math reasoning	Synthetic math problem sets from DeepMind covering arithmetic, algebra, calculus, and more.	★★★★★		G Granite-4.0-H-Small	59.3%
150	DeepResearchBench o	Agentic research writing	Research-oriented agentic writing and planning benchmark.	★★★★★		🇺🇸 Gemini 3 Pro	49.6%
151	DeepSearchQA o	Deep web search QA	Multi-step web search and question answering benchmark.	★★★★★		🇨🇳 Kimi-K2.5 Thinking	77.1%
152	DeR2 Bench o	Long-context reasoning	Dense retrieval and reasoning benchmark for long-context evaluation.	★★★★★		🇺🇸 GPT-5.2 High	69.0%
153	Design2Code O	Coding (UI)	Translating UI designs into code.	★★★★★		🇨🇳 Qwen3-VL-235B-A22B Thinking	93.4%
154	DesignArena O	Generative design	Leaderboard tracking generative design systems across layout, branding, and marketing tasks.	★★★★★		🇺🇸 Claude Sonnet 4.5 (Thinking)	1410
155	DetailBench o	Spot small mistakes	Evaluates whether LLMs can notice subtle errors and minor inconsistencies in text.	★★★★★		🇺🇸 Llama 4 Maverick	8.7%
156	DiscoX O	Agentic writing	DiscoX benchmark for agentic writing and reasoning.	★★★★★		G Seed2.0 Pro	82.0%
157	Do-Anything-Now o	Safety / jailbreak	Resistance to Do Anything Now (DAN) style jailbreak prompts.	★★★★★		G IQuest-Coder-V1-40B-Thinking	97.7%
158	Do-Not-Answer o	Safety / refusal	Evaluates a model's ability to refuse unsafe or disallowed requests.	★★★★★		G K2-THINK	88.0%
159	DocMath O	Document math	Math reasoning on document-based problems.	★★★★★		🇺🇸 GPT-5	67.6%
160	DocVQA O	Document understanding (VQA)	Visual question answering over scanned documents.	★★★★★		🇨🇳 Seed1.5-VL-Thinking	96.9%
161	Dolphin-Page o	Document OCR	Dolphin Page benchmark measuring OCR fidelity and structured extraction on multi-layout documents.	★★★★★		🇺🇸 Dolphin 1.5	↓ 7.4%
162	DPG-Bench O	Text rendering	DPG-Bench score for text rendering in image generation.	★★★★★		G Seedream 4.5	88.6%
163	DROP O	Reading + reasoning	Discrete reasoning over paragraphs (addition, counting, comparisons).	★★★★★		🇨🇳 Kimi K2 Instruct	93.5%
164	DUDE o	Multimodal long-context	Long-context multimodal understanding benchmark.	★★★★★		🇺🇸 Gemini 3 Pro	70.1%
165	DynaMath O	Math reasoning (video)	Dynamic/video-based mathematical reasoning evaluating temporal and visual understanding.	★★★★★		🇨🇳 Qwen3.5-27B	87.7%
166	Economically important tasks o	Industry QA (cross-domain)	Evaluation suite of real-world, economically impactful tasks across key industries and workflows.	★★★★★		🇺🇸 GPT-5	47.1%
167	Education o	Economics/education	Education field evaluation (economically valuable tasks).	★★★★★		G Seed1.8	60.8%
168	EgoSchema O	Egocentric video QA	EgoSchema validation accuracy.	★★★★★		🇨🇳 Qwen2-VL 72B Instruct	77.9%
169	EgoTempo o	Egocentric temporal reasoning	Egocentric video temporal reasoning benchmark.	★★★★★		G Seed1.8	67.0%
170	EIFBench o	Instruction following	Complex instruction-following benchmark.	★★★★★		🇺🇸 GPT-5 High	66.7%
171	EmbSpatialBench O	Spatial understanding	Embodied spatial understanding benchmark evaluating navigation and localization.	★★★★★		🇨🇳 Qwen3.5-397B-A17B	84.5%
172	EMMA o	Multimodal reasoning	EMMA benchmark for multimodal reasoning.	★★★★★		🇺🇸 Gemini 3 Pro	66.5%
173	Enamel o	Composite capability	Composite capability benchmark capturing broad model performance (Enamel score).	★★★★★		G Rnj-1	49.0%
174	EnConda-Bench o	Code editing	English code editing benchmark for applying conditional modifications.	★★★★★		G Youtu-LLM-2B	21.5%
175	Encyclo-K o	Encyclopedic knowledge	Encyclopedic knowledge evaluation benchmark.	★★★★★		G Seed2.0 Pro	65.7%
176	EnigmaEval O	Challenging puzzles	Challenging puzzle benchmark.	★★★★★		🇺🇸 Gemini 3 Pro	17.8%
177	Enterprise RAG o	Retrieval-augmented generation	Enterprise retrieval-augmented generation evaluation covering internal knowledge bases.	★★★★★		🇺🇸 Apriel Nemotron 15B Thinker	69.2%
178	EQ-Bench O	Reasoning	General reasoning benchmark assessing equation/logic capabilities.	★★★★★	352	G Jan v1 2509	85.0%
179	EQ-Bench 3 O	Emotional intelligence (roleplay)	A benchmark measuring emotional intelligence in challenging roleplays, judged by Sonnet 3.7.	★★★★★	21	🇨🇳 Kimi K2 Instruct	1555
180	ERQA O	Spatial reasoning	Spatial recognition and reasoning QA benchmark (ERQA).	★★★★★		🇺🇸 Gemini 3 Flash	71.0%
181	EvalPerf O	Code evaluation performance	Measures performance of LLM code evaluation, including runtime, memory, and efficiency metrics.	★★★★★		🇺🇸 GPT-4o (2024-08-06)	100.0%
182	EvalPlus O	Code generation	Aggregated code evaluation suite from EvalPlus.	★★★★★	1577	🇺🇸 o1 Mini	89.0%
183	EVG o	Document OCR	EVG document OCR benchmark evaluating recognition accuracy and layout extraction.	★★★★★		🇺🇸 Dolphin 1.5	↓ 3.0%
184	EXECUTE o	Multilingual character tasks	Multilingual character-level evaluation benchmark.	★★★★★		🇺🇸 Bolmo 7B	71.6%
185	FACTS Benchmark Suite o	Held out internal grounding, parametric, MM, and search retrieval benchmarks	Comprehensive factuality benchmark suite covering held-out internal grounding, parametric knowledge, multimodal understanding, and search retrieval benchmarks.	★★★★★		🇺🇸 Gemini 3 Pro	70.5%
186	FACTS Grounding O	Grounding / factuality	Grounded factuality benchmark evaluating model alignment with source facts.	★★★★★		🇨🇳 Kimi K2 Instruct	88.5%
187	FActScore O	Hallucination rate on open-source prompts	Measures hallucination rate on an open-source prompt suite; lower is better.	★★★★★		🇺🇸 GPT-5	↓ 1.0%
188	FaithJudge (1-Hallu.) o	Hallucination detection	FaithJudge hallucination rate with 1-hallucination metric (lower is better).	★★★★★		G Moonlight-Instruct	↓ 56.0%
189	Meta Score Agent O	Composite capability index		★★★★★		🇺🇸 Claude Opus 4.5	100.0%
190	Meta Score Code O	Composite capability index		★★★★★		🇺🇸 Claude Opus 4.5	100.0%
191	Meta Score Math O	Composite capability index		★★★★★		🇨🇳 Qwen3-VL-235B-A22B Instruct	100.0%
192	Meta Score OCR O	Composite capability index		★★★★★		🇺🇸 o3 (Low)	80.0%
193	Meta Score Safety O	Composite safety index		★★★★★		G Granite 3.3 8B Instruct	70.0%
194	Meta Score STEM O	Composite capability index		★★★★★		🇨🇳 Qwen3-VL-235B-A22B Instruct	100.0%
195	Meta Score Text O	Composite capability index		★★★★★		🇺🇸 Claude Opus 4.5	85.7%
196	Meta Score Visual O	Composite capability index		★★★★★		🇺🇸 Gemini 3 Pro	100.0%
197	Meta Score Writing O	Composite capability index		★★★★★		🇨🇳 Qwen3 235B A22B Instruct 2507	60.0%
198	FigQA o	Figure understanding and QA	Figure question answering benchmark evaluating visual reasoning over scientific figures and diagrams.	★★★★★		🇺🇸 Grok 4.1 (Thinking)	34.0%
199	FinanceReasoning o	Financial reasoning	Financial reasoning benchmark evaluating quantitative and qualitative finance problem solving.	★★★★★		G Ling 1T	87.5%
200	FinanceAgent O	Agentic finance tasks	Interactive financial agent benchmark requiring multi-step tool use.	★★★★★		🇺🇸 Claude Opus 4.6	60.7%
201	FinanceAgent v1.1 o	Agentic finance tasks	Finance Agent v1.1 benchmark for interactive financial agent evaluation.	★★★★★		🇺🇸 Claude Sonnet 4.6	63.3%
202	FinanceBench (FullDoc) o	Finance QA	FinanceBench full-document question answering benchmark requiring long-context financial understanding.	★★★★★		G Jamba Mini 1.6	45.4%
203	FinSearchComp O	Financial retrieval	Financial search and comprehension benchmark measuring retrieval grounded reasoning over financial content.	★★★★★		🇺🇸 Grok 4	68.9%
204	FinSearchComp-CN O	Financial retrieval (Chinese)	Chinese financial search and comprehension benchmark measuring retrieval-grounded reasoning over regional financial content.	★★★★★		G doubao-1-5-vision-pro	54.2%
205	FinSearchComp (T2&T3) o	Finance search	Finance search competition tasks (tracks T2 and T3).	★★★★★		🇺🇸 GPT-5 High	64.5%
206	Flame-React-Eval o	Frontend coding	Front-end React coding tasks and evaluation.	★★★★★		🇨🇳 GLM-4.6V	86.3%
207	Flores O	Machine translation (multilingual)	FLORES multilingual translation benchmark.	★★★★★		G EuroLLM-22B	88.9%
208	Fox-Page-cn o	Document OCR (Chinese)	Fox Page benchmark evaluating OCR accuracy and layout understanding on Chinese document pages.	★★★★★		🇺🇸 Dolphin 1.5	↓ 0.8%
209	Fox-Page-en o	Document OCR (English)	Fox Page benchmark evaluating OCR accuracy and layout understanding on English document pages.	★★★★★		🇺🇸 Dolphin 1.5	↓ 0.7%
210	FRAMES O	Interactive reasoning	Frame-based interactive reasoning and dialogue benchmark.	★★★★★		🇨🇳 Tongyi DeepResearch	90.6%
211	FreshQA o	Recency QA	Question answering benchmark emphasizing up-to-date knowledge and recency.	★★★★★		🇨🇳 Qwen3-4B Thinking 2507	66.9%
212	FrontierScience o	Science reasoning	Frontier-level scientific reasoning and QA benchmark.	★★★★★		🇺🇸 GPT-5.2	25.2%
213	FrontierScience Olympiad o	Science reasoning (olympiad)	Olympiad-level problems from the FrontierScience benchmark.	★★★★★		🇺🇸 GPT-5.2 High	75.0%
214	FrontierScience Research o	Science reasoning (research)	Research-level problems from the FrontierScience benchmark.	★★★★★		🇺🇸 GPT-5.2 High	25.0%
215	FSC-147 o	Few-shot counting	Few-shot counting benchmark across 147 categories.	★★★★★		G Seed1.8	33.8%
216	FullStackBench O	Full-stack development	End-to-end web/app development tasks and evaluation.	★★★★★		🇺🇸 Claude Opus 4.5	72.3%
217	FullStackBench (zh) o	Full-stack development	Chinese language full-stack development tasks and evaluation.	★★★★★		🇨🇳 Qwen3-235B-A22B	63.1%
218	GAIA O	General AI tasks	Comprehensive benchmark for agentic tasks.	★★★★★		G Seed1.8	87.4%
219	GAIA (no file) o	General AI tasks	GAIA benchmark subset without file inputs.	★★★★★		G Step-3.5 Flash 20260204	84.5%
220	GAIA 2 O	General agent tasks	Grounded agentic intelligence benchmark version 2 covering multi-tool tasks.	★★★★★		G Ring-1T-2.5	75.0%
221	GAOKAO-Bench o	Chinese exams	GAOKAO benchmark measuring Chinese college entrance exam performance.	★★★★★		🇨🇳 Qwen3-30B-A3B-Instruct-2507	94.5%
222	GDPVal O	General capability	GDPVal benchmark evaluating broad general capabilities of LLMs across diverse tasks.	★★★★★		🇺🇸 Claude Opus 4.6	73.5%
223	GDPVal-AA Elo O	Office tasks	GDPVal Artificial Analysis Elo rating for office-style tasks.	★★★★★		🇺🇸 Claude Sonnet 4.6	1633
224	General Tool Use O	Tool use	General tool-use benchmark covering web and API tasks.	★★★★★		🇺🇸 Claude Opus 4.5	78.9%
225	GeoBench1 o	Geospatial reasoning	Geospatial visual QA and reasoning (set 1).	★★★★★		🇨🇳 GLM-4.5V	79.7%
226	GlobalMGSM o	Math (multilingual)	Global multilingual grade school math word problems.	★★★★★		🇨🇳 Qwen3-4B	60.9%
227	Global-MMLU O	Multi-domain knowledge (global)	Full Global-MMLU evaluation across diverse languages and regions.	★★★★★		🇨🇳 DeepSeek V3.2-Exp	82.0%
228	Global-MMLU-Lite O	Multi-domain knowledge (global)	Lightweight global variant of MMLU covering diverse languages and regions.	★★★★★		🇺🇸 Gemini 2.5 Pro	89.2%
229	Global PIQA O	Commonsense reasoning across 100 Languages and Cultures	Physical commonsense reasoning benchmark spanning 100 languages and diverse cultural contexts.	★★★★★		🇺🇸 Gemini 3 Flash	95.6%
230	Gorilla Benchmark API Bench o	Tool use	Gorilla API Bench tool-use evaluation.	★★★★★		🇺🇸 Llama 3.1 405B	35.3%
231	GPQA O	Graduate-level QA	Graduate-level question answering evaluating advanced reasoning.	★★★★★	406	🇺🇸 GPT-5.2 Thinking	92.4%
232	GPQA-diamond O	Graduate-level QA	Hard subset of GPQA (diamond level).	★★★★★		🇺🇸 Gemini 3.1 Pro	94.3%
233	GraphWalks BFS O	Long-context reasoning	Graph traversal/GraphWalks benchmark (BFS variant) for long-context reasoning.	★★★★★		🇺🇸 GPT-5.2 High	98.0%
234	GraphWalks Parents o	Long-context reasoning	Graph traversal/GraphWalks benchmark (Parents variant) for long-context reasoning.	★★★★★		G Seed2.0 Lite	100.0%
235	GRE Math maj@16 o	Math (standardized tests)	GRE quantitative section evaluated via majority voting over 16 samples.	★★★★★		🇨🇳 Qwen2 7B	58.5%
236	Ground-UI-1K O	GUI grounding	Accuracy on the Ground-UI-1K grounding benchmark.	★★★★★		🇨🇳 Qwen2.5-VL 72B	85.4%
237	GSM-Infinite Hard (128K) o	Math reasoning	GSM-Infinite Hard benchmark at 128K context.	★★★★★		G MiMo V2 Flash Base	29.0%
238	GSM-Infinite Hard (16K) o	Math reasoning	GSM-Infinite Hard benchmark at 16K context.	★★★★★		🇨🇳 DeepSeek V3.2-Exp	50.4%
239	GSM-Infinite Hard (32K) o	Math reasoning	GSM-Infinite Hard benchmark at 32K context.	★★★★★		🇨🇳 DeepSeek V3.2-Exp	45.2%
240	GSM-Infinite Hard (64K) o	Math reasoning	GSM-Infinite Hard benchmark at 64K context.	★★★★★		🇨🇳 DeepSeek V3.1	34.7%
241	GSM-Plus O	Math (grade-school, enhanced)	Enhanced GSM-style grade-school math benchmark variant.	★★★★★		G LLaDA2.0 Flash	89.7%
242	GSM-Symbolic o	Math reasoning	Symbolic reasoning variant of GSM that tests algebraic manipulation and arithmetic with structured problems.	★★★★★		G Granite-4.0-H-Small	87.4%
243	GSM8K O	Math (grade-school)	Grade-school math word problems requiring multi-step reasoning.	★★★★★	1322	🇨🇳 Kimi K2 Instruct	97.3%
244	GSM8K (DE) o	Math (grade-school, German)	German translation of the GSM8K grade-school math word problems.	★★★★★		🇺🇸 Llama-3_1-70B-TFree-HAT-SFT	0.6%
245	GSM8K-Ko o	Math (grade-school, Korean)	Korean translation of the GSM8K grade-school math word problems.	★★★★★		🇨🇳 Qwen3-30B-A3B	88.1%
246	GSM8K Platinum o	Math (grade-school, hard)	Harder subset/setting of GSM8K grade-school math problems.	★★★★★		🇨🇳 Kimi-Linear-Base	89.6%
247	GSO Benchmark O	Code generation	LiveCodeBench GSO benchmark.	★★★★★		🇺🇸 o3-high	8.8%
248	HAE-RAE Bench o	Korean language understanding	Korean language understanding benchmark evaluating knowledge and reasoning.	★★★★★		G Kanana-1.5-32.5B-Base	90.7%
249	HallusionBench O	Multimodal hallucination	Benchmark for evaluating hallucination tendencies in multimodal LLMs.	★★★★★		🇨🇳 Qwen3.5-397B-A17B	71.4%
250	HarmBench o	Safety	Harmfulness and safety compliance benchmark across a variety of risky prompts.	★★★★★		G IQuest-Coder-V1-40B-Thinking	94.8%
251	HarmfulQA o	Safety	Harmful question set testing models' ability to avoid unsafe answers.	★★★★★	104	G K2-THINK	99.0%
252	HealthBench O	Medical QA	Comprehensive medical knowledge and clinical reasoning benchmark across specialties and tasks.	★★★★★		🇺🇸 GPT-5	67.2%
253	HealthBench-Hard O	Medical QA (hard)	Challenging subset of HealthBench focusing on complex, ambiguous clinical cases.	★★★★★		🇺🇸 GPT-5	46.2%
254	HealthBench-Hard Hallucinations o	Medical hallucination safety	Measures hallucination and unsafe medical advice under hard clinical scenarios.	★★★★★		🇺🇸 GPT-5	↓ 1.6%
255	HellaSwag O	Commonsense reasoning	Adversarial commonsense sentence completion.	★★★★★	220	🇨🇳 DeepSeek V3 Base	96.4%
256	HellaSwag (DE) o	Commonsense reasoning (German)	German translation of the HellaSwag commonsense benchmark.	★★★★★		🇺🇸 Llama-3_1-70B-TFree-HAT-SFT	0.7%
257	HELMET LongQA o	Long-context QA	Long-context subset of the HELMET benchmark focusing on grounded question answering.	★★★★★		G Jamba Mini 1.6	46.9%
258	HeroBench O	Long-horizon planning	Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds.	★★★★★		🇺🇸 Grok 4	91.7%
259	HHEM v2.1 O	Hallucination detection	Hughes Hallucination Evaluation Model (Vectara) — lower is better.	★★★★★		G AntGroup Finix_S1_32b	↓ 0.6%
260	HiddenMath O	Math reasoning	Mathematical reasoning benchmark referenced in recent model cards.	★★★★★		🇺🇸 Gemini 2.0 Pro	65.2%
261	HLE O	Multi-domain reasoning	Challenging LLMs at the frontier of human knowledge.	★★★★★	1085	🇺🇸 Gemini 3 Deep Think	48.4%
262	HLE Overconfidence O	Overconfidence / safety	Overconfidence rate derived from Humanity's Last Exam evaluations.	★★★★★		🇺🇸 GPT-5.2	↓ 43.7%
263	HLE (Text Only) O	Advanced reasoning	Humanity's Last Exam benchmark restricted to text-only inputs.	★★★★★	1085	🇺🇸 Gemini 3 Pro	45.8%
264	HLE-Verified o	Multi-domain reasoning	Verified and revised version of Humanity's Last Exam (HLE) with component-wise verification protocol.	★★★★★		🇺🇸 Gemini 3 Pro	48.0%
265	HLE-VL o	Holistic language evaluation (vision-language)	Vision-language HLE benchmark.	★★★★★		🇺🇸 Gemini 3 Pro	36.0%
266	HLE (With Tools) O	Tool-augmented reasoning	Humanity's Last Exam benchmark evaluated with tool access.	★★★★★	1085	🇺🇸 Claude Opus 4.6	53.1%
267	HMMT o	Math (competition)	Harvard–MIT Mathematics Tournament problems.	★★★★★		🇺🇸 GPT-5 pro	100.0%
268	HMMT 2025 O	Math (competition)	Harvard–MIT Mathematics Tournament 2025 problems.	★★★★★		🇺🇸 Gemini 3 Pro	99.8%
269	HMMT Feb 2025 O	Math (competition)	Harvard–MIT Mathematics Tournament February 2025 problems.	★★★★★		🇺🇸 GPT-5.2 High	100.0%
270	HMMT Nov 2025 O	Math (competition)	Harvard–MIT Mathematics Tournament November 2025 problems.	★★★★★		🇺🇸 GPT-5.2 High	100.0%
271	HotpotQA o	Multi-hop QA	Explainable multi-hop QA with supporting facts.	★★★★★		🇨🇳 Qwen 3 0.6B	64.0%
272	HRBench 4K O	Hallucination robustness	Hallucination robustness benchmark with 4K token contexts.	★★★★★		🇨🇳 Qwen3-VL-30B-A3B Instruct	89.5%
273	HRBench 8K O	Hallucination robustness	Hallucination robustness benchmark with 8K token contexts.	★★★★★		🇺🇸 Gemini 2.5 Pro	84.0%
274	HRM8K o	Korean reasoning	8k-question Korean reasoning and knowledge benchmark.	★★★★★		🇨🇳 Qwen3-235B-A22B-Thinking-2507	92.0%
275	HumanEval O	Code generation	Python synthesis problems evaluated by unit tests.	★★★★★	2916	🇺🇸 Gemini 3 Pro Preview	100.0%
276	HumanEval+ O	Code generation	Extended HumanEval with more tests.	★★★★★	1577	🇺🇸 Claude Sonnet 4	94.5%
277	HumanEval-V o	Code generation (vision)	HumanEval variant with visual programming prompts.	★★★★★		G Step3-VL-10B	66.0%
278	HumanEval-X o	Code generation (multilingual)	Multilingual code generation benchmark extending HumanEval to multiple programming languages.	★★★★★		G TeleChat3-36B-Thinking	92.7%
279	Hypersim O	3D scene understanding	Hypersim benchmark for synthetic indoor scene understanding and reconstruction.	★★★★★		🇺🇸 GPT-5 Mini Minimal	39.3%
280	IFBench O	Instruction following	Instruction-following benchmark measuring compliance and adherence.	★★★★★	70	🇫🇷 Mistral Small 3.2 24B Instruct	84.8%
281	IFEval O	Instruction following	Instruction following capability evaluation for LLMs.	★★★★★	36312	🇨🇳 Qwen3.5-27B	95.0%
282	IFEval-Code o	Instruction following (code)	Instruction following evaluation for code generation tasks.	★★★★★		🇨🇳 Qwen3-32B	28.0%
283	IFEval (Strict Prompt) O	Instruction following	IFEval strict prompt-level accuracy.	★★★★★		🇨🇳 Qwen3-8B Non-Thinking	84.3%
284	Image QA Average O	Image QA (aggregate)	Average of single-image visual question answering benchmarks.	★★★★★		🇺🇸 Gemini 3 Pro	86.2%
285	IMO AnswerBench O	Math (competition)	Evaluates free-form solutions to International Mathematical Olympiad problems using expert-style grading rubrics.	★★★★★		G Ring-1T-2.5-heavy-thinking	90.0%
286	INCLUDE O	Inclusiveness / bias	Evaluates inclusive language use and bias mitigation in model outputs.	★★★★★		🇺🇸 Gemini 3 Pro	90.5%
287	InfoQA O	Information-seeking QA	Information retrieval question answering benchmark evaluating factual responses.	★★★★★		🇺🇸 Gemini 3 Pro	86.9%
288	Information Extraction o	Information extraction	Information extraction benchmark for economically valuable fields.	★★★★★		🇺🇸 Claude Sonnet 4.5	46.9%
289	Information Processing o	Information processing	Information processing benchmark for economically valuable tasks.	★★★★★		🇺🇸 Gemini 3 Pro	56.5%
290	InfoVQA O	Infographic VQA	Visual question answering over infographics requiring reading, counting, and reasoning.	★★★★★		🇨🇳 Kimi-K2.5 Thinking	92.6%
291	Intention Recognition o	Intent recognition	Intent recognition benchmark for practical applications.	★★★★★		🇺🇸 Gemini 3 Pro	65.3%
292	IntPhys 2 O	Intuitive physics	Intuitive physics reasoning benchmark.	★★★★★		🇺🇸 Gemini 3 Flash	63.4%
293	Inverse IFEval O	Instruction following (inverse)	Inverse instruction-following evaluation.	★★★★★		🇺🇸 Gemini 3 Flash	80.9%
294	ISL/OSL 8k/16k o	Throughput	Relative throughput on ISL/OSL 8k/16k context workloads.	★★★★★		🇺🇸 Nemotron-3-Nano-30B-A3B	3.3%
295	JudgeMark v2.1 O	LLM judging ability	A benchmark measuring LLM judging ability.	★★★★★		🇺🇸 Claude Sonnet 4	82.0%
296	KGC-Safety o	Safety (Korean)	Korean safety benchmark evaluating harmfulness and compliance.	★★★★★		G K-EXAONE	96.1%
297	KK-4 People o	Working memory (4 people)	Keep/kill working-memory benchmark with 4 people entities.	★★★★★		G K2-V2	92.9%
298	KK-8 People o	Working memory (8 people)	Keep/kill working-memory benchmark with 8 people entities.	★★★★★		G K2-V2	82.8%
299	KMMLU O	Korean knowledge	Korean Massive Multitask Language Understanding benchmark.	★★★★★		🇨🇳 DeepSeek V3.1	78.7%
300	KMMLU-Pro O	Multilingual knowledge	Korean Multilingual Massive Multitask Language Understanding Pro	★★★★★		🇺🇸 o1	77.5%
301	KMMLU-Redux O	Multilingual knowledge	Redux variant of KMMLU benchmark	★★★★★		🇺🇸 o1	81.1%
302	Ko-LongBench o	Korean long-context	Long-context understanding benchmark in Korean.	★★★★★		🇨🇳 DeepSeek V3.2-Thinking	87.9%
303	KoBALT o	Korean knowledge	Korean benchmark for knowledge and language understanding.	★★★★★		🇨🇳 DeepSeek V3.2-Thinking	62.7%
304	KoMT-Bench o	Korean chat ability	Korean multi-turn chat evaluation benchmark.	★★★★★		🇨🇳 Qwen3-30B-A3B-Instruct-2507	8.5%
305	KOR-Bench O	Reasoning	Comprehensive reasoning benchmark spanning diverse domains and cognitive skills.	★★★★★		🇺🇸 GPT-5 High	77.4%
306	KORBench o	General reasoning	Korean reasoning benchmark evaluating diverse reasoning capabilities.	★★★★★		🇺🇸 GPT-5.2 High	79.2%
307	KoSimpleQA o	Korean QA	Korean simple question answering benchmark.	★★★★★		G Kanana-2-30B-A3B-Mid-2601	49.7%
308	KSM o	Multilingual math	Korean STEM and math benchmark	★★★★★		G EXAONE Deep 2.4B	60.9%
309	LAMBADA O	Language modeling	Word prediction requiring broad context understanding.	★★★★★		🇺🇸 GPT-3	86.4%
310	LatentJailbreak o	Safety / jailbreak	Robustness to latent jailbreak adversarial techniques.	★★★★★	39	🇺🇸 GPT-3.5-turbo	77.4%
311	LBV1-QA O	Vision-language	Vision-language QA benchmark v1.	★★★★★		🇺🇸 GPT-5	73.7%
312	LBV2 O	Vision-language	Vision-language benchmark v2.	★★★★★		🇺🇸 Gemini 2.5 Pro	65.7%
313	LIFEBench o	Instruction following	Length-based instruction-following evaluation benchmark.	★★★★★		🇺🇸 GPT-5.2	61.7%
314	LingoQA O	Driving scene QA	Question answering benchmark for autonomous driving scene understanding.	★★★★★		🇨🇳 Qwen3.5-27B	82.0%
315	LiveBench O	General capability	Continually updated capability benchmark across diverse tasks.	★★★★★		🇺🇸 Gemini 2.5 Pro	82.4%
316	LiveCodeBench O	Code generation	Live coding and execution-based evaluation benchmark (v6 dataset).	★★★★★		🇺🇸 Gemini 3 Pro	92.0%
317	LiveCodeBench-Ko o	Code generation (Korean)	Korean translation of LiveCodeBench.	★★★★★		🇨🇳 Qwen3-30B-A3B-Thinking-2507	66.3%
318	LiveCodeBench Pro O	Competitive coding problems from Codeforces, ICPC, and IOI	LiveCodeBench Pro evaluates competitive programming performance across Codeforces, ICPC, and IOI contests. Elo rating, higher is better.	★★★★★		🇺🇸 Gemini 3.1 Pro	2887
319	LCB Pro 25Q2 (Easy) O	Code generation	LiveCodeBench Pro 2025 Q2 easy subset.	★★★★★		G Nanbeige4.1-3B	81.4%
320	LCB Pro 25Q2 (Med) O	Code generation	LiveCodeBench Pro 2025 Q2 medium subset.	★★★★★		🇺🇸 GPT-OSS 120B (High)	35.4%
321	LiveCodeBench v3 O	Code generation	LiveCodeBench v3 snapshot measuring pass rates on streaming coding tasks.	★★★★★		🇨🇳 Qwen3 32B	90.2%
322	LiveCodeBench v5 (2024.10-2025.02) O	Code generation	LiveCodeBench v5 snapshot covering Oct 2024-Feb 2025.	★★★★★		G IQuest-Coder-V1-40B-Loop-Thinking	86.2%
323	LiveMCP-101 O	Agent real-time eval	A novel real-time evaluation framework and benchmark to stress‑test agents on complex, real‑world tasks.	★★★★★		🇺🇸 GPT-5	58.4%
324	LiveSports-3K o	Sports video	Live sports video understanding benchmark (3K).	★★★★★		G Seed1.8	77.5%
325	LMArena Text O	Crowd eval (text)	Chatbot Arena text evaluation (average win rate).	★★★★★		🇺🇸 Gemini 2.5 Pro	1455
326	LMArena Vision O	Crowd eval (vision)	Chatbot Arena vision evaluation leaderboard (ELO ratings).	★★★★★		🇺🇸 Gemini 2.5 Pro	1242
327	Local Agent Bench O	Tool calling judgment	Tests whether small open-weight models can reliably decide when to call tools and when not to. Agent Score = (Action x 0.4) + (Restraint x 0.3) + (Wrong-Tool-Avoidance x 0.3).	★★★★★		🇺🇸 LFM2.5 1.2B Instruct	88.0%
328	LogicVista O	Visual logical reasoning	Visual logic and pattern reasoning tasks requiring compositional and spatial understanding.	★★★★★		🇺🇸 Gemini 3 Pro	80.8%
329	LogiQA o	Logical reasoning	Reading comprehension with logical reasoning.	★★★★★	138	G Pythia 70M	23.5%
330	LongBench o	Long-context eval	Long-context understanding across tasks.	★★★★★	957	G Jamba Mini 1.6	32.0%
331	longbench-v2 O	Long-context eval	Next-generation LongBench v2 long-context evaluation benchmark.	★★★★★		🇺🇸 Gemini 3 Pro	68.2%
332	LongFact-Concepts O	Hallucination rate on open-source prompts	Long-context factuality eval focused on conceptual statements; lower is better.	★★★★★		🇺🇸 GPT-5	↓ 0.7%
333	LongFact-Objects O	Hallucination rate on open-source prompts	Long-context factuality eval focused on object/entity references; lower is better.	★★★★★		🇺🇸 GPT-5	↓ 0.8%
334	LongText-Bench EN o	Text rendering	LongText-Bench English subset score for text rendering.	★★★★★		G Seedream 4.5	1.0%
335	LongText-Bench ZH o	Text rendering	LongText-Bench Chinese subset score for text rendering.	★★★★★		G Seedream 4.5	1.0%
336	LongVideoBench o	Long video QA	Long video understanding and QA benchmark.	★★★★★		🇨🇳 Kimi-K2.5 Thinking	79.8%
337	LPFQA O	Finance QA	Long-form financial question answering benchmark.	★★★★★		🇺🇸 Claude Sonnet 4.5	54.9%
338	LVBench O	Video understanding	Long video understanding benchmark (LVBench).	★★★★★		🇺🇸 Gemini 3 Pro	76.2%
339	M3GIA (CN) o	Chinese multimodal QA	Chinese-language M3GIA benchmark covering grounded multimodal question answering.	★★★★★		🇨🇳 Seed1.5-VL-Thinking	91.2%
340	Machiavelli O	Deception / safety	Benchmark for deceptive or manipulative behavior in social interactions.	★★★★★		🇺🇸 Claude Haiku 4.5	↓ 52.2%
341	MakeMeSay o	Adversarial robustness	Adversarial benchmark testing model robustness against manipulation attempts. Lower is better.	★★★★★		🇺🇸 Grok 4.1 (Thinking)
342	Mantis O	Multimodal reasoning	Multimodal reasoning and instruction following benchmark (Mantis).	★★★★★		G dots.vlm1	86.2%
343	mArenaHard o	Chat ability (multilingual)	Multilingual variant of Arena-Hard evaluating chat quality across languages.	★★★★★		🇨🇳 Qwen3-4B	70.1%
344	MARS-Bench O	Instruction following	Instruction-following benchmark with complex tasks.	★★★★★		🇺🇸 GPT-5.2 High	87.9%
345	MASK O	Safety / red teaming	Model behavior safety assessment via red-teaming scenarios.	★★★★★		🇺🇸 Claude Sonnet 4 (t)	95.3%
346	MatBench o	Materials property prediction	Materials property prediction benchmark for scientific AI models.	★★★★★		G Intern-S1-Pro	72.8%
347	MATH O	Math (competition)	Competition-level mathematics across algebra, geometry, number theory, combinatorics.	★★★★★	1185	🇺🇸 o3 mini	97.9%
348	MATH-Ko o	Math (Korean)	Korean translation of the MATH competition benchmark.	★★★★★		🇨🇳 Qwen3-30B-A3B	58.2%
349	MATH Level 5 o	Math (competition)	Level 5 subset of the MATH benchmark emphasizing the hardest competition-style problems.	★★★★★		🇨🇳 Qwen3-4B-Instruct-2507	73.6%
350	MATH500 O	Math reasoning	500-problem slice of the MATH benchmark for challenging math reasoning.	★★★★★		G Motif-2-12.7B-Reasoning	99.3%
351	MATH500 (ES) o	Math (multilingual)	Spanish MATH500 benchmark	★★★★★		G EXAONE 4.0 1.2B	88.8%
352	MathArena Apex O	Challenging Math Contest problems	Challenging math contest problems from MathArena Apex benchmark.	★★★★★		G Seed2.0 Pro	82.1%
353	MathVerse o	Math reasoning (multimodal)	Visual math reasoning benchmark combining images and text across diverse mathematical tasks.	★★★★★		🇺🇸 Gemini 2.5 Pro	82.9%
354	MathVerse-mini O	Math reasoning (multimodal)	Compact MathVerse split focusing on single-image math puzzles and visual reasoning.	★★★★★		🇨🇳 Qwen3-VL-235B-A22B Thinking	85.0%
355	MathVerse-Vision O	Math reasoning (multimodal)	Multi-image visual mathematical reasoning tasks from the MathVerse ecosystem.	★★★★★		🇺🇸 GPT-5 High	84.1%
356	MathVision O	Math reasoning (multimodal)	Visual math reasoning benchmark with problems that combine images (charts, diagrams) and text.	★★★★★		🇨🇳 Qwen3.5-397B-A17B	88.6%
357	MathVista O	Multimodal math reasoning	Visual math reasoning across diverse tasks.	★★★★★		🇨🇳 Kimi-K2.5 Thinking	90.1%
358	MathVista-Mini O	Math reasoning (multimodal)	Lightweight subset of MathVista for quick evaluation of visual mathematical reasoning.	★★★★★		🇨🇳 Qwen3.5-397B-A17B	90.3%
359	MAXIFE O	Instruction following (multilingual)	Multilingual instruction-following evaluation across English and multilingual original prompts.	★★★★★		🇺🇸 GPT-5.2	88.4%
360	MBPP O	Code generation	Short Python problems with hidden tests.	★★★★★	36312	🇨🇳 Kimi-K2 Thinking	97.4%
361	MBPP-Ko o	Code generation (Korean)	Korean translation of MBPP code generation benchmark.	★★★★★		🇨🇳 Qwen3-30B-A3B	66.8%
362	MBPP+ O	Code generation	Extended MBPP with more tests and stricter evaluation.	★★★★★		🇨🇳 GLM 4.6	94.2%
363	MCP-Atlas O	Agent evaluation	Aggregate MCP agent benchmark covering tool-use and planning tasks.	★★★★★		🇺🇸 Gemini 3.1 Pro	69.2%
364	MCP Universe O	Agent evaluation	Benchmarks multi-step tool-use agents across diverse task suites with a unified overall success metric.	★★★★★		🇺🇸 Gemini 3 Pro	50.7%
365	MCPMark O	Agent tool-use (MCP)	Benchmark for Model Context Protocol (MCP) agent tool-use.	★★★★★	127	🇺🇸 GPT-5.2	57.5%
366	mDolly o	Instruction following (multilingual)	Multilingual variant of the Dolly instruction-following benchmark.	★★★★★		G Tiny Aya Global	86.9%
367	MedXpertQA-MM O	Medical VQA	Multimodal medical expert question answering benchmark.	★★★★★		🇺🇸 Gemini 3 Pro	76.0%
368	METR O	Long task benchmark	METR evaluates AI agents on long-horizon coding and agentic tasks, measuring autonomous task completion time.	★★★★★		🇺🇸 Claude Opus 4.5	4.8%
369	MEWC o	Web comprehension	Multi-page End-to-end Web Comprehension benchmark.	★★★★★		🇺🇸 Claude Opus 4.6	89.8%
370	MGSM O	Math (multilingual)	Multilingual grade school math word problems.	★★★★★		🇺🇸 Claude Opus 4.1 (2025-08-05) Thinking	94.4%
371	MIABench O	Multimodal instruction following	Multimodal instruction-following benchmark evaluating accuracy on complex image-text tasks.	★★★★★		🇺🇸 Gemini 2.5 Pro	96.0%
372	MicroVQA o	Biological microscopy	Visual question answering benchmark for biological microscopy images.	★★★★★		🇺🇸 Gemini 3 Pro	69.0%
373	NIAH-Multi 128K o	Long-context QA	Needle-in-a-haystack multi-query benchmark at 128K context.	★★★★★		🇨🇳 Kimi-K2 Base	99.5%
374	NIAH-Multi 32K o	Long-context QA	Needle-in-a-haystack multi-query benchmark at 32K context.	★★★★★		🇨🇳 Kimi-K2 Base	99.8%
375	NIAH-Multi 64K o	Long-context QA	Needle-in-a-haystack multi-query benchmark at 64K context.	★★★★★		🇨🇳 Kimi-K2 Base	100.0%
376	MindCube O	Spatial navigation	Spatial navigation benchmark.	★★★★★		🇺🇸 Gemini 3 Flash	78.3%
377	Minerva Math O	University-level math	Advanced quantitative reasoning set inspired by the Minerva benchmark for STEM problem solving.	★★★★★		🇨🇳 Qwen3 235B A22B Thinking	98.0%
378	MiniF2F pass@1 o	Math competition	MiniF2F competition benchmark pass@1 accuracy.	★★★★★		🇺🇸 NVIDIA-Nemotron-3-Nano-30B-A3B-BF16	50.0%
379	MiniF2F pass@32 o	Math competition	MiniF2F competition benchmark pass@32 accuracy.	★★★★★		🇺🇸 NVIDIA-Nemotron-3-Nano-30B-A3B-BF16	79.9%
380	MiniF2F (Test) o	Math competition	MiniF2F competition benchmark (test split).	★★★★★		🇨🇳 LongCat-Flash-Thinking	81.6%
381	MixEval o	Multi-task reasoning	Mixed-subject benchmark covering knowledge and reasoning tasks across domains.	★★★★★		🇺🇸 o1 Mini	82.9%
382	MixEval Hard o	Multi-task reasoning (hard)	Hard subset of MixEval covering diverse reasoning tasks.	★★★★★		🇨🇳 Qwen3-4B	31.6%
383	MLVU O	Large video understanding	MLVU: Large-scale multi-task benchmark for video understanding.	★★★★★		🇨🇳 Qwen3.5-122B-A10B	87.3%
384	MM-BrowseComp o	Multimodal browsing	Multimodal browsing comprehension benchmark.	★★★★★		G Seed1.8	46.3%
385	MM-IFEval o	Multimodal instruction following	Instruction-following benchmark assessing multimodal obedience to complex prompts.	★★★★★		🇺🇸 LFM2.5-VL-1.6B	52.3%
386	MM-MT-Bench O	Multimodal instruction following	Multi-turn multimodal instruction following benchmark evaluating dialogue quality and helpfulness.	★★★★★		🇨🇳 Qwen3-VL-235B-A22B Thinking	8.5%
387	MMBench v1.1 (CN) O	Multimodal understanding (Chinese)	MMBench v1.1 Chinese subset for evaluating multimodal LLMs.	★★★★★		🇺🇸 Gemini 3 Pro	91.3%
388	MMBench v1.1 (EN) O	Multimodal understanding (English)	MMBench v1.1 English subset for evaluating multimodal LLMs.	★★★★★		🇺🇸 Gemini 3 Pro	93.3%
389	MMBench v1.1 (EN dev) O	General VQA	English dev split of MMBench v1.1 measuring multimodal question answering.	★★★★★		🇨🇳 Kimi-K2.5	94.2%
390	MME-CC o	Multimodal evaluation	MME-CC multimodal evaluation suite.	★★★★★		🇺🇸 Gemini 3 Pro	56.9%
391	MME Elo o	Multimodal perception	Elo-style scoring for the MME multimodal evaluation benchmark.	★★★★★		🇨🇳 InternVL3-2B	2186.4%
392	MME-RealWorld (cn) o	Real-world perception (CN)	MME-RealWorld Chinese split.	★★★★★		🇺🇸 GPT-4o	58.5%
393	MME-RealWorld (en) o	Real-world perception (EN)	MME-RealWorld English split.	★★★★★		G MiMo-VL 7B-RL	59.1%
394	MMIU O	Multi-image understanding	Multi-image understanding benchmark evaluating cross-image reasoning.	★★★★★		🇺🇸 Gemini 3 Pro	72.1%
395	MMLB-NIAH (128k) o	Multimodal long-context	MMLB-NIAH 128k long-context multimodal benchmark.	★★★★★		G Seed1.8	72.2%
396	MMLB-VRAG (128k) o	Multimodal long-context	MMLB-VRAG 128k long-context multimodal benchmark.	★★★★★		🇺🇸 Gemini 3 Pro	88.9%
397	MMLongBench-128K o	Long-context multimodal	128K-context variant of MMLongBench evaluating multimodal long-context understanding.	★★★★★		🇨🇳 GLM-4.6V	64.1%
398	MMLongBench-Doc O	Long-context multimodal documents	Evaluates long-context document understanding with mixed text, tables, and figures across multiple pages.	★★★★★		🇺🇸 Claude Opus 4.5	61.9%
399	MMLU O	Multi-domain knowledge	57 tasks spanning STEM, humanities, social sciences; broad knowledge and reasoning.	★★★★★	1488	🇺🇸 GPT-5 High	93.8%
400	MMLU Arabic o	Arabic knowledge and reasoning	Arabic-language variant of MMLU evaluating knowledge and reasoning.	★★★★★		🇨🇳 Qwen 2.5 72B	74.1%
401	MMLU (cloze) o	Multi-domain knowledge (cloze)	Cloze-form MMLU evaluation variant.	★★★★★		🇺🇸 SmolLM2 135M Base	31.5%
402	Full Text MMLU o	Multi-domain knowledge (long-form)	Full-context MMLU variant evaluating reasoning over long passages.	★★★★★		🇺🇸 Llama 3.3 70B Instruct	83.0%
403	MMLU-Pro O	Multi-domain knowledge	Harder successor to MMLU with more challenging questions.	★★★★★	286	🇺🇸 Gemini 3 Pro	90.1%
404	MMLU Pro MCF o	Multi-domain knowledge (few-shot)	MMLU-Pro common format (MCF) few-shot evaluation.	★★★★★		🇨🇳 Qwen3-4B-Base	41.1%
405	MMLU-ProX O	Multi-domain knowledge	Cross-lingual and robust variant of MMLU-Pro.	★★★★★		🇺🇸 Gemini 3 Pro	87.7%
406	MMLU-Redux O	Multi-domain knowledge	Updated MMLU-style evaluation with revised questions and scoring.	★★★★★		🇺🇸 Gemini 3 Pro	95.9%
407	MMLU-STEM O	STEM knowledge	STEM subset of MMLU.	★★★★★	1488	G Falcon-H1-34B-Instruct	83.6%
408	MMMB o	Multilingual MMBench	Multilingual Multimodal Benchmark (MMMB) average score.	★★★★★		🇺🇸 LFM2.5-VL-1.6B	77.0%
409	MMMLU O	Multi-domain knowledge (multilingual)	Massively multilingual MMLU-style evaluation across many languages.	★★★★★		🇺🇸 Gemini 3.1 Pro	92.6%
410	MMMLU (ES) o	Multilingual knowledge	Spanish MMMLU benchmark	★★★★★		🇺🇸 SmolLM 3 3B	64.7%
411	MMMU O	Multimodal understanding	Multi-discipline multimodal understanding benchmark.	★★★★★		🇺🇸 Gemini 3 Pro	87.2%
412	MMMU PRO O	Multimodal understanding (hard)	Professional/advanced subset of MMMU for multimodal reasoning.	★★★★★		🇺🇸 Gemini 3 Deep Think	81.5%
413	MMMU-Pro (vision) o	Multimodal understanding (vision)	MMMU-Pro vision-only setting.	★★★★★		🇺🇸 Claude 3.7 Sonnet	45.8%
414	MMMU Pro (with tools) O	Multimodal understanding (with tools)	MMMU-Pro benchmark evaluated with tool access.	★★★★★		🇺🇸 GPT-5.2	80.4%
415	MMSIBench (circular) o	Spatial understanding	MMSIBench circular subset for spatial reasoning.	★★★★★		🇺🇸 Gemini 3 Pro	25.4%
416	MMStar O	Multimodal reasoning	Broad evaluation of multimodal LLMs across diverse tasks.	★★★★★		🇨🇳 Qwen3.5-397B-A17B	83.8%
417	MMVet o	Multimodal evaluation	Comprehensive evaluation suite for assessing multimodal LLM capabilities.	★★★★★		🇨🇳 R-4B-Base	85.9%
418	MMVP O	Multimodal video perception	Benchmark for multimodal video understanding and perception.	★★★★★		G Seed1.8	91.6%
419	MMVU O	Video understanding	Multimodal video understanding benchmark (MMVU).	★★★★★		🇺🇸 GPT-5.2 Thinking XHigh	80.8%
420	Mol-Instructions o	Bio-molecular instruction following	Instruction-following benchmark for bio-molecular understanding and generation.	★★★★★		G Intern-S1-Pro	48.8%
421	MotionBench O	Video motion understanding	Video motion and temporal reasoning benchmark.	★★★★★		G Seed1.8	70.6%
422	OpenAI-MRCR (128k) O	Long-context reasoning	OpenAI Multi-Round Chain Reasoning benchmark with 128k context window.	★★★★★		🇺🇸 Gemini 3 Pro	89.7%
423	MRCR 128K-2N o	Long-context reasoning	Multi-Round Coreference Resolution benchmark at 128k context with 2 needles.	★★★★★		🇫🇷 Ministral-3-R 8B	50.3%
424	MRCR 128K-4N o	Long-context reasoning	Multi-Round Coreference Resolution benchmark at 128k context with 4 needles.	★★★★★		🇫🇷 Ministral-3-R 8B	22.7%
425	MRCR 128K-8N O	Long-context reasoning	Multi-Round Coreference Resolution benchmark at 128k context with 8 needles.	★★★★★		🇺🇸 Gemini 3.1 Pro	84.9%
426	OpenAI-MRCR (1M) O	Long-context reasoning	OpenAI Multi-Round Chain Reasoning benchmark with 1M context window.	★★★★★		🇺🇸 Gemini 2.5 Pro	58.8%
427	MRCR 64K-2N o	Long-context reasoning	Multi-Round Coreference Resolution benchmark at 64k context with 2 needles.	★★★★★		🇫🇷 Ministral-3-R 8B	44.0%
428	MRCR 64K-4N o	Long-context reasoning	Multi-Round Coreference Resolution benchmark at 64k context with 4 needles.	★★★★★		🇫🇷 Ministral-3-R 8B	35.8%
429	MRCR 64K-8N o	Long-context reasoning	Multi-Round Coreference Resolution benchmark at 64k context with 8 needles.	★★★★★		🇨🇳 Qwen3-8B	17.8%
430	MRCR v2 O	Multimodal reasoning	Multi-round multimodal chain-of-reasoning evaluation (v2).	★★★★★		🇺🇸 GPT-5.2 High	89.4%
431	MSEarth-MCQ o	Earth science	Earth science multiple-choice question benchmark for scientific AI models.	★★★★★		🇺🇸 Gemini 3 Pro	65.8%
432	MT-Bench O	Chat ability	Multi-turn chat evaluation via GPT-4 grading.	★★★★★	39074	🇺🇸 Apriel Nemotron 15B Thinker	85.7%
433	MTOB (full book) o	Long-form reasoning	Long-context book understanding benchmark (full-book setting).	★★★★★		🇺🇸 Llama 4 Maverick	50.8%
434	MTOB (half book) o	Long-form reasoning	Long-context book understanding benchmark (half-book setting).	★★★★★		🇺🇸 Llama 4 Maverick	54.0%
435	MUIRBENCH O	Multimodal robustness	Evaluates multimodal understanding robustness and reliability.	★★★★★		🇺🇸 Gemini 3 Pro	86.1%
436	Multi-IF O	Instruction following (multi-task)	Composite instruction-following evaluation across multiple tasks.	★★★★★		🇨🇳 Qwen3-30B-A3B	81.0%
437	Multi-IFEval O	Instruction following (multi-task)	Multi-task variant of instruction-following evaluation.	★★★★★		🇺🇸 Llama 3.3 70B	88.7%
438	Multi-SWE-Bench O	Code repair (multi-repo)	Multi-repository SWE-Bench variant.	★★★★★	246	G MiniMax M2.5	51.3%
439	MultiChallenge O	Instruction following	Multi-domain instruction-following benchmark.	★★★★★		🇺🇸 GPT-5	69.6%
440	Multi-Image QA Average O	Multi-image QA (aggregate)	Aggregate score over multi-image visual question answering tasks.	★★★★★		🇺🇸 Gemini 3 Pro	81.9%
441	Multilingual MMBench o	Multilingual vision benchmark	Multilingual MMBench average score across languages.	★★★★★		🇺🇸 LFM2.5-VL-1.6B	65.9%
442	Multilingual MMLU O	Multi-domain knowledge (multilingual)	Multilingual variant of MMLU across many languages.	★★★★★		🇺🇸 GPT-4.1	87.3%
443	MultiPL-E O	Code generation (multilingual)	Multilingual code generation and execution benchmark across many programming languages.	★★★★★	269	🇺🇸 Claude Opus 4	89.6%
444	MultiPL-E HumanEval o	Code generation (multilingual)	MultiPL-E variant of HumanEval tasks.	★★★★★		🇺🇸 Llama 3.1 405B	75.2%
445	MultiPL-E MBPP o	Code generation (multilingual)	MultiPL-E variant of MBPP tasks.	★★★★★		🇺🇸 Llama 3.1 405B	65.7%
446	MuSR O	Reasoning	Multistep Soft Reasoning.	★★★★★		G Ling Flash 2.0	82.7%
447	MVBench O	Video QA	Multi-view or multi-video QA benchmark (MVBench).	★★★★★		🇺🇸 GPT-5.2	78.1%
448	Natural2Code o	Code generation	Natural language to code benchmark for instruction-following synthesis.	★★★★★		🇺🇸 Gemini 2.0 Flash	92.9%
449	NaturalQuestions O	Open-domain QA	Google NQ; real user questions with long/short answers.	★★★★★		🇫🇷 Mixtral 8x22B	40.1%
450	Nexus (0-shot) O	Tool use	Nexus tool-use benchmark, zero-shot setting.	★★★★★		🇺🇸 Llama 3.1 405B	58.7%
451	Needle In A Haystack o	Long-context retrieval	Needle In A Haystack test for locating hidden facts in long contexts.	★★★★★		G MobileLLM P1 Base	100.0%
452	NoLiMa 128K o	Long-context eval	NoLiMa (No Literal Match) long-context benchmark at 128k context window.	★★★★★		🇨🇳 MiniCPM-SALA	23.9%
453	NoLiMa 32K o	Long-context eval	NoLiMa (No Literal Match) long-context benchmark at 32k context window.	★★★★★		🇨🇳 MiniCPM-SALA	54.5%
454	NoLiMa 64K o	Long-context eval	NoLiMa (No Literal Match) long-context benchmark at 64k context window.	★★★★★		🇨🇳 MiniCPM-SALA	43.0%
455	NOVA-63 O	Multilingual evaluation	Multilingual evaluation benchmark covering 63 languages.	★★★★★		🇨🇳 Qwen3.5-397B-A17B	59.1%
456	NuScenes o	3D scene understanding	3D scene understanding and perception benchmark for autonomous driving.	★★★★★		🇨🇳 Qwen3.5-397B-A17B	16.0%
457	Objectron o	Object detection	Objectron benchmark for 3D object detection in video captures.	★★★★★		🇨🇳 Qwen3-VL-235B-A22B Thinking	71.2%
458	OBQA o	Open book QA	OpenBookQA science question answering benchmark.	★★★★★		🇨🇳 Qwen2.5-Omni-3B	76.3%
459	OCNLI O	Natural language inference (Chinese)	Original Chinese Natural Language Inference benchmark.	★★★★★		G LLaDA2.1-Flash (Q Mode)	72.8%
460	OCRBench V2 O	OCR (vision text extraction)	OCRBench v2 evaluating text extraction from images and documents.	★★★★★		🇨🇳 Qwen3-VL 2B Instruct	858.0%
461	OCRBench-ELO o	OCR (ELO ranking)	OCR benchmark using ELO rating system to rank model performance on text extraction tasks.	★★★★★		🇺🇸 Gemini 2.5 Pro	866
462	OCRBenchV2 (CN) O	OCR (Chinese)	OCRBenchV2 Chinese subset assessing OCR performance on Chinese-language documents.	★★★★★		G Ovis2.6-30B-A3B	67.1%
463	OCRBenchV2 (EN) O	OCR (English)	OCRBenchV2 English subset evaluating OCR accuracy on English documents and layouts.	★★★★★		G Ovis2.6-30B-A3B	72.6%
464	OCRReasoning o	OCR reasoning	OCR reasoning benchmark combining text extraction with multi-step reasoning over documents.	★★★★★		🇺🇸 Gemini 2.5 Pro	70.8%
465	OctoCodingBench o	Code generation	Coding benchmark across multi-language programming tasks.	★★★★★		🇺🇸 Claude Opus 4.5	36.2%
466	ODinW-13 O	Object detection (in the wild)	Object Detection in the Wild benchmark covering 13 real-world domains.	★★★★★		🇨🇳 Qwen3-VL-4B-Instruct	48.2%
467	Odyssey Math o	Math reasoning	Odyssey multi-step math benchmark.	★★★★★		G Mathstral 7B	37.2%
468	OIBench EN o	Code generation	English subset of OIBench for code generation.	★★★★★		🇺🇸 Gemini 3 Pro	58.2%
469	OJBench O	Code generation (online judge)	Programming problems evaluated via online judge-style execution.	★★★★★		🇺🇸 Gemini 3 Pro	68.5%
470	olmOCR-Bench O	Document OCR	olmOCR benchmark assessing OCR fidelity and structured extraction on complex document pages.	★★★★★		G Chandra OCR 0.1.0	83.1%
471	OlympiadBench O	Math (olympiad)	Advanced mathematics olympiad-style problem benchmark.	★★★★★		🇨🇳 Qwen3-30B-A3B-Instruct-2507	77.6%
472	OlympicArena o	Math (competition)	Olympiad-style mathematics reasoning benchmark.	★★★★★		🇨🇳 DeepSeek V3	76.2%
473	OMEGA O	Math (advanced)	OMEGA olympiad-grade mathematics reasoning benchmark.	★★★★★		🇺🇸 OLMo-3-Think-32B	50.8%
474	Omni-MATH O	Math reasoning	Omni-MATH benchmark covering diverse math reasoning tasks across difficulty levels.	★★★★★		G Ling 1T	74.5%
475	Omni-MATH-HARD O	Math	Challenging math benchmark (Omni-MATH-HARD).	★★★★★		🇺🇸 GPT-5 High	73.6%
476	OmniDocBench O	Document understanding	Document understanding benchmark covering multi-page layouts, tables, and charts for robust question answering.	★★★★★		G Gundam-M	↓ 12.3%
477	OmniDocBench 1.5 O	OCR	Document understanding benchmark v1.5 with OCR evaluation. Overall Edit Distance metric, lower is better.	★★★★★		🇺🇸 Dolphin V2	↓ 0.1%
478	OmniDocBench-CN O	Document understanding (Chinese)	Chinese subset of OmniDocBench focusing on OCR-grounded document comprehension and reasoning.	★★★★★		G PPStructure v3	↓ 13.6%
479	OmniMMI o	Multimodal interaction	OmniMMI benchmark for multimodal interaction across video streams.	★★★★★		G Seed1.8	53.0%
480	OmniSpatial o	Spatial reasoning	Spatial understanding and reasoning benchmark (OmniSpatial).	★★★★★		🇨🇳 GLM-4.6V	52.0%
481	OneIG-Bench EN O	Text-to-image	OneIG-Bench English subset score for text-to-image generation.	★★★★★		G Nano Banana 2.0	0.6%
482	OneIG-Bench ZH O	Text-to-image	OneIG-Bench Chinese subset score for text-to-image generation.	★★★★★		G Nano Banana 2.0	0.6%
483	Online-Mind2web o	Web automation	Online web automation and task execution benchmark.	★★★★★		G Seed1.8	85.9%
484	Open Rewrite o	Instruction following	Rewrite benchmark assessing open-ended editing and directive-following quality.	★★★★★		G MobileLLM P1	51.0%
485	OpenBookQA O	Science QA	Open-book multiple choice science questions with supporting facts.	★★★★★	128	🇺🇸 Hermes 4.3 36B Pyche	96.6%
486	OpenRewrite-Eval o	Rewrite quality	OpenRewrite evaluation; micro-averaged RougeL.	★★★★★		🇨🇳 Qwen2.5 1.5B Instruct	46.9%
487	OptMATH o	Math optimization reasoning	OptMATH benchmark targeting challenging math optimization and problem-solving tasks.	★★★★★		G Ling 1T	57.7%
488	Order 15 Items o	List ordering	Ordering benchmark requiring models to sequence 15 items correctly.	★★★★★		G K2-V2	87.6%
489	Order 30 Items o	List ordering (long)	Ordering benchmark requiring models to sequence 30 items correctly.	★★★★★		G K2-V2	40.3%
490	OSWorld O	GUI agents	Agentic GUI task completion and grounding on desktop environments.	★★★★★		🇺🇸 Claude Opus 4.6	72.7%
491	OSWorld-G O	GUI agents	OSWorld-G center accuracy (no_refusal).	★★★★★		🇺🇸 Holo1.5-72B	71.8%
492	OSWorld Verified O	GUI agents	Verified subset of OSWorld GUI agent benchmark.	★★★★★		🇺🇸 Claude Opus 4.6	72.7%
493	OSWorld2 o	GUI agents	Second-generation OSWorld GUI agent benchmark.	★★★★★		🇨🇳 GLM-4.5V	35.8%
494	OVBench o	Open-vocabulary streaming	Open-vocabulary benchmark for streaming video understanding.	★★★★★		G Seed1.8	65.1%
495	OVOBench o	Streaming video QA	Streaming video QA benchmark with open-vocabulary queries.	★★★★★		G Seed1.8	72.6%
496	PaperBench Code-Dev o	Code understanding	PaperBench developer subset measuring code reasoning accuracy.	★★★★★		🇺🇸 Claude Sonnet 4	43.3%
497	PaperBench o	Research paper understanding	Benchmark for understanding and reasoning over research papers.	★★★★★		🇺🇸 Claude Opus 4.5 Thinking	72.9%
498	PHYBench O	Physics reasoning	Physics reasoning and calculation benchmark.	★★★★★		🇺🇸 Gemini 3 Pro	80.0%
499	PhyX o	Physics reasoning (multimodal)	Multimodal physics reasoning benchmark (PhyX).	★★★★★		G Step3-VL-10B	59.5%
500	PIQA O	Physical commonsense	Physical commonsense about everyday tasks and object affordances.	★★★★★		G LLaDA2.0 Flash	96.5%
501	PixmoCount O	Visual counting	Counting objects/instances in images (PixmoCount).	★★★★★		G Eagle2.5-8B	90.2%
502	PMC-VQA O	Medical VQA	PubMed Central visual question answering benchmark for biomedical images.	★★★★★		🇨🇳 Qwen3.5-397B-A17B	64.2%
503	Point-Bench o	Pointing and counting	Benchmark for pointing and counting objects in images.	★★★★★		🇺🇸 Gemini 2.5 Pro	85.5%
504	PolyMATH O	Math reasoning	Polyglot mathematics benchmark assessing cross-topic math reasoning.	★★★★★		🇺🇸 Gemini 3 Pro	81.6%
505	POPE o	Hallucination detection	Vision-language hallucination benchmark focusing on object existence verification.	★★★★★		🇨🇳 InternVL3-2B	90.1%
506	PopQA O	Knowledge / QA	Open-domain popular culture question answering benchmark testing long-tail factual recall.	★★★★★		🇺🇸 Llama 3.1 Tulu 3 405B SFT	55.7%
507	PostTrainBench o	Post-training automation	Measures how well AI agents can post-train base LLMs under fixed compute/time constraints; average score across AIME 2025, BFCL, GPQA Main, GSM8K, and HumanEval.	★★★★★		🇺🇸 GPT-5.1 Codex-Max	34.9%
508	PRDBench o	Agentic coding	Product Requirements Document benchmark for evaluating agentic coding capabilities.	★★★★★		🇨🇳 LongCat-Flash-Lite	39.6%
509	ProcBench o	Procedural reasoning	Procedural reasoning benchmark evaluating step-by-step logical reasoning.	★★★★★		G Seed2.0 Pro	96.6%
510	PrOntoQA O	Logical reasoning	Probing ontological reasoning via question answering.	★★★★★		G Ling Flash 2.0	97.9%
511	ProofBench Advanced o	Mathematical proofs (advanced)	Advanced mathematical proof benchmark covering complex theorem proving tasks.	★★★★★		🇺🇸 Gemini Deep Think (IMO Gold)	65.7%
512	ProofBench Basic o	Mathematical proofs	Entry-level mathematical proof benchmarking set.	★★★★★		🇨🇳 DeepSeekMath-V2-Heavy	99.0%
513	ProtocolQA o	Protocol understanding and QA	Protocol question answering benchmark evaluating understanding of scientific protocols and procedures.	★★★★★		🇺🇸 Grok 4.1 (Thinking)	79.0%
514	QuAC o	Conversational QA	Question answering in context.	★★★★★		🇺🇸 Llama 3.1 405B Base	53.6%
515	QuALITY o	Long-context reading comprehension	Long-document multiple-choice reading comprehension benchmark.	★★★★★		🇺🇸 Llama-3_1-70B-TFree-HAT-SFT	48.8%
516	RACE o	Reading comprehension	English exams for middle and high school.	★★★★★		🇺🇸 Nemotron-3-Nano-30B-A3B-Base	88.0%
517	Random Complex Tasks o	Agentic tasks (random)	Randomly constructed complex task environments for agent generalization.	★★★★★		🇨🇳 LongCat-Flash-Thinking-2601	35.8%
518	Realbench o	Web browsing	Real-world browsing and QA benchmark.	★★★★★		G Seed1.8	49.1%
519	RealWorldQA O	Real-world visual QA	Visual question answering with real-world images and scenarios.	★★★★★		🇨🇳 Qwen3.5-122B-A10B	85.1%
520	Ref-L4 (test) o	Referring expressions	Ref-L4 referring expression comprehension on the test split.	★★★★★		🇨🇳 GLM-4.6V	88.9%
521	RefCOCO O	Referring expressions	RefCOCO average accuracy at IoU 0.5 (val).	★★★★★		🇨🇳 InternVL3.5-4B	92.4%
522	RefCOCOg o	Referring expressions	RefCOCOg average accuracy at IoU 0.5 (val).	★★★★★		G Moondream-9B-A2B	88.6%
523	RefCOCO+ o	Referring expressions	RefCOCO+ accuracy at IoU 0.5 on the val split.	★★★★★		G Moondream-9B-A2B	81.8%
524	RefSpatialBench O	Spatial reasoning	Reference spatial understanding benchmark covering spatial grounding tasks.	★★★★★		🇨🇳 Qwen3.5-397B-A17B	73.6%
525	RefusalBench o	Safety / refusal	Safety-oriented refusal and policy adherence benchmark.	★★★★★		🇺🇸 Hermes 4.3 36B Pyche	72.3%
526	ReMI o	Multimodal reasoning	Reasoning over multimodal inputs (ReMI).	★★★★★		G Step3-VL-10B	67.3%
527	RepoBench O	Code understanding	Repository-level code comprehension and reasoning benchmark.	★★★★★		🇺🇸 Claude Sonnet 4.5	83.8%
528	ResearchRubrics o	Research evaluation	Benchmark evaluating model ability to conduct research and synthesize findings.	★★★★★		G Step-3.5 Flash 20260204	65.3%
529	RoboSpatialHome O	Embodied spatial understanding	RoboSpatialHome benchmark for embodied spatial reasoning in domestic environments.	★★★★★		🇨🇳 Qwen3-VL-235B-A22B Thinking	73.9%
530	Roo Code Evals O	Code assistant eval	Community-maintained coding evals and leaderboard by Roo Code.	★★★★★		🇺🇸 GPT-5 mini	99.0%
531	RULER-100 @1M o	Long-context eval	RULER-100 evaluation at a 1M context window.	★★★★★		🇺🇸 NVIDIA-Nemotron-3-Nano-30B-A3B-BF16	86.3%
532	RULER-100 @256k o	Long-context eval	RULER-100 evaluation at a 256k context window.	★★★★★		🇺🇸 NVIDIA-Nemotron-3-Nano-30B-A3B-BF16	92.9%
533	RULER-100 @512k o	Long-context eval	RULER-100 evaluation at a 512k context window.	★★★★★		🇺🇸 NVIDIA-Nemotron-3-Nano-30B-A3B-BF16	91.3%
534	Ruler 128k O	Long-context eval	RULER benchmark at 128k context window.	★★★★★		🇨🇳 Qwen3-Next-80B-A3B-Instruct	96.0%
535	Ruler 16k o	Long-context eval	RULER benchmark at 16k context window.	★★★★★		🇨🇳 Qwen2.5 7.6B	92.2%
536	Ruler 1M o	Long-context eval	RULER benchmark at 1M context window.	★★★★★		🇨🇳 Kimi-Linear-Instruct	94.8%
537	Ruler 32k o	Long-context eval	RULER benchmark at 32k context window.	★★★★★		🇫🇷 Mistral Medium 3	96.0%
538	Ruler 4k o	Long-context eval	RULER benchmark at 4k context window.	★★★★★		🇫🇷 Ministral 8B	96.0%
539	Ruler 512k o	Long-context eval	RULER benchmark at 512k context window.	★★★★★		🇨🇳 Qwen3-235B-A22B-Instruct-2507	90.9%
540	Ruler 64k o	Long-context eval	RULER benchmark at 64k context window.	★★★★★		🇨🇳 MiniCPM-SALA	92.7%
541	Ruler 8k o	Long-context eval	RULER benchmark at 8k context window.	★★★★★		🇺🇸 Llama 3.1 8B Base	93.8%
542	RW Search o	Agentic search	Real-world search benchmark evaluating retrieval and reasoning.	★★★★★		🇺🇸 GPT-5.2 Thinking XHigh	82.0%
543	SALAD-Bench o	Safety alignment	Safety Alignment and Dangerous-behavior benchmark evaluating harmful assistance and refusal consistency.	★★★★★		G Granite-4.0-H-Micro	↓ 96.8%
544	SArena (Icon) o	SVG generation	SVG Arena benchmark for icon generation evaluation.	★★★★★		G Intern-S1-Pro	83.5%
545	Scale AI Multi Challenge o	Chat & instruction following	Scale AI Multi Challenge crowd-evaluated instruction following benchmark.	★★★★★		🇨🇳 Qwen3-30B-A3B-Thinking-2507	44.8%
546	SciCode (sub) O	Code	SciCode subset score (sub).	★★★★★		🇺🇸 Gemini 3.1 Pro	59.0%
547	SciCode (main) O	Code	SciCode main score.	★★★★★		🇺🇸 Gemini 2.5 Pro	15.4%
548	ScienceQA O	Science QA (multimodal)	Multiple-choice science questions with images, diagrams, and text context.	★★★★★		G FastVLM-7B	96.7%
549	SciQ o	Science QA	Multiple choice science questions.	★★★★★		G Pythia 12B	92.9%
550	SciReasoner o	Scientific reasoning	Scientific reasoning benchmark evaluating multimodal AI models on scientific tasks.	★★★★★		G Intern-S1-Pro	55.5%
551	SciRes FrontierMath Tier 1-3 o	Math (frontier)	SciRes FrontierMath benchmark covering tiers 1-3.	★★★★★		🇺🇸 GPT-5.2 Thinking	40.3%
552	SciRes FrontierMath Tier 4 o	Math (frontier)	SciRes FrontierMath benchmark covering tier 4.	★★★★★		🇺🇸 Gemini 3 Pro	18.8%
553	ScreenQA Complex O	GUI QA	Complex ScreenQA benchmark accuracy.	★★★★★		🇺🇸 Holo1.5-72B	87.1%
554	ScreenQA Short O	GUI QA	Short-form ScreenQA benchmark accuracy.	★★★★★		🇺🇸 Holo1.5-72B	91.9%
555	ScreenSpot O	Screen UI locators	Center accuracy on ScreenSpot.	★★★★★		🇨🇳 Qwen3-VL 32B Instruct	95.8%
556	ScreenSpot-Pro O	Screen UI locators	Average center accuracy on ScreenSpot-Pro.	★★★★★		🇺🇸 GPT-5.2 Extra High	86.3%
557	ScreenSpot-v2 O	Screen UI locators	Center accuracy on ScreenSpot-v2.	★★★★★		G UI-Venus 72B	95.3%
558	SEAL-0 O	Agentic web search	Evaluation of multi-step browsing agents on search, evidence gathering, and synthesis tasks.	★★★★★		🇨🇳 Kimi-K2.5 Thinking	57.4%
559	SecCodeBench o	Secure code generation	Benchmark evaluating secure code generation capabilities.	★★★★★		🇺🇸 GPT-5.2	68.7%
560	SEED-Bench-2-Plus O	Multimodal evaluation	SEED-Bench-2-Plus overall accuracy.	★★★★★		🇺🇸 Claude 3.7 Sonnet	72.9%
561	SEED-Bench-Img O	Multimodal image understanding	SEED-Bench image-only subset (SEED-Bench-Img).	★★★★★		G Bagel 14B	78.5%
562	SEED-Bench o	Multimodal evaluation	SEED-Bench comprehensive multimodal understanding benchmark evaluating generative comprehension across multiple dimensions.	★★★★★		🇺🇸 LFM2-VL-3B	76.5%
563	SFE O	Multimodal reasoning	Structured factual evaluation for multimodal models.	★★★★★		🇺🇸 Gemini 3 Pro	61.9%
564	Showdown O	GUI agents	Success rate on the Showdown UI interaction benchmark.	★★★★★		🇺🇸 Holo1.5-72B	76.8%
565	SIFO o	Instruction following	Single-turn instruction following benchmark.	★★★★★		🇨🇳 Qwen3-VL-30B-A3B Thinking	66.9%
566	SIFO Multiturn o	Instruction following	Multi-turn SIFO benchmark for sustained instruction adherence.	★★★★★		🇨🇳 Qwen3-VL-30B-A3B Thinking	60.3%
567	SimpleQA O	QA	Simple question answering benchmark.	★★★★★		🇨🇳 DeepSeek V3.2-Exp	97.1%
568	SimpleQA Verified O	QA	Verified SimpleQA variant for parametric knowledge accuracy.	★★★★★		🇺🇸 Gemini 3 Pro	72.1%
569	SimpleVQA O	General VQA	Lightweight visual question answering set with everyday scenes.	★★★★★		🇺🇸 Gemini 3 Pro	73.2%
570	SimpleVQA-DS o	General VQA	SimpleVQA variant curated by DeepSeek with everyday image question answering tasks.	★★★★★		🇨🇳 Seed1.5-VL-Thinking	61.3%
571	Social Interaction QA (SIQA) o	Social commonsense QA	Social Interaction QA benchmark evaluating social commonsense and situational reasoning.	★★★★★		🇺🇸 Gemma 3 27B	54.9%
572	SLAKE O	Medical VQA	Semantically-Labeled Knowledge-Enhanced medical visual question answering benchmark.	★★★★★		🇨🇳 Kimi-K2.5	81.6%
573	SmolInstruct o	Small molecule understanding	Small molecule instruction-following and understanding benchmark.	★★★★★		G Intern-S1-Pro	74.8%
574	SocialIQA o	Social commonsense	Social interaction commonsense QA.	★★★★★		🇺🇸 Gemma 3 PT 27B	54.9%
575	SpatialViz O	Mental visualization	Mental visualization benchmark.	★★★★★		🇺🇸 GPT-5.2	65.8%
576	Spider O	Text-to-SQL	Complex text-to-SQL benchmark over cross-domain databases.	★★★★★		G LLaDA2.0 Flash	82.5%
577	Spiral-Bench O	Safety / sycophancy	A LLM-judged benchmark measuring sycophancy and delusion reinforcement.	★★★★★		🇺🇸 GPT-5	87.0%
578	SQuAD v1.1 o	Reading comprehension	Extractive QA from Wikipedia articles.	★★★★★	566	🇺🇸 Llama 3.1 405B Base	89.3%
579	SQuAD v2.0 O	Reading comprehension	Like v1.1 with unanswerable questions.	★★★★★	566	G LLaDA2.1-Flash (Q Mode)	90.8%
580	StreamingBench o	Streaming video	Streaming video understanding benchmark.	★★★★★		G Seed1.8	84.4%
581	SUNRGBD O	3D scene understanding	SUN RGB-D benchmark for indoor scene understanding from RGB-D imagery.	★★★★★		🇺🇸 GPT-5 Mini Minimal	45.8%
582	SuperChem o	Chemistry reasoning	Chemistry reasoning benchmark evaluating text-based chemical knowledge and problem solving.	★★★★★		🇺🇸 Gemini 3 Pro	63.2%
583	SuperGPQA O	Graduate-level QA	Harder GPQA variant assessing advanced graduate-level reasoning.	★★★★★		🇺🇸 Gemini 3 Pro	75.3%
584	SWE-Bench O	Code repair	Supervised software engineering benchmark across many repos and issues.	★★★★★	3442	🇺🇸 GPT-5 Codex	74.5%
585	SWE-Bench Multilingual O	Code repair (multilingual)	Multilingual variant of SWE-Bench for issue fixing.	★★★★★		🇺🇸 Claude Opus 4.5 Thinking	77.5%
586	SWE-Bench (OpenHands) o	Code repair	SWE-Bench results using the OpenHands autonomous coding agent.	★★★★★	3442	🇺🇸 NVIDIA-Nemotron-3-Nano-30B-A3B-BF16	38.8%
587	SWE-Bench Pro O	Software engineering	Full SWE-Bench Pro benchmark for software-engineering agents.	★★★★★		🇺🇸 Claude Opus 4.5	56.9%
588	SWE-Bench Pro (Public) O	Software engineering	Public subset of the SWE-Bench Pro benchmark for software-engineering agents.	★★★★★		🇺🇸 GPT-5.3 Codex	56.8%
589	SWE-Bench Verified O	Code repair	Verified subset of SWE-Bench for issue fixing.	★★★★★		🇺🇸 Claude Opus 4.5	80.9%
590	SWE-Dev o	Code repair	Software engineering development and bug fixing benchmark.	★★★★★		🇺🇸 Claude Sonnet 4	67.1%
591	SWE-Lancer o	Code repair (freelance tasks)	Software engineering benchmark using real freelance-style issues.	★★★★★		🇺🇸 GPT-5.1 Codex-Max	79.9%
592	SWE-Lancer Diamond o	Code repair (freelance)	Diamond subset of SWE-Lancer focusing on the hardest freelance-style issues.	★★★★★		🇺🇸 GPT-5.3 Codex	81.4%
593	SWE-Lancer IC Diamond o	Code repair (freelance)	Individual Contributor Diamond subset of SWE-Lancer.	★★★★★		🇺🇸 GPT-5.3 Codex	81.4%
594	SWE-Perf o	Code repair	Software engineering benchmark focused on performance-oriented fixes.	★★★★★		🇺🇸 Gemini 3 Pro	6.5%
595	SWE-Review o	Code review	Software engineering review benchmark for assessing code review quality.	★★★★★		🇺🇸 Claude Opus 4.5	16.2%
596	SWT-Bench o	Code repair	Software tool-use benchmark for code tasks.	★★★★★		🇺🇸 GPT-5.2 Thinking	80.7%
597	SysBench o	System prompts	System prompt understanding and adherence benchmark.	★★★★★		🇺🇸 GPT-4.1	74.1%
598	TAU1-Airline O	Agent tasks (airline)	Tool-augmented agent evaluation in airline scenarios (TAU1).	★★★★★		G openPangu-R-72B-2512 Slow Thinking	56.0%
599	TAU1-Retail O	Agent tasks (retail)	Tool-augmented agent evaluation in retail scenarios (TAU1).	★★★★★		G openPangu-R-72B-2512 Slow Thinking	73.0%
600	TAU2-Airline O	Agent tasks (airline)	Tool-augmented agent evaluation in airline scenarios (TAU2).	★★★★★		🇨🇳 LongCat-Flash-Thinking-2601	76.5%
601	TAU2-Bench o	Agent tasks	Aggregate tool-augmented agent evaluation across airline, retail, and telecom scenarios (TAU2).	★★★★★		🇺🇸 Claude Opus 4.5	91.6%
602	TAU2-Retail O	Agent tasks (retail)	Tool-augmented agent evaluation in retail scenarios (TAU2).	★★★★★		🇺🇸 Claude Opus 4.6	91.9%
603	TAU2-Telecom O	Agent tasks (telecom)	Tool-augmented agent evaluation in telecom scenarios (TAU2).	★★★★★		🇨🇳 LongCat-Flash-Thinking-2601	99.3%
604	TempCompass o	Temporal reasoning	Temporal reasoning benchmark evaluating understanding of time-related concepts in videos and images.	★★★★★		🇺🇸 Gemini 3 Pro	88.0%
605	Terminal-Bench O	Agent terminal tasks	Command-line task completion benchmark for agents.	★★★★★	637	🇺🇸 Claude Sonnet 4.5 (Thinking)	61.3%
606	Terminal-Bench 2.0 O	Agent terminal tasks	Second-generation Terminal-Bench leaderboard for end-to-end terminal agents.	★★★★★		G IQuest-Coder-V1-40B-Loop-Instruct	81.4%
607	Terminal-Bench Hard O	Agent terminal tasks	Hard subset of Terminal-Bench command-line agent tasks.	★★★★★		🇺🇸 GPT-5.1 High	43.0%
608	Terminal-Bench Terminus O	Agent terminal tasks	Terminal-Bench Terminus track assessing end-to-end terminal tool use.	★★★★★		🇺🇸 Gemini 3.1 Pro	68.5%
609	TextQuests O	Text-based video games	Text-based video game benchmark.	★★★★★		🇺🇸 Gemini 3 Pro	41.0%
610	TextQuests Harm O	Harmful propensities	Harmfulness evaluation on TextQuests scenarios.	★★★★★		🇺🇸 Grok 4.1 Fast	↓ 9.1%
611	TextVQA O	Text-based VQA	Visual question answering that requires reading text in images.	★★★★★		G Ovis2.6-30B-A3B	90.7%
612	TIIF-Bench Long O	Text-to-image	TIIF-Bench long prompt score for text-to-image generation.	★★★★★		G Seedream 4.5	88.5%
613	TIIF-Bench Short O	Text-to-image	TIIF-Bench short prompt score for text-to-image generation.	★★★★★		G Nano Banana 2.0	91.0%
614	TIR-Bench o	Tool-integrated reasoning	Benchmark for tool-integrated reasoning with visual models.	★★★★★		🇨🇳 Qwen3.5-27B	59.8%
615	TLDR9+ o	Summarization	Long-form summarization benchmark with nine-domain TLDR prompts plus extended variations.	★★★★★		G MobileLLM P1	16.8%
616	TOMATO o	Temporal understanding	Temporal ordering and motion analysis benchmark (TOMATO).	★★★★★		G Seed1.8	60.8%
617	Tool-Decathlon O	Agent tool-use	Composite tool-use suite measuring multi-domain tool invocation success (Pass@1).	★★★★★		🇺🇸 GPT-5.2	43.8%
618	Toolathlon O	Agentic software tasks	Long-horizon, real-world software tool-use tasks.	★★★★★		🇺🇸 Gemini 3 Flash	49.4%
619	TreeBench o	Reasoning with tree structures	Evaluates hierarchical/tree-structured reasoning and planning capabilities in LLMs/VLMs.	★★★★★		🇨🇳 GLM-4.6V	51.4%
620	TriQA o	Knowledge QA	Triadic question answering benchmark evaluating world knowledge and reasoning.	★★★★★		🇫🇷 Mixtral 8x22B	82.2%
621	TriviaQA O	Open-domain QA	Open-domain question answering benchmark built from trivia and web evidence.	★★★★★		🇺🇸 Gemma 3 PT 27B	85.5%
622	TriviaQA-Wiki o	Open-domain QA	TriviaQA subset answering using Wikipedia evidence.	★★★★★		🇺🇸 Llama 3.1 405B Base	91.8%
623	TrustLLM o	Safety / reliability	TrustLLM benchmark for trustworthiness and safety behaviors.	★★★★★		🇨🇳 Qwen3-Coder-480B-A35B-Instruct	88.4%
624	TruthfulQA O	Truthfulness / hallucination	Measures whether a model imitates human falsehoods (truthfulness).	★★★★★		G SOLAR-10.7B-Instruct-v1.0	71.4%
625	TruthfulQA (DE) o	Truthfulness / hallucination (German)	German translation of the TruthfulQA benchmark.	★★★★★		🇺🇸 Llama 3.3 70B Instruct	0.2%
626	TVBench o	TV comprehension	Benchmark for TV show video comprehension and QA.	★★★★★		G Seed1.8	71.5%
627	TydiQA o	Cross-lingual QA	Typologically diverse QA across languages.	★★★★★	313	🇺🇸 Llama 3.1 405B Base	34.3%
628	U-Artifacts o	Agentic coding artifacts	Benchmark focusing on generated code artifacts quality.	★★★★★		🇺🇸 Gemini 3 Pro	57.8%
629	V* O	Multimodal reasoning	V* benchmark accuracy.	★★★★★		🇨🇳 Qwen3.5-397B-A17B	95.8%
630	VCRBench o	Visual commonsense reasoning	Visual commonsense reasoning benchmark.	★★★★★		G Seed1.8	59.8%
631	VCT O	Virology capability (protocol troubleshooting)	Virology Capabilities Test: a benchmark that measures an LLM's ability to troubleshoot complex virology laboratory protocols.	★★★★★		🇺🇸 Gemini 2.5 Pro	100.0%
632	Vending-Bench 2 O	Long-horizon agentic tasks	Long-horizon agentic task benchmark evaluating sustained goal completion.	★★★★★		🇺🇸 Gemini 3 Pro	5478.2%
633	Vibe Android o	Vibe evaluation (Android)	Vibe evaluation on Android tasks.	★★★★★		🇺🇸 Claude Opus 4.5	92.2%
634	Vibe Average o	Vibe evaluation	Aggregate Vibe evaluation score.	★★★★★		G MiniMax M2.1	88.6%
635	Vibe Backend o	Vibe evaluation (backend)	Vibe evaluation on backend tasks.	★★★★★		🇺🇸 Claude Opus 4.5	98.0%
636	Vibe iOS o	Vibe evaluation (iOS)	Vibe evaluation on iOS tasks.	★★★★★		🇺🇸 Claude Opus 4.5	90.0%
637	Vibe Simulation o	Vibe evaluation (simulation)	Vibe evaluation on simulation tasks.	★★★★★		🇺🇸 Gemini 3 Pro	89.2%
638	Vibe Web o	Vibe evaluation (web)	Vibe evaluation on web tasks.	★★★★★		G MiniMax M2.1	91.5%
639	VibeEval O	Aesthetic/visual quality	VLM aesthetic evaluation with GPT scores.	★★★★★		🇺🇸 Gemini 2.5 Pro	76.4%
640	Video-MME O	Video understanding (multimodal)	Multimodal evaluation of video understanding and reasoning.	★★★★★		🇨🇳 Qwen3-VL 32B Instruct	76.6%
641	VideoHolmes o	Video QA	Video question answering benchmark focused on detective-style clues.	★★★★★		G Seed1.8	65.5%
642	VideoMME o	Multimodal video evaluation	Video multimodal evaluation suite (VideoMME).	★★★★★		🇺🇸 Gemini 3 Pro	88.4%
643	VideoMME (w/o sub) O	Video understanding	Video understanding benchmark without subtitles.	★★★★★		🇺🇸 Gemini 3 Pro	87.7%
644	VideoMME (w/sub) O	Video understanding	Video understanding benchmark with subtitles.	★★★★★		🇺🇸 Gemini 3 Pro	88.4%
645	VideoMMMU O	Multimodal video understanding	Video-based extension of MMMU evaluating temporal multimodal reasoning and perception across disciplines.	★★★★★		🇺🇸 Gemini 3 Pro	87.6%
646	VideoReasonBench o	Video reasoning	Video reasoning benchmark assessing temporal and causal understanding.	★★★★★		🇺🇸 Gemini 2.5 Pro	59.7%
647	VideoSimpleQA o	Video QA	Simple question answering over short videos.	★★★★★		🇺🇸 Gemini 3 Pro	71.9%
648	ViSpeak o	Video dialogue	Video-grounded dialogue and description benchmark.	★★★★★		🇺🇸 Gemini 3 Pro	89.0%
649	VisualPuzzle o	Visual reasoning	Visual puzzle solving benchmark evaluating reasoning and pattern recognition capabilities.	★★★★★		🇺🇸 GPT-5 High	57.8%
650	VisualWebBench O	Web UI understanding	Average accuracy on VisualWebBench.	★★★★★		🇺🇸 Holo1.5-72B	83.8%
651	VisuLogic O	Visual logical reasoning	Logical reasoning and compositionality benchmark for visual-language models.	★★★★★		🇨🇳 ERNIE-4.5-VL-28B-A3B-Thinking	52.5%
652	VitaBench O	Industry QA	Industry-focused benchmark evaluating domain QA performance.	★★★★★		🇺🇸 Claude Opus 4.5	56.3%
653	VL-RewardBench o	Reward modeling (VL)	Reward alignment benchmark for VLMs.	★★★★★		🇺🇸 Claude 3.7 Sonnet	67.4%
654	VLMs are Biased o	Multimodal bias	Evaluates whether VLMs truly 'see' vs. relying on memorized knowledge; measures bias toward non-visual priors.	★★★★★	90	🇺🇸 o4 mini	20.2%
655	VLMs are Blind O	Visual grounding robustness	Evaluates failure modes of VLMs in grounding and perception tasks.	★★★★★		G MiMo-VL 7B-RL	79.4%
656	VLMsAreBiased o	Multimodal bias	Benchmark evaluating biases in vision-language models.	★★★★★		G Seed1.8	62.0%
657	VLMsAreBlind O	Multimodal robustness	Benchmark probing robustness of vision-language models to visual perturbations.	★★★★★		🇺🇸 Gemini 3 Pro	97.5%
658	VoiceBench AdvBench O	VoiceBench	VoiceBench adversarial safety evaluation.	★★★★★		🇨🇳 Qwen3-Omni-30B-A3B-Thinking	99.4%
659	VoiceBench AlpacaEval o	VoiceBench	VoiceBench evaluation on AlpacaEval instructions.	★★★★★		🇨🇳 Qwen3-Omni-Flash-Thinking	96.8%
660	VoiceBench BBH o	VoiceBench	VoiceBench evaluation on Big-Bench Hard prompts.	★★★★★		🇺🇸 Gemini 2.5 Pro	92.6%
661	VoiceBench CommonEval O	VoiceBench	VoiceBench evaluation on CommonEval.	★★★★★		🇨🇳 Qwen3-Omni-Flash-Instruct	91.0%
662	VoiceBench IFEval o	VoiceBench	VoiceBench instruction-following evaluation (IFEval).	★★★★★		🇺🇸 Gemini 2.5 Pro	85.7%
663	MMAU v05.15.25 o	Audio reasoning	Audio reasoning benchmark MMAU v05.15.25.	★★★★★		🇨🇳 Qwen3-Omni-Flash-Instruct	77.6%
664	VoiceBench MMSU O	VoiceBench	VoiceBench MMSU benchmark (voice modality).	★★★★★		🇨🇳 Qwen3-Omni-Flash-Thinking	84.3%
665	VoiceBench MMSU (Audio) o	Audio reasoning	Audio reasoning MMSU results.	★★★★★		🇺🇸 Gemini 2.5 Pro	77.7%
666	VoiceBench OpenBookQA o	VoiceBench	VoiceBench results on OpenBookQA prompts.	★★★★★		🇨🇳 Qwen3-Omni-Flash-Thinking	95.0%
667	VoiceBench SD-QA O	VoiceBench	VoiceBench Spoken Dialogue QA results.	★★★★★		🇺🇸 Gemini 2.5 Pro	90.1%
668	VoiceBench WildVoice O	VoiceBench	VoiceBench evaluation on WildVoice dataset.	★★★★★		🇺🇸 Gemini 2.5 Pro	93.4%
669	VPCT o	Multimodal reasoning	Visual perception and comprehension test.	★★★★★		🇺🇸 Gemini 3 Pro	90.0%
670	VQAv2 O	Visual question answering	Standard Visual Question Answering v2 benchmark on natural images.	★★★★★		🇺🇸 Molmo2-8B	87.0%
671	VSI-Bench O	Spatial intelligence	Visual spatial intelligence benchmark covering 3D reasoning and spatial inference tasks.	★★★★★		🇨🇳 Qwen3-VL-30B-A3B Instruct	63.2%
672	WebClick O	GUI agents	Task success on the WebClick UI agent benchmark.	★★★★★		🇺🇸 Claude Sonnet 4	93.0%
673	WebDev Arena O	Web development agents	Arena evaluation for autonomous web development agents.	★★★★★		🇺🇸 GPT-5	1483
674	WebQuest-MultiQA o	Web agents	Multi-question web search and interaction tasks.	★★★★★		🇨🇳 GLM-4.5V	60.6%
675	WebQuest-SingleQA o	Web agents	Single-question web search and interaction tasks.	★★★★★		🇨🇳 GLM-4.6V	79.5%
676	WebSrc O	Web QA	Webpage question answering (SQuAD F1).	★★★★★		🇺🇸 Holo1.5-72B	97.2%
677	WebVoyager o	Web agents	Web navigation and interaction tasks for LLM agents.	★★★★★		🇨🇳 GLM-4.6V	81.0%
678	WebVoyager2 o	Web agents	Web navigation and interaction tasks for LLM agents (v2).	★★★★★		🇨🇳 GLM-4.5V	84.4%
679	WebWalkerQA o	Web agents	WebWalker tasks evaluating autonomous browsing question answering performance.	★★★★★		🇨🇳 Tongyi DeepResearch	72.2%
680	WeMath O	Math reasoning	Math reasoning benchmark spanning diverse curricula and difficulty levels.	★★★★★		🇨🇳 Qwen3.5-397B-A17B	87.9%
681	WideSearch O	Web search	Wide web search and QA benchmark.	★★★★★		🇺🇸 GPT-5.2	76.8%
682	Wild-Jailbreak o	Safety / jailbreak	Adversarial jailbreak benchmark evaluating refusal robustness.	★★★★★		🇺🇸 GPT-OSS 120B (High)	98.2%
683	WildBench V2 o	Instruction following	WildBench V2 human preference benchmark for instruction following and helpfulness.	★★★★★		🇫🇷 Mistral Small 3.2 24B Instruct	65.3%
684	WildGuardTest o	Safety	WildGuardTest safety benchmark.	★★★★★		G IQuest-Coder-V1-40B-Thinking	86.8%
685	Winogender o	Gender bias (coreference)	Coreference resolution dataset for measuring gender bias.	★★★★★		🇺🇸 Llama 3.3 70B Instruct	84.3%
686	WinoGrande O	Coreference reasoning	Large-scale adversarial Winograd Schema-style pronoun resolution.	★★★★★	99	🇺🇸 OLMo-3-Think-32B	90.3%
687	WinoGrande (DE) o	Coreference reasoning (German)	German translation of the WinoGrande pronoun resolution benchmark.	★★★★★		🇺🇸 Llama-3_1-70B-TFree-HAT-SFT	0.8%
688	WMDP Bio o	Biosecurity knowledge	Weapons of Mass Destruction Proxy benchmark for biosecurity, measuring hazardous biological knowledge without info hazards.	★★★★★		🇺🇸 Zephyr 7B	↓ 63.7%
689	WMDP Chem o	Chemical security knowledge	WMDP benchmark for chemical security, evaluating knowledge relevant to chemical weapons development.	★★★★★		🇺🇸 Zephyr 7B	↓ 45.8%
690	WMDP Cyber o	Cybersecurity knowledge	WMDP benchmark for cybersecurity, assessing knowledge that could aid in cyber weapons development.	★★★★★		🇺🇸 Zephyr 7B	↓ 44.0%
691	WMT16 En–De o	Machine translation	WMT16 English–German translation benchmark (news).	★★★★★		🇺🇸 Llama 3.3 70B Instruct	38.8%
692	WMT16 En–De (Instruct) o	Machine translation	Instruction-tuned evaluation on the WMT16 English–German translation set.	★★★★★		🇺🇸 Llama 3.3 70B Instruct	37.9%
693	WMT24++ O	Machine translation	Extended WMT 2024 evaluation across multiple language pairs.	★★★★★		🇨🇳 Qwen3-235B-A22B-Thinking-2507	94.7%
694	WorldTravel2 (multi-modal) o	Travel planning (multimodal)	WorldTravel2 benchmark multimodal track.	★★★★★		🇺🇸 Gemini 3 Pro	47.2%
695	WorldTravel2 (text) o	Travel planning (text)	WorldTravel2 benchmark text-only track.	★★★★★		🇺🇸 GPT-5 High	56.4%
696	WorldVQA o	World knowledge VQA	Visual question answering requiring world knowledge and commonsense reasoning.	★★★★★		🇺🇸 Gemini 3 Pro	47.4%
697	WritingBench O	Writing quality	General-purpose writing quality benchmark.	★★★★★		🇨🇳 Qwen3-235B-A22B-Thinking-2507	88.3%
698	WSC O	Coreference reasoning	Classic Winograd Schema Challenge measuring commonsense coreference.	★★★★★		🇺🇸 Gemma 3 PT 27B	91.9%
699	xBench-DeepSearch O	Agentic research	Evaluates multi-hop deep research workflows on xBench DeepSearch tasks.	★★★★★		🇺🇸 GPT-5 High	77.9%
700	xBench-DeepSearch (2025.05) o	Agentic research	xBench DeepSearch benchmark May 2025 snapshot.	★★★★★		G Step-3.5 Flash 20260204	83.7%
701	xBench-DeepSearch (2025.10) o	Agentic research	xBench DeepSearch benchmark October 2025 snapshot.	★★★★★		G Step-3.5 Flash 20260204	56.3%
702	XLRS-Bench o	Remote sensing	Remote sensing benchmark for evaluating multimodal AI on satellite and aerial imagery.	★★★★★		G Intern-S1-Pro	52.8%
703	XpertBench (Edu) o	Economics/education	XpertBench education domain subset.	★★★★★		🇺🇸 GPT-5 High	56.9%
704	XpertBench (Fin) o	Economics/finance	XpertBench finance domain subset.	★★★★★		🇺🇸 GPT-5 High	64.5%
705	XpertBench (Humanities) o	Economics/humanities	XpertBench humanities domain subset.	★★★★★		🇺🇸 GPT-5 High	68.5%
706	XpertBench (Law) o	Economics/legal	XpertBench legal domain subset.	★★★★★		🇺🇸 Claude Sonnet 4.5	58.7%
707	XpertBench (Research) o	Economics/research	XpertBench research domain subset.	★★★★★		🇺🇸 GPT-5 High	48.2%
708	XSTest o	Safety	XSTest safety benchmark.	★★★★★		G IQuest-Coder-V1-40B-Thinking	94.3%
709	ZebraLogic O	Logical reasoning	Logical reasoning benchmark assessing complex pattern and rule inference.	★★★★★		🇨🇳 Qwen3-VL 32B Thinking	96.1%
710	ZeroBench O	Zero-shot generalization	Evaluates zero-shot performance across diverse tasks without task-specific finetuning.	★★★★★		🇨🇳 GLM-4.5V	23.4%
711	ZeroBench (sub) O	Zero-shot generalization	Subset of ZeroBench targeting harder zero-shot reasoning cases.	★★★★★		🇨🇳 Qwen3.5-397B-A17B	41.0%
712	ZeroSCROLLS BookSumSort o	Long-context summarization	ZeroSCROLLS split based on BookSumSort long-form summarization.	★★★★★		🇺🇸 GPT-4	60.5%
713	ZeroSCROLLS GovReport o	Long-context summarization	ZeroSCROLLS split based on the GovReport summarization benchmark.	★★★★★		G CoLT5	41.0%
714	ZeroSCROLLS MuSiQue o	Long-context reasoning	ZeroSCROLLS split derived from MuSiQue multi-hop QA.	★★★★★		🇺🇸 Llama 3.3 70B Instruct	52.2%
715	ZeroSCROLLS NarrativeQA o	Long-context QA	ZeroSCROLLS split based on the NarrativeQA reading comprehension benchmark.	★★★★★		🇺🇸 Claude v1.3	32.6%
716	ZeroSCROLLS Qasper o	Long-context QA	ZeroSCROLLS split based on the Qasper paper QA benchmark.	★★★★★		G FLAN-UL2	56.9%
717	ZeroSCROLLS QMSum o	Long-context summarization	ZeroSCROLLS split based on the QMSum meeting summarization benchmark.	★★★★★		G CoLT5	22.5%
718	ZeroSCROLLS QuALITY o	Long-context QA	ZeroSCROLLS split based on the QuALITY reading comprehension benchmark.	★★★★★		🇺🇸 GPT-4	89.2%
719	ZeroSCROLLS SpaceDigest o	Long-context summarization	ZeroSCROLLS SpaceDigest extractive summarization task.	★★★★★		🇺🇸 Llama-3_1-70B-TFree-HAT-SFT	77.9%
720	ZeroSCROLLS SQuALITY o	Long-context summarization	ZeroSCROLLS split based on the SQuALITY long-form summarization benchmark.	★★★★★		🇺🇸 GPT-4	22.6%
721	ZeroSCROLLS SummScreenFD o	Long-context summarization	ZeroSCROLLS split based on the SummScreenFD summarization benchmark.	★★★★★		G CoLT5	20.0%

Model Size

Log scale

Furukama's Blog

Fu — Benchmark of Benchmarks

Model Size

Release Date