AI War Tracker
Agents

τ²-bench

Sierra Lab's next-gen tool-use benchmark — multi-turn customer-service workflows under realistic constraints.

296Models
99.1Top score
35.1Median

State of the art over time

Each point is a model at its release date; the line traces the best score to date.

10075502502023202420252026Llama 3 8B Instruct: 0 (2024-04-18)Llama 3 70B Instruct: 0 (2024-04-18)Phi-3 Mini Instruct 3.8B: 0 (2024-04-23)Llama 3.1 405B Instruct: 19 (2024-07-23)Llama 3.1 8B Instruct: 16.4 (2024-07-23)Llama 3.1 70B Instruct: 15.2 (2024-07-23)Llama 3.2 3B Instruct: 21.1 (2024-09-25)Llama 3.2 11B Instruct: 14.6 (2024-09-25)Molmo 7B-D: 0 (2024-09-25)Llama 3.2 1B Instruct: 0 (2024-09-25)Llama 3.1 Nemotron 70B Instruct: 23.1 (2024-10-01)Claude 3.5 Haiku: 24.6 (2024-11-04)Nova Lite: 17.5 (2024-11-20)Nova Pro: 14 (2024-11-20)Nova Micro: 14 (2024-11-20)OLMo 2 7B: 0 (2024-11-26)Llama 3.3 70B Instruct: 26.6 (2024-12-06)Gemini 2.0 Flash: 29.5 (2024-12-11)DeepSeek-V3: 22.8 (2024-12-26)Phi 4: 0 (2025-01-10)DeepSeek R1 Distill Llama 70B: 21.9 (2025-01-20)DeepSeek-R1: 11.4 (2025-01-20)Mistral Small 3: 19.6 (2025-01-30)o3-mini: 31.3 (2025-01-31)Grok-3: 48.8 (2025-02-17)Claude 3.7 Sonnet: 54.7 (2025-02-24)Gemma 3 12B Instruct: 10.8 (2025-03-12)Gemma 3 27B Instruct: 10.5 (2025-03-12)Gemma 3 4B Instruct: 5 (2025-03-12)Reka Flash 3: 0 (2025-03-12)Command A: 80.7 (2025-03-13)Gemma 3 1B Instruct: 10.5 (2025-03-13)OLMo 2 32B: 0 (2025-03-13)Mistral Small 3.1: 25.1 (2025-03-17)Llama-3.3 Nemotron Super 49B v1: 26.9 (2025-03-18)Gemini 2.5 Pro: 54.1 (2025-03-25)DeepSeek-V3 0324: 47.1 (2025-03-25)Llama 4 Maverick: 17.8 (2025-04-05)Llama 4 Scout: 15.5 (2025-04-05)Llama 3.1 Nemotron Ultra 253B v1: 11.4 (2025-04-07)GPT-4.1 Mini: 52.9 (2025-04-14)GPT-4.1: 47.1 (2025-04-14)GPT-4.1 Nano: 17.3 (2025-04-14)o3: 80.7 (2025-04-16)o4-mini: 55.6 (2025-04-16)Granite 3.3 8B: 10.5 (2025-04-16)Gemini 2.5 Flash: 31.6 (2025-04-17)Qwen3 14B: 34.5 (2025-04-28)Qwen3 32B: 29.8 (2025-04-28)Qwen3 8B: 27.8 (2025-04-28)Qwen3 235B A22B: 27.2 (2025-04-28)Qwen3 1.7B: 26 (2025-04-28)Qwen3 30B A3B: 26 (2025-04-28)Qwen3 0.6B: 21.1 (2025-04-28)Qwen3 4B: 19 (2025-04-28)Nova Premier: 38.3 (2025-04-30)Mistral Medium 3: 24.3 (2025-05-07)Solar Pro 2: 31.9 (2025-05-20)Llama 3.1 Nemotron Nano 4B v1.1: 11.7 (2025-05-20)Gemma 3n E4B Instruct: 5 (2025-05-20)Devstral Small: 38 (2025-05-21)Claude Opus 4: 73.4 (2025-05-22)Claude Sonnet 4: 64.6 (2025-05-22)Sarvam M: 0 (2025-05-23)DeepSeek-R1-0528: 36.5 (2025-05-28)DeepSeek R1 0528 Qwen3 8B: 0 (2025-05-29)Magistral Small 1: 26.6 (2025-06-10)Magistral Medium 1: 23.1 (2025-06-10)MiniMax M1 80k: 34.2 (2025-06-17)MiniMax M1 40k: 31.6 (2025-06-17)Mistral Small 3.2: 29.5 (2025-06-20)Gemma 3n E2B Instruct: 0 (2025-06-26)ERNIE 4.5 300B A47B: 0 (2025-06-30)Jamba 1.7 Mini: 12.6 (2025-07-07)Grok 4: 74.9 (2025-07-09)Devstral Medium: 19.9 (2025-07-10)LFM2 1.2B: 12.6 (2025-07-10)Kimi K2: 61.1 (2025-07-11)Exaone 4.0 1.2B: 20.5 (2025-07-15)EXAONE 4.0 32B: 17.3 (2025-07-15)Qwen3 Coder 480B A35B Instruct: 43.6 (2025-07-22)Qwen3-235B-A22B-Instruct-2507: 33.3 (2025-07-22)Gemini 2.5 Flash Lite: 19 (2025-07-22)Qwen3 235B A22B 2507: 53.2 (2025-07-25)GLM 4.5 Air: 46.5 (2025-07-25)Llama Nemotron Super 49B v1.5: 28.1 (2025-07-25)GLM-4.5: 43 (2025-07-28)Qwen3 30B A3B 2507 Instruct: 10.2 (2025-07-29)Qwen3 30B A3B 2507: 28.1 (2025-07-30)Qwen3 Coder 30B A3B Instruct: 34.5 (2025-07-31)Claude Opus 4.1: 71.4 (2025-08-05)gpt-oss-120b: 65.8 (2025-08-05)gpt-oss-20b: 60.2 (2025-08-05)Qwen3 4B 2507 Instruct: 26.6 (2025-08-06)Qwen3 4B 2507: 25.4 (2025-08-06)GPT-5: 86.5 (2025-08-07)GPT-5 mini: 71.1 (2025-08-07)GPT-5 nano: 36.5 (2025-08-07)Jamba Large 1.7: 13.5 (2025-08-08)GLM 4.5V: 22.5 (2025-08-11)Mistral Medium 3.1: 40.6 (2025-08-13)Gemma 3 270M: 9.1 (2025-08-14)NVIDIA Nemotron Nano 9B V2: 23.4 (2025-08-18)Seed-OSS-36B-Instruct: 49.4 (2025-08-20)DeepSeek-V3.1: 37.4 (2025-08-21)Hermes 4 - Llama-3.1 405B: 26.6 (2025-08-27)Hermes 4 - Llama-3.1 70B: 22.5 (2025-08-27)Grok Code Fast 1: 75.7 (2025-08-28)Apertus 70B Instruct: 12.9 (2025-09-02)Apertus 8B Instruct: 11.4 (2025-09-02)Kimi K2 0905: 73.4 (2025-09-05)Ling-mini-2.0: 13.2 (2025-09-09)Qwen3-Next-80B-A3B: 41.5 (2025-09-10)Qwen3 Next 80B A3B Instruct: 21.6 (2025-09-11)Magistral Small 1.2: 27.8 (2025-09-17)Ling-flash-2.0: 20.8 (2025-09-17)Magistral Medium 1.2: 52 (2025-09-18)Grok 4 Fast: 65.8 (2025-09-19)Ring-flash-2.0: 0 (2025-09-19)DeepSeek V3.1 Terminus: 37.1 (2025-09-22)Qwen3 Omni 30B A3B: 21.3 (2025-09-22)Granite 4.0 H Small: 17.3 (2025-09-22)Qwen3 Omni 30B A3B Instruct: 16.4 (2025-09-22)GPT-5 Codex: 86.8 (2025-09-23)Qwen3 Max: 74.3 (2025-09-23)Qwen3 VL 235B A22B: 54.1 (2025-09-23)Qwen3 VL 235B A22B Instruct: 35.1 (2025-09-23)LFM2 2.6B: 13.5 (2025-09-23)Claude Sonnet 4.5: 78.1 (2025-09-29)DeepSeek V3.2 Exp: 33.9 (2025-09-29)GLM-4.6: 76.9 (2025-09-30)Apriel-v1.5-15B-Thinker: 68.4 (2025-09-30)Qwen3 VL 30B A3B: 19.9 (2025-10-03)Qwen3 VL 30B A3B Instruct: 19 (2025-10-06)LFM2 8B A1B: 10.5 (2025-10-07)Ling-1T: 32.7 (2025-10-08)Jamba Reasoning 3B: 15.8 (2025-10-08)Ring-1T: 26.3 (2025-10-13)Qwen3 VL 8B Instruct: 29.2 (2025-10-14)Qwen3 VL 4B Instruct: 23.4 (2025-10-14)Qwen3 VL 8B: 22.5 (2025-10-14)Qwen3 VL 4B: 15.5 (2025-10-14)Claude Haiku 4.5: 54.7 (2025-10-15)Phi 4 Mini Instruct: 8.2 (2025-10-17)Granite 4.0 Micro: 12.6 (2025-10-20)Qwen3 VL 32B: 45.6 (2025-10-21)Qwen3 VL 32B Instruct: 29.2 (2025-10-23)MiniMax-M2: 86.8 (2025-10-27)Granite 4.0 1B: 22.8 (2025-10-28)NVIDIA Nemotron Nano 12B v2 VL: 21.3 (2025-10-28)Granite 4.0 H 1B: 19.6 (2025-10-28)Granite 4.0 H 350M: 14.6 (2025-10-28)Granite 4.0 350M: 13.2 (2025-10-28)Kimi Linear 48B A3B Instruct: 0 (2025-10-30)KAT-Coder-Pro V1: 88.6 (2025-11-11)Doubao Seed Code: 58.2 (2025-11-11)GPT-5.1: 81.9 (2025-11-12)ERNIE 5.0 Thinking: 83.9 (2025-11-13)GPT-5.1-Codex: 83 (2025-11-13)GPT-5.1-Codex-Mini: 62.9 (2025-11-13)Gemini 3 Pro: 87.1 (2025-11-18)Olmo 3 7B Instruct: 12.6 (2025-11-20)Olmo 3 7B Think: 0 (2025-11-20)Olmo 3 32B Think: 0 (2025-11-21)Claude Opus 4.5: 89.5 (2025-11-24)Apriel-v1.6-15B-Thinker: 69.3 (2025-11-25)Nova 2.0 Omni: 80.4 (2025-11-26)Nova 2.0 Pro: 92.7 (2025-11-27)INTELLECT-3: 26.6 (2025-11-27)DeepSeek-V3.2: 90.6 (2025-12-01)DeepSeek V3.2 Speciale: 0 (2025-12-01)Nova 2 Lite: 75.7 (2025-12-02)Ministral 3 14B: 27.2 (2025-12-02)Ministral 3 8B: 26.6 (2025-12-02)Ministral 3 3B: 24.9 (2025-12-02)Mistral Large 3: 24.6 (2025-12-02)Motif-2-12.7B-Reasoning: 46.5 (2025-12-04)K2-V2: 27.8 (2025-12-05)GLM 4.6V: 31.6 (2025-12-08)Devstral 2: 24.9 (2025-12-09)Devstral Small 2: 23.4 (2025-12-09)Mi:dm K 2.5 Pro: 86.5 (2025-12-11)GPT-5.2: 84.8 (2025-12-11)Molmo2-8B: 0 (2025-12-11)Olmo 3.1 32B Think: 0 (2025-12-12)NVIDIA Nemotron 3 Nano 30B A3B: 40.9 (2025-12-15)K2 Think V2: 25.4 (2025-12-15)Gemini 3 Flash: 80.4 (2025-12-17)Solar Open 100B: 48.2 (2025-12-17)MiniMax M2.1: 85.4 (2025-12-23)HyperCLOVA X SEED Think: 87.4 (2025-12-26)K-EXAONE: 74.3 (2025-12-31)Falcon-H1R-7B: 27.8 (2026-01-04)LFM2.5-1.2B-Instruct: 10.8 (2026-01-05)LFM2.5-VL-1.6B: 8.5 (2026-01-05)Olmo 3.1 32B Instruct: 21.3 (2026-01-13)GPT-5.2-Codex: 92.1 (2026-01-14)LFM2.5-1.2B-Thinking: 19.6 (2026-01-20)Step3 VL 10B: 16.1 (2026-01-20)Kimi K2.5: 95.9 (2026-01-27)Solar Pro 3: 86.3 (2026-01-27)LongCat Flash Lite: 79.5 (2026-01-28)Step 3.5 Flash: 94.4 (2026-01-29)Qwen3 Coder Next: 79.5 (2026-02-04)Claude Opus 4.6: 92.1 (2026-02-05)Qwen3 Max Thinking: 83.6 (2026-02-09)Tri-21B-Think: 93.3 (2026-02-10)GLM-5: 98.2 (2026-02-11)Nanbeige4.1-3B: 21.6 (2026-02-11)MiniMax M2.5: 95.3 (2026-02-12)Qwen3.5 397B A17B: 95.6 (2026-02-16)Claude Sonnet 4.6: 79.5 (2026-02-17)Tiny Aya Global: 0 (2026-02-17)Gemini 3.1 Pro: 95.6 (2026-02-19)GPT-5.3-Codex: 86 (2026-02-24)Qwen3.5-27B: 93.9 (2026-02-25)Qwen3.5-122B-A10B: 93.6 (2026-02-25)Qwen3.5-35B-A3B: 89.2 (2026-02-25)LFM2-24B-A2B: 11.1 (2026-02-25)Qwen3.5 4B: 92.1 (2026-03-02)Qwen3.5 2B: 81.6 (2026-03-02)Qwen3.5 0.8B: 65.2 (2026-03-02)Mercury 2: 70.8 (2026-03-04)GPT-5.4: 87.1 (2026-03-05)Sarvam 105B: 46.8 (2026-03-06)Sarvam 30B: 34.5 (2026-03-06)Grok 4.20 0309: 96.5 (2026-03-10)Qwen3.5-9B: 86.8 (2026-03-10)NVIDIA Nemotron 3 Super 120B A12B: 67.8 (2026-03-11)GLM 5 Turbo: 98.5 (2026-03-15)Mistral Small 4: 41.2 (2026-03-16)NVIDIA Nemotron 3 Nano 4B: 28.1 (2026-03-16)GPT-5.4 mini: 83.3 (2026-03-17)GPT-5.4 nano: 76 (2026-03-17)MiMo-V2-Pro: 95 (2026-03-18)MiMo-V2-Omni: 91.2 (2026-03-18)MiniMax M2.7: 84.8 (2026-03-18)Nemotron Cascade 2 30B A3B: 53.2 (2026-03-19)KAT-Coder-Pro V2: 89.5 (2026-03-27)MiMo-V2-Omni-0327: 88 (2026-03-27)Qwen3.5 Omni Plus: 88.3 (2026-03-30)Qwen3.5 Omni Flash: 84.5 (2026-03-30)GLM 5V Turbo: 98.5 (2026-04-01)Trinity Large Thinking: 90.1 (2026-04-01)Qwen3.6 Plus: 97.7 (2026-04-02)Step 3.5 Flash 2603: 87.4 (2026-04-02)Gemma 4 31B: 65.5 (2026-04-02)Gemma 4 E2B: 22.2 (2026-04-02)Gemma 4 26B A4B: 43.6 (2026-04-03)Gemma 4 E4B: 26 (2026-04-03)GLM 5.1: 97.7 (2026-04-07)Grok 4.20 0309 v2: 93 (2026-04-07)Muse Spark: 91.5 (2026-04-08)EXAONE 4.5 33B: 78.1 (2026-04-09)JT-MINI: 93 (2026-04-15)Claude Opus 4.7: 88.6 (2026-04-16)Kimi K2.6: 95.9 (2026-04-20)Ling-2.6-flash: 86 (2026-04-21)MiMo-V2.5-Pro: 94.2 (2026-04-22)Hy3: 92.7 (2026-04-22)MiMo-V2.5: 90.6 (2026-04-22)GPT-5.5: 93.9 (2026-04-23)Ling-2.6-1T: 89.8 (2026-04-23)DeepSeek-V4-Pro: 96.2 (2026-04-24)DeepSeek-V4-Flash: 95.6 (2026-04-24)Qwen3.6 Max: 95.9 (2026-04-27)Qwen3.6 35B A3B: 95.3 (2026-04-27)Qwen3.6 27B: 94.2 (2026-04-27)Nemotron 3 Nano Omni 30B A3B Reasoning: 45.3 (2026-04-29)Granite 4.1 30B: 42.1 (2026-04-29)Granite 4.1 3B: 19.6 (2026-04-29)Mistral Medium 3.5: 94.2 (2026-04-30)Granite 4.1 8B: 27.8 (2026-04-30)Grok 4.3: 97.7 (2026-05-06)Gemini 3.1 Flash Lite: 31.3 (2026-05-07)Ring-2.6-1T: 92.4 (2026-05-08)MiniCPM-V 4.6 1.3B: 87.7 (2026-05-11)Gemini 3.5 Flash: 95.6 (2026-05-19)Qwen3.7 Max: 94.7 (2026-05-21)MiniCPM5-1B: 82.5 (2026-05-25)Claude Opus 4.8: 94.4 (2026-05-28)Mistral 7B Instruct: 0 (2023-09-27)Mistral 7B InstructSolar Mini: 20.2 (2024-01-25)Solar MiniClaude 3 Haiku: 21.1 (2024-03-13)GPT-4o: 28.9 (2024-05-13)GPT-4oMistral Large 2: 33 (2024-07-24)Qwen2.5 72B Instruct: 34.5 (2024-09-19)Qwen2.5 72B InstructPixtral Large: 36.5 (2024-11-19)o1: 62.6 (2024-12-05)Grok 3 mini Reasoning: 90.4 (2025-02-19)Grok 3 mini ReasoningKimi K2 Thinking: 93 (2025-11-06)Kimi K2 ThinkingGrok 4.1 Fast: 93.3 (2025-11-19)MiMo-V2-Flash: 95 (2025-12-14)GLM 4.7: 95.9 (2025-12-22)GLM 4.7 Flash: 98.8 (2026-01-19)JT-35B-Flash: 99.1 (2026-05-14)JT-35B-Flash

Ranking

#1JT-35B-Flash99.1
#2GLM 4.7 Flash98.8
#3GLM 5 Turbo98.5
#4GLM 5V Turbo98.5
#5GLM-598.2
#6Qwen3.6 Plus97.7
#7GLM 5.197.7
#8Grok 4.397.7
#9Grok 4.20 030996.5
#10DeepSeek-V4-Pro96.2
#11GLM 4.795.9
#12Qwen3.6 Max95.9
#13Kimi K2.695.9
#14Kimi K2.595.9
#15Qwen3.5 397B A17B95.6
#16DeepSeek-V4-Flash95.6
#17Gemini 3.5 Flash95.6
#18Gemini 3.1 Pro95.6
#19MiniMax M2.595.3
#20Qwen3.6 35B A3B95.3
#21MiMo-V2-Flash95
#22MiMo-V2-Pro95
#23Qwen3.7 Max94.7
#24Claude Opus 4.894.4
#25Step 3.5 Flash94.4
#26MiMo-V2.5-Pro94.2
#27Qwen3.6 27B94.2
#28Mistral Medium 3.594.2
#29Qwen3.5-27B93.9
#30GPT-5.593.9
#31Qwen3.5-122B-A10B93.6
#32Tri-21B-Think93.3
#33Grok 4.1 Fast93.3
#34JT-MINI93
#35Grok 4.20 0309 v293
#36Kimi K2 Thinking93
#37Nova 2.0 Pro92.7
#38Hy392.7
#39Ring-2.6-1T92.4
#40Qwen3.5 4B92.1
#41GPT-5.2-Codex92.1
#42Claude Opus 4.692.1
#43Muse Spark91.5
#44MiMo-V2-Omni91.2
#45MiMo-V2.590.6
#46DeepSeek-V3.290.6
#47Grok 3 mini Reasoning90.4
#48Trinity Large Thinking90.1
#49Ling-2.6-1T89.8
#50KAT-Coder-Pro V289.5
#51Claude Opus 4.589.5
#52Qwen3.5-35B-A3B89.2
#53KAT-Coder-Pro V188.6
#54Claude Opus 4.788.6
#55Qwen3.5 Omni Plus88.3
#56MiMo-V2-Omni-032788
#57MiniCPM-V 4.6 1.3B87.7
#58HyperCLOVA X SEED Think87.4
#59Step 3.5 Flash 260387.4
#60Gemini 3 Pro87.1
#61GPT-5.487.1
#62GPT-5 Codex86.8
#63Qwen3.5-9B86.8
#64MiniMax-M286.8
#65Mi:dm K 2.5 Pro86.5
#66GPT-586.5
#67Solar Pro 386.3
#68GPT-5.3-Codex86
#69Ling-2.6-flash86
#70MiniMax M2.185.4
#71MiniMax M2.784.8
#72GPT-5.284.8
#73Qwen3.5 Omni Flash84.5
#74ERNIE 5.0 Thinking83.9
#75Qwen3 Max Thinking83.6
#76GPT-5.4 mini83.3
#77GPT-5.1-Codex83
#78MiniCPM5-1B82.5
#79GPT-5.181.9
#80Qwen3.5 2B81.6
#81Command A80.7
#82o380.7
#83Nova 2.0 Omni80.4
#84Gemini 3 Flash80.4
#85LongCat Flash Lite79.5
#86Qwen3 Coder Next79.5
#87Claude Sonnet 4.679.5
#88EXAONE 4.5 33B78.1
#89Claude Sonnet 4.578.1
#90GLM-4.676.9
#91GPT-5.4 nano76
#92Grok Code Fast 175.7
#93Nova 2 Lite75.7
#94Grok 474.9
#95K-EXAONE74.3
#96Qwen3 Max74.3
#97Kimi K2 090573.4
#98Claude Opus 473.4
#99Claude Opus 4.171.4
#100GPT-5 mini71.1
#101Mercury 270.8
#102Apriel-v1.6-15B-Thinker69.3
#103Apriel-v1.5-15B-Thinker68.4
#104NVIDIA Nemotron 3 Super 120B A12B67.8
#105gpt-oss-120b65.8
#106Grok 4 Fast65.8
#107Gemma 4 31B65.5
#108Qwen3.5 0.8B65.2
#109Claude Sonnet 464.6
#110GPT-5.1-Codex-Mini62.9
#111o162.6
#112Kimi K261.1
#113gpt-oss-20b60.2
#114Doubao Seed Code58.2
#115o4-mini55.6
#116Claude Haiku 4.554.7
#117Claude 3.7 Sonnet54.7
#118Qwen3 VL 235B A22B54.1
#119Gemini 2.5 Pro54.1
#120Qwen3 235B A22B 250753.2
#121Nemotron Cascade 2 30B A3B53.2
#122GPT-4.1 Mini52.9
#123Magistral Medium 1.252
#124Seed-OSS-36B-Instruct49.4
#125Grok-348.8
#126Solar Open 100B48.2
#127DeepSeek-V3 032447.1
#128GPT-4.147.1
#129Sarvam 105B46.8
#130Motif-2-12.7B-Reasoning46.5
#131GLM 4.5 Air46.5
#132Qwen3 VL 32B45.6
#133Nemotron 3 Nano Omni 30B A3B Reasoning45.3
#134Qwen3 Coder 480B A35B Instruct43.6
#135Gemma 4 26B A4B43.6
#136GLM-4.543
#137Granite 4.1 30B42.1
#138Qwen3-Next-80B-A3B41.5
#139Mistral Small 441.2
#140NVIDIA Nemotron 3 Nano 30B A3B40.9
#141Mistral Medium 3.140.6
#142Nova Premier38.3
#143Devstral Small38
#144DeepSeek-V3.137.4
#145DeepSeek V3.1 Terminus37.1
#146DeepSeek-R1-052836.5
#147Pixtral Large36.5
#148GPT-5 nano36.5
#149Qwen3 VL 235B A22B Instruct35.1
#150Sarvam 30B34.5
#151Qwen2.5 72B Instruct34.5
#152Qwen3 14B34.5
#153Qwen3 Coder 30B A3B Instruct34.5
#154MiniMax M1 80k34.2
#155DeepSeek V3.2 Exp33.9
#156Qwen3-235B-A22B-Instruct-250733.3
#157Mistral Large 233
#158Ling-1T32.7
#159Solar Pro 231.9
#160MiniMax M1 40k31.6
#161GLM 4.6V31.6
#162Gemini 2.5 Flash31.6
#163Gemini 3.1 Flash Lite31.3
#164o3-mini31.3
#165Qwen3 32B29.8
#166Mistral Small 3.229.5
#167Gemini 2.0 Flash29.5
#168Qwen3 VL 8B Instruct29.2
#169Qwen3 VL 32B Instruct29.2
#170GPT-4o28.9
#171Qwen3 30B A3B 250728.1
#172NVIDIA Nemotron 3 Nano 4B28.1
#173Llama Nemotron Super 49B v1.528.1
#174K2-V227.8
#175Falcon-H1R-7B27.8
#176Magistral Small 1.227.8
#177Qwen3 8B27.8
#178Granite 4.1 8B27.8
#179Qwen3 235B A22B27.2
#180Ministral 3 14B27.2
#181Llama-3.3 Nemotron Super 49B v126.9
#182Qwen3 4B 2507 Instruct26.6
#183Magistral Small 126.6
#184Hermes 4 - Llama-3.1 405B26.6
#185INTELLECT-326.6
#186Llama 3.3 70B Instruct26.6
#187Ministral 3 8B26.6
#188Ring-1T26.3
#189Qwen3 1.7B26
#190Gemma 4 E4B26
#191Qwen3 30B A3B26
#192Qwen3 4B 250725.4
#193K2 Think V225.4
#194Mistral Small 3.125.1
#195Ministral 3 3B24.9
#196Devstral 224.9
#197Mistral Large 324.6
#198Claude 3.5 Haiku24.6
#199Mistral Medium 324.3
#200Qwen3 VL 4B Instruct23.4
#201NVIDIA Nemotron Nano 9B V223.4
#202Devstral Small 223.4
#203Magistral Medium 123.1
#204Llama 3.1 Nemotron 70B Instruct23.1
#205Granite 4.0 1B22.8
#206DeepSeek-V322.8
#207Qwen3 VL 8B22.5
#208Hermes 4 - Llama-3.1 70B22.5
#209GLM 4.5V22.5
#210Gemma 4 E2B22.2
#211DeepSeek R1 Distill Llama 70B21.9
#212Nanbeige4.1-3B21.6
#213Qwen3 Next 80B A3B Instruct21.6
#214Qwen3 Omni 30B A3B21.3
#215Olmo 3.1 32B Instruct21.3
#216NVIDIA Nemotron Nano 12B v2 VL21.3
#217Qwen3 0.6B21.1
#218Claude 3 Haiku21.1
#219Llama 3.2 3B Instruct21.1
#220Ling-flash-2.020.8
#221Exaone 4.0 1.2B20.5
#222Solar Mini20.2
#223Qwen3 VL 30B A3B19.9
#224Devstral Medium19.9
#225Granite 4.0 H 1B19.6
#226Granite 4.1 3B19.6
#227LFM2.5-1.2B-Thinking19.6
#228Mistral Small 319.6
#229Qwen3 4B19
#230Llama 3.1 405B Instruct19
#231Gemini 2.5 Flash Lite19
#232Qwen3 VL 30B A3B Instruct19
#233Llama 4 Maverick17.8
#234Nova Lite17.5
#235EXAONE 4.0 32B17.3
#236Granite 4.0 H Small17.3
#237GPT-4.1 Nano17.3
#238Qwen3 Omni 30B A3B Instruct16.4
#239Llama 3.1 8B Instruct16.4
#240Step3 VL 10B16.1
#241Jamba Reasoning 3B15.8
#242Qwen3 VL 4B15.5
#243Llama 4 Scout15.5
#244Llama 3.1 70B Instruct15.2
#245Granite 4.0 H 350M14.6
#246Llama 3.2 11B Instruct14.6
#247Nova Pro14
#248Nova Micro14
#249LFM2 2.6B13.5
#250Jamba Large 1.713.5
#251Ling-mini-2.013.2
#252Granite 4.0 350M13.2
#253Apertus 70B Instruct12.9
#254LFM2 1.2B12.6
#255Jamba 1.7 Mini12.6
#256Olmo 3 7B Instruct12.6
#257Granite 4.0 Micro12.6
#258Llama 3.1 Nemotron Nano 4B v1.111.7
#259Apertus 8B Instruct11.4
#260Llama 3.1 Nemotron Ultra 253B v111.4
#261DeepSeek-R111.4
#262LFM2-24B-A2B11.1
#263Gemma 3 12B Instruct10.8
#264LFM2.5-1.2B-Instruct10.8
#265Granite 3.3 8B10.5
#266Gemma 3 1B Instruct10.5
#267Gemma 3 27B Instruct10.5
#268LFM2 8B A1B10.5
#269Qwen3 30B A3B 2507 Instruct10.2
#270Gemma 3 270M9.1
#271LFM2.5-VL-1.6B8.5
#272Phi 4 Mini Instruct8.2
#273Gemma 3n E4B Instruct5
#274Gemma 3 4B Instruct5
#275Sarvam M0
#276OLMo 2 7B0
#277OLMo 2 32B0
#278Phi-3 Mini Instruct 3.8B0
#279DeepSeek R1 0528 Qwen3 8B0
#280Mistral 7B Instruct0
#281Gemma 3n E2B Instruct0
#282Ring-flash-2.00
#283Tiny Aya Global0
#284Olmo 3.1 32B Think0
#285Molmo2-8B0
#286Olmo 3 7B Think0
#287Molmo 7B-D0
#288Kimi Linear 48B A3B Instruct0
#289Reka Flash 30
#290Llama 3 8B Instruct0
#291Llama 3 70B Instruct0
#292Llama 3.2 1B Instruct0
#293Phi 40
#294ERNIE 4.5 300B A47B0
#295Olmo 3 32B Think0
#296DeepSeek V3.2 Speciale0

Related Agents benchmarks