ProductClank Agents

Model rankings
for agentic tasks.

Sorted by real-world agentic performance — not marketing claims. Every model is ranked by SWE-bench Verified (autonomous software engineering) and tool use reliability (function calling accuracy). Sort and filter to find your best match.

SWE-bench %— autonomous coding score
★★★★★— BFCL tool use reliability
Context— max conversation length

Top 5 by SWE-bench score

#1
Claude Opus 4.6
Anthropic
80.8%
#2
MiniMax M2.5
MiniMax
80.2%
#3
DeepSeek R1
Together AI
79.8%
#4
Gemini 3 Flash
Google (Gemini)
78%
#5
Gemini 3.1 Pro
Google (Gemini)
77%

All models — click any column to sort

30 models
1
Anthropic
Claude Haiku 4.5Fast
200K
Very Fast
$0.80
2
MiniMax
MiniMax M2.5 LightningFast
1M
Very Fast
$0.20
3
OpenAI
GPT-4.1Rec.
1M
Fast
$2.00
4
OpenAI
o3Power
200K
Medium
$10.00
5
OpenAI
GPT-4.1 MiniFast
1M
Very Fast
$0.40
6
Moonshot (Kimi)
Kimi K2.5Rec.
200K
Fast
$0.25
7
Moonshot (Kimi)
Kimi K2Fast
200K
Fast
$0.15
8
xAI (Grok)
Grok 4Rec.
2M
Fast
$5.00
9
xAI (Grok)
Grok 4.1 FastFast
2M
Fast
$2.00
10
Groq
Llama 3.3 70BRec.
128K
Ultra Fast
$0.59
11
Groq
Llama 3.1 8B InstantFast
128K
Ultra Fast
$0.05
12
Mistral AI
Mistral LargeRec.
128K
Fast
$2.00
13
Mistral AI
Mistral SmallFast
32K
Fast
$0.20
14
Together AI
Llama 3.3 70B TurboRec.
128K
Fast
$0.88
15
Together AI
Llama 3.1 8B TurboFast
128K
Very Fast
$0.18
16
Hugging Face
Llama 3.3 70BRec.
128K
Fast
$0.40
17
Hugging Face
Llama 3.1 8BFast
128K
Fast
$0.10
18
Hugging Face
Qwen 2.5 72BPower
128K
Fast
$0.40
19
Venice AI
Llama 3.3 70B (Private)Rec.
128K
Fast
Varies
20
Cloudflare Workers AI
Llama 3.3 70BRec.
128K
Fast
Free
21
Cloudflare Workers AI
Llama 3.1 8BFast
128K
Very Fast
Free
22
Zhipu AI
GLM-4 FlashFast
128K
Very Fast
Free
23
OpenRouter
Auto (Smart Routing)Rec.
2M
Fast
Varies
24
Anthropic
Claude Opus 4.6Power
80.8%
1M
Medium
$15.00
25
MiniMax
MiniMax M2.5Rec.
80.2%
1M
Fast
$0.20
26
Together AI
DeepSeek R1Power
79.8%
128K
Medium
$3.00
27
Google (Gemini)
Gemini 3 FlashFast
78%
1M
Very Fast
$0.15
28
Google (Gemini)
Gemini 3.1 ProRec.
77%
2M
Medium
$2.50
29
Zhipu AI
GLM-4.7Rec.
73.8%
128K
Fast
$0.10
30
Anthropic
Claude Sonnet 4.6Rec.
72%
1M
Fast
$3.00

SWE-bench Verified (%) = agentic coding performance. Higher is better. Tool use stars = Berkeley Function Calling Leaderboard (BFCL) reliability score. Input prices in USD per 1M tokens. Last updated March 2026 — verify at provider sites.

How we score models

SWE-bench Verified (%)

The gold standard for agentic coding performance. Models are given 500 real GitHub issues and must autonomously write code to solve them. A score of 70%+ means the model can handle 7 in 10 real-world engineering tasks without human help. OpenClaw's agents rely heavily on this capability for file operations, cron jobs, and automated workflows.

Tool Use Reliability (★★★★★)

Based on the Berkeley Function Calling Leaderboard (BFCL) — the standard test for how reliably a model calls functions/tools correctly. OpenClaw uses tools for every core feature: scheduled messages, file operations, heartbeats, and more. Models scoring below ★★★ are excluded from our presets to prevent user-facing failures.

Pick any model, switch anytime

OpenClaw lets you choose your AI provider and model in your dashboard settings. Test a few — your agents, integrations, and history stay intact when you switch.