Model rankings
for agentic tasks.

Sorted by real-world agentic performance — not marketing claims. Every model is ranked by SWE-bench Verified (autonomous software engineering) and tool use reliability (function calling accuracy). Sort and filter to find your best match.

SWE-bench %— autonomous coding score

★★★★★— BFCL tool use reliability

Context— max conversation length

Top 5 by SWE-bench score

Claude Opus 4.6

Anthropic

80.8%

DeepSeek R1

Together AI

79.8%

Gemini 3 Flash

Google (Gemini)

78%

Gemini 3.1 Pro

Google (Gemini)

77%

GLM-4.7

Zhipu AI

73.8%

All models — click any column to sort

31 models

Provider

Model

Speed

Anthropic

Claude Haiku 4.5Fast

—

★★★★★

200K

Very Fast

$0.80

MiniMax

MiniMax M2.7Rec.

—

★★★★★

205K

Fast

$0.30

MiniMax

MiniMax M2.5 LightningFast

—

★★★★★

Very Fast

$0.20

OpenAI

GPT-4.1Rec.

—

★★★★★

Fast

$2.00

OpenAI

GPT-5.4Power

—

★★★★★

1.025390625M

Medium

$2.50

OpenAI

o3Power

—

★★★★★

200K

Medium

$10.00

OpenAI

GPT-5.4 MiniFast

—

★★★★★

400K

Very Fast

$0.75

Moonshot (Kimi)

Kimi K2.5Rec.

—

★★★★★

200K

Fast

$0.25

Moonshot (Kimi)

Kimi K2Fast

—

★★★★★

200K

Fast

$0.15

xAI (Grok)

Grok 4Rec.

—

★★★★★

Fast

$5.00

xAI (Grok)

Grok 4.1 FastFast

—

★★★★★

Fast

$2.00

Groq

Llama 3.3 70BRec.

—

★★★★★

128K

Ultra Fast

$0.59

Groq

Llama 3.1 8B InstantFast

—

★★★★★

128K

Ultra Fast

$0.05

Mistral AI

Mistral LargeRec.

—

★★★★★

128K

Fast

$2.00

Mistral AI

Mistral SmallFast

—

★★★★★

32K

Fast

$0.20

Together AI

Llama 3.3 70B TurboRec.

—

★★★★★

128K

Fast

$0.88

Together AI

Llama 3.1 8B TurboFast

—

★★★★★

128K

Very Fast

$0.18

Hugging Face

Llama 3.3 70BRec.

—

★★★★★

128K

Fast

$0.40

Hugging Face

Llama 3.1 8BFast

—

★★★★★

128K

Fast

$0.10

Hugging Face

Qwen 2.5 72BPower

—

★★★★★

128K

Fast

$0.40

Venice AI

Llama 3.3 70B (Private)Rec.

—

★★★★★

128K

Fast

Varies

Cloudflare Workers AI

Llama 3.3 70BRec.

—

★★★★★

128K

Fast

Free

Cloudflare Workers AI

Llama 3.1 8BFast

—

★★★★★

128K

Very Fast

Free

Zhipu AI

GLM-4 FlashFast

—

★★★★★

128K

Very Fast

Free

OpenRouter

Auto (Smart Routing)Rec.

—

★★★★★

Fast

Varies

Anthropic

Claude Opus 4.6Power

80.8%

★★★★★

Medium

$15.00

Together AI

DeepSeek R1Power

79.8%

★★★★★

128K

Medium

$3.00

Google (Gemini)

Gemini 3 FlashFast

78%

★★★★★

Very Fast

$0.15

Google (Gemini)

Gemini 3.1 ProRec.

77%

★★★★★

Medium

$2.50

Zhipu AI

GLM-4.7Rec.

73.8%

★★★★★

128K

Fast

$0.10

Anthropic

Claude Sonnet 4.6Rec.

72%

★★★★★

Fast

$3.00

SWE-bench Verified (%) = agentic coding performance. Higher is better. Tool use stars = Berkeley Function Calling Leaderboard (BFCL) reliability score. Input prices in USD per 1M tokens. Last updated March 2026 — verify at provider sites.

How we score models

SWE-bench Verified (%)

The gold standard for agentic coding performance. Models are given 500 real GitHub issues and must autonomously write code to solve them. A score of 70%+ means the model can handle 7 in 10 real-world engineering tasks without human help. OpenClaw's agents rely heavily on this capability for file operations, cron jobs, and automated workflows.

Tool Use Reliability (★★★★★)

Based on the Berkeley Function Calling Leaderboard (BFCL) — the standard test for how reliably a model calls functions/tools correctly. OpenClaw uses tools for every core feature: scheduled messages, file operations, heartbeats, and more. Models scoring below ★★★ are excluded from our presets to prevent user-facing failures.

Pick any model, switch anytime

OpenClaw lets you choose your AI provider and model in your dashboard settings. Test a few — your agents, integrations, and history stay intact when you switch.

Start Free Trial Compare prices

Model rankingsfor agentic tasks.

Top 5 by SWE-bench score

All models — click any column to sort

How we score models

SWE-bench Verified (%)

Tool Use Reliability (★★★★★)

Pick any model, switch anytime

Model rankings
for agentic tasks.