Sorted by real-world agentic performance — not marketing claims. Every model is ranked by SWE-bench Verified (autonomous software engineering) and tool use reliability (function calling accuracy). Sort and filter to find your best match.
SWE-bench Verified (%) = agentic coding performance. Higher is better. Tool use stars = Berkeley Function Calling Leaderboard (BFCL) reliability score. Input prices in USD per 1M tokens. Last updated March 2026 — verify at provider sites.
The gold standard for agentic coding performance. Models are given 500 real GitHub issues and must autonomously write code to solve them. A score of 70%+ means the model can handle 7 in 10 real-world engineering tasks without human help. OpenClaw's agents rely heavily on this capability for file operations, cron jobs, and automated workflows.
Based on the Berkeley Function Calling Leaderboard (BFCL) — the standard test for how reliably a model calls functions/tools correctly. OpenClaw uses tools for every core feature: scheduled messages, file operations, heartbeats, and more. Models scoring below ★★★ are excluded from our presets to prevent user-facing failures.
OpenClaw lets you choose your AI provider and model in your dashboard settings. Test a few — your agents, integrations, and history stay intact when you switch.