@getboski

Your agent framework doesn't matter if the model underneath it can't call tools reliably. An agent that hallucinates function names, misparses JSON arguments, or forgets its context mid-chain will fail regardless of how clever your orchestration logic is. The model is the engine. Everything else is plumbing.

Open-weight models have reached a turning point for agent workloads in 2026. Multiple families now support native function calling, structured output, and multi-turn tool use out of the box. Some outperform proprietary alternatives on the Berkeley Function Calling Leaderboard (BFCL) while costing a fraction per token. The trade-off used to be capability versus control. That gap has mostly closed.

We tested and ranked eight open-weight models specifically for agent use cases: tool calling accuracy, multi-step reasoning, context retention, hosting economics, and licensing terms. This isn't a general benchmark roundup. It's a buying guide for teams building agents that need to work in production.

How We Ranked

Five criteria, weighted by what actually breaks agents in the field:

  1. Tool-calling accuracy. Performance on BFCL V4 (which now tests agentic memory, error recovery, and multi-hop reasoning), plus real-world function calling reliability. This is the single highest-weighted factor.
  2. Reasoning depth. Scores on SWE-bench Verified, AIME, and LiveCodeBench. Agents that can't reason through multi-step problems will stall on anything beyond simple API wrappers.
  3. Context window and retention. Raw context length matters less than how well the model uses it. A 128K window with strong retrieval beats a 1M window with degraded recall at 200K tokens.
  4. Hosting cost. Self-hosted inference costs (GPU requirements, quantization support) and API pricing through providers like Together, Fireworks, and Groq. MoE models with low active parameters have a structural advantage here.
  5. License. Apache 2.0 and MIT allow commercial use without restrictions. Meta's Llama Community License has a 700M monthly active user threshold. Some enterprise teams treat anything short of Apache 2.0 as a procurement blocker.

Benchmark numbers come from official model cards, the BFCL V4 leaderboard, SWE-bench Verified, and Artificial Analysis. Where models have multiple versions (DeepSeek V3 vs. V3.1 vs. V3.2), we used the latest stable release available as of March 2026.

At a Glance

Rank Model Best Agent Use Case BFCL V4 Context License
1 Qwen3-235B-A22B General-purpose agents, MCP ~68% 131K Apache 2.0
2 DeepSeek-V3.1 Coding agents, cost-efficient ~65% 128K MIT
3 Llama 4 Maverick Long-context, multimodal agents ~63% 1M Llama 4 Community
4 Mistral Large 2 Multilingual tool calling ~66% 128K Apache 2.0 (research)
5 Command R+ RAG-heavy agents ~60% 128K CC-BY-NC-4.0
6 Gemma 3 27B Edge deployment, mobile agents ~55% 128K Gemma Terms
7 Phi-4-mini On-device agents, low compute ~50% 128K MIT
8 Yi-Large Multilingual, knowledge retrieval ~52% 32K Apache 2.0

BFCL V4 scores are overall composite (Agentic 40% + Multi-Turn 30% + Live 10% + Non-Live 10% + Hallucination 10%). Scores approximate based on latest available leaderboard data.

1. Qwen3-235B-A22B

Developer: Alibaba (Qwen Team) · Parameters: 235B total, 22B active · Architecture: MoE with MLA

Qwen3 is the model to beat for general agent workloads in 2026. It supports both thinking and non-thinking modes, switching between slow deliberative reasoning and fast tool execution depending on the task. The Qwen team specifically optimized Qwen3 for Model Context Protocol (MCP) compatibility, making it one of the first open-weight models with first-class MCP support.

On benchmarks that matter for agents, Qwen3 scores 95.6 on ArenaHard and 77.1 on LiveBench, putting it within striking distance of Gemini 2.5 Pro on instruction following. Its BFCL performance leads among open-weight models for multi-step tool use. The MoE architecture keeps inference costs manageable: only 22B parameters activate per token, so you get flagship-class reasoning without flagship-class GPU bills.

The practical advantage is the Qwen-Agent library, which bundles tool-calling templates and argument parsers that handle the messy parts of function calling. If you're building with LangGraph or CrewAI, Qwen3 slots in cleanly. Apache 2.0 licensing means no commercial restrictions. For teams building multi-tool agents that need to plan, reason, and execute reliably, Qwen3 is the default starting point. (For a deeper comparison with other open-weight families, see our open-weight models comparison.)

Hosting: ~2x A100 80GB (FP16) or 1x A100 (INT4 quantized). API pricing: $0.20/1M input tokens via Together AI.

2. DeepSeek-V3.1

Developer: DeepSeek · Parameters: 671B total, 37B active · Architecture: MoE with 256 experts + MLA

DeepSeek's V3 line has evolved rapidly. V3.1 merged the base V3 model with R1's reasoning capabilities into a single hybrid that switches between thinking and non-thinking modes. The result is a coding agent powerhouse: 66.0% on SWE-bench Verified, compared to R1's 44.6%, representing a 48% improvement on real-world GitHub issue resolution.

For agent builders, the key improvement in V3.1 is post-training optimization specifically targeting tool usage and multi-step task execution. Function calls that V3 would occasionally malform, V3.1 handles cleanly. The model also excels at error recovery, identifying when a tool call returns unexpected output and adjusting its approach rather than repeating the same failing request.

Cost is DeepSeek's structural advantage. At $0.15 per million input tokens and $0.75 per million output tokens, it's the cheapest frontier-class model available. The MIT license removes all commercial barriers. The main drawback is hosting complexity: 671B total parameters mean you'll need significant infrastructure for self-hosted deployment, even with quantization. Most teams will use DeepSeek through API providers. (For background on DeepSeek's architecture, see our DeepSeek explainer.)

Hosting: 4-8x A100 80GB (FP16, depends on parallelism strategy). API: $0.15/1M input, $0.75/1M output via DeepSeek API.

3. Llama 4 Maverick

Developer: Meta · Parameters: 400B total, 17B active · Architecture: MoE with 128 routed experts + 1 shared expert

Llama 4 Maverick crossed 1,400 on the LMArena benchmark, beating GPT-4o, DeepSeek V3, and Gemini 2.0 Flash in head-to-head evaluations. Its defining feature for agent use is the 1M token context window combined with only 17B active parameters per inference pass. That's an unusual combination: massive context at relatively modest compute cost.

For agent workloads, Maverick supports native function calling through an OpenAI-compatible tool interface. Fireworks AI and other providers expose structured function_call objects, so integration with existing agent frameworks requires minimal adaptation. The 128-expert MoE architecture routes each token to exactly one routed expert plus the shared expert, keeping latency predictable even on long contexts.

The limitation is the Llama 4 Community License. It's permissive for most teams, but organizations exceeding 700 million monthly active users need a separate agreement with Meta. For the vast majority of production deployments, this isn't a practical concern. Where Maverick falls short compared to Qwen3 is raw tool-calling accuracy on complex multi-step chains. It handles single-turn and parallel function calls well but occasionally loses track during longer sequential tool-use sessions.

Hosting: ~1x A100 80GB (INT4) for the active parameters; full model needs multi-GPU. API: $0.15-0.27/1M input tokens.

4. Mistral Large 2

Developer: Mistral AI · Parameters: 123B (dense) · Architecture: Dense Transformer with GQA

Mistral Large 2 takes a different architectural approach. Instead of MoE, it's a fully dense 123B parameter model, meaning every parameter activates on every token. That makes it more expensive to run than MoE alternatives, but it also means more predictable behavior. Dense models don't have the expert routing variance that occasionally causes MoE models to produce inconsistent outputs on similar inputs.

Where Mistral Large 2 outperforms larger models is function calling. In benchmarks comparing parallel and sequential tool use, it beat both GPT-4o and Claude 3.5 Sonnet at the time of its release. It natively supports parallel function calling, letting an agent dispatch multiple tool requests simultaneously and aggregate results. The 128K context window handles substantial conversation histories and retrieval contexts without truncation.

The multilingual support is genuinely strong: over 80 coding languages and dozens of natural languages including Chinese, Japanese, Arabic, and Hindi. For teams building agents that operate across language boundaries, Mistral Large 2 is the safest open-weight choice. The research-use Apache 2.0 license is more restrictive than Qwen3's full Apache 2.0, so verify your use case qualifies.

Hosting: 2-4x A100 80GB (dense architecture demands more memory). API: ~$0.50/1M input tokens.

5. Command R+

Developer: Cohere · Parameters: 104B · Architecture: Dense Transformer

Command R+ was purpose-built for retrieval-augmented generation, and it shows. The model generates inline citations by default, pointing to specific passages in its context that support each claim. For RAG-based agents that need to cite sources, this eliminates the post-processing step of mapping outputs back to retrieved documents.

On function calling, Command R+ outperformed GPT-4 Turbo on both the ToolTalk (Hard) benchmark and the Berkeley Function Calling Leaderboard when it launched. Its multi-step tool use capability handles error correction automatically: if a tool call fails, the model identifies the problem and retries with corrected arguments. The 128K context window accommodates large retrieval sets without chunking compromises.

The main constraint is the CC-BY-NC-4.0 license for the open-weight version, which prohibits direct commercial use without a Cohere enterprise agreement. Teams that need RAG-optimized agents without licensing overhead should evaluate whether Qwen3 or DeepSeek can match Command R+'s citation quality for their specific retrieval patterns. For enterprise deployments where Cohere's managed service is acceptable, Command R+ remains the strongest choice for knowledge-intensive agent tasks.

Hosting: 2-4x A100 80GB. Enterprise API pricing through Cohere (varies by contract).

6. Gemma 3 27B

Developer: Google DeepMind · Parameters: 27B · Architecture: Dense Transformer

Gemma 3 occupies a sweet spot that the larger models can't reach: strong enough for real agent work, small enough to run on a single consumer GPU. The 27B variant fits on a single RTX 4090 with INT4 quantization, making it the most accessible model on this list for teams without cloud GPU budgets.

Google released FunctionGemma, a 270M parameter variant specifically fine-tuned for function calling on mobile and edge devices. This makes Gemma 3 the only family on our list with a purpose-built model for on-device agent deployment. The full 27B model supports function calling, planning, and structured output through standard tool-use interfaces, with the 128K context window providing enough room for complex agent memory.

Where Gemma 3 falls behind is raw reasoning depth. On SWE-bench Verified and multi-step tool-calling benchmarks, it consistently scores below the larger MoE models. It's not going to replace Qwen3 for complex coding agents. But for chatbots with tool access, smart home controllers, customer service agents, or any workload where latency and hosting cost matter more than peak reasoning, Gemma 3 is the practical pick. (For more on when smaller models beat larger ones, see our SLMs vs. LLMs comparison.)

Hosting: 1x RTX 4090 (INT4) or 1x A100 40GB (FP16). Free via Google AI Studio API.

7. Phi-4-mini

Developer: Microsoft · Parameters: 3.8B · Architecture: Dense Transformer

Phi-4-mini proves that 3.8 billion parameters can handle function calling. Microsoft added native tool-use support to the Phi-4 family, enabling both single and parallel function calls. For edge devices, IoT controllers, and mobile applications where running even a 7B model is impractical, Phi-4-mini is currently the best option.

The model performs surprisingly well on instruction following and basic reasoning for its size. Microsoft's post-training pipeline specifically targeted function calling accuracy, and the results show in structured scenarios with well-defined tool schemas. Where it struggles is ambiguity. Give Phi-4-mini a vaguely specified tool and it will sometimes hallucinate function names or fabricate URL parameters. Production deployments need strict input validation and well-constrained tool definitions.

The MIT license is maximally permissive. The model runs on laptop CPUs, Raspberry Pi-class hardware, and mobile chipsets. For teams building agent functionality into devices that can't phone home to a cloud API, Phi-4-mini is the only real option at this scale.

Hosting: Runs on CPU. 4GB RAM minimum. No GPU required for basic inference.

8. Yi-Large

Developer: 01.AI · Parameters: 70B (dense) · Architecture: Dense Transformer

Yi-Large trails the other models on this list in agent-specific capabilities, but it earns its place through strong multilingual performance and solid general reasoning. On the LMArena benchmark, it closely trails GPT-4, Gemini 1.5 Pro, and Claude 3 Opus in overall quality. Its particular strength is East Asian language performance, where it's competitive with models two to three times its size.

For agent workloads, Yi-Large handles knowledge retrieval, data classification, and conversational tasks competently. Tool-calling support exists but lags behind the dedicated function-calling optimization in Qwen3 or Mistral Large 2. The 32K context window is the smallest on this list and a real constraint for agents that need to maintain long conversation histories or process large retrieval sets.

The Apache 2.0 license and reasonable parameter count (70B fits on 2x A100 40GB) make it an accessible choice. Teams building multilingual agents focused on Chinese, Japanese, or Korean markets should benchmark Yi-Large against Qwen3 for their specific language distribution. For English-primary agent workloads, the other models on this list offer better tool-calling reliability.

Hosting: 2x A100 40GB (FP16) or 1x A100 80GB (INT4). API: $2.00/1M tokens (input and output).

Decision Matrix: Which Model for Which Agent?

Agent Type Top Pick Runner-Up Why
Coding agent DeepSeek-V3.1 Qwen3-235B 66% SWE-bench Verified, cheapest per token
RAG agent Command R+ Qwen3-235B Native inline citations, grounded retrieval
Conversational agent Llama 4 Maverick Mistral Large 2 1M context for long conversations, low cost per turn
Multi-agent orchestration Qwen3-235B DeepSeek-V3.1 Best overall tool calling, MCP support, thinking modes
Multilingual agent Mistral Large 2 Yi-Large 80+ coding languages, dozens of natural languages
Edge/mobile agent Phi-4-mini Gemma 3 27B 3.8B runs on CPU; Gemma fits single GPU
Budget-constrained DeepSeek-V3.1 Gemma 3 27B $0.14/1M input; Gemma free on Google AI Studio
Maximum permissiveness DeepSeek-V3.1 (MIT) Qwen3 (Apache 2.0) No restrictions, no thresholds, no procurement blockers

What About GLM-4.5?

GLM-4.5 from Zhipu AI topped the BFCL V4 leaderboard at 70.9%, beating both Claude Opus 4.1 and Claude Sonnet 4 on function calling accuracy. We didn't include it in the main rankings because its English-language documentation and international hosting infrastructure are still maturing. For teams comfortable operating primarily in Chinese or with bilingual engineering staff, GLM-4.5 deserves serious evaluation. It's built on MoE architecture and specifically optimized for tool use, web browsing, and software development. Watch this space.

Frequently Asked Questions

Can open-weight models match GPT-4o or Claude for agent tasks?

On tool calling specifically, yes. Multiple open-weight models now match or exceed GPT-4o on the Berkeley Function Calling Leaderboard. Qwen3 and DeepSeek-V3.1 are competitive on multi-step reasoning benchmarks too. Where proprietary models still hold an edge is in the long tail of unusual tool-calling patterns and graceful error recovery on ambiguous inputs. For well-defined agent workflows with clear tool schemas, the gap is effectively closed.

How much does it cost to self-host these models?

The range is enormous. Phi-4-mini runs on a $200 laptop. Gemma 3 27B needs a single $1,600 RTX 4090. Qwen3-235B requires $30,000-60,000 in A100 infrastructure (or ~$2-4/hour on cloud GPU). DeepSeek-V3.1 at full precision needs 4-8x A100s. For most teams, the practical choice is between self-hosting a smaller model (Gemma, Phi) or using API providers for the larger ones. MoE models offer a middle ground: their active parameter count is much lower than total parameters, so quantized versions are more deployable than the headline numbers suggest.

Which license is safest for commercial use?

MIT (DeepSeek) and Apache 2.0 (Qwen3, Mistral research, Yi-Large) impose no meaningful commercial restrictions. Llama 4's Community License adds a 700M MAU threshold that affects only the largest consumer platforms. Command R+'s CC-BY-NC requires a Cohere commercial agreement. Gemma's terms are permissive but custom. If licensing simplicity is your top priority, DeepSeek's MIT license is the cleanest option.

Should I use thinking mode or non-thinking mode for agents?

Both Qwen3 and DeepSeek-V3.1 support hybrid thinking/non-thinking modes. Use thinking mode for complex planning steps: deciding which tools to call, interpreting ambiguous results, recovering from errors. Switch to non-thinking mode for straightforward tool execution where the action is obvious from context. Most agent frameworks let you toggle this per step. The latency overhead of thinking mode is 2-5x, so using it on every turn will noticeably slow your agent. (See our guide on Qwen's open-source approach for more on how thinking modes evolved.)


Sources: