Mistral Medium 3.5 Hits 77.6% on SWE-Bench, Adds Cloud Coding Agents

Mistral AI on May 2 released Medium 3.5, a unified 128-billion-parameter model that fuses chat, reasoning, and code into a single set of weights — and paired the launch with cloud-based coding agents in its Vibe CLI and a new Work mode inside Le Chat. The combined release is Mistral's most aggressive push yet at the agentic-AI market dominated by OpenAI and Anthropic.

Medium 3.5 scores 77.6% on SWE-Bench Verified, the closest the industry has to a real-world coding test. That's well behind Claude Opus 4.7's 87.6% but ahead of Mistral's prior coding-focused Devstral 2 — and it lands at a price of $1.50 per million input tokens and $7.50 per million output tokens, roughly a third of what Anthropic charges for Opus 4.7.

What Mistral Shipped

The headline is the model itself. Medium 3.5 is a dense 128B model with a 256K context window, replacing the previous split between Mistral's chat model, its reasoning model, and Devstral for coding. Mistral describes it as the company's "first flagship merged model," meaning one set of weights handles instruction-following, multi-step reasoning, and code generation without routing across specialized variants.

On agentic benchmarks, Medium 3.5 hits 91.4% on τ³-Telecom, a tool-use evaluation that measures how reliably a model can chain external API calls. That score puts it in the same neighborhood as the top closed models on agent reliability, even if it trails them on raw coding.

The bigger product story is Vibe. Mistral's coding CLI now supports remote agents — long-running coding sessions that execute in isolated cloud sandboxes instead of your laptop. You spawn a remote agent from the Vibe CLI or Le Chat, hand it a task, and it works in the background until it opens a pull request on GitHub.

Why this matters: every major AI coding tool is moving in this direction. OpenAI's Codex, Anthropic's Claude Code, and Cursor's background agents all have similar models. Mistral was conspicuously behind on remote execution until today.

Le Chat Gets a Work Mode

Alongside the model, Mistral added Work mode to Le Chat (in preview). Work mode is built around a long-running agent powered by Medium 3.5 that can call tools in parallel across email, calendar, documents, Jira, and Slack. The agent maintains persistent sessions that survive across multiple steps until a task is complete.

Work mode includes transparency features that show every tool call the agent makes, and it requires explicit approval before sensitive actions like sending emails or posting to Slack. That governance layer is a direct response to enterprise concerns about agent autonomy that have slowed deployment of competing products.

Mistral is positioning Work mode as the European answer to ChatGPT Enterprise and Claude for Work — both of which have moved aggressively into multi-step agentic workflows over the past six months. The pitch is data residency in the EU plus parallel tool execution.

How It Stacks Up Against Rivals

The competitive reality on coding: Medium 3.5's 77.6% on SWE-Bench Verified puts it ahead of older open models but behind the closed frontier. Claude Opus 4.7 leads at 87.6%, OpenAI's GPT-5.5 sits in the low 80s on the same benchmark, and DeepSeek V4-Pro is in the mid-70s.

Where Mistral has an edge is unit economics. At $1.50 input and $7.50 output per million tokens, Medium 3.5 is roughly 3x cheaper than Opus 4.7 and 2x cheaper than GPT-5.5. For high-volume coding workloads — think automated PR review or background refactor agents — the per-task cost matters more than the last few benchmark points.

The model is also available with open weights through Hugging Face, which closed-frontier labs don't offer. That makes Medium 3.5 a real option for regulated enterprises and sovereign-AI customers in Europe.

Industry Implications

For developers, the immediate effect is a cheaper option for cloud-based coding agents. Vibe's remote execution means you can offload long-running tasks — large refactors, dependency upgrades, framework migrations — without keeping a terminal open.

For Mistral, the merged-model architecture is a strategic bet. Anthropic still ships separate Sonnet and Opus models. OpenAI maintains a family of GPT-5 variants. By collapsing chat, reasoning, and code into one model, Mistral is wagering that customers prefer a single endpoint to multi-model routing.

For OpenAI and Anthropic, the price point puts pressure on the middle of their pricing ladders — the segments where Sonnet and GPT-5 mid-tier compete. Both labs have been holding pricing steady for months as DeepSeek and now Mistral push the cost frontier down.

Expert Perspectives

Reaction on X was mixed. Researchers focused on the merged-architecture decision — flagging that fusing reasoning into a chat model usually costs latency on simple queries, but Mistral claims its routing inside the model handles that internally.

Some open-source developers were skeptical, pointing out that Mistral's earlier "open" releases shipped under restrictive licenses. Others welcomed Vibe's remote agents as the first credible European alternative to Codex and Claude Code.

What's Next

Medium 3.5 is in public preview today, available through Mistral's API, the Vibe CLI, Le Chat, and Hugging Face. Work mode is in preview inside Le Chat with a phased rollout to enterprise customers.

Mistral has signaled that a Large 3.5 model is in training and likely to ship later this quarter. The company also recently entered three-way partnership talks with xAI and Cursor, which would extend Medium 3.5's distribution if those discussions close.

The bottom line: Medium 3.5 won't dethrone Opus 4.7 on raw coding capability, but it lands a credible mid-tier flagship at an aggressive price with the cloud-agent infrastructure that customers now expect. If you're running a high-volume coding agent and the last few SWE-Bench points don't matter, this is the upgrade to evaluate this week.

Mistral Medium 3.5 Hits 77.6% on SWE-Bench, Adds Cloud Coding Agents

Mistral Medium 3.5 Hits 77.6% on SWE-Bench, Adds Cloud Coding Agents

What Mistral Shipped

Le Chat Gets a Work Mode

How It Stacks Up Against Rivals

Industry Implications

Expert Perspectives

What's Next

Sources

Don't fall behind

Related Articles

OpenAI's GeneBench-Pro Tests AI on Real Biology Research

Anthropic Launches Claude Fable 5: Its Most Capable Model Yet

China Plans $295B AI Data Center Buildout to Rival the US