Did Anthropic Nerf Claude Opus 4.6? The BridgeBench Debate
Krasa AI
2026-04-13
5 minute read
Did Anthropic Nerf Claude Opus 4.6? The BridgeBench Debate
A viral benchmark post is forcing Anthropic to defend the quality of its flagship model. On April 12, the BridgeMind account posted that Claude Opus 4.6 had dropped from 83.3% accuracy and a No. 2 ranking on its hallucination benchmark to 68.3% and No. 10 in a retest. The post racked up views fast and amplified months of growing developer complaints that Claude has quietly gotten worse.
Anthropic pushed back, and independent researchers are divided. The real answer is messier than either side is making it sound — and the implications for how the industry benchmarks frontier models are the most interesting part of the story.
The Claim That Went Viral
BridgeMind runs BridgeBench, an independent hallucination benchmark. In their April 12 post, they claimed Claude Opus 4.6 regressed 15 percentage points between runs, dropping eight ranking positions. The framing — "Opus 4.6 nerfed" — landed at a moment when the Claude developer community was already primed to believe it.
The post immediately became part of a broader narrative. A GitHub issue titled "Opus 4.6 Max 20x: systematic hallucinations, rule violations, 80% weekly usage wasted — April 2026" has hundreds of comments from paying Claude Code users reporting degraded performance on sustained coding tasks. A separate independent analysis by Marginlab of over 6,800 Claude Code sessions found that reasoning depth dropped roughly 67% by late February, and the file-read ratio before editing code fell from 6.6 to 2.0 — both proxies for how carefully the model investigates before acting.
Why the Benchmark Critics Have a Point
Computer scientist Paul Calcraft and other researchers were quick to note serious methodology problems with the BridgeBench comparison. The retest used a different set of tasks than the original run, which means the scores aren't directly comparable. On the overlapping tasks where apples-to-apples comparison is possible, the performance difference was minor.
This matters. Benchmark scores are only meaningful when the test set is held constant between runs. Swapping in different prompts and reporting a score change as model regression is the kind of error that makes the entire field's benchmark discourse less trustworthy.
In other words: the specific BridgeBench headline is probably wrong, but that doesn't automatically mean Claude Opus 4.6 is performing the way it did at launch.
Anthropic's Explanation
Anthropic has acknowledged user complaints and attributed some of the friction to specific product changes rather than weights regression. The company cited three factors: a UI-only change that reduces latency but "does not impact thinking itself," Opus 4.6's move to adaptive thinking by default on February 9, and a March 3 shift to medium effort level as the default.
The translation: the weights are the same, but the model is reasoning less hard by default than it used to. Users who remember the launch-day experience of Opus 4.6 are, in some sense, comparing today's medium-effort responses against the original high-effort responses — and noticing the gap.
This is a real product decision with real consequences. Anthropic is trying to balance cost, latency, and quality across a surging user base. Lowering default reasoning effort keeps the system usable for free and Plus users, but it changes the experience for power users who never opted into that tradeoff.
The Deeper Industry Problem
Whether or not Opus 4.6 is actually worse, this episode exposes a structural issue with how frontier models are shipped and measured. Models change after launch. Default settings change. Inference infrastructure changes. Rate limiters change. All of these affect quality, and none of them are surfaced to end users in a way that lets them separate "the model got worse" from "the defaults got cheaper."
This is why third-party benchmark trackers like BridgeBench and Marginlab exist, and why they get amplified even when their methodology is imperfect. Users don't have better instruments for detecting changes in model behavior over time.
The fix, if there is one, probably involves frontier labs publishing version-level changelogs for user-facing model behavior — not just for the underlying weights. Anthropic, OpenAI, and Google all change default settings regularly and rarely disclose the specifics.
What This Means for Claude Customers
For teams paying for Claude Code or Claude API access, the practical guidance is specific: set reasoning effort and thinking budgets explicitly in your requests rather than relying on defaults. If you were getting great results at launch and getting mediocre results now with the same prompts, the default effort level has almost certainly changed under you.
For enterprise customers evaluating Claude against GPT-5.2 and Gemini 3.1 Pro, the Opus 4.6 noise is a reminder that benchmark scores at launch don't always reflect production behavior six months later. Build your own evaluation harness, run it on a schedule, and treat public benchmark rankings as a starting point rather than a conclusion.
What's Next
Anthropic is reportedly preparing Claude Opus 5 for release later this year, alongside continued iteration on Claude Sonnet 4.6. If Opus 5 launches and the defaults reset to higher reasoning effort, a lot of the "nerfing" narrative will quiet down — regardless of whether the underlying weights changed.
For now, the BridgeBench controversy is more informative about benchmark hygiene than about Claude. The methodology critics are right that the specific 15-point drop claim doesn't hold up. But the broader developer complaints about degraded coding behavior are too widespread and too specific to dismiss as imagination.
The bottom line: Claude Opus 4.6 is probably not meaningfully worse at the weights level, but the user experience has shifted as Anthropic tuned defaults for scale. Serious users should set reasoning parameters explicitly. Serious buyers should run their own evals. And serious observers should treat viral single-post benchmark claims with the same skepticism they'd apply to any other one-sample study.
Sources: VentureBeat on Claude Quality Reports | Yahoo Tech on BridgeBench Critique | Claude Code GitHub Issue #46727 | Marginlab Performance Tracker
Sources
Don't fall behind
Expert AI Implementation →Related Articles
Anthropic Launches Claude Fable 5: Its Most Capable Model Yet
Anthropic released Claude Fable 5, a Mythos-class model that's state-of-the-art on nearly every benchmark — with new safeguards built in. Here's what it means.
min read
China Plans $295B AI Data Center Buildout to Rival the US
China is readying a $295 billion plan to build nationwide AI data centers using mostly domestic chips — squeezing out Nvidia and AMD. Here's what it means.
min read
Flourish Raises $500M to Copy the Brain and Fix AI's Power Crisis
Flourish raised $500M at a $2.5B valuation — backed by Jeff Bezos — to build brain-inspired AI that runs on a fraction of today's energy. Here's the bet.
min read