Google DeepMind's Vision Banana Beats SAM 3 by Generating Pixels

Google DeepMind released Vision Banana on April 25, and the result quietly upends a decade of computer vision orthodoxy. The model — built by instruction-tuning the Nano Banana Pro image generator — beats Meta's SAM 3 on segmentation, Depth Anything V3 on metric depth, and Lotus-2 on surface normal estimation, all in zero-shot transfer settings.

The headline result: Vision Banana hits 0.699 mIoU on Cityscapes semantic segmentation versus SAM 3's 0.652, a 4.7-point gain. On metric depth, it reaches 0.929 δ1 versus Depth Anything V3's 0.918 — using only synthetic training data.

Why a Generator Beating Specialists Is a Big Deal

Computer vision has historically split into specialized models: segmentation systems like SAM, depth estimators like Depth Anything, surface-normal predictors like Lotus. Each was trained on task-specific data with task-specific architectures.

Vision Banana throws that approach out. It's a generative image model — built to make pictures from text — that DeepMind tuned with a small amount of vision-task data. The model produces task outputs as RGB images, and a fixed decoding scheme converts those colored pixels back into segmentation masks, depth maps, or normal vectors.

Why this matters: this is strong evidence for what DeepMind is calling "generation equals understanding." If a model trained primarily to make plausible images already encodes enough visual understanding to top specialist benchmarks, the dominant paradigm in computer vision research needs an update.

The Technical Sleight of Hand

The clever piece is how Vision Banana represents non-image outputs as images. For metric depth estimation, the model applies a power transform with shape parameter λ = -3 and scale parameter c = 10/3 to compress the dynamic range of physical distances. Those compressed values then get encoded as a false-color visualization that traces the edges of the RGB color cube along a 3D Hilbert curve.

The encoding is invertible — the colored output decodes cleanly back to physical metric distances. For segmentation, classes are mapped to fixed colors. For surface normals, the X, Y, Z components of each vector become R, G, B channels.

The point is that Vision Banana never has to learn task-specific output heads. It just generates a slightly different kind of image, and a deterministic decoder turns it into the answer. One set of weights, prompt-only switching across four major vision tasks.

What the Numbers Say

The benchmark results are convincing across the board. On semantic segmentation, Vision Banana reaches 0.699 mIoU on Cityscapes (19 urban-scene classes) — best zero-shot result, beating SAM 3 by 4.7 points. On reasoning segmentation, the model scores 0.793 gIoU, beating SAM 3 Agent's 0.770 and even surpassing X-SAM, a non-zero-shot model trained specifically on the test domain.

On metric depth estimation, Vision Banana's 0.929 δ1 score outperforms Depth Anything V3's 0.918 — and Vision Banana uses only synthetic training data, while Depth Anything V3 trains on millions of real-world depth-labeled images.

On surface normal estimation, the model beats Lotus-2 across multiple benchmarks, again using a single set of weights and no task-specific fine-tuning.

Authors Worth Noting

The paper's author list includes He Kaiming and Xie Saining, two of the most cited researchers in computer vision. He Kaiming is the originator of ResNet, one of the most influential deep-learning architectures of the past decade. Xie Saining co-authored ConvNeXt and Diffusion Transformers (DiT).

Both are now at DeepMind, and their involvement signals that Vision Banana isn't a one-off experiment — it's part of a broader research program at Google around generative pretraining as a foundation for visual understanding.

Industry Implications

The most direct impact is on companies building vision pipelines. If you've been stitching together SAM for segmentation, Depth Anything for depth, and a separate model for normals, Vision Banana suggests a single foundation model can replace that stack at higher accuracy.

For self-driving and robotics teams, that's a real cost reduction. Multi-task vision pipelines are expensive to maintain, and unifying them on a single backbone simplifies both training and deployment.

For Meta — whose SAM (Segment Anything Model) franchise has been a flagship open-research win — the result is uncomfortable. SAM 3 had been positioned as the state of the art in zero-shot segmentation. Vision Banana takes that crown without being designed for the task.

For the broader AI research community, Vision Banana strengthens the case that generative pretraining produces stronger and more general features than discriminative pretraining. Expect a wave of follow-up papers re-examining other "specialist" vision tasks through the same lens.

Expert Perspectives

The reaction on AI Twitter has focused on the methodology. Researchers have praised the elegance of treating depth estimation as image generation with a structured color encoding — it's a non-obvious move that turns out to work better than careful task-specific architectures.

Skeptics have pointed out that Vision Banana inherits the compute and data costs of Nano Banana Pro, which is a much larger model than SAM 3. The fair comparison, they argue, should account for inference cost as well as benchmark accuracy. DeepMind has not yet published full inference cost numbers.

What's Next

DeepMind has open-sourced the encoding scheme and released the paper, but the full Vision Banana weights remain inside Google. The expectation is that the techniques will roll into Gemini's vision stack and into commercial Google Cloud vision APIs over the coming months.

The bigger story is what happens to Meta's SAM franchise. Meta has been a major contributor to open-research vision, and a credible response would likely involve re-examining SAM's architecture in light of Vision Banana's results.

The bottom line: the line between generative AI and "understanding" AI just got blurrier. If you're building vision software, Vision Banana is a signal that the next generation of computer vision foundations will look more like image generators than like specialized vision models. Plan your roadmap accordingly.

Google DeepMind's Vision Banana Beats SAM 3 by Generating Pixels

Google DeepMind's Vision Banana Beats SAM 3 by Generating Pixels

Why a Generator Beating Specialists Is a Big Deal

The Technical Sleight of Hand

What the Numbers Say

Authors Worth Noting

Industry Implications

Expert Perspectives

What's Next

Sources

Don't fall behind

Related Articles

Anthropic Launches Claude Science and Enters Drug Discovery

AI Uncovers Squidbleed, a 29-Year-Old Squid Proxy Bug

Anthropic Launches Claude Fable 5: Its Most Capable Model Yet