Google DeepMind Introduces Vision Banana: An Instruction-Tuned Image Generator That Beats SAM 3 on Segmentation and Depth Anything V3 on Metric Depth Estimation

For years, the pc imaginative and prescient neighborhood has operated on two separate tracks: generative fashions (which produce photographs) and discriminative fashions (which perceive them). The assumption was easy — fashions good at making photos aren’t essentially good at studying them. A brand new paper from Google, titled “Image Generators are Generalist Vision Learners” (arXiv:2604.20329), revealed April 22, 2026, blows that assumption aside.

A group of Google DeepMind researchers launched Vision Banana, a single unified mannequin that surpasses or matches state-of-the-art specialist programs throughout a variety of visible understanding duties — together with semantic segmentation, occasion segmentation, monocular metric depth estimation, and floor regular estimation — whereas concurrently retaining the unique picture era capabilities of its base mannequin.

The LLM Analogy That Changes Everything

If you’ve labored with massive language fashions, you already perceive the two-phase playbook: first, pretrain a base mannequin on huge textual content information utilizing a generative goal, then apply instruction-tuning to align it for downstream duties. The pretraining section is the place the mannequin develops a wealthy inside illustration of language that may be repurposed for nearly something.

The Google group’s core declare is that picture era coaching performs the very same foundational function for imaginative and prescient. Their base mannequin, Nano Banana Pro (NBP), is Google’s state-of-the-art picture generator. By performing a light-weight instruction-tuning cross — mixing a small proportion of laptop imaginative and prescient process information at a really low ratio into NBP’s unique coaching combination — they created Vision Banana. The key perception: producing photorealistic photographs implicitly requires a mannequin to know geometry, semantics, depth, and object relationships. Vision Banana learns to categorical that latent data in measurable, decodable codecs.

Critically, no coaching information from any of the analysis benchmarks is included within the instruction-tuning combination — making certain that each one outcomes mirror true generalist functionality moderately than in-domain memorization.

How It Works: Perception as Image Generation

Rather than including specialised decoder heads or regression modules for every process, all imaginative and prescient process outputs are parameterized as RGB photographs. The mannequin is instruction-tuned to provide visualizations that observe exact, invertible shade schemes — that means the generated photographs will be decoded again into quantitative outputs for benchmark analysis.

The analysis group recognized three key benefits of this technique. First, it helps all kinds of duties with a single unified mannequin — after instruction-tuning, solely the immediate adjustments, not the weights. Second, it requires comparatively little new coaching information, since instruction-tuning is solely instructing the mannequin the best way to format laptop imaginative and prescient outputs as RGB. Third, it helps the mannequin retain its unique picture era capabilities, for the reason that outputs are merely new RGB photographs.

For semantic segmentation, the mannequin is prompted with directions corresponding to: “Generate a segmentation visualization of this picture, utilizing the colour mapping: {‘cat’: ‘purple’, ‘background’: ‘yellow’}.” Each pixel is coloured by its predicted class, and as a result of shade assignments are specified within the immediate, no mounted label vocabulary is required.

For occasion segmentation, for the reason that variety of situations is unknown upfront, Vision Banana makes use of a per-class inference technique — operating a separate cross per class and dynamically assigning distinctive colours to every occasion. Masks are recovered by clustering pixels with comparable colours utilizing a threshold.

Metric depth estimation makes use of a bijective mapping between unbounded metric depth values in [0, ∞) and bounded RGB values in [0, 1]³. An influence rework (form parameter λ = −3, scale parameter c = 10/3) first “curves” metric depth values, that are then encoded as a false-color visualization that traverses the perimeters of the RGB dice, following the construction of a 3D Hilbert curve. This rework is strictly invertible, so the generated depth picture decodes cleanly again to bodily metric distances. Crucially, no digital camera parameters — neither intrinsics nor extrinsics — are required at coaching or inference time. The mannequin infers absolute scale purely from visible cues and world data embedded throughout pretraining. The depth coaching information can be fully artificial, generated from simulation rendering engines, with zero real-world depth information used.

For floor regular estimation, the mapping is extra direct: floor normals are unit vectors (x, y, z) starting from −1.0 to 1.0, which map naturally to RGB channels. Facing-left normals encode as pinkish-red; facing-up normals encode as gentle inexperienced; normals pointing towards the digital camera encode as gentle blue/purple.

The Numbers: Beating Specialists at Their Own Game

Vision Banana’s outcomes throughout benchmarks — all in zero-shot switch settings, the place the mannequin has by no means seen any coaching information from the evaluated datasets — are important:

Semantic segmentation on Cityscapes val: mIoU of 0.699, in comparison with SAM 3’s 0.652 — a 4.7-point acquire.
Referring expression segmentation on RefCOCOg UMD val: cIoU of 0.738, edging out SAM 3 Agent’s 0.734.
Reasoning segmentation on ReasonSeg val: gIoU of 0.793, beating SAM 3 Agent’s 0.770 — and notably surpassing even non-zero-shot strategies skilled on in-domain information, together with X-SAM.
Instance segmentation on SA-Co/Gold: pmF1 of 0.540, on par with DINO-X (0.552), and forward of Gemini 2.5 (0.461), APE-D (0.369), and OWLv2 (0.420) beneath zero-shot switch.
Metric depth estimation: common δ1 of 0.882 throughout six main benchmarks; on the 4 datasets the place Depth Anything V3 was evaluated (NYU, ETH3D, DIODE-Indoor, KITTI), Vision Banana scores 0.929 versus Depth Anything V3’s 0.918 — whereas utilizing zero real-world coaching information and no digital camera parameters.
Surface regular estimation: common imply angle error of 18.928° throughout 4 datasets, in comparison with Lotus-2’s 19.642°. On indoor datasets particularly, Vision Banana achieves the bottom imply angle error (15.549°) and lowest median angle error (9.300°) amongst all in contrast strategies.

On generative benchmarks, Vision Banana holds its personal in opposition to its base mannequin: it achieves a 53.5% win charge in opposition to Nano Banana Pro on GenAI-Bench (text-to-image), and a 47.8% win charge on ImgEdit (picture enhancing), the place Nano Banana Pro scores 52.2%. Overall, the outcomes affirm that light-weight instruction-tuning doesn’t degrade the mannequin’s generative capabilities.

Key Takeaways

Image era pretraining is a generalist imaginative and prescient learner: Just as LLM pretraining unlocks emergent language understanding, Google’s analysis reveals that coaching on picture era naturally develops highly effective inside visible representations that switch to notion duties like segmentation, depth estimation, and floor regular estimation.
Vision Banana beats specialist fashions with out specialist structure: Built by light-weight instruction-tuning of Nano Banana Pro, Vision Banana surpasses SAM 3 on three segmentation benchmarks, Depth Anything V3 on metric depth estimation (δ1: 0.929 vs 0.918), and Lotus-2 on floor regular estimation (imply angle error: 18.928° vs 19.642°) — all in zero-shot switch settings.
All imaginative and prescient duties are reframed as picture era: By parameterizing imaginative and prescient process outputs as RGB photographs with decodable shade schemes, Vision Banana makes use of a single set of weights and prompt-only switching throughout semantic segmentation, occasion segmentation, depth estimation, and floor regular estimation — no task-specific modules required.
Metric depth estimation works with none digital camera parameters or real-world information: Using a bijective energy rework mapping depth values to RGB shade house, Vision Banana infers absolute metric scale purely from visible context — requiring neither digital camera intrinsics nor extrinsics, and skilled fully on artificial information from simulation engines.
Image era can function a common interface for imaginative and prescient: Analogous to how textual content era unifies language duties, picture era could grow to be the common output interface for laptop imaginative and prescient, pointing towards a paradigm shift the place generative imaginative and prescient pretraining powers true Foundational Vision Models for each era and understanding.

Check out the Paper and Project Page here. Also, be happy to observe us on Twitter and don’t neglect to hitch our 130k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so on.? Connect with us

The publish Google DeepMind Introduces Vision Banana: An Instruction-Tuned Image Generator That Beats SAM 3 on Segmentation and Depth Anything V3 on Metric Depth Estimation appeared first on MarkTechPost.