Google DeepMind Releases Gemma 4 12B: An Encoder-Free Multimodal Model with Native audio that runs on a 16 GB laptop

Google DeepMind simply launched (*16*), a dense multimodal mannequin that strips out conventional encoders completely. Vision and audio circulate straight into the LLM spine. The result’s a mannequin that runs agentic workflows on a shopper laptop with 16 GB of RAM. It ships beneath the Apache 2.0 license.

Model Overview & Access

Gemma 4 12B is a 12-billion-parameter decoder-only transformer. It handles textual content, photos, audio, and video natively. There are not any separate imaginative and prescient or audio encoders. The decoder makes use of the identical construction because the Gemma 4 31B Dense mannequin. It bridges the hole between the edge-friendly E4B and the bigger 26B Mixture of Experts variant.

Architecture: Unified, encoder-free decoder-only transformer.
Modalities: Text, picture, video, and native audio enter — the primary mid-sized Gemma with audio.
Hardware requirement: 16 GB VRAM or unified reminiscence. Runs on shopper GPU laptops and Apple Silicon Macs.
License: Apache 2.0. Weights are open and publicly downloadable.
Inference stack: Compatible with llama.cpp, MLX, vLLM, Ollama, SGLang, Unsloth, and LM Studio.
Download: Hugging Face and Kaggle. The instruct variant is google/gemma-4-12B-it.
Integration: Hugging Face Transformers, LiteRT-LM CLI, and an OpenAI-compatible native API server by way of litert-lm serve.

A devoted Multi-Token Prediction (MTP) drafter mannequin can also be launched. It reduces inference latency on native {hardware}.

Architecture: The Encoder-Free Design

Every prior mid-sized Gemma mannequin used separate Transformer encoders for imaginative and prescient and audio. Those encoders added latency and parameter overhead. The medium-sized Gemma 4 fashions carry a 550M-parameter imaginative and prescient encoder. The E2B and E4B fashions embrace a 300M-parameter audio encoder. All of that is gone within the 12B.

Vision embedder (35M parameters): Raw photos are break up into 48×48 pixel patches. Each patch is projected to the LLM’s hidden dimension with a single matrix multiplication. There is not any consideration layer; every patch is processed independently. Spatial place is injected utilizing a factorized coordinate lookup: a discovered X matrix and a discovered Y matrix. For a patch at (x, y), the mannequin seems up two discovered embeddings and provides them to type a place vector. This is added to the patch embedding, adopted by normalization. That is your entire imaginative and prescient pipeline.

Audio wave projection: Raw 16 kHz audio is sliced into 40 ms frames. Each body comprises 640 values. Those values are linearly projected into the identical embedding area as textual content tokens. There is not any characteristic extraction and no conformer layers. The LLM’s present Rotary Position Embedding (RoPE) handles the 1-D temporal sequence. The audio encoder within the E2B and E4B used 12 conformer layers. All of that is eliminated.

Importance: The unified weight area means you now not co-tune separate frozen encoders. Downstream fine-tuning with LoRA or full tuning updates imaginative and prescient, audio, and textual content processing in a single move. Hugging Face Transformers and Unsloth already assist this.

The encoder-free design reduces multimodal latency. The LLM spine begins processing instantly. No encoder should end first.

Capabilities & Performance

Google DeepMind staff has not revealed full benchmark leads to the preliminary launch supplies. The official launch notes state the 12B mannequin performs nearing the 26B MoE mannequin on customary benchmarks, at lower than half the full reminiscence footprint.

https://weblog.google/innovation-and-ai/expertise/developers-tools/introducing-gemma-4-12b/

The mannequin’s demonstrated capabilities embrace:

Automatic speech recognition. Transcribes audio natively with out an exterior ASR pipeline.
Agentic reasoning. Runs multi-step workflows regionally, with efficiency approaching the 26B MoE mannequin.
Diarization. Distinguishes audio system in audio enter.
Video understanding. Processes video frames alongside audio. A demo analyzed a 5-minute Google I/O keynote phase utilizing 313 frames at 1 FPS with a visible token finances of 70 per body.
Coding. Built a Gradio image-processing app utilizing its personal code era, served regionally with llama.cpp.
Multimodal agentic workflows. The official Gemma Skills repository at github.com/google-gemma/gemma-skills offers pre-built agent capabilities.

In Google’s personal Google AI Edge Eloquent app, the swap to Gemma 4 12B produced what Google studies as a 60%+ soar in total high quality, with improved instruction following and scope adherence.

Marktechpost’s Visual Explainer

Released June 3, 2026

Gemma 4 12B

Google DeepMind’s unified, encoder-free multimodal mannequin

A 12-billion-parameter decoder-only transformer that drops separate imaginative and prescient and audio encoders. Vision and audio circulate straight into the LLM spine. It runs regionally on a 16 GB laptop beneath an Apache 2.0 license.

Encoder-free — no separate imaginative and prescient or audio encoders
First mid-sized Gemma with native audio enter; provides video
Local-ready — 16 GB VRAM or unified reminiscence

Overview & Access

What ships

Specs, weights, and the inference stack

Architecture — decoder-only, similar construction as Gemma 4 31B Dense
Modalities — textual content, picture, video, and native audio
Hardware — 16 GB VRAM / unified reminiscence; GPU laptops and Apple Silicon
License — Apache 2.0; weights on Hugging Face and Kaggle
Instruct variant — google/gemma-4-12B-it
Speed — a devoted Multi-Token Prediction (MTP) drafter can also be launched

Architecture · Vision

A 35M imaginative and prescient embedder

Replacing the 550M imaginative and prescient encoder of the medium-sized fashions

Raw photos break up into 48×48 pixel patches
Each patch projected to the LLM hidden dimension with a single matrix multiplication
No consideration layer — every patch is processed independently
Position by way of a factorized X/Y coordinate lookup, then normalization
That is your entire imaginative and prescient pipeline

Architecture · Audio

Direct audio wave projection

No conformer layers, no characteristic extraction

Removes the 12 conformer layers utilized in Gemma 4 E2B and E4B
Raw 16 kHz audio sliced into 40 ms frames (640 values every)
Frames projected into the similar embedding area as textual content tokens
The LLM’s present RoPE handles the temporal sequence
The first mid-sized Gemma to natively ingest audio

Capabilities & Performance

Near-26B reasoning, half the reminiscence

Google studies efficiency nearing the 26B MoE at beneath half the reminiscence footprint

ASR & diarization — native transcription, speaker separation
Agentic reasoning — multi-step workflows run regionally
Video — demo on a 5-min I/O keynote: 313 frames at 1 FPS, token finances 70
Coding — constructed a Gradio app by way of gemma-skills, served with llama.cpp
No full benchmark tables revealed at launch

Run It Locally

Three paths on day one

Native macOS apps plus a drop-in native server

Google AI Edge Gallery (macOS) — sandboxed Python execution loop
Google AI Edge Eloquent (macOS) — on-device dictation and enhancing
LiteRT-LM CLI — litert-lm serve exposes an OpenAI-compatible endpoint
Works with Continue, Aider, OpenCode, Open WebUI
Also LM Studio, Ollama, Transformers, Unsloth, vLLM, SGLang, MLX
Deploy on Cloud Run, GKE, or Gemini Enterprise Agent Platform Model Garden

Key Takeaways

The backside line

What the encoder-free design buys you

No separate imaginative and prescient (550M) or audio (300M) encoders
35M imaginative and prescient embedder plus direct audio wave projection
Fine-tuning updates imaginative and prescient, audio, and textual content in a single move
Nears 26B efficiency at beneath half the reminiscence; runs on 16 GB
Apache 2.0 with broad ecosystem assist out of the gate

1 / 7

Marktechpost
— AI analysis, mannequin releases & developer instruments for 1M+ practitioners.
marktechpost.com

Key Takeaways

Google DeepMind launched Gemma 4 12B, a dense encoder-free multimodal mannequin beneath the Apache 2.0 license.
Vision and audio feed straight into the LLM spine — no separate imaginative and prescient (550M) or audio (300M) encoders.
A 35M imaginative and prescient embedder makes use of a single matmul plus factorized X/Y place lookup; audio initiatives uncooked 16 kHz frames instantly.
It is the primary mid-sized Gemma with native audio, and provides video, working on a 16 GB laptop.
Benchmark efficiency nears the 26B MoE mannequin at lower than half the reminiscence footprint.

Check out the Model Weights and Technical details. Also, be at liberty to comply with us on Twitter and don’t neglect to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and many others.? Connect with us

The submit (*4*) appeared first on MarkTechPost.

Google DeepMind Releases Gemma 4 12B: An Encoder-Free Multimodal Model with Native audio that runs on a 16 GB laptop

Model Overview & Access

Architecture: The Encoder-Free Design

Capabilities & Performance

Marktechpost’s Visual Explainer

Gemma 4 12B

Google DeepMind’s unified, encoder-free multimodal mannequin

What ships

Specs, weights, and the inference stack

A 35M imaginative and prescient embedder

Replacing the 550M imaginative and prescient encoder of the medium-sized fashions

Direct audio wave projection

No conformer layers, no characteristic extraction

Near-26B reasoning, half the reminiscence

Google studies efficiency nearing the 26B MoE at beneath half the reminiscence footprint

Three paths on day one

Native macOS apps plus a drop-in native server

The backside line

What the encoder-free design buys you

Key Takeaways

Google AI Releases Gemini 3.1 Pro with 1 Million Token Context and 77.1 Percent ARC-AGI-2 Reasoning for AI Agents

GitHub Introduces Vibe Coding with Spark: Revolutionizing Intelligent App Development in a Flash

NVIDIA AI Released DiffusionRenderer: An AI Model for Editable, Photorealistic 3D Scenes from a Single Video

NVIDIA Releases DreamDojo: An Open-Source Robot World Model Trained on 44,711 Hours of Real-World Human Video Data

Patter SDK Guide to Building a Restaurant Booking Phone Agent with Dynamic Variables, Guardrails, Latency Dashboards, and Eval Checks

Chroma Releases Context-1: A 20B Agentic Search Model for Multi-Hop Retrieval, Context Management, and Scalable Synthetic Task Generation

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!

Model Overview & Access

Architecture: The Encoder-Free Design

Capabilities & Performance

Marktechpost’s Visual Explainer

Gemma 4 12B

Google DeepMind’s unified, encoder-free multimodal mannequin

What ships

Specs, weights, and the inference stack

A 35M imaginative and prescient embedder

Replacing the 550M imaginative and prescient encoder of the medium-sized fashions

Direct audio wave projection

No conformer layers, no characteristic extraction

Near-26B reasoning, half the reminiscence

Google studies efficiency nearing the 26B MoE at beneath half the reminiscence footprint

Three paths on day one

Native macOS apps plus a drop-in native server

The backside line

What the encoder-free design buys you

Key Takeaways

Similar Posts

Curated by experts. Filtered for relevance.

Resources

About

Subscribe & learn more every day!