IBM Released new Granite 4.0 Models with a Novel Hybrid Mamba-2/Transformer Architecture: Drastically Reducing Memory Use without Sacrificing Performance

IBM simply launched Granite 4.0, an open-source LLM household that swaps monolithic Transformers for a hybrid Mamba-2/Transformer stack to chop serving reminiscence whereas maintaining high quality. Sizes span a 3B dense “Micro,” a 3B hybrid “H-Micro,” a 7B hybrid MoE “H-Tiny” (~1B lively), and a 32B hybrid MoE “H-Small” (~9B lively). The fashions are Apache-2.0, cryptographically signed, and—per IBM—the primary open fashions coated by an accredited ISO/IEC 42001:2023 AI administration system certification. They can be found on watsonx.ai and through Docker Hub, Hugging Face, LM Studio, NVIDIA NIM, Ollama, Replicate, Dell Pro AI Studio/Enterprise Hub, Kaggle, with Azure AI Foundry…
So, what’s new?
Granite 4.0 introduces a hybrid design that interleaves a small fraction of self-attention blocks with a majority of Mamba-2 state-space layers (9:1 ratio). As per IBM technical blog, relative to traditional Transformer LLMs, Granite 4.0-H can scale back RAM by >70% for long-context and multi-session inference, translating into decrease GPU price at a given throughput/latency goal. IBM’s inside comparisons additionally present the smallest Granite 4.0 fashions outperforming Granite 3.3-8B regardless of utilizing fewer parameters.
Tell me what are the launched variants?
IBM is transport each Base and Instruct variants throughout 4 preliminary fashions:
- Granite-4.0-H-Small: 32B complete, ~9B lively (hybrid MoE).
- Granite-4.0-H-Tiny: 7B complete, ~1B lively (hybrid MoE).
- Granite-4.0-H-Micro: 3B (hybrid dense).
- Granite-4.0-Micro: 3B (dense Transformer for stacks that don’t but help hybrids).
All are Apache-2.0 and cryptographically signed; IBM states Granite is the primary open mannequin household with accredited ISO/IEC 42001 protection for its AI administration system (AIMS). Reasoning-optimized (“Thinking”) variants are deliberate later in 2025.
How is it skilled, context, and dtype?
Granite 4.0 was skilled on samples as much as 512K tokens and evaluated as much as 128K tokens. Public checkpoints on Hugging Face are BF16 (quantized and GGUF conversions are additionally printed), whereas FP8 is an execution possibility on supported {hardware}—not the format of the launched weights.
Lets perceive it’s efficiency alerts (enterprise-relevant)
IBM highlights instruction following and tool-use benchmarks:
IFEval (HELM): Granite-4.0-H-Small leads most open-weights fashions (trailing solely Llama 4 Maverick at far bigger scale).

BFCLv3 (Function Calling): H-Small is aggressive with bigger open/closed fashions at lower cost factors.

MTRAG (multi-turn RAG): Improved reliability on advanced retrieval workflows.

How can I get entry?
Granite 4.0 is reside on IBM watsonx.ai and distributed through Dell Pro AI Studio/Enterprise Hub, Docker Hub, Hugging Face, Kaggle, LM Studio, NVIDIA NIM, Ollama, OPAQUE, Replicate. IBM notes ongoing enablement for vLLM, llama.cpp, NexaML, and MLX for hybrid serving.
My ideas/feedback
I see Granite 4.0’s hybrid Mamba-2/Transformer stack and active-parameter MoE as a sensible path to decrease TCO: >70% reminiscence discount and long-context throughput positive factors translate instantly into smaller GPU fleets without sacrificing instruction-following or tool-use accuracy (IFEval, BFCLv3, MTRAG). The BF16 checkpoints with GGUF conversions simplify native analysis pipelines, and ISO/IEC 42001 plus signed artifacts handle provenance/compliance gaps that usually stall enterprise deployment. Net end result: a lean, auditable base mannequin household (1B–9B lively) that’s simpler to productionize than prior 8B-class Transformers.
Check out the Hugging Face Model Card and Technical details. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to comply with us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The submit IBM Released new Granite 4.0 Models with a Novel Hybrid Mamba-2/Transformer Architecture: Drastically Reducing Memory Use without Sacrificing Performance appeared first on MarkTechPost.