TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts

In the present panorama of laptop imaginative and prescient, the usual working process includes a modular ‘Lego-brick’ method: a pre-trained imaginative and prescient encoder for function extraction paired with a separate decoder for job prediction. While efficient, this architectural separation complicates scaling and bottlenecks the interplay between language and imaginative and prescient.

The Technology Innovation Institute (TII) analysis crew is difficult this paradigm with Falcon Perception, a 600M-parameter unified dense Transformer. By processing picture patches and textual content tokens in a shared parameter house from the very first layer, TII analysis crew has developed an early-fusion stack that handles notion and job modeling with excessive effectivity.

The Architecture: A Single Stack for Every Modality

The core design of Falcon Perception is constructed on the speculation {that a} single Transformer can concurrently be taught visible representations and carry out task-specific era.

Hybrid Attention and GGROPE

Unlike commonplace language fashions that use strict causal masking, Falcon Perception employs a hybrid consideration technique. Image tokens attend to one another bidirectionally to construct a worldwide visible context, whereas textual content and job tokens attend to all previous tokens (causal masking) to allow autoregressive prediction.

To preserve 2D spatial relationships in a flattened sequence, the analysis crew makes use of 3D Rotary Positional Embeddings. This decomposes the top dimension right into a sequential part and a spatial part utilizing Golden Gate ROPE (GGROPE). GGROPE permits consideration heads to take care of relative positions alongside arbitrary angles, making the mannequin strong to rotation and facet ratio variations.

Minimalist Sequence Logic

The primary architectural sequence follows a Chain-of-Perception format:

[Image] [Text] <coord> <dimension> <seg> ... <eos>.

This ensures that the mannequin resolves spatial ambiguity (place and dimension) as a conditioning sign earlier than producing the ultimate segmentation masks.

Engineering for Scale: Muon, FlexAttention, and Raster Ordering

TII analysis crew launched a number of optimizations to stabilize coaching and maximize GPU utilization for these heterogeneous sequences.

Muon Optimization: The analysis crew report that using the Muon optimizer for specialised heads (coordinates, dimension, and segmentation) led to decrease coaching losses and improved efficiency on benchmarks in comparison with commonplace AdamW.
FlexAttention and Sequence Packing: To course of photographs at native resolutions with out losing compute on padding, the mannequin makes use of a scatter-and-pack technique. Valid patches are packed into fixed-length blocks, and FlexAttention is used to limit self-attention inside every picture pattern’s boundaries.
Raster Ordering: When a number of objects are current, Falcon Perception predicts them in raster order (top-to-bottom, left-to-right). This was discovered to converge quicker and produce decrease coordinate loss than random or size-based ordering.

The Training Recipe: Distillation to 685GT

The mannequin makes use of multi-teacher distillation for initialization, distilling information from DINOv3 (ViT-H) for native options and SigLIP2 (So400m) for language-aligned options. Following initialization, the mannequin undergoes a three-stage notion coaching pipeline totaling roughly 685 Gigatokens (GT):

In-Context Listing (450 GT): Learning to ‘record’ the scene stock to construct international context.
Task Alignment (225 GT): Transitioning to independent-query duties utilizing Query Masking to make sure the mannequin grounds every question solely on the picture.
Long-Context Finetuning (10 GT): Short adaptation for excessive density, growing the masks restrict to 600 per expression.

During these phases, the task-specific serialization is used:

<picture>expr1<current><coord><dimension><seg> <eoq>expr2<absent> <eoq> <eos>.

The <current> and <absent> tokens pressure the mannequin to decide to a binary determination on an object’s existence earlier than localization.

PBench: Profiling Capabilities Beyond Saturated Baselines

To measure progress, TII analysis crew launched PBench, a benchmark that organizes samples into 5 ranges of semantic complexity to disentangle mannequin failure modes.

Main Results: Falcon Perception vs. SAM 3 (Macro-F1)

Benchmark Split	SAM 3	Falcon Perception (600M)
L0: Simple Objects	64.3	65.1
L1: Attributes	54.4	63.6
L2: OCR-Guided	24.6	38.0
L3: Spatial Understanding	31.6	53.5
L4: Relations	33.3	49.1
Dense Split	58.4	72.6

Falcon Perception considerably outperforms SAM 3 on advanced semantic duties, significantly exhibiting a +21.9 level achieve on spatial understanding (Level 3).

FalconOCR: The 300M Document specialist

TII crew additionally prolonged this early-fusion recipe to FalconOCR, a compact 300M-parameter mannequin initialized from scratch to prioritize fine-grained glyph recognition. FalconOCR is aggressive with a number of bigger proprietary and modular OCR programs:

olmOCR: Achieves 80.3% accuracy, matching or exceeding Gemini 3 Pro (80.2%) and GPT 5.2 (69.8%).
OmniDocBench: Reaches an total rating of 88.64, forward of GPT 5.2 (86.56) and Mistral OCR 3 (85.20), although it trails the highest modular pipeline PaddleOCR VL 1.5 (94.37).

Key Takeaways

Unified Early-Fusion Architecture: Falcon Perception replaces modular encoder-decoder pipelines with a single dense Transformer that processes picture patches and textual content tokens in a shared parameter house from the primary layer. It makes use of a hybrid consideration masks—bidirectional for visible tokens and causal for job tokens—to behave concurrently as a imaginative and prescient encoder and an autoregressive decoder.
Chain-of-Perception Sequence: The mannequin serializes occasion segmentation right into a structured sequence $(langle coordrangle rightarrow langle sizerangle rightarrow langle segrangle)$ , which forces it to resolve spatial place and dimension as a conditioning sign earlier than producing the pixel-level masks.
Specialized Heads and GGROPE: To handle dense spatial information, the mannequin makes use of Fourier Feature encoders for high-dimensional coordinate mapping and Golden Gate ROPE (GGROPE) to allow isotropic 2D spatial consideration. The Muon optimizer is employed for these specialised heads to steadiness studying charges towards the pre-trained spine.
Semantic Performance Gains: On the brand new PBench benchmark, which disentangles semantic capabilities (Levels 0-4), the 600M mannequin demonstrates vital positive aspects over SAM 3 in advanced classes, together with a +13.4 level lead in OCR-guided queries and a +21.9 level lead in spatial understanding.
High-Efficiency OCR Extension: The structure scales all the way down to Falcon OCR, a 300M-parameter mannequin that achieves 80.3% on olmOCR and 88.64 on OmniDocBench. It matches or exceeds the accuracy of a lot bigger programs like Gemini 3 Pro and GPT 5.2 whereas sustaining excessive throughput for large-scale doc processing.

Check out the Paper, Model Weight, Repo and Technical details. Also, be happy to observe us on Twitter and don’t overlook to hitch our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The publish TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts appeared first on MarkTechPost.

TII Releases Falcon Perception: A 0.6B-Parameter Early-Fusion Transformer for Open-Vocabulary Grounding and Segmentation from Natural Language Prompts