Hugging Face Open-Sourced FineVision: A New Multimodal Dataset with 24 Million Samples for Training Vision-Language Models (VLMs)

Hugging Face has simply launched SuperbVision, an open multimodal dataset designed to set a brand new customary for Vision-Language Models (VLMs). With 17.3 million photos, 24.3 million samples, 88.9 million question-answer turns, and almost 10 billion reply tokens, SuperbVision place itself as one of many largest and structured publicly accessible VLM coaching datasets.

SuperbVision aggregates 200+ sources right into a unified format, rigorously filtered for duplicates and benchmark contamination. Rated systematically throughout a number of high quality dimensions, the dataset permits researchers and devs to assemble sturdy coaching mixtures whereas minimizing knowledge leakage.

Why is SuperbVision Important for VLM Training?

Most state-of-the-art VLMs depend on proprietary datasets, limiting reproducibility and accessibility for the broader analysis group. SuperbVision addresses this hole by:

Scale and Coverage: 5 TB of curated knowledge throughout 9 classes, together with General VQA, OCR QA, Chart & Table reasoning, Science, Captioning, Grounding & Counting, and GUI navigation.
Benchmark Gains: Across 11 broadly used benchmarks (e.g., AI2D, ChartQA, DocVQA, ScienceQA, OCRBench), fashions skilled on SuperbVision outperform alternate options by vital margins—as much as 46.3% over LLaVA, 40.7% over Cauldron, and 12.1% over Cambrian.
New Skill Domains: SuperbVision introduces knowledge for rising duties like GUI navigation, pointing, and counting, increasing the capabilities of VLMs past typical captioning and VQA.

How Was SuperbVision Built?

The curation pipeline adopted a three-step course of:

Collection and Augmentation
Over 200 publicly accessible image-text datasets had been gathered. Missing modalities (e.g., text-only knowledge) had been reformatted into QA pairs. Underrepresented domains, reminiscent of GUI knowledge, had been supplemented by means of focused assortment.
Cleaning
- Removed outsized QA pairs (>8192 tokens).
- Resized giant photos to a most of 2048 px whereas preserving side ratio.
- Discarded corrupted samples.
Quality Rating
Using Qwen3-32B and Qwen2.5-VL-32B-Instruct as judges, each QA pair was rated on 4 axes:
- Text Formatting Quality
- Question-Answer Relevance
- Visual Dependency
- Image-Question Correspondence
These scores allow selective coaching mixtures, although ablations present that retaining all samples yields the perfect efficiency, even when lower-rated samples are included.

Comparative Analysis: SuperbVision vs. Existing Open Datasets

Dataset	Images	Samples	Turns	Tokens	Leakage	Perf. Drop After Deduplication
Cauldron	2.0M	1.8M	27.8M	0.3B	3.05%	-2.39%
LLaVA-Vision	2.5M	3.9M	9.1M	1.0B	2.15%	-2.72%
Cambrian-7M	5.4M	7.0M	12.2M	0.8B	2.29%	-2.78%
SuperbVision	17.3M	24.3M	88.9M	9.5B	1.02%	-1.45%

SuperbVision is just not solely one of many largest but in addition the least hallucinated dataset, with simply 1% overlap with benchmark take a look at units. This ensures minimal knowledge leakage and dependable analysis efficiency.

Performance Insights

Model Setup: Ablations had been performed utilizing nanoVLM (460M parameters), combining SmolLM2-360M-Instruct because the language spine and SigLIP2-Base-512 because the imaginative and prescient encoder.
Training Efficiency: On 32 NVIDIA H100 GPUs, one full epoch (12k steps) takes ~20 hours.
Performance Trends:
- SuperbVision fashions enhance steadily with publicity to various knowledge, overtaking baselines after ~12k steps.
- Deduplication experiments verify SuperbVision’s low leakage in comparison with Cauldron, LLaVA, and Cambrian.
- Multilingual subsets, even when the spine is monolingual, present slight efficiency positive aspects, suggesting variety outweighs strict alignment.
- Attempts at multi-stage coaching (two or 2.5 levels) didn’t yield constant advantages, reinforcing that scale + variety is extra essential than coaching heuristics.

Why SuperbVision Brings the New Standard?

+20% Average Performance Boost: Outperforms all present open datasets throughout 10+ benchmarks.
Unprecedented Scale: 17M+ photos, 24M+ samples, 10B tokens.
Skill Expansion: GUI navigation, counting, pointing, and doc reasoning included.
Lowest Data Leakage: 1% contamination, in comparison with 2–3% in different datasets.
Fully Open Source: Available on Hugging Face Hub for instant use through the datasets library.

Conclusion

SuperbVision marks a major development in open multimodal datasets. Its giant scale, systematic curation, and clear high quality assessments create a reproducible and extensible basis for coaching state-of-the-art Vision-Language Models. By lowering dependence on proprietary assets, it permits researchers and devs to construct aggressive methods and speed up progress in areas reminiscent of doc evaluation, visible reasoning, and agentic multimodal duties.

Check out the Dataset and Technical details. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be happy to comply with us on Twitter and don’t overlook to affix our 100k+ ML SubReddit and Subscribe to our Newsletter.

The submit Hugging Face Open-Sourced FineVision: A New Multimodal Dataset with 24 Million Samples for Training Vision-Language Models (VLMs) appeared first on MarkTechPost.

Hugging Face Open-Sourced FineVision: A New Multimodal Dataset with 24 Million Samples for Training Vision-Language Models (VLMs)