Meta AI Releases NeuralBench: A Unified Open-Source Framework to Benchmark NeuroAI Models Across 36 EEG Tasks and 94 Datasets
Evaluating AI fashions educated on mind alerts has lengthy been a messy, inconsistent subject. Different analysis teams use totally different preprocessing pipelines, prepare fashions on totally different datasets, and report outcomes on a slim set of duties — making it almost unattainable to know which mannequin really works finest, or for what. A new framework from Meta AI group is designed to repair that.
Meta Researchers have launched NeuralBench, a unified, open-source framework for benchmarking AI fashions of mind exercise. Its first launch, NeuralBench-EEG v1.0, is the most important open benchmark of its form: 36 downstream duties, 94 datasets, 9,478 topics, 13,603 hours of electroencephalography (EEG) information, and 14 deep studying architectures evaluated beneath a single standardized interface.

(*94*)
The Problem NeuralBench Solves
The broader area of NeuroAI the place deep studying meets neuroscience has exploded lately. Self-supervised studying strategies initially developed for language, speech and photos are actually being tailored to construct mind basis fashions: massive fashions pretrained on unlabeled mind recordings and fine-tuned for downstream duties starting from scientific seizure detection to decoding what an individual is seeing or listening to.
But the analysis panorama has been badly fragmented. Existing benchmarks like MOABB cowl up to 148 brain-computer interfacing (BCI) datasets however restrict analysis to simply 5 downstream duties. Other efforts — EEG-Bench, EEG-FM-Bench, AdaBrain-Bench — are every constrained in their very own methods. For modalities like magnetoencephalography (MEG) and practical magnetic resonance imaging (fMRI), there isn’t a systematic benchmark in any respect.
The end result — claims about basis fashions being “generalizable” or “foundational” typically relaxation on cherry-picked duties with no widespread reference level.
What is NeuralBench?
NeuralBench is constructed on three core Python packages that type a modular pipeline.
NeuralFetch handles dataset acquisition, pulling curated information from public repositories together with OpenNeuro, DANDI, and NEMAR. NeuralSet prepares information as PyTorch-ready dataloaders, wrapping present neuroscience instruments like MNE-Python and nilearn for preprocessing, and HuggingFace for extracting stimulus embeddings (for duties involving photos, speech, or textual content). NeuralTrain offers modular coaching code constructed on PyTorch-Lightning, Pydantic, and the exca execution and caching library.
Once put in by way of pip set up neuralbench, the framework is managed by way of a command-line interface (CLI). Running a activity is so simple as three instructions: obtain the information, put together the cache, and execute. Every activity is configured via a light-weight YAML file that specifies the information supply, prepare/validation/take a look at splits, preprocessing steps, goal processing, coaching hyperparameters, and analysis metrics.

(*94*)
What NeuralBench-EEG v1.0 Covers
The first launch focuses on EEG and spans eight activity classes: cognitive decoding (picture, sentence, speech, typing, video, and phrase decoding), brain-computer interfacing (BCI), evoked responses, scientific duties, inside state, sleep, phenotyping, and miscellaneous.
Three courses of fashions are in contrast:
- Task-specific architectures (~1.5K–4.2M parameters, educated from scratch): ShallowFBCSPNet, Deep4Net, EEGNet, BDTCN, ATCNet, EEGConformer, SimpleConvTimeAgg, and CTNet.
- EEG basis fashions (~3.2M–157.1M parameters, pretrained and fine-tuned): BENDR, LaBraM, BIOT, CBraMod, LUNA, and REVE.
- Handcrafted characteristic baselines: sklearn-style pipelines utilizing symmetric constructive particular (SPD) matrix representations fed into logistic or Ridge regression.
All basis fashions are fine-tuned end-to-end utilizing a shared coaching recipe — AdamW optimizer, studying charge of 10⁻⁴, weight decay of 0.05, cosine-annealing with 10% warmup, up to 50 epochs with early stopping (persistence=10). The sole exception is BENDR, for which the training charge is lowered to 10⁻⁵ and gradient clipping is utilized at 0.5 to receive steady studying curves. This intentional standardization in any other case removes model-specific optimization tips — akin to layer-wise studying charge decay, two-stage probing, or LoRA — in order that structure and pretraining methodology are what really will get evaluated.
Data splitting is dealt with in another way per activity sort to mirror real-world generalization constraints: predefined splits the place supplied by dataset analysis group, leave-concept-out for cognitive decoding duties (all topics seen in coaching, however a held-out set of stimuli used for testing), cross-subject splits for many scientific and BCI duties, and within-subject splits for datasets with only a few members. Each mannequin is educated 3 times per activity utilizing three totally different random seeds.
Evaluation metrics are standardized by activity sort: balanced accuracy for binary and multiclass classification, macro F1-score for multilabel classification, Pearson correlation for regression, and top-5 accuracy for retrieval duties. All outcomes are moreover reported as normalized scores (s̃), the place 0 corresponds to dummy-level efficiency and 1 corresponds to good efficiency, enabling truthful cross-task comparisons no matter metric scale.
One necessary methodological observe: some EEG basis fashions have been pretrained on datasets that overlap with NeuralBench’s downstream analysis units. Rather than discarding these outcomes, the benchmark flags them with hashed bars in end result figures so readers can establish potential pretraining information leakage — no sturdy development suggesting leakage inflates efficiency was noticed, however the transparency is preserved.
The benchmark gives two variants: NeuralBench-EEG-Core v1.0, which makes use of a single consultant dataset per activity for broad protection, and NeuralBench-EEG-Full v1.0, which expands to up to 24 datasets per activity to research within-task variability throughout recording {hardware}, labs, and topic populations. A Kendall’s τ of 0.926 (p < 0.001) between Core and Full rankings confirms that the Core variant is a dependable proxy — although a number of mannequin positions do shift, together with CTNet overtaking LUNA when extra datasets are included.

(*94*)
Two Key Findings
Finding 1: Foundation fashions solely marginally outperform task-specific fashions. The top-ranked fashions general are REVE (69.2M parameters, imply normalized rank 0.20), LaBraM (5.8M, rank 0.21), and LUNA (40.4M, rank 0.30). But a number of task-specific fashions educated from scratch — CTNet (150K parameters, rank 0.32), SimpleConvTimeAgg (4.2M, rank 0.35), and Deep4Net (146K, rank 0.43) — path intently behind. CTNet really overtakes the LUNA basis mannequin to rank third within the Full variant, regardless of having roughly 270× fewer parameters. This exhibits the hole between task-specific and basis fashions is slim sufficient that increasing dataset protection alone is ample to change international rankings.
Finding 2: Many duties stay genuinely laborious. Cognitive decoding duties — recovering dense representations of photos, speech, sentences, video, or phrases from mind exercise — are significantly difficult, with even the most effective fashions scoring nicely beneath ceiling. Tasks like psychological imagery, sleep arousal, psychopathology decoding, and cross-subject motor imagery and P300 classification incessantly yield efficiency shut to dummy degree. These duties symbolize the most effective benchmarks for stress-testing the following technology of EEG basis fashions.
Tasks approaching saturation embody SSVEP classification, pathology detection, seizure detection, sleep stage classification, and phenotyping duties like age regression and intercourse classification.
Beyond EEG: MEG and fMRI
Even on this preliminary EEG-focused launch, NeuralBench already helps MEG and fMRI duties as proof of idea. Notably, the REVE mannequin — pretrained solely on EEG information — achieves the most effective efficiency amongst all examined fashions on the typing decoding activity in MEG. This is a hanging early sign that EEG-pretrained representations could switch meaningfully throughout mind recording modalities, a speculation the framework is positioned to rigorously take a look at in future releases.
The infrastructure is explicitly designed for growth to intracranial EEG (iEEG), practical near-infrared spectroscopy (fNIRS), and electromyography (EMG).
How to Get Started
Installation takes a single command: pip set up neuralbench. From there, working the audiovisual stimulus classification activity on EEG appears like this:
