Fastino Labs Open-Sources GLiGuard: A 300M Parameter Safety Moderation Model That Matches or Exceeds Accuracy of Models 23–90x Its Size
As LLM-powered functions transfer into manufacturing — and as AI brokers tackle extra consequential duties like shopping the net, writing and executing code, and interacting with exterior companies — security moderation has quietly turn into one of probably the most operationally costly elements of the stack.
Most builders who’ve deployed a manufacturing LLM system know the issue: you have to consider each person immediate earlier than it reaches the mannequin, and each mannequin response earlier than it reaches the person. That means your guardrail mannequin runs on each single request, at each flip of a dialog. The guardrail latency compounds. The price compounds. And the present technology of open-source guardrail fashions — LlamaGuard4 (12B), WildGuard (7B), ShieldGemma (27B), NemoGuard (8B) — are all decoder-only fashions with billions of parameters, constructed for flexibility however not for velocity.
Fastino Labs launched GLiGuard, a 300 million parameter open-source security moderation mannequin designed to handle this particular drawback. GLiGuard evaluates a number of security dimensions in a single cross, and throughout 9 security benchmarks, its accuracy matches or exceeds fashions which might be 23 to 90 occasions its dimension whereas working as much as 16 occasions quicker.

Why Decoder LLMs May Not Be the Right Tool for Safety Moderation
To perceive what makes GLiGuard totally different, it helps to know why present guardrail fashions are sluggish. Most main guardrail fashions are constructed on decoder-only transformer architectures, they generate their security verdicts autoregressively, one token at a time — the identical means a big language mannequin generates a response to a chat message.
This design made sense when security necessities have been fluid. Decoder fashions can interpret pure language activity descriptions and adapt to new security insurance policies with out retraining. But autoregressive technology is inherently sequential, which makes it sluggish and computationally costly.
There’s a compounding drawback on prime of that. Most guardrail fashions must assess inputs throughout a number of security dimensions: what kind of hurt is current, whether or not the person immediate is trying to bypass security coaching, whether or not the mannequin’s response is itself unsafe, and so forth. Because decoder fashions generate output sequentially, these assessments are sometimes produced one after one other, and latency compounds as extra standards are evaluated.
In different phrases, the structure that makes decoder fashions versatile can be the structure that makes them the unsuitable device for what’s essentially a classification drawback.
What GLiGuard Actually Does
GLiGuard is a small encoder-based mannequin that reframes security moderation as a textual content classification drawback moderately than a textual content technology drawback. Encoder fashions course of the complete enter directly and output a single classification label for a set of mounted labels, whereas decoder fashions generate their output one token at a time, left to proper.
The key architectural perception is in how GLiGuard handles a number of duties concurrently. Instead of producing tokens, GLiGuard encodes each the enter textual content and activity definitions (labels) collectively. These are then fed to the mannequin, which scores each label concurrently in a single ahead cross and returns the highest-scoring label for every activity. Because all duties and their candidate labels are half of the enter itself, evaluating extra security dimensions doesn’t add latency; it merely means together with extra labels within the enter.

GLiGuard runs 4 moderation duties concurrently in a single ahead cross:
- Safety classification (secure / unsafe) — utilized to each person prompts earlier than technology and mannequin responses after technology.
- Jailbreak technique detection throughout 11 methods, together with immediate injection, roleplay bypass, instruction override, and social engineering. If any jailbreak technique is detected, the immediate is mechanically flagged as unsafe.
- Harm class detection throughout 14 classes — violence, sexual content material, hate speech, PII publicity, misinformation, baby security, copyright violation, and others. A single enter can set off a number of classes directly.
- Refusal detection (compliance / refusal), tracked individually to assist measure over-refusal (when a mannequin refuses secure requests) and detect false compliance (when a mannequin seems to conform however doesn’t). If a refusal is detected, the response is mechanically marked as secure.
Training Data and Fine-Tuning
GLiGuard was educated on a combination of human-annotated and synthetically generated coaching information. For immediate security, response security, and refusal detection, the crew used WildGuardPractice, a dataset of 87,000 human-annotated examples. For hurt class and jailbreak technique detection, labels for the unsafe samples have been generated utilizing GPT-4.1.
During early coaching, the mannequin struggled to tell apart between comparable hurt classes like poisonous speech and violence, so the crew used Pioneer to generate supplemental artificial information with edge instances concentrating on these fine-grained distinctions.
On the structure aspect, GLiGuard was educated by way of full fine-tuning of the GLiNER2-base-v1 checkpoint for 20 epochs utilizing the AdamW optimizer. GLiNER2 is Fastino’s personal structure for multi-task textual content classification — a pure start line for a mannequin designed to attain a number of label units in a single cross.

Benchmark Results: Accuracy and Speed
The analysis crew evaluated GLiGuard throughout 9 established security benchmarks. These benchmarks cowl each immediate and response classification, testing whether or not a mannequin can establish dangerous content material, stand up to adversarial assaults, distinguish between differing kinds of hurt, and keep away from over-flagging secure content material. Results use macro-averaged F1, a typical metric that balances precision and recall.
On accuracy:
- GLiGuard scores 87.7 common F1 on immediate classification, inside 1.7 factors of one of the best mannequin (PolyGuard-Qwen at 89.4).
- It achieves the second-highest common F1 on response classification (82.7), behind solely Qwen3Guard-8B (84.1).
- It outperforms LlamaGuard4-12B, ShieldGemma-27B, and NemoGuard-8B regardless of being 23–90× smaller.

On throughput and latency, benchmarked on a single NVIDIA A100 GPU:
- GLiGuard achieves as much as 16.2× larger throughput (133 vs. 8.2 samples/s at batch dimension 4).
- GLiGuard achieves as much as 16.6× decrease latency: 26 ms vs. 426 ms at sequence size 64.
These should not marginal enhancements. At 26 ms per request versus 426 ms, the distinction is significant in any real-time user-facing utility, and the compounding impact throughout a multi-turn dialog makes the hole even bigger in follow.
Marktechpost’s Visual Explainer
Key Takeaways
- GLiGuard is a 300M parameter encoder-based security moderation mannequin that handles 4 duties — security classification, jailbreak detection, hurt categorization, and refusal detection — in a single ahead cross.
- Unlike decoder-only guardrail fashions that generate verdicts autoregressively, GLiGuard reframes security moderation as a textual content classification drawback, eliminating the sequential latency bottleneck.
- Benchmarked on a single NVIDIA A100 GPU, GLiGuard achieves as much as 16.2× larger throughput and 16.6× decrease latency (26 ms vs. 426 ms) in comparison with present SOTA fashions like ShieldGemma-27B.
- Across 9 security benchmarks, GLiGuard scores 87.7 common F1 on immediate classification and 82.7 on response classification — outperforming LlamaGuard4-12B, ShieldGemma-27B, and NemoGuard-8B regardless of being 23–90× smaller.
- Model weights can be found below Apache 2.0 on Hugging Face (
fastino/gliguard-LLMGuardrails-300M), making it deployable on a single GPU with out heavy infrastructure.
Check out the Paper, Model Weights on HF, GitHub Repo and Technical details. Also, be happy to observe us on Twitter and don’t overlook to hitch our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to associate with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us
The publish Fastino Labs Open-Sources GLiGuard: A 300M Parameter Safety Moderation Model That Matches or Exceeds Accuracy of Models 23–90x Its Size appeared first on MarkTechPost.
