|

OpenAI Releases Research Preview of ‘gpt-oss-safeguard’: Two Open-Weight Reasoning Models for Safety Classification Tasks

OpenAI has launched a analysis preview of gpt-oss-safeguard, two open weight security reasoning fashions that permit builders apply customized security insurance policies at inference time. The fashions are available in two sizes, gpt-oss-safeguard-120b and gpt-oss-safeguard-20b, each high-quality tuned from gpt-oss, each licensed underneath Apache 2.0, and each obtainable on Hugging Face for native use.

https://openai.com/index/introducing-gpt-oss-safeguard/

Why Policy-Conditioned Safety Matters?

Conventional moderation fashions are skilled on a single fastened coverage. When that coverage modifications, the mannequin have to be retrained or changed. gpt-oss-safeguard reverses this relationship. It takes the developer authored coverage as enter along with the person content material, then causes step-by-step to determine whether or not the content material violates the coverage. This turns security right into a immediate and analysis process, which is best suited for quick altering or area particular harms similar to fraud, biology, self hurt or recreation particular abuse.

Same Pattern as OpenAI’s Internal Safety Reasoner

OpenAI states that gpt-oss-safeguard is an open weight implementation of the Safety Reasoner used internally throughout methods like GPT 5, ChatGPT Agent and Sora 2. In manufacturing settings OpenAI already runs small excessive recall filters first, then escalates unsure or delicate objects to a reasoning mannequin, and in latest launches as much as 16 % of complete compute was spent on security reasoning. The open launch lets exterior groups reproduce this protection in depth sample as a substitute of guessing how OpenAI’s stack works.

Model Sizes and Hardware Fit

The massive mannequin, gpt-oss-safeguard-120b, has 117B parameters with 5.1B energetic parameters and is sized to suit on a single 80GB H100 class GPU. The smaller gpt-oss-safeguard-20b has 21B parameters with 3.6B energetic parameters and targets decrease latency or smaller GPUs, together with 16GB setups. Both fashions have been skilled on the concord response format, so prompts should comply with that construction in any other case outcomes will degrade. The license is Apache 2.0, the identical because the father or mother gpt-oss fashions, so business native deployment is permitted.

https://openai.com/index/introducing-gpt-oss-safeguard/

Evaluation Results

OpenAI evaluated the fashions on inside multi coverage checks and on public datasets. In multi coverage accuracy, the place the mannequin should appropriately apply a number of insurance policies directly, gpt-oss-safeguard and OpenAI’s inside Safety Reasoner outperform gpt-5-thinking and the open gpt-oss baselines. On the 2022 moderation dataset the brand new fashions barely outperform each gpt-5-thinking and the inner Safety Reasoner, nevertheless OpenAI specifies that this hole just isn’t statistically important, so it shouldn’t be oversold. On ToxicChat, the inner Safety Reasoner nonetheless leads, with gpt-oss-safeguard shut behind. This locations the open fashions within the aggressive vary for actual moderation duties.

OpenAI is express that pure reasoning on each request is pricey. The advisable setup is to run small, quick, excessive recall classifiers on all site visitors, then ship solely unsure or delicate content material to gpt-oss-safeguard, and when person expertise requires quick responses, to run the reasoner asynchronously. This mirrors OpenAI’s personal manufacturing steerage and displays the truth that devoted process particular classifiers can nonetheless win when there’s a massive prime quality labeled dataset.

Key Takeaways

  1. gpt-oss-safeguard is a analysis preview of two open weight security reasoning fashions, 120b and 20b, that classify content material utilizing developer equipped insurance policies at inference time, so coverage modifications don’t require retraining.
  2. The fashions implement the identical Safety Reasoner sample OpenAI makes use of internally throughout GPT 5, ChatGPT Agent and Sora 2, the place a primary quick filter routes solely dangerous or ambiguous content material to a slower reasoning mannequin.
  3. Both fashions are high-quality tuned from gpt-oss, maintain the concord response format, and are sized for actual deployments, the 120b mannequin matches on a single H100 class GPU, the 20b mannequin targets 16GB stage {hardware}, and each are Apache 2.0 on Hugging Face.
  4. On inside multi coverage evaluations and on the 2022 moderation dataset, the safeguard fashions outperform gpt-5-thinking and the gpt-oss baselines, however OpenAI notes that the small margin over the inner Safety Reasoner just isn’t statistically important.
  5. OpenAI recommends utilizing these fashions in a layered moderation pipeline, along with neighborhood sources similar to ROOST, so platforms can categorical customized taxonomies, audit the chain of thought, and replace insurance policies with out touching weights.

Editorial Comments

OpenAI is taking an inside security sample and making it reproducible, which is an important half of this launch. The fashions are open weight, coverage conditioned and Apache 2.0, so platforms can lastly apply their very own taxonomies as a substitute of accepting fastened labels. The indisputable fact that gpt-oss-safeguard matches and typically barely exceeds the inner Safety Reasoner on the 2022 moderation dataset, whereas outperforming gpt-5-thinking on multi coverage accuracy, however with a non statistically important margin, exhibits the method is already usable. The advisable layered deployment is real looking for manufacturing.

The put up OpenAI Releases Research Preview of ‘gpt-oss-safeguard’: Two Open-Weight Reasoning Models for Safety Classification Tasks appeared first on MarkTechPost.

Similar Posts