|

StepFun Releases Step 3.7 Flash: A 198B MoE Vision-Language Model for Coding Agents and Search Workflows

StepEnjoyable right this moment launched Step 3.7 Flash, a multimodal Mixture-of-Experts mannequin concentrating on agentic use circumstances. It provides native imaginative and prescient enter and improved tool-use reliability over Step 3.5 Flash.

What is Step 3.7 Flash?

Step 3.7 Flash is a 198B-parameter sparse Mixture-of-Experts (MoE) vision-language mannequin. It pairs a 196B-parameter language spine with a 1.8B-parameter imaginative and prescient encoder (ViT) for native picture understanding.

The mannequin prompts roughly 11B parameters per token throughout inference. In MoE architectures, solely a subset of “skilled” sub-networks fires per ahead cross — not the complete community. This retains inference compute nearer to an 11B dense mannequin whereas sustaining a 198B complete parameter funds.

Key specs:

Spec Value
Total parameters 198B (196B language + 1.8B ViT)
Active parameters per token ~11B
Context window 256k tokens
Throughput Up to 400 tokens/sec
Reasoning ranges Low, medium, excessive
License Apache 2.0

Architecture Notes

The imaginative and prescient encoder runs as a separate 1.8B ViT module. It injects picture representations into the language spine’s context. Step 3.5 Flash had no multimodal assist; it is a new addition in 3.7.

Three selectable reasoning depths — low, medium, and excessive — let builders commerce latency for reasoning depth. Low is quicker and cheaper; excessive applies extra computation per response.

Agentic Coding Performance

On SWE-Bench Pro, Step 3.7 Flash scores 56.26%, up from Step 3.5 Flash’s 51.3% — a achieve of roughly 5 share factors. On Terminal-Bench 2.1, it scores 59.55%, up from 53.37%.

On SWE-MTLG (a multi-task long-generation coding benchmark), it scores 72.42%.

Cross-harness consistency on StepEnjoyable’s inside Step-SWE-Bench:

Scaffold Step 3.7 Flash Step 3.5 Flash
Hermes Agent 67.5% 60.0%
OpenClaw 67.0% 47.0%
KiloCode 67.5% 59.0%
RooCode 64.5% 43.0%
Claude Code 71.5% 73.0%
OpenCode 64.5% 57.0%

Step 3.5 Flash ranged from 43% to 73% throughout harnesses. Step 3.7 Flash ranges from 64.5% to 71.5%. In manufacturing, coding brokers usually run inside heterogeneous scaffolds — every with its personal prompting conventions and device schemas. Narrower per-harness variance means extra predictable conduct throughout totally different setups.

Advisor Mode

Step 3.7 Flash helps Advisor Mode, StepEnjoyable’s implementation of the advisor technique described by Anthropic. The mannequin runs the agentic loop end-to-end — calling instruments, studying outcomes, iterating — and escalates to a bigger advisor mannequin solely at particular inflection factors, corresponding to planning or recovering from repeated failures. Most of the run stays at executor price.

With Advisor Mode enabled on SWE-Bench Verified, StepEnjoyable reviews Step 3.7 Flash reaches 97% of Claude Opus 4.6’s coding efficiency at roughly one-ninth the per-task price ($0.19 vs. $1.76 per activity). These are StepEnjoyable’s inside figures.

Multimodal Capabilities

Step 3.7 Flash helps two visible device pathways:

Visual Search Tool — For recognition duties the place the mannequin’s parametric data is inadequate (long-tail entities, lately emerged ideas), it invokes a visible search device to retrieve and confirm. On SimpleVQA (with Search), it scores 79.16%, akin to GPT 5.5 (79.11%) and above Kimi K2.6 (78.24%) and GLM 5V Turbo (78.20%).

Python Tool — For fine-grained visible duties (high-resolution photos, visible probing, bounding-box evaluation), it makes use of a code interface to crop, zoom, and draw pixels or bounding containers. On V (a self-tested rating with Python), it scores 95.29%. On HR-Bench 4K and HR-Bench 8K, it scores 89.13% and 86.34% respectively.

StepEnjoyable notes an noticed conduct throughout testing: the mannequin mixed visible instruments with non-visual instruments with out being explicitly educated to take action. For instance, after producing frontend code, it used the GUI to render and examine the outcome earlier than iterating. StepEnjoyable describes this as emergent compositional device use.

On Android Daily (long-horizon cellphone UI activity completion), Step 3.7 Flash scores 61.87%, forward of Kimi K2.6 (53.36%) and GLM 5V Turbo (51.68%). Gemini 3 Flash (63.21%) leads this benchmark.

Search and Research Benchmarks

StepEnjoyable targeted this mannequin’s search design on planning, proof filtering, and synthesis — integrating search as a part of the reasoning loop relatively than a separate add-on.

Benchmark Step 3.7 Flash Notable comparability
HLE with Tools (acc) 47.20% DeepSeek V4 Flash: 45.10%
BrowseComp (acc) 75.82% Claude Opus 4.7: 79.30%
DeepSearchQA (F1) 92.82% Kimi K2.6: 92.50%
ResearchRubrics (rating) 71.68% GPT 5.5: 61.50%

Note: The HLE with Tools rating of 47.20% compares to Step 3.5 Flash’s text-only rating of 35.68%. Step 3.5 Flash didn’t assist tool-augmented analysis on HLE.

General Agent Benchmarks

Benchmark Step 3.7 Flash Description
Toolathlon 49.51% Multi-tool coordination
ClawEval-1.1 67.07% Daily autonomous activity execution in practical environments
GDPval (44 occupations) 45.8% General skilled activity execution
Tau2-bench Telecom >98% Across totally different reasoning issue tiers

On ClawEval-1.1, Step 3.7 Flash (67.07%) leads DeepSeek V4 Flash (57.80%) and DeepSeek V4 Pro (59.80%) among the many in contrast fashions.

Long-Context Performance

On AA-LCR (a long-context retrieval benchmark, avg@16/acc), Step 3.7 Flash scores 63.94%. This is akin to DeepSeek V4 Flash (63.70%) and DeepSeek V4 Pro (66.30%).

Pricing

Token Type Price
Input (cache miss) $0.20 / M tokens
Input (cache hit) $0.04 / M tokens
Output $1.15 / M tokens

Marktechpost’s Visual Explainer

Model Release
Step 3.7 Flash — A 198B MoE Vision-Language Model
StepEnjoyable · Released May 29, 2026 · Apache 2.0

Slide 1 of 8 — Overview

What Is Step 3.7 Flash?

Step 3.7 Flash is a sparse Mixture-of-Experts (MoE) vision-language mannequin from StepEnjoyable. It combines a 196B-parameter language spine with a 1.8B-parameter Vision Transformer (ViT) encoder for native picture understanding.

In a MoE mannequin, solely a subset of “skilled” sub-networks prompts per token — not the complete community. This retains inference compute near an 11B dense mannequin whereas sustaining 198B complete parameters.

Total Params
198B
Active / Token
~11B
Context Window
256k tokens
Throughput
400 tok/sec
Reasoning Levels
Low / Med / High
License
Apache 2.0

Slide 2 of 8 — Architecture

Architecture Notes

The 1.8B ViT encoder runs as a separate module and injects picture representations into the language spine’s context. Step 3.5 Flash was text-only; native multimodal assist is new in 3.7.

Three selectable reasoning depths let builders stability velocity and price:

  • Low — Fastest, least expensive. Suitable for easy completions.
  • Medium — Balanced price and reasoning depth.
  • High — More compute per response. Best for complicated agent duties.
MoE routing means you pay for ~11B lively params at inference, not 198B. This is the core effectivity trade-off in Flash-tier fashions.

Slide 3 of 8 — Agentic Coding

Agentic Coding Performance

Step 3.7 Flash scores 56.26% on SWE-Bench Pro (up from 51.3% in 3.5 Flash) and 59.55% on Terminal-Bench 2.1 (up from 53.37%). On SWE-MTLG it scores 72.42%.

Per-harness scores on StepEnjoyable’s inside Step-SWE-Bench:

Scaffold 3.7 Flash 3.5 Flash
Hermes Agent 67.5% 60.0%
OpenClaw 67.0% 47.0%
KiloCode 67.5% 59.0%
RooCode 64.5% 43.0%
Claude Code 71.5% 73.0%
OpenCode 64.5% 57.0%
3.5 Flash ranged 43–73% throughout harnesses. 3.7 Flash narrows that to 64.5–71.5% — extra predictable throughout heterogeneous scaffolds.

Slide 4 of 8 — Advisor Mode

Advisor Mode

Step 3.7 Flash helps Advisor Mode, StepEnjoyable’s implementation of the advisor technique described by Anthropic. The mannequin runs the complete agentic loop — calling instruments, studying outcomes, iterating — and escalates to a bigger advisor mannequin solely at particular inflection factors.

  • Escalates throughout planning or restoration from repeated failures
  • Most of the run stays at executor (Flash) price
  • Large advisor mannequin is consulted sparingly

SWE-Bench Verified outcomes with Advisor Mode (StepEnjoyable inside figures):

Step 3.7 Flash + Advisor
76.3% rating
Per-task price
$0.19
Claude Opus 4.6
78.7% rating
Claude Opus 4.6 price
$1.76

Slide 5 of 8 — Multimodal

Multimodal Capabilities

Step 3.7 Flash helps two visible device pathways:

  • Visual Search Tool — Invoked for long-tail entity recognition or lately emerged ideas the place parametric data is inadequate. SimpleVQA (Search): 79.16%
  • Python Tool — Code interface for cropping, zooming, pixel/bounding-box operations on high-resolution photos. V* (Python): 95.29% | HR-Bench 4K: 89.13% | HR-Bench 8K: 86.34%

Android Daily (long-horizon cellphone UI duties): Step 3.7 Flash scores 61.87%, forward of Kimi K2.6 (53.36%) and GLM 5V Turbo (51.68%). Gemini 3 Flash leads at 63.21%.

StepEnjoyable reviews emergent compositional device use throughout testing — the mannequin mixed visible and non-visual instruments with out specific coaching to take action.

Slide 6 of 8 — Search & Research

Search and Research Benchmarks

Search is built-in into the mannequin’s reasoning loop relatively than handled as an exterior add-on. StepEnjoyable targeted coaching on search planning, proof filtering, and synthesis.

Benchmark 3.7 Flash Comparison
HLE w. Tools (acc) 47.20% DeepSeek V4 Flash: 45.10%
BrowseComp (acc) 75.82% Claude Opus 4.7: 79.30%
DeepSearchQA (F1) 92.82% Kimi K2.6: 92.50%
ResearchRubrics 71.68% GPT 5.5: 61.50%
HLE comparability: Step 3.5 Flash scored 35.68% text-only. Step 3.7 Flash scores 47.20% with device entry — these will not be apples-to-apples.

Slide 7 of 8 — Deployment

Pricing, Deployment & Ecosystem

Token Type Price
Input (cache miss) $0.20 / M tokens
Input (cache hit) $0.04 / M tokens
Output $1.15 / M tokens

Available on:

StepEnjoyable Platform
OpenRouter
NVIDIA NIM
DeepInfra (quickly)
Fireworks AI (quickly)
Modal (quickly)

Inference backends: vLLM, SGLang, Hugging Face Transformers (requires v5.0+), llama.cpp

Quantization codecs: BF16, FP8, NVFP4, GGUF

Local minimal: 120 GB unified reminiscence/VRAM

Slide 8 of 8 — Key Takeaways

Key Takeaways

  • 198B sparse MoE mannequin with ~11B lively params per token and a 256k context window
  • Native multimodal assist (photos, GUIs, paperwork) — Step 3.5 Flash was text-only
  • Advisor Mode scores 76.3% on SWE-Bench Verified at $0.19/activity vs. Claude Opus 4.6 at $1.76
  • Cross-harness coding variance narrowed from 43–73% (3.5) to 64.5–71.5% (3.7)
  • Released Apache 2.0 with BF16, FP8, NVFP4, and GGUF weights on Hugging Face

Compatible harnesses:

Claude Code
KiloCode
Hermes Agent
OpenClaw

1 / 8

Key Takeaways

  • Step 3.7 Flash is a 198B sparse MoE mannequin with 11B lively params and a 256k context window.
  • Native multimodal assist (photos, GUIs, paperwork) is new — Step 3.5 Flash was text-only.
  • Advisor Mode reaches 97% of Claude Opus 4.6's SWE-Bench Verified efficiency at $0.19 per activity vs. $1.76.
  • Cross-harness coding variance narrowed from a 43–73% vary (3.5 Flash) to 64.5–71.5% (3.7 Flash).
  • Released below Apache 2.0 with BF16, FP8, NVFP4, and GGUF weights on Hugging Face.

Where (Inferences) to Run Step 3.7 Flash


Check out the Model Weights, Repo and Technical DetailsAlso, be happy to observe us on Twitter and don’t neglect to affix our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to companion with us for selling your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar and so forth.? Connect with us

The publish StepFun Releases Step 3.7 Flash: A 198B MoE Vision-Language Model for Coding Agents and Search Workflows appeared first on MarkTechPost.

Similar Posts