|

Sakana AI Launches Sakana Fugu: An Orchestration Model That Routes Tasks Across a Swappable Pool of Frontier LLMs

Today, Sakana AI launched Sakana Fugu. It is a multi-agent orchestration system that behaves like one mannequin. You ship a request to a single endpoint. Fugu decides the way to deal with it internally. It solves a job straight when that’s sufficient. It additionally assembles and coordinates a crew of knowledgeable fashions when wanted. The complexity of a multi-agent system by no means reaches your code.

TL;DR

  • Fugu delivers a multi-agent system behind one OpenAI-compatible API.
  • Fugu Ultra leads most revealed coding and reasoning benchmarks.
  • The orchestrator beats the person fashions it coordinates.
  • Opt-out and supplier routing goal compliance and single-vendor danger.
  • Routing is proprietary, so per-query mannequin choice stays hidden.

What is Sakana Fugu

Fugu is itself a language mannequin. It is educated to name different LLMs in an agent pool. That pool consists of situations of itself, referred to as recursively. Fugu manages mannequin choice, delegation, verification, and synthesis internally.

Instead of hard-coded roles or workflows, Fugu learns the way to coordinate. It decides when to delegate and the way brokers ought to talk. It then combines their work into one reply. From the surface, you name a single mannequin. Inside, a coordinated system of consultants does the work.

Sakana AI frames this as a hedge in opposition to single-vendor dependency. If one supplier restricts entry, Fugu routes across the disruption. The analysis crew cites current export controls on Anthropic’s Fable and Mythos fashions as motivation. Over time, newer fashions will be folded into the pool.

Fugu and Fugu Ultra: Two Models, One API

Fugu ships in two variants, each behind one OpenAI-compatible API:

  • Fugu balances sturdy efficiency with low latency. It is a default for on a regular basis coding, code assessment, and chatbots. It additionally matches instruments like Codex. You can choose particular brokers out of its pool. That helps groups meet information, privateness, and compliance necessities.
  • Fugu Ultra is tuned for max reply high quality on onerous, multi-step issues. It coordinates a deeper pool of knowledgeable brokers. Its pool is mounted, so opt-out isn’t accessible. The present mannequin ID is fugu-ultra-20260615.

The Research Behind the Orchestrator

Fugu builds on two ICLR 2026 papers Trinity and the Conductor on realized orchestration.

TRINITY makes use of a light-weight advanced coordinator throughout a number of turns. It assigns Thinker, Worker, or Verifier roles to delegate work adaptively. Conductor is educated with reinforcement studying. It discovers natural-language coordination methods and targeted prompts for various LLM swimming pools.

Together, they present techniques can be taught to assemble and route brokers per job. That replaces hand-designed workflows.

Interactive Explainer




Benchmark

Sakana AI compares Fugu in opposition to the muse fashions it orchestrates. Baselines use provider-reported scores. SWE Bench Pro makes use of the mini-swe-agent as scaffolding.

Benchmark Fugu Fugu Ultra Opus 4.8 Gemini 3.1 Pro GPT 5.5
SWE Bench Pro* 59.0 73.7 69.2 54.2 58.6
TerminalBench 2.1 80.2 82.1 74.6 70.3 78.2
ResideCodeBench 92.9 93.2 87.8 88.5 85.3
ResideCodeBench Pro 87.8 90.8 84.8 82.9 88.4
Humanity’s Last Exam 47.2 50.0 49.8 44.4 41.4
CharXiv Reasoning 85.1 86.6 84.2 83.3 84.1
GPQA-D 95.5 95.5 92.0 94.3 93.6
SciCode 60.1 58.7 53.5 58.9 56.1
τ³ Banking 21.7 20.6 20.6 8.4 20.6
Long Context Reasoning 74.7 73.3 67.7 72.7 74.3
MRCRv2 86.6 93.6 87.9 84.9 94.8

The orchestrator posts the highest rating on 10 of 11 rows. Fugu Ultra tops the 4 coding benchmarks, CharXiv Reasoning, and Humanity’s Last Exam. It ties common Fugu on GPQA-D. Regular Fugu leads SciCode, τ³ Banking, and Long Context Reasoning. GPT 5.5 wins MRCRv2, the one baseline win right here.

Its Fugu fashions stand shoulder-to-shoulder with Anthropic’s Fable 5 and Mythos Preview. Those two will not be in Fugu’s pool, since they don’t seem to be publicly accessible.

Use Cases

Sakana AI ran a beta with near 500 early customers. The revealed examples favor lengthy, multi-step duties.

  • AutoResearch: An agent improved a small GPT’s coaching recipe autonomously. It ran 123 experiments over roughly 14 hours on one H100 GPU. Fugu Ultra reached the perfect imply validation BPB of 0.9774, with a finest single run of 0.9748.
  • Rubik’s Cube solver: Each mannequin wrote a pure-Python solver, no libraries allowed. Fugu Ultra solved all 300 held-out cubes, averaging 19.72 strikes. One baseline matched it intently at 19.76 strikes. Two others crashed and solved none.
  • Classical Japanese kana studying order: On a 1610 letter, Fugu Ultra scored NED 0.80. The nearest baseline reached solely 0.24.
  • Blindfold chess: Fugu performed 4 video games from reminiscence, with no board proven. It beat three frontier fashions and a 2100-Elo Stockfish engine.
  • Online buying and selling: On one 50-week window, Fugu Ultra returned +19.43% on common throughout 5 runs. The different frontier fashions stayed beneath +15%. Sakana AI notes previous efficiency doesn’t assure future outcomes.

A Minimal API Example

Fugu makes use of an OpenAI-compatible API, so no SDK migration is required. Point an current consumer at your console-provided endpoint.

from openai import OpenAI

# Endpoint and key come out of your Sakana console (console.sakana.ai).
consumer = OpenAI(
    base_url="https://<your-fugu-endpoint>/v1",  # from console.sakana.ai
    api_key="YOUR_SAKANA_API_KEY",
)

resp = consumer.chat.completions.create(
    mannequin="fugu-ultra-20260615",           # or "fugu"
    messages=[
        {"role": "user",
         "content": "Reproduce the method in this paper and report the gap."},
    ],
)

print(resp.decisions[0].message.content material)

Token utilization and price are reported per request. So you may monitor spend in actual time.

Community Reactions

Sakana Fugu — Early Community Sentiment

A handbook assessment of public response on X and Hacker News, with hyperlinks to each supply. Captured June 22, 2026.

12 posts reviewed

Sentiment cut up (n = 12)

Supportive 3
Skeptical 6
Critical 3

Supportive
Skeptical
Critical
Early response skews skeptical. The “is that this simply a router or wrapper?” query dominates. The clearest supportive voices are Sakana‑affiliated.




Method: sentiment was assigned by hand from a small pattern of public posts on June 22, 2026. This isn’t a statistical survey, and the cut up can shift as extra reactions arrive. Two of the three supportive posts are from Sakana AI or its CEO. Quotes are shortened; comply with every hyperlink for full context. The Reddit quote is as reported by VentureBeat.
Marktechpost · Sakana Fugu sentiment tracker
Sources: X · Hacker News · VentureBeat

Similar Posts