|

Cisco AI Introduces FAPO: Pipeline-Aware Prompt Optimization With Step-Level Failure Attribution and Claude Code Orchestration

Getting prompts proper remains to be the toughest a part of delivery dependable LLM functions. Small wording modifications can swing accuracy by 20 %. What works on just a few examples usually breaks at scale. When a multi-step pipeline returns a incorrect reply, discovering the failing step means inspecting intermediate outputs by hand.

Cisco AI launched FAPO to handle that bottleneck. FAPO stands for Fully Automated Prompt Optimization. It is a Claude Code-driven system that optimizes LLM pipelines from baseline prompts to focus on accuracy. You provide a dataset and an preliminary immediate. FAPO then evaluates, classifies failures, proposes variants, validates them, and iterates. The complete loop is orchestrated by Claude Code brokers. The mission ships open supply below Apache 2.0, and additionally helps Codex because the optimization agent.

In Cisco’s reported analysis, FAPO beat GEPA, a state-of-the-art immediate optimizer, on 15 of 18 model-benchmark comparisons. On the 2 benchmarks the place FAPO escalated to pipeline modifications, the imply achieve over GEPA reached +33.8pp.

TL;DR

  • FAPO is a Claude Code-driven system that autonomously optimizes multi-step LLM pipelines from baseline prompts to focus on accuracy, open supply below Apache 2.0.
  • It escalates by means of three ranges — immediate, parameter, then chain construction — utilizing step-level failure attribution to determine what to vary subsequent.
  • In Cisco’s analysis, FAPO beat GEPA on 15 of 18 model-benchmark comparisons, with a +14.1pp imply achieve.
  • On HoVer and IFBench, the place it escalated to pipeline modifications, FAPO gained all six pairs at a +33.8pp imply achieve; AIME was GEPA’s solely win, inside sampling noise.
  • Guardrails in opposition to overfitting embrace training-split-only inspection, immutable variant recordsdata, and an unbiased reviewer on each proposal.

What is FAPO

FAPO is a multi-tenant analysis and optimization framework. A tenant is a self-contained optimization mission. Each tenant listing holds one process’s prompts, dataset, chain definition, scorer, and config. Tenants keep remoted, so unrelated duties optimize aspect by aspect with out interference.

The core engine is known as hephaestus and is domain-agnostic. It handles analysis, chain execution, and scoring. Chains are LangGraph state graphs that course of every take a look at case. Out of the field, FAPO helps three suppliers: OpenAI, Baseten, and SageMaker.

The one enter you need to carry is a dataset. It is paired inputs and anticipated outputs that outline success. FAPO splits it right into a validation set and a held-out take a look at set. The validation set drives iteration; the take a look at set is used just for a closing one-shot analysis. From a process description, Claude can scaffold the remaining: the preliminary immediate, the chain, and the scorer.

How the Optimization Loop Works

Once the items exist, FAPO runs a closed loop till goal accuracy is reached. Each cycle runs six phases:

  1. Evaluate — run the chain on the dataset, acquire per-case scores and step-level outputs.
  2. Attribute — classify failures by root trigger utilizing rule-based heuristics plus LLM evaluation.
  3. Propose — generate a variant concentrating on the dominant failure cluster.
  4. Review — an unbiased agent validates the proposal for scope compliance and knowledge leakage.
  5. Compare — settle for the variant provided that it improves on the earlier finest, in any other case reject.
  6. Iterate — proceed till goal accuracy is reached or the optimization price range is exhausted.

The system works at three escalating ranges. Prompt edits are lowest price and tried first. Parameter modifications modify config values like retrieval_k or temperature. Structural modifications alter chain topology, corresponding to including a self-reflection node or switching to a ReAct sample. FAPO exhausts one degree earlier than escalating to the subsequent.

Step attribution types failures into 4 lessons. Retrieval failures return empty or irrelevant content material. Cascading failures start when an early step produces empty output. Format failures disguise the right reply inside textual content the scorer can not parse. Reasoning failures happen when good inputs nonetheless produce a incorrect conclusion. Format and reasoning points are prompt-addressable. Retrieval and cascade points are structural-addressable.

Guardrails preserve the optimizer from overfitting. It inspects solely training-split instances, whereas validation and take a look at expose mixture scores solely. Every variant is a brand new immutable file, by no means edited in place. An unbiased reviewer checks every proposal earlier than it runs.

The Benchmark Case: FAPO vs. GEPA

Cisco group evaluated FAPO in opposition to GEPA (Generalized Evolutionary Prompt Architecture), a state-of-the-art immediate optimization methodology. GEPA makes use of evolutionary search with genetic operators to optimize prompts for multi-step pipelines. Both methods began from an identical baseline pipelines and prompts. FAPO may escalate to structural modifications when attribution discovered bottlenecks. GEPA was restricted to prompt-level optimization.

The comparability spanned six benchmarks and three process fashions: GPT-4.1-mini, GPT-5.4-mini, and Gemma 3-12B. Claude Opus 4.6 served as each FAPO’s orchestrator and GEPA’s reflector. Scores under are averaged throughout the three process fashions.

Benchmark Baseline GEPA FAPO Gain vs. GEPA
HoVer 35.9 48.5 83.8 +35.3pp
IFBench 35.7 48.5 80.7 +32.2pp
LiveBench-Math 51.0 52.6 62.0 +9.4pp
HotpotQA 50.9 61.8 68.3 +6.5pp
Papillon 73.6 90.7 94.9 +4.2pp
AIME 16.7 16.0 12.9 -3.1pp

FAPO gained 15 of 18 model-benchmark comparisons, with a imply achieve of +14.1pp over GEPA. On HoVer and IFBench, the place FAPO escalated to pipeline modifications, it gained all six model-benchmark pairs. The imply achieve there was +33.8pp. On the 4 benchmarks with out structural modifications, FAPO nonetheless gained 9 of 12 by means of immediate optimization alone. AIME was the one benchmark the place GEPA led, by 3.1pp. The hole is smaller than the usual deviation throughout stochastic trials.

A functionality comparability exhibits the design distinction reported by Cisco. Every row under displays the supply description of the 2 methods.

Capability GEPA FAPO
Optimization ranges Prompt textual content solely Prompt → parameter → structural
Can change chain construction No Yes, when attribution finds bottlenecks
How it’s pushed Evolutionary search with genetic operators Claude Code or Codex agent loop
Result throughout 18 model-benchmark pairs Reference Wins 15 of 18; +14.1pp imply

Where It Fits: Use Cases

FAPO targets multi-step LLM pipelines, not single prompts. Just a few concrete examples:

  • Multi-hop query answering: A sequence retrieves paperwork, extracts info, causes over proof, and codecs a solution. In Cisco’s documented walkthrough, a multi-hop QA chain rose from 39.3% to 70.3% validation precise match throughout two iterations. Attribution then flagged the remaining failures as retrieval-limited, signaling a structural repair. Separately, on the HotpotQA benchmark, FAPO reached 68.3% take a look at accuracy versus GEPA’s 61.8%.
  • Instruction following: On IFBench, format-constraint failures pushed FAPO to escalate past prompts, reaching 80.7% take a look at accuracy.
  • Classification: A software-name-to-category process could be scaffolded by Claude Code, then optimized to exact-match targets.
  • ReAct brokers: An MCP workflow extension optimizes a tool-calling ReAct agent utilizing trajectory scoring and LLM-as-Judge scoring.

Getting Started

The quickest path is to let Claude Code create the tenant recordsdata. From the repo, describe your process in plain English, then add a JSONL dataset. Each line is one take a look at case with case_id, task_type, context, anticipated, and metadata:

{"case_id": "1", "task_type": "qa", "context": {"query": "What is the capital of France?"}, "anticipated": {"reply": "Paris"}, "metadata": {}}
{"case_id": "2", "task_type": "qa", "context": {"query": "What is 2 + 2?"}, "anticipated": {"reply": "4"}, "metadata": {}}

A scorer compares the chain output to the anticipated reply. It implements validate_case to catch unhealthy knowledge early and score_case to return a composite rating:

from hephaestus.scoring.scorer import Scorer as BaseScorer

class Scorer(BaseScorer):
    def validate_case(self, case, scoring_profile):
        assert "reply" in case.anticipated, "Missing 'reply' in anticipated"

    def score_case(self, case, output_text, scoring_profile):
        anticipated = case.anticipated["answer"].strip().decrease()
        predicted = output_text.strip().decrease()
        em = 100.0 if predicted == anticipated else 0.0
        return {"composite_score": em, "score_breakdown": {"exact_match": em}}

Verify the setup with a baseline analysis:

export OPENAI_API_KEY="sk-..."
python -m hephaestus.cli eval --config tenants/my_project/configs/eval.json

Then invoke the optimization agent with a tenant, config, and success standards corresponding to composite_score >= 90. Claude Code produces a scope contract, then iterates autonomously. Every immediate variant, config, and per-variant evaluation is written to disk, so every run stays auditable. A neighborhood read-only UI referred to as FAPO Explorer browses the artifacts afterward.

Strengths and Weaknesses

Strengths

  • Pipeline-aware scoring attributes failures to the step that triggered them, not simply the ultimate output.
  • Three-level escalation handles failures that prompts alone can not repair.
  • Guardrails in opposition to overfitting: training-split-only inspection, immutable variants, and an unbiased reviewer.
  • Open supply below Apache 2.0, with each Claude Code and Codex supported.

Weaknesses

  • Optimization high quality is bounded by the dataset’s high quality and protection, which you need to provide.
  • The mission is current, so unbiased manufacturing observe data are nonetheless restricted.
  • The default loop relies on agentic coding instruments (Claude Code or Codex) moderately than a standalone optimizer.

Interactive Explainer