|

NVIDIA HORIZON: A Hands-Free Agent that Evolves Git Worktrees and Hits 100% RTL Benchmark Completion

NVIDIA Research launched HORIZON, a hands-free agent framework for {hardware} design. It treats {hardware} design as repository-level code evolution. This analysis group workouts the register-transfer degree (RTL) instantiation. A structured Markdown harness turns into a undertaking pack. A self-contained agent loop then evolves an remoted git worktree. It commits a model solely when an executable acceptance gate passes.

The analysis group experiences 100% completion throughout each evaluated RTL benchmark suite. It additionally states plainly that agentic {hardware} design is just not solved.

What is HORIZON?

Single-turn code technology has a transparent restrict on executable design duties. Plausible Verilog is just not sufficient for actual {hardware}. Correctness is determined by cycle-level conduct, reset conventions, bit widths, and simulator suggestions.

HORIZON hosts every design downside as a version-controlled repository, not a one-shot immediate. The solely required enter is a structured Markdown harness. That harness carries 4 elements: a objective, domain-knowledge instructions, an evaluator specification, and an acceptance predicate.

A bootstrap agent compiles the harness right into a undertaking pack. The analysis group writes this as p = (πagent, Ep, Ap, Γp, Ωp). Those phrases cowl the agent coverage, the executable evaluator, and the acceptance predicate. They additionally cowl the version-control coverage and the area expertise.

For RTL, the evaluator Ep could embody compilation, simulation, protection extraction, and assertion or testbench checks. In different domains, that identical slot might maintain unit assessments, theorem provers, profilers, or synthesis instruments. Problems are due to this fact outlined over git worktrees, not over a hard and fast repository kind.

https://arxiv.org/pdf/2606.28279

How the Repository-Level Loop Works

After bootstrap, the loop runs with out additional human enter. Each cycle plans a goal, edits the worktree, invokes instruments, and runs the evaluator. The acceptance predicate then decides one factor: commit the brand new model, or log the failure.

Git is the substrate right here, not incidental bookkeeping. Diffs expose proposed state modifications. Commits outline accepted checkpoints. Notes connect evaluator proof. The log recovers the total trajectory.

The loop leans on native git instructions to maintain tracing low cost. Staged edits are inspected with git diff –cached. Each accepted try turns into a git commit whose notes carry the decision and reward. Successful commits change into optimistic restore examples. Rejected makes an attempt are logged as damaging examples. The repository historical past is the expertise buffer, not a separate datastore.

The analysis group borrow semi-Markov determination course of vocabulary for one slim objective. It names the recorded objects, nothing extra. A ‘state’ is a versioned snapshot of the repository. An “possibility” is one episode between two checkpoints. HORIZON doesn’t prepare or replace an RL coverage on this work. The agent spine stays fastened all through a marketing campaign.

Session reuse retains price down. HORIZON holds a persistent mannequin session throughout iterations. The harness, undertaking pack, and secure sources are served from the supplier’s immediate cache. Newly billed tokens are then dominated by the present diff and the most recent evaluator output.

Where HORIZON Sits Among Self-Evolving Systems

HORIZON extends a lineage of repository-scale self-evolution. Earlier techniques advanced the software program that engineers run. HORIZON as a substitute evolves the {hardware} artifacts that engineers create.

System Object advanced Domain Evaluation sign
AlphaEvolve (2025) Algorithmic kernels Scientific and algorithmic discovery Automated evaluators
SATLUTION (2025) Full SAT-solver repositories SAT fixing Distributed correctness and runtime
ABCEvo (2026) ABC logic-synthesis system EDA software program Correctness and QoR
HORIZON (this work) RTL sources, testbenches, verification artifacts Hardware design Compile, simulate, protection, assertion checks

All 4 share one precept. A candidate change is admitted solely when executable proof helps it.

Benchmark Results

The spine is GPT-5.3, fastened for all experiments. Every consequence makes use of single-agent, hands-free mode. Campaigns ran on an AMD EPYC 9334 32-core host with 512 GB of RAM.

The analysis spans ChipBench, RTLLM-2.0, and Verilog-Eval. It provides 9 CVDP code- and verification-generation classes, CID 002 to 016. CVDP incorporates 783 human-authored issues throughout 13 job classes (Pinckney et al., 2025).

An iteration is one automated outer step. The agent edits the worktree, runs the evaluator, then commits a go or logs a rejection. HORIZON reaches a 100% go price on each suite. The one residual miss is a ChipBench specification-harness defect, not an agent failure.

The combination first-iteration go price is 47.8%. Iteration-0 is just not a standalone Pass@1 measurement. It is the repository state after the primary agent iteration. The agent could defer debugging and restore to later iterations by design.

Suite / class Focus Iter. 0 Conv. iter. HORIZON
ChipBench Mixed RTL technology 20.0 5 100.0
RTLLM-2.0 NL spec to RTL 78.0 2 100.0
Verilog-Eval-v2 HDLBits-style Verilog 86.2 2 100.0
CVDP CID 002 RTL code completion 3.2 82 100.0
CVDP CID 003 NL spec to RTL 19.2 24 100.0
CVDP CID 004 RTL code modification 10.9 36 100.0
CVDP CID 005 Spec-to-RTL module reuse 9.1 14 100.0
CVDP CID 007 Linting / QoR enchancment 0.0 24 100.0
CVDP CID 012 Test-plan to stimulus technology 47.8 32 100.0
CVDP CID 013 Test-plan to checker technology 3.8 19 100.0
CVDP CID 014 Test-plan to assertion technology 79.1 1 100.0
CVDP CID 016 Debugging and bug fixing 25.7 13 100.0

Convergence problem varies broadly throughout classes. RTLLM-2.0 and Verilog-Eval attain 100% inside two iterations. Checker technology (CID 013) begins at simply 3.8%. Yet it climbs steadily to 100% by iteration 19, with nearly no plateau. Code completion (CID 002) wants 82 iterations. Its lengthy tail is the one largest token price.

Interactive Metrics Explainer