AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters

Are your LLM code benchmarks really rejecting wrong-complexity options and interactive-protocol violations, or are they passing under-specified unit assessments? A crew of researchers from UCSD, NYU, University of Washington, Princeton University, Canyon Crest Academy, OpenAI, UC Berkeley, MIT, University of Waterloo, and Sentient Labs introduce AutoCode, a brand new AI framework that lets LLMs create and confirm aggressive programming issues, mirroring the workflow of human drawback setters. AutoCode reframes analysis for code-reasoning fashions by treating drawback setting (not solely drawback fixing) as the goal activity. The system trains LLMs to provide competition-grade statements, take a look at information, and verdict logic that match official on-line judges at excessive charges. On a 7,538-problem benchmark constructed from prior datasets, AutoCode achieves 91.1% consistency with official judgments (FPR 3.7%, FNR 14.1%). On a separate, tougher 720 current Codeforces issues (together with interactive duties), the full framework studies 98.7% consistency, 1.3% FPR, 1.2% FNR.

Why drawback setting issues for analysis?
Public code benchmarks typically depend on under-specified assessments that let wrong-complexity or shortcut options cross. That inflates scores and pollutes reinforcement indicators (rewarding fragile techniques). AutoCode’s validator-first strategy and adversarial take a look at era goal to scale back false positives (FPR)—incorrect packages that cross—and false negatives (FNR)—right packages rejected because of malformed inputs.

The core loop: Validator → Generator → Checker
AutoCode runs a closed loop that mirrors human contest workflows, however every step is chosen from LLM-generated candidates utilizing focused in-framework assessments.
1) Validator (decrease FNR by implementing enter legality)
The system first asks an LLM to synthesize 40 analysis inputs—10 legitimate and 30 near-valid unlawful (e.g., off-by-one boundary violations). It then prompts the LLM for three candidate validator packages and selects the one that greatest classifies these instances. This prevents “right” options from crashing on malformed information.

2) Generator (scale back FPR by adversarial protection)
Three complementary methods produce take a look at instances:
• Small-data exhaustion for boundary protection,
• Randomized + excessive instances (overflows, precision, hash-collisions),
• TLE-inducing constructions to interrupt wrong-complexity options.
Invalid instances are filtered by the chosen validator; then instances are deduplicated and bucket-balanced earlier than sampling.

3) Checker (verdict logic)
The checker compares contestant outputs with the reference answer beneath advanced guidelines. AutoCode once more generates 40 checker eventualities and three candidate checker packages, retains solely eventualities with validator-approved inputs, and selects the greatest checker by accuracy towards the 40 labeled eventualities.

4) Interactor (for interactive issues)
For duties that require dialogue with the choose, AutoCode introduces a mutant-based interactor: it makes small logical edits (“mutants”) to the reference answer, selects interactors that settle for the true answer however reject the mutants, maximizing discrimination. This addresses a spot in earlier public datasets that prevented interactives.

Dual verification allows new issues (not simply assessments for current ones)
AutoCode can generate novel drawback variants ranging from a random “seed” Codeforces drawback (<2200 Elo). The LLM drafts a brand new assertion and two options: an environment friendly reference and a less complicated brute-force baseline. A drawback is accepted provided that the reference output matches brute drive throughout the generated take a look at suite (the brute drive could TLE on giant instances however serves as floor fact on small/exhaustive instances). This dual-verification protocol filters ~27% of error-prone gadgets, lifting reference-solution correctness from 86% → 94% earlier than human evaluation.
Human specialists then grade the survivors on solvability, answer correctness, high quality, novelty, problem. After filtering, 61.6% are usable for mannequin coaching, 76.3% for human coaching, and 3.2% are ICPC/IOI-level issues. Difficulty sometimes will increase relative to the seed, and problem achieve correlates with perceived high quality.

Understanding the outcomes
Existing issues (7,538 complete; 195,988 human submissions). AutoCode: 91.1% consistency, 3.7% FPR, 14.1% FNR, vs 72.9–81.0% consistency for prior mills (CodeContests, CodeContests+, TACO, HardTests).
Recent Codeforces issues (720, unfiltered; contains interactives). AutoCode: 98.7% consistency, 1.3% FPR, 1.2% FNR. Ablations present all three generator methods and immediate optimization contribute: eradicating immediate optimization drops consistency to 98.0% and greater than doubles FNR to 2.9%.

Key Takeaways
- AutoCode {couples} a Validator–Generator–Checker (+Interactor) loop with twin verification (reference vs. brute-force) to construct contest-grade take a look at suites and new issues.
- On held-out issues, AutoCode’s take a look at suites attain ~99% consistency with official judges, surpassing prior mills like HardTests (<81%).
- For current Codeforces duties (together with interactives), the full framework studies ~98.7% consistency with ~1.3% FPR and ~1.2% FNR.
- The mutant-based interactor reliably accepts the true answer whereas rejecting mutated variants, bettering analysis for interactive issues.
- Human specialists price a large fraction of AutoCode-generated gadgets as training-usable and a non-trivial share as contest-quality, aligning with the LiveCodeBench Pro program’s goals.
Editorial Comments
AutoCode is a sensible repair for present code benchmarks. It facilities drawback setting and makes use of a closed-loop Validator–Generator–Checker (+Interactor) pipeline with twin verification (reference vs. brute-force). This construction reduces false positives/negatives and yields judge-aligned consistency (≈99% on held-out issues; 98.7% on current Codeforces, together with interactives). The strategy standardizes constraint legality, adversarial protection, and protocol-aware judging, which makes downstream RL reward indicators cleaner. Its placement beneath LiveCodeBench Pro matches a hallucination-resistant analysis program that emphasizes expert-checked rigor.
Check out the Paper and Project. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to observe us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
The put up AutoCode: A New AI Framework that Lets LLMs Create and Verify Competitive Programming Problems, Mirroring the Workflow of Human Problem Setters appeared first on MarkTechPost.