|

Andrej Karpathy Releases ‘nanochat’: A Minimal, End-to-End ChatGPT-Style Pipeline You Can Train in ~4 Hours for ~$100

Andrej Karpathy has open-sourced nanochat, a compact, dependency-light codebase that implements a full ChatGPT-style stack—from tokenizer training to web UI inference—aimed toward reproducible, hackable LLM coaching on a single multi-GPU node.

The repo provides a single-script “speedrun” that executes the complete loop: tokenization, base pretraining, mid-training on chat/multiple-choice/tool-use information, Supervised Finetuning (SFT), elective RL on GSM8K, analysis, and serving (CLI + ChatGPT-like internet UI). The really useful setup is an 8×H100 node; at ~$24/hour, the 4-hour speedrun lands close to $100. A post-run report.md summarizes metrics (CORE, ARC-E/C, MMLU, GSM8K, HumanEval, ChatCORE).

Tokenizer and information path

  • Tokenizer: customized Rust BPE (constructed by way of Maturin), with a 65,536-token vocab; coaching makes use of TremendousWeb-EDU shards (re-packaged/shuffled for easy entry). The walkthrough experiences ~4.8 characters/token compression and compares in opposition to GPT-2/4 tokenizers.
  • Eval bundle: a curated set for CORE (22 autocompletion datasets like HellaSwag, ARC, BoolQ, and many others.), downloaded into ~/.cache/nanochat/eval_bundle.

Model, scaling, and “speedrun” goal

The speedrun config trains a depth-20 Transformer (≈560M params with 1280 hidden channels, 10 consideration heads of dim 128) for ~11.2B tokens according to Chinchilla-style scaling (params × ~20 tokens). The creator estimates this run as a ~4e19 FLOPs functionality mannequin. Training makes use of Muon for matmul parameters and AdamW for embeddings/unembeddings; loss is reported in bits-per-byte (bpb) to be tokenizer-invariant.

Mid-training, SFT, and power use

After pretraining, mid-training adapts the bottom mannequin to conversations (SmolTalk) and explicitly teaches multiple-choice habits (100K MMLU auxiliary-train questions) and device use by inserting <|python_start|>…<|python_end|> blocks; a small GSM8K slice is included to seed calculator-style utilization. The default combination: SmolTalk (460K), MMLU aux-train (100K), GSM8K foremost (8K), totaling 568K rows.

SFT then fine-tunes on higher-quality conversations whereas matching test-time formatting (padded, non-concatenated rows) to cut back practice/inference mismatch. The repo’s instance post-SFT metrics (speedrun tier) report ARC-Easy 0.3876, ARC-Challenge 0.2807, MMLU 0.3151, GSM8K 0.0455, HumanEval 0.0854, ChatCORE 0.0884.

Tool use is wired end-to-end: the customized Engine implements KV cache, prefill/decode inference, and a easy Python interpreter sandbox for tool-augmented runs—used in each coaching and analysis flows.

Optional RL on GSM8K by way of a simplified GRPO loop

The last (elective) stage applies reinforcement studying on GSM8K with a simplified GRPO routine. The walkthrough clarifies what’s omitted relative to canonical PPO-style RLHF: no belief area by way of a reference mannequin, no KL penalties, on-policy updates (discard PPO ratios/clip), token-level GAPO-style normalization, and mean-shift benefit. Practically, it behaves near REINFORCE whereas preserving the group-relative benefit calculation. Scripts scripts.chat_rl and scripts.chat_eval -i rl -a GSM8K reveal the loop.

Cost/high quality scaling and greater fashions

The README sketches two bigger targets past the ~$100 speedrun:

  • ~$300 tier: d=26 (~12 hours), barely surpasses GPT-2 CORE; requires extra pretraining shards and batch-size changes.
  • ~$1,000 tier: ~41.6 hours, with materially improved coherence and primary reasoning/coding means.

The repo additionally notice prior experimental runs the place a d=30 mannequin educated for ~24 hours reached 40s on MMLU, 70s on ARC-Easy, 20s on GSM8K.

Evaluation snapshot (speedrun tier)

An instance report.md desk for the ~$100/≈4-hour run exhibits: CORE 0.2219 (base); after mid-training/SFT, ARC-E 0.3561→0.3876, ARC-C ~0.2875→0.2807, MMLU 0.3111→0.3151, GSM8K 0.0250→0.0455, HumanEval 0.0671→0.0854, ChatCORE 0.0730→0.0884; wall-clock 3h51m.

https://github.com/karpathy/nanochat/discussions/1

Key Takeaways

  • nanochat is a minimal, end-to-end ChatGPT-style stack (~8K LOC) that runs by way of a single speedrun.sh on one 8×H100 node (~4h ≈ $100).
  • The pipeline covers tokenizer (Rust BPE), base pretraining, mid-training, SFT, elective RL on GSM8K (simplified GRPO), analysis, and serving (CLI + Web UI).
  • Speedrun metrics (instance report.md): CORE 0.2219 base; after SFT—ARC-Easy 0.3876, ARC-Challenge 0.2807, MMLU 0.3151, GSM8K 0.0455, HumanEval 0.0854.
  • Scaling tiers are outlined: ~$300 (d=26, ~12h) “barely outperforms GPT-2 CORE”; ~$1,000 (~41.6h) for materially higher coherence/reasoning.

Editorial Comments

Karpathy’s nanochat lands in a helpful center floor: a single, clear, dependency-light repository that stitches tokenizer coaching (Rust BPE), pretraining on TremendousWeb-EDU, mid-training (SmolTalk/MMLU aux/GSM8K with device use tags), SFT, elective simplified GRPO on GSM8K, and a skinny Engine (KV cache, prefill/decode, Python interpreter) right into a reproducible speedrun on an 8×H100 node, producing a traceable report.md with CORE/ARC/MMLU/GSM8K/HumanEval and a minimal Web UI.


Check out the Technical details and Codes. Feel free to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to comply with us on Twitter and don’t neglect to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The publish Andrej Karpathy Releases ‘nanochat’: A Minimal, End-to-End ChatGPT-Style Pipeline You Can Train in ~4 Hours for ~$100 appeared first on MarkTechPost.

Similar Posts