March 18, 2026

Karpathy's autoresearch: 100 ML Experiments While You Sleep

Karpathy's 630-line script runs ~12 ML experiments/hour autonomously, 89 overnight, 15 kept, zero crashes. Here's how the loop works and when it breaks down.

Sebastian Mondragon

6 min read

Karpathy's autoresearch: 100 ML Experiments While You Sleep

TL;DR

Autoresearch gives an AI coding agent a 630-line training script and lets it modify architecture, train for 5 minutes, evaluate, and repeat, ~12 experiments/hour, ~100 overnight. Karpathy's first run: 89 experiments, 15 keepers, 11% efficiency gain on Time to GPT-2. Shopify saw 19% performance gains overnight. Works best for single-metric optimization on fixed hardware; breaks down for multi-objective tasks and architectural innovation.

Two weeks ago, Andrej Karpathy pushed a 630-line Python script to GitHub and went to sleep. By morning, an AI agent had run 89 experiments on his training code, kept 15 improvements, discarded 74 dead ends, and crashed exactly zero times. The next day, Fortune ran a feature calling it "a glimpse of where AI is heading."

I've been running autoresearch on client fine-tuning jobs for the past week. The results are real, but the limitations are sharper than the hype suggests. Here's what actually happens when you let an agent run your ML experiments overnight, and when you absolutely shouldn't.

What Autoresearch Actually Does

Autoresearch is not hyperparameter tuning. That distinction matters.

Traditional hyperparameter search explores a predefined grid: learning rates from 1e-4 to 1e-2, batch sizes of 16/32/64, maybe dropout rates. The search space is human-defined and the modifications are parametric.

Autoresearch hands an AI coding agent a ~630-line training file (train.py) and says: make this better. The agent can restructure the model architecture, swap optimizers, add normalization layers, delete components, change the training loop, anything. As Karpathy put it: "When agents can modify code arbitrarily, the notion of a 'hyperparameter' dissolves."

The system has three files:

The human programs the organization. The agent programs the model.

File	Role	Who modifies it
`prepare.py`	Data preparation and evaluation	Locked, nobody
`train.py`	Model architecture + training loop (~630 lines)	The AI agent
`program.md`	Natural language constraints and objectives	The human

The Karpathy Loop: How It Works

The core loop is deceptively simple:

Read, Agent consults program.md for constraints and objectives

Modify, Agent edits train.py (architecture, optimizer, any code change)

Train, Fixed 5-minute training run, measuring validation bits-per-byte (val_bpb)

Evaluate, Compare to baseline. Lower val_bpb = better.

Decide, Improvement → git commit. Regression → git reset.

Repeat, Log result to results.tsv, start again. No human intervention.

Git is the agent's persistent memory. The commit history records what worked and what failed. The agent reads this history to inform future hypotheses, which is why it's more "junior ML engineer" than "grid search."

At ~12 experiments per hour, you get roughly 100 experiments per 8-hour overnight run. Karpathy's first session: 89 experiments in 7.5 hours, 15 kept, 74 discarded, zero crashes.

The Numbers: What Autoresearch Found

Over two days and ~700 experiments, Karpathy documented ~20 additive improvements. They fell into three categories that tell you a lot about where AI agents currently excel:

Resource Reallocation

The agent halved the batch size, a counterintuitive move that improved results by prioritizing iteration speed within fixed 5-minute time budgets. More gradient updates in less time beat fewer updates on larger batches. This is the kind of insight a human might eventually reach, but probably wouldn't test first.

Narrow Sweet Spots

The agent found parameter ranges so tight a human would never search there manually. Example: a multiplier of 0.68x worked, but 0.66x degraded performance. These narrow optima exist everywhere in deep learning, and brute-force exploration is exactly what finds them.

Actual Bugs

Missing multipliers. Suboptimal default optimizer settings. These aren't hyperparameter discoveries, they're bugs that had been silently costing performance. The agent found them because it tested every modification against a clear metric, with no assumptions about what was "already working." The stacked improvements dropped the Time to GPT-2 leaderboard metric from 2.02 hours to 1.80 hours, an 11% efficiency gain. And critically, these findings transferred to larger models without modification.

Real-World Validation: Shopify's Overnight Run

Shopify CEO Tobias Lutke ran autoresearch on internal company data the week after release. 37 experiments overnight. The result: a 19% performance gain, with a smaller model outperforming a manually-configured model twice its size.

That's the pitch for autoresearch in one sentence: it finds configurations that humans wouldn't search for, on hardware you already have, while you're asleep.

The Ecosystem: 7 Days, 5+ Derivatives

Autoresearch hit 41,000+ GitHub stars in under two weeks. More interesting than the star count is what developers built on top of it:

The MLX port revealed something unexpected: smaller hardware favors different winning strategies than H100 runs. Optimizations that worked on an H100 sometimes degraded on an M4 Max. This has implications for anyone running autoresearch on consumer hardware and expecting results to transfer to production GPUs.

Karpathy's vision for where this goes: "The next step for autoresearch is that it has to be asynchronously massively collaborative for agents, think SETI@home style. The goal is not to emulate a single PhD student, it's to emulate a research community of them."

Project	Stars	What It Does
pi-autoresearch	1,377	Persistent sessions that survive restarts, dashboard UI, branch-aware tracking
autoresearch-mlx	701	Apple Silicon/Mac port via MLX. M4 Max hit val_bpb of 1.294 from 2.667 baseline
autokernel	689	GPU kernel optimization. ~40 experiments/hour. Discovers faster Triton/CUDA kernels
autoresearch-at-home	423	Distributed SETI@home-style. Multiple machines share experiments via Ensue
autoresearch-agents	72	Harrison Chase (LangChain). Optimizes agent code using LangSmith eval scores

Setting Up Autoresearch: Step by Step

The setup is deliberately minimal. No distributed systems, no cloud configs, no container orchestration.

# Install uv if you don't have it
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and setup
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
uv sync

# Prepare data
uv run prepare.py

# Run a single training baseline
uv run train.py

# Then point your coding agent at train.py with program.md as context
# and let it loop: modify → train → evaluate → commit/reset

Prerequisites

Single NVIDIA GPU (H100 tested; RTX consumer GPUs via community ports)
Python 3.10+
uv package manager (by Astral)
Any AI coding agent (Claude, Codex, etc.)

Quick Start

Writing program.md

The program.md file is where you define the rules of engagement. This is the human's lever, everything the agent is allowed to do, forbidden from doing, and optimizing for. Key principles from Karpathy's own setup:

Single metric: val_bpb (lower is better). Don't expose multiple metrics, the agent will Goodhart whichever one is easiest to game.
Fixed time budget: 5 minutes per experiment. This prevents the agent from "improving" by simply training longer.
Immutable eval: prepare.py is locked. The agent can't improve scores by modifying the evaluation.
Git-based memory: Every run is committed or reset. The agent can read history.

When Autoresearch Works, and When It Doesn't

After running this on three client projects, here's my honest assessment:

Works Well For

Single-metric optimization on fixed hardware, The sweet spot. One GPU, one metric, overnight.
Finding unintuitive tradeoffs, Like the batch-size halving that improved results by prioritizing iteration speed.
Bug discovery, Missing multipliers, suboptimal defaults. The agent tests everything with zero assumptions.
Hardware-specific tuning, Optimizations that exploit specific GPU characteristics.

Breaks Down For

Multi-objective optimization: If you need to balance latency, accuracy, and memory footprint, the agent will sacrifice two to improve one.
Architectural innovation, The agent is great at local optimization within a 630-line file. It won't invent the transformer.
Cross-hardware portability, MLX port showed that winning strategies differ by hardware. Don't assume H100 results transfer to your production T4s.
Long runs with diminishing returns, After ~100 experiments, the agent starts making marginal changes (Karpathy noted late sessions degraded to random seed adjustments).
Expensive eval functions, The 5-minute budget works because training a small model is cheap. If your eval takes 2 hours, you get 4 experiments overnight instead of 100.

The Goodhart's Law Problem

The agent optimizes aggressively for whatever metric you expose. If your val_bpb doesn't perfectly correlate with downstream task performance, the agent will find ways to game it. This isn't a bug in autoresearch, it's the fundamental challenge of evaluation-driven development. Your eval must genuinely measure what you care about, because the agent will find every shortcut you left open.

The Bigger Picture: "All LLM Frontier Labs Will Do This"

Karpathy didn't frame autoresearch as a toy. His exact words: "All LLM frontier labs will do this. It's the final boss battle."

He's probably right. The pattern, agent modifies code, trains, evaluates, iterates, is hardware-agnostic, model-agnostic, and scales with compute. The 630-line single-GPU version is a proof of concept. The production version is a fleet of agents across a cluster, each exploring different branches of the search space, sharing findings via something like autoresearch-at-home.

For teams already doing fine-tuning, autoresearch is the obvious next step: take your existing training pipeline, wrap it in the loop, and let an agent search the space you don't have time to explore manually. The investment is minimal, you're running experiments on hardware you already have, during hours you're not using it.

For teams choosing between small specialized models and flagship models, autoresearch changes the calculus. A 7B model that's been autoresearched overnight might beat a flagship model that was configured manually. Shopify proved this isn't theoretical.

And for anyone building AI evaluation systems, autoresearch is a stress test for your evals. If an agent can game your metric overnight, your metric has holes. Better to find them now than in production.

What Comes Next

Karpathy's SETI@home vision, distributed agents collaborating on shared experiment logs, is already happening with autoresearch-at-home. The domain-agnostic forks extend the pattern beyond ML: test coverage, bundle size, Terraform compliance, accessibility scores. Any metric reasonably efficient to evaluate can be autoresearched.

The 630-line script is a seed. The idea it represents, that AI agents can run your experiments better than you can, if you give them a clear metric and get out of the way, is going to reshape how ML teams operate.

Set it up. Run it tonight. Check the results tomorrow. That's the whole pitch.

Frequently Asked Questions

Quick answers to common questions about this topic

Autoresearch is an open-source framework (MIT license) that gives an AI coding agent a ~630-line training script and lets it experiment autonomously. The loop: agent reads instructions from program.md, modifies train.py, trains for exactly 5 minutes, evaluates validation bits-per-byte, then commits improvements via git or resets failures. It runs ~12 experiments per hour with zero human intervention. Released March 6, 2026, it hit 41,000+ GitHub stars within two weeks.