Autoresearch gives an AI coding agent a 630-line training script and lets it modify architecture, train for 5 minutes, evaluate, and repeat—~12 experiments/hour, ~100 overnight. Karpathy's first run: 89 experiments, 15 keepers, 11% efficiency gain on Time to GPT-2. Shopify saw 19% performance gains overnight. Works best for single-metric optimization on fixed hardware; breaks down for multi-objective tasks and architectural innovation.
Two weeks ago, Andrej Karpathy pushed a 630-line Python script to GitHub and went to sleep. By morning, an AI agent had run 89 experiments on his training code, kept 15 improvements, discarded 74 dead ends, and crashed exactly zero times. The next day, Fortune ran a feature calling it "a glimpse of where AI is heading."
I've been running autoresearch on client fine-tuning jobs for the past week. The results are real—but the limitations are sharper than the hype suggests. Here's what actually happens when you let an agent run your ML experiments overnight, and when you absolutely shouldn't.
What Autoresearch Actually Does
Autoresearch is not hyperparameter tuning. That distinction matters.
Traditional hyperparameter search explores a predefined grid: learning rates from 1e-4 to 1e-2, batch sizes of 16/32/64, maybe dropout rates. The search space is human-defined and the modifications are parametric.
Autoresearch hands an AI coding agent a ~630-line training file (train.py) and says: make this better. The agent can restructure the model architecture, swap optimizers, add normalization layers, delete components, change the training loop—anything. As Karpathy put it: "When agents can modify code arbitrarily, the notion of a 'hyperparameter' dissolves."
The system has three files:
The human programs the organization. The agent programs the model.
| File | Role | Who modifies it |
|---|---|---|
prepare.py | Data preparation and evaluation | Locked—nobody |
train.py | Model architecture + training loop (~630 lines) | The AI agent |
program.md | Natural language constraints and objectives | The human |
The Karpathy Loop: How It Works
The core loop is deceptively simple:
program.md for constraints and objectivestrain.py (architecture, optimizer, any code change)git commit. Regression → git reset.results.tsv, start again. No human intervention.Git is the agent's persistent memory. The commit history records what worked and what failed. The agent reads this history to inform future hypotheses—which is why it's more "junior ML engineer" than "grid search."
At ~12 experiments per hour, you get roughly 100 experiments per 8-hour overnight run. Karpathy's first session: 89 experiments in 7.5 hours, 15 kept, 74 discarded, zero crashes.
The Numbers: What Autoresearch Found
Over two days and ~700 experiments, Karpathy documented ~20 additive improvements. They fell into three categories that tell you a lot about where AI agents currently excel:
Resource Reallocation
The agent halved the batch size—a counterintuitive move that improved results by prioritizing iteration speed within fixed 5-minute time budgets. More gradient updates in less time beat fewer updates on larger batches. This is the kind of insight a human might eventually reach, but probably wouldn't test first.
Narrow Sweet Spots
The agent found parameter ranges so tight a human would never search there manually. Example: a multiplier of 0.68x worked, but 0.66x degraded performance. These narrow optima exist everywhere in deep learning, and brute-force exploration is exactly what finds them.
Actual Bugs
Missing multipliers. Suboptimal default optimizer settings. These aren't hyperparameter discoveries—they're bugs that had been silently costing performance. The agent found them because it tested every modification against a clear metric, with no assumptions about what was "already working." The stacked improvements dropped the Time to GPT-2 leaderboard metric from 2.02 hours to 1.80 hours—an 11% efficiency gain. And critically, these findings transferred to larger models without modification.
Real-World Validation: Shopify's Overnight Run
Shopify CEO Tobias Lutke ran autoresearch on internal company data the week after release. 37 experiments overnight. The result: a 19% performance gain, with a smaller model outperforming a manually-configured model twice its size.
That's the pitch for autoresearch in one sentence: it finds configurations that humans wouldn't search for, on hardware you already have, while you're asleep.
The Ecosystem: 7 Days, 5+ Derivatives
Autoresearch hit 41,000+ GitHub stars in under two weeks. More interesting than the star count is what developers built on top of it:
The MLX port revealed something unexpected: smaller hardware favors different winning strategies than H100 runs. Optimizations that worked on an H100 sometimes degraded on an M4 Max. This has implications for anyone running autoresearch on consumer hardware and expecting results to transfer to production GPUs.
Karpathy's vision for where this goes: "The next step for autoresearch is that it has to be asynchronously massively collaborative for agents—think SETI@home style. The goal is not to emulate a single PhD student, it's to emulate a research community of them."
| Project | Stars | What It Does |
|---|---|---|
| pi-autoresearch | 1,377 | Persistent sessions that survive restarts, dashboard UI, branch-aware tracking |
| autoresearch-mlx | 701 | Apple Silicon/Mac port via MLX. M4 Max hit val_bpb of 1.294 from 2.667 baseline |
| autokernel | 689 | GPU kernel optimization. ~40 experiments/hour. Discovers faster Triton/CUDA kernels |
| autoresearch-at-home | 423 | Distributed SETI@home-style. Multiple machines share experiments via Ensue |
| autoresearch-agents | 72 | Harrison Chase (LangChain). Optimizes agent code using LangSmith eval scores |
Setting Up Autoresearch: Step by Step
The setup is deliberately minimal. No distributed systems, no cloud configs, no container orchestration.
# Install uv if you don't have it curl -LsSf https://astral.sh/uv/install.sh | sh # Clone and setup git clone https://github.com/karpathy/autoresearch.git cd autoresearch uv sync # Prepare data uv run prepare.py # Run a single training baseline uv run train.py # Then point your coding agent at train.py with program.md as context # and let it loop: modify → train → evaluate → commit/reset
Prerequisites
- Single NVIDIA GPU (H100 tested; RTX consumer GPUs via community ports)
- Python 3.10+
uvpackage manager (by Astral)- Any AI coding agent (Claude, Codex, etc.)
Quick Start
Writing program.md
The program.md file is where you define the rules of engagement. This is the human's lever—everything the agent is allowed to do, forbidden from doing, and optimizing for. Key principles from Karpathy's own setup:
- Single metric: val_bpb (lower is better). Don't expose multiple metrics—the agent will Goodhart whichever one is easiest to game.
- Fixed time budget: 5 minutes per experiment. This prevents the agent from "improving" by simply training longer.
- Immutable eval:
prepare.pyis locked. The agent can't improve scores by modifying the evaluation. - Git-based memory: Every run is committed or reset. The agent can read history.
When Autoresearch Works—and When It Doesn't
After running this on three client projects, here's my honest assessment:
Works Well For
- Single-metric optimization on fixed hardware — The sweet spot. One GPU, one metric, overnight.
- Finding unintuitive tradeoffs — Like the batch-size halving that improved results by prioritizing iteration speed.
- Bug discovery — Missing multipliers, suboptimal defaults. The agent tests everything with zero assumptions.
- Hardware-specific tuning — Optimizations that exploit specific GPU characteristics.
Breaks Down For
- Multi-objective optimization — If you need to balance latency, accuracy, and memory footprint, the agent will sacrifice two to improve one.
- Architectural innovation — The agent is great at local optimization within a 630-line file. It won't invent the transformer.
- Cross-hardware portability — MLX port showed that winning strategies differ by hardware. Don't assume H100 results transfer to your production T4s.
- Long runs with diminishing returns — After ~100 experiments, the agent starts making marginal changes (Karpathy noted late sessions degraded to random seed adjustments).
- Expensive eval functions — The 5-minute budget works because training a small model is cheap. If your eval takes 2 hours, you get 4 experiments overnight instead of 100.
The Goodhart's Law Problem
The agent optimizes aggressively for whatever metric you expose. If your val_bpb doesn't perfectly correlate with downstream task performance, the agent will find ways to game it. This isn't a bug in autoresearch—it's the fundamental challenge of evaluation-driven development. Your eval must genuinely measure what you care about, because the agent will find every shortcut you left open.
The Bigger Picture: "All LLM Frontier Labs Will Do This"
Karpathy didn't frame autoresearch as a toy. His exact words: "All LLM frontier labs will do this. It's the final boss battle."
He's probably right. The pattern—agent modifies code, trains, evaluates, iterates—is hardware-agnostic, model-agnostic, and scales with compute. The 630-line single-GPU version is a proof of concept. The production version is a fleet of agents across a cluster, each exploring different branches of the search space, sharing findings via something like autoresearch-at-home.
For teams already doing fine-tuning, autoresearch is the obvious next step: take your existing training pipeline, wrap it in the loop, and let an agent search the space you don't have time to explore manually. The investment is minimal—you're running experiments on hardware you already have, during hours you're not using it.
For teams choosing between small specialized models and flagship models, autoresearch changes the calculus. A 7B model that's been autoresearched overnight might beat a flagship model that was configured manually. Shopify proved this isn't theoretical.
And for anyone building AI evaluation systems, autoresearch is a stress test for your evals. If an agent can game your metric overnight, your metric has holes. Better to find them now than in production.
What Comes Next
Karpathy's SETI@home vision—distributed agents collaborating on shared experiment logs—is already happening with autoresearch-at-home. The domain-agnostic forks extend the pattern beyond ML: test coverage, bundle size, Terraform compliance, accessibility scores. Any metric reasonably efficient to evaluate can be autoresearched.
The 630-line script is a seed. The idea it represents—that AI agents can run your experiments better than you can, if you give them a clear metric and get out of the way—is going to reshape how ML teams operate.
Set it up. Run it tonight. Check the results tomorrow. That's the whole pitch.
Frequently Asked Questions
Quick answers to common questions about this topic
Autoresearch is an open-source framework (MIT license) that gives an AI coding agent a ~630-line training script and lets it experiment autonomously. The loop: agent reads instructions from program.md, modifies train.py, trains for exactly 5 minutes, evaluates validation bits-per-byte, then commits improvements via git or resets failures. It runs ~12 experiments per hour with zero human intervention. Released March 6, 2026, it hit 41,000+ GitHub stars within two weeks.



