Running code an agent wrote inside a Docker container is a gamble — shared kernel, slow enough cold starts (500ms–2s) that teams skip sandboxing on the hot path, and a steady drip of container-escape CVEs. SmolVM (launched April 17, 2026, 482 HN points) ships a single-executable microVM with sub-200ms cold starts on Hypervisor.framework/libkrun and lets you spin up a fresh VM per agent invocation. Firecracker is still the right choice for long-lived tenant workloads; gVisor for low-trust research tooling; Docker for cached build steps. Route per-request ephemeral AI code execution to SmolVM, not a container.
Last week a client's platform team asked me to review the infrastructure behind their internal agent that writes and runs data-cleaning scripts. The agent produced Python, dropped it into a /tmp directory, and ran it with subprocess.run — on the same host as the orchestrator, the auth service, and the customer database. "It's behind a VPN," they said. The VPN does not help when the prompt that asks for a script is written by a user on the other side of that VPN. The agent did what agents do, which is produce plausible code from whatever instruction it was given, and the host executed it.
This is the "agent wrote code, do I dare run it?" problem, and it has had the same bad answer for two years: wrap the exec in a Docker container and hope. Docker shares the kernel. A kernel CVE or a runc escape dissolves the boundary. Cold starts of 500ms to 2 seconds make teams skip sandboxing on hot paths altogether. For AI-generated code — where the instruction itself is attacker-influenceable through prompt injection — that is the wrong threat model. On April 17, 2026, smol-machines/smolvm launched and hit 482 Hacker News points in a day because it finally closed the gap: a single-executable microVM with sub-200ms cold starts, designed for ephemeral per-invocation agent sandboxing. This post is the head-to-head across SmolVM, Firecracker, Docker, and gVisor, and the decision framework we use at Particula Tech to pick the right one per workload.
The Threat Model: Why Docker Alone Keeps Failing
Every time we audit an AI coding agent or a tool-using agent in production, we find the same pattern: generated code runs in a Docker container on the orchestrator host, with --privileged or with a bind mount to /var/run/docker.sock, because someone needed it to build an image or talk to the host. The shared-kernel model then becomes the problem:
docker.sock mounted in, which is operationally equivalent to full root on the host with extra steps.The fix is a separate kernel per invocation — a microVM. That used to mean Firecracker on Linux with KVM, a jailer process, and an ops team. As of April 17, 2026, it also means SmolVM on your laptop.
SmolVM in 200 Words
SmolVM is a microVM runtime from smol-machines shipped as a single static executable. It launched April 17, 2026, and picked up 482 HN points the same day because it filled a gap the Firecracker ecosystem never addressed: a microVM you can spawn as easily as you spawn a subprocess, from any developer machine, on macOS or Linux, in under 200ms cold.
Under the hood it uses Hypervisor.framework on macOS (same thing Docker Desktop's VM layer uses) and libkrun on Linux (Red Hat's project for running OCI images in a microVM). The image format is a minimal Alpine-based Linux kernel plus a tiny init that mounts virtio-fs shares, resolves network policy, and execs an entrypoint. The API surface is intentionally small: spawn a VM with a code mount, a network policy, and a budget; exec a command; collect output; kill the VM. That is the entire lifecycle, and it maps cleanly to one agent invocation.
The release is new — treat it as production-adjacent, not production-hardened — but the primitives it exposes are the right ones for AI code execution.
Head-to-Head: SmolVM vs Firecracker vs Docker vs gVisor
A few things jump out. Docker is the worst on isolation by a wide margin and the best on ecosystem; gVisor sits awkwardly in the middle on both. Firecracker wins density and prewarm speed; SmolVM wins single-shot cold start and developer ergonomics. For AI-generated code execution, the two columns that matter most are "isolation boundary" (you want hardware virt) and "cold start (single-shot)" (per-invocation, no pool). That puts SmolVM and Firecracker ahead, with SmolVM winning when your workload is "one agent invocation, start from zero, on any OS."
| Dimension | Docker | gVisor | SmolVM | Firecracker |
|---|---|---|---|---|
| Isolation boundary | Shared kernel | User-space kernel | Hardware virtualization | Hardware virtualization |
| Cold start (single-shot) | 500ms - 2s | 300 - 800ms | 80 - 260ms | 100 - 150ms (no jailer) |
| Cold start (prewarmed pool) | 50 - 200ms | N/A | ~50ms | < 50ms (snapshots) |
| Runs on macOS dev machines | Yes (via VM) | Linux only | Yes (native) | Linux + KVM only |
| Memory overhead per instance | 20 - 50 MB | 15 - 30 MB | 40 - 80 MB | 5 - 15 MB |
| Concurrency on one host | Hundreds | Hundreds | ~50-100 (laptop), hundreds (server) | 1000+ |
| Escape-class CVEs (2024-26) | Multiple | Few | None yet (new) | None |
| Ecosystem maturity | Massive | Moderate | New (Apr 2026) | Mature (AWS uses it) |
| Setup effort | Minimal | Moderate | Minimal (single binary) | High (jailer + KVM) |
When Each Sandbox Wins
Match the sandbox to the workload. Getting this wrong means either paying for isolation you don't need or running with isolation you do.
SmolVM — per-request ephemeral AI code execution. An agent generates a script, you want it to run once in 200ms with no persistent state, and you want the same code path to work on a developer laptop and a CI runner. This is SmolVM's sweet spot and the exact workload nothing else fits cleanly.
Firecracker — heavy tenant workloads on Linux. You are running multiple long-lived VMs on Linux hosts with KVM, you care about density (100+ VMs per host), and you want snapshot-based prewarming to push cold starts under 50ms. This is the AWS Lambda / Fargate model and it still beats everything else for that shape of workload.
Docker — cached build steps and trusted workers. Your agent needs to run pnpm install, docker build, or a CI pipeline step where the code is not attacker-influenced and the cache matters. Docker's ecosystem and layer caching are still unbeatable for this.
gVisor — untrusted HTTP workloads and research tooling. You are hosting untrusted code behind an HTTP interface (function-as-a-service, code playgrounds) where syscall overhead is acceptable and operational simplicity beats raw speed. gVisor is fine here. For agent code execution specifically, reach for a microVM first.
We spell out the broader isolation-vs-cost tradeoff — including when the answer is to move execution entirely off a shared platform — in our cloud vs on-premise AI security and cost comparison.
Wiring SmolVM to an Agent: A 40-Line Pattern
Here is the minimum viable pattern for running AI-generated code through SmolVM from a TypeScript orchestrator. The same shape works for Claude Code tool handlers, Codex CLI custom commands, or OpenHands runtimes.
// sandbox.ts — run agent-generated code in an ephemeral SmolVM
import { spawn } from "node:child_process";
import { mkdtempSync, writeFileSync, readFileSync } from "node:fs";
import { tmpdir } from "node:os";
import { join } from "node:path";
interface SandboxResult {
stdout: string;
stderr: string;
exitCode: number;
wallMs: number;
}
export async function runGeneratedCode(
code: string,
opts: { timeoutMs?: number; allowedHosts?: string[] } = {}
): Promise<SandboxResult> {
const stage = mkdtempSync(join(tmpdir(), "agent-exec-"));
writeFileSync(join(stage, "main.py"), code, { mode: 0o400 });
const started = Date.now();
const args = [
"run",
"--mount", `${stage}:/work:ro`,
"--network", (opts.allowedHosts ?? []).length === 0
? "none"
: `allowlist=${(opts.allowedHosts ?? []).join(",")}`,
"--cpu", "1",
"--memory", "512M",
"--timeout", String(opts.timeoutMs ?? 10_000),
"--",
"python3", "/work/main.py",
];
return new Promise((resolve) => {
const vm = spawn("smolvm", args, { timeout: opts.timeoutMs ?? 10_000 });
let stdout = "", stderr = "";
vm.stdout.on("data", (b) => { stdout += b.toString(); });
vm.stderr.on("data", (b) => { stderr += b.toString(); });
vm.on("close", (code) => {
resolve({ stdout, stderr, exitCode: code ?? -1, wallMs: Date.now() - started });
});
});
}Three pieces worth pointing out. The staging directory is mounted read-only — the VM cannot modify the host's copy of the code, even if it tries. The default network policy is deny-all (--network none); any egress is an explicit allowlist of hostnames. The budget is hard — CPU, memory, and wall-clock are capped at spawn, and the host kills the VM if it exceeds them. These three together are the minimum safe defaults. Shipping without any of them is the same mistake we wrote about in the n8n CVE-2026-21858 RCE chain — generated code running with too much trust, in the wrong place, with no kill switch.
This pattern composes naturally with the parallel-agent workflow we documented in our oh-my-codex worktree guide: one sandbox per worker, one VM per exec, one audit trail per invocation. It also slots cleanly into the zero-trust agent model we walked through in our analysis of Microsoft's ZT4AI framework — every tool invocation authenticated, scoped, sandboxed, and logged.
Production Gotchas We've Hit
A sandbox is not a product. The microVM is the isolation primitive, but the production operational surface has sharp edges.
Network Egress Is Where Most Teams Leak
Deny-all is the default you want, and it is the default almost nobody ships with. Teams allowlist *.pypi.org so the sandboxed code can pip install, and that allowlist becomes a DNS-exfiltration path because pip fetches arbitrary package names. The fix: pre-install dependencies into the VM image rather than allowlisting the package index. If the sandboxed code needs HTTP at runtime, allowlist specific API hostnames, not ecosystem registries.
Filesystem Mount Strategy Matters
Read-only code mounts are the easy part. Write scratch space is where the footguns live. If you give the VM a writable scratch mount on the host filesystem, you've reintroduced a side channel: agent A writes a payload file that agent B reads on the next invocation. Use an ephemeral tmpfs mount inside the VM for scratch (size-capped), and a separate, one-way write mount for the exec result. When the VM dies, the tmpfs dies with it.
GPU Passthrough Is a Reality Check
If your agent writes code that needs a local GPU — training, inference, CUDA — the microVM story gets much worse. Hypervisor.framework has limited GPU passthrough; libkrun supports virtio-gpu but it's slow and driver-fragile. In practice, keep GPU workloads outside the sandbox: run CPU-bound generated code inside SmolVM, and route model calls to a separate, authenticated inference service. For orgs that actually need sandboxed GPU compute, the answer is Firecracker on a Linux host with PCI passthrough, or bare-metal with strong auth — not a cross-platform microVM.
Cold Start Regresses with Image Size
The "under 200ms" number is the stripped Alpine image with a minimal init. The moment you add Python, Node, and a handful of common libraries, cold start drifts to 400-600ms. Two options: ship multiple images (a Python sandbox, a Node sandbox, a "minimal" sandbox) and pick by task; or keep a warm pool of N pre-started VMs and rotate them. The pool approach is what Firecracker does at Lambda scale; for SmolVM it is a ~100-line wrapper you write once.
Observability Needs to Survive the VM's Death
When the VM dies, everything inside it dies. If your agent's debug output matters, capture it to a file on a one-way scratch mount before the VM exits, then read it from the host after. Do not try to stream logs over a VM-to-host socket under load — the data race on VM shutdown loses the last few hundred ms of output, which is exactly where the interesting failures happen.
The Bottom Line: Match the Sandbox to the Risk
Running AI-generated code is a capability question and a trust question. The capability is easy — any container or VM can exec the code. The trust question is what the isolation boundary has to hold, and for code an agent wrote in response to a user-supplied prompt, the boundary has to hold against a hostile workload every time. Docker doesn't. gVisor mostly doesn't for syscall-heavy code. Firecracker and SmolVM do, and the choice between them is mostly about where you run and how often you cold-start.
At Particula Tech we've shipped this pattern for clients running coding agents on developer laptops, multi-tenant SaaS agent platforms, and internal tools that execute model-generated SQL and Python. The recipe is the same every time: hardware-virtualized microVM per invocation, deny-all network with explicit allowlist, read-only code mount, hard budget, audit every exec. SmolVM made the developer-side of that recipe cheap enough that you can put it in dev, CI, and prod without a platform team rewriting the world. For how the same isolation thinking applies to the rest of the AI-in-the-loop development stack — from IDE agents to CI/CD — start with the AI development tools pillar and our Cursor AI development best practices guide.
Three days of running a Claude Code instance through a microVM sandbox is worth more than a quarter of security-review meetings about "what if the agent writes something bad." Let the sandbox answer the question.
Frequently Asked Questions
Quick answers to common questions about this topic
Docker shares the host kernel. A container escape or a kernel vulnerability gives the guest root on the host, and 2023-2026 saw a steady drip of runc, containerd, and kernel CVEs (CVE-2024-21626, CVE-2025-23359, CVE-2026-1109) that each broke the isolation boundary. For code that an LLM wrote from a prompt that might contain injection, shared-kernel isolation is the wrong threat model. Docker is fine for cached build steps or trusted workers behind an auth boundary. It is not fine as the only thing between an attacker-influenced prompt and your production host. A hardware-virtualized microVM (SmolVM, Firecracker) gives you a separate kernel per invocation and blocks the entire escape class.



