Why isn't Docker enough to sandbox AI-generated code?

Docker shares the host kernel. A container escape or a kernel vulnerability gives the guest root on the host, and 2023-2026 saw a steady drip of runc, containerd, and kernel CVEs (CVE-2024-21626, CVE-2025-23359, CVE-2026-1109) that each broke the isolation boundary. For code that an LLM wrote from a prompt that might contain injection, shared-kernel isolation is the wrong threat model. Docker is fine for cached build steps or trusted workers behind an auth boundary. It is not fine as the only thing between an attacker-influenced prompt and your production host. A hardware-virtualized microVM (SmolVM, Firecracker) gives you a separate kernel per invocation and blocks the entire escape class.

What is SmolVM and why did it hit the HN front page on April 17, 2026?

SmolVM is an open-source microVM runtime from smol-machines, released April 17, 2026, that packages a minimal Linux VM as a single static executable you can spawn like a subprocess. It uses Hypervisor.framework on macOS and libkrun on Linux, boots in under 200ms cold (the fast path is closer to 80-120ms on an M-series Mac), and exposes a tiny API for mounting code, setting network policy, and capturing stdout/stderr. It hit 482 upvotes on Hacker News on launch day because it was the first microVM runtime explicitly designed for the 'an agent wrote this, do I dare run it?' use case: ephemeral, per-invocation, sub-second, cross-platform. The target user is anyone running Codex, Claude Code, OpenHands, or a custom agent that generates and executes code.

SmolVM vs Firecracker, which do I pick for production AI code execution?

Pick SmolVM for per-request ephemeral execution on developer laptops and single-node servers where cold starts dominate. Pick Firecracker when you are running many concurrent long-lived VMs on KVM-capable Linux hosts with tenant isolation as the primary concern, Firecracker powers AWS Lambda and Fargate for a reason, and its memory-overhead and density numbers are still better at 100+ concurrent VMs per host. Firecracker boots in roughly 125ms but requires KVM, a Linux host, and a separate jailer process; SmolVM trades some per-VM overhead for a single binary that runs on macOS developer machines out of the box. Most teams end up using SmolVM in dev and Firecracker in prod.

Where does gVisor fit in the sandbox comparison?

gVisor is Google's user-space kernel implementation, it intercepts syscalls in Go before they reach the host kernel, giving you a separate kernel boundary without a hypervisor. It is stronger isolation than Docker but weaker than Firecracker/SmolVM because the Go runtime is itself an attack surface, and some syscalls fall back to the host. Performance is noticeably slower for syscall-heavy workloads (file I/O, network), which makes gVisor a poor fit for code-execution sandboxes where you want to call subprocesses, hit the filesystem, or run tests. Its best use cases are serving untrusted HTTP workloads and research tooling. For AI-generated code execution specifically, we do not reach for gVisor first.

How do you wire a microVM sandbox to an agent like Claude Code or Codex?

Three pieces: a lifecycle (spawn VM, mount generated code, exec entrypoint, capture output, kill VM), a network policy (deny all egress by default, allowlist specific hostnames per task), and a budget (CPU, memory, wall-clock, output-size limits with hard kill). SmolVM ships primitives for all three, a typical integration is 40-60 lines that sits between the agent's code-generation step and its evaluation step. The agent never touches the host filesystem directly; it writes code into a staging directory, SmolVM mounts that directory read-only into the VM, and the agent reads exec output from a capture file the VM wrote into a scratch mount. This is the same pattern n8n got wrong in CVE-2026-21858, missing sandbox between LLM-generated code and the host shell.

Does the microVM approach work with GPU-backed AI workloads?

Partially. Hypervisor.framework and libkrun both support virtio-gpu passthrough, but it is slow, driver-fragile, and in practice almost nobody runs real GPU training or inference inside a microVM sandbox. The pattern that works: run CPU-bound code (test runs, data transforms, linting, static checks, scripting) inside the sandbox, and call out to a separate GPU-backed inference service over an authenticated network allowlist when the sandboxed code needs a model call. If your agent's generated code needs local GPU access, you are back to Firecracker on a Linux host with PCI passthrough or bare-metal containers behind a strict auth boundary. Treat GPU inside a microVM as a nice-to-have, not a production primitive.

How fast can SmolVM actually cold-start in practice?

Under 200ms on the advertised benchmarks, but production numbers vary with your kernel image size, your mount config, and whether you are paying for copy-on-write snapshot warm-up. On an M3 Mac with the default Alpine-based image, we measured 90-140ms from spawn to entrypoint. On a Linux host with libkrun and a stripped-down kernel, 180-260ms is more typical. Firecracker beats it on prewarmed pools (sub-50ms with snapshots) but loses on single-shot cold starts because the jailer setup dominates. For per-request sandboxing of AI-generated code, where you start from zero and exit after one exec, SmolVM's cold-start profile is the one that actually matters.

BLOG/AI DEVELOPMENT TOOLS

SmolVM vs Firecracker vs Docker: Sandboxing AI-Generated Code

SmolVM boots a microVM in under 200ms. Firecracker in 125ms. Docker shares your kernel. Here is how to pick the sandbox that actually holds back code your agent wrote.

Sebastian MondragonAPRIL 22, 2026 · 10 MIN READ

SmolVM vs Firecracker vs Docker: Sandboxing AI-Generated Code

Across the production agent systems we've audited, the same pattern keeps showing up: an internal agent writes data-cleaning scripts, drops Python into a /tmp directory, and runs it with subprocess.run: on the same host as the orchestrator, the auth service, and the customer database. The reasoning is always the same: "It's behind a VPN." The VPN does not help when the prompt that asks for a script is written by a user on the other side of that VPN. The agent does what agents do, which is produce plausible code from whatever instruction it's given, and the host executes it.

This is the "agent wrote code, do I dare run it?" problem, and it has had the same bad answer for two years: wrap the exec in a Docker container and hope. Docker shares the kernel. A kernel CVE or a runc escape dissolves the boundary. Cold starts of 500ms to 2 seconds make teams skip sandboxing on hot paths altogether. For AI-generated code, where the instruction itself is attacker-influenceable through prompt injection, that is the wrong threat model. On April 17, 2026, smol-machines/smolvm launched and hit 482 Hacker News points in a day because it finally closed the gap: a single-executable microVM with sub-200ms cold starts, designed for ephemeral per-invocation agent sandboxing. This post is the head-to-head across SmolVM, Firecracker, Docker, and gVisor, and the decision framework we use at Particula Tech to pick the right one per workload.

01 · The Threat Model: Why Docker Alone Keeps Failing

Every time we audit an AI coding agent or a tool-using agent in production, we find the same pattern: generated code runs in a Docker container on the orchestrator host, with --privileged or with a bind mount to /var/run/docker.sock, because someone needed it to build an image or talk to the host. The shared-kernel model then becomes the problem:

Container escapes are not hypothetical. CVE-2024-21626 (runc), CVE-2025-23359 (NVIDIA Container Toolkit), and CVE-2026-1109 (kernel io_uring) each broke the guest/host boundary in 2024-2026. One successful exploit from inside the container is root on the host.

Privileged containers are normalized. Agents that need Docker-in-Docker to build images get docker.sock mounted in, which is operationally equivalent to full root on the host with extra steps.

Prompt injection is a primary-class threat. The Kiro agent incident showed how quickly attacker-influenced instructions become attacker-run code. If the prompt is untrusted, the code is untrusted, and a shared-kernel sandbox is the wrong answer.

Cold starts push teams to skip sandboxing. A Docker container cold-starting in 1-2 seconds per agent invocation kills the UX for any real-time coding agent. Teams quietly drop containerization on hot paths and re-add it later, if ever.

The fix is a separate kernel per invocation, a microVM. That used to mean Firecracker on Linux with KVM, a jailer process, and an ops team. As of April 17, 2026, it also means SmolVM on your laptop.

02 · SmolVM in 200 Words

SmolVM is a microVM runtime from smol-machines shipped as a single static executable. It launched April 17, 2026, and picked up 482 HN points the same day because it filled a gap the Firecracker ecosystem never addressed: a microVM you can spawn as easily as you spawn a subprocess, from any developer machine, on macOS or Linux, in under 200ms cold.

Under the hood it uses Hypervisor.framework on macOS (same thing Docker Desktop's VM layer uses) and libkrun on Linux (Red Hat's project for running OCI images in a microVM). The image format is a minimal Alpine-based Linux kernel plus a tiny init that mounts virtio-fs shares, resolves network policy, and execs an entrypoint. The API surface is intentionally small: spawn a VM with a code mount, a network policy, and a budget; exec a command; collect output; kill the VM. That is the entire lifecycle, and it maps cleanly to one agent invocation.

The release is new, treat it as production-adjacent, not production-hardened, but the primitives it exposes are the right ones for AI code execution.

03 · Head-to-Head: SmolVM vs Firecracker vs Docker vs gVisor

A few things jump out. Docker is the worst on isolation by a wide margin and the best on ecosystem; gVisor sits awkwardly in the middle on both. Firecracker wins density and prewarm speed; SmolVM wins single-shot cold start and developer ergonomics. For AI-generated code execution, the two columns that matter most are "isolation boundary" (you want hardware virt) and "cold start (single-shot)" (per-invocation, no pool). That puts SmolVM and Firecracker ahead, with SmolVM winning when your workload is "one agent invocation, start from zero, on any OS."

Dimension	Docker	gVisor	SmolVM	Firecracker
Isolation boundary	Shared kernel	User-space kernel	Hardware virtualization	Hardware virtualization
Cold start (single-shot)	500ms - 2s	300 - 800ms	80 - 260ms	100 - 150ms (no jailer)
Cold start (prewarmed pool)	50 - 200ms	N/A	~50ms	< 50ms (snapshots)
Runs on macOS dev machines	Yes (via VM)	Linux only	Yes (native)	Linux + KVM only
Memory overhead per instance	20 - 50 MB	15 - 30 MB	40 - 80 MB	5 - 15 MB
Concurrency on one host	Hundreds	Hundreds	~50-100 (laptop), hundreds (server)	1000+
Escape-class CVEs (2024-26)	Multiple	Few	None yet (new)	None
Ecosystem maturity	Massive	Moderate	New (Apr 2026)	Mature (AWS uses it)
Setup effort	Minimal	Moderate	Minimal (single binary)	High (jailer + KVM)

04 · When Each Sandbox Wins

Match the sandbox to the workload. Getting this wrong means either paying for isolation you don't need or running with isolation you do.

SmolVM, per-request ephemeral AI code execution. An agent generates a script, you want it to run once in 200ms with no persistent state, and you want the same code path to work on a developer laptop and a CI runner. This is SmolVM's sweet spot and the exact workload nothing else fits cleanly.

Firecracker, heavy tenant workloads on Linux. You are running multiple long-lived VMs on Linux hosts with KVM, you care about density (100+ VMs per host), and you want snapshot-based prewarming to push cold starts under 50ms. This is the AWS Lambda / Fargate model and it still beats everything else for that shape of workload.

Docker, cached build steps and trusted workers. Your agent needs to run pnpm install, docker build, or a CI pipeline step where the code is not attacker-influenced and the cache matters. Docker's ecosystem and layer caching are still unbeatable for this.

gVisor, untrusted HTTP workloads and research tooling. You are hosting untrusted code behind an HTTP interface (function-as-a-service, code playgrounds) where syscall overhead is acceptable and operational simplicity beats raw speed. gVisor is fine here. For agent code execution specifically, reach for a microVM first.

All four of these assume the code you run is untrusted. If instead you just want to confine a local coding agent running your own code, where the real risk is prompt injection reaching your secrets rather than a hostile workload, you can sandbox the agent with OS kernel primitives and skip the VM entirely.

We spell out the broader isolation-vs-cost tradeoff, including when the answer is to move execution entirely off a shared platform, in our cloud vs on-premise AI security and cost comparison.

05 · Wiring SmolVM to an Agent: A 40-Line Pattern

Here is the minimum viable pattern for running AI-generated code through SmolVM from a TypeScript orchestrator. The same shape works for Claude Code tool handlers, Codex CLI custom commands, or OpenHands runtimes.

TYPESCRIPT

// sandbox.ts, run agent-generated code in an ephemeral SmolVM
import { spawn } from "node:child_process";
import { mkdtempSync, writeFileSync, readFileSync } from "node:fs";
import { tmpdir } from "node:os";
import { join } from "node:path";

interface SandboxResult {
  stdout: string;
  stderr: string;
  exitCode: number;
  wallMs: number;
}

export async function runGeneratedCode(
  code: string,
  opts: { timeoutMs?: number; allowedHosts?: string[] } = {}
): Promise<SandboxResult> {
  const stage = mkdtempSync(join(tmpdir(), "agent-exec-"));
  writeFileSync(join(stage, "main.py"), code, { mode: 0o400 });

  const started = Date.now();
  const args = [
    "run",
    "--mount",
    `${stage}:/work:ro`,
    "--network",
    (opts.allowedHosts ?? []).length === 0
      ? "none"
      : `allowlist=${(opts.allowedHosts ?? []).join(",")}`,
    "--cpu",
    "1",
    "--memory",
    "512M",
    "--timeout",
    String(opts.timeoutMs ?? 10_000),
    "--",
    "python3",
    "/work/main.py",
  ];

  return new Promise((resolve) => {
    const vm = spawn("smolvm", args, { timeout: opts.timeoutMs ?? 10_000 });
    let stdout = "",
      stderr = "";
    vm.stdout.on("data", (b) => {
      stdout += b.toString();
    });
    vm.stderr.on("data", (b) => {
      stderr += b.toString();
    });
    vm.on("close", (code) => {
      resolve({
        stdout,
        stderr,
        exitCode: code ?? -1,
        wallMs: Date.now() - started,
      });
    });
  });
}

Three pieces worth pointing out. The staging directory is mounted read-only, the VM cannot modify the host's copy of the code, even if it tries. The default network policy is deny-all (--network none); any egress is an explicit allowlist of hostnames. The budget is hard: CPU, memory, and wall-clock are capped at spawn, and the host kills the VM if it exceeds them. These three together are the minimum safe defaults. Shipping without any of them is the same mistake we wrote about in the n8n CVE-2026-21858 RCE chain: generated code running with too much trust, in the wrong place, with no kill switch.

This pattern composes naturally with the parallel-agent workflow we documented in our oh-my-codex worktree guide: one sandbox per worker, one VM per exec, one audit trail per invocation. It also slots cleanly into the zero-trust agent model we walked through in our analysis of Microsoft's ZT4AI framework: every tool invocation authenticated, scoped, sandboxed, and logged.

06 · Production Gotchas We've Hit

A sandbox is not a product. The microVM is the isolation primitive, but the production operational surface has sharp edges.

Network Egress Is Where Most Teams Leak

Deny-all is the default you want, and it is the default almost nobody ships with. Teams allowlist *.pypi.org so the sandboxed code can pip install, and that allowlist becomes a DNS-exfiltration path because pip fetches arbitrary package names. The fix: pre-install dependencies into the VM image rather than allowlisting the package index. If the sandboxed code needs HTTP at runtime, allowlist specific API hostnames, not ecosystem registries.

Filesystem Mount Strategy Matters

Read-only code mounts are the easy part. Write scratch space is where the footguns live. If you give the VM a writable scratch mount on the host filesystem, you've reintroduced a side channel: agent A writes a payload file that agent B reads on the next invocation. Use an ephemeral tmpfs mount inside the VM for scratch (size-capped), and a separate, one-way write mount for the exec result. When the VM dies, the tmpfs dies with it.

GPU Passthrough Is a Reality Check

If your agent writes code that needs a local GPU, training, inference, CUDA, the microVM story gets much worse. Hypervisor.framework has limited GPU passthrough; libkrun supports virtio-gpu but it's slow and driver-fragile. In practice, keep GPU workloads outside the sandbox: run CPU-bound generated code inside SmolVM, and route model calls to a separate, authenticated inference service. For orgs that actually need sandboxed GPU compute, the answer is Firecracker on a Linux host with PCI passthrough, or bare-metal with strong auth, not a cross-platform microVM.

Cold Start Regresses with Image Size

The "under 200ms" number is the stripped Alpine image with a minimal init. The moment you add Python, Node, and a handful of common libraries, cold start drifts to 400-600ms. Two options: ship multiple images (a Python sandbox, a Node sandbox, a "minimal" sandbox) and pick by task; or keep a warm pool of N pre-started VMs and rotate them. The pool approach is what Firecracker does at Lambda scale; for SmolVM it is a ~100-line wrapper you write once.

Observability Needs to Survive the VM's Death

When the VM dies, everything inside it dies. If your agent's debug output matters, capture it to a file on a one-way scratch mount before the VM exits, then read it from the host after. Do not try to stream logs over a VM-to-host socket under load, the data race on VM shutdown loses the last few hundred ms of output, which is exactly where the interesting failures happen.

07 · The Bottom Line: Match the Sandbox to the Risk

Running AI-generated code is a capability question and a trust question. The capability is easy, any container or VM can exec the code. The trust question is what the isolation boundary has to hold, and for code an agent wrote in response to a user-supplied prompt, the boundary has to hold against a hostile workload every time. Docker doesn't. gVisor mostly doesn't for syscall-heavy code. Firecracker and SmolVM do, and the choice between them is mostly about where you run and how often you cold-start.

Across the agent platforms we've shipped this pattern into, coding agents on developer laptops, multi-tenant SaaS agent platforms, and internal tools that execute model-generated SQL and Python, the recipe is the same every time: hardware-virtualized microVM per invocation, deny-all network with explicit allowlist, read-only code mount, hard budget, audit every exec. SmolVM made the developer-side of that recipe cheap enough that you can put it in dev, CI, and prod without a platform team rewriting the world. If you would rather not run the isolation layer yourself at all, managed AI code execution sandboxes like Modal, E2B, Daytona, and Vercel Sandbox cover the same threat model as a hosted service. For how the same isolation thinking applies to the rest of the AI-in-the-loop development stack, from IDE agents to CI/CD, start with the AI development tools pillar and our Cursor AI development best practices guide.

Three days of running a Claude Code instance through a microVM sandbox is worth more than a quarter of security-review meetings about "what if the agent writes something bad." Let the sandbox answer the question.

08 · FAQ

Quick answers to the questions this post tends to raise.

BLOG/AI DEVELOPMENT TOOLS

SmolVM vs Firecracker vs Docker: Sandboxing AI-Generated Code

SmolVM boots a microVM in under 200ms. Firecracker in 125ms. Docker shares your kernel. Here is how to pick the sandbox that actually holds back code your agent wrote.

Sebastian MondragonAPRIL 22, 2026 · 10 MIN READ

01 · The Threat Model: Why Docker Alone Keeps Failing

Privileged containers are normalized. Agents that need Docker-in-Docker to build images get docker.sock mounted in, which is operationally equivalent to full root on the host with extra steps.

The fix is a separate kernel per invocation, a microVM. That used to mean Firecracker on Linux with KVM, a jailer process, and an ops team. As of April 17, 2026, it also means SmolVM on your laptop.

02 · SmolVM in 200 Words

The release is new, treat it as production-adjacent, not production-hardened, but the primitives it exposes are the right ones for AI code execution.

03 · Head-to-Head: SmolVM vs Firecracker vs Docker vs gVisor

Dimension	Docker	gVisor	SmolVM	Firecracker
Isolation boundary	Shared kernel	User-space kernel	Hardware virtualization	Hardware virtualization
Cold start (single-shot)	500ms - 2s	300 - 800ms	80 - 260ms	100 - 150ms (no jailer)
Cold start (prewarmed pool)	50 - 200ms	N/A	~50ms	< 50ms (snapshots)
Runs on macOS dev machines	Yes (via VM)	Linux only	Yes (native)	Linux + KVM only
Memory overhead per instance	20 - 50 MB	15 - 30 MB	40 - 80 MB	5 - 15 MB
Concurrency on one host	Hundreds	Hundreds	~50-100 (laptop), hundreds (server)	1000+
Escape-class CVEs (2024-26)	Multiple	Few	None yet (new)	None
Ecosystem maturity	Massive	Moderate	New (Apr 2026)	Mature (AWS uses it)
Setup effort	Minimal	Moderate	Minimal (single binary)	High (jailer + KVM)

04 · When Each Sandbox Wins

Match the sandbox to the workload. Getting this wrong means either paying for isolation you don't need or running with isolation you do.

We spell out the broader isolation-vs-cost tradeoff, including when the answer is to move execution entirely off a shared platform, in our cloud vs on-premise AI security and cost comparison.

05 · Wiring SmolVM to an Agent: A 40-Line Pattern

TYPESCRIPT

// sandbox.ts, run agent-generated code in an ephemeral SmolVM
import { spawn } from "node:child_process";
import { mkdtempSync, writeFileSync, readFileSync } from "node:fs";
import { tmpdir } from "node:os";
import { join } from "node:path";

interface SandboxResult {
  stdout: string;
  stderr: string;
  exitCode: number;
  wallMs: number;
}

export async function runGeneratedCode(
  code: string,
  opts: { timeoutMs?: number; allowedHosts?: string[] } = {}
): Promise<SandboxResult> {
  const stage = mkdtempSync(join(tmpdir(), "agent-exec-"));
  writeFileSync(join(stage, "main.py"), code, { mode: 0o400 });

  const started = Date.now();
  const args = [
    "run",
    "--mount",
    `${stage}:/work:ro`,
    "--network",
    (opts.allowedHosts ?? []).length === 0
      ? "none"
      : `allowlist=${(opts.allowedHosts ?? []).join(",")}`,
    "--cpu",
    "1",
    "--memory",
    "512M",
    "--timeout",
    String(opts.timeoutMs ?? 10_000),
    "--",
    "python3",
    "/work/main.py",
  ];

  return new Promise((resolve) => {
    const vm = spawn("smolvm", args, { timeout: opts.timeoutMs ?? 10_000 });
    let stdout = "",
      stderr = "";
    vm.stdout.on("data", (b) => {
      stdout += b.toString();
    });
    vm.stderr.on("data", (b) => {
      stderr += b.toString();
    });
    vm.on("close", (code) => {
      resolve({
        stdout,
        stderr,
        exitCode: code ?? -1,
        wallMs: Date.now() - started,
      });
    });
  });
}

06 · Production Gotchas We've Hit

A sandbox is not a product. The microVM is the isolation primitive, but the production operational surface has sharp edges.