June 18, 2026

Run GLM-5.2 Locally: Hardware, Quantization & vLLM

Run GLM-5.2 locally: 744B MoE, 1M context, MIT license. Unsloth's 2-bit GGUF squeezes 1.51 TB down to 239 GB at 3-9 tok/s on a Mac Studio.

Sebastian Mondragon

11 min read

Run GLM-5.2 Locally: Hardware, Quantization & vLLM

TL;DR

GLM-5.2 is a ~744B-parameter MoE (~40B active) with a 1M-token context window under an MIT license, released mid-June 2026. Full-precision serving needs ~8x H200 (~1,128 GB aggregate VRAM); Unsloth's 2-bit dynamic GGUF compresses it from ~1.51 TB to ~239 GB, fitting a 256 GB Mac Studio or a 24 GB GPU + 256 GB RAM box with MoE offloading at ~3-9 tok/s. It scored 62.1 on SWE-bench Pro versus GPT-5.5's 58.6 at roughly 1/6 the cost.

In mid-June 2026, Z.ai released GLM-5.2 under an MIT license, and within a day Simon Willison was calling it "probably the most powerful text-only open weights LLM." If you want to run GLM-5.2 locally, the headline numbers are both encouraging and sobering: it beats GPT-5.5 on long-horizon coding benchmarks at roughly 1/6 the cost, and it is a 744-billion-parameter model that occupies 1.51 terabytes at full precision. Those two facts define the entire practical problem.

The model is a Mixture-of-Experts design with about 40 billion active parameters per token and a 1-million-token context window. That MoE architecture is exactly why local inference is even on the table: only a fraction of the weights fire on any given token, so you can offload the inactive experts to slower memory and still get usable output. The question is not whether you can run GLM-5.2 locally. It is which tradeoff you accept: 8x H200 for full-precision throughput, or a 256 GB Mac Studio at 3-9 tokens per second for a quantized version that fits on one machine.

This is a how-to for both paths. We will do the VRAM math, walk through Unsloth's 2-bit dynamic GGUF and what quality you give up, set up vLLM with tensor parallelism for the GPU path, configure llama.cpp with MoE offloading for the consumer path, and end on the cost case: at what token volume does owning the hardware beat renting an API. For the broader landscape of where GLM-5.2 sits among open-weight coders, our DeepSeek V4 vs Kimi K2.6 vs GLM-5.1 comparison covers the prior generation, and the llm-models pillar maps the full cluster.

What GLM-5.2 Is: 744B MoE, 1M Context, MIT License

GLM-5.2 is roughly 744 billion total parameters arranged as a Mixture-of-Experts, with about 40 billion active per token. The practical meaning of that ratio: the model has the knowledge capacity of a 744B dense model but the per-token compute of a 40B one. For a router-based MoE, the inactive experts still need to be stored somewhere, but they do not all need to be in the fastest memory at once. That single property is the hinge the entire local-inference story swings on.

The 1-million-token context window is the second headline. That is enough to load a mid-sized monorepo, a full set of API docs, or a long agent trajectory into a single prompt without chunking. Context that large changes how you architect retrieval: for some workloads you can skip the RAG-versus-long-context tradeoff entirely and just paste the corpus in. The caveat is that 1M tokens of KV cache is itself enormous and competes for the same memory you need for weights, so on a quantized local box you will rarely run anywhere near the full window.

The MIT license is the third, and for a lot of teams it is the deciding factor. MIT is about as permissive as licenses get: commercial use, modification, redistribution, fine-tuning, all allowed, no per-seat fees, no usage royalties. Combined with the fact that model weights are static files that transmit nothing, this is the sovereignty play. Nothing leaves your network, there is no API jurisdiction question, and a hosted-endpoint compliance review becomes moot because there is no hosted endpoint in the loop. On benchmarks, GLM-5.2 scored 62.1 on SWE-bench Pro against GPT-5.5's 58.6, so this is not a "good for an open model" caveat. It is competitive at the frontier on long-horizon coding, full stop.

VRAM Math: Full-Precision Serving vs Quantized Footprint

Start with the uncomfortable number. At full precision (BF16, 2 bytes per parameter), 744 billion parameters is about 1.51 TB of weights before you add any KV cache. Serving that needs roughly 8x H200, which at 141 GB each gives you about 1,128 GB of aggregate VRAM, and you spread the model across those eight GPUs with tensor parallelism. That is a cloud-rental or on-prem datacenter setup. It delivers real production throughput, but it is not a workstation.

Quantization is what brings GLM-5.2 down to a single machine. Here is the footprint ladder:

The bottom row is the one that makes local GLM-5.2 real. Unsloth's 2-bit dynamic GGUF compresses the model from ~1.51 TB to about 239 GB, a roughly 6.3x reduction. That fits inside a 256 GB unified-memory Mac Studio with room left for context, or a desktop with one 24 GB GPU and 256 GB of system RAM where the MoE offloading scheme keeps the active expert path on the GPU and parks the rest in RAM.

A note on the KV cache, because it is the silent VRAM eater. At a 1M-token context the KV cache for a model this size runs into hundreds of gigabytes on its own. On the quantized local path you simply will not have that headroom, so plan your context budget realistically: a few tens of thousands of tokens is comfortable, the full million is not. If your use case genuinely needs the full window, you are back to the multi-GPU path.

Precision	Bytes/param (effective)	Approx. weight size	Fits on	Realistic throughput
BF16 (full)	2.0	~1.51 TB	8x H200 (~1,128 GB VRAM)	Production (tens-hundreds tok/s)
FP8	1.0	~750 GB	~6-8x H200 / H100	Production
4-bit GGUF	~0.5	~400 GB	Multi-GPU or large RAM box	Workstation
2-bit dynamic GGUF (Unsloth)	~0.26	~239 GB	256 GB Mac Studio, or 24 GB GPU + 256 GB RAM	3-9 tok/s

Quantization Options: Unsloth 2-Bit Dynamic GGUF and the Quality Tradeoff

The word "2-bit" sets off alarm bells for anyone who has watched a naive 2-bit quant turn a good model into word salad. Unsloth's dynamic quantization is the reason GLM-5.2 survives the compression. Instead of flattening every layer to 2 bits, it assigns bit-widths per layer based on sensitivity: the layers where precision loss does the most damage (attention projections, certain expert gates, early and late layers) get kept at higher precision, while the bulk of the expert weights, which tolerate aggressive quantization, get squeezed to 2 bits. The "239 GB" figure is the blended result, not a flat 2-bit cast.

What do you actually give up? Quality degradation at 2-bit dynamic is real but graceful. Expect:

Slightly more brittle long-horizon reasoning, the kind that compounds over a 30-step agent loop.

Occasional formatting drift on strict structured output, worth pairing with constrained decoding if you need guaranteed JSON.

Reduced margin on the hardest tickets, where the full-precision model's 62.1 SWE-bench Pro edge narrows.

For interactive coding assistance, local prototyping, and async batch work, the 2-bit dynamic GGUF is a legitimate tool. For a production endpoint where every point of accuracy maps to revenue, run FP8 or BF16 on GPUs and treat the 2-bit build as a dev-environment convenience. This is the same calculus we cover in when to use smaller models versus flagship models: match the precision to the cost of being wrong on that specific task.

Download the GGUF and verify the file before you load it:

# Pull the Unsloth 2-bit dynamic GGUF (UD-Q2_K_XL class build)
huggingface-cli download unsloth/GLM-5.2-GGUF \
  --include "*UD-Q2_K_XL*" \
  --local-dir ./glm-5.2-gguf

# Sanity-check the on-disk size lands near ~239 GB
du -sh ./glm-5.2-gguf

Serving With vLLM: Setup, Tensor Parallel, and Commands

If you have the GPUs, vLLM is the serving engine for GLM-5.2. It gives you continuous batching, paged attention, and tensor-parallel sharding across multiple cards, which is exactly what a production endpoint needs. For the full engine comparison, our vLLM vs Ollama vs TensorRT breakdown and the Ollama vs vLLM post cover where each one wins. The short version: Ollama for single-user local convenience, vLLM for concurrent production throughput.

For full-precision or FP8 GLM-5.2 across an 8-GPU H200 node, you shard the model with tensor parallelism set to the GPU count:

# 8x H200 node, FP8 to roughly halve the BF16 footprint
vllm serve zai-org/GLM-5.2 \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.92 \
  --served-model-name glm-5.2 \
  --port 8000

A few practitioner notes. Set --tensor-parallel-size to your physical GPU count; mismatches either waste cards or fail to load. Keep --max-model-len realistic: 131072 is a sane production cap that leaves KV-cache room, and pushing toward the full 1M will force tiny batch sizes and tank your aggregate throughput. The --gpu-memory-utilization 0.92 leaves a safety margin so a long prompt does not OOM the node mid-request. For multi-node setups beyond a single 8-GPU box, vLLM supports pipeline parallelism on top of tensor parallelism, but start single-node, it is dramatically simpler to operate.

Hit it with the standard OpenAI-compatible API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.2",
    "messages": [{"role": "user", "content": "Refactor this module for testability."}],
    "max_tokens": 1024
  }'

This GPU path is what we reach for when teams need a self-hosted GLM-5.2 endpoint behind their own VPC. It is also where our model-serving and infrastructure work tends to live: sizing the node, tuning batch and context limits, and proving out the cost model before anyone commits to hardware.

Running on a Mac Studio or Consumer Box With llama.cpp and MoE Offloading

The consumer path is llama.cpp with the 2-bit GGUF, and the magic ingredient is MoE expert offloading. Because only ~40B of the 744B parameters are active per token, llama.cpp can keep the always-resident layers and the active expert path in fast memory while offloading the inactive expert weights to slower memory, streaming them in as the router selects them.

On a 256 GB unified-memory Mac Studio, the 239 GB GGUF fits in memory directly and Metal handles the compute:

# Mac Studio, 256 GB unified memory, full GGUF resident
./llama-server \
  -m ./glm-5.2-gguf/GLM-5.2-UD-Q2_K_XL.gguf \
  --ctx-size 32768 \
  --n-gpu-layers 999 \
  --host 0.0.0.0 --port 8080

On a desktop with a single 24 GB GPU and 256 GB of system RAM, you split the model: active path on the GPU, experts offloaded to RAM. The key flag is the regex-based tensor override that pins expert tensors to CPU/RAM while keeping attention on the GPU:

# 24 GB GPU + 256 GB RAM, MoE experts offloaded to system memory
./llama-server \
  -m ./glm-5.2-gguf/GLM-5.2-UD-Q2_K_XL.gguf \
  --ctx-size 16384 \
  --n-gpu-layers 999 \
  --override-tensor "\.ffn_.*_exps\.=CPU" \
  --host 0.0.0.0 --port 8080

Both setups land in the same throughput band: roughly 3-9 tokens per second. Keep --ctx-size modest. Every token of context you allow eats memory you would rather spend on weights, and on the RAM-offload box a large context also means more frequent expert streaming over the PCIe bus, which is where a lot of the slowdown lives. Treat this as a single-user, async-friendly deployment: a personal coding assistant, an overnight refactor runner, or a privacy-sensitive workstation, not a multi-tenant service.

Benchmarks vs GPT-5.5 and the Cost Case for Self-Hosting

The benchmark that matters here is SWE-bench Pro, which measures long-horizon, multi-step coding, the workload where context handling and reasoning stamina actually show up. GLM-5.2 scored 62.1 versus GPT-5.5's 58.6, at roughly 1/6 the cost. That is an open-weights model beating the proprietary frontier on a hard coding benchmark while costing a fraction as much to run.

The cost case for self-hosting is a break-even calculation, not a slogan. Renting an API has zero fixed cost and a per-token variable cost. Owning hardware (or reserving cloud GPUs) flips that: high fixed cost, near-zero marginal cost. The break-even is the monthly token volume where your fixed infrastructure spend divides down below the API's per-token price. Two anchors set the curve here. The 2-bit local box is cheap to own but caps at 3-9 tok/s, so its break-even is low-volume and latency-tolerant. The 8x H200 node is expensive to run but delivers production throughput, so it only pays off at high sustained volume. We walk the full model in our self-host vs API break-even math post; the GLM-5.2 numbers plug straight into it because the 1/6-cost advantage shifts the crossover point dramatically in self-hosting's favor.

The practical read: if your GLM-5.2 usage is a developer or two running coding agents, the 256 GB Mac Studio is a one-time cost that pays itself off fast against per-token API billing, and you get full data sovereignty as a bonus. If you are serving a product feature to thousands of users, you need the GPU node and the throughput math changes, but the 1/6 cost-per-token advantage still makes the owned-hardware crossover arrive sooner than it would for a more expensive model.

Model	SWE-bench Pro	License	Relative cost	Hosting
GLM-5.2	62.1	MIT	~1/6 of GPT-5.5	Self-host or API
GPT-5.5	58.6	Proprietary	1x (reference)	API only

Throughput Expectations and Tuning the Deployment

Set expectations honestly before you deploy: the 2-bit local build runs at roughly 3-9 tokens per second on consumer hardware. That number is not a bug to fix, it is the physics of streaming a 744B MoE off unified memory or RAM. Where you land in that 3-to-9 band, and how to nudge it upward, comes down to a handful of levers:

Context length. Shorter prompts process faster and leave more memory resident for weights. Trim aggressively; do not load 200K tokens of context for a task that needs 8K.

Prompt-processing vs generation. Prompt ingestion (prefill) and token generation (decode) have different bottlenecks. A long prompt with a short answer is prefill-bound; a short prompt with a long answer is decode-bound. Batch your prefill where the tooling allows.

Expert residency. On the RAM-offload box, the more of the hot expert path you can keep on the GPU, the less you pay in PCIe streaming. Tune the --override-tensor regex so only the genuinely cold experts go to CPU.

Quantization tier. If 2-bit quality is marginal for your task and you have the memory, step up to a 4-bit GGUF (~400 GB). It is slower to fit but noticeably sharper on hard reasoning.

Batch size on GPUs. On the vLLM path, throughput per stream and aggregate throughput trade off. For a production endpoint, raise concurrency and accept slightly higher per-request latency; for a single interactive user, do the opposite.

The honest framing for picking a path: the local 2-bit build is for sovereignty, privacy, and low-to-moderate volume where 3-9 tok/s is acceptable. The GPU FP8 build is for production throughput where you have committed to the hardware. GLM-5.2's MIT license and benchmark lead make both paths defensible. What you should not do is expect a 256 GB Mac Studio to serve a high-traffic product, or rent 8x H200 to support one developer's coding agent. Match the deployment to the volume, and the 1/6-cost advantage does the rest of the work.

If you are weighing self-hosting GLM-5.2 against staying on an API, the deciding inputs are your monthly token volume, your latency tolerance, and whether data sovereignty is a hard requirement or a nice-to-have. Run those through the break-even math, and if the answer is "self-host but we are not sure how to size it," that sizing-and-serving problem is exactly the kind of work our model-serving practice exists to de-risk.

Frequently Asked Questions

Quick answers to common questions about this topic

It depends entirely on precision. Full-precision (BF16) serving of GLM-5.2 needs roughly 8x H200 GPUs, about 1,128 GB of aggregate VRAM, which is a datacenter or cloud-rental setup, not a desk. For local use you quantize. Unsloth's 2-bit dynamic GGUF drops the footprint to about 239 GB, which fits a single 256 GB unified-memory Mac Studio or a box with one 24 GB GPU plus 256 GB of system RAM using Mixture-of-Experts offloading. Expect roughly 3-9 tokens per second on that consumer hardware. If you want production throughput rather than a local workstation, plan for a multi-GPU H200 node and FP8.

June 18, 2026

Run GLM-5.2 Locally: Hardware, Quantization & vLLM

Run GLM-5.2 locally: 744B MoE, 1M context, MIT license. Unsloth's 2-bit GGUF squeezes 1.51 TB down to 239 GB at 3-9 tok/s on a Mac Studio.

Sebastian Mondragon

11 min read

TL;DR

What GLM-5.2 Is: 744B MoE, 1M Context, MIT License

VRAM Math: Full-Precision Serving vs Quantized Footprint

Quantization is what brings GLM-5.2 down to a single machine. Here is the footprint ladder:

Precision	Bytes/param (effective)	Approx. weight size	Fits on	Realistic throughput
BF16 (full)	2.0	~1.51 TB	8x H200 (~1,128 GB VRAM)	Production (tens-hundreds tok/s)
FP8	1.0	~750 GB	~6-8x H200 / H100	Production
4-bit GGUF	~0.5	~400 GB	Multi-GPU or large RAM box	Workstation
2-bit dynamic GGUF (Unsloth)	~0.26	~239 GB	256 GB Mac Studio, or 24 GB GPU + 256 GB RAM	3-9 tok/s

Quantization Options: Unsloth 2-Bit Dynamic GGUF and the Quality Tradeoff

What do you actually give up? Quality degradation at 2-bit dynamic is real but graceful. Expect:

Slightly more brittle long-horizon reasoning, the kind that compounds over a 30-step agent loop.

Occasional formatting drift on strict structured output, worth pairing with constrained decoding if you need guaranteed JSON.

Reduced margin on the hardest tickets, where the full-precision model's 62.1 SWE-bench Pro edge narrows.

Download the GGUF and verify the file before you load it:

# Pull the Unsloth 2-bit dynamic GGUF (UD-Q2_K_XL class build)
huggingface-cli download unsloth/GLM-5.2-GGUF \
  --include "*UD-Q2_K_XL*" \
  --local-dir ./glm-5.2-gguf

# Sanity-check the on-disk size lands near ~239 GB
du -sh ./glm-5.2-gguf

Serving With vLLM: Setup, Tensor Parallel, and Commands

For full-precision or FP8 GLM-5.2 across an 8-GPU H200 node, you shard the model with tensor parallelism set to the GPU count:

# 8x H200 node, FP8 to roughly halve the BF16 footprint
vllm serve zai-org/GLM-5.2 \
  --tensor-parallel-size 8 \
  --quantization fp8 \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.92 \
  --served-model-name glm-5.2 \
  --port 8000

Hit it with the standard OpenAI-compatible API:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "glm-5.2",
    "messages": [{"role": "user", "content": "Refactor this module for testability."}],
    "max_tokens": 1024
  }'

Running on a Mac Studio or Consumer Box With llama.cpp and MoE Offloading

On a 256 GB unified-memory Mac Studio, the 239 GB GGUF fits in memory directly and Metal handles the compute:

# Mac Studio, 256 GB unified memory, full GGUF resident
./llama-server \
  -m ./glm-5.2-gguf/GLM-5.2-UD-Q2_K_XL.gguf \
  --ctx-size 32768 \
  --n-gpu-layers 999 \
  --host 0.0.0.0 --port 8080

# 24 GB GPU + 256 GB RAM, MoE experts offloaded to system memory
./llama-server \
  -m ./glm-5.2-gguf/GLM-5.2-UD-Q2_K_XL.gguf \
  --ctx-size 16384 \
  --n-gpu-layers 999 \
  --override-tensor "\.ffn_.*_exps\.=CPU" \
  --host 0.0.0.0 --port 8080

Benchmarks vs GPT-5.5 and the Cost Case for Self-Hosting

Model	SWE-bench Pro	License	Relative cost	Hosting
GLM-5.2	62.1	MIT	~1/6 of GPT-5.5	Self-host or API
GPT-5.5	58.6	Proprietary	1x (reference)	API only

Throughput Expectations and Tuning the Deployment

Context length. Shorter prompts process faster and leave more memory resident for weights. Trim aggressively; do not load 200K tokens of context for a task that needs 8K.

Quantization tier. If 2-bit quality is marginal for your task and you have the memory, step up to a 4-bit GGUF (~400 GB). It is slower to fit but noticeably sharper on hard reasoning.

Frequently Asked Questions

Quick answers to common questions about this topic

Run GLM-5.2 Locally: Hardware, Quantization & vLLM

What GLM-5.2 Is: 744B MoE, 1M Context, MIT License

VRAM Math: Full-Precision Serving vs Quantized Footprint

Quantization Options: Unsloth 2-Bit Dynamic GGUF and the Quality Tradeoff

Serving With vLLM: Setup, Tensor Parallel, and Commands

Running on a Mac Studio or Consumer Box With llama.cpp and MoE Offloading

Benchmarks vs GPT-5.5 and the Cost Case for Self-Hosting

Throughput Expectations and Tuning the Deployment

Frequently Asked Questions

Need help sizing GPUs or self-hosting open-weight models like GLM-5.2 in production? That is our model-serving practice.

Related Articles

Claude Fable 5 vs Opus 4.8: When to Use Which Model

Stop Parsing LLM JSON With Regex: Constrained Decoding

How One Agent Scored 100% on SWE-Bench Without Solving Anything

Run GLM-5.2 Locally: Hardware, Quantization & vLLM

What GLM-5.2 Is: 744B MoE, 1M Context, MIT License

VRAM Math: Full-Precision Serving vs Quantized Footprint

Quantization Options: Unsloth 2-Bit Dynamic GGUF and the Quality Tradeoff

Serving With vLLM: Setup, Tensor Parallel, and Commands

Running on a Mac Studio or Consumer Box With llama.cpp and MoE Offloading

Benchmarks vs GPT-5.5 and the Cost Case for Self-Hosting

Throughput Expectations and Tuning the Deployment

Frequently Asked Questions

Need help sizing GPUs or self-hosting open-weight models like GLM-5.2 in production? That is our model-serving practice.

Related Articles

Claude Fable 5 vs Opus 4.8: When to Use Which Model

Stop Parsing LLM JSON With Regex: Constrained Decoding

How One Agent Scored 100% on SWE-Bench Without Solving Anything