GLM-5.2: Strong at Coding, Design, and Tool Calls — But Slow Enough to Pick Your Spots

An evidence-first read on Z.ai's GLM-5.2: the #1 open-weights coding/design/agentic model, but token-hungry and slow end-to-end. When to use it — and when not.

GLM-5.2: Strong at Coding, Design, and Tool Calls — But Slow Enough to Pick Your Spots

Spend an afternoon driving GLM-5.2 through real coding work and two things land fast: it is genuinely good — at code, at front-end/design generation, at multi-step tool calling — and it is slow. Both are true, and the interesting part is why it's slow, because the obvious explanation is wrong. Here's the evidence-first read: what GLM-5.2 actually is, where it's strong (with independent numbers, not just the vendor's), exactly how the slowness works, and when to reach for it versus when to skip it.

The pitch, and the catch

GLM-5.2 is, as of mid-June 2026, the strongest open-weights coding/design/agentic model you can run — MIT-licensed, 1M-token context, and roughly one-sixth the per-token cost of closed frontier models. The catch: it "buys" that quality by thinking out loud, and on large contexts it's slow to first token. Great value; wrong tool for latency-sensitive work. Both halves matter.

What it actually is

GLM-5.2 comes from Z.ai (the international brand of Zhipu AI, a 2019 Tsinghua spinout). It's a Mixture-of-Experts model — ~744B total / 40B active (the HF model card lists 753B) — with a 1M-token context (up ~5× from GLM-5.1), 128K max output, and two reasoning "effort" levels. Weights are open on Hugging Face and ModelScope; it landed across the GLM Coding Plan tiers on June 13 with no benchmarks, with the API and weights following days later.

One thing to watch: the HF card says MIT, while the GLM-5 GitHub repo footer shows Apache-2.0. Minor, but if licensing matters to you, confirm before you ship.

The strengths — with numbers, and who measured them

Lead with the independent evidence, because the vendor's slides always look good:

  • Artificial Analysis Intelligence Index v4.1: 51 — the #1 open-weights model, #4 overall, behind only Claude Fable 5 (60), Opus 4.8 (56), and GPT-5.5 (55) (AA). That's an 11-point jump from GLM-5.1.
  • Design and front-end: it ranks #1 on Design Arena and #2 on Code Arena (WebDev) on LMArena, +29 Elo over Claude Opus 4.7 Thinking — despite being text-only with no image input, which is genuinely surprising. Simon Willison calls it "probably the most powerful text-only open-weights LLM."

Now the vendor self-reported numbers — useful, but treat at arm's length: Terminal-Bench 2.1 81.0, SWE-bench Pro 62.1, FrontierSWE 74.4, MCP-Atlas 77.0 (per Z.ai). DataCamp confirms these are self-reported and notes what's missing: no SWE-bench Verified, no LiveCodeBench, no BFCL or tau-bench for 5.2. So the agentic/tool-calling strength is real in hands-on use and arenas, but the cleanest public coding benchmark hasn't landed yet.

The slow part, quantified — and why the obvious reason is wrong

Here's the counterintuitive bit. It is not slow because it's a weak model, and not because its raw throughput is low. Artificial Analysis measures GLM-5.2's output at ~98–102 tokens/sec — about 67% above the ~61 t/s median for its size class, ranking #19 of 92 for speed. Aggregate time-to-first-token is a reasonable ~2.1s.

The slowness is wall-clock, and it comes from two places:

  1. Verbosity. It burns ~43k output tokens per task (~37k of it reasoning) versus ~26k for GLM-5.1 (AA). An independent OpenCode test put it plainly: "slower because it thinks more." Wall-clock = tokens × per-token latency, and the token count is the problem.
  2. Long-context TTFT. On big contexts, first-token latency runs 15–26 seconds depending on provider (AA providers) — long enough to trip default kill thresholds in agent harnesses. Provider speed also varies ~2.7× (≈164 t/s on the fastest FP8 host down to ~60 t/s on the slowest), so which endpoint you pick matters a lot.

So your firsthand "it's too slow" is right — but the fix isn't "wait for a smarter model," it's "manage the token budget and pick a fast provider."

Cost: cheap sticker, verbose bill

List pricing is about $1.40/M input, $4.40/M output (cached input ~$0.26) — roughly 1/6 of GPT-5.5; OpenRouter routes it ~$1.20/$3.20 (OpenRouter). But per completed task, AA measured ~$0.46 because the reasoning-token volume inflates the output bill. The sticker discount is real; your effective cost depends on how many tokens it burns to finish. Budget on output tokens, not the headline rate.

Running it without getting burned

  • In Claude Code / agent harnesses: raise API_TIMEOUT_MS — 1M-context first-token latency can exceed the default kill threshold. Use the OpenAI-compatible coding endpoint (/api/coding/paas/v4) to avoid dropped nested tool-result blocks on long loops.
  • When latency matters: drop to the lower "High" effort level and route to a fast FP8 provider.
  • Self-hosting: the MIT weights are real, but it's a ~744B MoE (~1.5TB). Local runs are brutal — single-digit tokens/sec on a quantized workstation — so realistically it's the API or ~8×H100. (The hosted Z.ai API also carries a China data-residency consideration.)

People also ask

Is GLM-5.2 actually slow? Not in raw tokens/sec (~98–102, above median). It's slow end-to-end because it's verbose (~43k tokens/task) and slow to first token on big contexts (15–26s).

Is it open source and free? Open weights under MIT (commercial use OK) on Hugging Face/ModelScope — free if you have the hardware for a 744B MoE. Note the MIT-vs-Apache repo discrepancy.

How does it compare to GPT-5.5 and Opus 4.8? On Z.ai's self-reported coding evals it edges GPT-5.5 on several long-horizon tasks but trails Opus 4.8; independently, AA ranks it #1 open-weights and #4 overall.

Can I use it in Claude Code? Yes, via env vars — but raise API_TIMEOUT_MS or long-context calls will time out.

Builder takeaway

GLM-5.2 is the best open-weights coding, design, and tool-calling model you can run today, and the price is hard to argue with. But it earns its quality by thinking out loud, so treat it as a one-shot, batch, or background agent — not a low-latency interactive copilot. Reach for it on front-end/design generation, long-horizon refactors, and high-volume pipelines where depth and cost beat responsiveness; skip it for autocomplete or chat where wall-clock latency rules. Budget on output tokens, drop to "High" effort and a fast FP8 provider when you need speed, raise your timeouts — and keep the vendor benchmarks at arm's length until an independent SWE-bench Verified number lands.