MODEL FAMILY · GLM
FAMILY LINEUP
GLM-5 AND BEYOND
GLM-5.2
FLAGSHIP · CURRENT
Z.ai's latest flagship multimodal model — strong bilingual (Chinese-English) reasoning, long-context understanding (up to 1M tokens), vision inputs, advanced tool use, code generation, and agent-oriented behavior.
PARAMS
CONTEXT
QUANT
MODALITY
TOOL CALLING
REASONING
CODE GEN
VISION
RESPONSES API
LICENSE: MIT
GLM-5.1
PREVIOUS FLAGSHIP
Previous-generation flagship in the GLM-5 line — strong bilingual reasoning, long-horizon agentic tasks (sustains thousands of tool calls), and SWE-Bench Pro state-of-the-art code generation.
PARAMS
CONTEXT
QUANT
MODALITY
TOOL CALLING
REASONING
CODE GEN
LICENSE: MIT
GLM-4.7-Flash
FAST / LIGHTWEIGHT MOE
30B-parameter MoE with 3B active per token — preserved thinking mode for multi-turn agentic tasks, with speculative decoding and multi-token prediction for low-latency, high-throughput inference.
PARAMS
CONTEXT
QUANT
MODALITY
TOOL CALLING
REASONING
LOW LATENCY
LICENSE: MIT
PARAMS
CONTEXT
QUANT
MODALITY
VISION
OCR
DOC PARSING
TABLES
LICENSE: MIT
CAPABILITIES
WHAT GLM DOES WELL
BILINGUAL REASONING
Strong Chinese-English reasoning across long-form text, code, and structured tasks. Competitive with Western flagships on English benchmarks and class-leading on Chinese.
1M LONG CONTEXT
Up to 1M tokens on GLM-5.2 (with sparse-attention IndexShare reducing per-token FLOPs ~2.9× at full context) — absorb full legal filings, codebases, or research corpora without chunking. Persistent agent state for multi-turn loops.
NATIVE TOOL USE
First-class function calling, structured outputs, and Responses API support. Agent-oriented architecture handles multi-step plans and tool composition.
EXPLICIT REASONING
Reasoning mode surfaces chain-of-thought scratchpads for complex math, code, and analytical tasks. Tunable depth at the API boundary.
MULTIMODAL INPUTS
GLM-5.2 accepts image and text inputs natively. Pair with the specialized GLM-OCR (0.9B, CogViT + GLM-0.5B; #1 on OmniDocBench V1.5) for high-volume document extraction pipelines.
FP8 EFFICIENCY
Native FP8 quantization keeps cost-per-token competitive and inference fast on H100-class hardware across the open-weight provider network.
WHERE GLM FITS
REPRESENTATIVE WORKLOADS
BILINGUAL ENTREPRISE ASSISTANTS
Customer-facing or internal assistants serving Chinese-English markets with consistent reasoning quality across both languages.
LONG DOCUMENT ANALYSIS
Legal filings, financial disclosures, technical specifications — 432K context absorbs full documents without retrieval chunking.
AGENTIC WORKFLOWS
Tool-calling backbone for multi-step agents — research loops, code generation pipelines, structured action sequences.
RAG WITH REDUCED RETRIEVAL
Long-context tolerance lets you pack more context per query and reduce the brittleness of retrieval recall.
MULTIMODAL DOCUMENT PIPELINES
GLM-OCR for visual extraction → GLM-5.2 for reasoning over extracted content. End-to-end open-weight document understanding.
OPEN-WEIGHT PRODUCTION INFERENCE
MIT-licensed alternative to closed flagships, with deployable weights for sovereign and on-prem buyers.
PROCUREMENT
HOW TO ACCESS GLM
TOKEN FORWARDS
COMMITTED INFERENCE
Lock GLM inference capacity in advance, denominated in Standardized Token Units. Provider operates the model; you tap tokens against a committed balance over terms up to six months.
Provider operates and scales GLM endpoint
Per-STU rate locked at commitment
Realtime, batch, or mixed latency
Quotes against the published STU index
RESERVED GPU RENTAL
RUN YOUR OWN
Reserve H100-class capacity from the neocloud network and deploy GLM weights yourself. Full operational control — sovereign data, custom serving stack, fine-tuned variants.
MIT-licensed weights — deploy anywhere
Custom serving stack (vLLM, SGLang, TensorRT-LLM)
Sovereign / on-prem / air-gapped deployments
Terms from 1 month to 24+ months
Frequently Asked Questions