MODEL FAMILY · GLM

FLAGSHIP OPEN-WEIGHT
BILINGUAL REASONING.

FLAGSHIP OPEN-WEIGHT
BILINGUAL REASONING.

GLM is Z.ai's open-weight family of large language models, led by GLM-5.2 a 753B-parameter multimodal flagship with 1M-token context, FP8 inference, native tool calling, code generation, explicit reasoning, and vision inputs. Procure GLM inference through Compute Exchange Token Forwards or Reserved GPU Rental.
GLM is Z.ai's open-weight family of large language models, led by GLM-5.2 a 753B-parameter multimodal flagship with 1M-token context, FP8 inference, native tool calling, code generation, explicit reasoning, and vision inputs. Procure GLM inference through Compute Exchange Token Forwards or Reserved GPU Rental.

FAMILY LINEUP

GLM-5 AND BEYOND

The GLM-5 line is the current flagship; GLM-4.7-Flash covers latency-sensitive workloads; GLM-OCR specializes in document and vision extraction. All ship under MIT license as open-weight releases from Z.ai.

The GLM-5 line is the current flagship; GLM-4.7-Flash covers latency-sensitive workloads; GLM-OCR specializes in document and vision extraction. All ship under MIT license as open-weight releases from Z.ai.

GLM-5.2

FLAGSHIP · CURRENT

Z.ai's latest flagship multimodal model — strong bilingual (Chinese-English) reasoning, long-context understanding (up to 1M tokens), vision inputs, advanced tool use, code generation, and agent-oriented behavior.

PARAMS

753B
753B

CONTEXT

1M
1M

QUANT

FP8
FP8

MODALITY

MULTIMODAL
MULTIMODAL

TOOL CALLING

REASONING

CODE GEN

VISION

RESPONSES API

LICENSE: MIT

GLM-5.1

PREVIOUS FLAGSHIP

Previous-generation flagship in the GLM-5 line — strong bilingual reasoning, long-horizon agentic tasks (sustains thousands of tool calls), and SWE-Bench Pro state-of-the-art code generation.

PARAMS

754B
754B

CONTEXT

200K
200K

QUANT

FP8
FP8

MODALITY

TEXT-TO-TEXT
TEXT-TO-TEXT

TOOL CALLING

REASONING

CODE GEN

LICENSE: MIT

GLM-4.7-Flash

FAST / LIGHTWEIGHT MOE

30B-parameter MoE with 3B active per token — preserved thinking mode for multi-turn agentic tasks, with speculative decoding and multi-token prediction for low-latency, high-throughput inference.

PARAMS

30B - A3B
30B - A3B

CONTEXT

200K
200K

QUANT

BF16
BF16

MODALITY

TEXT-TO-TEXT
TEXT-TO-TEXT

TOOL CALLING

REASONING

LOW LATENCY

LICENSE: MIT

GLM-OCR

VISION / OCR SPECIALIST

CogViT visual encoder + GLM-0.5B language decoder for OCR, document parsing, formula and table recognition. #1 on OmniDocBench V1.5 (94.62). *PP-DocLayoutV3 sub-component under Apache 2.0.

PARAMS

0.9B
0.9B

CONTEXT

128K
128K

QUANT

-
-

MODALITY

IMAGE-TEXT-TO-TEXT
IMAGE-TEXT-TO-TEXT

VISION

OCR

DOC PARSING

TABLES

LICENSE: MIT

CAPABILITIES

WHAT GLM DOES WELL

BILINGUAL REASONING

Strong Chinese-English reasoning across long-form text, code, and structured tasks. Competitive with Western flagships on English benchmarks and class-leading on Chinese.

1M LONG CONTEXT

Up to 1M tokens on GLM-5.2 (with sparse-attention IndexShare reducing per-token FLOPs ~2.9× at full context) — absorb full legal filings, codebases, or research corpora without chunking. Persistent agent state for multi-turn loops.

NATIVE TOOL USE

First-class function calling, structured outputs, and Responses API support. Agent-oriented architecture handles multi-step plans and tool composition.

EXPLICIT REASONING

Reasoning mode surfaces chain-of-thought scratchpads for complex math, code, and analytical tasks. Tunable depth at the API boundary.

MULTIMODAL INPUTS

GLM-5.2 accepts image and text inputs natively. Pair with the specialized GLM-OCR (0.9B, CogViT + GLM-0.5B; #1 on OmniDocBench V1.5) for high-volume document extraction pipelines.

FP8 EFFICIENCY

Native FP8 quantization keeps cost-per-token competitive and inference fast on H100-class hardware across the open-weight provider network.

WHERE GLM FITS

REPRESENTATIVE WORKLOADS

BILINGUAL ENTREPRISE ASSISTANTS

Customer-facing or internal assistants serving Chinese-English markets with consistent reasoning quality across both languages.

LONG DOCUMENT ANALYSIS

Legal filings, financial disclosures, technical specifications — 432K context absorbs full documents without retrieval chunking.

AGENTIC WORKFLOWS

Tool-calling backbone for multi-step agents — research loops, code generation pipelines, structured action sequences.

RAG WITH REDUCED RETRIEVAL

Long-context tolerance lets you pack more context per query and reduce the brittleness of retrieval recall.

MULTIMODAL DOCUMENT PIPELINES

GLM-OCR for visual extraction → GLM-5.2 for reasoning over extracted content. End-to-end open-weight document understanding.

OPEN-WEIGHT PRODUCTION INFERENCE

MIT-licensed alternative to closed flagships, with deployable weights for sovereign and on-prem buyers.

PROCUREMENT

HOW TO ACCESS GLM

Two procurement paths through Compute Exchange. Choose by who you want operating the model.

Two procurement paths through Compute Exchange. Choose by who you want operating the model.

TOKEN FORWARDS

COMMITTED INFERENCE

Lock GLM inference capacity in advance, denominated in Standardized Token Units. Provider operates the model; you tap tokens against a committed balance over terms up to six months.

  • Provider operates and scales GLM endpoint

  • Per-STU rate locked at commitment

  • Realtime, batch, or mixed latency

  • Quotes against the published STU index

RESERVED GPU RENTAL

RUN YOUR OWN

Reserve H100-class capacity from the neocloud network and deploy GLM weights yourself. Full operational control — sovereign data, custom serving stack, fine-tuned variants.

  • MIT-licensed weights — deploy anywhere

  • Custom serving stack (vLLM, SGLang, TensorRT-LLM)

  • Sovereign / on-prem / air-gapped deployments

  • Terms from 1 month to 24+ months

Frequently Asked Questions

GLM, EXPLAINED

Who builds GLM?

What is the difference between GLM-5.2 and GLM-5.1?

Is GLM open-weight?

How does GLM compare to Western flagship models?

How do I procure GLM inference through Compute Exchange?

How does GLM map to the STU methodology?

PROCURE GLM CAPACITY.

PROCURE GLM CAPACITY.

Submit a commitment request and Compute Exchange returns GLM inference quotes across the verified open-model provider network.
DISCLAIMER
DISCLAIMER

GLM is an open-weight model family developed by Z.ai (Zhipu AI), released under MIT license. Compute Exchange facilitates quotes from verified third-party inference providers serving the GLM family and does not operate inference infrastructure or guarantee model availability, performance, or SLA. All commitment terms — including pricing, tap, rollover, and settlement — are negotiated directly between buyer and provider.

GLM is an open-weight model family developed by Z.ai (Zhipu AI), released under MIT license. Compute Exchange facilitates quotes from verified third-party inference providers serving the GLM family and does not operate inference infrastructure or guarantee model availability, performance, or SLA. All commitment terms — including pricing, tap, rollover, and settlement — are negotiated directly between buyer and provider.

COMPUTE

EXCHANGE

The transparent GPU marketplace for AI infrastructure. Built for builders.

ALL SYSTEMS OPERATIONAL

© 2026 COMPUTE EXCHANGE

BUILT FOR THE AI ERA

COMPUTE

EXCHANGE

The transparent GPU marketplace for AI infrastructure. Built for builders.

ALL SYSTEMS OPERATIONAL

© 2026 COMPUTE EXCHANGE

BUILT FOR THE AI ERA

COMPUTE

EXCHANGE

The transparent GPU marketplace for AI infrastructure. Built for builders.

ALL SYSTEMS OPERATIONAL

© 2026 COMPUTE EXCHANGE

BUILT FOR THE AI ERA