Setup
To get started, sign into Baseten with Truss and then install the OpenAI SDK.Sign in to Baseten
uvx truss login --browser
Install the OpenAI SDK
uv pip install openai
- Standard
- Flash
zai-org/GLM-4.7 is a MoE model with up to 198K context.This preset serves GLM-4.7 from an FP4 checkpoint on B200:4, delivering frontier-class reasoning at single-node cost.Then create a file named
You should see output similar to:Your model ID is printed in the
Hardware
B200 × 4
Engine
TRT-LLM v2
Context
198K
Concurrency
64
Write the config
Create and move into the project directory:mkdir glm-4.7-latency && cd glm-4.7-latency
config.yaml and paste the following:config.yaml
model_name: "model:glm-4.7 preset:latency"
resources:
accelerator: B200:4
cpu: "1"
memory: 10Gi
use_gpu: true
model_metadata:
tags:
- openai-compatible
example_model_input:
model: glm47
messages:
- role: user
content: "Given an array of integers nums and an integer target, return indices of the two numbers such that they add up to target. You may assume that each input would have exactly one solution, and you may not use the same element twice. You can return the answer in any order. class Solution: def twoSum(self, nums: List[int], target: int) -> List[int]:"
stream: true
max_tokens: 2048
temperature: 0.5
secrets:
hf_access_token: null
weights:
- source: "hf://baseten-admin/glm-4.7-fp4@main"
mount_location: "/app/model_cache/glm47"
auth_secret_name: "hf_access_token"
allow_patterns:
- "*.safetensors"
- "*.json"
- "*.model"
- "*.jinja"
- "*.py"
ignore_patterns:
- "original/*"
- "*.pth"
- "*.bin"
runtime:
predict_concurrency: 64
trt_llm:
build:
checkpoint_repository:
# repo: baseten-admin/glm-4.7-fp4
repo: michaelfeil/empty-model
revision: main
source: HF
inference_stack: v2
runtime:
enable_chunked_prefill: true
max_batch_size: 64
max_num_tokens: 8192
max_seq_len: 202752
tensor_parallel_size: 4
served_model_name: glm47
patch_kwargs:
disable_overlap_scheduler: True
moe_expert_parallel_size: 4
moe_config:
use_low_precision_moe_combine: true
backend: TRTLLM
kv_cache_config:
free_gpu_memory_fraction: 0.8
enable_block_reuse: true
cuda_graph_config:
enable_padding: true
max_batch_size: 64
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 3
autotuner_enabled: false
model_path: /app/model_cache/glm47
reasoning_parser: glm47
tool_call_parser: glm47
Key parameters
Baseten Inference Stack (BIS) reads these fields from thetrt_llm block. Each one shapes how the engine is built and served:| Parameter | Value |
|---|---|
| Tensor parallel size | 4 |
| Max sequence length | 202752 |
| Max batch size | 64 |
| Max batched tokens | 8192 |
| Chunked prefill | enabled |
| Inference stack | v2 |
| Served model name | glm47 |
Deploy
Push the config to Baseten:uvx truss push
✨ Model glm-4.7-latency was successfully pushed ✨
Model ID: abc1d2ef
Deployment ID: xyz123
Endpoint: model-abc1d2ef.api.baseten.co
Logs: https://app.baseten.co/models/abc1d2ef/logs/xyz123
truss push output (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:- Python
- cURL
main.py
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="glm47",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $BASETEN_API_KEY" \
-d '{
"model": "glm47",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
zai-org/GLM-4.7-Flash is a MoE model with up to 128K context.This preset serves GLM-4.7 Flash on H100:2 with the glm47 tool-call parser enabled, tuned for latency-sensitive agent workflows.Then create a file named
You should see output similar to:Your model ID is printed in the To let the model call tools, pass a
Hardware
H100 × 2
Engine
vLLM (0.22.0-cu129 build)
Context
128K
Concurrency
32
Write the config
Create and move into the project directory:mkdir glm-4.7-flash-latency && cd glm-4.7-flash-latency
config.yaml and paste the following:config.yaml
model_name: "model:glm-4.7-flash preset:latency"
model_metadata:
description: >-
Zhipu GLM-4.7 Flash via vLLM (H100 × 2 TP), fast GLM tool calling and auto tool choice from BDN-mounted weights.
repo_id: zai-org/GLM-4.7-Flash
example_model_input:
model: zai-org/GLM-4.7-Flash
messages:
- role: system
content: "You are a helpful assistant."
- role: user
content: "What is the meaning of life?"
stream: true
max_tokens: 32768
temperature: 0.7
tags:
- openai-compatible
base_image:
image: vllm/vllm-openai:v0.22.0-cu129
weights:
- source: "hf://zai-org/GLM-4.7-Flash@main"
mount_location: "/app/checkpoint/model"
auth_secret_name: "hf_access_token"
secrets:
hf_access_token: null
environment_variables:
VLLM_LOGGING_LEVEL: WARNING
VLLM_ENGINE_READY_TIMEOUT_S: "3600"
docker_server:
start_command: >-
sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/model
--tensor-parallel-size $GPU_COUNT
--gpu-memory-utilization 0.8
--tool-call-parser glm47
--enable-auto-tool-choice
--served-model-name zai-org/GLM-4.7-Flash
--host 0.0.0.0
--port 8000
--trust-remote-code
--max-model-len auto
--enable-prefix-caching
--load-format runai_streamer"
readiness_endpoint: /health
liveness_endpoint: /health
predict_endpoint: /v1/chat/completions
server_port: 8000
resources:
accelerator: H100:2
cpu: "1"
memory: 2Gi
use_gpu: true
runtime:
is_websocket_endpoint: false
predict_concurrency: 32
transport:
kind: http
health_checks:
restart_check_delay_seconds: 1800
restart_threshold_seconds: 1200
stop_traffic_threshold_seconds: 120
Flags
Thestart_command passes these flags to the engine. Each one controls a runtime or serving behavior:| Flag | Value | What it does |
|---|---|---|
--tensor-parallel-size | $GPU_COUNT | Number of GPUs to shard the model across. |
--gpu-memory-utilization | 0.8 | Fraction of GPU memory vLLM may use for weights and KV cache. |
--tool-call-parser | glm47 | Server-side parser that emits structured tool_calls on the response. |
--enable-auto-tool-choice | (no value) | Let the model choose when to call tools without requiring tool_choice: "required". |
--trust-remote-code | (no value) | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures). |
--max-model-len | auto | Maximum context length (tokens) the server accepts per request. |
--enable-prefix-caching | (no value) | Reuse KV cache across requests that share a prefix. |
--load-format | runai_streamer | Weight loading backend. runai_streamer: Stream weights from object storage without materializing to disk. |
Deploy
Push the config to Baseten:uvx truss push
✨ Model glm-4.7-flash-latency was successfully pushed ✨
Model ID: abc1d2ef
Deployment ID: xyz123
Endpoint: model-abc1d2ef.api.baseten.co
Logs: https://app.baseten.co/models/abc1d2ef/logs/xyz123
truss push output (abcd1234 in the example). Use it wherever you see {model_id} in the next section.Call the model
Your deployment serves an OpenAI-compatible API. Replace{model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:- Python
- cURL
main.py
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["BASETEN_API_KEY"],
base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)
response = client.chat.completions.create(
model="zai-org/GLM-4.7-Flash",
messages=[
{"role": "user", "content": "What is machine learning?"}
],
)
print(response.choices[0].message.content)
curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $BASETEN_API_KEY" \
-d '{
"model": "zai-org/GLM-4.7-Flash",
"messages": [
{"role": "user", "content": "What is machine learning?"}
]
}'
tools array. The server returns structured tool_calls on the response:tools = [{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {"location": {"type": "string"}},
"required": ["location"],
},
},
}]
response = client.chat.completions.create(
model="zai-org/GLM-4.7-Flash",
messages=[
{"role": "user", "content": "What's the weather in Paris?"}
],
tools=tools,
)
print(response.choices[0].message.tool_calls)