GLM-4.7 - Baseten

Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.

Sign in to Baseten

uvx truss login --browser

Install the OpenAI SDK

uv pip install openai

Pick the model you want to deploy. Each tab is a self-contained recipe.

Standard
Flash

zai-org/GLM-4.7 is a MoE model with up to 198K context.This preset serves GLM-4.7 from an FP4 checkpoint on B200:4, delivering frontier-class reasoning at single-node cost.

Hardware

B200 × 4

Engine

TRT-LLM v2

Context

198K

Concurrency

Write the config

Create and move into the project directory:

mkdir glm-4.7-latency && cd glm-4.7-latency

Then create a file named config.yaml and paste the following:

config.yaml

model_name: "model:glm-4.7 preset:latency"

resources:
  accelerator: B200:4
  cpu: "1"
  memory: 10Gi
  use_gpu: true

model_metadata:
  tags:
    - openai-compatible
  example_model_input:
    model: glm47
    messages:
      - role: user
        content: "Given an array of integers nums and an integer target, return indices of the two numbers such that they add up to target. You may assume that each input would have exactly one solution, and you may not use the same element twice. You can return the answer in any order. class Solution: def twoSum(self, nums: List[int], target: int) -> List[int]:"
    stream: true
    max_tokens: 2048
    temperature: 0.5

secrets:
  hf_access_token: null

weights:
  - source: "hf://baseten-admin/glm-4.7-fp4@main"
    mount_location: "/app/model_cache/glm47"
    auth_secret_name: "hf_access_token"
    allow_patterns:
      - "*.safetensors"
      - "*.json"
      - "*.model"
      - "*.jinja"
      - "*.py"
    ignore_patterns:
      - "original/*"
      - "*.pth"
      - "*.bin"

runtime:
  predict_concurrency: 64

trt_llm:
  build:
    checkpoint_repository:
      # repo: baseten-admin/glm-4.7-fp4
      repo: michaelfeil/empty-model
      revision: main
      source: HF
  inference_stack: v2
  runtime:
    enable_chunked_prefill: true
    max_batch_size: 64
    max_num_tokens: 8192
    max_seq_len: 202752
    tensor_parallel_size: 4
    served_model_name: glm47
    patch_kwargs:
      disable_overlap_scheduler: True
      moe_expert_parallel_size: 4
      moe_config:
        use_low_precision_moe_combine: true
        backend: TRTLLM
      kv_cache_config:
        free_gpu_memory_fraction: 0.8
        enable_block_reuse: true
      cuda_graph_config:
        enable_padding: true
        max_batch_size: 64
      speculative_config:
        decoding_type: MTP
        num_nextn_predict_layers: 3
      autotuner_enabled: false
      model_path: /app/model_cache/glm47
      reasoning_parser: glm47
      tool_call_parser: glm47

Key parameters

Baseten Inference Stack (BIS) reads these fields from the trt_llm block. Each one shapes how the engine is built and served:

Parameter	Value
Tensor parallel size	`4`
Max sequence length	`202752`
Max batch size	`64`
Max batched tokens	`8192`
Chunked prefill	`enabled`
Inference stack	`v2`
Served model name	`glm47`

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model glm-4.7-latency was successfully pushed ✨

   Model ID:      abc1d2ef
   Deployment ID: xyz123
   Endpoint:      model-abc1d2ef.api.baseten.co
   Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123

Your model ID is printed in the truss push output (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:

Python
cURL

main.py

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="glm47",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $BASETEN_API_KEY" \
  -d '{
    "model": "glm47",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ]
  }'

zai-org/GLM-4.7-Flash is a MoE model with up to 128K context.This preset serves GLM-4.7 Flash on H100:2 with the glm47 tool-call parser enabled, tuned for latency-sensitive agent workflows.

Hardware

H100 × 2

Engine

vLLM (0.22.0-cu129 build)

Context

128K

Concurrency

Write the config

Create and move into the project directory:

mkdir glm-4.7-flash-latency && cd glm-4.7-flash-latency

Then create a file named config.yaml and paste the following:

config.yaml

model_name: "model:glm-4.7-flash preset:latency"
model_metadata:
  description: >-
    Zhipu GLM-4.7 Flash via vLLM (H100 × 2 TP), fast GLM tool calling and auto tool choice from BDN-mounted weights.
  repo_id: zai-org/GLM-4.7-Flash
  example_model_input:
    model: zai-org/GLM-4.7-Flash
    messages:
      - role: system
        content: "You are a helpful assistant."
      - role: user
        content: "What is the meaning of life?"
    stream: true
    max_tokens: 32768
    temperature: 0.7
  tags:
    - openai-compatible
base_image:
  image: vllm/vllm-openai:v0.22.0-cu129
weights:
  - source: "hf://zai-org/GLM-4.7-Flash@main"
    mount_location: "/app/checkpoint/model"
    auth_secret_name: "hf_access_token"
secrets:
  hf_access_token: null
environment_variables:
  VLLM_LOGGING_LEVEL: WARNING
  VLLM_ENGINE_READY_TIMEOUT_S: "3600"
docker_server:
  start_command: >-
    sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/model
    --tensor-parallel-size $GPU_COUNT
    --gpu-memory-utilization 0.8
    --tool-call-parser glm47
    --enable-auto-tool-choice
    --served-model-name zai-org/GLM-4.7-Flash
    --host 0.0.0.0
    --port 8000
    --trust-remote-code
    --max-model-len auto
    --enable-prefix-caching
    --load-format runai_streamer"
  readiness_endpoint: /health
  liveness_endpoint: /health
  predict_endpoint: /v1/chat/completions
  server_port: 8000
resources:
  accelerator: H100:2
  cpu: "1"
  memory: 2Gi
  use_gpu: true
runtime:
  is_websocket_endpoint: false
  predict_concurrency: 32
  transport:
    kind: http
  health_checks:
    restart_check_delay_seconds: 1800
    restart_threshold_seconds: 1200
    stop_traffic_threshold_seconds: 120

Flags

The start_command passes these flags to the engine. Each one controls a runtime or serving behavior:

Flag	Value	What it does
`--tensor-parallel-size`	`$GPU_COUNT`	Number of GPUs to shard the model across.
`--gpu-memory-utilization`	`0.8`	Fraction of GPU memory vLLM may use for weights and KV cache.
`--tool-call-parser`	`glm47`	Server-side parser that emits structured `tool_calls` on the response.
`--enable-auto-tool-choice`	(no value)	Let the model choose when to call tools without requiring `tool_choice: "required"`.
`--trust-remote-code`	(no value)	Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).
`--max-model-len`	`auto`	Maximum context length (tokens) the server accepts per request.
`--enable-prefix-caching`	(no value)	Reuse KV cache across requests that share a prefix.
`--load-format`	`runai_streamer`	Weight loading backend. runai_streamer: Stream weights from object storage without materializing to disk.

Deploy

Push the config to Baseten:

uvx truss push

You should see output similar to:

✨ Model glm-4.7-flash-latency was successfully pushed ✨

   Model ID:      abc1d2ef
   Deployment ID: xyz123
   Endpoint:      model-abc1d2ef.api.baseten.co
   Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123

Your model ID is printed in the truss push output (abcd1234 in the example). Use it wherever you see {model_id} in the next section.

Call the model

Your deployment serves an OpenAI-compatible API. Replace {model_id} with your model ID and make sure BASETEN_API_KEY is set.Now call your deployment to run inference:

Python
cURL

main.py

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["BASETEN_API_KEY"],
    base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
)

response = client.chat.completions.create(
    model="zai-org/GLM-4.7-Flash",
    messages=[
        {"role": "user", "content": "What is machine learning?"}
    ],
)

print(response.choices[0].message.content)

curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $BASETEN_API_KEY" \
  -d '{
    "model": "zai-org/GLM-4.7-Flash",
    "messages": [
      {"role": "user", "content": "What is machine learning?"}
    ]
  }'

To let the model call tools, pass a tools array. The server returns structured tool_calls on the response:

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"],
        },
    },
}]

response = client.chat.completions.create(
    model="zai-org/GLM-4.7-Flash",
    messages=[
        {"role": "user", "content": "What's the weather in Paris?"}
    ],
    tools=tools,
)
print(response.choices[0].message.tool_calls)

​Setup

Hardware

Engine

Context

Concurrency

​Write the config

​Key parameters

​Deploy

​Call the model

Hardware

Engine

Context

Concurrency

​Write the config

​Flags

​Deploy

​Call the model

Setup

Write the config

Key parameters

Deploy

Call the model

Write the config

Flags

Deploy

Call the model