> ## Documentation Index
> Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
> Use this file to discover all available pages before exploring further.

# GLM-4.7

> GLM-4.7 recipes: 2 variants (Standard, Flash), MoE architecture.

<div className="capability-pills">
  <a href="/examples/models/capabilities/reasoning" className="capability-pill">Reasoning</a>
  <a href="/examples/models/capabilities/tool-calling" className="capability-pill">Tool calling</a>
  <a href="/examples/models/capabilities/long-context" className="capability-pill">Long context</a>
</div>

## Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.

<Columns cols={2}>
  <Column>
    **Sign in to Baseten**

    ```sh theme={"system"}
    uvx truss login --browser
    ```
  </Column>

  <Column>
    **Install the OpenAI SDK**

    ```sh theme={"system"}
    uv pip install openai
    ```
  </Column>
</Columns>

Pick the model you want to deploy. Each tab is a self-contained recipe.

<Tabs>
  <Tab title="Standard">
    [zai-org/GLM-4.7](https://huggingface.co/zai-org/GLM-4.7) is a MoE model with up to 198K context.

    This preset serves GLM-4.7 from an FP4 checkpoint on B200:4, delivering frontier-class reasoning at single-node cost.

    <CardGroup cols={4}>
      <Card title="Hardware" icon="microchip">B200 × 4</Card>
      <Card title="Engine" icon="server">TRT-LLM v2</Card>
      <Card title="Context" icon="ruler-horizontal">198K</Card>
      <Card title="Concurrency" icon="layer-group">64</Card>
    </CardGroup>

    ## Write the config

    Create and move into the project directory:

    ```sh theme={"system"}
    mkdir glm-4.7-latency && cd glm-4.7-latency
    ```

    Then create a file named `config.yaml` and paste the following:

    ```yaml config.yaml theme={"system"}
    model_name: "model:glm-4.7 preset:latency"

    resources:
      accelerator: B200:4
      cpu: "1"
      memory: 10Gi
      use_gpu: true

    model_metadata:
      tags:
        - openai-compatible
      example_model_input:
        model: glm47
        messages:
          - role: user
            content: "Given an array of integers nums and an integer target, return indices of the two numbers such that they add up to target. You may assume that each input would have exactly one solution, and you may not use the same element twice. You can return the answer in any order. class Solution: def twoSum(self, nums: List[int], target: int) -> List[int]:"
        stream: true
        max_tokens: 2048
        temperature: 0.5

    secrets:
      hf_access_token: null

    weights:
      - source: "hf://baseten-admin/glm-4.7-fp4@main"
        mount_location: "/app/model_cache/glm47"
        auth_secret_name: "hf_access_token"
        allow_patterns:
          - "*.safetensors"
          - "*.json"
          - "*.model"
          - "*.jinja"
          - "*.py"
        ignore_patterns:
          - "original/*"
          - "*.pth"
          - "*.bin"

    runtime:
      predict_concurrency: 64

    trt_llm:
      build:
        checkpoint_repository:
          # repo: baseten-admin/glm-4.7-fp4
          repo: michaelfeil/empty-model
          revision: main
          source: HF
      inference_stack: v2
      runtime:
        enable_chunked_prefill: true
        max_batch_size: 64
        max_num_tokens: 8192
        max_seq_len: 202752
        tensor_parallel_size: 4
        served_model_name: glm47
        patch_kwargs:
          disable_overlap_scheduler: True
          moe_expert_parallel_size: 4
          moe_config:
            use_low_precision_moe_combine: true
            backend: TRTLLM
          kv_cache_config:
            free_gpu_memory_fraction: 0.8
            enable_block_reuse: true
          cuda_graph_config:
            enable_padding: true
            max_batch_size: 64
          speculative_config:
            decoding_type: MTP
            num_nextn_predict_layers: 3
          autotuner_enabled: false
          model_path: /app/model_cache/glm47
          reasoning_parser: glm47
          tool_call_parser: glm47
    ```

    ## Key parameters

    [Baseten Inference Stack](/engines/bis-llm/overview) (BIS) reads these fields from the `trt_llm` block. Each one shapes how the engine is built and served:

    | Parameter            | Value     |
    | -------------------- | --------- |
    | Tensor parallel size | `4`       |
    | Max sequence length  | `202752`  |
    | Max batch size       | `64`      |
    | Max batched tokens   | `8192`    |
    | Chunked prefill      | `enabled` |
    | Inference stack      | `v2`      |
    | Served model name    | `glm47`   |

    ## Deploy

    Push the config to Baseten:

    ```sh theme={"system"}
    uvx truss push
    ```

    You should see output similar to:

    ```output theme={"system"}
    ✨ Model glm-4.7-latency was successfully pushed ✨

       Model ID:      abc1d2ef
       Deployment ID: xyz123
       Endpoint:      model-abc1d2ef.api.baseten.co
       Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123
    ```

    Your **model ID** is printed in the `truss push` output (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

    ## Call the model

    Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

    Now call your deployment to run inference:

    <Tabs>
      <Tab title="Python">
        ```python main.py theme={"system"}
        import os
        from openai import OpenAI

        client = OpenAI(
            api_key=os.environ["BASETEN_API_KEY"],
            base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
        )

        response = client.chat.completions.create(
            model="glm47",
            messages=[
                {"role": "user", "content": "What is machine learning?"}
            ],
        )

        print(response.choices[0].message.content)
        ```
      </Tab>

      <Tab title="cURL">
        ```sh theme={"system"}
        curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
          -H "Content-Type: application/json" \
          -H "Authorization: Bearer $BASETEN_API_KEY" \
          -d '{
            "model": "glm47",
            "messages": [
              {"role": "user", "content": "What is machine learning?"}
            ]
          }'
        ```
      </Tab>
    </Tabs>
  </Tab>

  <Tab title="Flash">
    [zai-org/GLM-4.7-Flash](https://huggingface.co/zai-org/GLM-4.7-Flash) is a MoE model with up to 128K context.

    This preset serves GLM-4.7 Flash on H100:2 with the glm47 tool-call parser enabled, tuned for latency-sensitive agent workflows.

    <CardGroup cols={4}>
      <Card title="Hardware" icon="microchip">H100 × 2</Card>
      <Card title="Engine" icon="server">vLLM (0.22.0-cu129 build)</Card>
      <Card title="Context" icon="ruler-horizontal">128K</Card>
      <Card title="Concurrency" icon="layer-group">32</Card>
    </CardGroup>

    ## Write the config

    Create and move into the project directory:

    ```sh theme={"system"}
    mkdir glm-4.7-flash-latency && cd glm-4.7-flash-latency
    ```

    Then create a file named `config.yaml` and paste the following:

    ```yaml config.yaml theme={"system"}
    model_name: "model:glm-4.7-flash preset:latency"
    model_metadata:
      description: >-
        Zhipu GLM-4.7 Flash via vLLM (H100 × 2 TP), fast GLM tool calling and auto tool choice from BDN-mounted weights.
      repo_id: zai-org/GLM-4.7-Flash
      example_model_input:
        model: zai-org/GLM-4.7-Flash
        messages:
          - role: system
            content: "You are a helpful assistant."
          - role: user
            content: "What is the meaning of life?"
        stream: true
        max_tokens: 32768
        temperature: 0.7
      tags:
        - openai-compatible
    base_image:
      image: vllm/vllm-openai:v0.22.0-cu129
    weights:
      - source: "hf://zai-org/GLM-4.7-Flash@main"
        mount_location: "/app/checkpoint/model"
        auth_secret_name: "hf_access_token"
    secrets:
      hf_access_token: null
    environment_variables:
      VLLM_LOGGING_LEVEL: WARNING
      VLLM_ENGINE_READY_TIMEOUT_S: "3600"
    docker_server:
      start_command: >-
        sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/model
        --tensor-parallel-size $GPU_COUNT
        --gpu-memory-utilization 0.8
        --tool-call-parser glm47
        --enable-auto-tool-choice
        --served-model-name zai-org/GLM-4.7-Flash
        --host 0.0.0.0
        --port 8000
        --trust-remote-code
        --max-model-len auto
        --enable-prefix-caching
        --load-format runai_streamer"
      readiness_endpoint: /health
      liveness_endpoint: /health
      predict_endpoint: /v1/chat/completions
      server_port: 8000
    resources:
      accelerator: H100:2
      cpu: "1"
      memory: 2Gi
      use_gpu: true
    runtime:
      is_websocket_endpoint: false
      predict_concurrency: 32
      transport:
        kind: http
      health_checks:
        restart_check_delay_seconds: 1800
        restart_threshold_seconds: 1200
        stop_traffic_threshold_seconds: 120
    ```

    ## Flags

    The `start_command` passes these flags to the engine. Each one controls a runtime or serving behavior:

    | Flag                        | Value            | What it does                                                                                                   |
    | --------------------------- | ---------------- | -------------------------------------------------------------------------------------------------------------- |
    | `--tensor-parallel-size`    | `$GPU_COUNT`     | Number of GPUs to shard the model across.                                                                      |
    | `--gpu-memory-utilization`  | `0.8`            | Fraction of GPU memory vLLM may use for weights and KV cache.                                                  |
    | `--tool-call-parser`        | `glm47`          | Server-side parser that emits structured `tool_calls` on the response.                                         |
    | `--enable-auto-tool-choice` | (no value)       | Let the model choose when to call tools without requiring `tool_choice: "required"`.                           |
    | `--trust-remote-code`       | (no value)       | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).     |
    | `--max-model-len`           | `auto`           | Maximum context length (tokens) the server accepts per request.                                                |
    | `--enable-prefix-caching`   | (no value)       | Reuse KV cache across requests that share a prefix.                                                            |
    | `--load-format`             | `runai_streamer` | Weight loading backend. **runai\_streamer:** Stream weights from object storage without materializing to disk. |

    ## Deploy

    Push the config to Baseten:

    ```sh theme={"system"}
    uvx truss push
    ```

    You should see output similar to:

    ```output theme={"system"}
    ✨ Model glm-4.7-flash-latency was successfully pushed ✨

       Model ID:      abc1d2ef
       Deployment ID: xyz123
       Endpoint:      model-abc1d2ef.api.baseten.co
       Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123
    ```

    Your **model ID** is printed in the `truss push` output (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

    ## Call the model

    Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

    Now call your deployment to run inference:

    <Tabs>
      <Tab title="Python">
        ```python main.py theme={"system"}
        import os
        from openai import OpenAI

        client = OpenAI(
            api_key=os.environ["BASETEN_API_KEY"],
            base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
        )

        response = client.chat.completions.create(
            model="zai-org/GLM-4.7-Flash",
            messages=[
                {"role": "user", "content": "What is machine learning?"}
            ],
        )

        print(response.choices[0].message.content)
        ```
      </Tab>

      <Tab title="cURL">
        ```sh theme={"system"}
        curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
          -H "Content-Type: application/json" \
          -H "Authorization: Bearer $BASETEN_API_KEY" \
          -d '{
            "model": "zai-org/GLM-4.7-Flash",
            "messages": [
              {"role": "user", "content": "What is machine learning?"}
            ]
          }'
        ```
      </Tab>
    </Tabs>

    To let the model call tools, pass a `tools` array. The server returns structured `tool_calls` on the response:

    ```python theme={"system"}
    tools = [{
        "type": "function",
        "function": {
            "name": "get_weather",
            "parameters": {
                "type": "object",
                "properties": {"location": {"type": "string"}},
                "required": ["location"],
            },
        },
    }]

    response = client.chat.completions.create(
        model="zai-org/GLM-4.7-Flash",
        messages=[
            {"role": "user", "content": "What's the weather in Paris?"}
        ],
        tools=tools,
    )
    print(response.choices[0].message.tool_calls)
    ```
  </Tab>
</Tabs>
