> ## Documentation Index
> Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
> Use this file to discover all available pages before exploring further.

# Qwen3.5

> Qwen3.5 recipes: 4 variants (4B, 9B, 35B, 122B), Dense, Hybrid MoE, and MoE architectures.

<div className="capability-pills">
  <a href="/examples/models/capabilities/reasoning" className="capability-pill">Reasoning</a>
  <a href="/examples/models/capabilities/tool-calling" className="capability-pill">Tool calling</a>
  <a href="/examples/models/capabilities/long-context" className="capability-pill">Long context</a>
  <a href="/examples/models/capabilities/agentic" className="capability-pill">Agentic</a>
</div>

## Setup

To get started, sign into Baseten with Truss and then install the OpenAI SDK.

<Columns cols={2}>
  <Column>
    **Sign in to Baseten**

    ```sh theme={"system"}
    uvx truss login --browser
    ```
  </Column>

  <Column>
    **Install the OpenAI SDK**

    ```sh theme={"system"}
    uv pip install openai
    ```
  </Column>
</Columns>

Pick the model you want to deploy. Each tab is a self-contained recipe.

<Tabs>
  <Tab title="4B">
    [Qwen/Qwen3.5-4B](https://huggingface.co/Qwen/Qwen3.5-4B) is a 4B-parameter dense model with up to 256K context.

    This preset serves Qwen3.5-4B with BF16 weights on a single H100, optimized for low time-to-first-token.

    <CardGroup cols={4}>
      <Card title="Hardware" icon="microchip">H100 × 1</Card>
      <Card title="Engine" icon="server">vLLM (0.22.0-cu129 build)</Card>
      <Card title="Context" icon="ruler-horizontal">32K</Card>
      <Card title="Concurrency" icon="layer-group">128</Card>
    </CardGroup>

    ## Write the config

    Create and move into the project directory:

    ```sh theme={"system"}
    mkdir qwen3.5-4b-latency && cd qwen3.5-4b-latency
    ```

    Then create a file named `config.yaml` and paste the following:

    ```yaml config.yaml theme={"system"}
    model_name: "model:qwen3.5-4b preset:latency"
    model_metadata:
      description: >-
        Qwen 3.5 4B instruct (dense), OpenAI-compatible chat via vLLM with Qwen tooling parsers.
      repo_id: Qwen/Qwen3.5-4B
      example_model_input:
        model: "Qwen/Qwen3.5-4B"
        messages:
          - role: user
            content: "What is the capital of France?"
        stream: true
        max_tokens: 100
        temperature: 0.7
      tags:
        - openai-compatible
    base_image:
      image: vllm/vllm-openai:v0.22.0-cu129
    weights:
      - source: "hf://Qwen/Qwen3.5-4B@main"
        mount_location: "/app/checkpoint/model"
        auth_secret_name: "hf_access_token"
    secrets:
      hf_access_token: null
    docker_server:
      start_command: >-
        sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/model --tensor-parallel-size $GPU_COUNT
        --served-model-name Qwen/Qwen3.5-4B
        --host 0.0.0.0
        --port 8000
        --gpu-memory-utilization 0.95
        --max-model-len 32768
        --dtype bfloat16
        --reasoning-parser qwen3
        --enable-auto-tool-choice
        --tool-call-parser qwen3_coder
        --trust-remote-code
        --enable-prefix-caching
        --load-format runai_streamer"
      readiness_endpoint: /health
      liveness_endpoint: /health
      predict_endpoint: /v1/chat/completions
      server_port: 8000
    environment_variables:
      VLLM_LOGGING_LEVEL: WARNING
      VLLM_ENGINE_READY_TIMEOUT_S: "3600"
    runtime:
      predict_concurrency: 128
      health_checks:
        restart_check_delay_seconds: 1800
        restart_threshold_seconds: 1200
        stop_traffic_threshold_seconds: 120
    resources:
      accelerator: H100:1
      use_gpu: true
    ```

    ## Flags

    The `start_command` passes these flags to the engine. Each one controls a runtime or serving behavior:

    | Flag                        | Value            | What it does                                                                                                                                                |
    | --------------------------- | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
    | `--tensor-parallel-size`    | `$GPU_COUNT`     | Number of GPUs to shard the model across.                                                                                                                   |
    | `--gpu-memory-utilization`  | `0.95`           | Fraction of GPU memory vLLM may use for weights and KV cache.                                                                                               |
    | `--max-model-len`           | `32768`          | Maximum context length (tokens) the server accepts per request.                                                                                             |
    | `--dtype`                   | `bfloat16`       | Weight precision loaded at runtime. **bfloat16:** BF16 weights, no quantization.                                                                            |
    | `--reasoning-parser`        | `qwen3`          | Server-side parser that separates reasoning output into `reasoning_content`. **qwen3:** Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
    | `--enable-auto-tool-choice` | (no value)       | Let the model choose when to call tools without requiring `tool_choice: "required"`.                                                                        |
    | `--tool-call-parser`        | `qwen3_coder`    | Server-side parser that emits structured `tool_calls` on the response. **qwen3\_coder:** Qwen3-Coder tool format.                                           |
    | `--trust-remote-code`       | (no value)       | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).                                                  |
    | `--enable-prefix-caching`   | (no value)       | Reuse KV cache across requests that share a prefix.                                                                                                         |
    | `--load-format`             | `runai_streamer` | Weight loading backend. **runai\_streamer:** Stream weights from object storage without materializing to disk.                                              |

    ## Deploy

    Push the config to Baseten:

    ```sh theme={"system"}
    uvx truss push
    ```

    You should see output similar to:

    ```output theme={"system"}
    ✨ Model qwen3.5-4b-latency was successfully pushed ✨

       Model ID:      abc1d2ef
       Deployment ID: xyz123
       Endpoint:      model-abc1d2ef.api.baseten.co
       Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123
    ```

    Your **model ID** is printed in the `truss push` output (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

    ## Call the model

    Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

    Now call your deployment to run inference:

    <Tabs>
      <Tab title="Python">
        ```python main.py theme={"system"}
        import os
        from openai import OpenAI

        client = OpenAI(
            api_key=os.environ["BASETEN_API_KEY"],
            base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
        )

        response = client.chat.completions.create(
            model="Qwen/Qwen3.5-4B",
            messages=[
                {"role": "user", "content": "What is machine learning?"}
            ],
        )

        print(response.choices[0].message.content)
        ```
      </Tab>

      <Tab title="cURL">
        ```sh theme={"system"}
        curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
          -H "Content-Type: application/json" \
          -H "Authorization: Bearer $BASETEN_API_KEY" \
          -d '{
            "model": "Qwen/Qwen3.5-4B",
            "messages": [
              {"role": "user", "content": "What is machine learning?"}
            ]
          }'
        ```
      </Tab>
    </Tabs>

    To access the model's chain of thought, enable thinking mode. The server parses the reasoning output into a separate `reasoning_content` field on the response:

    ```python theme={"system"}
    response = client.chat.completions.create(
        model="Qwen/Qwen3.5-4B",
        messages=[
            {"role": "user", "content": "How many r's in strawberry?"}
        ],
        extra_body={"chat_template_kwargs": {"enable_thinking": True}},
    )
    print(response.choices[0].message.reasoning_content)  # chain of thought
    print(response.choices[0].message.content)            # final answer
    ```

    To let the model call tools, pass a `tools` array. The server returns structured `tool_calls` on the response:

    ```python theme={"system"}
    tools = [{
        "type": "function",
        "function": {
            "name": "get_weather",
            "parameters": {
                "type": "object",
                "properties": {"location": {"type": "string"}},
                "required": ["location"],
            },
        },
    }]

    response = client.chat.completions.create(
        model="Qwen/Qwen3.5-4B",
        messages=[
            {"role": "user", "content": "What's the weather in Paris?"}
        ],
        tools=tools,
    )
    print(response.choices[0].message.tool_calls)
    ```
  </Tab>

  <Tab title="9B">
    [Qwen/Qwen3.5-9B](https://huggingface.co/Qwen/Qwen3.5-9B) is a 9B-parameter dense model with up to 256K context.

    This preset serves Qwen3.5-9B with BF16 weights on a single H100. It's the smallest dense Qwen3.5 deployment that keeps reasoning and tool calling enabled.

    <CardGroup cols={4}>
      <Card title="Hardware" icon="microchip">H100 × 1</Card>
      <Card title="Engine" icon="server">vLLM (0.22.0-cu129 build)</Card>
      <Card title="Context" icon="ruler-horizontal">32K</Card>
      <Card title="Concurrency" icon="layer-group">128</Card>
    </CardGroup>

    ## Write the config

    Create and move into the project directory:

    ```sh theme={"system"}
    mkdir qwen3.5-9b-latency && cd qwen3.5-9b-latency
    ```

    Then create a file named `config.yaml` and paste the following:

    ```yaml config.yaml theme={"system"}
    model_name: "model:qwen3.5-9b preset:latency"
    model_metadata:
      description: >-
        Qwen 3.5 9B instruct (dense), OpenAI-compatible chat via vLLM with Qwen tooling parsers.
      repo_id: Qwen/Qwen3.5-9B
      example_model_input:
        model: "Qwen/Qwen3.5-9B"
        messages:
          - role: user
            content: "What is the capital of France?"
        stream: true
        max_tokens: 100
        temperature: 0.7
      tags:
        - openai-compatible
    base_image:
      image: vllm/vllm-openai:v0.22.0-cu129
    weights:
      - source: "hf://Qwen/Qwen3.5-9B@main"
        mount_location: "/app/checkpoint/model"
        auth_secret_name: "hf_access_token"
    secrets:
      hf_access_token: null
    docker_server:
      start_command: >-
        sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/model --tensor-parallel-size $GPU_COUNT
        --served-model-name Qwen/Qwen3.5-9B
        --host 0.0.0.0
        --port 8000
        --gpu-memory-utilization 0.95
        --max-model-len 32768
        --dtype bfloat16
        --reasoning-parser qwen3
        --enable-auto-tool-choice
        --tool-call-parser qwen3_coder
        --trust-remote-code
        --enable-prefix-caching
        --load-format runai_streamer"
      readiness_endpoint: /health
      liveness_endpoint: /health
      predict_endpoint: /v1/chat/completions
      server_port: 8000
    environment_variables:
      VLLM_LOGGING_LEVEL: WARNING
      VLLM_ENGINE_READY_TIMEOUT_S: "3600"
    runtime:
      predict_concurrency: 128
      health_checks:
        restart_check_delay_seconds: 1800
        restart_threshold_seconds: 1200
        stop_traffic_threshold_seconds: 120
    resources:
      accelerator: H100:1
      use_gpu: true
    ```

    ## Flags

    The `start_command` passes these flags to the engine. Each one controls a runtime or serving behavior:

    | Flag                        | Value            | What it does                                                                                                                                                |
    | --------------------------- | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
    | `--tensor-parallel-size`    | `$GPU_COUNT`     | Number of GPUs to shard the model across.                                                                                                                   |
    | `--gpu-memory-utilization`  | `0.95`           | Fraction of GPU memory vLLM may use for weights and KV cache.                                                                                               |
    | `--max-model-len`           | `32768`          | Maximum context length (tokens) the server accepts per request.                                                                                             |
    | `--dtype`                   | `bfloat16`       | Weight precision loaded at runtime. **bfloat16:** BF16 weights, no quantization.                                                                            |
    | `--reasoning-parser`        | `qwen3`          | Server-side parser that separates reasoning output into `reasoning_content`. **qwen3:** Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
    | `--enable-auto-tool-choice` | (no value)       | Let the model choose when to call tools without requiring `tool_choice: "required"`.                                                                        |
    | `--tool-call-parser`        | `qwen3_coder`    | Server-side parser that emits structured `tool_calls` on the response. **qwen3\_coder:** Qwen3-Coder tool format.                                           |
    | `--trust-remote-code`       | (no value)       | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).                                                  |
    | `--enable-prefix-caching`   | (no value)       | Reuse KV cache across requests that share a prefix.                                                                                                         |
    | `--load-format`             | `runai_streamer` | Weight loading backend. **runai\_streamer:** Stream weights from object storage without materializing to disk.                                              |

    ## Deploy

    Push the config to Baseten:

    ```sh theme={"system"}
    uvx truss push
    ```

    You should see output similar to:

    ```output theme={"system"}
    ✨ Model qwen3.5-9b-latency was successfully pushed ✨

       Model ID:      abc1d2ef
       Deployment ID: xyz123
       Endpoint:      model-abc1d2ef.api.baseten.co
       Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123
    ```

    Your **model ID** is printed in the `truss push` output (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

    ## Call the model

    Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

    Now call your deployment to run inference:

    <Tabs>
      <Tab title="Python">
        ```python main.py theme={"system"}
        import os
        from openai import OpenAI

        client = OpenAI(
            api_key=os.environ["BASETEN_API_KEY"],
            base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
        )

        response = client.chat.completions.create(
            model="Qwen/Qwen3.5-9B",
            messages=[
                {"role": "user", "content": "What is machine learning?"}
            ],
        )

        print(response.choices[0].message.content)
        ```
      </Tab>

      <Tab title="cURL">
        ```sh theme={"system"}
        curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
          -H "Content-Type: application/json" \
          -H "Authorization: Bearer $BASETEN_API_KEY" \
          -d '{
            "model": "Qwen/Qwen3.5-9B",
            "messages": [
              {"role": "user", "content": "What is machine learning?"}
            ]
          }'
        ```
      </Tab>
    </Tabs>

    To access the model's chain of thought, enable thinking mode. The server parses the reasoning output into a separate `reasoning_content` field on the response:

    ```python theme={"system"}
    response = client.chat.completions.create(
        model="Qwen/Qwen3.5-9B",
        messages=[
            {"role": "user", "content": "How many r's in strawberry?"}
        ],
        extra_body={"chat_template_kwargs": {"enable_thinking": True}},
    )
    print(response.choices[0].message.reasoning_content)  # chain of thought
    print(response.choices[0].message.content)            # final answer
    ```

    To let the model call tools, pass a `tools` array. The server returns structured `tool_calls` on the response:

    ```python theme={"system"}
    tools = [{
        "type": "function",
        "function": {
            "name": "get_weather",
            "parameters": {
                "type": "object",
                "properties": {"location": {"type": "string"}},
                "required": ["location"],
            },
        },
    }]

    response = client.chat.completions.create(
        model="Qwen/Qwen3.5-9B",
        messages=[
            {"role": "user", "content": "What's the weather in Paris?"}
        ],
        tools=tools,
    )
    print(response.choices[0].message.tool_calls)
    ```
  </Tab>

  <Tab title="35B">
    [Qwen/Qwen3.5-35B-A3B](https://huggingface.co/Qwen/Qwen3.5-35B-A3B) is a 35B-parameter hybrid MoE model (3B active per token) with up to 256K context.

    This variant ships in 2 presets tuned for different goals: **Latency** for lowest time-to-first-token, and **Throughput** for highest tokens per second. Pick the tab that matches your workload.

    <Tabs>
      <Tab title="Latency">
        This preset serves Qwen3.5-35B with FP8 weights on H100:2, optimized for low time-to-first-token on interactive chat and short-horizon agent workflows.

        <CardGroup cols={4}>
          <Card title="Hardware" icon="microchip">H100 × 2</Card>
          <Card title="Engine" icon="server">vLLM (0.22.0-cu129 build)</Card>
          <Card title="Context" icon="ruler-horizontal">32K</Card>
          <Card title="Concurrency" icon="layer-group">128</Card>
        </CardGroup>

        ## Write the config

        Create and move into the project directory:

        ```sh theme={"system"}
        mkdir qwen3.5-35b-latency && cd qwen3.5-35b-latency
        ```

        Then create a file named `config.yaml` and paste the following:

        ```yaml config.yaml theme={"system"}
        model_name: "model:qwen3.5-35b preset:latency"
        model_metadata:
          description: >-
            Qwen 3.5 35B A3B MoE instruct FP8 weights, TP=2 latency preset with Qwen parsers.
          repo_id: Qwen/Qwen3.5-35B-A3B-FP8
          example_model_input:
            model: "Qwen/Qwen3.5-35B-A3B-FP8"
            messages:
              - role: user
                content: "What is the capital of France?"
            stream: true
            max_tokens: 100
            temperature: 0.7
          tags:
            - openai-compatible
        base_image:
          image: vllm/vllm-openai:v0.22.0-cu129
        weights:
          - source: "hf://Qwen/Qwen3.5-35B-A3B-FP8@main"
            mount_location: "/app/checkpoint/model"
            auth_secret_name: "hf_access_token"
        secrets:
          hf_access_token: null
        docker_server:
          start_command: >-
            sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/model
            --served-model-name Qwen/Qwen3.5-35B-A3B-FP8
            --host 0.0.0.0
            --port 8000
            --gpu-memory-utilization 0.95
            --max-model-len 32768
            --kv-cache-dtype fp8
            --tensor-parallel-size $GPU_COUNT
            --reasoning-parser qwen3
            --enable-auto-tool-choice
            --tool-call-parser qwen3_coder
            --trust-remote-code
            --enable-prefix-caching
            --load-format runai_streamer"
          readiness_endpoint: /health
          liveness_endpoint: /health
          predict_endpoint: /v1/chat/completions
          server_port: 8000
        environment_variables:
          VLLM_LOGGING_LEVEL: WARNING
          VLLM_ENGINE_READY_TIMEOUT_S: "3600"
        runtime:
          predict_concurrency: 128
          health_checks:
            restart_check_delay_seconds: 1800
            restart_threshold_seconds: 1200
            stop_traffic_threshold_seconds: 120
        resources:
          accelerator: H100:2
          use_gpu: true
        ```

        ## Flags

        The `start_command` passes these flags to the engine. Each one controls a runtime or serving behavior:

        | Flag                        | Value            | What it does                                                                                                                                                |
        | --------------------------- | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
        | `--gpu-memory-utilization`  | `0.95`           | Fraction of GPU memory vLLM may use for weights and KV cache.                                                                                               |
        | `--max-model-len`           | `32768`          | Maximum context length (tokens) the server accepts per request.                                                                                             |
        | `--kv-cache-dtype`          | `fp8`            | KV cache numeric precision. **fp8:** \~2× KV cache density with negligible quality impact on most models.                                                   |
        | `--tensor-parallel-size`    | `$GPU_COUNT`     | Number of GPUs to shard the model across.                                                                                                                   |
        | `--reasoning-parser`        | `qwen3`          | Server-side parser that separates reasoning output into `reasoning_content`. **qwen3:** Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
        | `--enable-auto-tool-choice` | (no value)       | Let the model choose when to call tools without requiring `tool_choice: "required"`.                                                                        |
        | `--tool-call-parser`        | `qwen3_coder`    | Server-side parser that emits structured `tool_calls` on the response. **qwen3\_coder:** Qwen3-Coder tool format.                                           |
        | `--trust-remote-code`       | (no value)       | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).                                                  |
        | `--enable-prefix-caching`   | (no value)       | Reuse KV cache across requests that share a prefix.                                                                                                         |
        | `--load-format`             | `runai_streamer` | Weight loading backend. **runai\_streamer:** Stream weights from object storage without materializing to disk.                                              |

        ## Deploy

        Push the config to Baseten:

        ```sh theme={"system"}
        uvx truss push
        ```

        You should see output similar to:

        ```output theme={"system"}
        ✨ Model qwen3.5-35b-latency was successfully pushed ✨

           Model ID:      abc1d2ef
           Deployment ID: xyz123
           Endpoint:      model-abc1d2ef.api.baseten.co
           Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123
        ```

        Your **model ID** is printed in the `truss push` output (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

        ## Call the model

        Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

        Now call your deployment to run inference:

        <Tabs>
          <Tab title="Python">
            ```python main.py theme={"system"}
            import os
            from openai import OpenAI

            client = OpenAI(
                api_key=os.environ["BASETEN_API_KEY"],
                base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
            )

            response = client.chat.completions.create(
                model="Qwen/Qwen3.5-35B-A3B-FP8",
                messages=[
                    {"role": "user", "content": "What is machine learning?"}
                ],
            )

            print(response.choices[0].message.content)
            ```
          </Tab>

          <Tab title="cURL">
            ```sh theme={"system"}
            curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
              -H "Content-Type: application/json" \
              -H "Authorization: Bearer $BASETEN_API_KEY" \
              -d '{
                "model": "Qwen/Qwen3.5-35B-A3B-FP8",
                "messages": [
                  {"role": "user", "content": "What is machine learning?"}
                ]
              }'
            ```
          </Tab>
        </Tabs>

        To access the model's chain of thought, enable thinking mode. The server parses the reasoning output into a separate `reasoning_content` field on the response:

        ```python theme={"system"}
        response = client.chat.completions.create(
            model="Qwen/Qwen3.5-35B-A3B-FP8",
            messages=[
                {"role": "user", "content": "How many r's in strawberry?"}
            ],
            extra_body={"chat_template_kwargs": {"enable_thinking": True}},
        )
        print(response.choices[0].message.reasoning_content)  # chain of thought
        print(response.choices[0].message.content)            # final answer
        ```

        To let the model call tools, pass a `tools` array. The server returns structured `tool_calls` on the response:

        ```python theme={"system"}
        tools = [{
            "type": "function",
            "function": {
                "name": "get_weather",
                "parameters": {
                    "type": "object",
                    "properties": {"location": {"type": "string"}},
                    "required": ["location"],
                },
            },
        }]

        response = client.chat.completions.create(
            model="Qwen/Qwen3.5-35B-A3B-FP8",
            messages=[
                {"role": "user", "content": "What's the weather in Paris?"}
            ],
            tools=tools,
        )
        print(response.choices[0].message.tool_calls)
        ```
      </Tab>

      <Tab title="Throughput">
        This preset serves Qwen3.5-35B FP8 on a single B200, with prefix caching and chunked prefill enabled. It maximizes aggregate throughput at high concurrency with minor quality impact from FP8.

        <CardGroup cols={4}>
          <Card title="Hardware" icon="microchip">B200</Card>
          <Card title="Engine" icon="server">vLLM 0.22.0</Card>
          <Card title="Context" icon="ruler-horizontal">256K</Card>
          <Card title="Concurrency" icon="layer-group">1000</Card>
        </CardGroup>

        ## Write the config

        Create and move into the project directory:

        ```sh theme={"system"}
        mkdir qwen3.5-35b-throughput && cd qwen3.5-35b-throughput
        ```

        Then create a file named `config.yaml` and paste the following:

        ```yaml config.yaml theme={"system"}
        ########################################################
        # Remove ( --language-model-only ) from the start command to turn on multimodal mode
        ########################################################
        model_name: "model:qwen3.5-35b preset:throughput"
        model_metadata:
          description: >-
            Qwen 3.5 35B A3B FP8 MoE throughput on B200, language-only mode (--language-model-only) optional.
          repo_id: Qwen/Qwen3.5-35B-A3B-FP8
          example_model_input:
            model: "Qwen/Qwen3.5-35B-A3B-FP8"
            messages:
              - role: user
                content: "What is the capital of France?"
            stream: true
            max_tokens: 100
            temperature: 0.7
          tags:
            - openai-compatible
        base_image:
          image: vllm/vllm-openai:v0.22.0
        weights:
          - source: "hf://Qwen/Qwen3.5-35B-A3B-FP8@main"
            mount_location: "/app/checkpoint/model"
            auth_secret_name: "hf_access_token"
        secrets:
          hf_access_token: null
        environment_variables:
          VLLM_LOGGING_LEVEL: WARNING
          VLLM_ENGINE_READY_TIMEOUT_S: "3600"
          VLLM_USE_FLASHINFER_MOE_FP8: "0"
          PYTORCH_ALLOC_CONF: "expandable_segments:True"
        docker_server:
          start_command: >-
            sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/model
            --tensor-parallel-size $GPU_COUNT
            --served-model-name Qwen/Qwen3.5-35B-A3B-FP8
            --host 0.0.0.0
            --port 8000
            --language-model-only
            --gpu-memory-utilization 0.95
            --kv-cache-dtype fp8
            --reasoning-parser qwen3
            --enable-chunked-prefill
            --enable-prefix-caching
            --max-num-seqs 512
            --trust-remote-code
            --load-format runai_streamer"
          readiness_endpoint: /health
          liveness_endpoint: /health
          predict_endpoint: /v1/chat/completions
          server_port: 8000
        runtime:
          predict_concurrency: 1000
          health_checks:
            restart_check_delay_seconds: 1800
            restart_threshold_seconds: 1200
            stop_traffic_threshold_seconds: 120
        resources:
          accelerator: B200
          use_gpu: true
        ```

        ## Flags

        The `start_command` passes these flags to the engine. Each one controls a runtime or serving behavior:

        | Flag                       | Value            | What it does                                                                                                                                                |
        | -------------------------- | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
        | `--tensor-parallel-size`   | `$GPU_COUNT`     | Number of GPUs to shard the model across.                                                                                                                   |
        | `--language-model-only`    | (no value)       | Disable the multimodal path; text-only serving. Remove to enable image/video inputs.                                                                        |
        | `--gpu-memory-utilization` | `0.95`           | Fraction of GPU memory vLLM may use for weights and KV cache.                                                                                               |
        | `--kv-cache-dtype`         | `fp8`            | KV cache numeric precision. **fp8:** \~2× KV cache density with negligible quality impact on most models.                                                   |
        | `--reasoning-parser`       | `qwen3`          | Server-side parser that separates reasoning output into `reasoning_content`. **qwen3:** Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
        | `--enable-chunked-prefill` | (no value)       | Process long prompts in chunks so decode requests keep running.                                                                                             |
        | `--enable-prefix-caching`  | (no value)       | Reuse KV cache across requests that share a prefix.                                                                                                         |
        | `--max-num-seqs`           | `512`            | Maximum number of concurrent sequences in the batch.                                                                                                        |
        | `--trust-remote-code`      | (no value)       | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).                                                  |
        | `--load-format`            | `runai_streamer` | Weight loading backend. **runai\_streamer:** Stream weights from object storage without materializing to disk.                                              |

        ## Deploy

        Push the config to Baseten:

        ```sh theme={"system"}
        uvx truss push
        ```

        You should see output similar to:

        ```output theme={"system"}
        ✨ Model qwen3.5-35b-throughput was successfully pushed ✨

           Model ID:      abc1d2ef
           Deployment ID: xyz123
           Endpoint:      model-abc1d2ef.api.baseten.co
           Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123
        ```

        Your **model ID** is printed in the `truss push` output (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

        ## Call the model

        Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

        Now call your deployment to run inference:

        <Tabs>
          <Tab title="Python">
            ```python main.py theme={"system"}
            import os
            from openai import OpenAI

            client = OpenAI(
                api_key=os.environ["BASETEN_API_KEY"],
                base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
            )

            response = client.chat.completions.create(
                model="Qwen/Qwen3.5-35B-A3B-FP8",
                messages=[
                    {"role": "user", "content": "What is machine learning?"}
                ],
            )

            print(response.choices[0].message.content)
            ```
          </Tab>

          <Tab title="cURL">
            ```sh theme={"system"}
            curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
              -H "Content-Type: application/json" \
              -H "Authorization: Bearer $BASETEN_API_KEY" \
              -d '{
                "model": "Qwen/Qwen3.5-35B-A3B-FP8",
                "messages": [
                  {"role": "user", "content": "What is machine learning?"}
                ]
              }'
            ```
          </Tab>
        </Tabs>

        To access the model's chain of thought, enable thinking mode. The server parses the reasoning output into a separate `reasoning_content` field on the response:

        ```python theme={"system"}
        response = client.chat.completions.create(
            model="Qwen/Qwen3.5-35B-A3B-FP8",
            messages=[
                {"role": "user", "content": "How many r's in strawberry?"}
            ],
            extra_body={"chat_template_kwargs": {"enable_thinking": True}},
        )
        print(response.choices[0].message.reasoning_content)  # chain of thought
        print(response.choices[0].message.content)            # final answer
        ```
      </Tab>
    </Tabs>
  </Tab>

  <Tab title="122B">
    [Qwen/Qwen3.5-122B-A10B](https://huggingface.co/Qwen/Qwen3.5-122B-A10B) is a 122B-parameter MoE model (10B active per token) with up to 256K context.

    This preset serves Qwen3.5-122B with FP8 weights on H100:4. It keeps time-to-first-token low while fitting the full model on a single H100 node.

    <CardGroup cols={4}>
      <Card title="Hardware" icon="microchip">H100 × 4</Card>
      <Card title="Engine" icon="server">vLLM (0.22.0-cu129 build)</Card>
      <Card title="Context" icon="ruler-horizontal">32K</Card>
      <Card title="Concurrency" icon="layer-group">128</Card>
    </CardGroup>

    ## Write the config

    Create and move into the project directory:

    ```sh theme={"system"}
    mkdir qwen3.5-122b-latency && cd qwen3.5-122b-latency
    ```

    Then create a file named `config.yaml` and paste the following:

    ```yaml config.yaml theme={"system"}
    model_name: "model:qwen3.5-122b preset:latency"
    model_metadata:
      description: >-
        Qwen 3.5 122B A10B MoE instruct FP8 weights, TP=4 latency preset via vLLM with Qwen parsers.
      repo_id: Qwen/Qwen3.5-122B-A10B-FP8
      example_model_input:
        model: "Qwen/Qwen3.5-122B-A10B-FP8"
        messages:
          - role: user
            content: "What is the capital of France?"
        stream: true
        max_tokens: 100
        temperature: 0.7
      tags:
        - openai-compatible
    base_image:
      image: vllm/vllm-openai:v0.22.0-cu129
    weights:
      - source: "hf://Qwen/Qwen3.5-122B-A10B-FP8@main"
        mount_location: "/app/checkpoint/model"
        auth_secret_name: "hf_access_token"
    secrets:
      hf_access_token: null
    docker_server:
      start_command: >-
        sh -c "GPU_COUNT=$(nvidia-smi --list-gpus | wc -l) && vllm serve /app/checkpoint/model
        --served-model-name Qwen/Qwen3.5-122B-A10B-FP8
        --host 0.0.0.0
        --port 8000
        --gpu-memory-utilization 0.95
        --max-model-len 32768
        --kv-cache-dtype fp8
        --tensor-parallel-size $GPU_COUNT
        --reasoning-parser qwen3
        --enable-auto-tool-choice
        --tool-call-parser qwen3_coder
        --trust-remote-code
        --enable-prefix-caching
        --load-format runai_streamer"
      readiness_endpoint: /health
      liveness_endpoint: /health
      predict_endpoint: /v1/chat/completions
      server_port: 8000
    environment_variables:
      VLLM_LOGGING_LEVEL: WARNING
      VLLM_ENGINE_READY_TIMEOUT_S: "3600"
    runtime:
      predict_concurrency: 128
      health_checks:
        restart_check_delay_seconds: 1800
        restart_threshold_seconds: 1200
        stop_traffic_threshold_seconds: 120
    resources:
      accelerator: H100:4
      use_gpu: true
    ```

    ## Flags

    The `start_command` passes these flags to the engine. Each one controls a runtime or serving behavior:

    | Flag                        | Value            | What it does                                                                                                                                                |
    | --------------------------- | ---------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
    | `--gpu-memory-utilization`  | `0.95`           | Fraction of GPU memory vLLM may use for weights and KV cache.                                                                                               |
    | `--max-model-len`           | `32768`          | Maximum context length (tokens) the server accepts per request.                                                                                             |
    | `--kv-cache-dtype`          | `fp8`            | KV cache numeric precision. **fp8:** \~2× KV cache density with negligible quality impact on most models.                                                   |
    | `--tensor-parallel-size`    | `$GPU_COUNT`     | Number of GPUs to shard the model across.                                                                                                                   |
    | `--reasoning-parser`        | `qwen3`          | Server-side parser that separates reasoning output into `reasoning_content`. **qwen3:** Qwen3-family thinking format (used by Qwen3, Qwen3.5, and Qwen3.6). |
    | `--enable-auto-tool-choice` | (no value)       | Let the model choose when to call tools without requiring `tool_choice: "required"`.                                                                        |
    | `--tool-call-parser`        | `qwen3_coder`    | Server-side parser that emits structured `tool_calls` on the response. **qwen3\_coder:** Qwen3-Coder tool format.                                           |
    | `--trust-remote-code`       | (no value)       | Execute model-specific Python from the checkpoint (required for many Qwen, Phi, and custom architectures).                                                  |
    | `--enable-prefix-caching`   | (no value)       | Reuse KV cache across requests that share a prefix.                                                                                                         |
    | `--load-format`             | `runai_streamer` | Weight loading backend. **runai\_streamer:** Stream weights from object storage without materializing to disk.                                              |

    ## Deploy

    Push the config to Baseten:

    ```sh theme={"system"}
    uvx truss push
    ```

    You should see output similar to:

    ```output theme={"system"}
    ✨ Model qwen3.5-122b-latency was successfully pushed ✨

       Model ID:      abc1d2ef
       Deployment ID: xyz123
       Endpoint:      model-abc1d2ef.api.baseten.co
       Logs:          https://app.baseten.co/models/abc1d2ef/logs/xyz123
    ```

    Your **model ID** is printed in the `truss push` output (`abcd1234` in the example). Use it wherever you see `{model_id}` in the next section.

    ## Call the model

    Your deployment serves an OpenAI-compatible API. Replace `{model_id}` with your model ID and make sure `BASETEN_API_KEY` is set.

    Now call your deployment to run inference:

    <Tabs>
      <Tab title="Python">
        ```python main.py theme={"system"}
        import os
        from openai import OpenAI

        client = OpenAI(
            api_key=os.environ["BASETEN_API_KEY"],
            base_url="https://model-{model_id}.api.baseten.co/environments/production/sync/v1",
        )

        response = client.chat.completions.create(
            model="Qwen/Qwen3.5-122B-A10B-FP8",
            messages=[
                {"role": "user", "content": "What is machine learning?"}
            ],
        )

        print(response.choices[0].message.content)
        ```
      </Tab>

      <Tab title="cURL">
        ```sh theme={"system"}
        curl -s https://model-{model_id}.api.baseten.co/environments/production/sync/v1/chat/completions \
          -H "Content-Type: application/json" \
          -H "Authorization: Bearer $BASETEN_API_KEY" \
          -d '{
            "model": "Qwen/Qwen3.5-122B-A10B-FP8",
            "messages": [
              {"role": "user", "content": "What is machine learning?"}
            ]
          }'
        ```
      </Tab>
    </Tabs>

    To access the model's chain of thought, enable thinking mode. The server parses the reasoning output into a separate `reasoning_content` field on the response:

    ```python theme={"system"}
    response = client.chat.completions.create(
        model="Qwen/Qwen3.5-122B-A10B-FP8",
        messages=[
            {"role": "user", "content": "How many r's in strawberry?"}
        ],
        extra_body={"chat_template_kwargs": {"enable_thinking": True}},
    )
    print(response.choices[0].message.reasoning_content)  # chain of thought
    print(response.choices[0].message.content)            # final answer
    ```

    To let the model call tools, pass a `tools` array. The server returns structured `tool_calls` on the response:

    ```python theme={"system"}
    tools = [{
        "type": "function",
        "function": {
            "name": "get_weather",
            "parameters": {
                "type": "object",
                "properties": {"location": {"type": "string"}},
                "required": ["location"],
            },
        },
    }]

    response = client.chat.completions.create(
        model="Qwen/Qwen3.5-122B-A10B-FP8",
        messages=[
            {"role": "user", "content": "What's the weather in Paris?"}
        ],
        tools=tools,
    )
    print(response.choices[0].message.tool_calls)
    ```
  </Tab>
</Tabs>
