Skip to main content
Some Model APIs support extended thinking, where the model reasons through a problem before producing a final answer. The reasoning process generates additional tokens that appear in a separate reasoning_content field, distinct from the final response.

Supported models

ModelSlugReasoning
DeepSeek V4 Prodeepseek-ai/DeepSeek-V4-ProEnabled by default
GPT OSS 120Bopenai/gpt-oss-120bEnabled by default
GLM 5.2zai-org/GLM-5.2Enabled by default
Kimi K2.5moonshotai/Kimi-K2.5Opt-in through chat_template_args
Kimi K2.6moonshotai/Kimi-K2.6Opt-in through chat_template_args
Kimi K2.7 Codemoonshotai/Kimi-K2.7-CodeOpt-in through chat_template_args
GLM 4.7zai-org/GLM-4.7Opt-in through chat_template_args
GLM 5zai-org/GLM-5Opt-in through chat_template_args
GLM 5.1zai-org/GLM-5.1Opt-in through chat_template_args
Nemotron Supernvidia/Nemotron-120B-A12BOpt-in through chat_template_args
Nemotron Ultranvidia/NVIDIA-Nemotron-3-Ultra-550B-A55BOpt-in through chat_template_args
DeepSeek V4 Pro, GPT OSS 120B, and GLM 5.2 also support reasoning_effort. Models not listed here don’t support reasoning.

Enable thinking

For models marked opt-in in the table above, enable thinking by passing chat_template_args.
Pass chat_template_args through extra_body since it extends the standard OpenAI API:
enable_thinking.py
response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.5",
    messages=[{"role": "user", "content": "What is the sum of the first 100 prime numbers?"}],
    extra_body={"chat_template_args": {"enable_thinking": True}},
    max_tokens=4096,
    stream=True,
)

Control reasoning depth

The reasoning_effort parameter controls how thoroughly the model reasons through a problem. DeepSeek V4 Pro, GPT OSS 120B, and GLM 5.2 support this parameter. Supported values vary by model:
ModelSupported values
DeepSeek V4 Pronone, minimal, low, medium (default), high, xhigh, max
GPT OSS 120Bnone, minimal, low, medium (default), high, xhigh, max
GLM 5.2none, high, max
Lower values return faster responses with less thorough reasoning; higher values reason longer and cost more output tokens. none disables reasoning entirely. GLM 5.2 returns a 400 error for values outside its set. Some model templates also read reasoning_effort from inside chat_template_args (GLM 5.2 honors both placements). Use the top-level parameter: the API validates it and returns a 400 for invalid values, but doesn’t validate chat_template_args contents, so mistakes there fail silently.
A successful request doesn’t mean reasoning_effort took effect. Models not listed in this table accept the parameter but ignore it.
Pass reasoning_effort through extra_body since it extends the standard OpenAI API:
reasoning_effort.py
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://inference.baseten.co/v1",
    api_key=os.environ.get("BASETEN_API_KEY")
)

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V4-Pro",
    messages=[
        {"role": "user", "content": "What is the sum of the first 100 prime numbers?"}
    ],
    extra_body={"reasoning_effort": "high"}  
)

print(response.choices[0].message.content)
Reasoning improves quality for tasks that benefit from step-by-step thinking: mathematical calculations, multi-step logic problems, code generation with complex requirements, and analysis requiring multiple considerations. For straightforward tasks like simple Q&A or text generation, reasoning adds latency and token cost without improving quality. In these cases, use a model without reasoning support or set reasoning_effort to low.

Parse the response

The model’s thinking process appears in reasoning_content, separate from the final answer in content. Both fields are returned on the message object.
Read reasoning_content and content directly off the message object:
parse_reasoning.py
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://inference.baseten.co/v1",
    api_key=os.environ.get("BASETEN_API_KEY"),
)

response = client.chat.completions.create(
    model="moonshotai/Kimi-K2.6",
    messages=[{"role": "user", "content": "Is 91 a prime number? Answer in one sentence."}],
    extra_body={"chat_template_args": {"enable_thinking": True}},
)

message = response.choices[0].message
print("Reasoning:", message.reasoning_content)
print("Answer:", message.content)
The response body contains both fields on the assistant message:
Response
{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "reasoning_content": "The user is asking whether 91 is a prime number... 91 = 7 × 13, so it is not prime...",
        "content": "No, 91 is not a prime number because it can be factored as $7 \\times 13$."
      }
    }
  ],
  "usage": {
    "prompt_tokens": 21,
    "completion_tokens": 203,
    "total_tokens": 224
  }
}
Reasoning tokens are included in completion_tokens and count toward your total usage and billing.

Next steps

Model APIs overview

Supported models, pricing, and the feature support matrix

Structured outputs

Constrain reasoning models to a JSON schema