Control extended thinking for reasoning-capable models
Some Model APIs support extended thinking, where the model reasons through a problem before producing a final answer. The reasoning process generates additional tokens that appear in a separate reasoning_content field, distinct from the final response.
For models marked opt-in in the table above, enable thinking by passing chat_template_args.
Python
JavaScript
cURL
Pass chat_template_args through extra_body since it extends the standard OpenAI API:
enable_thinking.py
response = client.chat.completions.create( model="moonshotai/Kimi-K2.5", messages=[{"role": "user", "content": "What is the sum of the first 100 prime numbers?"}], extra_body={"chat_template_args": {"enable_thinking": True}}, max_tokens=4096, stream=True,)
Include chat_template_args directly in the request options:
enable_thinking.js
const response = await client.chat.completions.create({ model: "moonshotai/Kimi-K2.5", messages: [{ role: "user", content: "What is the sum of the first 100 prime numbers?" }], chat_template_args: { enable_thinking: true }, max_tokens: 4096, stream: true,});
Include chat_template_args in the JSON request body:
Request
curl https://inference.baseten.co/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $BASETEN_API_KEY" \ -d '{ "model": "moonshotai/Kimi-K2.5", "messages": [{"role": "user", "content": "What is the sum of the first 100 prime numbers?"}], "chat_template_args": {"enable_thinking": true}, "max_tokens": 4096, "stream": false }'
The reasoning_effort parameter controls how thoroughly the model reasons through a problem. DeepSeek V4 Pro, GPT OSS 120B, and GLM 5.2 support this parameter. Supported values vary by model:
Model
Supported values
DeepSeek V4 Pro
none, minimal, low, medium (default), high, xhigh, max
GPT OSS 120B
none, minimal, low, medium (default), high, xhigh, max
GLM 5.2
none, high, max
Lower values return faster responses with less thorough reasoning; higher values reason longer and cost more output tokens. none disables reasoning entirely. GLM 5.2 returns a 400 error for values outside its set.Some model templates also read reasoning_effort from inside chat_template_args (GLM 5.2 honors both placements). Use the top-level parameter: the API validates it and returns a 400 for invalid values, but doesn’t validate chat_template_args contents, so mistakes there fail silently.
A successful request doesn’t mean reasoning_effort took effect. Models not listed in this table accept the parameter but ignore it.
DeepSeek V4 Pro
GPT OSS 120B
GLM 5.2
Python
JavaScript
cURL
Pass reasoning_effort through extra_body since it extends the standard OpenAI API:
reasoning_effort.py
from openai import OpenAIimport osclient = OpenAI( base_url="https://inference.baseten.co/v1", api_key=os.environ.get("BASETEN_API_KEY"))response = client.chat.completions.create( model="deepseek-ai/DeepSeek-V4-Pro", messages=[ {"role": "user", "content": "What is the sum of the first 100 prime numbers?"} ], extra_body={"reasoning_effort": "high"} )print(response.choices[0].message.content)
Include reasoning_effort directly in the request options:
reasoning_effort.js
import OpenAI from "openai";const client = new OpenAI({ baseURL: "https://inference.baseten.co/v1", apiKey: process.env.BASETEN_API_KEY,});const response = await client.chat.completions.create({ model: "deepseek-ai/DeepSeek-V4-Pro", messages: [ { role: "user", content: "What is the sum of the first 100 prime numbers?" } ], reasoning_effort: "high"});console.log(response.choices[0].message.content);
Include reasoning_effort in the JSON request body:
Request
curl https://inference.baseten.co/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $BASETEN_API_KEY" \ -d '{ "model": "deepseek-ai/DeepSeek-V4-Pro", "messages": [{"role": "user", "content": "What is the sum of the first 100 prime numbers?"}], "reasoning_effort": "high" }'
Python
JavaScript
cURL
Pass reasoning_effort through extra_body since it extends the standard OpenAI API:
reasoning_effort.py
from openai import OpenAIimport osclient = OpenAI( base_url="https://inference.baseten.co/v1", api_key=os.environ.get("BASETEN_API_KEY"))response = client.chat.completions.create( model="openai/gpt-oss-120b", messages=[ {"role": "user", "content": "What is the sum of the first 100 prime numbers?"} ], extra_body={"reasoning_effort": "high"} )print(response.choices[0].message.content)
Include reasoning_effort directly in the request options:
reasoning_effort.js
import OpenAI from "openai";const client = new OpenAI({ baseURL: "https://inference.baseten.co/v1", apiKey: process.env.BASETEN_API_KEY,});const response = await client.chat.completions.create({ model: "openai/gpt-oss-120b", messages: [ { role: "user", content: "What is the sum of the first 100 prime numbers?" } ], reasoning_effort: "high"});console.log(response.choices[0].message.content);
Include reasoning_effort in the JSON request body:
Request
curl https://inference.baseten.co/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $BASETEN_API_KEY" \ -d '{ "model": "openai/gpt-oss-120b", "messages": [{"role": "user", "content": "What is the sum of the first 100 prime numbers?"}], "reasoning_effort": "high" }'
Python
JavaScript
cURL
Pass reasoning_effort through extra_body since it extends the standard OpenAI API:
reasoning_effort.py
from openai import OpenAIimport osclient = OpenAI( base_url="https://inference.baseten.co/v1", api_key=os.environ.get("BASETEN_API_KEY"))response = client.chat.completions.create( model="zai-org/GLM-5.2", messages=[ {"role": "user", "content": "What is the sum of the first 100 prime numbers?"} ], extra_body={"reasoning_effort": "high"} )print(response.choices[0].message.content)
Include reasoning_effort directly in the request options:
reasoning_effort.js
import OpenAI from "openai";const client = new OpenAI({ baseURL: "https://inference.baseten.co/v1", apiKey: process.env.BASETEN_API_KEY,});const response = await client.chat.completions.create({ model: "zai-org/GLM-5.2", messages: [ { role: "user", content: "What is the sum of the first 100 prime numbers?" } ], reasoning_effort: "high"});console.log(response.choices[0].message.content);
Include reasoning_effort in the JSON request body:
Request
curl https://inference.baseten.co/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $BASETEN_API_KEY" \ -d '{ "model": "zai-org/GLM-5.2", "messages": [{"role": "user", "content": "What is the sum of the first 100 prime numbers?"}], "reasoning_effort": "high" }'
Reasoning improves quality for tasks that benefit from step-by-step thinking: mathematical calculations, multi-step logic problems, code generation with complex requirements, and analysis requiring multiple considerations.For straightforward tasks like simple Q&A or text generation, reasoning adds latency and token cost without improving quality. In these cases, use a model without reasoning support or set reasoning_effort to low.
The response body contains both fields on the assistant message:
Response
{ "choices": [ { "message": { "role": "assistant", "reasoning_content": "The user is asking whether 91 is a prime number... 91 = 7 × 13, so it is not prime...", "content": "No, 91 is not a prime number because it can be factored as $7 \\times 13$." } } ], "usage": { "prompt_tokens": 21, "completion_tokens": 203, "total_tokens": 224 }}
Reasoning tokens are included in completion_tokens and count toward your total usage and billing.