config.yaml that specifies the model, the hardware, and the engine, then uvx truss push builds a TensorRT-optimized container and deploys it. No Python code, no Dockerfile, no container management.
This guide walks through deploying Qwen 2.5 3B Instruct, a small but capable LLM, from a config file to a production API. You’ll set up Truss, write a config, deploy to Baseten, and call the model’s OpenAI-compatible endpoint.
Set up your environment
Before you begin:- Sign up or sign in to Baseten.
- Install uv, a fast Python package manager. This guide uses
uvxto run Truss commands without a separate install step.
Authenticate with Baseten
Generate an API key from Settings > API keys, then log in:Create a Truss project
Scaffold a new project:Qwen 2.5 3B.
config.yaml, a model/ directory, and supporting files. For engine-based deployments like this one, you only need config.yaml. The model/ directory is for custom Python code when you need custom preprocessing, postprocessing, or unsupported model architectures.
Write the config
Replace the contents ofconfig.yaml with:
config.yaml
model_nameidentifies the model in your Baseten dashboard.resourcesselects an L4 GPU (24 GB VRAM), which is plenty for a 3B parameter model.trt_llmtells Baseten to use Engine-Builder-LLM, which compiles the model with TensorRT-LLM for optimized inference.checkpoint_repositorypoints to the model weights on Hugging Face. Qwen 2.5 3B Instruct is ungated, so no access token is needed.quantization_type: fp8compresses weights to 8-bit floating point, cutting memory usage roughly in half with negligible quality loss.max_seq_len: 8192sets the maximum context length for requests.
Deploy
Push the model to Baseten:/models/ (for example, abc1d2ef). You’ll need this to call the model’s API. You can also find it in your Baseten dashboard.
Baseten downloads the model weights from Hugging Face, compiles them with TensorRT-LLM, and deploys the resulting container to an L4 GPU. This build step takes roughly 10-20 minutes for the first deploy. You can watch progress in the logs linked above.
Engine-based deployments (TRT-LLM) use published deployments by default. The
--watch flag, which creates a development deployment with live reload, is not supported for TRT-LLM models. For custom Python models, see Customize a model where --watch enables a faster development loop.Call the model
Engine-based deployments serve an OpenAI-compatible API. Once the deployment shows “Active” in the dashboard, call it using the OpenAI SDK or cURL. Replace{model_id} with your model ID from the deployment output.
- Python
- cURL
Install the OpenAI SDK if you don’t have it:Create a chat completion:
base_url at your model’s endpoint.
Iterate on your model
To update your deployed model, editconfig.yaml and push again:
Next steps
Engine configuration
Tune max sequence length, batch size, quantization, and runtime settings.
Customize a model
Add custom Python code when you need preprocessing, postprocessing, or unsupported model architectures.
Autoscaling
Configure replicas, concurrency targets, and scale-to-zero for production traffic.