vLLM supports a wide range of models and performance optimizations. This guide deploys a vLLM model as a custom Docker server on Baseten. This configuration serves Qwen 2.5 3B with vLLM on an L4 GPU. The deployment process is the same for larger models like GLM-4.7. Adjust theDocumentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
resources and start_command to match your model’s requirements.
Set up your environment
This guide usesuvx to run Truss commands without a separate install step. Sign in to Baseten and install the OpenAI SDK. Browser login opens a tab to approve this device, so there’s no API key to copy and paste.
Sign in to Baseten
Install the OpenAI SDK
Hugging Face access for gated models. Some models require that you accept terms and conditions on Hugging Face before deployment. To prevent issues:
- Accept the license for any gated models you wish to access, like Gemma 3.
- Create a read-only user access token from your Hugging Face account.
- Add the
hf_access_tokensecret to your Baseten workspace.
Configure the model
Create a directory with aconfig.yaml file:
config.yaml:
config.yaml
base_image specifies the vLLM Docker image. The weights block uses the Baseten Delivery Network to mirror the model from Hugging Face and mount it at /models/qwen before the container starts. vLLM reads weights directly from that path and serves the model with --served-model-name, which sets the model identifier for the OpenAI-compatible API. The health_checks settings control how Baseten monitors the server after it passes the startup probe.
Deploy the model
Push the model to Baseten to start the deployment:Call the model
Call the deployed model with the OpenAI client:call_model.py
model_url with the URL from your deployment output.
Route through an external LLM gateway
To route traffic from a third-party OpenAI-compatible gateway to this deployment, see External LLM gateways. Themodel value the gateway sends must match the --served-model-name in the start_command above.