A cold start is the time a fresh replica spends booting before it can accept traffic. During that boot, the deployment moves through the statuses you see in the dashboard: it leaves Scaled to zero, enters Waking up while the container is scheduled, then Loading model while weights move into GPU memory and the model’s setup code runs. Any request that triggered the scale-up sits in the queue until the deployment reaches Active, so the cold-start duration becomes the latency floor for that request. The diagram below traces a deployment cycling from Scaled to zero through Waking up and Loading model to Active and back, with a queued request waiting through the boot bar before being admitted.Documentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
When cold starts happen
Cold starts show up in two situations. The first is scale-from-zero: a deployment withmin_replica set to 0 has shut all its replicas down to save cost, so the next request triggers a fresh boot before anything can serve it. The second is during ordinary scaling events: when load crosses the autoscaler’s threshold, every new replica it provisions has to cold-start before joining the pool. The first pattern is the more visible one because the scale-from-zero request is the one waiting on the boot. The second is usually masked by the existing replicas absorbing load while the new ones come online, so users only feel it when load grows faster than the autoscaler can keep up.
What contributes to cold start time
Cold start duration is the sum of three steps that the new replica works through in order:| Factor | Impact |
|---|---|
| Model loading | Loading model weights (10s–100s of GB), typically the dominant factor. |
| Container pull | Downloading Docker image layers. |
| Initialization | Running your model’s setup code. |
Minimizing cold starts
Keep replicas warm
Setmin_replica to always have at least one replica ready to serve requests. This eliminates cold starts for the first request but increases cost.
min_replica ≥ 2 so one replica can fail during maintenance without causing cold starts.
Pre-warm before expected traffic
For predictable traffic spikes, increase min replicas before the expected load:Use longer scale-down delay
A longer scale-down delay keeps replicas warm during temporary traffic dips:Platform optimizations
Baseten automatically applies several optimizations to reduce cold start times: Baseten Delivery Network: Theweights configuration optimizes cold starts by mirroring weights to Baseten’s infrastructure and caching them close to your model pods. See Baseten Delivery Network (BDN) for full configuration options.
Image streaming: Optimized images stream into nodes, allowing model loading to begin before the full download completes:
The tradeoff
Cold starts create a fundamental tradeoff between cost and latency:| Approach | Cost | Latency |
|---|---|---|
Scale to zero (min_replica: 0) | Lower: no cost when idle | Higher: first request waits for cold start |
Always on (min_replica: ≥1) | Higher: pay for idle replicas | Lower: no cold starts |
Next steps
- Request lifecycle: What happens to requests during cold starts, including queuing and timeout behavior.
- Autoscaling: Configure min replicas and scale-down delay.
- Traffic patterns: Pre-warming strategies for different traffic types.
- Troubleshooting: Diagnose cold start issues.