Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.baseten.co/llms.txt

Use this file to discover all available pages before exploring further.

A cold start is the time a fresh replica spends booting before it can accept traffic. During that boot, the deployment moves through the statuses you see in the dashboard: it leaves Scaled to zero, enters Waking up while the container is scheduled, then Loading model while weights move into GPU memory and the model’s setup code runs. Any request that triggered the scale-up sits in the queue until the deployment reaches Active, so the cold-start duration becomes the latency floor for that request. The diagram below traces a deployment cycling from Scaled to zero through Waking up and Loading model to Active and back, with a queued request waiting through the boot bar before being admitted.

When cold starts happen

Cold starts show up in two situations. The first is scale-from-zero: a deployment with min_replica set to 0 has shut all its replicas down to save cost, so the next request triggers a fresh boot before anything can serve it. The second is during ordinary scaling events: when load crosses the autoscaler’s threshold, every new replica it provisions has to cold-start before joining the pool. The first pattern is the more visible one because the scale-from-zero request is the one waiting on the boot. The second is usually masked by the existing replicas absorbing load while the new ones come online, so users only feel it when load grows faster than the autoscaler can keep up.

What contributes to cold start time

Cold start duration is the sum of three steps that the new replica works through in order:
FactorImpact
Model loadingLoading model weights (10s–100s of GB), typically the dominant factor.
Container pullDownloading Docker image layers.
InitializationRunning your model’s setup code.
For large models, cold starts can take minutes, and model weight downloads are usually the bottleneck. Even with caching in place, the physics of moving hundreds of gigabytes from storage into GPU memory creates inherent lag, which is why Baseten’s platform optimizations focus on shrinking that hop.

Minimizing cold starts

Keep replicas warm

Set min_replica to always have at least one replica ready to serve requests. This eliminates cold starts for the first request but increases cost.
{
  "min_replica": 1
}
For production redundancy, set min_replica ≥ 2 so one replica can fail during maintenance without causing cold starts.

Pre-warm before expected traffic

For predictable traffic spikes, increase min replicas before the expected load:
# 10-15 minutes before expected spike
curl -X PATCH \
  https://api.baseten.co/v1/models/{model_id}/deployments/{deployment_id}/autoscaling_settings \
  -H "Authorization: Api-Key $BASETEN_API_KEY" \
  -d '{"min_replica": 5}'
After traffic stabilizes, reset to your normal minimum.

Use longer scale-down delay

A longer scale-down delay keeps replicas warm during temporary traffic dips:
{
  "scale_down_delay": 900
}
This prevents cold starts when traffic returns within the delay window.

Platform optimizations

Baseten automatically applies several optimizations to reduce cold start times: Baseten Delivery Network: The weights configuration optimizes cold starts by mirroring weights to Baseten’s infrastructure and caching them close to your model pods. See Baseten Delivery Network (BDN) for full configuration options. Image streaming: Optimized images stream into nodes, allowing model loading to begin before the full download completes:
Successfully pulled streaming-enabled image in 15.851s. Image size: 32 GB.
These optimizations are applied automatically.

The tradeoff

Cold starts create a fundamental tradeoff between cost and latency:
ApproachCostLatency
Scale to zero (min_replica: 0)Lower: no cost when idleHigher: first request waits for cold start
Always on (min_replica: ≥1)Higher: pay for idle replicasLower: no cold starts
For latency-sensitive production workloads, the cost of keeping replicas warm is usually justified. For batch workloads or development, scale-to-zero often makes sense.

Next steps