Different traffic patterns require different autoscaling configurations. Identify your pattern below for recommended starting settings.Documentation Index
Fetch the complete documentation index at: https://docs.baseten.co/llms.txt
Use this file to discover all available pages before exploring further.
These are starting points, not final answers. Monitor your
deployment’s performance and adjust based on observed behavior. See
Autoscaling for parameter details.
Identifying your pattern
Not sure which pattern you have? Check your metrics:- Go to your model’s Metrics tab in the Baseten dashboard.
- Look at Inference volume and Replicas over the past week.
- Compare to the patterns below.
Some workloads are a mix of patterns. If your traffic has both smooth diurnal patterns AND occasional bursts, optimize for the bursts (they cause the most pain) and accept slightly higher cost during steady periods.
Jittery traffic
Small, frequent spikes that quickly return to baseline.Characteristics
- Baseline replica count is steady, but spikes up by 2x several times per hour.
- Spikes are short-lived and return to baseline quickly.
- Often not real load growth, just temporary surges causing overreaction.
Common causes
- Consumer products with intermittent usage bursts.
- Traffic splitting or A/B testing with low percentages.
- Polling clients with synchronized intervals.
Recommended settings
| Parameter | Value | Why |
|---|---|---|
| Autoscaling window | 2-5 minutes | Smooth out noise, avoid reacting to every spike |
| Scale-down delay | 300-600s | Moderate stability |
| Target utilization | 70% | Default is fine |
| Concurrency target | Benchmarked value | Start conservative |
Bursty traffic
Characteristics
- Traffic jumps sharply (2x+ within 60 seconds).
- Stays high for a sustained period before dropping.
- The “pain” is queueing and latency spikes while new replicas start.
Common causes
- Daily morning ramp-up (users starting their day).
- Marketing events, product launches, viral moments.
- Top-of-hour scheduled jobs or cron-triggered traffic.
Recommended settings
| Parameter | Value | Why |
|---|---|---|
| Autoscaling window | 30-60s | React quickly to genuine load increases |
| Scale-down delay | 900s+ | Handle back-to-back waves without thrashing |
| Target utilization | 50-60% | More headroom absorbs the burst while scaling |
| Min replicas | ≥2 | Redundancy + reduces cold start impact |
Pre-warming for predictable bursts
If your bursts are predictable (morning ramp, scheduled events), pre-warm by bumping min replicas before the expected spike:Scheduled traffic
Characteristics
- Long periods of low or zero traffic.
- Large bursts tied to job schedules (hourly, daily, weekly).
- Traffic patterns are predictable but infrequent.
Common causes
- ETL pipelines and data processing jobs.
- Embedding backfills and batch inference.
- Periodic evaluation or testing jobs.
- Document processing triggered by user uploads.
Recommended settings
| Parameter | Value | Why |
|---|---|---|
| Min replicas | 0 (if cold starts acceptable) or 1 (during job windows) | Cost savings when idle |
| Scale-down delay | Moderate to high | Jobs often come in waves |
| Autoscaling window | 60-120s | Don’t overreact to the first few requests |
| Target utilization | 70% | Default is fine |
Scheduled pre-warming
For predictable batch jobs, use cron + API to pre-warm. 5 minutes before the hourly job, scale up:Steady traffic
Characteristics
- Traffic rises and falls gradually over the day.
- Classic diurnal pattern with no sharp edges.
- Predictable, cyclical behavior.
Common causes
- Always-on inference APIs with consistent user base.
- B2B applications with business-hours usage.
- Production workloads with stable, mature traffic.
Recommended settings
| Parameter | Value | Why |
|---|---|---|
| Target utilization | 70-80% | Can run replicas hotter safely |
| Autoscaling window | 60-120s | Moderate reaction speed |
| Scale-down delay | 300-600s | Moderate |
| Min replicas | ≥2 | Redundancy for production |
Next steps
- Autoscaling: Full parameter documentation.
- Troubleshooting autoscaling: Diagnose and fix common problems.
- Truss configuration reference: Configure predict_concurrency in your model.