Choose your path
- Coming from Tinker: Loops is a Tinker-compatible backend for your existing scripts.
import tinkerworks with one install change, and the Tinker compatibility page lists exactly what differs. Start with the Loops overview. - Already training elsewhere: Your Axolotl config, TRL script, or custom loop runs unchanged in a container. Baseten provisions the GPUs, syncs checkpoints as your job saves them, and deploys any checkpoint as a production endpoint. That’s Truss Train, documented in this section.
| Truss Train | Loops | |
|---|---|---|
| Training code | Any container image | Tinker-compatible Python (import tinker) |
| Models | Any | Supported base models only |
| Hardware | You declare GPUs in a Truss config | Baseten picks GPUs for the base model |
| Inference path | Deploy any synced checkpoint with one CLI command | Sampler serves new weights live during training; deploy checkpoints when ready |
| Lifecycle | Job runs to completion and exits | Session stays live until you run truss loops deactivate |
| Availability | All workspaces | Early access |
| Documentation | This section | Loops docs |
How Truss Train works
Baseten stores your checkpoints while the job runs and deploys any of them as a production endpoint. You don’t download weights, re-upload them, or manage separate serving infrastructure. The core workflow is two commands:- Define your job: Declare compute, container image, runtime, and checkpointing in a Python config file.
- Submit it:
truss train pushpackages your code and starts the job on H100 or H200 GPUs, single-node or multi-node. - Watch checkpoints sync: Baseten stores each checkpoint your job saves.
- Deploy a checkpoint:
truss train deploy_checkpointsturns any synced checkpoint into a production endpoint.
Supported frameworks
Truss Train is framework-agnostic: if it runs in a container, it runs here.| Framework | Best for | Example |
|---|---|---|
| Axolotl | Configuration-driven fine-tuning with LoRA/QLoRA | oss-gpt-20b-axolotl |
| TRL | SFT, DPO, and GRPO with Hugging Face | oss-gpt-20b-lora-trl |
| TRL | LoRA DPO fine-tuning | qwen3-8b-lora-dpo-trl |
| VeRL | Reinforcement learning with custom rewards | qwen3-8b-lora-verl |
| MS-Swift | Long-context and multilingual training | qwen3-30b-mswift-multinode |
Key features
Checkpoint management
Checkpoints sync automatically to Baseten storage during training. You can:- Deploy any checkpoint as a production endpoint with
truss train deploy_checkpoints. - Download checkpoints for local evaluation and analysis.
- Resume from any checkpoint if a job fails or you want to train further.
BDN weight and data loading
Load model weights and training data through Baseten Delivery Network (BDN). Mount weights from Hugging Face, S3, GCS, R2, or any HTTPS URL directly into your training container with no download code needed. BDN mirrors weights before compute is provisioned, then caches them for faster mounting on subsequent jobs. See storage and data ingestion for setup details.Persistent caching
Speed up training iterations by caching models, datasets, and preprocessed data between jobs. The cache persists across training runs, so you don’t re-download 70B models every time. See the training cache guide for configuration options.Multi-node training
Scale training across multiple GPU nodes with InfiniBand networking. Baseten handles node orchestration, communication setup, and environment variables. You setnode_count in your configuration.
Learn more about multi-node training.
Remote access
Connect to running training containers to debug, inspect state, and iterate without resubmitting. Baseten offers two options:- SSH: Connect from any OpenSSH client for terminal sessions and file transfer with
scporsftp. - VS Code & Cursor: Connect from VS Code or Cursor Remote Tunnels for a full IDE experience.
Next steps
Get started
Run your first training job and deploy the result.
Loops
Tinker-compatible SFT and async RL with checkpoint deploys to inference.
ML Cookbook
Production-ready examples for frameworks and models.