truss train push, checkpoint sync, deployment) stays the same.
Swap the dataset
The tutorial’strain.py loads a public Hugging Face dataset:
load_dataset() at your own Hugging Face repo, or at files bundled with your project (everything in your project directory ships with truss train push):
hf_access_token secret to the job and read it from the environment; the tutorial’s container makes unauthenticated Hugging Face requests, so a gated model fails at training time without this. See secrets in training.
TRL’s SFTTrainer consumes chat-format datasets (a messages column) directly. For other shapes, apply a formatting function; the TRL SFT docs cover the options.
Load weights and data through BDN
The tutorial’strain.py downloads the base model inside the container, on billed GPU time, on every job. The better pattern is to mount weights and large datasets through BDN: declare a WeightsSource on the job, and the files are on local disk before your start commands run. BDN mirrors each source once and caches it, so re-runs skip the download entirely.
Add WeightsSource to the truss_train imports in config.py and declare the mounts:
config.py
train.py
auth block. See storage and data ingestion for the full configuration.
Adjust the training config
The tutorial caps the run at 50 steps so it finishes fast. For a real run, train on the full dataset and checkpoint less often. Keep the tutorial’s other settings (learning_rate, bf16, max_length); these are the fields that change:
output_dir on $BT_CHECKPOINT_DIR: that’s the directory Baseten syncs and deploys from. And every checkpoint you save is uploaded, so pick a save_steps cadence you’d actually resume or deploy from; frequent saves on a large model cost sync time and storage.
If the job hits GPU out-of-memory, lower per_device_train_batch_size and raise gradient_accumulation_steps to hold the effective batch size, or move up a GPU tier.
Scale the hardware
Hardware lives inconfig.py, not in your training code. A bigger base model needs more GPUs:
node_count and Baseten handles the InfiniBand networking and orchestration; see multi-node training.
Iterate faster on re-runs
Your second submission shouldn’t re-download the base model. The tutorial’s config already enables the training cache (CacheConfig(enabled=True)); keep it on, and cache model downloads and preprocessed data under the cache directory so subsequent jobs skip them.
To debug a live job instead of resubmitting, SSH into the running container or attach VS Code or Cursor.
Next steps
- Browse the ML Cookbook for complete recipes: Axolotl, DPO, RL with VeRL, and multi-node FSDP.
- Deploy your checkpoints when training finishes.