truss train workstation --node-count N provisions N full nodes, bootstraps Slurm across them, and prints the SSH command to connect. Every node runs slurmd, the rank-0 node also runs slurmctld as the controller, and each node’s GPUs register as gres automatically.
For single-node workstations, see SSH access. For non-interactive multi-node training jobs, see Multinode training.
How Baseten builds the cluster
When--node-count is greater than 1, every node runs a Slurm bootstrap at startup:
- Each node installs Slurm and munge, then detects its GPUs.
- Nodes coordinate through the shared project cache, registering themselves until all
BT_GROUP_SIZEnodes are present. - The rank-0 node generates
/etc/slurm/slurm.confand distributes it: cluster nameworkstation, a single default partition namedgpuwith no time limit, and each node’s GPUs registered asgres. - The controller starts
slurmctldandslurmd; workers startslurmd. The controller is also a compute node, so all N nodes accept work.
slurm.conf and munge key, so Slurm commands work from any node. For the environment variables Baseten injects (BT_NODE_RANK, BT_GROUP_SIZE, BT_PROJECT_CACHE_DIR, and more), see the SDK reference.
Launch a workstation
First, set up SSH access if you haven’t:--node-countprovisions full nodes, using all GPUs on each. It’s mutually exclusive with--gpu-count, which configures single-node workstations.--acceleratorselects the GPU type (H100 by default).--imageswaps the base image (defaultnvidia/cuda:12.8.1-devel-ubuntu24.04). The Slurm bootstrap installs its own packages, so any Debian-based image with your framework preinstalled works.
truss train stop when you finish; that tears down Slurm and releases the nodes.
Verify the cluster
After connecting, confirm the cluster sees every node and GPU:echo $BT_NODE_RANK; rank 0 is the controller.
Run distributed work
The project cache directory is shared across all nodes. Put your code, data, and outputs there so every rank sees the same files:srun:
pretrain.sbatch on the shared cache:
#SBATCH lines don’t expand environment variables, so --chdir uses the literal path that $BT_PROJECT_CACHE_DIR resolves to (/root/.cache/user_artifacts). Pass the GPU count on the command line instead, where $BT_NUM_GPUS expands. Submit the job and track it:
SLURM_* environment variables (SLURM_NODEID, SLURM_NTASKS, SLURM_JOB_NODELIST), so distributed launchers like torchrun pick up the topology the standard way. For job arrays, dependencies, and everything beyond launching, see the Slurm documentation.
Checkpoints and the shared cache
Workstations support the same storage as training jobs:- The shared cache mounts on every node and persists across workstation restarts within a project. See Cache.
- Pass
--enable-checkpointing(with optional--checkpoint-pathand--checkpoint-volume-size) to mount checkpoint storage, and--checkpoint-from-jobto load the latest checkpoint from a previous job. See Checkpoints.
Notes and limits
- Everything runs as root, and there is one partition. The bootstrap regenerates
slurm.confon every start, so manual edits don’t survive a restart. - Multi-node workstations always allocate full nodes; there is no fractional multi-node sizing.
Next steps
Once your training script behaves across nodes, the same project can run it as a non-interactive multi-node training job, with the cache and checkpoints carrying over.SSH access
Single-node workstations and direct SSH connections.
VS Code & Cursor
Attach your IDE to a workstation with remote tunnels.
Multinode training
Non-interactive distributed training jobs.
CLI reference
All
truss train workstation options.