Distributed Training
thsolver supports two distributed launch styles controlled by
SOLVER.ddp_mode.
Spawn Mode
spawn is the default. It is best suited to single-node, multi-GPU training.
SOLVER.gpudefines the GPU ids to use.SOLVER.portdefines the localhost NCCL rendezvous port.The solver launches one worker per listed GPU with
torch.multiprocessing.spawn().
When only one GPU id is provided, the same code path still works and runs a single worker.
torchrun Mode
torchrun uses the environment variables created by
torch.distributed.run. This is the better fit for multi-node launches or
when you already standardize on torchrun.
torchrun --nproc_per_node=4 train.py --config configs/experiment.yaml
In this mode, WORLD_SIZE, RANK, and LOCAL_RANK come from the launch
environment rather than from SOLVER.gpu.
Master Process Responsibilities
Only rank 0 performs:
TensorBoard and CSV logging
checkpoint save and restore bookkeeping
best-checkpoint updates
console logging intended for the user
All other ranks participate in forward and backward passes and synchronize
epoch-level metrics through thsolver.tracker.AverageTracker.average_all_gather().
Reproducibility
Set SOLVER.rand_seed to a positive integer to enable deterministic seeding
for Python, NumPy, and PyTorch. When a seed is fixed, the solver also disables
cudnn.benchmark and enables deterministic CuDNN behavior.