Distributed Training

thsolver supports two distributed launch styles controlled by SOLVER.ddp_mode.

Spawn Mode

spawn is the default. It is best suited to single-node, multi-GPU training.

SOLVER.gpu defines the GPU ids to use.
SOLVER.port defines the localhost NCCL rendezvous port.
The solver launches one worker per listed GPU with torch.multiprocessing.spawn().

When only one GPU id is provided, the same code path still works and runs a single worker.

torchrun Mode

torchrun uses the environment variables created by torch.distributed.run. This is the better fit for multi-node launches or when you already standardize on torchrun.

torchrun --nproc_per_node=4 train.py --config configs/experiment.yaml

In this mode, WORLD_SIZE, RANK, and LOCAL_RANK come from the launch environment rather than from SOLVER.gpu.

Master Process Responsibilities

Only rank 0 performs:

TensorBoard and CSV logging
checkpoint save and restore bookkeeping
best-checkpoint updates
console logging intended for the user

All other ranks participate in forward and backward passes and synchronize epoch-level metrics through thsolver.tracker.AverageTracker.average_all_gather().

Reproducibility

Set SOLVER.rand_seed to a positive integer to enable deterministic seeding for Python, NumPy, and PyTorch. When a seed is fixed, the solver also disables cudnn.benchmark and enables deterministic CuDNN behavior.