Distributed Training

thsolver supports two distributed launch styles controlled by SOLVER.ddp_mode.

Spawn Mode

spawn is the default. It is best suited to single-node, multi-GPU training.

  • SOLVER.gpu defines the GPU ids to use.

  • SOLVER.port defines the localhost NCCL rendezvous port.

  • The solver launches one worker per listed GPU with torch.multiprocessing.spawn().

When only one GPU id is provided, the same code path still works and runs a single worker.

torchrun Mode

torchrun uses the environment variables created by torch.distributed.run. This is the better fit for multi-node launches or when you already standardize on torchrun.

torchrun --nproc_per_node=4 train.py --config configs/experiment.yaml

In this mode, WORLD_SIZE, RANK, and LOCAL_RANK come from the launch environment rather than from SOLVER.gpu.

Master Process Responsibilities

Only rank 0 performs:

  • TensorBoard and CSV logging

  • checkpoint save and restore bookkeeping

  • best-checkpoint updates

  • console logging intended for the user

All other ranks participate in forward and backward passes and synchronize epoch-level metrics through thsolver.tracker.AverageTracker.average_all_gather().

Reproducibility

Set SOLVER.rand_seed to a positive integer to enable deterministic seeding for Python, NumPy, and PyTorch. When a seed is fixed, the solver also disables cudnn.benchmark and enables deterministic CuDNN behavior.