Distributed Training
====================

``thsolver`` supports two distributed launch styles controlled by
``SOLVER.ddp_mode``.


Spawn Mode
----------

``spawn`` is the default. It is best suited to single-node, multi-GPU training.

- ``SOLVER.gpu`` defines the GPU ids to use.
- ``SOLVER.port`` defines the localhost NCCL rendezvous port.
- The solver launches one worker per listed GPU with
  :func:`torch.multiprocessing.spawn`.

When only one GPU id is provided, the same code path still works and runs a
single worker.


torchrun Mode
-------------

``torchrun`` uses the environment variables created by
``torch.distributed.run``. This is the better fit for multi-node launches or
when you already standardize on ``torchrun``.

.. code-block:: none

   torchrun --nproc_per_node=4 train.py --config configs/experiment.yaml

In this mode, ``WORLD_SIZE``, ``RANK``, and ``LOCAL_RANK`` come from the launch
environment rather than from ``SOLVER.gpu``.


Master Process Responsibilities
-------------------------------

Only rank 0 performs:

- TensorBoard and CSV logging
- checkpoint save and restore bookkeeping
- best-checkpoint updates
- console logging intended for the user

All other ranks participate in forward and backward passes and synchronize
epoch-level metrics through :meth:`thsolver.tracker.AverageTracker.average_all_gather`.


Reproducibility
---------------

Set ``SOLVER.rand_seed`` to a positive integer to enable deterministic seeding
for Python, NumPy, and PyTorch. When a seed is fixed, the solver also disables
``cudnn.benchmark`` and enables deterministic CuDNN behavior.