↼ Back to Thoughts

6. Test Results and Future Directions

Crunching the numbers and what comes next

Part 6 of 6 in the Distributed Training series

6.1 Results and Analysis

Running with 4 simulated GPUs on a single RTX 3060:

torchrun --nproc_per_node=4 benchmark_strategies.py --quick
Benchmark Results: 4 Simulated GPUs (gloo backend)Peak Memory (MB)DDPFlat1277 MBDDPBucketed1631 MBDDPFlat + ZeRO1361 MBDDPBucketed + ZeRO1718 MBTensorParallel1030 MBPyTorch DDP1874 MBPyTorch FSDP1158 MBPyTorch DDP + ZeRO1722 MBCustom ImplementationPyTorch BaselineWeights SyncedSharded (expected)TensorParallel and FSDP intentionally shard weights: different sums across ranks is correct behavior

Note on benchmarking throughput: Throughput numbers are not meaningful in simulation mode as all processes share a single GPU. What matters is correctness: weights stay synchronized, and training dynamics match PyTorch's battle-tested implementations.

The results confirm correctness across all implementations:

Data parallel strategies (DDPFlat, DDPBucketed, PyTorch DDP): All ranks converge to identical weights after training. The sync check passes, confirming that gradient synchronization is working correctly. My custom implementations match PyTorch's behavior.

ZeRO combinations: Adding ShardedOptimizer to any data-parallel strategy maintains weight synchronization while sharding optimizer state. The slight memory increase in the benchmark is an artifact of the simulation. In real multi-GPU scenarios, each rank would hold only 1/N of the optimizer state.

TensorParallel and FSDP: These show async behaviour which is correct behavior. Both strategies intentionally shard model weights across ranks. TensorParallel splits layer dimensions (each rank holds different columns/rows), while FSDP shards entire parameter tensors. Different weight sums are expected.

The memory profile reveals the tradeoffs:

Strategy Peak Memory vs PyTorch DDP
TensorParallel 1030 MB -45%
PyTorch FSDP 1158 MB -38%
DDPFlat 1277 MB -32%
DDPFlat + ZeRO 1361 MB -27%
DDPBucketed 1631 MB -13%
DDPBucketed + ZeRO 1718 MB -8%
PyTorch DDP + ZeRO 1722 MB -8%
PyTorch DDP 1874 MB baseline

DDPBucketed uses more memory than DDPFlat because it allocates flat buffers to coalesce gradients before reduction. This is the expected tradeoff: bucket memory overhead in exchange for fewer collective operations. In real multi-GPU scenarios with high-latency interconnects, the reduced communication overhead more than compensates.

A note on ZeRO memory in simulation: The memory numbers for ZeRO combinations with our custom implementations appear higher than their non-ZeRO counterparts, which seems counterintuitive. This is an artifact of the simulation: ZeRO shards optimizer state across ranks, but here all 4 ranks share the same physical GPU. Each process allocates its 1/4 shard on the same device, so we pay the bookkeeping overhead without the memory savings. On real multi-GPU hardware, ZeRO would show dramatic memory reduction: each GPU would hold only 1/N of the optimizer state.

The key validation: all custom implementations produce the same training dynamics as PyTorch's battle-tested DDP. The abstractions work. The synchronization is correct. The code is ready for real distributed hardware.

6.2 Reflection

This project started as a way to answer a simple question: what actually happens when you call loss.backward() across multiple GPUs? The documentation tells you how to use DDP and FSDP, but not why they're built the way they are, or what breaks when you scale beyond the happy path.

Building these primitives from scratch forced me to confront the constraints that dominate large-scale training: the tension between computation and communication, the memory hierarchy from registers to network, and the coordination overhead that compounds with every additional process. These determine whether a training run takes hours or days, whether a model fits on your cluster or requires a different parallelism strategy entirely.

The single-GPU validation approach proved more valuable than expected. By simulating distributed training with gloo, I could iterate on synchronization logic in minutes rather than waiting for cluster allocation. The correctness guarantees transfer directly: if gradients reduce correctly over CPU shared memory, they'll reduce correctly over NVLink or InfiniBand. The only difference is bandwidth.

6.3 Future Directions

The implementations here cover the core parallelism primitives, but production training systems require additional layers:

Pipeline Parallelism. Tensor parallelism splits layers across devices; pipeline parallelism splits the model vertically, assigning different layers to different GPUs. The challenge is keeping all devices utilized as naive implementations leave GPUs idle waiting for activations from earlier stages. Techniques like GPipe's micro-batching and PipeDream's 1F1B scheduling address this, but add complexity around activation memory and gradient staleness.

Sequence Parallelism. For very long sequences, even a single layer's activations can exceed GPU memory. Sequence parallelism extends tensor parallelism to shard the sequence dimension during LayerNorm and Dropout operations, where the full sequence would otherwise need to be materialized on each device.

Activation Checkpointing Integration. The current implementations support activation checkpointing at the block level, but finer-grained policies like selective checkpointing based on memory pressure or offloading activations to CPU could push the memory-compute tradeoff further.

Communication Optimization. The bucketed all-reduce implementation uses synchronous collectives. Production systems like Megatron-LM overlap communication with computation by carefully scheduling operations: while one bucket is being reduced, the next layer's backward pass is already computing gradients. This requires dependency tracking and asynchronous kernel launches that I haven't implemented.

Fault Tolerance. At scale, hardware failures are inevitable. Checkpointing state, detecting failures, and resuming with different world sizes are essential for multi-day training runs. The current implementation assumes a fixed, reliable cluster.

Mixed Parallelism Strategies. Real systems combine multiple approaches: tensor parallelism within a node (where NVLink provides high bandwidth), data parallelism across nodes (where network bandwidth is the bottleneck), and pipeline parallelism for models that exceed even sharded memory limits. Orchestrating these layers efficiently is where frameworks like Megatron-LM and DeepSpeed earn their complexity.

6.4 Closing Thoughts

Distributed training is often treated as infrastructure: something you configure and forget about while focusing on model architecture and data. But the parallelism strategy fundamentally shapes what's possible. The difference between fitting a model on your hardware and not determines which experiments you can run at all.

Understanding these systems at the implementation level changes how you think about the tradeoffs. When you know that ZeRO-3 broadcasts parameters on every forward pass, you understand why it's slower than ZeRO-1 but enables larger models. When you've implemented bucketed gradient reduction, you can reason about why certain batch sizes interact poorly with bucket boundaries. The abstractions stop being black boxes.

The code from this project is available on GitHub.