Ryan Zhou | Portfolio

Part 1 of 6 in the Distributed Training series

1.1 Motivation: Understanding the System

Most distributed training material focuses on how to use abstractions like DDP or FSDP, not on what needs to be true for them to work once models and hardware stop being small. The APIs hide the constraints that actually dominate large-scale training: communication latency, memory duplication, synchronization order, and failure modes.

When I looked for references, I found plenty of high-level explanations, but very little that walks through building gradient synchronization to optimizer state partitioning with explicit correctness checks. I needed to convince myself that I understood what would break, what would scale, and why these systems behave the way they do under load.

Doing that on real multi-GPU infrastructure is expensive and slow to iterate on, so this project is intentionally a proof of concept. The goal was to rebuild the core distributed training primitives under tight constraints: single consumer GPU, no opaque framework behavior, explicit correctness checks, minimal infra cost, and use that to reason about how the same ideas extend to larger models and real clusters.

The result is a small transformer with a distributed training stack implemented from first principles and validated without access to multi-GPU hardware.

1.2 Baseline: Single-GPU Training and Its Limits

Ryan-GPT is a 12.5M parameter transformer language model trained initially on a single RTX 3060 (12GB VRAM). The baseline setup ran roughly 80,000 training steps in about 12 hours.

The model itself was not the limiting factor. The training system was.

Scaling model size or reducing wall-clock time immediately exposed two hard limits:

Memory: parameters are only a fraction of total footprint once gradients and optimizer state are included.
Throughput: batch size and sequence length are capped by memory, limiting tokens processed per step.

At that point, further progress required distributing work across processes.

1.3 Why Single-Process Training Does Not Scale

The baseline training loop followed the standard single-process pattern:

for step in range(max_steps):
    x, y = get_batch(data, batch_size, seq_len, device)

    logits = model(x)
    loss = loss_fn(logits, y)

    loss.backward()
    optimizer.step()
    optimizer.zero_grad()

This loop is functionally correct, but it implicitly assumes that computation, memory, and optimizer state all fit on a single device. Once model size increases, those assumptions break independently and for different reasons.

For a model with P parameters, memory footprint during training is not O(P) but closer to:

parameters: P
gradients: P
optimizer state: 2P (Adam first and second moments)
activations: O(B * L * H)

Even before activations, Adam-based training requires roughly 4× parameter memory. At scale, optimizer state becomes the dominant term, not parameters.

On a consumer GPU, models become memory-bound well before compute saturation even at modest parameter counts.

With fixed sequence length and batch size, tokens processed per step are constant:

tokens/step = B * L

Increasing throughput requires increasing either batch size or sequence length, both of which increase activation memory linearly. Once memory limits are hit, throughput plateaus regardless of available compute.

This creates a hard ceiling where: compute remains underutilized, wall-clock time cannot be reduced, scaling model size becomes impossible.

Therefore, we must break at least one of the single-process assumptions:

replicate parameters and shard computation (data parallelism)
shard optimizer state (ZeRO-style approach)
shard parameters themselves (tensor parallelism)

The remainder of this project focuses on introducing these primitives incrementally.

↼ Previous: TLDR Next: Data Parallelism as the First Scaling Primitive ⇁