Ryan Zhou | Portfolio

0. TLDR: Ryan-GPT

Who, what, and why.

1. Why Single-GPU Training Doesn't Scale

Understanding the constraints before building distributed systems.

2. Data Parallelism as the First Scaling Primitive

Explicit gradient synchronization and the cost of communication.

3. Bucketing and Sharding as the Second and Third Scaling Primitives

Reducing communication overhead and memory usage.

4. Tensor Parallelism as the Fourth Scaling Primitive

Splitting individual layers across GPUs.

5. Training Validation on a Single GPU

Distributed code without distributed hardware?

6. Test Results and Future Directions

Crunching the numbers and what comes next.