0. TLDR: Ryan-GPT
Who, what, and why.
1. Why Single-GPU Training Doesn't Scale
Understanding the constraints before building distributed systems.
2. Data Parallelism as the First Scaling Primitive
Explicit gradient synchronization and the cost of communication.
3. Bucketing and Sharding as the Second and Third Scaling Primitives
Reducing communication overhead and memory usage.
4. Tensor Parallelism as the Fourth Scaling Primitive
Splitting individual layers across GPUs.
5. Training Validation on a Single GPU
Distributed code without distributed hardware?
6. Test Results and Future Directions
Crunching the numbers and what comes next.