A Deep Dive into Distributed Training of Large-scale Language Modeling with PyTorch on a Supercomputer
Date:
This tutorial presents a comprehensive, in-depth guide for large-scale distributed training of LLMs on supercomputers managed with SLURM. It briefly covers some basics of collective communications in message passing including gather, scatter and all-gather operations, delving into data parallelism techniques such as Data Parallelism (DP) and Distributed Data Parallelism (DDP) in PyTorch, and model parallelism techniques including Tensor Parallelism, Pipeline Parallelism, and 3D Parallelism, with hands-on PyTorch code examples. It also covers how to set up and leverage distributed training tools like NVIDIA Megatron-LM and Microsoft DeepSpeed to efficiently run the PyTorch codes using multiple GPUs on a supercomputer.