슈퍼컴퓨터에서 멀티 GPU노드기반 분산딥러닝하기

Date:

Distributed deep learning (DDL) training refers to the process of training a deep learning model on multiple machines, possibly with multiple GPUs on each machine. Not only does DDL training speed up the training process, but also enables the use of larger models and datasets that could not fit on a single GPU. This tutorial is intended to guide users to run his/her distributed deep learning codes on multiple GPU nodes using Horovod on Neuron. Neuron is a KISTI GPU cluster consisting of 65 nodes with 260 GPUs (120 of NVIDIA A100 and 140 of NVIDIA V100 GPUs). Horovod, originally developed by Uber in 2017, is a distributed deep learning framework aiming to make it easy and simple to take a DL code developed with different DL frameworks such as Tensorflow and Pytorch and scale it to run across many GPUs. This tutorial will also give a short demo of how to practice large-scale distributed deep learning training using Horovod on Perlmutter, the world 9th fastest supercomptuer, which is located at NERSC supercomputing center in Lawrence Berkeley National Laboratory.

13:30~14:30분산딥러닝 방법론 소개 (Data Parallelism, Model Parallelism, Horovod Framework)
14:30~15:00휴식
15:30~17:30KISTI GPU 클러스터 뉴론 시스템에서의 Horovod 실습 (Installation of Conda & Horovod, Singularity Container)