Recent tutorials and talks that I gave in conferences and workshops

A Deep Dive into Distributed Training of Large-scale Language Modeling with PyTorch on a Supercomputer

December 18, 2024

Tutorial, 2024 한국소프트웨어종합학술대회(KSC2024), 여수엑스포컨벤션센터

This tutorial presents a comprehensive, in-depth guide for large-scale distributed training of LLMs on supercomputers managed with SLURM. It briefly covers some basics of collective communications in message passing including gather, scatter and all-gather operations, delving into data parallelism techniques such as Data Parallelism (DP) and Distributed Data Parallelism (DDP) in PyTorch, and model parallelism techniques including Tensor Parallelism, Pipeline Parallelism, and 3D Parallelism, with hands-on PyTorch code examples. It also covers how to set up and leverage distributed training tools like NVIDIA Megatron-LM and Microsoft DeepSpeed to efficiently run the PyTorch codes using multiple GPUs on a supercomputer.

Best Practices for Generarive AI with LLMs on a Supercomputer

August 17, 2024

기획세션발표, 2024 한국인공지능학회 하계학술대회, 부산 벡스코

본 세션에서는 슈퍼컴퓨터에서 “LLM 기반 생성형 AI 해보기”의 Best Practice를 소개합니다. 데이터 수집, 모델 선택, 프롬프트 엔지니어링, 인스트럭션 파인튜닝, 평가 및 벤치마킹 등 생성형 AI 프로젝트 라이프사이클 전 과정을 살펴보고, KISTI GPU 클러스터에 접속해서 시연을 진행합니다. 특히, 슈퍼컴퓨터와 AWS 같은 클라우드 플랫폼 환경을 비교하고, SLURM 기반슈퍼컴퓨터 환경에서 주피터 노트북을 활용한 생성형 AI 실습 사례를 공유합니다.

Best Practices for Generative AI with LLMs on a Supercomputer

June 26, 2024

Tutorial, 2024 한국컴퓨터종합학술대회 (KCC2024), ICC 제주

Generative AI with LLMs refers to the use of large language models like GPT-3 for generating human-like content, spanning text, images and even code. LLMs are trained on a vast amount of data and code, and usually carefully prompt-engineered or fine-tuned to suit specific downstream tasks such as Chatbots, Translation, Question Answering and Summarization. The contents and python codes of this tutorial are originated from the 16-hour “Generative AI with LLMs” course offered by the DeepLearning.AI. This tutorial will mainly cover the key concepts and practices of a typical LLM-powered Generative AI lifecycle, from data gathering and model selection, to instruction fine-tuning and RLHF-based alignment to human preference, to performance evaluation and deployment. For hands-on exercises, students will have access to the KISTI GPU cluster known as Neuron, which consists of 65 nodes with 260 GPUs (120 of NVIDIA A100 and 140 of NVIDIA V100 GPUs) running SLURM as for its workload manager.

Best Practices for Generative AI with LLMs on a Supercomputer

May 28, 2024

Seminar, KAIST-BECS Research Center & Computer Science Department, Room 201, N1 Building at KAIST

Generative AI with LLMs refers to the use of large language models like GPT-3 for generating human-like content, spanning text, images and even code. LLMs are trained on a vast amount of data and code, and usually carefully prompt-engineered or fine-tuned to suit specific downstream tasks such as Chatbots, Translation, Question Answering and Summarization. The contents and python codes of this seminar are mainly originated from the 16-hour “Generative AI with LLMs” course offered by the DeepLearning.AI. This talk will cover the key concepts and practices of a typical LLM-powered Generative AI lifecycle, from data gathering and model selection, to instruction fine-tuning and RLHF-based alignment to human preference, to performance evaluation and deployment. I will show a short demo on how users can create a conda virtual environment on the KISTI Neuron cluster with 260 GPUs, launch a Jupyter server on a compute node and have access it to from his/her own PC or Labtop for Genera tive AI practices on a supercomputer. The demo will illustrate how to conduct LLM practices including prompting and prompt engineering, and instruction fine-tuning and parameter-efficient fine-tuning (PEFT) with LoRA, and evaluation and benchmark on LLMs.

Best Practices for Generative AI with LLMs on a Supercomputer

March 25, 2024

Lecture, 2024-1st Semester Computer Science and Engineering Department Colloquium, Adam Schall Hall 510 at Sogang University

Generative AI with LLMs refers to the use of large language models like GPT-3 for generating human-like content, spanning text, images and even code. LLMs are trained on a vast amount of data and code, and usually carefully prompt-engineered or fine-tuned to suit specific downstream tasks such as Chatbots, Translation, Question Answering and Summarization. The contents and python codes of this seminar are mainly originated from the 16-hour “Generative AI with LLMs” course offered by the DeepLearning.AI. This talk will cover the key concepts and practices of a typical LLM-powered Generative AI lifecycle, from data gathering and model selection, to instruction fine-tuning and RLHF-based alignment to human preference, to performance evaluation and deployment. I will show a short demo on how users can create a conda virtual environment on the KISTI Neuron cluster with 260 GPUs, launch a Jupyter server on a compute node and have access it to from his/her own PC or Labtop for Genera tive AI practices on a supercomputer. The demo will illustrate how to conduct LLM practices including prompting and prompt engineering, and instruction fine-tuning and parameter-efficient fine-tuning (PEFT) with LoRA, and evaluation and benchmark on LLMs.

Best Practices for Generative AI with LLMs on a Supercomputer

December 20, 2023

Tutorial, 2023 한국소프트웨어종합학술대회 (KSC 2023), 부산 백스코

Generative AI with LLMs refers to the use of large language models like GPT-3 for generating human-like content, spanning text, images and even code. LLMs are trained on a vast amount of data and code, and usually carefully prompt-engineered or fine-tuned to suit specific downstream tasks such as Chatbots, Translation, Question Answering and Summarization. The contents and python codes of this tutorial are originated from the 16-hour “Generative AI with LLMs” course offered by the DeepLearning.AI. This tutorial will mainly cover the key concepts and practices of a typical LLM-powered Generative AI lifecycle, from data gathering and model selection, to instruction fine-tuning and RLHF-based alignment to human preference, to performance evaluation and deployment. For hands-on exercises, students will have access to the KISTI GPU cluster known as Neuron, which consists of 65 nodes with 260 GPUs (120 of NVIDIA A100 and 140 of NVIDIA V100 GPUs) running SLURM as for its workload manager.

슈퍼컴퓨터에서 멀티노드기반 분산딥러닝하기

July 18, 2023

기획세션발표, 2023 한국인공지능학회 하계학술대회, 소노컴 여수

최근에 구축되는 Top10 슈퍼컴퓨터들은 한 노드에 최신 GPU 4~8개를 장착하고 있다. 본 세션에서는 슈퍼컴퓨터에서의 멀티 GPU노드를 활용한 대규모 분산딥러닝하기 Best Practices를 소개한다. (1)슈퍼컴퓨터에 접속해서 (2)각자의 가상 환경을 만들고 (3)온라인으로 GPU 1~2개를 할당받아 터미널 또는 주피터 환경에서 분산딥러닝 코드를 개발 및 테스트 한 후에 (4)배치 작업을 통해서 대규모 분산딥러닝 작업을 실행하는 슈퍼컴퓨터에서의 AI/DL 연구프로세스에 대해서 공유하고 의견을 나누고자 한다.

슈퍼컴퓨터에서 멀티 GPU노드기반 분산딥러닝하기

June 18, 2023

Tutorial, 2023 한국컴퓨터종합학술대회 (KCC 2023), 라마다프라자제주호텔

Distributed deep learning (DDL) training refers to the process of training a deep learning model on multiple machines, possibly with multiple GPUs on each machine. Not only does DDL training speed up the training process, but also enables the use of larger models and datasets that could not fit on a single GPU. This tutorial is intended to share best practices for large-scale distributed training on a supercomputer, guiding users on how to run their distributed deep learning codes on multiple GPU nodes using Horovod on Neuron. The Neuron system is a KISTI GPU cluster consisting of 65 nodes with 260 GPUs (120 of NVIDIA A100 and 140 of NVIDIA V100 GPUs). Horovod, originally developed by Uber in 2017, is a distributed deep learning framework aiming to make it easy and simple to take a DL code developed with different DL frameworks such as Tensorflow and Pytorch and scale it to run across many GPUs. This tutorial will also give a short demo of how to practice large-scale distributed deep learning training using Horovod on Perlmutter, the world 9th fastest supercomptuer, which is located at NERSC supercomputing center in Lawrence Berkeley National Laboratory.

슈퍼컴퓨터에서 멀티노드기반 분산딥러닝하기

May 17, 2023

Tutorial, (주)씨이랩 직원 대상 특강, 씨이랩 대회의실

Distributed deep learning (DDL) training refers to the process of training a deep learning model on multiple machines, possibly with multiple GPUs on each machine. Not only does DDL training speed up the training process, but also enables the use of larger models and datasets that could not fit on a single GPU. This talk is intended to introduce the principles, concepts and approaches of large-scale distributed DL practices on a supercomputer, and guide users to run his/her distributed deep learning codes on multiple GPU nodes using Horovod on Neuron. Neuron is a KISTI GPU cluster consisting of 65 nodes with 260 GPUs (120 of NVIDIA A100 and 140 of NVIDIA V100 GPUs). Horovod, originally developed by Uber in 2017, is a distributed deep learning framework aiming to make it easy and simple to take a DL code developed with different DL frameworks such as Tensorflow and Pytorch and scale it to run across many GPUs. This talk will also give a short demo of how to practice large-scale distributed deep learning training using Horovod on Perlmutter, the world 9th fastest supercomptuer, which is located at NERSC supercomputing center in Lawrence Berkeley National Laboratory.

슈퍼컴퓨터에서 멀티노드기반 분산딥러닝하기

May 03, 2023

Tutorial, 충남대 바이오AI융합학과 AI융합세미나 특강, 충남대학교 공대5호관(W2) 411호

Distributed deep learning (DDL) training refers to the process of training a deep learning model on multiple machines, possibly with multiple GPUs on each machine. Not only does DDL training speed up the training process, but also enables the use of larger models and datasets that could not fit on a single GPU. This talk is intended to introduce the principles, concepts and approaches of large-scale distributed DL practices on a supercomputer, and guide users to run his/her distributed deep learning codes on multiple GPU nodes using Horovod on Neuron. Neuron is a KISTI GPU cluster consisting of 65 nodes with 260 GPUs (120 of NVIDIA A100 and 140 of NVIDIA V100 GPUs). Horovod, originally developed by Uber in 2017, is a distributed deep learning framework aiming to make it easy and simple to take a DL code developed with different DL frameworks such as Tensorflow and Pytorch and scale it to run across many GPUs.

슈퍼컴퓨터에서 멀티 GPU노드기반 분산딥러닝하기

March 31, 2023

Tutorial, KAIST ITRC-BECS 센터 초청 기술 세미나, KAIST NI 201호

Distributed deep learning (DDL) training refers to the process of training a deep learning model on multiple machines, possibly with multiple GPUs on each machine. Not only does DDL training speed up the training process, but also enables the use of larger models and datasets that could not fit on a single GPU. This tutorial is intended to guide users to run his/her distributed deep learning codes on multiple GPU nodes using Horovod on Neuron. Neuron is a KISTI GPU cluster consisting of 65 nodes with 260 GPUs (120 of NVIDIA A100 and 140 of NVIDIA V100 GPUs). Horovod, originally developed by Uber in 2017, is a distributed deep learning framework aiming to make it easy and simple to take a DL code developed with different DL frameworks such as Tensorflow and Pytorch and scale it to run across many GPUs. This tutorial will also give a short demo of how to practice large-scale distributed deep learning training using Horovod on Perlmutter, the world 9th fastest supercomptuer, which is located at NERSC supercomputing center in Lawrence Berkeley National Laboratory.

Using Multiple GPUs for Distributed Deep Learning on Neuron

January 26, 2023

Tutorial, 동국대학교 인공지능학과 특강 , 신공학관 6122호

Distributed deep learning (DDL) training refers to the process of training a deep learning model on multiple machines, possibly with multiple GPUs on each machine. Not only does DDL training speed up the training process, but also enables the use of larger models and datasets that could not fit on a single GPU. This tutorial is intended to guide users to run his/her distributed deep learning codes on multiple GPU nodes using Horovod on Neuron. Neuron is a KISTI GPU cluster consisting of 65 nodes with 260 GPUs (120 of NVIDIA A100 and 140 of NVIDIA V100 GPUs). Horovod, originally developed by Uber in 2017, is a distributed deep learning framework aiming to make it easy and simple to take a DL code developed with different DL frameworks such as Tensorflow and Pytorch and scale it to run across many GPUs.

멀티 GPU 노드 기반 분산 딥러닝 튜토리얼

December 20, 2022

Tutorial, 계산과학공학회/KISTI 인공지능 겨울학교 2022, 하이원리조트 마운틴프라자 3층 원추리홀

Distributed Deep Learning (DDL) training refers to the process of training a deep learning model on multiple machines, possibly with multiple GPUs on each machine. Not only does DDL training speed up the training process, but also enables the use of larger models and datasets that could not fit on a single GPU. This tutorial is intended to guide users to run his/her distributed deep learning codes on multiple GPU nodes using Horovod on Neuron. Neuron is a KISTI GPU cluster consisting of 65 nodes with 260 GPUs (120 of NVIDIA A100 and 140 of NVIDIA V100 GPUs). Horovod, originally developed by Uber in 2017, is a distributed deep learning framework aiming to make it easy and simple to take a DL code developed with different DL frameworks such as Tensorflow and Pytorch and scale it to run across many GPUs. This tutorial will also give a short demo of how to practice large-scale distributed deep learning training using Horovod on Perlmutter, the world 9th fastest supercomptuer, which is located at NERSC supercomputing center in Lawrence Berkeley National Laboratory.

엑사스케일 슈퍼컴퓨팅 시대를 맞으며

December 22, 2020

Invited Talk, 2020 한국소프트웨어종합학술대회 (KSC2020), 온라인

슈퍼컴퓨터의 성능은 지속적으로 발전해서 메가, 기가, 테타,를 넘어 이제는 엑사, 즉 1초에 100경번의 연산이 가능한 “꿈의 컴퓨터”인 엑사스케일 컴퓨팅 시대가 논 앞에 다가왔습니다. 수년 전부터 미국, 중국, 일본간의 치열한 엑사스케일 컴퓨팅 경쟁을 벌여 왔습니다. 올 여름 일본이 450페타급 후가쿠 시스템을 구축하여서 엑사시스템에 한층 더 다가왔습니다. 내년이면 미국 오크리지 국립연구소에 ‘프론티어’라는 세계 최초의 엑사스케일 슈퍼컴퓨터가 구축될 예정이어서 이제 본격적으로 엑사스케일 슈퍼컴퓨팅 시대를 맞이하게 됩니다. 본 강연은 최근 글로벌 슈퍼컴퓨팅 구축 및 기술 동향과 한국의 슈퍼컴퓨팅 역사와 현황에에 대해서 조망하고자 합니다.