AI Safety Upskilling: CUDA Track
Table of Contents
- Week 1: Intro to GPUs and writing your first kernel!
- Week 2 and 3: Learning to optimize your kernels!
- Week 4 and 5: Learning to optimize with Tensor Cores!
- Week 6: Exploring other optimization parallel techniques!
- Week 7 & 8: Putting it all together in Flash Attention
Week 1: Intro to GPUs and writing your first kernel!
Can you guess which architecture more closely resembles a CPU? What about a GPU?
Recommended Readings:
Motivation for GPUs in Deep Learning
A gentle introduction to CUDA
Further resources/references to use:
PMPP Book Access
NVIDIA GPU Glossary
Week 2 and 3: Learning to optimize your kernels!
From the image, how many FLOPS (floating point operations) are in matrix multiplication?
Recommended Readings:
Aalto University’s Course on GPU Programming
Simon’s Blog on SGEMM (Kernels 1-5 are the most relevant for the assignment)
How to use NCU profiler
Further references to use:
Week 4 and 5: Learning to optimize with Tensor Cores!
How much faster are Tensor Core operations compared to F32 CUDA Cores?
Recommended Readings:
Alex Armbruster’s Blog on HGEMM
Bruce’s Blog on HGEMM
NVIDIA’s Presentation on A100 Tensor Cores
Further references to use:
Primer on Inline PTX Assembly
CUTLASS GEMM Documentation
NVIDIA PTX ISA Documentation (Chapter 9.7 is most relevant)
Week 6: Exploring other optimization parallel techniques!
How could we compute the sum of all the elements in a 1-million sized vector?
Recommended Readings:
Primer on Parallel Reduction
Warp level Primitives
Vectorization
Efficient Softmax Kernel
Online Softmax Paper
Week 7 & 8: Putting it all together in Flash Attention
Is the self-attention layer in LLMs compute-bound or memory-bound?
Recommended Readings:
Flash Attention V1 Paper
Aleksa Gordic’s Flash Attention Blog