AI Safety Upskilling: CUDA Track

Table of Contents

Week 1: Intro to GPUs and writing your first kernel!
Week 2 and 3: Learning to optimize your kernels!
Week 4 and 5: Learning to optimize with Tensor Cores!
Week 6: Exploring other optimization parallel techniques!
Week 7 & 8: Putting it all together in Flash Attention

Assignments

Week 1: Intro to GPUs and writing your first kernel!

gpu-devotes-more-transistors-to-data-processing

Can you guess which architecture more closely resembles a CPU? What about a GPU?

Recommended Readings:

Motivation for GPUs in Deep Learning
A gentle introduction to CUDA

Further resources/references to use:

PMPP Book Access
NVIDIA GPU Glossary

Week 2 and 3: Learning to optimize your kernels!

gemm1

From the image, how many FLOPS (floating point operations) are in matrix multiplication?

Recommended Readings:

Aalto University’s Course on GPU Programming
Simon’s Blog on SGEMM (Kernels 1-5 are the most relevant for the assignment)
How to use NCU profiler

Further references to use:

NCU Documentation

Week 4 and 5: Learning to optimize with Tensor Cores!

Tensor-Core-Matrix

How much faster are Tensor Core operations compared to F32 CUDA Cores?

Recommended Readings:

Alex Armbruster’s Blog on HGEMM
Bruce’s Blog on HGEMM
Spatter’s Blog on HGEMM
NVIDIA’s Presentation on A100 Tensor Cores

Further references to use:

Primer on Inline PTX Assembly
CUTLASS GEMM Documentation
NVIDIA PTX ISA Documentation (Chapter 9.7 is most relevant)

Week 6: Exploring other optimization parallel techniques!

reduction

How could we compute the sum of all the elements in a 1-million sized vector?

Recommended Readings:

Primer on Parallel Reduction
Warp level Primitives
Vectorization
Efficient Softmax Kernel
Online Softmax Paper

Week 7 & 8: Putting it all together in Flash Attention

flash-att

Is the self-attention layer in LLMs compute-bound or memory-bound?

Recommended Readings:

Flash Attention V1 Paper
Aleksa Gordic’s Flash Attention Blog