AI Safety Upskilling: CUDA Track

Table of Contents

  1. Week 1: Intro to GPUs and writing your first kernel!
  2. Week 2 and 3: Learning to optimize your kernels!
  3. Week 4 and 5: Learning to optimize with Tensor Cores!
  4. Week 6: Exploring other optimization parallel techniques!
  5. Week 7 & 8: Putting it all together in Flash Attention

Week 1: Intro to GPUs and writing your first kernel!

gpu-devotes-more-transistors-to-data-processing

Can you guess which architecture more closely resembles a CPU? What about a GPU?

Motivation for GPUs in Deep Learning
A gentle introduction to CUDA

Further resources/references to use:

PMPP Book Access
NVIDIA GPU Glossary

Week 2 and 3: Learning to optimize your kernels!

gemm1

From the image, how many FLOPS (floating point operations) are in matrix multiplication?

Aalto University’s Course on GPU Programming
Simon’s Blog on SGEMM (Kernels 1-5 are the most relevant for the assignment)
How to use NCU profiler

Further references to use:

NCU Documentation

Week 4 and 5: Learning to optimize with Tensor Cores!

Tensor-Core-Matrix

How much faster are Tensor Core operations compared to F32 CUDA Cores?

Alex Armbruster’s Blog on HGEMM
Bruce’s Blog on HGEMM
NVIDIA’s Presentation on A100 Tensor Cores

Further references to use:

Primer on Inline PTX Assembly
CUTLASS GEMM Documentation
NVIDIA PTX ISA Documentation (Chapter 9.7 is most relevant)

Week 6: Exploring other optimization parallel techniques!

reduction

How could we compute the sum of all the elements in a 1-million sized vector?

Primer on Parallel Reduction
Warp level Primitives
Vectorization
Efficient Softmax Kernel
Online Softmax Paper

Week 7 & 8: Putting it all together in Flash Attention

flash-att

Is the self-attention layer in LLMs compute-bound or memory-bound?

Flash Attention V1 Paper
Aleksa Gordic’s Flash Attention Blog