Speed Up PyTorch With Custom Kernels. But It Gets Progressively Darker | Towards Data Science
We’ll begin with torch.compile, move on to writing a custom Triton kernel, and finally dive into designing a CUDA kernel from scratch.

Source: Towards Data Science
We’ll begin with torch.compile, move on to writing a custom Triton kernel, and finally dive into designing a CUDA kernel from scratch.