Maximizing GPU efficiency for training large language models (LLMs) is challenging due to issues like out-of-memory (OOM) errors, scaling batch size, and sequence length. To address these challenges, LinkedIn has developed an open-source library called Liger-Kernel, which offers efficient Triton kernels for LLM training. This library can increase training throughput by 20% and reduce memory usage by 60% with just one line of code.
The custom triton kernels we developed at LinkedIn integrate smoothly with Flash Attention, PyTorch FSDP, and DeepSpeed. Patch your Hugging Face model with one line, or compose your own model using the provided kernels. Lightweight and efficient, these kernels have minimal dependencies—just Torch and Triton.
*Implementation details*
We have taken the spirit from llm.c but used Triton to reimplement RMSNorm, RoPE, SwiGLU, CrossEntropy, and FusedLinearCrossEntropy from scratch with forward and backward passes in pure Triton. The kernels are exact, without approximations.
We have adopted kernel fusion, in-place, tiling, and chunking techniques. For example, due to the large vocab size of some models, instead of materializing the full logits (10s of GB), we combine chunking, gradient-in-forward, and online softmax to reduce memory usage by 5X.
Torch Compiler now supports custom Triton kernels, allowing our kernels to seamlessly integrate. For example, by combining Torch Compile with FusedLinearCrossEntropy, we have observed more than a 50% reduction in memory usage.
*Acknowledgement*
We’d like to first give a shout-out to Andrej Karpathy’s llm.c for inspiring us to develop llm.triton. FlashAttention, vLLM, and Unsloth have been pioneers in custom Triton kernels. Special thanks to the Triton team for the revolutionary kernel interface, and to Efficient Cross Entropy for the linear cross entropy tricks.
We would like to thank Animesh Singh, Haowen Ning, Yanning Chen for the leadership support, Shao Tang, Qingquan Song, Yun Dai, Vignesh Kothapalli, Jason (Siyu) Zhu, Steven Shimizu, Shivam Sahni and Zain Merchant for the technical contribution.
*Want to contribute?*
Are you a dedicated researcher looking for a reliable kernel, a kernel guru who can help us shape better kernels, or a curious novice wanting to learn Triton? Join our community at https://discord.gg/CX2YmNmn to hack together.
Stay tuned for our talk at CUDA MODE (https://discord.gg/CX2YmNmn?event=1273323969788772455), where we will provide an immersive experience in developing Triton kernels. We’ll share code examples, and together we’ll identify bottlenecks, derive backward formulas, ensure exactness, and fix intricate bugs.
The custom triton kernels we developed at LinkedIn integrate smoothly with Flash Attention, PyTorch FSDP, and DeepSpeed. Patch your Hugging Face model with one line, or compose your own model using the provided kernels. Lightweight and efficient, these kernels have minimal dependencies—just Torch and Triton.
*Implementation details*
We have taken the spirit from llm.c but used Triton to reimplement RMSNorm, RoPE, SwiGLU, CrossEntropy, and FusedLinearCrossEntropy from scratch with forward and backward passes in pure Triton. The kernels are exact, without approximations.
We have adopted kernel fusion, in-place, tiling, and chunking techniques. For example, due to the large vocab size of some models, instead of materializing the full logits (10s of GB), we combine chunking, gradient-in-forward, and online softmax to reduce memory usage by 5X.
Torch Compiler now supports custom Triton kernels, allowing our kernels to seamlessly integrate. For example, by combining Torch Compile with FusedLinearCrossEntropy, we have observed more than a 50% reduction in memory usage.
*Acknowledgement*
We’d like to first give a shout-out to Andrej Karpathy’s llm.c for inspiring us to develop llm.triton. FlashAttention, vLLM, and Unsloth have been pioneers in custom Triton kernels. Special thanks to the Triton team for the revolutionary kernel interface, and to Efficient Cross Entropy for the linear cross entropy tricks.
We would like to thank Animesh Singh, Haowen Ning, Yanning Chen for the leadership support, Shao Tang, Qingquan Song, Yun Dai, Vignesh Kothapalli, Jason (Siyu) Zhu, Steven Shimizu, Shivam Sahni and Zain Merchant for the technical contribution.
*Want to contribute?*
Are you a dedicated researcher looking for a reliable kernel, a kernel guru who can help us shape better kernels, or a curious novice wanting to learn Triton? Join our community at https://discord.gg/CX2YmNmn to hack together.
Stay tuned for our talk at CUDA MODE (https://discord.gg/CX2YmNmn?event=1273323969788772455), where we will provide an immersive experience in developing Triton kernels. We’ll share code examples, and together we’ll identify bottlenecks, derive backward formulas, ensure exactness, and fix intricate bugs.