📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
-
Updated
Feb 24, 2025 - Cuda
📚200+ Tensor/CUDA Cores Kernels, ⚡️flash-attn-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS/FA2 🎉🎉).
🔥🔥🔥 A collection of some awesome public CUDA, cuBLAS, cuDNN, CUTLASS, TensorRT, TensorRT-LLM, Triton, TVM, MLIR and High Performance Computing (HPC) projects.
Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.
GEMM and Winograd based convolutions using CUTLASS
study of cutlass
Multiple GEMM operators are constructed with cutlass to support LLM inference.
pytorch implements block sparse
Add a description, image, and links to the cutlass topic page so that developers can more easily learn about it.
To associate your repository with the cutlass topic, visit your repo's landing page and select "manage topics."