Stars
A C++ compile-time math library using generalized constant expressions
Examples demonstrating available options to program multiple GPUs in a single node or a cluster
Learning how to write "Less Slow" code in C++ 20, C 99, CUDA, PTX, & Assembly, from numerics & SIMD to coroutines, ranges, exception handling, networking and user-space IO
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
Demonstration of various hardware effects.
Memory-Efficient CUDA kernels for training ConvNets with PyTorch.
Fork of https://source.codeaurora.org/quic/hexagon_nn/nnlib
Any model. Any hardware. Zero compromise. Built with @ziglang / @openxla / MLIR / @bazelbuild
Cross-platform c++ sdk & model hub for easy ai inference
A retargetable MLIR-based machine learning compiler and runtime toolkit.
Offical code of "QKFormer: Hierarchical Spiking Transformer using Q-K Attention" (NeurIPS 2024,Spotlight 3%)
An easy-to-use and fast library for task-based parallelism, utilizing coroutines.
Low-overhead tracing of all Linux kernel-user transitions, for serious performance analysis. Includes kernel patches, loadable module, and post-processing software. Output is HTML/SVG per-CPU-core …
[CVPR 2024] Confronting Ambiguity in 6D Object Pose Estimation via Score-Based Diffusion on SE(3)
A minimal GPU design in Verilog to learn how GPUs work from the ground up
CPU INFOrmation library (x86/x86-64/ARM/ARM64, Linux/Windows/Android/macOS/iOS)
A CPU tool for benchmarking the peak of floating points
Utility to explode a tflite pipeline into individual ops for testing.
a language for fast, portable data-parallel computation
Library for specialized dense and sparse matrix operations, and deep learning primitives.
collection of benchmarks to measure basic GPU capabilities
Tensor Core Multiplication at the Speed of CuBLAS in Three Simple Steps
Large World Model -- Modeling Text and Video with Millions Context
The Compute Library is a set of computer vision and machine learning functions optimised for both Arm CPUs and GPUs using SIMD technologies.
GPU programming related news and material links
VMamba: Visual State Space Models,code is based on mamba