Starred repositories
An evolutionary approach to find small and low latency sorting networks
A tool for running small microbenchmarks on recent Intel and AMD x86 CPUs.
CUDA Python: Performance meets Productivity
A high-performance distributed file system designed to address the challenges of AI training and inference workloads.
Analyze computation-communication overlap in V3/R1.
A bidirectional pipeline parallelism algorithm for computation-communication overlap in DeepSeek V3/R1 training.
DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling
DeepEP: an efficient expert-parallel communication library
FlashMLA: Efficient Multi-head Latent Attention Kernels
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
MoBA: Mixture of Block Attention for Long-Context LLMs
NVIDIA Linux open GPU with P2P support
Doing simple retrieval from LLM models at various context lengths to measure accuracy
Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi-hundred-gigabit Networks
A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology
A Flexible Framework for Experiencing Heterogeneous LLM Inference/Fine-tune Optimizations
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
how to optimize some algorithm in cuda.
CUDA Templates and Python DSLs for High-Performance Linear Algebra
Large Language Model Text Generation Inference
A high-throughput and memory-efficient inference and serving engine for LLMs
Transformer related optimization, including BERT, GPT
Grasper: A High Performance Distributed System for OLAP on Property Graphs.
A solver for subgraph isomorphism problems, based upon a series of papers by subsets of McCreesh, Prosser, and Trimble.
CP 2015 subgraph isomorphism experiments, data and paper