Skip to content
View wx-csy's full-sized avatar
👋
bonjour
👋
bonjour
  • Tsinghua University
  • Beijing, China

Organizations

@nju-calabash

Block or report wx-csy

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don’t include any personal information such as legal names or email addresses. Markdown is supported. This note will only be visible to you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse

Starred repositories

Showing results

An evolutionary approach to find small and low latency sorting networks

HTML 80 10 Updated Feb 22, 2026

A tool for running small microbenchmarks on recent Intel and AMD x86 CPUs.

Python 515 67 Updated Mar 29, 2026

CUDA Python: Performance meets Productivity

Cython 3,231 276 Updated Apr 28, 2026

A high-performance distributed file system designed to address the challenges of AI training and inference workloads.

C++ 9,840 1,038 Updated Mar 30, 2026

Expert Parallelism Load Balancer

Python 1,365 201 Updated Mar 24, 2025

Analyze computation-communication overlap in V3/R1.

1,152 146 Updated Mar 21, 2025

A bidirectional pipeline parallelism algorithm for computation-communication overlap in DeepSeek V3/R1 training.

Python 2,951 322 Updated Jan 14, 2026

DeepGEMM: clean and efficient FP8 GEMM kernels with fine-grained scaling

Cuda 7,125 946 Updated Apr 24, 2026

DeepEP: an efficient expert-parallel communication library

Cuda 9,578 1,208 Updated Apr 28, 2026

FlashMLA: Efficient Multi-head Latent Attention Kernels

C++ 12,604 1,015 Updated Apr 27, 2026

Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation

7,984 287 Updated May 15, 2025

MoBA: Mixture of Block Attention for Long-Context LLMs

Python 2,105 142 Updated Apr 3, 2025

NVIDIA Linux open GPU with P2P support

C 1,360 139 Updated Jun 6, 2025

Doing simple retrieval from LLM models at various context lengths to measure accuracy

Jupyter Notebook 2,275 242 Updated Aug 17, 2024

Reexamining Direct Cache Access to Optimize I/O Intensive Applications for Multi-hundred-gigabit Networks

Makefile 102 22 Updated Sep 2, 2021

A fast GPU memory copy library based on NVIDIA GPUDirect RDMA technology

C++ 1,373 185 Updated Mar 12, 2026

A Flexible Framework for Experiencing Heterogeneous LLM Inference/Fine-tune Optimizations

Python 17,086 1,277 Updated Apr 27, 2026

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

C++ 5,219 714 Updated Apr 28, 2026

how to optimize some algorithm in cuda.

Cuda 2,952 272 Updated Apr 22, 2026

CUDA Templates and Python DSLs for High-Performance Linear Algebra

C++ 9,644 1,823 Updated Apr 25, 2026

Large Language Model Text Generation Inference

Python 10,847 1,263 Updated Mar 21, 2026

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 78,464 16,202 Updated Apr 28, 2026

Transformer related optimization, including BERT, GPT

C++ 6,414 935 Updated Mar 27, 2024

Inference code for Llama models

Python 59,375 9,818 Updated Jan 26, 2025

Intel® Performance Counter Monitor (Intel® PCM)

C++ 3,262 523 Updated Apr 13, 2026

Grasper: A High Performance Distributed System for OLAP on Property Graphs.

C++ 30 9 Updated Apr 3, 2021

A solver for subgraph isomorphism problems, based upon a series of papers by subsets of McCreesh, Prosser, and Trimble.

C++ 101 28 Updated Apr 3, 2026

CP 2015 subgraph isomorphism experiments, data and paper

C++ 13 5 Updated Sep 5, 2015
Next