Lists (32)
Sort Name ascending (A-Z)
AI Efficiency
Attention
Checkpointing
Collective Communication
Compilers
CUDA
Datasets
Diffusion
Distributed Systems
Graphs
gRPC
Hardware
HPC
Infrastructure as Code
Jax
Job Orchestration
K8s
Kernel Language
Language Models
Linux
Mechanistic Interpretability
NVIDIA GDS: DMA/RDMA
PEFT
PyTorch
Quantization
Research Tools
Rust Machine Learning Ecosystem
Serving
Simulation
Storage
Vision
WASM
- All languages
- Agda
- Assembly
- Bikeshed
- C
- C#
- C++
- CMake
- CSS
- Common Lisp
- Coq
- Cuda
- Cython
- D
- Dockerfile
- Elixir
- Emacs Lisp
- F#
- Fortran
- GLSL
- Go
- HCL
- HTML
- Handlebars
- Haskell
- Java
- JavaScript
- Jinja
- Jsonnet
- Julia
- Jupyter Notebook
- Koka
- Kotlin
- LLVM
- Lean
- Lua
- MATLAB
- MDX
- MLIR
- Makefile
- Markdown
- Metal
- Mojo
- Nim
- OCaml
- Perl
- PowerShell
- Python
- Rocq Prover
- Roff
- Ruby
- Rust
- SMT
- Sass
- Scala
- Shell
- Starlark
- Svelte
- Swift
- SystemVerilog
- Tcl
- TeX
- TypeScript
- Verilog
- Vim Script
- WebAssembly
- Zig
Starred repositories
A Python DSL to write Nvidia PTX for Hopper and Blackwell in JAX and PyTorch
Automated CUDA kernel performance diagnostics from NVIDIA Nsight Compute (NCU) CSV exports.
Hy3 preview (295B A21B), a leading reasoning and agent model in its size, with great cost efficiency
A pure-Python implementation of the Nvidia CuTe layout algebra intended to be approachable and easy to learn.
Efficient and unified implementations for TopK-based sparse attention
CUDA kernels for linear attention variants, written in CuTe DSL and CUTLASS C++.
TraceRoot - open-source observability and self-healing layer for AI agents. YC S25
Reverse engineering NVIDIA SASS instruction dictionary, kernel audits and pattern recognition across GPU architectures.
Comparative study and experimentation on standard vs mHC vs attention residual (full and block)
PTX ISA 9.1 documentation converted to searchable markdown. Includes Claude Code skill for CUDA development.
Train the smallest LM you can that fits in 16MB. Best model wins!
autonomous systems engineering cli agent for any cloud environment: AWS, GCP, Cloudflare, etc
A plug-and-play compiler that delivers free-lunch optimizations for both inference and training.
Autonomous GPU kernel optimization system driven by AI agents.
MDM-Prime-v2: Binary Encoding and Index Shuffling Enable Compute-optimal Scaling of Diffusion Language Models
Engine-agnostic LLM gateway in Rust. Full OpenAI & Anthropic API compatibility across SGLang, vLLM, TRT-LLM, OpenAI, Gemini & more. Industry-first gRPC pipeline, KV cache-aware routing, chat histor…
IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse
geohot / nanocode
Forked from 1rgs/nanocodeMinimal Claude Code alternative. Single Python file, zero dependencies, ~250 lines.
Minimal and scalable research codebase in JAX, designed for rapid iteration on frontier research in LLM and other autoregressive models.
AI agents running research on single-GPU nanochat training automatically
GPU Cluster Monitoring (GCM): Large-Scale AI Research Cluster Monitoring
A lightweight inference engine supporting speculative speculative decoding (SSD).