Stars
强化学习中文教程(蘑菇书🍄),在线阅读地址:https://datawhalechina.github.io/easy-rl/
CUDA Templates and Python DSLs for High-Performance Linear Algebra
Evaluate and Enhance Your LLM Deployments for Real-World Inference Needs
The awesome collection of OpenClaw skills. 5,400+ skills filtered and categorized from the official OpenClaw Skills Registry.🦞
AI-powered, vision-driven UI automation for every platform.
Midscene connector for pc,include local pc and remote pc server. Supports windows/linux/macos.基于midscene的跨平台PC桌面端自动化操作代理。同时支持本地和远程桌面操作方式。
Paper list in the survey: A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
[ACL 2026] Towards Efficient Large Language Model Serving: A Survey on System-Aware KV Cache Optimization
SpotServe: Serving Generative Large Language Models on Preemptible Instances
Persist and reuse KV Cache to speedup your LLM.
本项目旨在分享大模型相关技术原理以及实战经验(大模型工程化、大模型应用落地)
TurboDiffusion: 100–200× Acceleration for Video Diffusion Models
Offline optimization of your disaggregated Dynamo graph
A compact implementation of SGLang, designed to demystify the complexities of modern LLM serving systems.
Domain-specific language designed to streamline the development of high-performance GPU/CPU/Accelerators kernels
A Simplified PyTorch Implementation of Vision Transformer (ViT)
Train speculative decoding models effortlessly and port them smoothly to SGLang serving.
Open Model Engine (OME) — Kubernetes operator for LLM serving, GPU scheduling, and model lifecycle management. Works with SGLang, vLLM, TensorRT-LLM, and Triton
Puzzles for learning Triton
A workload for deploying LLM inference services on Kubernetes
Checkpoint-engine is a simple middleware to update model weights in LLM inference engines
Since the emergence of chatGPT in 2022, the acceleration of Large Language Model has become increasingly important. Here is a list of papers on accelerating LLMs, currently focusing mainly on infer…
Efficient Interactive LLM Serving with Proxy Model-based Sequence Length Prediction | A tiny BERT model can tell you the verbosity of an LLM (with low latency overhead!)