Skip to content

vllm-project/tpu-inference

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2,561 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

vLLM TPU vLLM TPU

| Documentation | Blog | User Forum | Developer Slack (#sig-tpu) |


🤝 Contribute to the Project
Looking to help? Click a badge below to find issues that need your attention.

bug good first issue enhancement contribution-welcome auto-generated View All Issues

Latest News

Previous News 🔥

About

vLLM TPU is now powered by tpu-inference, an expressive and powerful new hardware plugin unifying JAX and PyTorch under a single lowering path within the vLLM project. The new backend now provides a framework for developers to:

  • Push the limits of TPU hardware performance in open source.
  • Provide more flexibility to JAX and PyTorch users by running PyTorch model definitions performantly on TPU without any additional code changes, while also extending native support to JAX.
  • Retain vLLM standardization: keep the same user experience, telemetry, and interface.

Recommended models and features

Although vLLM TPU’s new unified backend makes out-of-the-box high performance serving possible with any model supported in vLLM, the reality is that we're still in the process of implementing a few core components.

For this reason, we’ve provided a Recommended Models and Features page detailing the models and features that are validated through unit, integration, and performance testing.


Get started

Get started with vLLM on TPUs by following the quickstart guide.

Visit our documentation to learn more.

Compatible TPU Generations

  • Recommended: v7x, v5e, v6e
  • Experimental: v3, v4, v5p

Recipes


TPU Support Matrix Dashboard

Below is the live status of our supported models, features, and kernels. Click on any category to expand the detailed support table. It is automatically updated from our detailed Support Matrices.

Last Updated: 2026-04-16 10:24 PM UTC

🚦 Status Legend
  • Passing: Tested and works as expected. Ready for use.
  • Failing: Known to be broken or not functional. Help is wanted to fix this!
  • 🧪 Experimental: Works, but unoptimized or pending community validation.
  • 📝 Planned: Not yet implemented, but on the official roadmap.
  • ⛔️ Unplanned: There is no benefit to adding this.
  • Untested: The functionality exists but has not been recently or thoroughly verified.
📐 View Matrix Aggregation Rules (v6e/v7x & C+P)
  • 🛠️ Correctness + Performance (C + P)

    • Failing: If either check fails.
    • Passing: If BOTH checks pass successfully.
    • Untested: If any check is untested (and neither fails).
  • 🌐 Hardware Rollups (v6e + v7x)

    • Failing: If the feature fails on either v6e or v7x.
    • Passing: If the feature passes on BOTH v6e and v7x.
    • Untested: If either generation is untested (and neither fails).

Release Support Matrices

Click to expand support matrices

Stable support status for official releases and production deployments.

✅ Tested Models
Model Type Unit Test Correctness Test Performance Test
google/gemma-3-27b-it Text
meta-llama/Llama-3.1-8B-Instruct Text
meta-llama/Llama-3.3-70B-Instruct Text
Qwen/Qwen3-30B-A3B Text
Qwen/Qwen3-32B Text
Qwen/Qwen3-4B Text
Qwen/Qwen3-Coder-480B-A35B-Instruct Text
Qwen/Qwen2.5-VL-7B-Instruct Multimodal
deepseek-ai/DeepSeek-OCR Multimodal
moonshotai/Kimi-K2.5 Multimodal
Qwen/Qwen3-Omni-30B-A3B-Instruct Multimodal
Qwen/Qwen3-VL-8B-Instruct Multimodal
Qwen/Qwen3.5-9B Multimodal
deepseek-ai/DeepSeek-Math-V2 Text
deepseek-ai/DeepSeek-R1 Text
deepseek-ai/DeepSeek-V3.1 Text
deepseek-ai/DeepSeek-V3.2 Text
deepseek-ai/DeepSeek-V3.2-Speciale Text
MiniMaxAI/MiniMax-M2.5 Text
moonshotai/Kimi-K2-Thinking Text
openai/gpt-oss-120b Text
openai/gpt-oss-20b Text
zai-org/GLM-5 Text
🚀  Advanced Capabilities
Core Features
Feature Flax Torchax Default
async scheduler
Chunked Prefill
DCN-based P/D disaggregation
LoRA_Torch
Out-of-tree model support
Prefix Caching
Single Program Multi Data
Speculative Decoding: Ngram
Multimodal Inputs
Speculative Decoding: Eagle3
hybrid kv cache
KV cache host offloading
multi-host
runai_model_streamer_loader
sampling_params
Single-Host-P-D-disaggregation
structured_decoding
Parallelism Techniques
Feature Flax Torchax
Single-host Multi-host Single-host Multi-host
EP
TP
PP
DP
CP
SP (vote to prioritize)
Quantization Methods
Checkpoint dtype Method Supported
Hardware Acceleration
Flax Torchax
AWQ INT4 v5, v6
FP4 W4A16 mxfp4 v7
FP8 W8A16 compressed-tensor v7
FP8 W8A8 compressed-tensor v7
INT4 W4A16 awq v5, v6
INT8 W8A8 compressed-tensor v5, v6

Note:

  • This table only tests checkpoint loading compatibility.
🔬 Microbenchmark Kernel Support
Category Test W16A16 W8A8 W8A16 W4A4 W4A8 W4A16
Moe Fused MoE
gmm
Dense All‑gather matmul
Attention Generic Ragged Paged
Attention V3*
MLA
Ragged Paged
Attention V3 Head_Dim
64*

Note:

  • For attention kernels, W[x]A[y] denotes KV cache as W, A as compute, and x, y as bit precision.

Nightly Support Matrices

Click to expand support matrices

Support status for the latest nightly/main branch developments.

✅ Tested Models
Model Type Unit Test Correctness Test Performance Test
google/gemma-3-27b-it Text
meta-llama/Llama-3.1-8B-Instruct Text
meta-llama/Llama-3.3-70B-Instruct Text
Qwen/Qwen3-30B-A3B Text
Qwen/Qwen3-32B Text
Qwen/Qwen3-4B Text
Qwen/Qwen3-Coder-480B-A35B-Instruct Text
Qwen/Qwen3.5-397B-A17B Text
openai/gpt-oss-120b Text
Qwen/Qwen2.5-VL-7B-Instruct Multimodal
deepseek-ai/DeepSeek-R1 Text
google/gemma-4-26B-A4B-it Multimodal
google/gemma-4-31B-it Multimodal
deepseek-ai/DeepSeek-OCR Multimodal
moonshotai/Kimi-K2.5 Multimodal
Qwen/Qwen3-Omni-30B-A3B-Instruct Multimodal
Qwen/Qwen3-VL-8B-Instruct Multimodal
Qwen/Qwen3.5-9B Multimodal
deepseek-ai/DeepSeek-Math-V2 Text
deepseek-ai/DeepSeek-V3.1 Text
deepseek-ai/DeepSeek-V3.2 Text
deepseek-ai/DeepSeek-V3.2-Speciale Text
MiniMaxAI/MiniMax-M2.5 Text
moonshotai/Kimi-K2-Thinking Text
openai/gpt-oss-20b Text
zai-org/GLM-5 Text
🚀  Advanced Capabilities
Core Features
Feature Flax Torchax Default
Chunked Prefill
DCN-based P/D disaggregation
LoRA_Torch
Prefix Caching
Single Program Multi Data
Speculative Decoding: Ngram
async scheduler
Speculative Decoding: Eagle3
Out-of-tree model support
Multimodal Inputs
Single-Host-P-D-disaggregation
hybrid kv cache
KV cache host offloading
multi-host
runai_model_streamer_loader
sampling_params
structured_decoding
Parallelism Techniques
Feature Flax Torchax
Single-host Multi-host Single-host Multi-host
EP
TP
PP
DP
CP
SP (vote to prioritize)
Quantization Methods
Checkpoint dtype Method Supported
Hardware Acceleration
Flax Torchax
FP4 W4A16 mxfp4 v7
FP8 W8A16 compressed-tensor v7
FP8 W8A8 compressed-tensor v7
INT4 W4A16 awq v5, v6
INT8 W8A8 compressed-tensor v5, v6

Note:

  • This table only tests checkpoint loading compatibility.
🔬 Microbenchmark Kernel Support
Category Test W16A16 W8A8 W8A16 W4A4 W4A8 W4A16
Moe Fused MoE
gmm
Dense All‑gather matmul
Attention Generic Ragged Paged
Attention V3*
MLA
Ragged Paged
Attention V3 Head_Dim
64*

Note:

  • For attention kernels, W[x]A[y] denotes KV cache as W, A as compute, and x, y as bit precision.

🤝 Contribute

bug good first issue enhancement contribution-welcome auto-generated View All Issues

We're thrilled you're interested in contributing to the vLLM TPU project! Your help is essential for making our tools better for everyone. There are many ways to get involved, even if you're not ready to write code.

Ways to Contribute:

  • 🐞 Submit Bugs & Suggest Features: See an issue or have an idea? Open a new issue to let us know.
  • 👀 Provide Feedback on Pull Requests: Lend your expertise by reviewing open pull requests and helping us improve the quality of our codebase.
  • 📚 Improve Our Documentation: Help us make our guides clearer. Fix a typo, clarify a confusing section, or write a new recipe.

If you're ready to contribute code, our Contributing Guide is the best place to start. It covers everything you need to know, including:

  • Tips for finding an issue to work on we recommend starting with our good-first issues!

🌟 Contributors Wall

A huge thank you to everyone who has helped build and improve vllm-project/tpu-inference!

🌟 Contribution Type Legend & Ranking
Emoji Contribution Meaning
💻 Code Submitted merged pull requests or code changes.
🐛 Issues Opened valid issues or bug reports.
👀 Reviews Reviewed pull requests and provided feedback.

🏆 Ranking: Contributors are sorted from highest to lowest based on their total effort score (Total Commits + Unique Issues Opened + PRs Reviewed). If there is a tie, contributors are displayed alphabetically.


xiangxu-google
xiangxu-google

💻
jrplatin
jrplatin

🐛 👀 💻
buildkite-bot
buildkite-bot

💻
kyuyeunk
kyuyeunk

🐛 👀 💻
py4
py4

💻
fenghuizhang
fenghuizhang

💻
lk-chen
lk-chen

🐛 👀 💻
wenxindongwork
wenxindongwork

👀 💻
vanbasten23
vanbasten23

👀 💻
sixiang-google
sixiang-google

💻
lsy323
lsy323

💻
Lumosis
Lumosis

💻
QiliangCui
QiliangCui

👀 💻
Chenyaaang
Chenyaaang

👀 💻
bzgoogle
bzgoogle

👀 💻
gpolovets1
gpolovets1

👀 💻
mrjunwan-lang
mrjunwan-lang

👀 💻
yarongmu-google
yarongmu-google

💻
wwl2755-google
wwl2755-google

💻
yaochengji
yaochengji

💻
patemotter
patemotter

👀 💻

...and more! Click to view all contributors.
boe20211
boe20211

💻
jcyang43
jcyang43

👀 💻
kwang3939
kwang3939

👀 💻
bythew3i
bythew3i

💻
pv97
pv97

👀 💻
karan
karan

🐛 💻
dennisYehCienet
dennisYehCienet

👀 💻
syhuang22
syhuang22

👀 💻
helloworld1
helloworld1

🐛 👀 💻
ica-chao
ica-chao

💻
richardsliu
richardsliu

👀 💻
catswe
catswe

👀 💻
RobMulla
RobMulla

🐛 💻
xingliu14
xingliu14

🐛 💻
juncgu-google
juncgu-google

👀
saltysoup
saltysoup

🐛
weiyu0824
weiyu0824

👀 💻
andrewkvuong
andrewkvuong

💻
rupengliu-meta
rupengliu-meta

🐛 💻
bvrockwell
bvrockwell

🐛 💻
sierraisland
sierraisland

💻
wang2yn84
wang2yn84

💻
wdhongtw
wdhongtw

💻
JiriesKaileh
JiriesKaileh

💻
ylangtsou
ylangtsou

💻
amacaskill
amacaskill

💻
BirdsOfAFthr
BirdsOfAFthr

💻
patrickji2014
patrickji2014

👀 💻
qihqi
qihqi

🐛 💻
yuanfz98
yuanfz98

🐛
cychiuak
cychiuak

💻
hosseinsarshar
hosseinsarshar

🐛 💻
samos123
samos123

🐛
AlienKevin
AlienKevin

🐛
dgouju
dgouju

🐛
eitanporat
eitanporat

🐛
ernie-chang
ernie-chang

💻
lepan-google
lepan-google

🐛 💻
muskansh-google
muskansh-google

🐛
saikat-royc
saikat-royc

👀
abhinavclemson
abhinavclemson

💻
aman2930
aman2930

💻
BabyChouSr
BabyChouSr

🐛
CienetStingLin
CienetStingLin

💻
coolkp
coolkp

💻
functionstackx
functionstackx

🐛
helloleah
helloleah

💻
mailvijayasingh
mailvijayasingh

💻
QiliangCui2023
QiliangCui2023

👀
shireen-bean
shireen-bean

🐛
utkarshsharma1
utkarshsharma1

💻
A9isha
A9isha

💻
AahilA
AahilA

💻
amishacorns
amishacorns

💻
carlesoctav
carlesoctav

🐛
dannikay
dannikay

💻
depksingh
depksingh

🐛
Dineshkumar-Anandan-ZS0367
Dineshkumar-Anandan-ZS0367

🐛
dtrifiro
dtrifiro

🐛
erfanzar
erfanzar

🐛
inho9606
inho9606

💻
jk1333
jk1333

🐛
jyj0w0
jyj0w0

👀
kuafou
kuafou

💻
kyle-google
kyle-google

💻
Mhdaw
Mhdaw

🐛
mokeddembillel
mokeddembillel

🐛
oindrila-b
oindrila-b

🐛
oliverdutton
oliverdutton

🐛
pathfinder-pf
pathfinder-pf

🐛
piotrfrankowski
piotrfrankowski

🐛
reeaz27-droid
reeaz27-droid

🐛
rupeng-liu
rupeng-liu

💻
salmanmohammadi
salmanmohammadi

🐛
vlad-karp
vlad-karp

💻
XMaster96
XMaster96

🐛
yixinshi
yixinshi

👀
yuyanpeng-google
yuyanpeng-google

💻
zixi-qi
zixi-qi

💻
zongweiz
zongweiz

🐛
zzzwen
zzzwen

💻

💬  Contact us

About

TPU inference for vLLM, with unified JAX and PyTorch support.

Resources

License

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors