GitHub - vllm-project/tpu-inference: TPU inference for vLLM, with unified JAX and PyTorch support.

🤝 Contribute to the Project
_{Looking to help? Click a badge below to find issues that need your attention.}

Latest News

Announcing Gemma 4 on vLLM Byte for byte, the most capable open models - available on TPUs on Day 0!

Previous News 🔥

Pytorch Conference Learn how Spotify uses vLLM with both GPUs and TPUs to drive down costs and improve user experience.
Ray Summit, November 3-5 in San Francisco!
JAX DevLab on November 18th in Sunnyvale!
[2025/10] vLLM TPU: A New Unified Backend Supporting PyTorch and JAX on TPU

About

vLLM TPU is now powered by tpu-inference, an expressive and powerful new hardware plugin unifying JAX and PyTorch under a single lowering path within the vLLM project. The new backend now provides a framework for developers to:

Push the limits of TPU hardware performance in open source.
Provide more flexibility to JAX and PyTorch users by running PyTorch model definitions performantly on TPU without any additional code changes, while also extending native support to JAX.
Retain vLLM standardization: keep the same user experience, telemetry, and interface.

Recommended models and features

Although vLLM TPU’s new unified backend makes out-of-the-box high performance serving possible with any model supported in vLLM, the reality is that we're still in the process of implementing a few core components.

For this reason, we’ve provided a Recommended Models and Features page detailing the models and features that are validated through unit, integration, and performance testing.

Get started

Get started with vLLM on TPUs by following the quickstart guide.

Visit our documentation to learn more.

Compatible TPU Generations

Recommended: v7x, v5e, v6e
Experimental: v3, v4, v5p

Recipes

TPU Support Matrix Dashboard

Below is the live status of our supported models, features, and kernels. Click on any category to expand the detailed support table. It is automatically updated from our detailed Support Matrices.

Last Updated: 2026-04-16 10:24 PM UTC

🚦 Status Legend

✅ Passing: Tested and works as expected. Ready for use.

❌ Failing: Known to be broken or not functional. Help is wanted to fix this!

🧪 Experimental: Works, but unoptimized or pending community validation.

📝 Planned: Not yet implemented, but on the official roadmap.

⛔️ Unplanned: There is no benefit to adding this.

❓ Untested: The functionality exists but has not been recently or thoroughly verified.

📐 View Matrix Aggregation Rules (v6e/v7x & C+P)

🛠️ Correctness + Performance (C + P)

❌ Failing: If either check fails.

✅ Passing: If BOTH checks pass successfully.

❓ Untested: If any check is untested (and neither fails).

🌐 Hardware Rollups (v6e + v7x)

❌ Failing: If the feature fails on either v6e or v7x.

✅ Passing: If the feature passes on BOTH v6e and v7x.

❓ Untested: If either generation is untested (and neither fails).

Release Support Matrices

Click to expand support matrices

Stable support status for official releases and production deployments.

✅ Tested Models

Model Type Unit Test Correctness Test Performance Test

google/gemma-3-27b-it Text ✅ ✅ ✅

meta-llama/Llama-3.1-8B-Instruct Text ✅ ✅ ✅

meta-llama/Llama-3.3-70B-Instruct Text ✅ ✅ ✅

Qwen/Qwen3-30B-A3B Text ✅ ✅ ✅

Qwen/Qwen3-32B Text ✅ ✅ ✅

Qwen/Qwen3-4B Text ✅ ✅ ✅

Qwen/Qwen3-Coder-480B-A35B-Instruct Text ✅ ✅ ✅

Qwen/Qwen2.5-VL-7B-Instruct Multimodal ✅ ✅ ❌

deepseek-ai/DeepSeek-OCR Multimodal ❓ ❓ ❓

moonshotai/Kimi-K2.5 Multimodal ❓ ❓ ❓

Qwen/Qwen3-Omni-30B-A3B-Instruct Multimodal ❓ ❓ ❓

Qwen/Qwen3-VL-8B-Instruct Multimodal ❓ ❓ ❓

Qwen/Qwen3.5-9B Multimodal ❓ ❓ ❓

deepseek-ai/DeepSeek-Math-V2 Text ❓ ❓ ❓

deepseek-ai/DeepSeek-R1 Text ❓ ❓ ❓

deepseek-ai/DeepSeek-V3.1 Text ❓ ❓ ❓

deepseek-ai/DeepSeek-V3.2 Text ❓ ❓ ❓

deepseek-ai/DeepSeek-V3.2-Speciale Text ❓ ❓ ❓

MiniMaxAI/MiniMax-M2.5 Text ❓ ❓ ❓

moonshotai/Kimi-K2-Thinking Text ❓ ❓ ❓

openai/gpt-oss-120b Text ❓ ❓ ❓

openai/gpt-oss-20b Text ❓ ❓ ❓

zai-org/GLM-5 Text ❓ ❓ ❓

🚀 Advanced Capabilities

Core Features

Feature Flax Torchax Default

async scheduler ✅ ✅ ✅

Chunked Prefill ✅ ✅ ✅

DCN-based P/D disaggregation ✅ ✅ ✅

LoRA_Torch ✅ ✅ ✅

Out-of-tree model support ✅ ✅ ✅

Prefix Caching ✅ ✅ ✅

Single Program Multi Data ✅ ✅ ✅

Speculative Decoding: Ngram ✅ ✅ ✅

Multimodal Inputs ✅ ❌ ✅

Speculative Decoding: Eagle3 ✅ ❌ ✅

hybrid kv cache ❓ ❓ ❓

KV cache host offloading ❓ ❓ ❓

multi-host ❓ ❓ ❓

runai_model_streamer_loader ❓ ❓ ❓

sampling_params ❓ ❓ ❓

Single-Host-P-D-disaggregation ❓ ❓ ❓

structured_decoding ❓ ❓ ❓

Parallelism Techniques

Feature Flax Torchax

Single-host Multi-host Single-host Multi-host

EP ✅ ❓ ✅ ❓

TP ✅ ❓ ✅ ❓

PP ❌ ✅ ❌ ❌

DP ❌ ❓ ✅ ❓

CP ❓ ❓ ❓ ❓

SP (vote to prioritize) ❓ ❓ ❓ ❓

Quantization Methods

Checkpoint dtype Method Supported
Hardware Acceleration Flax Torchax

AWQ INT4 v5, v6 ❓ ❓

FP4 W4A16 mxfp4 v7 ❓ ❓

FP8 W8A16 compressed-tensor v7 ❓ ❓

FP8 W8A8 compressed-tensor v7 ❓ ❓

INT4 W4A16 awq v5, v6 ❓ ❓

INT8 W8A8 compressed-tensor v5, v6 ❓ ❓

Note:

This table only tests checkpoint loading compatibility.

🔬 Microbenchmark Kernel Support

Category Test W16A16 W8A8 W8A16 W4A4 W4A8 W4A16

Moe Fused MoE ❓ ❓ ❓ ❓ ❓ ❓

gmm ❓ ❓ ❓ ❓ ❓ ❓

Dense All‑gather matmul ❓ ❓ ❓ ❓ ❓ ❓

Attention Generic Ragged Paged
Attention V3* ❓ ❓ ❓ ❓ ❓ ❓

MLA ❓ ❓ ❓ ❓ ❓ ❓

Ragged Paged
Attention V3 Head_Dim
64* ❓ ❓ ❓ ❓ ❓ ❓

Note:

For attention kernels, W[x]A[y] denotes KV cache as W, A as compute, and x, y as bit precision.

Nightly Support Matrices

Click to expand support matrices

Support status for the latest nightly/main branch developments.

✅ Tested Models

Model Type Unit Test Correctness Test Performance Test

google/gemma-3-27b-it Text ✅ ✅ ✅

meta-llama/Llama-3.1-8B-Instruct Text ✅ ✅ ✅

meta-llama/Llama-3.3-70B-Instruct Text ✅ ✅ ✅

Qwen/Qwen3-30B-A3B Text ✅ ✅ ✅

Qwen/Qwen3-32B Text ✅ ✅ ✅

Qwen/Qwen3-4B Text ✅ ✅ ✅

Qwen/Qwen3-Coder-480B-A35B-Instruct Text ✅ ✅ ✅

Qwen/Qwen3.5-397B-A17B Text ✅ ✅ ❌

openai/gpt-oss-120b Text ✅ ✅ ❓

Qwen/Qwen2.5-VL-7B-Instruct Multimodal ✅ ❌ ❓

deepseek-ai/DeepSeek-R1 Text ✅ ❓ ❓

google/gemma-4-26B-A4B-it Multimodal ❌ ❓ ❓

google/gemma-4-31B-it Multimodal ❌ ❓ ❓

deepseek-ai/DeepSeek-OCR Multimodal ❓ ❓ ❓

moonshotai/Kimi-K2.5 Multimodal ❓ ❓ ❓

Qwen/Qwen3-Omni-30B-A3B-Instruct Multimodal ❓ ❓ ❓

Qwen/Qwen3-VL-8B-Instruct Multimodal ❓ ❓ ❓

Qwen/Qwen3.5-9B Multimodal ❓ ❓ ❓

deepseek-ai/DeepSeek-Math-V2 Text ❓ ❓ ❓

deepseek-ai/DeepSeek-V3.1 Text ❓ ❓ ❓

deepseek-ai/DeepSeek-V3.2 Text ❓ ❓ ❓

deepseek-ai/DeepSeek-V3.2-Speciale Text ❓ ❓ ❓

MiniMaxAI/MiniMax-M2.5 Text ❓ ❓ ❓

moonshotai/Kimi-K2-Thinking Text ❓ ❓ ❓

openai/gpt-oss-20b Text ❓ ❓ ❓

zai-org/GLM-5 Text ❓ ❓ ❓

🚀 Advanced Capabilities

Core Features

Feature Flax Torchax Default

Chunked Prefill ✅ ✅ ✅

DCN-based P/D disaggregation ✅ ✅ ✅

LoRA_Torch ✅ ✅ ✅

Prefix Caching ✅ ✅ ✅

Single Program Multi Data ✅ ✅ ✅

Speculative Decoding: Ngram ✅ ✅ ✅

async scheduler ✅ ✅ ❌

Speculative Decoding: Eagle3 ✅ ❌ ✅

Out-of-tree model support ❌ ✅ ❌

Multimodal Inputs ❌ ❌ ❌

Single-Host-P-D-disaggregation ❌ ❓ ❌

hybrid kv cache ❓ ❓ ❓

KV cache host offloading ❓ ❓ ❓

multi-host ❓ ❓ ❓

runai_model_streamer_loader ❓ ❓ ❓

sampling_params ❓ ❓ ❓

structured_decoding ❓ ❓ ❓

Parallelism Techniques

Feature Flax Torchax

Single-host Multi-host Single-host Multi-host

EP ✅ ❓ ✅ ❓

TP ✅ ❓ ✅ ❓

PP ❌ ❌ ✅ ✅

DP ❌ ❓ ✅ ❓

CP ❓ ❓ ❓ ❓

SP (vote to prioritize) ❓ ❓ ❓ ❓

Quantization Methods

Checkpoint dtype Method Supported
Hardware Acceleration Flax Torchax

FP4 W4A16 mxfp4 v7 ❓ ❓

FP8 W8A16 compressed-tensor v7 ❓ ❓

FP8 W8A8 compressed-tensor v7 ❓ ❓

INT4 W4A16 awq v5, v6 ❓ ❓

INT8 W8A8 compressed-tensor v5, v6 ❓ ❓

Note:

This table only tests checkpoint loading compatibility.

🔬 Microbenchmark Kernel Support

Category Test W16A16 W8A8 W8A16 W4A4 W4A8 W4A16

Moe Fused MoE ❓ ❓ ❓ ❓ ❓ ❓

gmm ❓ ❓ ❓ ❓ ❓ ❓

Dense All‑gather matmul ❓ ❓ ❓ ❓ ❓ ❓

Attention Generic Ragged Paged
Attention V3* ❓ ❓ ❓ ❓ ❓ ❓

MLA ❓ ❓ ❓ ❓ ❓ ❓

Ragged Paged
Attention V3 Head_Dim
64* ❓ ❓ ❓ ❓ ❓ ❓

Note:

For attention kernels, W[x]A[y] denotes KV cache as W, A as compute, and x, y as bit precision.

🤝 Contribute

We're thrilled you're interested in contributing to the vLLM TPU project! Your help is essential for making our tools better for everyone. There are many ways to get involved, even if you're not ready to write code.

Ways to Contribute:

🐞 Submit Bugs & Suggest Features: See an issue or have an idea? Open a new issue to let us know.
👀 Provide Feedback on Pull Requests: Lend your expertise by reviewing open pull requests and helping us improve the quality of our codebase.
📚 Improve Our Documentation: Help us make our guides clearer. Fix a typo, clarify a confusing section, or write a new recipe.

If you're ready to contribute code, our Contributing Guide is the best place to start. It covers everything you need to know, including:

Tips for finding an issue to work on we recommend starting with our good-first issues!

🌟 Contributors Wall

A huge thank you to everyone who has helped build and improve vllm-project/tpu-inference!

🌟 Contribution Type Legend & Ranking

Emoji Contribution Meaning

💻 Code Submitted merged pull requests or code changes.

🐛 Issues Opened valid issues or bug reports.

👀 Reviews Reviewed pull requests and provided feedback.

🏆 Ranking: Contributors are sorted from highest to lowest based on their total effort score (Total Commits + Unique Issues Opened + PRs Reviewed). If there is a tie, contributors are displayed alphabetically.

_{xiangxu-google} 💻	_jrplatin 🐛 👀 💻	_{buildkite-bot} 💻	_kyuyeunk 🐛 👀 💻	_py4 💻	_fenghuizhang 💻	_lk-chen 🐛 👀 💻
_{wenxindongwork} 👀 💻	_vanbasten23 👀 💻	_{sixiang-google} 💻	_lsy323 💻	_Lumosis 💻	_QiliangCui 👀 💻	_Chenyaaang 👀 💻
_bzgoogle 👀 💻	_gpolovets1 👀 💻	_{mrjunwan-lang} 👀 💻	_{yarongmu-google} 💻	_{wwl2755-google} 💻	_yaochengji 💻	_patemotter 👀 💻

...and more! Click to view all contributors.

_boe20211 💻	_jcyang43 👀 💻	_kwang3939 👀 💻	_bythew3i 💻	_pv97 👀 💻	_karan 🐛 💻	_{dennisYehCienet} 👀 💻
_syhuang22 👀 💻	_helloworld1 🐛 👀 💻	_ica-chao 💻	_richardsliu 👀 💻	_catswe 👀 💻	_RobMulla 🐛 💻	_xingliu14 🐛 💻
_{juncgu-google} 👀	_saltysoup 🐛	_weiyu0824 👀 💻	_andrewkvuong 💻	_{rupengliu-meta} 🐛 💻	_bvrockwell 🐛 💻	_sierraisland 💻
_wang2yn84 💻	_wdhongtw 💻	_JiriesKaileh 💻	_ylangtsou 💻	_amacaskill 💻	_BirdsOfAFthr 💻	_{patrickji2014} 👀 💻
_qihqi 🐛 💻	_yuanfz98 🐛	_cychiuak 💻	_{hosseinsarshar} 🐛 💻	_samos123 🐛	_AlienKevin 🐛	_dgouju 🐛
_eitanporat 🐛	_ernie-chang 💻	_lepan-google 🐛 💻	_{muskansh-google} 🐛	_saikat-royc 👀	_{abhinavclemson} 💻	_aman2930 💻
_BabyChouSr 🐛	_{CienetStingLin} 💻	_coolkp 💻	_{functionstackx} 🐛	_helloleah 💻	_{mailvijayasingh} 💻	_{QiliangCui2023} 👀
_shireen-bean 🐛	_{utkarshsharma1} 💻	_A9isha 💻	_AahilA 💻	_amishacorns 💻	_carlesoctav 🐛	_dannikay 💻
_depksingh 🐛	_{Dineshkumar-Anandan-ZS0367} 🐛	_dtrifiro 🐛	_erfanzar 🐛	_inho9606 💻	_jk1333 🐛	_jyj0w0 👀
_kuafou 💻	_kyle-google 💻	_Mhdaw 🐛	_{mokeddembillel} 🐛	_oindrila-b 🐛	_oliverdutton 🐛	_{pathfinder-pf} 🐛
_{piotrfrankowski} 🐛	_{reeaz27-droid} 🐛	_rupeng-liu 💻	_{salmanmohammadi} 🐛	_vlad-karp 💻	_XMaster96 🐛	_yixinshi 👀
_{yuyanpeng-google} 💻	_zixi-qi 💻	_zongweiz 🐛	_zzzwen 💻

💬 Contact us

For technical questions and feature requests, open a GitHub Issue
For feature requests, please open one on Github here
For discussing with fellow users, use the TPU support topic in the vLLM Forum
For coordinating contributions and development, use the Developer Slack
For collaborations and partnerships, contact us at [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 2,561 Commits
.buildkite		.buildkite
.github		.github
docker		docker
docs		docs
examples		examples
scripts		scripts
support_matrices		support_matrices
tests		tests
tpu_inference		tpu_inference
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
DCO		DCO
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_benchmarking.txt		requirements_benchmarking.txt
requirements_test.txt		requirements_test.txt
setup.py		setup.py
verified_commit_hashes.csv		verified_commit_hashes.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Latest News

About

Recommended models and features

Get started

Recipes

TPU Support Matrix Dashboard

Release Support Matrices

Nightly Support Matrices

🤝 Contribute

🌟 Contributors Wall

💬 Contact us

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Model	Type	Unit Test	Correctness Test	Performance Test
google/gemma-3-27b-it	Text	✅	✅	✅
meta-llama/Llama-3.1-8B-Instruct	Text	✅	✅	✅
meta-llama/Llama-3.3-70B-Instruct	Text	✅	✅	✅
Qwen/Qwen3-30B-A3B	Text	✅	✅	✅
Qwen/Qwen3-32B	Text	✅	✅	✅
Qwen/Qwen3-4B	Text	✅	✅	✅
Qwen/Qwen3-Coder-480B-A35B-Instruct	Text	✅	✅	✅
Qwen/Qwen2.5-VL-7B-Instruct	Multimodal	✅	✅	❌
deepseek-ai/DeepSeek-OCR	Multimodal	❓	❓	❓
moonshotai/Kimi-K2.5	Multimodal	❓	❓	❓
Qwen/Qwen3-Omni-30B-A3B-Instruct	Multimodal	❓	❓	❓
Qwen/Qwen3-VL-8B-Instruct	Multimodal	❓	❓	❓
Qwen/Qwen3.5-9B	Multimodal	❓	❓	❓
deepseek-ai/DeepSeek-Math-V2	Text	❓	❓	❓
deepseek-ai/DeepSeek-R1	Text	❓	❓	❓
deepseek-ai/DeepSeek-V3.1	Text	❓	❓	❓
deepseek-ai/DeepSeek-V3.2	Text	❓	❓	❓
deepseek-ai/DeepSeek-V3.2-Speciale	Text	❓	❓	❓
MiniMaxAI/MiniMax-M2.5	Text	❓	❓	❓
moonshotai/Kimi-K2-Thinking	Text	❓	❓	❓
openai/gpt-oss-120b	Text	❓	❓	❓
openai/gpt-oss-20b	Text	❓	❓	❓
zai-org/GLM-5	Text	❓	❓	❓

Feature	Flax	Torchax	Default
async scheduler	✅	✅	✅
Chunked Prefill	✅	✅	✅
DCN-based P/D disaggregation	✅	✅	✅
LoRA_Torch	✅	✅	✅
Out-of-tree model support	✅	✅	✅
Prefix Caching	✅	✅	✅
Single Program Multi Data	✅	✅	✅
Speculative Decoding: Ngram	✅	✅	✅
Multimodal Inputs	✅	❌	✅
Speculative Decoding: Eagle3	✅	❌	✅
hybrid kv cache	❓	❓	❓
KV cache host offloading	❓	❓	❓
multi-host	❓	❓	❓
runai_model_streamer_loader	❓	❓	❓
sampling_params	❓	❓	❓
Single-Host-P-D-disaggregation	❓	❓	❓
structured_decoding	❓	❓	❓

Feature	Flax		Torchax
Feature	Single-host	Multi-host	Single-host	Multi-host
EP	✅	❓	✅	❓
TP	✅	❓	✅	❓
PP	❌	✅	❌	❌
DP	❌	❓	✅	❓
CP	❓	❓	❓	❓
SP (vote to prioritize)	❓	❓	❓	❓

Checkpoint dtype	Method	Supported Hardware Acceleration	Flax	Torchax
AWQ INT4		v5, v6	❓	❓
FP4 W4A16	mxfp4	v7	❓	❓
FP8 W8A16	compressed-tensor	v7	❓	❓
FP8 W8A8	compressed-tensor	v7	❓	❓
INT4 W4A16	awq	v5, v6	❓	❓
INT8 W8A8	compressed-tensor	v5, v6	❓	❓

Category	Test	W16A16	W8A8	W8A16	W4A4	W4A8	W4A16
Moe	Fused MoE	❓	❓	❓	❓	❓	❓
Moe	gmm	❓	❓	❓	❓	❓	❓
Dense	All‑gather matmul	❓	❓	❓	❓	❓	❓
Attention	Generic Ragged Paged Attention V3*	❓	❓	❓	❓	❓	❓
	MLA	❓	❓	❓	❓	❓	❓
	Ragged Paged Attention V3 Head_Dim 64*	❓	❓	❓	❓	❓	❓

Emoji	Contribution	Meaning
💻	Code	Submitted merged pull requests or code changes.
🐛	Issues	Opened valid issues or bug reports.
👀	Reviews	Reviewed pull requests and provided feedback.

Folders and files

Latest commit

History

Repository files navigation

Latest News

About

Recommended models and features

Get started

Recipes

TPU Support Matrix Dashboard

Release Support Matrices

Nightly Support Matrices

🤝 Contribute

🌟 Contributors Wall

💬 Contact us

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages