GitHub - TheTom/vllm-swift: vLLM Metal plugin powered by mlx-swift — high-performance LLM inference on Apple Silicon

A native Swift/Metal backend for vLLM on Apple Silicon.
No Python in the inference hot path.

Run vLLM workloads on Apple Silicon with a native Swift/Metal hot path.
OpenAI-compatible API. Up to 2.6× faster short-context decode.

Quick Start

1. Install

brew tap TheTom/tap && brew install vllm-swift

Or from source:

git clone https://github.com/TheTom/vllm-swift.git && cd vllm-swift
./scripts/install.sh       # builds Swift bridge, installs plugin, creates activate.sh
source activate.sh         # sets DYLD_LIBRARY_PATH (generated by install.sh)

2. Run

vllm-swift download mlx-community/Qwen3-4B-4bit
vllm-swift serve ~/models/Qwen3-4B-4bit --max-model-len 4096  # increase as needed, max 40960

Homebrew users don't need activate.sh — vllm-swift serve handles everything.

Server running at http://localhost:8000 (OpenAI-compatible API).

Drop-in replacement for vLLM on Apple Silicon. All vllm serve flags work unchanged.

Performance (M5 Max 128GB)

Decode throughput, tok/s. Prompt = 18 tokens, generation = 50 tokens, greedy (temp=0). Both engines measured via offline benchmark (no HTTP overhead). vllm-swift uses the Swift/Metal engine via ctypes. vllm-metal uses the Python/MLX engine via vLLM's offline LLM API.

Qwen3-0.6B

	Single	8 concurrent	32 concurrent	64 concurrent
vllm-swift	364	1,527	2,859	3,425
vllm-metal (Python/MLX)	111	652	2,047	2,620

Qwen3-4B

	Single	8 concurrent	32 concurrent	64 concurrent
vllm-swift	147	477	1,194	1,518
vllm-metal (Python/MLX)	104	396	1,065	1,375

Full matrix, methodology, and long-context cells in docs/PERFORMANCE.md.

TurboQuant+ KV Cache Compression

TurboQuant+ compresses KV cache to fit longer context with modest throughput cost.

Qwen3.5 2B (4-bit weights)

KV Cache	Compression	Prefill @1K	Decode @1K	Prefill @4K	Decode @4K
FP16	1.0×	1,252 tok/s	259 tok/s	1,215 tok/s	249 tok/s
turbo4v2	3.0×	1,331 tok/s	245 tok/s	1,245 tok/s	240 tok/s
turbo3	4.6×	1,346 tok/s	174 tok/s	1,276 tok/s	241 tok/s

Architecture

The entire forward pass runs in Swift/Metal. Python is used only for orchestration.

Python (vLLM API, tokenization, scheduling)  ← github.com/vllm-project/vllm
  ↓ ctypes FFI
C bridge (bridge.h)
  ↓ @_cdecl
Swift (mlx-swift-lm, BatchedKVCache, batched decode)
  ↓
Metal GPU

Features

OpenAI-compatible API (/v1/completions, /v1/chat/completions)
Streaming (SSE) responses
Chat templates (applied by vLLM, model-specific)
Batched concurrent decode with BatchedKVCache (fully batched projections + attention)
Per-request temperature sampling in batched path
Auto model download from HuggingFace Hub
TurboQuant+ KV cache compression (turbo3, turbo4v2) via mlx-swift-lm
Decode and prompt logprobs
Greedy and temperature sampling
EOS / stop token detection (vLLM scheduler)
VLM (vision-language model) support (experimental)
Works with Hermes, OpenCode, and any OpenAI-compatible client

Use with AI tools

# Start server with tool calling enabled
vllm-swift serve ~/models/Qwen3-4B-4bit --max-model-len 40960 \
  --served-model-name qwen3-4b \
  --enable-auto-tool-choice --tool-call-parser hermes

Then point your tool at it:

# Hermes — set in ~/.hermes/config.yaml:
#   base_url: http://localhost:8000/v1
#   model: qwen3-4b

# OpenCode
OPENAI_API_BASE=http://localhost:8000/v1 OPENAI_API_KEY=local opencode

# Any OpenAI-compatible client
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3-4b","messages":[{"role":"user","content":"Hello"}]}'

Configuration

vllm-swift serve is a thin wrapper around vllm serve — all standard vLLM flags work. Here are the common setups:

Basic serving

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960

Agent / tool calling (Hermes, OpenCode, etc.)

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960 \
  --enable-auto-tool-choice --tool-call-parser hermes

Chain-of-thought models (strip `<think>` tags)

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960 \
  --enable-reasoning --reasoning-parser deepseek_r1

Long context with TurboQuant+

Compress KV cache 3-5× to fit longer context with modest throughput cost:

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960 \
  --additional-config '{"kv_scheme": "turbo4v2", "kv_bits": 4}'

Scheme	Compression	Best for
`turbo4v2`	~3×	Recommended — best quality/compression balance
`turbo3`	~4.6×	Maximum compression, higher PPL trade-off

Full setup (agent + reasoning + TurboQuant+)

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960 \
  --enable-auto-tool-choice --tool-call-parser hermes \
  --enable-reasoning --reasoning-parser deepseek_r1 \
  --additional-config '{"kv_scheme": "turbo4v2", "kv_bits": 4}'

All flags

vllm-swift serve <model> [options]

  --served-model-name NAME   Clean model name for API clients (recommended)
  --max-model-len N          Max sequence length (default: model config)
  --port PORT                API server port (default: 8000)
  --gpu-memory-utilization F Memory fraction 0.0-1.0 (default: 0.9)
  --dtype float16            Model dtype (default: float16)
  --enable-auto-tool-choice  Enable tool/function calling
  --tool-call-parser NAME    Tool call format (hermes, llama3, mistral, etc.)
  --enable-reasoning         Enable chain-of-thought parsing
  --reasoning-parser NAME    Reasoning format (deepseek_r1, etc.)
  --additional-config JSON   Extra config (kv_scheme, kv_bits)

All standard vLLM flags work — these are just the most common ones.

Changelog

See CHANGELOG.md for release history.

Known Limitations (early development)

LoRA not supported (Swift engine limitation)
Chunked prefill disabled (Swift engine handles full sequences)
top_p sampling not supported in batched decode path (temperature works)
Only Qwen3 models use the fully batched decode path; other architectures fall back to sequential decode (still functional, just slower at high concurrency)
Requires macOS on Apple Silicon (no Linux/CUDA)

Install

Homebrew

brew tap TheTom/tap && brew install vllm-swift

Prebuilt bottle — no Swift toolchain needed. First run of vllm-swift serve sets up a managed Python environment automatically.

To update to the latest version:

vllm-swift update

# Or via standard Homebrew (works from any version):
brew update && brew upgrade vllm-swift

From source

git clone https://github.com/TheTom/vllm-swift.git
cd vllm-swift
./scripts/install.sh       # builds Swift, installs plugin, creates activate.sh
source activate.sh         # sets DYLD_LIBRARY_PATH
vllm serve ~/models/Qwen3-4B-4bit --max-model-len 4096

Manual (full control)

git clone https://github.com/TheTom/vllm-swift.git && cd vllm-swift
cd swift && swift build -c release && cd ..
pip install -e .
DYLD_LIBRARY_PATH=swift/.build/arm64-apple-macosx/release \
  vllm serve ~/models/Qwen3-4B-4bit --max-model-len 4096

Troubleshooting

Homebrew checksum error on reinstall:

brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm*
brew tap TheTom/tap && brew install vllm-swift

"No module named vllm" or plugin not loading after brew install:

brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm* ~/.vllm-swift
brew tap TheTom/tap && brew install vllm-swift
vllm-swift setup

vLLM build error (Apple Clang parentheses): Our install script and brew wrapper handle this automatically. If you're on an older bottle or installing vLLM manually:

# Brew users: get the latest bottle first
brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm* ~/.vllm-swift/venv
brew tap TheTom/tap && brew install vllm-swift && vllm-swift setup

# Or install vLLM manually with the fix
CFLAGS="-Wno-parentheses" pip install vllm

activate.sh not found: Make sure you run ./install.sh (or ./scripts/install.sh) first — it generates activate.sh in the project root.

Metal kernel not found (GDN/TurboFlash models): The mlx.metallib file must be in the same directory as libVLLMBridge.dylib. For manual installs, copy it:

cp swift/.build/arm64-apple-macosx/release/mlx.metallib \
   $(dirname $(echo $DYLD_LIBRARY_PATH | cut -d: -f1))/

Download a model

vllm-swift download mlx-community/Qwen3-4B-4bit

# Or manually:
huggingface-cli download mlx-community/Qwen3-4B-4bit --local-dir ~/models/Qwen3-4B-4bit

# Already have models in HuggingFace cache? Point directly at them:
vllm-swift serve ~/.cache/huggingface/hub/models--mlx-community--Qwen3-4B-4bit/snapshots/latest

Project Structure

vllm_swift/           Python plugin (vLLM WorkerBase)
swift/
  Sources/VLLMBridge/       C bridge (@_cdecl exports)
  bridge.h                  C API (prefill, decode, batched decode)
scripts/
  install.sh                One-step build + install
  build_bottle.sh           Build + upload Homebrew bottle
  integration_test.sh       End-to-end smoke test
homebrew/
  vllm-swift.rb             Homebrew formula
tests/                      84 tests, 97% coverage

Requirements

macOS 14+ on Apple Silicon
Xcode 15+ or Swift 6.0+ (for building from source; Homebrew bottle skips this)
Python 3.10+
vLLM 0.19+
mlx-swift-lm (pulled automatically by Swift Package Manager)

License

Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 119 Commits
.github		.github
assets		assets
benchmarks		benchmarks
docs		docs
homebrew		homebrew
scripts		scripts
swift		swift
tests		tests
vllm_swift		vllm_swift
.coverage		.coverage
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
RELEASING.md		RELEASING.md
install.sh		install.sh
pyproject.toml		pyproject.toml
test_bridge.py		test_bridge.py

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Quick Start

1. Install

2. Run

Performance (M5 Max 128GB)

Qwen3-0.6B

Qwen3-4B

TurboQuant+ KV Cache Compression

Architecture

Features

Use with AI tools

Configuration

Basic serving

Agent / tool calling (Hermes, OpenCode, etc.)

Chain-of-thought models (strip <think> tags)

Long context with TurboQuant+

Full setup (agent + reasoning + TurboQuant+)

All flags

Changelog

Known Limitations (early development)

Install

Homebrew

From source

Manual (full control)

Troubleshooting

Download a model

Project Structure

Requirements

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Chain-of-thought models (strip `<think>` tags)

Packages