A native Swift/Metal backend for vLLM on Apple Silicon.
No Python in the inference hot path.
Run vLLM workloads on Apple Silicon with a native Swift/Metal hot path.
OpenAI-compatible API. Up to 2.6× faster short-context decode.
brew tap TheTom/tap && brew install vllm-swiftOr from source:
git clone https://github.com/TheTom/vllm-swift.git && cd vllm-swift
./scripts/install.sh # builds Swift bridge, installs plugin, creates activate.sh
source activate.sh # sets DYLD_LIBRARY_PATH (generated by install.sh)vllm-swift download mlx-community/Qwen3-4B-4bit
vllm-swift serve ~/models/Qwen3-4B-4bit --max-model-len 4096 # increase as needed, max 40960Homebrew users don't need
activate.sh—vllm-swift servehandles everything.
Server running at http://localhost:8000 (OpenAI-compatible API).
Drop-in replacement for vLLM on Apple Silicon. All
vllm serveflags work unchanged.
Decode throughput, tok/s. Prompt = 18 tokens, generation = 50 tokens, greedy (temp=0). Both engines measured via offline benchmark (no HTTP overhead). vllm-swift uses the Swift/Metal engine via ctypes. vllm-metal uses the Python/MLX engine via vLLM's offline LLM API.
| Single | 8 concurrent | 32 concurrent | 64 concurrent | |
|---|---|---|---|---|
| vllm-swift | 364 | 1,527 | 2,859 | 3,425 |
| vllm-metal (Python/MLX) | 111 | 652 | 2,047 | 2,620 |
| Single | 8 concurrent | 32 concurrent | 64 concurrent | |
|---|---|---|---|---|
| vllm-swift | 147 | 477 | 1,194 | 1,518 |
| vllm-metal (Python/MLX) | 104 | 396 | 1,065 | 1,375 |
Full matrix, methodology, and long-context cells in docs/PERFORMANCE.md.
TurboQuant+ KV Cache Compression
TurboQuant+ compresses KV cache to fit longer context with modest throughput cost.
Qwen3.5 2B (4-bit weights)
| KV Cache | Compression | Prefill @1K | Decode @1K | Prefill @4K | Decode @4K |
|---|---|---|---|---|---|
| FP16 | 1.0× | 1,252 tok/s | 259 tok/s | 1,215 tok/s | 249 tok/s |
| turbo4v2 | 3.0× | 1,331 tok/s | 245 tok/s | 1,245 tok/s | 240 tok/s |
| turbo3 | 4.6× | 1,346 tok/s | 174 tok/s | 1,276 tok/s | 241 tok/s |
The entire forward pass runs in Swift/Metal. Python is used only for orchestration.
Python (vLLM API, tokenization, scheduling) ← github.com/vllm-project/vllm
↓ ctypes FFI
C bridge (bridge.h)
↓ @_cdecl
Swift (mlx-swift-lm, BatchedKVCache, batched decode)
↓
Metal GPU
- OpenAI-compatible API (
/v1/completions,/v1/chat/completions) - Streaming (SSE) responses
- Chat templates (applied by vLLM, model-specific)
- Batched concurrent decode with
BatchedKVCache(fully batched projections + attention) - Per-request temperature sampling in batched path
- Auto model download from HuggingFace Hub
- TurboQuant+ KV cache compression (
turbo3,turbo4v2) via mlx-swift-lm - Decode and prompt logprobs
- Greedy and temperature sampling
- EOS / stop token detection (vLLM scheduler)
- VLM (vision-language model) support (experimental)
- Works with Hermes, OpenCode, and any OpenAI-compatible client
# Start server with tool calling enabled
vllm-swift serve ~/models/Qwen3-4B-4bit --max-model-len 40960 \
--served-model-name qwen3-4b \
--enable-auto-tool-choice --tool-call-parser hermesThen point your tool at it:
# Hermes — set in ~/.hermes/config.yaml:
# base_url: http://localhost:8000/v1
# model: qwen3-4b
# OpenCode
OPENAI_API_BASE=http://localhost:8000/v1 OPENAI_API_KEY=local opencode
# Any OpenAI-compatible client
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"qwen3-4b","messages":[{"role":"user","content":"Hello"}]}'vllm-swift serve is a thin wrapper around vllm serve — all standard vLLM flags work. Here are the common setups:
vllm-swift serve ~/models/Qwen3-4B-4bit \
--served-model-name qwen3-4b \
--max-model-len 40960vllm-swift serve ~/models/Qwen3-4B-4bit \
--served-model-name qwen3-4b \
--max-model-len 40960 \
--enable-auto-tool-choice --tool-call-parser hermesvllm-swift serve ~/models/Qwen3-4B-4bit \
--served-model-name qwen3-4b \
--max-model-len 40960 \
--enable-reasoning --reasoning-parser deepseek_r1Long context with TurboQuant+
Compress KV cache 3-5× to fit longer context with modest throughput cost:
vllm-swift serve ~/models/Qwen3-4B-4bit \
--served-model-name qwen3-4b \
--max-model-len 40960 \
--additional-config '{"kv_scheme": "turbo4v2", "kv_bits": 4}'| Scheme | Compression | Best for |
|---|---|---|
turbo4v2 |
~3× | Recommended — best quality/compression balance |
turbo3 |
~4.6× | Maximum compression, higher PPL trade-off |
vllm-swift serve ~/models/Qwen3-4B-4bit \
--served-model-name qwen3-4b \
--max-model-len 40960 \
--enable-auto-tool-choice --tool-call-parser hermes \
--enable-reasoning --reasoning-parser deepseek_r1 \
--additional-config '{"kv_scheme": "turbo4v2", "kv_bits": 4}'vllm-swift serve <model> [options]
--served-model-name NAME Clean model name for API clients (recommended)
--max-model-len N Max sequence length (default: model config)
--port PORT API server port (default: 8000)
--gpu-memory-utilization F Memory fraction 0.0-1.0 (default: 0.9)
--dtype float16 Model dtype (default: float16)
--enable-auto-tool-choice Enable tool/function calling
--tool-call-parser NAME Tool call format (hermes, llama3, mistral, etc.)
--enable-reasoning Enable chain-of-thought parsing
--reasoning-parser NAME Reasoning format (deepseek_r1, etc.)
--additional-config JSON Extra config (kv_scheme, kv_bits)All standard vLLM flags work — these are just the most common ones.
See CHANGELOG.md for release history.
- LoRA not supported (Swift engine limitation)
- Chunked prefill disabled (Swift engine handles full sequences)
- top_p sampling not supported in batched decode path (temperature works)
- Only Qwen3 models use the fully batched decode path; other architectures fall back to sequential decode (still functional, just slower at high concurrency)
- Requires macOS on Apple Silicon (no Linux/CUDA)
brew tap TheTom/tap && brew install vllm-swiftPrebuilt bottle — no Swift toolchain needed. First run of vllm-swift serve sets up a managed Python environment automatically.
To update to the latest version:
vllm-swift update
# Or via standard Homebrew (works from any version):
brew update && brew upgrade vllm-swiftgit clone https://github.com/TheTom/vllm-swift.git
cd vllm-swift
./scripts/install.sh # builds Swift, installs plugin, creates activate.sh
source activate.sh # sets DYLD_LIBRARY_PATH
vllm serve ~/models/Qwen3-4B-4bit --max-model-len 4096git clone https://github.com/TheTom/vllm-swift.git && cd vllm-swift
cd swift && swift build -c release && cd ..
pip install -e .
DYLD_LIBRARY_PATH=swift/.build/arm64-apple-macosx/release \
vllm serve ~/models/Qwen3-4B-4bit --max-model-len 4096Homebrew checksum error on reinstall:
brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm*
brew tap TheTom/tap && brew install vllm-swift"No module named vllm" or plugin not loading after brew install:
brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm* ~/.vllm-swift
brew tap TheTom/tap && brew install vllm-swift
vllm-swift setupvLLM build error (Apple Clang parentheses): Our install script and brew wrapper handle this automatically. If you're on an older bottle or installing vLLM manually:
# Brew users: get the latest bottle first
brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm* ~/.vllm-swift/venv
brew tap TheTom/tap && brew install vllm-swift && vllm-swift setup
# Or install vLLM manually with the fix
CFLAGS="-Wno-parentheses" pip install vllmactivate.sh not found: Make sure you run ./install.sh (or ./scripts/install.sh) first — it generates activate.sh in the project root.
Metal kernel not found (GDN/TurboFlash models): The mlx.metallib file must be in the same directory as libVLLMBridge.dylib. For manual installs, copy it:
cp swift/.build/arm64-apple-macosx/release/mlx.metallib \
$(dirname $(echo $DYLD_LIBRARY_PATH | cut -d: -f1))/vllm-swift download mlx-community/Qwen3-4B-4bit
# Or manually:
huggingface-cli download mlx-community/Qwen3-4B-4bit --local-dir ~/models/Qwen3-4B-4bit
# Already have models in HuggingFace cache? Point directly at them:
vllm-swift serve ~/.cache/huggingface/hub/models--mlx-community--Qwen3-4B-4bit/snapshots/latestvllm_swift/ Python plugin (vLLM WorkerBase)
swift/
Sources/VLLMBridge/ C bridge (@_cdecl exports)
bridge.h C API (prefill, decode, batched decode)
scripts/
install.sh One-step build + install
build_bottle.sh Build + upload Homebrew bottle
integration_test.sh End-to-end smoke test
homebrew/
vllm-swift.rb Homebrew formula
tests/ 84 tests, 97% coverage
- macOS 14+ on Apple Silicon
- Xcode 15+ or Swift 6.0+ (for building from source; Homebrew bottle skips this)
- Python 3.10+
- vLLM 0.19+
- mlx-swift-lm (pulled automatically by Swift Package Manager)
Apache-2.0