Skip to content

TheTom/vllm-swift

Repository files navigation

vllm-swift

A native Swift/Metal backend for vLLM on Apple Silicon.
No Python in the inference hot path.

Run vLLM workloads on Apple Silicon with a native Swift/Metal hot path.
OpenAI-compatible API. Up to 2.6× faster short-context decode.

Quick Start

1. Install

brew tap TheTom/tap && brew install vllm-swift

Or from source:

git clone https://github.com/TheTom/vllm-swift.git && cd vllm-swift
./scripts/install.sh       # builds Swift bridge, installs plugin, creates activate.sh
source activate.sh         # sets DYLD_LIBRARY_PATH (generated by install.sh)

2. Run

vllm-swift download mlx-community/Qwen3-4B-4bit
vllm-swift serve ~/models/Qwen3-4B-4bit --max-model-len 4096  # increase as needed, max 40960

Homebrew users don't need activate.shvllm-swift serve handles everything.

Server running at http://localhost:8000 (OpenAI-compatible API).

Drop-in replacement for vLLM on Apple Silicon. All vllm serve flags work unchanged.

Performance (M5 Max 128GB)

Decode throughput, tok/s. Prompt = 18 tokens, generation = 50 tokens, greedy (temp=0). Both engines measured via offline benchmark (no HTTP overhead). vllm-swift uses the Swift/Metal engine via ctypes. vllm-metal uses the Python/MLX engine via vLLM's offline LLM API.

Qwen3-0.6B

Single 8 concurrent 32 concurrent 64 concurrent
vllm-swift 364 1,527 2,859 3,425
vllm-metal (Python/MLX) 111 652 2,047 2,620

Qwen3-4B

Single 8 concurrent 32 concurrent 64 concurrent
vllm-swift 147 477 1,194 1,518
vllm-metal (Python/MLX) 104 396 1,065 1,375

Full matrix, methodology, and long-context cells in docs/PERFORMANCE.md.

TurboQuant+ KV Cache Compression

TurboQuant+ compresses KV cache to fit longer context with modest throughput cost.

Qwen3.5 2B (4-bit weights)

KV Cache Compression Prefill @1K Decode @1K Prefill @4K Decode @4K
FP16 1.0× 1,252 tok/s 259 tok/s 1,215 tok/s 249 tok/s
turbo4v2 3.0× 1,331 tok/s 245 tok/s 1,245 tok/s 240 tok/s
turbo3 4.6× 1,346 tok/s 174 tok/s 1,276 tok/s 241 tok/s

Architecture

The entire forward pass runs in Swift/Metal. Python is used only for orchestration.

Python (vLLM API, tokenization, scheduling)  ← github.com/vllm-project/vllm
  ↓ ctypes FFI
C bridge (bridge.h)
  ↓ @_cdecl
Swift (mlx-swift-lm, BatchedKVCache, batched decode)
  ↓
Metal GPU

Features

  • OpenAI-compatible API (/v1/completions, /v1/chat/completions)
  • Streaming (SSE) responses
  • Chat templates (applied by vLLM, model-specific)
  • Batched concurrent decode with BatchedKVCache (fully batched projections + attention)
  • Per-request temperature sampling in batched path
  • Auto model download from HuggingFace Hub
  • TurboQuant+ KV cache compression (turbo3, turbo4v2) via mlx-swift-lm
  • Decode and prompt logprobs
  • Greedy and temperature sampling
  • EOS / stop token detection (vLLM scheduler)
  • VLM (vision-language model) support (experimental)
  • Works with Hermes, OpenCode, and any OpenAI-compatible client

Use with AI tools

# Start server with tool calling enabled
vllm-swift serve ~/models/Qwen3-4B-4bit --max-model-len 40960 \
  --served-model-name qwen3-4b \
  --enable-auto-tool-choice --tool-call-parser hermes

Then point your tool at it:

# Hermes — set in ~/.hermes/config.yaml:
#   base_url: http://localhost:8000/v1
#   model: qwen3-4b

# OpenCode
OPENAI_API_BASE=http://localhost:8000/v1 OPENAI_API_KEY=local opencode

# Any OpenAI-compatible client
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3-4b","messages":[{"role":"user","content":"Hello"}]}'

Configuration

vllm-swift serve is a thin wrapper around vllm serve — all standard vLLM flags work. Here are the common setups:

Basic serving

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960

Agent / tool calling (Hermes, OpenCode, etc.)

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960 \
  --enable-auto-tool-choice --tool-call-parser hermes

Chain-of-thought models (strip <think> tags)

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960 \
  --enable-reasoning --reasoning-parser deepseek_r1

Long context with TurboQuant+

Compress KV cache 3-5× to fit longer context with modest throughput cost:

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960 \
  --additional-config '{"kv_scheme": "turbo4v2", "kv_bits": 4}'
Scheme Compression Best for
turbo4v2 ~3× Recommended — best quality/compression balance
turbo3 ~4.6× Maximum compression, higher PPL trade-off

Full setup (agent + reasoning + TurboQuant+)

vllm-swift serve ~/models/Qwen3-4B-4bit \
  --served-model-name qwen3-4b \
  --max-model-len 40960 \
  --enable-auto-tool-choice --tool-call-parser hermes \
  --enable-reasoning --reasoning-parser deepseek_r1 \
  --additional-config '{"kv_scheme": "turbo4v2", "kv_bits": 4}'

All flags

vllm-swift serve <model> [options]

  --served-model-name NAME   Clean model name for API clients (recommended)
  --max-model-len N          Max sequence length (default: model config)
  --port PORT                API server port (default: 8000)
  --gpu-memory-utilization F Memory fraction 0.0-1.0 (default: 0.9)
  --dtype float16            Model dtype (default: float16)
  --enable-auto-tool-choice  Enable tool/function calling
  --tool-call-parser NAME    Tool call format (hermes, llama3, mistral, etc.)
  --enable-reasoning         Enable chain-of-thought parsing
  --reasoning-parser NAME    Reasoning format (deepseek_r1, etc.)
  --additional-config JSON   Extra config (kv_scheme, kv_bits)

All standard vLLM flags work — these are just the most common ones.

Changelog

See CHANGELOG.md for release history.

Known Limitations (early development)

  • LoRA not supported (Swift engine limitation)
  • Chunked prefill disabled (Swift engine handles full sequences)
  • top_p sampling not supported in batched decode path (temperature works)
  • Only Qwen3 models use the fully batched decode path; other architectures fall back to sequential decode (still functional, just slower at high concurrency)
  • Requires macOS on Apple Silicon (no Linux/CUDA)

Install

Homebrew

brew tap TheTom/tap && brew install vllm-swift

Prebuilt bottle — no Swift toolchain needed. First run of vllm-swift serve sets up a managed Python environment automatically.

To update to the latest version:

vllm-swift update

# Or via standard Homebrew (works from any version):
brew update && brew upgrade vllm-swift

From source

git clone https://github.com/TheTom/vllm-swift.git
cd vllm-swift
./scripts/install.sh       # builds Swift, installs plugin, creates activate.sh
source activate.sh         # sets DYLD_LIBRARY_PATH
vllm serve ~/models/Qwen3-4B-4bit --max-model-len 4096

Manual (full control)

git clone https://github.com/TheTom/vllm-swift.git && cd vllm-swift
cd swift && swift build -c release && cd ..
pip install -e .
DYLD_LIBRARY_PATH=swift/.build/arm64-apple-macosx/release \
  vllm serve ~/models/Qwen3-4B-4bit --max-model-len 4096

Troubleshooting

Homebrew checksum error on reinstall:

brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm*
brew tap TheTom/tap && brew install vllm-swift

"No module named vllm" or plugin not loading after brew install:

brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm* ~/.vllm-swift
brew tap TheTom/tap && brew install vllm-swift
vllm-swift setup

vLLM build error (Apple Clang parentheses): Our install script and brew wrapper handle this automatically. If you're on an older bottle or installing vLLM manually:

# Brew users: get the latest bottle first
brew uninstall vllm-swift && brew untap TheTom/tap
rm -rf $(brew --cache)/downloads/*vllm* ~/.vllm-swift/venv
brew tap TheTom/tap && brew install vllm-swift && vllm-swift setup

# Or install vLLM manually with the fix
CFLAGS="-Wno-parentheses" pip install vllm

activate.sh not found: Make sure you run ./install.sh (or ./scripts/install.sh) first — it generates activate.sh in the project root.

Metal kernel not found (GDN/TurboFlash models): The mlx.metallib file must be in the same directory as libVLLMBridge.dylib. For manual installs, copy it:

cp swift/.build/arm64-apple-macosx/release/mlx.metallib \
   $(dirname $(echo $DYLD_LIBRARY_PATH | cut -d: -f1))/

Download a model

vllm-swift download mlx-community/Qwen3-4B-4bit

# Or manually:
huggingface-cli download mlx-community/Qwen3-4B-4bit --local-dir ~/models/Qwen3-4B-4bit

# Already have models in HuggingFace cache? Point directly at them:
vllm-swift serve ~/.cache/huggingface/hub/models--mlx-community--Qwen3-4B-4bit/snapshots/latest

Project Structure

vllm_swift/           Python plugin (vLLM WorkerBase)
swift/
  Sources/VLLMBridge/       C bridge (@_cdecl exports)
  bridge.h                  C API (prefill, decode, batched decode)
scripts/
  install.sh                One-step build + install
  build_bottle.sh           Build + upload Homebrew bottle
  integration_test.sh       End-to-end smoke test
homebrew/
  vllm-swift.rb             Homebrew formula
tests/                      84 tests, 97% coverage

Requirements

  • macOS 14+ on Apple Silicon
  • Xcode 15+ or Swift 6.0+ (for building from source; Homebrew bottle skips this)
  • Python 3.10+
  • vLLM 0.19+
  • mlx-swift-lm (pulled automatically by Swift Package Manager)

License

Apache-2.0

About

vLLM Metal plugin powered by mlx-swift — high-performance LLM inference on Apple Silicon

Resources

License

Stars

Watchers

Forks

Sponsor this project

 

Packages

 
 
 

Contributors