Skip to content

Tags: kekzl/imp

Tags

v0.7

Toggle v0.7's commit message
imp v0.7.0

Long-context correctness + Gemma-4 / GDN stabilization.

Headline:
- FP8 FMHA S_tile smem overlap fix — pp>1024 coherent on all tested models,
  up to x1.70 vs llama.cpp at pp=8192
- Qwen 3.5 / 3.6 GDN: launch_bounds + partial-RoPE + h_state FP32 fixes
- Gemma-4 suite: CUDA graphs, rope_freqs, SWA long-context, host-MoE fix
  Q4_K_M decode 55 -> 183 tok/s (x1.21 vs llama.cpp)
- Platform: CUDA 13.2.1, stream priorities, StreamingLLM smart KV,
  weight-storage refactor, CUTLASS 3.x NVFP4 Grouped GEMM scaffold

Full changelog: CHANGELOG.md

v0.6

Toggle v0.6's commit message
v0.6 — Jinja2 macros, Qwen3.5 fix, MXFP4, n-gram spec, HuggingFace Hub

v0.5.1

Toggle v0.5.1's commit message
v0.5.1: Fix GDN multi-turn chat degeneration

- FP16 prefill weights for GDN models (FP8 precision loss in recurrent state)
- Chunked prefill preserves SSM/GDN state across chunk boundaries
- Conv1d uses prev chunk context instead of zero-padding
- Prefix caching disabled for recurrent models
- Updated benchmarks for v0.5.1

v0.4.1

Toggle v0.4.1's commit message
fix: disable NVFP4 for GDN models + cuBLASLt fallback

GDN (Gated DeltaNet) models: auto-disable NVFP4 decode cache. The delta
rule scan accumulates quantization error in recurrent state H across
tokens; 4-bit NVFP4 causes visible quality degradation on 9B+ models
(repeated <|im_start|> tokens). FP8 prefill + dp4a decode path preserves
enough precision. Qwen3.5-4B was unaffected but 9B/27B were broken.

cuBLASLt: add cublasGemmEx fallback when cublasLtMatmul fails with
status 7 (CUBLAS_STATUS_INVALID_VALUE). CUDA 13.2 returns this for
certain M/K/N combinations on sm_120. Previously silently continued
with corrupted output. All three cuBLASLt paths now fall back gracefully.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

v0.4

Toggle v0.4's commit message
docs: update CLAUDE.md for v0.4 — Qwen3.5 GDN benchmarks, CUDA 13.2

- Add Qwen3.5 (GDN) to benchmark tables: tg128 +82%, pp512 +44% vs llama.cpp
- Add GDN architecture section documenting fused kernels and design
- Update CUDA version references from 13.1 to 13.2
- Add Qwen3.5 to supported architectures list
- Add test_gdn_kernel.cu to test file table
- Version bump v0.3 → v0.4

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

v0.2

Toggle v0.2's commit message
Release 0.2 — Decode Performance Optimization

Major decode throughput improvements across all model architectures on RTX 5090.

Key optimizations:
- NVFP4 prmt register LUT (no shared memory dequant)
- Vectorized RMSNorm (float4 loads, 512 threads)
- Fused FP8 KV cache write (halves kernel launches per layer)
- Split SwiGLU for large K (separate activation + prmt GEMV)
- Occupancy-aware NVFP4 GEMV dispatch (multi-row vs kpar)
- NVFP4 down_proj for post-FFN norm models (Gemma-3)
- NVFP4 LM head in async CUDA graph decode loop
- Multi-row occupancy 6→8 blocks/SM
- rmsnorm_quantize_q8_1 threads 256→1024
- Vectorized rmsnorm_fp32_accum (float4, 512 threads)

Benchmark results (RTX 5090, Q8_0, decode tg128):
  Qwen3-4B:       393 tok/s (llama.cpp: 244, +61%)
  Qwen3-8B:       262 tok/s (llama.cpp: 157, +67%)
  Gemma-3-12B:    146 tok/s (llama.cpp:  98, +49%)
  Phi-4-mini:     264 tok/s (llama.cpp: 277,  -5%)
  Qwen3-Coder-30B MoE: 293 tok/s (llama.cpp: 251, +17%)