Tags: kekzl/imp
Tags
imp v0.7.0 Long-context correctness + Gemma-4 / GDN stabilization. Headline: - FP8 FMHA S_tile smem overlap fix — pp>1024 coherent on all tested models, up to x1.70 vs llama.cpp at pp=8192 - Qwen 3.5 / 3.6 GDN: launch_bounds + partial-RoPE + h_state FP32 fixes - Gemma-4 suite: CUDA graphs, rope_freqs, SWA long-context, host-MoE fix Q4_K_M decode 55 -> 183 tok/s (x1.21 vs llama.cpp) - Platform: CUDA 13.2.1, stream priorities, StreamingLLM smart KV, weight-storage refactor, CUTLASS 3.x NVFP4 Grouped GEMM scaffold Full changelog: CHANGELOG.md
v0.5.1: Fix GDN multi-turn chat degeneration - FP16 prefill weights for GDN models (FP8 precision loss in recurrent state) - Chunked prefill preserves SSM/GDN state across chunk boundaries - Conv1d uses prev chunk context instead of zero-padding - Prefix caching disabled for recurrent models - Updated benchmarks for v0.5.1
fix: disable NVFP4 for GDN models + cuBLASLt fallback GDN (Gated DeltaNet) models: auto-disable NVFP4 decode cache. The delta rule scan accumulates quantization error in recurrent state H across tokens; 4-bit NVFP4 causes visible quality degradation on 9B+ models (repeated <|im_start|> tokens). FP8 prefill + dp4a decode path preserves enough precision. Qwen3.5-4B was unaffected but 9B/27B were broken. cuBLASLt: add cublasGemmEx fallback when cublasLtMatmul fails with status 7 (CUBLAS_STATUS_INVALID_VALUE). CUDA 13.2 returns this for certain M/K/N combinations on sm_120. Previously silently continued with corrupted output. All three cuBLASLt paths now fall back gracefully. Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
docs: update CLAUDE.md for v0.4 — Qwen3.5 GDN benchmarks, CUDA 13.2 - Add Qwen3.5 (GDN) to benchmark tables: tg128 +82%, pp512 +44% vs llama.cpp - Add GDN architecture section documenting fused kernels and design - Update CUDA version references from 13.1 to 13.2 - Add Qwen3.5 to supported architectures list - Add test_gdn_kernel.cu to test file table - Version bump v0.3 → v0.4 Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
Release 0.2 — Decode Performance Optimization Major decode throughput improvements across all model architectures on RTX 5090. Key optimizations: - NVFP4 prmt register LUT (no shared memory dequant) - Vectorized RMSNorm (float4 loads, 512 threads) - Fused FP8 KV cache write (halves kernel launches per layer) - Split SwiGLU for large K (separate activation + prmt GEMV) - Occupancy-aware NVFP4 GEMV dispatch (multi-row vs kpar) - NVFP4 down_proj for post-FFN norm models (Gemma-3) - NVFP4 LM head in async CUDA graph decode loop - Multi-row occupancy 6→8 blocks/SM - rmsnorm_quantize_q8_1 threads 256→1024 - Vectorized rmsnorm_fp32_accum (float4, 512 threads) Benchmark results (RTX 5090, Q8_0, decode tg128): Qwen3-4B: 393 tok/s (llama.cpp: 244, +61%) Qwen3-8B: 262 tok/s (llama.cpp: 157, +67%) Gemma-3-12B: 146 tok/s (llama.cpp: 98, +49%) Phi-4-mini: 264 tok/s (llama.cpp: 277, -5%) Qwen3-Coder-30B MoE: 293 tok/s (llama.cpp: 251, +17%)