#matrix #gemm #avx2 #simd #algebra #matrix-multiply #transpose #fma

cool_matrix

Extremely optimized linear algebra - BLIS-style GEMM with AVX2/FMA SIMD

1 unstable release

0.1.0 Mar 30, 2026

#2303 in Math

MIT license

47KB
852 lines

matrixa — Extremely Fast Linear Algebra in Pure Rust

A from-scratch matrix multiplication engine that beats OpenBLAS on modern Intel hardware (tested: 33% faster on i9-14900KF). No C, no Fortran, no external BLAS — just Rust with hand-tuned AVX2+FMA SIMD intrinsics.

What's Inside

  • Matrix Multiply (A * B): BLIS-style blocked GEMM, ~850 GFLOPS on f64
  • Transpose Multiply (Aᵀ*B, A*Bᵀ): Zero-copy transpose via packing tricks
  • Linear Solve (Ax = b): LU decomposition with partial pivoting

Why It's Fast

  1. Cache blocking — matrices are split into tiles sized for L1/L2/L3
  2. Panel packing — tiles are copied into contiguous memory for sequential access
  3. AVX2+FMA micro-kernel — the 4×8 inner loop does 8 fused multiply-adds per iteration using 256-bit SIMD registers
  4. Rayon parallelism — work-stealing across all cores, naturally handling Intel's asymmetric P-core/E-core architecture
  5. Zero allocation in hot path — thread-local buffers reused across calls
  6. Compile-time dispatch — AVX2 check at compile time, not runtime

Quick Example

use matrixa::Matrix;

let a = Matrix::from_vec(2, 3, vec![1.0, 2.0, 3.0,
                                    4.0, 5.0, 6.0]);
let b = Matrix::from_vec(3, 2, vec![7.0, 8.0,
                                    9.0, 10.0,
                                    11.0, 12.0]);
let c = a.multiply(&b); // 2×2 result
assert_eq!(c.get(0, 0), 58.0); // 1*7 + 2*9 + 3*11

Dependencies

~1.5MB
~25K SLoC