1 unstable release
| 0.1.0 | Mar 30, 2026 |
|---|
#2303 in Math
47KB
852 lines
matrixa — Extremely Fast Linear Algebra in Pure Rust
A from-scratch matrix multiplication engine that beats OpenBLAS on modern Intel hardware (tested: 33% faster on i9-14900KF). No C, no Fortran, no external BLAS — just Rust with hand-tuned AVX2+FMA SIMD intrinsics.
What's Inside
- Matrix Multiply (
A * B): BLIS-style blocked GEMM, ~850 GFLOPS on f64 - Transpose Multiply (
Aᵀ*B,A*Bᵀ): Zero-copy transpose via packing tricks - Linear Solve (
Ax = b): LU decomposition with partial pivoting
Why It's Fast
- Cache blocking — matrices are split into tiles sized for L1/L2/L3
- Panel packing — tiles are copied into contiguous memory for sequential access
- AVX2+FMA micro-kernel — the 4×8 inner loop does 8 fused multiply-adds per iteration using 256-bit SIMD registers
- Rayon parallelism — work-stealing across all cores, naturally handling Intel's asymmetric P-core/E-core architecture
- Zero allocation in hot path — thread-local buffers reused across calls
- Compile-time dispatch — AVX2 check at compile time, not runtime
Quick Example
use matrixa::Matrix;
let a = Matrix::from_vec(2, 3, vec![1.0, 2.0, 3.0,
4.0, 5.0, 6.0]);
let b = Matrix::from_vec(3, 2, vec![7.0, 8.0,
9.0, 10.0,
11.0, 12.0]);
let c = a.multiply(&b); // 2×2 result
assert_eq!(c.get(0, 0), 58.0); // 1*7 + 2*9 + 3*11
Dependencies
~1.5MB
~25K SLoC