1 unstable release

0.1.0	Mar 30, 2026

#2303 in Math

MIT license

47KB
852 lines

matrixa — Extremely Fast Linear Algebra in Pure Rust

A from-scratch matrix multiplication engine that beats OpenBLAS on modern Intel hardware (tested: 33% faster on i9-14900KF). No C, no Fortran, no external BLAS — just Rust with hand-tuned AVX2+FMA SIMD intrinsics.

What's Inside

Matrix Multiply (A * B): BLIS-style blocked GEMM, ~850 GFLOPS on f64
Transpose Multiply (Aᵀ*B, A*Bᵀ): Zero-copy transpose via packing tricks
Linear Solve (Ax = b): LU decomposition with partial pivoting

Why It's Fast

Cache blocking — matrices are split into tiles sized for L1/L2/L3
Panel packing — tiles are copied into contiguous memory for sequential access
AVX2+FMA micro-kernel — the 4×8 inner loop does 8 fused multiply-adds per iteration using 256-bit SIMD registers
Rayon parallelism — work-stealing across all cores, naturally handling Intel's asymmetric P-core/E-core architecture
Zero allocation in hot path — thread-local buffers reused across calls
Compile-time dispatch — AVX2 check at compile time, not runtime

Quick Example

use matrixa::Matrix;

let a = Matrix::from_vec(2, 3, vec![1.0, 2.0, 3.0,
                                    4.0, 5.0, 6.0]);
let b = Matrix::from_vec(3, 2, vec![7.0, 8.0,
                                    9.0, 10.0,
                                    11.0, 12.0]);
let c = a.multiply(&b); // 2×2 result
assert_eq!(c.get(0, 0), 58.0); // 1*7 + 2*9 + 3*11

Dependencies

~1.5MB
~25K SLoC