Great explanations and the step by step speed improvements are amazing to see!
However, in the end a comparison to a real world alternative is interesting. No one would seriously do matmul in pure python . So I compared the performance to
numpy
numpy
which is a much better "baseline" for comparison.
Results on my machine:
- Naive matrix multiplication - 0.854 GFLOP/s
- Vectorized matrix multiplication without vectorize - 5.71 GFLOP/s
- Vectorized matrix multiplication with vectorize - 5.81 GFLOP/s
Results - gigantic speedup comparing against naive, pure python - still almost 4x SLOWER compared to
numpy
numpy
Wondering if
numpy
numpy
is so heavily optimised for this operation that there is little way to keep up or improve upon? Does anyone have ideas for further optimisations to get mojo closer to numpy? Is this something that only a framework like MAX or super low level bit manipulation can achive?