How can I optimize matrix multiplication performance and reduce L3 cache misses in my C++ library? - DevHeads IoT Integration Server