My first approach when optimizing a single loop is to apply vectorize. Now I wonder if it in some cases makes sense to transform the single loop into a nested loop, vectorizing the inner loop and parallelize the outer
Instead of vectorizing
for i in range(12): ...
for i in range(12): ...
`
using
for k in range(4): for j in range(3): var i = 3*k + j ...
for k in range(4): for j in range(3): var i = 3*k + j ...
and then vectorize over j and parallize over k.
I f it makes sense, how to find a good balance between vectorize and parallize. In my concrete example, i have a loop of around 120 million .... (updating parameters in llm.mojo)
What i also wonder in this regard if the compiler is detecting these optimizations anyway so better to keep the code simple and let the compiler do these type of standard optimization.