Why is my SIMD code slower than the scalar version? - C#