Why is my SIMD code slower than the scalar version?

I wrote the following to learn more about simd - it tries to find a substring

https://paste.mod.gg/jswpvcpgoxgo/0

I ran a benchmark on my machine, which is avx512, compared against when it goes down the scalar path by setting DOTNET_EnableHWIntrinsic=0.

In my benchmark I have 2 paragraphs of Lorem Ipsum (1156 chars length) and a search string of a few words (47 chars length).

The vector512 benchmark takes approx 2.8us and the scalar benchmark takes 4.1us which seems like a fairly large difference and indicative that I’ve done something wrong.

Is there any more profiling I can use to work out what went wrong?
A tool for sharing your source code with the world!
Was this page helpful?