What Are the Architectural Constraints in Haswell That Limit CPE Optimization?

I want to understand why any scalar version of the inner product procedure cannot achieve a CPE less than 1.00 on an Intel Core i7 4790 Haswell processor, Ubuntu 20.04 Linux Kernel 5.4, with GCC 9.3.0 compiler.
I want to optimize the inner product procedure using 6x1a loop unrolling on the Intel Core i7 Haswell processor.
For integer data, my unrolled version gives a CPE as in cycles per element of 1.07.
For floating-point data, it still remains at 3.01.
I understand that pipelining and vectorization offer opportunities for parallelism, but is there a fundamental limitation in scalar code that prevents CPE from dropping below 1.00, even with loop unrolling?

Are there architectural constraints in the Haswell processor that make achieving a CPE of less than 1.00 impossible? What will be the best approach to optimize further?
file0.jpg
Was this page helpful?