DevHeads IoT Integration Server•13mo ago

What Are the Architectural Constraints in Haswell That Limit CPE Optimization?

I want to understand why any scalar version of the inner product procedure cannot achieve a CPE less than

1.00

1.00

1.00

1.00 on an Intel Core i7 4790 Haswell processor, Ubuntu 20.04 Linux Kernel 5.4, with GCC 9.3.0 compiler.
I want to optimize the inner product procedure using

6x1a

6x1a

6x1a

6x1a loop unrolling on the Intel Core i7 Haswell processor.
For integer data, my unrolled version gives a CPE as in cycles per element of

1.07

1.07

1.07

1.07.
For floating-point data, it still remains at

3.01

3.01

3.01

3.01.
I understand that pipelining and vectorization offer opportunities for parallelism, but is there a fundamental limitation in scalar code that prevents CPE from dropping below

1.00

1.00

1.00

1.00, even with loop unrolling?

Are there architectural constraints in the Haswell processor that make achieving a CPE of less than

1.00

1.00

1.00

1.00 impossible? What will be the best approach to optimize further?

What Are the Architectural Constraints in Haswell That Limit CPE Optimization?

Similar Threads

Similar Threads

Similar Threads