What Are the Architectural Constraints in Haswell That Limit CPE Optimization?
I want to understand why any scalar version of the inner product procedure cannot achieve a CPE less than
I want to optimize the inner product procedure using
For integer data, my unrolled version gives a CPE as in cycles per element of
For floating-point data, it still remains at
I understand that pipelining and vectorization offer opportunities for parallelism, but is there a fundamental limitation in scalar code that prevents CPE from dropping below
Are there architectural constraints in the Haswell processor that make achieving a CPE of less than
1.00 on an Intel Core i7 4790 Haswell processor, Ubuntu 20.04 Linux Kernel 5.4, with GCC 9.3.0 compiler. I want to optimize the inner product procedure using
6x1a loop unrolling on the Intel Core i7 Haswell processor. For integer data, my unrolled version gives a CPE as in cycles per element of
1.07.For floating-point data, it still remains at
3.01.I understand that pipelining and vectorization offer opportunities for parallelism, but is there a fundamental limitation in scalar code that prevents CPE from dropping below
1.00, even with loop unrolling?Are there architectural constraints in the Haswell processor that make achieving a CPE of less than
1.00 impossible? What will be the best approach to optimize further?