Unnecessary nan-checks: performance issue or missing compile options.
I'm not sure whether this is a performance issue or a feature request. I figured lets ask here first.
The issue is a performance regression due to unnecessary nan-check for with (eg.) max and min operations.
+298 and +306 load data0 and data1
+314 calculates the maximum of zmm0 and zmm2 and store the result in zmm1 .
+320 mask register k1 is set when zmm0 (data0) contains nan-values.
+327 the result value (zmm1) is overwritten when the zmm0 was a nan with the value of data1 (zmm2)
+333 result value is written back to memory
If data0 could contain nan-values, the above assembly would be correct. But when data0 does not have such values, the code has a performance regression, because for every float min/max operations a nan-check is performed. This is something I would like to control in HPC AI workloads.
Q: Is this a regression bug or something else (for which i need to make a feature request)?
The issue is a performance regression due to unnecessary nan-check for with (eg.) max and min operations.
+298 and +306 load data0 and data1
+314 calculates the maximum of zmm0 and zmm2 and store the result in zmm1 .
+320 mask register k1 is set when zmm0 (data0) contains nan-values.
+327 the result value (zmm1) is overwritten when the zmm0 was a nan with the value of data1 (zmm2)
+333 result value is written back to memory
If data0 could contain nan-values, the above assembly would be correct. But when data0 does not have such values, the code has a performance regression, because for every float min/max operations a nan-check is performed. This is something I would like to control in HPC AI workloads.
Q: Is this a regression bug or something else (for which i need to make a feature request)?