In the previous builds I posted, I only disabled AVX2 (AVX was enabled). Now I made an additional
build (57.5MB) with only SSE2, SSE3 and SSE4.1 kernels enabled (CCX_HAS_AVX & CCX_HAS_AVX2 = False).
On Piledriver there was no difference what so ever, and on Haswell the difference pretty much falls withing the margin of error (< 1.5%).
If you want to compare the builds between each other, use "ICL" & "ICLWOAVX2" builds from the previous package with this one.
AVX and AVX2 should definitely help even in pure FP workload. In
this package you can find a simple Monte Carlo raytracer (based on SmallPT port). Exactly the same code and build options, excluding the allowed instructions (Arch SSE4.2 / AVX / AVX2). SSE4.2 being the baseline, AVX boosts the performance by 6.1% and AVX2 by 15.8% (tested on Haswell-EP).