Cool thread, actually Haswell might start looking even better: http://forums.anandtech.com/showpost.php?p=35123392&postcount=8
I managed to extract ~22% speedup clock-for-clock compared to Ivy Bridge generation using AVX2 (including usage of both FMA and Gather instructions)