Thanks for the kind words everyone. About the code, I do plan to open-source it at some point within a few months, after perhaps we get 1 or 2 more articles out.
It is not exactly rocket science though. With the description in the article, and a little bit of experimenting, you can figure it out pretty easily. It is a standard technique, so you should be able to find articles and maybe existing code about it on the webs.
Comparisons with x86: Well, I wanted to avoid detailed comparisons in this article.
Generally cross-ISA comparisons about instruction throughputs are a minefield. Ideally cross-ISA comparisons are best done at the application level rather than synthetic instruction throughput level.
Anyway, since you asked, here is some data from memory:
Core 2 Duo: 4 DP flops/cycle, 8 SP flops/cycle
Nehalem: 4 DP flops/cycle, 8 SP flops/cycle
k10: 4 DP flops/cycle, 8 SP flops/cycle
Sandy/Ivy: 8 DP flops/cycle, 16 SP flops/cycle