How about the GCC and AMD/Intel? Lots of compiler benchmarks put it at about equal footing in compiled code with the icc. Have you seen any big differences across architectures for the GCC?
gcc and Microsoft compiler are not affected. They produce very good code for both architectures. Even the old Intel compilers procuded good code for either architecture (though they were always a bit faster for Intel, but not in a strange way). However the "newer" Intel compilers show this problems. I mean version 5 to 7 were okay, 8-9 got slowly strange and 10 was really very bad for AMD. And Microsoft compiler and gcc continously closed the gap of code performance (in old days you could get 20% more by just compiling with icc, but that is no longer the case and icc is slower if run on AMD).
Also, for your code above, how do you know that intel wasn't taking a different code path for AMD CPUs (something they have become notorious for doing) and not just using instructions that hurt AMD more then they hurt intel.
Theoretically the Intel compiler could do this, but did not do this because of the compiler settings I used (does not mean that the compiler does this anyway). Actually it thinks it compiles for an Intel CPU (forced by architecture settings). And there was no additional code. I know this because I inspect dissassembly in the performance critical parts (I do this because I do e.g. SSE optimizations and need to investigate that thourougly because performance is critical for this application). After the bad results I tried different compiler settings and by that the slowdown for AMD CPUs varies, but does not disappear.
However the newer versions of Intel Profiler VTune has such code and will simply not even do a profile run if on a computer with AMD CPU (older versions worked fine). I think that was the reason why AMD released it's charge free profiler "AMD CodeAnalyst" after that. My company was very unpleased by that as customers of Intel VTune since many development machines ran on AMD CPUs and we had to switch profilers therefore (we switched to Rational Quantify then)!
This thing with different code parts is more a thing for certain applications not for the compiler itself. E.g. an application could use vanilla code and use SSE only if an Intel CPU has been detected.
But a compiler can't do that if I give the SSE instructions in the source code (what I did just for example).
Yes was wondering this as well. How do the 2 compare in GCC...how fair is GCC between intel and AMD.
I am not familar with the latest gcc version 4.x but I cannot imagine that they are not fair. However you could maybe use some compiler options which work better for one or the other but that is something you can freely choose and all of those things will have very minor effects. Again there is nothing immanent in the compiler and I would declare gcc and Microsoft compiler as 100% fair.
I mean it is really nothing special to be "fair". It creates a lot of work to find out things where you can hurt your competitor without hurting yourself. Obviously only Intel had enough funds to do so. And it does not give any advantage for Intel user's!
Pretty much the same approach that Intel did when they launched Banias/Dothan and its derivatives like Conroe/Penryn, Nehalem/Lynfield. AMD's approach is similar to what Intel did with their NetBust architecture, except that the Star architecture is much better than Netburst.
That is a bit oversimplified. They made a lot more than that they changed almost every aspect in their design compared to K7-K10.5 so it's not star core design anymore. The high speed design has a similarity but works different and much better than with Netburst because e.g. Bulldozer does not have such extremly long pipelines. And the Core split might be compared with Hyper Threading but again the AMD approach of splitting cores is way better (80 to 100% boost for Module technology over -5 to 30% for Hyper Threading).
So Bulldozer is a complete core overhaul + frontend/backend overhaul + cache system overhaul + predictor/prefetcher overhaul; so far that are more conventional changes (in the direction of what Intel did with Conroe) but larger than those of K8-K10.5 architecture switch, only the K7 to K8 switch was nearly that large.
The two big changes on top of that are the high frequency design and the "module technology".
But if you simplify and want to condense that to a single statement then you could call it "Netburst-like" though this is somewhat misleading regarding the results and techniques in detail. It is much more what IBM did with Power7 and the real merits for innovation should maybe go to IBM.
Intel could do the same for their future CPUs. However reagarding the module technology this would be extremly difficult to do for Intel because of let's say the way Intel does x86. AMD could do this because of their design optimized for throughput. They had too much parallel unit's so they could just split them and still had enough (they added even units btw). On the other hand it would be difficult to do Hyper Threading with any AMD CPU architecture. That is why we never saw any Hyper Threading CPU from AMD though it's quite an old technology by now.
Therefore this "Module tech." vs. Hyperthreading advantage could last for quite a long while and that advantage is really huge. Will be interesting if and when Intel manages to provide a similar technology for their upcoming CPUs. Maybe Intel might counter with a Super-HT approach where they could fully use their latency advantage and "Reverse-HT" would be included. But that is just a personal speculation of mine.