Just
a quick ans. I would have to get the info from intel but intel FMA3 has 3 0ps but can pull from 4 registers. If intel went FMA4 and used 4 ops than it could pull from 5 registers
I find this more to the point
Personally I think a strided load would be a waste in the long term. Sooner or later true scatter/gather will be added (*) and the strided load becomes another superseded legacy instruction that you have to drag with you till the end of days.
If that's not a concern, fine, but please consider adding the gather instruction as soon as possible. An early implementation could work just like in Larrabee; using multiple wide loads till all the elements have been 'gathered'. It would definitely be faster than using individual insertps instructions, with a minimal latency equal to that of a movups (for sequential indexes or indexes all in the same vector).
And it would be useful for a lot more than just matrix transposition. It opens the door for things that aren't even conceivable today. Truely any loop that involves independent iterations could be (automatically) parallelized when we have scatter/gather instructions, no matter how the data is organized, or even in the presence of pointer chasing. So it's not just for HPC or multimedia (although those would benefit massively as well). If you think that's radical, please realise that the rules for writing high performance software have already changed dramatically when we went multi-core. So you might as well finish what you started and add scatter/gather support or the CPU will keep losing terrain to the GPU. You're nearing the point where people just buy the cheapest CPU available and rather invest in a more powerful GPU to do the 'real work'. The competition (both AMD and NVIDIA) are in rather sweet spots to take the biggest pieces of the pie in this scenario. So you'd better give people good reasons to keep buying the latest CPUs, by adding instructions to support algorithms that would otherwise run more optimally outside of the CPU. The only reason I care is because I believe it's better for the end user
UntiL single uop execution units are available. Intel with AVX is working on scatter gather . This is the more forward looking approach
Absolutely. It's really about adoption and compatibility:
Scenario 1: FMA instructions are added later when single uop execution units are available.
Let's say this happens in four years. At that point developers will be eager to use FMA, but they have to be careful to still support older processors. So they have the choice of writing two code paths, or just not using FMA till it's ubiquitous. Maintaining multiple code paths is a software engineer's daily nightmare (it's not just FMA, it's other ISA extensions and many other system parameters as well). So it's not uncommon to only start supporting new instructions years later. In fact I believe that only recently it has become relatively safe to assume SSE2 support as a minimum (i.e. putting that on the box won't cost us a significant number of clients). That's a full 7 years after its introduction! So in this scenario FMA would suffer pretty slow adoption up to the year 2019...
Scenario 2: FMA instructions are added sooner and executed in two uops.
Developers can and will experiment with these instructions sooner. Compilers and other tools will support them years sooner too. Code size, extra precision, and the potential of seeing faster implementations in future processors (without requiring a code rewrite) are enough incentive for the early adopters. By the time single uop FMA processors become available they'll see a nice boost in performance. That's good for Intel too since real-world applications can be used as benchmarks, which is a lot more convincing for consumers than numbers on paper and a much later return on investment. And just as importantly, those 2 uop FMA processors will still run applicaitons that have one code path and demand FMA as a minimum. They won't run it faster than an application with two code paths (one using separate mul and add) but at leat they'll run it. There's nothing more frustrating than not being able to run an application because the hardware doesn't support it (and guess who gets the blame).
So I think scenario 2 is a win for everybody (hardware guys, software guys and consumers). And I strongly believe it applies to much more than FMA. Of course you can't just blindly start adding instructions, but if you already decided you're going to invest transistors into a feature at some point, it really doesn't hurt to have a functional 'interface' much sooner. In fact, if it turns out that developers are not so interested in the feature after all, you have the option of postponing the full-fledged implementation a couple years till they're more interested, investing those transistors elsewhere in the meantime.
Lastly, in case anyone's worried about the marketing aspects: It's simply a case of not marketing to consumers until the faster execution units are added. Core 2's vastly increased SSE performance has been a grand succes despite that SSE has been around for a decade. It's easy to market when the numbers speak for themselves.
