The Logic of a Shared FPU

sm625 · Aug 3, 2011

If you take two cpus both running at 100% (according to task manager) that doesn't mean they are running at the same IPC, or even close. One could be processing 4 times as many as the other (4 times more GFLOPS). I don't have any actual data, but it is safe to assume that during a typical gaming session, even if your cpu reads 100% in task manager, it might only be running at 5-50% of its maximum potential FPU throughput. I seriously doubt that during gaming your FPU throughput ever reaches more than 50%. If there was a way to actually measure this I would like to know it! So, if you double the size of your FPU, and then share it, I dont see how it could have any negative performance impact on gaming. If two cores are both loading the FPU at 5-50% then combined that makes 10-100%, which at no point bottlenecks the FPU.

If it is found that typical usage scenarios never load the FPU more than 25% on average, then the FPU could be shared amongst 4 integer cores with virtually no loss in performance.

I was just wondering if anyone could point out the flaws in this logic.

soccerballtux · Aug 3, 2011

rah, rah intel!

hans007 · Aug 3, 2011

well the bulldozer does share 1 fpu per 2 integer cores. so amd might agree with you.

podspi · Aug 3, 2011

Academically, yes. In practice, we'll see soon enough. For gaming, I'd be more worried about the shared front-end hurting performance than the shared FPU. As far as I understand, gaming is mostly integer, now that we have hardware-offload.

deimos3428 · Aug 3, 2011

sm625 said:
I was just wondering if anyone could point out the flaws in this logic.

The only "flaw" I see is the assumption that you're using the CPU for gaming. What happens when you're doing something else? CPUs aren't designed specifically for gaming, so unless you're making a "gaming accelerator" they'll have to work well in all potential scenarios.

sm625 · Aug 3, 2011

It just shared fetch and decode right? The cache is already shared so it kind of makes sense. The schedulers are still separate.

Vesku · Aug 3, 2011

It really seems like an enhanced version of x86 and x87 coprocessor is occuring, with both AMD and NVIDIA wanting their GPUs to be used for FP processing. Very interesting when you also consider that the "cloud" stuff is an improved version of mainframe+terminals.

Edit: As to your original summation, the GPU compute movement is pretty much all about the resource sharing you have mentioned.

Idontcare · Aug 3, 2011

sm625 said:
If you take two cpus both running at 100% (according to task manager) that doesn't mean they are running at the same IPC, or even close. One could be processing 4 times as many as the other (4 times more GFLOPS). I don't have any actual data, but it is safe to assume that during a typical gaming session, even if your cpu reads 100% in task manager, it might only be running at 5-50% of its maximum potential FPU throughput. I seriously doubt that during gaming your FPU throughput ever reaches more than 50%. If there was a way to actually measure this I would like to know it! So, if you double the size of your FPU, and then share it, I dont see how it could have any negative performance impact on gaming. If two cores are both loading the FPU at 5-50% then combined that makes 10-100%, which at no point bottlenecks the FPU.

If it is found that typical usage scenarios never load the FPU more than 25% on average, then the FPU could be shared amongst 4 integer cores with virtually no loss in performance.

I was just wondering if anyone could point out the flaws in this logic.

Of course this is true, pipeline stalls, cache misses, instruction latency itself...there is a whole bevy of reasons why the actual IPC of a core falls short of its theoretical "peak capable" IPC.

First you are dealing with the fact that the ISA itself supports >700 instructions, each of which has its own latency and throughput that depends on the implementation of the ISA within the microarchitecture.

So you have to decide upfront which instruction you are referring to when you speak of "IPC" (instructions per clock or cycle).

Measuring Instruction Latency and Throughput

You can use Everest to assess the instruction latency for your processor.

AtenRa · Aug 3, 2011

What it says in the Y axes ??

Schmide · Aug 3, 2011

AtenRa said:
What it says in the Y axes ??

Number of Instructions

AtenRa · Aug 3, 2011

veri745 · Aug 3, 2011

*edit* nm, I need to learn how to refresh

jvroig · Aug 3, 2011

sm625 said:
I was just wondering if anyone could point out the flaws in this logic.

Depending on the actual load, the design can have insignificant performance hit, or can be much slower.

This decision in Bulldozer is not so much as a "here's a great idea!" thing, but more of a "hmmm... we need a balancing act here, and maybe this is a worthy trade-off". So with that out of the way, the best/ideal way is still separate everything for everything, except for the components that need to be shared because that is their function (for example, shared cache (L2/L3))

In a perfect/ideal/mystical world, the CPU would not only have separate FPUs for each core, but would also have a branch predictor that is hybrid, two-level adaptive with global history and local history, and a loop counter, and subroutine return prediction, and have a BTB capacity of a hundred-thousand lines. It would also have as much private L1 as it can, then 10x more shared L2 for the entire chip. No slower L3 used.

We can continue this exercise of ideal fantasy and enumerate more parts of the CPU that should be "perfected", but the point is already obvious that we don't have them because it is impossible within the bounds we need to play with - our chips need to be within a particular size, power, and thermals so as to be within the realm of economic feasibility. The perfect, turbo-charged branch predictor is impossible, else it would take up half of the die size of our current chips now. High-speed cache memory is extremely expensive. And we have a limited transistor budget, for the thermals and die size we are shooting for, which of course are constrained by the process we have.

I am saying this merely as a point of context - when AMD decided to share the FPU, it is not because it was an insight that nobody has thought of before and they are geniuses for figuring it out finally. Rather, it's a design for efficiency, as a way to balance all constraints and come up with their projected power/performance/thermal/size targets.

So the logic behind a shared FPU is more of "we can save die-space / transistors by fusing this together, and the trade-off is acceptable because we think X and Y" (where X and Y are justifications for why the trade-off is acceptable), and less of "discrete FPU units aren't really needed for each core in the real world".

Are the aforementioned X and Y really valid justifications? I don't know. AMD thinks so. They even think so, too, for server loads. I assume they know more than me since they have solid data from their partners/customers, so at the end of the day, it may indeed appear that yes, sharing the FPU is ok due to real-world data saying separate 256-bit FPUs are hardly necessary today and we can get away for now with shared 256-bit FPUs that can do two 128-bit operations at the same time. But from the designers' point of view, it is always just a balancing act, because ideally we still want discrete components in everything to better handle all situations.

podspi · Aug 3, 2011

Many people say they expect Intel to follow AMD with a module type design. I personally don't see this happening unless GF or another contract fab (who?!) catches up w/ Intel.

Fact of the matter is, Intel has a solid manufacturing lead. In order to compensate that, AMD has to be very parsimonious in their transistor use. I believe that Bulldozer's current module configuration is an effort to do so.

Tuna-Fish · Aug 3, 2011

podspi said:
Many people say they expect Intel to follow AMD with a module type design. I personally don't see this happening unless GF or another contract fab (who?!) catches up w/ Intel.

Honestly, with HT they are half-way there. They can just keep making the core fatter, even though they are way past the point of diminishing returns for a single thread. This still gives (small) benefits for single-threaded software, but more importantly, it increases the advantage that HT gives.

Aren't we all expecting a single SNB thread to be in the same ballpark as a BD core?

TakeNoPrisoners · Aug 3, 2011

Well even if it is slower then I expect AMD won't be dumb enough to price it above Intel. They realize their product is slower.

Edgy · Aug 3, 2011

Didn't Intel almost always have a "manufacturing lead" (which I assume to mean smaller nanometer manufacturing tech) over AMD throughout their CPU history?

I think this was true even during AMD's heyday (Athlon/64 vs Pentium4s) more or less...

I'm sure it matters but probably not to the extent that many of us presume nor in areas where we expect.

formulav8 · Aug 3, 2011

podspi said:
AMD has to be very parsimonious in their transistor use. I believe that Bulldozer's current module configuration is an effort to do so.

This. BD was quite strongly built around a fab constrained, money constrained, transistor contrained AMD when they still owned the fabs. Its still very important anyways.

sm625 · Aug 3, 2011

Tuna-Fish said:
Aren't we all expecting a single SNB thread to be in the same ballpark as a BD core?

This is a question I cannot get a straight answer to. In terms of running one thread per chip, BD really should be 10-30% faster. They are both 4 issue integer. But BD can do 5 in certain circumstances. Intel is 128 bit in all but AVX, which basically means they are 128 bit for every piece of code being run right now. BD is 256. BD has most of the optimizations that made Core x branch prediction better. If BD can match the ridiculous memory bandwidth of a i7-2600 then it seems to me it really should outperform it by 20% in gaming. But nobody else is making that same analysis, and they are not even saying why. They just say it will have lower IPC, no specific reason given. Again I am talking about running one thread per chip here and giving it exclusive access to all shared resources.

podspi · Aug 3, 2011

Edgy said:
Didn't Intel almost always have a "manufacturing lead" (which I assume to mean smaller nanometer manufacturing tech) over AMD throughout their CPU history?

I think this was true even during AMD's heyday (Athlon/64 vs Pentium4s) more or less...

I'm sure it matters but probably not to the extent that many of us presume nor in areas where we expect.

I don't think it was quite as strong a lead (or even a lead) as it is today, but I could be wrong. AMD was first to 1ghz, and iirc it was because they had sprinted ahead on some process-innovation (risky at the time, but it paid off).

Idontcare · Aug 3, 2011

podspi said:
I don't think it was quite as strong a lead (or even a lead) as it is today, but I could be wrong. AMD was first to 1ghz, and iirc it was because they had sprinted ahead on some process-innovation (risky at the time, but it paid off).

AMD dove into copper a full node before Intel. AMD at 180nm, Intel waited until 130nm.

Intel, the world's largest marketing machine, elected to name their 180nm P3's "coppermine" though to at least engender whatever mindshare they could at the time to get people thinking the 180nm Intel chips had copper like AMD's did.

Then AMD dove into SOI at 130nm, really setting the stage for their low-power initiatives which reached a crescendo at 90nm.

IMO Intel did not get "serious" about their process tech until the 90nm fiasco with Prescott.

As a process development engineer, I saw the clear Heaviside step-function change in Intel's demeanor and approach to process technology at that juncture in the timeline.

wuliheron · Aug 3, 2011

The idea that the success of Intel or AMD merely hinges on having better ideas and hard work is a joke in bad taste. These are billion dollar corporations who spend countless millions just thinking up new ways to use and abuse the law to get what they want. When AMD used risky new fabrication processes to get a leg up, Intel countered by using every dirty trick in the book to drive them so close to bankruptcy they had to sell off their fabs.

AMD has been forced to focus on architecture as the only way left to compete and as jvroig pointed out, it is a balancing act. Insects might have 8 legs, but larger animals only have 4 and I'm sure there are very good reasons why. More and bigger and stronger etc. is not always better even if confers certain advantages. Exactly what advantages and disadvantages bulldozer brings to the table we'll just have to wait and see.

Accord99 · Aug 3, 2011

sm625 said:
This is a question I cannot get a straight answer to. In terms of running one thread per chip, BD really should be 10-30% faster. They are both 4 issue integer.

If by chip, you mean SB core or BD module, a difference is that SB can dedicate all its execution resources onto a single thread, BD's module cannot.

Intel is 128 bit in all but AVX, which basically means they are 128 bit for every piece of code being run right now. BD is 256.

BD only has 128-bit FMACs.

CTho9305 · Aug 3, 2011

sm625 said:
I seriously doubt that during gaming your FPU throughput ever reaches more than 50%. If there was a way to actually measure this I would like to know it!

There are things called "performance counters" that track event rates, and I would assume there are counters for events like "retired floating point instructions". It may be possible to use a debugger to inject some monitoring code, or a profiler, but I wouldn't expect it to be very easy for an end user to do. Google should help if you want to investigate them ("performance counter", "floating point", "retired instructions").

pm · Aug 3, 2011

As Ctho9305 said, you could figure this out using performance monitor registers, although I'm not sure that you can easily - key word being "easily" - figure out the utilization of a given unit. You can do it with a profiler if you have the source code, but without the source code, you could still do it but you'd need to start and stop the application and the performance monitor counters synchronously which would be a trick, and you'd need to code up some sort of program which would setup, clear and read the registers. I've done similar things to this myself for various projects that I've worked on - and it works pretty well - but I've had access to tools (like an Intel In-Target Probe) that are harder for the general public to get ahold of and even with something like an ITP, I will admit that this sort of thing is trickier without access to the source code... if you have the source and can do inline assembly, it's easy once you figure out how to configure the monitors.

The Logic of a Shared FPU

Diamond Member

Lifer

Lifer

Golden Member

Senior member

Diamond Member

Diamond Member

Elite Member

Lifer

Diamond Member

Lifer

Golden Member

Platinum Member

Golden Member

Golden Member

Platinum Member

Senior member

Diamond Member

Diamond Member

Golden Member

Elite Member

Diamond Member

Platinum Member

Elite Member

Elite Member Mobile Devices