The Logic of a Shared FPU

sm625

Diamond Member
May 6, 2011
8,172
137
106
If you take two cpus both running at 100% (according to task manager) that doesn't mean they are running at the same IPC, or even close. One could be processing 4 times as many as the other (4 times more GFLOPS). I don't have any actual data, but it is safe to assume that during a typical gaming session, even if your cpu reads 100% in task manager, it might only be running at 5-50% of its maximum potential FPU throughput. I seriously doubt that during gaming your FPU throughput ever reaches more than 50%. If there was a way to actually measure this I would like to know it! So, if you double the size of your FPU, and then share it, I dont see how it could have any negative performance impact on gaming. If two cores are both loading the FPU at 5-50% then combined that makes 10-100%, which at no point bottlenecks the FPU.

If it is found that typical usage scenarios never load the FPU more than 25% on average, then the FPU could be shared amongst 4 integer cores with virtually no loss in performance.

I was just wondering if anyone could point out the flaws in this logic.
 

hans007

Lifer
Feb 1, 2000
20,212
18
81
well the bulldozer does share 1 fpu per 2 integer cores. so amd might agree with you.
 

podspi

Golden Member
Jan 11, 2011
1,982
102
106
Academically, yes. In practice, we'll see soon enough. For gaming, I'd be more worried about the shared front-end hurting performance than the shared FPU. As far as I understand, gaming is mostly integer, now that we have hardware-offload.
 

deimos3428

Senior member
Mar 6, 2009
697
0
0
I was just wondering if anyone could point out the flaws in this logic.
The only "flaw" I see is the assumption that you're using the CPU for gaming. What happens when you're doing something else? CPUs aren't designed specifically for gaming, so unless you're making a "gaming accelerator" they'll have to work well in all potential scenarios.
 

sm625

Diamond Member
May 6, 2011
8,172
137
106
It just shared fetch and decode right? The cache is already shared so it kind of makes sense. The schedulers are still separate.
 

Vesku

Diamond Member
Aug 25, 2005
3,743
28
86
It really seems like an enhanced version of x86 and x87 coprocessor is occuring, with both AMD and NVIDIA wanting their GPUs to be used for FP processing. Very interesting when you also consider that the "cloud" stuff is an improved version of mainframe+terminals.

Edit: As to your original summation, the GPU compute movement is pretty much all about the resource sharing you have mentioned.
 
Last edited:

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
If you take two cpus both running at 100% (according to task manager) that doesn't mean they are running at the same IPC, or even close. One could be processing 4 times as many as the other (4 times more GFLOPS). I don't have any actual data, but it is safe to assume that during a typical gaming session, even if your cpu reads 100% in task manager, it might only be running at 5-50% of its maximum potential FPU throughput. I seriously doubt that during gaming your FPU throughput ever reaches more than 50%. If there was a way to actually measure this I would like to know it! So, if you double the size of your FPU, and then share it, I dont see how it could have any negative performance impact on gaming. If two cores are both loading the FPU at 5-50% then combined that makes 10-100%, which at no point bottlenecks the FPU.

If it is found that typical usage scenarios never load the FPU more than 25% on average, then the FPU could be shared amongst 4 integer cores with virtually no loss in performance.

I was just wondering if anyone could point out the flaws in this logic.

Of course this is true, pipeline stalls, cache misses, instruction latency itself...there is a whole bevy of reasons why the actual IPC of a core falls short of its theoretical "peak capable" IPC.

First you are dealing with the fact that the ISA itself supports >700 instructions, each of which has its own latency and throughput that depends on the implementation of the ISA within the microarchitecture.

x86ISAovertime.jpg


So you have to decide upfront which instruction you are referring to when you speak of "IPC" (instructions per clock or cycle).

Measuring Instruction Latency and Throughput

You can use Everest to assess the instruction latency for your processor.
 
Last edited:

jvroig

Platinum Member
Nov 4, 2009
2,394
1
81
I was just wondering if anyone could point out the flaws in this logic.
Depending on the actual load, the design can have insignificant performance hit, or can be much slower.

This decision in Bulldozer is not so much as a "here's a great idea!" thing, but more of a "hmmm... we need a balancing act here, and maybe this is a worthy trade-off". So with that out of the way, the best/ideal way is still separate everything for everything, except for the components that need to be shared because that is their function (for example, shared cache (L2/L3))

In a perfect/ideal/mystical world, the CPU would not only have separate FPUs for each core, but would also have a branch predictor that is hybrid, two-level adaptive with global history and local history, and a loop counter, and subroutine return prediction, and have a BTB capacity of a hundred-thousand lines. It would also have as much private L1 as it can, then 10x more shared L2 for the entire chip. No slower L3 used.

We can continue this exercise of ideal fantasy and enumerate more parts of the CPU that should be "perfected", but the point is already obvious that we don't have them because it is impossible within the bounds we need to play with - our chips need to be within a particular size, power, and thermals so as to be within the realm of economic feasibility. The perfect, turbo-charged branch predictor is impossible, else it would take up half of the die size of our current chips now. High-speed cache memory is extremely expensive. And we have a limited transistor budget, for the thermals and die size we are shooting for, which of course are constrained by the process we have.

I am saying this merely as a point of context - when AMD decided to share the FPU, it is not because it was an insight that nobody has thought of before and they are geniuses for figuring it out finally. Rather, it's a design for efficiency, as a way to balance all constraints and come up with their projected power/performance/thermal/size targets.

So the logic behind a shared FPU is more of "we can save die-space / transistors by fusing this together, and the trade-off is acceptable because we think X and Y" (where X and Y are justifications for why the trade-off is acceptable), and less of "discrete FPU units aren't really needed for each core in the real world".

Are the aforementioned X and Y really valid justifications? I don't know. AMD thinks so. They even think so, too, for server loads. I assume they know more than me since they have solid data from their partners/customers, so at the end of the day, it may indeed appear that yes, sharing the FPU is ok due to real-world data saying separate 256-bit FPUs are hardly necessary today and we can get away for now with shared 256-bit FPUs that can do two 128-bit operations at the same time. But from the designers' point of view, it is always just a balancing act, because ideally we still want discrete components in everything to better handle all situations.
 

podspi

Golden Member
Jan 11, 2011
1,982
102
106
Many people say they expect Intel to follow AMD with a module type design. I personally don't see this happening unless GF or another contract fab (who?!) catches up w/ Intel.

Fact of the matter is, Intel has a solid manufacturing lead. In order to compensate that, AMD has to be very parsimonious in their transistor use. I believe that Bulldozer's current module configuration is an effort to do so.
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,677
2,560
136
Many people say they expect Intel to follow AMD with a module type design. I personally don't see this happening unless GF or another contract fab (who?!) catches up w/ Intel.

Honestly, with HT they are half-way there. They can just keep making the core fatter, even though they are way past the point of diminishing returns for a single thread. This still gives (small) benefits for single-threaded software, but more importantly, it increases the advantage that HT gives.

Aren't we all expecting a single SNB thread to be in the same ballpark as a BD core?
 

TakeNoPrisoners

Platinum Member
Jun 3, 2011
2,599
1
81
Well even if it is slower then I expect AMD won't be dumb enough to price it above Intel. They realize their product is slower.
 

Edgy

Senior member
Sep 21, 2000
366
20
81
Didn't Intel almost always have a "manufacturing lead" (which I assume to mean smaller nanometer manufacturing tech) over AMD throughout their CPU history?

I think this was true even during AMD's heyday (Athlon/64 vs Pentium4s) more or less...

I'm sure it matters but probably not to the extent that many of us presume nor in areas where we expect.
 

formulav8

Diamond Member
Sep 18, 2000
7,004
523
126
AMD has to be very parsimonious in their transistor use. I believe that Bulldozer's current module configuration is an effort to do so.

This. BD was quite strongly built around a fab constrained, money constrained, transistor contrained AMD when they still owned the fabs. Its still very important anyways.
 

sm625

Diamond Member
May 6, 2011
8,172
137
106
Aren't we all expecting a single SNB thread to be in the same ballpark as a BD core?

This is a question I cannot get a straight answer to. In terms of running one thread per chip, BD really should be 10-30% faster. They are both 4 issue integer. But BD can do 5 in certain circumstances. Intel is 128 bit in all but AVX, which basically means they are 128 bit for every piece of code being run right now. BD is 256. BD has most of the optimizations that made Core x branch prediction better. If BD can match the ridiculous memory bandwidth of a i7-2600 then it seems to me it really should outperform it by 20% in gaming. But nobody else is making that same analysis, and they are not even saying why. They just say it will have lower IPC, no specific reason given. Again I am talking about running one thread per chip here and giving it exclusive access to all shared resources.
 

podspi

Golden Member
Jan 11, 2011
1,982
102
106
Didn't Intel almost always have a "manufacturing lead" (which I assume to mean smaller nanometer manufacturing tech) over AMD throughout their CPU history?

I think this was true even during AMD's heyday (Athlon/64 vs Pentium4s) more or less...

I'm sure it matters but probably not to the extent that many of us presume nor in areas where we expect.


I don't think it was quite as strong a lead (or even a lead) as it is today, but I could be wrong. AMD was first to 1ghz, and iirc it was because they had sprinted ahead on some process-innovation (risky at the time, but it paid off).
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
I don't think it was quite as strong a lead (or even a lead) as it is today, but I could be wrong. AMD was first to 1ghz, and iirc it was because they had sprinted ahead on some process-innovation (risky at the time, but it paid off).

AMD dove into copper a full node before Intel. AMD at 180nm, Intel waited until 130nm.

Intel, the world's largest marketing machine, elected to name their 180nm P3's "coppermine" though to at least engender whatever mindshare they could at the time to get people thinking the 180nm Intel chips had copper like AMD's did.

Then AMD dove into SOI at 130nm, really setting the stage for their low-power initiatives which reached a crescendo at 90nm.

IMO Intel did not get "serious" about their process tech until the 90nm fiasco with Prescott.

As a process development engineer, I saw the clear Heaviside step-function change in Intel's demeanor and approach to process technology at that juncture in the timeline.
 

wuliheron

Diamond Member
Feb 8, 2011
3,536
0
0
The idea that the success of Intel or AMD merely hinges on having better ideas and hard work is a joke in bad taste. These are billion dollar corporations who spend countless millions just thinking up new ways to use and abuse the law to get what they want. When AMD used risky new fabrication processes to get a leg up, Intel countered by using every dirty trick in the book to drive them so close to bankruptcy they had to sell off their fabs.

AMD has been forced to focus on architecture as the only way left to compete and as jvroig pointed out, it is a balancing act. Insects might have 8 legs, but larger animals only have 4 and I'm sure there are very good reasons why. More and bigger and stronger etc. is not always better even if confers certain advantages. Exactly what advantages and disadvantages bulldozer brings to the table we'll just have to wait and see.
 

Accord99

Platinum Member
Jul 2, 2001
2,259
172
106
This is a question I cannot get a straight answer to. In terms of running one thread per chip, BD really should be 10-30% faster. They are both 4 issue integer.
If by chip, you mean SB core or BD module, a difference is that SB can dedicate all its execution resources onto a single thread, BD's module cannot.

Intel is 128 bit in all but AVX, which basically means they are 128 bit for every piece of code being run right now. BD is 256.
BD only has 128-bit FMACs.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
I seriously doubt that during gaming your FPU throughput ever reaches more than 50%. If there was a way to actually measure this I would like to know it!

There are things called "performance counters" that track event rates, and I would assume there are counters for events like "retired floating point instructions". It may be possible to use a debugger to inject some monitoring code, or a profiler, but I wouldn't expect it to be very easy for an end user to do. Google should help if you want to investigate them ("performance counter", "floating point", "retired instructions").
 

pm

Elite Member Mobile Devices
Jan 25, 2000
7,419
22
81
As Ctho9305 said, you could figure this out using performance monitor registers, although I'm not sure that you can easily - key word being "easily" - figure out the utilization of a given unit. You can do it with a profiler if you have the source code, but without the source code, you could still do it but you'd need to start and stop the application and the performance monitor counters synchronously which would be a trick, and you'd need to code up some sort of program which would setup, clear and read the registers. I've done similar things to this myself for various projects that I've worked on - and it works pretty well - but I've had access to tools (like an Intel In-Target Probe) that are harder for the general public to get ahold of and even with something like an ITP, I will admit that this sort of thing is trickier without access to the source code... if you have the source and can do inline assembly, it's easy once you figure out how to configure the monitors.