The Logic of a Shared FPU

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Thanks podspi, I would also have mentioned these tools to measure IPC. It even works for binaries w/o much or any info in them. Some overall performance counter measurements are always possible, just detecting the meaning of disassembled code might be requiring too specific knowledge for most readers.

Further on the topic:
Intel already has a shared FPU - shared between 2 threads via SMT as the rest of the execution units, the scheduler, the front end and some other units. And indeed there are Intel patents (filed 2-4 years ago IIRC, found them 2 years ago) showing cores containing execution clusters (at least ALUs, I'm not sure about FP units).
 

Tuna-Fish

Golden Member
Mar 4, 2011
1,678
2,561
136
This is a question I cannot get a straight answer to. In terms of running one thread per chip, BD really should be 10-30% faster. They are both 4 issue integer.
But SNB can do 3 integer (non-memory) instructions per clock, while BD can only do 2.
But BD can do 5 in certain circumstances.
So can Intel.
Intel is 128 bit in all but AVX, which basically means they are 128 bit for every piece of code being run right now. BD is 256.
BD has much higher latencies, especially for fdiv.
BD has most of the optimizations that made Core x branch prediction better.
But I have heard nothing about the strided prefetchers, which are the other thing that gave core 2 it's boost.
If BD can match the ridiculous memory bandwidth of a i7-2600 then it seems to me it really should outperform it by 20% in gaming. But nobody else is making that same analysis, and they are not even saying why. They just say it will have lower IPC, no specific reason given. Again I am talking about running one thread per chip here and giving it exclusive access to all shared resources.

Intel's caches should give them much better latencies/BW for most cached items. They also have much more OoOE resources.

At equal clocks, I'm not expecting that much from BD. The big question is, when not limited by thermals, how high can that one BD thread clock?
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
BD has much higher latencies, especially for fdiv.
This is mostly for fp ops and more complex integer ops (imul, idiv). Most common int ops have 1 cycle latency (reg, reg).

But I have heard nothing about the strided prefetchers, which are the other thing that gave core 2 it's boost.
The optimization manual mentions them: "The L1 hardware prefetcher in AMD Family 15h processors is a stride prefetcher that is triggered by L1 cache misses and received training data from the L2 prefetcher."

Then there are region prefetchers (good for server type workloads)...

Intel's caches should give them much better latencies/BW for most cached items. They also have much more OoOE resources.
2 threads on a BD module can fetch 32B+32B per cycle from L1 at 4cycle latency. Writes go to WCC and L2 then, at a much lower rate. L2 reads might reach 32B per cycle at 18-20cycles latency, but I don't remember right now the bandwidth.

OTOH BDv1 has some bandwith issues with streaming writes according to the manual.

At equal clocks, I'm not expecting that much from BD. The big question is, when not limited by thermals, how high can that one BD thread clock?
It's likely that BD won't run at equal clocks at the same workloads as SB. Will be interesting to watch if BD will at least reach similar overall performance. But due to the very different archs I expect some dramatic positive and negative outliers.