The Logic of a Shared FPU

podspi · Aug 3, 2011

Can't help on the Intel side, but:

http://developer.amd.com/tools/codeanalyst/pages/default.aspx

Edit: I lied, Intel:

http://software.intel.com/en-us/articles/intel-vtune-amplifier-xe/

Dresdenboy · Aug 4, 2011

Thanks podspi, I would also have mentioned these tools to measure IPC. It even works for binaries w/o much or any info in them. Some overall performance counter measurements are always possible, just detecting the meaning of disassembled code might be requiring too specific knowledge for most readers.

Further on the topic:
Intel already has a shared FPU - shared between 2 threads via SMT as the rest of the execution units, the scheduler, the front end and some other units. And indeed there are Intel patents (filed 2-4 years ago IIRC, found them 2 years ago) showing cores containing execution clusters (at least ALUs, I'm not sure about FP units).

Tuna-Fish · Aug 4, 2011

sm625 said:
This is a question I cannot get a straight answer to. In terms of running one thread per chip, BD really should be 10-30% faster. They are both 4 issue integer.

But SNB can do 3 integer (non-memory) instructions per clock, while BD can only do 2.

But BD can do 5 in certain circumstances.

So can Intel.

Intel is 128 bit in all but AVX, which basically means they are 128 bit for every piece of code being run right now. BD is 256.

BD has much higher latencies, especially for fdiv.

BD has most of the optimizations that made Core x branch prediction better.

But I have heard nothing about the strided prefetchers, which are the other thing that gave core 2 it's boost.

If BD can match the ridiculous memory bandwidth of a i7-2600 then it seems to me it really should outperform it by 20% in gaming. But nobody else is making that same analysis, and they are not even saying why. They just say it will have lower IPC, no specific reason given. Again I am talking about running one thread per chip here and giving it exclusive access to all shared resources.

Intel's caches should give them much better latencies/BW for most cached items. They also have much more OoOE resources.

At equal clocks, I'm not expecting that much from BD. The big question is, when not limited by thermals, how high can that one BD thread clock?

Dresdenboy · Aug 4, 2011

Tuna-Fish said:
BD has much higher latencies, especially for fdiv.

This is mostly for fp ops and more complex integer ops (imul, idiv). Most common int ops have 1 cycle latency (reg, reg).

Tuna-Fish said:
But I have heard nothing about the strided prefetchers, which are the other thing that gave core 2 it's boost.

The optimization manual mentions them: "The L1 hardware prefetcher in AMD Family 15h processors is a stride prefetcher that is triggered by L1 cache misses and received training data from the L2 prefetcher."

Then there are region prefetchers (good for server type workloads)...

Tuna-Fish said:
Intel's caches should give them much better latencies/BW for most cached items. They also have much more OoOE resources.

2 threads on a BD module can fetch 32B+32B per cycle from L1 at 4cycle latency. Writes go to WCC and L2 then, at a much lower rate. L2 reads might reach 32B per cycle at 18-20cycles latency, but I don't remember right now the bandwidth.

OTOH BDv1 has some bandwith issues with streaming writes according to the manual.

Tuna-Fish said:
At equal clocks, I'm not expecting that much from BD. The big question is, when not limited by thermals, how high can that one BD thread clock?

It's likely that BD won't run at equal clocks at the same workloads as SB. Will be interesting to watch if BD will at least reach similar overall performance. But due to the very different archs I expect some dramatic positive and negative outliers.

Search

The Logic of a Shared FPU

podspi

Golden Member

Dresdenboy

Golden Member

Tuna-Fish

Golden Member

Dresdenboy

Golden Member

TRENDING THREADS