Message from discussion
Bulldozer on Slashdot
MitchAlsup View profile
On Aug 25, 11:38 pm, Brett Davis <gg...@yahoo.com> wrote:
> In article <ggtgp-2F5622.01163525082
...@news.isp.giganews.com>,
> Brett Davis <gg
...@yahoo.com> wrote:
> K10 has one major bottleneck outside the issue pipeline to executing
> more instructions per cycle. The 16 byte decode unit will give you 3.5
> instructions per cycle on average, less for SSE code, as few as 2.5.
> A 32 byte decode unit will be idle greater than 50% of the time on average.
> Huge die area and a huge win to share.
> The k10 retirement unit can only retire 3 instructions a cycle, Bulldozer
> will do 4.
(Ahem) K10 is BullDozer, K8 is Opteron and follow-ons.
> The third AGU was never used, waste of die area and heat.
The issue was that the 3rd unit was used a lot, only to run into the
dual-only ported DataCache. This caused sequencing issues.
> The third ALU is of more concern, Intel will standardize benchmarks to
> make this look bad, even though I know it was used 1% on average.
So what else is new.
> AMD now has separate load and store pipelines, this can be a huge advantage.
> For every 90 instructions on average you will have 60 integer ops, 20 loads,
> and 10 stores.
We measured very close to 50% of x86 instructions having memory
reference attachments. So, for every 90 x86 instructions, on would
expect 45 memory references wiht a general ratio of just over 2 reads
to 1 write. Thus, I would expect 30-33 reads and 12-15 writes.
> The branch unit is not on the Bulldozer slides,
We always put these in the ALUs with means to redirect the front-end
on discovery of mispredict.
> Bulldozer will be faster than K10, the question is how much,
When I left, BD was supposed to be 20-25% faster frequency wise, and
loose a little architectural figure (5%-ish) of merit due to the
microarchitecture. The surprising thing was the lack of mention of
frequency in the market-droid-ing.
Mitch