Some Bulldozer and Bobcat articles have sprung up

Cerb · Aug 26, 2010

TuxDave said:
And most performance demanding applications are demanding because they have plenty of uops to execute.

Yes. I haven't been saying that's not the case. I even quoted myself about it being common, there.

It started specifically about a situation in which there would be equal numbers of threads across the modules, where half were not using AVX, but still issuing SIMD instructions every cycle, and half were, with one of each variety was represented on each module. Also, for whatever reason, each could only execute a single SIMD instruction each cycle.

So, then, if OOO allows you to not be stuck that way, then the whole initial worry was moot. But, OOO can only go so far--it's not a panacea--and worrying about high-utilization cases should really be left to benchmarks, because feeding the cores, even if there are parallel instructions to go around, is no small task, and AMD is not going to release enough info to do a good job predicting the results, on top of OS scheduling having a significant impact on such a scenario, since surely any modern OS will use the same scheduling as with HT, for the job.

jvroig · Aug 26, 2010

khon said:
An 8-core BD could well be smaller than a 4-core SB, since adding the four secondary cores apparently only increases size by 5%

Two things are not quite right: the 5% figure, and "secondary cores" concept

The 5% figure is wrong, and has been for a long time.

This figure has been retracted (debunked, if you will) and the actual figure is 50%.

There's a good story why "5%" figure was started (and what it actually means, it wasn't an outright lie but a context mismatch between someone in Marketing and someone in Engineering), but I tire of having to re-tell the story.

It makes sense that "50%" is the real figure, because there are no "secondary cores". To get it out of the way:
-There is no "primary" core with an accompanying smaller "secondary core"
-There are also no "mini cores" in Bulldozer

Bulldozer (the unit, as in one module) is a monolithic dual core - they started the design with 2 full cores, tweaked it in such a way as to achieve what they thought was the best tradeoff - minimal single-threaded performance loss (relative to if they did not tweak and fuse the 2 full cores together) in exchange for significant multi-threaded performance gain. They got the end result they wanted with how Bulldozer is now, and along with the "minimal single-threaded performance loss" promise, they have also made it a point to emphasize IPC improved from Deneb (no telling if it is actually better than anything Intel, but from Deneb, yes), and that this performance loss is relative to, well, what I just said earlier in the paragraph.

I think this is already the third or fourth time I had to explain this in this thread. Not that I mind, just speaking out loud, and this thread has become rather hard to follow anyway, I doubt anybody is actually reading through all the posts.

TuxDave · Aug 26, 2010

Cerb said:
Yes. I haven't been saying that's not the case. I even quoted myself about it being common, there.

It started specifically about a situation in which there would be equal numbers of threads across the modules, where half were not using AVX, but still issuing SIMD instructions every cycle, and half were, with one of each variety was represented on each module. Also, for whatever reason, each could only execute a single SIMD instruction each cycle.

Yeah, I'm no OOO expert but I figure trying to keep AVX and SIMD instructions in separate modules would probably lead to worse performance and so it's highly possible to end up with some scheduling issue when one module has a mix of AVX and SIMD threads.

Personally, my first impression was that as long as they can get the next design out soon after this, then they won't have any problems. AVX is not the norm right now and if you had each core have a 256b FMAC, majority of the time half of it wouldn't be doing jack. So as opposed to letting it sit there and do nothing, at least upon release you'd get the nice bonus of two 128b FMACs. And then some time in the future they'll probably do the same with each core having a 256b FMAC and sharing a 512b FMAC and so on.

IntelUser2000 · Aug 26, 2010

Cogman said:
Blizzard DID spend resources on multi-threading and graphics. Yes, it added to the overhead. But it also added to the quality of the application.

Not every application needs to be threaded, nor does every application need an Opengl interface. However, where they do, it seems developers often take the lazy way out.

You aren't wrong either but, Blizzard does not spend anywhere near as much resources on graphics and multi-threading like some other developers do. Hell, there are some that rely primarily on those to succeed.

It's not about being "lazy". It's about being economical and cost efficient. Extra programming work generally results in more R&D.

the only similarity btwn ht and a BD module is that they each take up ~ 5-10% more die space. Obviously the proof will be in the pudding, but up to 80% extra performance on the 2nd core is much better than ht's 15%.

Right, because duplicating registers take anywhere near as much space as adding more execution units. I bet Hyperthreading only takes 2-3% on Nehalem excluding the L3 cache. That on a chip level is ~1%.

You should look at Anandtech's article again. "12% per core including L2" and "5% per chip". That's at least 5x much circuitry as Hyperthreading requires but because Moore's Law offers so much headroom for more transistors, the impact is so little. Cores aren't the dominating factor in terms of die size anymore. Caches, Interconnect, and Uncore are equally dominant.

Sandy Bridge's competitor will be Llano for significant amount of its lifetime, because of when it'll be released.

Scali · Aug 26, 2010

khon said:
An 8-core BD could well be smaller than a 4-core SB, since adding the four secondary cores apparently only increases size by 5%, and it doesn't have the IGP that SB does.

Wasn't that 5% figure in the Anandtech article later corrected to 50%?
5% is impossible... only something like HT takes 5%.
http://www.anandtech.com/show/2881

AMD has come back to us with a clarification: the 5% figure was incorrect. AMD is now stating that the additional core in Bulldozer requires approximately an additional 50% die area. That's less than a complete doubling of die size for two cores, but still much more than something like Hyper Threading.

Scali · Aug 26, 2010

bryanW1995 said:
the only similarity btwn ht and a BD module is that they each take up ~ 5-10% more die space. Obviously the proof will be in the pudding, but up to 80% extra performance on the 2nd core is much better than ht's 15%.

Incorrect... see my post above, correcting the 5% figure to 50%.
I don't see why people believe in such fairy tales anyway.
HT takes up very little die space because hardly any physical hardware is added. Most of it is just partitioning the existing resources.
AMD adds actual units to it... you really think 4 integer units are only 5% of a die?

Scali · Aug 26, 2010

bryanW1995 said:
so if they don't explicitly state it then it must be something horrible, but if they DO explicitly state it then they're lying. kind of convenient, no?

I'm just pointing out that AMD has absolutely NO credibility for me.
I didn't buy their Barcelona claims back then, because of the missing 'secret sauce'... and I turned out to be right.
I'm doing the same today: I don't see the 'secret sauce' that makes it run as fast as they claim it will, so I don't believe it. I'm willing to bet JFAMD's job on this.
What you're basically saying is that it's foolish not to believe AMD's marketing babble.
I would advise the exact opposite: Use common sense and logic, and don't believe anything marketing says unless they can back it up with sensible, logical and technical arguments.
What we know so far is this:
- They are removing one ALU and one AGU per core
- They are removing one decoder per core (effectively, as you get one 4-wide decoder per module, so effectively 2-wide decoders per core, as opposed to three decoders).
- They are sharing one FPU/SIMD unit per 2 cores

These are all actions that DECREASE the execution resources in some way.
We have not heard about them even COMPENSATING for the removal of these resources yet... so even being on par with previous generation IPC in single threads would already be quite an improvement in efficiency (roughly 33%, which is arguably more than what AMD or Intel ever achieved in a microarchitecture update, with the obvious exception of Netburst->Conroe, although this was skewed by the drop of about 1 GHz in clockspeed).
So basically, with this level of reduction and sharing of resources, and STILL increasing IPC over the previous gen, that would be one INCREDIBLE feat of engineering... And they'd need some pretty nifty 'secret sauce' to make these 'anemic' cores run that fast. Have you seen it? I haven't. And that is after AMD made its big introduction about Bulldozer to the press. So I guess it's just not there, and they're not going to pull it off.
That's called common sense. Educated guess, if you will. I'm willing to bet JFAMD's job on it.

AtenRa · Aug 26, 2010

jvroig said:
The 5% figure is wrong

http://www.hardocp.com/image.html?image=MTI4MjMyNzEzOHV3YTdlWk81TTNfMV84X2wuanBn

From the above AMD Slide we have that the Integer Core (Integer Execution Unit + Scheduler) is 12% of the whole Module and each module has 2 Integer Cores, so 24% of each Module are just the INT Cores.

http://images.anandtech.com/reviews/cpu/amd/phenom2/phenom2die.jpg

From the above AMD Phenom II picture, is clear that L3 Cache (Purple) occupies 33-35% percent of the entire CPU die, 4 cores (Blue) are almost half 50% and the rest is IMC (Integrated Memory Controller) INC (Integrated NorthBridge Controller) and IO. (Approximation)

If we take this analogy and replace the Phenom II Cores with Bulldozer Modules each module then occupies 12,5% of the whole Die area and of that, only the 12% is an Integer Core, so when AMD says 5% of the total die its correct

Edit:
and it means a Bulldozer Integer Core occupies 5% of the whole CPU die.

That’s wrong (Sorry)

Riek · Aug 26, 2010

Scali said:
I'm just pointing out that AMD has absolutely NO credibility for me.
I didn't buy their Barcelona claims back then, because of the missing 'secret sauce'... and I turned out to be right.
I'm doing the same today: I don't see the 'secret sauce' that makes it run as fast as they claim it will, so I don't believe it. I'm willing to bet JFAMD's job on this.
What you're basically saying is that it's foolish not to believe AMD's marketing babble.
I would advise the exact opposite: Use common sense and logic, and don't believe anything marketing says unless they can back it up with sensible, logical and technical arguments.
What we know so far is this:
- They are removing one ALU and one AGU per core
- They are removing one decoder per core (effectively, as you get one 4-wide decoder per module, so effectively 2-wide decoders per core, as opposed to three decoders).
- They are sharing one FPU/SIMD unit per 2 cores

These are all actions that DECREASE the execution resources in some way.
We have not heard about them even COMPENSATING for the removal of these resources yet... so even being on par with previous generation IPC in single threads would already be quite an improvement in efficiency (roughly 33%, which is arguably more than what AMD or Intel ever achieved in a microarchitecture update, with the obvious exception of Netburst->Conroe, although this was skewed by the drop of about 1 GHz in clockspeed).
So basically, with this level of reduction and sharing of resources, and STILL increasing IPC over the previous gen, that would be one INCREDIBLE feat of engineering... And they'd need some pretty nifty 'secret sauce' to make these 'anemic' cores run that fast. Have you seen it? I haven't. And that is after AMD made its big introduction about Bulldozer to the press. So I guess it's just not there, and they're not going to pull it off.
That's called common sense. Educated guess, if you will. I'm willing to bet JFAMD's job on it.

First problem: you compare a bulldozer core to another core. Bulldozer is to compete with HT and is always placed in modules. eg a bulldozer core competes with a thread from intel SB, neha, west... A module (or 2cores) compete with a core from K8, K10..

What we know so far is this:
- They are removing one ALU and one AGU per core
- They are removing one decoder per core (effectively, as you get one 4-wide decoder per module, so effectively 2-wide decoders per core, as opposed to three decoders).
- They are sharing one FPU/SIMD unit per 2 cores

--->
-there are removing one alu and agu per core but increasing the flexibility. (also increasing prediction and others to keep the pipes fetched).
- They add one decoder per module (a module is competing with a intel core and an older amd core)
- They are sharing double 128bit FMAC fpu units which can process request from 2threads simultaniously (unless 256bit ops). basically each core has the same throughput (without special fmac optimization) as a K10core. since they are competing with a module against a core, they have double the throughput!

The module size is not 50% larger due to the added integer core... That is simply impossibe since the cache is the biggest space consumer and that is shared over a core. All we know is that de BD module is only a little larger with the extra core than without. over the complete die this will be just a mere %, more than HT, but it allows them to compete/outperform HT for which they previously needed a whole extra core. Note that while HT adds little space, the wider execution units which were needed to have a boost from HT also cost die space and are not calculated in the HT figure.

Scali · Aug 26, 2010

imported_Riek said:
First problem: you compare a bulldozer core to another core. Bulldozer is to compete with HT and is always placed in modules. eg a bulldozer core competes with a thread from intel SB, neha, west... A module (or 2cores) compete with a core from K8, K10..

We were discussing the effects on single-threaded performance however.
So I don't see a 'problem'. Try to stay on topic.

And I disagree that a module competes with one physical core... it's about 50% larger... it falls somewhere between 1 and 2 physical cores of a K8/10.

That is not that relevant when discussing single-threaded performance however.

imported_Riek said:
-there are removing one alu and agu per core but increasing the flexibility. (also increasing prediction and others to keep the pipes fetched).

Does this flexibility guarantee a 50% increase in efficiency or more? If not, single-threaded performance will suffer (two ALUs need to do the same work that 3 ALUs used to do).
And as I said, I think 50% increase in efficiency is a pretty tall order.

imported_Riek said:
- They add one decoder per module (a module is competing with a intel core and an older amd core)

Since a module houses two cores, that is effectively removing one decoder per core, as I said.
Especially with the decoder, the notion that a module competes with a single core is nonsense. A module NEEDS to decode two threads at a time, because there's two cores inside. Even if a core is just executing the idle thread, it still needs to decode halt instructions every cycle.

imported_Riek said:
- They are sharing double 128bit FMAC fpu units which can process request from 2threads simultaniously (unless 256bit ops). basically each core has the same throughput (without special fmac optimization) as a K10core. since they are competing with a module against a core, they have double the throughput!

Not exactly. Each core used to have three FPU ports, two of which also handled SIMD (128-bit). Now there's two shared by two cores.
Again, this is about single-threaded performance. Hence we compare one logical Bulldozer core's IPC against one logical (and also physical) K10 core IPC.

imported_Riek said:
The module size is not 50% larger due to the added integer core... That is simply impossibe since the cache is the biggest space consumer and that is shared over a core.

The 50% figure is a quote from Anand who says it comes directly from AMD (correcting his earlier figure of 5%).
I'll take AMD's information over yours, thank you very much.

AtenRa · Aug 26, 2010

Scali said:
And I disagree that a module competes with one physical core... it's about 50% larger... it falls somewhere between 1 and 2 physical cores of a K8/10.

If you only take the Integer Core that’s only 12%, but you have to include the UnCore part of the Module (Fetchers, Decoders etc) that have been larger because they have to feed two INT Cores. But that don’t make the Module 50% more than one Phenom Core. If that was true then the whole point of having les silicon (Than two Full Cores) with Bulldozer and better performance will not be possible.

Scali · Aug 26, 2010

AtenRa said:
If you only take the Integer Core that’s only 12%, but you have to include the UnCore part of the Module (Fetchers, Decoders etc) that have been larger because they have to feed two INT Cores. But that don’t make the Module 50% more than one Phenom Core. If that was true then the whole point of having les silicon (Than two Full Cores) with Bulldozer and better performance will not be possible.

Uhhh, what?
The whole module being 50% larger than one Phenom core would make perfect sense.
One module handles 2 threads.
How much larger is a Phenom core if you need to handle two threads? That's right, 100%, because you just copy-paste a complete core next to it.
So if AMD managed to save 50% on adding that second core now, that's quite a significant gain. And that will probably come at a small cost in terms of single-threaded performance, as each thread now has less execution resources at its disposal. But since they can now add more cores in the same die area, they can improve multi-threaded performance.

AtenRa · Aug 26, 2010

You could have one CLASSICAL single core (one INT execution Unit + FP) if you take out one Integer Core of the Bulldozer Module and that will only save you 12% (According to AMD). The uncore part of the module likely doesnt have to be that large if you only have ONE integer Core so I will say take another 5-10%, so i believe a Bulldozer Module is not more that 20-25% larger than a single Classical Core like the Phenom II would be.

Scali · Aug 26, 2010

AtenRa said:
You could have one CLASSICAL single core (one INT execution Unit + FP) if you take out one Integer Core of the Bulldozer Module and that will only save you 12% (According to AMD). The uncore part of the module likely doesn’t have to be that large if you only have ONE integer Core so I will say take another 5-10%, so i believe a Bulldozer Module is not more that 20-25% larger than a single Classical Core like the Phenom II would be.

Well, I'm getting tired off all these figures floating around.
What's the point?
What really matters is:
- How many threads will we REALLY get? (8 apparently, according to JFAMD, rather than the 6 of Thuban, which ironically is not even 50% more, so are cores really THAT much smaller? Why don't they go for 10, 12, heck 16 cores, if adding cores is only a few % extra die space now? Especially when they know Intel is going for 8 physical cores with SB, hence 16 logical cores... which probably have higher IPC than AMD's).
- At what price?
- How well do these threads perform?

It's useless trying to debate over die-sizes with so many unknowns.

jvroig · Aug 26, 2010

AtenRa said:
and it means a Bulldozer Integer Core occupies 5% of the whole CPU die.

5% is the size of the int core, I didn't say it wasn't. What is wrong is that "5% is the added die area to 'create/add' a second core to an existing core". Plain wrong. And that was the context of it, when someone said "well, it only takes 5% to add another core" - or did you miss the context and just wanted to "correct" something even if it means taking it out of context?

Pop quiz #1: An "int core" is the whole she-bang that you need to add to an existing single core for it to be a dual core, yes or no?

Pop quiz #2: AMD already has a die with 4 cores (normal ones, like Deneb, Propus, etc), if they were to add 4 more cores (making the chip a Bulldozer core), how much more die area would be needed, is it 5% or close?

No and no.

The problem is that you can obviously read slides, but you don't really show any concrete understanding of CPU architecture / design. I don't begrudge you participating in forums, of course, but maybe you can try to leave out participating in pissing matches.

I am not about to repeat myself again (for what would be the 4th or 5th time) about BD architecture, and I don't want to repeat the origin and context of the "5% story", and I don't want to link again to a direct quote from AMD courtesy of Anand.

AtenRa said:
But that don’t make the Module 50% more than one Phenom Core. If that was true then the whole point of having les silicon (Than two Full Cores) with Bulldozer and better performance will not be possible.

That's the rub, isn't it? How indeed would it be possible? Maybe they are over-promising, maybe they haven't shown all their cards yet. Who knows.

Anand published 5% (he wasn't exactly misled, it came from marketing, but marketing misconstrued something from engineering, because engineering gave a very literal answer to a question posed by marketing, an answer that, while correct, did not capture the 'spirit' of the question, because marketing failed to phrase the question correctly anyway, (obviously because their domains and jargons are different; actually, the question was rather direct, and for marketing people it was more than good enough because they aren't engineers and don't understand the nuances of such questions and statements, since it isn't their job), resulting in failure of communication - there you go, the gist of the 5% story, right after I just said I don't want to repeat it, but I'm at a loss how else to proceed), but then redacted what he said.

Why?

AMD called him out on it. Not Intel. AMD. Anand published a figure so awesome, why wouldn't AMD just let it go? Because it was just too good to be true, and most people who understand CPU architecture and design will not be fooled by it for a second. So AMD pretty much had no choice but tell Anand, "Hi, actually it's 50%. Thanks. Btw, remember, emphasize the part about IPC improvements from Deneb, our new BD int cores have better throughput than Deneb!"

So 50% is the "non-marketing" reality. It's not as bad as how you say it with " then the whole point of having les silicon (Than two Full Cores) with Bulldozer and better performance will not be possible.". They still have 2 cores that only occupy 1.5x the space. Had they not gone this way, their octo core would take up 8x, whereas now it only takes up 6x (all sizes approximate and only relative to each other). So they did achieve less silicon (they saved 25%), but the performance is still an open question, and only benchmarks will show us whether they actually achieved it as well.

Scali · Aug 26, 2010

jvroig said:
"Hi, actually it's 50%. Thanks. Btw, remember, emphasize the part about IPC improvements from Deneb, our new BD int cores have better throughput than Deneb!"

I think this again is a 'miscommunication' of some sort.

I mean, it's very easy for anyone to see that theoretically BD can handle 4 integer instructions per cycle, two of them being ALU, two of them being AGU.
Deneb can only handle 3 integer instructions per cycle, where either can be ALU or AGU, doesn't matter.
So yes, in theory BD has higher throughput.
However, it places some extra restraints on the instruction mix that Deneb doesn't have.
And I am going to just boldly state that in practice, the instruction mix will not be that favourable to BD on average. Getting the 3 instructions throughput out of Deneb sustained is difficult, but possible in practical situations.
But getting 4 instructions throughput out of BD sustained? Nah, I don't see it happening.
I will present two arguments for this:
1) If you have AGU operations, generally they will be either to load data for an ALU instruction, or to store a result from an ALU operation. In both cases, they are dependent, and as such cannot be scheduled in the same cycle (often both micro-ops are encoded into a single x86 instruction).
2) A lot of the time, you use registers for loading or storing data with ALU operations. So no AGU operations required (and depending on how AMD implements their ALUs and AGUs exactly, certain simple memory accesses may not require AGUs either... eg the form mov eax, [edx]).

Hence, I think you'll generally find a higher load of ALU operations than AGU instructions in performance-critical code... and when you do find AGU operations, you may run into dependency problems, which limit your paralellization opportunities.

By definition you cannot get 3 ALU throughput out of BD anyway.

Riek · Aug 26, 2010

the 50%, 5%, 12% figures are completely out of context.

No it is IMPOSSIBLE for a module to be 50% larger due to a extra integer core.
what it is possible is that the integer area is 50% larger due to the extra core (since they have shared resources!!!). but due to the additional shared, fpu, shared caches this becomes alot less on module level. If you then look on chip level which also contains l3 and IMC the figure of the extra cores is almost nothing.

again the 50% figure was on the integer part only!!

euhm there will be 16cored (8modules) on the server, there will not be an 6core SB on launch either.

and the biggest problem for K10 was not its max throughput, but keeping the architecture/units filled. Intel was leaps ahead in predicting/fetching/opfusion, which is now also in bulldozer and they go further than the current i7 in this.

khon · Aug 26, 2010

It's not as if I pulled the 5% out of thin air, or got it from the old presentation. It's taken directly from the press info AMD released two days ago:

If you want to disagree with that then take it up with AMD.

You could of course argue that they are making the shared rescources larger than they would need to be for only one core, which would alter the 5% number somewhat. But It's still going to be somewhere in that ballpark, certainly not 50%.

jvroig · Aug 26, 2010

Scali said:
It's useless trying to debate over die-sizes with so many unknowns.

It is useless now because some people would rather believe what they think the marketing slides mean, even despite a clear statement from the same company that refutes the image in their heads about how fantastical, magical, fairy-dust-and-unicorns BD is or should be.

Scali said:
I think this again is a 'miscommunication' of some sort.

Most certainly possible, or it could be a host of other things, like marketing hype, a move to allay foreseen concerns, etc. Maybe they are even right - the only saving grace is if Deneb's weakness was not ALU+AGU mix at all. If they are right and the Achilles' heel of Deneb was not ALU crunching power but the coupled branch prediction and instruction fetch. The only thing I can think of close to this is the Radeon scenario - perhaps Deneb is like Cypress, theoretical peak performance is astounding, but is not really that achievable in real life - so instead of adding more number crunchers, this time around they just make sure all number crunchers are fed much better. While theoretical peak performance will certainly drop, real world performance may increase.

I am rather more forgiving of their engineering team, so at the moment I assume performance is "ok" - not "holy crap Intel-eating great performance", but also not "terrible Barcelona II fiasco". We'll see when benchmarks arrive, and no matter how it turns out it will certainly be preferable to what we have now with people who have no idea what they are talking about practically declaring "AMD will pownz Intel, lolololol!!!" or how much smarter AMD engineers are because AMD's 5% > Intel's 5%.

imported_Riek said:
No it is IMPOSSIBLE for a module to be 50% larger due to a extra integer core.

Yeah, here we go again. This time I will be smarter and not repeat myself for the fifth time or so. We've been over that. Nobody said the int core is 50%, and an int core is not the only thing you need to create a Bulldozer out of an existing single core.

jvroig · Aug 26, 2010

khon said:
It's not as if I pulled the 5% out of thin air, or got it from the old presentation. It's taken directly from the press info AMD released two days ago:

I know where "5%" started, in fact I just repeated the story about it again (its origin story if you will) - or at least, the gist of it. RTT if you missed it.

khon said:
If you want to disagree with that then take it up with AMD.

If you disagree with 50%, take it up with Anand and AMD, I even posted a link. Then I even bothered to repeat myself for the fourth time or so, just to explain the how's and why's. Or have you missed the last few posts?

RTT. I'm not repeating myself for the fifth time or so.

Nothing personal, I don't hate you or anything, and I'm not pissed off or anything (and apologies in advance if you feel it is due). But after repeating myself over and over (not necessarily to you, there are a lot of other people here too), it just gets old pretty fast.

See you around

----
Anyway, this thread is amassing fail at a spectacular level. I don't see it going anywhere else (having to repeat myself 4 - 5 times was a clue), so I'll just leave. I hope AMD releases some more info and benchmarks sooner rather than later so we can all have something real to talk about that people can easily understand, relate to, and get excited about.

Scali · Aug 26, 2010

jvroig said:
The only thing I can think of close to this is the Radeon scenario - perhaps Deneb is like Cypress, theoretical peak performance is astounding, but is not really that achievable in real life

That's pretty much the signature of x86.

But as I said, the last 2-ALU x86 was the Pentium III.
K7 didn't get all that much mileage out of the third ALU over the Pentium III... but still.
We've come a long way since then, so I think we can assume that the third ALU is at least somewhat useful... especially since both AMD and Intel have insisted on that third ALU for so many years.
If we count the Pentium IV in as well (since it had two ALUs that could execute 2 instructions per clk, so it was more or less a pseudo 4-ALU CPU), then Intel has had 4 3(+) ALU micro-architectures in a row, since 1999:
Netburst, Conroe, Nehalem, and the upcoming Sandy Bridge.
Likewise, AMD has had K7, K8 and K10, all with the three ALU design.

I would think that in all those years of micro-architecture design between these two companies... if you could really remove one of the ALUs without hurting IPC, at least one of them would have taken that opportunity already.
It just can't be that simple that after all these years, some guy at AMD suddenly wakes up and says "Hey wait a second, you know that 3rd ALU that we've been using all these years? We don't need it, it's never being used!"
That just goes against all common sense.

Not that I'm saying that it will be a huge drop in performance, or even a significant one (assuming they can optimize other parts of the architecture to compensate)... it's just that removing an ALU *and* improving IPC at the same time, that's just not going to happen. Mark my words.
I've written too much code with more than 2 ALU ops per clk throughput to buy this fairytale.

Riek · Aug 26, 2010

yes by adding another integer part... (see link...)

http://en.wikipedia.org/wiki/File:K10h.jpg

you add 50% of the total module size.. (which includes l2 caches which are not even shown on that picture). RTTP on anandtech first of that outdated article. (or you could just assume the latest slides of AMD are correct... they were made a few days ago)

JFAMD · Aug 26, 2010

OK folks, enough of the 50% number.

Let's think about this realistically for a moment. The L3 cache, HT links, memory controller and other components are well over 50% of the total die space.

Then at the module level, the FPUs are large. There is an L2 cache. There is the shared front end. Those things all eat up real estate.

The integer cores are actually pretty small.

Here is the math. Start with a full die. Remove 1 integer core from each module and 1 integer schedule (everything else stays where it is). Measure the die size. Your new number is 95% of the total die.

The 50% number was way wrong and it took me a lot of time to get that straightend out with Anand. And for some reason everyone keeps quoting the original article instead of the numerous corrections.

It was so bad that we added the slide that was posted above just in case it came up in press interviews (it did in 1 out of ~25 or so that I did.)

If you just sit back and think about it from a logical perspective there is no way that increasing the integer core count could increase the die by 50% because there is so much other stuff in the die.

Scali · Aug 26, 2010

imported_Riek said:
yes by adding another integer part... (see link...)

http://en.wikipedia.org/wiki/File:K10h.jpg

you add 50% of the total module size.. (which includes l2 caches which are not even shown on that picture). RTTP on anandtech first of that outdated article. (or you could just assume the latest slides of AMD are correct... they were made a few days ago)

A second core in BD adds the following non-shared resources (see http://www.anandtech.com/Gallery/Album/754#7 ):
- 4 integer units
- An integer scheduler unit
- An integer retire unit
- A load-store unit
- L1 Datacache
- L1 Data TLB

This would make up roughly the middle-half of the picture of the K10 core you posted, about 50% (not taking into account that some units were grown as well to accomodate sharing, such as the decoder and the L2 cache).

Scali · Aug 26, 2010

JFAMD said:
If you just sit back and think about it from a logical perspective there is no way that increasing the integer core count could increase the die by 50% because there is so much other stuff in the die.

I think they were not talking about a die, but about a module.
And a module doesn't include the uncore functionality, such as the L3 cache, which makes a big difference in terms of any die-area related issues.
There will also be multiple modules on a die, to further affect any die-area talks.
For a module, adding a second integer core, the 50% figure seems correct, and also it is meaningful. You show that the Bulldozer architecture significantly reduces die area per core, a very important metric in the current parallel climate.
For a full die, removing an integer core of every module, 5% may be correct, but what does that prove, really? It seems to be a completely random metric.

Some Bulldozer and Bobcat articles have sprung up

Elite Member

Platinum Member

Lifer

Elite Member

Banned

Banned

Banned

Lifer

Senior member

Banned

Lifer

Banned

Lifer

Banned

Platinum Member

Banned

Senior member

Golden Member

Platinum Member

Platinum Member

Banned

Senior member

Senior member

Banned

Banned