Some Bulldozer and Bobcat articles have sprung up

Idontcare · Aug 31, 2010

Guys, this is escalating, please stop.

We all see the handwriting on the wall. Please resist the urge to make a post here if it isn't related to the topic of the thread.

Do feel free to use that "report post" button all you want (the red triangle at the lower left of each post, under the poster post-count), as well as making a thread in PFI or moderator discussions so you can vent in an environment where venting doesn't get you an infraction for personal attacks, etc.

Vent your frustrations, we encourage that, but do it in its proper place. This is not the proper place.

edit: I want to be doubly, triply, clear that I am not posting this as a mod, I am posting this as a fellow cpu forum citizen begging you guys to cool it before we cross that threshold where everything and everyone gets put on ice for a while...

CTho9305 · Aug 31, 2010

Scali said:
The exact amount of IPC improvement is not relevant to the point I was making.
Do you understand the point I was making?
I said that BD would require an improvement of about 50% in efficiency in order to maintain the same IPC.
I then pointed out that CPUs rarely make a jump of 50% in efficiency.
So whether K8 vs K10 is 5%, 10% or even 20%, it's not relevant. All these figures are a LONG way from the 50% I was arguing, hence it's not relevant to the point.

Ok, then please be very clear when you're using a number as a number and when you're not. I didn't find it obvious that "5%" meant "some unknown, which is less than the 50% required" and not "5%, which is less than the 50% required". I interpreted it as the latter, which can mislead people who hope to learn more from your post than the one point you intended to make. I know when I read posts, I usually try to get as much out of them as I can... which means it helps when peripheral remarks are correct.

I haven't thought through your "50% performance uplift is required" argument, so I'm not making any comments on its accuracy.

Riek · Aug 31, 2010

you might want to first explain how you reach that 50% efficiency increase needed. And i don't mean the deneb has 50% more ALU. (since in that case you give deneb an impossible 100% efficient design without any AGU calculation whatsoever with perfect non branched code) that fits in l1 cache.

Scali · Aug 31, 2010

CTho9305 said:
Ok, then please be very clear when you're using a number as a number and when you're not. I didn't find it obvious that "5%" meant "some unknown, which is less than the 50% required" and not "5%, which is less than the 50% required". I interpreted it as the latter, which can mislead people who hope to learn more from your post than the one point you intended to make. I know when I read posts, I usually try to get as much out of them as I can... which means it helps when peripheral remarks are correct.

I haven't thought through your "50% performance uplift is required" argument, so I'm not making any comments on its accuracy.

Here is the exact post I made:
http://forums.anandtech.com/showpost.php?p=30370842&postcount=320

I think it was VERY clear. It is also clear that I said 'perhaps 5%', so I wasn't sure about the exact figure, didn't think it was important enough to look it up exactly, as nobody would argue that it would be anywhere near 50% anyway.
Don't blame me for the fact that people started systematically picking that post apart in the aftermath... *I* was perfectly clear with what I meant, and what function the K8 vs K10 comparison had in my argumentation. And *you* were free to look back through the thread and find this original post yourself.
Don't be so quick to blame *me* for *your* misunderstandings.

Martimus · Aug 31, 2010

Obviously this thread has gone to looking at the trees, but missing the forrest.

There is something I would like to learn more about though. What will this more aggressive prefetcher and branch prediction mean for this architecture?

I have to admit that was the one thing that made me excited to see the Bulldozer architecture in action. Before that, I was very lukewarm on the processor, but I really want to see how this change affects the chip.

Does anyone know what the pratfalls of going more aggressive are, and what sorts of trade-offs they would need to make?

EDIT: I found a good explanation of the options for the prefetchers and branch prediction for BD here: http://www.realworldtech.com/page.cfm?ArticleID=RWT082610181333&p=4

Markfw · Aug 31, 2010

See previous page. 2 infractions so far. Lets stop the insults before I lock this thread.

Scali · Aug 31, 2010

Why don't you ban those troublemakers for a few days (you know who you are), so I can have a normal time on the forum as well, without this constant stream of insults, heckling, trolling and derailing?
I would really REALLY like to just be able to post without having 5-6 people ganging up on me all the time, just like most other forum users. You think we could do that, even if just for a few days?

Markfw · Aug 31, 2010

Scali said:
Why don't you ban those troublemakers for a few days (you know who you are), so I can have a normal time on the forum as well, without this constant stream of insults, heckling, trolling and derailing?
I would really REALLY like to just be able to post without having 5-6 people ganging up on me all the time, just like most other forum users. You think we could do that, even if just for a few days?

Banning is for multiple infractions. I gave you one for insulting Ctho9306, and one to the next poster for insulting you. I don't want this to turn into an infraction-fest. If you don't get over it, and move on, I will lock this thread.

Scali · Aug 31, 2010

Given all the insults I've received from a few people over the past few months, they are bound to have run up enough infractions to be eligible for a ban by now.
I'm fed up with the insults, and basically it's your fault that I feel this way. You moderators let it come this far. Just like you let it come way too far with KeysPlayr.
How can you possibly expect me to be perfectly friendly and nice all the time in this hostile climate? I'm a lot more tolerant, patient and friendly than most here... I don't just "vent" like some of the posters (such as the one Idontcare spoke out against).
Shouldn't you try to make the climate a bit less hostile first?

OK you were giving multiple warning to stop. This is not the the place, yet you continue. You are one of the problems here. The sooner you get over yourself and your self-absorbed style of posting the better off you will be. Right now, for disobeying several mod posts and warnings you are gone for a week.
You were the only one that continued when told to stop. IDC told you there was a venue for complaining, yet you chose to do it here anyway.
Goodbye.

esquared
Anandtech hAdministrator

Scali · Aug 31, 2010

This post from Markfw900 is at least as much of a personal attack towards me than what he infracted me for:
http://forums.anandtech.com/showpost.php?p=30275198&postcount=71

You just don't know the definition of stop do you?

Here is is:
Stop

esquared
Anandtech Administrator

jones377 · Aug 31, 2010

Now can we please get back to discussing Bulldozer?

FYI, RWT has posted it's Bulldozer article. The extra days spent analyzing the information seem well spent as it's IMHO the best BD article yet.

http://www.realworldtech.com/page.cfm?ArticleID=RWT082610181333

Markfw · Aug 31, 2010

Scali said:
This post from Markfw900 is at least as much of a personal attack towards me than what he infracted me for:
http://forums.anandtech.com/showpost.php?p=30275198&postcount=71

That was a freindly suggestion, based on exactly what is going on here.

Now, everyone, lets get back on track here. Scali has a vacation, so the rest of us can continue while he has time to think.

bryanW1995 · Aug 31, 2010

nice article, but it didn't bring any new info to light.

jvroig · Aug 31, 2010

Martimus said:
There is something I would like to learn more about though. What will this more aggressive prefetcher and branch prediction mean for this architecture?

(Damn, I already typed about 5 paragraphs and then I clicked an updated link from my email and it just replaced the tab I was working on)

Anyway, that is about as "star material" as this architecture can get for desktop use. Before we delve into that, in contrast serverland can be well served as long as the "perf/watt" promised materializes - it's not a tough sell, especially for those with power limits set by off-site datacenters, or even an internal company IT infrastructure limitation. You'll be getting 16 strong cores (or 32/64 in 2p/4p), and compared to 200 mini cores like a SeaMicro deal, this is way better for reasons serverland is concerned with (bursty usage, for one - in fact that's one of the things they mentioned in their slides aside from perf/watt and perf/mm2 - "smoothing out bursty usage"; clearly, they are selling this arch hard to serverland, and so far everything they've laid out makes me feel this is a "server-first" arch - look at everything they've been saying, and it's a clear serverland message, nothing for desktops really, unless your local computer store salesman quotes you perf/watt, perf/mm2, "efficiency", etc.) So as far as serverland is concerned, and they hold true to the promise of better than MC performance (not hard to do - compared to Thuban, MC has low clocks, which made Thuban all the more a surprise when it clocked at 3.3 GHz), BD Interlagos is not a hard sell.

For desktop usage, it's a hard sell. Cutting down on execution units means less IPC (assuming nothing was changed to improve instruction fetch and branch prediction compared to last arch), but the fact that they changed fetch and prediction can tell us one of two things, depending on if we are optimistic or pessimistic.

If we are optimistic, it means something within this line: when they made instruction fetch and branch prediction better (not just more aggressive but decoupling them), they found out that IPC improved to a point that removing the third ALU and AGU resulted in a hit that was still better (maybe far better, depending on how generous you are with adjectives) than their previous case - all that matters is they do hit their target, which they've set way before when all they had was the drawing board, before any real silicon was in their hands.

If we are pessimistic, well, then it's a "holy-crap-this-arch-is-made-for-serverland-and-now-we-are-stuck-with-low-cost-desktop-chips-to-compensate-for-the-performance" type of scenario. They pretty much removed 33% of the execution units. If the changes above did not result in far beyond 33% improvement, then there's no gain, and 33% is a tough figure to overcome. I, for one, don't want more cores - I love quad-cores just fine, and now I want even-faster quad-cores, not just-as-fast octo-cores.

Which one is a better POV is a tough call. On the more optimistic scale, there are two things that make it believable:

1.) There's been talk before of why the hell the third ALU and AGU is there, that it is more of historical reasons that they are still there. Due to the harder work of optimizing fetch and decode, it's easier to just add or retain (retain = nice to save die space, but unless IPC is improved from somewhere else like faster cache, then don't cut out a unit) execution units to increase IPC (of course, as you can imagine, this is merely to a point - at a certain point adding more units increases IPC less and less, and the diminishing returns come at a great cost of power and space), especially since there hasn't been an overhaul of archs for quite a time. Now that they have worked on a new arch and decided to go the "correct" way, they then improved fetch and prediction, cut an ALU and AGU, and went their merry way. By "correct", I do not mean that the brute force way of adding execution units is the "wrong/lazy" way. Rather, I mean it is more in line with their "perf/mm2" goals - if, after optimizing fetch, the third ALU+AGU is now only adding x% of performance but at y% cost of size&power, and the relationship of x with y is beyond/below their threshold, then it makes sense to cut the third ALU+AGU. In reality it would be more complicated than that, since optimizing and decoupling fetch and prediction didn't come for free (the module design itself isn't free), but they've accounted for those as well.

2.) Cutting out 33% of ALUs and AGUs might be significant enough to actually make performance better even if the optimizations in fetch alone could not make the IPC much better than before. After all, clockspeeds also play a big part. If IPC just increased by 5% due to optimized fetch, but cutting down ALU/AGU by 33% means smaller cores or a smaller chip and X% more clockspeed, then performance per core just went up - it's not just IPC that matters, after all, and slimming down the cores can (not guaranteed, as that's not the only thing to consider) give you more thermal headroom and better clockspeeds. Maybe the act of cutting 33% of their exec units ended up giving them more thermal rope, and that's a good thing, since it's always a balancing act of performance and the thermal envelope you need to be in (IBM might be an exception, if they are done selling their power chips to outsiders - if all they need to service are their midranges and mainframes, they can go crazy with thermals as long as the rest of their mainframe design compensates for it, and all their Z-series already come with their own refrigeration units, even for a "single book" cage, so their thermal limit may well be beyond what Intel and AMD has for now. I have had the misfortune of having to to deal with their Z-series tech manual before, "IBM System Z10 Enterprise Class Technical Guide", and while no TDP figures sticks in my memory - I don't think it ever was mentioned, but I could easily just have forgotten - I remember quite well that uniprocessor performance improved 60% (based on LSPR mixed workload) over Z9 - ~60% single-thread improvement in our language, in one iteration, just a sample of how different IBM's playing field is, compared to us in PC world getting stuck with %s well below that in single-thread performance increase per generation due to much harder power and thermal budgets)

As for the pessimistic, well... I personally just find the cut rather unnerving - although it is too early to tell, I admit. You know, if they cut only the third AGU, I would really be unconcerned. It reminds me of a post from Matthias (Dresdenboy) in July, about ALU:AGU being 4:3 per core, and I thought "yeah, better ratio". Then of course, here comes Hot Chips and we get... 2:2. So instead of more ALU than AGU, we get less. It's understandable that with less number-crunchers (yep, that's an engineering term now) inside, now it's all up to the fetch and prediction to be loads better. In fact, we better pray Deneb had terrible instruction fetch and prediction (possible), because if it's actually pretty efficient already, we aren't getting much more, and those two ALUs better be damn good.

As I've said earlier in the thread, I'm more "wait-and-see" for Bulldozer, as compared to excited for Bobcat. However, for even more consolation, Matthias seems to be pretty pumped about Bulldozer, expecting it to be almost a 5.2GHz Phenom II in performance. However, when I saw his computation, it is no different from the joke post I made that concludes in 40% performance increase (41%, actually). The only problem there is that it is based on a marketing statement, made from basket benchmarks that are all server loads. There is no way to tell if 40% is possible or not because there is no info on what benchmarks were ran. It makes all the difference because if we did, we can see how bad (or good) the scaling is, and if it is one that hammers the FPU or not (big difference - if it isn't, then ho-hum, but if it is, then better performance can be expected). Anyway, when I saw his post, well... at least that's one person with more credibility than me who thinks Bulldozer can be pretty good, and he damn knows his stuff. Of course, it is worth noting that both our computations hinge on a marketing statement, and we know such statements (especially this early - come on, didn't they just finalize whether or not BD will be AM3 drop-in or not, right? That just isn't good news to measure BD development progress) can easily change.

jvroig · Aug 31, 2010

Triskain said:
A boost of at least 12,8 % is already confirmed by the "50 % over Magny-Cours" statement, however, because of the shared architecure the behavior in single thread situations will be different, with a greater boost to be expected.

Yes, 41%. That's what I got when I did the math (before JFAMD said don't do it that way, and so I removed the post) starting from "33% more cores for 50% more performance", then factoring in "Module = 80% max throughput of CMP".

jvroig · Aug 31, 2010

I found no other earlier references for some of the things I mentioned in my latest wall of text (we don't all walk around with a collection of links and books, all properly bookmarked and meticulously organized in case the need to post arises, right?

) , so I just have to redirect you to this recent comp.arch discussion. I know, comp.arch is not exactly easy reading if you want AnandTech-like flow and clarity (Anand's an awesome writer, and he seems to be able to distil normally complicated information into "swallowable" chunks), but that's all I can give right now.

You can find some talk similar to some of my own statements, including trains of thought such as:
1. "Uselessness" of the third set of ALU & AGU
2. That the third ALU is more critical than the third AGU
3. Some sort of "support" from an ex-AMD engineer regarding frequency/GHz (~20% better), showing they did expect frequency improvement to result from all of the slimming down

This far off, nothing is certain, and even the word of an ex-engineer just can't be taken as infallible. He was speaking of targets only, and was surprised himself such information was not in the marketing slides (draw your own conclusion - maybe it was achieved but not important enough alongside perf/watt, or maybe it was not achieved, or most likely it is just too early to tell because they are still working on it).

There are far better sources and reading material, there's just no way I can dig up 2 year old long-forgotten things in comp.arch and other sites about ALU/AGU, thermal budgets, clockspeeds and what have you.

(Although, I do still have that boring IBM tech manual in PDF form

Can be used as a sleeping aid)

bryanW1995 · Aug 31, 2010

jvroig said:
*snip*

Of course, it is worth noting that both our computations hinge on a marketing statement, and we know such statements (especially this early - come on, didn't they just finalize whether or not BD will be AM3 drop-in or not, right? That just isn't good news to measure BD development progress) can easily change.

wow, back on topic with a vengeance.

Looks like that last part is extremely important. look at barcelona, fermi, 2900, all of those were somewhat disappointing from a performance perspective. but their real failure was being VERY late in each case. BD can come out great now, and honestly I expect that it will be a great improvement for amd, but it is still so late (even if it is on time by the new standard) that SB 4 core HT should be competitive with it and skt 2011 will annihilate it.

Martimus · Aug 31, 2010

Here is an interesting diagram of the proposed differences between Bulldozer and Westmere.

From everything I have read about this architecture, AMD is targeting and relying on high frequencies to meet their performance goals (many parts of the architecture have large cycle times, which seems to point to high frequencies - along with the more aggressive prefetchers and longer pipeline).

I know I already compared BD to Netburst, but I can see even more of a connection now that I have read more of the details of the architecture changes. AMD seems to really be relying on GF to hit their targets for 32nm SOI to reach their targeted clockspeeds. I honestly wouldn't be surprised if we see a stock 4GHz four core BD SKU.

bryanW1995 · Aug 31, 2010

from reading that comp.arch thread, the former amd engineer said their internal goals were 20-25% higher clocks with 5% less ipc. so figure 15-20% more performance/core as a best case scenario, that will leave them 10-15% behind SB on up to 4 cores. figure breakeven is between 5/6, then 7-8 cores give them the advantage. very server friendly, not so much for desktop users

their only chance on desktop vs 2011 is to go to 12/16 cores.

Kuzi · Aug 31, 2010

Thanks for the informative and interesting posts jvroig, and for briging the thread back to life

jvroig said:
As for the pessimistic, well... I personally just find the cut rather unnerving - although it is too early to tell, I admit. You know, if they cut only the third AGU, I would really be unconcerned. It reminds me of a post from Matthias (Dresdenboy) in July, about ALU:AGU being 4:3 per core, and I thought "yeah, better ratio". Then of course, here comes Hot Chips and we get... 2:2. So instead of more ALU than AGU, we get less. It's understandable that with less number-crunchers (yep, that's an engineering term now) inside, now it's all up to the fetch and prediction to be loads better. In fact, we better pray Deneb had terrible instruction fetch and prediction (possible), because if it's actually pretty efficient already, we aren't getting much more, and those two ALUs better be damn good.

This is one area where I'm sure K10 is weak in, and that's the Branch Prediction unit. Current Intel offerings and even Core2's have a more efficient unit. AMD has been working on Bulldozer for a long time, and even had it delayed for a couple of years, so they should have had lots of time improving this area.

As I've said earlier in the thread, I'm more "wait-and-see" for Bulldozer, as compared to excited for Bobcat. However, for even more consolation, Matthias seems to be pretty pumped about Bulldozer, expecting it to be almost a 5.2GHz Phenom II in performance. However, when I saw his computation, it is no different from the joke post I made that concludes in 40% performance increase (41%, actually). The only problem there is that it is based on a marketing statement, made from basket benchmarks that are all server loads. There is no way to tell if 40% is possible or not because there is no info on what benchmarks were ran. It makes all the difference because if we did, we can see how bad (or good) the scaling is, and if it is one that hammers the FPU or not (big difference - if it isn't, then ho-hum, but if it is, then better performance can be expected).

Have you checked Anand's latest Thuban article?

http://www.anandtech.com/show/3877/...investigation-of-thuban-performance-scaling/7

The OCed NB produced some pretty good results in some cases, SC2 got a 16% increase in performance from the higher clocked NB frequency alone. AMD had to lower the NB/L3 clocks to keep thermals down, and that slightly handicapped Deneb's performance in many situations.

So that's another area where BD could improve on, now that it will be produced on a smaller 32nm process with HK/MG, the leakage and heat issues facing current AMD CPU should be elevated. That can mean higher Core/NB clocks, and I'd think that increasing NB frequency plus tweaking memory controller/cache (lower L1/L2/L3 latency) can produce at least 10% higher performance.

Anyway, when I saw his post, well... at least that's one person with more credibility than me who thinks Bulldozer can be pretty good, and he damn knows his stuff. Of course, it is worth noting that both our computations hinge on a marketing statement, and we know such statements (especially this early - come on, didn't they just finalize whether or not BD will be AM3 drop-in or not, right? That just isn't good news to measure BD development progress) can easily change.

Lets hope Matthias is right, although a 40% increase in one jump seems almost impossible. I personally believe BD would do remarkably well even with a 20% IPC improvement, as long as power draw is improved over Deneb (which will be the case), is priced appropriately, and OCs nicely.

bryanW1995 · Aug 31, 2010

remember, they've made no claims about ipc improvement, just total single thread improvement. that could be made by ipc improvement and same clock speed or higher clock speed with same/slightly lower ipc. current evidence seems to point towards the latter.

Kuzi · Aug 31, 2010

Martimus said:
From everything I have read about this architecture, AMD is targeting and relying on high frequencies to meet their performance goals (many parts of the architecture have large cycle times, which seems to point to high frequencies - along with the more aggressive prefetchers and longer pipeline).

I know I already compared BD to Netburst, but I can see even more of a connection now that I have read more of the details of the architecture changes. AMD seems to really be relying on GF to hit their targets for 32nm SOI to reach their targeted clockspeeds. I honestly wouldn't be surprised if we see a stock 4GHz four core BD SKU.

I fully agree here. I actually wouldn't be surprised to see a stock 4GHz 8-core BD on the same process, although not initially

Riek · Aug 31, 2010

bryanW1995 said:
remember, they've made no claims about ipc improvement, just total single thread improvement. that could be made by ipc improvement and same clock speed or higher clock speed with same/slightly lower ipc. current evidence seems to point towards the latter.

the only thing that went down is the number of ALU. a third ALU is only gives a small performance boost (only <5% code makes use of this).
Considering the biggest change is the AGU and that they can work simultaniously with the ALU in BD can allready result in a gain large enough to offset the <5% from the 3rd ALU. i'm pretty confident that the ipc on average will be higher then K10. What i agree with others is that the potential peak of K10 will be higher due to the 3rd ALU. But in real applications i see enough improvement in BD to keep its alu's better occupied (remember it will also support op fusion).

Idontcare · Aug 31, 2010

Kuzi said:
I fully agree here. I actually wouldn't be surprised to see a stock 4GHz 8-core BD on the same process, although not initially

If they can do 3.4GHz stock, and 3.6GHz turbo, on a 6-core thuban with 45nm w/o HKMG, I am going to be floored and gobsmacked if they can't do 4GHz with a 4module BD on 32nm w/HKMG while operating within the same thermals.

Consider JF has already said Interlagos is going to be 50% more performance with 33% more cores while operating within the same thermals as Magny-Cours.

People OC their 45nm X6's to 4GHz routinely. I expect to see 5GHz with BD a year after intro.

JFAMD · Aug 31, 2010

bryanW1995 said:
from reading that comp.arch thread, the former amd engineer said their internal goals were 20-25% higher clocks with 5% less ipc. so figure 15-20% more performance/core as a best case scenario, that will leave them 10-15% behind SB on up to 4 cores. figure breakeven is between 5/6, then 7-8 cores give them the advantage. very server friendly, not so much for desktop users

their only chance on desktop vs 2011 is to go to 12/16 cores.

So a former engineer who no longer works on the project says "5% less" and a current engineer who IS working on the project tells me "more IPC."

Why do you choose to believe the one with less access to today's data?

Some Bulldozer and Bobcat articles have sprung up

Elite Member

Elite Member

Senior member

Banned

Diamond Member

Moderator Emeritus, Elite Member

Banned

Moderator Emeritus, Elite Member

Banned

Banned

Senior member

Moderator Emeritus, Elite Member

Lifer

Platinum Member

Platinum Member

Platinum Member

Lifer

Diamond Member

Lifer

Senior member

Lifer

Senior member

Senior member

Elite Member

Senior member