AMD sheds light on Bulldozer, Bobcat, desktop, laptop plans

Page 6 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

GaiaHunter

Diamond Member
Jul 13, 2008
3,732
432
126
I fully agree with what you are saying, I didn't state my concern clearly and in retrospect I was beating around the bush a bit.

When a company misrepresents, even a little, the performance of a product; they invariably profit in the short term but receive backlash from the reviewers (and opinion leaders, us) which has a long term and very-difficult-to quantify but very real effect on not only sales but also the reputation of the company and by extension the industry. I tried to be brand agnostic in my first post but I am worried that AMD will advertise "more cores than Intel" which will fool the average consumer into thinking that this means "better than Intel" (and maybe it will be) while really each bulldozer "module" should *honestly* be referred to as just 1 really awesome core or maybe 1.5 cores. I really do wish the best for AMD and I don't want to see them lose credibility with short-sighted marketing.

I've no problem with AMD stating that have bazilion cores as long they perform as that.

I remember that during the Athlon XP era an Athlon XP 1700+ was roughly equivalent to a P4 1.7GHz, even though it ran at 1467 MHz - slightly more comparable at stuff like gaming, a bit worse at encoding.

So if AMD states a dual Bulldozer Module is a quad core and then it really performs as a quad core comparably to current generation of phenom II, I don't see a problem.

If AMD advertise that dual Bulldozer Module as a dual-core and it does perform as a quad, then a less tech savvy person could also be misled thinking something else that has 4 cores is superior or some other processor with 2 cores is equal, when it might not be the case.
 

ilkhan

Golden Member
Jul 21, 2006
1,117
1
0
A dual-core Zambezi (that is a CPU with one bulldozer module) with 80% thread scaling means it almost acts like a native dual core cpu.

The quad-core/quad-thread zambezi (2 BD modules) would be compared to a dual-core/quad-thread clarkdale (2 westmere cores with HT).

Likewise the octo-core/octo-thread zambezi (4 BD modules) would be compared to the quad-core/octo-thread sandy-bridge.
Which is exactly the downside of their terminology "adjustment".
If 2 cores = 2 threads, than it'll match a traditional dual core. If 2 cores = 4 threads, it'll match a traditional quad core. From the explanation above, Im pretty sure its the former.
 

cbn

Lifer
Mar 27, 2009
12,968
221
106
A dual-core Zambezi (that is a CPU with one bulldozer module) with 80% thread scaling means it almost acts like a native dual core cpu.

The quad-core/quad-thread zambezi (2 BD modules) would be compared to a dual-core/quad-thread clarkdale (2 westmere cores with HT).

Likewise the octo-core/octo-thread zambezi (4 BD modules) would be compared to the quad-core/octo-thread sandy-bridge.

Thanks for clearing this up. Based on what I saw in post #7 I was thinking Bulldozer would start as the four core/8 thread model (which I consider at this time to be the rough equivalent of a native octo core CPU)
 

JFAMD

Senior member
May 16, 2009
565
0
0
Here is the challenge. Outside of my powerpoint and the die etching, nobody sees modules.

If you take an interlagos, which has 16 cores by way of 8 bulldozer modules, you get the following:

When the customer selects the product, it will be called "16 core"
At boot up, the system sees 16 cores
When the OS loads, it sees 16 cores
When the application loads, it will see 16 cores
The processor will be able to handle 16 simultaneous threads

There is nowhere that anyone will ever see 8 modules, so why would we ever call it 8 core, or, even for that matter, 8 module?

I like the idea of getting to threads, but I have an issue with just using raw threads.

Take a quad core intel processor today, run an app with 4 threads active. That is your 100% baseline. Now, turn on HT. 8 threads. 10-20% performance gain. Basic math says that 8 threads = ~4.8 cores. If you did the same on an Interlagos, with ~80% uplift, 16 threads = ~14.4 cores. As you can see, there is no easy way to do this.

That is why I believe that we should get off of the cores vs. threads discussion and get onto the real results.

Performance per dollar per watt. With a customers' actual application and environment. That is the truest measure.
 

Fox5

Diamond Member
Jan 31, 2005
5,957
7
81
If it performs like an 8 core, I see no problem with calling it an 8 core (and a huge win in design for AMD, the first real sign of technological leadership from AMD since the athlon 64 era). If it performs more like a quad core with hyperthreading, then they should market it more so as that.
 

cbn

Lifer
Mar 27, 2009
12,968
221
106
Given the nature of the equations of the device physics that underlie power-consumption, current CMOS-based xtor processors most certainly benefit from operating at the lowest clockspeed and lowest Vcc possible while having more of those xtors working in parallel if necessary to get a particular calculation done per unit time.

Consider that for a given Vcc and clockspeed the power-consumption scales linearly with xtor count (assuming replication of active architectural functions of course) whereas increasing Vcc and frequency results in a cubic relationship to power consumption.

So if given the choice of doubling the number of cores (threads) to get 2x the work done per unit time (assuming we have a coarse-grained dual-threaded application in mind) or increasing the clockspeed and Vcc as needed (will need to be >2x as the fixed memory subsystem latency hinders clockspeed scaling efficiency) so we can execute and retire both threads serially in the same period time on a single core (thread) processor we would always choose the higher thread/core count processor if power-consumption were our primary concern.

However doubling the xtor count means cost of production increases (and non-linearly so to the disfavor of larger die), whereas doubling the operating frequency of the xtors does not necessarily incur a doubling of the production costs...so from a design and manufacturing viewpoint simply making super low clockspeed CPU's the size of wafers for use on massively threaded applications isn't exactly the path to pots of gold either.

This was a really good explanation. From this I can get a little bit of a feel for the balancing act that goes on with building these chips.

I didn't realize production costs increase non-linearly with increased xtor count and die size.
 
Last edited:

evolucion8

Platinum Member
Jun 17, 2005
2,867
3
81
130nm Athlon XPs managed sub 10W power consumption...if you undervolted and underclocked them as far as they go. (300Mhz and they'll run passively cooled at that)
A low power cpu isn't worth much without knowing its performance profile.

A single core Athlon 64, undervolted and underclocked, will run at 800Mhz and use around 8W iirc. An Athlon X2 roughly doubles that. There was actually an Oqo sized device that used an Athlon X2, undervolted and underclocked to 800mhz.

To add more notes in power efficiency, a Pentium M running at 2.26GHz could only consume 27W at full load, and it would consume much less when idle.
 

cbn

Lifer
Mar 27, 2009
12,968
221
106
Take a quad core intel processor today, run an app with 4 threads active. That is your 100% baseline. Now, turn on HT. 8 threads. 10-20% performance gain. Basic math says that 8 threads = ~4.8 cores. If you did the same on an Interlagos, with ~80% uplift, 16 threads = ~14.4 cores. As you can see, there is no easy way to do this.

That is why I believe that we should get off of the cores vs. threads discussion and get onto the real results.

Performance per dollar per watt. With a customers' actual application and environment. That is the truest measure.

These "quad core" Bulldozer processors almost sound like they have two stronger cores and four weaker cores (due to the second set of "cores" being 80% scaling).

How is this better than what Intel is doing with Turbo mode? With Turbo mode it sounds like the increased performance can be more selective....One, two or three cores could receive a increase in processing power on a native quad core CPU. To me this sounds a lot more flexible than having two permanently stronger cores and two permanently slightly weaker cores.

That being said I still am eager to find out how good these CPUs will be from a power per watt and power per dollar basis (especially the smaller Bulldozers).
 

DrMrLordX

Lifer
Apr 27, 2000
23,217
13,300
136
Atom cannot compete against ARM, their chips are already becoming inferior in every respect. In 2011 a dual core ARM with a 1ghz+ clock speeds, and TDP's of a quater of a watt will be available and for even less cost. Even Tegra is better then the HD4500 or any other intel graphics. Bobcat in the higher clocked version can compete against mobile celerons, but I see a separation happening in the mobile market; people will buy their fully functional core2's (or i3 then) or they will be the ultra mobile. Now intel will still get far more sales then AMD no matter how inferior their product, but if AMD can really deliver with this chip and their is no way the current Atom architecture can come close to a combination of competent CPU power coupled with good, dx11 IGP.

No offense, but I don't think ARM chips are going to do as well as their proponents claim unless they get some decent software-based x86 compatability going on (such as what Transmeta used to do, and Nvidia is the only company I see with that going for them right now). Give someone an ultraportable/smartphone that runs windows 7 and that's all Intel would need to gain an enormous foothold into the ultraportable market. Atom will eventually be able to fill that niche, and, when paired with Win7, will make a great many consumers not really care about what ARM can do. Maybe I sound crazy, but I see no reason why the resources necessary for a stripped-down win7 couldn't be crammed into a smartphone in the near future. Intel could supply the CPU, the chipset, and the flash-based storage, so the only wildcard would be getting volatile RAM in sufficient amounts crammed into that package.

I think Bobcat will do well, but I think it will probably do better to fight against mobile celerons. Maybe I underestimate AMD's ability to move Bobcat into the ultraportable market, maybe I don't.
 

DrMrLordX

Lifer
Apr 27, 2000
23,217
13,300
136
AMD could probably produce sub 10W processors right now on its 45nm process, but the performance class might not warrant the cost to produce (though somehow Intel gets through it with its LV and ULV chips). To get down to an Atom level of power consumption (4W per core), they'd probably have to go down to an Atom level of performance, but at a much larger die size.

I agree, and I don't know that Bobcat will change that. That's why I think we're more likely to see Bobcat encroach on what's left of the netbook market (or slimtops or what have you) while Intel repositions Atom for ultraportable applications. Bobcat will probably be squaring off against die-shrunk Core2s on updated p945GSE platforms (with die-shrunk components), since, after all, there's no reason why Intel couldn't/shouldn't toss die-shrunk Core2 chips onto platforms that are roughly analogous to 945GSE at least until they're prepared to roll out i3/p55 in netbooks. Eventually I predict that we'll see Bobcat vs i3 mobile, or maybe a single-core i3 variant (Core i1?).
 

JFAMD

Senior member
May 16, 2009
565
0
0
These "quad core" Bulldozer processors almost sound like they have two stronger cores and four weaker cores (due to the second set of "cores" being 80% scaling).

How is this better than what Intel is doing with Turbo mode? With Turbo mode it sounds like the increased performance can be more selective....One, two or three cores could receive a increase in processing power on a native quad core CPU. To me this sounds a lot more flexible than having two permanently stronger cores and two permanently slightly weaker cores.

That being said I still am eager to find out how good these CPUs will be from a power per watt and power per dollar basis (especially the smaller Bulldozers).


No, there are not 2 kinds of cores, all cores are identical. Here is the scaling:

The 80% scaling is the recognition that there are some shared components. In a perfect world, 2 threads would be 200% the throughput of one thread. That is perfect scaling. Bulldozer has some shared components so you get 180% the performance of one, not 200% when you run 2 threads through the 2 cores. With Hyperthreading you get 120% running 2 threads on one core.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Here is the challenge. Outside of my powerpoint and the die etching, nobody sees modules.

If you take an interlagos, which has 16 cores by way of 8 bulldozer modules, you get the following:

When the customer selects the product, it will be called "16 core"
At boot up, the system sees 16 cores
When the OS loads, it sees 16 cores
When the application loads, it will see 16 cores
The processor will be able to handle 16 simultaneous threads

There is nowhere that anyone will ever see 8 modules, so why would we ever call it 8 core, or, even for that matter, 8 module?

I like the idea of getting to threads, but I have an issue with just using raw threads.

Take a quad core intel processor today, run an app with 4 threads active. That is your 100% baseline. Now, turn on HT. 8 threads. 10-20% performance gain. Basic math says that 8 threads = ~4.8 cores. If you did the same on an Interlagos, with ~80% uplift, 16 threads = ~14.4 cores. As you can see, there is no easy way to do this.

That is why I believe that we should get off of the cores vs. threads discussion and get onto the real results.

Performance per dollar per watt. With a customers' actual application and environment. That is the truest measure.

In HPC environments we speak to "thread scaling" within the context of Amdahl's law and Almasi & Gottlieb's IPC characterizations fairly routinely.

http://i272.photobucket.com/albums/jj163/idontcare_photo_bucket/Euler3DBenchmarkScaling.gif

http://i272.photobucket.com/albums/jj163/idontcare_photo_bucket/MyriMatchBenchmarkScaling.gif

(credit: The data presented in that linked graphs above came from techreport, I just crunched it to present it in standard thread scaling format for ease of digestion and interpretation)

I would think the framework already exists for capturing and communicating the throughput computing enhancements that bulldozer represents over a hyperthreading system or even a more traditional multi-socket single-core=single-thread architecture.

The challenge as I see it is to transfer the understanding and appreciation of the relevance of those metrics from the HPC arena to the consumer. The HPC market has had nearly 50yrs of working with and characterizing the performance of multithreaded systems but the industry grown around the consumer markets has had less time to come to terms with the vernacular.

You guys have always had the upper-hand when it comes to fine-grained IPC bound workloads versus the competition, I can only imagine an Interlagos product is really going to be quite exciting from these metrics.
 

Fox5

Diamond Member
Jan 31, 2005
5,957
7
81
I agree, and I don't know that Bobcat will change that. That's why I think we're more likely to see Bobcat encroach on what's left of the netbook market (or slimtops or what have you) while Intel repositions Atom for ultraportable applications. Bobcat will probably be squaring off against die-shrunk Core2s on updated p945GSE platforms (with die-shrunk components), since, after all, there's no reason why Intel couldn't/shouldn't toss die-shrunk Core2 chips onto platforms that are roughly analogous to 945GSE at least until they're prepared to roll out i3/p55 in netbooks. Eventually I predict that we'll see Bobcat vs i3 mobile, or maybe a single-core i3 variant (Core i1?).

Well, bobcat does look cut down from even an athlon 64 core (though I assume they'll keep performance at least on par).
Thankfully for AMD, there is demand for something more powerful than a 1.6Ghz Atom (and the chipsets it comes with). However, I think that part of the reason Atom doesn't scale higher is that Intel doesn't want it to cut into sales of more expensive ulv/lv chips, or even low end celeron chips.
It's a known quantity that Atom can overclock to the low to mid 2Ghz range. Intel probably could have kept a similar die size, and designed it to hit 3Ghz+ (at higher power consumption). This may still be an option, if AMD starts to pose a real threat in the low power market. Again, it's not the performance or power consumption to focus on, but the cost.
Atom also lacks almost all power management options atm (may only be disabled), probably to make it compare less favorably with ulv chips. Right now, atom is 4W all the time, but I think some of the ULV chips get that low in their sleep states. What if Atom had sleep states enabled and could go to 2W, if not lower?

BTW, here's something from way back in 2004:
http://www.tomshardware.com/reviews/athlonxp-underclocking-a-low,892-18.html
An entire desktop system (using parts not optimized for low power use, besides the cpu) uses 15W at 300Mhz. Granted, my entire Pentium-M laptop from around the same time period uses the same, but has a 1.5Ghz cpu and an LCD screen.

The Athlon XP @ 300Mhz on a 130nm process uses as low as 4.5W. Hmm, for 2004, that's pretty impressive, I always wondered why AMD never got into the ULV game like Intel, their processors were obviously just as capable of it. Chipsets were lacking, but the desktop chipset used in that article doesn't seem to fair any worse than the chipsets paired with Atom.

The 80% scaling is the recognition that there are some shared components. In a perfect world, 2 threads would be 200% the throughput of one thread. That is perfect scaling. Bulldozer has some shared components so you get 180% the performance of one, not 200% when you run 2 threads through the 2 cores. With Hyperthreading you get 120% running 2 threads on one core.

When you say 80%, are you talking average (some higher some lower) or peak?
 

Duvie

Elite Member
Feb 5, 2001
16,215
0
71
There's a saying in business - "He who lives by price dies by price".

Being able to compete only on price is death, especially in an industry that requires the kind of funding that developing CPUs requires.

Exactly we have seen this act before from AMD....didn't get them much then either....
 
Last edited:

JFAMD

Senior member
May 16, 2009
565
0
0
Based on what has been released on bulldozer to date, I don't think that price is the only thing we will be competing on.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
After reading little more about Bulldozer, here are my 2 cents. :)

I'm going to have to agree on JFAMD on his remarks about 8 core Bulldozer being a 4 module/8 thread(aka, a module refers to 2 cores).

Single thread performance - Hah! It should not be a big issue anymore especially considering how the architecture is arranged. No apps are less than 2 threaded anymore. It's more important to consider the performance per core/thread. Anyway, for those that care, I think each Integer core will be mostly seperate from each other but might do things like take two branches or w/e. At the worst, it should not be worse than K10.

Multi thread performance - Currently per core size of Core 2 and Core i7 are significantly larger than AMD's Agena/Deneb cores. Single thread performance is significantly better on the Core 2/i7 and multi-thread performance follows. But as I said, single thread performance doesn't matter anymore right? Who cares if your overall performance end up higher? Multi-threading support on i7 specifically are even better than AMD, not even counting SMT.

Multi-thread performance with multi cores are bound by interconnect performance and latency. In most applications, dual cores do not give anywhere near 2x the performance. In PC, double cores probably give less than 50%. Such close connection between the two "tightly linked cores" and sharing of the resources optimized for multiple threads could turn out better than totally seperate 2 core approach. If a single Bulldozer module can do 1.8x the performance over a hypothetical single integer core approach versus a two seperate core which can only scale to 1.4x, it'll be a win-win for AMD with better performance AND smaller die size.

AMD cannot afford to make greater than 350mm2 die size chips. MCM approaches like Magny Cours are a desperate approach not to look too terrible against the competition(besides, 2x350mm2 is different from 1x700mm2).
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Single thread performance - Hah! It should not be a big issue anymore especially considering how the architecture is arranged. No apps are less than 2 threaded anymore.

If only that were true with Metatrader 4's strategy testing modules. Completely single-threaded and entirely CPU limited. (core/thread pegs at 100% utilization)

I'm keeping my fingers crossed though, your statement might be true in five to ten years.

Multi thread performance - Currently per core size of Core 2 and Core i7 are significantly larger than AMD's Agena/Deneb cores. Single thread performance is significantly better on the Core 2/i7 and multi-thread performance follows. But as I said, single thread performance doesn't matter anymore right? Who cares if your overall performance end up higher? Multi-threading support on i7 specifically are even better than AMD, not even counting SMT.

It's both, right, the ability to execute and retire a single thread in given period of time combined with the ability of those threads to periodically "synchronize" their results as dictated by the "graininess" of the code being executed that determines absolute performance of the CPU in a multithreaded environment.

For example Euler3D thread scaling performance comparison shows that Shanghai trounces Bloomfield in thread scaling with this particular application (which is impacted by both serial code as well as IPC) but the absolute performance of the systems are such that no shanghai rig will come close to the performance of a Bllomfield with this application.

Thread scaling:
Euler3DBenchmarkScaling.gif


Absolute performance:
euler3d.gif

(source: http://techreport.com/articles.x/15905/9)

Bulldozer can take thread scaling to 11 if it likes but if clockspeed and single-threaded performance are lackluster then absolute performance will likewise be lacklustre even if a bazillion cores/threads are thrown at the application.

(Consider SUN's Niagara microarchitecture philosophy...not exactly heralded as a performance monster despite its ridiculous thread capabilities and excellent interprocessor communications topology)

AMD cannot afford to make greater than 350mm2 die size chips. MCM approaches like Magny Cours are a desperate approach not to look too terrible against the competition(besides, 2x350mm2 is different from 1x700mm2).

If they can sell 343mm^2 Cypress chips for sub-$150 ASPs then I've no doubt they can survive selling 350mm^2 bulldozer chips. Production cost is only one of the two numbers used in determining gross margins and profits.

Performance determines ASP and ASP is what is critical. Production cost doubling from $15 to $30 is not as detrimental to one's gross margins as ASPs getting halved from $300 to $150.

If performance sucks relative to the competition, and consequently ASPs are dismal, then at that point to be sure they'd much rather be buying those chips from GloFo for $15 instead of $30. But I doubt they are going to set out with the intent to make a chip that can be produced cheaply out of expectation that the ASPs are going to suck. Well, I hope they didn't, I guess I shouldn't be making assumptions as if I know any different.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Yea I thought of your points IDC. ;)

1st point. I'm pretty sure we won't see it with reviews.

2nd point. Actually Bloomfield's scaling on that particular graph isn't poor, just that "8 cores" in that graph for Bloomfield is 8 logical threads, aka Hyperthreading. Case in point:
http://www.techreport.com/articles.x/15818/13
http://www.techreport.com/articles.x/16656/8
(Look at the Core i7 965EE with no Hyperthreading. The speedup in 4 threads is 3.14x, which is good as the Opteron with 3.27x)

http://www.anandtech.com/bench/default.aspx?p=49&p2=56

I know about the Amdahl's Law. But think about it. Do you think the efficiency of current multi-core implementations reach anywhere NEAR 100%? How much can they pull from that before Amdahl's Law(where its saying amount of single threaded code affects multi thread scaling) becomes a significant hindrance?

Compare the E8400 to Q9650, the closest dual core to quad core comparison you can EVER get from the Anand's beta Bench test above. Average improvement turns out to be 45%. If they can get the heavily multi-threaded apps from 70% scaling to near 100% and lightly multi-threaded apps from 25% to 70%, I can't say it'll be bad at all.

Sharing the front end might allow significantly faster communication between the core. For example, you can share the L1 caches, which you can't do now(or only limited). You know what a Snoop Filter is right? The limits with current multi-core is that it needs to go through the arbitration logic or the cache to communicate with each other. CMT approaches like these will allow bypassing it. In essence, its improving the "single thread" part of multi-thread.

In really high end, if they go MCM with that, it'll solve performance with very parallel apps.

3rd point. I'm not sure what to say exactly, but CPUs and GPUs are much different. The significant low-level refinement CPUs get compared to GPUs are probably a big reason why GPUs release every year with new architecture, tape out to production is significantly faster, and profit margins lower. To roughly quote one reader: "Hand-tuned high-frequency part versus a automated low-frequency part".
 
Last edited:

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Ok, so let me conclude.

If they really do 1 module=2 core approach when they say "8 cores" as in 4 modules, then non-CMT implementations with much better single thread performance might do better in highly and well threaded apps like multimedia. But this CMT approach might do better in threaded apps that don't currently do well. As they say, they want to do those "embarassingly parallel" apps in GPU eventually right? The Intel=encode/decode vs. AMD=general purpose might become amplified here.
 

Mothergoose729

Senior member
Mar 21, 2009
409
2
81
No offense, but I don't think ARM chips are going to do as well as their proponents claim unless they get some decent software-based x86 compatability going on (such as what Transmeta used to do, and Nvidia is the only company I see with that going for them right now). Give someone an ultraportable/smartphone that runs windows 7 and that's all Intel would need to gain an enormous foothold into the ultraportable market. Atom will eventually be able to fill that niche, and, when paired with Win7, will make a great many consumers not really care about what ARM can do. Maybe I sound crazy, but I see no reason why the resources necessary for a stripped-down win7 couldn't be crammed into a smartphone in the near future. Intel could supply the CPU, the chipset, and the flash-based storage, so the only wildcard would be getting volatile RAM in sufficient amounts crammed into that package.

I think Bobcat will do well, but I think it will probably do better to fight against mobile celerons. Maybe I underestimate AMD's ability to move Bobcat into the ultraportable market, maybe I don't.

You are right of course, intel has x86 code and the bigger marketing name and will continue to sell more products. As an intelligent consumer though, i am looking to ARM and AMD to provide better products in the future. I have an Atom based netbook now and love it, but I realize its shortcomings.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
I know about the Amdahl's Law. But think about it. Do you think the efficiency of current multi-core implementations reach anywhere NEAR 100%? How much can they pull from that before Amdahl's Law(where its saying amount of single threaded code affects multi thread scaling) becomes a significant hindrance?

Compare the E8400 to Q9650, the closest dual core to quad core comparison you can EVER get from the Anand's beta Bench test above. Average improvement turns out to be 45%. If they can get the heavily multi-threaded apps from 70% scaling to near 100% and lightly multi-threaded apps from 25% to 70%, I can't say it'll be bad at all.

The subject of heterogeneous processing resources and speedup was the topic of a chapter in my dissertation, I would argue that I have put a considerable amount of thinking into it and my posts here are an attempt to share some of the fruits of all that thinking.

The infrastructure for characterizing the contributions of various hardware and software limitations on overall speedup exists, you just have to use it correctly if you intend to extract meaningful interpretations from the results.

To determine hardware-based limitations to thread scaling you must first deconvolve the scaling data and remove the portion of imperfect thread scaling that is application/software dependent. That is what Amdahl's law helps us characterize.

Take the Euler3D example above, cursory analysis of the scaling data available so far indicates this code is semi-coarse grained, roughly 96% of the computation efforts can be performed in parallel while roughly 4% of the computations are serial in nature.

That sets the upper limit of what we would call "perfect scaling" for a hardware solution at the Amdahl limit (the thick red line in my graph) which itself is a function of the number of threads.

Regardless your hardware efficiency you simply cannot exceed the scaling limits imposed by Amdahl's law. And owing to hardware scaling efficiencies we lose scaling from there.

I take the time to belabor this point because it speaks to the crux of the issue when people refer to the "efficiency of current multi-core implementations" by way of speaking to absolute scaling numbers which are convoluted by the ramifications of Amdahl's law.

With Euler3D code you will never see 100% thread scaling, never, no matter the type of microarchitecture involved.

At the absolute best case scenario in which interprocessor communication topology is infinitely fast (zero latency) and infinitely wide (infinite bandwidth) and the cores are absolutely identical in not sharing resources (no fetch/decode/cache contention) the best thread scaling you will see with Euler3D in going from 1 core to 2 cores is 92% and the best scaling you could ever see in going from 2 cores to 4 cores is 85% and the best scaling you would ever observe in going from 4 cores to 8 cores is 75%.

Thread scaling efficiency only goes down from there, solely owing to limitations of the ratio of time spent processing parallel computations versus those that must be done serially.

Further still the thread scaling efficiency declines because interprocessor communications are not infinitely fast and wide, and with the advent of hyperthreading and bulldozer we have further reduction in thread scaling because of resource contention adding idle cycles to any given thread.

What we see in the Euler3D data is that thread scaling certainly suffers from resource contention in nehalem with hyperthreading, no one is arguing any differently, but the results also show how much of that inefficiency in thread scaling with bloomfield can be eliminated by way of improving on the hardware - be it by adopting an architecture like that of Opteron or disabling HT.

The relevance of that statement is that we are here discussing the ramifications of the assured additional deterioration in thread scaling that bulldozer modules will incur (JFAMD says the penalty is 20%) over that of a more efficient (for scaling purposes) architecture as seen in current Shanghai/Istanbul systems.

Provided the right kinds of data are generated (thread scaling data), we have the tools to generate the sort of analyses that can isolate thread scaling efficiencies attributable to software versus hardware.
 
Last edited:

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Thanks for the reply. FYI, I believe the 80% number could be related to their 2009 Bulldozer presentation where they claimed 50% die size increase with 80% performance boost. Now, if that's true, what are they really comparing to? Each integer core that can't combined with another integer core, or do speculative threading? Compared to K10 cores? Compared to per thread performance that it can actually do with all the nitty gritty like speculative threading and integer core combining?

My justification that 2 cores=1 module came from the presentation claiming 4-8 "cores" for the Zambezi version. But, maybe I was speculating into the dreams, you are right about what you said. Compare with even their OWN predecessor CPU, the Thuban, it seems 2 core=1 module look like a sidegrade.

Did I mention the current competition, the Bloomfield which outperforms the 6 core Istanbul in highly threaded apps?: http://www.xbitlabs.com/articles/cpu/display/amd-istanbul.html

How will it do against Gulftown, which is basically 6 core version of Bloomfield, using 8 mini "cores"?

1. Either 8 "cores" in Zambezi(or any other code name) are really 8 modules or
2. They are taking the similar strategy as in GPUs with "small-die" approach. Lower performance, but cheaper to produce and sell less.
 

jvroig

Platinum Member
Nov 4, 2009
2,394
1
81
How will it do against Gulftown, which is basically 6 core version of Bloomfield, using 8 mini "cores"?
1. Either 8 "cores" in Zambezi(or any other code name) are really 8 modules or
2. They are taking the similar strategy as in GPUs with "small-die" approach. Lower performance, but cheaper to produce and sell less.

That's the reasoning that led me to believe the "Anand" interpretation of "Bulldozer cores" rather than the "Matthias" interpretation. Unfortunately, Anand seems to be in error, as the official AMD word c/o JFAMD seems to be unwaveringly in favor of the Matthias/Dresdenboy interpretation.

If Bulldozer has 4 modules therefore 8 cores, how will that stack up against Gulftown? What's the AMD plan behind this?
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Thanks for the reply. FYI, I believe the 80% number could be related to their 2009 Bulldozer presentation where they claimed 50% die size increase with 80% performance boost. Now, if that's true, what are they really comparing to? Each integer core that can't combined with another integer core, or do speculative threading? Compared to K10 cores? Compared to per thread performance that it can actually do with all the nitty gritty like speculative threading and integer core combining?

My justification that 2 cores=1 module came from the presentation claiming 4-8 "cores" for the Zambezi version. But, maybe I was speculating into the dreams, you are right about what you said. Compare with even their OWN predecessor CPU, the Thuban, it seems 2 core=1 module look like a sidegrade.

Did I mention the current competition, the Bloomfield which outperforms the 6 core Istanbul in highly threaded apps?: http://www.xbitlabs.com/articles/cpu/display/amd-istanbul.html

How will it do against Gulftown, which is basically 6 core version of Bloomfield, using 8 mini "cores"?

1. Either 8 "cores" in Zambezi(or any other code name) are really 8 modules or
2. They are taking the similar strategy as in GPUs with "small-die" approach. Lower performance, but cheaper to produce and sell less.

I don't recall seeing any die-size tradeoff numbers from the presentations (doesn't mean they aren't there, am just saying if they are then I am ignorant to their existence) but JFAMD said in this thread is was more like a 5% die-adder while netting that 80% effective thread processing capability.

I will scan the thread for the exact quote, but to paraphrase (and going from memory) it is something like "with HT on Intel, there is a 5% diesize increase and only a 10-20% thread performance increase whereas with Bulldozer there is the same 5% diesize increase but with an 80% thread performance increase".

I haven't begun to worry about dickering over the ramifications of the numbers if they are true, I am still trying to get my wrap firmly wrapped around what is a core and what resources are available (or shared) for two given threads in a bulldozer module.

In a very nerdy pedagogical sort of way it is all very titillating to me ;)

That's the reasoning that led me to believe the "Anand" interpretation of "Bulldozer cores" rather than the "Matthias" interpretation. Unfortunately, Anand seems to be in error, as the official AMD word c/o JFAMD seems to be unwaveringly in favor of the Matthias/Dresdenboy interpretation.

If Bulldozer has 4 modules therefore 8 cores, how will that stack up against Gulftown? What's the AMD plan behind this?

In sort of an ironic twist, the Matthias interpretation was that integer cores capable of processing higher bit instructions would be busted up (broken down into clusters) so more integer instructions of lower bitlength could be processed in parallel but it turns out this is actually what AMD did with the FPU and not the INT units. They doubled the INT units, and the FPU, but only the FPU is a "cluster" in the spirit of the sense of term as it was being applied to bulldozer via the Matthias interpretation.

(I love saying "the Matthias interpretation" in my head, seems so clandestine and super-spy like the Borne Identity or some such...the dude is soooo going to get a wiki page out of all this, lol)

Regarding "8 core" zambezi versus Gulftown...I lol'ed when I saw the term "castrated core" on semiaccurate forum (don't hate me for admitting to lurking there)...yeah I can see the question now that will be posed in many threads to come will be "how do 8 castrated cores square up against 6 "real" cores in Gulftown?"

My expectation is that they will be quite comparable in terms of IPC capabilities per "thread" which means actual performance will, once again, come down to how well an 8-core bulldozer clocks compared to gulftown (or 4C/8T sandy) at the time.

Look at how well Phenom II does compared to Core 2 Yorkfield...Phenom II debuted a year after Core 2 Yorkfield and it basically closed the IPC gap (more or less) and has steadily driven up those clockspeeds. There is no reason to assume or expect Bulldozer won't make a similar closure of the gap to Westmere's IPC, a big question on our minds should be what will Sandy bring that Bulldozer will also have to compete with...
 
Last edited: