AMD sheds light on Bulldozer, Bobcat, desktop, laptop plans

Page 7 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

JFAMD

Senior member
May 16, 2009
565
0
0
I think the difference between 50% and 5% might be the difference between marketing and engineering. Engineers tend to be very literal.

If 2 cores get you 180% performance of 1, then in simple terms, that extra core is 50% that gets you the 80%.

What I asked the engineering team was "what is the actual die space of the dedicated integer portion of the module"? So, for instance, if I took an 8 core processor (with 4 modules) and removed one integer core from each module, how much die space would that save. The answer was ~5%.

Simply put, in each module, there are plenty of shared components. And there is a large cache in the processor. And a northbridge/memory controller. The dies themselves are small in relative terms.
 

cbn

Lifer
Mar 27, 2009
12,968
221
106
Does Intel hyperthreading reduce efficiency when the processor is dealing with a smaller number of threads.

For example could a quad core with hyperthreading perform worse than a quad core without hyperthreading provided the number of threads does not exceed four? When overclocked? If so how much difference are we talking?
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Does Intel hyperthreading reduce efficiency when the processor is dealing with a smaller number of threads.

For example could a quad core with hyperthreading perform worse than a quad core without hyperthreading provided the number of threads does not exceed four? When overclocked? If so how much difference are we talking?

The answer is that yes it does, but not because of the hardware per say but because of thread migration and scheduling as dictated by the OS in question.

Kind of like how Vista w/o Trim vs Win7 w/trim can make all the difference in the performance of your Intel or OCZ SSD.

Checkout Inteluser's link above where he highlighted the thread scaling difference for i7 with and w/o HT in the euler3d benches.

Anytime the shared resources become critical to the execution speed of the threads themselves the increase in threads/cores sharing those resources can degrade performance. We just hope those cases are rare.

Take either a modern day i7 or PhII with their shared L3$...in theory you could end up with a program that critically depends on the L3$ cache size just for a single thread and as such adding a second thread which then needs that L3$ space as well suddenly results in considerable IMC activity and dram fetches...now all of a sudden both threads are stalling like crazy and your speedup actually became worse than if you had just stuck with a single thread. It can happen, and it does happen with extreme rarity, but the concerns are legitimate.
 

DrMrLordX

Lifer
Apr 27, 2000
23,217
13,300
136
Well, bobcat does look cut down from even an athlon 64 core (though I assume they'll keep performance at least on par).

It will be interesting to see what performance from Bobcat will be like in production chips, though I would think that Bobcat will have process technology on its side (at least) when comparing it to k8. It will also be interesting to see how many cores AMD decides to deploy. I could easily see four Bobcat cores in a netbook, or at least two.

Thankfully for AMD, there is demand for something more powerful than a 1.6Ghz Atom (and the chipsets it comes with). However, I think that part of the reason Atom doesn't scale higher is that Intel doesn't want it to cut into sales of more expensive ulv/lv chips, or even low end celeron chips.
It's a known quantity that Atom can overclock to the low to mid 2Ghz range. Intel probably could have kept a similar die size, and designed it to hit 3Ghz+ (at higher power consumption). This may still be an option, if AMD starts to pose a real threat in the low power market. Again, it's not the performance or power consumption to focus on, but the cost.

I would agree that Atom has been "kept in its place", so to speak, mostly by the chipset that hobbles it, but also by some of the other issues you brought up.

Atom also lacks almost all power management options atm (may only be disabled), probably to make it compare less favorably with ulv chips. Right now, atom is 4W all the time, but I think some of the ULV chips get that low in their sleep states. What if Atom had sleep states enabled and could go to 2W, if not lower?

32nm Atom with sleep states could probably be rolled out at 2.5-3ghz in dual-core form and stay within its current 4W power envelope (or something very close to it) or 2 ghz in dual-core form at a lower power envelope. Whether or not that would be competitive with a Bobcat dual or quad system remains to be seen, though it would probably be a lot cheaper.

After that, it's a matter of figuring out what consumers want in different market segments. Personally I think netbook buyers will want more performance with the same battery life, which is where I think Bobcat could win out (at least over next-gen Atom anyway). Then Intel will have to move other products into the netbook market which is where things could get interesting.

The Athlon XP @ 300Mhz on a 130nm process uses as low as 4.5W. Hmm, for 2004, that's pretty impressive, I always wondered why AMD never got into the ULV game like Intel, their processors were obviously just as capable of it. Chipsets were lacking, but the desktop chipset used in that article doesn't seem to fair any worse than the chipsets paired with Atom.

My guess is lack of vision or lack of R&D budget. Also keep in mind that was in an era when Intel was still engaging in incentive programs to keep AMD chips out of certain market segments which may or may not have had an influence on AMD's ability to penetrate new markets. Or, maybe they just didn't see the need for a 300 mhz Athlon XP. Or maybe they couldn't get them running at such low voltages reliably based on their own binning practices; just because Tom's could do it with one system to their own satisfaction doesn't necessarily mean that the same parts would have cleared AMD's QA processes. To further complicate things, back then, AMD wasn't in as much control of their own platforms as they were back then, so it would have been a matter of the chipsets passing Nvidia QA and then the mobos passing OEM QA before being ready for sale.

I'm sure it could have been done, but I have doubts about how many of the parts AMD and Nvidia generated back then could have run reliabliy at those speeds and power levels.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Given the nature of the equations of the device physics that underlie power-consumption, current CMOS-based xtor processors most certainly benefit from operating at the lowest clockspeed and lowest Vcc possible while having more of those xtors working in parallel if necessary to get a particular calculation done per unit time.

Consider that for a given Vcc and clockspeed the power-consumption scales linearly with xtor count (assuming replication of active architectural functions of course) whereas increasing Vcc and frequency results in a cubic relationship to power consumption.

So if given the choice of doubling the number of cores (threads) to get 2x the work done per unit time (assuming we have a coarse-grained dual-threaded application in mind) or increasing the clockspeed and Vcc as needed (will need to be >2x as the fixed memory subsystem latency hinders clockspeed scaling efficiency) so we can execute and retire both threads serially in the same period time on a single core (thread) processor we would always choose the higher thread/core count processor if power-consumption were our primary concern.

However doubling the xtor count means cost of production increases (and non-linearly so to the disfavor of larger die), whereas doubling the operating frequency of the xtors does not necessarily incur a doubling of the production costs...so from a design and manufacturing viewpoint simply making super low clockspeed CPU's the size of wafers for use on massively threaded applications isn't exactly the path to pots of gold either.
This was a really good explanation. From this I can get a little bit of a feel for the balancing act that goes on with building these chips.

I didn't realize production costs increase non-linearly with increased xtor count and die size.

Coincidentally Intel just published an article today on semiconductor.net discussing this very subject matter:

Intel Chip Vision: Run Slow to Stay Cool

To keep power within reasonable limits, Intel Corp.'s director of MPU research, Shekhar Borkar, has a vision of microprocessors with many hundreds of small cores running at slow frequencies, using extremely low operating voltages that hug the threshold voltage.

Borkar said tomorrow's microprocessors may operate much as today's watch chips, with very low operating voltages that are barely above the threshold voltage. Although today's MPUs have a Vdd of ~1 V and a Vt of perhaps 0.3 V, processors a decade in the future might have many transistors operating at 0.4 Vdd, barely a tenth of a volt higher than the Vt. Although he acknowledged that those cores would be very slow, the issue will be forced by the need to keep power consumption under control.

194745-Running_cores_at_very_low_frequencies_intel.jpg

Running cores at very low frequencies, with the operating voltage near the threshold voltage, may be required.

http://www.semiconductor.net/article/389668-Intel_Chip_Vision_Run_Slow_to_Stay_Cool.php

How cool is that? Data and all, 320mV!
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
32nm Atom with sleep states could probably be rolled out at 2.5-3ghz in dual-core form and stay within its current 4W power envelope (or something very close to it) or 2 ghz in dual-core form at a lower power envelope. Whether or not that would be competitive with a Bobcat dual or quad system remains to be seen, though it would probably be a lot cheaper.

Possible, but not likely. The "Lincroft" Atom will be on the 45nm SoC process, which for an SoC process its high performing, but still a loss from the 45nm HP process. Quite likely the 32nm Atom, Medfield will be on the 32nm SoC process.

I don't recall seeing any die-size tradeoff numbers from the presentations (doesn't mean they aren't there, am just saying if they are then I am ignorant to their existence) but JFAMD said in this thread is was more like a 5% die-adder while netting that 80% effective thread processing capability.

Slight mistake. It was from their presentation about various Multi-threading technologies.

Here: http://data5.blog.de/media/732/3663732_9bc35365d1_l.png

It might not exactly be Bulldozer, but it was made back when 2009 Bulldozer was being talked about.

And what JFAMD said: " I think the difference between 50% and 5% might be the difference between marketing and engineering. Engineers tend to be very literal.

If 2 cores get you 180% performance of 1, then in simple terms, that extra core is 50% that gets you the 80%."

I guess it can be interpreted as: "If a hypothetical single(mini) core Bulldozer based CPU without CMT-like capabilities were compared to the CMT-enabled 1 module/2 core version, the performance improvement would be 80% and die size increase would be 50%".

Look at how well Phenom II does compared to Core 2 Yorkfield...Phenom II debuted a year after Core 2 Yorkfield and it basically closed the IPC gap (more or less) and has steadily driven up those clockspeeds.

There is still an IPC advantage of approximately 10% per clock.

Core 2 Q9650 3GHz vs Phenom II X4 940 3GHz: http://www.anandtech.com/bench/default.aspx?p=49&p2=80

You can see that the Phenom II X4 outperforms Core 2 Quad in latter part of the multi-threaded benches.

Now Phenom II X2 550BE 3.1GHz vs Core 2 Duo E8400 3.0GHz

http://www.anandtech.com/bench/default.aspx?p=56&p2=97

You can see there are no cases where Core 2 Duo loses to the slightly higher clocked Phenom II X2.

How did it go from losing in some apps at same clock speed to never losing with lower clock speed? Probably because Core 2 Quad is two dual cores kludged, aka MCM, while Phenom II X4 isn't a MCM Phenom II X2.

Nehalem isn't a kludge core, but a well architected quad core with fast memory controller and interconnects.
 
Last edited:

DrMrLordX

Lifer
Apr 27, 2000
23,217
13,300
136
Possible, but not likely. The "Lincroft" Atom will be on the 45nm SoC process, which for an SoC process its high performing, but still a loss from the 45nm HP process. Quite likely the 32nm Atom, Medfield will be on the 32nm SoC process.

So you don't think Intel will bother pushing clockspeed on Medfield much? I really don't think it will be a good idea for Intel not to force the issue of performance in that segment. Though, if Tom's Hardware is to be believed (http://www.tomshardware.com/news/Intel-Atom-Medfield,6671.html - yes, I know it's a fairly old article), AMD is going to avoid the netbook market altogether.
 

JFAMD

Senior member
May 16, 2009
565
0
0
If you start to deconstruct a bulldozer die, you start to see that the *unique* integer components are a small piece.

Start with probably half the die being cache (I am guessing, I don't have the numbers, only the die size). Then, you have to consider the northbridge circuits. Then the memory controller and the HT PHY. What you are left with is the bulldozer modules. Of those, the fetch, decode, L2 cache and FPU are all shared, so they would not go away by removing the second core from each module.

You start to see that if you removed one integer core from each module (by taking out the discrete circuitry), you really are talking about a very small part of the die.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
So you don't think Intel will bother pushing clockspeed on Medfield much? I really don't think it will be a good idea for Intel not to force the issue of performance in that segment. Though, if Tom's Hardware is to be believed (http://www.tomshardware.com/news/Intel-Atom-Medfield,6671.html - yes, I know it's a fairly old article), AMD is going to avoid the netbook market altogether.

The whole issue of calling a "Netbook" with processors like Bobcat and Atom is that depending on what the manufacturer wants to do it might be a "Netbook", or become a "Notebook", its hard to draw the line. Doesn't it seem likely the market will at least put Bobcat into the high-end Netbook segment with the rough die and power estimations they gave us? The 1-10W range seems like we'll see single and dual core versions with low and higher TDP versions of each. It looks like AMD will fill the niche Via is leaving behind, that is a notch higher than Atom, but lower than regular laptop CPUs.

Possibly, we might see dual core versions of Medfield and its derivatives. Maybe that and some architectural enhancements. Clock speed increases doesn't seem too efficient. Dual cores were already hinted with Moorestown, and integrated memory controller on a in-order CPU should do good.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
I think the difference between 50% and 5% might be the difference between marketing and engineering. Engineers tend to be very literal.

If 2 cores get you 180% performance of 1, then in simple terms, that extra core is 50% that gets you the 80%.

What I asked the engineering team was "what is the actual die space of the dedicated integer portion of the module"? So, for instance, if I took an 8 core processor (with 4 modules) and removed one integer core from each module, how much die space would that save. The answer was ~5%.

Simply put, in each module, there are plenty of shared components. And there is a large cache in the processor. And a northbridge/memory controller. The dies themselves are small in relative terms.

If you start to deconstruct a bulldozer die, you start to see that the *unique* integer components are a small piece.

Start with probably half the die being cache (I am guessing, I don't have the numbers, only the die size). Then, you have to consider the northbridge circuits. Then the memory controller and the HT PHY. What you are left with is the bulldozer modules. Of those, the fetch, decode, L2 cache and FPU are all shared, so they would not go away by removing the second core from each module.

You start to see that if you removed one integer core from each module (by taking out the discrete circuitry), you really are talking about a very small part of the die.

I can't speak for anyone else beyond myself but I think it is perfectly clear what you communicating.

~10% of the die is integer units, remove half of those integer units and you've reduced the diesize by ~5%.

I can see the flip-side of the question being "if AMD's engineers knew they didn't need to make the module's other shared components as beefy as they did because that 5% would end up being removed then how much smaller would the rest of the die have been?".

Presumably the shared resources were beefed up to diminish some of the anticipated performance degradation that would come from resource contention, L2$ sizes are larger than they otherwise would have been, etc. (i.e. there is a reason the expected thread scaling efficiency is 80% and not 70% or 50% or 10% for two threads in a bulldozer module)

So if you removed one half the integer units and also removed the excess portions of the shared resources (basically re-optimized the architecture to handle one thread instead of two) then how much smaller eve still would the die have become?

In that case I imagine the value does approach 50% sans the none thread-count scaling components such as IMC, etc.

edit: Something occurred to me since I made this post...and that is perhaps the shared resources would actually be left in place even in a hypothetical bulldozer module in which half the Integer units were removed because one advantage of having shared resources is that 100% of those resources are available for assisting the processing rate of a single-thread in the event that only a single-thread is tasked to a given bulldozer module. So for the sake of single-thread/module performance reasons those beefed up shared resources would be retained anyways because they are dual-purpose and removing half the integer units only eliminates one of the two purposes served by those shared resources.
 
Last edited:

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
The whole issue of calling a "Netbook" with processors like Bobcat and Atom is that depending on what the manufacturer wants to do it might be a "Netbook", or become a "Notebook", its hard to draw the line. Doesn't it seem likely the market will at least put Bobcat into the high-end Netbook segment with the rough die and power estimations they gave us? The 1-10W range seems like we'll see single and dual core versions with low and higher TDP versions of each. It looks like AMD will fill the niche Via is leaving behind, that is a notch higher than Atom, but lower than regular laptop CPUs.

Possibly, we might see dual core versions of Medfield and its derivatives. Maybe that and some architectural enhancements. Clock speed increases doesn't seem too efficient. Dual cores were already hinted with Moorestown, and integrated memory controller on a in-order CPU should do good.

Has it been mentioned explicitly for Bobcat yet whether or not it will be a fusion-product with on-die IGP and possibly dual-purposing as an APU?

If it is then that would certainly step-up bobcat's game when it came to competing with Atom/Via/Ion out there.
 

JFAMD

Senior member
May 16, 2009
565
0
0
I can't speak for anyone else beyond myself but I think it is perfectly clear what you communicating.

~10% of the die is integer units, remove half of those integer units and you've reduced the diesize by ~5%.

I can see the flip-side of the question being "if AMD's engineers knew they didn't need to make the module's other shared components as beefy as they did because that 5% would end up being removed then how much smaller would the rest of the die have been?".

Presumably the shared resources were beefed up to diminish some of the anticipated performance degradation that would come from resource contention, L2$ sizes are larger than they otherwise would have been, etc. (i.e. there is a reason the expected thread scaling efficiency is 80% and not 70% or 50% or 10% for two threads in a bulldozer module)

So if you removed one half the integer units and also removed the excess portions of the shared resources (basically re-optimized the architecture to handle one thread instead of two) then how much smaller eve still would the die have become?

In that case I imagine the value does approach 50% sans the none thread-count scaling components such as IMC, etc.

edit: Something occurred to me since I made this post...and that is perhaps the shared resources would actually be left in place even in a hypothetical bulldozer module in which half the Integer units were removed because one advantage of having shared resources is that 100% of those resources are available for assisting the processing rate of a single-thread in the event that only a single-thread is tasked to a given bulldozer module. So for the sake of single-thread/module performance reasons those beefed up shared resources would be retained anyways because they are dual-purpose and removing half the integer units only eliminates one of the two purposes served by those shared resources.

You are absolutely correct. Pretty soon you start to get into angels dancing on the head of a pin territory. I am a simple guy at heart, I like to address technical things in simplistic terms. Someone can add an asterisk to just about anything that someone else says, but as long as the general premise was directionally correct, they get into diminishing returns.

And, if someone wanted to split hairs, I would actually have a lower starting point becuase it was actually in 4.something range. I said 5% because there are people who take things to literally, so I tend to be conservative in my statements.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Someone can add an asterisk to just about anything that someone else says, but as long as the general premise was directionally correct, they get into diminishing returns.

Beautifully stated. That should be included in our forum's TOS.
 

deimos3428

Senior member
Mar 6, 2009
697
0
0
Bulldozer has some shared components so you get 180% the performance of one, not 200% when you run 2 threads through the 2 cores. With Hyperthreading you get 120% running 2 threads on one core.
It's really not surprising that a dual core scales better than a single core with HT, though, is it? I'm more interested in how it compares to other apples, namely, how does a single BD module scale vs. other dual cores?

Using the data provided by Idontcare above, it looks like the Opterons are getting 76-92% scaling going from single core to dual, 69-73% going from dual to quad, and 54-58% going from quad to octo. The Xeons weren't faring anywhere near as well with ranges of 68-76%., 46-53%., 39-45%. (I ignored the 6-thread info completely but otherwise there's no fancy math involved here, just dividing the latter by the former and subtracting one to get the scaling as the number of cores doubles.)

In that light if the BD is scaling at about 80%, it would seem most useful beyond four threads, as we've already got roughly equivalent levels of scaling in the Opteron for four threads or less.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
In that light if the BD is scaling at about 80%, it would seem most useful beyond four threads, as we've already got roughly equivalent levels of scaling in the Opteron for four threads or less.

deimos3428, my current interpretation of the "80%" number is that you would reduce existing thread scaling to 0.8x and arrive at the Bulldozer thread scaling equivalent assuming you could hold all other limitations to scaling constant (degree of software parallelism, latency and bandwidth of interprocessor communications topology, single-threaded IPC capability, etc).

So if an otherwise equivalent CPU generated a speedup of say 6x with 8 cores (so thread scaling efficiency equals 75% for the given app at 8 threads) we should expect an 8core bulldozer CPU to produce a speed of 0.8x*6 = 4.8x speedup for an overall thread scaling efficiency of 60% (again, just for that app and only for 8 threads comparison purposes).

So the next question would naturally be "why make a cpu that gives you 60% thread scaling in app XYZ for 8 threads if I can get 75% thread scaling in the same app for the same number of threads on a different CPU?".

The answer would be three-fold...first the "cost" of the bulldozer chip would be presumably less than the full-fledged octo-core comparison chip because incorporating four of those eight thread processors only increased the diesize by 5%. So it favors the price/performance end of things.

Second is that absolute performance might very well still be higher if the IPC per thread is higher despite the lower thread-scaling efficiency (this is the case for 8 threads on bloomfield w/HT vs. a dual-socket shanghai opteron in the Euler3D bench). Again this would speak to price/performance.

And a third reason would be that despite having lower overall thread efficiency, owing to the reduced footprint of the module itself over that of implementing two fully isolated cores AMD (and Intel) can elect to "throw more cores" at the problem in an effort to boost the absolute performance higher (enter Magny-Cours, Interlagos, Beckton w/HT) regardless the diminishing thread-scaling efficiency (Amdahl's Law) incurred by doing so.
 
Last edited:

deimos3428

Senior member
Mar 6, 2009
697
0
0
The answer would be three-fold...first the "cost" of the bulldozer chip would be presumably less than the full-fledged octo-core comparison chip because incorporating four of those eight thread processors only increased the diesize by 5%. So it favors the price/performance end of things.

Second is that absolute performance might very well still be higher if the IPC per thread is higher despite the lower thread-scaling efficiency (this is the case for 8 threads on bloomfield w/HT vs. a dual-socket shanghai opteron in the Euler3D bench). Again this would speak to price/performance.

And a third reason would be that despite having lower overall thread efficiency, owing to the reduced footprint of the module itself over that of implementing two fully isolated cores AMD (and Intel) can elect to "throw more cores" at the problem in an effort to boost the absolute performance higher (enter Magny-Cours, Interlagos, Beckton w/HT) regardless the diminishing thread-scaling efficiency (Amdahl's Law) incurred by doing so.
Thanks for that excellent explanation.
 

Mothergoose729

Senior member
Mar 21, 2009
409
2
81
In order for Atom to remain a cutting edge product for consumers Intel engineers need to add OOO back into the chip. Without it the processor will always seem "almost fast", and much more competent chips that do have it will quickly come to market and take its place. Intel also needs to get their graphics situation sorted and and pretty quick. Consumers expect their laptops to be able to do anything, most specifically surf the internet and play all forms of digital media. Netbooks can't really do that with Intel IGP. Asus just recently announced that they will now ship all their netbooks with ion; being the first mass producer of atom netbooks that should say to intel "hint, hint, make better graphics or lose your valuable IGP market".
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Why was OOO processing not implemented in Atom to begin with? Was a public reason ever stated?

I'd assume that while OOO improves performance it doesn't do so while adhering to the 2:1 rule - "a 2% peformance increase can increase power-consumption by no more than 1%".
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Goto-san has put up his weekly digest with Bulldozer being the topic this week:

Dual-core and Hyper-Threading intermediate Bulldozer

kaigai2.jpg


The top dual-core CMP is the simplest type. Single thread can only run two CPU cores just one bundle. In this case, if there is no sharing of resources, limited cash and even a small fraction. Therefore, each CPU core is completely without being disturbed, each thread can run.

CMP is a high performance but of course because you have two CPU cores to implement one whole minute resources, high cost of implementation. Simple calculation, CPU core 100% of the cost of one minute each, 100 percent up to get the theoretical performance. If two times the single-core dual-core, quad-core at four times the resources needed.

SMT is the bottom, as the idea is to extend the single-core CPU to allow two threads to run. Computational resources as well as cash, scheduling mechanism, such as instruction fetch and decode two most resources shared by two threads. Example of Intel, such as registers and buffers, but it's only been a handful of dedicated resources for each thread.

http://pc.watch.impress.co.jp/docs/column/kaigai/20091120_330076.html

I liked the graphic as it simplistically illustrates the implicit trade-off between the degree of shared resources and the impact on performance and cost from doing so.
 

Martimus

Diamond Member
Apr 24, 2007
4,490
157
106
Why was OOO processing not implemented in Atom to begin with? Was a public reason ever stated?

I'd assume that while OOO improves performance it doesn't do so while adhering to the 2:1 rule - "a 2% peformance increase can increase power-consumption by no more than 1%".

The decision to go in-order eliminated the need for much complex, power hungry circuitry. While you get good performance from out-of-order execution, the corresponding increase in scheduling complexity was simply too great for Atom at 45nm. Keep in mind that just as out-of-order execution wasn't feasible on Intel CPUs until the Pentium Pro, there may come a time where transistor size is small enough that it makes sense to implement an OoOE engine on Atom. I don't expect that you'll see such a change in the next 5 years however.
".
http://www.anandtech.com/showdoc.aspx?i=3276&p=6

They may yet make an OOO Atom chip, but I would expect them to continue reducing the power consumption so they can get this chip into smart phones and similar devices. We may see a divergence where they create a more powerful Atom for netbooks, and a less power hungry Atom for smaller devices.

EDIT: Also, I believe you are confusing Intels 1% rule for their old 2% rule. It used to be that Intel could could add a feature to a microprocessor design if you get a 1% increase in performance for at most a 2% increase in power. However, this rule was changed so that now Intel may only add a feature if it yields a 1% increase in performance for at most a 1% increase in power consumption. Although I could definitely be wrong when it comes to Atom, and they have the rule you stated specifically for that platform.
 
Last edited:

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Intel's Atom: Changing Intel from the Inside

For years at Intel the rule of thumb for power vs. performance was this: a designer could add a feature to a microprocessor design if you get a 1% increase in performance for at most a 2% increase in power. Unfortunately, it's thinking like that which gave us the NetBurst architecture used in the Pentium 4 and its derivatives.

The Intel Atom was the first Intel CPU to redefine the rule of thumb and now the requirement is that a designer may add a feature if it yields a 1% increase in performance for at most a 1% increase in power consumption. It's a pretty revolutionary change and it's one that will be seen in other Intel architectures as well (Nehalem comes to mind), but Atom was the first.

While Atom started as a single-issue, in-order microprocessor the Austin team quickly widened it to be a dual-issue core. The in-order decision stuck however.

Yeah this is what I was thinking of when I made my post.
 

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
Coincidentally Intel just published an article today on semiconductor.net discussing this very subject matter:



How cool is that? Data and all, 320mV!

If you think that's bizzare, maybe 6-7 years ago I was reading a paper on sub-threshold logic where things never really turn off or on. Of course I think they were talking about clock frequency on the order of kHz.
 

Fox5

Diamond Member
Jan 31, 2005
5,957
7
81
Why was OOO processing not implemented in Atom to begin with? Was a public reason ever stated?

I'd assume that while OOO improves performance it doesn't do so while adhering to the 2:1 rule - "a 2% peformance increase can increase power-consumption by no more than 1%".

I got the feeling Atom was going for the absolute lowest production cost and power draw, regardless of power efficiency. (since atom is very inefficient compared to even the 90nm pentium m, let alone core 2 duo)
 

cbn

Lifer
Mar 27, 2009
12,968
221
106
.


Second is that absolute performance might very well still be higher if the IPC per thread is higher despite the lower thread-scaling efficiency (this is the case for 8 threads on bloomfield w/HT vs. a dual-socket shanghai opteron in the Euler3D bench). Again this would speak to price/performance.

To someone like me (who is doesn't know much about computer science) higher IPC makes sense. Even if scaling per core is less the overall effect is still greater.

On top of that I wonder how much power their dual module (quad core) Bulldozer will draw? I am guessing it won't draw that many watts relative to its processing power.
 

cbn

Lifer
Mar 27, 2009
12,968
221
106
Take either a modern day i7 or PhII with their shared L3$...in theory you could end up with a program that critically depends on the L3$ cache size just for a single thread and as such adding a second thread which then needs that L3$ space as well suddenly results in considerable IMC activity and dram fetches...now all of a sudden both threads are stalling like crazy and your speedup actually became worse than if you had just stuck with a single thread. It can happen, and it does happen with extreme rarity, but the concerns are legitimate.

I have noticed this L3 cache take up quite a bit of die size as well as adding quite a bit of xtors.

Something tells me this approach isn't really energy efficient or cost effective. But I guess there comes at point where there is nothing else than can be done.