Heterogeneous Computing (at the level of the CPU) vs. Intel's Process tech advantage?

cbn · Nov 25, 2011

Here is a chart (I found on another forum posted by username "bladerash") showing process technology leads over the years:

Intel

250nm_____january 1998_____Deschutes
180nm_____25 october 1999_____Coppermine
130nm_____july 2001_____Tualatin
90nm_____february 2004_____Prescott
65nm_____january 2006_____Cedar Mill
45nm_____january 2008_____Wolfdale
32nm_____7 january 2010_____Clarkdale
22nm_____january 2012??_____Ivy Bridge

AMD

250nm_____6 january 1998_____K6 ''Little Foot''
180nm_____23 june 1999_____Athlon ''original''
130nm_____10 june 2002_____Thoroughbred
90nm_____14 october 2004_____Winchester
65nm_____5 december 2006_____Brisbane
45nm_____8 january 2009_____Deneb
32nm_____30 june 2011_____Llano

If this chart is correct at one time AMD and Intel were on par with each other. (with AMD actually beating Intel to the 180mm node).

But look at things today? AMD is 18 month behind Intel on the 32nm node. Furthermore, Intel is about to release 22nm Finfet process technology, which will probably result in the gap widening even more. (see quote below)

http://www.anandtech.com/show/4318/intel-roadmap-ivy-bridge-panther-point-ssds/1

The shrink to 22nm and 3D transistors (FinFET) almost represents a two-node process technology jump, so we expect performance at various power levels to increase quite a bit.

This brings me to my question:

We've all heard about using the GPU for energy efficient computing, but what about AMD adding bobcat CPU cores (to augment large CPUs) to the equation? Could this help close the gap on battery run times?

A good example of mixing large CPU cores with small cpu cores would be the upcoming ARM "big.LITTLE" arrangement with Cortex A15 coupled to Cortex A7.

http://www.arm.com/products/processors/technologies/bigLITTLEprocessing.php

Even though we are looking at a scaleless graph above, I'd have to imagine the power savings would be significant.

The downside for AMD with "big.LITTLE" would be increased silicon die size area, but bobcat cores are pretty small at 4.6mm2 on 40nm node.

IntelUser2000 · Nov 25, 2011

Computer Bottleneck said:
Here is a chart (I found on another forum posted by username "bladerash") showing process technology leads over the years:

If this chart is correct at one time AMD and Intel were on par with each other. (with AMD actually beating Intel to the 180mm node).

That never happened.

Few corrections with timeline differences:

Intel

250nm_____August 1997_____Pentium(Tillamook)
180nm_____June 14, 1999_____Pentium II(Dixon??)
130nm_____july 2001_____Tualatin
90nm_____february 2004_____Prescott
65nm_____Dec 27, 2005_____Pentium Extreme Edition 965(Presler)
45nm_____Oct 28, 2007_____Core 2 QX9650(Yorkfield)
32nm_____7 january 2010_____Clarkdale
22nm_____january 2012??_____Ivy Bridge

AMD

250nm_____6 january 1998_____K6 ''Little Foot'' +5 months
180nm_____Nov 29, 1999_____Athlon K75 (Pluto/Orion) +5.5 months
130nm_____Apr 17, 2002_____Thoroughbred(Mobile) +8.5 months
90nm_____Aug 17, 2004_____Oakville(Mobile) +7 months
65nm_____5 december 2006_____Brisbane +11.5 months
45nm_____November 13, 2008_____Shanghai(Server) +12.5 months
32nm_____30 june 2011_____Llano +18.5 months

Other:
-The original Athlon was debuted at 0.25 micron and was code-named "Argon"

Lonbjerg · Nov 25, 2011

IntelUser2000 said:
That never happened.

Few corrections with timeline differences:

Intel

250nm_____August 1997_____Pentium(Tillamook)
180nm_____June 14, 1998_____Pentium II(Deschutes)
130nm_____july 2001_____Tualatin
90nm_____february 2004_____Prescott
65nm_____Dec 27, 2005_____Pentium Extreme Edition 965(Presler)
45nm_____Oct 28, 2007_____Core 2 QX9650(Yorkfield)
32nm_____7 january 2010_____Clarkdale
22nm_____january 2012??_____Ivy Bridge

AMD

250nm_____6 january 1998_____K6 ''Little Foot'' +5 months
180nm_____Nov 29, 1999_____Athlon K75 (Pluto/Orion) +5.5 months
130nm_____Apr 17, 2002_____Thoroughbred(Mobile) +8.5 months
90nm_____Aug 17, 2004_____Oakville(Mobile) +7 months
65nm_____5 december 2006_____Brisbane +11.5 months
45nm_____November 13, 2008_____Shanghai(Server) +12.5 months
32nm_____30 june 2011_____Llano +18.5 months

It does show how Intel has pulled away on process nodes...

IntelUser2000 · Nov 25, 2011

Computer Bottleneck said:
This brings me to my question:

We've all heard about using the GPU for energy efficient computing, but what about AMD adding bobcat CPU cores (to augment large CPUs) to the equation? Could this help close the gap on battery run times?

This might not work as well in the PC ecosystem that relies on standards rather than proprietary hardware and software. No one will be happy for what might be a marginal power savings if there's a performance loss.

Of course, its easier in laptops than desktops, due to the reasons I put above.

Nemesis 1 · Nov 25, 2011

Computer Bottleneck said:
Here is a chart (I found on another forum posted by username "bladerash") showing process technology leads over the years:

Intel

250nm_____january 1998_____Deschutes
180nm_____25 october 1999_____Coppermine
130nm_____july 2001_____Tualatin
90nm_____february 2004_____Prescott
65nm_____january 2006_____Cedar Mill
45nm_____january 2008_____Wolfdale
32nm_____7 january 2010_____Clarkdale
22nm_____january 2012??_____Ivy Bridge

AMD

250nm_____6 january 1998_____K6 ''Little Foot''
180nm_____23 june 1999_____Athlon ''original''
130nm_____10 june 2002_____Thoroughbred
90nm_____14 october 2004_____Winchester
65nm_____5 december 2006_____Brisbane
45nm_____8 january 2009_____Deneb
32nm_____30 june 2011_____Llano

If this chart is correct at one time AMD and Intel were on par with each other. (with AMD actually beating Intel to the 180mm node).

But look at things today? AMD is 18 month behind Intel on the 32nm node. Furthermore, Intel is about to release 22nm Finfet process technology, which will probably result in the gap widening even more. (see quote below)

http://www.anandtech.com/show/4318/intel-roadmap-ivy-bridge-panther-point-ssds/1

This brings me to my question:

We've all heard about using the GPU for energy efficient computing, but what about AMD adding bobcat CPU cores (to augment large CPUs) to the equation? Could this help close the gap on battery run times?

A good example of mixing large CPU cores with small cpu cores would be the upcoming ARM "big.LITTLE" arrangement with Cortex A15 coupled to Cortex A7.

http://www.arm.com/products/processors/technologies/bigLITTLEprocessing.php

Even though we are looking at a scaleless graph above, I'd have to imagine the power savings would be significant.

The downside for AMD with "big.LITTLE" would be increased silicon die size area, but bobcat cores are pretty small at 4.6mm2 on 40nm node.

Looking at your charts made me think back. It started me thinking about Haswell and and all the reseaerch tech on Intel I have done. and the computing power we have right now. I think Haswell is going to be very differant. intel doesn't really need to increase processor speed . He needs to increase core count . It also needs to be vary modular.

From everything I have researched about Intel . Haswell inorder to give Intel the edge it needs in both compute and efficiency intel has to drop the x86 decoders and will emulate x86. Intel has the compiler infrostucture in place and with nights corner on the horizon. It all ties together for Intel to drop the power hungry x86 decoders.

cotak13 · Nov 26, 2011

If this actually work, and I think not. What's stopping intel from doing the same? And what's motivating Microsoft to support this quickly? Don't get me wrong I am not saying ms wouldn't support just they don't work fast. So even if you launch this it will take at least a year for it to work correctly with windows if not 2-3 years.

The idea that amd can out maneuver intel in 1 step is basically folly at this point. The apu was suppose to be the flanking move to end Intel's domination. But as we can see it took so long intel manages to build its own half decent graphics. And with no real consumer apps that uses opencl the whole apu thing turned out to be a bit of a mirage. Yes, llano has decent graphics and longer battery life while running 3d games. But seriously, besides games it has basically no unique selling point. It's a product that's good if your needs fits in a narrow box. For the rest of the world it's just ok because of the price. A bit problem for amd, the low price it must sell at to be relevant in the market.

cbn · Nov 26, 2011

cotak13 said:
If this actually work, and I think not. What's stopping intel from doing the same? And what's motivating Microsoft to support this quickly? Don't get me wrong I am not saying ms wouldn't support just they don't work fast. So even if you launch this it will take at least a year for it to work correctly with windows if not 2-3 years.

Here is some information I found:

http://www.wired.com/cloudline/2011/10/arms-cortex-a7/

The OS doesn’t actually need to be modified or to be at all aware of the smaller A7 cores in order to take advantage of the technology. All popular mobile and desktop OSes now ship with dynamic voltage and frequency scaling (DVFS) capabilities, so that they can tell the CPU when they need more horsepower and when they need less. For lighter workloads, a typical CPU responds to the OS’s signal by throttling back its operating frequency and lowering its frequency and voltage, thereby saving power; for heavier workloads, it can burst the frequency and voltage higher temporarily to provide a performance boost. The open-source firmware layer that will sit between the OS and a big.LITTLE chip can take these standard signals and, instead of downclocking the A15 when the OS asks for less horsepower, it simply moves the workload onto the A7 cores. So while it will be possible to modify an OS to be big.LITTLE-aware, but it’s not necessary in order to take advantage of the capability.

cbn · Nov 26, 2011

I thought the following was a good explanation on the advantages of Heterogeneous multi-core cpu vs. homogeneous multi-core:

http://www.theinquirer.net/inquirer...nstruments-arms-cortex-a7-android-accelerator

Texas Instruments sees ARM's Cortex A7 as an Android accelerator
Claims mix and match is the way forward
By Lawrence Latif
Thu Oct 20 2011, 13:23

CHIP VENDOR Texas Instruments (TI) said ARM's heterogeneous 'Big.little' architecture helps it accelerate Google's Android operating system.

TI, which designs the popular range of OMAP system-on-chip (SoC) processors found in many smartphones told The INQUIRER that ARM's newly unveiled Big.little architecture will help improve overall performance of the Android operating system.

Avner Goren, GM of OMAP Strategy at TI told The INQUIRER that ARM's Big.little architecture, which uses Cortex A7 and Cortex A15 cores, addresses a different need than that of multi-core processors made up of identical cores.

Goren said, "We have been using heterogeneous multi-cores since 2002, we always had an ARM CPU coupled to accelerators for video, graphics, DSPs, image processing. This [Big.little] doesn't change anything in this idea. On the contrary, it builds on this concept and it is another dimension. None of what was held here changes what we are doing in the rest of the system."

Goren continued by saying that Big.little is a natural progression from the multi-core, accelerator-aided processors of yesteryear. "What we have held today doesn't change the fact I would continue doing accelerators, DSPs, video accelerators and use [Cortex] M3s inside, but it changes what I'm doing on the high-level Android side."

When ARM's multi-core processors tipped up at Mobile World Congress earlier this year firms were banging on about it would be a golden age of power efficiency due to being able to run multiple cores at lower frequencies. Now less than a year later and with dual-core smartphones still having relatively poor battery life, it looks like that strategy has gone for a Burton. Goren admits that homogenous multi-core architectures do have a problem.

"Multicores give you scalability in a range, performance goes up and down within this range based on how many cores are active and what is the voltage level for these cores. On the other hand it has a floor, this floor is when you have one core running at the lowest voltage. What we have identified is a need for general processing power, meaning running Android, even at a lower [power] level," said Goren.

Goren said ARM's A7 processor will allow TI to ramp up the Cortex A15 core without hurting the 'idle' performance of the more frequently used Cortex A7 core.

TI and other chip vendors have used accelerators to optimise particular aspects of SoC for use cases such as video playback, but seemed to exclude the fact that optimisations for the underlying operating system would also come in handy.

While ARM's Cortex A7 isn't an Android specific chip, meaning it is a general purpose processor capable of running any code compiled for the architecture, the fact that its primary job in Android smartphones will be to take care of the OS leaving most of the heavy lifting to the Cortex A15 cores, effectively means that chip makers are viewing the Cortex A7 as an Android accelerator.

cbn · Nov 26, 2011

Nemesis 1 said:
I think Haswell is going to be very differant. intel doesn't really need to increase processor speed . He needs to increase core count .

As I understand things, increasing core counts can improve efficiency, but result in decreases in efficiency if too many are added.(Amdahl's law).

Therefore I have the following question: "In what situation does increasing frequency or IPC start to make more sense to Intel than adding more cores?"

Increases in voltage (all things being equal) can lead to increases in processor frequency...this increases single thread performance at the cost of higher power consumption (net effect: decreased performance per watt)

With IPC increases comes the effect of Pollack's rule. (Absolute performance may increase, but performance per watt decreases).

Do I have these three efficiency concepts right?

1. Amdahl's law
2. Processor Frequency vs. voltage curve
3. Pollack's rule

Idontcare · Nov 26, 2011

Computer Bottleneck said:
As I understand things, increasing core counts can improve efficiency, but result in decreases in efficiency if too many are added.(Amdahl's law).

Do I have these three efficiency concepts right?

1. Amdahl's law
2. Processor Frequency vs. voltage curve
3. Pollack's rule

You have it correct, within a set of caveats of course, and one caveat that bears mentioning is the particularly insidious performance degradation mechanism that augments Amdahl's Law - namely interprocessor communication overhead (or more aptly, interthread communication overhead) as captured by Almasi and Gottlieb.

Here's Amdahl's Law:

Tpp = Time to complete processing of a particular task. Ts = time required to complete the serial portion (depends on single-thread performance, IPC and clockspeed). Tp = time to compute the parallel portion of the task and P = number of threads or processors.

In the limit of infinite processors, the so-called Amdahl limit, the computation time can only be reduced to the time required to compute the serial portions of the task:

But this ignores the increase in computation time associated with the overhead of thread-management itself. The parsing of data, the recompiling of the results, and so on.

Almasi and Gottlieb captured this as an additional time adder to the Amdahl equation:

In this equation we now have a term that scales with the number of processors (number of threads). This results in a term that comes to rival the time required to compute the task itself (Ts+Tp) once all the overhead processing is imputed.

I refer to this caveat as "insidious" because it can actually cause parallelized tasks to take longer to finish (in terms of absolute time to completion) if you add too many processors to the job.

When the performance peaks, reaches a maximum value and then starts to decline as more cores are used, we refer to this as the Almasi/Gottlieb limit. It is real, and it is a real issue affecting multi-threaded applications.

Improving the interthread communication (both in software/messaging protocol and hardware performance) lessens the issue, but it can never be eliminated.

More chip cores can mean slower supercomputing, Sandia simulation shows

The worldwide attempt to increase the speed of supercomputers merely by increasing the number of processor cores on individual chips unexpectedly worsens performance for many complex applications, Sandia simulations have found.
A Sandia team simulated key algorithms for deriving knowledge from large data sets. The simulations show a significant increase in speed going from two to four multicores, but an insignificant increase from four to eight multicores. Exceeding eight multicores causes a decrease in speed. Sixteen multicores perform barely as well as two, and after that, a steep decline is registered as more cores are added.

Source

IntelUser2000 · Nov 26, 2011

Idontcare said:
When the performance peaks, reaches a maximum value and then starts to decline as more cores are used, we refer to this as the Almasi/Gottlieb limit. It is real, and it is a real issue affecting multi-threaded applications.

More chip cores can mean slower supercomputing, Sandia simulation shows

It's interesting you mention that. Some research group was comparing scaling for Nehalem/Shanghai/Harpertown processors. On Harpertown, where the memory system was a significant bottleneck and even interprocessor communications went through the FSB(which is also where the memory traffic passes through), at certain number of cores, the performance decreased by a noticeable amount. Yet on Nehalem/Shanghai the performance improvement was pretty decent with max amount of threads and it was a very parallelizable program.

There's also the often mentioned Intel paper which showed that beyond 4 cores, it needed performance per clock improvements, ISA extensions(like SSE), clock speed increases, and memory bandwidth scaling, along with software optimizations to take advantage of extra cores. To me, it sounded similar to how improvements in Strain, low-K interconnects, and HKMG, plus general optimizations were needed to reap the benefits from new process generations.

Idontcare · Nov 26, 2011

Yeah I haven't crunched the numbers for the newer chips out there but this graph was very telling:

A direct example of the impact of interprocessor communications overhead on performance can be seen in these data:

And another comparison where microarchitecture design can be superior for multithreaded apps:

And if you throw in Nehalem (admittedly dated study here) you see where the shortcomings can be addressed:

cbn · Nov 27, 2011

In post #7, there is a quote by the TI OMAP strategist :

Goren said ARM's A7 processor will allow TI to ramp up the Cortex A15 core without hurting the 'idle' performance of the more frequently used Cortex A7 core.

Can this heterogeneous CPU strategy go a step further?

Lets say we have a Cortex A7, coupled to a ramped up Cortex A15 operating on higher voltage and high frequency. At what point would it make sense to hand off that single threaded task to an even wider, high IPC core? (for even greater single threaded performance gains)

Cortex A15 with higher voltage and higher frequency

vs.

Larger ARM Core (maybe ARMv8) operating a lower voltages/lower frequency?

Which CPU core provides better performance per watt in that instance? Pollack's rule vs CPU voltage/frequency? Which one of these two rules governing efficiency would have a stronger effect in that scenario?

To help the reader visualize what I am thinking about imagine one or two "Really Big" Cores (4 wide ARMv8 or maybe even something bigger than this, etc), one or two "Big Cores" (3 wide Cortex A15, etc) and one or two little cores (Cortex A7, etc) all on the same piece of silicon.

cbn · Nov 27, 2011

Part of my inspiration for the above post came from seeing this.

Tegra (Wayne) series

Processor: quad- or octa-core ARM Cortex-A15 MPCore
Improved 24 (for the quad-core) and 32 to 64 (for the octa-core) GPU cores with support for Directx 11+, OpenGL 4.X, OpenCL 1.X, and PhysX
28 nm[25]
About 10 times faster than Tegra 2
To be released in 2012

With Nvidia planning "octocore" Cortex A15 for Smartphone SOC, I'd imagine there is quite a bit of room for Tegra to "Turbo" up one of the cores for faster single threaded performance.

But why not use one or two Higher IPC cores instead of applying "Max Turbo" (frequency + voltage) to one or more smaller A15 size cores? Wouldn't that result in a better performance per watt for certain heavy single threaded tasks?

Furthermore, I'd imagine using Higher IPC cores at lower voltages for mobile could even make for a better "docked" Smartphone experience. (re: Once the Smartphone is docked and connected to AC power/lapdock battery power the Higher IPC cores would probably have more room for clock improvements.)

Idontcare · Nov 27, 2011

Computer Bottleneck said:
In post #7, there is a quote by the TI OMAP strategist :

Can this heterogeneous CPU strategy go a step further?

Lets say we have a Cortex A7, coupled to a ramped up Cortex A15 operating on higher voltage and high frequency. At what point would it make sense to hand off that single threaded task to an even wider, high IPC core? (for even greater single threaded performance gains)

Cortex A15 with higher voltage and higher frequency

vs.

Larger ARM Core (maybe ARMv8) operating a lower voltages/lower frequency?

Which CPU core provides better performance per watt in that instance? Pollack's rule vs CPU voltage/frequency? Which one of these two rules governing efficiency would have a stronger effect in that scenario?

To help the reader visualize what I am thinking about imagine one or two "Really Big" Cores (4 wide ARMv8 or maybe even something bigger than this, etc), one or two "Big Cores" (3 wide Cortex A15, etc) and one or two little cores (Cortex A7, etc) all on the same piece of silicon.

In cases of arguing the "sprint to idle" line of thinking, the metric of interest is the ratio of static leakage to dynamic power consumption as a function of clockspeed (assuming the chip is optimally operated at the minimum necessary Vcc for each point on the clockspeed curve).

For example, here's such a curve for my 2600K:

Understand this curve is application dependent. This curve would look different if I used photoshop or email browser versus LinX.

But it speaks to the tradeoff that must be targeted in the design decisions you are talking about.

cbn · Nov 28, 2011

Idontcare said:
In cases of arguing the "sprint to idle" line of thinking, the metric of interest is the ratio of static leakage to dynamic power consumption as a function of clockspeed (assuming the chip is optimally operated at the minimum necessary Vcc for each point on the clockspeed curve).

For example, here's such a curve for my 2600K:

Understand this curve is application dependent. This curve would look different if I used photoshop or email browser versus LinX.

But it speaks to the tradeoff that must be targeted in the design decisions you are talking about.

If I understand that correctly, the static leakage current factors in quite a bit into the power consumption.

In fact, according to that graph the dynamic power consumption/static leakage ratio of your 2600K is pretty low at lower clock speeds and particularly at the highest clock speeds. This suggests a high amount of leakage for the Sandy Bridge cpu core in the sub-threshold state for both those extremes?<---Please correct me if I am wrong.

Now with respect to "racing to idle arguments": What if a smaller CPU core (such as atom) were present in the system? Couldn't the ability to switch to atom (with less xtors and perhaps a design optimized with "higher threshold for leakage current") reduce idle power consumption even more? (assuming the all Sandy Bridge cores could be appropriately power gated...rather than leaving one active)

In other words, "race to idle" would mean shutting down the larger Intel CPU via power gating and switching to a small atom.

For non tech savvy people (like me) reading this here is the Wikipedia link for Power gating.

This technique uses high Vt sleep transistors which cut off VDD from a circuit block when the block is not switching. The sleep transistor sizing is an important design parameter. This technique, also known as MTCMOS, or Multi-Threshold CMOS reduces stand-by or leakage power.

CPUarchitect · Nov 28, 2011

Nemesis 1 said:
I think Haswell is going to be very differant. intel doesn't really need to increase processor speed . He needs to increase core count . It also needs to be vary modular.

From everything I have researched about Intel . Haswell inorder to give Intel the edge it needs in both compute and efficiency intel has to drop the x86 decoders and will emulate x86. Intel has the compiler infrostucture in place and with nights corner on the horizon. It all ties together for Intel to drop the power hungry x86 decoders.

They don't have to do anything that radical. More cores is not the answer anyway. Instead, AVX2 will double the SIMD throughput, and the gather instruction support allows to auto-vectorize a lot of code. This is a massive increase in performance / Watt and doesn't require more x86 decoders.

The next step is to do even more work per instruction. This can be achieved by executing AVX-1024 instructions on the existing 256-bit units, in four cycles. The throughput remains the same but the instruction rate is reduced so the power hungry front-end can be clock gated for 3/4 of the time.

ocre · Nov 28, 2011

Nemesis 1 said:
intel doesn't really need to increase processor speed . He needs to increase core count . It also needs to be vary modular.
.

wow. this sound like a chip i heard recently about? Uhm......bulldozer!!!!

So Intel needs to be like AMD, just like AMD???

WTH are you saying. This is insane.

pm · Nov 28, 2011

Idontcare said:
In cases of arguing the "sprint to idle" line of thinking, the metric of interest is the ratio of static leakage to dynamic power consumption as a function of clockspeed (assuming the chip is optimally operated at the minimum necessary Vcc for each point on the clockspeed curve).

For example, here's such a curve for my 2600K:

Understand this curve is application dependent. This curve would look different if I used photoshop or email browser versus LinX.

But it speaks to the tradeoff that must be targeted in the design decisions you are talking about.

Nice graph, IDC. I don't think that I've ever seen that sort of graph for a real processor before. Pretty cool.

CPUarchitect · Nov 28, 2011

Idontcare said:
In cases of arguing the "sprint to idle" line of thinking, the metric of interest is the ratio of static leakage to dynamic power consumption as a function of clockspeed (assuming the chip is optimally operated at the minimum necessary Vcc for each point on the clockspeed curve).

For example, here's such a curve for my 2600K:

That's a nice graph, which proves once again that lower clock rates don't improve performance / Watt. But keep in mind that it will look quite different for a Tri-Gate process.

And I think that argument can be extended to any "dark silicon" discussion. They assume silicon technology will scale exactly the way it has in the past. But this ignores the fact that from now on for ever new process node they will focus on technology that improves power efficiency more than anything else.

There's lots of new research going on in this area, as the incentive for having such technology is huge. No matter how you look at it, dark silicon is always a loss and each workaround has severe consequences. So they're throwing big bucks at solving this problem at the root instead.

Of course that doesn't mean I believe it can or should be ignored at the design level. When silicon goes idle it should be clock gated. And the opportunities for making it go idle without lowering performance should be maximized.

Homogeneous computing has huge advantages in terms of ease of programmability and data locality, which should not be ignored. So instead of adding heterogeneous cores, the existing ones should be made capable of achieving the same power effiency as a more specialized one.

AVX-1024 would offer exactly that. The high IPC of the out-of-order execution is still there for sequential scalar workloads, but for throughput oriented workloads that benefit from DLP the front-end can be clock gated for 3/4 of the time, turning the CPU into an architecture that achieves GPU-like GFLOPS / Watt.

cbn · Nov 28, 2011

CPUarchitect said:
"dark silicon"

Thank you for bringing up this term "dark silicon". (This is a new concept for me.)

While doing my internet search I found some interesting information on it.

[For other non technical readers in the forum] The idea of "dark silicon" stems from the fact that with every node shrink the number of xtors nearly doubles, but the power consumption does not get cut in half. (the power consumption improvement is much less than that...I believe somewhere around 25% is typical)

Since these modest power consumption improvements cannot keep up with the much greater xtor number increases a point is reached where all the xtors on the die can no longer be powered at the same time.

Apparently, the problem of "dark silicon" will get worse and worse with every node shrink barring some breakthrough.

Nemesis 1 · Nov 28, 2011

CPUarchitect said:
They don't have to do anything that radical. More cores is not the answer anyway. Instead, AVX2 will double the SIMD throughput, and the gather instruction support allows to auto-vectorize a lot of code. This is a massive increase in performance / Watt and doesn't require more x86 decoders.

The next step is to do even more work per instruction. This can be achieved by executing AVX-1024 instructions on the existing 256-bit units, in four cycles. The throughput remains the same but the instruction rate is reduced so the power hungry front-end can be clock gated for 3/4 of the time.

Just get rid of the x86 decoders Their power hungry monsters

I was looking for article I read recently .Couldn't find . Ya considering it me I can see why you missed what I was saying . Grammer LOL . I like to use what ever spelling comes out . But in the case here I used very . and vary and I used them correctly for a change to throw ya a curve ball . By more cores I wasn't being specific . I ment like maybe 1 large and 4 small with maybe 3 IGP or even a small knights corner . die . So no not what AMD is doing but closer to what NV is doing. It may have been an AT article . I will keep looking if I find it . I will link.Intel New power gating may be the ans. TO the dark silicon that CpuA brought up . I know intel says haswell is going to be power gated differantly which should allow for varing the types of cores in haswell.

IntelUser2000 · Nov 29, 2011

Computer Bottleneck said:
In other words, "race to idle" would mean shutting down the larger Intel CPU via power gating and switching to a small atom.

Remember all the C and P states? The multiple frequency and voltage steps between? P states are for active and C states are for idle. The lower the power state, the longer it takes to wake up.

By having enough C and P states they can avoid adding another small core CPU just to save power. Race to idle to save power won't work if the CPU core is many times slower, like putting in the Atom CPU. Any potential power that might have been saved might be wasted because it takes longer to execute the task.

Package-level idle, not just core-level idle only happens when ALL the cores are idle. You'd have to fire up the I/Os, the interconnects, caches, memory controllers, just for the Atom CPU. Maybe not everything, but you can't avoid it. And the CPU is now much slower. Any power advantage is gone.

They have two things going for chips like Tegra. They are already far less complicated, and the system and software built around is proprietary. That allows optimizations that are unthinkable in the standards based PC ecosystem. And who knows? Maybe their 5th core approach may or may not work.

BenchPress · Nov 29, 2011

Nemesis 1 said:
Just get rid of the x86 decoders Their power hungry monsters

Whatever you replace them with will still be quite power hungry. You just can't have a fast 4-wide instruction decoder that is significantly more power efficient than what current x86 processors have. Also anything you might win with a different ISA will easily be lost in the code-morphing software. Transmeta never succeeded at becoming much of a threat to Intel.

Sandy Bridge's micro-operation cache is a more effective way to lower the power consumption of the decoders. The next step is to make sure each micro-instruction does more work, and this can be achieved though extended macro-op fusion and the AVX-1024 that has been proposed before. Clock gating the decoders (and probably more than that) during the execution of very wide vector instructions is an excellent idea.

Nemesis 1 · Nov 29, 2011

BenchPress said:
Whatever you replace them with will still be quite power hungry. You just can't have a fast 4-wide instruction decoder that is significantly more power efficient than what current x86 processors have. Also anything you might win with a different ISA will easily be lost in the code-morphing software. Transmeta never succeeded at becoming much of a threat to Intel.

Sandy Bridge's micro-operation cache is a more effective way to lower the power consumption of the decoders. The next step is to make sure each micro-instruction does more work, and this can be achieved though extended macro-op fusion and the AVX-1024 that has been proposed before. Clock gating the decoders (and probably more than that) during the execution of very wide vector instructions is an excellent idea.

Thats old thinking . Intel only needs X86 to be good enough . Intel has to beat Arm at Arms game . Haswell is along ways off about 1 1/2 years . So heres what I think a haswell core will look like for desk top . modile and server will look way differant . Haswell desktop 2 Ivb type core emulating X86 AVX 2 HT and a New IGP and a small knights corner core .

Server likely 4+ Ib type cores AVX 2 1 IGP core and a larger knights corner core . If intel doesn't do something like this they will get ass handed to them . NV has publicly stated its all about compilers and I agree. Intel has the best compilers . Me and IDC have threw the years discussed this type of change and even tho I expected it on 32nm . 3D 22nm Intel chips seem to have what it takes to combine all these cores in 1 die. With recent AMD news and the fact Intel haswell has taped out . Its going to be along 1 1/2 years

Heterogeneous Computing (at the level of the CPU) vs. Intel's Process tech advantage?

Lifer

Elite Member

Diamond Member

Elite Member

Lifer

Member

Lifer

Lifer

Lifer

Elite Member

Elite Member

Elite Member

Lifer

Lifer

Elite Member

Lifer

Senior member

Golden Member

Elite Member Mobile Devices

Senior member

Lifer

Lifer

Elite Member

Senior member

Lifer