Intel Atom's Hyperthreading

IntelUser2000 · Apr 5, 2008

Multi-threading gain of various architectures on SpecInt2KRate:

Power 5: 21%
Pentium 4: less than 5%
Pentium 4 single core to Pentium D dual core: 76%
Atom: 39%

Cinebench gain:
Atom: 53%
Pentium 4: 15%
Pentium Extreme Edition(Smithfield): -7.6%

From: http://download.intel.com/pres...C_Chandrasekher_EN.pdf

I think the implementations of multi-threading on Atom and Nehalem will change people's perspectives on Hyperthreading. The gains are simply amazing here. In-order does take advantage of HT better, but Atom architecture is further optimized to maximize the benefits.

Idontcare · Apr 5, 2008

A fair chunk of the improvements has got to come from the fact that memory subsystems are so much better here 4 years after HT was introduced on the P4C.

SMT will effectively cut the available L1/L2 and ram bandwidth in half for each given thread, which is a problem if the system wasn't designed with 2X bandwidth requirements needs in mind.

soccerballtux · Apr 5, 2008

I'm not interested. There's a huge difference between my e2180 at 2.0Ghz and at 3.4Ghz.

And, single core still sucks for desktop. I want dual core and great battery life.

CTho9305 · Apr 5, 2008

Originally posted by: Idontcare
A fair chunk of the improvements has got to come from the fact that memory subsystems are so much better here 4 years after HT was introduced on the P4C.

SMT will effectively cut the available L1/L2 and ram bandwidth in half for each given thread, which is a problem if the system wasn't designed with 2X bandwidth requirements needs in mind.

I think a much bigger factor is just that Atom's single-thread performance is likely abysmal. In an in-order machine (Atom may not be completely in-order, but for these purposes it is -- at least according to Anand's article), every single cache miss stalls the processor. If you have a 90% L1 hit rate and 1/3rd of operations are memory operations, then once every every 30 instructions the CPU grinds to a halt for however long it takes to get data back from the L2. If you make that same machine multi-threaded, you can run operations from the other thread(s) during this window, drastically decreasing the wasted time.

Similarly, for any small chain of dependent instructions, Atom will become serialized, leaving one issue port idle all the time. An out-of-order processor could look past a small group of instructions and find work to do on its other execution units in parallel. Again, SMT gives you a separate pool of instructions, so now even if two threads are just chains of dependent instructions, you'll have two instructions ready to execute each cycle.

To summarize, I think an in-order processor generally needs much less memory bandwidth because it just can't take advantage of it (upon the first miss, it's stuck), whereas an out-of-order processor can keep working and generate memory accesses in parallel. Adding SMT to an in-order processor allows it to have two (for SMT with 2 threads) memory accesses happening in parallel, which is closer to what an out-of-order CPU might do in common cases (even though most modern out-of-order CPUs do 8+ if your code manages to expose that much memory parallelism).

By the way, this statement from Anand's article is wrong: A small signal array design based on a 6T cell has a certain minimum operating voltage, in other words it can retain state until a certain Vmin. In the L2 cache, Intel was able to use a 6T signal array design since it had inline ECC.

ECC doesn't affect your Vmin - when you cross Vmin, you start losing data so rapidly that ECC can't help at all. Anand must have misinterpreted something. Maybe the L2's power supply keeps it at a higher voltage, and/or the voltage to the L2 only drops significantly in sleep states that flush it.

nismotigerwvu · Apr 5, 2008

Originally posted by: CTho9305

Originally posted by: Idontcare
A fair chunk of the improvements has got to come from the fact that memory subsystems are so much better here 4 years after HT was introduced on the P4C.

SMT will effectively cut the available L1/L2 and ram bandwidth in half for each given thread, which is a problem if the system wasn't designed with 2X bandwidth requirements needs in mind.

Click to expand...

I think a much bigger factor is just that Atom's single-thread performance is likely abysmal. In an in-order machine (Atom may not be completely in-order, but for these purposes it is -- at least according to Anand's article), every single cache miss stalls the processor. If you have a 90% L1 hit rate and 1/3rd of operations are memory operations, then once every every 30 instructions the CPU grinds to a halt for however long it takes to get data back from the L2. If you make that same machine multi-threaded, you can run operations from the other thread(s) during this window, drastically decreasing the wasted time.

Similarly, for any small chain of dependent instructions, Atom will become serialized, leaving one issue port idle all the time. An out-of-order processor could look past a small group of instructions and find work to do on its other execution units in parallel. Again, SMT gives you a separate pool of instructions, so now even if two threads are just chains of dependent instructions, you'll have two instructions ready to execute each cycle.

To summarize, I think an in-order processor generally needs much less memory bandwidth because it just can't take advantage of it (upon the first miss, it's stuck), whereas an out-of-order processor can keep working and generate memory accesses in parallel. Adding SMT to an in-order processor allows it to have two (for SMT with 2 threads) memory accesses happening in parallel, which is closer to what an out-of-order CPU might do in common cases (even though most modern out-of-order CPUs do 8+ if your code manages to expose that much memory parallelism).

By the way, this statement from Anand's article is wrong: A small signal array design based on a 6T cell has a certain minimum operating voltage, in other words it can retain state until a certain Vmin. In the L2 cache, Intel was able to use a 6T signal array design since it had inline ECC.

ECC doesn't affect your Vmin - when you cross Vmin, you start losing data so rapidly that ECC can't help at all. Anand must have misinterpreted something. Maybe the L2's power supply keeps it at a higher voltage, and/or the voltage to the L2 only drops significantly in sleep states that flush it.

Exactly, if you want to put boil it down just a little more, you can almost think of HT on Atom as the poor man's out of order execution as they both have the same net result. Atom gets hung up on a cache miss but has another thread going that can reduce the downtime.

Nemesis 1 · Apr 5, 2008

Now I don't know but I am assuming that ATOM cores will be used on Larrabbee even tho some reports state larrabbee cores will do 4 threads. using a hugh vertec engine and a new kind of shared cache. Rumors are rambus XDR memory usage. and a ringbus connecting all cores. With 16 cores do ya think this would make a good GPU.

Zap · Apr 5, 2008

Originally posted by: soccerballtux
And, single core still sucks for desktop. I want dual core and great battery life.

Well, until dual core notebooks with desktop performance average 5 hours on a normal battery, I think I'd settle for lower performance with great battery life.

Idontcare · Apr 5, 2008

Originally posted by: CTho9305

Originally posted by: Idontcare
A fair chunk of the improvements has got to come from the fact that memory subsystems are so much better here 4 years after HT was introduced on the P4C.

SMT will effectively cut the available L1/L2 and ram bandwidth in half for each given thread, which is a problem if the system wasn't designed with 2X bandwidth requirements needs in mind.

Click to expand...

I think a much bigger factor is just that Atom's single-thread performance is likely abysmal. In an in-order machine (Atom may not be completely in-order, but for these purposes it is -- at least according to Anand's article), every single cache miss stalls the processor. If you have a 90% L1 hit rate and 1/3rd of operations are memory operations, then once every every 30 instructions the CPU grinds to a halt for however long it takes to get data back from the L2. If you make that same machine multi-threaded, you can run operations from the other thread(s) during this window, drastically decreasing the wasted time.

Similarly, for any small chain of dependent instructions, Atom will become serialized, leaving one issue port idle all the time. An out-of-order processor could look past a small group of instructions and find work to do on its other execution units in parallel. Again, SMT gives you a separate pool of instructions, so now even if two threads are just chains of dependent instructions, you'll have two instructions ready to execute each cycle.

To summarize, I think an in-order processor generally needs much less memory bandwidth because it just can't take advantage of it (upon the first miss, it's stuck), whereas an out-of-order processor can keep working and generate memory accesses in parallel. Adding SMT to an in-order processor allows it to have two (for SMT with 2 threads) memory accesses happening in parallel, which is closer to what an out-of-order CPU might do in common cases (even though most modern out-of-order CPUs do 8+ if your code manages to expose that much memory parallelism).

This is not too unlike the CMT situation for 8 threads/core on an in-order Niagara2, so I am not surprised the logic is upheld.

I appreciate the extra info but I should have stipulated in my post that I was being more generic about SMT and thinking about P4C vs Nehalem than say P4C vs Atom...which is totally off-topic and I don't know why my mind was thinking about it.

But Niagara scores 8 threads/core with in-order processing in no small part thanks to its fairly beefy bandwidth.

Originally posted by: CTho9305
By the way, this statement from Anand's article is wrong: A small signal array design based on a 6T cell has a certain minimum operating voltage, in other words it can retain state until a certain Vmin. In the L2 cache, Intel was able to use a 6T signal array design since it had inline ECC.

ECC doesn't affect your Vmin - when you cross Vmin, you start losing data so rapidly that ECC can't help at all. Anand must have misinterpreted something. Maybe the L2's power supply keeps it at a higher voltage, and/or the voltage to the L2 only drops significantly in sleep states that flush it.

I was avoiding passing judgement on Anand's comments regarding 6T + ECC as I don't have the S/N experience with Intel's specific sram designs as I do with TI's (where we held the benchmark smallest srams for many nodes repeatedly)...but he is at least technically correct to some non-zero degree as ECC will provide some non-zero margin to Vmin...now is it 0.1V of margin or 0.001V of margin depends intrinsically on the quality of sram design and process variation robustness.

So there is chance he is entirely correct (one would assume he has connections at Intel and he would have proofed his ideas first, one would assume anyways) and possibly there is something to be learned here for non-Intel folks regarding the quality of sram cells (in Si) that Intel's layout folks get to work with?

CTho9305 · Apr 5, 2008

Originally posted by: Idontcare

Originally posted by: CTho9305
By the way, this statement from Anand's article is wrong: A small signal array design based on a 6T cell has a certain minimum operating voltage, in other words it can retain state until a certain Vmin. In the L2 cache, Intel was able to use a 6T signal array design since it had inline ECC.

ECC doesn't affect your Vmin - when you cross Vmin, you start losing data so rapidly that ECC can't help at all. Anand must have misinterpreted something. Maybe the L2's power supply keeps it at a higher voltage, and/or the voltage to the L2 only drops significantly in sleep states that flush it.

Click to expand...

I was avoiding passing judgement on Anand's comments regarding 6T + ECC as I don't have the S/N experience with Intel's specific sram designs as I do with TI's (where we held the benchmark smallest srams for many nodes repeatedly)...but he is at least technically correct to some non-zero degree as ECC will provide some non-zero margin to Vmin...now is it 0.1V of margin or 0.001V of margin depends intrinsically on the quality of sram design and process variation robustness.

So there is chance he is entirely correct (one would assume he has connections at Intel and he would have proofed his ideas first, one would assume anyways) and possibly there is something to be learned here for non-Intel folks regarding the quality of sram cells (in Si) that Intel's layout folks get to work with?

I just don't buy it. You could verify that Anand is wrong relatively easily - put the part at rated Vmin and check if ECC detects any correctable errors (there should be performance counters for it). Nobody is going to do designs that operated microvolts from their failure point.

Maybe they use 6T cells in the L2 because they have 2 different Vmins: a retention voltage, and a voltage used when it's being accessed. The reason 6T SRAMs have high Vmins is usually dominated by read stability... if you're not reading the data (e.g. when the core is asleep) you can drop the voltage quite a bit further without corrupting the state nodes. This gives you a range of choices:
1. Core operating, L1 and L2 at operating voltages
2. Core L1 asleep (flushed), L2 at retention voltage (extremely low power - especially if the L2 is built with long-L / HVT devices), with reasonably fast wakeup (data comes from L2)
3. Core and all caches asleep (flushed), with slow wakeup (all data has to come from DRAM)

Idontcare · Apr 5, 2008

Originally posted by: CTho9305
I just don't buy it. You could verify that Anand is wrong relatively easily - put the part at rated Vmin and check if ECC detects any correctable errors (there should be performance counters for it). Nobody is going to do designs that operated microvolts from their failure point.

Maybe they use 6T cells in the L2 because they have 2 different Vmins: a retention voltage, and a voltage used when it's being accessed. The reason 6T SRAMs have high Vmins is usually dominated by read stability... if you're not reading the data (e.g. when the core is asleep) you can drop the voltage quite a bit further without corrupting the state nodes. This gives you a range of choices:
1. Core operating, L1 and L2 at operating voltages
2. Core L1 asleep (flushed), L2 at retention voltage (extremely low power - especially if the L2 is built with long-L / HVT devices), with reasonably fast wakeup (data comes from L2)
3. Core and all caches asleep (flushed), with slow wakeup (all data has to come from DRAM)

I wonder if we are misinterpretting what Anand means by "Vmin"...we are thinking in terms of manufacturer parameters and shmoo plots for the specific use of the Vmin term...maybe he means "the minimum voltage the array requires while still operating with an acceptable number of ECC events"...not a technical Vmin definition that we are familiar with using in our daily jobs.

CTho9305 · Apr 5, 2008

Originally posted by: Idontcare

Originally posted by: CTho9305
I just don't buy it. You could verify that Anand is wrong relatively easily - put the part at rated Vmin and check if ECC detects any correctable errors (there should be performance counters for it). Nobody is going to do designs that operated microvolts from their failure point.

Maybe they use 6T cells in the L2 because they have 2 different Vmins: a retention voltage, and a voltage used when it's being accessed. The reason 6T SRAMs have high Vmins is usually dominated by read stability... if you're not reading the data (e.g. when the core is asleep) you can drop the voltage quite a bit further without corrupting the state nodes. This gives you a range of choices:
1. Core operating, L1 and L2 at operating voltages
2. Core L1 asleep (flushed), L2 at retention voltage (extremely low power - especially if the L2 is built with long-L / HVT devices), with reasonably fast wakeup (data comes from L2)
3. Core and all caches asleep (flushed), with slow wakeup (all data has to come from DRAM)

Click to expand...

I wonder if we are misinterpretting what Anand means by "Vmin"...we are thinking in terms of manufacturer parameters and shmoo plots for the specific use of the Vmin term...maybe he means "the minimum voltage the array requires while still operating with an acceptable number of ECC events"...not a technical Vmin definition that we are familiar with using in our daily jobs.

Ah, you're thinking that given the reduced Q at low voltage (Q=CV), a cosmic ray/alpha particle is more likely to flip bits, so ECC becomes even more important to maintain an acceptable FIT rate? That doesn't sound unreasonable (I don't know the math to really make a thorough evaluation), but it's independent of 6T vs 8T SRAM cells.

Idontcare · Apr 5, 2008

Originally posted by: CTho9305
Ah, you're thinking that given the reduced Q at low voltage (Q=CV), a cosmic ray/alpha particle is more likely to flip bits, so ECC becomes even more important to maintain an acceptable FIT rate? That doesn't sound unreasonable (I don't know the math to really make a thorough evaluation), but it's independent of 6T vs 8T SRAM cells.

Yeah, just playing out that mental game of "assuming its 100% correct, what would that require to also be true"...its a slow weekend.

In the end I suspect your initial Ockams razor assessment is correct regarding Anand's 6T+ECC comments...he probably mistated something in there.

Fox5 · Apr 6, 2008

Originally posted by: IntelUser2000
Multi-threading gain of various architectures on SpecInt2KRate:

Power 5: 21%
Pentium 4: less than 5%
Pentium 4 single core to Pentium D dual core: 76%
Atom: 39%

Cinebench gain:
Atom: 53%
Pentium 4: 15%
Pentium Extreme Edition(Smithfield): -7.6%

From: http://download.intel.com/pres...C_Chandrasekher_EN.pdf

I think the implementations of multi-threading on Atom and Nehalem will change people's perspectives on Hyperthreading. The gains are simply amazing here. In-order does take advantage of HT better, but Atom architecture is further optimized to maximize the benefits.

HT works better because in-order is quite a bit less efficient per clock cycle. HT covers up for that.
IBM's in order console cpus also gain significantly from HT (their equivalent anyway).

magreen · Apr 6, 2008

Idontcare and ctho ftw!
:thumbsup:

Search

Intel Atom's Hyperthreading

IntelUser2000

Elite Member

Idontcare

Elite Member

soccerballtux

Lifer

CTho9305

Elite Member

nismotigerwvu

Golden Member

Nemesis 1

Lifer

Zap

Elite Member

Idontcare

Elite Member

CTho9305

Elite Member

Idontcare

Elite Member

CTho9305

Elite Member

Idontcare

Elite Member

Fox5

Diamond Member

magreen

Golden Member

TRENDING THREADS