Originally posted by: CTho9305
Originally posted by: Idontcare
A fair chunk of the improvements has got to come from the fact that memory subsystems are so much better here 4 years after HT was introduced on the P4C.
SMT will effectively cut the available L1/L2 and ram bandwidth in half for each given thread, which is a problem if the system wasn't designed with 2X bandwidth requirements needs in mind.
I think a much bigger factor is just that Atom's single-thread performance is likely abysmal. In an in-order machine (Atom may not be completely in-order, but for these purposes it is -- at least according to Anand's article), every single cache miss stalls the processor. If you have a 90% L1 hit rate and 1/3rd of operations are memory operations, then once every every 30 instructions the CPU grinds to a halt for however long it takes to get data back from the L2. If you make that same machine multi-threaded, you can run operations from the other thread(s) during this window, drastically decreasing the wasted time.
Similarly, for any small chain of dependent instructions, Atom will become serialized, leaving one issue port idle all the time. An out-of-order processor could look past a small group of instructions and find work to do on its other execution units in parallel. Again, SMT gives you a separate pool of instructions, so now even if two threads are just chains of dependent instructions, you'll have two instructions ready to execute each cycle.
To summarize, I think an in-order processor generally
needs much less memory bandwidth because it just can't take advantage of it (upon the first miss, it's stuck), whereas an out-of-order processor can keep working and generate memory accesses in parallel. Adding SMT to an in-order processor allows it to have two (for SMT with 2 threads) memory accesses happening in parallel, which is closer to what an out-of-order CPU might do in common cases (even though most modern out-of-order CPUs do 8+ if your code manages to expose that much memory parallelism).
By the way, this statement from Anand's article is wrong:
A small signal array design based on a 6T cell has a certain minimum operating voltage, in other words it can retain state until a certain Vmin. In the L2 cache, Intel was able to use a 6T signal array design since it had inline ECC.
ECC doesn't affect your Vmin - when you cross Vmin, you start losing data so rapidly that ECC can't help at all. Anand must have misinterpreted something. Maybe the L2's power supply keeps it at a higher voltage, and/or the voltage to the L2 only drops significantly in sleep states that flush it.