How much will future games support Hyper-Threading?

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

iCyborg

Golden Member
Aug 8, 2008
1,356
64
91
windows is dumb as a rock when it comes to thread handeling, but that is not windows fault in most cases. Hyperthreading is designed so that it was implemented with little to no changes needed at the software level. To do this, the CPU lies and reports the logical Hyperthreading cores as being real.

As they report as real, windows swaps the threads it has to work with to the cpu with the lowest usage, which means a program with two threads can end up on the same physical core from time to time. By time to time I mean windows will take one of the threads and move it to another core (real or logical) and so "fix" the issue of running on the same core. Down side is that their is a overhead from changing cores in mid run as data needs to be moved around for the new core to be able to use it.

All windows tries to do is move threads from heavly used cpus to lightly used CPUs with the intent that each thread gets the best performance it can. Down side is that the hardware does not know which core it can put to sleep (save power) as windows could try and use it without notice. Intel and AMD have both had the issue (AMD first IIRC), but it has lead to both manufactures needing to speed up and down cpu cores together to keep windows happy.
Maybe Windows 98 is dumb with respect to HT, but not Win XP and later, or any Linux running a newer kernel. All of those know about HT and will not be scheduling stuff obliviously and randomly move stuff around like you've described.
Also, CPU does not lie and does not report logical cores as real:
http://en.wikipedia.org/wiki/CPUID
Check EAX=1, bit 28.

So a program can also do its own scheduling taking HT into account if the programmer thinks he/she can do a better job than OS, but the OS will leave you on your own then, and you'll have to take care of a lot of other things that Win scheduler does, and is generally not worth the trouble.
 

LokutusofBorg

Golden Member
Mar 20, 2001
1,065
0
76
Do you guys realize how dumb it sounds to pipe off about how crappy HT is, and saying that disabling HT will actually get you performance increases, when that chart posted right at the top of the thread proves you completely wrong?

There is no reason for anybody to disable HT unless you're tuning a high-throughput database server. Pretty much every other application will run better (or no different) on Windows with HT enabled.
 

janas19

Platinum Member
Nov 10, 2011
2,313
1
0
So a program can also do its own scheduling taking HT into account if the programmer thinks he/she can do a better job than OS, but the OS will leave you on your own then, and you'll have to take care of a lot of other things that Win scheduler does, and is generally not worth the trouble.

Can I ask a ques, what does Win scheduler send it's instructions to? Is it the CPU itself, or some other piece of hardware.
 

Throckmorton

Lifer
Aug 23, 2007
16,829
3
0
Do you guys realize how dumb it sounds to pipe off about how crappy HT is, and saying that disabling HT will actually get you performance increases, when that chart posted right at the top of the thread proves you completely wrong?

There is no reason for anybody to disable HT unless you're tuning a high-throughput database server. Pretty much every other application will run better (or no different) on Windows with HT enabled.

If a program has 3 threads, and you have a quad core processor, how can HT help? Is there any game with 5+ threads?
 

iCyborg

Golden Member
Aug 8, 2008
1,356
64
91
Do you guys realize how dumb it sounds to pipe off about how crappy HT is, and saying that disabling HT will actually get you performance increases, when that chart posted right at the top of the thread proves you completely wrong?

There is no reason for anybody to disable HT unless you're tuning a high-throughput database server. Pretty much every other application will run better (or no different) on Windows with HT enabled.
In 70% (16/23) of the games, HT advantage was less than 10%, sometimes even negative. Only in a handful of games did it matter, and one of them doesn't make much sense (100% scaling for F1 2010). This is for dual core i3 2105, for quad core it would matter even less.
And it doesn't come for completely free. I don't have a kill-a-watt, but my thermals go up quite a bit, about 5-10 degrees Celsius higher on my i7 920 at full load vs HT disabled. If you're OC-er, you could probably OC it a smidgen higher to further lessen HT's advantage.
 

iCyborg

Golden Member
Aug 8, 2008
1,356
64
91
Can I ask a ques, what does Win scheduler send it's instructions to? Is it the CPU itself, or some other piece of hardware.
I'm not sure I understand the question. Win is a piece of software, and so is its scheduler, and like any other software, it uses CPU and RAM, probably disk for config data etc. Sure it runs in kernel and has some special privileges, but bottom-line, it's a piece of software.
 

wuliheron

Diamond Member
Feb 8, 2011
3,536
0
0
In 70% (16/23) of the games, HT advantage was less than 10%, sometimes even negative. Only in a handful of games did it matter, and one of them doesn't make much sense (100% scaling for F1 2010). This is for dual core i3 2105, for quad core it would matter even less.
And it doesn't come for completely free. I don't have a kill-a-watt, but my thermals go up quite a bit, about 5-10 degrees Celsius higher on my i7 920 at full load vs HT disabled. If you're OC-er, you could probably OC it a smidgen higher to further lessen HT's advantage.

Yeah, I think the max it ever helps in any given instance is 20%. Nothing to sneeze at, but not exactly something to write home about either. Its a moot point anyway as far as I'm concerned. Already processors are reaching the limits of silicon with speeds over 5ghz and within the next few years heterogeneous architectures will begin to dominate. Instead of gamers worrying about the speed of the cpu so much, they'll be focused on the raw bandwidth of the chip and other features.
 

Pray To Jesus

Diamond Member
Mar 14, 2011
3,622
0
0
In 70% (16/23) of the games, HT advantage was less than 10%, sometimes even negative. Only in a handful of games did it matter, and one of them doesn't make much sense (100% scaling for F1 2010). This is for dual core i3 2105, for quad core it would matter even less.
And it doesn't come for completely free. I don't have a kill-a-watt, but my thermals go up quite a bit, about 5-10 degrees Celsius higher on my i7 920 at full load vs HT disabled. If you're OC-er, you could probably OC it a smidgen higher to further lessen HT's advantage.

It's been proven time and time again that a higher OC w/o HT is better than a lower OC w/ HT for game performance.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
Good day AT people. I have a question that I thought some people might like to discuss. Hyper-Threading is a proprietary technology developed by Intel where every physical core can be "split" into two logical cores that can perform processes simultaneously to speed up an application (that's a very watered down explanation but the point of this post is not to discuss what HT is but how).
Hyperthreading adds the minimum amount of additional resource handlers to a CPU core, so that that CPU core can execute multiple threads. HT as it is currently implemented is very close to a fully shared implementation. A few parts are still split, but the important bits are either added on (dedicated full-size resources for each thread), or shared (execution units, caches, etc.). It is much closer to fully shared SMT than it is to fully partitioned (IE, split up) SMT.

But the program must be developed to utilize Hyper-Threading in order for it to work...
This is also wrong. The program needs to be developed to use Hyperthreading to extract the most performance from a Hyperthreading CPU, running low IPC/high CPI code that is not low IPC or high CPI due to bandwidth limitations or execution resource stalls. Such optimizations are generally either leftovers from console development (XB360 and PS3 CPUs have SMT that can be used like HT), or leftovers from the era of early P4 Xeons. Once the K8 came out, developers started to generally not care (this is a good thing, mind you), and HT in the Core i series has far superior performance to HT in the P4.

Ok, got that, that part makes sense now. But how or exactly why this is the case is still eluding me...
CPUs have gotten very fast very quickly, but memory hasn't. Even as they stay <4GHz, they are doing more work per cycle, so a modern 3GHz CPU might be equivalent to what people in the early 90s would have expected from a 10GHz CPU.

A Pentium might have to take 20 cycles to get out to RAM (made up the number, but it should be close). A Core ix-2xxx might take 100-200 cycles (or more, if the address needed is far enough off from any open address, and on a RAM channel that's getting hammered by another thread). Ouch. Given that real IPC has been increasing, it would probably be equivalent to 300-500 cycles, if we were to normalize it to the older CPU's performance per clock.

This is why CPUs have been getting bigger and more complicated caches. However, as these caches get bigger, they also get slower, and then they also necessarily must be made slower for the CPU to reach higher speeds. So, today, going out to L2 is big of a deal as going out to main RAM was 15+ years ago.

To mitigate these latencies, your CPU is constantly trying to determine what the most likely instruction and data addresses are that it may need in the near future, and fetch them before you need them. Programming languages, compilers, preferred data structures, and preferred common algorithms have all been working in a complimentary fashion with the advancement of such speculative hardware; so while it is hard to look at past use and determine future use, we try to use methods that make it less hard on the hardware, when we can.

When the CPU is correct, it can perform speculative execution, as well. The first form that came to commodity CPUs, TMK, being speculative execution of a predicted branch. IE, if A do B else do C, it figures it's probably going to need C, and it has data on the CPU to execute C, so it does so. If it was right, the CPU never even had to wait on evaluating A. If it was wrong, it can get rid of everything after A, and jump to C (a performance hit is taken for this, but it is correct often enough that is worth occasionally being wrong). More recently, speculative re-ordering of loads and stores, and speculative execution by memory value (IE, goto A+B, but B has not been evaluated) have been getting more popular (and value speculation is hard).

Then, modern systems use virtual memory. So, a memory address the program asks for is in its own little world, and then must be translated to a physical memory address, the location of which is controlled by the hardware and operating system. CPU's have buffers of address translation data (TLB), but these will not always have the value needed. Having to go look it up can take time, even going so far as multiple rounds trips out to main RAM.

Finally, instructions run in a CPU take time. Most basic instructions can complete very quickly (1 cycle, if everything goes well), but many memory operations, divides, and modulos (divide but return the remainder) can take up to ~30 cycles, even after getting all the data ready for them. When another instruction depends on one of those, you have a whole set of instructions twiddling its thumbs in a queue. Good out of order execution (OOOE) allows any instruction who's data is ready to execute, so that while that set of instructions is waiting on either one long instruction to complete, or on a cache request (if it has to go out to memory, OOOE won't help much), others can run, reducing the penalty.

Oh, wait, that's not the end. What we call operating systems today grew from what used to be called time sharing systems, and the way threads are handled follows that history. When swapping from thread A to thread B, the CPU lets execution finish, then packs up the register state, and sends it out to RAM. Once it is done, it then loads the state for thread B. This takes for freaking ever. Its like wrapping a gift in a box, packaging it, sending it by post, then waiting for the receiver to send back a thank you letter, before handling the next gift. There are good reasons for sticking with it, but you would generally imagine the process being quicker and simpler than it is. Some of this latency can be hidden by HT, which greatly helped the P4.

Well, taken all together, you should be able to see how a CPU will often be left waiting on something to do, and thus how swapping between threads at the front end is not a big deal (the front end can decode and prepare instructions faster than they will ever be executed, outside of synthetic benchmarks).

The benefit of Hyperthreading, and any similar SMT implementation, is that it can get the CPU doing work while one thread is waiting around, making more efficient use of often-idle execution resources. The downsides with Hyperthreading, and any similar SMT implementation, are that the other thread is every bit as likely to stall as the first, and they both compete over the same memory and execution resources when they aren't stalled.
 
Last edited:

janas19

Platinum Member
Nov 10, 2011
2,313
1
0
Kudos to you Cerb for breaking it down for us. Much appreciated.

+5 Rep for you! ;)