Important reminder that CPU Usage in a game is not always indictative of anything

generalmx

Junior Member
Jan 15, 2008
10
0
0
High CPU usage for a core, process, or thread, does not necessarily mean that game is actually doing anything important/significant with that core which would benefit from more of them / faster cores: because sometimes "Core #1 Usage: 99%" actually means it's not really doing anything except waiting for something on "Core #0" to finish.

In programming, there is what is known as a "blocking wait", meaning instead of freeing up resources (sleeping) while waiting for something else (usually I/O like disk access, network data, etc., but can also be simply waiting for another thread to finish). Blocking waits are generally bad because they can lead to problems like mutual exclusions and eventually deadlocks; however, the OS, not the program, is the main one in charge of preventing deadlocks, and dealing with stuff like proper thread synchronization is pretty damn tricky (to pretty goddamn tricky). Remember that graphics developers' main goal is to make a prettier, more immersive, and more functional experience, and a game developer's main goal is to make a fun and profitable game, etc.: optimization often takes a back-seat except where it's clearly multi-platform safe. And tricks like blocking waits can make thread synchronization a heck of a lot easier and/or safer.

You may notice in your Windows Task Manager you'll have the "System Idle" process usually taking up most of your CPU resources, which is indeed a process -- one that continually sends commands to the processor in a loop -- most often, the HALT (HLT) instruction in modern times, maximizing power-savings. Games with blocking waits on the other hand, not being coded anywhere near the level of operating systems (and for multi-platform or portability reasons), will most often be compiled to use an instruction like NOOP (No Operation) to give to the processor, along with possibly some overhead instructions it'll do while waiting. While both the hardware and OS can sometimes optimize power usage for instructions like NOOP, it'll still show up as high CPU usage for that thread/process.

Those that are old enough / cared enough back then may remember John Carmack was trying to explain partly this, along with the general immaturity of multi-threading at the time, when he was asked multiple times why Doom 3 wasn't optimized for multicore; and why his answer seemed to indicate that "multicore isn't important for gaming [at this time]". Not to say it isn't, but to say that all the marketing hype of games -- especially multi-platform games -- "optimized for quad-core" and "optimized for multi-core multithreading" -- is more likely marketing bullshit with some poor attempt at multithreading going on, than a serious multicore/multithreaded innovation.

That said, the newest generation of x86 multicore, thread-loving APU (CPU+GPU) gaming consoles may bring about real, serious innovation in multithreaded gaming.
 
Last edited:

Falafil

Member
Jun 5, 2013
51
0
0
You sure it's not just John Carmack making up an excuse for not multithreading his game?
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
145
106
Its a classic back in the days with 100% usage, nomatter what you did in games.

And it still applies to games today.

Remember looking at widnows taskmgr graphs for how loaded cores are is a very bad idea. Because you wont get the result you look for. Also why reviews with taskmgr graphs is about as useless as they get.
 

Headfoot

Diamond Member
Feb 28, 2008
4,444
641
126
The first generation of multi-core-architecture engines are arriving. Frostbite 3, Cryengine 3 so far.
 

BrightCandle

Diamond Member
Mar 15, 2007
4,762
0
76
You can be CPU limited at 12.5% usage on an i7, because the game is only really single threaded. The complexities of locking and busy waiting aside the basic problem is that none of the tools currently used show the thread level usage of a game. Thus we don't really know how parallel a game is or whether that parallelism is useful.

The sad state of tooling is that despite the fact that the Windows API supports asking about a processes thread usage very few people use the perfmon to do per thread stats.
 

fixbsod

Senior member
Jan 25, 2012
415
0
0
I don't think a multicore D3 would be any improvement on the disaster it already was.
 

KingFatty

Diamond Member
Dec 29, 2010
3,034
1
81
I see the point about how the CPU usage can be misleading about the particular state of that particular machine.

But doesn't it affect CPUs equally, so you'll see similar misleading information across CPUs, so that it is still meaningful to consider CPU usage when comparing between two different CPUs?

What I mean is that this misleading effect will be like fouls that "offset" each other on two different CPUs, so you can still compare which is better than the other?

I guess I'm also asking if this misleading aspect is something we can safely ignore because it's just part of CPUs and the CPU meter, so it still helps you get good info because you don't really need to know the actual truth, just use what the CPU usage chart says and remain blissfully ignorant of the misleading aspect?
 

Homeles

Platinum Member
Dec 9, 2011
2,580
0
0
I see the point about how the CPU usage can be misleading about the particular state of that particular machine.

But doesn't it affect CPUs equally, so you'll see similar misleading information across CPUs, so that it is still meaningful to consider CPU usage when comparing between two different CPUs?
I would be inclined to say no.
 

generalmx

Junior Member
Jan 15, 2008
10
0
0
This all actually gets even more confusing when we deal with modern Intel vs AMD and even delve into non-x86 architectures (like ARM). Back in the ancient days, a floating-point unit was something extra, but now, a modern CPU means it supports both integer and floating-point (fractions) operations. However, both of the two major companies kinda changed what this means when thinking about a 'core' dealing with a 'thread'.

The Pentium 4 Netburst microarchitecture had 2 pairs of 2 ALUs for simple integer instructions, one slower ALU for complex integer instructions, one FP execution unit for simple instructions, one FP execution unit for "Move" instructions, one FP/MMX/SSE/SSE2 ALU, and two AGUs for loading and storing addresses. (Source)

The Nehalem ("Core i") microarchitecture is even more complex, with integer and floating-point (FP) operation types (along with MMX and SSE operations too) separated amongst 3 different types of ALUs (for a Quad-Core), instead of having dedicated and separate integer and floating-point execution units; as well as the pair of AGUs for load/store address, and one special "store data" AGU. In this case, the 3 different ALUs + AGUs are considered 'processor cores', since they *can* do the basic operations of that old P4 core, except they're meant to do specific types of basic operations: hence, they are heterogeneous cores (meaning, the cores aren't all the same). (See: https://en.wikipedia.org/wiki/File:Intel_Nehalem_arch.svg and http://www.pcper.com/reviews/Processors/Inside-Nehalem-Intels-New-Core-i7-Microarchitecture/Core-Enhancements-and-Cache-S)

And then AMD decided to push heterogeneous processing even further with Bulldozer, making up two "integer cluster cores" with two ALUs each, where two "cores" share a single floating-point cluster core (what we'd just call a FPU in the olden days in comparison to the mainly integer CPU). This is one of the main reasons why Bulldozer ended up sucking: AMD "native processor cores" effectively had half the floating-point power as a native Intel processor core. So this means with that impressive low-TDP "8-Core" Bulldozer/Piledriver, you actually have only 4 FP clusters. What ended up happening is that, while this was great for computer use that was integer-heavy (like CPU video encoding), Intel Quad-Cores w/ HyperThreading would still regularly beat out AMD's new "Cores", since Intel's Cores were much better at any task; basically, programs would not only have to be multi-threaded optimized, but multi-threaded optimized specifically for this odd backwards "cores share a FPU" design. (See: https://en.wikipedia.org/wiki/File:AMD_Bulldozer_block_diagram_%28CPU_core_bloack%29.PNG)

Lemme try to explain this a lot more basically: a computer is a super-fast calculator that is programmed using mainly ADD/SUB (add and subtract), MUL/DIV (multiply and divide), MOV (move from/to memory), and Branch ("if x = 2 go here") operations. So let's say you have this for your computer to resolve: 2 + 2. In Quad-Core Nehalem, there's 3 different, complex ALUs that can deal with that, while in Quad-Core AMD Bulldozer, there'd be 4 different "integer clusters" with 2 simpler ALUs each. However, if we want to do "2 + 2 + 0.2", then Bulldozer needs to get its separate FPU involved, while Nehalem can possibly do this on the same ALU which did "2 + 2", thus making it effectively "(2 + 2) + 0.2" instead of needing to wait for an "integer cluster" to finish the "2 + 2" computation first, and then possibly needing to wait for the shared FPU to finish a computation required by the other "core".

As for the future of heterogeneous computing, AMD is working swiftly towards merging more and more parts of their CPUs and GPUs, which may be quite awesome for everyone. So instead of having just 4 units for a "Quad-Core" that can do integer operations, we may also have 40 units that are optimized for different types of simple integer operations (modern graphics cores are mostly really simple integer ALUs) and lots of other execution units that do not require explicit recoding to use.

Note: I'm sure I made some type of engineering error here -- I'm not an engineer -- but AFAIK this should be basically true in P4 vs Modern Intel Multicore (Core i) vs Modern AMD Multicore (Bulldozer on).
 
Last edited: