Interesting BF 4 CPU usage [GameGPU.Ru]

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

ChronoReverse

Platinum Member
Mar 4, 2004
2,562
31
91
No need to repeat myself, I'll just point you to the same conversation I've been having with people since forever.

http://forums.anandtech.com/showthread.php?t=2329338&highlight=hyper

That's... part of my explanation. There are cases where you can neatly fit your load into the execution units so there's nothing left for the second thread.

But unless you're doing something as "easy" as Linpack, that's not going to happen for most code. It's simply not possible to optimize to that point.



For example x264 has a lot of hand-crafted assembly. It's so fast that on its fastest setting, it can match the pace of Quicksync (dedicated encoder) but yield similar or better quality. And even then, it gains benefit from Hyperthreading.



It really seems to me that you're getting stuck on the point that there are some cases where SMT doesn't help (and in fact HURTS) performance. Intelligent scheduling can help with that (and IBM's POWER line does some more magic in hardware to deal with that). But that doesn't change that HT is a net benefit and there are many cases where it's not possible to optimize code to "perfectly fit" into the core.

And even if we take your view at face value, SMT remains useful in general since we all have legacy code that's not well optimized.
 
Last edited:
Aug 11, 2008
10,451
642
126
its good showing for the 8350 but even then its basically right there with the slower clocked 4 core 4670k. oc them both and the 4670k leaves it behind while using quite a bit less power. heck even the 2500k would also pass up the 8350 with both oced. and again this is about as good as the 8350 can look.

Depends on the benchmark you are looking at. In some at stock, they are basically the same, as in the game.gpu results. In the pcper multiplayer benchmarks though, the 4670k is significantly ahead of the FX8350. So I think we need more tests to know the true performance of 4670 vs 8350.

I dont really think the results so far show the great results for the 8350 that some are claiming (not referring to you). Basically a non-hyperthreaded i5 is faster or the same as the 8350. Maybe there was just too much hype about how optimized this game was going to be for multicore and what a boost the 8350 would get.
 

24601

Golden Member
Jun 10, 2007
1,683
40
86
That's... part of my explanation. There are cases where you can neatly fit your load into the execution units so there's nothing left for the second thread.

But unless you're doing something as "easy" as Linpack, that's not going to happen for most code. It's simply not possible to optimize to that point.



x264 has a lot of hand-crafted assembly for example. It's so fast that on its fastest setting, it can match the pace of Quicksync (dedicated encoder) but yield similar or better quality. And even then, it gains benefit from Hyperthreading.

I never said it's "easy" :p that's the point. The reason I even made the statements there is in response to the AMD people saying benchmarks were "Intel Optimized" or "Intel Biased"

For them to be "Intel Optimized" or "Intel Biased", then you wouldn't be able to extract a large performance increase by enabling hyperthreading.

It's the AMD people's way around the truth that Intel just has more compute resources per core.



Their hand-crafting of the assembly is obviously not good enough or else it wouldn't show any improvements in hyperthreading over non-hyperthreading. It would probably be easier if the programmers coded to the actual RISC co-processors in the cores (they can't of course).
 
Last edited:

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
The games is using all eight cores of the CPU. (although two cores are receiving a moderately greater load than the other six.)

There is no scaling from 6 to 8 that is really visible. 6300 runs at a lower speed and is minimally slower (some of which is attributable to 3M vs 4M). If there is scaling then its minimal.
 

ChronoReverse

Platinum Member
Mar 4, 2004
2,562
31
91
For them to be "Intel Optimized" or "Intel Biased", then you wouldn't be able to extract a large performance increase by enabling hyperthreading

But that's the thing, there are some (real) workloads that are impossible to optimize to the point that Hyperthreading doesn't help.

Something simple like Linpack can do it, but as general purpose processors, CPU's face loads that are complex enough that it can't.


What if you have code that's not math heavy but lots of branching and data loading? OOE is limited because of the dependencies so while your pipeline is stalled, a second thread has resources to play with. Net benefit. You can't optimize that kind of thing away.
 
Last edited:

24601

Golden Member
Jun 10, 2007
1,683
40
86
But that's the thing, there are some (real) workloads that are impossible to optimize to the point that Hyperthreading doesn't help.

Something simple like Linpack can do it, but as general purpose processors, CPU's face loads that are complex enough that it can't.


What if you have code that's not math heavy but lots of branching and data loading? OOE is limited because of the dependencies so while your pipeline is stalled, a second thread has resources to play with. Net benefit. You can't optimize that kind of thing away.

I'm not seeing how you can both be limited by than but also be benefiting from balanced parallelization.

Adding more threads to solve a problem about lacking the ability to be as parallel?
 

ChronoReverse

Platinum Member
Mar 4, 2004
2,562
31
91
The way SMT works, if one thread is stalled doing its thing, the free math assets could be used by the other thread (or if the first thread isn't heavy enough to use it all).


Linpack can easy fill a CPU's math assets since it's just doing math questions really quickly. You don't have to worry about dependencies or anything, just pack the math together and let the CPU chew through it. It should easy to see why a single thread could then use up all the math assets in a superscalar CPU.

But what if you're running mixed code, like a game where you have AI decision (branching) as well as physics code (math) for instance? Now you can more efficiently use the CPU's resources with two threads. Use both the branching crap and math muck at the same time.


The idea here isn't that the first thread is any faster (if anything it's slower) but because you can do additional work on the second thread, the net benefit is positive.

And if you don't have the second thread, then the first thread gets all the CPU and runs at the same speed as if the CPU isn't SMT enabled.

This is why SMT is known as a way to increase efficiency of a CPU (note that the Atom uses HT for this reason). It's trying to avoid having unused bits in the CPU.
 
Last edited:

24601

Golden Member
Jun 10, 2007
1,683
40
86
The way SMT works, if one thread is stalled doing its thing, the free math assets could be used by the other thread.


Linpack can easy fill a CPU's math assets since it's just doing math questions really quickly. You don't have to worry about dependencies or anything, just pack the math together and let the CPU chew through it. It should easy to see why a single thread could then use up all the math assets in a superscalar CPU.

But what if you're running mixed code, like a game where you have AI decision (branching) as well as physics code (math) for instance? Now you can more efficiently use the CPU's resources with two threads. Use both the branching crap and math much at the same time.


The idea here isn't that the first thread is any faster (if anything it's slower sometimes) but because you can do additional work on the second thread, the net benefit is positive.

And if you don't have the second thread, then the first thread gets the all the CPU and runs at the same speed as if the CPU isn't SMT enabled.

Sounds like a problem with how you are dispatching (your program logic) more than being that the core just can't execute out of order itself.

What you're talking about sounds like the tax for continuing backwards compatibility with x86 instructions.
 

mikk

Diamond Member
May 15, 2012
4,311
2,395
136
For example x264 has a lot of hand-crafted assembly. It's so fast that on its fastest setting, it can match the pace of Quicksync (dedicated encoder) but yield similar or better quality. And even then, it gains benefit from Hyperthreading.


This is nonsense unless you compare a high clocked 8 core Ivy Bridge-EP with a Mainstream Haswell using Quicksync. Quicksync VBR+mbbrc on Haswell i5 is twice as fast as ultrafast preset x264 with better quality.
 

ChronoReverse

Platinum Member
Mar 4, 2004
2,562
31
91
This is nonsense unless you compare a high clocked 8 core Ivy Bridge-EP with a Mainstream Haswell using Quicksync. Quicksync VBR+mbbrc on Haswell i5 is twice as fast as ultrafast preset x264 with better quality.

Eh, last time I read reviews of Quicksync (Ivy Bridge) this was the case. I just got a new i7-4700k CPU so I'll do some tests this weekend. I don't know if they've improved Quicksync performance though (although x264 got some gains with AVX2, maybe they'll balance out).
 
Last edited:

ChronoReverse

Platinum Member
Mar 4, 2004
2,562
31
91
Sounds like a problem with how you are dispatching (your program logic) more than being that the core just can't execute out of order itself.

What you're talking about sounds like the tax for continuing backwards compatibility with x86 instructions.

No, I was speaking about CPU's in general without Intel particularly in mind. You'll note that I had mentioned POWER as another CPU that heavily uses SMT.

Dependencies and stalls are a fact of life when executing out of order (since that still operates on a single thread).
 

mikk

Diamond Member
May 15, 2012
4,311
2,395
136
Eh, last time I read reviews of Quicksync (Ivy Bridge) this was the case. I just got a new i7-4700k CPU so I'll do some tests this weekend. I don't know if they've improved Quicksync performance though (although x264 got some gains with AVX2, maybe they'll balance out).


Use Handbrake for Quicksync, I recommend TU4 and VBR as a starting point and make sure gop-ref-dist is on default (3). If you want something really fast try TU7 :biggrin:


Gains from AVX2 were relatively small in the 1-5% range.
 

24601

Golden Member
Jun 10, 2007
1,683
40
86
They don't because they can't..

Yes I should have worded that better. I'll fix my post now.

No, I was speaking about CPU's in general without Intel particularly in mind. You'll note that I had mentioned POWER as another CPU that heavily uses SMT.

Dependencies and stalls are a fact of life when executing out of order (since that still operates on a single thread).

Yes it effects all CPUs that use pipelines, which is why I point to the x86 instruction set as limiting per-Coprocessor coding in this case where we are talking about x86 cpus. RISC instruction sets have the same problem when they use pipelines.
 
Last edited:

ChronoReverse

Platinum Member
Mar 4, 2004
2,562
31
91
Use Handbrake for Quicksync, I recommend TU4 and VBR as a starting point and make sure gop-ref-dist is on default (3). If you want something really fast try TU7 :biggrin:


Gains from AVX2 were relatively small in the 1-5% range.

No CRF modes available? I can't see the "VBR" being anything other than ABR if it's still a single pass. In any case, it's easy enough to test speed.
 

mikk

Diamond Member
May 15, 2012
4,311
2,395
136
No CRF modes available? I can't see the "VBR" being anything other than ABR if it's still a single pass. In any case, it's easy enough to test speed.


No CRF but CP. CP is faster than VBR but quality with VBR is usually better at the same bitrate. AVBR and VBR are two different bitrate modes. Both are supported from Quicksync, for a while Handbrake used AVBR instead VBR but this resulted is some quality issues and they went back to VBR. Speed is the same with VBR and AVBR. Same for CBR but worse quality.
 

ChronoReverse

Platinum Member
Mar 4, 2004
2,562
31
91
Ugh, Intel didn't make it easy to use Quicksync if you have another video card as your primary. I was hoping to niggle it into action tonight but I'll have to work it out later.
 

mikk

Diamond Member
May 15, 2012
4,311
2,395
136
Ugh, Intel didn't make it easy to use Quicksync if you have another video card as your primary. I was hoping to niggle it into action tonight but I'll have to work it out later.


It has nothing to do with Intel really, it's a Windows 7 limitation. Windows 8 has native headless support as long as D3d11 implementation is supported from the application and 15.31 drivers are installed. Under Windows 7 you could use a virtual display.

http://www.bandicam.com/support/tips/intel-quick-sync/
 

ChronoReverse

Platinum Member
Mar 4, 2004
2,562
31
91
Hmm, I'm on Windows 8.1 and have the 15335 drivers installed. Still giving me a Code 43 error when I try to enable it.

A quick google seems to indicate some people saying I may have had to install Windows with the iGFX as my primary in the first place. Hopefully that guy is just wrong.
 

mikk

Diamond Member
May 15, 2012
4,311
2,395
136
Hmm, I'm on Windows 8.1 and have the 15335 drivers installed. Still giving me a Code 43 error when I try to enable it.

A quick google seems to indicate some people saying I may have had to install Windows with the iGFX as my primary in the first place. Hopefully that guy is just wrong.


The guy is wrong. Code 43 sounds more like a recent Windows 8.1 issue with this driver reported from several people in Intels forum. Seems related to configurations with dedicated GPU+iGPU.
 

Carfax83

Diamond Member
Nov 1, 2010
6,841
1,536
136
So have we found out why Nvidia has much better performance than AMD in BF4 under Windows 8?

Many of us were theorizing that it was due to driver multithreading, but that begs the question why these gains only show up in Windows 8 and not Windows 7.
 

MeldarthX

Golden Member
May 8, 2010
1,026
0
76
You're in single player though :(

Compare the 780 cpu results to the R290X results in multiplayer...

bf4_cpu_radeon.png


Every cpu takes a big hit, the 8350 at 5GHz only manages 51 fps with a R290X.


Notice no uber mode - thus why second uber shown in everything else; but multiplayer.....why is that? *serious question*