• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Interesting BF 4 CPU usage [GameGPU.Ru]

Page 4 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
No need to repeat myself, I'll just point you to the same conversation I've been having with people since forever.

http://forums.anandtech.com/showthread.php?t=2329338&highlight=hyper

That's... part of my explanation. There are cases where you can neatly fit your load into the execution units so there's nothing left for the second thread.

But unless you're doing something as "easy" as Linpack, that's not going to happen for most code. It's simply not possible to optimize to that point.



For example x264 has a lot of hand-crafted assembly. It's so fast that on its fastest setting, it can match the pace of Quicksync (dedicated encoder) but yield similar or better quality. And even then, it gains benefit from Hyperthreading.



It really seems to me that you're getting stuck on the point that there are some cases where SMT doesn't help (and in fact HURTS) performance. Intelligent scheduling can help with that (and IBM's POWER line does some more magic in hardware to deal with that). But that doesn't change that HT is a net benefit and there are many cases where it's not possible to optimize code to "perfectly fit" into the core.

And even if we take your view at face value, SMT remains useful in general since we all have legacy code that's not well optimized.
 
Last edited:
its good showing for the 8350 but even then its basically right there with the slower clocked 4 core 4670k. oc them both and the 4670k leaves it behind while using quite a bit less power. heck even the 2500k would also pass up the 8350 with both oced. and again this is about as good as the 8350 can look.

Depends on the benchmark you are looking at. In some at stock, they are basically the same, as in the game.gpu results. In the pcper multiplayer benchmarks though, the 4670k is significantly ahead of the FX8350. So I think we need more tests to know the true performance of 4670 vs 8350.

I dont really think the results so far show the great results for the 8350 that some are claiming (not referring to you). Basically a non-hyperthreaded i5 is faster or the same as the 8350. Maybe there was just too much hype about how optimized this game was going to be for multicore and what a boost the 8350 would get.
 
That's... part of my explanation. There are cases where you can neatly fit your load into the execution units so there's nothing left for the second thread.

But unless you're doing something as "easy" as Linpack, that's not going to happen for most code. It's simply not possible to optimize to that point.



x264 has a lot of hand-crafted assembly for example. It's so fast that on its fastest setting, it can match the pace of Quicksync (dedicated encoder) but yield similar or better quality. And even then, it gains benefit from Hyperthreading.

I never said it's "easy" 😛 that's the point. The reason I even made the statements there is in response to the AMD people saying benchmarks were "Intel Optimized" or "Intel Biased"

For them to be "Intel Optimized" or "Intel Biased", then you wouldn't be able to extract a large performance increase by enabling hyperthreading.

It's the AMD people's way around the truth that Intel just has more compute resources per core.



Their hand-crafting of the assembly is obviously not good enough or else it wouldn't show any improvements in hyperthreading over non-hyperthreading. It would probably be easier if the programmers coded to the actual RISC co-processors in the cores (they can't of course).
 
Last edited:
The games is using all eight cores of the CPU. (although two cores are receiving a moderately greater load than the other six.)

There is no scaling from 6 to 8 that is really visible. 6300 runs at a lower speed and is minimally slower (some of which is attributable to 3M vs 4M). If there is scaling then its minimal.
 
For them to be "Intel Optimized" or "Intel Biased", then you wouldn't be able to extract a large performance increase by enabling hyperthreading

But that's the thing, there are some (real) workloads that are impossible to optimize to the point that Hyperthreading doesn't help.

Something simple like Linpack can do it, but as general purpose processors, CPU's face loads that are complex enough that it can't.


What if you have code that's not math heavy but lots of branching and data loading? OOE is limited because of the dependencies so while your pipeline is stalled, a second thread has resources to play with. Net benefit. You can't optimize that kind of thing away.
 
Last edited:
But that's the thing, there are some (real) workloads that are impossible to optimize to the point that Hyperthreading doesn't help.

Something simple like Linpack can do it, but as general purpose processors, CPU's face loads that are complex enough that it can't.


What if you have code that's not math heavy but lots of branching and data loading? OOE is limited because of the dependencies so while your pipeline is stalled, a second thread has resources to play with. Net benefit. You can't optimize that kind of thing away.

I'm not seeing how you can both be limited by than but also be benefiting from balanced parallelization.

Adding more threads to solve a problem about lacking the ability to be as parallel?
 
The way SMT works, if one thread is stalled doing its thing, the free math assets could be used by the other thread (or if the first thread isn't heavy enough to use it all).


Linpack can easy fill a CPU's math assets since it's just doing math questions really quickly. You don't have to worry about dependencies or anything, just pack the math together and let the CPU chew through it. It should easy to see why a single thread could then use up all the math assets in a superscalar CPU.

But what if you're running mixed code, like a game where you have AI decision (branching) as well as physics code (math) for instance? Now you can more efficiently use the CPU's resources with two threads. Use both the branching crap and math muck at the same time.


The idea here isn't that the first thread is any faster (if anything it's slower) but because you can do additional work on the second thread, the net benefit is positive.

And if you don't have the second thread, then the first thread gets all the CPU and runs at the same speed as if the CPU isn't SMT enabled.

This is why SMT is known as a way to increase efficiency of a CPU (note that the Atom uses HT for this reason). It's trying to avoid having unused bits in the CPU.
 
Last edited:
The way SMT works, if one thread is stalled doing its thing, the free math assets could be used by the other thread.


Linpack can easy fill a CPU's math assets since it's just doing math questions really quickly. You don't have to worry about dependencies or anything, just pack the math together and let the CPU chew through it. It should easy to see why a single thread could then use up all the math assets in a superscalar CPU.

But what if you're running mixed code, like a game where you have AI decision (branching) as well as physics code (math) for instance? Now you can more efficiently use the CPU's resources with two threads. Use both the branching crap and math much at the same time.


The idea here isn't that the first thread is any faster (if anything it's slower sometimes) but because you can do additional work on the second thread, the net benefit is positive.

And if you don't have the second thread, then the first thread gets the all the CPU and runs at the same speed as if the CPU isn't SMT enabled.

Sounds like a problem with how you are dispatching (your program logic) more than being that the core just can't execute out of order itself.

What you're talking about sounds like the tax for continuing backwards compatibility with x86 instructions.
 
For example x264 has a lot of hand-crafted assembly. It's so fast that on its fastest setting, it can match the pace of Quicksync (dedicated encoder) but yield similar or better quality. And even then, it gains benefit from Hyperthreading.


This is nonsense unless you compare a high clocked 8 core Ivy Bridge-EP with a Mainstream Haswell using Quicksync. Quicksync VBR+mbbrc on Haswell i5 is twice as fast as ultrafast preset x264 with better quality.
 
This is nonsense unless you compare a high clocked 8 core Ivy Bridge-EP with a Mainstream Haswell using Quicksync. Quicksync VBR+mbbrc on Haswell i5 is twice as fast as ultrafast preset x264 with better quality.

Eh, last time I read reviews of Quicksync (Ivy Bridge) this was the case. I just got a new i7-4700k CPU so I'll do some tests this weekend. I don't know if they've improved Quicksync performance though (although x264 got some gains with AVX2, maybe they'll balance out).
 
Last edited:
Sounds like a problem with how you are dispatching (your program logic) more than being that the core just can't execute out of order itself.

What you're talking about sounds like the tax for continuing backwards compatibility with x86 instructions.

No, I was speaking about CPU's in general without Intel particularly in mind. You'll note that I had mentioned POWER as another CPU that heavily uses SMT.

Dependencies and stalls are a fact of life when executing out of order (since that still operates on a single thread).
 
Eh, last time I read reviews of Quicksync (Ivy Bridge) this was the case. I just got a new i7-4700k CPU so I'll do some tests this weekend. I don't know if they've improved Quicksync performance though (although x264 got some gains with AVX2, maybe they'll balance out).


Use Handbrake for Quicksync, I recommend TU4 and VBR as a starting point and make sure gop-ref-dist is on default (3). If you want something really fast try TU7 :biggrin:


Gains from AVX2 were relatively small in the 1-5% range.
 
They don't because they can't..

Yes I should have worded that better. I'll fix my post now.

No, I was speaking about CPU's in general without Intel particularly in mind. You'll note that I had mentioned POWER as another CPU that heavily uses SMT.

Dependencies and stalls are a fact of life when executing out of order (since that still operates on a single thread).

Yes it effects all CPUs that use pipelines, which is why I point to the x86 instruction set as limiting per-Coprocessor coding in this case where we are talking about x86 cpus. RISC instruction sets have the same problem when they use pipelines.
 
Last edited:
Use Handbrake for Quicksync, I recommend TU4 and VBR as a starting point and make sure gop-ref-dist is on default (3). If you want something really fast try TU7 :biggrin:


Gains from AVX2 were relatively small in the 1-5% range.

No CRF modes available? I can't see the "VBR" being anything other than ABR if it's still a single pass. In any case, it's easy enough to test speed.
 
No CRF modes available? I can't see the "VBR" being anything other than ABR if it's still a single pass. In any case, it's easy enough to test speed.


No CRF but CP. CP is faster than VBR but quality with VBR is usually better at the same bitrate. AVBR and VBR are two different bitrate modes. Both are supported from Quicksync, for a while Handbrake used AVBR instead VBR but this resulted is some quality issues and they went back to VBR. Speed is the same with VBR and AVBR. Same for CBR but worse quality.
 
Ugh, Intel didn't make it easy to use Quicksync if you have another video card as your primary. I was hoping to niggle it into action tonight but I'll have to work it out later.
 
Ugh, Intel didn't make it easy to use Quicksync if you have another video card as your primary. I was hoping to niggle it into action tonight but I'll have to work it out later.


It has nothing to do with Intel really, it's a Windows 7 limitation. Windows 8 has native headless support as long as D3d11 implementation is supported from the application and 15.31 drivers are installed. Under Windows 7 you could use a virtual display.

http://www.bandicam.com/support/tips/intel-quick-sync/
 
Hmm, I'm on Windows 8.1 and have the 15335 drivers installed. Still giving me a Code 43 error when I try to enable it.

A quick google seems to indicate some people saying I may have had to install Windows with the iGFX as my primary in the first place. Hopefully that guy is just wrong.
 
Hmm, I'm on Windows 8.1 and have the 15335 drivers installed. Still giving me a Code 43 error when I try to enable it.

A quick google seems to indicate some people saying I may have had to install Windows with the iGFX as my primary in the first place. Hopefully that guy is just wrong.


The guy is wrong. Code 43 sounds more like a recent Windows 8.1 issue with this driver reported from several people in Intels forum. Seems related to configurations with dedicated GPU+iGPU.
 
So have we found out why Nvidia has much better performance than AMD in BF4 under Windows 8?

Many of us were theorizing that it was due to driver multithreading, but that begs the question why these gains only show up in Windows 8 and not Windows 7.
 
You're in single player though 🙁

Compare the 780 cpu results to the R290X results in multiplayer...

bf4_cpu_radeon.png


Every cpu takes a big hit, the 8350 at 5GHz only manages 51 fps with a R290X.


Notice no uber mode - thus why second uber shown in everything else; but multiplayer.....why is that? *serious question*
 
Back
Top