Do AMD cpus at least give a smoother desktop experience w/more cores?

Page 25 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

VirtualLarry

No Lifer
Aug 25, 2001
56,587
10,225
126
You, you are offhand because CMT has nothing to do with it, i guess that you realized the weak point of your "logic" but you are now so much involved in sustaining the unsustainable that you are relying on obviously flawed assumptions.

Think a little, the 8 cores are used in Winrar and the FX score 100%, add CB and the FX will retain 92% of its Winrar throughput while providing 65% of its CB throughput, so what has CMT to do with thoses scores..?.

Are you implying that ony 4 cores are used for each app..?.

But then why the Winrar score at 92%..?.

Because it's memory-bound? Are are you going to try to tell me that a 4-module FX CPU can simultaneously execute not only 8 threads of WinRAR but also 8 threads of Cinebench, at the same time? When I posted the statements from AMD, that exactly a maximum of TWO OS threads can execute at a time per module?

And when I'm talking about scheduler, I'm talking about the OS's time-slicing thread scheduler.

Edit: And if CMT (and HT) have nothing to do with it, then tell me, how many threads can the OS schedule per CPU core, on Intel and on AMD FX CPUs?
 
Last edited:

superstition

Platinum Member
Feb 2, 2008
2,219
221
101
First of all, the FX has CMT, clustered multi-threading, which I am less familiar with, other that seeing the architecture diagrams from AMD slides showing which portions of the "cores" are shared, within a "module".
And one explanation for the FX losing only 8% of it's performance in WinRAR, might be that the WinRAR task is far more memory-bound than computation-bound, and thus, adding in another computationally-heavy, but not as memory-bound program, would naturally show very little loss in the memory-bound program.

Do you know if Cinebench scales to 8 cores?
Because it's memory-bound? Are are you going to try to tell me that a 4-module FX CPU can simultaneously execute not only 8 threads of WinRAR but also 8 threads of Cinebench, at the same time?
Cinebench is primarily a floating point test as far as I know. It is one of the worst-case scenario benchmarks for the FX design.

WinRAR and other inter-heavy tasks benefit the FX design because each module has two integer cores and one floating point core. Another thing that benefits FX in testing, as far as I've read, is when a program can not only utilize all of the integer cores but when it can do so while keeping the code independent and out of the L3 cache (in the L2), since the L3 is on the slow side.

So, an FX chip basically has eight integer cores and four floating point cores. It's not exactly surprising that a chip with double the integer cores performs better in integer tasks and not nearly as well in floating point.
 

Abwx

Lifer
Apr 2, 2011
11,885
4,873
136
Because it's memory-bound? Are are you going to try to tell me that a 4-module FX CPU can simultaneously execute not only 8 threads of WinRAR but also 8 threads of Cinebench, at the same time? When I posted the statements from AMD, that exactly a maximum of TWO OS threads can execute at a time per module?

And when I'm talking about scheduler, I'm talking about the OS's time-slicing thread scheduler.

There s 8 threads from winrar and 8 threads from CB stressing the cores, the FX does well because it has two schedulers and separate exe units for Integer and FP, it s a multithreaded core by the excellence..

Beside an OS thread is not necessarly related to a single app thread, or is it.?..

I m aware that you re talking of time slicing but that s not what i m talking about and that doesnt explain Computerbase tests.

If you were to slice the times this would be the equivalent of serializing the tasks but at high speed, let say 1000 cycles for one app and then 1000 for the other app or whatever other time distribution that maximise the throughput.

In such a scenario the outcome is the same as running one app after the other has ended and the result in the tests would be to halve the scores (assuming the two apps use equivalent amount of time).
 
Last edited:

jhu

Lifer
Oct 10, 1999
11,918
9
81
Beside an OS thread is not necessarly related to a single app thread, or is it.?..

A thread is a thread. The OS ultimately retains control due to interrupts (otherwise everything software related falls apart). So the real question is whether everything currently in flight (pipeline, reorder buffer, etc.) gets flushed or otherwise becomes useless during a context switch.
 

VirtualLarry

No Lifer
Aug 25, 2001
56,587
10,225
126
There s 8 threads from winrar and 8 threads from CB stressing the cores, the FX does well because it has two schedulers and separate exe units for Integer and FP, it s a multithreaded core by the excellence..
I'm not taking pot shots at AMD here, and I'm not trying to compare them to Intel.
I think that the overall intent of the design of Bulldozer and sucessors isn't bad, but the specific actual implementation turned out to be, along with outdated process tech.

Oh, and you're not (to my knowledge), running 16 thread on a 4-module CMT CPU. There's some amount of time-slicing going on here.

Beside an OS thread is not necessarly related to a single app thread, or is it.?..
App threads share a process address space. But they get scheduled on CPU cores according to priority queues by the OS, one thread per logical CPU core.
I m aware that you re talking of time slicing but that s not what i m talking about and that doesnt explain Computerbase tests.

If you were to slice the times this would be the equivalent of serializing the tasks but at high speed, let say 1000 cycles for one app and then 1000 for the other app or whatever other time distribution that maximise the throughput.

In such a scenario the outcome is the same as running one app after the other has ended and the result in the tests would be to halve the scores (assuming the two apps use equivalent amount of time).
Well, except for cache effects, in the case of Pentium / i5 (non-HT Intel CPU),and sharing a core (in the case of i3 / i7 on Intel), and a module (in the case of AMD FX CPUs).

Here's a question, do you see the same kind of scaling, with Cinebench and WinRAR, when you limit your FX CPUs, to one thread per module, rather than two?

Likewise, running the same tests, on an L3-less FM2/FM2+ APU?

Let me also say, that if you had re-phrased your initial statement, about the i5 not being able to multitask both an INT-heavy thread and an FP-heavy thread, but instead simply stated that the FX CPUs scaled better when presented with 8 int threads and 8 FP threads, I might have simply agreed with you.

It's not that the i5 slows down with mixed thread workloads, because as I've stated, it executes one thread per full core. The only issue with the i5 would be the shared L3, that I can see.
 
Last edited:

TheELF

Diamond Member
Dec 22, 2012
4,027
753
126
There s 8 threads from winrar and 8 threads from CB stressing the cores,
There are 8 threads from CB but they don't run the benchmark from winrar they actually compress something they say so in the review,try it out yourself and you will see that it will not stress all 8 cores, your CPU will most probably stay below 50%.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
There are 8 threads from CB but they don't run the benchmark from winrar they actually compress something they say so in the review,try it out yourself and you will see that it will not stress all 8 cores, your CPU will most probably stay below 50%.

Running CB AND Winrar simultaneously will not stress all 8 cores ??? where did you see that ??

CB alone will stress those 8 cores 100%.
 

coercitiv

Diamond Member
Jan 24, 2014
7,362
17,455
136
If he was correct then stressing a single core with two threads instead of one wouldnt increase the throughput.
That's not enough to prove your point, the job of the OS is not only to maximize throughput, but to do so while multitasking is still effective. It trades some throughput for responsiveness, just like HT trades some responsiveness for throughput at hardware level.

Increasing the thread count beyond the number of CPU threads does increase 7Zip archiving speed, but does so at the cost of making the system hardly responsive. Archiving a 1.2GB folder on a Haswell i7 went down from 2m06s with 8 threads to 1m43s with 16 threads, but the system couldn't even load a webpage properly while doing that.

I invite any BD or XV owner to do a similar test and report back on results & responsiveness while doubling the thread count. Should be interesting to compare behavior.

For the same matter the FX8350 throughputs in Winrar + CB 11.5 should be at least halved for one or for each apps when running simultaneously, yet CB lose 38% throughput and Winrar only 8%.
What happens to your line of reasoning when you double the WinRar threads on BD to increase archiving throughput? Will it increase when only WinRar is running, and how will that affect overall throughput when CB threads are running as well? If operands are handled independently, throughput should still go up, correct?

What is your comment on the behavior of the HEDT platform in these tests? Why is it that HEDT Haswell does not succumb to the same weakness as mainstream Haswell?

There are clear differences between the way SMT and CMT cores end up handling combined FP & INT loads, but basing conclusions solely on observation while there are so many layers affecting performance (both hw and sw based) can only lead to a sterile debate. We really need some form of written documentation to make this discussion worthwhile.

Running CB AND Winrar simultaneously will not stress all 8 cores ??? where did you see that ??

CB alone will stress those 8 cores 100%.
Archiving thread doesn't necessarily stress CPU thread to 100%. Combine that with time slicing... and you end up with lower than 100% overall CPU stress.

7Zip running on 8 threads
wC1agRa.png


and the same folder being compressed with 16 threads
CE75Vg3.png
 
Last edited:

TheELF

Diamond Member
Dec 22, 2012
4,027
753
126
Running CB AND Winrar simultaneously will not stress all 8 cores ??? where did you see that ??

CB alone will stress those 8 cores 100%.

That's what I said,yes CB alone will stress those 8 cores 100%
But compressing something with winrar will not.
That's why CB loses 38% throughput and Winrar 8% ,that's only 46% CPU time that winrar would use when running on its own.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
I thought you were talking running both apps together.

If i remember correctly, Winrar is heavily memory bandwidth bound.

Intel HEDT has quad memory channel with up to 2133MHz dimms, it could be the reason they dont degrade as the desktop SKUs.
 

TheELF

Diamond Member
Dec 22, 2012
4,027
753
126
If i remember correctly, Winrar is heavily memory bandwidth bound.
Winrar ,when you really use it and not just run the benchmark, is bound by a lot of things, ssd speed, memory, even the OS having to access a bunch of files.
 

coercitiv

Diamond Member
Jan 24, 2014
7,362
17,455
136
I thought you were talking running both apps together.
We are talking running both apps together, but the explanation Abwx offers for the increased throughput of BD versus Haswell is that out-of-order designs can (or should) process operands from multiple threads as long as necessary HW resource is not being used by active thread. Based on this we might conclude HW has a hard time running combined loads (fp+int).

This hypothesis should also hold true for multiple INT threads like the ones created by 7Zip, hence Abwx brought up this example where increasing 7Zip archiving thread count brings about increased throughput.

However, unlike CB or other processing benchmarks, archiving depends on several other hardware resources, which creates conditions for sub optimal CPU utilization. Simply observing an increase in throughput does not prove the hypothesis to be correct. See my post above.

If we accept that WinRar isn't actually using the full resources FX8350 has to offer, a different explanation comes to mind: we have 4 BD modules running on average 4 CB threads and 4 WinRar threads at a time (with the other 8 awaiting execution time). If 4 CB threads are placed on different modules, would that result in more than 50% FP throughoutput being available?
 
Last edited:

Phynaz

Lifer
Mar 13, 2006
10,140
819
126



Threadcrapping and trolling are not allowed
Markfw900
 
Last edited by a moderator:

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
Each BD module can fetch decode and execute up to two different threads per cycle. If i remember correctly the FPU can only execute mOps from two different threads every two cycles. That is it needs two cycles to fetch and then it can decode and execute from up to two threads.

If you only use one core from each module then you get the maximum throughput from each thread since you dont have the CMT penalty.
 

coercitiv

Diamond Member
Jan 24, 2014
7,362
17,455
136
Each BD module can fetch decode and execute up to two different threads per cycle. If i remember correctly the FPU can only execute mOps from two different threads every two cycles. That is it needs two cycles to fetch and then it can decode and execute from up to two threads.

If you only use one core from each module then you get the maximum throughput from each thread since you dont have the CMT penalty.
So in theory, with a somewhat optimal thread placement for CB threads BD could make very good use of it's FP resources. Meanwhile, if and only if WinRar load is considerably lower than actual CPU processing resources, archiving time might suffer only a small penalty due to more efficient resource usage (less CPU time, but better fed).

I'll stop here though, since I have no idea whether the scheduler can actually accomplish this kind of thread distribution.

Hope this discussion remains civil, it's been a while since anything remotely interesting came out of an AMD vs. Intel forum skirmish.
 

Abwx

Lifer
Apr 2, 2011
11,885
4,873
136
If we accept that WinRar isn't actually using the full resources FX8350 has to offer, a different explanation comes to mind: we have 4 BD modules running on average 4 CB threads and 4 WinRar threads at a time (with the other 8 awaiting execution time). If 4 CB threads are placed on different modules, would that result in more than 50% FP throughoutput being available?

And the result would be compression time increased 2x and CB score halved, here we have the compression time being 1.08x and the CB score being 0.65x..

So that s 7.36 cores for Winrar and 4.8 cores for CB when both apps are running simultaneously....

What is your comment on the behavior of the HEDT platform in these tests? Why is it that HEDT Haswell does not succumb to the same weakness as mainstream Haswell?
The difference between the i5 and the i7 is that the latter has added cache to serve 4 other threads, the same way i3s have the necessary added cache to deal with 2 more threads than the Pentium/Celeron.
.

There are 8 threads from CB but they don't run the benchmark from winrar they actually compress something they say so in the review,try it out yourself and you will see that it will not stress all 8 cores, your CPU will most probably stay below 50%.

From the pic above we can see that an archiver use more than 50% when there s 1 thread per core.
 
Last edited:

TheELF

Diamond Member
Dec 22, 2012
4,027
753
126
From the pic above we can see that an archiver use more than 50% when there s 1 thread per core.
You mean the pic from coercitiv?
Task manager shows activity over time mixing hyper- and normal threads,it shows us nothing about how the threads actually run and even less about how the FX will run them.
We don't even know what version or settings computerbase used.

Look at this video from a celeron compressing a folder with winrar, it uses ~38 threads and the usage varies from very low ( ~10%) to about 60-70%
https://www.youtube.com/watch?v=mcbICLEeIBg
 

Abwx

Lifer
Apr 2, 2011
11,885
4,873
136
You mean the pic from coercitiv?
Task manager shows activity over time mixing hyper- and normal threads,it shows us nothing about how the threads actually run and even less about how the FX will run them.
We don't even know what version or settings computerbase used.

Look at this video from a celeron compressing a folder with winrar, it uses ~38 threads and the usage varies from very low ( ~10%) to about 60-70%
https://www.youtube.com/watch?v=mcbICLEeIBg

I made a test with 7Zip, CPU utilisation get low because each document on a folder is compressed separately, so there s dips, otherwise the CPU utilisation is 95% with one thread by core, indeed Computerbase.de use a big file so the CPU utilisation is high with no interruption..

So the exemple you re showing is not adequate at all in respect of a right methodology, wich is fortunately the case for Computerbase s..
 

TheELF

Diamond Member
Dec 22, 2012
4,027
753
126
I made a test with 7Zip, CPU utilisation get low because each document on a folder is compressed separately, so there s dips, otherwise the CPU utilisation is 95% with one thread by core, indeed Computerbase.de use a big file so the CPU utilisation is high with no interruption..

So the exemple you re showing is not adequate at all in respect of a right methodology, wich is fortunately the case for Computerbase s..

http://www.computerbase.de/2015-10/prozessoren-benchmarks-testsystem-amd-intel-2015/2/
WinRAR 5.30 Beta 1: benötigte Zeit zum realen Packen des kompletten PCMark-8-Ordners via Skript, mehrere Durchläufe
They are compressing the PCMark-8 folder several times via a script and not a single big file.
 

Essence_of_War

Platinum Member
Feb 21, 2013
2,650
4
81
I made a test with 7Zip, CPU utilisation get low because each document on a folder is compressed separately, so there s dips, otherwise the CPU utilisation is 95% with one thread by core, indeed Computerbase.de use a big file so the CPU utilisation is high with no interruption..

That actually varies by settings. 7zip has a variable solid block size and you can set it to "solid" where it concatenates all of the files and then compresses the block.
 

VirtualLarry

No Lifer
Aug 25, 2001
56,587
10,225
126
That actually varies by settings. 7zip has a variable solid block size and you can set it to "solid" where it concatenates all of the files and then compresses the block.

I think "Solid" archives compress better, because as you say they concatenate all the files together into a big blob, and compress it that way. That way, they have one compression dictionary, instead of one for each file.
 

Essence_of_War

Platinum Member
Feb 21, 2013
2,650
4
81
I think "Solid" archives compress better, because as you say they concatenate all the files together into a big blob, and compress it that way. That way, they have one compression dictionary, instead of one for each file.

Depends on a lot of things, but yeah, solid archiving should usually improve compression ratios, but it will vary by algorithm, the underlying redundancies, etc.
 

Abwx

Lifer
Apr 2, 2011
11,885
4,873
136
http://www.computerbase.de/2015-10/prozessoren-benchmarks-testsystem-amd-intel-2015/2/

They are compressing the PCMark-8 folder several times via a script and not a single big file.

Their methodology is still adequate, actualy they must run winrar at 100% the time it take for CB to make the test, then they run CB several times consequently if necessary to check the compression time.

From their results i5s are adequate for Integer + Intger tasks but not for intensive Integer + FP.

The Celeron and i3 are to be discarded even in INT + INT as they lack the necessary grunt, this can be witnessed in the Winrar + Witcher tests where they lose throughput on much bigger %ages than all other CPUs, and they have less throughput to begin with.