AMD APU not showing proper core amounts?

Phaetos

Senior member
Jan 27, 2005
391
27
91
Just installed Win 8.1 and looking at the Performance tab in Task Manager, it shows:
1 Socket (yes)
2 Cores (nope, got 4)
4 Logical Processors (shouldn't that be 8 if the cores were reading correctly?)

What's the deal here?
 

VirtualLarry

No Lifer
Aug 25, 2001
56,587
10,225
126
Just installed Win 8.1 and looking at the Performance tab in Task Manager, it shows:
1 Socket (yes)
2 Cores (nope, got 4)
4 Logical Processors (shouldn't that be 8 if the cores were reading correctly?)

What's the deal here?

You have an FM2/FM2+ APU, correct? Not an AM1?

In that case, what Windows has listed is correct.
 

Phaetos

Senior member
Jan 27, 2005
391
27
91
You have an FM2/FM2+ APU, correct? Not an AM1?

In that case, what Windows has listed is correct.

Correct. Device Manager shows 4 processors, and CPU-Z show 1 processor 4 core 4 thread. So what Windows is showing is not correct. It showed as 4 cores under Win7. 8.1 is reporting it incorrectly.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
146
106
Windows 8.1 shows it correctly. (2 modules, 4 threads.)

Microsoft no longer accepts AMDs CMT as real cores. But rather on the same level as SMT.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
Correct. Device Manager shows 4 processors, and CPU-Z show 1 processor 4 core 4 thread. So what Windows is showing is not correct. It showed as 4 cores under Win7. 8.1 is reporting it incorrectly.
How is Windows incorrect? You haven't even told us what the CPU is.

My guess is that Windows is being quite correct, though, both versions, and it's a BD-based APU on an FM2(+) socket.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,811
1,290
136
What's the deal here?
Windows 7 does not have the scheduling patch pre-installed. While, Windows 8 and 8.1 does have the patch pre-installed.
Currently, the CPU scheduling techniques that are used by Windows 7 and Windows Server 2008 R2 are not optimized for the AMD Bulldozer module architecture. This architecture is found on AMD FX series, AMD Opteron 4200/4300 Series, and AMD Opteron 6200/6300 Series processors. Therefore, multithreaded workloads may not be optimally distributed on computers that have one of these processors installed in a lightly-threaded environment. This may result in decreased system performance for some applications.
AMD's Bulldozer Module and Intel's Core i Hyperthreading, have the same threading technology in the front-end. So, Microsoft pulled the Hyperthreading optimization and put it on the Bulldozer. While, not correcting the terminology.

It's errata that most likely won't be fixed by Microsoft.

Microsoft officially considers the Bulldozer Module to be two cores, unlike what ShintaiDK states.
 
Last edited:

Yuriman

Diamond Member
Jun 25, 2004
5,530
141
106
FM2 CPUs top out at 2 module 4 threads (cores). As NostaSeronx said, it's just how windows needs to look at the chip when scheduling tasks, because loading up the second core before the third can cause a hefty performance penalty due to shared resources.
 
Last edited:

VirtualLarry

No Lifer
Aug 25, 2001
56,587
10,225
126
It's not really a "4 core", any more than that "HP Hexacore" is really a hex-core.
 

Yuriman

Diamond Member
Jun 25, 2004
5,530
141
106
I disagree, it's generally accepted that AMD cores are real cores, albeit slow individually and sharing resources. There was some debate when the FX's first came out but I don't know of anyone who questions whether an FX-8350 is really an 8 core processor.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,811
1,290
136
FM2 CPUs top out at 2 module 4 threads (cores). As NostaSeronx said, it's just how windows needs to look at the chip when scheduling tasks, because loading up the second core before the third can cause a hefty performance penalty due to shared resources.
Well the actual purpose of the patch is SPMD.

Windows 7;
Task A = 2 parallel threads
Task B = 1 serial thread

Task A(1A) and Task B(1) would be run in Module A on Cores A and B.
While, Task A(2A) would be run in Module B on Cores A or B.

Windows 7+Hotfix / Windows 8 / Windows 8.1;
Task A = 2 parallel threads
Task B = 1 serial thread

Task A(1A) and Task A(2A) would be run in Module A on Cores A and B.
While, Task B(1) would be run in Module B on Cores A or B.

SPMD = Single Program Multiple Data. The module is built to optimize for such workloads. So, running Multiple Program Multiple Data(MPMD) workloads on a single module is non-optimal.
 
Last edited:

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
I disagree, it's generally accepted that AMD cores are real cores, albeit slow individually and sharing resources. There was some debate when the FX's first came out but I don't know of anyone who questions whether an FX-8350 is really an 8 core processor.
There is no historical definition of specifically what is a core, v. not a core, until you get close to memory. Either way is correct, so long as the definitions are well defined and consistent.
 

Phaetos

Senior member
Jan 27, 2005
391
27
91
How is Windows incorrect? You haven't even told us what the CPU is.

My guess is that Windows is being quite correct, though, both versions, and it's a BD-based APU on an FM2(+) socket.

I didn't mention the APU? My bad, A10-6800K, socket FM2.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
Both 1are correct, then, in their view of the world. Windows 7, by default, does not recognize the shared caches, but that can be fixed. Windows 8 does out of the box, and treats it much like a HT CPU, which should be better for performance. But, with 4 sets of int processing units and L1Ds, 4 cores isn't all wrong, just more superficial than would be ideal.
 
Last edited:

sm625

Diamond Member
May 6, 2011
8,172
137
106
It's not really a "4 core", any more than that "HP Hexacore" is really a hex-core.

That's a totally different thing, and not entirely fair anyway. The CMT scaling is actually pretty good. A 2M 4C steamroller based cpu scales at around 80%. The problem isnt the CMT design, its just the fact that the cores are just plain bad/slow.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
146
106
That's a totally different thing, and not entirely fair anyway. The CMT scaling is actually pretty good. A 2M 4C steamroller based cpu scales at around 80%. The problem isnt the CMT design, its just the fact that the cores are just plain bad/slow.

Until you add FP loads. Then it scales 0%.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
Bulldozer has 80% scaling, Steamroller has more closely to 90-95%.
 
Last edited:

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
Until you add FP loads. Then it scales 0%.

FP scales very nicely even in Bulldozer

Edit: Phenom II x6, FX8150 and Core i7 2600K

go0mqkj
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
146
106
You know that it s wrong, do you.?.

If I was wrong there wouldnt be a need to share the FP unit in a module.

You can try run Linpack or something and tell me the throughput. It will for some odd reason of chance end up in the ballpark of a dualcore SB/IB ;)
 

Yuriman

Diamond Member
Jun 25, 2004
5,530
141
106
As I understand it, with scaling of about "80%" you actually only get about 160% performance out of two cores as you would with one.

Anandtech bench results of FX-6300:

xPaOJxq.png


470/6 = 78.33, which means all cores are performing at around 81.5% due to sharing. Using a second core in a module doesn't make it 80% faster, but rather both cores take a 20% hit so loading up the module fully you get about 60% more performance.


An i5 4690 by comparison is 3% short of linear scaling with 4 cores (used as a control to show potential scaling in Cinebench):

CoHdqhu.png


EDIT: Is Cinebench a FP-heavy bench? It may not be representative of the average task. If so, what are some other multithreaded benches that don't use the FPUs as heavily?
 
Last edited:
Dec 30, 2004
12,553
2
76
I still think if they had done a 3+2 wide decode instead of 2+2 wide they could have gotten 100% scaling right out of the box for like 99% of workloads
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
If I was wrong there wouldnt be a need to share the FP unit in a module.

You can try run Linpack or something and tell me the throughput. It will for some odd reason of chance end up in the ballpark of a dualcore SB/IB ;)

Scaling is not equal to throughput.

Here another one,
my4u1hj
 

Abwx

Lifer
Apr 2, 2011
11,888
4,874
136
If I was wrong there wouldnt be a need to share the FP unit in a module.

You can try run Linpack or something and tell me the throughput. It will for some odd reason of chance end up in the ballpark of a dualcore SB/IB ;)


What is the relevancy of a comparison with SB/IB in respect of your assumption that it didnt scale at all.?.

Dont try to change the goal posts, you did say that scaling was 0% with more threads, i guess that you mean 0% for more than 2 threads in a 2 modules configuration, either find us data that say so, and you know that you cant, or else it will mean that you re deliberatly misleading the general public.
 

NTMBK

Lifer
Nov 14, 2011
10,461
5,845
136
If I was wrong there wouldnt be a need to share the FP unit in a module.

Even on a pure, 100% FPU workload with no integer code whatsoever, you would get >0% scaling. It can swap in the second thread when the first one stalls on memory access, branch misprediction, whatever. The FPU is basically SMT, and as such pure FPU workloads will scale much the same as on an SMT core. So yes, a 2 module PD chip scales much like a 2 core Sandy Bridge chip in that case.

But of course the vast majority of code isn't purely FPU bound. It depends on what your use case is what scaling you get. *shrug*
 

Abwx

Lifer
Apr 2, 2011
11,888
4,874
136
As I understand it, with scaling of about "80%" you actually only get about 160% performance out of two cores as you would with one.

Anandtech bench results of FX-6300:

470/6 = 78.33, which means all cores are performing at around 81.5% due to sharing. Using a second core in a module doesn't make it 80% faster, but rather both cores take a 20% hit so loading up the module fully you get about 60% more performance.

These estimations dont hold with Kaveri since it did solve the shared front end penalty.

CB115.png