• Guest, The rules for the P & N subforum have been updated to prohibit "ad hominem" or personal attacks against other posters. See the full details in the post "Politics and News Rules & Guidelines."
  • Community Question: What makes a good motherboard?

New Zen microarchitecture details

Page 37 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

itsmydamnation

Platinum Member
Feb 6, 2011
2,153
1,674
136
Okay, nevermind then ;)



AFAIK, a fastpath double instruction was issued, and there was nothing stopping it from executing on two SIMD ports in parallel, so long as they were available.

But a lot of people said that two pipes "fused" together to become one AVX pipe. AFAIK AMD themselves used such language. It's fairly meaningless. But I think this concept is what the Bits and Chips article is alluding to.

I had a reread of agners mircoarch pdf again, seems your right in that mops can go to any available unit for some reason i was sure that they both went to the same unit back to back. The entire FPU section is actually a really good read in relation to what we might see get better in Zen, stuff like:

The data cache has two 128-bit ports which can be used for either read or write. This
means that it can do two reads or one read and one write in the same clock cycle.
The measured throughput is two reads or one read and one write per clock cycle when only
one thread is active. We would not expect the throughput to be less when multiple threads
are active because each core has separate load/store units and level-1 data cache. But my
measurements indicate that level-1 cache throughput is several times lower when multiple
threads are running, even if the threads are running in different units that do not share any
level-1 or level-2 cache. This phenomenon is seen on both Bulldozer, Piledriver and
Steamroller. No explanation for this effect has been found. Level-2 cache throughput is
shared between two threads running in the same unit, but not affected by threads running in
different units.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
At some point AMD changed the behavior of the shared resources with AGESA / microcode patch. On earlier AGESAs / microcode versions disabling the second core on a compute unit resulted in performance uplift. Nowdays making the same doesn't improve the performance even a bit.
 

Abwx

Diamond Member
Apr 2, 2011
9,117
902
126
But a lot of people said that two pipes "fused" together to become one AVX pipe. AFAIK AMD themselves used such language. It's fairly meaningless. But I think this concept is what the Bits and Chips article is alluding to.
It exist, neverlesss...

I observed an interesting phenomenon when executing 256-bit vector instructions on the Skylake. There is a warm-up period of approximately 14 µs before it can execute 256-bit vector instructions at full speed. Apparently, the upper 128-bit half of the execution units and data buses is turned off in order to save power when it is not used. As soon as the processor sees a 256-bit instruction it starts to power up the upper half. It can still execute 256-bit instructions during the warm-up period, but it does so by using the lower 128-bit units twice for every 256-bit vector.
http://www.agner.org/optimize/blog/read.php?i=415
 

majord

Senior member
Jul 26, 2015
349
324
136
At some point AMD changed the behavior of the shared resources with AGESA / microcode patch. On earlier AGESAs / microcode versions disabling the second core on a compute unit resulted in performance uplift. Nowdays making the same doesn't improve the performance even a bit.
Can you expand on this a bit? Do you mean running a single thread on a module with vs without the 2nd 'core' disabled?
 

Exophase

Diamond Member
Apr 19, 2012
4,440
8
81
What does that have to do with Bulldozer or any other AMD core, or "fusing" units in general? Running units with half the width power gated is a completely different sort of thing, and of course you can execute a full-width operation on half-width units over multiple cycles, Intel and AMD did this with the original SSE and SSE2 for years. Still has nothing to do with any kind of fusion, whatever that even means.
 

AtenRa

Lifer
Feb 2, 2009
13,548
2,522
126
At some point AMD changed the behavior of the shared resources with AGESA / microcode patch. On earlier AGESAs / microcode versions disabling the second core on a compute unit resulted in performance uplift. Nowdays making the same doesn't improve the performance even a bit.
You mean on Carrizo ??

or

from Trinity to Carrizo etc ??
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
But a lot of people said that two pipes "fused" together to become one AVX pipe. AFAIK AMD themselves used such language. It's fairly meaningless. But I think this concept is what the Bits and Chips article is alluding to.
I think, JF blogged about this at AMD first, so there is a link to marketing. But this could also be an ELI5 approach to explain the execution.
 

AtenRa

Lifer
Feb 2, 2009
13,548
2,522
126
The ones without the second decoder (BD & PD).
I see,

Does the single Thread performance on the old and the new AGESA is the same ??

Or single thread performance on old AGESA is higher than single thread performance on new AGESA ?
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
The only difference I noticed was that disabling the second core in a CU did no longer improve the performance.

I don't know what exactly they changed, but I know as a fact that with BD at least they did some radical stuff with the µcode. No idea if having "Super Bottom Expand" enabled all the time or not, and stuff like that... :sneaky:
 

swilli89

Golden Member
Mar 23, 2010
1,528
1,085
136
Didn't Kim Kardashian get that done?
She had the special 4XL version of that done. It was a very tedious operation, they removed mass from Kanye's giant face and brain and injected it straight into Kim's behind. Now even her ass is smug.
 

AtenRa

Lifer
Feb 2, 2009
13,548
2,522
126
The only difference I noticed was that disabling the second core in a CU did no longer improve the performance.

I don't know what exactly they changed, but I know as a fact that with BD at least they did some radical stuff with the µcode. No idea if having "Super Bottom Expand" enabled all the time or not, and stuff like that... :sneaky:
ok thanks.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
The only difference I noticed was that disabling the second core in a CU did no longer improve the performance.

I don't know what exactly they changed, but I know as a fact that with BD at least they did some radical stuff with the µcode. No idea if having "Super Bottom Expand" enabled all the time or not, and stuff like that... :sneaky:
Were the compared configs 2M/4T vs. 4M/4T? Or did you run ST tests?

I think it could be related to the Windows scheduler and the reporting of cores and thread capabilities of the CPU - so some different CPUID results. Or is it power mgmt related (module/core turbo w/ and w/o 2nd core).
 
Last edited:

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Were the compared configs 2M/4T vs. 4M/4T? Or did you run ST tests?

I think it could be related to the Windows scheduler and the reporting of cores and thread capabilities of the CPU - so some different CPUID results. Or is it power mgmt related (module/core turbo w/ and w/o 2nd core).
Disabling the second core improved performance in all scenarios. Me and chew posted results of the benefits back in the day, they can still probably be found in XS.

It had nothing to do with the power management, as on FX (Zambezi & Vishera) it is extremely simple and must be disabled when overclocked anyways.

IIRC in some cases disabling the second core within a unit was so beneficial that if doing so would have lowered the power consumption (which it didn't), it would have been better to use a single core per CU in several applications (performance wise). I would imagine it has something to do with the L2 cache, but never got an official response about the technical backgrounds of the phenomenon.
 

del42sa

Member
May 28, 2013
26
11
81
IIRC in some cases disabling the second core within a unit was so beneficial that if doing so would have lowered the power consumption (which it didn't), it would have been better to use a single core per CU in several applications (performance wise). I would imagine it has something to do with the L2 cache, but never got an official response about the technical backgrounds of the phenomenon.
I think it had to do something with very low associativity of instruction cache, so there is cache trashing when two thread are active.

"Each Intel core has a 32K instruction cache that’s eight-way associative, whereas Steamroller has a 96KB shared cache that’s just three-way associative. Bulldozer only two way associative and even smaller. Cache conflicts remain a significant problem — when two different threads are running in the same module, they can overwrite each others’ code."

the same happen in L2 cache, but this is less problematic as it is quite large and have higher associativity.

If you look at Intel's designs, they tend to have 8 way caches - or 4 ways per thread. AMD has 1.5 ways per thread (with Steamroller core) and 1 way per thread with Bulldozer, and therefore will have more associativity conflicts.

btw: I did post your BD benchmark in post right above yours :sneaky:
 
Last edited:

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Regarding the quoted "40% IPC improvement over Excavator": It is now certain that it is a single core of Zen CCX vs. a single core of a Excavator CU.

While running at the same frequency, each Zen core should be able to provide ~73% of the performance of a Excavator CU in FP workloads. Will be interesting to see how efficient will the SMT implementation from AMD be.
 

majord

Senior member
Jul 26, 2015
349
324
136
Interesting. The ~73% performance of a CU = Zen single thread vs xv 2 threads (Full CU throughput) on XV?
 

majord

Senior member
Jul 26, 2015
349
324
136
Well you're never going to get the same throughput out of similar resources as shoving two threads into it regardless

Righto. well plugging that scenario into excel, I get the following out of my XV numbers:



I would have expected FP to be higher than Int/mixed represented by those benches though.

-edit. Worth noting I'd probably get higher CMT scaling with 1C/1M vs 2C/1M as opposed to the 2c/2M vs 4c/2M as tested above due to XV's 1MB cache (hinders total throughput), which would skew the avg back towards 7x%
 
Last edited:

Abwx

Diamond Member
Apr 2, 2011
9,117
902
126
Regarding the quoted "40% IPC improvement over Excavator": It is now certain that it is a single core of Zen CCX vs. a single core of a Excavator CU.
So if one is at say 100 the other will be at 140, this for one thread..

While running at the same frequency, each Zen core should be able to provide ~73% of the performance of a Excavator CU in FP workloads. Will be interesting to see how efficient will the SMT implementation from AMD be.
In FP workload like CB an EXV module score 188 in respect of a single core, so 73% of this performance mean that a Zen core will be at 137 when loaded with two threads, that is, less than what you re stating for a single thread a few lines above, hey, that make a negative efficency for SMT...
 

ASK THE COMMUNITY