New Zen microarchitecture details

itsmydamnation · Apr 11, 2016

Exophase said:
Okay, nevermind then

AFAIK, a fastpath double instruction was issued, and there was nothing stopping it from executing on two SIMD ports in parallel, so long as they were available.

But a lot of people said that two pipes "fused" together to become one AVX pipe. AFAIK AMD themselves used such language. It's fairly meaningless. But I think this concept is what the Bits and Chips article is alluding to.

I had a reread of agners mircoarch pdf again, seems your right in that mops can go to any available unit for some reason i was sure that they both went to the same unit back to back. The entire FPU section is actually a really good read in relation to what we might see get better in Zen, stuff like:

The data cache has two 128-bit ports which can be used for either read or write. This
means that it can do two reads or one read and one write in the same clock cycle.
The measured throughput is two reads or one read and one write per clock cycle when only
one thread is active. We would not expect the throughput to be less when multiple threads
are active because each core has separate load/store units and level-1 data cache. But my
measurements indicate that level-1 cache throughput is several times lower when multiple
threads are running, even if the threads are running in different units that do not share any
level-1 or level-2 cache. This phenomenon is seen on both Bulldozer, Piledriver and
Steamroller. No explanation for this effect has been found. Level-2 cache throughput is
shared between two threads running in the same unit, but not affected by threads running in
different units.

The Stilt · Apr 11, 2016

At some point AMD changed the behavior of the shared resources with AGESA / microcode patch. On earlier AGESAs / microcode versions disabling the second core on a compute unit resulted in performance uplift. Nowdays making the same doesn't improve the performance even a bit.

Abwx · Apr 11, 2016

Exophase said:
But a lot of people said that two pipes "fused" together to become one AVX pipe. AFAIK AMD themselves used such language. It's fairly meaningless. But I think this concept is what the Bits and Chips article is alluding to.

It exist, neverlesss...

I observed an interesting phenomenon when executing 256-bit vector instructions on the Skylake. There is a warm-up period of approximately 14 µs before it can execute 256-bit vector instructions at full speed. Apparently, the upper 128-bit half of the execution units and data buses is turned off in order to save power when it is not used. As soon as the processor sees a 256-bit instruction it starts to power up the upper half. It can still execute 256-bit instructions during the warm-up period, but it does so by using the lower 128-bit units twice for every 256-bit vector.

http://www.agner.org/optimize/blog/read.php?i=415

majord · Apr 11, 2016

The Stilt said:
At some point AMD changed the behavior of the shared resources with AGESA / microcode patch. On earlier AGESAs / microcode versions disabling the second core on a compute unit resulted in performance uplift. Nowdays making the same doesn't improve the performance even a bit.

Can you expand on this a bit? Do you mean running a single thread on a module with vs without the 2nd 'core' disabled?

The Stilt · Apr 11, 2016

majord said:
Do you mean running a single thread on a module with vs without the 2nd 'core' disabled?

Exactly that.

Exophase · Apr 11, 2016

Abwx said:
It exist, neverlesss...

http://www.agner.org/optimize/blog/read.php?i=415

What does that have to do with Bulldozer or any other AMD core, or "fusing" units in general? Running units with half the width power gated is a completely different sort of thing, and of course you can execute a full-width operation on half-width units over multiple cycles, Intel and AMD did this with the original SSE and SSE2 for years. Still has nothing to do with any kind of fusion, whatever that even means.

AtenRa · Apr 11, 2016

The Stilt said:
At some point AMD changed the behavior of the shared resources with AGESA / microcode patch. On earlier AGESAs / microcode versions disabling the second core on a compute unit resulted in performance uplift. Nowdays making the same doesn't improve the performance even a bit.

You mean on Carrizo ??

or

from Trinity to Carrizo etc ??

Dresdenboy · Apr 11, 2016

Exophase said:
But a lot of people said that two pipes "fused" together to become one AVX pipe. AFAIK AMD themselves used such language. It's fairly meaningless. But I think this concept is what the Bits and Chips article is alluding to.

I think, JF blogged about this at AMD first, so there is a link to marketing. But this could also be an ELI5 approach to explain the execution.

The Stilt · Apr 11, 2016

AtenRa said:
You mean on Carrizo ??

or

from Trinity to Carrizo etc ??

The ones without the second decoder (BD & PD).

AtenRa · Apr 11, 2016

The Stilt said:
The ones without the second decoder (BD & PD).

I see,

Does the single Thread performance on the old and the new AGESA is the same ??

Or single thread performance on old AGESA is higher than single thread performance on new AGESA ?

The Stilt · Apr 11, 2016

The only difference I noticed was that disabling the second core in a CU did no longer improve the performance.

I don't know what exactly they changed, but I know as a fact that with BD at least they did some radical stuff with the µcode. No idea if having "Super Bottom Expand" enabled all the time or not, and stuff like that... :sneaky:

NTMBK · Apr 11, 2016

The Stilt said:
No idea if having "Super Bottom Expand" enabled all the time or not, and stuff like that... :sneaky:

Didn't Kim Kardashian get that done?

swilli89 · Apr 11, 2016

NTMBK said:
Didn't Kim Kardashian get that done?

She had the special 4XL version of that done. It was a very tedious operation, they removed mass from Kanye's giant face and brain and injected it straight into Kim's behind. Now even her ass is smug.

AtenRa · Apr 11, 2016

The Stilt said:
The only difference I noticed was that disabling the second core in a CU did no longer improve the performance.

I don't know what exactly they changed, but I know as a fact that with BD at least they did some radical stuff with the µcode. No idea if having "Super Bottom Expand" enabled all the time or not, and stuff like that... :sneaky:

ok thanks.

del42sa · Apr 12, 2016

some benchmarks with 4CU/4C setting:

http://www.xtremesystems.org/forums/showthread.php?275873-AMD-FX-quot-Bulldozer-quot-Review-%284%29-

http://www.tomshardware.co.uk/forum/364483-28-interesting-individual-cores-modules-benchmarks

\\btw: is there still such functional bios avaliable for CROSSHAIR V FORMULA-Z/FX8350 ??

Dresdenboy · Apr 13, 2016

The Stilt said:
The only difference I noticed was that disabling the second core in a CU did no longer improve the performance.

I don't know what exactly they changed, but I know as a fact that with BD at least they did some radical stuff with the µcode. No idea if having "Super Bottom Expand" enabled all the time or not, and stuff like that... :sneaky:

Were the compared configs 2M/4T vs. 4M/4T? Or did you run ST tests?

I think it could be related to the Windows scheduler and the reporting of cores and thread capabilities of the CPU - so some different CPUID results. Or is it power mgmt related (module/core turbo w/ and w/o 2nd core).

The Stilt · Apr 13, 2016

Dresdenboy said:
Were the compared configs 2M/4T vs. 4M/4T? Or did you run ST tests?

I think it could be related to the Windows scheduler and the reporting of cores and thread capabilities of the CPU - so some different CPUID results. Or is it power mgmt related (module/core turbo w/ and w/o 2nd core).

Disabling the second core improved performance in all scenarios. Me and chew posted results of the benefits back in the day, they can still probably be found in XS.

It had nothing to do with the power management, as on FX (Zambezi & Vishera) it is extremely simple and must be disabled when overclocked anyways.

IIRC in some cases disabling the second core within a unit was so beneficial that if doing so would have lowered the power consumption (which it didn't), it would have been better to use a single core per CU in several applications (performance wise). I would imagine it has something to do with the L2 cache, but never got an official response about the technical backgrounds of the phenomenon.

del42sa · Apr 13, 2016

The Stilt said:
IIRC in some cases disabling the second core within a unit was so beneficial that if doing so would have lowered the power consumption (which it didn't), it would have been better to use a single core per CU in several applications (performance wise). I would imagine it has something to do with the L2 cache, but never got an official response about the technical backgrounds of the phenomenon.

I think it had to do something with very low associativity of instruction cache, so there is cache trashing when two thread are active.

"Each Intel core has a 32K instruction cache that’s eight-way associative, whereas Steamroller has a 96KB shared cache that’s just three-way associative. Bulldozer only two way associative and even smaller. Cache conflicts remain a significant problem — when two different threads are running in the same module, they can overwrite each others’ code."

the same happen in L2 cache, but this is less problematic as it is quite large and have higher associativity.

If you look at Intel's designs, they tend to have 8 way caches - or 4 ways per thread. AMD has 1.5 ways per thread (with Steamroller core) and 1 way per thread with Bulldozer, and therefore will have more associativity conflicts.

btw: I did post your BD benchmark in post right above yours :sneaky:

Dresdenboy · Apr 13, 2016

The Stilt, del42sa, thanks! The I$ theory sounds plausible. I'm short on time (need to dig more into Zen than BD

), but here are some pointers:
http://developer.amd.com/wordpress/media/2012/10/SharedL1InstructionCacheonAMD15hCPU.pdf
http://www.phoronix.com/scan.php?page=article&item=amd_bulldozer_aliasing&num=1

The Stilt · Apr 13, 2016

Regarding the quoted "40% IPC improvement over Excavator": It is now certain that it is a single core of Zen CCX vs. a single core of a Excavator CU.

While running at the same frequency, each Zen core should be able to provide ~73% of the performance of a Excavator CU in FP workloads. Will be interesting to see how efficient will the SMT implementation from AMD be.

majord · Apr 13, 2016

Interesting. The ~73% performance of a CU = Zen single thread vs xv 2 threads (Full CU throughput) on XV?

The Stilt · Apr 13, 2016

itsmydamnation · Apr 13, 2016

the reason it probably isn't 100% is because a Zen core has less load/store bandwidth vs a XV module.

majord · Apr 13, 2016

Well you're never going to get the same throughput out of similar resources as shoving two threads into it regardless

The Stilt said:
Yes.

Righto. well plugging that scenario into excel, I get the following out of my XV numbers:

I would have expected FP to be higher than Int/mixed represented by those benches though.

-edit. Worth noting I'd probably get higher CMT scaling with 1C/1M vs 2C/1M as opposed to the 2c/2M vs 4c/2M as tested above due to XV's 1MB cache (hinders total throughput), which would skew the avg back towards 7x%

Abwx · Apr 13, 2016

The Stilt said:
Regarding the quoted "40% IPC improvement over Excavator": It is now certain that it is a single core of Zen CCX vs. a single core of a Excavator CU.

So if one is at say 100 the other will be at 140, this for one thread..

The Stilt said:
While running at the same frequency, each Zen core should be able to provide ~73% of the performance of a Excavator CU in FP workloads. Will be interesting to see how efficient will the SMT implementation from AMD be.

In FP workload like CB an EXV module score 188 in respect of a single core, so 73% of this performance mean that a Zen core will be at 137 when loaded with two threads, that is, less than what you re stating for a single thread a few lines above, hey, that make a negative efficency for SMT...

New Zen microarchitecture details

Diamond Member

Golden Member

Lifer

Senior member

Golden Member

Diamond Member

Lifer

Golden Member

Golden Member

Lifer

Golden Member

Lifer

Golden Member

Lifer

Member

Golden Member

Golden Member

Member

Golden Member

Golden Member

Senior member

Golden Member

Diamond Member

Senior member

Lifer