AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

lolfail9001 · Oct 6, 2016

Arachnotronic said:
The Stilt published an XV result @ 3.4GHz:
https://forums.anandtech.com/threads/first-summit-ridge-zen-benchmarks.2482739/page-51#post-38501596

1.4x the perf/MHz shown here (and assuming linear scaling from 3.4GHz to 3.5GHz) would give a score of ~3549 at 3.5GHz. My Broadwell @ 3.5GHz manages to get 3903 in single core.

This implies that Broadwell is ~10% ahead in perf/clock or, more succinctly, AMD has built its own version of Ivy Bridge.

That's still 3 core iterations behind Intel's current best in terms of perf/clock, but that's a much better position than what AMD was previously in.

I reference 1140ish result of GB4 Zen sample. If we assume it ran at 1Ghz (and 750 per Ghz result of Excavator certainly gives some credence to that idea, if we assume for a minute AMD does not lie). It is pretty similar to result of your Broadwell on per Ghz basis in this case. If you are up to redo your low frequency test again, i have GB4 Zen results backed up.

AtenRa · Oct 6, 2016

LTC8K6 said:
Zen's extra cores should make it's multi thread performance well above those older Intel chips, though. That should be a reason to consider Zen, depending on price, when they upgrade.

I will agree,
If ZEN single thread (not IPC) performance is lower than Intel Kabylake, which i believe it will be, then AMD will have to sell more Threads at the same or lower price than Intel.

That means they will be forced to sell a Quad Core + HT (4C 8T) or even 6 Cores 6 Threads at the price of Quad Core (4C 4T) Intel CPUs (Core i5).

Example could be a 4C 8T ZEN unlocked at $200-220, 10-15% slower in ST but 10-20% faster in MT than a Core i5 KabyLake at $220-230.

Arachnotronic · Oct 6, 2016

lolfail9001 said:
I reference 1140ish result of GB4 Zen sample. If we assume it ran at 1Ghz (and 750 per Ghz result of Excavator certainly gives some credence to that idea, if we assume for a minute AMD does not lie). It is pretty similar to result of your Broadwell on per Ghz basis in this case. If you are up to redo your low frequency test again, i have GB4 Zen results backed up.

Yeah, I'd be up for that. Interestingly, perf/MHz for my Broadwell-E goes up at lower frequencies. At 1.4GHz (DDR4-2400, uncore @ 1.4GHz), I get 1968 single core. Perf/MHz here is ~1405.

Dresdenboy · Oct 6, 2016

coercitiv said:
Based on your post, would you agree that IPC is utterly useless in describing ST performance of a SMT capable CPU core?

You got me to check the literature.

And there I found this:

Computer Architecture Performance Evaluation Methods said:
In contrast to what is the case for single-threaded workloads IPC is not an accurate and reliable performance metric for multi-threaded workloads and may lead to misleading or incorrect conclusions.

BTW, IPC as the reciprocal of CPI (part of the "Iron Law of Performance") is described in the chapter "Single-threaded Workloads".

EDIT: I found the cited paper leading to this remark.

In this article, we challenge the commonly held view that IPC accurately reflects performance— at least for multithreaded workloads running on multiprocessors. Our simple counterexamples show cases in which IPC increases do not reflect a performance gain, and others in which IPC decreases do not reflect a performance loss. In some of our examples, IPC actually decreases as performance increases, and vice versa. As the number of processors increases, IPC becomes a less accurate measure of performance.

Source: A. R. Alameldeen (Intel) and D. A.Wood (University of Wisconsin-Madison). IPC considered harmful for multiprocessor workloads.
IEEE Micro, 26(4):8–17, July 2006.

EDIT#2: IBM researchers distinguished between single-threaded IPC and SMT IPC here
http://pharm.ece.wisc.edu/wddd/2002/final/squillante.pdf

Other researchers give a whole slew of different IPC definitions:
https://pdfs.semanticscholar.org/9750/fd5a20c3ecf846b589e2dfa3ba8610862e27.pdf

ShintaiDK said:
Fair enough, it was an article rewrite that got it the up to part added.

But would be nice to see AMD reach the performance claim they do for the first time in 10 years. But that would mean the leaks are wrong.

No prob. However, The Stilt also already confirmed, that AMD matched it's promised IPC improvements for PD, SR, and XV (and likely the cat cores, too, if you care to check).

AtenRa · Oct 6, 2016

Just downloaded GB4
Pity i cannot underclock bellow 1.6GHz so i used 2.2GHz and double at my current 4.4GHz

Core i7 3770K @ 2.2GHz
2133MHz memory
https://browser.geekbench.com/v4/cpu/659503

Core i7 3770K @ 4.4GHz
2133MHz memory
https://browser.geekbench.com/v4/cpu/659569

lolfail9001 · Oct 6, 2016

AtenRa said:
Just downloaded GB4
Pity i cannot underclock bellow 1.6GHz so i used 2.2GHz and double at my current 4.4GHz

Core i7 3770K @ 2.2GHz
2133MHz memory
https://browser.geekbench.com/v4/cpu/659503

Core i7 3770K @ 4.4GHz
2133MHz memory
https://browser.geekbench.com/v4/cpu/659569

Hm, looking at individual tests, scaling looks linear. Score is not because of memory i take. Well, great, then dividing 6950X@1.2's results by 1.2 will be fine.

For now, have that:

Yes, i do take an assumption that sample was working at 1Ghz. And yes, i am that bad at using LOCalc that i failed to put a proper labels for tests. Just know that number 14 is SGEMM, that uses hand-written code with AVX2.

coercitiv · Oct 6, 2016

Dresdenboy said:
You got me to check the literature.

I was merely attempting a little reduction to absurdity, but I guess more solid conventional knowledge will also do the trick, with less OT to boot.

Nothingness · Oct 6, 2016

Arachnotronic said:
Yeah, I'd be up for that. Interestingly, perf/MHz for my Broadwell-E goes up at lower frequencies. At 1.4GHz (DDR4-2400, uncore @ 1.4GHz), I get 1968 single core. Perf/MHz here is ~1405.

That's to be expected: if you maintain RAM speed, the number of cycles in case of misses is reduced

lolfail9001 · Oct 6, 2016

Nothingness said:
That's to be expected: if you maintain RAM speed, the number of cycles in case of misses is reduced

Actually, it's mostly because of GB4 scoring taking into account memory performance, that does not really correlate linearly with clock speed (or uncore speed for that matter).

EDIT: Took Arachnotronic's 1.4Ghz results for now. Woah, that's one hell of a mixed picture.

Dresdenboy · Oct 6, 2016

coercitiv said:
I was merely attempting a little reduction to absurdity, but I guess more solid conventional knowledge will also do the trick, with less OT to boot.

I'm fine with that as I usually find useful stuff for my game, plus some kind of generic refresher.

cdimauro · Oct 6, 2016

Nothingness said:
That's a funny statement to make. IPC in isolation is useless: a poorly optimized application might run many useless instructions and that might artificially increase IPC.

Nevertheless, if you run the application you execute such instructions, and they should be counted as well, since they impact on the performance, and thus on the time that the application takes to complete the task.

One place where IPC can be considered as useful is for comparing two different CPU running the same program (or when tuning a micro-architecture )

Also. But it remains a metric for measuring the performance of the application.

Dresdenboy said:
With a clear distinction it should be possible to use IPC for 1T on a 2T core.

Where is the definition of IPC, which excludes its application to parts of programs, or different scenarios on a SMT machine? This would just cut that metrics' usability. In fact I've read papers showing the actual IPC plotted over time for different applications.

Nobody stops you on doing it: IPC is a generic measure, so you can apply to portions of an application, as you reported. You only need to measure the number of cycles taken by a certain number of executed (retired) instructions, and that's it.

coercitiv said:
If the quote above is a definition of sorts. then I'm an interstellar rocket.

In fact it isn't a definition. The definition was in the link that I gave:

"Instructions Retired per Cycle, or IPC shows average number of retired instructions per cycle."

I reported the other sentence to show that the IPC is about measuring the performance of an application.

Based on your post, would you agree that IPC is utterly useless in describing ST performance of a SMT capable CPU core?

Maybe because IPC is related to the overall performance of an application, and not to theoretical numbers which means nothing by themselves?

As I reported above, the IPC is... just an average of the number of cycles per instructions spent by... running an application. If the application is ST or MT, it doesn't matter looking at the IPC.

It's clear that, with the latter and with an SMT-capable core, the IPC is affected by the contributes of both hardware threads which are running and concurrently using the available shared resources.

Dresdenboy said:
You got me to check the literature.

And there I found this:

Which shows that IPC isn't related to single-threaded (because it's in contrast with "the case for single-threaded workloads").

BTW, IPC as the reciprocal of CPI (part of the "Iron Law of Performance") is described in the chapter "Single-threaded Workloads".

The fact that the IPC is described in a chapter which such name doesn't mean that the IPC is a single-thread measure. Logic at the hand.

In fact, the sentence that you reported from the text, states the exact contrary.

EDIT: I found the cited paper leading to this remark.

Source: A. R. Alameldeen (Intel) and D. A.Wood (University of Wisconsin-Madison). IPC considered harmful for multiprocessor workloads.
IEEE Micro, 26(4):8–17, July 2006.

EDIT#2: IBM researchers distinguished between single-threaded IPC and SMT IPC here
http://pharm.ece.wisc.edu/wddd/2002/final/squillante.pdf

From p.5:

"The primary performance measures presented are the average number of instructions executed per cycle (IPC) and the miss ratios of all caches. We compute IPC and other statistics for SMT simulations and compare to singlethreaded performance as follows."

Here IPC isn't related to ST performance, since such result is compared with ST performance.

It's also not true that researches distinguish between ST-IPC and SMT-IP: this is the artificial work of splitting instructions execution that they did to have numbers for ST and MT while running an SMT application.
Again from p.5:

"In SMT mode our simulator halts when one trace runs out of instructions. Thus our SMT measurements reflect only multi-threaded performance. We record the position in the trace of the second thread, the one that did not run to completion, and extract the statistics for that initial portion of the trace from a single threaded run. The combined single-threaded IPC of the two traces is then computed by adding the total number of instructions executed and dividing by the total number of cycles on the two single-threaded runs (one complete and one partial). Thus we are comparing single-threaded and multi-threaded performance on exactly the same set of instructions."

From p.7:

"Selfishness is the relative speed of an application when running in SMT mode as measured by its IPC as a percentage of its single-threaded IPC."

Here an MT applications' IPC is, again, related to... SMT. Which is obvious.

From p.9:

"A partial explanation for this difference may be their 8-thread SMT vs our 2-thread: IPC results were only given for 8 threads, and miss rates for 8 threads show a much greater difference between COLOR and BINHOP than for 2 threads."

Why IPC results weren't given for less threads? 8 threads were available, and, guess what... they reported the results with all 8.

Other researchers give a whole slew of different IPC definitions:
https://pdfs.semanticscholar.org/9750/fd5a20c3ecf846b589e2dfa3ba8610862e27.pdf

With the IPC definition which is:

"Average number of useful instructions executed per cycle"

Wow! Incredibile

And another interesting thing on p.10:
"We prefer the use of two metrics, one for fairness and one for throughput (IPC)."

coercitiv said:
I was merely attempting a little reduction to absurdity, but I guess more solid conventional knowledge will also do the trick, with less OT to boot.

The conventional knowledge doesn't seems to be according to you. See above what I've reported by the same sources.

Now I wait your reduction to absurdity, but possible without just empty words, eh!

lolfail9001 · Oct 6, 2016

cdimauro said:
Which shows that IPC isn't related to single-threaded (because it's in contrast with "the case for single-threaded workloads").

Sorry man, but i think English is not your first language, because it clearly shows that contrast is between :"IPC is useless for multi threaded workloads" and "IPC for single threaded workloads". It does not outright say IPC is single-thread measurement but points out that it's entirely useless for multi threaded workloads (and that makes perfect sense if you have ever heard of multi threaded applications).

cdimauro said:
Here an MT applications' IPC is, again, related to... SMT. Which is obvious.

No, here both IPCs are used, you are taking it out of context.

cdimauro said:
The conventional knowledge doesn't seems to be according to you. See above what I've reported by the same sources.

Conventional knowledge seems to be that you are taking it all out of context to drive your point, sorry.

cdimauro · Oct 6, 2016

bjt2 said:
Haswell is worse, because the 4 int ports has attached 2 256 bit FPU plus 3 vector ALU, so it's 4 INT/FP/Vec with maximum 2 FP and 3 vec int.

Vec ALU & Vec Shuffle (mapped to port 5, which supports some vector operations) are able to execute several SIMD operations, which are pretty common and very important in the SIMD code.

For 128 bit code without FMAC it's half than Zen throughput...

Sure, but see above: SIMD code isn't made only by FADD/MUL/MACs.

When Zen has to execute some Vec ALU or Vec Shuffle operation, it needs to use one of the four FPU ports (but currently we don't know how they are mapped).

So, it's true that an Intel can execute only 2 FADD/MUL/MACs instructions per cycle, but in general, counting all SIMD instructions, it can execute up to 3 of them.

I was talking of skylake, that hopefully it's a better true 4 shared ports (i can't find a diagram on google)

No, the ports are essentially the same, only with some improvements on the execution units.

Anyway, I was talking about Haswell because Broadwell-E has a very similar microarchitecture.

But we see that AMD for two threads can do 4 INT PLUS 4 FP, while Haswell 4 INT or 3 INT and 1 FP etc, up to 2 FP or 3 vec... This is what i meant... Obviously unified queue (from queue theory) is better, but not quite with half queues...

You're forgetting the L/S ports: Intel processors have 4 of them (half of the total!), with one only dedicated to store address (added in Haswell). I think that if Intel decided to put so many ports dedicated only for this kind of operations, it has its strong reasons, right?

Whereas Zen has only 2 of them.

AMD MOPS and uops are very powerful too... And there is also uop fusion in AMD architectures...

We have no details on what AMD's MOPS/uops and Intel's uops can do, so we can't make comparisons.

bjt2 said:
Even if i read carefully the diagrams, there are 2 reasons that forces us to wait actual benchmarks:

1) We don't have instruction type layout for Zen. We don't know what pipeline can do what, e.g. how many IMUL? How many cycles, limitation? So a simulation is impossible.
2) Even if we have that details, it's a difficult calculation, better done with a simulator, that we don't have.

Sure, but here we are not trying to make estimations about the performance: just talking about pros and cons of the microarchitectures.

We can do only an high level analysis, with queue theory to roughly estimate the outcome.
And we know that 4+4 specialized queues are faster than 4 shared queues, given the same latencies but obviously require much logic...

See above: you're completely ignoring the L/S units, which make some useful work too.

The only reason to be slower are if there are some limitation in instruction combinations, that limit the maximum IPC.

Included the only 2 available AGUs.

Moreover 10 uops is not sustainable, but only 6, and this only assuming 100% cache hit and high enough uop cache hit to avoid the 4 uops bottleneck of the decoder and low enough dependencies...

So INTEL design is balanced enough for very well mixed instructions (8 peak uops/cycle processing, with max 6 uops/cycle dispatching),

It's the Decoded Instruction Cache which can send 6 uops/cycle to the micro-op queue. The micro-op queue can receive up to 5 uops/cycle (on Skylake; it's 4 for the previous microarchitectures) directly from the decoder (called Legacy Decode Pipeline in the Intel terminology), and up to 4 uops/cycle from the MicroROM.

The micro-op queue can then send uops (don't know how many of them: it's not specified) to the rename unit, and finally to the scheduler. Finally, the scheduler can send up to 8 uops/cycle.

but could have some stall in particular instruction mix, e.g. complex FP instruction mix with low interlocking that leaves few free ports for integer instructions, like 2 thread of the 2.4 IPC SPEC FP bench.
Moreover probabily branch prediction is better on INTEL and this assures a constant 6 uops cycle flux, where in Zen this could be more intermittent...

Zen is more equipped for peak with awfully mixed instructions...

Zen lacks L/S-AGU units, as I reported, whereas Intel has plenty of stuff here.

Also if an FPU instruction has to access memory (which is not a rare condition), it has to go to the AGUs first, while this isn't a problem with Intel.

Just to add a couple of important things.

Anyway, I'm curious to see how Zen behaves with some other kind of workloads, like emulators, database queries, compilation, web app execution, etc.

cdimauro · Oct 6, 2016

lolfail9001 said:
Sorry man, but i think English is not your first language,

True.

because it clearly shows that contrast is between :"IPC is useless for multi threaded workloads" and "IPC for single threaded workloads". It does not outright say IPC is single-thread measurement but points out that it's entirely useless for multi threaded workloads

Correct.

(and that makes perfect sense if you have ever heard of multi threaded applications).

I've clearly expressed my opinion in several other parts, so I'm quite aware of it, but if there's something wrong on what I've stated, you're free to quote me and show how/why.

No, here both IPCs are used, you are taking it out of context.

Sorry, but I was right there. Here's again the sentence:

"Selfishness is the relative speed of an application when running in SMT mode as measured by its IPC as a percentage of its single-threaded IPC."

The applications is running in SMT, and its IPC is measured. In SMT mode...

Conventional knowledge seems to be that you are taking it all out of context to drive your point, sorry.

Well, actually only one point was wrong: it's too little to discard all other points, which are correct.

Abwx · Oct 6, 2016

cdimauro said:
You're forgetting the L/S ports: Intel processors have 4 of them (half of the total!), with one only dedicated to store address (added in Haswell). I think that if Intel decided to put so many ports dedicated only for this kind of operations, it has its strong reasons, right?

Whereas Zen has only 2 of them.

See above: you're completely ignoring the L/S units, which make some useful work too.

Wrong, there s 2 load and one store in Zen, how could that be half of 4..?..

So who s completely ignoring uarch details..?.

cdimauro said:
It's the Decoded Instruction Cache which can send 6 uops/cycle to the micro-op queue. The micro-op queue can receive up to 5 uops/cycle (on Skylake; it's 4 for the previous microarchitectures) directly from the decoder (called Legacy Decode Pipeline in the Intel terminology), and up to 4 uops/cycle from the MicroROM.

The micro-op queue can then send uops (don't know how many of them: it's not specified) to the rename unit, and finally to the scheduler. Finally, the scheduler can send up to 8 uops/cycle.

Well explained, and Zen dispatcher can get 6 uops from the decoder and 4uops from the MicroRom, that s as much as 10 uops that can be dispatched (and then scheduled) each cycle, 6 to the INT scheduler and 4 to the FPU scheduler..

cdimauro · Oct 6, 2016

Abwx said:
Wrong, there s 2 load and one store in Zen, how could that be half of 4..?..

So who s completely ignoring uarch details..?.

Can you show me where are attached the L/S units?

Well explained, and Zen dispatcher can get 6 uops from the decoder and 4uops from the MicroRom, that s as much as 10 uops that can be dispatched (and then scheduled) each cycle, 6 to the INT scheduler and 4 to the FPU scheduler..

And you're ignoring the fact that it's the decoder unit, in Intel's processors, which realizes the Instructions-fusing (up to 2 of these operations per cycle) and uops-fusing (every one of the 4 decoders can do it). So, in the latter case, you can have up 8uops which are packed into 4 uops, and sent to the micro-op queue.

AFAIK, on Zen it's the micro-op cache which makes such kind of operations, after it has received the uops from the decoder.

Abwx · Oct 6, 2016

Arachnotronic said:
Interestingly, perf/MHz for my Broadwell-E goes up at lower frequencies.

It doesnt go up, and you can certainly check it with the individual FP scores, the thing is that when lowering the CPU frequency RAM bandwith is not decreased accordingly and since it s part of GB score its weight will be artificialy increased in the test and will increase the score the same way, artificialy..

Abwx · Oct 6, 2016

cdimauro said:
Can you show me where are attached the L/S units?

And you're ignoring the fact that it's the decoder unit, in Intel's processors, which realizes the Instructions-fusing (up to 2 of these operations per cycle) and uops-fusing (every one of the 4 decoders can do it). So, in the latter case, you can have up 8uops which are packed into 4 uops, and sent to the micro-op queue.

AFAIK, on Zen it's the micro-op cache which makes such kind of operations, after it has received the uops from the decoder.

I ignore nothing, and it s even the other way around, you should know that Zen decoder also fuse two instructions in a single op and they will be unpacked later in the pipeline at the output of the schedulers..

On Zen there s two paths for the uops contrary to what you re stating, one from the decoder and the other from the op cache.

I suggest that you read this article before discussing further, even with Google sloppy translation it should help you understand better the design as well as sparing wall of texts built out of wrong basics:

www.hardware.fr/news/14758/amd-detaille-architecture-zen.html

coercitiv · Oct 6, 2016

cdimauro said:
The conventional knowledge doesn't seems to be according to you. See above what I've reported by the same sources.

Now I wait your reduction to absurdity, but possible without just empty words, eh!

I have addressed you a question, care to answer it please?

coercitiv said:
Based on your post, would you agree that IPC is utterly useless in describing ST performance of a SMT capable CPU core?

cdimauro · Oct 6, 2016

Abwx said:
I ignore nothing, and it s even the other way around, you should know that Zen decoder also fuse two instructions in a single op and they will be unpacked later in the pipeline at the output of the schedulers..

According to the article, this is only for instructions-fusing, but not says nothing for uops-fusing.

On Zen there s two paths fo the uops contrary to what you re stating, one from the decoder and the other from the op cache.

Why you've reported this slide? The correct one is another.

Anyway, I saw: the micro-op queue is able to take uops both from the decoder and the micro-op cache.

However it can only dispatch 6 uops/cycle, not 10, to the schedulers:

I suggest that you read this article before discussing further, even with Google sloppy translation it should help you understand better the design as well as sparing wall of texts built out of wrong basics:

www.hardware.fr/news/14758/amd-detaille-architecture-zen.html

Maybe it's better that you can also take a look. See above.

And BTW, you haven't answered to my question: where are located the L/S units?

bjt2 · Oct 6, 2016

cdimauro said:
Vec ALU & Vec Shuffle (mapped to port 5, which supports some vector operations) are able to execute several SIMD operations, which are pretty common and very important in the SIMD code.

Sure, but see above: SIMD code isn't made only by FADD/MUL/MACs.

When Zen has to execute some Vec ALU or Vec Shuffle operation, it needs to use one of the four FPU ports (but currently we don't know how they are mapped).

So, it's true that an Intel can execute only 2 FADD/MUL/MACs instructions per cycle, but in general, counting all SIMD instructions, it can execute up to 3 of them.

Ok, so for true 128 bit not FMAC code Zen is more equipped... For shuffle, Logic, int simd etc... When we have details on Zen FPU we'll see...

cdimauro said:
No, the ports are essentially the same, only with some improvements on the execution units.

Anyway, I was talking about Haswell because Broadwell-E has a very similar microarchitecture.

Too bad.

cdimauro said:
You're forgetting the L/S ports: Intel processors have 4 of them (half of the total!), with one only dedicated to store address (added in Haswell). I think that if Intel decided to put so many ports dedicated only for this kind of operations, it has its strong reasons, right?

Whereas Zen has only 2 of them.

[...]

See above: you're completely ignoring the L/S units, which make some useful work too.

Included the only 2 available AGUs.

Zen has 2 AGU but can do 2x128 read and 1x128 write. For 128 bit code is only slight less than INTEL.

cdimauro said:
It's the Decoded Instruction Cache which can send 6 uops/cycle to the micro-op queue. The micro-op queue can receive up to 5 uops/cycle (on Skylake; it's 4 for the previous microarchitectures) directly from the decoder (called Legacy Decode Pipeline in the Intel terminology), and up to 4 uops/cycle from the MicroROM.

The micro-op queue can then send uops (don't know how many of them: it's not specified) to the rename unit, and finally to the scheduler. Finally, the scheduler can send up to 8 uops/cycle.

I was using six, supposing the common case of uop cache hit, where the two architectures are eveno on par... For the most of the time, up to 6 uop can be delivered...

cdimauro said:
Zen lacks L/S-AGU units, as I reported, whereas Intel has plenty of stuff here.

Also if an FPU instruction has to access memory (which is not a rare condition), it has to go to the AGUs first, while this isn't a problem with Intel.

Just to add a couple of important things.

Anyway, I'm curious to see how Zen behaves with some other kind of workloads, like emulators, database queries, compilation, web app execution, etc.

I am curious too, but for write operations, the latency does not matter and for read instruction probabily the other thread or other instructions can be executed most of the time...

cdimauro · Oct 6, 2016

coercitiv said:
I have addressed you a question, care to answer it please?

I've already answered: it doesn't make sense to talk about ST or MT performance for an SMT core. It depends entirely on the kind of application(s) which the core is running.

bjt2 · Oct 6, 2016

cdimauro said:
Can you show me where are attached the L/S units?

And you're ignoring the fact that it's the decoder unit, in Intel's processors, which realizes the Instructions-fusing (up to 2 of these operations per cycle) and uops-fusing (every one of the 4 decoders can do it). So, in the latter case, you can have up 8uops which are packed into 4 uops, and sent to the micro-op queue.

AFAIK, on Zen it's the micro-op cache which makes such kind of operations, after it has received the uops from the decoder.

AFAIK only 1 of the 4 can be a fused uop, for a total of max 5 uops...

bjt2 · Oct 6, 2016

cdimauro said:
Maybe it's better that you can also take a look. See above.

And BTW, you haven't answered to my question: where are located the L/S units?

Purple box in the image you posted: 2 load + 1 store per cycle

cdimauro · Oct 6, 2016

bjt2 said:
Too bad.

Not that bad, looking at the benchmarks.

Zen has 2 AGU but can do 2x128 read and 1x128 write. For 128 bit code is only slight less than INTEL.

But Intel has 4 ports which take care of store address operations, data loads and store, whereas on Zen you have to pass through the only 2 AGUs for it.

That's a consistent difference, especially for the FPU code.

I am curious too, but for write operations, the latency does not matter and for read instruction probabily the other thread or other instructions can be executed most of the time...

Well, it's an OoO microarchitecture at the end.

But this can be applied to Intel processors also.

bjt2 said:
AFAIK only 1 of the 4 can be a fused uop, for a total of max 5 uops...

Are you talking about Zen? Because Intel processors can do two instructions-fusion operations per cycle.

AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

Golden Member

Lifer

Lifer

Golden Member

Lifer

Golden Member

Diamond Member

Diamond Member

Golden Member

Golden Member

Member

Golden Member

Member

Member

Lifer

Member

Lifer

Lifer

Diamond Member

Member

Senior member

Member

Senior member

Senior member

Member