AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

cdimauro · Oct 6, 2016

bjt2 said:
Purple box in the image you posted: 2 load + 1 store per cycle

I know it, but I wanted that Abwx, the Zen expert, to tell it.

Abwx · Oct 6, 2016

cdimauro said:
Why you've reported this slide? The correct one is another.

Anyway, I saw: the micro-op queue is able to take uops both from the decoder and the micro-op cache.

However it can only dispatch 6 uops/cycle, not 10, to the schedulers:

I edited my post and added the decode stage, there s 10 uops that can be dispatched simultaneously, 6 to the INT schedulers and 4 to the FP schedulers, if there s 10 uops available in the uop queue it doesnt make sense to pretend that only 6 can be dispatched, it s 6 only if the queued uops are all to be executed by the ALUs and none related to FP exe...

cdimauro said:
And BTW, you haven't answered to my question: where are located the L/S units?

cdimauro said:
I know it, but I wanted that Abwx, the Zen expert, to tell it.

That was a useless question, but if it please you to make some smoke screen, well, you posted a slide that contradict your previous saying about Zen L/S being half of intel s CPU, one has to wonder if you did even bother to read some article here or there about Zen...

As for expertise i gave you a link, the journalist that wrote the article is known to have deep knowledge of CPUs, softwares, compilers and so on, certainly that i trust him more than the average poster...

Btw, the diagram you posted is higher level than the ones i posted, of course using those high level slides allow to ignore the datas that are on the low level diagrams, we can see on the latter that the INT and FP dispatch use two different paths, wich render moot your "argumentation" that only 6 uops can be dispatched..

bjt2 · Oct 6, 2016

cdimauro said:
But Intel has 4 ports which take care of store address operations, data loads and store, whereas on Zen you have to pass through the only 2 AGUs for it.

That's a consistent difference, especially for the FPU code.

Since the odd R/W and AGU number, even on XV, i suspect that on simple addressing (like direct and maybe also direct plus displacement) the AGU is not needed...

Moreover with stack addressing there is a stack engine with L0 cache that maybe take care also of stack addressing and should use the same R/W ports on L0 miss...

cdimauro said:
Are you talking about Zen? Because Intel processors can do two instructions-fusion operations per cycle.

I was taking of intel, but newer arch maybe can fuse more as you are saying... For AMD i don't know the fusion limits...

bjt2 · Oct 6, 2016

Abwx said:
I edited my post and added the decode stage, there s 10 uops that can be dispatched simultaneously, 6 to the INT schedulers and 4 to the FP schedulers, if there s 10 uops available in the uop queue it doesnt make sense to pretend that only 6 can be dispatched, it s 6 only if the queued uops are all to be executed by the ALUs and none related to FP exe...

That was a useless question, but if it please you to make some smoke screen, well, you posted a slide that contradict your previous saying about Zen L/S being half of intel s CPU, one has to wonder if you did even bother to read some article here or there about Zen...

Btw, the diagram you posted is higher level than the ones i posted, of course using those high level slides allow to ignore the datas that are on the low level diagrams, we can see on the latter that the INT and FP dispatch use two different paths, wich render moot your "argmentation" that only 6 uops can be dispatched..

I could remember wrong, but in the amd presentation at hot chips, the limit was still 6 in total and 6 to int and 4 to FP... Since pure fp code does not pratically exist, even on heavy fp code, 4 uops go to fp and up to 2 go to int...

cdimauro · Oct 6, 2016

Abwx said:
I edited my post and added the decode stage, there s 10 uops that can be dispatched simultaneously, 6 to the INT schedulers and 4 to the FP schedulers, if there s 10 uops available in the uop queue it doesnt make sense to pretend that only 6 can be dispatched, it s 6 only if the queued uops are all to be executed by the ALUs and none related to FP exe...

There are TWO slides which show that the micro-op queue dispatches 6 uops/cycle:

That was a useless question, but if it please you to make some smoke screen, well, you posted a slide that contradict your previous saying about Zen L/S being half of intel s CPU, one has to wonder if you did even bother to read some article here or there about Zen...

I already explained that for using the L/S units you have to pass through the AGUs, which are ONLY TWO!

So you can execute 2 loads / cycle, or 1 load + 1 store / cycle, even if you have 2 loads + 1 store unit, because of the AGUs.

Understood?

Btw, the diagram you posted is higher level than the ones i posted, of course using those high level slides allow to ignore the datas that are on the low level diagrams, we can see on the latter that the INT and FP dispatch use two different paths, wich render moot your "argmentation" that only 6 uops can be dispatched..

See above: there are TWO slides which report the same information: 6 uops / cycle.

Abwx · Oct 6, 2016

cdimauro said:
There are TWO slides which show that the micro-op queue dispatches 6 uops/cycle:

I already explained that for using the L/S units you have to pass through the AGUs, which are ONLY TWO!

So you can execute 2 loads / cycle, or 1 load + 1 store / cycle, even if you have 2 loads + 1 store unit, because of the AGUs.

Understood?

Are you trolling.?..

It s written 2 loads and 1 store per cycle on the very slide you posted, this has been already pointed by a member just a few posts before this one...

cdimauro said:
There are TWO slides which show that the micro-op queue dispatches 6 uops/cycle:

See above: there are TWO slides which report the same information: 6 uops / cycle.

And both slides show 6 uops dispatched to the INT unit, that s high level, the low level slide show 6 explicitely to the INT path.

According to Hardware.Fr the INT and FP part of the CP can work independently without a part stalling the other one, wich could happen in previous uarches.

As said it doesnt make sense that if there s more than 6 uops in the queue that only 6 could be dispatched since the FP path is separated, the limitation to 6 is if it s exclusively INT code, but you can of course think otherwise if that s somewhat relieving for you, after all better to make wrong assumptions than losing sleep, isnt it...

cdimauro · Oct 6, 2016

Abwx said:
Are you trolling.?..

It s written 2 loads and 1 store per cycle on the very slide you posted, this has been already pointed by a member just a few posts before this one...

Trolling? Please, can you explain me how you can do it when you have to pass from the only TWO available AGUs?

It's just a bit of logic, eh! Nothing so difficult to understand...

bjt2 · Oct 6, 2016

cdimauro said:
There are TWO slides which show that the micro-op queue dispatches 6 uops/cycle:

I already explained that for using the L/S units you have to pass through the AGUs, which are ONLY TWO!

So you can execute 2 loads / cycle, or 1 load + 1 store / cycle, even if you have 2 loads + 1 store unit, because of the AGUs.

Understood?

See above: there are TWO slides which report the same information: 6 uops / cycle.

If the AGUs were a limit for the R/W ops, then i don't see the need of 3 ports on L1D. Probabily for simple addressing, like direct (e.g. [AX]), AGU is not needed... Also stack engine has his own AGUs.
And stack operations are very common (e.g. local variables)

cdimauro · Oct 6, 2016

bjt2 said:
If the AGUs were a limit for the R/W ops, then i don't see the need of 3 ports on L1D. Probabily for simple addressing, like direct (e.g. [AX]), AGU is not needed... Also stack engine has his own AGUs.

Whatever are the memory operations, they all pass through the AGUs. See here:

And it's interesting to note that the AGUs can access just 1 Store Queue and 1 Load Queue. So, you're limited to 1 Load + 1 Store dispatched the proper queue, per cycle.

I don't know why it's reported 2 loads. Maybe there can be the possibility to "fuse" two loads which are made at consecutive memory locations, but this is just a speculation.

Arachnotronic · Oct 6, 2016

Some info on the FO4 delay metric that people keep bringing up:

In this article, we’ve provided a cursory review of the FO4 metric as it applies to the delay depth of a logic block. We also engaged in brief discussions using the barrel shifter as an example in illustrating the use of the FO4 metric. Finally, the barrel shifter also served as an example to illustrate that the FO4 depth of a logic block can be reduced through the use of more advanced circuits or better architectural re-arrangement of logic and arrays to minimize signal delay paths. What is not mentioned is that even more engineering resources may be applied to specific logic paths, especially if that logic path happens to be the critical path that limits the operating frequency of a processor. In a design that uses the standard cell libraries, the 2 input mux may itself be a standard cell, or smaller bit vector barrel shifters may be created as a standard cell with which larger bit vector shifters may be constructed from. Processors or ASICS that use standard cell design methodology require less design effort as compared with a full custom design effort. In a full custom design effort, logic circuits may be specifically tailored to a given process technology at the cost of engineering resources. This point is well covered in the article by Chinnery and Keutzer [3].

Physical implementation of a processor utilizing a full custom design methodology requires detailed simulations and considerations for a functional unit’s area, power, and latency. The optimal design point of each logic block would depend on the assumptions made in the overall microarchitecture of the processor. However, such detailed simulations and considerations are not needed in cases where a processor’s architects wish to perform preliminary research into a new and novel architecture. The FO4 metric provides an abstraction from physical circuit implementations and allows architects to project, design and simulate a processor without knowing the specifics and details of the implementation. As a result, the FO4 metric is useful and popular metric used by architects and engineers as a basis to explore architectural concepts at an abstract level.

http://www.realworldtech.com/fo4-metric/5/

bjt2 · Oct 6, 2016

cdimauro said:
Whatever are the memory operations, they all pass through the AGUs. See here:

And it's interesting to note that the AGUs can access just 1 Store Queue and 1 Load Queue. So, you're limited to 1 Load + 1 Store dispatched the proper queue, per cycle.

I don't know why it's reported 2 loads. Maybe there can be the possibility to "fuse" two loads which are made at consecutive memory locations, but this is just a speculation.

I think that this is a simplified diagram. There is not even the stack engine depicted. They clearly stated that the stack engine has its own AGUs...

bjt2 · Oct 6, 2016

Arachnotronic said:
Some info on the FO4 delay metric that people keep bringing up:

http://www.realworldtech.com/fo4-metric/5/

This is what i was referring to. I remember that article...

cdimauro · Oct 6, 2016

bjt2 said:
I think that this is a simplified diagram. There is not even the stack engine depicted. They clearly stated that the stack engine has its own AGUs...

Stack operations have to deal with pushes (stores) and pops (loads) to/from the stack's memory, so you need load + store units, which doesn't match with the available information.

bjt2 · Oct 6, 2016

cdimauro said:
Stack operations have to deal with pushes (stores) and pops (loads) to/from the stack's memory, so you need load + store units, which doesn't match with the available information.

There is a sort of L0 cache (called memory file) and I think that "cache" miss are redirected to L1D cache and here the AGU calculation were already made... The stack engine is upper in the hierarchy and "steals" stack instructions from uop stream...

cdimauro · Oct 6, 2016

bjt2 said:
There is a sort of L0 cache (called memory file) and I think that "cache" miss are redirected to L1D cache and here the AGU calculation were already made... The stack engine is upper in the hierarchy and "steals" stack instructions from uop stream...

But its operations go to the dispatcher. Take a look at the frontend slide.

Anyway, the solution is in another slide:

Both AGUs can send operations to the Load or Store Queue, and the Load Queue can do two loads.

Anyway, and as I've already stated, you have only 2 AGUs, so you can only dispatch 2 memory operations per cycle to the proper queue.

cdimauro · Oct 6, 2016

I saw that you added a big part to your message, after that I've already replied to you.

Abwx said:
And both slides show 6 uops dispatched to the INT unit, that s high level, the low level slide show 6 explicitely to the INT path.

According to Hardware.Fr the INT and FP part of the CP can work independently without a part stalling the other one, wich could happen in previous uarches.

As said it doesnt make sense that if there s more than 6 uops in the queue that only 6 could be dispatched since the FP path is separated, the limitation to 6 is if it s exclusively INT code, but you can of course think otherwise if that s somewhat relieving for you, after all better to make wrong assumptions than losing sleep, isnt it...

Well, the only thing that I think is that you can have hard times even with elementary fluid dynamics.

Let's say that you have a pipe which carries 6 liters/s. Then the pipe forks in two pipes: the first can carry 6 liters/s and the second one can carry 4 liters/s.

Problem: what's the maximum aggregated throughput of the system?

Dresdenboy · Oct 6, 2016

cdimauro, that's the sustained vs. peak story again.

How about this:

The new Stack Engine comes into play between the queue and the dispatch, allowing for a low-power address generation when it is already known from previous cycles. This allows the system to save power from going through the AGU and cycling back around to the caches.

Finally, the dispatch can apply six instructions per cycle, at a maximum rate of 6/cycle to the INT scheduler or 4/cycle to the FP scheduler. We confirmed with AMD that the dispatch unit can simultaneously dispatch to both INT and FP inside the same cycle, which can maximize throughput (the alternative would be to alternate each cycle, which reduces efficiency). We are told that the operations used in Zen for the uOp cache are ‘pretty dense’, and equivalent to x86 operations in most cases.

http://www.anandtech.com/show/10591...t-2-extracting-instructionlevel-parallelism/3

And let me point you at the differences between AMD's "instructions", "micro-ops" and "µops" (the latter as found in the schedulers). According to Mike Clark, those "dense" micro ops get split up in the dispatch unit. This should include "fast path doubles" like 256b AVX ops. Those instructions fused by a micro-op fusion in Intel CPUs usually covers EX+mem ops (IIRC since Banias). That type of instructions gets decoded as one Cop/Mop/instruction in AMD's processors (no splitting up in the first place). Macro-op fusion (test+branch) is the only fusion happening in AMD's chips. Whether the "fast path doubles" (med complex x86 ops which break up into up to 4 uops later, e.g. 256b avx + mem) can be decoded at a rate of 4/cycle remains to be seen, as it was different in the past.

A lot of information can be found in the GCC patches both for Intel's and AMD's x86 processors, including latencies, throughput (in most cases), dependencies, unit occupancy, etc.

superstition · Oct 6, 2016

.vodka said:
An outlier of a game or two doesn't change the fact that most of the time BD and its derivatives just can't keep up.

I don't think dismissing a shipping game as an "outlier" is necessarily the best response. Why is Cinebench so vastly popular here? It's just one program. Why isn't it an outlier?

The Division and the latest Tomb Raider also managed to get decent performance out of Piledriver.

Hruska tried to dismiss the latest Doom as an outlier, too, even though what it is is a demonstration of the gains to be found for AMD parts when they're more fully utilized in terms of things like async. Just because there is only one of something doesn't mean it can be dismissed. All it means is that it should be evaluated more carefully.

If, for instance Intel's latest CPUs were to have a "Cabbage" instruction superset and that could offer 30% more performance in a game that utilizes it it's not an outlier if a game ships with that benefit and no others have bothered to leverage it. The only way Kharak is an outlier is if it's coded in a manner that benefits Piledriver AND which hurts performance on Intel parts, something no developer has an incentive to do.

.vodka said:
Piledriver/Vishera was a fine CPU back in its day when competing with Sandy (8350 matching 2500k as expected)... nowadays it's game over.

One can get an 8320E and a board capable of clocking it at 4.4 GHz on air for $130 after rebate with tax at Micro Center. That is enough to meet the performance of the 9370, performance that can be adequate depending on the workload and how optimal the coding of the software is.

.vodka said:
The 9370 and 9590 are that high in the charts through balls to the wall overclocking and overvolting for an off the shelf consumer part.

They're stupid parts because they use turbo, which is not at all optimal. The Stilt also says the leakage is worse with those than with the E chips. It makes more sense to get a cheaper 8 core FX and overclock it. Even the plain 8320 has shown fine results from what I've read.

To get to 4.4 GHz with an E chip isn't that tough. I was stable with a Zalman dual tower cooler and low noise 140mm fans. It cost very little money. AMD unlocked all these chips, used solder, and improved the leakage.

The point is that these CPUs aren't so horrifyingly terrible as the rhetoric surrounding them often makes it seem. If you're building an emulation system or planning to play just MMOs or Starcraft though, stick with the Anniverary Pentium.

Yes, they need high clocks and power consumption to deal with lower IPC. Gaming is also a luxury activity. All the obsession over power consumption is sometimes too cute by half, especially given how enthusiasts were sporting dual GTX 480 (and 580) systems and an overclocked 45nm Intel CPU without batting much of an eye.

Plasma televisions were also superior to LCD in contrast ratio, off-angle performance, motion performance, and color fidelity (on good sets). People often obsessed about the power consumption without remembering that not watching TV (a luxury activity usually) saves even more power than using an LCD set.

For systems that require many many hours of uptime, though, power consumption could outweigh the upfront cost savings. It depends on the usage pattern and the software being used. At stock, especially when undervolted, the 8320E is relatively low in power consumption. So, for a virtual machine server or something...

Also, one could use BIOS profiles to have the processor near stock for web browsing, word processing, and the like and a profile for gaming and other high-performance tasks. That's something that people often don't mention when worrying about the power consumption. It's another benefit of using an E chip instead of a 9000 series.

.vodka said:
in XV the low hanging fruit in the design has been picked and further refinement could do wonders, as you say. BR as it is, a cache starved part on an outdated 28nm process tuned for sheer density, is a much stronger product than Trinity/Richland were. Imagine what a four module XV v2 part with L3 and a better uncore could do as replacement for Vishera, designed for higher power on a newer, better node... sadly we won't get to see that. Another what if.

AMD knows very well why they didn't keep pursuing the BD lineage for high performance and power usage, instead decided to design a new core. Zen should be more all around more competitive than BD ever was in any of its iterations given the 40% figure over XV... if it can clock high enough.

I think the reasons AMD didn't improve upon Piledriver for the enthusiast and server arenas is because of its weakened market position thanks to Intel's tactics (as well as its various managerial errors over the years) and the decision that it would try to have stronger sales of Zen by withholding an upgrade, making Zen seem better. This latter decision seems to be due to the limitations of node improvement from GoFlo since 32nm SOI. I'm speculating but it seems sensible that, if AMD can't offer Skylake or Kaby performance with Zen it can, at least, point to a vast improvement over Piledriver (thanks to Piledriver's old age).

cdimauro · Oct 7, 2016

Dresdenboy said:
cdimauro, that's the sustained vs. peak story again.

How about this:

http://www.anandtech.com/show/10591...t-2-extracting-instructionlevel-parallelism/3

And let me point you at the differences between AMD's "instructions", "micro-ops" and "µops" (the latter as found in the schedulers). According to Mike Clark, those "dense" micro ops get split up in the dispatch unit. This should include "fast path doubles" like 256b AVX ops. Those instructions fused by a micro-op fusion in Intel CPUs usually covers EX+mem ops (IIRC since Banias). That type of instructions gets decoded as one Cop/Mop/instruction in AMD's processors (no splitting up in the first place). Macro-op fusion (test+branch) is the only fusion happening in AMD's chips. Whether the "fast path doubles" (med complex x86 ops which break up into up to 4 uops later, e.g. 256b avx + mem) can be decoded at a rate of 4/cycle remains to be seen, as it was different in the past.

A lot of information can be found in the GCC patches both for Intel's and AMD's x86 processors, including latencies, throughput (in most cases), dependencies, unit occupancy, etc.

Then there should be some misleading information in the slides. Let me quickly recap.

According to these two slides:

HC28.AMD.Mike%20Clark.final-page-007.jpg

HC28.AMD.Mike%20Clark.final-page-015.jpg

The Micro-op Queue receives micro-ops (obviously) from the Decoder and the Op-Cache, but then dispatches 6 uops (NOT micro-ops!).

Whereas in this slide:

HC28.AMD.Mike%20Clark.final-page-009.jpg

the dispatcher sends Micro-ops (6 + 4), which shouldn't be the case.

Last but not least, from the quoted part of the article that you reported:

"Finally, the dispatch can apply six instructions per cycle, at a maximum rate of 6/cycle to the INT scheduler or 4/cycle to the FP scheduler."

So, according to what you reported, I think that here we should have 6 micro-ops (not "instructions") per cycle which are split by the Dispatcher into up to 6 uops for the INT and 4 uops for the FPU sections, per cycle.

Abwx · Oct 7, 2016

cdimauro said:
Then there should be some misleading information in the slides. Let me quickly recap.

According to these two slides:

The Micro-op Queue receives micro-ops (obviously) from the Decoder and the Op-Cache, but then dispatches 6 uops (NOT micro-ops!).

In these two slides it s written neither uops nor micro ops but just ops, so one more time you re not even reading accurately in the slides you re posting...

cdimauro said:
Whereas in this slide:

the dispatcher sends Micro-ops (6 + 4), which shouldn't be the case.

As said ad nauseam that s the case, it s just that the fact that it s possible seems to be highly inconvenient for you, Hardware.Fr journalist state that it s the case and that it s 2 uops/cycle more than Intel s Haswell 8 uops/cycle..

http://www.hardware.fr/news/14758/amd-detaille-architecture-zen.html

En pratique 10 micro ops peuvent être envoyées (6 vers la partie "Integer" de la puce, 4 vers la partie "Floating Point"), soit deux de plus que sur Haswell (Intel ne nous a pas donné l'information pour Skylake).

"Deux de plus que " mean "two more than".....

bjt2 · Oct 7, 2016

Anyway as already said, AMD ops are pretty dense and are split AFTER the dispatch, only near the exe units...
If also 256 bit ops are treated as one up to the retire unit, then this is an advantage versus old AMD archs...

itsmydamnation · Oct 7, 2016

cdimauro said:
I already explained that for using the L/S units you have to pass through the AGUs, which are ONLY TWO!

So you can execute 2 loads / cycle, or 1 load + 1 store / cycle, even if you have 2 loads + 1 store unit, because of the AGUs.

Understood?

See above: there are TWO slides which report the same information: 6 uops / cycle.

No this is 100% wrong, there are load and store QUEUE's after the AGU's. You can issue 2 load and a store a cycle as clearly said by michael clark @ 12:39 into the hot chips presentation.

You are also completely ignoring the stack engine ( which is used to do address gen + unknown stuff for stack address/data).

Also you keep saying Intel has 4 L/S ports, they do not, they have dedicated store AGU ( does not move data!) a dedicated store port (doesn't generate addresses!) and two generate purpose AGU's that can each do a load. So aggregate is 3!

So the question is on average what percentage of loads are for the stack, because AMD doesn't use the AGU to calc its address, so its completely plausible that >haswell and Zen issue on average the same amount of loads and stores a cycle.

bjt2 · Oct 7, 2016

itsmydamnation said:
No this is 100% wrong, there are load and store QUEUE's after the AGU's. You can issue 2 load and a store a cycle as clearly said by michael clark @ 12:39 into the hot chips presentation.

You are also completely ignoring the stack engine ( which is used to do address gen + unknown stuff for stack address/data).

Also you keep saying Intel has 4 L/S ports, they do not, they have dedicated store AGU ( does not move data!) a dedicated store port (doesn't generate addresses!) and two generate purpose AGU's that can each do a load. So aggregate is 3!

So the question is on average what percentage of loads are for the stack, because AMD doesn't use the AGU to calc its address, so its completely plausible that >haswell and Zen issue on average the same amount of loads and stores a cycle.

I think also that instructions of type MOV [AX],BX does not require AGU because there is nothing to calculate to get address...

cytg111 · Oct 7, 2016

http://www.agner.org/optimize/instruction_tables.pdf

K10 :

Integer instructions Instruction Operands Ops Latency Notes Move instructions MOV r,r 1 1 1/3 ALU MOV r,i 1 1 1/3 ALU MOV r8,m8 1 4 1/2 ALU, AGU MOV r16,m16 1 4 1/2 ALU, AGU MOV r32,m32 1 3 1/2 AGU MOV r64,m64 1 3 1/2 AGU MOV m8,r8H 1 8 1/2 AGU AH, BH, CH, DH MOV m8,r8L 1 3 1/2 AGU Any other 8-bit reg. MOV m16/32/64,r 1 3 1/2 AGU Any addressing mode MOV m,i 1 3 1/2 AGU MOV m64,i32 1 3 1/2 AGU MOV r,sr 1 3-4 1/2 MOV sr,r/m 6 8-26 8 from AMD manual MOVNTI m,r 1 1 AGU

- that looks like crap, lookup the pdf yourself

edit : port2+3 = AGU for skylake ?

Dresdenboy · Oct 7, 2016

This discussion flows at a faster pace than I have time to add something useful to it.

However, let me pick something out again.

cdimauro said:
But its operations go to the dispatcher. Take a look at the frontend slide.

Anyway, the solution is in another slide:

Both AGUs can send operations to the Load or Store Queue, and the Load Queue can do two loads.

Anyway, and as I've already stated, you have only 2 AGUs, so you can only dispatch 2 memory operations per cycle to the proper queue.

This setup looks similar to the K12 one, just that the latter has one more AGU to feed the queues.

With the stack engine and stack memfile in mind (which seems to be used to forward stored stack data to the subsequent loads, as there is also a dependency table), there might be a reduction of about 20-30% of mem ops needing the AGUs. Further as AMD's complex ops (Cops in construction and cat cores, also called "compound op" or "complex micro-op" at your employer

) can even contain a RMW op, this should save one AGU issue slot. OTOH these instructions just make up some lowish percent (3%?) of all ops. That discussion took place at SemiAccurate a while back.

For your speculation about a possibly reduced AGU pressure for adjacent addresses (e.g. 256b AVX) I didn't find any hints so far. In a few months we might finally know.

In the end, those 2R might just be needed to provide load data as fast as possible without being interfered with stores. And the third AGU in combination with the stack handling stuff might have been cut due to cycle time (more PRF ports) and/or power efficiency reasons.

AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

Member

Lifer

Senior member

Senior member

Member

Lifer

Member

Senior member

Member

Lifer

Senior member

Senior member

Member

Senior member

Member

Member

Golden Member

Platinum Member

Member

Lifer

Senior member

Diamond Member

Senior member

Lifer

Golden Member