AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

The Stilt · Sep 13, 2016

krumme said:
I guess most thought of zen+ as one of the motivators when hearing news about the new wafer agreement with mubadala.
Server cpu is high margin. It doesnt really matter much if you have to pay a fine to gf each time.
As for Raven i wouldnt bet on it. Why? Because they have experience with hbm and tsmc?
Apu is low margin and amd needs to allocate some wafers at gf. Polaris is a good place to start. Zen at GF is...ehh ...perhaps a serious challenge?
Naa keep those freq very low and lets get 4x moar cores for our hosting cost.

Raven is supposed the be based on the same exact CCXs as Zeppelin, so I wonder if they go the trouble porting it. Since there is no need to worry about another fab lacking experience with HBM, with Raven I'm certain that AMD will get the cheapest overall process available (i.e 14nm LPP). APUs will definitely remain a low-end / low-margin product, eventhou adding in Zen will definitely raise their pricing range.

krumme · Sep 13, 2016

bjt2 said:
Who is this Mark P?

I think its Mark B
Dont take his 50% finfet uplift as nessesary the hole truth as he have had to many blue pills

bjt2 · Sep 13, 2016

Dresdenboy said:
Main reason: different voltages. Compared at the same voltage, this might be interesting.

Have you juicy informations on Vcore? Production Vcore? because ES Vcore may be higher than production...

20 vs 19 stages even on the same process imply 0.05V or less of difference in Vcore... But here we are talking of 28nm BULK vs 14nm FF...

You are talking of vcore margins due to the youth of the 14nm FF? because AVFS and other amenities of polaris can compensate for low quality VRMs, high temperatures and silicon aging...

bjt2 · Sep 13, 2016

krumme said:
I think its Mark B
Dont take his 50% finfet uplift as nessesary the hole truth as he have had to many blue pills

Mine was a rethoric question. I do not doubt that, at same FO4, 14nm FF gives higher FMAX than 28nm BULK... But here many doubt this...

Glo. · Sep 13, 2016

bjt2 said:
You said: "The GPU die for D700 consumes 70W of power while having 850 MHz core clock.
GPU die for RX 470 can consume around 85-90W, while having 1206 MHz core clock."

The last was a typo, as it was 1266, but the first 70W you calculated from 165W, so 95W the RAM??? I think that the dice power was comparable, so we have same power, but +50% clock...

FirePro D700 has 129W TDP, and consumes that amount of power, because that is the gate in the GPU BIOS, that will not allow more power draw for whole GPU board. 70W consumed by the GPU die, because the RAM consumes around 57-59W of power.
RX 470 Memory consumes around 30-34W of power, and the whole GPU board of reference RX 470 has 1206 MHz core clock, and 125W power consumption. So the GPU consumes around 85-90W of power under load.

Just to clear off-topic information.

bjt2 · Sep 13, 2016

Glo. said:
FirePro D700 has 129W TDP, and consumes that amount of power, because that is the gate in the GPU BIOS, that will not allow more power draw for whole GPU board. 70W consumed by the GPU die, because the RAM consumes around 57-59W of power.
RX 470 Memory consumes around 30-34W of power, and the whole GPU board of reference RX 470 has 1206 MHz core clock, and 125W power consumption. So the GPU consumes around 85-90W of power under load.

Just to clear off-topic information.

I quoted a consumer GPU card. I don't even know if this firepro board has 2048 SPs... Selected, low leakage, Pro board does not count, because the 480 is a consumer, high yeld, board...

bjt2 · Sep 13, 2016

Dresdenboy said:
Main reason: different voltages. Compared at the same voltage, this might be interesting.

Maybe i didn't get the information, but I thought that 14nm FF has a lower Vcore at same frequency and FO4... So Zen should have lower Vcore at same clock vs XV...

Abwx · Sep 13, 2016

bjt2 said:
Mine was a rethoric question. I do not doubt that, at same FO4, 14nm FF gives higher FMAX than 28nm BULK... But here many doubt this...

FO4 delay is a characteristic of a process not of an uarch.

This is the delay of a signal that got through three gates (inverters for instance) and with each successive gate driving four gates, that is, each gate output is loaded by a total capacitance that is four times her own input capacitance.

When measuring the delay a fourth gate is added to drive the first of the serie of 3 gates, this added gate output is of course loaded by only one gate input.

Glo. · Sep 13, 2016

bjt2 said:
I quoted a consumer GPU card. I don't even know if this firepro board has 2048 SPs... Selected, low leakage, Pro board does not count, because the 480 is a consumer, high yeld, board...

The FirePro D700, appears in Windows as HD7970. Apple just changed under OS X the badge of the GPU to FirePro D700. Windows revealed the truth about this GPU. It is perfectly standard HD7970, with 6 GB of VRAM. It doesn't even have ECC memory.

RX 470 at 1000 MHz will be able to squeeze itself to much lower Thermal envelope than 125W.

Similar thing will happen with Zen. That 95W CPU is downclocked, very much.

bjt2 · Sep 13, 2016

Abwx said:
FO4 delay is a characteristic of a process not of an uarch.

This is the delay of a signal that got through three gates (inverters for instance) and with each successive gate driving four gates, that is, each gate output is loaded by a total capacitance that is four times her own input capacitance.

When measuring the delay a fourth gate is added to drive the first of the serie of 3 gates, this added gate output is of course loaded by only one gate input.

The Fo4 delay in ns is characteristic of every process. The FO4 delay as relative number is characteristic of a given architecture and is independent of the process (moreless). Knowing both let you calculate FMAX...

When we say that the FO4 delay of an architecture is 17. 17 is not ns. 17 is the relative delay of an inverter of fan out 4...

bjt2 · Sep 13, 2016

Glo. said:
The FirePro D700, appears in Windows as HD7970. Apple just changed under OS X the badge of the GPU to FirePro D700. Windows revealed the truth about this GPU. It is perfectly standard HD7970, with 6 GB of VRAM. It doesn't even have ECC memory.

RX 470 at 1000 MHz will be able to squeeze itself to much lower Thermal envelope than 125W.

Similar thing will happen with Zen. That 95W CPU is downclocked, very much.

I know. This is what i am saying... That Zen can clock at least as XV. And AMD is even saying that: AMD says that Zen has +40% IPC with SAME ENERGY per clock. This can only mean same power consumtpion at same clock, core to core, vs XV... What is the power consumption of an XV core? With latest Bristol ridge, a 4 core APU has a TDP of 65W for 3.8GHz base and 4.2 GHz turbo. This means at most 16W per core at 3.8GHz... If we subtract the GPU, that is moreles 50% of TDP (in reality usually it's more on AMD APU), we have 8W per core... So a 3.8GHz 8 core Zen should draw anywere from 65 to 130W, depending on the GPU consumption... Perfectly in line with my forecast of 4GHz@95W... Maybe not in the first batches...

krumme · Sep 13, 2016

bjt2 said:
.. Maybe not in the first batches...

Define batches...

I am not in the business but my guess is you cant find a longer road from easy desk calculations to high yield mass production than this product.
A long and crazy expensive road. And i am actually surpriced its that controlled and predictable.
Its imo extremely impressive what Intel have done here on the technical level bit also what amd can muster with a company so relatively small compared to the huge projects they enter.

The Stilt · Sep 13, 2016

Glo. said:
It doesn't even have ECC memory..

All GDDR5 GCN ASICs have ECC

Just had the error monitoring implemented in HWInfo few months back.

Abwx · Sep 13, 2016

bjt2 said:
The Fo4 delay in ns is characteristic of every process. The FO4 delay as relative number is characteristic of a given architecture and is independent of the process (moreless). Knowing both let you calculate FMAX....

Not at all, the delay is a function of the gate max output current and input capacitance, the higher the current (and the lower the loading capacitance) the lower the delay.

bjt2 said:
When we say that the FO4 delay of an architecture is 17. 17 is not ns. 17 is the relative delay of an inverter of fan out 4...

It means 17 FO4 delays...

bjt2 · Sep 13, 2016

krumme said:
Define batches...

I am not in the business but my guess is you cant find a longer road from easy desk calculations to high yield mass production than this product.
A long and crazy expensive road. And i am actually surpriced its that controlled and predictable.
Its imo extremely impressive what Intel have done here on the technical level bit also what amd can muster with a company so relatively small compared to the huge projects they enter.

Let's say the first production weeks, 10-15, that will last for at least 3-4 months...
We know that Zen has AVFS, boot time calibration and other things to use the best Vcore for the given processor and the correct frequency during a voltage droop, so this is another advantage vs INTEL, giving higher frequency, given a TDP...

bjt2 · Sep 13, 2016

Abwx said:
Not at all, the delay is a function of the gate max output current and input capacitance, the higher the current (and the lower the loading capacitance) the lower the delay.

It means 17 FO4 delays...

Yes. The FO4 delay in absolute terms depends directly by the process and not from the architecture and is measured in ns (nanoseconds)

When we say that an architecture has 17 FO4 delay, we say that the worst pipeline stage has 17 time the delay of a inverter in the known conditions... This is a relative number, charaterizing a given architecture.

This is what I said... The same as you.
I don't get what is the problem...

The Stilt · Sep 13, 2016

Design wise I would recon the L2 caches on Zen are the most limiting factors for Fmax. If AMD would modify Piledriver to have similar L2 characteristics as Zen does, the resulting part would have Fmax of ~2.8GHz, instead of the usual ~4.7GHz

Hopefully AMD has managed to improve their L2 caches in Zen, because they have been the first limiting factor since K7. Even the GCN GPUs suffer from this.

Abwx · Sep 13, 2016

bjt2 said:
the worst pipeline stage has 17 time the delay of a inverter in the known conditions..

It is not the delay of an inverter but the delay of a chain of three inverters with each inverter output being loaded by 4 inverters inputs.

To explain it simply an inverter drive four inverters, we take one of those 4 inverters and load its output with 4 inverters, so there s 3 stages, in the pic below there s two stage, so we add 4 gates that are driven by say the upper gate, the delay at the output of the third consecutive gate is the FO4 delay.

Phynaz · Sep 13, 2016

bjt2 said:
Let's say the first production weeks, 10-15, that will last for at least 3-4 months...
We know that Zen has AVFS, boot time calibration and other things to use the best Vcore for the given processor and the correct frequency during a voltage droop, so this is another advantage vs INTEL, giving higher frequency, given a TDP...

Your posting style is suspiciously familiar. What's your main account?

bjt2 · Sep 13, 2016

The Stilt said:
Design wise I would recon the L2 caches on Zen are the most limiting factors for Fmax. If AMD would modify Piledriver to have similar L2 characteristics as Zen does, the resulting part would have Fmax of ~2.8GHz, instead of the usual ~4.7GHz Hopefully AMD has managed to improve their L2 caches in Zen, because they have been the first limiting factor since K7. Even the GCN GPUs suffer from this.

A dedicated 512KB L2 cache, slower than a shared 2MB L2 cache? Sounds strange... Anyway increasing the latency should suffice to go on pair...

Abwx said:
It is not the delay of an inverter but the delay of a chain of three inverters with each inverter output being loaded by 4 inverters inputs.

To explain it simply an inverter drive four inverters, we take one of those 4 inverters and load its output with 4 inverters, so there s 3 stages, in the pic below there s two stage, so we add 4 gates that are driven by say the upper gate, the delay at the output of the third consecutive gate is the FO4 delay.

Ok I miss the definition of FO4... Excuse me... It's 1 AM here (Italy) and after the Champions league and 2 lt of beer, i can miss something, I hope... Anyway that does not impair the point...

bjt2 · Sep 13, 2016

Phynaz said:
Your posting style is suspiciously familiar. What's your main account?

I am bjt2 from hwupgrade.it... Never had an account here... If you know italian, you can see on the zen thread my announce to subscribe here... This is the post: http://www.hwupgrade.it/forum/showpost.php?p=44017916&postcount=6440

EDIT: and the same bjt2 that wrote, some years ago, articles on Llano and bulldozer on xtremehardware.it... And an AMD PR asked us if they could use one on my phrases somewhere...

Phynaz · Sep 13, 2016

bjt2 said:
I am bjt2 from hwupgrade.it... Never had an account here... If you know italian, you can see on the zen thread my announce to subscribe here... This is the post: http://www.hwupgrade.it/forum/showpost.php?p=44017916&postcount=6440

Thank you.

krumme · Sep 14, 2016

The Stilt said:
Design wise I would recon the L2 caches on Zen are the most limiting factors for Fmax. If AMD would modify Piledriver to have similar L2 characteristics as Zen does, the resulting part would have Fmax of ~2.8GHz, instead of the usual ~4.7GHz Hopefully AMD has managed to improve their L2 caches in Zen, because they have been the first limiting factor since K7. Even the GCN GPUs suffer from this.

Is performance of L2 mostly a result of design?

It might be that amd needs some ip or competence to design a high perf. L2 with both low latency and high freq scalability. But as you notice its a problem that have been there for years.
I am no engineer but in my world having full control of process and knowing the -strict- process parameters beforehand should be crucial for designing especially the high perf. L2.
I think its fair to asume:
Intel have a far better integration of design and production,
They have less process variability
They can more precise forecast parameters for future process and use it for designing the arch
- all factors that btw can delay a EUV introduction. They simply dont have the same need because they can better go to the edge so to speak.

In a L2 design all above factors is vital, as its a constant balancing act, where the sum of all those tight parameters give a large difference in the total performance. It hits the L2 perf relatively harder than rest of the arch design compromises and dilemmas. If you dont have the control you have to give small slices of slack everywhere.

So where we see the benefits of a tight integration of process and production is especially in the L2 perf.

cdimauro · Sep 14, 2016

lolfail9001 said:
Well, i could start with note that Zen is x86 so it has added cost of instruction decoding compared to any ARM core.

I mostly agree, but x86 microarchitectures aren't all the same.

For example, Intel uses the famous 4-1-1-1 decoder: one complex decoder which can decode any instruction, and 3 simpler ones for decoding simpler (but more common) instructions.

On the other hand, AMD uses all complex decoders in its x86 microarchitectures, and Zen has 4 of them.

So, depending on the specific microarchitecture, an x86 decoder can (or not) have a very small impact on power consumption. Particularly on high-end cores, where we talk about billions of transistors used, whereas decoders use only some millions of them.

Anyway, I agree that ARM and x86 cores are too much different, so it's better to avoid comparisons.

cdimauro · Sep 14, 2016

bjt2 said:
Zen has the uop cache, that reduces the consumption by the cache hit rate (I think between 50% and 80%) and AFAIK the A9 no, but I can be wrong...

It might be, because ARM has a simpler architecture/ISA, and may not require a uop cache (albeit usually requires some microcode for very complex/legacy instructions).

Anyway, I remember the A9 having 6 decoders and the ARM ISA, in the 64 bit incarnation, is not that simple...

It's the exact opposite: ARMv8/ARM64/AArch64 has a much simpler opcode table & instruction formats, compared to the previous ISA (ARMv7/ARM32/AArch32).

Contrary to what AMD did with x64, ARM took the chance to completely rewrite its ISA when extending it to 64-bits, removing all legacy instructions, and just replacing some of them with much simpler versions (e.g.: double registers loading, for example).

ARM64 resembles more an Alpha architecture, instead of an ARM32: simpler ISA, devoted to very fast instruction decoding & execution.

AMD Ryzen (Summit Ridge) Benchmarks Thread (use new thread)

Golden Member

Diamond Member

Senior member

Senior member

Diamond Member

Senior member

Senior member

Lifer

Diamond Member

Senior member

Senior member

Diamond Member

Golden Member

Lifer

Senior member

Senior member

Golden Member

Lifer

Lifer

Senior member

Senior member

Lifer

Diamond Member

Member

Member