New Zen microarchitecture details

Page 99 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
IMO it is obvious that Samsung has manufactured some parts for AMD, since AMD had the prototype silicon available before GlobalFoundries was even ramped up (certainly for Polaris, possibly for Zeppelin too). Because of WSA I would expect that Samsung won't be making any parts beyond the earliest samples. And why would they, it's not like GlobalFoundries is short of capacity (for obvious reasons) and the process itself should be identical regardless if it is made at GlobalFoundries or Samsung plants.
 
Aug 11, 2008
10,451
642
126
Playing Devil's Advocate:

7770 performance is in reach, but just. Zen APU will allegedly have up to ~704 GCN4 SPs (or, possibly, the presumably updated Vega SPs). Memory compression will provide a ~20 to 25% increase in effective bandwidth (so about 48GB/s) and the many architectural updates made since 7770 further reduce demand on memory bandwidth by some 10% net over the 7770. So it would be like giving the 7770 52GB/s bandwidth... or almost exactly 2/3 of the bandwidth.

Bandwidth isn't everything, though, but it is very important. RX 480 only loses 1~4% from losing 12.5% of its bandwidth, so the 7770 should be expected to follow a similar curve - losing 5~15% of its more idealized performance, but architectural improvements should make up for that and the Zen APU should be fairly close to the 7770 in performance.

BUT

Doing things another way, though, leads us to believe AMD wouldn't bother with 704 or more SPs on a Zen APU:

RX 480 has 36 CUs, and 256GB/s of hardware memory bandwidth.
ZN APU has 11 CUs and ~40GB/s of hardware memory bandwidth.

RX 480 has 7.1GB/s per CU.
ZN APU has 3.6GB/s per CU

I have a problem with this 11 CU business. It doesn't make any sense given the physical layout of GCN - basically the CUs are organized in pairs, so the number should always be eve unless one CU is disabled.. which is odd for the top APU SKU.

Adding one CU would take it to 768 - XBox One, and RX 460, territory, but the memory bandwidth would make this a waste, and the larger die wouldn't be a good thing either. Bringing this down one CU would take us to 640 CUs. This seems better. Fewer CUs means less power, smaller die, and higher achievable clocks - and more bandwidth per CU.

Bandwidth wise, things look a little better:

RX 480 has 7.1GB/s per CU.
ZN APU has 4.0GB/s per CU

That reduction in bandwidth would easily cost 20% of the performance potential... so why not just take off 20% of the CUs?

So, let's take off two more CUs, and stick with the 512 SPs and 8 CUs AMD has been using on their APUs.

RX 480 has 7.1GB/s per CU.
ZN APU has 5.0GB/s per CU

Now we're talking! Still a memory bandwidth issue, but a much smaller one.

Run that little GPU at 1Ghz to 1.2Ghz and you have a sizeable improvement in performance over current APUs - and you don't threaten the RX 460's meaning of existence. But you also can't reach HD 7770 or modern console levels of performance...

So, read my pecks: More than 512SPs only makes sense if the Zen APU will have an L4 cache. Even a 32MB reasonably fast L4 would permit HD 7770 levels of performance... if HBM is used, then we will likely see effectively dedicated VRAM on the top APUs - and console-like graphics performance.

But the 480 must have excess bandwidth, no. So the apu should perform better than expected based on bandwidth alone. They really need hbm though, or a lot more apps that can use hsa.
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
But the 480 must have excess bandwidth, no. So the apu should perform better than expected based on bandwidth alone. They really need hbm though, or a lot more apps that can use hsa.

The APU will perform worse than expected based on bandwidth alone. It has to share bandwidth with the CPU.
 

Doom2pro

Senior member
Apr 2, 2016
587
619
106
The APU will perform worse than expected based on bandwidth alone. It has to share bandwidth with the CPU.

Isn't that shared bandwidth between the Socket and the RAM? For HSA wouldn't a HBM APU with HSA be able to have CPU directly access onboard HBM, avoiding direct system memory access in the first place, or is that hindered by the need to fill that HBM with data from system RAM anyway?
 

KTE

Senior member
May 26, 2016
478
130
76
Intel went from Netburst's 2 cycle L1D to 3 cycle for Conroe and 3 cycle for Conroe to 4 cycle for Nehalem...

Load to use latency is only a part of the whole thing.
Is not compute.

Relaxing cache latencies from one generation to the next is well known when going for higher frequencies.

You're looking at just the L1 which doesn't paint the full picture... Let's look at something of competition.

SNB 4/11/25/148 cycles.
BD 4/21/65/195 cycles.
K10 3/14/55/157 cycles.

According to AT.

Sent from HTC 10
(Opinions are own)
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
Isn't that shared bandwidth between the Socket and the RAM? For HSA wouldn't a HBM APU with HSA be able to have CPU directly access onboard HBM, avoiding direct system memory access in the first place, or is that hindered by the need to fill that HBM with data from system RAM anyway?

Sorry, my comment was not in regards to HBM.
The first versions of Zen will not have HBM unless AMD is planning on selling these chips for minimum $300+.

My guess is that HBM is off the table for Zen consumer APUs. Maybe for Zen+.
 

Doom2pro

Senior member
Apr 2, 2016
587
619
106
Sorry, my comment was not in regards to HBM.
The first versions of Zen will not have HBM unless AMD is planning on selling these chips for minimum $300+.

My guess is that HBM is off the table for Zen consumer APUs. Maybe for Zen+.

The first versions of Zen won't even be APUs... They will be FX High End Desktop CPUs...

Next is Server CPUs...

Then APUs... And it still isn't known if any of those SKUs will have HBM...
 

Enigmoid

Platinum Member
Sep 27, 2012
2,907
31
91
The first versions of Zen won't even be APUs... They will be FX High End Desktop CPUs...

Next is Server CPUs...

Then APUs... And it still isn't known if any of those SKUs will have HBM...

Whoops sorry again.

Clarifying. I strongly believe that the first versions of consumer Zen (APU versions for consumer market, not meaning the server stuff) will not have HBM. HMB Zen, for the consumer, on APUs, in my opinion will not be until Zen+.

Like GDDR5 on the bulldozer products, its simply won't make economic sense, even if it would be a good product.
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Every single performance per watt chart on the internet shows nothing close to 2.8x.

You HAVE to scale the GPU for the process-derived improvement to become visible. Remember that the RX 480 is clocked over 20% higher. When you clock Hawaii to that level the power usage climbs dramatically.

You have to compare apples to apples.

Nonsense. You can't add and subtract shaders and the like to 'fit' performance and power characteristics. Or boost or lower clock speeds.

Especially the nonsense comparison to cut GM104 when the 1060 is GP 106 and a noncut die.

Except you absolutely can - and must. The numbers won't be 100% accurate, but the scientific method requires a leveling of all but one variable prior to being able to make a statement about that one variable.

That means the clock speeds need to equalized, as well as the amount of hardware in play (SPs, CUs, RAM, VRMs, etc.).

We are trying to determine the ability of 14nm LPP to save power. You can only do that if all other factors have been accounted for - RX 480 doesn't exist on 28nm, so we have to take the nearest GPU like it and scale it to fit the RX 480 specifications - then, and only then, can we determine what effect 14nm LPP had on the GPU.


Now consider that the GPU is using a smaller relative portion of that power.

On R9 290, the RAM is pulling about 45~50W of power.
On RX 480, the RAM is pulling about 25~30W of power.

For both, there is between 15W and 20W lost to the VRM and other components.

So the 290 GPU is using 350~370W at 1.15Ghz on my system.
And RX 480 GPU is using 125~130W of power.

That gets you right around 2.8x of improvement on the GPU for efficiency at the same frequency. The improvement will be notably diminished when considering the power improvements that were made after Hawaii.

Also, I didn't fiddle with the numbers to make the 2.8x figure, it just came out that way :thumbsup:

-----EDIT: Duh, I need to use the cut-down version of the R9 290 to match RX 480 specs...

So basing the figures from a 390W starting point

Cut R9 290 GPU: 310~330W
And RX 480 GUP: 125~130W

Improvement: 2.48~2.64x for the GPU alone.

----RESUME:

Fury et. al. can't be compared without considering that the GPU is most of the power being consumed - the memory only uses a few watts. And Fury Nano is a low clocking chip, running well in the peak efficiency range of the process and architecture.

Take Fury, and overclock it to 1.2Ghz and it blows right past 400W during stress tests. An efficiency comparison is highly compromised by the loss GDDR5 memory controllers, so we can't even begin to estimate the efficiency change between Fury and RX 480 without adding the power draw from 8 GDDR5 controllers... things which we know are quite power hungry.

420W for a 290X is also astonishingly high.

Yes it is, but that's what I see with my equipment once I start to overclock. Without overclocking, the card is much more efficient, but I have to push it to that level to equalize the clocks. The other alternative is to drop both to 1Ghz, which I'd prefer to do, but I don't have an RX 480 for testing.

I'm also using stress-testing power usage figures across the board. I can be off by 20% and my point still remains valid - 14nm LPP is providing significant efficiency improvements - easily more than 2x.

Then there is the argument that Polaris is NOT a Hawaii replacement. More of a Pitcarin replacement (Hawaii as a ton of DP).

And entirely irrelevant, beyond which GCN versions are being compared - not that there have been massive efficiency improvements beyond Fury's use of HBM.
 
Last edited:

looncraz

Senior member
Sep 12, 2011
722
1,651
136
But the 480 must have excess bandwidth, no. So the apu should perform better than expected based on bandwidth alone. They really need hbm though, or a lot more apps that can use hsa.

In a manner of speaking, yes. The RX 480 8GB version sits happily right at the crux of the diminishing returns curve for bandwidth.

Remove 12.5% bandwidth - only lose 1~4% performance. Add 12.5% bandwidth, only gain 1~4% performance.

At what point that curve becomes steeper is not yet apparent, but it's safe to say that losing another 12.5% of the bandwidth will have a larger impact than the first 12.5%. And the 12.5% after that will be worse still.

At that point, it makes more sense cut down the GPU - either in frequency or in CU count, because you're just spinning your wheels waiting for data... which AMD APUs are known to do.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
looncraz, I like your method, albeit the overall graphics card power counts, I suppose.

RX 470 is likely to be relatively less power efficient than the RX 480, since it uses harvested ASICs. Unless of course they clock it below the threshold of the process / the design (i.e =< 1GHz range) :sneaky:

The 2.8x total figure AMD displayed for RX 480 is so far off that I wonder if they actually mixed up the numbers. For a P11 the figures might be plausible, if it is clocked low enough. AFAIK P11 is just basically a "cropped" P10 (or vice versa).
It seems, the TDP of 110W has been leaked already:
http://videocardz.com/amd/radeon-rx-400/radeon-rx-470
But there is also this nice slide, supporting your 460/Polaris 11 theory:
AMD%20Polaris%2010%20and%20Polaris%2011.png

http://www.pcper.com/news/Graphics-Cards/AMD-Reveals-Radeon-RX-460-and-RX-470-Specifications

Relaxing cache latencies from one generation to the next is well known when going for higher frequencies.

You're looking at just the L1 which doesn't paint the full picture... Let's look at something of competition.

SNB 4/11/25/148 cycles.
BD 4/21/65/195 cycles.
K10 3/14/55/157 cycles.
A more complex cache (more ways, other interesting stuff) might also have higher latencies. And then there is also the power efficiency (e.g. implementing the cache logic + arrays as fast, power hungry logic or more power efficient one).

Clarifying. I strongly believe that the first versions of consumer Zen (APU versions for consumer market, not meaning the server stuff) will not have HBM. HMB Zen, for the consumer, on APUs, in my opinion will not be until Zen+.

Like GDDR5 on the bulldozer products, its simply won't make economic sense, even if it would be a good product.
There will be a server/datacenter APU with HBM first. Consumer surely also depends on costs and production volume.

And then there is this:
eSilicon-HBM2-01.png


The server APU should look different, as it might use ZP dies and GL dies as partly (+ as MCM) or fully on an Interposer. But it could also be sth. else altogether (console, embedded?).
 
Aug 11, 2008
10,451
642
126
You HAVE to scale the GPU for the process-derived improvement to become visible. Remember that the RX 480 is clocked over 20% higher. When you clock Hawaii to that level the power usage climbs dramatically.

You have to compare apples to apples.



Except you absolutely can - and must. The numbers won't be 100% accurate, but the scientific method requires a leveling of all but one variable prior to being able to make a statement about that one variable.

That means the clock speeds need to equalized, as well as the amount of hardware in play (SPs, CUs, RAM, VRMs, etc.).

We are trying to determine the ability of 14nm LPP to save power. You can only do that if all other factors have been accounted for - RX 480 doesn't exist on 28nm, so we have to take the nearest GPU like it and scale it to fit the RX 480 specifications - then, and only then, can we determine what effect 14nm LPP had on the GPU.



Now consider that the GPU is using a smaller relative portion of that power.

On R9 290, the RAM is pulling about 45~50W of power.
On RX 480, the RAM is pulling about 25~30W of power.

For both, there is between 15W and 20W lost to the VRM and other components.

So the 290 GPU is using 350~370W at 1.15Ghz on my system.
And RX 480 GPU is using 125~130W of power.

That gets you right around 2.8x of improvement on the GPU for efficiency at the same frequency. The improvement will be notably diminished when considering the power improvements that were made after Hawaii.

Also, I didn't fiddle with the numbers to make the 2.8x figure, it just came out that way :thumbsup:

-----EDIT: Duh, I need to use the cut-down version of the R9 290 to match RX 480 specs...

So basing the figures from a 390W starting point

Cut R9 290 GPU: 310~330W
And RX 480 GUP: 125~130W

Improvement: 2.48~2.64x for the GPU alone.

----RESUME:

Fury et. al. can't be compared without considering that the GPU is most of the power being consumed - the memory only uses a few watts. And Fury Nano is a low clocking chip, running well in the peak efficiency range of the process and architecture.

Take Fury, and overclock it to 1.2Ghz and it blows right past 400W during stress tests. An efficiency comparison is highly compromised by the loss GDDR5 memory controllers, so we can't even begin to estimate the efficiency change between Fury and RX 480 without adding the power draw from 8 GDDR5 controllers... things which we know are quite power hungry.



Yes it is, but that's what I see with my equipment once I start to overclock. Without overclocking, the card is much more efficient, but I have to push it to that level to equalize the clocks. The other alternative is to drop both to 1Ghz, which I'd prefer to do, but I don't have an RX 480 for testing.

I'm also using stress-testing power usage figures across the board. I can be off by 20% and my point still remains valid - 14nm LPP is providing significant efficiency improvements - easily more than 2x.



And entirely irrelevant, beyond which GCN versions are being compared - not that there have been massive efficiency improvements beyond Fury's use of HBM.

Now you are shifting the goalposts, or maybe AMD itself was being duplicitous in making the calculation. But as the majority of people assume the meaning to be, 2.8x performance per watt increase means 2.8 faster FPS for the same power usage by the entire card. Yes, the process and clockspeed, and ram power usage all are a part of that, but only as they affect the final "performance", which is FPS. (Your PSU certainly does not know what part of the card is using the power it must provide).


In fact introducing other variables such as clockspeed, process, and ram power usage, you are doing exactly what you said is not part of the "scientific method": examining several variables at once. And as another poster said, I have not seen any review which shows anywhere close to 2.8 times performance per watt for the 480, at least as it is generally understood to mean. In fact, I think most reviews are showing 1.5 to 2.0 x at best increase. And AMD actually admitted they did not meet that target for the 480, by publicly stating that claim applies to the 470 (co-incidentally *after* the 480 came out and did not meet the goal). Now if the 470 meets that target, or if they were using some other way to calculate "performance per watt" they technically may have met the criteria, but in a what I consider a deceptive manner.
 

Headfoot

Diamond Member
Feb 28, 2008
4,444
641
126
Now you are shifting the goalposts, or maybe AMD itself was being duplicitous in making the calculation. But as the majority of people assume the meaning to be, 2.8x performance per watt increase means 2.8 faster FPS for the same power usage by the entire card. Yes, the process and clockspeed, and ram power usage all are a part of that, but only as they affect the final "performance", which is FPS. (Your PSU certainly does not know what part of the card is using the power it must provide).


In fact introducing other variables such as clockspeed, process, and ram power usage, you are doing exactly what you said is not part of the "scientific method": examining several variables at once. And as another poster said, I have not seen any review which shows anywhere close to 2.8 times performance per watt for the 480, at least as it is generally understood to mean. In fact, I think most reviews are showing 1.5 to 2.0 x at best increase. And AMD actually admitted they did not meet that target for the 480, by publicly stating that claim applies to the 470 (co-incidentally *after* the 480 came out and did not meet the goal). Now if the 470 meets that target, or if they were using some other way to calculate "performance per watt" they technically may have met the criteria, but in a what I consider a deceptive manner.

Chill on the 2.8x per watt improvement until all SKUs are out. Polaris 11 or a cut down, downclocked P10 will hit the number. The 480 was clearly pushed to try and hit a higher performance target on the cheap at the expense of using up power budget and still resulted in massive perf/watt gains
 

coercitiv

Diamond Member
Jan 24, 2014
6,202
11,907
136
looncraz, I like your method, albeit the overall graphics card power counts, I suppose.
You like the idea of "equalizing" clocks past the 28nm max stock clocks? Can't wait to see how much PPW Pascal has over a 2Ghz Maxwell.
 
Aug 11, 2008
10,451
642
126
Chill on the 2.8x per watt improvement until all SKUs are out. Polaris 11 or a cut down, downclocked P10 will hit the number. The 480 was clearly pushed to try and hit a higher performance target on the cheap at the expense of using up power budget and still resulted in massive perf/watt gains

I am not the one that made the claim. If they did not want to get criticized, AMD should specifically have said what SKU it applied to. I think it is natural to expect the first and "flagship" chip out on a new process node to meet the power expectations, not some cut down, downclocked successor that is not even available until a later date.
 

ElFenix

Elite Member
Super Moderator
Mar 20, 2000
102,414
8,356
126
from the above it appears the 2.8x was for polaris 11. the 2.8x we saw months ago was from an incomplete slide deck, and whoever leaked it did not do AMD any favors.


anyway, if you guys want to figure out what sort of power improvements come just from the process, you need to do a very careful investigation of RX480 vs. a 380X, matching clocks and ram. roughly same number of xtors, separated by just 1 generational improvement.
 
Last edited:

KTE

Senior member
May 26, 2016
478
130
76
Now you are shifting the goalposts, or maybe AMD itself was being duplicitous in making the calculation. But as the majority of people assume the meaning to be, 2.8x performance per watt increase means 2.8 faster FPS for the same power usage by the entire card. Yes, the process and clockspeed, and ram power usage all are a part of that, but only as they affect the final "performance", which is FPS. (Your PSU certainly does not know what part of the card is using the power it must provide).


In fact introducing other variables such as clockspeed, process, and ram power usage, you are doing exactly what you said is not part of the "scientific method": examining several variables at once. And as another poster said, I have not seen any review which shows anywhere close to 2.8 times performance per watt for the 480, at least as it is generally understood to mean. In fact, I think most reviews are showing 1.5 to 2.0 x at best increase. And AMD actually admitted they did not meet that target for the 480, by publicly stating that claim applies to the 470 (co-incidentally *after* the 480 came out and did not meet the goal). Now if the 470 meets that target, or if they were using some other way to calculate "performance per watt" they technically may have met the criteria, but in a what I consider a deceptive manner.
Agreed.

They definitely weren't upfront about this. Nor about many of their previous marketing campaigns. People who are skeptic have 10 years of AMD marketing claims to thank.

The way Zen is sounding and playing out so far is not looking good (DT, compared to that 40% claim). I'll give them HUGE props if they can even deliver 30% average IPC increase.

25/07/2016 I'm saying this.

Sent from HTC 10
(Opinions are own)
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
The server APU should look different, as it might use ZP dies and GL dies as partly (+ as MCM) or fully on an Interposer. But it could also be sth. else altogether (console, embedded?).


I have been wondering if AMD might not make a common console/APU die at some point.

Four Zen cores, 16/18CUs, 1 HBM2 module, and external support for DDR4.

This setup would be a performance, efficiency, and implementation, and quite likely a net cost improvement - especially if AMD got both Sony and Microsoft to use some variant of the same setup... the trifecta would be getting Nintendo on board - one die to rule them all.

This is a capability uniquely available to AMD.

They could then take a certain portion of these and sell them into the retail channel (with or without HBM, with many of the CUs disabled if no HBM).
 

jpiniero

Lifer
Oct 1, 2010
14,599
5,218
136
The server APU should look different, as it might use ZP dies and GL dies as partly (+ as MCM) or fully on an Interposer. But it could also be sth. else altogether (console, embedded?).

Keep in mind the APU is specifically for HPC and no other markets. It almost has to be multiple dies using an interposer. I don't know if they would be able to fit a Zen die plus the bigger Vega though, I guess we will find out.