New Zen microarchitecture details

Page 104 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

looncraz

Senior member
Sep 12, 2011
722
1,651
136
Decent, but not good enough compared to Intels. Run some branchy bench you know of and use CodeAnalyst to check the misprediction rates. I'll run the same with Intel Skylake.

You make the project, I'll run it on three different systems (Excavator, Sandy Bridge, Deneb).

Excavator's branch prediction rates seem like they should be better than Sandy Bridge, judging by Agner's comments.

http://www.agner.org/optimize/microarchitecture.pdf

Pages 28 & 33

I disagree that it's just cache and the rest was equally matched to Intel.

Not what I'm saying at all, I'm just comparing Zen to the Construction cores. AMD does have areas where it has been stronger than Intel - and Zen looks to attempt to exploit that. Zen+ looks to double down on that strategy (if the 15% boost is to be believed).

Slow cache is also a major oversimplification - it's ways, it's how many accesses per line can be made simultaneously, it's the latency and bandwidth with cache contention.

I wouldn't call it an oversimplification - all the things you said make a cache slow. Slow != bandwidth alone. Slow comes in terms of latency, throughput, and any combination thereof... and nearly everything is tied to the performance of the caches one way or another.

If Excavator had even Sandy Bridge level caches, things would be a lot different.

BDvsSandy-Caches.png

Oh, and that's 5Ghz Bulldozer vs 4.5Ghz Sandy Bridge.
 
Last edited:

itsmydamnation

Platinum Member
Feb 6, 2011
2,764
3,131
136
The L3 is rather irrelevant as in bulldozer its just an eviction cache ( anything faster the main memory is a benefit), its the horrible L2. David kanter blames a lot of bulldozers issues around the L1D L1I L2 arrangement. I think a 10-12 cycle L2 for Zen is a reasonable expectation for 512k L2.

According to the leaks around Cern L1D and L1i is 64k each which when coupled to the much faster L2 and probably much faster L3 which will be a massive improvement for single thread perf, probably not much of a difference for throughput. Even if just 32k each that should be sufficient when backed by a fast good sized L2 (on the assumption the L1i is better then the current 3 way).

IF i was a betting man i would bet that Zen has just as long a pipeline as CON (my guess is front end and L/S are CON core evolution), so it will be interesting to see how/if they can reduce failed branches and associated penalty . will we see the much patented about retirement queue cache/trace cache, check pointing etc, i expect they will have done something to alleviate the 20+ cycle branch miss penalty of a pipeline of that length.
 

coffeemonster

Senior member
Apr 18, 2015
241
86
101
Hypothetical thought: After Kaveri, Carrizo/Excavator was designed for efficiency foremost no? to become the CAT core's replacement. So what would have happened if instead they decided to make the Kaveri arch improvement on the mature 32 SOI and continued with the FX-8400, and later the Excavator arch improvements(perhaps without using High Density Libraries?) FX-8500. I Just wonder what sort of a CPU would have come out if the FX line continued with all the CON core improvements on 32 SOI since that seems to be the node best suited to the uarch.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,686
1,221
136
Hypothetical thought: After Kaveri, Carrizo/Excavator was designed for efficiency foremost no? to become the CAT core's replacement. So what would have happened if instead they decided to make the Kaveri arch improvement on the mature 32 SOI and continued with the FX-8400, and later the Excavator arch improvements(perhaps without using High Density Libraries?) FX-8500. I Just wonder what sort of a CPU would have come out if the FX line continued with all the CON core improvements on 32 SOI since that seems to be the node best suited to the uarch.
If it was a drop and optimize port it would not be on 32nm PDSOI but on 22nm FDSOI.

Optimized Orochi die on FDSOI with OD+FBB/SRAM+RBB we are looking at ~4.5 GHz as nominal clock. With 30-40% yield aiming towards 5 GHz.

So, FX-9600(117W/8-core Top SKU)/FX-9400(88W/8-core Nominal SKU)/FX-7400(88W/6-core Salvaged SKU)/FX-5400(88W/4-core Salvage SKU). 170-to-156.42-to-140 mm² range. (Would require internal Northbridge/Southbridge to support AM4)

Optimized Gecko(SR28)/Basilisk(XV20) we would be looking at a nominal all core of ~4 GHz @ 140 watts w/ a 16-core SKU. 360-to-344.124-to-320 mm² range. (Would just need a southbridge to support AM4)

22FDSOI has ~55% lower PVT variation than 14LPP, currently. Which means at this moment 0.3_PDK_22 is faster than 1.2_PDK_14.
 
Last edited:

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
The L3 is rather irrelevant as in bulldozer its just an eviction cache ( anything faster the main memory is a benefit), its the horrible L2. David kanter blames a lot of bulldozers issues around the L1D L1I L2 arrangement. I think a 10-12 cycle L2 for Zen is a reasonable expectation for 512k L2.

According to the leaks around Cern L1D and L1i is 64k each which when coupled to the much faster L2 and probably much faster L3 which will be a massive improvement for single thread perf, probably not much of a difference for throughput. Even if just 32k each that should be sufficient when backed by a fast good sized L2 (on the assumption the L1i is better then the current 3 way).

IF i was a betting man i would bet that Zen has just as long a pipeline as CON (my guess is front end and L/S are CON core evolution), so it will be interesting to see how/if they can reduce failed branches and associated penalty . will we see the much patented about retirement queue cache/trace cache, check pointing etc, i expect they will have done something to alleviate the 20+ cycle branch miss penalty of a pipeline of that length.

Few weeks ago I actually tested Vishera with L3 caches disabled. The only workload I noticed any performance difference in was WinRAR and 7-Zip. In WinRAR the performance was ~2% higher with L3 enabled and in 7-Zip 3.7% higher. This was at 1866MHz MEMCLK so at higher memory speeds the difference would have been ever lower.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
How about power consumption?

L3 on vs L3 off.

Cannot be measured easily. The total NB (NB + 4x 2MB L3) worst case power consumption on Vishera is ~15W so it is nearly irrelevant. For a Piledriver based NPU (without GPU) the NB power consumption is ~8W.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,764
3,131
136
Yeah things with bigger more often accessed data sets will see more benefit from the extra cache, did you test any games? I would expect a bigger difference there.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Yeah things with bigger more often accessed data sets will see more benefit from the extra cache, did you test any games? I would expect a bigger difference there.

Nope, just some of the most common benchmarks. I would be shocked if any other workload would show greater gains than 7-Zip.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,764
3,131
136
Compres-decompress are only a small subset, heavy branching logic code will stress prefetch predict and cache far more. Back in the phenom 2 days the 6mb l3 was good for around 10-15% in games.
 
  • Like
Reactions: Doom2pro

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
Cannot be measured easily. The total NB (NB + 4x 2MB L3) worst case power consumption on Vishera is ~15W so it is nearly irrelevant. For a Piledriver based NPU (without GPU) the NB power consumption is ~8W.
These numbers are helpful. Is it safe to assume, that a Zeppelin die might have ~6W NB power (lower for internal logic, roughly similar for off-die I/O)?

Edit:
@all: From here:
Lead a team on a brand new, next generation microprocessor core design in 16/14 nm FinFET Technology.
Delivered Various CPU x86/ARM core design from netlist to tape-out includes AMD K8, Bulldozer, PileDriver, and Excavator.
OK, there is no ARM based core in this list...

This could be a result of component sharing between both designs. The question is: are there any upcoming 16nm FinFET CPUs on the internal AMD roadmap?
 
Last edited:

BigDaveX

Senior member
Jun 12, 2014
440
216
116
Few weeks ago I actually tested Vishera with L3 caches disabled. The only workload I noticed any performance difference in was WinRAR and 7-Zip. In WinRAR the performance was ~2% higher with L3 enabled and in 7-Zip 3.7% higher. This was at 1866MHz MEMCLK so at higher memory speeds the difference would have been ever lower.

There seems to be something about these "speed demon" designs whereby they hardly gain anything from cache size increases. Back in the day, the Pentium 4 got some pretty nice performance increases when the cache was doubled in Northwood, and then when the Extreme Edition slapped on that huge (for the time) L3 cache. Then Prescott made the pipeline crazy deep, and all of a sudden we had a situation where the IPC difference between a Celeron D with 256KB of L2 cache and a Pentium 4 600-series with 2MB of L2 cache was fairly negligible.
 

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
These numbers are helpful. Is it safe to assume, that a Zeppelin die might have ~6W NB power (lower for internal logic, roughly similar for off-die I/O)?

Edit:
@all: From here:


OK, there is no ARM based core in this list...

This could be a result of component sharing between both designs. The question is: are there any upcoming 16nm FinFET CPUs on the internal AMD roadmap?

Isn't there an ARM core inside Zen acting as a system agent?
 

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
Zen added the support for AMD's Secure Memory Encryption (SME) and AMD's Secure Encrypted Virtualization (SEV). Secure Memory Encryption is real time memory encryption done per Page Table Entry. This is done utilizing the onboard "Security" Processor (ARM Cortex-A5) at boot time to encrypt each page, allowing any DDR-4 Memory (including Non Volatile varieties) to be encrypted. AMD SME also makes the contents of the memory more resistant to memory snooping and cold boot attacks.

^ Wikipedia

It's the security processor I was thinking of - AMD Trust Zone.

http://www.amd.com/en-us/press-releases/Pages/amd-strengthens-security-2012jun13.aspx
 
Last edited:

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
^ Wikipedia

It's the security processor I was thinking of - AMD Trust Zone.

http://www.amd.com/en-us/press-releases/Pages/amd-strengthens-security-2012jun13.aspx
The PSP is using an ARM Cortex-A5. David Kaplan, the guy who held the CPU design talk at CCC, is mainly involved in the PSP stuff. But Hon Hin Wong worked on CPU schedulers and the like, at a much lower level the PSP. I think, even if the K12 would've been cancelled, many people already worked on it for years, and even more on reused components for both uarchs.
 

Hi-Fi Man

Senior member
Oct 19, 2013
601
120
106
AMD seems to have a lot of sound designs but they are usually held back by their poor cache and memory performance. If AMD could improve that, I think it would bring them much closer to where they need to be.

I'm also worried about their chipset. Rumors say Asmedia is designing it but rumors also say there have been delays/issues. Not only does AMD have to execute their CPU well but they also have to execute their chipset well. This is an area I think AMD will be able to provide an advantage over Intel for the same price point at least in the mainstream market.

There's a lot riding on GlobalFoundries' 14nm process and so far it doesn't seem to be that great but hopefully from now till Zen launch it'll mature a bit. I can only wonder what Zen on IBM's (now GF's) 22nm FDSOI would look like...
 

DrMrLordX

Lifer
Apr 27, 2000
21,620
10,829
136
There's a lot riding on GlobalFoundries' 14nm process and so far it doesn't seem to be that great but hopefully from now till Zen launch it'll mature a bit. I can only wonder what Zen on IBM's (now GF's) 22nm FDSOI would look like...

A lot of us are wondering about that. Back in 2014 it gave us POWER8 CPUs in the 4.7 GHz territory, albeit at massive power draw. Those were big honkin chips though. Considering how much improving GF has done with their 32nm and 28nm processes, you'd think 22nm FDSOI would have seen some improvements since fall of 2014.
 

The Stilt

Golden Member
Dec 5, 2015
1,709
3,057
106
Power9 will be using 14nm HP FinFet developed by IBM, and now owned by GlobalFoundries. Now AMD could grow some balls and tell that WSA won't be fullfilled unless they get access to the process. I would assume that oblications to fullfill any made contracts expire at the moment when a bankruptcy is declared anyway, or? So in that aspect GlobalFoundries has nothing to lose.
 

looncraz

Senior member
Sep 12, 2011
722
1,651
136
I'm also worried about their chipset. Rumors say Asmedia is designing it but rumors also say there have been delays/issues.

The "issue" that was leaked was just the standard affair about USB 3.1 - the same applies to Intel chipsets because you can only drive the signal so strong from the chip. Basically, USB 3.1 degrades over relatively short distances (and with relatively minor interference), so the traces from the chipset to the header/ports need to be short and clean - which is a serious packaging challenge.

The solution is to either use a ~$2 driver chip to boost and filter the signal after a few inches or to do what HP did - and move the chipset right behind the USB 3.1 ports - and to keep the headers closer:
AMD-AM4-Motherboard-1.jpg



There's a lot riding on GlobalFoundries' 14nm process and so far it doesn't seem to be that great but hopefully from now till Zen launch it'll mature a bit. I can only wonder what Zen on IBM's (now GF's) 22nm FDSOI would look like...

I disagree with assertions that 14nm LPP isn't looking good - I think it is delivering handsomely. It took the poor-clocking and power hungry GCN architecture and gave it 20% better clocks and cut power usage of the GPU immensely.

Yields are probably something of an issue at this time, but that should be resolved by the time Zen makes it to production.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,785
136
I disagree with assertions that 14nm LPP isn't looking good - I think it is delivering handsomely. It took the poor-clocking and power hungry GCN architecture and gave it 20% better clocks and cut power usage of the GPU immensely.

Back with Pentium 4 Northwood, a shrink made the CPU drop its power use from ~90W to 54W, a big 40% drop in power use. And they did not have to use fancy power management techniques that's not only a requirement nowadays but even with it don't bring the huge improvements that were brought with a "simple" shrink over a decade ago with Northwood.

Back then, in a very general sense we could call it a simple/straight shrink, because compared to uArch changes the efforts put in were negligible. Of course, there's changes that would have been needed that we didn't know.

But nowadays, process improvements and even architectural overhauls bring much smaller changes. And at the same time its much difficult. If it was 1x effort and 1y gain, now its 4x effort and 0.25y gain. There isn't such a thing called simple/straight shrinks anymore. Not from an engineers point of view. Remember how AMD promised massive, top-to-bottom changes for GCN4 aka GCN1.4 aka Polaris? Well, 10 years ago that kind of effort might have really brought massive improvements, but now they are a requirement to merely benefit from a process. It's clear that its a fundamental limitation, like what humans are capable of(or not capable of).

So to blame/attribute to a process change alone is incorrect. Process is merely a recipe for success, not the whole picture.
 
Last edited: