New Zen microarchitecture details

Page 190 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Doom2pro

Senior member
Apr 2, 2016
587
619
106
So where is the apps that are running for seconds that would benefit from HSA and thus the average user? I dont see them.

Also I would like to point out the age old pitfall of the term "average user", a long time ago the "average user" didn't ever need a computer, look where we are now... You cannot take advantage of something if it isn't already there, you don't get modern video games or PC software if there isn't a PC in the first place, and if you don't even bother attempting to emplace one, you will never know the benefits of doing so.
 
  • Like
Reactions: cytg111

krumme

Diamond Member
Oct 9, 2009
5,952
1,585
136
Also I would like to point out the age old pitfall of the term "average user", a long time ago the "average user" didn't ever need a computer, look where we are now... You cannot take advantage of something if it isn't already there, you don't get modern video games or PC software if there isn't a PC in the first place, and if you don't even bother attempting to emplace one, you will never know the benefits of doing so.
I perfectly agree with your points about about how programming is done. Teaching culture models. We need a paradigm change. And we see game engines comming without a main thread. So its on route.
But those benefits will get to us even if the cpu and gpu is not on the same die.
What i originally questioned was that with infinity fabric in hand. Who in their sane mind wants to fork out 1.2B to get the last step in reducing latency going single die instead of connecting it on an interposer?
Imo Infinity fabric is a game changer here. You get like say 80% of the advantage of an apu for all your stuff on the shelf and at cost that is far less than 20%.
 
  • Like
Reactions: Doom2pro

itsmydamnation

Platinum Member
Feb 6, 2011
2,764
3,131
136
Snowy Owl is probably not coming out until the end of next year though, if the rumors of Vega 20's release isn't until then.
I think the socket and infrastructure most certainly will (SP4??) , thats going to be the upto 16cores 4 memory channels a P, 2P platform to take on the lower end of the server, Xeon-D and the workstation space.

The APU itself will come out whenever it comes out, i dont know if they need to wait for vega20, amd have already said all Vega's are using the infinity fabric. The question is did they add the needed I/O / interface for it.
 

Doom2pro

Senior member
Apr 2, 2016
587
619
106
I think the socket and infrastructure most certainly will (SP4??) , thats going to be the upto 16cores 4 memory channels a P, 2P platform to take on the lower end of the server, Xeon-D and the workstation space.

The APU itself will come out whenever it comes out, i dont know if they need to wait for vega20, amd have already said all Vega's are using the infinity fabric. The question is did they add the needed I/O / interface for it.

I wouldn't be surprised if Infinity Fabric is also being used in those 2P systems across packages (and possibly GPUs on the PCI-E), in addition to the uses of it with GMI between dies on the MCMs.
 

jpiniero

Lifer
Oct 1, 2010
14,585
5,209
136
The APU itself will come out whenever it comes out, i dont know if they need to wait for vega20, amd have already said all Vega's are using the infinity fabric. The question is did they add the needed I/O / interface for it.

Kind of need the 1/2 DP of Vega 20 to make it work I think. They could of course go ahead and release models without the GPU die in the meantime of course.
 

DrMrLordX

Lifer
Apr 27, 2000
21,620
10,830
136
I understand that. I just look and see "spreadsheet dingus in LibreOffice" and simply cant imagine the software where this utilization of the gpu gives a user benefit? I dont question it can make a lot of software faster just that i cant imagine the software where the user will notice?
Its that simple. What software are we talking about? I dont know?

I haven't looked at enough of the code for stuff like Firefox, for example, but there's still plenty of software out there where eventually end-users will want a quicker race-to-idle. It will take faster storage and memory for people to fully-appreciate those improvements since those things are more likely to be bottlenecks.

The other thing to think about here is this: in an fp workload, using both the iGPU and CPU in its entirity, an A10-7850k approached the same 32-bit fp output as a decent i5/i7 Haswell doing the same. Think about that for a second. Stuff like OpenCL2.0 and HSA should have made cheaper APUs/CPUs capable of doing the same work as a higher-end CPU. We're still not getting as much "everyday" processing power out of stuff like i3s and A10s thanks to the shortage of proper OpenCL2.0 acceleration.

Excatly what I was gonna write ... well not excatly, but something along those lines :).
What apps?
I could maybe see it in games ala physx, that is, an APU with discrete on the side.. Other than that, what? (that wouldnt allready benefit from completely offloading to a gpgpu)

Complete offload requires massively parallel code. iGPGPU can be utilized in any scenario where a small set of calculations can be carried out before a branch/dependency interrupts the parallel workflow. Ditto for SIMD really, but the iGPU can do it better. In the case of something like Carrizo or Kaveri, the iGPU can do it a lot better.

To use SIMD as an example, if I had code where I frequently needed to carry out calculations in blocks of 4, 8, 12, or 16 32-bit addition operations before reaching a dependency, then obviously 128-bit SIMD or (in the case of 8-large or 16-large blocks of calculations) 256-bit SIMD would be potentially helpful.

There is a lot of software out there that uses heavily parallel loads (Rendering), and software that could but currently doesn't (Like SPICE for example)... The programming techniques being taught today need to change, because blindly assuming that CPUs are going to get faster and faster in Frequency and IPC isn't helping things.

Well that, and also there needs to be more work making ubiquitous compilers that can help programmers utilize GPUs for this purpose. It's a shame that most of the OpenCL/HSA work done on Java9 died on the vine and won't make it into the next release.

Yea, I get that, but overall if I am rendering stuff, wouldnt I go the gpgpu route and do opencl? Rendering seems like a very specific task, what I am getting at, where is the day-to-day apps that would benefit from HSA? I can only come up with games! (and that might be enough given console deals, i dunno..)

See above. A lot of it has to do with underlying code of apps that we don't look at on a daily basis. I'll bet you there's more that could be accelerated via iGPUs than people think.
 

itsmydamnation

Platinum Member
Feb 6, 2011
2,764
3,131
136
Kind of need the 1/2 DP of Vega 20 to make it work I think. They could of course go ahead and release models without the GPU die in the meantime of course.
I was thinking "deep learning " APU, but i dont know how much value there is in that. Also i wonder if Vega really is only /16th ( or whatever DP) it could just be limited to that for the products announced.
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
This is a dead horse but ima gonna dig it up anyway :) .. Amdahls law.
Very many things is inherently seriel in nature. No way around it.

Example, why would this benefit the user experience
Do X at 2ms : result at 10 watts
Do X with HSA at 1ms : result at 10 watts

You are doing twice the work at time n but still at the same watts. The user aint gonna notice going from 2ms to 1ms.
So where is the apps that are running for seconds that would benefit from HSA and thus the average user? I dont see them.

I dont see endless parallelism is gonna save us here, my bet is on frequency scaling.
Science is going to have to come up with more Hz's for our chips.
We should not forget Gustafson's Law, which explains the existence and wide application of Deep Learning.

Amdahl's law has a flaw, btw, as it estimates scaling better than it can be achieved in reality due to increased communication overhead, power constraints, etc. OTOH for today's compute tasks the used values for the B part (serial component) is often too high, as it can be lower than 1%, thus giving nice scaling at reasonable core counts. And over time it gets smaller. Just think of raytracing as an example: In the 90's we calculated 640x480 pictures, now 1080p or even 2160p is the new standard. By scaling the parallel part by the factor of 27, the serial part simply shrank in relation to that (Gustafson's law).

For typical non-compute tasks, there might be long serial chains of course. This is, where the "Fusion" concept begins to make sense -> quick and low overhead switch between, or even combined application of fast serial and fast parallel compute resources.

If your 2ms vs. 1ms would happen every frame in a game, average users would notice it. ;) Also using 10W for half the time would mean less energy used for the task.
 
  • Like
Reactions: Doom2pro

Crumpet

Senior member
Jan 15, 2017
745
539
96
Fott posted this in Semiaccurate, some new information about GF and Samsung's 14LPP (Same node as Polaris, Vega and Ryzen) IMO a good read especially considering the lack of anything new of late: http://www.bitsandchips.it/52-engli...obal-foundries-pdf-on-the-14nm-finfet-process

I would be more than happy with Ryzen rolling out at 4ghz. With the new turbo and some efficient cooling it would have some potentially decent clocks without even overclocking.

All speculation and rumour of course, but if the data reported above is correct..
 

krumme

Diamond Member
Oct 9, 2009
5,952
1,585
136
For typical non-compute tasks, there might be long serial chains of course. This is, where the "Fusion" concept begins to make sense -> quick and low overhead switch between, or even combined application of fast serial and fast parallel compute resources.

If your 2ms vs. 1ms would happen every frame in a game, average users would notice it. ;) Also using 10W for half the time would mean less energy used for the task.
Its a theoretical exercise.
What serial chains of cource will present a 4c/8t zen cpu with dificulties that the user will notice and can not be ofloaded to the gpu on a separate die?
Its not like asynch compute is not working because its on a separate die but because the programmers dont use it.
If i look at the fpu in 4c zen its about 4 times as strong as a 2m/4c excavator. And then look at a 8c solution.
It just dont make any sense to go the last stretch and pay 1.2b for it.
 

lolfail9001

Golden Member
Sep 9, 2016
1,056
353
96
I will play a game with you guys.
Server APU has 16C/32T setup with 2 stacks of HBM2(16 GB, Total) and 64 CU GPU.
4C/8T, 16 CU, 4GB's of HBM2 is exactly 1/4th of The big brother.
Did not it even once cross your mind that 16c/32t setup with 2 stacks of HBM2 and 64 CU GPU is basically a descritpion of MCM with 2 Zeppelins and 1 Vega 10 die? I suggest you propose to cut precisely 3/4ths of each component and call it Raven Ridge, don't you?
 

Glo.

Diamond Member
Apr 25, 2015
5,705
4,549
136
Did not it even once cross your mind that 16c/32t setup with 2 stacks of HBM2 and 64 CU GPU is basically a descritpion of MCM with 2 Zeppelins and 1 Vega 10 die? I suggest you propose to cut precisely 3/4ths of each component and call it Raven Ridge, don't you?
Simplest answer: no.

Read about the designs. Read about the philosophy behind the designs. Zen is about balance. Your proposals about Raven Ridge are completely contradicting the design of the microarchitectures. Raven Ridge is 4C/8T +16CU + 4 GB HBM2, in highest end design. From this point everything will come out. Designing 11 CU GPU, would break the balanced design, that Raven Ridge was supposed to be.

I am not saying that we will not see 4C/4T+11 CU and no HBM2 design. We might. But that is not the design, that will spawn it. Only as a cut down version - then yes.

And start to think what AMD has in software initiatives, and what they would want to achieve with Raven Ridge APU. They are not only meant for consumer markets. They will see a lot of traction in professional use, especially in machine learning in embedded applications.

One last bit, because it sums up why people are having hard time believing anything about AMD, because of the perception they have about the brand: Have anyone of you considered, that Ryzen is better than you think? Have you Considered, that Vega is better than you believe it is?
 

krumme

Diamond Member
Oct 9, 2009
5,952
1,585
136
Infinity fabric have a high bandwith path for data and a low latency for control/safety. This separation is very interesting.

We migh see future where some of the big guys like google fb amazon adds their own fixed function hardware to the fabric especially on the control safety side.

To me it looks like amd just invented something 100 times more worth than mcafee just using their brain.
 

krumme

Diamond Member
Oct 9, 2009
5,952
1,585
136
1991 i plugged a cyrix fpu into my 386sx mb. Not even fractals were calculated faster but heck even excel screen redrawing was faster ;)

With infinity fabric amd have the tech to open for tech ip and innovation they can not even fathom themselves eg just by making room on the mb for it.
 

lolfail9001

Golden Member
Sep 9, 2016
1,056
353
96
Your proposals about Raven Ridge are completely contradicting the design of the microarchitectures.
Did you really miss the sarcasm in the part where i suggest to use 5 dies totalling over 1000mm^2 of silicon to produce 200mm^2 of functional silicon? Frankly i have no clue what Raven Ridge will be except that it will most likely be similar to Intel's mainstream line-up, with worse CPU and better GPU, exact configuration of GPU itself is irrelevant because it will be memory-limited... Oh, and it won't be with HBM.
Raven Ridge is 4C/8T +16CU + 4 GB HBM2, in highest end design.
What is your reasoning for that? Because The Big Brother, as you have put it, is not a single die. So your proposal for small brother being a separately developed expensive die [HBM2 and interposer for 200mm^2 APU, jesus] without any target market, is, frankly, puzzling.
And start to think what AMD has in software initiatives, and what they would want to achieve with Raven Ridge APU.
There you have me, i barely recall any software initiative with any traction from AMD.
They will see a lot of traction in professional use, especially in machine learning in embedded applications.
If that's your reasoning, then it is pretty weak, after all.
that Ryzen is better than you think?
It has proven that it is indeed better than i thought it was when i have learned of Broadwell-E level clocks.
Have you Considered, that Vega is better than you believe it is?
It has certainly proven larger than i thought it would be for what AMD has shown so far.
 
  • Like
Reactions: Sweepr

Glo.

Diamond Member
Apr 25, 2015
5,705
4,549
136
What is your reasoning for that? Because The Big Brother, as you have put it, is not a single die. So your proposal for small brother being a separately developed expensive die [HBM2 and interposer for 200mm^2 APU, jesus] without any target market, is, frankly, puzzling.
Because this is one design, that will be used not only in desktop, and professional use, but also in Mobile. It can be cut down, and sold with disabled cores, both in GPU and CPU.

And this: http://www.bitsandchips.it/english/...two-versions-of-raven-ridge-under-development

Im having hard time believing that Raven Ridge with HBM2 will be offered in 35W package for desktop. 65-95W, yes. 35W can be only for Mobile(Apple).
 

lolfail9001

Golden Member
Sep 9, 2016
1,056
353
96
Because this is one design, that will be used not only in desktop, and professional use, but also in Mobile. It can be cut down, and sold with disabled cores, both in GPU and CPU.
And why exactly the, let's call it rumored, 11 or 12CU design unfit for both desktop and mobile? Even 8 GCN1.2 CUs are seriously memory starved on both, and Vega would need to flip the Earth upside down to solve that problem. Alas, with it's die size, flipping the Earth is 1 thing it does not.
Half the article is "Hey hey hey. This is a rumor", come on.
Im having hard time believing that Raven Ridge with HBM2 will be offered in 35W package for desktop. 65-95W, yes. 35W can be only for Mobile(Apple).
And i am having hard time believing this table is all information without a grain of speculation.
 

Glo.

Diamond Member
Apr 25, 2015
5,705
4,549
136
And why exactly the, let's call it rumored, 11 or 12CU design unfit for both desktop and mobile? Even 8 GCN1.2 CUs are seriously memory starved on both, and Vega would need to flip the Earth upside down to solve that problem. Alas, with it's die size, flipping the Earth is 1 thing it does not.

Half the article is "Hey hey hey. This is a rumor", come on.

And i am having hard time believing this table is all information without a grain of speculation.
Because Vega architecture scheduling is optimized for 8 CU Shader Engine. 8 CU's we have already in Bristol Ridge. Any imrpovement is ONLY by adding second Shader Engine with 8 CU's. Full designs can be either 8 CU, or 16 CU. B&C claims that mobile platform will have 12 CU's. Yes, if the GPU has disabled few of the cores, thats possible.

You can have 12 or 11 CU's. But only in cut down version.

Rumor to B&C site came from somewhere. As their other credible information about Ryzen, before.
 

Mopetar

Diamond Member
Jan 31, 2011
7,835
5,981
136
Not sure if they'll make a 16 CU model anytime soon, or even something that gets cut down from that, as it completely cannibalizes Polaris 11. Unless they're going to retire it really early, I can't see an APU with even 11 or 12 CUs.

8 CUs is the most they've gone with in their APUs prior to this point and if you're getting an APU for games like DotA, LoL, etc. you really don't need more than that as the 460 is already overkill for those games. A 4C/8T/8CU APU would be great for notebooks or low-end desktops that people want to use for casual gaming.
 

coercitiv

Diamond Member
Jan 24, 2014
6,187
11,859
136
Not sure if they'll make a 16 CU model anytime soon, or even something that gets cut down from that, as it completely cannibalizes Polaris 11.
Any modern Zen based APU would "cannibalize" Polaris 11, do you really expect people to buy the APU and add Polaris 11 for the extra grunt? :)
 
  • Like
Reactions: Drazick

Glo.

Diamond Member
Apr 25, 2015
5,705
4,549
136
Any modern Zen based APU would "cannibalize" Polaris 11, do you really expect people to buy the APU and add Polaris 11 for the extra grunt? :)
It will not cannibalize, because Vega 1024 GCN core will be MUCH faster than 1024 GCN core Polaris.
 

coercitiv

Diamond Member
Jan 24, 2014
6,187
11,859
136
It will not cannibalize, because Vega 1024 GCN core will be MUCH faster than 1024 GCN core Polaris.
Oh, because 1024 GCN would be the only version, no cut down SKUs.
You can have 12 or 11 CU's. But only in cut down version.
I can haz? Thank you so much! But... what about that cannibalization? You promised, and now the violence... they will rip each other apart!
 
  • Like
Reactions: Drazick