bitsandchips[RUMOR] Bristol Ridge could have a 16CU GPU

Abwx · Feb 24, 2016

The Stilt said:
Ideally Steamroller / Excavator designs would have six units which would be clocked higher.

That would be far from being ideal, at 33% higher frequency they would have the same throughput as with 8 units but at the expense of 33% more GPU TDP that in this latter case, that is 25% lower perf/watt.

On the other hand getting to 16 CUs would theoricaly allow to double the perf/watt (of the GPU) at equal throughput in respect of a 8 CUs solution.

This wouldnt had made sense in Kaveri wich has not separate supplies for the GPU and CPU, not that it s implemented in Bristol Ridge but in its case there would be some advantages.

deasd · Feb 24, 2016

Yes. Leakage curve of process always show that 3Ghz dual core is more power hungry than 1.5Ghz quad core. If process node could tolerate larger die area, especially GPU, should have more CUs rather than high clock, of course CPU side has different story. I don't know where to find data about 28nm process leakage curve but you should know about this.

cbn · Feb 24, 2016

deasd said:
Yes. Leakage curve of process always show that 3Ghz dual core is more power hungry than 1.5Ghz quad core. If process node could tolerate larger die area, especially GPU, should have more CUs rather than high clock, of course CPU side has different story. I don't know where to find data about 28nm process leakage curve but you should know about this.

I agree with your theory, but I am skeptical the GPU frequency voltage scales that well at the lower end (where it makes the most sense for a APU with limited bandwidth).

But maybe AMD would do this because 1.) they have WSA burning Wafers anyway 2.) Extending the iGPU on the Carrizo/Bristol Ridge die (while keeping everything else the same) isn't that tough for them 3.) They would like to be more competitive against Intel and Nvidia in 35W to 45W laptop 4.) Budget laptops use lower resolutions than budget desktops do (so the limited bandwidth doesn't hurt so much here).

cbn · Feb 24, 2016

So assuming we had two 35W/45W excavator APUs (one with 512sp and one with 1024sp), how much frequency does the 1024sp iGPU need to have in game to make it worth it over a 512sp iGPU that runs around 750 Mhz in a game?

1024sp @ 375 Mhz would be equal to 512sp @ 750Mhz so in this case such an iGPU would be a waste of silicon area.

But what about 1024sp @ 450 Mhz (a 20% gain over 512sp @ 750 Mhz) or 1024sp @ 500 Mhz (a 33% gain over 512sp @ 750MHz). Is this worth the ~60mm2 extra silicon?

Unoid · Feb 24, 2016

deasd said:
Yes. Leakage curve of process always show that 3Ghz dual core is more power hungry than 1.5Ghz quad core. If process node could tolerate larger die area, especially GPU, should have more CUs rather than high clock, of course CPU side has different story. I don't know where to find data about 28nm process leakage curve but you should know about this.

I agree also.

Just look at Fury nano, Take the wide massive CU chip, but lower clocks. It almost geometrically reduces power usage.

mobile chip carrizo with 512sp at lets say 1ghz
vs.
mobile chip carrizo with 1024sp at 500mhz.

just eye balling it, you'd think both would be nearly the same performance, ignoring bottlenecks and assuming same ratio of ROPs etc.

It should be near the same performance however to operate at 500mhz even if it's 2x the die space, it'll consume much less power.

that could mean the difference between throttling or not.

I'd love tos ee a 500mm^2 mobile carrizo with the same 4 threaded excavator, however, make it a 3000ish SP gpu, and put it on a HBM interposer with that awesome bandwidth. Clock the gpu at like 300mhz and keep low volts. you'd still have a sub 45watt MONSTER APU!

Yuriman · Feb 24, 2016

^ Arguably not, because (as I understand it) even 512SP parts are not scaling well with clockspeed @ 750mhz. AMD can probably get around the same effective performance out of their 512sp parts with slashed clockspeed as-is, while lowering power usage.

Until the memory situation changes, I don't think there's much point to a bigger die, other than to lower power consumption.

Abwx said:
That would be far from being ideal, at 33% higher frequency they would have the same throughput as with 8 units but at the expense of 33% more GPU TDP that in this latter case, that is 25% lower perf/watt.

On the other hand getting to 16 CUs would theoricaly allow to double the perf/watt (of the GPU) at equal throughput in respect of a 8 CUs solution.

This wouldnt had made sense in Kaveri wich has not separate supplies for the GPU and CPU, not that it s implemented in Bristol Ridge but in its case there would be some advantages.

6 CUs clocked 33% higher would not consume 33% more power than 8 CUs... at least probably not. It depends on what voltages are needed. Those extra CUs still need power to run. I'm sure there would be gains, but I doubt you'll cut power usage in half by doubling the size of the chip.

cbn · Feb 24, 2016

Yuriman said:
^ Arguably not, because (as I understand it) even 512SP parts are not scaling well with clockspeed @ 750mhz. AMD can probably get around the same effective performance out of their 512sp parts with slashed clockspeed as-is, while lowering power usage.

Remember Kaveri and Godavari don't have GCN 1.2 which gives something like a 35% improvement in Bandwidth efficiency due to its Delta Color compression.

So Bristol Ridge at 512sp at 750Mhz goes farther on DDR4 2400 than Kavari 512sp @ 750 Mhz using DDR3 2400 (ie, the same 38.4 GB/s bandwidth)

Also remember laptops often have lower resolutions than desktops. So that stronger core might be useful for turning up detail settings (which I believe at 1366 x 768 stress GPU core more than bandwidth).

AtenRa · Feb 24, 2016

NTMBK said:
Please, not this myth again...

A lot of the OpenCL benchmarks out there don't need a lot of memory bandwidth. That is because they are bad benchmarks. They are like Dhrystone, and don't exercise the memory subsystem. There are plenty of GPGPU loads out there that depend on memory bandwidth, and are on the GPU for that precise reason.

OK i will word it better,

There are many Compute workloads that benefit more from having higher number of compute units (CUs) than higher memory. There is not a single game that will not need higher memory bandwidth as you scale your CUs higher.

And yes there are Compute workloads that will benefit from higher memory bandwidth.

cbn · Feb 24, 2016

Unoid said:
I agree also.

Just look at Fury nano, Take the wide massive CU chip, but lower clocks. It almost geometrically reduces power usage.

mobile chip carrizo with 512sp at lets say 1ghz
vs.
mobile chip carrizo with 1024sp at 500mhz.

just eye balling it, you'd think both would be nearly the same performance, ignoring bottlenecks and assuming same ratio of ROPs etc.

It should be near the same performance however to operate at 500mhz even if it's 2x the die space, it'll consume much less power.

that could mean the difference between throttling or not.

I'd love tos ee a 500mm^2 mobile carrizo with the same 4 threaded excavator, however, make it a 3000ish SP gpu, and put it on a HBM interposer with that awesome bandwidth. Clock the gpu at like 300mhz and keep low volts. you'd still have a sub 45watt MONSTER APU!

I don't think Fury Nano drops clocks that low though.

In the following article the author mentions a clockspeed of 850 Mhz to 900 Mhz in games:

http://www.maximumpc.com/amd-r9-nano-review/

But most people are going to want to know how the Nano performs at stock, and how far it may end up dropping below its maximum 1,000MHz core clock; in our testing, it generally stays above 900MHz in most games, with occasional drops into the 850MHz range.

So dropping clocks from 1050Mhz (Fury X) to 850Mhz/900Mhz saves a lot of power.....but does going from 850 Mhz (or whatever) to 500Mhz reduce power to same degree?

Or does this approach the point where reducing iGPU size makes sense even for a mobile chip?

Shehriazad · Feb 24, 2016

Highly doubt this.

This level of iGPU is not gonna be a thing in their chips until Zen APUs with HBM....unless they decided to go with on-die GDDR5...which I highly doubt...and they will NOT force board makers to put it on the MBs...for obvious reasons.

DDR4 2400 is not gonna cut it. If Kaveri had clear performance gains all the way up to 3000 mhz DDR3...DDR4 2400 would bottleneck the GPU so hard that it might actually win a price for horrible design.

If anything: same "size" GPU with a higher clock/better optimization and DDR4 instead of 3. I wouldn't be surprised to see the GPU to clock around 900mhz + the bandwidth optimization that was promised with the (hopefully) last gen modular design.

And I'm really not trying to hate on AMD...unless they had a magic solution to stop the bottlenecking (somehow double the bandwidth use efficiency) there is just no way....they would be wasting die space and money....two things AMD cannot afford to waste right now!

DrMrLordX · Feb 24, 2016

NTMBK said:
You need some serious shader throughput for those pachinko machines.

Ahahaha Konami has their APU of choice nao!

Yuriman said:
^ Arguably not, because (as I understand it) even 512SP parts are not scaling well with clockspeed @ 750mhz.

Eh, sort of. It depends on the application. At least on Kaveri, you can still see performance in games at clockspeeds up to 900-960. Beyond that, it gets a bit silly.

With the colour compression scheme of GCN 1.2, it should have a little more scaling headroom. After all, the top-end Bristol Ridge has GPU clockspeeds by default in excess of 850 mhz.

Abwx · Feb 24, 2016

Yuriman said:
6 CUs clocked 33% higher would not consume 33% more power than 8 CUs... at least probably not. It depends on what voltages are needed. Those extra CUs still need power to run. I'm sure there would be gains, but I doubt you'll cut power usage in half by doubling the size of the chip.

33% higher frequency require sqrt(1.33) higher voltage wich translate to 1.33^2 higher power, given that there s 1.33x less SPs the bottom line is 33% more power for the same throughput.

RAM bandwith wouldnt be an issue relatively speaking if CUs count is doubled as at half frequency the throughput would be identical, but perf/Watt improvement would be dramatic and somewhat needed for SKUs at 65W and below.

bsp2020 · Feb 24, 2016

Nobody here seems to think that AMD might be bringing out 14nm Bristol Ridge.

What if AMD shrank Carrizo to 14nm? It would be about 125mm^2. Adding 8 more graphics CU would bring it up to 185mm^2. HBM1 is about 35mm^2. So, that hypothetical Bristol Ridge with HBM1 would fit nicely on 250mm^2 interposer and fit perfectly on the FP4 packaging AMD is currently using for Carrizo.

Such a chip, if existed, would still need less power than current Carrizo and offer 3X graphics performance. It will also substantially improve CPU performance as well (https://homes.cs.washington.edu/~oskin/pact2015.pdf). In fact, you can expect up to 40% uplift in CPU performance, delivering what Zen promised before Zen becomes available.

BTW, did you guys notice that Stoney Ridge has Polaris like display controller capable of 10-bit display (http://www.amd.com/Documents/J-Family-Product-Brief.pdf)? I'm now more convinced that Stoney Ridge is 14nm and has Polaris graphics.

monstercameron · Feb 24, 2016

bsp2020 said:
Nobody here seems to think that AMD might be bringing out 14nm Bristol Ridge.

What if AMD shrank Carrizo to 14nm? It would be about 125mm^2. Adding 8 more graphics CU would bring it up to 185mm^2. HBM1 is about 35mm^2. So, that hypothetical Bristol Ridge with HBM1 would fit nicely on 250mm^2 interposer and fit perfectly on the FP4 packaging AMD is currently using for Carrizo.

Such a chip, if existed, would still need less power than current Carrizo and offer 3X graphics performance. It will also substantially improve CPU performance as well (https://homes.cs.washington.edu/~oskin/pact2015.pdf). In fact, you can expect up to 40% uplift in CPU performance, delivering what Zen promised before Zen becomes available.

BTW, did you guys notice that Stoney Ridge has Polaris like display controller capable of 10-bit display (http://www.amd.com/Documents/J-Family-Product-Brief.pdf)? I'm now more convinced that Stoney Ridge is 14nm and has Polaris graphics.

They can upgrade the hardware blocks relatively easily due to their modular design approach.

NTMBK · Feb 25, 2016

bsp2020 said:
Nobody here seems to think that AMD might be bringing out 14nm Bristol Ridge.

What if AMD shrank Carrizo to 14nm? It would be about 125mm^2. Adding 8 more graphics CU would bring it up to 185mm^2. HBM1 is about 35mm^2. So, that hypothetical Bristol Ridge with HBM1 would fit nicely on 250mm^2 interposer and fit perfectly on the FP4 packaging AMD is currently using for Carrizo.

Such a chip, if existed, would still need less power than current Carrizo and offer 3X graphics performance. It will also substantially improve CPU performance as well (https://homes.cs.washington.edu/~oskin/pact2015.pdf). In fact, you can expect up to 40% uplift in CPU performance, delivering what Zen promised before Zen becomes available.

BTW, did you guys notice that Stoney Ridge has Polaris like display controller capable of 10-bit display (http://www.amd.com/Documents/J-Family-Product-Brief.pdf)? I'm now more convinced that Stoney Ridge is 14nm and has Polaris graphics.

That hypothetical chip would be super expensive, waste a lot of R&D resources, and still have crap CPU performance.

Search

bitsandchips[RUMOR] Bristol Ridge could have a 16CU GPU

Abwx

Lifer

deasd

Senior member

cbn

Lifer

cbn

Lifer

Unoid

Senior member

Yuriman

Diamond Member

cbn

Lifer

AtenRa

Lifer

cbn

Lifer

Shehriazad

Senior member

DrMrLordX

Lifer

Abwx

Lifer

bsp2020

Member

monstercameron

Diamond Member

NTMBK

Lifer

TRENDING THREADS