[VR-Zone]AMD upcoming Tonga GPU to be released in mid August

Page 3 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

blackened23

Diamond Member
Jul 26, 2011
8,548
2
0
You aren't looking at the whole picture. Perf/watt is definitely one part and Nvidia is placing paramount on it being the most important aspect. But AMD might try to maximize perf/mm^2, which is evident in Hawaii's transistor density. Despite having seen Kepler in action some 18 months in advance, and despite improvements in GCN architecture, Hawaii could not compete in perf/watt vs. Kepler. Instead, Hawaii competes much more effectively in perf/mm^2.

There are more ways to compete than in just perf/watt.

It really depends on how you look at it. Doesn't matter so much for desktop but looking at the broad market - I think performance per watt is the end all be all metric because really, you cannot get your chips into mobile devices without having excellent performance per watt.

For desktop enthusiasts it doesn't matter so much, but for the company providing the hardware - I think designing an efficient architecture that can range from the smallest mobile devices to the high end discrete GPUs is probably the better bet. Getting mobile design depends on that entirely. So i'd have to agree that AMD would have to #1) be reasonably close to maxwell in terms of efficiency if they want to get mobile dGPU design wins again and #2) increase the quality of their software for their mobile products in particular. Basically, AMD's software for their mobile GPUs is a disaster and enduro generally does not ever work properly. That is something that AMD needs to fix, although AMD has software problems/lack of funding (I assume) on many fronts and not just mobile.

My speculation is that AMD wanted this primarily for a mobile uarch. I guess the desktop Tonga is semi interesting, but it will be more important for AMD to get back in to the mobile dGPU game with a more efficient perf/watt uarch and better software (their desktop level software has improved a LOT, mobile not so much). Personally I think that's more of AMD's end game with Tonga. I don't think people will care about same as R9-280 performance for the desktop, but if they can do that for mobile with good power consumption and much improved software (as compared to their current lackluster mobile software), then it will be a big win for AMD. And really, more competition in the mobile/ultrabook dGPU space would definitely be a good thing....it's pretty lopsided right now.
 
Last edited:

buletaja

Member
Jul 1, 2013
80
0
66
@buletaja:
You are confusing me a bit here.

R9 M295X = 32CU. We know that based on previous leak.

Are you saying Tonga will have 128 ALUs per CU? Or still using 64 ALUs per CU?
Or 16 CUs and 128 ALUs per CU (which it isnt).

If you are saying 128 ALUs per CU, which means 30-40% bigger die size if its 28nm, it should beat R9 290X by a good margin. Which I highly doubt.

i mean, Tonga is 16 CU with 128 ALU per CU
it fit with it has 256bit memory controller

Yes AMD could still said it as 32CU but 64 ALU
but as compute gives more attention
better to have 16CU with 128 ALU
but with better SIMD distribution

plus this time each 8 wide SIMD have its own LSM
just like AMD dec 2013 slide

LSM is not LDS, LSM is different target compare to LDS

this link about Hawai DX scracthbyte will give idea
why future GPU will incorporated 16 EX, which each EX have LSM
http://www.tuicool.com/articles/jQ3YFz

usedScratchBytes:

The last item in the shader statistics is usedScratchBytes . When the AMD driver’s shader compiler needs to allocate more than 256 VGPRs, it spills into scratch memory. Main video memory is used for scratch, backed by the L1 and L2 caches.

Having to spill VGPRs into scratch does not come up often in practice for HLSL shaders. However, when it does, it will almost surely result in sub-optimal performance and your shader should be modified to reduce VGPR pressure and eliminate scratch usage. For example, temporary arrays (i.e. local arrays) use VGPRs and potentially can cause high VGPR usage. (As an aside, indexing into local arrays is an expensive operation on GCN, so you should be wary of relying heavily on local arrays already, beyond the potential for high VGPR usage.)

On GCN 1.0 Scracthbytes usage means poorer performance
as it has to use GDDR5 or DDR3 which will give >100-200 cycle latency

So how to overcome above as future GPU/engine will focus on locality
and nested deep branching (Ray tracing).

then LSM is a good idea
LSM will be used on GCN 2.0
and also already incorporated on one of nextgen console

AMD already used LSM (Scratch memory) on Tensilica based true audio
and AMD said as to ensure local fast operation

and i believe the 2nd enchancement is AMD will include
Scalar unit in each EX correspondence with LSM
so each 8 wide SIMD can fetch it own instruction bundles ---> DX12
as on those slide there is not scalar unit per CU, as now each EX has scalar unit on it
but this time instead 1 Scalar unit control (64 ALU/64 wide), 1 Scalar control 8 wide SIMD.
so 16 EX like the slide shown, means 16 Scalar unit control 16 x 8 wide SIMD = 128 ALU

Mike Mantor Latest paper suggesting about improving Scalar unit on future GCN
 
Last edited:

tviceman

Diamond Member
Mar 25, 2008
6,734
514
126
www.facebook.com
It really depends on how you look at it. Doesn't matter so much for desktop but looking at the broad market - I think performance per watt is the end all be all metric because really, you cannot get your chips into mobile devices without having excellent performance per watt.

For desktop enthusiasts it doesn't matter so much, but for the company providing the hardware - I think designing an efficient architecture that can range from the smallest mobile devices to the high end discrete GPUs is probably the better bet. Getting mobile design depends on that entirely. So i'd have to agree that AMD would have to #1) be reasonably close to maxwell in terms of efficiency if they want to get mobile dGPU design wins again and #2) increase the quality of their software for their mobile products in particular. Basically, AMD's software for their mobile GPUs is a disaster and enduro generally does not ever work properly. That is something that AMD needs to fix, although AMD has software problems/lack of funding (I assume) on many fronts and not just mobile.

My speculation is that AMD wanted this primarily for a mobile uarch. I guess the desktop Tonga is semi interesting, but it will be more important for AMD to get back in to the mobile dGPU game with a more efficient perf/watt uarch and better software (their desktop level software has improved a LOT, mobile not so much). Personally I think that's more of AMD's end game with Tonga. I don't think people will care about same as R9-280 performance for the desktop, but if they can do that for mobile with good power consumption and much improved software (as compared to their current lackluster mobile software), then it will be a big win for AMD. And really, more competition in the mobile/ultrabook dGPU space would definitely be a good thing....it's pretty lopsided right now.

My personal opinion is that I agree perf/watt it's the most important metric driving MOST chip makers these days, but like I said it's not the end all. Especially for a company like AMD that is operating on significantly smaller margins. Going for optimized die space instead as the #1 priority and making perf/watt a secondary priority can make just as much sense when money is very, very tight. But at the same time, AMD isn't actively and aggressively trying to get into ultra-mobile android products (phones, tablets, google TV boxes). They have a reference tablet but nothing else to show for it.

For the most part, Kepler beat GCN in perf/watt but with Pitcairn and smaller the differences were negligible to sometimes being in AMD's favor. If AMD had just-as-good mobile software as Nvidia (i.e. enduro was at the same level as optimus, raptr had battery saving options a la geforce experience battery saver) then they would have remained every bit as competitive in the notebook market as they have been (and continue to be) in the desktop segment. The gtx680m had the same hardware specs as the desktop gtx760, but with considerably lower clocks. Pitcairn was binned as a 7970m for AMD's mobile high end and went tit-for-tat with gtx680. Comparisons between the refreshes since showed relative parity until the very recent gtx880m. So I do think that prioritizing die space can be beneficial, even in mobile. AMD's entire problem with mobile since 40nm (when Nvidia started introducing differentiating features) has been software.

I think the biggest problem facing AMD right now is that Kepler (and especially Maxwell), besides GK110, are doing the same or better work with LESS transistors.
 

ocre

Golden Member
Dec 26, 2008
1,594
7
81
there is compromises being made all the time. They have to decide where they will spend their money and what they expect to gain. When you recognize this, what TViceman is saying can surely fit. It is all speculation and from the sidelines. He is proposing that AMD took a route that would be more profitable per chip sold. Considering they sell 1/3 the chips nvidia does, they need to make as much profit per chip as they can. If their chips are 2/3rds more profitable, they are closing the gap. They can maek more money by selling less chips

Of course there are many goals but decisions are made on where to put the largest focus. AMD might actually have put heavy focus on per mm2 performance over performance per watt. This could be the case. But it is all just pure speculation.

I myself love to speculate on the side line. Its just the love of discussing HW in general.
 

Cloudfire777

Golden Member
Mar 24, 2013
1,787
95
91
i mean, Tonga is 16 CU with 128 ALU per CU
it fit with it has 256bit memory controller

Yes AMD could still said it as 32CU but 64 ALU
but as compute gives more attention
better to have 16CU with 128 ALU
but with better SIMD distribution

plus this time each 8 wide SIMD have its own LSM
just like AMD dec 2013 slide

LSM is not LDS, LSM is different target compare to LDS

this link about Hawai DX scracthbyte will give idea
why future GPU will incorporated 16 EX, which each EX have LSM
http://www.tuicool.com/articles/jQ3YFz



On GCN 1.0 Scracthbytes usage means poorer performance
as it has to use GDDR5 or DDR3 which will give >100-200 cycle latency

So how to overcome above as future GPU/engine will focus on locality
and nested deep branching (Ray tracing).

then LSM is a good idea
LSM will be used on GCN 2.0
and also already incorporated on one of nextgen console

AMD already used LSM (Scratch memory) on Tensilica based true audio
and AMD said as to ensure local fast operation

and i believe the 2nd enchancement is AMD will include
Scalar unit in each EX correspondence with LSM
so each 8 wide SIMD can fetch it own instruction bundles ---> DX12
as on those slide there is not scalar unit per CU, as now each EX has scalar unit on it
but this time instead 1 Scalar unit control (64 ALU/64 wide), 1 Scalar control 8 wide SIMD.
so 16 EX like the slide shown, means 16 Scalar unit control 16 x 8 wide SIMD = 128 ALU

Mike Mantor Latest paper suggesting about improving Scalar unit on future GCN

Mobile Tonga is confirmed 32CUs. Not 16CUs.
http://videocardz.com/50737/amd-radeon-r9-m295x-tonga-gpu-32-compute-units

What is not confirmed is:
- How many CUs is a full Tonga (desktop)
- How many ALUs comes with 1 CU
- HBM or not
- What other improvements comes with GCN 2.0
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,701
1,230
136
- 32 CUs for the full bin, 24 to 28 for the non-full bin.
- 1 CU equals 64 32-bit ALUs
- Should have HBM
- More GPRs, discrete HSA, L3 caches

We'll see 64-bit ALUs in GCN-based architectures before we'll see two times more 32-bit ALUs.
 

tviceman

Diamond Member
Mar 25, 2008
6,734
514
126
www.facebook.com
- 32 CUs for the full bin, 24 to 28 for the non-full bin.
- 1 CU equals 64 32-bit ALUs
- Should have HBM
- More GPRs, discrete HSA, L3 caches

We'll see 64-bit ALUs in GCN-based architectures before we'll see two times more 32-bit ALUs.

It's not going to have HBM. HBM is too recent in the pipeline to have been included in this chip. This chip's design was finalized 15 months ago. HBM wasn't completely developed at that point by AMD.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,701
1,230
136
It's not going to have HBM. HBM is too recent in the pipeline to have been included in this chip. This chip's design was finalized 15 months ago. HBM wasn't completely developed at that point by AMD.
HBM has been in the works since before September 2011 by AMD. It has only been recently that AMD announced its partnership with SK Hynix. HBM is an AMD standard crafted by AMD brains shared with SK Hynix. Who are the production masters of DRAM which means HBM is high volume from the get go.

SK Hynix HBM - Mass Production; Late Q2 2014 / Early Q3 2014.
GlobalFoundries 2.5D Si Interposer - Mass Production; Sometime Q2 2014.

The hard part is finding out who is going HBM;
http://www.linkedin.com/pub/himakiran-kodihalli/53/59/89a
http://www.linkedin.com/pub/ashfaq-shaikh/0/949/bb9

http://www.linkedin.com/in/matsmatsuo
http://www.linkedin.com/in/jinwookjang
http://www.linkedin.com/pub/brian-amick/0/9a2/45b

---
Some logical deductions;

We know for a FACT* that Tonga is replacing Hawaii Pro as W8100v2, eventually. The math is actually particularly interesting thinking about it. I never thought of dividing 2560 by 32.

2560 by 32 => 80 ALUs per GCN CU.
2048(2xHBM) by 2560 => 0.8 GHz

The clock is dictated by the HBM interface then.

The funny part of this is that I guessed the ALU part back in 2012;
http://forums.evga.com/FindPost/1740201
(Do not look at my old posts please don't jeez if I could I would erase them)

I sure hate old me always undermining myself in past tense. @2012seronx; I hate you! / @buletaja; delay your 128 ALU theory to Pirate Islands!

*

6920 Tonga
6921 Amethyst XT [Radeon R9 M295X]
692b Tonga XT GL [FirePro W8100]
692f Tonga XT GL [FirePro W8100]

Hawaii Pro for desktop = 275 Watts / Hawaii Pro GL for workstations = 220 Watts.

If we assume Hawaii Pro GL is relative to GlobalFoundries not leaky TSMC. Then, a 0.5x TDP consumption reduction is feasible if clock is static. 220W * 0.5 = 110 Watts

(((110 Watts/20-nm - 85 watts/512b) + 42.5 watts/256b) + 14.5 watts/2048b) => ~82 watts for TDP

Now the wait is to see if it has the same potential as the W8100(Hawaii Pro); 2560 ALUs / 160 TMUs / 64 ROPs.

---
This never happened okay?
We'll see 64-bit ALUs in GCN-based architectures before we'll see two times more 32-bit ALUs.

---
If you guys are wondering the clock speed for mobile GPUs are dependent on their memory interfaces.

R9 M290X => 1228.8 Gbit/s ÷ 1280 => 0.96 GHz or 960 MHz, but the actual clock is up to 900 MHz.
R9 M295X =>
2048 Gbit/s ÷ 2560 => 0.8 GHz or 800 MHz <-- this could be the boost rate.
1408 Gbit/s ÷ 2560 => 0.55 GHz or 550 MHz <-- this could be the base rate or idle rate.
 
Last edited:

Cloudfire777

Golden Member
Mar 24, 2013
1,787
95
91
iAK2lXs.jpg
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,701
1,230
136
Bateluer cares not for mid range cards! Where's the R9 390X, AMD?
Tonga XT is a high-end card that is next to Hawaii Pro but below Hawaii XT.

Highest Model in Mobile => R9 M295X
Second Highest Model in FirePro => FirePro W8100

Technically, both of these namings would mesh leading to a conclusion stating that;

Tonga XT = R9 290
Tonga Pro = R9 285X
Tonga LE = R9 285

Most will also have to make note that Volcanic Islands can be or most likely will be more efficient than Sea Islands and Southern Islands.


---
Other than that can anyone explain why Volcanic Islands; Iceland and Tonga. Have a preference for the VOP2 GCN instruction set over the VOP3a/3b GCN instruction set?
 
Last edited:

NostaSeronx

Diamond Member
Sep 18, 2011
3,701
1,230
136
Sorry for double posting but new information has popped up.

http://i.imgur.com/tnbJwUK.png

Another Possibility;

Ameythst XT = 0.8 GHz HBM
Tonga XT = 1.0 GHz HBM

2048-bit * 0.8 GHz = 1,638.4 GBit/s

1,638.4 GBit/s ÷ 2048 ALUs = 0.8 GHz

HBM allows for 1 ALU clock to 1 MEM clock. If the bit count equals the ALU count.

Gbits ÷ ALU count = Base core clock speed.
 
Last edited:

Cloudfire777

Golden Member
Mar 24, 2013
1,787
95
91
Sorry for double posting but new information has popped up.

http://i.imgur.com/tnbJwUK.png

Another Possibility;

Ameythst XT = 0.8 GHz HBM
Tonga XT = 1.0 GHz HBM

2048-bit * 0.8 GHz = 1,638.4 GBit/s

1,638.4 GBit/s ÷ 2048 ALUs = 0.8 GHz

HBM allows for 1 ALU clock to 1 MEM clock. If the bit count equals the ALU count.

Gbits ÷ ALU count = Base core clock speed.

What on earth would they do with a 1638GB/s bandwidth? Tahiti with 2048 shaders got 264GB/s.
Also the chart you posted above says 128GB/s for 1Gb chips. How does that work? 1Gigabit = 125MB. Stack 30 of those chips together, you get 125x30 4GB VRAM (3750MB) but the bandwidth is 128x30 3060GB/s?

But nice that HBM is available and ready. Question is if AMD have used that in Tonga when they were already finished with the design probably many months ago.
 
Last edited:

f1sherman

Platinum Member
Apr 5, 2011
2,243
1
0

How many patents remain just that?
And even if the company commits to the patent - how long it takes to bring commercially available product?



After reading this^^ I am 99% sure HBM is N O T happening in this GPU gen.

Code:
[SIZE="5"]Die stacking is catching on in FPGAs, Power Devices, and MEMs 
 
But… 
 
There is nothing in mainstream computing CPUs, GPUs, and APUs[/SIZE]
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,701
1,230
136
What on earth would they do with a 1638GB/s bandwidth?
1638.4 Gbit/s or 204.8 GB/s for the HBM L3 cache.
+
1408 Gbit/s or 176 GB/s for GDDR5.
Tahiti with 2048 shaders got 264GB/s.
Tahiti XTL/XT2 = 384 * 6 => 288 GByte/s or 2,304 Gbit/s
Tahiti XT = 384 * 5.5 => 264 GByte/s or 2112 Gbit/s
Also the chart you posted above says 128GB/s for 1Gb chips. How does that work? 1Gigabit = 125MB. Stack 30 of those chips together, you get 125x30 4GB VRAM (3750MB) but the bandwidth is 128x30 3060GB/s?
It is a 1 GB 4-Hi stack at 0.8 Gb(bit)ps per bit or 1 Gb(bit)ps per bit. It is an exact 1 GB, no more and no less.
But nice that HBM is available and ready. Question is if AMD have used that in Tonga when they were already finished with the design probably many months ago.
Development for HBM started in 2012. This is around the same time AMD started working on 20-nm from GlobalFoundries and TSMC.
How many patents remain just that?
And even if the company commits to the patent - how long it takes to bring commercially available product?
Since, SK Hynix is helping it only took AMD two years to go from patent to real product.
After reading this^^ I am 99% sure HBM is N O T happening in this GPU gen.
Scroll down to takeways:
Die stacking is happening in the mainstream

It is happening now because we need it
&
It is going to change who and how we build sockets in the future​
 
Last edited:

NostaSeronx

Diamond Member
Sep 18, 2011
3,701
1,230
136

f1sherman

Platinum Member
Apr 5, 2011
2,243
1
0
And all this doesn't scream EARLY SAMPLING?

I am going to eat my hat if this makes into the mainstream of next GPU gen.
At best we are looking at the stop-gap card ala HD4770, 750/Ti.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,701
1,230
136
And all this doesn't scream EARLY SAMPLING?
Early sampling was 2013, we are in the volume production stage of 20-nm and HBM.
I am going to eat my hat if this makes into the mainstream of next GPU gen.
At best we are looking at the stop-gap card ala HD4770, 750/Ti.
It is a stop-gap as it is 20-nm planar and not 20-nm FinFETs.

http://i.imgur.com/bDlN7Lm.png

At best we are looking at same core and ALU counts but with AVFS, HBM, etc. Large power efficiency increase while same to lower performance.
 
Last edited:

raghu78

Diamond Member
Aug 23, 2012
4,093
1,475
136
And all this doesn't scream EARLY SAMPLING?
I am going to eat my hat if this makes into the mainstream of next GPU gen.
At best we are looking at the stop-gap card ala HD4770, 750/Ti.

I can't say about Tonga. But it makes sense for AMD to introduce HBM at the ultra high end where it can command high margins and the memory bandwidth is badly needed. You should remember AMD introduced GDDR5 on the HD 4870 and then HD 4870X2 which were the top of the HD 4000 product stack. GDDR5 across the product stack did not make it till HD 5000 series almost 18 months later. It makes the worst sense to introduce cutting edge memory tech to entry level models which can be easily served by GDDR5 and without the extra cost of 2.5D silicon interposer.
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,701
1,230
136
There is only two GPUs for Volcanic Islands.

Tonga and Iceland.

Maui is the dual GPU implementation of Tonga or the Logic + Logic + Memory version of it. AMD is heavily abusing the 2.5D TSV interposer to max out yields on 20-nm.

Note: Maui is a codename related to not an island but a myth.
http://en.wikipedia.org/wiki/M%C4%81ui_%28mythology%29
 
Last edited: