[BitsAndChips]390X ready for launch - AMD ironing out drivers - Computex launch

ShintaiDK · Mar 20, 2015

Cloudfire777 said:
Im saying that the connection to HBM is different than GDDR5, and that is written in the code somewhere. With GDDR5 you have say 8x32bit spread out through 8 different chips. Each chip with their own memory controller. With HBM you have still chips spread out but they are stacked. Which needs a little different approach.

Any code wouldnt be in a driver. And HBM modules are accessed with regular DDR commands.

AMD-Volcanic-Islands-2.0-HBM-Memory.jpeg

zagitta · Mar 20, 2015

Lots of FUD in this thread, saying the OS just sees all RAM as equal is quite naive and outdated knowledge at best.
Multi-socket motherboards and the likes today gives ripe opportunity for the OS to optimize performance by allocating memory on the RAM sockets belonging to the processor that the allocating process/thread is running on to prevent expensive ram access across the processor interconnect.
This is typically done with the Table Lookaside Buffer (TLB) which maps virtual memory to physical memory.

Furthermore it's completely insidious to compare VRAM to RAM since they first of all have vastly different access patterns in terms of latency and bandwidth and secondly aren't coded for in the same way either. Regular code accessing RAM is compiled and optimized on the developer's machines meaning optimization of memory accesses is done offline and only once (A LOT of compiler effort goes into this due to the CPU waiting ~100 cycles for cache misses, setting compiler flags for a specific CPU family will sometimes even tweak memory access patterns for that family's memory controller).
This is very different from GPUs since they don't share a common instruction set and as such the shaders must be be compiled at runtime by the DRIVER, what does this mean? Well the driver becomes the compiler that optimizes the memory access patterns which is SUPER DUPER MEGA important for the massively parallel architectures that GPUs are since their throughput is entirely dependent on wavefronts being scheduled and interleaved with the right timings compared to memory latency and bandwidth such that the ALUs never are idle.

This scheduling is essentially black magic and highly dependent on the GPU architecture, trust me I'm a software engineer

Gikaseixas · Mar 20, 2015

It's nice to see that level of commitment from AMD, as they know they must nail it this time since their reputation is a bit lacking.

I know little about memory access but I do believe drivers play some sort of role there. After all games do tax memory and software controls memory allocation/resources... once there's not enough to play with the GPU chokes.

Techhog · Mar 20, 2015

Users: "AMD's drivers suck! They need to improve them!"

Rumor: "AMD is delaying the 390X to ensure that the drivers are good right off the bat."

Users: "BS it should just release now!"

I'm going to bang my head on a wall. There are most likely hardware things to fix too, but given AMD's image issues it's really not that far-fetched.

Khato · Mar 20, 2015

zagitta said:
trust me I'm a software engineer

Since when should I trust those software guys who can get it wrong as many times as they want so long as the end result works and is delivered on time? Sorry, I have too much fun teasing software at any opportunity

That said, I would tend to disagree with your assessment of the relationship between memory latency/bandwidth and scheduling on current graphics architectures. Now for prior graphics architectures there most definitely was dependency on such when it comes to GPGPU functionality, but the majority of that has been removed with dynamic scheduling capabilities. Graphics workloads, however, have always been relatively agnostic to latency, no? Primarily because they run the same small piece of code how many times over? Enough to keep all compute resources active until the memory latency on the first round of execution has been exceeded.

xthetenth · Mar 20, 2015

It's pretty obvious to basically everyone here that there's a yawning chasm between drivers that work, and drivers optimized for high performance on a card, but I'm glad someone gave a bit more detail so we have specific examples of it.

itsmydamnation · Mar 20, 2015

Khato said:
Since when should I trust those software guys who can get it wrong as many times as they want so long as the end result works and is delivered on time? Sorry, I have too much fun teasing software at any opportunity

That said, I would tend to disagree with your assessment of the relationship between memory latency/bandwidth and scheduling on current graphics architectures. Now for prior graphics architectures there most definitely was dependency on such when it comes to GPGPU functionality, but the majority of that has been removed with dynamic scheduling capabilities. Graphics workloads, however, have always been relatively agnostic to latency, no? Primarily because they run the same small piece of code how many times over? Enough to keep all compute resources active until the memory latency on the first round of execution has been exceeded.

Go read places like b3d where there are lots of consoles devs, gpu memory latency is super important, register presure is super important. On the consoles you have the ability to control these your self. On pc you have the driver. On b3d you wil also find a lotcof talk about how currently the gcn compiler is very comcervative in the way it allocates regesters and overallocates regularly which can cause big performance clifs.

So yes driver matters alot, really think about executing instructions is cheap and easy, moving data is hard and expensive. So if memory management doesnt need a driver then what does.....

ShintaiDK · Mar 20, 2015

The original memory discussion isnt about what the driver does with the memory or how it handles it in terms of processing. Its about that there is no difference from the drivers perspective if the memory controller is fitted with GDDR, HBM or anything else.

Cloudfire777 · Mar 20, 2015

zagitta said:
Lots of FUD in this thread, saying the OS just sees all RAM as equal is quite naive and outdated knowledge at best.
Multi-socket motherboards and the likes today gives ripe opportunity for the OS to optimize performance by allocating memory on the RAM sockets belonging to the processor that the allocating process/thread is running on to prevent expensive ram access across the processor interconnect.
This is typically done with the Table Lookaside Buffer (TLB) which maps virtual memory to physical memory.

Furthermore it's completely insidious to compare VRAM to RAM since they first of all have vastly different access patterns in terms of latency and bandwidth and secondly aren't coded for in the same way either. Regular code accessing RAM is compiled and optimized on the developer's machines meaning optimization of memory accesses is done offline and only once (A LOT of compiler effort goes into this due to the CPU waiting ~100 cycles for cache misses, setting compiler flags for a specific CPU family will sometimes even tweak memory access patterns for that family's memory controller).
This is very different from GPUs since they don't share a common instruction set and as such the shaders must be be compiled at runtime by the DRIVER, what does this mean? Well the driver becomes the compiler that optimizes the memory access patterns which is SUPER DUPER MEGA important for the massively parallel architectures that GPUs are since their throughput is entirely dependent on wavefronts being scheduled and interleaved with the right timings compared to memory latency and bandwidth such that the ALUs never are idle.

This scheduling is essentially black magic and highly dependent on the GPU architecture, trust me I'm a software engineer

Thank you for the in-depth explanation. :thumbsup:
I guess AMD faced some small challenges on the driver side. Maybe not to make it work, but to optimize it fully and make it stable.

As long as we dont have to wait any longer than June, Im happy

Silverforce11 · Mar 20, 2015

itsmydamnation said:
Go read places like b3d where there are lots of consoles devs, gpu memory latency is super important

Latency is very important to keeping shaders fed. I tested lots of vram settings during the bit mining craze, even lowering vram clocks but decreasing latency resulted in massive throughput. It was totally against what we back then were expecting, everyone pushed vram OCs to the max but they found that typically the performance did not improve unless the clock speed fell into some ranges that reduced latency.

Definitely an entire new memory subsystem to move things around within the GPU requires new drivers taking advantage of that.

Phynaz · Mar 20, 2015

I've been in the same room with a working 390X-based system, so they clearly have working silicon.

http://techreport.com/news/27994/let-handicap-the-2015-gpu-race-for-a-moment

Scott thinks they need to sell off their present inventory first.

biostud · Mar 20, 2015

Phynaz said:
http://techreport.com/news/27994/let-handicap-the-2015-gpu-race-for-a-moment

Scott thinks they need to sell off their present inventory first.

Could make sense:

Shipments of standalone graphics cards in the first quarter of 2015 are projected to decrease by whopping 20 25 per cent quarter-over-quarter, reports DigiTimes citing sources in the industry. The decline from Q4 to Q1 is unprecedented.

http://www.kitguru.net/components/g...aphics-cards-drops-dramatically-in-q1-report/

Silverforce11 · Mar 20, 2015

Phynaz said:
http://techreport.com/news/27994/let-handicap-the-2015-gpu-race-for-a-moment

Scott thinks they need to sell off their present inventory first.

That's in regard to why we didn't get "full 384bit Tonga".. which he does not seem to realize, but its actually supplied to Apple as M295X.

It doesn't address why there's no Fiji. The supply argument does NOT fly against the simple logic: It wouldn't be in the same price segment to affect lower product sales. We're talking potentially $599/799 R390/X segment.

They have samples a long time ago, they have cards at AIBs, journalists have seen systems running on 390X. Why hasn't it launched?

1. Not enough inventory, could be linked to limited Asetek/Coolermaster units, could be linked to low GF 28nm yields etc..

2. Drivers aren't ready and its underperforming, AMD thinks they can harness much more performance.

3. Drivers aren't ready and CF doesn't work (new architecture, new memory management). This could potentially be a problem that will prevent a launch.

Edit: There's no way AMD is that stupid to pit full-Tonga (280X +10% perf/power) against 970/980 as the new 380X. We've seen 370X leaks which puts it on ~780 performance levels. Full-tonga would not even match 370X... it's a new architecture with vastly improved perf/w and perf/shader.

Cloudfire777 · Mar 20, 2015

AMD surely have exhausted their rebranding options by now?

300 series better be all new and dont come with some Tonga rebranded crap. They need new cards across the entire line to recoup market share and to sell cards. They can`t expect 390/390X to be carrying the entire weight.

I dont think techreport`s speculation is correct anyway. We know R9 370 is brand new and very efficient. R9 390X/390 too. 380X better be a scaled down Fiji

snorge · Mar 20, 2015

Cloudfire777 said:
XFX R9 370 is releasing in April.
Recent rumors suggested GTX 780 performance so it should be able to compete with GTX 960 and beat it.
Thats where they will start I think.

Don't forget though Nvidia still has the GTX 960ti/965 to release which was shown on some of the leaked benchmarks. It appears to be a lot faster than the regular 960. I am guessing the regular 960 prices will go down quite a bit then.

kawi6rr · Mar 20, 2015

Phynaz said:
http://techreport.com/news/27994/let-handicap-the-2015-gpu-race-for-a-moment

Scott thinks they need to sell off their present inventory first.

Basically the same thing Nvidia would do as well, it makes sense.

tviceman · Mar 20, 2015

Silverforce11 said:
That's in regard to why we didn't get "full 384bit Tonga".. which he does not seem to realize, but its actually supplied to Apple as M295X.

It's still not 384-bits.

wasabiman123 · Mar 20, 2015

How do you guys think NVidia will respond?

Vesku · Mar 20, 2015

tviceman said:
It's still not 384-bits.

Yes but it's likely full Tonga or at least full Tonga with only the memory controller not fully used due to power efficiency concerns.

I don't think inventory glut explains the lack of 300 series at least past January. The way AMD's market share is being hit it's not as if stretching out the release moves many units at this point. Unless one of the reasons the AMD GM of Computing and Graphics Business Group left or perhaps was let go at the end of the year was making the terrible call of ordering a bunch (2X, 3X) of Hawaii wafers even as the mining craze trailed off into oblivion and retailers are desperately trying to clear out 100s of thousands of units. Doubtful that was the case though, the price on Hawaii would have been lower pre 970/980 than it was. Much more likely they're just having problems getting the HBM version of 300 series to a shippable state. Missing a GPU refresh by 6 months is probably worse than AIBs and Retailers losing a few million $ selling 290(X)s at a loss post 300 series launch.

Khato · Mar 20, 2015

itsmydamnation said:
Go read places like b3d where there are lots of consoles devs, gpu memory latency is super important, register presure is super important. On the consoles you have the ability to control these your self. On pc you have the driver. On b3d you wil also find a lotcof talk about how currently the gcn compiler is very comcervative in the way it allocates regesters and overallocates regularly which can cause big performance clifs.

So yes driver matters alot, really think about executing instructions is cheap and easy, moving data is hard and expensive. So if memory management doesnt need a driver then what does.....

Now that's a perfectly valid point - if they're attempting to cut it too close on their register space then that would indeed become the limiting factor. I have a hard time believing that AMD would design their hardware/software quite so poorly as to be hitting such issues, but it's entirely possible.

Anyway, I stand corrected. Since memory latency does indeed affect the hardware design decision on how much register space to build into the design and if that's inadequate/borderline then a poor driver will bottleneck throughput due to running out of registers. Would be a rather odd scenario though considering that HBM latency should be lower than GDDR5 and hence it would only affect the driver if they attempted to reduce the number of available registers to recover a bit of die space... not something I'd expect for a first iteration.

Silverforce11 said:
Latency is very important to keeping shaders fed. I tested lots of vram settings during the bit mining craze, even lowering vram clocks but decreasing latency resulted in massive throughput. It was totally against what we back then were expecting, everyone pushed vram OCs to the max but they found that typically the performance did not improve unless the clock speed fell into some ranges that reduced latency.

Definitely an entire new memory subsystem to move things around within the GPU requires new drivers taking advantage of that.

Correct, but keep in mind that GPGPU workloads are quite different in nature compared to graphics.

BFG10K · Mar 21, 2015

ShintaiDK said:
Its about that there is no difference from the drivers perspective if the memory controller is fitted with GDDR, HBM or anything else.

This is simply untrue. An elementary example is the GeForce3 which wasn't much faster than the GeForce 2 at launch, until later drivers took full advantage of the new memory configuration.

A device driver is as about a close to the metal as you can get because it programs the registers/memory/etc. directly. There are no layers that abstract away the difference between DDR and HBM. The API does that, not the driver.

I have absolutely no trouble believing the driver is being heavily tuned to take optimal advantage of the new bandwidth structure it has available.

boozzer · Mar 21, 2015

zagitta said:
Lots of FUD in this thread, saying the OS just sees all RAM as equal is quite naive and outdated knowledge at best.
Multi-socket motherboards and the likes today gives ripe opportunity for the OS to optimize performance by allocating memory on the RAM sockets belonging to the processor that the allocating process/thread is running on to prevent expensive ram access across the processor interconnect.
This is typically done with the Table Lookaside Buffer (TLB) which maps virtual memory to physical memory.

Furthermore it's completely insidious to compare VRAM to RAM since they first of all have vastly different access patterns in terms of latency and bandwidth and secondly aren't coded for in the same way either. Regular code accessing RAM is compiled and optimized on the developer's machines meaning optimization of memory accesses is done offline and only once (A LOT of compiler effort goes into this due to the CPU waiting ~100 cycles for cache misses, setting compiler flags for a specific CPU family will sometimes even tweak memory access patterns for that family's memory controller).
This is very different from GPUs since they don't share a common instruction set and as such the shaders must be be compiled at runtime by the DRIVER, what does this mean? Well the driver becomes the compiler that optimizes the memory access patterns which is SUPER DUPER MEGA important for the massively parallel architectures that GPUs are since their throughput is entirely dependent on wavefronts being scheduled and interleaved with the right timings compared to memory latency and bandwidth such that the ALUs never are idle.

This scheduling is essentially black magic and highly dependent on the GPU architecture, trust me I'm a software engineer

this post just owned the entire driver discussion :thumbsup:

ShintaiDK · Mar 21, 2015

BFG10K said:
This is simply untrue. An elementary example is the GeForce3 which wasn't much faster than the GeForce 2 at launch, until later drivers took full advantage of the new memory configuration.

A device driver is as about a close to the metal as you can get because it programs the registers/memory/etc. directly. There are no layers that abstract away the difference between DDR and HBM. The API does that, not the driver.

I have absolutely no trouble believing the driver is being heavily tuned to take optimal advantage of the new bandwidth structure it has available.

Optimizations is a completely different thing. And that you have to show it with 2 completely GPUs as the NV1x and NV20 series was shows it(Programmable shaders etc). Nobody says that a driver cant optimize the GPUs performance. But saying you need a special driver just because someone placed HBM modules on a card instead of GDDR is rubbish.

People should at least claim it is due to a new uarch to fuel the speculation.

And bandwidth wise there may not even be any relevant difference in terms of bandwidth/SP ratio.

290X=113MB/sec per SP.
390X=125MB/sec per SP.

DownTheSky · Mar 21, 2015

ShintaiDK said:
And bandwidth wise there may not even be any relevant difference in terms of bandwidth/SP ratio.

290X=113MB/sec per SP.
390X=125MB/sec per SP.

How is it not relevant?

290x = 113.63 MB/sec per SP
390x = 156.625 MB/sec per SP

390x has HBM @ 1.25Ghz

Skurge · Mar 21, 2015

DownTheSky said:
How is it not relevant?

290x = 113.63 MB/sec per SP
390x = 156.625 MB/sec per SP

390x has HBM @ 1.25Ghz

After saying the drivers would be the same he now backtracks saying, sure you can optimize it, but you don't need a special driver. Nobody said you need a special driver.

[BitsAndChips]390X ready for launch - AMD ironing out drivers - Computex launch

Lifer

Member

Platinum Member

Platinum Member

Golden Member

Golden Member

Diamond Member

Lifer

Golden Member

Lifer

Lifer

Lifer

Lifer

Golden Member

Member

Senior member

Diamond Member

Member

Diamond Member

Golden Member

Lifer

Golden Member

Lifer

Senior member

Diamond Member