Relative Memory Bandwidth

Lepton87 · May 30, 2016

What's going on with the memory bandwidth available for every GFLOP of compute crunching power? Did memory bandwidth requirements go down significantly in recent games due to a different mix of workloads? It seems that pure compute gets more important then texturing and other workloads specific to graphics rendering. Maybe on top of that there were some great advancements in memory compression technology along with bigger caches? I analyzed memory bandwidth available for every Gflop of theoretical compute and it turns out that it's been going down and the downward trend is accelerating. The utilization of all of that computational prowess is also going up which makes bandwidth scarcity even more notable then cold numbers would suggest. I made the analisis on NV hardware but the trend is mostly the same on AMD's silicon. I'll start with the first modern unified shader architecture as everyone probably remembers that is the G80 silicon on GF 8800GTX one of the greatest leaps in graphics performance.

Bandwidth 88.4GB/s Compute performance 518 GFLOPs BW to Compute ratio 167MB/s per GFLOPs. That's the most in my comparison.

65nm G90 in 9800GTX 70.4GB/s 648GFLOPs 108MB/s per GFLOP.
55nm G92b in 9800GTX+ 70.4GB/s 705GFLOPs 99MB/s
65nm GT200 in GTX 280 141GB/s 933GFLOPs 151MB/s per GFLOP. This one was a massive increase in bandwidth and the return of high memory bandwidth per GFLOP of compute.
55nm GT200B in GTX 285 159GB/s 1062GFLOPs 149MB/s per GFLOP.
40nm GF100 in GTX480 177 GB/s 1345GFLOPs 131MB/s per GFLOP.
40nm GF110 in GTX580 192 GB/s 1581GFLOPs 121MB/s per GFLOP.
28nm GK104 in GTX680 192GB/s 3000GFLOPs 54MB/s per GFLOP. That's the biggest reduction.
28nm GK110 in GTX780 288GB/s 4000GFLOPs 72MB/s. It seems that after every big relative reduction there's an increase and of course this is quite a big die cut for compute but not for memory controllers.
28nm GK110 in GTX780TI 336GB/s 5046GFLOPs 72MB/s per GFLOP
28nm GM200 in TITAN X 336GB/s 6144GFLOPs 66MB/s per GFLOP
16nm GP204 in GF1080 320GB/s 8228GFLOPs 38MB/s per GFLOP

GP204 relative memory bandwidth looks very bad in relative terms, didn't that card really need wide memory interface or is that improved memory compression enough? How low can we go? Just for the kicks I introduced some IGP/APUs to see how they compare

A10 7870K (886 GFLOPS) with 2133MHz DDR3 DC 34GB/s. Is that the fastest memory that can be run?
38MB/s per GFLOP. That's the same as GP204. We all know that APU is heavily memory bottle-necked and that this bandwidth has to be shared with the CPU but still, that's the same as 1080. What's 1080 trick to get away with such a relatively puny amount of BW? Is that memory compression so damn effective? Anyway this looks promising for the upcoming performance of APUs.

It's very interesting that such a mature technology as memory compression can still be improved by such a huge amount in just one generation.

Flapdrol1337 · May 30, 2016

I read an anandtech article about hbm and it said gddr5 memory controllers on the gpu die don't really shrink with a lower nm process.

I guess the die space is better spent on more gpu cores and trickery to get as much out of the memory as possible than on a wider interface.

crisium · May 30, 2016

The compression has definitely gotten better. Tech Report usually some some data that can demonstrate the generation improvements.

The 980, with 224 GB/s, overall outperforms the 780 Ti which has 336 GB/s, in a bandwidth test.

The 680, which has equal memory bandwidth and less pixel fillrate than the GTX 580, outperforms it on a color fill test. Accoring to TR, this test is usually limited by memory bandwidth. So the less ROP throughput of the 680 is able to shine more than the 'better on paper' 580 thanks to better bandwidth compression.

Also the equal bandwidth 680 isn't held back in Texture fillrates by the memory bandwidth either, easily crushing the 580.

$3dm-tex-fill.gif$

I'm sure every generation improves. I don't fully understand all these tests, but each architecture makes improvements independent of memory bandwidth.

Lepton87 · May 30, 2016

Obviously bandwidth utilization gets betters every generation, caches get bigger but still, from G80 to 1080 we have an almost 5X reduction in bandwidth available per unit of compute. That's pretty massive, we are at APUs relative bandwidth level with 1080 and APUs were considered to be notoriously bandwidth-bottle-necked to a nasty degree.

dogen1 · May 30, 2016

In addition to improved compression, larger cache sizes help a lot too.

Lepton87 · May 30, 2016

dogen1 said:
In addition to improved compression, larger cache sizes help a lot too.

Sure, I wrote about it. The best example is the i7 5775C.

Search

Relative Memory Bandwidth

Lepton87

Platinum Member

Flapdrol1337

Golden Member

crisium

Platinum Member

Lepton87

Platinum Member

dogen1

Senior member

Lepton87

Platinum Member

TRENDING THREADS