NVIDIA Volta Rumor Thread

Page 12 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Glo.

Diamond Member
Apr 25, 2015
5,705
4,549
136
stop with this register file nonsense.Maxwell doubled Rops, 4x increase cache, brings 40% clock increase, new delta color compression, TB rasterization.This is why it was so much better than kepler.
Explain then with why Maxwell GTX 980 with 2048 CUDA cores was faster in compute than GTX 780 Ti, that had 2880 CUDA cores?

67744.png

67798.png

67799.png

67800.png


ROPs will not bring you any meaningful performance uplift. Increased cache size - maybe, if your cores have enough throughput to handle it. Which had happened with Maxwell transition.

Tile Based Rasterization does not increase performance in meaningful way, but increases memory bandwidth efficiency, and overall efficiency of GPU under load.

And lastly. You cannot say that compute performance increase came from 40% higher core clocks. Which is pretty much BS.

GTX 980 has 4.1 TFLOPs of compute power. GTX 780 Ti - 5.5 TFLOPs. And yet, GTX 980 was faster in BOTH: compute and gaming.

So, do I still write nonsense about Register File Sizes, or do they have meaningful impact on performance of the GPU?

Oh, and here is a proof:
http://www.anandtech.com/show/7764/the-nvidia-geforce-gtx-750-ti-and-gtx-750-review-maxwell/3
NVIDIA hasn’t given us hard numbers on SMM power efficiency, but for space efficiency a single 128 CUDA core SMM can deliver 90% of the performance of a 192 CUDA core SMX at a much smaller size.

It is just because of the shift from 192 cores/256 KB RFS to 128 Cores/256 KB RFS. Those 128 Maxwell cores did exactly the same amount of job, as those 192 cores in Kepler. Which meant they have been less starved for work.

With Volta, and shift to 64 cores/256, those 64 cores, will do exactly the same amount of work as 128 cores in Maxwell/Consumer Pascal.
 
Last edited:

Head1985

Golden Member
Jul 8, 2014
1,864
686
136
Explain then with why Maxwell GTX 980 with 2048 CUDA cores was faster in compute than GTX 780 Ti, that had 2880 CUDA cores?

67744.png

67798.png

67799.png

67800.png


ROPs will not bring you any meaningful performance uplift. Increased cache size - maybe, if your cores have enough throughput to handle it. Which had happened with Maxwell transition.

Tile Based Rasterization does not increase performance in meaningful way, but increases memory bandwidth efficiency, and overall efficiency of GPU under load.

And lastly. You cannot say that compute performance increase came from 40% higher core clocks. Which is pretty much BS.

GTX 980 has 4.1 TFLOPs of compute power. GTX 780 Ti - 5.5 TFLOPs. And yet, GTX 980 was faster in BOTH: compute and gaming.

So, do I still write nonsense about Register File Sizes, or do they have meaningful impact on performance of the GPU?

Oh, and here is a proof:
http://www.anandtech.com/show/7764/the-nvidia-geforce-gtx-750-ti-and-gtx-750-review-maxwell/3


It is just because of the shift from 192 cores/256 KB RFS to 128 Cores/256 KB RFS. Those 128 Maxwell cores did exactly the same amount of job, as those 192 cores in Kepler. Which meant they have been less starved for work.

With Volta, and shift to 64 cores/256, those 64 cores, will do exactly the same amount of work as 128 cores in Maxwell/Consumer Pascal.
this is gaming forum right?Why you talk about compute and rest of us about gaming volta/maxwell performance.
Btw GTX980 dont have 4.1Tflops.GTX980 reference boosting at 1250Mhz.
2048x1250x2=5.12Tflops

780TI reference boosting at 900Mhz
2880x900x2=5.18Tflops
 

Glo.

Diamond Member
Apr 25, 2015
5,705
4,549
136
this is gaming forum right?Why you talk about compute and rest of us about gaming volta/maxwell performance.
Btw GTX980 dont have 4.1Tflops.GTX980 reference boosting at 1250Mhz.
2048x1250x2=5.12Tflops

780TI reference boosting at 900Mhz
2880x900x2=5.18Tflops
Im using compute to prove you that it is not only core clock, ROPs, increased cache that brought increase in performance. It was also sheer throughput of the cores, which you have to increase if you have increased cache sizes. Otherwise your GPUs will be starved for resources. Gaming performance uplift came from this in the same way as it come for compute.
 

Head1985

Golden Member
Jul 8, 2014
1,864
686
136
jesus..Whatever.Sure volta will be 50% faster in games than pascal just because registers...
 

Glo.

Diamond Member
Apr 25, 2015
5,705
4,549
136
jesus..Whatever.Sure volta will be 50% faster in games than pascal just because registers...
Its not. But CORE IPC is reliant on Register File Size. Whole performance of GPU relies on cache size, ROP, memory bandwidth, cache structure, scheduling, etc...

Its all about feeding the cores. There is no point in increasing the core throughput, if you are unable to feed those cores with work.
 

DooKey

Golden Member
Nov 9, 2005
1,811
458
136
I expect Volta to be a beast, but I expect the price to be beastly as well. I'm really going to be tested this time when I see the Titan Volta price. I'm expecting $1400 with Ti coming out later for $800.
 
  • Like
Reactions: Arachnotronic

raghu78

Diamond Member
Aug 23, 2012
4,093
1,475
136
Its not. But CORE IPC is reliant on Register File Size. Whole performance of GPU relies on cache size, ROP, memory bandwidth, cache structure, scheduling, etc...

Its all about feeding the cores. There is no point in increasing the core throughput, if you are unable to feed those cores with work.

You are over simplifying GPU design to 1 or 2 factors when its not. Register file size might impact throughput. But thats not the only thing which matters. Its much more complicated. Design choices are made wrt perf, area and efficiency goals. Maxwell was a huge architectural jump from Kepler. FYI a 192 cuda core Kepler SMX had the same register file size (65536 x 32 bit) as a 128 cuda core Maxwell SMM. But that alone was not the reason for the improved perf.

http://www.anandtech.com/show/8526/nvidia-geforce-gtx-980-review/2

"Starting with the Maxwell 1 SMM, NVIDIA has adjusted their streaming multiprocessor layout to achieve better efficiency. Whereas the Kepler SMX was for all practical purposes a large, flat design with 4 warp schedulers and 15 different execution blocks, the SMM has been heavily partitioned. Physically each SMM is still one contiguous unit, not really all that different from an SMX. But logically the execution blocks which each warp scheduler can access have been greatly curtailed.

The end result is that in an SMX the 4 warp schedulers would share most of their execution resources and work out which warp was on which execution resource for any given cycle. But on an SMM, the warp schedulers are removed from each other and given complete dominion over a far smaller collection of execution resources. No longer do warp schedulers have to share FP32 CUDA cores, special function units, or load/store units, as each of those is replicated across each partition. Only texture units and FP64 CUDA cores are shared.

Among the changes NVIDIA made to reduce power consumption, this is among the greatest. Shared resources, though extremely useful when you have the workloads to fill them, do have drawbacks. They’re wasting space and power if not fed, the crossbar to connect all of them is not particularly cheap on a power or area basis, and there is additional scheduling overhead from having to coordinate the actions of those warp schedulers. By forgoing the shared resources NVIDIA loses out on some of the performance benefits from the design, but what they gain in power and space efficiency more than makes up for it.


NVIDIA still isn’t sharing hard numbers on SMM power efficiency, but for space efficiency a single 128 CUDA core SMM can deliver 90% of the performance of a 192 CUDA core SMX at a much smaller size."

Maxwell's improved SMM efficiency allowed it to clock much higher than Kepler (this was a combination of multiple factors - mature process, more efficient SMM and my guess is other optimizations to sustain much higher frequencies). Most importantly Nvidia's driver software is also one of the major contributors to performance gains as it handles scheduling of instructions

http://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/3

With Kepler Nvidia made a move to software scheduling and as each generation goes by they improve the drivers based on their learning from the previous generations. We have to wait and see how the architectural changes in Volta improve its performance and efficiency wrt Pascal. But do not over simplify everything down to register file size.
 
Last edited:
Aug 20, 2015
60
38
61
Maxwell/Pascal were holding up (or rather, dominating) impressively enough, but Volta looks to be even further beyond and appears to be very impressive indeed. Nvidia's improvements have been nothing short of amazing these past few years.

I don't take any pleasure in it, but honestly, RTG seem to be finished (for at least the next few years). Vega is completely dead in the water and if Navi still insists on bloating up GCN's carcass further instead of bringing a new architecture to the scene, it likely will be too.
 
Last edited:

Det0x

Golden Member
Sep 11, 2014
1,028
2,953
136
HPC Innovation Lab. September 27 2017 said:
In both PCIe and SXM2 versions, V100 is >40% faster than P100 in FP32 for both NV-Caffe and MXNet. This matches the theoretical speedup. Because FP32 is single precision floating points, and V100 is 1.5x faster than P100 in single precision. With TensorFlow, V100 is more than 30% faster than P100. Its performance improvement is lower than the other two frameworks and we think that is because of different algorithm implementations in these frameworks.

In both PCIe and SXM2 versions, V100 is >2x faster than P100 in FP16. Based on the specification, V100 tensor performance is ~6x than P100 FP16. The reason that the actual speedup does not match the theoretical speedup is that not all data are stored in FP16 and so not all operations are tensor operations (the FMA matrix multiply and add operation).

In V100, the performance of FP16 is close to 2x than that of FP32. This is because FP16 only requires half storage compared to FP32 and therefore we could double the batch size in FP16 to improve the computation speed.

6406.image003.png

2376.image004.png

2705.image005.png

table4.PNG

HPC Innovation Lab. September 27 2017 said:
Conclusions and Future Work
After evaluating the performance of V100 with three popular deep learning frameworks, we conclude that in training V100 is more than 40% faster than P100 in FP32 and more than 100% faster in FP16, and in inference V100 is 3.7x faster than P100. This demonstrates the performance benefits when the V100 tensor cores are used. In the future work, we will evaluate different data type combinations in FP16 and study the accuracy impact with FP16 in deep learning training. We will also evaluate the TensorFlow with FP16 once support is added into the software. Finally, we plan to scale the training to multiple nodes with these frameworks.

Recommended read over at Dell community for those interested.

My takeaway:

V100 is roughly 40% faster then P100 in FP32, largely thanks to ~40% more Cuda cores at about the same clockspeed.

V100 is over twice as fast as P100 in FP16, since V100 supports "Rapid Packed Math" (packing two FP16 operations inside of a single FP32 operation)
 
Last edited:

TheF34RChannel

Senior member
May 18, 2017
786
309
136
It's still too quiet surrounding Volta consumer products, more than I like, plus new current generation products being released tells us it's still a ways off. However, I'm eager to learn more and want to have one asap...
 
  • Like
Reactions: Arachnotronic

PeterScott

Platinum Member
Jul 7, 2017
2,605
1,540
136
I'm expecting leaks to start November/December and continue up to launch, March-May time IMO

I expect NVidia will keep everything very quiet until after the holiday selling seasons, since they don't wan't have people hold off buying Pascal during the holiday season.

In the slow season early in the new year, information will get looser.
 

Dayman1225

Golden Member
Aug 14, 2017
1,152
974
146
I expect NVidia will keep everything very quiet until after the holiday selling seasons, since they don't wan't have people hold off buying Pascal during the holiday season.

In the slow season early in the new year, information will get looser.

Nvidia is big, it'll be hard to keep it all under control, same problem Intel faces, but on a smaller scale.
 

moonbogg

Lifer
Jan 8, 2011
10,635
3,095
136
I'm ready to drop my pile on a new Volta Ti. I'll have to wait a while though because they will do the usual song and dance; release the mid rangers for $700 and $500, then the Titan for $1500, then finally a year later the Ti for $800. Prices reflect future truth, not past observations.
 
  • Like
Reactions: [DHT]Osiris

tviceman

Diamond Member
Mar 25, 2008
6,734
514
126
www.facebook.com
V100 being 40% faster than P100 at FP32 is a good indicator that Volta will be a similar leap over Pascal that Maxwell was over Kepler at Maxwell's launch. 12nm FN doesn't improve density much, but it does offer 30% better efficiency over 16nm FF. Coupled with architecture enhancements and 50% perf/w is very realistic. Die sizes will probably be similar to Maxwell too (i.e. GV100 ~600mm2, GV104 400mm2, etc.).
 

Glo.

Diamond Member
Apr 25, 2015
5,705
4,549
136
I want GTX 1060 3 GB performance out of GTX 2050, for a price of GTX 1050, and I want it yesterday!
 
Last edited:
  • Like
Reactions: Justinbaileyman

TheF34RChannel

Senior member
May 18, 2017
786
309
136
The single most exciting hardware upgrade to me remains the GPU. Therefor I cannot hold out until a Ti and will very likely hop on as soon as the 2080s release :D

- Wow this post adds nothing to the discussion.
 

PeterScott

Platinum Member
Jul 7, 2017
2,605
1,540
136
I'm ready to drop my pile on a new Volta Ti. I'll have to wait a while though because they will do the usual song and dance; release the mid rangers for $700 and $500, then the Titan for $1500, then finally a year later the Ti for $800. Prices reflect future truth, not past observations.

Except this year all prices will be 30% higher (matching the performance gain) because "miners", which I feel is starting to be at least partially used as cover for price gouging.