NVIDIA Volta Rumor Thread

Glo. · Sep 18, 2017

Head1985 said:
stop with this register file nonsense.Maxwell doubled Rops, 4x increase cache, brings 40% clock increase, new delta color compression, TB rasterization.This is why it was so much better than kepler.

Explain then with why Maxwell GTX 980 with 2048 CUDA cores was faster in compute than GTX 780 Ti, that had 2880 CUDA cores?

ROPs will not bring you any meaningful performance uplift. Increased cache size - maybe, if your cores have enough throughput to handle it. Which had happened with Maxwell transition.

Tile Based Rasterization does not increase performance in meaningful way, but increases memory bandwidth efficiency, and overall efficiency of GPU under load.

And lastly. You cannot say that compute performance increase came from 40% higher core clocks. Which is pretty much BS.

GTX 980 has 4.1 TFLOPs of compute power. GTX 780 Ti - 5.5 TFLOPs. And yet, GTX 980 was faster in BOTH: compute and gaming.

So, do I still write nonsense about Register File Sizes, or do they have meaningful impact on performance of the GPU?

Oh, and here is a proof:
http://www.anandtech.com/show/7764/the-nvidia-geforce-gtx-750-ti-and-gtx-750-review-maxwell/3

NVIDIA hasn’t given us hard numbers on SMM power efficiency, but for space efficiency a single 128 CUDA core SMM can deliver 90% of the performance of a 192 CUDA core SMX at a much smaller size.

It is just because of the shift from 192 cores/256 KB RFS to 128 Cores/256 KB RFS. Those 128 Maxwell cores did exactly the same amount of job, as those 192 cores in Kepler. Which meant they have been less starved for work.

With Volta, and shift to 64 cores/256, those 64 cores, will do exactly the same amount of work as 128 cores in Maxwell/Consumer Pascal.

Head1985 · Sep 18, 2017

Glo. said:
Explain then with why Maxwell GTX 980 with 2048 CUDA cores was faster in compute than GTX 780 Ti, that had 2880 CUDA cores?

ROPs will not bring you any meaningful performance uplift. Increased cache size - maybe, if your cores have enough throughput to handle it. Which had happened with Maxwell transition.

Tile Based Rasterization does not increase performance in meaningful way, but increases memory bandwidth efficiency, and overall efficiency of GPU under load.

And lastly. You cannot say that compute performance increase came from 40% higher core clocks. Which is pretty much BS.

GTX 980 has 4.1 TFLOPs of compute power. GTX 780 Ti - 5.5 TFLOPs. And yet, GTX 980 was faster in BOTH: compute and gaming.

So, do I still write nonsense about Register File Sizes, or do they have meaningful impact on performance of the GPU?

Oh, and here is a proof:
http://www.anandtech.com/show/7764/the-nvidia-geforce-gtx-750-ti-and-gtx-750-review-maxwell/3

It is just because of the shift from 192 cores/256 KB RFS to 128 Cores/256 KB RFS. Those 128 Maxwell cores did exactly the same amount of job, as those 192 cores in Kepler. Which meant they have been less starved for work.

With Volta, and shift to 64 cores/256, those 64 cores, will do exactly the same amount of work as 128 cores in Maxwell/Consumer Pascal.

this is gaming forum right?Why you talk about compute and rest of us about gaming volta/maxwell performance.
Btw GTX980 dont have 4.1Tflops.GTX980 reference boosting at 1250Mhz.
2048x1250x2=5.12Tflops

780TI reference boosting at 900Mhz
2880x900x2=5.18Tflops

Glo. · Sep 18, 2017

Head1985 said:
this is gaming forum right?Why you talk about compute and rest of us about gaming volta/maxwell performance.
Btw GTX980 dont have 4.1Tflops.GTX980 reference boosting at 1250Mhz.
2048x1250x2=5.12Tflops

780TI reference boosting at 900Mhz
2880x900x2=5.18Tflops

Im using compute to prove you that it is not only core clock, ROPs, increased cache that brought increase in performance. It was also sheer throughput of the cores, which you have to increase if you have increased cache sizes. Otherwise your GPUs will be starved for resources. Gaming performance uplift came from this in the same way as it come for compute.

Head1985 · Sep 18, 2017

Compute are not games..Look at vega

Glo. · Sep 18, 2017

Head1985 said:
Compute are not games..Look at vega

Im looking at Maxwell vs Kepler.

And evidence here is clear.

Head1985 · Sep 18, 2017

jesus..Whatever.Sure volta will be 50% faster in games than pascal just because registers...

Glo. · Sep 18, 2017

Head1985 said:
jesus..Whatever.Sure volta will be 50% faster in games than pascal just because registers...

Its not. But CORE IPC is reliant on Register File Size. Whole performance of GPU relies on cache size, ROP, memory bandwidth, cache structure, scheduling, etc...

Its all about feeding the cores. There is no point in increasing the core throughput, if you are unable to feed those cores with work.

DooKey · Sep 18, 2017

I expect Volta to be a beast, but I expect the price to be beastly as well. I'm really going to be tested this time when I see the Titan Volta price. I'm expecting $1400 with Ti coming out later for $800.

raghu78 · Sep 18, 2017

Glo. said:
Its not. But CORE IPC is reliant on Register File Size. Whole performance of GPU relies on cache size, ROP, memory bandwidth, cache structure, scheduling, etc...

Its all about feeding the cores. There is no point in increasing the core throughput, if you are unable to feed those cores with work.

You are over simplifying GPU design to 1 or 2 factors when its not. Register file size might impact throughput. But thats not the only thing which matters. Its much more complicated. Design choices are made wrt perf, area and efficiency goals. Maxwell was a huge architectural jump from Kepler. FYI a 192 cuda core Kepler SMX had the same register file size (65536 x 32 bit) as a 128 cuda core Maxwell SMM. But that alone was not the reason for the improved perf.

http://www.anandtech.com/show/8526/nvidia-geforce-gtx-980-review/2

"Starting with the Maxwell 1 SMM, NVIDIA has adjusted their streaming multiprocessor layout to achieve better efficiency. Whereas the Kepler SMX was for all practical purposes a large, flat design with 4 warp schedulers and 15 different execution blocks, the SMM has been heavily partitioned. Physically each SMM is still one contiguous unit, not really all that different from an SMX. But logically the execution blocks which each warp scheduler can access have been greatly curtailed.

The end result is that in an SMX the 4 warp schedulers would share most of their execution resources and work out which warp was on which execution resource for any given cycle. But on an SMM, the warp schedulers are removed from each other and given complete dominion over a far smaller collection of execution resources. No longer do warp schedulers have to share FP32 CUDA cores, special function units, or load/store units, as each of those is replicated across each partition. Only texture units and FP64 CUDA cores are shared.

Among the changes NVIDIA made to reduce power consumption, this is among the greatest. Shared resources, though extremely useful when you have the workloads to fill them, do have drawbacks. They’re wasting space and power if not fed, the crossbar to connect all of them is not particularly cheap on a power or area basis, and there is additional scheduling overhead from having to coordinate the actions of those warp schedulers. By forgoing the shared resources NVIDIA loses out on some of the performance benefits from the design, but what they gain in power and space efficiency more than makes up for it.

NVIDIA still isn’t sharing hard numbers on SMM power efficiency, but for space efficiency a single 128 CUDA core SMM can deliver 90% of the performance of a 192 CUDA core SMX at a much smaller size."

Maxwell's improved SMM efficiency allowed it to clock much higher than Kepler (this was a combination of multiple factors - mature process, more efficient SMM and my guess is other optimizations to sustain much higher frequencies). Most importantly Nvidia's driver software is also one of the major contributors to performance gains as it handles scheduling of instructions

http://www.anandtech.com/show/5699/nvidia-geforce-gtx-680-review/3

With Kepler Nvidia made a move to software scheduling and as each generation goes by they improve the drivers based on their learning from the previous generations. We have to wait and see how the architectural changes in Volta improve its performance and efficiency wrt Pascal. But do not over simplify everything down to register file size.

Ghost of Cyrix · Sep 18, 2017

Maxwell/Pascal were holding up (or rather, dominating) impressively enough, but Volta looks to be even further beyond and appears to be very impressive indeed. Nvidia's improvements have been nothing short of amazing these past few years.

I don't take any pleasure in it, but honestly, RTG seem to be finished (for at least the next few years). Vega is completely dead in the water and if Navi still insists on bloating up GCN's carcass further instead of bringing a new architecture to the scene, it likely will be too.

Det0x · Sep 28, 2017

HPC Innovation Lab. September 27 2017 said:
In both PCIe and SXM2 versions, V100 is >40% faster than P100 in FP32 for both NV-Caffe and MXNet. This matches the theoretical speedup. Because FP32 is single precision floating points, and V100 is 1.5x faster than P100 in single precision. With TensorFlow, V100 is more than 30% faster than P100. Its performance improvement is lower than the other two frameworks and we think that is because of different algorithm implementations in these frameworks.

In both PCIe and SXM2 versions, V100 is >2x faster than P100 in FP16. Based on the specification, V100 tensor performance is ~6x than P100 FP16. The reason that the actual speedup does not match the theoretical speedup is that not all data are stored in FP16 and so not all operations are tensor operations (the FMA matrix multiply and add operation).

In V100, the performance of FP16 is close to 2x than that of FP32. This is because FP16 only requires half storage compared to FP32 and therefore we could double the batch size in FP16 to improve the computation speed.

HPC Innovation Lab. September 27 2017 said:
Conclusions and Future Work
After evaluating the performance of V100 with three popular deep learning frameworks, we conclude that in training V100 is more than 40% faster than P100 in FP32 and more than 100% faster in FP16, and in inference V100 is 3.7x faster than P100. This demonstrates the performance benefits when the V100 tensor cores are used. In the future work, we will evaluate different data type combinations in FP16 and study the accuracy impact with FP16 in deep learning training. We will also evaluate the TensorFlow with FP16 once support is added into the software. Finally, we plan to scale the training to multiple nodes with these frameworks.

Recommended read over at Dell community for those interested.

My takeaway:

V100 is roughly 40% faster then P100 in FP32, largely thanks to ~40% more Cuda cores at about the same clockspeed.

V100 is over twice as fast as P100 in FP16, since V100 supports "Rapid Packed Math" (packing two FP16 operations inside of a single FP32 operation)

moonbogg · Sep 30, 2017

I want the juicy details on that Volta Ti card. Hand them over.

Qwertilot · Oct 1, 2017

It'll be an easy question to answer once the V102 titan comes out

Konan · Oct 2, 2017

If you like C++ and Volta this is the video for you

Other links -
http://docs.nvidia.com/cuda/parallel...sistency-model

TheF34RChannel · Oct 20, 2017

It's still too quiet surrounding Volta consumer products, more than I like, plus new current generation products being released tells us it's still a ways off. However, I'm eager to learn more and want to have one asap...

Dayman1225 · Oct 20, 2017

I'm expecting leaks to start November/December and continue up to launch, March-May time IMO

DooKey · Oct 20, 2017

I want my Volta Titan and I want it now!

PeterScott · Oct 20, 2017

Dayman1225 said:
I'm expecting leaks to start November/December and continue up to launch, March-May time IMO

I expect NVidia will keep everything very quiet until after the holiday selling seasons, since they don't wan't have people hold off buying Pascal during the holiday season.

In the slow season early in the new year, information will get looser.

Dayman1225 · Oct 20, 2017

PeterScott said:
I expect NVidia will keep everything very quiet until after the holiday selling seasons, since they don't wan't have people hold off buying Pascal during the holiday season.

In the slow season early in the new year, information will get looser.

Nvidia is big, it'll be hard to keep it all under control, same problem Intel faces, but on a smaller scale.

moonbogg · Oct 21, 2017

I'm ready to drop my pile on a new Volta Ti. I'll have to wait a while though because they will do the usual song and dance; release the mid rangers for $700 and $500, then the Titan for $1500, then finally a year later the Ti for $800. Prices reflect future truth, not past observations.

tviceman · Oct 21, 2017

V100 being 40% faster than P100 at FP32 is a good indicator that Volta will be a similar leap over Pascal that Maxwell was over Kepler at Maxwell's launch. 12nm FN doesn't improve density much, but it does offer 30% better efficiency over 16nm FF. Coupled with architecture enhancements and 50% perf/w is very realistic. Die sizes will probably be similar to Maxwell too (i.e. GV100 ~600mm2, GV104 400mm2, etc.).

Glo. · Oct 21, 2017

I want GTX 1060 3 GB performance out of GTX 2050, for a price of GTX 1050, and I want it yesterday!

Justinbaileyman · Oct 21, 2017

Glo. said:
I want GTX 1060 3 GB performance out of GTX 2050, for a price of GTX 1050, and I want it yesterday!

Yes please.. Me too!! But maybe a bit more on the ram side like 6-8gb will be great.

TheF34RChannel · Oct 21, 2017

The single most exciting hardware upgrade to me remains the GPU. Therefor I cannot hold out until a Ti and will very likely hop on as soon as the 2080s release

- Wow this post adds nothing to the discussion.

PeterScott · Oct 21, 2017

moonbogg said:
I'm ready to drop my pile on a new Volta Ti. I'll have to wait a while though because they will do the usual song and dance; release the mid rangers for $700 and $500, then the Titan for $1500, then finally a year later the Ti for $800. Prices reflect future truth, not past observations.

Except this year all prices will be 30% higher (matching the performance gain) because "miners", which I feel is starting to be at least partially used as cover for price gouging.

NVIDIA Volta Rumor Thread

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Golden Member

Diamond Member

Member

Golden Member

Lifer

Golden Member

Senior member

Senior member

Golden Member

Golden Member

Platinum Member

Golden Member

Lifer

Diamond Member

Diamond Member

Golden Member

Senior member

Platinum Member