May 11, 2008
19,501
1,163
126
I have been reading this article. And these people know what they are talking about.
The Case for ECC memory in Nvidia's next gpu

I know that there are a lot of rounding errors made in the graphical output for games. Simply because there is no need to have the accuracy when the image itself does not even look photorealistic. I was amazed by this text in the article :

For double precision, a GT200 or GT200b is only 65% or 88% faster than Nehalem ? computing the results twice would make the GPU slower than a standard CPU.


I always assumed gpu's where calculation beasts. That was with single precision calculations. Now it seems al to obvious to me. Larrabee is just an experiment for the end result. A minimal 8 core nehalem with vector capabilities. My guess the end result is a larrabee like chip but with nehalem like cores and not the atom derivates the first incarnation of larrabee has. As we all know by now Nvidia is building a gfx chip with ECC called fermi. I am sure this ECC is selectable and can be left out for the cheaper gfx cards for the home user that do not need the reliability.


Here is another article about the calculation power of the various gpu's/cpu's.

Computational Efficiency in Modern Processors


GPUs of today are monstrously powerful compute devices for explicit and embarrassingly parallel workloads. They are multi-core processors with 10 (AMD) to 30 (Nvidia) cores per GPU, and each core executes extremely wide vectors. The Nvidia GT200 is capable of 933 or 622 GFLOP/s single precision (SP), depending on how you count, and 77 GFLOP/s double precision (DP). The competing AMD RV770 can execute up to 1.2 SP TFLOP/s and 240 DP GFLOP/s. In contrast, a high-end CPU like Nehalem can achieve roughly 102 GFLOP/s and 51 GFLOP/s for single and double precision respectively.

If i see those numbers on the AMD RV770, i wonder if AMD also will make a gpu with ECC capabilities like Nvidia is doing. Then we might be surprised by amd presenting a gpgpu that meets the specifications of the High power computer world. AMD processors are widely used in the server world. They might as well make an x86/ATI type rack just as the cell blades from IBM.




On a side note. What is NASA using for the calculations of the data coming from this telescope :

Fermi Gamma-ray Space Telescope.







 

BenSkywalker

Diamond Member
Oct 9, 1999
9,140
67
91
I always assumed gpu's where calculation beasts. That was with single precision calculations. Now it seems al to obvious to me.

This is one of the two major differences between G100 based parts and GT2x0 offerings. G100 will be back to ~order of magnitude faster then CPUs at DP, still only half the speed of SP performance, but far beyond the reach of current desktop CPUs.

Larrabee is just an experiment for the end result. A minimal 8 core nehalem with vector capabilities. My guess the end result is a larrabee like chip but with nehalem like cores and not the atom derivates the first incarnation of larrabee has.

This would be a bad design on several different fronts. In order for Larrabee to be competitive on the HPC front it will need to have somewhere in the range of 64, or more, cores. It won't be remotely competitive with 8. If you use a Nehalem style core you are wasting too much die space to deal with OoO to be competitive on a transistor basis, GPUs will simply kill you in terms of functional units. Going with the P56 core makes sense for Intel here as it is the best in order core they have, a significantly updated version of that may be used in future versions, but usiing one of the monster OoO cores will fail, horribly, at going toe to toe with a GPU for HPC style applications. The other angle of that approach is that while an 8 core Nehalem style chip would be significantly better at general purpose applications, you are wasting large chunks of die space as typical applications aren't going to benefit from the amount of cores due to threading issues. They also don't tend to be accelerated nearly as much from increased vector performance(if at all). Intel is using the best tools it has available at this time for the approach they are taking. Larrabee will not be a competitive GPU, but for HPC tasks it actually should be fairly competitive.

If i see those numbers on the AMD RV770, i wonder if AMD also will make a gpu with ECC capabilities like Nvidia is doing.

Not with their current architecture. While the raw FP performance is there, the issue with AMD GPUs atm is their lack of instruction support and lack of on die memory compared to nV's offerings. If you check out the compliance level of their parts, they aren't even close to full x87 as Fermi is(the GT200 is missing a couple features also, but it is still much closer then ATi's parts). Certain tasks that can be done simply on GT200 would be considerably more complex on ATi's offerings reducing their effective performance by a rather large amount. Also, the accuracry of their 'DP' is not to spec, for applications that require DP, that is not a tollerable position. This isn't knocking ATi GPUs in the least, they made a choice to dedicate their die space to gaming performance and add minimal computational hardware that allows them to hit DX specs. That is a completely valid approach and one that makes sense given that they are in the CPU business. nVidia needs to push GPGPU computing hard as they don't have a CPU business at all, and without pushing into the computational market they run the risk of an evaporating market moving forward as both AMD and Intel are interested in having the GPU on die with the CPU years down the road.

AMD processors are widely used in the server world. They might as well make an x86/ATI type rack just as the cell blades from IBM.

ATi needs to add support for missing instruction support, adjust their DP precission and add ECC support to compete in that market. They have the raw performance numbers, but that is only one element. GT200 was already ahead of them in those areas, Fermi has them feature complete for this market. If ATi were to match up with Fermi their die size would increase a decent amount, the cost of their chips would be increased a decent amount and ther yields would suffer as a result making them more costly on multiple fronts. Is that worth it for AMD currently? Given their financial situation at the moment, and considering that they currently(when paired with Cell) are already in the fastest SuperComputer in the world I wouldn't think so. Yes, nVidia is going to take the world's fastest SuperComputer title in the not too distant future, perhaps AMD should focus on being the support chips in that system if it really means that much to them, a CPU or GPU sale for them is still a sale.
 

Keysplayr

Elite Member
Jan 16, 2003
21,209
50
91
Originally posted by: SSChevy2001
Originally posted by: William Gaatjes
If i see those numbers on the AMD RV770, i wonder if AMD also will make a gpu with ECC capabilities like Nvidia is doing.
AMD/ATi 5870/5850 already has EDC & temperature compensation, which is not the same as what Nvidia is doing, but it works without having to increase the bus size.
http://www.anandtech.com/video/showdoc.aspx?i=3643&p=12
From your link:

"Finally, we should also note that this error detection scheme is only for detecting bus errors. Errors in the GDDR5 memory modules or errors in the memory controller will not be detected, so it?s still possible to end up with bad data should either of those two devices malfunction. By the same token this is solely a detection scheme, so there are no error correction abilities. The only way to correct a transmission error is to keep trying until the bus gets it right."

So it's error "Detection" only? Not "Correction"? ATI would need to move to ECC to get the attention of a large portion of the scientific/academic communities.

Supposedly, ATI is said to be moving to ECC in their next gen "6xxx" product.

 

BD231

Lifer
Feb 26, 2001
10,568
138
106
Originally posted by: Keysplayr
Originally posted by: SSChevy2001
Originally posted by: William Gaatjes
If i see those numbers on the AMD RV770, i wonder if AMD also will make a gpu with ECC capabilities like Nvidia is doing.
AMD/ATi 5870/5850 already has EDC & temperature compensation, which is not the same as what Nvidia is doing, but it works without having to increase the bus size.
http://www.anandtech.com/video/showdoc.aspx?i=3643&p=12
From your link:

"Finally, we should also note that this error detection scheme is only for detecting bus errors. Errors in the GDDR5 memory modules or errors in the memory controller will not be detected, so it?s still possible to end up with bad data should either of those two devices malfunction. By the same token this is solely a detection scheme, so there are no error correction abilities. The only way to correct a transmission error is to keep trying until the bus gets it right."

So it's error "Detection" only? Not "Correction"? ATI would need to move to ECC to get the attention of a large portion of the scientific/academic communities.

Supposedly, ATI is said to be moving to ECC in their next gen "6xxx" product.

The end result is no artifacts when the cards overheat, overclockers will be happy about this. It will be harder to determine max overclock because you can't really see the errors any longer with this feature. At least it wont freeze when you've pushed your card to far, you will however get a drop in performance if your overclock is no good.
 

Keysplayr

Elite Member
Jan 16, 2003
21,209
50
91
Originally posted by: BD231
Originally posted by: Keysplayr
Originally posted by: SSChevy2001
Originally posted by: William Gaatjes
If i see those numbers on the AMD RV770, i wonder if AMD also will make a gpu with ECC capabilities like Nvidia is doing.
AMD/ATi 5870/5850 already has EDC & temperature compensation, which is not the same as what Nvidia is doing, but it works without having to increase the bus size.
http://www.anandtech.com/video/showdoc.aspx?i=3643&p=12
From your link:

"Finally, we should also note that this error detection scheme is only for detecting bus errors. Errors in the GDDR5 memory modules or errors in the memory controller will not be detected, so it?s still possible to end up with bad data should either of those two devices malfunction. By the same token this is solely a detection scheme, so there are no error correction abilities. The only way to correct a transmission error is to keep trying until the bus gets it right."

So it's error "Detection" only? Not "Correction"? ATI would need to move to ECC to get the attention of a large portion of the scientific/academic communities.

Supposedly, ATI is said to be moving to ECC in their next gen "6xxx" product.

The end result is no artifacts when the cards overheat, overclockers will be happy about this. It will be harder to determine max overclock because you can't really see the errors any longer with this feature. At least it wont freeze when you've pushed your card to far, you will however get a drop in performance if your overclock is no good.

Slowdown in performance because of errors that have to keep trying the bus due to o/c'ing?
The end result is cool. I was just curious about the route "to" the end result. Interesting stuff.
 

BD231

Lifer
Feb 26, 2001
10,568
138
106
The memory controller itself detectes 1 or 2 bit errors with 100% accuracy and a retransmission request is made until the data comes through correctly. William pointed out that GPU's are are embarassingly parallel in one of his quotes which is true, the vast majority of errors should be limited to 1-bit and 2-bit errors so this feautre is actually quite effective.

Those same errors can arise from heat also, so it's a thermal feature as well.
 

Keysplayr

Elite Member
Jan 16, 2003
21,209
50
91
Originally posted by: BD231
The memory controller itself detectes 1 or 2 bit errors with 100% accuracy and a retransmission request is made until the data comes through correctly. William pointed out that GPU's are are embarassingly parallel in one of his quotes which is true, the vast majority of errors should be limited to 1-bit and 2-bit errors so this feautre is actually quite effective.
Those same errors can arise from heat also, so it's a thermal feature as well.

Now how does this compare to ECC? Just as effective, or more so?

 

SSChevy2001

Senior member
Jul 9, 2008
774
0
0
Originally posted by: Keysplayr
So it's error "Detection" only? Not "Correction"? ATI would need to move to ECC to get the attention of a large portion of the scientific/academic communities.

Supposedly, ATI is said to be moving to ECC in their next gen "6xxx" product.
That's corect. It will keep requesting the data until it's correct, and while this is not as good as ECC it's more cost effective for us gamers.

Now on Nvidia side we know there's going to be ECC for Tesla, but what about Geforce?

The register file, L1 cache, L2 cache and DRAM all have full ECC support in Fermi. This is one of those Tesla-specific features.
http://www.anandtech.com/video/showdoc.aspx?i=3651&p=6

Nvidia Can Disable Certain Fermi Features on Gaming Graphics Cards
http://www.xbitlabs.com/news/v...ng_Graphics_Cards.html
 

BD231

Lifer
Feb 26, 2001
10,568
138
106
Multi-bit errors are rare so I'd say it's actually comparable when you consider the fact that ECC requires more logic code in the memory controller to handle error correction. Errors in the memory modules or errors in the memory controller are not detected and apperntly not common enough to be of great concern right now; there's a clear focus on filtering GPU calculations.

Soft errors in memory are caused predominantly by electrical disturbances, which can be minimized through design so I'd imagine this is why we have not seen talk of ECC on video cards earlier. This is a reduction of non-critical process's you would find in ECC memory to me. I'd imagine we should see some type of implementation like this in the future, being that ECC would raise costs.
 

Keysplayr

Elite Member
Jan 16, 2003
21,209
50
91
Originally posted by: BD231
Multi-bit errors are rare so I'd say it's actually comparable when you consider the fact that ECC requires more logic code in the memory controller to handle error correction. Errors in the memory modules or errors in the memory controller are not detected and apperntly not common enough to be of great concern right now; there's a clear focus on filtering GPU calculations.

Soft errors in memory are caused predominantly by electrical disturbances, which can be minimized through design so I'd imagine this is why we have not seen talk of ECC on video cards earlier. This is a reduction of non-critical process's you would find in ECC memory to me. I'd imagine we should see some type of implementation like this in the future, being that ECC would raise costs.

ECC was the most widely requested enhancement from "big corps" scientists. Other requests are numerous as well, but ECC was the biggie. While bit errors don't affect gaming much, it certainly can affect days or weeks (whatever duration) worth of heavy computation. And after that week, it might be good to know that the data was accurate.
This is why it was requested so much. This is why it was implemented. Nvidia would like to push further into the world of computation on GPGPU, and to do that, they needed to give the people what they asked for.
Maybe we will have non-ECC Fermi versions for gaming. It's possible. Depends on how "modular" or dynamic the design is I would guess.

 

BD231

Lifer
Feb 26, 2001
10,568
138
106
So widely requested that it took a new, possibly profitable feature to bring it about?

:p.
 

Minas

Junior Member
Jul 27, 2009
18
0
0
Originally posted by: BD231
So widely requested that it took a new, possibly profitable feature to bring it about?

:p.
There might not be so many of them but they do buy 'bleeding edge' tech.

Keysplayr: do yo know if the scientific groups tend to request ECC specifically or do they specify an overall error tolerance that necessitates it?
 

Keysplayr

Elite Member
Jan 16, 2003
21,209
50
91
Originally posted by: Minas
Originally posted by: BD231
So widely requested that it took a new, possibly profitable feature to bring it about?

:p.
There might not be so many of them but they do buy 'bleeding edge' tech.

Keysplayr: do yo know if the scientific groups tend to request ECC specifically or do they specify an overall error tolerance that necessitates it?

These scientific groups currently use massive clusters with tons of ECC memory.
There can be no tolerance for errors in these scientific calculations else all their data will be off and useless to them. I used to work at one of the largest scientific labs in the states not too long ago. They'd spend hundreds of thousands of dollars on a 120 node clusters. Each node having multiple processors and maxxed out with ECC memory. Some of these calculations took weeks at a time. Imagine all that time lost due to errors in the data.

Yes, ECC was requested specifically. It has been said that the reason most haven't adopted GPU farms was because of the lack of ECC. There are plenty of specifics in the Fermi whitepapers regarding this.