nVidia cuts Fermi based Tesla performance.

evolucion8 · May 6, 2010

Source: http://www.theinquirer.net/inquirer/news/1635153/nvidia-cuts-tesla-performance

DESIGNER OF WARM GPUS Nvidia has cut the performance of its Fermi based Tesla GPGPU aimed at the high performance computing (HPC) market.

The server vendors Appro and Supermicro recently announced 1U servers featuring up to four Tesla cards. We marvelled at the engineering brilliance of the two firms as they had to overcome the associated power and heat issues that come with graphics cards that use the Green Goblin's Fermi architecture.

It seems those engineering feats were helped, in a manner of speaking, by Nvidia itself, through a decrease in the shader count and clock speeds. The latest specifications detail a 18 per cent clock decrease, coupled with a previously announced 12.5 per cent cut in the number of "stream processors" from 512 to 448, and the board tipped up drawing 10 per cent more power than previous vague estimates that had been bandied about by the firm late last year.

Given that server vendors flog their units by stressing the performance per watt per square foot metric, Nvidia's cuts in performance and increase in power won't help Appro or Supermicro shift too many of their boxes kitted out with Tesla cards.

This is perhaps not surprising given the months of delays that Nvidia has had trying to get its Fermi GPUs out the door. Regardless of who you believe, even the most fortuitous yield reports are abysmal and Nvidia's repeated attempts to rein in the power consumption of a chip that has the same thirst for power as a tinpot dictator seem to have failed, with the firm even having admitted that its Fermi GTX480 GPU chip runs hot.

The whole sorry saga is set to hurt Nvidia in its latest push into the HPC market where it has previously done well against its competition. The problem is, with the power requirements being so high and the scale of the performance decrease, the numbers of extra servers that will be required to overcome the shortfall might be too much for prospective customers.

All of this could make Nvidia's efforts to become a big player in the HPC market go up in smoke like an insufficiently cooled Fermi chip. µ

Source: http://www.semiaccurate.com/2010/05/05/nvidia-downgrades-tesla-again/

IT LOOKS LIKE Nvidia is being it's normal honest self with respect to the company's high end Tesla compute cards. Yes, the specs on them dropped again, precipitously, and that is from the already castr^h^h^h^h^h downgraded specs released last fall.

If you recall, a year ago, Nvidia was telling people that Fermi would come out in October of 2009 at 1500MHz, have 512 shaders, and only take about 175W. At its triumphant launch during not-Nvision 2009, those specs creeped down a bit, finally finishing off at 1.25GHz-1.40GHz clock, 448 shaders, 1.8GHz-2.0GHz memory clock and only sipping a mere 225W. The ship date slipped from 2009 to Q1 of 2010, then Q2, and if Nvidia liked you, and you were a financial analyst covering its stock, Q3 for anything resembling real quantities.

Customers were not bothered by this change, they took it in stride. Everything was going well, just ask Nvidia. No problems. Can't make chips? Feh, the architecture is fine on paper. Less than 20 percent yields? Not a problem, just obfuscate when asked about it, and telling the truth seems to be punishable at Nvidia.

Step forward to the 'release' Tesla cards, the C2050 / C2070, as seen by the spec sheet here. Remember the spec sheet that was here, but now has a page not found for some reason? Odd. The link may be dead, but the documents are pictured here, and we have saved copies of both that document titled BD-04983-001_v01.pdf and the datasheet titled NV_DS_Tesla_C2050_C2070_Final_lowres.pdf as created on 11/11/2009 and dated "NOV09". They differ a bit from the more modern ones.

November stats, from Nvidia's PDF

The new specs are a significant reduction, unless you are wondering about power, then they are higher. If you are an OEM, they are higher still, but lets not quibble about levels of dishonesty, I mean, if the SEC doesn't care about what Nvidia is telling the analysts and the investing public, why should mere journalists bother to hold it to its statements? It doesn't give analysts 30" monitors.

The new stats, slightly reduced for quick sales

You can find all the new literature here, but the one you want is specifically the C2050 / C2070 data sheets from "APR10", here. Those 'puppies' only run at 1.15GHz, have 515GFLOPS DP FP performance, and 1.03TFLOPS SP FP performance. Memory is at 1.5GHz, and power consumption is now 247W. Here's an analysis of the slippage.

Last year, November, and now.

What is there to say? The Fermi based compute cards are already a running joke, delivering only 68 percent of the promised performance, 88 percent of the cores at 77 percent of the intended clock speed,for 141 percent of the power. That turns out to be slightly less than half the promised performance per watt, the overridingly critical measure in the compute space.

In the end, Nvidia seems to have delivered about half of what it had promised, less if you consider memory speeds, but it is late, draws lots of power, runs hot and costs far more than hoped per chip. None of these problems are fixable, it is time for a new architecture.

With any luck, Nvidia will get to those favored financial analysts before they realize this. One thing for sure, it needs to get word out before some pesky journalists start raising inconvenient questions about threading, asynchronous transfer capability, and how much CPU time that takes versus what Nvidia promised. If word of that gets out to any analyst who understands the effect this will likely have on sales, things could get mighty awkward for the boys in green.S|A

If that's true, all the efforts done to the GPGPU arena will be wasted until they do a respin or go to a smaller manufacturing process. A pity not being able to see the true power of GPGPU performance with Fermi based Tesla.

DaveSimmons · May 6, 2010

> None of these problems are fixable, it is time for a new architecture.

Not even with a die shrink? Amazing.

Honest question, does any comparable system offer better price and performance-per-watt at the 515GFLOPS DP FP performance level? nV may have under-delivered but if they still offer the best performance the trolling comments in the article ("joke", "sorry saga") won't apply.

bunnyfubbles · May 6, 2010

Considering all the talk about skipping 32nm and even 28nm, a die shrink might not happen any time soon.

HurleyBird · May 6, 2010

DaveSimmons said:
Honest question, does any comparable system offer better price and performance-per-watt at the 515GFLOPS DP FP performance level? nV may have under-delivered but if they still offer the best performance the trolling comments in the article ("joke", "sorry saga") won't apply.

My 5870 delivers 544GFlops DP at significantly less heat and cost

Not saying that it can outdo a Tesla Fermi in compute (at least not for most code), but the point is that you can't just look at Flops/watt. In terms of efficiency the 5870 will be less efficient than a Tesla Fermi which in turn is a lot less efficient than an i7. Performance/watt is what matters, and to test actual performance you need benchmarks.

VirtualLarry · May 6, 2010

evolucion8 said:
You can find all the new literature here, but the one you want is specifically the C2050 / C2070 data sheets from "APR10", here. Those 'puppies' only run at 1.15GHz, have 515GFLOPS DP FP performance, and 1.03TFLOPS SP FP performance. Memory is at 1.5GHz, and power consumption is now 247W. Here's an analysis of the slippage.

Wow. My 4850s have 1Tflop SP FP performance already. Fermi isn't that much better.

Lonyo · May 6, 2010

HurleyBird said:
My 5870 delivers 544GFlops DP at significantly less heat and cost

Not saying that it can outdo a Tesla Fermi in compute (at least not for most code), but the point is that you can't just look at Flops/watt. In terms of efficiency the 5870 will be less efficient than a Tesla Fermi which in turn is a lot less efficient than an i7. Performance/watt is what matters, and to test actual performance you need benchmarks.

Well a regular Gulftown (6C/12T) can do 146GFlop @ 130w (TDP), so that's nearly 300GFlop from two of them at 260w TDP.
That's 60% of Fermi's perf/watt, and if you got energy efficient models it would probably go up.
Of course, it all depends on what exactly you are doing, as you say. Sometimes an HD5870 will destroy a GTX480 (not often though), sometimes a CPU will be better, and sometimes a Fermi will be better, but when it's getting only 50% of the performance/watt they were shooting for a year ago, that's a heavy penalty.

railven · May 6, 2010

What can cause such a steep hit in performance?

HurleyBird · May 6, 2010

Lonyo said:
Well a regular Gulftown (6C/12T) can do 146GFlop @ 130w (TDP), so that's nearly 300GFlop from two of them at 260w TDP.
That's 60% of Fermi's perf/watt, and if you got energy efficient models it would probably go up.

60% *Flops*/watt, yeah. An i7 is going to put it's power to use much more efficiently than a GPU though, so *performance*/watt will generally be higher than that 60% and often a lot higher than 100%. Flops/Transistor is even worse for Nvidia with 1 Fermi being equal to 3 gulftowns. Flops/diesize might be more in Nvidia's favor since GPUs tend to be more compact although Gulftown is on a newer node.

Honestly, I don't think Fermi gives good enough double precision power to be attractive over a CPU at all. Fermi has a big advantage when it comes to single precision which for some workloads could definitely be worth it. The issue is that Cypress destroys Fermi when it comes to single precision, assuming you can tailor your code for it. If Fermi has a saving grace, it will be the ease to program and extract performance from it vs. Cypress. That and good marketing

ronnn · May 6, 2010

DaveSimmons said:
> None of these problems are fixable, it is time for a new architecture.

Not even with a die shrink? Amazing.

Honest question, does any comparable system offer better price and performance-per-watt at the 515GFLOPS DP FP performance level? nV may have under-delivered but if they still offer the best performance the trolling comments in the article ("joke", "sorry saga") won't apply.

Everything is fixable and will be very impressive. Lets hope amd is also impressive with their new stuff and the race is on.

Paratus · May 6, 2010

As good as Fermi is/could be in GPGPU there will always be a small (or not so small) subset of code that runs beter on Cypress due to the architectural differneces. The password cracking app is one example where a single 5870 beats GTX 480 SLI.

HurleyBird · May 7, 2010

Paratus said:
As good as Fermi is/could be in GPGPU there will always be a small (or not so small) subset of code that runs beter on Cypress due to the architectural differneces. The password cracking app is one example where a single 5870 beats GTX 480 SLI.

Compute mark and directcompute benchmark are two more examples where the 5870 handily beats a GTX 480. Really, any single precision app should perform better on Cypress unless its completely unoptimized or extremely unfriendly to the architecture. Fermi may be more efficient but having only ~37% of the theoretical SP performance of Cypress has to be hard to overcome. Double precision will probably favor Fermi (by which I mean Tesla, not the castrated 80GFlop GTX480 and 470 versions), but again Perf/watt will probably favor Gulftown vs. either GPU solution so not really worth it, especially considering the fact that the GPUs don't do x86.

I see Nvidia's foray into HPC closing quickly though. Nvidia currently has a head start but is lacking both x86 compatibility and traditional ILP-centric cores like i7 or Deneb. Intel has been developing massively parallel x86 tech (LRB, and other massively multi-core experimental platforms) for awhile now and AMD has massively parallel tech that will eventually be updated to decode x86 instructions (not an if, but a when). Eventually AMD and Intel will have heterogeneous (a few big cores surrounded by many more smaller ones) x86 solutions and Nvidia will be unable to compete. Nvidia either needs to carve out a big enough market segment to itself (pull an apple) or gain an x86 license otherwise it will be artificially locked out (the irony) sooner or later. An acquisition of VIA might help them acquire a license if it transfers over, but we don't know the details of VIA's agreement.

Meghan54 · May 7, 2010

HurleyBird said:
An acquisition of VIA might help them acquire a license if it transfers over, but we don't know the details of VIA's agreement.

And that's the real question, isn't it? Did Intel have a poison pill inserted into VIA's x86 licensing agreement with Intel, stating that any acquisition of VIA by another company automatically renders said licensing null and void?

I'd almost bet there is such a clause, otherwise why wouldn't a cash-rich company like Nvidia not have already acquired VIA? Such an acquisition would certainly give Nvidia a big boost into developing x86 products, something I'd think Nvidia would dearly love to do.

A_Dying_Wren · May 7, 2010

Meghan54 said:
And that's the real question, isn't it? Did Intel have a poison pill inserted into VIA's x86 licensing agreement with Intel, stating that any acquisition of VIA by another company automatically renders said licensing null and void?

I'd almost bet there is such a clause, otherwise why wouldn't a cash-rich company like Nvidia not have already acquired VIA? Such an acquisition would certainly give Nvidia a big boost into developing x86 products, something I'd think Nvidia would dearly love to do.

Idk... It would be hard for Nvidia to compete with AMD and Intel assuming they take-over VIA. VIA's chips aren't exactly going anywhere and CPUs are a whole different ball game from GPUs. Idk how much Nvidia would have to do to make Fermi x86. I'm sure an even larger chip would be needed however to process more complex instruction sets.

The poison pill may also be in that Intel may cancel VIA's x86 licensing agreement at any time. Of course now there's no point as VIA isn't getting anywhere but I wouldn't put it past Intel to let Nvidia spend loads of money of x86 development and then cancel their licensing.:twisted:

HurleyBird · May 7, 2010

A_Dying_Wren said:
Idk how much Nvidia would have to do to make Fermi x86. I'm sure an even larger chip would be needed however to process more complex instruction sets.

Not much probably. There is no such thing as a true x86 chip anymore. Modern processors from AMD and Intel simply take x86 instructions and translate them into a simpler RISC-like internal language. Give IBM, Nvidia, ARM, or anyone an x86 license and they could easily make an x86 product in the same fashion. Translating x86 instructions to another architecture for compatibilities sake can be done with a simple truth table (decoder hardware reads a complex x86 instruction, looks up the instructions internal to the architecture that accomplish the same thing, then passes those new instructions and their corresponding operands down the pipe), I would think that doing the translation quickly and efficiently is where the real work is, and even then that would be a small amount of effort compared to designing a 3B transistor monster like Fermi. There's no doubt that AMD already has plans to add x86 compatibility to their stream processors (if I were a betting man I would say NI* and/or 2nd generation Fusion), and if Nvidia wants to stay on the game they need:

a. To build a more robust and self sustaining ecosystem around CUDA, gaining a significant amount of marketshare, or
b. Have a massive performance increase vs. x86 solutions so that the benefits of the speedup outweigh the costs of losing x86 compatibility (this is why Nvidia currently has a foot in the HPC space, however this will no longer be the case when AMD and Intel put out massively parallel x86 solutions), or
c. Gain x86 compatibility themselves.

Make no mistake about it, Nvidia is either in the process of acquiring c. or is a race against time to establish a. I hope for competitions sake it's the former

*Rumors are that AMD is doing a significant redesign of their SPs for Northern Islands. I doubt that this is for performance's sake (AMD already has much, MUCH better perf/mm2 than Nvidia using their current gen shader tech), but to pave way for x86 compatibility. VLIW processors use the compiler to try to order instructions in the most efficient manner (the downside is increased compile time, compiler complexity, file size, and severely reduced portability), while most other architectures do so internally at the hardware level. Assuming a perfect compiler, VLIW can theoretically attain the same performance as a hardware solution using a lot fewer transistors. In this case x86 compatibility will come at a very steep cost (think Itanium) as the compiled code will not be tailer made for that processor anymore. In order to run x86 with any kind of speed on their SPs, AMD needs to do a major overhaul. Ironically, there is a good chance that in redesigning the shader core into a non-VLIW design, AMD might actually lower performance/transistor vs. the old core. Assuming Nvidia does not acquire x86, they may be able to increase their performance/mm2 by switching to a VLIW design themselves (seeing as all their tech is proprietary anyway), however in doing so they would teether themselves in many ways to that first VLIW design (you don't want future architectures to stray too far away from the original, or all that legacy code compiled specifically for previous generations of hardware isn't going to play very nice)

WelshBloke · May 7, 2010

A_Dying_Wren said:
Idk... It would be hard for Nvidia to compete with AMD and Intel assuming they take-over VIA. VIA's chips aren't exactly going anywhere and CPUs are a whole different ball game from GPUs. Idk how much Nvidia would have to do to make Fermi x86. I'm sure an even larger chip would be needed however to process more complex instruction sets.

The poison pill may also be in that Intel may cancel VIA's x86 licensing agreement at any time. Of course now there's no point as VIA isn't getting anywhere but I wouldn't put it past Intel to let Nvidia spend loads of money of x86 development and then cancel their licensing.:twisted:

The licence runs out in 2013 anyway.

HurleyBird · May 7, 2010

WelshBloke said:
The licence runs out in 2013 anyway.

There's a difference between "runs out" and "due for renegotiation"

WelshBloke · May 7, 2010

HurleyBird said:
There's a difference between "runs out" and "due for renegotiation"

How well do you think the renegotiation would go if Nvidia owned VIA?:sneaky:

Anyway isnt VIA owned by some massive holding company?

ZimZum · May 7, 2010

The x86 licences that VIA and AMD have are non transferable and evaporate if either company is bought out.

BenSkywalker · May 7, 2010

Based on everything I can find, I'm seeing Fermi being 8.6% off from claimed clocked rates for shipping products, not off at all on shader cores(they announed 448 six months ago for Fermi) and off by 9.3% on power. Maybe I missed it, but I looked around and can't find anything resembling what Charlie is talking about. They stated they would have a 512 shader part, but they still haven't announced what that part is going to be yet(anyone here think a die shrink isn't going to enable that?).

Assuming a perfect compiler, VLIW can theoretically attain the same performance as a hardware solution using a lot fewer transistors.

While I don't disagree with this statement, I can see it certainly giving the very wrong impression. There is a huge area between x86 and VLIW, you can have the same architecture that today's chips use to compute without the need for x86 and without them being VLIW(see anything POWER based for many years for likely the most popular examples). As of now AMD and the moderator protected company both use a relatively small part of their overall die for decode hardware because they only have six cores. The amount of die space needed for 1,000 cores would be staggering, an abject failure of a part in no uncertain terms when looking at performance/transitor, or performance/watt.

Meh, can't get too much into this, the moderators don't allow criticism of Intel

yh125d · May 7, 2010

4 Fermi in a 1U? riiiiiiiiight

HurleyBird · May 7, 2010

BenSkywalker said:
As of now AMD and the moderator protected company both use a relatively small part of their overall die for decode hardware because they only have six cores. The amount of die space needed for 1,000 cores would be staggering, an abject failure of a part in no uncertain terms when looking at performance/transitor, or performance/watt.

Well, the real metric isn't going to be the number of cores, but number of instructions fetched, Eg. how many instructions need to be translated every clock cycle. Obviously this will be higher for GPUs, but whether it becomes a prohibiting factor I really can't say -- I don't know how many transistors are dedicated to instruction translation in current processors, just that it is supposed to be very small. I would say that even in a bad case scenario gaining x86 compatability is worth the investment, but I doubt the performance hit would be too bad (a very very small number multipled by one order of magnitude is still a very small number).

Voo · May 7, 2010

Well Tesla is a SIMD parallel architecture, something for which the x86 ISA was obviously never planned for, so what would be the point in emulating the whole ISA (and that isn't as easy as you make it sound, fetching and decoding those instructions is rather.. unpleasant and then you'd need a large µc and that probably for each SM) if you still couldn't reuse existing programs in a sensible manner? Also there are instructions that don't make a lot of sense for such a architecture and you need some instructions that aren't in the ISA.

Not to say it can't be done, but since Nvidia didn't create the whole architecture with it in the back of their heads, it sounds like a lot of work for relatively small gain. Why Intel decided to use it for larrabee on the other side isn't that surprising, they have a lot of experience with making those things work well, also larabee should be a SIMD vector computer so that should make a bit more sense.

VirtualLarry · May 7, 2010

Voo said:
Well Tesla is a SIMD parallel architecture,

Fermi is MIMD, isn't it? That was one of the major changes.

HurleyBird · May 7, 2010

Voo said:
Well Tesla is a SIMD parallel architecture, something for which the x86 ISA was obviously never planned for, so what would be the point in emulating the whole ISA?

And yet all modern "x86 CPUs" use SIMD instructions. No one really cares (except in special cases like VLIW, obviously) what x86 was designed to do or not to do -- no one actually uses x86 anymore -- the lucky few with access to the liscense translate x86 instructions into better architectures that look nothing like x86. Some architectures can be better suited to x86 translation than others. First of all you need enough internal registers, and x86 instructions may map to some architectures better than others. There are disadvantages, but also enourmous advantages due to the prevailence of x86 *everywhere*. They don't call x86 the "golden handcuffs" for nothing!

Nemesis 1 · May 7, 2010

BenSkywalker said:
Based on everything I can find, I'm seeing Fermi being 8.6% off from claimed clocked rates for shipping products, not off at all on shader cores(they announed 448 six months ago for Fermi) and off by 9.3% on power. Maybe I missed it, but I looked around and can't find anything resembling what Charlie is talking about. They stated they would have a 512 shader part, but they still haven't announced what that part is going to be yet(anyone here think a die shrink isn't going to enable that?).

While I don't disagree with this statement, I can see it certainly giving the very wrong impression. There is a huge area between x86 and VLIW, you can have the same architecture that today's chips use to compute without the need for x86 and without them being VLIW(see anything POWER based for many years for likely the most popular examples). As of now AMD and the moderator protected company both use a relatively small part of their overall die for decode hardware because they only have six cores. The amount of die space needed for 1,000 cores would be staggering, an abject failure of a part in no uncertain terms when looking at performance/transitor, or performance/watt.

Meh, can't get too much into this, the moderators don't allow criticism of Intel

The mods don't care what you say about Intel so long as its true facts . You and I have battled many times . Your true facts don't turn out to be true at all . Apple NV . and all your talk about Fermi none of it turned out true . The threads are here . Calling the Mods out like this is not wise.

nVidia cuts Fermi based Tesla performance.

Platinum Member

Elite Member

Lifer

Platinum Member

No Lifer

Lifer

Diamond Member

Platinum Member

Diamond Member

Lifer

Platinum Member

Lifer

Member

Platinum Member

Lifer

Platinum Member

Lifer

Golden Member

Diamond Member

Diamond Member

Platinum Member

Golden Member

No Lifer

Platinum Member

Lifer