Architectural Direction of GPUs

Seero · Apr 14, 2010

I wonder how did GTX 480 ended up to have 480 CUDA cores instead of 512. Did they send the 512 design, or a 480 design? If they indeed sent in a 512 design and though yield is low, they was able to somehow mark the bad cores and retrofit it as a 480 core card, then the Fermi design rocks. Think about it, they can max out the field by marking bad cores so they can make 512,480,448, 420, 400, and 386 GPU and market them individually. Or maybe they simply sent in a design with 480 CUDA cores and quickly change the design to 448 CUDA cores to the fab.

Get back to the architecture. Grouping cores as SM is a good idea. If they can somehow only power those active SMs then it will rock.

Seriously, if the new architecture allows them to selectively disable/bridge those bad cores, then it will be a dream come true in terms of production.
---
Lets talk about the cypress architecture. New instruction can not be executed before the previous one is finished, after that it needs to fetch data from and to memory. So the total time is to finish one instruction is processing time + 2x fetch time (memory latency). The problem is that cypress can't handle more than one process at a time, although it has massive power for one process, which makes it good for graphics and pixels. In fact, that is the reason for eyefinity as the architecture benefits super high resolution.

Fermi on the other hand as less raw power, but better handling on multi processing. Both tessellation and CUDA code result support this theory. The winning punch is it can also handle a single process as well as cypress architecture.

Now Fermi has lower clock speed, meaning that there are headroom. The down side is of course the amount of electricity needed. There is no trick here as the Fermi design consist of more transistors, and therefore requires more electricity to push through it.

It is very difficult to compare the 2 design as they each do something better than others. Cypress is better at more pixels and Fermi is better at multi-tasking. However, since Cypress is better at more pixels, scaling it down make sense, what about Fermi? The only way to scale a Fermi down is by reducing the number of CUDA cores, but what about the number of SMs? Should there be less CUDA cores per SM? or just fewer SMs? Then it comes another question. How many CUDA cores shall there be in a SM? Have they got the optimal numbers yet?

It appears that memory latency has less impact on the Fermi design as we have seen through some OC reviews. The one thing that had me interested in Fermi is executions can be executed back to back. That changes everything when it comes to programming. It will be much easier to fully utilize the card compare to the older structure.

Having said all that, most the above are just wild guesses. But what if they can all come true? then we will see a video card that
a) cheap because yield problem disappears.
b) efficient, as SM switches on/off (or OC) dynamically.
c) good performance, good at multi-task, or a single task.
d) efficient again, as the chance of powering inactive SMs are minimized.
f) Multi-purpose, as it can act like a tessellation unit, performs better than having a tessellation unit, without completely destroy other tasks.

However, now that it is out, and ATI has a product that is as good, there is nothing preventing ATI from LEARNING from Fermi and therefore be able to create a better design. Plus, the current Fermi isn't that OMG ATM as the environment does not fully support its architecture yet.

BenSkywalker · Apr 14, 2010

Again FILLRATE is still king.

I don't agree with that at all in today's market. The 5770 has as much fill as the 470, yet the 470 absolutely obliterates it. The 480 has significantly less fill then the 5870, yet is still clearly the faster part. What I'm talking about is that in the specific case of the 470/480 they are limited by their texel fillrate more then anything else. The 5870 doesn't appear to be texel limited at all.

I mean, do we know for sure that the fixed function stuff in Cypress isn't also running at a fraction of core speed?

Cypress doesn't use the type of differing clock domains that Fermi does.

You’re assuming the drop is based on texture fillrate, which hasn’t been proven to be the case.

All evidence points to that being the case at the moment. When you remove fill as a major factor, the 480 is running with and sometimes besting the 5970. The more shader limited a game is, the better Fermi starts to look. The more fill limited, the worse off it is.

Then it’d also be bound the same way on GT200 parts given they have less effective texturing performance, but that hasn’t been the case.

GT200 has a fraction of the shader power of Fermi. A game that is nigh entirely shader bound on GT200 could easily be nigh entirely texel bound on a Fermi part.

You can’t cut the shader hardware in half, not without reducing texturing and geometry performance by the same amount, given those units are tied to the SMs.

Cut the DP units out entirely. You get half the shader hardware, a significant die savings and a part that is more in line with shader/texture balance of the other parts available today.

The idea here is to remove parts of the chip not needed by gamers and producing a “gamer edition” of the card. With such a massive transistor count compared to the 5xxx parts, there must be much more than DP that could be removed without altering gaming performance.

You could take out the L2 cache, remove cross CUDA core data paths, remove DP, take out the additional registers, use a simpler scheduler- all of them having minimal impact in most games out today. All of them crippling GPGPU performance and reducing their effectiveness in terms of raw computational throughput.

99.99% of gamers won’t be running CS5.

How many users of CS5 are going to be using GeForces though? I would wager a rather high percentage. While gaming is clearly the dominant useage of GeForce parts, making premium graphics hardware something the broader markets want is very much directly in line with nV's current strategy. Fermi is clear evidence of this.

BTW- The ComputerBase AF numbers are particularly interesting. The 480 takes close to double the performance hit using 16x AF that the GT200 parts do. Clearly there are TMU limitations at play in the architecture compared to the other parts on the market.

Does anyone here see Fermi as being a capable chip for Shared frame rendering?

The problem with that is the way the games are rendered today. They are not handled in a way that is friendly to split frame rendering. It is possible that using something like Ray Traced reflections you could handle all of the ray tracing off to one board and handle the rest of the rendering on the other, that would be a reasonable approach. Particularly with all of the post process type effects being used today split frame rendering isn't really viable.

blanketyblank · Apr 14, 2010

Seero said:
I wonder how did GTX 480 ended up to have 480 CUDA cores instead of 512. Did they send the 512 design, or a 480 design? If they indeed sent in a 512 design and though yield is low, they was able to somehow mark the bad cores and retrofit it as a 480 core card, then the Fermi design rocks. Think about it, they can max out the field by marking bad cores so they can make 512,480,448, 420, 400, and 386 GPU and market them individually. Or maybe they simply sent in a design with 480 CUDA cores and quickly change the design to 448 CUDA cores to the fab.

Get back to the architecture. Grouping cores as SM is a good idea. If they can somehow only power those active SMs then it will rock.

Seriously, if the new architecture allows them to selectively disable/bridge those bad cores, then it will be a dream come true in terms of production.
---
Lets talk about the cypress architecture. New instruction can not be executed before the previous one is finished, after that it needs to fetch data from and to memory. So the total time is to finish one instruction is processing time + 2x fetch time (memory latency). The problem is that cypress can't handle more than one process at a time, although it has massive power for one process, which makes it good for graphics and pixels. In fact, that is the reason for eyefinity as the architecture benefits super high resolution.

Fermi on the other hand as less raw power, but better handling on multi processing. Both tessellation and CUDA code result support this theory. The winning punch is it can also handle a single process as well as cypress architecture.

Now Fermi has lower clock speed, meaning that there are headroom. The down side is of course the amount of electricity needed. There is no trick here as the Fermi design consist of more transistors, and therefore requires more electricity to push through it.

It is very difficult to compare the 2 design as they each do something better than others. Cypress is better at more pixels and Fermi is better at multi-tasking. However, since Cypress is better at more pixels, scaling it down make sense, what about Fermi? The only way to scale a Fermi down is by reducing the number of CUDA cores, but what about the number of SMs? Should there be less CUDA cores per SM? or just fewer SMs? Then it comes another question. How many CUDA cores shall there be in a SM? Have they got the optimal numbers yet?

It appears that memory latency has less impact on the Fermi design as we have seen through some OC reviews. The one thing that had me interested in Fermi is executions can be executed back to back. That changes everything when it comes to programming. It will be much easier to fully utilize the card compare to the older structure.

Having said all that, most the above are just wild guesses. But what if they can all come true? then we will see a video card that
a) cheap because yield problem disappears.
b) efficient, as SM switches on/off (or OC) dynamically.
c) good performance, good at multi-task, or a single task.
d) efficient again, as the chance of powering inactive SMs are minimized.
f) Multi-purpose, as it can act like a tessellation unit, performs better than having a tessellation unit, without completely destroy other tasks.

However, now that it is out, and ATI has a product that is as good, there is nothing preventing ATI from LEARNING from Fermi and therefore be able to create a better design. Plus, the current Fermi isn't that OMG ATM as the environment does not fully support its architecture yet.

I believe Fermi was meant to have 512 cores but then got 480 through binning.
Binning is relatively simple in that each chip is tested to run with certain temperature and power limits and those that pass are 480s and those that don't get tested as 470s and then dumped. I'm pretty sure they can't selectively disable the bad cores, it's more about providing enough electricity to enable the good or semigood ones and then putting the limit in the bios.
They could also potentially laser cut good and bad cores to make a lower model that can't be modded.

Also you are forgetting about patents which CAN prevent ATI from using technology from Fermi and design arounds are pretty hard to do and often inefficient.

AzN · Apr 14, 2010

BenSkywalker said:
I don't agree with that at all in today's market. The 5770 has as much fill as the 470, yet the 470 absolutely obliterates it. The 480 has significantly less fill then the 5870, yet is still clearly the faster part. What I'm talking about is that in the specific case of the 470/480 they are limited by their texel fillrate more then anything else. The 5870 doesn't appear to be texel limited at all.

5770 does not have much fill as 470 nor bandwidth. It does have much texel fill though.

GTX 470
24.3 Gpixels
34.0 peak bilinear (Gtexels/s)
17.0 FP16 (Gtexels/s)
133.9 GB/S.

5770
13.5 Gpixels
34.0 peak bilinear (Gtexels/s)
17.0 FP16 (Gtexels/s)
76.8 GB/S

GTX 480
33.6 Gpixels
42.0 peak bilinear (Gtexels/s)
21.0 FP16 (Gtexels/s)
177.4 GB/S.

5870
27.2.3 Gpixels
68.0 peak bilinear (Gtexels/s)
34.0 FP16 (Gtexels/s)
153.6 GB/S.

With 5870 vs 480 it kind of evens out. A slight edge to GTX 480 as you add AA and resolutions as it has more rop and bandwidth. Tweaks and what not. Pretty much on par.

HurleyBird · Apr 14, 2010

AzN said:
A slight edge to GTX 480 as you add AA and resolutions as it has more rop and bandwidth. Tweaks and what not. Pretty much on par.

Actually, as you can see in the link I posted earlier, at least with current drivers, 480 loses ground as resolution increases. It also loses ground as AF increases. 480 does tend to handle higher levels of AA better, although there are exceptions.

Seero · Apr 14, 2010

blanketyblank said:
I believe Fermi was meant to have 512 cores but then got 480 through binning.
Binning is relatively simple in that each chip is tested to run with certain temperature and power limits and those that pass are 480s and those that don't get tested as 470s and then dumped. I'm pretty sure they can't selectively disable the bad cores, it's more about providing enough electricity to enable the good or semigood ones and then putting the limit in the bios.
They could also potentially laser cut good and bad cores to make a lower model that can't be modded.

Also you are forgetting about patents which CAN prevent ATI from using technology from Fermi and design arounds are pretty hard to do and often inefficient.

Here is the thing, how can a chip ended up to be 480 cores that were suppose to be 512 cores? Once the laser burns in, the number of cores, as well as everything else, is burned in. If the mold has 512 cores, how did they ended up with 480 cores? The only reason is that they somehow bridged the bad ones, if they indeed sent in a design consisted of 512 cores.

I didn't forget about patents. ATI won't copy or steal tech from Nvidia, but learning it doesn't violate patents. It is the idea that counts, not the actual implementation. ATI patented on-board tessellation unit, yet Nvidia now can do tessellation too.

ViRGE · Apr 14, 2010

Seero said:
Here is the thing, how can a chip ended up to be 480 cores that were suppose to be 512 cores? Once the laser burns in, the number of cores, as well as everything else, is burned in. If the mold has 512 cores, how did they ended up with 480 cores? The only reason is that they somehow bridged the bad ones.

I didn't forget about patents. ATI won't copy or steal tech from Nvidia, but learning it doesn't violate patents. It is the idea that counts, not the actual implementation. ATI patented on-board tessellation unit, yet Nvidia now can do tessellation too.

I'm not sure I get the question. It's only using 480 cores because NVIDIA turned off 32 cores.

EarthwormJim · Apr 14, 2010

Seero said:
Here is the thing, how can a chip ended up to be 480 cores that were suppose to be 512 cores? Once the laser burns in, the number of cores, as well as everything else, is burned in. If the mold has 512 cores, how did they ended up with 480 cores? The only reason is that they somehow bridged the bad ones.

I didn't forget about patents. ATI won't copy or steal tech from Nvidia, but learning it doesn't violate patents. It is the idea that counts, not the actual implementation. ATI patented on-board tessellation unit, yet Nvidia now can do tessellation too.

It still has 512 cores, however only 480 are enabled. This is not a new practice. It has been done since all the way back in the 9700 pro days. Just about every single GPU has at least a partially crippled derivative of it. The 5850 has disabled shader units compared to the 5870, but it still physically has those units on its core.

To compensate for this, Nvidia upped the clock speeds on the GTX 480, which is also why the TDP went up.

Seero · Apr 14, 2010

EarthwormJim said:
It still has 512 cores, however only 480 are enabled. This is not a new practice. It has been done since all the way back in the 9700 pro days. Just about every single GPU has at least a partially crippled derivative of it. The 5850 has disabled shader units compared to the 5870, but it still physically has those units on its core.

To compensate for this, Nvidia upped the clock speeds on the GTX 480, which is also why the TDP went up.

Yes they can do that, but what happened is that they disabled 1 SM, and each SM has 32 cores. How can they group all the bad ones into one SM and then disable it? It does not make sense. If they keep 16 SMs, making each SM have lesser cores, then it make sense, but that isn't the case.

Even if they manage to disable the bad ones, since the new structure works differently, can they somehow tells the SM controller not to address the bad ones? If they can, then it won't be difficult to control how many cores to use within a SM, and disabling SM dynamically will not be a problem.

Of course, I am not saying i know, I am saying I don't know, but I found it quite interesting.

As to the clock speed, Nv was aiming for 750Mhz, and now at 700Mhz. How can you, or someone else who pretend to have inside sources, claim that the clock speed is upped? IMO it was down clocked because it was too hot at that speed, that means a lot of head room for OC.

BFG10K · Apr 14, 2010

AzN said:
Texture is only part of the equated fillrate. Bandwidth and pixel is part of that equation which GTX 4xx is stronger.

Those graphs are useless synthetics which can’t be used to determine anything in the real world.

And for you info that graph is showing FP16 blending tests which makes sense more modern games instead of 3dmark 2k6 that tests bi linearly for older games.

I didn’t get an answer: how many FP16 games do you play using point filtering?

I didn’t get an answer: the GTX470 has the same color fill as the GTX260, and the same FP texture fill as the 5770 on those graphs. How are those graphs relevant in the real world when we know the GTX470 is vastly faster than the GTX260/5770?

Nonsense to you maybe. Again 3dmark is a tool to test different parts of video cards. Just because you can't consume the information does not mean it's useless.

I can consume the information just fine, which is why I know it’s useless in the real world.

5770 has a whole lot more texture fillrate with FP16 blending. That's why it does so well in modern games while older cards like GTX 260 crushes 5770 in older games.

What are you talking about? The GTX260 has a higher FP texel fillrate than the 5770: 18.4 vs 17.0 GTex/sec.

That and the 5770 crushes the GTX260 in plenty of non FP games; likewise, the GTX260 crushes the 5770 in plenty of FP games. I know this because I actually tested both parts using ~30 games.

Tell me BFG what happened when you overclocked your GTX285 core clocks, SP, and memory? Which made the most impact in games? I rest my case. Fillrate is still king long as it's not constraint by bandwidth or shaders.

The core made the biggest difference because all indications are now pointing to the GT200 being ROP bound, especially when running AA. Your graphs can’t show this, which is why they show the 5770 and GTX260 as “equals”, but the real world shows the GTX470 being 30%-45% faster than the GTX285 with 4xAA/8xAA.

5770 does not have much fill as 470 nor bandwidth.

But the 5770 has much less bandwidth and less fillrate across the board than the GTX260:

5770:
Pixel 13.6.
IntTexel 34.0.
FPTexel 17.0.
Bandwidth 76.8.

GTX260:
Pixel 16.1.
IntTexel 36.9.
FPTexel 18.4.
Bandwidth 111.9.

So the GTX260 comes out ahead everywhere you claimed mattered (including having a lot more bandwidth, and more FP texel fillrate), yet in the real world the two parts are about equal overall.

blanketyblank · Apr 14, 2010

Seero said:
Here is the thing, how can a chip ended up to be 480 cores that were suppose to be 512 cores? Once the laser burns in, the number of cores, as well as everything else, is burned in. If the mold has 512 cores, how did they ended up with 480 cores? The only reason is that they somehow bridged the bad ones, if they indeed sent in a design consisted of 512 cores.

I didn't forget about patents. ATI won't copy or steal tech from Nvidia, but learning it doesn't violate patents. It is the idea that counts, not the actual implementation. ATI patented on-board tessellation unit, yet Nvidia now can do tessellation too.

It's something like this. The chip has 512 cores, but some of those maybe bad or require more power than can be provided or creates more heat than can be dissipated to run properly. So what they do is bin the chips to see how many cores they can actually use while running under these limits. It is possible that some 480s being sold have 512 working cores, but NV found an insufficient number of them to be worth selling. The company after testing the yields from its binning process decides on the criteria to test for setting minimum standards for number of working cores at certain heat/power requirements. This becomes a standards that is set in the bios so that the other parts of the cards never try to use 512 cores and use a set voltage even though it may be possible more cores work or the chip can use less voltage.

For example NV used to sell a 9600 GSO which was basically exactly the same as a more expensive and powerful 8800 GTS. This was because demand for the 8800 was smaller and there was insufficient supply of the 9600s so they crippled the g92 chips on the cards through a simple BIOS change which allowed the cards to only use a fraction of the cores and memory on the PCB. These cards however could be "unlocked" by flashing the bios.

In other words NV's not disabling the bad cores. It is ENABLING the a minimum number of cores out of cards that have been tested to have at least that minimum.

EarthwormJim · Apr 14, 2010

Seero said:
Yes they can do that, but what happened is that they disabled 1 SM, and each SM has 32 cores. How can they group all the bad ones into one SM and then disable it? It does not make sense. If they keep 16 SMs, making each SM have lesser cores, then it make sense, but that isn't the case.

Even if they manage to disable the bad ones, since the new structure works differently, can they somehow tells the SM controller not to address the bad ones? If they can, then it won't be difficult to control how many cores to use within a SM, and disabling SM dynamically will not be a problem.

Of course, I am not saying i know, I am saying I don't know, but I found it quite interesting.

As to the clock speed, Nv was aiming for 750Mhz, and now at 700Mhz. How can you, or someone else who pretend to have inside sources, claim that the clock speed is upped? IMO it was down clocked because it was too hot at that speed, that means a lot of head room for OC.

750mhz was before Nvidia started getting chips back. When they started seeing how awful yields were, they were aiming for 600-625mhz. When yields were too low for that, they went to a 480 core part. Since performance would go down, they upped the clock to 700mhz but consequently the TDP went up to 250watts at the last minute.

The shader multiprocessors aren't necessarily nonfunctional. They just may not be able to clock up to 700mhz at the right voltage.

Also on each GTX 480, there's probably very few defects on the actual GPU. Just enough so that a couple of adjacent cores are bad, thus having to disable one SM (so maybe 10 defects or less, I'm just throwing that out, probably focused specifically in the same physical area). Key there being adjacent cores. They can't allocate different cores to another SM.

Remember yields are ridiculously low for Nvidia right now. Charlie is saying 2% for the GTX 480 (considering the quantities they're launching it has to be ugly). So there are probably a ton of GPUs that have defects spaced out such that a core or two on each SM is bad, making the whole chip bad.

Mr. Pedantic · Apr 14, 2010

Yes they can do that, but what happened is that they disabled 1 SM, and each SM has 32 cores. How can they group all the bad ones into one SM and then disable it? It does not make sense. If they keep 16 SMs, making each SM have lesser cores, then it make sense, but that isn't the case.

Even if they manage to disable the bad ones, since the new structure works differently, can they somehow tells the SM controller not to address the bad ones? If they can, then it won't be difficult to control how many cores to use within a SM, and disabling SM dynamically will not be a problem.

Of course, I am not saying i know, I am saying I don't know, but I found it quite interesting.

As to the clock speed, Nv was aiming for 750Mhz, and now at 700Mhz. How can you, or someone else who pretend to have inside sources, claim that the clock speed is upped? IMO it was down clocked because it was too hot at that speed, that means a lot of head room for OC.

They disable one if there's only 1 SM affected. If there's 2 affected, they disable two. More than that and they can that chip. They don't 'group' all the bad ones into a single SM, it doesn't work like that.

They can just cut the connections to a single SM. Since as far as I know the design is modular, cutting off one doesn't make the whole chip suddenly useless, there are still 15 good SMs on there and they can work perfectly fine without the bad one. Seeing as I'm not an EE there are probably more subtle ways of doing this, but to me it makes sense.

Not necessarily. Underclocking a chip that happens to be right on the thermal limit doesn't necessarily mean that heat was the sole, or even main, reason why the clocks were scaled back.

nosfe · Apr 15, 2010

Just a heads-up for all those CS5 related comments: only Premiere Pro CS5 will use CUDA. After Effects and Photoshop will still be using only OpenGL so there's no advantage to using a GeForce card there.

NoQuarter · Apr 15, 2010

Seero said:
As to the clock speed, Nv was aiming for 750Mhz, and now at 700Mhz. How can you, or someone else who pretend to have inside sources, claim that the clock speed is upped? IMO it was down clocked because it was too hot at that speed, that means a lot of head room for OC.

That's because it was only stable with all cores at something like 600Mhz. The GPU was faster if they clocked it at 700Mhz with some cores disabled. So it just depends on how you look at it. Either they downclocked and disabled cores from their original plan, or overclocked and disabled cores from what they ended up with from the wafers. I look at it as the latter since the plan evolved as they got each attempt back from TSMC.

HurleyBird · Apr 15, 2010

Yields is one possibility, or you could think of it as Nvidia running into a power wall. For example, Nvidia could disable another SP, clock the parts up even higher so that they have the same power draw as current GTX 480's, and you'd end up with a chip that has less shader performance, but the rest of the uncore would become more powerful. My guess is that disabling one SP came down to a mix of yields, but also shifting the performance ratio away from shaders while staying at the same TDP.

Seero · Apr 15, 2010

Please excuse my ignorance on Fab., and thank you all for sharing knowledges. I will say I got the idea from blanketyblank. If they sent in a design consisting 512 cores, then all 470 and 480 in fact have 512 cores. However, the chip probably required too much power due to weak transistors, and having all 512 cores running at 750Mhz will probably be too hot for heatsink to hold. Cutting one SM is one of the solution to save power and reduce heat, but it also reduces the number of working cores. Reduce clock is another solution, but Nvidia seems to not want to do that. The reason may be that 512@600 < 480@700 in terms of performance.

In terms of designing, ATI wins as the yield is much better than Nvidia. 5870 is on the target of the design, meaning that it performs as well as it was intended. GTX 480 on the other hand have less active cores and run at lower speed than design. However, the performance of 480 is still as good as 5870 on most stuffs while better at others. 480 barely wins the game, single handed.

Let play with numbers a little. 480/512 = 0.9375, 700/750 = .93. It isn't hard to see the card only has 93% cores and each core runs at 93% of its speed. The net performance compare to design is roughly 86.5%. One of the anand articles said that the trick get better yield is by having 2 vias instead of 1 on the 40nm design. This trick is the solo reason why ATI win big this round as it lead to better yield, and better yield let to more high grade chip, and therefore more better card. So Cypress runs at 100%, and Fermi runs at 86.5%. What if Fermi runs at 100%?

I am not saying Nvidia is better, as the fact is ATI engineers has won this round and Fermi can only play single handed because it only has 1 hand, but the architecture of Fermi is still sounding.

blanketyblank · Apr 15, 2010

Seero said:
Please excuse my ignorance on Fab., and thank you all for sharing knowledges. I will say I got the idea from blanketyblank. If they sent in a design consisting 512 cores, then all 470 and 480 in fact have 512 cores. However, the chip probably required too much power due to weak transistors, and having all 512 cores running at 750Mhz will probably be too hot for heatsink to hold. Cutting one SM is one of the solution to save power and reduce heat, but it also reduces the number of working cores. Reduce clock is another solution, but Nvidia seems to not want to do that. The reason may be that 512@600 < 480@700 in terms of performance.

In terms of designing, ATI wins as the yield is much better than Nvidia. 5870 is on the target of the design, meaning that it performs as well as it was intended. GTX 480 on the other hand have less active cores and run at lower speed than design. However, the performance of 480 is still as good as 5870 on most stuffs while better at others. 480 barely wins the game, single handed.

Let play with numbers a little. 480/512 = 0.9375, 700/750 = .93. It isn't hard to see the card only has 93% cores and each core runs at 93% of its speed. The net performance compare to design is roughly 86.5%. One of the anand articles said that the trick get better yield is by having 2 vias instead of 1 on the 40nm design. This trick is the solo reason why ATI win big this round as it lead to better yield, and better yield let to more high grade chip, and therefore more better card. So Cypress runs at 100%, and Fermi runs at 86.5%. What if Fermi runs at 100%?

I am not saying Nvidia is better, as the fact is ATI engineers has won this round and Fermi can only play single handed because it only has 1 hand, but the architecture of Fermi is still sounding.

I think you answered your own question. If Fermi could run 512/750 then it could potentially be up to 16% faster though probably not that much since it's not perfect and those extra numbers won't necessarily translate into real world performance.

We've been talking about NV vs ATI for a while, but if we are discussing architecture why not talk about what is keeping Fermi from competing with Intel. I haven't tried progamming on Cuda or directcompute yet, but I'm wondering what functions exactly are missing from these cards currently that keep it from replacing a CPU?

cbn · Apr 16, 2010

Seero said:
Let play with numbers a little. 480/512 = 0.9375, 700/750 = .93. It isn't hard to see the card only has 93% cores and each core runs at 93% of its speed. The net performance compare to design is roughly 86.5%. One of the anand articles said that the trick get better yield is by having 2 vias instead of 1 on the 40nm design. This trick is the solo reason why ATI win big this round as it lead to better yield, and better yield let to more high grade chip, and therefore more better card. So Cypress runs at 100%, and Fermi runs at 86.5%. What if Fermi runs at 100%?

Could a 750 Mhz 512 Core Fermi be 35% faster than a stock HD5870? (1.15 x 1/.865= 1.35)

Cookie Monster · Apr 16, 2010

There is another rumour of the GF100 having 128TMUs, but have only 64 TMUs enabled due to power reasons. The rumour kicked off when the GF104 was rumoured to have 64TMUs. Whether or not this is true remains to be seen, but it can explain some anomalies regarding nVIDIA's decision pf the Fermi architecture.

The first thing is why suddenly go from a 3:1 ALU:TEX ratio to 8:1 ratio. Games are quite pixel heavy, but texture performance is as important and you can see this from previous generations. ATi finally managed to learn from this and hence the HD5800 series with strong texture performance. Even today's games benefit from having strong texture performance, and this part was suppose to be released 6months ago (DX11 games wasnt around at this time).

We still dont know what clock frequency the rest of the core is running at, i.e the L2 cache, ROPs etc because they are different from what Ive read. Yes, Fermi has shader clock frequency, core frequency and a different frequency for the rest of the chip. So if you look at this from a bigger picture (and if it has disabled units), Fermi at 100% could've well been fast as a HD5970 or faster. From what was rumoured, to the final retail product, I think alot of things have been cut from the target spec. They simply aimed so high that they failed to deliver the whole lot. I guess in some sense they were looking to redeem themselves from alot of rebranding, rebranding and further rebranding.

Kuzi · Apr 17, 2010

Cookie Monster said:
We still dont know what clock frequency the rest of the core is running at, i.e the L2 cache, ROPs etc because they are different from what Ive read. Yes, Fermi has shader clock frequency, core frequency and a different frequency for the rest of the chip. So if you look at this from a bigger picture (and if it has disabled units), Fermi at 100% could've well been fast as a HD5970 or faster. From what was rumoured, to the final retail product, I think alot of things have been cut from the target spec. They simply aimed so high that they failed to deliver the whole lot. I guess in some sense they were looking to redeem themselves from alot of rebranding, rebranding and further rebranding.

Actually the L2 cache speed should be the same as the core clock. As for the ROPs, their speed depends on the memory clock, the higher the memory frequency, the higher the ROP performance. I fully agree with you that Nvidia aimed too high with Fermi, they also did not expect TSMC's 40nm process to turn out this bad, which made the situation even worse for them.

Interesting thread guys. The way I see it ATI took the "safe" route with a smaller and less complex architecture, and this strategy payed off very well for them. You know we can talk about the impressive improvements Fermi brings in terms of Tessellation, GPGPU capability (CS5 etc), but when there is hardly any product on the market, it doesn't really make much difference. Hopefully that changes soon.

Architecture wise, Fermi is clearly ahead of Cypress in many areas, Nvidia worked hard to differentiate their product from the competition, especially the GPGPU performance which I'm sure will allow Fermi to shine once developer support catches up.

A Fermi refresh on a smaller process (28nm?) would be a better product all round and a more serious threat to ATI. I still believe GF will have actual 28nm product on the market before TSMC, but it seems Nvidia is still going with TSMC for the next round, which may not turn out to be the best choice. I think ATI will use GF for their next GPU, but we'll have to wait and see.

Puddle Jumper · Apr 17, 2010

On the subject of gpgpu I can't help but feel that nvidia's dominance of the HPC environment is as much marketing as fact. If you look at the Top 500 list of supercomputers the highest ranked ATI system is #5 while the top nvidia system is #56

http://www.top500.org/list/2009/11/100

Obviously none of these systems are exclusively using gpus howver if nvidia's cards were really as dominant as they are supposed to be it seems like they would at least be utilized in a Top 50 system.

Mr. Pedantic · Apr 17, 2010

Just because the fastest system happens to be based on Radeons doesn't mean that AMD has the largest (or even a large-ish) stake in the HPC market. It's a complete non sequitur.

Sheninat0r · Apr 18, 2010

Puddle Jumper said:
On the subject of gpgpu I can't help but feel that nvidia's dominance of the HPC environment is as much marketing as fact. If you look at the Top 500 list of supercomputers the highest ranked ATI system is #5 while the top nvidia system is #56

http://www.top500.org/list/2009/11/100

Obviously none of these systems are exclusively using gpus howver if nvidia's cards were really as dominant as they are supposed to be it seems like they would at least be utilized in a Top 50 system.

I'm under the impression that most of those computers use vast parallel arrays of CPUs and not GPUs to do calculations, so that's not an entirely accurate metric.

Outrage · Apr 18, 2010

Tianhe-1 is a hybrid, using both gpus and cpus, it's powered by 5120 r700 class gpus.

Oak ridge and nvidia had/have plans for a supercomputer using fermi that will take the top spot in that list, havent heard anything about it since september last year though, anyone have any news about that project?

Architectural Direction of GPUs

Golden Member

Diamond Member

Golden Member

Banned

Platinum Member

Golden Member

Elite Member, Moderator Emeritus

Diamond Member

Golden Member

Lifer

Golden Member

Diamond Member

Diamond Member

Senior member

Golden Member

Platinum Member

Golden Member

Golden Member

Lifer

Diamond Member

Senior member

Platinum Member

Diamond Member

Senior member

Senior member