I wonder how did GTX 480 ended up to have 480 CUDA cores instead of 512. Did they send the 512 design, or a 480 design? If they indeed sent in a 512 design and though yield is low, they was able to somehow mark the bad cores and retrofit it as a 480 core card, then the Fermi design rocks. Think about it, they can max out the field by marking bad cores so they can make 512,480,448, 420, 400, and 386 GPU and market them individually. Or maybe they simply sent in a design with 480 CUDA cores and quickly change the design to 448 CUDA cores to the fab.
Get back to the architecture. Grouping cores as SM is a good idea. If they can somehow only power those active SMs then it will rock.
Seriously, if the new architecture allows them to selectively disable/bridge those bad cores, then it will be a dream come true in terms of production.
---
Lets talk about the cypress architecture. New instruction can not be executed before the previous one is finished, after that it needs to fetch data from and to memory. So the total time is to finish one instruction is processing time + 2x fetch time (memory latency). The problem is that cypress can't handle more than one process at a time, although it has massive power for one process, which makes it good for graphics and pixels. In fact, that is the reason for eyefinity as the architecture benefits super high resolution.
Fermi on the other hand as less raw power, but better handling on multi processing. Both tessellation and CUDA code result support this theory. The winning punch is it can also handle a single process as well as cypress architecture.
Now Fermi has lower clock speed, meaning that there are headroom. The down side is of course the amount of electricity needed. There is no trick here as the Fermi design consist of more transistors, and therefore requires more electricity to push through it.
It is very difficult to compare the 2 design as they each do something better than others. Cypress is better at more pixels and Fermi is better at multi-tasking. However, since Cypress is better at more pixels, scaling it down make sense, what about Fermi? The only way to scale a Fermi down is by reducing the number of CUDA cores, but what about the number of SMs? Should there be less CUDA cores per SM? or just fewer SMs? Then it comes another question. How many CUDA cores shall there be in a SM? Have they got the optimal numbers yet?
It appears that memory latency has less impact on the Fermi design as we have seen through some OC reviews. The one thing that had me interested in Fermi is executions can be executed back to back. That changes everything when it comes to programming. It will be much easier to fully utilize the card compare to the older structure.
Having said all that, most the above are just wild guesses. But what if they can all come true? then we will see a video card that
a) cheap because yield problem disappears.
b) efficient, as SM switches on/off (or OC) dynamically.
c) good performance, good at multi-task, or a single task.
d) efficient again, as the chance of powering inactive SMs are minimized.
f) Multi-purpose, as it can act like a tessellation unit, performs better than having a tessellation unit, without completely destroy other tasks.
However, now that it is out, and ATI has a product that is as good, there is nothing preventing ATI from LEARNING from Fermi and therefore be able to create a better design. Plus, the current Fermi isn't that OMG ATM as the environment does not fully support its architecture yet.
Get back to the architecture. Grouping cores as SM is a good idea. If they can somehow only power those active SMs then it will rock.
Seriously, if the new architecture allows them to selectively disable/bridge those bad cores, then it will be a dream come true in terms of production.
---
Lets talk about the cypress architecture. New instruction can not be executed before the previous one is finished, after that it needs to fetch data from and to memory. So the total time is to finish one instruction is processing time + 2x fetch time (memory latency). The problem is that cypress can't handle more than one process at a time, although it has massive power for one process, which makes it good for graphics and pixels. In fact, that is the reason for eyefinity as the architecture benefits super high resolution.
Fermi on the other hand as less raw power, but better handling on multi processing. Both tessellation and CUDA code result support this theory. The winning punch is it can also handle a single process as well as cypress architecture.
Now Fermi has lower clock speed, meaning that there are headroom. The down side is of course the amount of electricity needed. There is no trick here as the Fermi design consist of more transistors, and therefore requires more electricity to push through it.
It is very difficult to compare the 2 design as they each do something better than others. Cypress is better at more pixels and Fermi is better at multi-tasking. However, since Cypress is better at more pixels, scaling it down make sense, what about Fermi? The only way to scale a Fermi down is by reducing the number of CUDA cores, but what about the number of SMs? Should there be less CUDA cores per SM? or just fewer SMs? Then it comes another question. How many CUDA cores shall there be in a SM? Have they got the optimal numbers yet?
It appears that memory latency has less impact on the Fermi design as we have seen through some OC reviews. The one thing that had me interested in Fermi is executions can be executed back to back. That changes everything when it comes to programming. It will be much easier to fully utilize the card compare to the older structure.
Having said all that, most the above are just wild guesses. But what if they can all come true? then we will see a video card that
a) cheap because yield problem disappears.
b) efficient, as SM switches on/off (or OC) dynamically.
c) good performance, good at multi-task, or a single task.
d) efficient again, as the chance of powering inactive SMs are minimized.
f) Multi-purpose, as it can act like a tessellation unit, performs better than having a tessellation unit, without completely destroy other tasks.
However, now that it is out, and ATI has a product that is as good, there is nothing preventing ATI from LEARNING from Fermi and therefore be able to create a better design. Plus, the current Fermi isn't that OMG ATM as the environment does not fully support its architecture yet.