Dual Architecture Graphics Card

User1001 · Jan 19, 2004

There's been all this talk about the failed xgi dou cards with 2 GPUs. Couldn't ati and NVIDIA tell manufacturers to put 2 GPUs on a circuit board and offer the dual architurture.

Matthias99 · Jan 19, 2004

Suffice it to say that there's a *lot* more to it than just "put[ting] 2 GPUs on a circuit board".

lexxmac · Jan 19, 2004

Not to mention the extra cost of the second chip itself...

Shinei · Jan 20, 2004

If the GPUs communicate across an HT bus, and use an 512-bit memory interface, wouldn't it be possible, though? Granted, the GPUs would have to be specially designed to incorporate the HT communication link, but considering the bandwidth advantage it has, I think the extra R&D would be worth it.
Yes, the cost of the second GPU is an issue, but if you're taking two GPUs at 350MHz on a 150nm process and including HT and a tremendous memory bus, aren't you effectively producing two cheaper chips that do at least the same work as a considerably faster single GPU on a smaller process? I could be wrong, since I'm not entirely sure how GPUs are designed compared to K8s or P4s, so if it is, feel free to yell "IDIOT". Well, don't, it hurts my feelings.

m1sterr0gers · Jan 20, 2004

Originally posted by: Matthias99
Suffice it to say that there's a *lot* more to it than just "put[ting] 2 GPUs on a circuit board".

guess these guys know a *lot* about it

MadRat · Jan 20, 2004

How about we just integrate the CPU to work off the GPU's memory controller. Hows that for dual architecture?

Sahakiel · Jan 20, 2004

Originally posted by: MadRat
How about we just integrate the CPU to work off the GPU's memory controller. Hows that for dual architecture?

How about we just ditch the whole cost-effective principle? How's that for practicality?

lexxmac · Jan 21, 2004

Originally posted by: m1sterr0gers

Originally posted by: Matthias99
Suffice it to say that there's a *lot* more to it than just "put[ting] 2 GPUs on a circuit board".

Click to expand...

guess these guys know a *lot* about it

Mind giving me a price on that card? The company that makes it designs military rendering equipment. You can do almost anything you want if you have enough money...

MadRat · Jan 21, 2004

Originally posted by: Sahakiel

Originally posted by: MadRat
How about we just integrate the CPU to work off the GPU's memory controller. Hows that for dual architecture?

Click to expand...

How about we just ditch the whole cost-effective principle? How's that for practicality?

I'm still trying to jar some of these whiz bang processor designers to toy with the idea of using internal memory bandwidth on par with the latest, greatest graphics cards. It wasn't but seven or eight years ago when $500 was mid-range for a CPU core. Its not like people aren't dolling out $1000 for the P4EE on the belief that the bleeding edge is worth it.

I think the design lines between CPU and GPU would certainly blur if the CPU core suddenly gained a 256-bit memory controller attached to a local 256MB memory package operating in the 20-25GB/sec range. Graphics cards have been pushing these theoretic bandwidth ranges for better than a year now, so why not CPU's? There should be a way to do it in the $500-$600 range, especially considering if a NUMA approach was used where people could run a second set of memory, like in conventional slots, to push the total system memory up to where its useful for the largest of programs. For the SMP crowd, each processor enjoys its own uninterupted, dedicated memory and all of the processors would (because of the NUMA approach) share the second expansion set of memory placed in the slots on the mainboard. Less expensive mainstream-grade modules in the $150-$400 range could use 128-bit memory controllers tied to 128MB of memory running in the 10-12GB/sec range. Cheap bastards in the entry-level crowd could use the NUMA expansion memory slots (otherwise without any high speed memory tied to them) if they're too cheaps to fork over $150... (Yes, it would sell just like its true that people really do buy Pentium4-based Celerons when real P4's are but a little more money!)

The real limit of CPU power today, in my opinion, is that we really don't yet have a killer application out there to need more power worthwhile. Not many of us even push our current systems in any meaningful way 99% of the time the systems are on.

Sahakiel · Jan 22, 2004

Originally posted by: MadRat

Originally posted by: Sahakiel

Originally posted by: MadRat
How about we just integrate the CPU to work off the GPU's memory controller. Hows that for dual architecture?

Click to expand...

How about we just ditch the whole cost-effective principle? How's that for practicality?

Click to expand...

I'm still trying to jar some of these whiz bang processor designers to toy with the idea of using internal memory bandwidth on par with the latest, greatest graphics cards. It wasn't but seven or eight years ago when $500 was mid-range for a CPU core. Its not like people aren't dolling out $1000 for the P4EE on the belief that the bleeding edge is worth it.

I think the design lines between CPU and GPU would certainly blur if the CPU core suddenly gained a 256-bit memory controller attached to a local 256MB memory package operating in the 20-25GB/sec range. Graphics cards have been pushing these theoretic bandwidth ranges for better than a year now, so why not CPU's? There should be a way to do it in the $500-$600 range, especially considering if a NUMA approach was used where people could run a second set of memory, like in conventional slots, to push the total system memory up to where its useful for the largest of programs. For the SMP crowd, each processor enjoys its own uninterupted, dedicated memory and all of the processors would (because of the NUMA approach) share the second expansion set of memory placed in the slots on the mainboard. Less expensive mainstream-grade modules in the $150-$400 range could use 128-bit memory controllers tied to 128MB of memory running in the 10-12GB/sec range. Cheap bastards in the entry-level crowd could use the NUMA expansion memory slots (otherwise without any high speed memory tied to them) if they're too cheaps to fork over $150... (Yes, it would sell just like its true that people really do buy Pentium4-based Celerons when real P4's are but a little more money!)

The real limit of CPU power today, in my opinion, is that we really don't yet have a killer application out there to need more power worthwhile. Not many of us even push our current systems in any meaningful way 99% of the time the systems are on.

I spent half an hour typing up a long flame. Good grief, I'm starting to dread reading your posts.
Long story short, what you propose is absolutely possible.... SEVERAL YEARS FROM NOW.
Go read a book on basic computer architecture. Patterson and Hennessey come highly recommended.

MDE · Jan 22, 2004

Originally posted by: m1sterr0gers

Originally posted by: Matthias99
Suffice it to say that there's a *lot* more to it than just "put[ting] 2 GPUs on a circuit board".

Click to expand...

guess these guys know a *lot* about it

Link's dead

MadRat · Jan 22, 2004

Originally posted by: Sahakiel
I spent half an hour typing up a long flame. Good grief, I'm starting to dread reading your posts.
Long story short, what you propose is absolutely possible.... SEVERAL YEARS FROM NOW.
Go read a book on basic computer architecture. Patterson and Hennessey come highly recommended.

Like I care if you have some kind of personal problem with my ideas. This is not something that is possible several years away, the technology exists today. Integrating memory into the mainboard is nothing new, nor is mating dedicated memory to a MPU. Different companies have already tackled the technology behind the memory controllers so its not like the memory technology does not exist. What I proposed was putting the CPU on an expansion card, perhaps even moving back to a slot since the interface would no longer need the huge pin count used by current fancy motherboard-centric designs, and then dedicated memory on the expansion card.

I realize you may be thinking that memory dedicated to the card would mean the architecture is akin to parrallel processing, but this can be done in an x86-compatible machine using a NUMA aware OS. The NUMA aware OS is already available now, too. If graphics cards can do this kind of bandwidth for nary the cost of a single CPU, then there is a margin with which to build a profit upon. With die size dropping so drastically every couple of years, it makes more sense to integrate the GPU and CPU together now than ever before. Memory has shrunken to the point where current slots are inefficent for the speeds easily possible; what could be better than putting the memory right there next to the CPU? In order to do a combination core like this effectively the thing must have adequate memory bandwidth. Varying levels of performance can be gleamed by varying the amounts of and pathways to the memory on the card. The idea is for massive memory bandwidth for either graphics work or computing depending on need, but all of the horsepower in one package. Basically it becomes a jack of all trades CPU/GPU core at an affordable price and motherboard layouts again become simplified like they should be kept.

In short, AMD and INTEL better beware graphics card makers doing this before they do it themselves.

Edited for Sahakiel's sake.

Sahakiel · Jan 22, 2004

Originally posted by: MadRat

Like I care if you have some kind of personal problem with my ideas. This is not something that is possible several years away, the technology exists today. Integrating memory into the mainboard is nothing new, nor is mating dedicated memory to a MPU. Different companies have already tackled the technology behind the memory controllers so its not like the memory technology does not exist.

My personal opinion of your posts has nothing to do with your character beyond the "get-rich-quick" type of mentality. First off, whether or not the technology is available is only one problem. 10 GHz transistors have been in existence for at least a year. There are several damn good reasons you don't see a 10 GHz Pentium IV and it has little to do with marketing.
Second, memory technology has ALWAYS lagged behind CPU technology. From the first mainframes to the first desktops, memory technology has never once matched CPUs. The trend lately has been an ever widening gap between CPU speed and memory speed given a CONSTANT DATA BUS WIDTH. In other words, if you decide to quadruple the memory bus to 256-bit, you're gonna get a slower clock speed. It may have larger throughput overall, but it's definitely gonna cost a lot more than 4x as much as the current 64-bit setup.

What I proposed was putting the CPU on an expansion card, perhaps even moving back to a slot since the interface would no longer need the huge pin count used by current fancy motherboard-centric designs, and then dedicated memory on the expansion card.

CPUs on expansion cards is nothing new. How the heck do you think backplanes came into existence? SCSI wasn't always a cable bus for your hard drives. SCSI backplanes hooked up to CPU riser cards have been around since at least the early 80's. Heck, I even worked on a PC with a Pentium 233 MMX on a CPU card hooked up to a PCI backplane circa 1996. The idea is nothing new. The PROBLEM you don't seem want to acknowledge is the fact that those systems were both more expensive than ATX layouts due to cramming everything into one card and everything else NOT soldered on board was hella slow. Also, I think you're also ignoring the fact that both Intel and AMD moved away from the Slot format as soon as was feasible. The main reason they even used the slot in the first place was due to die sizes being too large to integrate enough cache on die to match performance requirements. The first iterations, the Pentium Pro, used a dual cavity socket design for the sole reason that cache at that time was becoming a serious bottleneck much the same as DRAM a a HUGE bottleneck spawned the introduction of said caches in the first place.
So why did Intel move away from dual-cavity packages? Because it was too bloody expensive. Why did Intel move away from Slot? Because it was too bloody expensive, especially since the CPU was seriously outpacing off-die cache speeds even at half the clock rate and latency was starting to kill performance.

I realize you may be thinking that memory dedicated to the card would mean the architecture is akin to parrallel processing, but this can be done in an x86-compatible machine using a NUMA aware OS. The NUMA aware OS is already available now, too. If graphics cards can do this kind of bandwidth for nary the cost of a single CPU, then there is a margin with which to build a profit upon.

Let me try to hammer this point in one more time : Graphics cards have different design architectures than CPUs. Why do you think graphics cards have such great memory? Because it's soldered on, close to the GPU, and the card design is somewhat different than motherboards. Plus, I think the memory used for graphics cards is somewhat different than the DRAM used for system memory; not only in packaging, but also density and interfacing/accessing.

With die size dropping so drastically every couple of years, it makes more sense to integrate the GPU and CPU together now than ever before.

What planet did you hail from? Die size of CPUs is somewhat smaller each year, that's true. But how does this make room for a GPU? You have a 50+ million transistor Pentium IV versus a 110 million transister Radeon 9800 (125 M for GeforceFX) and somehow these two are supposed to work together? We haven't even started counting transistors required for coupling the two very different processors. In other words, try later, preferably SEVERAL YEARS LATER.

Memory has shrunken to the point where current slots are inefficent for the speeds easily possible; what could be better than putting the memory right there next to the CPU? In order to do a combination core like this effectively the thing must have adequate memory bandwidth. Varying levels of performance can be gleamed by varying the amounts of and pathways to the memory on the card. The idea is for massive memory bandwidth for either graphics work or computing depending on need, but all of the horsepower in one package. Basically it becomes a jack of all trades CPU/GPU core at an affordable price and motherboard layouts again become simplified like they should be kept.

Shrinking memory increases density more than speed. In the DRAM market, density is everything. Think about it.
For that sole reason, the transistor structure on DRAM is a lot different than CMOS on a CPU. Heck, I can barely even make out the parts of the transistor on a picture it's that weird. I don't know much about process technology, but I don't think you'll be fitting the same DRAM density onto a CPU die. In fact, memory process technology is usually one or two generations ahead of CPU process technology. So, in essence, you've just killed the main advantage of DRAM : density. Which means you'll probably use something like SRAM because DRAM in CPU CMOS technology will be too damn small and too damn slow for the die size.

Motherboard layouts are complex for one reason : expandability. We have 5-6 PCI buses plus AGP even though everything can be integrated because of that. We have legacy peripherals to provide support for the vast majority of hardware still on the market. We have sockets to change CPUs and sockets to swap memory, plus all the lovely voltage and current regulation. That's without the associated chipsets.
Your idea would somewhat work with embedded systems.
Oh, wait, your idea IS an embedded system, which are LOW POWER , LOW SPEED, LOW PERFORMANCE, and pretty much DON'T BELONG ON DESKTOPS. Not for a whiles, anyway.

In short, AMD and NVidia better beware graphics card makers doing this before they do it themselves.

1. "AMD and Intel"
2 They have plenty of time, quite possibly forever.
Why is that? Because nobody can put as much money into research as Intel which means that no x86 (read: desktop) CPU is going to come close to Intel in terms of processing power for a whiles to come. You would have to integrate the CPU, GPU, memory controller, the DRAM you want (which I'm not even sure is possible), plus associated data paths, bus controllers, interfaces and so on into one package with a die somewhere in the ballpark of 1cm², preferably less. Otherwise, you'll end up with something large and unwieldy with very low yields, lots of technical issues, and a price tag rivaling a CRAY.
And you're asking to do this sometime in the next few years.
Keep dreaming.

I would post more, but it's time to head to class. Perhaps you should do the same.

Mday · Jan 22, 2004

Originally posted by: User1001
There's been all this talk about the failed xgi dou cards with 2 GPUs. Couldn't ati and NVIDIA tell manufacturers to put 2 GPUs on a circuit board and offer the dual architurture.

bandwidth. agp is not fast enough. the $3000 cards already have this feature. etc.

MadRat · Jan 22, 2004

1. Sorry, User1001, for derailing the discussion.
2. Sahakiel, get real. You fire a torpedo at my idea before even contemplating the big picture. I am beginning to think you have some frustrations with something, perhaps at school, and would rather deflect it my way. Perhaps it has to do with you spending long hours up at night reading ATHT...
3. The 256-bit pathway to memory solution already exists and is used by today's graphics cards. Likewise, high speed memory exists on the latest graphics cards at speeds that far exceed desktop memory. GeForce FX5800U uses a 256-bit memory controller tied to 500MHz DDR (1000MHz eq.) memory. I rest my case that 256-bit pathways to dedicated memory is possible using today's technology.
4. The reason a slot or expansion card type of interface COULD be used is for simplicity reasons. As the technology was polished off then a socket interface could be used to shrink the entire package. The design would not need as many pins for external memory due to the less need for high speed memory on the motherboad; the high speed memory would be already on the package. A simpler 64-bit pathway should be acceptable for consumer boards, whereas they'd still be able to use dual-channel pathways for servers and other high end products.
5. The memory is put on graphics cards close to the controller to limit trace length, else the timing would be thrown off with high clock speeds. This is the same hurdle that current SDRAM is running into using conventional DDR modules, with 250MHz memory pushing the limits of even the best boards out right now. If we want to entertain the idea of 500MHz memory then the trace lengths have to shrink to a quarter of their current length, meaning its absolutely necessary to put the memory next to the controller. Hey, thats exactly what current graphics cards do...
6. With a NUMA approach the clock speeds of main memory and expansion memory become less relevant. Therefore new memory technology can be used on the upper echelon of processors without affecting the bottom line of the mainstream processors using my approach. Lesser performing cards could use less expensive memory configurations to keep costs down. Current graphics cards using memory controllers that allow mutliple memory configurations, some using the choice of 64-bit, 128-bit or 256-bit settings. I remember when 32MB of memory was relatively expensive yet the S4 Savage cost rougly 50-60% more than the price of just a 32MB stick of RAM, meaning the whole graphics card was dirt cheap. The S4 Savage had a huge transistor count, as high as alot of CPU's at the time, yet somehow they managed to cram 32MB into the package and this core all for a measely pitance in price. The secret of the low price was that the 32MB of memory was dirt cheap and had relatively poor performance. This meant that a graphics card with such low memory speed was more relevant to its low price than the high transistor count of the GPU. They sure sold alot of these cards so someone had to be making a profit along the line else they wouldn't of been so common.
7. AMD and Intel have dual-core processors in their roadmaps and its rumoured to be because the cores are becoming pad-limited as they shrink. GPU's on the other hand have been lagging a good 12 months behind CPU's when it comes to process size. The graphics makers just might decide to add the less complex CPU component into their cores, which is why I said AMD and INTEL need to beware. What was that direct challenge NVidia made towards Intel a shortwhile ago?
8. My idea would work for imbedded systems, but it takes a lesson from the embedded world to simplify the desktop world. Alot of modern mainboards had to move to 6 layers because the chipset/memory complexities are too much for a 4 layer design. Why not make it easier to stay with 4 layers if that is what can keep the system price simpler and cheaper? Removing the need for a 6 layer mainboard is incentive to work on this approach. Hell, it could make it where Intel and AMD could share a common motherboard design but use different sockets to connect their CPU packages. Kind of like in the Socket A/370 days where some VIA reference designs could merely call for a different Northbridge and CPU socket (they had common pinouts) for the AMD or Intel processor support, yet that was the only way to distinguish them apart. This would conserve valuable design time for motherboard layouts yet allow them to keep their propriety.

Innovation comes from someone taking a lead and running with the design they know will work. Convincing naysayers that it is possible would cost too much time so the deed gets done and the naysayers left behind. I have a feeling that AMD and INTEL better watch out for some renegade GPU maker to someday soon issue a design that makes their designs outdated. They would catch up, but any serious outside competition can have serious long term consequences to one's bottom line. I think my idea is basically a bridge to what others have already done to what can be done by taking it another step in a new direction. If it doesn't work then okay I'm wrong. But if it revolutionized the industry then who wants to be the one in the marketplace looking at the other guy's back?

Sahakiel · Jan 23, 2004

Originally posted by: MadRat
1. Sorry, User1001, for derailing the discussion.
2. Sahakiel, get real. You fire a torpedo at my idea before even contemplating the big picture. I am beginning to think you have some frustrations with something, perhaps at school, and would rather deflect it my way. Perhaps it has to do with you spending long hours up at night reading ATHT...

Unfortunately, you're the only one I know that exemplifies such passion.

3. The 256-bit pathway to memory solution already exists and is used by today's graphics cards. Likewise, high speed memory exists on the latest graphics cards at speeds that far exceed desktop memory. GeForce FX5800U uses a 256-bit memory controller tied to 500MHz DDR (1000MHz eq.) memory. I rest my case that 256-bit pathways to dedicated memory is possible using today's technology.

I never said the technology was impossible today. Perhaps you should read carefully and see that I keep saying that such a design for CPUs will far exceed reasonable cost. Plus, you still have the socket problem at the system level.

4. The reason a slot or expansion card type of interface COULD be used is for simplicity reasons. As the technology was polished off then a socket interface could be used to shrink the entire package. The design would not need as many pins for external memory due to the less need for high speed memory on the motherboad; the high speed memory would be already on the package. A simpler 64-bit pathway should be acceptable for consumer boards, whereas they'd still be able to use dual-channel pathways for servers and other high end products.

That's why I keep saying wait several years. It takes relatively little time to take an existing system design, shave off everything except the bare necessities, and then plop it onto an expansion board. Notebooks have been doing that for years (except for the expansion card layout). However, if you want to integrate it into a socket, well, that's a whole 'nother story. That requires at least one or two new process technologies to even shrink the CPU to a decent size to make room for integration of other system components.

5. The memory is put on graphics cards close to the controller to limit trace length, else the timing would be thrown off with high clock speeds. This is the same hurdle that current SDRAM is running into using conventional DDR modules, with 250MHz memory pushing the limits of even the best boards out right now. If we want to entertain the idea of 500MHz memory then the trace lengths have to shrink to a quarter of their current length, meaning its absolutely necessary to put the memory next to the controller. Hey, thats exactly what current graphics cards do...

How many times do I have to tell you that graphics cards and motherboards have different design methodologies. You can try integrating the CPU, memory controller, and DRAM onto one package (or Opteron + DRAM ) but you'll run into problems with upgrades. Unfortunately, upgradeability supercedes performance when designing for the consumer and server market. That, and you run into more problems with I/O running through the socket pins. I'm not even sure if HyperTransport would work through a socket.
On the other hand, embedded systems love your integration idea and have been designing systems as such for years. In fact, I'm holding one right now which is used in our lab sessions. The system is based around a Motorola 68HC12 core. However, you keep forgetting that with embedded systems, power usage (and heat) plus real-time response far outweigh overall performance. Hence, what you see now on the desktop, that level of raw power won't make it to the embedded market for years to come.

6. With a NUMA approach the clock speeds of main memory and expansion memory become less relevant. Therefore new memory technology can be used on the upper echelon of processors without affecting the bottom line of the mainstream processors using my approach. Lesser performing cards could use less expensive memory configurations to keep costs down. Current graphics cards using memory controllers that allow mutliple memory configurations, some using the choice of 64-bit, 128-bit or 256-bit settings.

The memory hierarchy you describe is pretty well exploited with caching. The main difference I can see is whether or not the OS is aware of it.
Oh, and by the way, newer isn't necessarily faster or better.

Graphics cards' memory configurations tend to come from one of two ideologies :
The first is to design a high-end part (with the vaunted 256-bit crossbar) and downgrade defective parts to the lower end. Aka half your crossbar doesn't work, you get a 128 bit memory bus width. Well, effectively.
The second is to design the same pipelines, but multiple chips. Graphics chips are superpipelined massively parallel architectures. You have one long pipeline duplicated several times over. Basically, you then have multiple performance chips with different amount of copies on chip. Your high-end chip then has everything you can fit on the biggest die you can use whereas the value chip has one copy of each with smaller dies to reduce cost.
CPU's follow the first ideology for one good reason : CPU's are general purpose chips. Each pipeline tends to be different (it's called superscalar). You may have more copies of certain pipelines depending on your characteristic applications' needs.

I remember when 32MB of memory was relatively expensive yet the S4 Savage cost rougly 50-60% more than the price of just a 32MB stick of RAM, meaning the whole graphics card was dirt cheap. The S4 Savage had a huge transistor count, as high as alot of CPU's at the time, yet somehow they managed to cram 32MB into the package and this core all for a measely pitance in price. The secret of the low price was that the 32MB of memory was dirt cheap and had relatively poor performance. This meant that a graphics card with such low memory speed was more relevant to its low price than the high transistor count of the GPU. They sure sold alot of these cards so someone had to be making a profit along the line else they wouldn't of been so common.

If I remember correctly, S3's Savage4 had around 9 M transistors at 110MHz core and 125 MHz memory. At the same time, the Pentium 3 had around 5 M transistors running at 600 MHz with 100MHz memory. The S3's memory was based on SDRAM, the same type as desktops. 124MHz SDRAM was not exactly "dirt" cheap, but it was cheaper than, say, SGRAM. Notice that S3 could not hit above 150 MHz no matter how much they tried. I do believe S3 could have made the Savage4 run even 200MHz if they had the financial backing of Intel. However, even today's Radeon9800 and GeforceFX run slower than that 600 MHz Pentium3.
My point so far has been that you can't get more transistors to run fast without large gobs of money and technical know-how. There is a lot more to cost than just transistor count. It may have been an error on my part to emphasize that aspect due to it being the easiest to point out, and the one you seem to ignore the most.
Oh, and you seem to be implying that the Savage4's memory was integrated onto the same die as the GPU. Well, no, it wasn't. Anyway, even if they did manage to sell a lot of those cards, the reason is due entirely to both the low price and S3's relationship with OEMs. However, it doesn't seem to have been enough to keep the company competitive. It may not have been enough to cover development costs, but that's pure speculation on my part.
Intel runs its own fabs. In terms of chip output, I believe they exceed any other company in the world. S3 doesn't have anywhere near the same fab capabilities; never had, never will. Why does Intel have its own fabs? So they can use cutting-edge process technologies and high capacity to outproduce and outperform the competition. Why doesn't ATI or nVidia do the same? Because it costs too much for them to run a fab so they outsource. ATI is fabless, I think, and nVidia is either close or in the same boat. Without a fab, development costs go down because then you can hire a fab company that can produce your chips at a lower cost than if you ran the fab yourself. Fab companies have low cost due to more efficient use of resources (like technicians) and production capacity. Much like pipelines are inherently more efficient when full and the "cost" per instruction becomes minimal.

7. AMD and Intel have dual-core processors in their roadmaps and its rumoured to be because the cores are becoming pad-limited as they shrink. GPU's on the other hand have been lagging a good 12 months behind CPU's when it comes to process size. The graphics makers just might decide to add the less complex CPU component into their cores, which is why I said AMD and INTEL need to beware. What was that direct challenge NVidia made towards Intel a shortwhile ago?

I can't recall off the top of my head what nVidia challenged Intel, but I can speculate. In terms of nVidia producing a CPU to challenge Intel, that would be very surprising. Like I said many times before (and you've so far, ignored) GPUs and CPUs have different design methodologies. They have different tasks and different data sets. It wouldn't be hard for nVidia (or any other company, for that matter) to produce an x86 CPU. What would be difficult is ramping up clock speeds to match Intel and AMD's level of performance. No other company has as much experience with the x86 ISA as those two. Plus, you gotta wonder where nVidia would produce said CPUs. They don't have a new fab coming online in the next couple of years, which means if they suddenly decided to drop $6+ billion today for a cutting edge 65nm fab, it won't be ready for at least 3 years. After that, they have to tweak the process to get anything to work, then tweak it some more to get anything running fast, send the data back to the design lab, and tweak the modified design a bit more. Once that fab's ready and the CPUs come rolling off the line ready to market, you have to deal with Intel's brute force fab production capacity. It's hard selling your new, untested, CPU when the established competition can flood the market and drive you out. The only way nVidia could pull of something of that scale is a new CPU that easily surpasses anything Intel can offer at a low price AND keep it up for years and years. Intel has an established reputation that's damn hard to beat (see how long AMD has been at it).
Thus, it's not entirely surprising to me to see that I haven't heard anything about nVidia challening Intel in x86 CPUs because such a challenge is literally laughable. They'd seriously have to pull off something revolutionary to take on AMD let alone Intel. Adding a simple CPU to the GPU is not exactly the right idea. The entire system architecture runs contrary to that. I can't begin to imagine the chaos such a move would do to software engineers. If you have any inkling over the true nature of the x86-64 debate, then you know that radically changing software to run on a new architecture is met with very high resistence.
On the other hand, if you do integrate a CPU to the GPU die, its no simple matter by itself. You also have to integrate the memory controller. Then you have to figure out your bus interface. Do you want to try accessing main memory over the AGP bus (not recommened) or PCI (even worse) or integreate it into the graphics board (better idea, but costs a lot more). What about I/O? The same questions come up and if you're looking for anything remotely close to good performance, you'll integrate it onto the graphics board. At that point, it seems you have a system on a riser board connected to a backplane. Now where have I heard this before? HMMMMMmmmmm..........
And then there's embedded....
Sheesh, you're making me sound like a parrot.

8. My idea would work for imbedded systems, but it takes a lesson from the embedded world to simplify the desktop world. Alot of modern mainboards had to move to 6 layers because the chipset/memory complexities are too much for a 4 layer design. Why not make it easier to stay with 4 layers if that is what can keep the system price simpler and cheaper? Removing the need for a 6 layer mainboard is incentive to work on this approach.

You just gave the reason for 6-layer boards : chipset/memory complexities. That, and I/O and expansion boards.

Hell, it could make it where Intel and AMD could share a common motherboard design but use different sockets to connect their CPU packages. Kind of like in the Socket A/370 days where some VIA reference designs could merely call for a different Northbridge and CPU socket (they had common pinouts) for the AMD or Intel processor support, yet that was the only way to distinguish them apart. This would conserve valuable design time for motherboard layouts yet allow them to keep their propriety.

Not a new idea. AMD pushed the methodology you describe during the SlotA era. AMD motherboards at that time were relatively simple changes to existing Intel boards. Just swap out the chipset and socket, add a few tweaks here and there, and we have an AMD board. Unfortunately, nowadays the CPU architecture and the accompanying bus architecture is so different and the complexity of the board has increased to the point where I don't think that situation can occur. My guess is the expansion and basic I/O areas could be relatively untouched, but the power regulation, memory, northbridge, basically anything else would require major redesigns. However, I know very little of what goes on underneath, so they may be able to share more than I know.

Innovation comes from someone taking a lead and running with the design they know will work. Convincing naysayers that it is possible would cost too much time so the deed gets done and the naysayers left behind. I have a feeling that AMD and INTEL better watch out for some renegade GPU maker to someday soon issue a design that makes their designs outdated. They would catch up, but any serious outside competition can have serious long term consequences to one's bottom line. I think my idea is basically a bridge to what others have already done to what can be done by taking it another step in a new direction. If it doesn't work then okay I'm wrong. But if it revolutionized the industry then who wants to be the one in the marketplace looking at the other guy's back?

Damn, now you just sound like Romero.

*edit : Less annoyed, now.

MadRat · Jan 25, 2004

I'm not sure what I said implied imbedded memory. I was thinking more along the lines of designing the CPU's package (you know the thing they mount the CPU core onto) around a flat card that includes a first stage of high speed memory dedicated to the local processor. The memory would not be built into the CPU core, but rather mounted onto the CPU's packaging to place it right up close to the core. The second stage of memory, the NUMA memory, would be external to the CPU packaging and likely on the mainboard like we do it today. The card could include CPU and GPU functions, being that they would share such memory bandwidth that it would be possible to avoid using AGP/PCI-Express videocards.

I forget the exact quote made by the chairman of NVidia, but I believe he said NVidia has big plans to sink Intel across the whole market spectrum. He may have only meant integrated graphics, but it sounded like he meant the GPU was going to be taking on functions impossible to do anywhere near the same performance on current CPU's. Sounded like a thinly disguised threat to Intel's overall processor and chipset businesses.

Sahakiel · Jan 25, 2004

Originally posted by: MadRat
I'm not sure what I said implied imbedded memory. I was thinking more along the lines of designing the CPU's package (you know the thing they mount the CPU core onto) around a flat card that includes a first stage of high speed memory dedicated to the local processor. The memory would not be built into the CPU core, but rather mounted onto the CPU's packaging to place it right up close to the core. The second stage of memory, the NUMA memory, would be external to the CPU packaging and likely on the mainboard like we do it today. The card could include CPU and GPU functions, being that they would share such memory bandwidth that it would be possible to avoid using AGP/PCI-Express videocards.

I'm also still trying to figure out what gave you the idea that DRAM could be integrated into a socket. Socket size hasn't changed much in the last several years. Socket 5 and Socket 7 were pretty much the same size as Socket 370 and Socket A and even Socket 423. Socket 478 shrunk the socket size significantly rather than expand it whereas the Opteron Sockets 754, 939, and 940 are all pretty close to older socket sizes. To me, that means that motherboard real-estate is expensive. Or, it could be that larger sockets are harder to design. Either way, if the size of sockets is relatively constant, it's hard to see any room to slap on even a single 8MB DRAM chip.
The only solution I can see is a PCB. Mounting the CPU onto a PCB and plopping down northbridge and DRAM looks supiciously like a slot cartridge with an extra chip. The only difference is that the cache from Slot1/A is now DRAM. Slot cartridges are, if I remember correctly, expensive and difficult to ramp up to high speeds. They're better suited for servers where capacity and massive parallelism are more important than executing single tasks very fast. Plus the customers have deeper wallets. If Slot cartridges make a return to the desktop, it is only as a last resort. If you're using NUMA, you gotta wonder whether a slot cartridge can handle GPU, Southbridge and I/O, and socketed DRAM traffic. The old slots only had CPU traffic and they were big enough. Imagine what plopping the northbridge on there would do. Plus, I'm sure having a second memory controller on the motherboard means you don't have much in terms of cost reduction.
If you integrate a GPU onto your CPU card, you still have to slap on a memory controller in between your CPU and GPU. Graphics cards are not independent of CPU functions, even with the AGP spec. Then, if you're sharing memory, you're going to need either a seperate chip for memory accesses for both chips or you're going to have to route GPU memory through the CPU's integrated controller. Otherwise, if you're pushing for both the CPU and GPU to share the same physical DRAM but use their own 256-bit crossbar memory controllers, that's just asking for some serious trouble.
You're also forced to cannibalize either the CPU's or GPU's performance with shared memory. We all know that. We don't have high-speed DRAM on graphics cards purely for marketing purposes. Those processors really do take advantage of the high bandwidth. If you're going to shave off 6.4 GB/s from the 25 GB/s average for graphics cards, that's a 25% reduction in performance. Add in extra latency for the crossbar, and you have to ask whether or not the extra throughput can make up for it.
So we're stuck with independent memory arrays for integrating a high performance CPU and a decent GPU ( the basis of your entire argument ). So we're looking at something the size of the Voodoo5 6000 just for the processors, northbridge, DRAM, and associated power, buses, etc. On the other hand, Of course, after spending all this money integrating so many components into something so large, it doesn't take an idiot to figure out that integrating the rest of the system takes little effort. At that point, you're looking at a system board. Expansion is only available through the backplane.
You keep pushing for doing the same thing only shrinking down everything first. That's the domain of embedded systems. They're small, highly integrated, cheap, low power, but low performance. Their small size is possible only due to specially designed components or die shrinked versions of previous generations. You don't see any significant market for embedded systems using cutting edge components. Well, maybe the military, but they're so slow by the time anything is deployed it's not cutting edge.
You have to realize that high performance comes at a price. Independent, specialized components will outperform general purpose and shared units. Independed memory pools for graphics and CPU outperform shared memory pools even when running at the same speed.

And NUMA aware OS's, from my understanding, is much better suited in multiprocessor environments with localized memory. Dual-core desktop processors are still a ways off and dual processor desktops are still the domain of workstations. A NUMA OS could be applied to Opteron systems I'm sure, but I wonder just how much MS is willing to invest in AMD.
As for single-processor platforms, I have my doubts as to whether NUMA aware Windows is going to make much of a difference. The caching scheme currently in use is largely effective, even if it is highly inefficient. Heck, if you look at it one way, the only difference between NUMA and exclusive caching is which system component determines what data requires the faster access.

I forget the exact quote made by the chairman of NVidia, but I believe he said NVidia has big plans to sink Intel across the whole market spectrum. He may have only meant integrated graphics, but it sounded like he meant the GPU was going to be taking on functions impossible to do anywhere near the same performance on current CPU's. Sounded like a thinly disguised threat to Intel's overall processor and chipset businesses.

The GPU already takes on functions nowhere near the same performance as current CPUs. Their peak floating point performance far exceeds the best Pentium IV even with SSE/SSE2. That's what they're designed for. But, tacking on a CPU to a GPU is a lot different than tacking on a GPU to a CPU. The system architecture isn't designed for that. Perhaps the best way for nVidia to pull of something like that is if a large majority of programs suddenly overnight required massive floating point (in massive data sets). The AGP bus is too slow to run regular programs on the CPU using memory from the system and that will be a problem if you're running any games with large textures. Trying to run the entire program on the graphics card is an idea but the problem is you can't upgrade the memory if you run out. Consumers will balk at having to replace a $500 part (for the low end, too) when in the past it was closer to $50.
The other option is that nVidia has to pull off a miracle. A good GPU with a decent CPU for use in a CPU socket. My guess is the lack of memory bandwidth will just kill the idea before it reaches the labs. It would work well in a proprietary system where you're not limited to sockets or you can just solder components on board. Starting to sound like a console... like the PS2.

MrSheep · Jan 25, 2004

Originally posted by: MadRat
I'm not sure what I said implied imbedded memory. I was thinking more along the lines of designing the CPU's package (you know the thing they mount the CPU core onto) around a flat card that includes a first stage of high speed memory dedicated to the local processor. The memory would not be built into the CPU core, but rather mounted onto the CPU's packaging to place it right up close to the core. The second stage of memory, the NUMA memory, would be external to the CPU packaging and likely on the mainboard like we do it today. The card could include CPU and GPU functions, being that they would share such memory bandwidth that it would be possible to avoid using AGP/PCI-Express videocards.

So this is one processor card containing a distinct CPU, "L1" Memory (aka off die L3 Cache?), GPU (on CPU die or off die?) and a memory controller (where does it go?). Sounds expensive and a packaging nightmare in terms of supplied configurations. Assuming the processor ?card? is sold with 6 different CPU speed grades and 3 graphics grades (high-end, midrange, value) with say 2 different memory configurations (64MB and 128MB). This would give you 36 SKU's to package and sell, OEM's would hate you, end users would be confused and the manufacturing+packaging costs would literally kill your profit margin.

Now if you were proposing a desktop SoC (system on a chip) I could at least partially understand where your coming from SoC's are nothing particularly new or innovative lots exist within the embedded space all are very low performance (in comparison to a Celeron, P3, P4, Athlon, A64 or even XScale) mainly for cost reasons.
Speaking of which I remember the ill fated Intel "Timna" desktop SoC back from 2000/2001. IIRC it was humanely killed just before launch after the realisation dawned that SoC wasn't yet commercially viable for desktop PC's. Timna was a marvel of integration by combining CPU core+GPU+Memory controller+PCI all into a single large die. Shame it cost more than a Celeron + i810 and had lower performance.

NUMA, umm yes. Why or how is the concept NUMA rather than just a strange UMA or for that matter why is it not COMA?
You seem to be advocating NUMA and I'm assuming ccNUMA here not plain NUMA but your proposed ccNUMA system wouldn?t play to any of the strengths of ccNUMA arch systems, actually your arch plays right into where NUMA is weakest (high speed remote access). In particular with NUMA arch systems local memory references are fast whilst remote ones are slow (typical ratio 1:[2-6] for remote access).

I forget the exact quote made by the chairman of NVidia, but I believe he said NVidia has big plans to sink Intel across the whole market spectrum. He may have only meant integrated graphics, but it sounded like he meant the GPU was going to be taking on functions impossible to do anywhere near the same performance on current CPU's. Sounded like a thinly disguised threat to Intel's overall processor and chipset businesses.

NVidia don't have a CPU design team, don't have a fab, don't have the cash to buy/build a fab and don't have a history of being able to successfully execute designs on latest generation CMOS manufacturing processes. I'm reminded of NVidia saying the usage of a low-k process was "crazy" and impossible earlier this year. Then a couple of months later ATi shipped RV360 130nm low-k whilst IBM, Intel & AMD had been shipping high performance, high volume low-k products for nearly a whole year. If it hadn't been for both ATi and NVidia using TSMC then perhaps it could be argued that the low-k problem was TSMC but with ATi able to ship a high volume, low cost part using the same process it clearly indicated NVidia had a lack of current generation process design understanding. IMHO if you?re going to make a run for the desktop CPU market this is critical expertise to internally possess, a single large misstep can literately kill you and historically has knocked out a multitude of competitors.
If anything NVidia has the ?vision? that GPU?s will become more important than CPU?s leaving the CPU as a mere commodity price component whilst the GPU takes commands a premium. IMHO the relatively robust sales of low performance commodity graphics integrated chipsets make this scenario quite unlikely. Intel extreme graphics which in performance terms is joke actually is the market leading GPU based on units sold.

MadRat · Jan 25, 2004

Sahakiel-

The socket would enlargen, true, but it takes on enough components to justify the cost. Two 256-bit memory controllers are not necessary. A single 256-bit memory controller would suffice for the local dedicated memory, with the CPU core acting as both CPU and GPU. One core. Not two, but one general purpose jack-of-all-trades-master-of-neither core. General purpose with adequate mainstream performance, supplemented with capacity for both high and low end offshoots, is the key to reducing cost.

A general purpose core is never going to outperform standalone, specialized chips. Then again, the standalones each require so many components it basically ends up adding to the overall cost to support them. We're cutting out the independent memory controllers for each GPU and CPU, thereby simplifying the system to one high speed memory bus. So what if we cannibalize the GPU-side of the performance somewhat, the CPU is going to have unfettered memory access that it never had before. In this scheme the GPU is not a standalone device, actually, so it may not affect the performance as much as you might think in comparison to standalone architecture. The nForce 200 with a matched set of 266DDR was actually quite similar to the GF4 MX200 in performance, although the integrated graphics core shared memory with the CPU! So why would we expect this scheme to perform differently?

The external (NUMA) memory architecture does not have to be as fast as the primary memory, just the bridge between memory and storage. The primary memory would be the high end performance memory. The external memory (NUMA) would be whatever the market would bear as efficent for the customer base, with each market segment able to use different memory types if need be. This gives the mainboard manufacture alot of flexibility, something that they've been working towards for years now.

MrSheep-

This card would contain a distinct: 1. MPU (multipurpose processor unit) 2. flexible-setting memory controller (256-, 128-, or 64-bit) 3. flexible-support for memory quantity (could be anywhere from 64MB to 2GB) This is not a SOC. The external memory is not a uniform standard, where say SDRAM or RDRAM has to be affixed. The northbridge on the mainboard (for boards that would support external memory) would need be compatible to the pathway from CPU to northbridge, but the memory need not be defined.

The writing is on the wall. High tech companies need not their own fab. Few megacorporations can even afford to do it on their own. NVidia has been a fabless company for years now. ATI is a fabless company. So what? They still get their products out the old fashioned way, they bid for the fab time. Thats how the free market system works you know.

What this design gives is a way to integrate memory into the core's socket without defining the support by the mainboard manufacturers. The CPU package does not need much memory for basic functions, but we are to the point where its ridiculous to think of memory in any quantity below 128MB. I can run just about anything on my machine with 128MB, not that I'd want to, but for beginners and low end customers its plenty acceptable.

System integrators could opt for absolutely no expansion slots for memory and place no imbedded memory on the mainboard for absolute barebone costs. With this strategy there can be one board for no external memory, another for using external memory. Either way, whichever CPU package they choose largely determines what performance spectrum they want to target. The customer can still be sold on upgrading the CPU later to something bigger and better!

As for alot of SKU's, why would one offer 36 SKU's? The GPU functions are inherent to the core, not separate. The memory on the CPU package would vary by size and quantity to fit the target market. I don't see them using much more than 3-4 memory sizes, with the memory bandwidth configuration varying in but three ways. Lets think using today's technology: At the high end you may see 256-bit/1GB config and a 256-bit/2GB config with much slower memory. At the mid-market level you might see 128-bit/1GB config, 128-bit/512MB config, and maybe even a 128-bit/256MB configurations. At the low end you might see 64-bit/256MB config and a 64-bit/128MB config. Notice how the different market segments are related? The 256-bit/2GB config and 128-bit/1GB configurations are one and the same configuration, only the latter with half the installed memory. Likewise, the configurations overlap all the way down to the bottom of the spectrum. So we see but six SKU's for a very flexibly-configuration of CPU packaging.

Sahakiel · Jan 26, 2004

Originally posted by: MadRat
Sahakiel-

The socket would enlargen, true, but it takes on enough components to justify the cost. Two 256-bit memory controllers are not necessary. A single 256-bit memory controller would suffice for the local dedicated memory, with the CPU core acting as both CPU and GPU. One core. Not two, but one general purpose jack-of-all-trades-master-of-neither core. General purpose with adequate mainstream performance, supplemented with capacity for both high and low end offshoots, is the key to reducing cost.

Your "general purpose jack-of-all-trades-master-of-neither core" is a regular CPU. I can't begin to fathom how you don't see that.
Enlarging the socket size for nothing more than adding on DRAM is largely a waste and practically infeasible with current technology. I don't even know if it's even possible to make a socket adhere to the same trace designs as PCBs. That's pretty much what you're going to end up with if you try to push forward this "integrated DRAM" idea. The only way you're going to be able to do bring DRAM closer to the CPU is by soldering your components onto a PCB and connecting it to the motherboard via slots. If you're doing it with a Pentium IV, you'll also need a second memory controller in addition to the one on your motherboard. If you're doing it with an Opteron, you're gonna need to get HyperTransport links through the socket AND a SECOND memory bus to access memory on the "low end" standard DRAM sockets.

A general purpose core is never going to outperform standalone, specialized chips. Then again, the standalones each require so many components it basically ends up adding to the overall cost to support them. We're cutting out the independent memory controllers for each GPU and CPU, thereby simplifying the system to one high speed memory bus. So what if we cannibalize the GPU-side of the performance somewhat, the CPU is going to have unfettered memory access that it never had before.

That's great, but much like the nForce2, CPUs aren't designed to sustain that much bandwidth. You can't just take fifty years of development and throw it out the window. Caching systems have been in place for years and for a good reason : memory is slower than the CPU. Even if you were to somehow get DRAM to run at 3 GHz to match your 3GHz Pentium IV, the gains you'll see are really not worth the cost. Something on the order of 15% better performance for close to 1000% the cost depending on DRAM speed. What you're proposing is technically just another level of cache. It just happens to be off-die (like older versions) and high speed (like the slot era).
The problem with main memory isn't really the bandwidth. Well, not at this point, anyway. It's the latency. While you can help alleviate that by putting DRAM closer to the CPU and ramping up throughput, the best way is still sticking the memory on the same die. That's the lesson learned with the Pentium Pro.

In this scheme the GPU is not a standalone device, actually, so it may not affect the performance as much as you might think in comparison to standalone architecture. The nForce 200 with a matched set of 266DDR was actually quite similar to the GF4 MX200 in performance, although the integrated graphics core shared memory with the CPU! So why would we expect this scheme to perform differently?

If you're referring to GeForce2 MX200 level of performance, how is your statement supposed to mean anything? GF2 MX200 has 1.6 GB/s in memory bandwidth. One single channel of 266DDR would more than satisfy that. Heck, it seems to me SDRAM on the GPU's 128-bit memory bus would suffice. You're left with 2.6 GB/s for the CPU. That's just a bit shy of a 333FSB, but you'll probably run faster RAM with a 333FSB than a 200 or 266 FSB, anyway.
If you're comparing it to a GeForce2 MX400, we're looking at 2.7 GB/s bandwidth. Still easily supplied by dual PC2100 DDR with enough left over (1.5GB/s) for a large chunk of Athlon processors. Up it to dual DDR333 (5.2GB/s) and problem's solved. The CPU gets 2.5 GB/s, just shy of the 333 FSB requirement.
The Geforce4 MX420 is the same as or similar to the Geforce2 MX400 in memory.
The Geforce4 MX440 finally strains the memory system weighing in at 6.4 GB/s. That's the exact amount given by dual channel DDR400. The AGP8x version gets better with a whopping 8.2 GB/s, just shy of Geforce4 MX460's 8.8GB/s.
So what's the point? The point is the nForce2 system provides Geforce4 level of performance with system memory for the sole reason that those graphics chips are paired with memory slower than system memory in the first place. That's why they're cheap. You have to get to Geforce4 MX440 level to really strain the system with dual channel DDR400. That problem is easily addressed by adding better memory management to the MX440 (like that found in the Geforce4 Ti series).
So why would I expect your scheme to work differently? Because it doesn't work with anything beyond value parts. Geforce4 MX line has always been the value series. There are already plenty of solutions, all of them cheaper than what you're proposing. Using your idea on a high-end part just ends up cannibalizing one component AND driving up costs, thus negating any advantage.

The external (NUMA) memory architecture does not have to be as fast as the primary memory, just the bridge between memory and storage. The primary memory would be the high end performance memory. The external memory (NUMA) would be whatever the market would bear as efficent for the customer base, with each market segment able to use different memory types if need be. This gives the mainboard manufacture alot of flexibility, something that they've been working towards for years now.

"bridge between memory and storage"... are you referring to DMA, now? Or did you suddenly integrate storage capabilities into your new "super-sized" socket? Might I remind you that the more you integrate into a single board, the more you move towards embedded systems. High performance single-board solutions cost a pretty penny and cheap solutions perform rather low. All hail the now defunct internet web appliances, which were highly integrated solutions that suddenly found themselves too expensive when compared to desktop PCs AND were more limited in functionality and upgrades.
And I do know that you're trying to advocate "normal" slow memory on the system board and faster memory on the CPU socket. What I fail to see is how this is any different from adding an additional cache level and how it can possibly be cheaper.

MrSheep-

This card would contain a distinct: 1. MPU (multipurpose processor unit) 2. flexible-setting memory controller (256-, 128-, or 64-bit) 3. flexible-support for memory quantity (could be anywhere from 64MB to 2GB) This is not a SOC. The external memory is not a uniform standard, where say SDRAM or RDRAM has to be affixed. The northbridge on the mainboard (for boards that would support external memory) would need be compatible to the pathway from CPU to northbridge, but the memory need not be defined.

You're quite literally only a few steps away from SOC. Your socket design contains just about everything needed for SOC. It basically just lacks nonvolatile memory and a simple I/O interface.

The writing is on the wall. High tech companies need not their own fab. Few megacorporations can even afford to do it on their own. NVidia has been a fabless company for years now. ATI is a fabless company. So what? They still get their products out the old fashioned way, they bid for the fab time. Thats how the free market system works you know.

And if your supplier lets you down, you're screwed. Still, the trend is towards fabless technology companies because, like I said, Intel is probably the only company that can keep using cutting edge process technology. What it means is that Intel will always have a leg up on the competition since they're the first to try out new processes.

What this design gives is a way to integrate memory into the core's socket without defining the support by the mainboard manufacturers. The CPU package does not need much memory for basic functions, but we are to the point where its ridiculous to think of memory in any quantity below 128MB. I can run just about anything on my machine with 128MB, not that I'd want to, but for beginners and low end customers its plenty acceptable.

System integrators could opt for absolutely no expansion slots for memory and place no imbedded memory on the mainboard for absolute barebone costs. With this strategy there can be one board for no external memory, another for using external memory. Either way, whichever CPU package they choose largely determines what performance spectrum they want to target. The customer can still be sold on upgrading the CPU later to something bigger and better!

128MB of DRAM still requires multiple chips. A quick check on Samsung shows the highest DDR density yields a 64MB chip. Okay, so it's only two chips. However, that's two chips in a stacked TSOP package. Pull out your CPU and tell me how much real estate is left to place a memory controller (if you don't have an Opteron) and two DRAMs plus associated support chips and traces. You want to expand the socket? Great. Fantastic. Now explain why nobody does that for desktops. Oh, could it be, *gasp*, COST?!?!?

As for alot of SKU's, why would one offer 36 SKU's? The GPU functions are inherent to the core, not separate. The memory on the CPU package would vary by size and quantity to fit the target market. I don't see them using much more than 3-4 memory sizes, with the memory bandwidth configuration varying in but three ways. Lets think using today's technology: At the high end you may see 256-bit/1GB config and a 256-bit/2GB config with much slower memory. At the mid-market level you might see 128-bit/1GB config, 128-bit/512MB config, and maybe even a 128-bit/256MB configurations. At the low end you might see 64-bit/256MB config and a 64-bit/128MB config. Notice how the different market segments are related? The 256-bit/2GB config and 128-bit/1GB configurations are one and the same configuration, only the latter with half the installed memory. Likewise, the configurations overlap all the way down to the bottom of the spectrum. So we see but six SKU's for a very flexibly-configuration of CPU packaging.

I STILL don't see how this is any better than simply adding another level of cache.

MadRat · Jan 26, 2004

You could save about 10gigs of server space by simply rebutting, bud.

I should perhaps of used smaller memory sizes to denote the "current technology" - keeping it inline with whats on-the-market. I was using whats possible as my watermark. After doing some reading I need to lower my expectations.

Figure the most practical solutions would be 64MB chips in either DDR or DDR2 configurations. We could practically weed out the less efficient possibilities to make a whopping three efficiently simple solutions overall, with all of them using DDR2. If you made it possible to use DDR then it really doesn't make sense to offer it in more than the 512MB/256bit, 256MB/128bit and 128MB/64bit solutions. Too add in DDR support would make a total of nine symmetrical DDR configurations possible via the employment of 16MB or 32MB chips, although these smaller chips aren't really practical. That makes a grand total of six worthwhile SKU's.

So we pare possibilities down to the following sizes:

256bit = High end:

512MB/256bit (8x64 DDR/375-425MHz - Most practical 512MB 'Extreme Edition' solution)
256MB/256bit (4x64 DDR2/400-500MHz; Most practical 256MB/256bit solution)

128bit = Mid-Range:

256MB/128bit (4x64 DDR/375-425MHz)
128MB/128bit (2x64 DDR2/400-500MHz; Most practical 256MB/128bit solution)

64bit = Low end: (All solutions Would still necessitate the use of on-mainboard expansion slots)

128MB/64bit (2x64 DDR/375-425MHz)
64MB/64bit (1x64 DDR2/400-500MHz; Cheapest DDR2 solution)

The cost of the overall system is reduced and makes supporting older systems less logical. I'm not sure why you figure even stacked modules becoming a real estate problem for mounting the heatsink, considering you have two sides from which to mount memory. For DDR I figure they'd come in pairs of 16, 32, or 64MB chips; up to four per side. For DDR2 I figure they can be mounted singly in 64MB chips. I then figured four 64MB DDR2 chips per side for the 256MB/256bit version and four 64MB DDR2 chips per side, in stacked pairs, for the 512MB/256bit version. However, to be realistic I think its would be more logical to use eight 64MB DDR chips for the 512MB/256bit 'Extreme Edition'. The newer system would be simpler and faster at a similar or potentially noticeably lesser amount in cost. Its not like we're talking exotic memory solutions here. The memory controller technology already exists and has already been integrated into existing GPU's. Some of the current GPU memory controllers, like in Nvidia 5900-series of cards, are fully capable of DDR and DDR2 support.

I would think the latest generation, performance DDR would be ideal for the CPU side of the argument, but DDR2 is smaller overall and offers higher bandwidth in exchange for a marginal increases in latency. Either one is certainly better than DC-DDR400 technology found in today's mainboards, and perhaps DDR2 is no worse off in latency than current DC-DDR technology. DDR2 has more headroom for clock increases, too, especially when talking for future proofing the CPU package. DDR3 is yet another improvement but we're years away from that technology - too bad because it offers low latency of DDR and high bandwidth of DDR2!

Feel free to correct me if I'm wrong in any estimation.

Perhaps the processors could be sold minus the DDR/DDR2 memory mounted at all and the OEMs could configure them how they saw fit. Below are some odd sizes that are possible:

OEM Wildcards w/stacked 64MB DDR2 chips: (Stacked pairs of DDR2 are an exotic 256MB solution; limited to 400-450MHz)

512MB/256bit (4x2x64; Too-'Extreme'-to-justify Editions for sure!)
256MB/128bit (2x2x64)
128MB/64bit (1x2x64)

OEM Wildcards w/32MB DDR chips: (limited to 350-400MHz)

256MB/256bit (8x32)
128MB/128bit (4x32)
64MB/64bit (2x32)

OEM Wildcards w/16MB DDR chips: (limited to 325-375MHz)

128MB/256bit (8x16)
64MB/128bit (4x16)
32MB/64bit (2x16; Opposite of the 'Extreme' edition?!)

MadRat · Jan 27, 2004

I put alot of thought into this day. My revisions have been made to the last post to reflect DDR2 technology.

MrSheep · Jan 28, 2004

I should perhaps of used smaller memory sizes to denote the "current technology" - keeping it inline with whats on-the-market. I was using whats possible as my watermark. After doing some reading I need to lower my expectations.

Figure the most practical solutions would be 64MB chips in either DDR or DDR2 configurations. We could practically weed out the less efficient possibilities to make a whopping three efficiently simple solutions overall, with all of them using DDR2. If you made it possible to use DDR then it really doesn't make sense to offer it in more than the 512MB/256bit, 256MB/128bit and 128MB/64bit solutions. Too add in DDR support would make a total of nine symmetrical DDR configurations possible via the employment of 16MB or 32MB chips, although these smaller chips aren't really practical. That makes a grand total of six worthwhile SKU's.

Assuming five core MPU speed grades for each configuration would result in 30 different combinations.

Selling one core MPU grade per configuration or even per configuration group would literally be stupid as discreet vendors would price you out of the market within days and the natural trickle down of CPU speed grades by discrete vendors would shorten the life of your product. IMHO you need to be able to deliver at least one speed grade per quarter just to remain in the game. Allowing the core a life of 18 to 24 month means 5 or 6 speed grades would be required over the design life.

Are these real clock speeds or DDR ratings e.g is the quoted 375-425MHz really just DDR375(187MHz)/DDR425(212MHz) or is it real clock speed 375MHz(DDR750)/425MHz(DDR850) ?

JEDEC 66pin TSOP DDR components have a surface area of 225mm2 vs 144mm2 for 144 ball FBGA (G)DDR/GDDR-II.

A 256bit pathway will require the usage of at least 4 four memory chips, now for various reasons due to the design of DDR you would in reality want to use 8 chips. 8 TSOP's will take a lot of surface area even 8 FBGA will be difficult to place. In short you?re looking at a big package.
Design rules regarding DDR mandate at least two chips even in a 64bit configuration, reality is four chips works out easier. For 128bit you need at least four chips but eight can in somecases be simpler and cheaper.
A few other things to point out high density DDR chips have a far higher latency than versions used in graphics cards and can only yield poorly when high clock speeds and low latency is required. IIRC Samsung and Hynix offer graphics targeted DDR chips with real clock speeds of up to 500Mhz (DDR1000) and are sampling graphics targeted chips with clockspeeds of up to 800MHz (DDR1600). Density is 16MB (128bit) per chip or 32MB (256bit) for selected slower grades.
This contrasts the 167MHz and 128MB (1Gbit) densities achieved by JEDEC conformant DDR chips, even still this density still yields poorly at 200MHz, actually that density isn't even been sampled by either manufacturer running at 200MHz.

Coming back into the real world for a second a NVidia 5200 based card typically uses eight Samsung K4D26163HE-TC40 memory chips, these are rated for a real clock speed of 250MHz (DDR500), 4.0ns, 4K/32ms refresh not particularly fast but still better than the stuff found in normal DDR modules. By the time you?re into the NVidia 5600 and ATi Radeon 9600 then stuff like Samsung K4D26323RA-GD2A (DDR800, 2.8ns, 4K/32ms) or K4D26323RA-GD33 (DDR600, 3.3ns, 4K/32ms) is being used in a eight chip configuration. The top end ATi Radeon 9800 and NVidia GeforceFX 5800 use even faster stuff.

Some of the current GPU memory controllers, like in Nvidia 5900-series of cards, are fully capable of DDR and DDR2 support.

Actually those models really use (G)DDR or (G)DDR-II which is slightly different in requiring tighter timings, being designed for a different bank configuration and is packaged as 144 FBGA for higher speed grades and 100 TQFP or 66 TSOP for lower grades.

Now the questions left outstanding are how you connect it to the mainboard and how is power injection dealt with.
Assuming a PGA socketed PCB with power injection via the socket. This would leave a requirement for 800 (ish) pins and a surface area of 7225mm2 for 256bit memory configuration, in perspective that is about 75% of surface area of a CD jewel case.
The surface area estimate is based on taking the size of a typical CPU manufactured on a 130nm process and then calculating the size of a theoretical mainstream GPU on the same process from that then roughly adding the two cores allowing for routing and layout problems. This core would occupy 247mm^2 once a IHS is added the surface area grows by a factor of x4.5 (based on core vs. IHS sizes averaged across P4, A64, R300, R350) giving a surface area of 1111mm2, now factoring each of the eight memory chips as consuming 324mm2 when clearances are allowed and allowing for a general clearance of 8mm around the core IHS and a further 4mm around the package edge gives the total. The cooling requirements of the ram mean a backside mount is not possible when the PCB is socketed as via rear mounted PGA.
Some form of SEC style slotting would allow a reduction in PCB size via backside mounting half the ram but would require the VRM to be integrated onto the PCB. You simply are not going to get 800 ish pins on a small (slightly larger than orginal Athlon/P2) sized SEC.

MadRat · Jan 28, 2004

Your MPU may be 247mm^2, but the IHS doesn't need to be 4 or 5 times that size. Last time I checked the IHS did little as far as cooling, and was there as more of a physical shield. So we might be talking 16mm by 16mm. Barton core is 101 mm^2 and 54.3M transistors on 130nm. The NV35 is 125M transistors on 130nm, and the R420 core will be 160M transistors on 130nm. (In comparison the Prescott is 109 mm^2 and 125M transistors on 90nm.) So we can roughly guess that a general purpose MP would need to be in the range of 220M transistors and around 250mm^2 to squeeze in the 8 rendering pipelines, 256-bit memory controller, 128K L1/512K L2 of a Barton or 12K trace/512K L2 of a Northwood, etc... Lets say we squeeze it down to 170 mm^2 (13 mm x 13mm) and roughly 160M transistors by reducing the rendering pipelines to 4 and the L1 cache to 64K (32i/32d) and the L2 to 256K. Sure we've crippled the GPU side of the processor a wee bit, but we're going for technology demonstration, and we can wait for ultimate performance on the move to 90nm.

The memory takes up roughly 324 mm^2 apiece, according to your estimate. We're talking somewhere along the lines of a meager 3 inches of heigth in the card and perhaps five inches of width, or the size of a notecard. So, yeah, we're talking around the size of Slot 1 or Slot A with a comparitive power to an AMD Thoroughbred or Intel P4M processor, the equivalent of an ATI 9600-9700 or NVidia 5600-5700 in raw video performance, and anywhere from 32MB to 512MB of dedicated high speed GDDR/GDDR2 memory depending on our target market. We added in, for the 128MB/128bit version, an extra $200 to the cost of the CPU to gain $200 worth of video card power (ATI 9600 or NVidia 5600 equivalent) and a real-time 325-500MHz front-side bus to the CPU and given it 10-15GB/sec memory bandwidth. We added in, for the 256MB/256bit version, an extra $300 to the cost of the CPU to gain $400 worth of video card power (ATI 9700 or NVidia 5700 equivalent) and a real-time 325-500MHz front-side bus to the CPU and given it 20-30GB/sec memory bandwidth, too. We've completely eliminated the AGP slot, (in some cases) the need for both northbridge and on-mainboard memory sockets, and the complex dual-channel or high speed busses on the mainboard.

Not too shabby I must say. AMD, Intel, ATI, Nvidia, VIA/S3... can you hear me?

Dual Architecture Graphics Card

Golden Member

Diamond Member

Member

Senior member

Junior Member

Lifer

Golden Member

Member

Lifer

Golden Member

Lifer

Lifer

Golden Member

Lifer

Lifer

Golden Member

Lifer

Golden Member

Junior Member

Lifer

Golden Member

Lifer

Lifer

Junior Member

Lifer