Nvidia GPUs soon a fading memory?

Page 8 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
Status
Not open for further replies.

BenSkywalker

Diamond Member
Oct 9, 1999
9,140
67
91
besides of the Zune HD (Which I own and is an awesome piece of hardware), I can't recall seeing it in other devices or seeing a lot of talk from companies wanting to adopt it yet.

http://www.pcworld.com/article/194072/microsofts_kin_are_the_first_tegra_smartphones.html

http://gizmodo.com/5448170/audi-turning-to-nvidia-tegra-chipset-to-make-their-dashboards-pop

Audi sells roughly 1 million cars a year(and all of them are to use Tegra). Those are done deals besides the ZuneHD. I realize that they aren't anywhere near as impressive as building the most powerful supercomputer or anything else along the HPC lines, but in terms of volume it is looking like Tegra is nV's long term solution and it is gaining traction. Technology wise they have everyone else in the field killed, unless AMD decides to make a real effort into the market I don't see that situation changing in the near future. There is also another portable device that will be announced next month that is strongly rumored to be using Tegra2(3DS)- if that pans out it will likely be another ~10Million-20Million units a year in sales.
 

Fox5

Diamond Member
Jan 31, 2005
5,957
7
81
http://www.pcworld.com/article/194072/microsofts_kin_are_the_first_tegra_smartphones.html

http://gizmodo.com/5448170/audi-turning-to-nvidia-tegra-chipset-to-make-their-dashboards-pop

Audi sells roughly 1 million cars a year(and all of them are to use Tegra). Those are done deals besides the ZuneHD. I realize that they aren't anywhere near as impressive as building the most powerful supercomputer or anything else along the HPC lines, but in terms of volume it is looking like Tegra is nV's long term solution and it is gaining traction. Technology wise they have everyone else in the field killed, unless AMD decides to make a real effort into the market I don't see that situation changing in the near future. There is also another portable device that will be announced next month that is strongly rumored to be using Tegra2(3DS)- if that pans out it will likely be another ~10Million-20Million units a year in sales.

AMD won't be back in the ARM market for a while, if ever. They might just follow Intel's Atom lead eventually.
But there are members of snapdragon (using AMD ip) that should be competitive with Tegra 2. PowerVR also has graphics tech competitive with Tegra 2 (better feature wise in fact, but that didn't help PowerVR back when they were in the PC market), but I don't know of any designs using it. Nvidia just happens to be more actively pursuing the low power/higher performance market, but there are competitors.

How has AMD tackled it before?

AMD has used tile-based deferred rendering with their cell phone IP (the graphics used by snapdragon) and the xbox 360. It doesn't require edram to be used. It's unlikely that they'll go this path, but it's one solution to the bandwidth problem.

AMD also uses dedicated memory with their current IGPs.
Take a look at the diagram on this page:
http://www.hexus.net/content/item.php?item=12116&page=2
AMD's existing integrated graphics communicate with the cpu over hypertransport to use its memory controller and system memory. The IGP itself can also be connected to a DDR3 chip directly. (not sure if this uses a dedicated memory bus or hypertransport...actually it might even use their sideport bus, which is slower speed than both of those in its existing variation)

In AMD's new APU, the IGP is merely being moved off the southbridge and onto the cpu. It still communicates with main memory through the cpu's on die memory controller. However, the southbridge still exists and is connected by a Hypertransport bus.
Assuming AMD doesn't make llano variants with on-chip memory (hanging off an additional memory bus or an existing hypertransport bus), memory could still be soldered onto the motherboard, and they could communicate with the southbridge over hypertransport and the southbridge communicates with the memory. FROM A HIGH LEVEL PERSPECTIVE, THIS IS IDENTICAL TO WHAT THEIR IGPS DO NOW. I would be willing to bet money that AMD continues to do this with llano, all they've done is swapped what side of the southbridge/cpu connection that the gpu is located on.
Assuming we're still using AM3 (why would they? there's no way llano will be compatible with existing motherboards, they need some type of additional pinout for the graphics) and hypertransport 3.0, that allows for an additional 16GB/s of bandwidth on top of whatever system bandwidth is. It's not ideal, but the aggregate bandwidth would be respectable for an IGP.

If they're using a new socket, there's also the possibility for a dedicated memory bus for vram, or a faster hypertransport bus. I think HT3.0 has a double bit width variant coming up soon.
 

BenSkywalker

Diamond Member
Oct 9, 1999
9,140
67
91
But there are members of snapdragon (using AMD ip) that should be competitive with Tegra 2.

I have a Snapdragon phone and a ZuneHD, they aren't really all that close(that is comparing it to Tegra1, not Tegra2). Not saying they are bad parts, but there is a far larger gap between them then there is current desktop companies. I have heard a lot about PVRs offerings, with them due to the nature of their technology I have had a very firm 'don't believe anything until you see it' attitude. Not saying they are dishonest, just their interesting approach to how they handle things results in problems frequently.

AMD has used tile-based deferred rendering with their cell phone IP (the graphics used by snapdragon) and the xbox 360.

The 360 uses an immediate mode renderer, it is not deferred at all. It tiles memory access, but all rasterizers and GPUs do that.

It doesn't require edram to be used.

An effective TBR does, and a lot of it with today's geometric loads.
 

tweakboy

Diamond Member
Jan 3, 2010
9,517
2
81
www.hammiestudios.com
No it's not, When they come out with 2 dual GPU probably called. 2 GPU on one card; nVidia hasn't done that yet. But when they do, youll feel dumb for getting something else.. hint hint lol jk
 

Fox5

Diamond Member
Jan 31, 2005
5,957
7
81
I have a Snapdragon phone and a ZuneHD, they aren't really all that close(that is comparing it to Tegra1, not Tegra2). Not saying they are bad parts, but there is a far larger gap between them then there is current desktop companies. I have heard a lot about PVRs offerings, with them due to the nature of their technology I have had a very firm 'don't believe anything until you see it' attitude. Not saying they are dishonest, just their interesting approach to how they handle things results in problems frequently.



The 360 uses an immediate mode renderer, it is not deferred at all. It tiles memory access, but all rasterizers and GPUs do that.



An effective TBR does, and a lot of it with today's geometric loads.

The snapdragon used in nexus one is the weakest in the snapdragon family. The high end snapdraon has at least 4x the graphics performance. Not sure how you would compare the performance of two devices running entirely different software though.

I've seen Xenos (the xbox 360 gpu) referred to a tilebased deferred render before at Beyond3D, I don't know if it actually is. However, AMD has slides referring to their snapdragon IP as a tbdr, and that's supposedly based on Xenos.

Did the older PowerVR stuff have edram? Heck, does the current phone stuff? There's a small tile buffer on chip for powervr's stuff I think, but that's different than the 10MB of edram xenos uses, hundreds of kilobytes iirc. Of course, that may be why no edram-less TBDR system has ever pushed polygon rates beyond what the PS2 could handle.
 

Scali

Banned
Dec 3, 2004
2,495
0
0
AMD has used tile-based deferred rendering with their cell phone IP (the graphics used by snapdragon) and the xbox 360. It doesn't require edram to be used. It's unlikely that they'll go this path, but it's one solution to the bandwidth problem.

As already stated, the XBox 360 is not a TBDR.
As far as I can tell, the Snapdragon GPUs are described as 'scaled-down Xenos', so they wouldn't be TBDR either then.

AMD also uses dedicated memory with their current IGPs.
Take a look at the diagram on this page:
http://www.hexus.net/content/item.php?item=12116&page=2
AMD's existing integrated graphics communicate with the cpu over hypertransport to use its memory controller and system memory. The IGP itself can also be connected to a DDR3 chip directly. (not sure if this uses a dedicated memory bus or hypertransport...actually it might even use their sideport bus, which is slower speed than both of those in its existing variation)

You mean the 'display cache' thing? That could be for an edram chip or something similar... But I've never actually seen this applied in practice...? Not sure if that ever was actually implemented. I haven't seen any mention of it related to Llano either.
Even so, it doesn't solve texturing bandwidth, it merely boosts fillrate for AA and such.
 

Scali

Banned
Dec 3, 2004
2,495
0
0
I've seen Xenos (the xbox 360 gpu) referred to a tilebased deferred render before at Beyond3D, I don't know if it actually is. However, AMD has slides referring to their snapdragon IP as a tbdr, and that's supposedly based on Xenos.

No, AMD's slides say it's a tile-based renderer. Note the missing 'deferred'.
I think you actually need to license a lot of PowerVR IP in order to implement deferred rendering in hardware.
Pretty much all modern GPUs classify as a tile-based renderer in one way or another, since they use hierarchical compression for z/stencil and multisample buffers.

Did the older PowerVR stuff have edram? Heck, does the current phone stuff? There's a small tile buffer on chip for powervr's stuff I think, but that's different than the 10MB of edram xenos uses, hundreds of kilobytes iirc.

Yes, PowerVR had an on-chip cache that could hold the framebuffer and z/stencilbuffer for one tile. Since they have a TBDR, that was enough to handle everything in the cache in one pass. I'm not sure if it still works that way today, though.
For the original PVR PCI addon cards that was a very nice solution though. They copied each tile via a burst transfer over the PCI bus to the host 2D card's memory. This way you didn't require an external piggyback VGA cable like the 3dfx addon cards, and didn't suffer from image degradation.
The XBox just buffers the entire framebuffer.
 
Last edited:

Fox5

Diamond Member
Jan 31, 2005
5,957
7
81
As already stated, the XBox 360 is not a TBDR.
As far as I can tell, the Snapdragon GPUs are described as 'scaled-down Xenos', so they wouldn't be TBDR either then.



You mean the 'display cache' thing? That could be for an edram chip or something similar... But I've never actually seen this applied in practice...? Not sure if that ever was actually implemented. I haven't seen any mention of it related to Llano either.
Even so, it doesn't solve texturing bandwidth, it merely boosts fillrate for AA and such.

There's a lot of shipping motherboards that use the display cache. It's a ddr2 or ddr3 chip that hangs off (typically 128MB), 64 bit bus, and supplements the bandwidth of the system ram. Why couldn't it aid in texturing bandwidth? It's probably not high bandwidth enough to matter though, just splitting off the framebuffer to the display cache and texturing to system ram would provide a performance boost.

If the sideport terminology used for the display cache is the same sideport that was going to be utilized for the x2 cards and AMD hung a memory chip off of it instead of another gpu.

Since it hangs on the hypertransport bus, the fastest memory chip that could be attached to this would provide 16GB/s of bandwidth for a framebuffer. At a 64-bit memory bus, that would require a 2Ghz DDR3 memory chip. That's possible by 2011, I can find higher end motherboards equipped with 1333Mhz sideport memory right now. Lower end products in 2011 would probably downclock the gpu and equip it with 1333Mhz or 1600Mhz memory. All current AMD IGPs are in effect 128bit memory bus (dual channel ddr2/3) + an optional 32 or 64 bit sideport (for the framebuffer only?), I'd imagine AMD would keep the option for llano, and maybe even bump it to a 128-bit sideport. (sounds like the current sideport suppported 128-bit chips, but no one ever made use of it)
 
Last edited:

Scali

Banned
Dec 3, 2004
2,495
0
0
Why couldn't it aid in texturing bandwidth? It's probably not high bandwidth enough to matter though, just splitting off the framebuffer to the display cache and texturing to system ram would provide a performance boost.

Texturing is going to be very slow, with or without a framebuffer cache. XBox suffers the same problem. Texture quality on an XBox is considerably lower than on PC.
That's the whole point we've been discussing all along: bandwidth.
You need bandwidth to keep 400-480 SPs busy. If they constantly have to wait for textures (and modern games use a LOT of textures), it's not going to work.
We've seen the same a while ago with TurboCache constructions, trying to store textures in main memory, then swapping them into the videocard memory just-in-time. It wasn't exactly a success.

What you'd need is the whole videomemory dedicated... but that's going to be too expensive.

Judging from this review: http://www.tomshardware.com/reviews/790gx-graphics-sideport,2088-6.html
The display cache isn't very spectacular anyway, it barely improves performance.
 
Last edited:

evolucion8

Platinum Member
Jun 17, 2005
2,867
3
81
http://www.pcworld.com/article/194072/microsofts_kin_are_the_first_tegra_smartphones.html

http://gizmodo.com/5448170/audi-turning-to-nvidia-tegra-chipset-to-make-their-dashboards-pop

Audi sells roughly 1 million cars a year(and all of them are to use Tegra). Those are done deals besides the ZuneHD. I realize that they aren't anywhere near as impressive as building the most powerful supercomputer or anything else along the HPC lines, but in terms of volume it is looking like Tegra is nV's long term solution and it is gaining traction. Technology wise they have everyone else in the field killed, unless AMD decides to make a real effort into the market I don't see that situation changing in the near future. There is also another portable device that will be announced next month that is strongly rumored to be using Tegra2(3DS)- if that pans out it will likely be another ~10Million-20Million units a year in sales.

The Kin barely debuted now, and for sure is based on the same platform/performance of the current Zune HD which is good enough. Audi? LOLL, don't think that such cars are sold in high volumes compared to other car manufacturer like Toyota (The car brand, not the nvidia brand loyalist). But in the end, is a good step in the right direction, at least nVidia managed to do something right and profitable. It was needed compared to the current pathettic solutions that most smartphones have currently specially in graphic acceleration.
 
Last edited:

evolucion8

Platinum Member
Jun 17, 2005
2,867
3
81
Sideport is a very nice piece of technology, but it doesn't work great in its current incarnation due to the lack of bandwidth of current RAM offerings, I think that it may fare better with dedicated onboard VRAM, or for GPU to GPU communication in Crossfire in the future.
 

BFG10K

Lifer
Aug 14, 2000
22,709
3,003
126
I have to agree with Scali and Ben here. There’s no way even a high-end IGP can match a discrete low-end part for the simple fact that the discrete part has dedicated VRAM while the IGP shares it with the system. Even if both parts had the exact bandwidth, the discrete part would still win because it doesn’t have to share it. Sticking the IGP onto the same die as a CPU doesn’t solve that problem at all.

eDRAM might help for rendering, but it won’t help for storing or transferring assets. Crysis has 1 GB worth of textures. You can’t store that in a tiny eDRAM buffer. The GTX480 has 1.5 GB dedicated VRAM that delivers 177.4 GB/sec bandwidth. You can’t have 1.5 GB eDRAM, nor can you solder 1.5 GB of such high-bandwidth RAM onto the motherboard. It’s just not practical or realistic.

Also as has been pointed out, the Xbox 360 is not a tile based renderer. The last tile based renderer in consumer space was the Kyro 2, and that wasn’t exactly a stunning success.

In particular, it required a lot of application specific driver workarounds because both OpenGL and Direct3D are primarily designed as forward rendered APIs, as are most games. I’d imagine nowadays with more complex shaders, deferred rendering, post-processing and more complex AA modes, tile based rendering would be even harder to pull off properly.
 

Scali

Banned
Dec 3, 2004
2,495
0
0
In particular, it required a lot of application specific driver workarounds because both OpenGL and Direct3D are primarily designed as forward rendered APIs, as are most games.

Actually, Direct3D < 10 is designed to support zbuffer-less devices.
The problem is mainly that a lot of game developers don't follow Direct3D specs to the letter. Can't really blame them... before Kyro arrived, there was no way to verify if your code would work on a zbuffer-less device.
3DMark2001 originally had Kyro-related bugs aswell, but they released an update that fixed everything (3DMark2001SE).
The BeginScene()/EndScene() calls are fundamental to zbuffer-less devices. They act as markers for when the device needs to start and end buffering the draw calls, and when the actual deferred rendering is to take place.
There is actually a caps flag for it: D3DPRASTERCAPS_ZBUFFERLESSHSR:
http://msdn.microsoft.com/en-us/library/bb172513(VS.85).aspx

In Direct3D 10 this was dropped. There is no BeginScene()/EndScene() call anymore, and all devices are expected to behave as if they have a zbuffer (which PowerVR devices could do btw).

I&#8217;d imagine nowadays with more complex shaders, deferred rendering, post-processing and more complex AA modes, tile based rendering would be even harder to pull off properly.

Tile-based rendering in itself is not that hard... but deferred rendering is a problem... You have to buffer and sort the entire frame before drawing. There is just far too much geometry and far too many complex state changes (shader constants and whatnot) to do this efficiently.
However, since more and more games are implementing deferred rendering themselves, and all GPUs already use some kind of tile-based z/stencil algorithm, I don't think the gains are as high as they were back in the Kyro days either. I think the main win would be that the zbuffer is in fast on-chip cache (or like the XBox, in a fast edram chip). But these days videomemory is incredibly fast anyway, so I wonder how much there is to win with on-chip cache.
The Kyro wasn't so much a faster card... it just could get the same performance as a GeForce 2 card, without requiring fast and expensive DDR memory. It used standard SDRAM. But that wasn't in the age of massive multitexture.
Its main claims to fame were very efficient overdraw handling, fast and high quality alphablending, and very clever on-the-fly mipmap filtering for efficient trilinear filtered texturing.
 
Last edited:

Scali

Banned
Dec 3, 2004
2,495
0
0
Dreamcast used tile based rendering.

Yes, tile-based DEFERRED rendering even. It used a PowerVR chip.
But it left the sorting to the application. This worked very well, as there was no driver overhead.
I have an Apocalypse 3Dx card myself, with a cousin to the DreamCast's chip (the PCX2)... But it doesn't work that well for Direct3D, as the driver has to do all the sorting.
With its own PowerSGL API though, you could do great things with that card.
 

Fox5

Diamond Member
Jan 31, 2005
5,957
7
81
The current embedded PowerVR solutions are at least competitive with Intel IGPs at a smaller die size (excluding the Intel igps that ARE powervr tech). The higher end PowerVR stuff (not really utilized in any devices that I know of) could probably compete with the current best IGPs from AMD and nvidia, though I think AMD and nvidia are still on 55nm or 65nm and the PowerVR IGPs are on 45nm.
 

Lonbjerg

Diamond Member
Dec 6, 2009
4,419
0
0
I don't know much about what you guys are discussing , bu i just came across this http://www.bit-tech.net/news/hardware/2010/05/15/amd-fusion-cpu-gpu-will-ship-this-year/1

These tidbits are "golden":

I don&#8217;t think there&#8217;s a simple answer to that" said Grim, "if you look at the history of AMD, when we came out with dual-core processors, we built a true dual-core processor. When we came out with quad-cores, we built a true quad-core processor. What our competitors did was an MCM solution &#8211; taking two chips and gluing them together.


And it beat the crap out of them...all AMD's boasting about "native" fell flat on it face.

When asked if building a CPU/GPU hybrid chip on a single piece of silicon would yield any advantages beyond speed, Grim replied, "We hope so."

That sounds reassuring...:D
 

evolucion8

Platinum Member
Jun 17, 2005
2,867
3
81
These tidbits are "golden":

I don&#8217;t think there&#8217;s a simple answer to that" said Grim, "if you look at the history of AMD, when we came out with dual-core processors, we built a true dual-core processor. When we came out with quad-cores, we built a true quad-core processor. What our competitors did was an MCM solution &#8211; taking two chips and gluing them together.


And it beat the crap out of them...all AMD's boasting about "native" fell flat on it face.

When asked if building a CPU/GPU hybrid chip on a single piece of silicon would yield any advantages beyond speed, Grim replied, "We hope so."

That sounds reassuring...:D

AMD did beat Intel in the Dual Core race, bringing the first native dual core which beat soundly the non native Pentium 4 Dual Core solutions like the Pentium D.

Intel's Conroe which was a native Dual Core beat the AMD's Dual Core solution, AMD's native Quad Core failed to beat Intel's Fake Quad Core but not because of the Intercommunication or because it was a native Quad Core, simply AMD's Quad Core architecture was a disaster in terms of IPC, Cache Latency plus the TLB Bug. You could pour into the original Phenom lots of Fairy Dust to improve Interconnect speed and simply it won't catch Kentsfield, period.

Phenom II X4 was an evolution of the original Phenom architecture and can go toe to toe with the best Kentsfield/Yorkfield Quad Core in most scenarios, but that's it. Nehalem is an evolution of the Yorkfield/Kentsfield architecture and simply outruns anything that AMD has currently, specially with the CPU market beyond $300.

This stuff about competition using MCM is simply PR stunt, same with Fusion, which kind of enthusiast will fall for that? I believe in execution and AMD simply failed to execute properly with the Phenom, but if they did fine with the Phenom II and its latest Phenom X6, they may surprise us, the same story happened with the HD 2900XT/3800 series. But definitively Intel owns now the enthusiast market and their tick/tock execution has paid very well, execution only matters, not marketing stuff and propaganda.
 

Scali

Banned
Dec 3, 2004
2,495
0
0
AMD did beat Intel in the Dual Core race, bringing the first native dual core which beat soundly the non native Pentium 4 Dual Core solutions like the Pentium D.

Not because Pentium D wasn't a native dualcore though.
Pentium 4 was not as good an architecture as Athlon64.
Logically, two Pentium 4 cores aren't as good as two Athlon64s either.
The problem wasn't related to it being an MCM specifically, as the Pentium 4 scaled to Pentium D about as well as the Athlon64 did to the Athlon X2.

As an aside, the original Pentium D Smithfield (8xx series) was a 'native dualcore' aswell. It was a single-die solution. And it was on the market before AMD's first dualcore. So AMD didn't beat them.
They didn't move to MCM until the Presler core (9xx series).
For some reason NOBODY seems to know this.
There was no performance penalty going from single-die to MCM, in fact, the MCM version was a slightly optimized architecture and was actually faster.

See this image:
1109721988.jpg


Notice the difference between Presler/Dempsey (MCM) and Smithfield (single-die)

Smithfield from another angle:
smithfield.jpg


One big rectangular die, containing two Pentium 4 'logic blocks' copy-pasted together.

Intel's Conroe which was a native Dual Core beat the AMD's Dual Core solution, AMD's native Quad Core failed to beat Intel's Fake Quad Core but not because of the Intercommunication or because it was a native Quad Core, simply AMD's Quad Core architecture was a disaster in terms of IPC, Cache Latency plus the TLB Bug. You could pour into the original Phenom lots of Fairy Dust to improve Interconnect speed and simply it won't catch Kentsfield, period.

The thing is, AMD claimed that there was going to be an advantage because of the native quadcore design, but cache2cache clearly demonstrates that this is not the case.
In theory you *could* get an advantage from a native design, but it is in no way any guarantee. With Barcelona it was clear long before the chip was launched that the design wasn't going to take advantage of the native design.
The same seems to go for Llano.
 
Last edited:

Skurge

Diamond Member
Aug 17, 2009
5,195
1
71
Not because Pentium D wasn't a native dualcore though.
Pentium 4 was not as good an architecture as Athlon64.
Logically, two Pentium 4 cores aren't as good as two Athlon64s either.
The problem wasn't related to it being an MCM specifically, as the Pentium 4 scaled to Pentium D about as well as the Athlon64 did to the Athlon X2.

As an aside, the original Pentium D Smithfield (8xx series) was a 'native dualcore' aswell. It was a single-die solution. And it was on the market before AMD's first dualcore. So AMD didn't beat them.
They didn't move to MCM until the Presler core (9xx series).
For some reason NOBODY seems to know this.
There was no performance penalty going from single-die to MCM, in fact, the MCM version was a slightly optimized architecture and was actually faster.

See this image:
1109721988.jpg


Notice the difference between Presler/Dempsey (MCM) and Smithfield (single-die)

Smithfield from another angle:
smithfield.jpg


One big rectangular die, containing two Pentium 4 'logic blocks' copy-pasted together.



The thing is, AMD claimed that there was going to be an advantage because of the native quadcore design, but cache2cache clearly demonstrates that this is not the case.
In theory you *could* get an advantage from a native design, but it is in no way any guarantee. With Barcelona it was clear long before the chip was launched that the design wasn't going to take advantage of the native design.
The same seems to go for Llano.

I'm not sure you can put Llano and Barcelona in the same category.
 

Lonbjerg

Diamond Member
Dec 6, 2009
4,419
0
0
I'm not sure you can put Llano and Barcelona in the same category.

They both have to obey the laws of physics, which IMHO is Scali's point.

There are not going to come magic fairies with pixiedust and move the data by casting a spell.
 

Scali

Banned
Dec 3, 2004
2,495
0
0
They both have to obey the laws of physics, which IMHO is Scali's point.

There are not going to come magic fairies with pixiedust and move the data by casting a spell.

Basically AMD shot themselves in the foot here.
Their HyperTransport bus is so good that there is little to gain by making single-die 'native' solutions.
 
Status
Not open for further replies.