ATI Xenos Demystified - 11 Page Interview-Article

hasanahmad

Junior Member
Jun 7, 2005
10
0
0
http://www.beyond3d.com/articles/xenos/

11 Pages of all Technical Details of what the Author is calling revolutionary 3D design


Conclusion

Overall it looks as though Xenos represents some highly interesting design choices on many fronts and clearly seems as though ATI have attempted to come up with a very different architecture to at the very least target the needs of the XBOX 360 console platform. It will be very interesting to see the performance and quality of graphics it is able to produce once developers have had decent access to development kits based on the final hardware, however we suspect that it won't be until the second generation of XBOX 360 titles before we see developers being able to seriously scratch the surface of understanding the processing capabilities of Xenos and the XBOX 360 as a whole. That being said, though, much of the architecture is transparent to the developer and they shouldn't need to concern themselves much with the types of workloads they are handing to the graphics processor as this will all be handled automatically, and without stalling any part of the pipeline.

Apart from the interesting use of eDRAM in this design, which is clearly targeted towards the console environment (although from its operation even this could potentially be moved into other the PC space if the driver forced a Z only pass, however this may be a little risky) the design of the ALU arrays, texture processing and threaded nature of the system is clearly a large departure from any of the shader architecture we've seen so far. Despite having a raw ALU quantity that exceeds any platform currently available, clearly the primary key to the design of the processing is that of "efficiency" when processing shader programs, by organising the workloads in a threaded manner in order to try and constantly keep the available processing elements active, not stalling by interleaving latency bound dependant operations and having a unified platform that is agnostic to whether it is processing Vertex or Pixel Shaders and never having one type of operation stalling the other. The primary question here is exactly how "inefficient" are current architectures in relation to this one, which is a difficult question to answer because no hardware vendor is going to tell you their graphics processors are inefficient. All we can say at the moment is that clearly Xenos's shader processing architecture is fundamentally and significantly different from current platforms and clearly ATI did perceive an issue with current methodologies otherwise they wouldn't have gone to these lengths to change the pipeline.

In the future, with WGF2.0's unified shader language, it would be hard not to see this type of threaded shader architecture not make its way across to ATI's PC products.
 

Pete

Diamond Member
Oct 10, 1999
4,953
0
0
Yes! Finally, ATI cleared it. Thanks for the link, I'm off to read it now.

Edit: Page seven has interesting info on the pipeline array, for those still confused by the hype:

"Its been said that Xenos's shader processor is an array of 48 ALU's, however it is more correct to say that that it is 3 separate arrays of SIMD (Single Instruction Multiple Data) ALU's. Each one of the 48 ALU's can co-issue a vector (Vec4) and a scalar instruction simultaneously, essentially allowing a "5D" operation per cycle. Each one of the ALU's is a complete instruction duplicate of the others and are all single precision IEEE floating point 32-bit compliant. The ALU's will process everything in FP32 internal precision and there are no internal partial precision requirements for FP16. Additional to the 48 ALU's is specific logic that performs all the pixel shader interpolation calculations which ATI suggests equates to about an extra 33% of pixels shader computational capability.

The arrows on ATI's diagram above indicates that there is some dependency from one of the shader arrays to another, almost as though they are pipelined; this is in fact not the case and each ALU array is working independently of the other and the data is not pipelined between them. This being the case there is no dependency between what programs, or types of programs, are being executed on each of the three ALU arrays - at a snapshot in time they could, potentially, all be vertex processing, all be pixel processing or there can be a mixture of both vertex processing and pixel processing occurring on the three different 16 ALU arrays."

So, three "pipelines" with 16 ALUs each. That's a big difference from the current setup of 16 pipes with about two ALUs each for the GF6 series and X800 series. Most interesting is that each pipe can run vertex or fragment shaders independently of the others. Heh, three cores for the Xb360 CPU, and three for its GPU.

Really interesting reading, so far.

Edit the Second: Page ten:

The combination of the shader array and tessellation unit can now make the, oft spoken of but rarely seen, capability of displacement mapping an attainable method to use as this truly becomes a single pass algorithm for Xenos.
 

hasanahmad

Junior Member
Jun 7, 2005
10
0
0
Summary by GameMaster at Teamxbox forums


*XENOS is a "Split Processor" GPU, meaning that it is actually 2 GPU cores that is packaged together with the "Parent" GPU handling the majority of shader tasks and acting as the "North Bridge" for the system, among other things. The "Daughter" GPU is directly linked to the "Parent" GPU and this is the module that has the 10MB of eDRAM. There is a considerable amount of additional logic on the "Daughter" GPU that will process a number of things such as HDR, 4xMSAA/FSAA, Z Buffer (Depth), Alpha Buffer (Transparency), Stencil Buffer (Shadows), Occusion Culling (Removing unseen polygons), Radiosity Lighting (such as Global Illumination), Real Time LOD (Level of Detail/Tessellation), and something that ATI refers to as "Fluid Reality" which is basically material physics such as hair, clothing, and water. All of that without burdening the "Parent" GPU and saving memory bandwidth at the same time since these tasks can be performed on the eDRAM.

*XENOS's Parent GPU has 232 million transistors and the Daughter GPU has 150 million transistors (80 million is for the eDRAM), for a grand total of around 382 million transistors. XENOS's "Parent" GPU is manufactured by TMSC using their .09nm manufacturing process and the "Daughter" GPU is manufactured by NEC using their .09nm manufacturing process. The "Split Processor" design allows XENOS to improve yeild during manufacturing and also helps with heat output/power comsumption issues.

*XENOS uses deferred tile based rendering (some of you would be familar with this as the Dreamcast used this rendering technique). This is how they will be able to process high resolution displays with 4xFSAA active and there are some additional performance enhancing technologies that will take advantage of the tile based rendering.

*XENOS contains 16 texture fetch and 16 vertex fetch units. Each of the texture units have bilinear sampling capacity per clock and if trilinear or anisotropic filtering, each unit will loop itself through multiple samples so the target sampling and filtering level is complete (Basically this means there is less performance loss when you are using trilinear or anisotropic texture filtering). These are done OUTSIDE of the shader units and improves performance as this increases efficiency.

*XENOS is capable of processing 64 threads simultaneously, this is to make sure that all elements are being utilized and so there is minimal or no stalling of the graphics architecture. So even if a ALU may be waiting for a texture sample to be achieved, that thread would not stall the ALU as it would be working on something else from another thread. This effectively hides tasks that would normally have a large latency penalty attached to them. ATI suggests that their testing achieves an average of 95% efficiency of the shader array in general purpose graphics usage conditions. The throughput is said to be two loops, two texture instructions, 6 ALU instructions, per pixel, per cycle at Xeno's peak fill rate.

*XENOS has 48 ALUs that are 16-way, and are grouped into 3 arrays of SIMD ALUs. Each ALU can co-issue a Vector4 and a scalar instruction simultaneously, essentially a "5D" operation per cycle (basically 2 Vec4 and 2 scalar instructions per cycle per ALU). The ALUs process everything in FP32 precision with no internal partial precision requirements for FP16. Additionally each of the 48 ALUs contains additional logic that performs all the pixel shader interpolation calculations. ATI suggests that this would basically equates to an extra 33% pixel shader computional capacity.

*Developers can choose to allow XENOS to automatically handle load balancing of the ALUs for their applications or take direct control of the ALUs. The load balancing is based on a algorithm that affects prioritization of the vertex and pixel shader programs. ATI believes that the algorithm gives very optimal throughput and expect only a few developers to actually look into changing the weightings of the algorithm. They also state that there will never be an unused shader array or texture sampler if there are threads available to use it.

*XENOS capabilities... 4K instruction slots (shared between VS and PS), greater than 500K maximum number of instructions executed, has instruction prediction, 64 temporary registers, 512 consant registers (shared between VS and PS), has static flow control, has dynamic flow control, had a 4 dynamic flow control depth or 2^23 if nesting, has vertex texture fetch (dependant fetches and all formats), 32 surface shared pool where textures consumes 1 entry and vertex consumes 1/3 of a entry so maximum of 32 texture or 96 vertex, has geometry instancing, has no dependant texture limits or texture instruction limits, has position registers, has 16 interpolated registers, has arbitrary swizzling, has gradient instructions, has loop count registers, and has face registers (2 sided lighting). What does all that mean? Don't ask... it would take too long to describe everything, but all this does mean it EXCEEDS VS3.0 and PS3.0 specifications.

*XENOS has a something called "MEMEXPORT" which will be important for shader programs that exceed 4000 instructions, but that is only the start of this particular beauty. It would take me too long to describe this feature in this post, but developers will absolutely love this feature...

*XENOS is capable of processing a displacement map in a single pass (this basically gives free additional geometry for the object).

...and a lot more than the 10 item limit that was requested by the earlier poster. Bottom line, XENOS is both POWERFUL and EFFICIENT... now what happens when you combine something that is powerful AND efficient? More later...
__________________