I don't kow but real time raytracing and physics look like there going to do well on larrabee. We lmow intel is working on its own games . We also know that other companies are working on games FOR larrabee . Its said 6 games will be released with larrabee . I bet 1 of those will do alot of raytracing.
Larrabee is also suitable for a wide variety of non-rasterization
based throughput applications. The following is a brief discussion
of the observed scalability and characteristics of several examples.
Figure 17: Game Physics Scalability Performance: this shows
that the Larrabee architecture is scalable to meet the growing
performance needs of interactive rigid body, fluid, and cloth
simulation algorithms and some commonly used collision kernels.
Game Physics: We have performed detailed scalability
simulation analysis of several game physics workloads on various
configurations of Larrabee cores. Figure 17 shows scalability of
some widely used game physics benchmarks and algorithms for
rigid body, fluid, and cloth. We achieve better than 50% resource
utilization using up to 64 Larrabee cores, and achieve near-linear
parallel speedup is some cases. The game rigid body simulation is
based on the popular ?castle? destruction scene with 10K objects.
Scalability plots for Sweep-and-Prune [Cohen et al. 1995] and
GJK [Gilbert et al. 1988] distance algorithms are included since
they are some of the most commonly used collision detection
routines. Game fluid simulation is based on the smoothed particle
hydrodynamics (SPH) algorithm. We used a mass spring model
and Verlet integration for our game cloth simulation [Jacobsen
2001]. Bader et al. [2008] provide details on the implementation
and scalability analysis for these game physics workloads
Figure 18: Real time ray tracing on Larrabee: cropped from a
1Kx1K sample image that requires ~4M rays. The ray tracer was
implemented in C++ with some hand-coded assembly code for
key routines like ray intersection. Kd-trees are typically 25MB
and are built dynamically per frame. Primary and reflection rays
are tested in 16 ray bundles. Nearly all 234K triangles are visible
to primary or reflection rays. (Bar Carta Blanca model by
Guillermo M Leal Llaguno, courtesy of Cornell University.)
Real Time Ray Tracing: The highly irregular nature of spatial
data structures used in Whitted style real-time ray tracers benefit
from Larrabee?s general purpose memory hierarchy, relatively
short pipeline, and VPU instruction set. Here we used SIMD 16
packet ray tracing traversing through a kd-tree. For the complete
workload, we observe that a single Intel Core 2 Duo processor
requires 4.67x more clock cycles than a single Larrabee core,
which shows the effectiveness of the Larrabee instruction set and
wide SIMD. Results are even better for small kernels. For
example, the intersection test of 16 rays to 1 triangle takes 47
cycles on a single Larrabee core. The same test takes 257 Core 2
Duo processor cycles. Figure 18 shows a 1024x1024 frame of the
bar scene with 234K triangles, 1 light source, 1 reflection level,
and typically 4M rays per frame. Figure 19 compares performance
for Larrabee with an instance of the ray tracer running on an Intel
Xeon® processor 2.6GHz with 8 cores total. Shevtsov et al. [2007]
and Reshetov et al. [2005] describe details of this implementation.
Figure 19: Real time ray tracing scalability: this graph compares
different numbers of Larrabee cores with a nominal 1GHz clock
speed to an Intel Xeon processor 2.6GHz with 8 cores total. The
latter uses 4.6x more clock cycles than are required by 8
Larrabee cores due to Larrabee?s wide VPU and vector
instruction set. Figure 18 describes the workload for these tests.
Image and Video Processing: The Larrabee architecture is
suitable for many traditional 2D image and video analysis
applications. Native implementations of traditional 2D filtering
functions (both linear and non-linear) as well as more advanced
functions, like video cast indexing, sports video analysis, human
body tracking, and foreground estimation offer significant
scalability as shown in Figure 20. Biomedical imaging represents
an important subset of this processing type. Medical imaging
needs such as back-projection, volume rendering, automated
segmentation, and robust deformable registration, are related yet
different from those of consumer imaging and graphics. Figure 20
also includes scalability analysis of iso-surface extraction on a 3D
volume dataset using the marching cubes algorithm.
Physical Simulation: Physical simulation applications use
numerical simulation to model complex natural phenomena in
movies and games, such as fire effects, waterfalls in virtual
worlds, and collisions between rigid or deformable objects. Large
data-sets, unstructured control-flow and data accesses often make
these applications more challenging to scale than traditional
streaming applications. Looking beyond interactive game physics,
we also analyzed applicability of Larrabee architecture for the
broader class of entertainment physics including offline movieindustry
effects and distributed real-time virtual-world simulation.
Specific simulation results based on Stanford?s PhysBAM are
shown in Figure 20 and illustrate very good scalability for
production fluid, production cloth, and production face.
Implementation and scalability analysis details are described by
Hughes et al. [2007].
18:12 ? L. Seiler et al.
ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, Publication date: August 2008.
Figure 20: Scalability of select non-graphics applications and
kernels: Larrabee?s general-purpose many-core architecture
delivers performance scalability for various non-graphics visual
and throughput computing workloads and common HPC kernels.
Larrabee is also highly scalable for non-visual throughput
applications, as shown in Figure 20. Larrabee?s highly-threaded
x86 architecture benefits traditional enterprise throughput
computing applications, such as text indexing. Its threading,
together with its wide-SIMD IEEE-compliant double-precision
support, makes it well positioned for financial analytics, such as
portfolio management. Internal research projects have proven
Larrabee architecture scalability for many traditional high
performance computing (HPC) workloads and well-known HPC
kernels such as 3D-FFT and BLAS3 (with dataset larger than ondie
cache). More details are described by Chen et al. [2008].
7. Conclusions
We have described the Larrabee architecture, which uses multiple
x86-based CPU cores, together with wide vector processor units
and some fixed function logic, to achieve high performance and
flexibility for interactive graphics and other applications. We have
also described a software renderer for the Larrabee architecture
and a variety of other throughput applications, with performance
and scalability analysis for each. Larrabee is more programmable
than current GPUs, with fewer fixed function units, so we believe
that Larrabee is an appropriate platform for the convergence of
GPU and CPU applications.
We believe that this architecture opens a rich set of opportunities
for both graphics rendering and throughput computing. We have
observed a great deal of convergence towards a common core of
computing primitives across the workloads that we analyzed on
Larrabee. This underlying workload convergence [Chen et al.
2008] implies potential for a common programming model, a
common run-time, and a native Larrabee implementation of
common compute kernels, functions, and data structures.
Acknowledgements: The Larrabee project was started by Doug
Carmean and Eric Sprangle, with assistance from many others,
both inside and outside Intel. The authors wish to thank many
people whose hard work made this project possible, as well as
many who helped with this paper. Workload implementation and
data analysis were provided by Jeff Boody, Dave Bookout, Jatin
Chhugani, Chris Gorman, Greg Johnson, Danny Lynch, Oliver
Macquelin, Teresa Morrison, Misha Smelyanskiy, Alexei
Soupikov, and others from Intel?s Application Research Lab,
Software Systems Group, and Visual Computing Group.
References
AKENINE-MÖLLER, T., HAINES, E. 2002. Real-Time Rendering.
2nd Edition. A. K. Peters.
AILA, T., LAINE, S. 2004. Alias-Free Shadow Maps. In
Proceedings of Eurographics Symposium on Rendering 2004,
Eurographics Association. 161-166.
ALPERT, D., AVNON, D. 1993. Architecture of the Pentium
Microprocessor. IEEE Micro, v.13, n.3, 11-21. May 1993.
AMD. 2007. Product description web site:
ati.amd.com/products/Radeonhd3800/specs.html.
BADER, A., CHHUGANI, J., DUBEY, P., JUNKINS, S., MORRISON T.,
RAGOZIN, D., SMELYANSKIY. 2008. Game Physics Performance
On Larrabee Architecture. Intel whitepaper, available in
August, 2008. Web site: techresearch.intel.com.
BAVOIL, L., CALLAHAN, S., LEFOHN, A., COMBA, J. SILVA, C. 2007.
Multi-fragment effects on the GPU using the k-buffer. In
Proceedings of the 2007 Symposium on Interactive 3D
Graphics and Games (Seattle, Washington, April 30 - May 02,
2007). I3D 2007. ACM, New York, NY, 97-104.
BLUMOFE, R., JOERG, C., KUSZMAUL, B., LEISERSON, C., RANDALL,
K., ZHOU, Y. Aug. 25, 1996. Cilk: An Efficient Multithreaded
Runtime System. Journal of Parallel and Distributed
Computing, v. 37, i. 1, 55?69.
BLYTHE, D. 2006. The Direct3D 10 System. ACM Transactions
on Graphics, 25, 3, 724-734.
BOOKOUT, D. July, 2007. Shadow Map Aliasing. Web site:
www.gamedev.net/reference/articles/article2376.asp.
BUCK, I., FOLEY, T., HORN, D., SUGERMAN, J., FATAHALIAN, K.,
HOUSTON, M., AND HANRAHAN, P. 2004. Brook for GPUs:
stream computing on graphics hardware. ACM Transactions on
Graphics, v. 23, n. 3, 777-786.
CALLAHAN, S., IKITS, M., COMBA, J., SILVA, C. 2005. Hardwareassisted
visibility sorting for unstructured volume rendering.
IEEE Transactions on Visualization and Computer Graphics,
11, 3, 285?295
CHANDRA, R., MENON, R., DAGUM, L., KOHR, D, MAYDAN, D.,
MCDONALD, J. 2000. Parallel Programming in OpenMP.
Morgan Kaufman.
CHEN, M., STOLL, G., IGEHY, H., PROUDFOOT, K., HANRAHAN P.
1998. Simple models of the impact of overlap in bucket
rendering. In Proceedings of the ACM SIGGRAPH /
EUROGRAPHICS Workshop on Graphics Hardware (Lisbon,
Portugal, August 31 - September 01, 1998). S. N. Spencer, Ed.
HWWS '98. ACM, New York, NY, 105-112.
CHEN, Y., CHHUGANI, J., DUBEY, P., HUGHES, C., KIM, D., KUMAR,
S., LEE, V., NGUYEN A., SMELYANSKIY, M. 2008. Convergence
of Recognition, Mining, and Synthesis Workloads and its
Implications. In Procedings of IEEE, v. 96, n. 5, 790-807.
CHUVELEV, M., GREER, B., HENRY, G., KUZNETSOV, S., BURYLOV,
I., SABANIN, B. Nov. 2007. Intel Performance Libraries: Multicore
ready Software for Numeric Intensive Computation. Intel
Technology Journal, v. 11, i. 4, 1-10.
COHEN, J., LIN., M., MANOCHA, D., PONAMGI., D. 1995.
I-COLLIDE: An Interactive and Exact Collision Detection
System for Large-Scale Environments. In Proceedings of 1995
Symposium on Interactive 3D Graphics. SI3D '95. ACM, New
York, NY, 189-196.
Larrabee: