Random Variable
Lifer
THE CRAY X1 DESIGN
The overall design goal of the X1 is to provide both the historic high bandwidth of vector supercomputers along with the efficient scaling of MPPs. This translates into several specific design elements including:
* New custom processor architecture ? The system is designed around custom multi-piped vector processors, with 12.8 Gflops peak performance per processor (25.6 Gflops for 32-bit computations).
** Processors use a new Instruction Set Architecture (ISA) that is partially based on the MIPS ISA plus many additional instructions to support vector processing, special instructions and other enhancements (e.g. fixed instruction size, more registers, masked vector operations, large integer vector support, 32-bit data, and cache control).
** It carries over the multi-streaming concept introduced in the SV1. In addition, the processor design incorporates superscalar processing, integrated vector caches, and a decoupled microarchitecture that allows the processors to better tolerate memory latencies.
** Processors are configured with 8 vector pipes.
* Balanced high bandwidth memory systems ? The system is organized into four processor nodes, each of which contains 128 Rambus memory channels, for a local bandwidth of 200 GB/s. The nodes are connected by 16 parallel networks, providing 25 GB/s of point-to-point bandwidth. In maximum configurations, the network scales to over 4 TB/s of global bandwidth.
* Scalability ? The Cray X1 is designed to provide high performance at both small and larger processor counts. The system scales to 1,000?s of processors. The scalable address translation mechanisms and communication protocols were carried forward from the Cray T3E design.
CRAY X1 HIGHLIGHTS
The highlights of the new Cray X1 include:
* Scaling from 4 to over 4,000 processors
* Each processor is rated at 12.8 peak GFLOPS. The processors are constructed of four sets of scalar/vector units to create a MSP (Muliti-Streaming Processor).
The processor chips run at 800 MHz for the vectors units and 400 MHz for the scalar units.
** Providing 3.2 scalar GOPS and 12.8 vector GFLOPS per MSP processor (25.6 GFLOPS in 32-bit node)
* High bandwidth, low latency memory system design:
** Processor bandwidth to cache is 76 GB/s (50 GB/s for loads and 26 GB/s for stores)
** Peak bandwidth to local main memory is 51 GB/s per processor (38 GB/s sustained). Global interconnect main memory bandwidth is 102 GB/s per four processor/memory node board.
** I/O bandwidth is 4.8 GB/s per 4-processor node board and up to 75 GB/s per cabinet. Up to one I/O channel per processor. Each I/O channel is 1.2 GB/s full duplex, and is globally accessible by all processors in the machine.
** The latency to global memory is in the microsecond range in the largest configurations. Typical latency across a 512 processor system (128 nodes) is around one microsecond
* U.S. list pricing begins at $2.5 million
http://www.supercomputingonline.com/article.php?sid=3049
The overall design goal of the X1 is to provide both the historic high bandwidth of vector supercomputers along with the efficient scaling of MPPs. This translates into several specific design elements including:
* New custom processor architecture ? The system is designed around custom multi-piped vector processors, with 12.8 Gflops peak performance per processor (25.6 Gflops for 32-bit computations).
** Processors use a new Instruction Set Architecture (ISA) that is partially based on the MIPS ISA plus many additional instructions to support vector processing, special instructions and other enhancements (e.g. fixed instruction size, more registers, masked vector operations, large integer vector support, 32-bit data, and cache control).
** It carries over the multi-streaming concept introduced in the SV1. In addition, the processor design incorporates superscalar processing, integrated vector caches, and a decoupled microarchitecture that allows the processors to better tolerate memory latencies.
** Processors are configured with 8 vector pipes.
* Balanced high bandwidth memory systems ? The system is organized into four processor nodes, each of which contains 128 Rambus memory channels, for a local bandwidth of 200 GB/s. The nodes are connected by 16 parallel networks, providing 25 GB/s of point-to-point bandwidth. In maximum configurations, the network scales to over 4 TB/s of global bandwidth.
* Scalability ? The Cray X1 is designed to provide high performance at both small and larger processor counts. The system scales to 1,000?s of processors. The scalable address translation mechanisms and communication protocols were carried forward from the Cray T3E design.
CRAY X1 HIGHLIGHTS
The highlights of the new Cray X1 include:
* Scaling from 4 to over 4,000 processors
* Each processor is rated at 12.8 peak GFLOPS. The processors are constructed of four sets of scalar/vector units to create a MSP (Muliti-Streaming Processor).
The processor chips run at 800 MHz for the vectors units and 400 MHz for the scalar units.
** Providing 3.2 scalar GOPS and 12.8 vector GFLOPS per MSP processor (25.6 GFLOPS in 32-bit node)
* High bandwidth, low latency memory system design:
** Processor bandwidth to cache is 76 GB/s (50 GB/s for loads and 26 GB/s for stores)
** Peak bandwidth to local main memory is 51 GB/s per processor (38 GB/s sustained). Global interconnect main memory bandwidth is 102 GB/s per four processor/memory node board.
** I/O bandwidth is 4.8 GB/s per 4-processor node board and up to 75 GB/s per cabinet. Up to one I/O channel per processor. Each I/O channel is 1.2 GB/s full duplex, and is globally accessible by all processors in the machine.
** The latency to global memory is in the microsecond range in the largest configurations. Typical latency across a 512 processor system (128 nodes) is around one microsecond
* U.S. list pricing begins at $2.5 million
http://www.supercomputingonline.com/article.php?sid=3049