- Oct 13, 2016
- 68
- 13
- 36
A small part of the original paper:
In the center of the EHP are two CPU clusters, each consisting of four multi-core CPU chiplets stacked on an active interposer base die. On either side of the CPU clusters are a total of four GPU clusters, each consisting of two GPU chiplets on a respective active interposer. Upon each GPU chiplet is a 3D stack of DRAM (e.g., some future generation of JEDEC high-bandwidth memory (HBM) [12]). The DRAM is directly stacked on the GPU chiplets to maximize bandwidth while minimizing memoryrelated data movement energy and total package footprint. CPU computations tend to be more latency sensitive, and so the central placement of the CPU cores reduces NUMA-like effects by keeping the CPU-to-DRAM distance relatively uniform. The interposers underneath the chiplets provide the interconnection network between the chiplets along with other common system functions. Interposers maintain high-bandwidth connectivity among themselves by utilizing wide, shortdistance, point-to-point paths.
The EHP uses eight GPU chiplets. Our initial configuration provisions 32 CUs per chiplet. Each chiplet is projected to provide two teraflops of double-precision computation, for a total of 16 teraflops. The EHP also employs eight CPU chiplets (four cores each), for a total of 32 cores, with greater parallelism through optional simultaneous multi-threading.
Heterogeneous System Architecture (HSA) compatibility is one of the major design goals of the APU. HSA provides a system architecture where all computing elements (CPU, GPU, and possibly other accelerators) share a unified coherent virtual address space. These features are supported by AMD’s Radeon Open Compute platform (ROCm) to improve the programmability of such heterogeneous systems. We have created novel mechanisms like the QuickRelease synchronization mechanism [14] and heterogeneous race free memory models (HRF) [15]–[17] to reduce synchronization overhead between GPU threads, and heterogeneous system coherence (HSC) [18] to transparently manage coherence between CPU and GPU caches.
Original paper work: http://www.computermachines.org/joe/publications/pdfs/hpca2017_exascale_apu.pdf
In the center of the EHP are two CPU clusters, each consisting of four multi-core CPU chiplets stacked on an active interposer base die. On either side of the CPU clusters are a total of four GPU clusters, each consisting of two GPU chiplets on a respective active interposer. Upon each GPU chiplet is a 3D stack of DRAM (e.g., some future generation of JEDEC high-bandwidth memory (HBM) [12]). The DRAM is directly stacked on the GPU chiplets to maximize bandwidth while minimizing memoryrelated data movement energy and total package footprint. CPU computations tend to be more latency sensitive, and so the central placement of the CPU cores reduces NUMA-like effects by keeping the CPU-to-DRAM distance relatively uniform. The interposers underneath the chiplets provide the interconnection network between the chiplets along with other common system functions. Interposers maintain high-bandwidth connectivity among themselves by utilizing wide, shortdistance, point-to-point paths.
The EHP uses eight GPU chiplets. Our initial configuration provisions 32 CUs per chiplet. Each chiplet is projected to provide two teraflops of double-precision computation, for a total of 16 teraflops. The EHP also employs eight CPU chiplets (four cores each), for a total of 32 cores, with greater parallelism through optional simultaneous multi-threading.
Heterogeneous System Architecture (HSA) compatibility is one of the major design goals of the APU. HSA provides a system architecture where all computing elements (CPU, GPU, and possibly other accelerators) share a unified coherent virtual address space. These features are supported by AMD’s Radeon Open Compute platform (ROCm) to improve the programmability of such heterogeneous systems. We have created novel mechanisms like the QuickRelease synchronization mechanism [14] and heterogeneous race free memory models (HRF) [15]–[17] to reduce synchronization overhead between GPU threads, and heterogeneous system coherence (HSC) [18] to transparently manage coherence between CPU and GPU caches.
Original paper work: http://www.computermachines.org/joe/publications/pdfs/hpca2017_exascale_apu.pdf