Signs of an HPC APU by AMD

otinane

Member
Oct 13, 2016
68
13
36
A small part of the original paper:

In the center of the EHP are two CPU clusters, each consisting of four multi-core CPU chiplets stacked on an active interposer base die. On either side of the CPU clusters are a total of four GPU clusters, each consisting of two GPU chiplets on a respective active interposer. Upon each GPU chiplet is a 3D stack of DRAM (e.g., some future generation of JEDEC high-bandwidth memory (HBM) [12]). The DRAM is directly stacked on the GPU chiplets to maximize bandwidth while minimizing memoryrelated data movement energy and total package footprint. CPU computations tend to be more latency sensitive, and so the central placement of the CPU cores reduces NUMA-like effects by keeping the CPU-to-DRAM distance relatively uniform. The interposers underneath the chiplets provide the interconnection network between the chiplets along with other common system functions. Interposers maintain high-bandwidth connectivity among themselves by utilizing wide, shortdistance, point-to-point paths.


The EHP uses eight GPU chiplets. Our initial configuration provisions 32 CUs per chiplet. Each chiplet is projected to provide two teraflops of double-precision computation, for a total of 16 teraflops. The EHP also employs eight CPU chiplets (four cores each), for a total of 32 cores, with greater parallelism through optional simultaneous multi-threading.


Heterogeneous System Architecture (HSA) compatibility is one of the major design goals of the APU. HSA provides a system architecture where all computing elements (CPU, GPU, and possibly other accelerators) share a unified coherent virtual address space. These features are supported by AMD’s Radeon Open Compute platform (ROCm) to improve the programmability of such heterogeneous systems. We have created novel mechanisms like the QuickRelease synchronization mechanism [14] and heterogeneous race free memory models (HRF) [15]–[17] to reduce synchronization overhead between GPU threads, and heterogeneous system coherence (HSC) [18] to transparently manage coherence between CPU and GPU caches.

Original paper work: http://www.computermachines.org/joe/publications/pdfs/hpca2017_exascale_apu.pdf
 
  • Like
Reactions: Dresdenboy

Doom2pro

Senior member
Apr 2, 2016
587
619
106
Interesting, I wonder how good the thermal conductivity from the GPUs through the HBM stacks to the Heatspreader will be... Also, active interposer, I heard about that being a possibility with interposers, I wonder what they are doing with it? Perhaps something to do with Infinity Fabric?
 
  • Like
Reactions: Drazick

Mopetar

Diamond Member
Jan 31, 2011
7,831
5,980
136
I think this is where AMD has wanted to go for a long while, but it's been a slow road with a lot of learning. I don't think that this would be something aimed at gaming though. Stacking the memory on top of the GPU limits the cooling unless the memory is designed to facilitate it. This is probably aimed at parts of the market that care more about efficiency than maximum performance.

I'm assuming that this won't see the light of day until we're down to a 7 nm process or something similar, otherwise this would be a truly obscene chip in terms of die size. Even then it would be beastly.
 
  • Like
Reactions: Doom2pro

Doom2pro

Senior member
Apr 2, 2016
587
619
106
I think this is where AMD has wanted to go for a long while, but it's been a slow road with a lot of learning. I don't think that this would be something aimed at gaming though. Stacking the memory on top of the GPU limits the cooling unless the memory is designed to facilitate it. This is probably aimed at parts of the market that care more about efficiency than maximum performance.

I'm assuming that this won't see the light of day until we're down to a 7 nm process or something similar, otherwise this would be a truly obscene chip in terms of die size. Even then it would be beastly.

Think about how this would have been done before, one massive die, think like 30-40 dies per wafer with only 10 good dies, not quite as obscene as some huge uber expensive CCD sensors (one good die per 5 wafers!).

Now with this Multi Chip Module, active Interposer and stacked HBM, they can get yields to sane levels (albeit with packing yields taking their hit also) and make a good profit off them.
 
  • Like
Reactions: Drazick

Mopetar

Diamond Member
Jan 31, 2011
7,831
5,980
136
No doubt the yields on the individual components will be good, I'm just saying that the whole thing is massive. Fury has 64 CUs, so this single chip would be 4 Fury's strapped to a 32 core CPU. Going by their calculations we can conclude the clock speed for the GPU must be fairly low as a Fury has ~8 TFLOPS by itself.

The silly part of it all is that given enough time there will be a more powerful chip in something that gets sold as a child's toy.
 
  • Like
Reactions: lightmanek

lolfail9001

Golden Member
Sep 9, 2016
1,056
353
96
That thing sounds like composition of quite a few MCMs, actually. Namely, 2 Snowy Owls and few Greenlands, or something.
Going by their calculations we can conclude the clock speed for the GPU must be fairly low as a Fury has ~8 TFLOPS by itself.
They talk about dual precision flops, this thing will have same clocks as Fury X.
 

Valantar

Golden Member
Aug 26, 2014
1,792
508
136
Interesting, I wonder how good the thermal conductivity from the GPUs through the HBM stacks to the Heatspreader will be... Also, active interposer, I heard about that being a possibility with interposers, I wonder what they are doing with it? Perhaps something to do with Infinity Fabric?
That was my initial reaction too. It's well established that mobile SoCs suffer due to stacked RAM (which is why iPads perform so well, as they have off-package RAM), and those are ~2W SoCs. On what must be a 100W+ chip, either that HBM would run extremely hot, or the GPU would throttle. Neither sounds good to me.

Other than that, though, this sounds awesome.