WCCFAMD Carrizo APU on the 28nm Node Will Have Stacked DRAM On Package

Ajay · Jul 16, 2014

SAAA said:
Saving space for better/wider cores? Besides Haswell and co (since Nehalem actually) have just 256Kb but that's good enough... just reduce the latency and everythings ok.

Yeah, but Intel 'Core' tech has L3$ (and LLC on some SKUs). Make cores wider and cut cache - seems like an odd choice.

Exophase · Jul 16, 2014

IF the cache reduction lets them cut latency by a significant amount (like say, 4-5 cycles?) it'll almost definitely be worth it. That's a big if, since the configuration of Bulldozer with half cache only reduced L2 latency by one cycle. But it's possible they could realize greater gains with a proper redesign.

SAAA · Jul 16, 2014

ShintaiDK said:
L3 being the key.

Actually cutting it completely wasn't so bad for all the APUs, but makes one think when you see a Piledriver die shot and imagine it with 50-100% more cores... Was it really necessary for server applications? I guess today (back then was a dream) a nice solution will be HBM as L3 equivalent to reduce main memory accesses.

ShintaiDK · Jul 16, 2014

SAAA said:
Actually cutting it completely wasn't so bad for all the APUs, but makes one think when you see a Piledriver die shot and imagine it with 50-100% more cores... Was it really necessary for server applications? I guess today (back then was a dream) a nice solution will be HBM as L3 equivalent to reduce main memory accesses.

It didnt help the APUs. And now the L2 is cut in half. While the memory speed is the same.

And there is no indication of HBM anywhere for APUs.

Homeles · Jul 16, 2014

ShintaiDK said:
It didnt help the APUs. And now the L2 is cut in half. While the memory speed is the same.

And there is no indication of HBM anywhere for APUs.

There's been plenty of images floating around. It's not a matter of "if;" it's a matter of when.

pTmdfx · Jul 16, 2014

Homeles said:
I just hadn't anything about hardware changes... all the coverage was on the software side, which honestly is by far the most important bit.

It is because MS hasn't disclosed anything about the new hardware features yet but a sneak peek with ordered UAV and conservative rasterization.

pTmdfx · Jul 16, 2014

ShintaiDK said:
L3 being the key.

Or there is no L3 perhaps? Propus with the K10 cores lived without L3 Cache back in 2009. Latency of the L2 cache matters too.

AtenRa · Jul 16, 2014

Exophase said:
IF the cache reduction lets them cut latency by a significant amount (like say, 4-5 cycles?) it'll almost definitely be worth it. That's a big if, since the configuration of Bulldozer with half cache only reduced L2 latency by one cycle. But it's possible they could realize greater gains with a proper redesign.

Which CPU was that ??

pTmdfx · Jul 16, 2014

AtenRa said:
Which CPU was that ??

I'm not sure about his clue, but the Bulldozer IEEE paper mentioned a 18-cycle load-use latency for the 1MB L2 version (physically 1MB, I assume) of Bulldozer module.

NostaSeronx · Jul 16, 2014

Checking the Carrizo slide not related to the core but the GPU;
VCE1/2 = 3x 1080p30
VCE3 = 9x 1080p30

UVD4 = 4x-8x 1080p30
-skip to 6-
UVD6 = 9x-18x 1080p30

ACP1; HiFi EP Audio DSP
ACP2?; HiFi 3 Audio DSP

There is already some errors as well. The FP4 socket is mostly used for the 16h family now. Anything in that socket would be single channel.

Project Discovery (FT3b);
http://browser.primatelabs.com/geekbench3/574913
http://browser.primatelabs.com/geekbench3/574933

Project Gardenia (FP4);
http://browser.primatelabs.com/geekbench3/626430
http://browser.primatelabs.com/geekbench3/626420

There is also the issue of the Carrizo slide using the old style when they switched to a newer style.

monstercameron · Jul 16, 2014

NostaSeronx said:
Checking the Carrizo slide not related to the core but the GPU;
VCE1/2 = 3x 1080p30
VCE3 = 9x 1080p30

UVD4 = 4x-8x 1080p30
-skip to 6-
UVD6 = 9x-18x 1080p30

ACP1; HiFi EP Audio DSP
ACP2?; HiFi 3 Audio DSP

There is already some errors as well. The FP4 socket is mostly used for the 16h family now. Anything in that socket would be single channel.

Project Discovery (FT3b);
http://browser.primatelabs.com/geekbench3/574913
http://browser.primatelabs.com/geekbench3/574933

Project Gardenia (FP4);
http://browser.primatelabs.com/geekbench3/626430
http://browser.primatelabs.com/geekbench3/626420

There is also the issue of the Carrizo slide using the old style when they switched to a newer style.

yep seems fake or old...

Exophase · Jul 16, 2014

AtenRa said:
Which CPU was that ??

Not a released CPU, but a configuration option mentioned in the software optimization guide.

AtenRa · Jul 16, 2014

Exophase said:
Not a released CPU, but a configuration option mentioned in the software optimization guide.

Yea thx,

Well, that may could be for a fused off part, not a real half L2 Module.

Homeles · Jul 16, 2014

AtenRa said:
Yea thx,

Well, that may could be for a fused off part, not a real half L2 Module.

Undoubtedly this is the case. Steamroller's L2 latency is down to 19 clocks now (Piledriver, 20; Bulldozer, 21). Phenom II's was as low as 14 on Thuban... so it should be somewhere in between for 1MB, and probably closer to the Phenom side of things.

NostaSeronx · Jul 16, 2014

Here is the image for those not wanting to click the VR-Zone link.

Exophase · Jul 16, 2014

Homeles said:
Undoubtedly this is the case. Steamroller's L2 latency is down to 19 clocks now (Piledriver, 20; Bulldozer, 21). Phenom II's was as low as 14 on Thuban... so it should be somewhere in between for 1MB, and probably closer to the Phenom side of things.

Interestingly, Llano had 1MB L2 caches per core (and 16-way set associative), and maintained a 15 cycle latency like its predecessors. But there are still some key differences. Being shared by two cores/interfacing two L1 dcaches probably increases latency, at the very least due to physical requirements. BD/PD were likely penalized further by their higher clock speed requirements. Maybe with XV AMD will give up even more clock headroom so they can tighten down latencies.

Incidentally, Llano didn't have a huge performance improvement over Athlon II, and this is including several minor improvements beyond the doubling of L2 cache. So I'm skeptical that 2MB/module is really much of a performance requirement for BD (1MB/module should be more flexible than 512KB/core anyway)

NostaSeronx · Jul 16, 2014

It just seems weird to me that Excavator is essentially Puma+. 4 cores, 2MB L2 total cache, 3rd Gen GCN, VCE3, UVD6, Integrated FCH, PSP. Just happens to be on the same platform as the revised Puma(Mullins) platform for Tablets/Convertibles.

There is no Kaveri APU at the 15 watts designation, either. So, it is clearly pointing at the Beema SKUs not the Kaveri SKUs which are only 19 watts and up.

30% more performance than Beema at the same TDP is not bad. It does seem to fall in line with the Trinity to Richland setup. Happening to be just a respin on the Kaveri design on 28-nm

Blitzvogel · Jul 16, 2014

NostaSeronx said:
It just seems weird to me that Excavator is essentially Puma+. 4 cores, 2MB L2 total cache, 3rd Gen GCN, VCE3, UVD6, Integrated FCH, PSP. Just happens to be on the same platform as the revised Puma(Mullins) platform for Tablets/Convertibles.

It's still a Bulldozer derived CMT architecture.

NostaSeronx · Jul 16, 2014

^ Essentially, with the changes what the die would look like after the respin ± iFCH. I was also considering cutting out half the GPU since it is 16 compute units. I'm pretty sure you guys can cut that out for your self.

It will be really weird if Kaveri gets a second stepping and becomes a FX APU.

inf64 · Jul 16, 2014

Excavator core is not a SR respin and Carrizo will not be a Kaveri variant. Whether it will be noticeably faster (via clock,IPC or combination of the two) is unlikely, but calling it Kaveri respin is just wrong.

NostaSeronx · Jul 16, 2014

inf64 said:
Excavator core is not a SR respin and Carrizo will not be a Kaveri variant. Whether it will be noticeably faster (via clock,IPC or combination of the two) is unlikely, but calling it Kaveri respin is just wrong.

I'm just thinking how aggressive AMD can be with this split.

40h-4Fh(FX CPU) : 16 Excavator Cores, No GPU, 8MB L2, 8 MB L3, 256-bit DDR3/DDR4. >65W
50h-5Fh(FX APU) : 6 Excavator Cores, 16 3rd gen GCN CUs, 3 MB L2, 6 MB L2+L3, 256-bit DDR3/DDR4. >45W
60h-6Fh(A(x) APU) : 4 Excavator Cores, 8 3rd gen GCN CUs, 2 MB L2, 128-bit DDR3/DDR4. <35W

This implies that these are all launching very late this year. With planned 20-nm node drop downs later next year.

http://www.linkedin.com/pub/ramya-gandamaneni/16/485/aa8

Fast macro porting designs like GCN and 16h can jump to the 20-nm node first. Then, the large macro count designs with slow porting speed like 15h, port later. Small L2 and L3 making better use of the space of 28-nm. Then, going back to the large L2 and large L3 with the shrink.

This would technically also allow AMD to do what Intel did with Sandy Bridge/-E and Ivy Bridge/-E. While doing it with PCIe 3.0(/Hypertransport 8 Gb/s) and PCIe 4.0(/Hypertransport 16 Gb/s).

Shivansps · Jul 16, 2014

NostaSeronx said:
Checking the Carrizo slide not related to the core but the GPU;
VCE1/2 = 3x 1080p30
VCE3 = 9x 1080p30

UVD4 = 4x-8x 1080p30
-skip to 6-
UVD6 = 9x-18x 1080p30

ACP1; HiFi EP Audio DSP
ACP2?; HiFi 3 Audio DSP

There is already some errors as well. The FP4 socket is mostly used for the 16h family now. Anything in that socket would be single channel.

Project Discovery (FT3b);
http://browser.primatelabs.com/geekbench3/574913
http://browser.primatelabs.com/geekbench3/574933

Project Gardenia (FP4);
http://browser.primatelabs.com/geekbench3/626430
http://browser.primatelabs.com/geekbench3/626420

There is also the issue of the Carrizo slide using the old style when they switched to a newer style.

or maybe is more like a Haswell-Y...

yulgrhet · Jul 16, 2014

On slide second section says:

"Full HSA:Hi Perf Bus for GFX & DRAM, Fine-grain Preemption for Context Switches"

Does not mean DRAM can be added to package? No DRAM mobile, DRAM module added desktop?

Fine-grained Preemption for Context Switches means dGPU additive to iGPU depends on load?

If both so desktop Carizzo being bad assing chip no?

pTmdfx · Jul 16, 2014

yulgrhet said:
Does not mean DRAM can be added to package? No DRAM mobile, DRAM module added desktop?

I cannot see such an implication there.

yulgrhet said:
Fine-grained Preemption for Context Switches

It is about the fine-grained multitasking ability of the integrated GPU.

NostaSeronx · Jul 16, 2014

WCCFAMD Carrizo APU on the 28nm Node Will Have Stacked DRAM On Package

Lifer

Diamond Member

Senior member

Lifer

Platinum Member

Member

Member

Lifer

Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Platinum Member

Diamond Member

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Member

Member

Diamond Member