WCCFAMD Carrizo APU on the 28nm Node Will Have Stacked DRAM On Package

Fjodor2001 · Jul 31, 2014

ViRGE said:
HBM gets is bandwidth from its large number of dies, not unlike NAND. If you're going to use only a small amount of it like a cache, you're better off with eDRAM or SRAM like Intel has done.

NostaSeronx said:
It is 100% true that AMD will only use HBM for L3.
[...]
There is nothing from AMD implying or stating that Stacked DRAM will replace System RAM. Thus, HBM is stuck as an L3 cache, no ifs or buts.

Aren't those two quotes contradictory? Or are we talking about a huge L3 cache at 1+ GB or so? 😕

Qwertilot · Jul 31, 2014

Even if we (slightly strangely I suspect 🙂) accept that the iGPU in the APU is all about compute, you can bet that AMD would love to produce a version that did work for decent levels of gaming.

With the GDDR5 option gone now, the only way to do that is stacked memory, so if they can half way reasonably push it out on a halo part they'll do it. Intel don't precisely give away their chips with iris pro in so quite a lot of scope price wise.

ShintaiDK · Jul 31, 2014

Qwertilot said:
Even if we (slightly strangely I suspect 🙂) accept that the iGPU in the APU is all about compute, you can bet that AMD would love to produce a version that did work for decent levels of gaming.

With the GDDR5 option gone now, the only way to do that is stacked memory, so if they can half way reasonably push it out on a halo part they'll do it. Intel don't precisely give away their chips with iris pro in so quite a lot of scope price wise.

Well remember as well that AMD sits with 2 hats. While their APU sales is disasterous. I bet you they still sit and think on that a faster APU means less dGPU sales. And if they accelerate performance "too fast". They may trigger Intel to up the game too. Again, less dGPU sales.

NTMBK · Jul 31, 2014

ShintaiDK said:
While their APU sales is disasterous.

Any evidence for this? I was under the impression that the Jaguar based APUs (Kabini in craptops, and the console APUs) were propping up the company right now. But the big core APUs certainly don't seem to have set the market alight.

mrmt · Jul 31, 2014

NTMBK said:
Any evidence for this? I was under the impression that the Jaguar based APUs (Kabini in craptops, and the console APUs) were propping up the company right now. But the big core APUs certainly don't seem to have set the market alight.

They are down 20% YoY and they are losing market share to Intel. I wouldn't call this propping up. The only things propping up AMD for now are the console contract, but they aren't too relevant.

AtenRa · Jul 31, 2014

mrmt said:
They are down 20% YoY and they are losing market share to Intel. I wouldn't call this propping up. The only things propping up AMD for now are the console contract, but they aren't too relevant.

You know the volum of APUs sold in Q2 2014 and last year ??? 🙄

ShintaiDK · Jul 31, 2014

AtenRa said:
You know the volum of APUs sold in Q2 2014 and last year ??? 🙄

Are you saying Kaveri is a complete flop and that AMD only more or less sells Kabini? Because else you have to agree with him.

You guys are not having this conversation in here. This is about the technical details of the next APU. Not the business of AMD's current APUs
-ViRGE

NTMBK · Jul 31, 2014

mrmt said:
They are down 20% YoY and they are losing market share to Intel. I wouldn't call this propping up. The only things propping up AMD for now are the console contract, but they aren't too relevant.

The "console contract" is made up of APUs...

mrmt · Jul 31, 2014

NTMBK said:
The "console contract" is made up of APUs...

Same IP but very different project scope.

NTMBK · Jul 31, 2014

mrmt said:
Same IP but very different project scope.

Yes, but they are APUs... which is what I was talking about when I was talking about Jaguar APUs propping up the company.

I think we're talking at cross purposes 🙂

Erenhardt · Jul 31, 2014

I wonder if Nintendo is working with AMD on their new console that will compete (crush) PS4 and Xbone. 20nm 1024 GCN2.0 + 3Module Carizzo + HBM.
It could be beneficial to both of them.
How HBM cost compares to GDDR5?

NTMBK · Jul 31, 2014

Erenhardt said:
I wonder if Nintendo is working with AMD on their new console that will compete (crush) PS4 and Xbone. 20nm 1024 GCN2.0 + 3Module Carizzo + HBM.
It could be beneficial to both of them.
How HBM cost compares to GDDR5?

Yeah, that doesn't sound like Nintendo.

DrMrLordX · Jul 31, 2014

AtenRa said:
Have a look at HD7770 and see how much faster it is compared to HD7750, both of them have the same Memory at the same frequency.

Also, running Luxmark 2.0 in GPU mode doesn't use the CPU at all. So, it doesn't matter what CPU you use in the system because the Benchmark will only use the GPU. Only if you run in CPU + GPU mode the final score will depend on the CPU as well.

You're still using a dGPU on the PCI-e bus hosted by an Intel processor to try to prove a point about AMD APU compute performance. It could be that there is a memory bandwidth threshold beyond which superior bandwidth is irrelevent (for compute functions anyway), and both the 7750 and 7770 have local memory exceeding that threshold (while that would most certainly not be the case for an APU with DDR3-1600 + weak timings). The 7730 results might indicate that we are seeing that memory threshold in effect. Again, too many uncontrolled factors.

You have a 7850K that you've been testing recently, correct? Why not just run a few simple Luxmark 2.0 tests with it? You'd be benching the machine against itself, so setting up a control would be extremely easy.

NostaSeronx · Jul 31, 2014

Fjodor2001 said:
Aren't those two quotes contradictory? Or are we talking about a huge L3 cache at 1+ GB or so? 😕

Huge L3 cache which operates as nearest memory.

HBM => 1 GB, 2 GB, 4 GB
DDR3 => 4 GB to 64 GB

HBM2 => 2 GB, 4 GB, 8 GB, 16 GB, 32 GB
DDR4 => 8 GB to 256 GB

As flash/non-volatile memory becomes faster and bigger that will replace system RAM. Fusing "hard drive/SSD memory" and "system RAM" is the end goal for NVM Express.

AtenRa · Jul 31, 2014

DrMrLordX said:
You're still using a dGPU on the PCI-e bus hosted by an Intel processor to try to prove a point about AMD APU compute performance.

Nope, the HD7770 vs HD7750 was meant to show you that Memory Bandwidth is not that important in Luxmark GPU score, compute units are affecting the score far greater than memory.

DrMrLordX said:
You have a 7850K that you've been testing recently, correct? Why not just run a few simple Luxmark 2.0 tests with it? You'd be benching the machine against itself, so setting up a control would be extremely easy.

I had the 7700K, but i will have the 7850K in a few days and i will test it in luxmark and more.

Phynaz · Jul 31, 2014

AtenRa said:
Nobody said gaming is not important, but it is not the first focus.

And yet they are giving away games to people that buy their "compute" CPUs.

ViRGE · Jul 31, 2014

Last warning. Stick to a tech discussion of Carrizo or you will receive a vacation. I do not want to see any more of this unending business arguing in this thread.

And that goes for everyone in this thread.

-ViRGE

DrMrLordX · Jul 31, 2014

AtenRa said:
I had the 7700K, but i will have the 7850K in a few days and i will test it in luxmark and more.

Great! That way we can make more accurate predictions regarding the value of HBM for compute on Carrizo.

kagui · Jul 31, 2014

compute depends on latency, graphics on bandwidth, there is a bench, where they test kavery memory on tasks related to graphics, they come to the conclution that increasing clocks, is better aka more bandwith (sorry for not putting links i will search later), HBM working as L3 benefit both but more to graphics (working with textures), and HBM working as extra memory well just ask xbox one developers, it breaks HSA

Xpage · Aug 1, 2014

NostaSeronx said:
If HBM is used it will be the L3 cache. While, the stacked memory will have latency similar to system memory. It will not have the issues of previous L3 caches from AMD.

Orochi(AMD) for example had pathetic bandwidth and a very small size which made its latency awful. HBM being the L3 will provide up to 128 GB/s and 1 gigabyte of memory.

Haswell in comparison: http://i1365.photobucket.com/albums...C10_AIDA64_25133Copy_zps9802a583.png~original

SRAM @ 8 MB - 128 GB/s Read - 40 GB/s Write - 50 GB/s Copy / 50% of the die (Interface + SRAM)
vs
Stacked DRAM @ 1 GB - 128 GB/s peak throughput / ~5?% of the die (Interface)

I recall AMD saying that they found the problem behind the latency issues they have will BD and PD's latency issues for L3. They didn't have time or maybe the desire to fix it for SR. I cannot recall if they said they would fix it for EX but I would assume that if the issue was identified why not fix it for EX if EX has a L3.

Of course if there is no L3 and only HBM then the L3 latency bug issue may be moot.

I think L3 will be present. It boils down to what the success of putting a HBM on a chip is and how it effects yields. AMD tends to be conservative, didn't they have 2 controllers for SR. I hope they have HBM and EX will be far superior to SR but I doubt it

NostaSeronx · Aug 1, 2014

Xpage said:
I recall AMD saying that they found the problem behind the latency issues they have will BD and PD's latency issues for L3.

http://www.anandtech.com/show/6201/amd-details-its-3rd-gen-steamroller-architecture/2 said:
According to AMD, they’ve isolated the reason for the unusually high L3 latency in the Bulldozer architecture, however fixing it isn’t a top priority.

Xpage said:
I think L3 will be present. It boils down to what the success of putting a HBM on a chip is and how it effects yields. AMD tends to be conservative, didn't they have 2 controllers for SR. I hope they have HBM and EX will be far superior to SR but I doubt it

HBM and the Si Interposer is testable before it becomes a SKU.

HBM package specification; Known Good Stacked Die

HBM and Si Interposer does not effect the yield of the product. Everything is tested before hand to assure that it is good yield. This comes at an increased cost and thus increased ASP.

pTmdfx · Aug 1, 2014

NostaSeronx said:
HBM and Si Interposer does not effect the yield of the product.

Even if the stack itself works, the yield of assembly of the host die and the stack onto the interposer is still affected by the maturity of TSV. If the yield is not so good, it already affects the yield of the final product.

LOL.

pTmdfx · Aug 1, 2014

kagui said:
compute depends on latency, graphics on bandwidth

Compute is a generic term, and you have tons of different types of workloads in the real world. So you would really not like to generalize things in this way. Particularly when we are talking about exploiting DATA PARALLELISM. Graphics is just compute, by the way, so some stages of the graphics pipeline, or maybe even the shaders, may want lower latencies either. Atomic operations and RMW operations of render backends are great examples.

kagui said:
HBM working as L3 benefit both but more to graphics (working with textures), and HBM working as extra memory well just ask xbox one developers, it breaks HSA

No, it won't. Okay, don't bring those hUMA slides to me, but read the HSA Platform System Architecture Specification 1.0 Provisional. Does it mention even a single word of hUMA the marketing hype of Kaveri? No, but it tells you "yah you need these features of hUMA to be HSA compliant". So basically how people interpreted hUMA by overlooking the last three letters can now be thrown away. Oh, by the way, Carrizo and those Project Skybridge APUs should be the first waves of full HSA platforms. Sorry, but Kaveri is not on the list*. Yep.

What else does it tell you? Discrete HSA devices with component local memory! Multi-node, multi-device topology discovery! So now, if a discrete GPU can be supported and covered by the spec, why would conceptually integrating a discrete GPU with its own pool of memory into a host processor suddenly break HSA?

Hmm. Don't even mention the fact that the HSA (and the higher level OpenCL 2 of the HSA SW stack) would already have itself broken in your kind of sense, with the holy group memory, mercy image memory and the virtue of the private memory segments..

P.S. Doesn't working as a L3 Cache contradict with the Graphics Needs Bandwidth claim, by the way? You are burning the bandwidth for cache management, while graphics is fine to operate in a smaller chunk of memory (yet 1GB is not too small for a GPU). Now making it a dedicated pool guarantees never a single cache miss and the full bandwidth to the DRAM!

* Carrizo supports hard preemption of wavefronts, which is a requirement of the Full Profile of the HSA platform spec. So there was a reason behind why Kaveri is just marketed as... "first to support HSA features".

NostaSeronx · Aug 1, 2014

pTmdfx said:
Even if the stack itself works, the yield of assembly of the host die and the stack onto the interposer is still affected by the maturity of TSV. If the yield is not so good, it already affects the yield of the final product.

LOL.

TSVs aren't used to connect the Logic die or Memory stack to the Si Interposer. The worst outcome based on validated research is a 5% loss in yield. This is without KGS and KGD or KGSD in whole. With these security measures the loss in yields is none.

Also, the L3 is hUMA through the cache coherent interconnect that all Excavator and Volcanic Island SKUs will have.

cbn · Aug 1, 2014

DrMrLordX said:
WRT iGPU performance, no. The 7850K has been shown to be severely lacking in memory bandwidth when it comes to iGPU performance. If you're running something like DDR3-1600 with "normal" timings, you can run up your iGPU speed without seeing any appreciable performance increase (while overclocking the RAM does produce positive results). It seems fairly obvious that the iGPU is starved for bandwidth.

On that 512 stream processor iGPU with DDR3 1600, I have been wondering if certain settings are harder on memory bandwidth compared to GPU processing power.

Here is a thread I made last year in an attempt to optimize the GPU setting when low memory bandwidth is present:

http://forums.anandtech.com/showthread.php?t=2352752&highlight=

Any opinions on the best way to set-up graphics till we get something better from AMD?

WCCFAMD Carrizo APU on the 28nm Node Will Have Stacked DRAM On Package

Diamond Member

Golden Member

Lifer

Lifer

Diamond Member

Lifer

Lifer

Lifer

Diamond Member

Lifer

Diamond Member

Lifer

Lifer

Diamond Member

Lifer

Lifer

Elite Member, Moderator Emeritus

Lifer

Member

Senior member

Diamond Member

Member

Member

Diamond Member

Lifer