• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

WCCFAMD Carrizo APU on the 28nm Node Will Have Stacked DRAM On Package

Page 10 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
HBM gets is bandwidth from its large number of dies, not unlike NAND. If you're going to use only a small amount of it like a cache, you're better off with eDRAM or SRAM like Intel has done.

It is 100% true that AMD will only use HBM for L3.
[...]
There is nothing from AMD implying or stating that Stacked DRAM will replace System RAM. Thus, HBM is stuck as an L3 cache, no ifs or buts.

Aren't those two quotes contradictory? Or are we talking about a huge L3 cache at 1+ GB or so? 😕
 
Even if we (slightly strangely I suspect 🙂) accept that the iGPU in the APU is all about compute, you can bet that AMD would love to produce a version that did work for decent levels of gaming.

With the GDDR5 option gone now, the only way to do that is stacked memory, so if they can half way reasonably push it out on a halo part they'll do it. Intel don't precisely give away their chips with iris pro in so quite a lot of scope price wise.
 
Even if we (slightly strangely I suspect 🙂) accept that the iGPU in the APU is all about compute, you can bet that AMD would love to produce a version that did work for decent levels of gaming.

With the GDDR5 option gone now, the only way to do that is stacked memory, so if they can half way reasonably push it out on a halo part they'll do it. Intel don't precisely give away their chips with iris pro in so quite a lot of scope price wise.

Well remember as well that AMD sits with 2 hats. While their APU sales is disasterous. I bet you they still sit and think on that a faster APU means less dGPU sales. And if they accelerate performance "too fast". They may trigger Intel to up the game too. Again, less dGPU sales.
 
While their APU sales is disasterous.

Any evidence for this? I was under the impression that the Jaguar based APUs (Kabini in craptops, and the console APUs) were propping up the company right now. But the big core APUs certainly don't seem to have set the market alight.
 
Any evidence for this? I was under the impression that the Jaguar based APUs (Kabini in craptops, and the console APUs) were propping up the company right now. But the big core APUs certainly don't seem to have set the market alight.

They are down 20% YoY and they are losing market share to Intel. I wouldn't call this propping up. The only things propping up AMD for now are the console contract, but they aren't too relevant.
 
They are down 20% YoY and they are losing market share to Intel. I wouldn't call this propping up. The only things propping up AMD for now are the console contract, but they aren't too relevant.

You know the volum of APUs sold in Q2 2014 and last year ??? 🙄
 
You know the volum of APUs sold in Q2 2014 and last year ??? 🙄

Are you saying Kaveri is a complete flop and that AMD only more or less sells Kabini? Because else you have to agree with him.

You guys are not having this conversation in here. This is about the technical details of the next APU. Not the business of AMD's current APUs
-ViRGE
 
Last edited by a moderator:
They are down 20% YoY and they are losing market share to Intel. I wouldn't call this propping up. The only things propping up AMD for now are the console contract, but they aren't too relevant.

The "console contract" is made up of APUs...
 
I wonder if Nintendo is working with AMD on their new console that will compete (crush) PS4 and Xbone. 20nm 1024 GCN2.0 + 3Module Carizzo + HBM.
It could be beneficial to both of them.
How HBM cost compares to GDDR5?
 
I wonder if Nintendo is working with AMD on their new console that will compete (crush) PS4 and Xbone. 20nm 1024 GCN2.0 + 3Module Carizzo + HBM.
It could be beneficial to both of them.
How HBM cost compares to GDDR5?

Yeah, that doesn't sound like Nintendo.
 
Have a look at HD7770 and see how much faster it is compared to HD7750, both of them have the same Memory at the same frequency.


Also, running Luxmark 2.0 in GPU mode doesn't use the CPU at all. So, it doesn't matter what CPU you use in the system because the Benchmark will only use the GPU. Only if you run in CPU + GPU mode the final score will depend on the CPU as well.

You're still using a dGPU on the PCI-e bus hosted by an Intel processor to try to prove a point about AMD APU compute performance. It could be that there is a memory bandwidth threshold beyond which superior bandwidth is irrelevent (for compute functions anyway), and both the 7750 and 7770 have local memory exceeding that threshold (while that would most certainly not be the case for an APU with DDR3-1600 + weak timings). The 7730 results might indicate that we are seeing that memory threshold in effect. Again, too many uncontrolled factors.

You have a 7850K that you've been testing recently, correct? Why not just run a few simple Luxmark 2.0 tests with it? You'd be benching the machine against itself, so setting up a control would be extremely easy.
 
Aren't those two quotes contradictory? Or are we talking about a huge L3 cache at 1+ GB or so? 😕
Huge L3 cache which operates as nearest memory.

HBM => 1 GB, 2 GB, 4 GB
DDR3 => 4 GB to 64 GB

HBM2 => 2 GB, 4 GB, 8 GB, 16 GB, 32 GB
DDR4 => 8 GB to 256 GB

As flash/non-volatile memory becomes faster and bigger that will replace system RAM. Fusing "hard drive/SSD memory" and "system RAM" is the end goal for NVM Express.
 
Last edited:
You're still using a dGPU on the PCI-e bus hosted by an Intel processor to try to prove a point about AMD APU compute performance.

Nope, the HD7770 vs HD7750 was meant to show you that Memory Bandwidth is not that important in Luxmark GPU score, compute units are affecting the score far greater than memory.

You have a 7850K that you've been testing recently, correct? Why not just run a few simple Luxmark 2.0 tests with it? You'd be benching the machine against itself, so setting up a control would be extremely easy.

I had the 7700K, but i will have the 7850K in a few days and i will test it in luxmark and more.
 
Last warning. Stick to a tech discussion of Carrizo or you will receive a vacation. I do not want to see any more of this unending business arguing in this thread.

And that goes for everyone in this thread.

-ViRGE
 
Last edited:
compute depends on latency, graphics on bandwidth, there is a bench, where they test kavery memory on tasks related to graphics, they come to the conclution that increasing clocks, is better aka more bandwith (sorry for not putting links i will search later), HBM working as L3 benefit both but more to graphics (working with textures), and HBM working as extra memory well just ask xbox one developers, it breaks HSA
 
If HBM is used it will be the L3 cache. While, the stacked memory will have latency similar to system memory. It will not have the issues of previous L3 caches from AMD.

Orochi(AMD) for example had pathetic bandwidth and a very small size which made its latency awful. HBM being the L3 will provide up to 128 GB/s and 1 gigabyte of memory.

AIDA64Memory.png

Haswell in comparison: http://i1365.photobucket.com/albums...C10_AIDA64_25133Copy_zps9802a583.png~original

SRAM @ 8 MB - 128 GB/s Read - 40 GB/s Write - 50 GB/s Copy / 50% of the die (Interface + SRAM)
vs
Stacked DRAM @ 1 GB - 128 GB/s peak throughput / ~5?% of the die (Interface)



I recall AMD saying that they found the problem behind the latency issues they have will BD and PD's latency issues for L3. They didn't have time or maybe the desire to fix it for SR. I cannot recall if they said they would fix it for EX but I would assume that if the issue was identified why not fix it for EX if EX has a L3.

Of course if there is no L3 and only HBM then the L3 latency bug issue may be moot.

I think L3 will be present. It boils down to what the success of putting a HBM on a chip is and how it effects yields. AMD tends to be conservative, didn't they have 2 controllers for SR. I hope they have HBM and EX will be far superior to SR but I doubt it
 
I recall AMD saying that they found the problem behind the latency issues they have will BD and PD's latency issues for L3.
http://www.anandtech.com/show/6201/amd-details-its-3rd-gen-steamroller-architecture/2 said:
According to AMD, they’ve isolated the reason for the unusually high L3 latency in the Bulldozer architecture, however fixing it isn’t a top priority.
I think L3 will be present. It boils down to what the success of putting a HBM on a chip is and how it effects yields. AMD tends to be conservative, didn't they have 2 controllers for SR. I hope they have HBM and EX will be far superior to SR but I doubt it
HBM and the Si Interposer is testable before it becomes a SKU.

HBM package specification; Known Good Stacked Die

HBM and Si Interposer does not effect the yield of the product. Everything is tested before hand to assure that it is good yield. This comes at an increased cost and thus increased ASP.
 
Last edited:
HBM and Si Interposer does not effect the yield of the product.
Even if the stack itself works, the yield of assembly of the host die and the stack onto the interposer is still affected by the maturity of TSV. If the yield is not so good, it already affects the yield of the final product.

LOL.
 
compute depends on latency, graphics on bandwidth
Compute is a generic term, and you have tons of different types of workloads in the real world. So you would really not like to generalize things in this way. Particularly when we are talking about exploiting DATA PARALLELISM. Graphics is just compute, by the way, so some stages of the graphics pipeline, or maybe even the shaders, may want lower latencies either. Atomic operations and RMW operations of render backends are great examples.

HBM working as L3 benefit both but more to graphics (working with textures), and HBM working as extra memory well just ask xbox one developers, it breaks HSA
No, it won't. Okay, don't bring those hUMA slides to me, but read the HSA Platform System Architecture Specification 1.0 Provisional. Does it mention even a single word of hUMA the marketing hype of Kaveri? No, but it tells you "yah you need these features of hUMA to be HSA compliant". So basically how people interpreted hUMA by overlooking the last three letters can now be thrown away. Oh, by the way, Carrizo and those Project Skybridge APUs should be the first waves of full HSA platforms. Sorry, but Kaveri is not on the list*. Yep.

What else does it tell you? Discrete HSA devices with component local memory! Multi-node, multi-device topology discovery! So now, if a discrete GPU can be supported and covered by the spec, why would conceptually integrating a discrete GPU with its own pool of memory into a host processor suddenly break HSA?

Hmm. Don't even mention the fact that the HSA (and the higher level OpenCL 2 of the HSA SW stack) would already have itself broken in your kind of sense, with the holy group memory, mercy image memory and the virtue of the private memory segments..

P.S. Doesn't working as a L3 Cache contradict with the Graphics Needs Bandwidth claim, by the way? You are burning the bandwidth for cache management, while graphics is fine to operate in a smaller chunk of memory (yet 1GB is not too small for a GPU). Now making it a dedicated pool guarantees never a single cache miss and the full bandwidth to the DRAM!

* Carrizo supports hard preemption of wavefronts, which is a requirement of the Full Profile of the HSA platform spec. So there was a reason behind why Kaveri is just marketed as... "first to support HSA features".
 
Last edited:
Even if the stack itself works, the yield of assembly of the host die and the stack onto the interposer is still affected by the maturity of TSV. If the yield is not so good, it already affects the yield of the final product.

LOL.
TSVs aren't used to connect the Logic die or Memory stack to the Si Interposer. The worst outcome based on validated research is a 5% loss in yield. This is without KGS and KGD or KGSD in whole. With these security measures the loss in yields is none.

Also, the L3 is hUMA through the cache coherent interconnect that all Excavator and Volcanic Island SKUs will have.
 
Last edited:
WRT iGPU performance, no. The 7850K has been shown to be severely lacking in memory bandwidth when it comes to iGPU performance. If you're running something like DDR3-1600 with "normal" timings, you can run up your iGPU speed without seeing any appreciable performance increase (while overclocking the RAM does produce positive results). It seems fairly obvious that the iGPU is starved for bandwidth.

On that 512 stream processor iGPU with DDR3 1600, I have been wondering if certain settings are harder on memory bandwidth compared to GPU processing power.

Here is a thread I made last year in an attempt to optimize the GPU setting when low memory bandwidth is present:

http://forums.anandtech.com/showthread.php?t=2352752&highlight=

Any opinions on the best way to set-up graphics till we get something better from AMD?
 
Back
Top