AMD Q4/2013 Desktop Roadmap

Ajay · Dec 2, 2013

sefsefsefsef said:
Zambezi's and Vishera's L3 is shared between all cores, and is a "mostly exclusive" cache, meaning it basically is a big victim cache for the L2s. What do you mean that a 1:1 size ratio is bad for a victim cache? Using a 1:1 sized L3 as anything but a victim cache would be absurd.

On an unrelated note, the place where large caches help the most for server workloads is in caching instructions, not data. Generally not a lot of data locality in server workloads.

Intel's eDRAM LLC is 128 MB and is a victim cache. IIRC, you want victim caches to be >> 1:1 - then again, my memory may just be off. 1:1 L3$:L2$ seems to be absurd to me - generally speaking. Apparently, based on preliminary results, Kaveri loses some FP performance without the victim cache, even compared to 1:1. That said, a small inclusive L3$ would be pointless.

sefsefsefsef · Dec 2, 2013

When victim caches were first proposed by Jouppi in '90 they were extremely small, and fully-associative (he only tested up to 15 total cache lines in his initial proposed victim cache). Victim caches of all sizes can be effective.

blastingcap · Dec 2, 2013

JDG1980 said:
AMD's roadmap includes a server version of Kaveri, called "Berlin". The slides indicate that it will indeed support ECC. Of course, we don't yet know what pricing will be like on this, or if mainstream boards will support it the way that Asus's current offerings do with AM3+.

If you already have an AM3+ motherboard that supports ECC, I suspect that a FX-8320 would be good enough for a NAS, and these chips are currently on sale for very reasonable prices at several locations. There are plenty of people who run a NAS on much worse chips, even Atoms. Most off-the-shelf NASes use low power ARM processors that are far weaker than Vishera.

I want lower power draw, not higher performance... I already have a 2.7GHz Sempron but would like to get something with lower idle wattage. Intel's ECC tax is outrageous. And buying AMD or INTC server CPUs isn't a solution either; at those prices I might as well just buy a Proliant.

ShintaiDK · Dec 2, 2013

blastingcap said:
I want lower power draw, not higher performance... I already have a 2.7GHz Sempron but would like to get something with lower idle wattage. Intel's ECC tax is outrageous. And buying AMD or INTC server CPUs isn't a solution either; at those prices I might as well just buy a Proliant.

Easier if you just did the research before complaining:

http://ark.intel.com/products/77773/Intel-Pentium-Processor-G3220-3M-Cache-3_00-GHz

64$ for CPU with ECC. Huge tax....

And here is another:
http://ark.intel.com/products/71072/Intel-Celeron-Processor-G1610-2M-Cache-2_60-GHz

42$ if the christmas budget is extra thight due to taxes.

Ajay · Dec 2, 2013

sefsefsefsef said:
When victim caches were first proposed by Jouppi in '90 they were extremely small, and fully-associative (he only tested up to 15 total cache lines in his initial proposed victim cache). Victim caches of all sizes can be effective.

When Jouppi first proposed them, he was looking @ a miss penalty of 1 cycle! And much less memory to map. I need to get the full article - or stop being lazy and open up my copy of Hennessy and Patterson. Thanks for the info

JDG1980 · Dec 2, 2013

blastingcap said:
I want lower power draw, not higher performance... I already have a 2.7GHz Sempron but would like to get something with lower idle wattage. Intel's ECC tax is outrageous. And buying AMD or INTC server CPUs isn't a solution either; at those prices I might as well just buy a Proliant.

Intel's "ECC tax" applies to quad cores and up. Their single and dual core mainstream offerings (Celeron, Pentium, i3) support ECC on both Ivy Bridge and Haswell, if used in conjunction with a server grade C-series chipset. For a NAS, you might want to try an i3-4130 CPU (dual-core, 54W TDP, $129.99 at Newegg). As mentioned, you'll need a server-specific motherboard, and the Supermicro X10SLM-F will get you there for $164.99.

If a really lightweight solution is good enough for you, there's the Supermicro X9SBAA, based on the Centertron Atom platform. $220 at Newegg for the board, which includes the CPU soldered in. It supports ECC, though you'll have to find a compact SO-DIMM instead of a standard full-size module. This might be a good choice for a firewall if you need ECC but don't need that much processing power. (People run firewalls on old single-core CPUs, so this should work fine.)

NTMBK · Dec 2, 2013

NostaSeronx said:
The two L2 caches from the CPU and GPU are coherent by a 256-bit interconnect.

Coherent != accessible from GPU. The two modules' L2 is also coherent, but they can't access each others' cache.

Shivansps · Dec 2, 2013

NostaSeronx said:
IOMMU 2.0(Windows only HSA)
Switchable Graphics V7 from V5.5

Massively improved power and thermal control. Which lead to a 2x increase in perf watt.
CPU & GPU now have boosts.

IOMMU is AMD VT-D

About HSA, that the thing i want to know i dont see HSA as supported on the slide, what makes no sence to me.
Boosts is great, finally AMD is coming to its sences, but im very skeptical on the x2 perf/watt thing, it seems too much for the same process.

But the missing HSA really puzzles me, i was starting to think Puma+ is nothing more than Jaguar+ with a name change and maybe some minimal change.

NostaSeronx · Dec 2, 2013

NTMBK said:
Coherent != accessible from GPU. The two modules' L2 is also coherent, but they can't access each others' cache.

All L2 CUs are connected, coherent, and accessible by each other.

Shivansps said:
IOMMU is AMD VT-D

IOMMU 2.0 isn't just AMD-Vi or Intel's VT-D.
http://www.linux-kvm.org/wiki/images/b/b1/2011-forum-amd-iommuv2-kvm.pdf

IOMMU v2 is required for HSA/hQ/hUMA for Windows.
IOMMU v2.5 is required for HSA/hQ/hUMA for the rest of the OSes.

https://www.youtube.com/watch?v=Wt-oRrk-tZQ
https://www.youtube.com/watch?v=GtYlcTeBFfo

IOMMUv2+ is the HSA-MMU.

Shivansps said:
But the missing HSA really puzzles me, i was starting to think Puma+ is nothing more than Jaguar+ with a name change and maybe some minimal change.

It was meant to be Jaguar+ not Puma+. As Puma is the enhanced version of the Jaguar core. Don't put heavy reliance on the marketing slides of AMD as they been known to be inaccurate sometimes.

--
---
----
The L3 cache on the Orochi dies can be virtually separated into 2MB partitions. If not, then the total of the L3 is available to all modules, all 8 MBs of it.

- L3 Cache Partitioning -
Allows customers to associate Bulldozer modules with L3 sub-caches so that each Bulldozer module can be guaranteed a certain amount of L3 cache (one Bulldozer module cannot monopolize the whole L3). With a minimum partition size of 2MB, the cache can be dynamically allocated through the software task scheduler. This can be done through the hypervisor or at the kernel level.

The cache can be at minimum be 1 to 1 and at maximum be 1 to 4.

NTMBK · Dec 3, 2013

NostaSeronx said:
All L2 CUs are connected, coherent, and accessible by each other.

"L2 CUs"? What are you talking about? A CU is a collection of GPU shader cores, not a cache.

mrmt · Dec 3, 2013

Shivansps said:
About HSA, that the thing i want to know i dont see HSA as supported on the slide, what makes no sence to me.
Boosts is great, finally AMD is coming to its sences, but im very skeptical on the x2 perf/watt thing, it seems too much for the same process.

It's not performance/watt, it's performance/TDP by AMD own admission. AMD can be credited for always pushing creative marketing boundaries.

SiliconWars · Dec 3, 2013

Where did AMD say that?

PPB · Dec 3, 2013

SiliconWars said:
Where did AMD say that?

Because perf is 2x and TDP stays the same. He is upset because he probably thinks AMD is like Intel in the mobile space, where TDP becomes meaningless and it's ok if you traspass your own TDP by 10 or more watts.

Nope, ironically AMD does the contrary than in desktop and their mobile solution's TDP ends up being more or less the real load power consumption. So in this case, 2x perf at same TDP = 2x perf/watt.

Abwx · Dec 3, 2013

PPB said:
Because perf is 2x and TDP stays the same. He is upset because he probably thinks AMD is like Intel in the mobile space, where TDP becomes meaningless and it's ok if you traspass your own TDP by 10 or more watts.

Nope, ironically AMD does the contrary than in desktop and their mobile solution's TDP ends up being more or less the real load power consumption. So in this case, 2x perf at same TDP = 2x perf/watt.

Actualy it is more perfs at smaller TDP , their metric
is totaly relevant contrary to SDP erratic and unsubstancied numbers

SiliconWars said:
Where did AMD say that?

Here , and it seems that some are concerned
by the numbers and AMD confidence..

“AMD is establishing excellent momentum this year in the low-power, mobile computing market and with ‘Mullins’ and ‘Beema’ coming in 2014 we are not standing still. AMD aims to deliver a set of platforms in the first half of next year that will outperform the competition in graphics and total compute performance in fanless tablets, 2-in-1s and ultrathin notebooks,” said Mark Papermaster

The new 2014 AMD A-Series low power APU platform, codenamed “Mullins,” is expected to deliver up to 139 percent better productivity performance per watt when compared to the previous generation “Temash” platform. Testing conducted by AMD Performance Labs on optimized AMD reference systems. PC manufacturers may vary configuration yielding different results. PCMark 8 - Home score divided by TDP (W) is used to simulate productivity performance per watt; the Mullins platform (4.5W) scored 1809 while the Temash platform (8W) scored 1343. AMD "Larne" reference platform system used for both APUs. Temash-based AMD A6-1450 quad-core APU with AMD Radeon™ HD 8250 Graphics, 2x2GB of DDR3-1333MHz RAM (running at 1066MHz), Windows 8.1, 13.200.11.0 - 03-Sep-2013 driver. Pre-production engineering sample of “Mullins” quad-core APU with next generation AMD Radeon graphics (model number TBD), 2x2GB DDR3-1333MHz RAM, Windows 8.1, and unreleased reference driver. MUN-3

The new 2014 AMD A-Series mainstream APU platform, codenamed “Beema,” is expected to deliver up to 104 percent better productivity performance per watt when compared to the previous generation “Kabini” platform. Testing conducted by AMD Performance Labs on optimized AMD reference systems. PC manufacturers may vary configuration yielding different results. PCMark 8 - Home score divided by TDP (W) is used to simulate productivity performance per watt; the Beema platform (15W) scored 2312 while the Kabini platform (25W) scored 1861. AMD "Larne" reference platform system used for both APUs. Kabini-based AMD A6-5200 quad-core APU with AMD Radeon™ HD 8400 Graphics, 2x2GB of DDR3-1600MHz RAM, Windows 8.1, 13.200.11.0 - 03-Sep-2013 driver. Pre-production engineering sample of “Beema” quad-core APU with next generation AMD Radeon graphics (model number TBD), 2x2GB DDR3-1600MHz RAM, Windows 8.1, and unreleased reference driver.BMN-3

http://www.amd.com/us/press-releases/Pages/amd-2014-mobile-apu-2013nov13.aspx

SiliconWars · Dec 3, 2013

Seems fair enough. We keep hearing the comparison between Bay Trail vs the "15W" A4-5000 after all.

Abwx · Dec 3, 2013

SiliconWars said:
Seems fair enough. We keep hearing the comparison between Bay Trail vs the "15W" A4-5000 after all.

From the numbers above we can see that the 4.5W TDP
Mullins will have the perfs of the 15W TDP Kabini , also
it is mentioned at the bottom that numbers are measured
from :

Pre-production engineering sample of “Mullins” quad-core APU with next generation AMD Radeon graphics (model number TBD)

Pre-production engineering sample of “Beema” quad-core APU with next generation AMD Radeon graphics (model number TBD)

So theses are real silicon numbers...

mrmt · Dec 3, 2013

PPB said:
Because perf is 2x and TDP stays the same. He is upset because he probably thinks AMD is like Intel in the mobile space, where TDP becomes meaningless and it's ok if you traspass your own TDP by 10 or more watts.

That *may* have been the case of their small core line until now, but what about the new Puma chips? What's the secret sauce to get 2x efficiency at the same node?

Mark my words: With Beema and Mulins AMD is playing with their TDP specs big time, and because of that the chips will get lambasted for every reviewer out there, except for Semiaccurate and maybe Phoronix.

inf64 · Dec 3, 2013

mrmt said:
Mark my words: With Beema and Mulins AMD is playing with their TDP specs big time, and because of that the chips will get lambasted for every reviewer out there, except for Semiaccurate and maybe Phoronix.

While that might be true, why not just wait and see the reviews before proclaiming it as a fact?

mrmt · Dec 3, 2013

inf64 said:
While that might be true, why not just wait and see the reviews before proclaiming it as a fact?

Because I don't believe in fairy tales, and also not on 2x efficiency from an architectural update at the same node.

inf64 · Dec 3, 2013

Ok but still claiming you are correct without any evidence to back it up besides your own opinion is kinda pointless.

raghu78 · Dec 3, 2013

mrmt said:
That *may* have been the case of their small core line until now, but what about the new Puma chips? What's the secret sauce to get 2x efficiency at the same node?

Firstly Kabini and Temash did not have an aggressive turbo implementation like Baytrail. In fact Kabini had no Turbo. With Temash A6-1450 was the only model to have turbo. It looks like AMD was limited by TSMC 28nm process and time to market design constraints with Kabini / Temash. When only the CPU is heavily loaded in workloads like Cinebench R11.5 better power sharing between CPU and GPU will allow the CPU to use the full available power budget. Also next year TSMC 28nm process would be in its 3rd year of production and very mature. This would allow an aggressive Turbo implementation and better yields.

http://www.anandtech.com/show/6981/...ality-of-mainstream-pcs-with-its-latest-apu/2

"A big issue here is Kabini, at least in its launched versions, lacks any turbo core support. The 15W A4-5000 runs even single threaded tasks as if all four cores were active and eating into that TDP budget. The fastest Jaguar implementation seems to be 2GHz, but even if the A4-5000 could turbo up to that level I feel like I’d still want a bit more. There’s obviously room on the table for a Kabini refresh, even at 28nm."

With Puma AMD also has the opportunity to tweak the Jaguar core for higher IPC for a slight increase in core size.

http://www.anandtech.com/show/7514/amd-2014-mobile-apu-update-beema-and-mullins

"AMD hasn’t disclosed clock speeds or anything else for the upcoming APUs, but given A6-1450 is clocked at 1000-1400MHz with the GPU core running at 300-400MHz, it is possible AMD was able to arrive at the above performance increases simply with higher clock speeds. Also possible is that similar to the Bobcat to Jaguar transition, AMD tweaked other elements of the Puma core (e.g. the scheduler could have more entries)."

With Sony PS4 based on the same Jaguar core boasting of a 2.75 ghz Turbo, the TSMC process is now in a good shape to yield high clocking enhanced Jaguar (Puma) designs.

Mark my words: With Beema and Mulins AMD is playing with their TDP specs big time, and because of that the chips will get lambasted for every reviewer out there, except for Semiaccurate and maybe Phoronix.

your words mean nothing. AMD has clearly come out with a 4.5 W TDP and 2W SDP specification for Mullins. They have posted results from actual silicon running benchmarks. You don't know anything about the improvements to the core (IPC increase, better turbo, deeper sleep states for lower idle power) and process maturity related improvements (better yields even at higher clocks) and here you are telling whats possible and not possible. stop the crap.

http://www.anandtech.com/show/7514/amd-2014-mobile-apu-update-beema-and-mullins

mrmt · Dec 3, 2013

raghu78 said:
your words mean nothing. AMD has clearly come out with a 4.5 W TDP and 2W SDP specification for Mullins. They have posted results from actual silicon running benchmarks.

Let me refresh your memory because you seem to not read the disclaimer in AMD presentation:

AMD said:
http://www.slideshare.net/AMD/amd-mobility-apu-lineup-announcement

Home score divided by TDP (W) is used to simulate productivity performance per watt (...)

Why did they have to put a disclaimer like that if they were really sure that performance per watt actually doubled? Why can't they report actually measured power consumption? Maybe AMD is saving on electricity measuring devices? Do you think that only power management and a few tweaks on the core would be enough to double performance/watt? Nobody ever achieved what AMD is saying it achieved here, and the fact that they are not giving actual measurements just makes it worse.

As for AMD coming with any kind of specification, it wouldn't be the first time they breached it. There are plenty of "125W" FX chips being sold out there, for them to start doing the same thing on the mobile market wouldn't be much of a stretch.

In any case, feel free to believe in AMD claims. It wouldn't be the first time that AMD lies like that, and I doubt it will be the last. AMD marketing is a gift that keeps on giving.

SiliconWars · Dec 3, 2013

So what you're basically saying in your own special way mrmt, is that AMD rated their Kabini TDP's far too highly. Interestingly enough that's what you might believe if you compared power draw to Intel's "17W" chips.

http://www.tomshardware.com/reviews/kabini-a4-5000-review,3518-13.html

Or you might believe that when it comes to getting creative with measuring TDP, Intel has a massive lead. Either way nothing you're claiming AMD will do with Beema or Mullins makes any kind of sense.

AtenRa · Dec 3, 2013

mrmt said:
Because I don't believe in fairy tales, and also not on 2x efficiency from an architectural update at the same node.

1: Jaguar has no Turbo, it cannot use the entire TDP for the CPU alone or downscale the CPU and raise the iGPU Frequency until it will get to the TDP/voltage/thermal level Limit. That way the CPU and the iGPU in Jaguar can only work at a fix Frequency even if it will never reach the TDP limit.
That will change with Puma+.

2. They may still use the 28nm TSMC process but they can also change the process to TSMC 28nm HPM for the new SoCs.

http://www.tsmc.com/english/dedicatedFoundry/technology/28nm.htm

TSMC also provides high performance for mobile applications (HPM) technology to address the need for applications requiring high speed as well as low leakage power. Such technology can provide better speed than 28HP and similar leakage power as 28LP. With such wide performance/leakage coverage, 28HPM is also ideal for many applications from networking, tablet, to mobile consumer products.

So, just because you lack the knowledge doesn't mean AMD lies.

NostaSeronx · Dec 3, 2013

NTMBK said:
"L2 CUs"? What are you talking about? A CU is a collection of GPU shader cores, not a cache.

A compute unit is:
A Jaguar Module, A Steamroller Module, A Graphic Core Next SIMD.

In Steamroller's and 1.1GCN's case the L2's connected to the CUs are shared and coherent through a 256-bit data bus.

AMD Q4/2013 Desktop Roadmap

Lifer

Senior member

Diamond Member

Lifer

Lifer

Golden Member

Lifer

Diamond Member

Diamond Member

Lifer

Diamond Member

Platinum Member

Golden Member

Lifer

Platinum Member

Lifer

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Diamond Member

Platinum Member

Lifer

Diamond Member