AMD Q4/2013 Desktop Roadmap

Page 5 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Ajay

Lifer
Jan 8, 2001
16,094
8,114
136
Zambezi's and Vishera's L3 is shared between all cores, and is a "mostly exclusive" cache, meaning it basically is a big victim cache for the L2s. What do you mean that a 1:1 size ratio is bad for a victim cache? Using a 1:1 sized L3 as anything but a victim cache would be absurd.

On an unrelated note, the place where large caches help the most for server workloads is in caching instructions, not data. Generally not a lot of data locality in server workloads.

Intel's eDRAM LLC is 128 MB and is a victim cache. IIRC, you want victim caches to be >> 1:1 - then again, my memory may just be off. 1:1 L3$:L2$ seems to be absurd to me - generally speaking. Apparently, based on preliminary results, Kaveri loses some FP performance without the victim cache, even compared to 1:1. That said, a small inclusive L3$ would be pointless.
 

sefsefsefsef

Senior member
Jun 21, 2007
218
1
71
When victim caches were first proposed by Jouppi in '90 they were extremely small, and fully-associative (he only tested up to 15 total cache lines in his initial proposed victim cache). Victim caches of all sizes can be effective.
 

blastingcap

Diamond Member
Sep 16, 2010
6,654
5
76
AMD's roadmap includes a server version of Kaveri, called "Berlin". The slides indicate that it will indeed support ECC. Of course, we don't yet know what pricing will be like on this, or if mainstream boards will support it the way that Asus's current offerings do with AM3+.

If you already have an AM3+ motherboard that supports ECC, I suspect that a FX-8320 would be good enough for a NAS, and these chips are currently on sale for very reasonable prices at several locations. There are plenty of people who run a NAS on much worse chips, even Atoms. Most off-the-shelf NASes use low power ARM processors that are far weaker than Vishera.

I want lower power draw, not higher performance... I already have a 2.7GHz Sempron but would like to get something with lower idle wattage. Intel's ECC tax is outrageous. And buying AMD or INTC server CPUs isn't a solution either; at those prices I might as well just buy a Proliant.
 

ShintaiDK

Lifer
Apr 22, 2012
20,378
146
106
I want lower power draw, not higher performance... I already have a 2.7GHz Sempron but would like to get something with lower idle wattage. Intel's ECC tax is outrageous. And buying AMD or INTC server CPUs isn't a solution either; at those prices I might as well just buy a Proliant.

Easier if you just did the research before complaining:

http://ark.intel.com/products/77773/Intel-Pentium-Processor-G3220-3M-Cache-3_00-GHz

64$ for CPU with ECC. Huge tax....

And here is another:
http://ark.intel.com/products/71072/Intel-Celeron-Processor-G1610-2M-Cache-2_60-GHz

42$ if the christmas budget is extra thight due to taxes.
 

Ajay

Lifer
Jan 8, 2001
16,094
8,114
136
When victim caches were first proposed by Jouppi in '90 they were extremely small, and fully-associative (he only tested up to 15 total cache lines in his initial proposed victim cache). Victim caches of all sizes can be effective.

When Jouppi first proposed them, he was looking @ a miss penalty of 1 cycle! And much less memory to map. I need to get the full article - or stop being lazy and open up my copy of Hennessy and Patterson. Thanks for the info :)
 

JDG1980

Golden Member
Jul 18, 2013
1,663
570
136
I want lower power draw, not higher performance... I already have a 2.7GHz Sempron but would like to get something with lower idle wattage. Intel's ECC tax is outrageous. And buying AMD or INTC server CPUs isn't a solution either; at those prices I might as well just buy a Proliant.

Intel's "ECC tax" applies to quad cores and up. Their single and dual core mainstream offerings (Celeron, Pentium, i3) support ECC on both Ivy Bridge and Haswell, if used in conjunction with a server grade C-series chipset. For a NAS, you might want to try an i3-4130 CPU (dual-core, 54W TDP, $129.99 at Newegg). As mentioned, you'll need a server-specific motherboard, and the Supermicro X10SLM-F will get you there for $164.99.

If a really lightweight solution is good enough for you, there's the Supermicro X9SBAA, based on the Centertron Atom platform. $220 at Newegg for the board, which includes the CPU soldered in. It supports ECC, though you'll have to find a compact SO-DIMM instead of a standard full-size module. This might be a good choice for a firewall if you need ECC but don't need that much processing power. (People run firewalls on old single-core CPUs, so this should work fine.)
 

Shivansps

Diamond Member
Sep 11, 2013
3,918
1,570
136
IOMMU 2.0(Windows only HSA)
Switchable Graphics V7 from V5.5


Massively improved power and thermal control. Which lead to a 2x increase in perf watt.
CPU & GPU now have boosts.

IOMMU is AMD VT-D

About HSA, that the thing i want to know i dont see HSA as supported on the slide, what makes no sence to me.
Boosts is great, finally AMD is coming to its sences, but im very skeptical on the x2 perf/watt thing, it seems too much for the same process.

But the missing HSA really puzzles me, i was starting to think Puma+ is nothing more than Jaguar+ with a name change and maybe some minimal change.
 
Last edited:

NostaSeronx

Diamond Member
Sep 18, 2011
3,811
1,290
136
Coherent != accessible from GPU. The two modules' L2 is also coherent, but they can't access each others' cache.
All L2 CUs are connected, coherent, and accessible by each other.
IOMMU is AMD VT-D
IOMMU 2.0 isn't just AMD-Vi or Intel's VT-D.
http://www.linux-kvm.org/wiki/images/b/b1/2011-forum-amd-iommuv2-kvm.pdf

IOMMU v2 is required for HSA/hQ/hUMA for Windows.
IOMMU v2.5 is required for HSA/hQ/hUMA for the rest of the OSes.

https://www.youtube.com/watch?v=Wt-oRrk-tZQ
https://www.youtube.com/watch?v=GtYlcTeBFfo

IOMMUv2+ is the HSA-MMU.
But the missing HSA really puzzles me, i was starting to think Puma+ is nothing more than Jaguar+ with a name change and maybe some minimal change.
It was meant to be Jaguar+ not Puma+. As Puma is the enhanced version of the Jaguar core. Don't put heavy reliance on the marketing slides of AMD as they been known to be inaccurate sometimes.

--
---
----
The L3 cache on the Orochi dies can be virtually separated into 2MB partitions. If not, then the total of the L3 is available to all modules, all 8 MBs of it.

- L3 Cache Partitioning -
Allows customers to associate Bulldozer modules with L3 sub-caches so that each Bulldozer module can be guaranteed a certain amount of L3 cache (one Bulldozer module cannot monopolize the whole L3). With a minimum partition size of 2MB, the cache can be dynamically allocated through the software task scheduler. This can be done through the hypervisor or at the kernel level.

The cache can be at minimum be 1 to 1 and at maximum be 1 to 4.
 
Last edited:

mrmt

Diamond Member
Aug 18, 2012
3,974
0
76
About HSA, that the thing i want to know i dont see HSA as supported on the slide, what makes no sence to me.
Boosts is great, finally AMD is coming to its sences, but im very skeptical on the x2 perf/watt thing, it seems too much for the same process.

It's not performance/watt, it's performance/TDP by AMD own admission. AMD can be credited for always pushing creative marketing boundaries.
 
Last edited:

PPB

Golden Member
Jul 5, 2013
1,118
168
106
Where did AMD say that?

Because perf is 2x and TDP stays the same. He is upset because he probably thinks AMD is like Intel in the mobile space, where TDP becomes meaningless and it's ok if you traspass your own TDP by 10 or more watts.

Nope, ironically AMD does the contrary than in desktop and their mobile solution's TDP ends up being more or less the real load power consumption. So in this case, 2x perf at same TDP = 2x perf/watt.
 

Abwx

Lifer
Apr 2, 2011
11,885
4,873
136
Because perf is 2x and TDP stays the same. He is upset because he probably thinks AMD is like Intel in the mobile space, where TDP becomes meaningless and it's ok if you traspass your own TDP by 10 or more watts.

Nope, ironically AMD does the contrary than in desktop and their mobile solution's TDP ends up being more or less the real load power consumption. So in this case, 2x perf at same TDP = 2x perf/watt.

Actualy it is more perfs at smaller TDP , their metric
is totaly relevant contrary to SDP erratic and unsubstancied numbers

Where did AMD say that?

Here , and it seems that some are concerned
by the numbers and AMD confidence..

“AMD is establishing excellent momentum this year in the low-power, mobile computing market and with ‘Mullins’ and ‘Beema’ coming in 2014 we are not standing still. AMD aims to deliver a set of platforms in the first half of next year that will outperform the competition in graphics and total compute performance in fanless tablets, 2-in-1s and ultrathin notebooks,” said Mark Papermaster


The new 2014 AMD A-Series low power APU platform, codenamed “Mullins,” is expected to deliver up to 139 percent better productivity performance per watt when compared to the previous generation “Temash” platform. Testing conducted by AMD Performance Labs on optimized AMD reference systems. PC manufacturers may vary configuration yielding different results. PCMark 8 - Home score divided by TDP (W) is used to simulate productivity performance per watt; the Mullins platform (4.5W) scored 1809 while the Temash platform (8W) scored 1343. AMD "Larne" reference platform system used for both APUs. Temash-based AMD A6-1450 quad-core APU with AMD Radeon™ HD 8250 Graphics, 2x2GB of DDR3-1333MHz RAM (running at 1066MHz), Windows 8.1, 13.200.11.0 - 03-Sep-2013 driver. Pre-production engineering sample of “Mullins” quad-core APU with next generation AMD Radeon graphics (model number TBD), 2x2GB DDR3-1333MHz RAM, Windows 8.1, and unreleased reference driver. MUN-3
The new 2014 AMD A-Series mainstream APU platform, codenamed “Beema,” is expected to deliver up to 104 percent better productivity performance per watt when compared to the previous generation “Kabini” platform. Testing conducted by AMD Performance Labs on optimized AMD reference systems. PC manufacturers may vary configuration yielding different results. PCMark 8 - Home score divided by TDP (W) is used to simulate productivity performance per watt; the Beema platform (15W) scored 2312 while the Kabini platform (25W) scored 1861. AMD "Larne" reference platform system used for both APUs. Kabini-based AMD A6-5200 quad-core APU with AMD Radeon™ HD 8400 Graphics, 2x2GB of DDR3-1600MHz RAM, Windows 8.1, 13.200.11.0 - 03-Sep-2013 driver. Pre-production engineering sample of “Beema” quad-core APU with next generation AMD Radeon graphics (model number TBD), 2x2GB DDR3-1600MHz RAM, Windows 8.1, and unreleased reference driver.BMN-3
http://www.amd.com/us/press-releases/Pages/amd-2014-mobile-apu-2013nov13.aspx
 
Last edited:

SiliconWars

Platinum Member
Dec 29, 2012
2,346
0
0
Seems fair enough. We keep hearing the comparison between Bay Trail vs the "15W" A4-5000 after all.
 

Abwx

Lifer
Apr 2, 2011
11,885
4,873
136
Seems fair enough. We keep hearing the comparison between Bay Trail vs the "15W" A4-5000 after all.

From the numbers above we can see that the 4.5W TDP
Mullins will have the perfs of the 15W TDP Kabini , also
it is mentioned at the bottom that numbers are measured
from :

Pre-production engineering sample of “Mullins” quad-core APU with next generation AMD Radeon graphics (model number TBD)

Pre-production engineering sample of “Beema” quad-core APU with next generation AMD Radeon graphics (model number TBD)
So theses are real silicon numbers...
 

mrmt

Diamond Member
Aug 18, 2012
3,974
0
76
Because perf is 2x and TDP stays the same. He is upset because he probably thinks AMD is like Intel in the mobile space, where TDP becomes meaningless and it's ok if you traspass your own TDP by 10 or more watts.

That *may* have been the case of their small core line until now, but what about the new Puma chips? What's the secret sauce to get 2x efficiency at the same node?

Mark my words: With Beema and Mulins AMD is playing with their TDP specs big time, and because of that the chips will get lambasted for every reviewer out there, except for Semiaccurate and maybe Phoronix.
 

inf64

Diamond Member
Mar 11, 2011
3,884
4,692
136
Mark my words: With Beema and Mulins AMD is playing with their TDP specs big time, and because of that the chips will get lambasted for every reviewer out there, except for Semiaccurate and maybe Phoronix.
While that might be true, why not just wait and see the reviews before proclaiming it as a fact?
 

mrmt

Diamond Member
Aug 18, 2012
3,974
0
76
While that might be true, why not just wait and see the reviews before proclaiming it as a fact?

Because I don't believe in fairy tales, and also not on 2x efficiency from an architectural update at the same node.
 

inf64

Diamond Member
Mar 11, 2011
3,884
4,692
136
Ok but still claiming you are correct without any evidence to back it up besides your own opinion is kinda pointless.
 

raghu78

Diamond Member
Aug 23, 2012
4,093
1,476
136
That *may* have been the case of their small core line until now, but what about the new Puma chips? What's the secret sauce to get 2x efficiency at the same node?

Firstly Kabini and Temash did not have an aggressive turbo implementation like Baytrail. In fact Kabini had no Turbo. With Temash A6-1450 was the only model to have turbo. It looks like AMD was limited by TSMC 28nm process and time to market design constraints with Kabini / Temash. When only the CPU is heavily loaded in workloads like Cinebench R11.5 better power sharing between CPU and GPU will allow the CPU to use the full available power budget. Also next year TSMC 28nm process would be in its 3rd year of production and very mature. This would allow an aggressive Turbo implementation and better yields.

http://www.anandtech.com/show/6981/...ality-of-mainstream-pcs-with-its-latest-apu/2

"A big issue here is Kabini, at least in its launched versions, lacks any turbo core support. The 15W A4-5000 runs even single threaded tasks as if all four cores were active and eating into that TDP budget. The fastest Jaguar implementation seems to be 2GHz, but even if the A4-5000 could turbo up to that level I feel like I’d still want a bit more. There’s obviously room on the table for a Kabini refresh, even at 28nm."

With Puma AMD also has the opportunity to tweak the Jaguar core for higher IPC for a slight increase in core size.

http://www.anandtech.com/show/7514/amd-2014-mobile-apu-update-beema-and-mullins

"AMD hasn’t disclosed clock speeds or anything else for the upcoming APUs, but given A6-1450 is clocked at 1000-1400MHz with the GPU core running at 300-400MHz, it is possible AMD was able to arrive at the above performance increases simply with higher clock speeds. Also possible is that similar to the Bobcat to Jaguar transition, AMD tweaked other elements of the Puma core (e.g. the scheduler could have more entries)."

With Sony PS4 based on the same Jaguar core boasting of a 2.75 ghz Turbo, the TSMC process is now in a good shape to yield high clocking enhanced Jaguar (Puma) designs.

Mark my words: With Beema and Mulins AMD is playing with their TDP specs big time, and because of that the chips will get lambasted for every reviewer out there, except for Semiaccurate and maybe Phoronix.
your words mean nothing. AMD has clearly come out with a 4.5 W TDP and 2W SDP specification for Mullins. They have posted results from actual silicon running benchmarks. You don't know anything about the improvements to the core (IPC increase, better turbo, deeper sleep states for lower idle power) and process maturity related improvements (better yields even at higher clocks) and here you are telling whats possible and not possible. stop the crap.

http://www.anandtech.com/show/7514/amd-2014-mobile-apu-update-beema-and-mullins
 
Last edited:

mrmt

Diamond Member
Aug 18, 2012
3,974
0
76
your words mean nothing. AMD has clearly come out with a 4.5 W TDP and 2W SDP specification for Mullins. They have posted results from actual silicon running benchmarks.

Let me refresh your memory because you seem to not read the disclaimer in AMD presentation:

AMD said:
http://www.slideshare.net/AMD/amd-mobility-apu-lineup-announcement


Home score divided by TDP (W) is used to simulate productivity performance per watt (...)

Why did they have to put a disclaimer like that if they were really sure that performance per watt actually doubled? Why can't they report actually measured power consumption? Maybe AMD is saving on electricity measuring devices? Do you think that only power management and a few tweaks on the core would be enough to double performance/watt? Nobody ever achieved what AMD is saying it achieved here, and the fact that they are not giving actual measurements just makes it worse.

As for AMD coming with any kind of specification, it wouldn't be the first time they breached it. There are plenty of "125W" FX chips being sold out there, for them to start doing the same thing on the mobile market wouldn't be much of a stretch.

In any case, feel free to believe in AMD claims. It wouldn't be the first time that AMD lies like that, and I doubt it will be the last. AMD marketing is a gift that keeps on giving.
 

SiliconWars

Platinum Member
Dec 29, 2012
2,346
0
0
So what you're basically saying in your own special way mrmt, is that AMD rated their Kabini TDP's far too highly. Interestingly enough that's what you might believe if you compared power draw to Intel's "17W" chips.

http://www.tomshardware.com/reviews/kabini-a4-5000-review,3518-13.html

Or you might believe that when it comes to getting creative with measuring TDP, Intel has a massive lead. Either way nothing you're claiming AMD will do with Beema or Mullins makes any kind of sense.
 

AtenRa

Lifer
Feb 2, 2009
14,003
3,362
136
Because I don't believe in fairy tales, and also not on 2x efficiency from an architectural update at the same node.

1: Jaguar has no Turbo, it cannot use the entire TDP for the CPU alone or downscale the CPU and raise the iGPU Frequency until it will get to the TDP/voltage/thermal level Limit. That way the CPU and the iGPU in Jaguar can only work at a fix Frequency even if it will never reach the TDP limit.
That will change with Puma+.

2. They may still use the 28nm TSMC process but they can also change the process to TSMC 28nm HPM for the new SoCs.

http://www.tsmc.com/english/dedicatedFoundry/technology/28nm.htm
TSMC also provides high performance for mobile applications (HPM) technology to address the need for applications requiring high speed as well as low leakage power. Such technology can provide better speed than 28HP and similar leakage power as 28LP. With such wide performance/leakage coverage, 28HPM is also ideal for many applications from networking, tablet, to mobile consumer products.

So, just because you lack the knowledge doesn't mean AMD lies. :rolleyes:
 

NostaSeronx

Diamond Member
Sep 18, 2011
3,811
1,290
136
"L2 CUs"? What are you talking about? A CU is a collection of GPU shader cores, not a cache.
A compute unit is:
A Jaguar Module, A Steamroller Module, A Graphic Core Next SIMD.

In Steamroller's and 1.1GCN's case the L2's connected to the CUs are shared and coherent through a 256-bit data bus.
 
Last edited: