• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

David Kanter: AMD's ARM core will be 10% faster than their x86 one, ditch Bulldozer

Page 7 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.
With Intel's recent announcements around fabbing (Altera, panasonic etc), it seems like they see this coming.

I think that part is simply to put TSMC and/or Samsung in the ground. The 2 already balance on an edge on who gets Apple in their survival game to pay the bill.
 
Intel makes a huge margin on their Xeon parts, and the rest of the industry (both their customers AND their competitors) hates this. AMD making ARM servers is about commoditizing the market even more than Intel did with Xeon. Xeon brought down the big iron guys by selling commodity hardware that was "good enough"; I imagine AMD hopes to do the same here. If AMD's parts are a bit worse in performance/W, they'll sell them cheaper, or rather, /much/ cheaper. There are two goals here: make some money on selling ARM SoCs that are good in their own right for some very specific workloads; and damage Intel's margins. Intel's huge margins are what funds their process lead, so the argument would be that in the long run it would damage Intel's lead indirectly too. With Intel's recent announcements around fabbing (Altera, panasonic etc), it seems like they see this coming.

Selling worse perf/watt chips for cheaper was the Magny Cours/Bulldozer strategy. How'd that work out?
 
AMD sell their APUs on the cheap and seem to be recovering a bit; I just imagined they may try to do the same in servers?

AMD isnt recovery, they are losing. Not to mention they jacked up the price of their APUs with Kaveri.

They also sell server chips for cheap. Not working either.
 
Thanks for backing me up ch424 🙂

But, I'm going to have to politely - and somewhat reservedly - disagree with what you said here:

Clearly they're in the cross-over area where it does matter; somewhere very much below Haswell-EX/POWER8 but quite a bit above (possibly; I'm guessing) Silvermont/Cortex-A12 levels.

I think there's quite a bit of evidence that goes against this. Consider what Jim Keller informed AT:

Jim Keller added some details on K12. He referenced AMD's knowledge of doing high frequency designs as well as "extending the range" that ARM is in. Keller also mentioned he told his team to take the best of the big and little cores that AMD presently makes in putting together this design.
A Silvermont/A12 level core is going to be far below that sort of mark. For that matter, it even falls below the level set by the Cortex-A57s in AMD's Seattle, to be released this year. A57 is not exactly Haswell level, or Cyclone level for that matter, but it's still quite a bit wider and deeper than A12 or A17. From all of this information I very much think that AMD is doing a custom core so they can go beyond the performance of an ARM design, and incorporating elements from the construction and cat cores (note that cat also already goes beyond Silvermont to an extent) reaffirms this.

I agree absolutely that the wider, faster, bigger and so on you go with a core design the less x86 overhead hits you. But without empirical data it's hard to nail down exactly which point it seeps into the noise. The only ones in much of a good place to analyze this are those who are intimately involved in designing the CPUs, and understand the tradeoffs made. Someone like Jim Keller is in a position where he can at least run simulations to determine the impact of pieces of the pipeline that are catering to x86.

I mentioned uop cache earlier, I'd like to follow that up with this additional insight from David Kanter:

The uop cache is one of the most promising features in Sandy Bridge because it both decreases power and improves performance. It avoids power hungry x86 decoding, which spans several pipline stages and requires fairly expensive hardware to handle the irregular instruction set. For a hit in the uop cache, Sandy Bridge’s pipeline (as measured by the mispredict penalty) is several cycles shorter than Nehalem’s, although in the case of a uop cache miss, the pipeline is about 2 stages longer.
http://www.realworldtech.com/sandy-bridge/4/

Scaling an x86 design will introduce a lot of elements that start to minimize the cost of decoders. But as you scale up the rest of the system in performance you still have to scale up the decoders. Complex instruction encoding will mean a longer serialized decode time which means more pipeline stages - in addition to increasing your branch mispredict penalty, more stages means more area and power dedicated to saving and forwarding state, as well as more clock distribution. And the cost of decoders doesn't merely increase linearly as you add them because of the dependent positioning - even if you add predecode bits to the L1 icache you still have to scan for more instruction barriers first. And chunking up the instructions along boundaries doesn't equalize everything, x86 still has complex variable offsets within the instruction based on prefixes (REX effectively being a common prefix), opcode type (variable size opcodes), presence of operand extension bytes, and combination of address and immediate fields.

The uop cache is a nice alternative to this but it does come at a cost.. I described this in that post earlier but to give some extra information, the mapping from icache lines to uop cache lines is not trivial. The cache lines are tagged by the address of the first instruction in the line, but the lookups are done to arbitrary instructions in the line. So the lines must maintain a list of instruction offsets which it must scan to find out which uops to return, at least in the case where the code isn't known to be continuing to the next sequential uop cache line. And since each icache line corresponds to 1-3 uop cache lines there needs to be some mechanism for sequencing multiple ways from a line lookup. One other thing is that you either need to take on extra latency on a uop cache miss or start looking up in the L1 icache tags in parallel with the uop cache, which uses more power. Intel appears to have done the latter, in the spirit of not allowing the uop cache (which doesn't have an amazing hit rate) to degrade performance vs the L1 icache.

Bottom line, it's not negligible. A lot of that could just be that it's a bunch of work to get right, I don't pretend to have real numbers on performance/efficiency/area impact.

Also, I don't really know the details on this, but I suspect that x86's memory model imposes additional expenses even in OoO CPUs that implement speculative memory ordering/disambiguation/alias prediction because of the read ordering requirements. But I could be totally wrong about that one. An x86 core will have to maintain coherency between icache and dcache where an ARM core won't, but there are good arguments for the ARM core to do this anyway (although I don't know any that do?) so that's kind of moot.
 
Selling worse perf/watt chips for cheaper was the Magny Cours/Bulldozer strategy. How'd that work out?

The problem is that these chips are way worse than competing Intel products - both in power consumption and in IPC.

AMD's process disadvantage gets talked up a lot, but it's not their primary problem at this time. Remember, Intel did Sandy Bridge on 32nm; if AMD had an x86 core that was as powerful and efficient as that, you wouldn't see nearly so many complaints from enthusiasts. (Plenty of gamers still run an overclocked i5-2500K.) And if the process disadvantage were the only driver of power efficiency differences, then AMD would be able to come close enough to make a serious challenge in the server space.

The real issue is simply that the Bulldozer design sucks.
 
The problem is that these chips are way worse than competing Intel products - both in power consumption and in IPC.

AMD's process disadvantage gets talked up a lot, but it's not their primary problem at this time. Remember, Intel did Sandy Bridge on 32nm; if AMD had an x86 core that was as powerful and efficient as that, you wouldn't see nearly so many complaints from enthusiasts. (Plenty of gamers still run an overclocked i5-2500K.) And if the process disadvantage were the only driver of power efficiency differences, then AMD would be able to come close enough to make a serious challenge in the server space.

The real issue is simply that the Bulldozer design sucks.

Why is it so difficult to stay on topic?
 
As has been mentioned just a bit in this thread, one of AMD's main concepts with ARM is to offer customization and partnerships it is not allowed to do with x86 due to their licensing agreements with Intel. Note AMD promoting itself as a design house in the last year or so and offering to integrate customers logic/modules with AMD IP.
 
I imagine AMD hopes to do the same here. If AMD's parts are a bit worse in performance/W, they'll sell them cheaper, or rather, /much/ cheaper.

problem is that performance/W is pretty much the most important metric in servers. Not only do you save on power cost directly but also indirectly (less cooling). Then Xeons cost like what? $2000-$6000? How much does the full server cost including the software it is running? This will quickly reach > 100K. So even if your server chip is only $100, you won't save much of the total server costs.
 
problem is that performance/W is pretty much the most important metric in servers. Not only do you save on power cost directly but also indirectly (less cooling). Then Xeons cost like what? $2000-$6000? How much does the full server cost including the software it is running? This will quickly reach > 100K. So even if your server chip is only $100, you won't save much of the total server costs.
Why would 100$ (3kWh a day) or so a year in power cost be important but several 100-1000$ in CPU cost isn't important?
Of course the CPU cost is important as well, especially in big server parks. It is by keeping cost down a server park stays profitable and competitive.

In a mature and competitive industry a lot of effort is spent on reducing cost everywhere, with specific engineering teams whose only purpose is to find or develop cost reductions.
Of course a server park will try to find cheaper CPU solutions as well. And they will be eager to try out the different ARM solutions that are on the way.
 
This line of reasoning doesn't make any sense. Saying that ARM means low frequency isn't any better than saying that x86 means high power.

AMD has said that the x86 core released along-side the ARM-based K12 (unknown as of yet if this too will be called x86) is built from scratch. It doesn't matter what came before it. Likewise, it doesn't matter what other companies have done with their ARM cores. All marketing thus far has put the new x86 and ARM cores in the same category.

I don't disagree, I just gave an explanation that isn't too unreasonable. We've never seen an ARM core at 3GHz, but plenty of x86 cores that went to 5GHz.
 
I don't disagree, I just gave an explanation that isn't too unreasonable. We've never seen an ARM core at 3GHz, but plenty of x86 cores that went to 5GHz.

I always thought this was because ARM has been used almost exclusively in mobile platforms so is designed for low power/freq then fabbed on processes meant for low Ghz and power saving. Kinda like how Intel is now using a special 14nm to fab the Core M line.
 
If you don't design it good, your CPU won't go above 1-2GHz. High-end SoCs are made on a regular HPM process. Qualcomm's Krait now tops out at 2.7GHz, up from 1.9GHz a year ago, although I wonder on which voltage. It has nothing to do with the instruction set, but it could make you cautious about claims of high frequencies.
 
Problem is that power consumption goes through the roof when ARM scales up in that frequency range. Just like x86 for that matter.

TSMC made a 3.1Ghz A9 on 28nm HPM as a test. And it was anything but power efficient.

Kinda funny as well. Since the usual superoptimistic projection was 3Ghz ARM chips in 2014 on 20nm as mass produced chips.
 
Last edited:
ARM SOCs have their place but let's be real here. Once you get past the low power envelope devices where low power matters and nothing else, things quickly change. Intel pretty dominates the high performance area for consumer devices, this is why you won't see high end ultrabooks or macbook pros with ARM SOCs anytime soon - the performance difference at the high end is very much tilted in favor of intel.

That isn't to say ARM SOCs dont have their place. And they can evolve over time. Yet as things stand, I see a major convergence happening. #1, intel is reaching performance per watt levels easily comparable to ARM SOCs in their newer core architectures, with better performance to boot. They're not quite where they need to be yet, but they're evolving at a much faster pace than ARM SOCs are. Yet I don't see ARM SOCs creeping into the high end anytime soon. The performance difference at this time is pretty vast, and will stay that way for some time. Again, that doesn't mean that ARM SOCs won't have their place - perhaps some server/data centers will benefit from this, but it isn't like intel isn't improving their performance per watt by leaps and bounds every year either. Intel definitely is improving with every iteration by a lot.

Anyway, perhaps AMD can carve their niche with their ARM SOC based K12 core. It fits right in with their seamicro acquisition, since they previously sold relatively low power/efficient servers in the past. So this would fit the bill. But, a Xeon replacement it will not be IMO. The performance difference will still be large (I think) and intel has been continually improving performance per watt. But competition is always a good thing so it'll be interesting to see how AMD fares.
 
Last edited:
Problem is that power consumption goes through the roof when ARM scales up in that frequency range. Just like x86 for that matter.
Definitely. And that's why in the context of a 5 GHz x86 (which I have personnally never seen) the comparison against 3 GHz ST A9 makes sense. So basically witeken original claim is wrong at least IMHO 🙂
 
Power consumption doesn't really start going through the roof until about 3.5GHz. A Chromebook with a 3GHz boost CPU should be possible. Maybe on the 20nm process in 2015 (although they finally start to realize that you better increase IPC instead of frequency).
 
Last edited:
In this context, high-end x86 is around 1.5-2.5GHz (with 8+ cores).

Any Xeon replacement will be against Atom-based Xeons, if anything, and not due to highly efficient CPU cores, but efficiency of the whole system, which basically would mean popular hardware offloading that stays a bit ahead of whatever Intel may offer, and/or useful IO support with fewer added chips.

Not a big market window, there, but both AMD and ARM are far more nimble than Intel, and AMD has certainly exploited that in the past, and may find ways to do so again (but maybe this time not with Netburst 2.0 🙂).
 
It seems like the window of opportunity for ARM based servers is closing: http://www.pcworld.com/article/2461180/processor-delays-hurt-arm-server-adoption-dell-exec-says.html

Expectations for the success of ARM servers are diminishing as processors and product releases get delayed, a top Dell executive said.

Dell is offering servers based on ARM processors for testing. But the main advantage of ARM processors—relatively low power consumption—is quickly disappearing as rival chip companies catch up, and this means there is less incentive for customers to invest in switching over from x86 architecture, said Forrest Norrod, general manager for servers at Dell.

“I think quite frankly the ecosystem is developing a little bit more slowly than expected,” Norrod said.

If the K12 ARM CPU was already here on 20nm; it would have a bigger impact on the market. I fear AMD has leaped too late to really open a new market under ARM.

As far as AMD's next gen x86 - I just can't make sense of all the comments made by AMD execs to really have a clue where it's going to fall performance wise. A big problem for AMD is that their APU (excluding consoles) business is shrinking - so a K12 x86 CPU will either be a CAT replacement (a bit beefier with better perf/watt) or, AMD pulls off a miracle and produces a competitive gaming APU in 2016 (really, competitive performance w/dGPU Intel systems). The latter would be the last hurrah for AMD, the either make it, or down size again and lose heavily against Intel and Nvidia because of constrained R&D dollars.
 
AMD's computing solutions group saw sales down 20% y-o-y...the sell uncompetitive APUs cheap strategy doesn't work so well.

"JPR Reports AMD jumps 11% in GPU shipments in Q2 [2014], Intel up 4%, Nvidia slips
[...]
Quick highlights:
* AMD’s overall unit shipments increased 11% quarter-to-quarter, Intel’s total shipments increased 4% from last quarter, and Nvidia’s decreased 8.3%.
* The attach rate of GPUs (includes integrated and discrete GPUs) to PCs, for the quarter was 139% (up 3.2%) and 32% of PCs had discrete GPUs, (down 3.6%) which mean 68% of the PCs are using the embedded graphics in the CPU.
* The overall PC market increased 1.3% quarter-to-quarter, and decreased 1.7% year-to-year.
* Desktop graphics add-in boards (AIBs) that use discrete GPUs declined 17.5%.
[...]
This report does not include the x86 game consoles, handhelds (i.e., mobile phones), x86 Servers or ARM-based Tablets (i.e. iPad and Android-based Tablets),or ARM-based Servers. It does include x86-based tablets, Chromebooks, and embedded systems."

http://jonpeddie.com/press-releases...-gpu-shipments-in-q2-intel-up-4-nvidia-slips/

Table%201b.jpg
 
Back
Top