Counting the success of Kabini & Temash

strata8 · Aug 2, 2013

rainy said:
I don't know how reliable is PassMark but it show that A8-4555M (1,6/2,4 Ghz) have 744 points in single thread and 2310 for multi and A6-5200 have 794/2616.
Of course on GPU side Trinity is much faster.

A8-5545M CPU clock is a bit higher (1,7/2,7 GHz) but GPU is different story: 450/554 MHz vs 320/424 on it's predecessor.
Even bigger difference is with A10-4655M (2,0/2,8 GHz) vs A10-5745M (2,1/2,9 GHz): 360/496 MHz vs 533/626 MHz.

http://www.cpubenchmark.net/singleThread.html

The A8-4555M has pretty bad throttling/turbo issues - something that Richland is supposed to fix. AMD lists the TDP as "19W2", whatever that means.

I used the benchmarks from NotebookCheck, who are generally pretty reliable (they test multiple laptops and average the results). No A6-5200 laptops have been reviewed yet so I just extrapolated from the A4-5000.

Cinebench R11.5 (Single/Multi)
A8-4555M: 0.46/1.23
A4-5000: 0.39/1.49
A6-5200 (est.): 0.52/2.08

3DMark Vantage CPU
A8-4555M: 3526
A4-5000: 4451
A6-5200 (est.): 5919

x264 Benchmark
A8-4555M: 28 fps
A4-5000: 41 fps
A6-5200 (est.): 54 fps

3DMark 11/Vantage/06 GPU
A8-4555M: 628/1,380/3,504
A4-5000: 521/1,497/3,424
A6-5200 (est.): 625/1,796/4,108

mikk · Aug 2, 2013

lefty2 said:
Is the die size of Silvermont even known at this stage? I would have thought that it's smaller than Kabini.

We know the size of Avoton - 107 mm². But Avoton has 8 Silvermont cores on the die while the consumer variants only have 4 cores.

rainy · Aug 2, 2013

strata8 said:
The A8-4555M has pretty bad throttling/turbo issues - something that Richland is supposed to fix. AMD lists the TDP as "19W2", whatever that means.

I used the benchmarks from NotebookCheck, who are generally pretty reliable (they test multiple laptops and average the results). No A6-5200 laptops have been reviewed yet so I just extrapolated from the A4-5000.

Cinebench R11.5 (Single/Multi)
A8-4555M: 0.46/1.23
A4-5000: 0.39/1.49
A6-5200 (est.): 0.52/2.08

3DMark Vantage CPU
A8-4555M: 3526
A4-5000: 4451
A6-5200 (est.): 5919

x264 Benchmark
A8-4555M: 28 fps
A4-5000: 41 fps
A6-5200 (est.): 54 fps

3DMark 11/Vantage/06 GPU
A8-4555M: 628/1,380/3,504
A4-5000: 521/1,497/3,424
A6-5200 (est.): 625/1,796/4,108

...being beaten by the A4-5000 is kinda sad.

I'm really surprised by those results: looks like Radeon HD 8400 (A6-5200) could be faster than Radeon HD 7600G!
I don't know how it is possible that 384 SP (at significantly lower clock) together with dual MC could end up slower compared with 128 SP - there's something wrong definitely.

SiliconWars · Aug 2, 2013

It will probably be the usual problem, stuck with 1 stick of RAM so operating in single channel mode.

monstercameron · Aug 2, 2013

rainy said:
I'm really surprised by those results: looks like Radeon HD 8400 (A6-5200) could be faster than Radeon HD 7600G!
I don't know how it is possible that 384 SP (at significantly lower clock) together with dual MC could end up slower compared with 128 SP - there's something wrong definitely.

I have seen ~160Gflops for the 8400 and ~240Gflops for the 7600g so I dont know how right those gpu results are.

strata8 · Aug 2, 2013

SiliconWars said:
It will probably be the usual problem, stuck with 1 stick of RAM so operating in single channel mode.

Good catch. I used the averages but they tested one laptop with dual channel memory.

3DMark 11/06 GPU
A8-4555M: 895/4,596
A4-5000: 521/3,424
A6-5200 (est.): 625/4,108

Makes me wonder whether Kabini is limited by the memory at all.

monstercameron · Aug 2, 2013

strata8 said:
Good catch. I used the averages but they tested one laptop with dual channel memory.

3DMark 11/06 GPU
A8-4555M: 895/4,596
A4-5000: 521/3,424
A6-5200 (est.): 625/4,108

Makes me wonder whether Kabini is limited by the memory at all.

quick answer, it is. those gcn cores would love some more bandwidth.

rainy · Aug 2, 2013

SiliconWars said:
It will probably be the usual problem, stuck with 1 stick of RAM so operating in single channel mode.

Yes, it's most probably the only "rational" reason - many (if not majority) laptops with AMD APUs have just a single RAM module.

strata8 said:
Good catch. I used the averages but they tested one laptop with dual channel memory.

3DMark 11/06 GPU
A8-4555M: 895/4,596
A4-5000: 521/3,424
A6-5200 (est.): 625/4,108

Makes me wonder whether Kabini is limited by the memory at all.

It would be not about 50 percent gain like with stronger IGPs (single vs dual channel) but I'm sure that even those 128 SP could be faster with dual MC.

mrmt · Aug 2, 2013

SiliconWars said:
The mistake was assuming the industry would be quick to embrace 8 cores, when the reality has been a long slow slog. BD would be fine if most software was properly multi-threaded, the benchmarks show this. In many ways it was an architecture ahead of it's time.

You can bring say this argument if you are talking about the engineers circa 2005 that were thinking about Bulldozer and pushed foward its current concept (many cores, high clocks, low IPC), but the same cannot be said about the decoder.

I do think the decoder size is anything but obvious. If the decoder is supposed to make a *huge* difference in the performance, why was it cut down in the first place? The costs of the processor are defined by its die size and R&D costs, but one of the components of the price of the processor is its performance. Bulldozer was already a *huge* die by consumer standards, why not spend a bit more in costs in order to earn much better prices on the market? This is the kind of straigh decision that any engineer would make, and no bean counter at the management could reasonably oppose to such decision because it would be fairly straightfoward.

Such complex decisions like defining the size of the decoder aren't taken in a vacuum. You must take into consideration the number of units, caches, the pipeline lenght and what not to determine how much per cycle you'll feed your pipeline. There is no point in being able to issue 4 instructions per clock if the pipeline is stalled most of the time. And here's the reason I think AMD went for a shared decoder: With atrocious cache latencies and hit rates and lenghty pipeline, they could not really afford a 4-wide decoder.

And this brings us to Steamroller. It's the same number of units and the units itself seems to be unchangerd. What they are doing is going for more L1 cache and better branch prediction, and now the trade off for a bigger decoder favors pointing out to a wider solution. What remains to be seen is by how much, but it's sure not 20-30% as some people here expect.

SiliconWars said:
But there's no need to worry because AMD won't be doing anything daring like this in future, that's why SR is a return to a more normal architecture. Of course their high end CPU's will totally stagnate in the same way Intel's have, but that's the price we pay for Intel's determination to put margins ahead of progress.

Daring? How was Bulldozer daring? The clocks? The high clock/low IPC concept was dead at Intel when Bulldozer started to be developed, plus IBM had some special high clock products already. FPU sharing? SUN was sharing the FPU among 8 (!) cores. Core density? There's Knights Landing for that.

The only exceptional thing about Bulldozer was a management team that went to the end with the project rather than cancelling it when they had time and alternatives, this despite all evidences that Bulldozer would be a failure.

monstercameron · Aug 2, 2013

geekbench

http://browser.primatelabs.com/geekbench2/compare/2176197/2191049
[a6-5200(25W) vs. a4-5000(15W)]

http://browser.primatelabs.com/geekbench2/compare/2176197/2182108
[a6-5200(25W) vs. a8-4555m(19W)]

http://browser.primatelabs.com/geekbench2/compare/2176197/1265092
[a6-5200(25W) vs. a10-4655m(25W)] note both at 2GHz(a10 possible turbo)

beginner99 · Aug 2, 2013

krumme said:
Who can find an alternative explanation for the CPU revenue was up 12% than Jaguar sales?

Console APUs. But yeah they do contain jaguar cores.

Just a general impression about how prevalent (or not) AMD is outside of US. In the online shop I use out of 430 models, 9 contain an AMD APU. And one of them is Jaguar (the acer aspire).

Idontcare · Aug 2, 2013

mrmt said:
Daring? How was Bulldozer daring? The clocks? The high clock/low IPC concept was dead at Intel when Bulldozer started to be developed, plus IBM had some special high clock products already. FPU sharing? SUN was sharing the FPU among 8 (!) cores. Core density? There's Knights Landing for that.

I thought the cluster concept was pretty cool, the idea of enabling flexible (dynamic) tradeoff between using the hardware for boosting ILP (single-thread on one module) versus using the same hardware for boosting TLP (two-thread on one module).

SiliconWars · Aug 2, 2013

mrmt said:
Daring? How was Bulldozer daring? The clocks? The high clock/low IPC concept was dead at Intel when Bulldozer started to be developed, plus IBM had some special high clock products already. FPU sharing? SUN was sharing the FPU among 8 (!) cores. Core density? There's Knights Landing for that.

It was daring for a lot of reasons. Going for 8 cores was the main one as it was taking a risk on the industry being there. Yes the clocks were daring, especially on a new process (some might call that insane).

Basically speaking, any time AMD does something that Intel isn't doing is very daring, because Intel owns the market by virtue of their sheer size even when they aren't operating dubiously. Had the industry gone with more cores Intel would probably have had to "convince" it otherwise again.

So yes, daring, or foolhardy. Pick one depending on your fanboy favourite.

SiliconWars · Aug 2, 2013

beginner99 said:
Console APUs. But yeah they do contain jaguar cores.

Just a general impression about how prevalent (or not) AMD is outside of US. In the online shop I use out of 430 models, 9 contain an AMD APU. And one of them is Jaguar (the acer aspire).

Console revenue is counted under graphics.

Dribble · Aug 2, 2013

These cpu's are AMD strategy all over. A great new market appears (netbooks), pro-active companies move in fast and start to make a packet. AMD have all the bits to make a great netbook cpu but don't pull their finger out and actually produce some hardware until that market is dying. In the mean time the market is moving on (tablets + phones) and they are sat on the sidelines looking at others making money again.

sm625 · Aug 2, 2013

The decoder improvements arent going to do anything for bulldozer if L1 and L2 cache latencies stay where they are.

SiliconWars · Aug 2, 2013

Dribble said:
These cpu's are AMD strategy all over. A great new market appears (netbooks), pro-active companies move in fast and start to make a packet. AMD have all the bits to make a great netbook cpu but don't pull their finger out and actually produce some hardware until that market is dying. In the mean time the market is moving on (tablets + phones) and they are sat on the sidelines looking at others making money again.

It might look that way but the reality is very few companies are making money in tablets and phones. Intel's Atom segment has cost the company $3 billion in the past 2 years, Nvidia's Tegra has cost the company $1/2 billion in the past 4 years.

That market is saturated with some very big guns.

mrmt · Aug 2, 2013

Idontcare said:
I thought the cluster concept was pretty cool, the idea of enabling flexible (dynamic) tradeoff between using the hardware for boosting ILP (single-thread on one module) versus using the same hardware for boosting TLP (two-thread on one module).

Doesn't SUN have something like that?

Idontcare · Aug 2, 2013

mrmt said:
Doesn't SUN have something like that?

SUN is to the microarchitecture world what IBM is to the process technology world.

A visionary, a pioneer, a herd of cats headed in 99 different directions at the same time.

You are right, SUN has something like it but at the same time it only works in one direction (sacrifice ILP to boost TLP) and not the other way around (can not operate in the mode where TLP is sacrificed to boost ILP).

What I really thought was inspiring was the prospect of having (in the future) extremely wide cores that really pushed way far out on the ILP curve (that part where diminishing returns are silly bad) for running light-threaded work but then the front-end could dynamically re-allocate the width of the core such that it became less wide per thread but could support more threads simultaneously (for software that was designed to maximize TLP over ILP).

I still think that this is the ultimate destination for the evolution of big-cores, bulldozer just wasn't a good enough implementation of it.

NostaSeronx · Aug 3, 2013

What I would like to see with Temash is a router implementation. I'm tired of this Geode dominated x86 router world!

insertcarehere · Aug 3, 2013

NostaSeronx said:
What I would like to see with Temash is a router implementation. I'm tired of this Geode dominated x86 router world!

What advantages would this bring over a dirt-cheap ARM or Geode implementation? Considering that Kabini isn't exactly tiny in its current implementation (~100mm^2).

NostaSeronx · Aug 3, 2013

insertcarehere said:
What advantages would this bring over a dirt-cheap ARM or Geode implementation? Considering that Kabini isn't exactly tiny in its current implementation (~100mm^2).

Geode -> Bobcat -> Jaguar
--
Geode -> G-series -> GX-series(Using A6 Temash) vs BCM4708 <- Fastest SoC for routing that is used in routers.

Geode: 3.6 Watts, 130-nm, 1? core, 4 cm x 4 cm package
500 MHz
32-bit x86, MMX and 3DNow!, 64-bit DDR
64 KB L1
128 KB L2

G-T24L: ~5 Watts, 40-nm, 1 core, 1.9 cm x 1.9 cm package
1 GHz
64-bit x86, MMX to SSE4a, 64-bit DDR3
32 KB L1
512 KB L2

A6-1450: ~8 Watts, 28-nm, 4 cores, 2.45 cm x 2.45 cm package
1 GHz
64-bit x86, MMX to AVX, 64-bit DDR3+ECC
32 KB L1
2 MB L2

BCM4708: ~2 Watts, 40-nm, 2 cores, idk
800 MHz
32-bit ARMv7, 64-bit DDR3
32 KB L1
256 KB L2
--
BCM4708 -> 1+ Gbps Routing WAN/LAN Simultaneous Perf
Geode -> <100 Mbps Routing WAN/LAN Simultaneous Perf
G-T24L -> <600 Mbps Routing WAN/LAN Simultaneous Perf(Estimation)
A6-1450 -> >4 Gbps Routing WAN/LAN Simultaneous Perf(Estimation)

mrmt · Aug 3, 2013

Idontcare said:
You are right, SUN has something like it but at the same time it only works in one direction (sacrifice ILP to boost TLP) and not the other way around (can not operate in the mode where TLP is sacrificed to boost ILP).

I'm assuming you are talking about T1 here, but wasn't SUN's Rock supposed to do just that, be able sacrifice a bit of TLP to boost ILP?

Btw Rock was internally cancelled circa 2009 by Oracle by the same reasons 45nm Bulldozer was cancelled.

Larry Ellison said:
This processor had two incredible virtues: It was incredibly slow and it consumed vast amounts of energy. It was so hot that they had to put about 12 inches of cooling fans on top of it to cool the processor. It was just madness to continue that project.

A MPU company very prone to take risks and operating in an environment where clusters made a lot of sense decided to can their cluster project because it didn't perform as advertised decided to can their cluster project. They went on with more conventional SMT.

A MPU company operating in a consumer environment where cluster doesn't make sense for a significant number of cases decided to not only proceed with a similar project, but also bet the entire farm on it.

As much as the idea might be inspiring, the red flags were all there and not only at AMD.

frozentundra123456 · Aug 3, 2013

The problem I see with these chips is that they are not bad, but nothing just jumps out at me as a compelling case to use them. Graphics is good, but neither cpu performance or battery life, at least in the platforms presented so far, are outstanding. I really was interested in them as a tablet chip, but they seem to be being utilized so far to make marginally powerful laptops. And they are not especially well priced either.

Idontcare · Aug 3, 2013

mrmt said:
I'm assuming you are talking about T1 here, but wasn't SUN's Rock supposed to do just that, be able sacrifice a bit of TLP to boost ILP?

Btw Rock was internally cancelled circa 2009 by Oracle by the same reasons 45nm Bulldozer was cancelled.

A MPU company very prone to take risks and operating in an environment where clusters made a lot of sense decided to can their cluster project because it didn't perform as advertised decided to can their cluster project. They went on with more conventional SMT.

A MPU company operating in a consumer environment where cluster doesn't make sense for a significant number of cases decided to not only proceed with a similar project, but also bet the entire farm on it.

As much as the idea might be inspiring, the red flags were all there and not only at AMD.

Yeah I got to work on SUN's Rock (the process side of it)...it was an interesting project that basically died from project mismanagement more so than technical microarchitectural issues.

For example, we prepared a special metal-gate flow for SUN at 65nm but they didn't want to pay for it, opting instead for very hot (drive current) very leaky standard SiON gates (for cost reasons). This made their Rock chip run at low clocks unless you turned up the voltage (which then cranked up the leakage).

Ellison was right to cancel the project because by the time he inherited it, the timeline was so bad that there was no way he could salvage a sellable product from it.

But the microarchitecture wasn't the problem, the decision making at the corporate level was the issue.

In that regard the same problem befell AMD's bulldozer. Had the process (32nm) been gate-last instead of gate-first, the drive currents would have been much higher versus the leakage at any given clockspeed. Bulldozer would have looked like a champ at 4GHz with much lower leakages (Vcore would have been much lower while still hitting the same clockspeed) and in turn the chip could have been clocked higher for higher performance on the current power footprint.

The people who designed bulldozer weren't given a voice or a vote on the executive decision to go with a gate-first integration scheme, an integration scheme that was exactly the opposite of what a bulldozer microarchitecture needed.

Counting the success of Kabini & Temash

Member

Diamond Member

Senior member

Platinum Member

Diamond Member

Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Diamond Member

Elite Member

Platinum Member

Platinum Member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Elite Member

Diamond Member

Senior member

Diamond Member

Diamond Member

Lifer

Elite Member