With the current rate of Intel CPU performance increases, could AMD be catching up?

AtenRa · Mar 21, 2013

Blandge said:
Doesn't look like a higher IPC to me.

We measure IPC in the same frequency with a single core.

Just to add that Llano was an update over Athlon II not Phenom II in IPC.

Fjodor2001 · Mar 21, 2013

Blandge said:
In any case. The time scales don't matter in the slightest. It's the microarchitectural generations that determine how different one CPU is from another. The performance comparison of the Husky (Read: Stars) cores in Llano and Piledriver cores inside Trinity is significantly different from the comparison of Stars to Bulldozer in when performance is more or less a wash.

That's why it's 2 generations.

So let's say we have 2 companies; A and B:

A: Releases 2 uarch generations and CPUs per year. Each uarch change is minor, so performance increases at 10% per year.

B: Releases 1 uarch generation CPU and per year. Each uarch change is major, so performance increases by 20% per year.

You're saying that A is better than B? Because all that matters is the number of uarch generations released per year, and not the actual performance increase per year?

Piroko · Mar 21, 2013

Blandge said:
Doesn't look like a higher IPC to me.

... There are nearly a half dozen minor tweaks to improve performance that cumulatively add about 6% to IPC, according to AMD architects. ...

The most significant improvement in Llano is the larger 1MB, 16-way associative L2 caches. Essentially, AMD opted to eliminate the shared L3 cache and instead make the private L2 caches larger. ...

There are also modest improvements to the execution resources within each core. The renaming window grew slightly from 72 to 84 macro-ops. The schedulers received more attention and improvement. The integer scheduler went from 24 to 30 entries, while the FP scheduler is up to 42 entries. A new divider was added to the third integer pipeline, and certain FP instructions execute faster.

http://www.realworldtech.com/fusion-llano/

colonelciller · Mar 21, 2013

ShintaiDK said:
Looking at sales, its exactly what people are asking for. Seems to be you asking for a minority request.

maybe, who knows

Blandge · Mar 21, 2013

Fjodor2001 said:
So let's say we have 2 companies; A and B:

A: Releases 2 uarch generations and CPUs per year. Each uarch change is minor, so performance increases at 10% per year.

B: Releases 1 uarch generation CPU and per year. Each uarch change is major, so performance increases by 20% per year.

You're saying that A is better than B? Because all that matters is the number of uarch generations released per year, and not the actual performance increase per year?

I agree with your logic, but your understanding on the historical events that took place is flawed.

AMD releases K10.5 uarch in 2008, recycles this uarch in Llano (Edit: Ok +6% performance) to be released in early 2011. Llano gets delayed until June 2011. Bulldozer gets released in 2011 with 0% (possibly negative) increase in IPC. Trinity gets released in 2012 with decent increase (15%?) in performance over Llano (K10.5).

That sounds like a 21% increase in performance over 4 years to me.

Homeles · Mar 21, 2013

Blandge said:
Doesn't look like a higher IPC to me.

Doesn't look like you know what IPC is.

1) The frequency is not the same. IPC is Instructions Per Cycle (or clock).

2) Phenom II has L3. Llano does not.

Blandge · Mar 21, 2013

Homeles said:
Doesn't look like you know what IPC is.

1) The frequency is not the same. IPC is Instructions Per Cycle (or clock).

2) Phenom II has L3. Llano does not.

I took a few minutes to look for a benchmark I know to be good for IPC measurements and that's the best I could find.

1) Score/Frequency = DMIPS/MHz which is a common benchmark for IPC
2) Dhrystone fits in L1 Cache.

Given the 6% number mentioned above I concede the point.

Piroko · Mar 21, 2013

Blandge said:
AMD releases K10.5 uarch in 2008, recycles this uarch in Llano (Edit: Ok +6% performance) to be released in early 2011. Llano gets delayed until June 2011. Bulldozer gets released in 2011 with 0% (possibly negative) increase in IPC. Trinity gets released in 2012 with decent increase (15%?) in performance over Llano (K10.5).

That sounds like a 21% increase in performance over 4 years to me.

Llano also increased clock rate of mobile quad cores while lowering platform power. That's the same reason that made SB so good compared to Nehalem.
It's not as big of an improvement as Intel realized in the same step, but it's noteworthy.

Blandge · Mar 21, 2013

Piroko said:
Llano also increased clock rate of mobile quad cores while lowering platform power. That's the same reason that made SB so good compared to Nehalem.
It's not as big of an improvement as Intel realized in the same step, but it's noteworthy.

Well yes Llano and Trinity are great APUs don't get me wrong. Don't confuse my arguements with distain for AMD or their engineering. I'm simply pointing out what I believe to be faults in others' logic or understanding, and if you can convince me that my views are wrong I'll gladly change my mind.

Exophase · Mar 21, 2013

Lots of talk about peak single threaded performance improvement from Llano to Trinity, but the only reason it's so high is because the clock speed is so much higher. And the only reason the clock speed is so much higher is because GF's 32nm was premature when Llano was released.

This is pretty much indisputable. Llano couldn't reliably hit clock speeds anywhere close to the Athlon IIs it superseded.

IDC mentioned it before but releasing products on processes that still needed work used to be standard practice.. and Intel and AMD could get away with it because the new process and uarchs offered so much more clock speed that they could still start with a big perf improvement even with the process far below spec. Then ramp it up to spec gradually and trickle out faster products. Once they hit the power wall and more diminishing returns in uarch scaling they didn't get nearly as big of an improvement in peak perf from a new process. Hence releasing a product on a grossly premature process meant releasing one that actually had substantially lower peak perf. Hence Llano. AMD could probably justify it since it wasn't meant to be a higher end product to begin with.

Trinity deserves a good amount of credit and is a much better showing than BD was.. but comparing it to Llano for peak CPU perf improvement is still going to be too exaggerated..

Fjodor2001 · Mar 21, 2013

Blandge said:
I agree with your logic, but your understanding on the historical events that took place is flawed.

AMD releases K10.5 uarch in 2008, recycles this uarch in Llano (Edit: Ok +6% performance) to be released in early 2011. Llano gets delayed until June 2011. Bulldozer gets released in 2011 with 0% (possibly negative) increase in IPC. Trinity gets released in 2012 with decent increase (15%?) in performance over Llano (K10.5).

That sounds like a 21% increase in performance over 4 years to me.

The 6% performance increase that came with Llano is comparable to SB->IB. So Llano could be considered a new CPU generation just as much as IB (even though Llano did not have a completely new uarch, but neither did IB). Also note that uarch changes is not the only thing that increases performance, e.g. frequency increases does too. In the end, all that matters is performance increase per year.

Also, the performance increase from Llano to Trinity is more in the range of 20-30%, not 15%. See my previous post. And it was achieved in about 1 year and 2 months (Llano -> Trinity).

As Exophase mentioned in the previous post, it might be considered a "lucky period" to use for comparison and that may be the case, but still.

An interesting question is also what lies ahead. What will Intel Broadwell/Skylake bring, and what will Richland/Kaveri bring? Will Intel continue to increase performance by 8% per year? Can we expect more than that from AMD, so it may catch up?

Blandge · Mar 21, 2013

Fjodor2001 said:
The 6% performance increase that came with Llano is comparable to SB->IB. So Llano could be considered a new CPU generation just as much as IB (even though Llano did not have a completely new uarch, but neither did IB). Also note that uarch changes is not the only thing that increases performance, e.g. frequency increases does too. In the end, all that matters is performance increase per year.

Also, the performance increase from Llano to Trinity is more in the range of 20-30%, not 15%. See my previous post. And it was achieved in about 1 year and 2 months (Llano -> Trinity), as mentioned previously.

SB->IB is not a new uarch.

And Llano was late, as mentioned previously.

If AMD delayed Llano long enough they could have released Llano and Trinity on the same day, and by your logic that would have been a miraculous accomplishment that it only took them a day to put in the engineering work to gain 20-30%

Fjodor2001 · Mar 21, 2013

Blandge said:
SB->IB is not a new uarch.

And Llano was late, as mentioned previously.

If AMD delayed Llano long enough they could have released Llano and Trinity on the same day, and by your logic that would have been a miraculous accomplishment that it only took them a day to put in the engineering work to gain 20-30%

It was "only" late by 6 months. But you still have a point. So perhaps a better way to phrase it is: "What matters is CPU performance increase per year, sustained over a longer period of time.".

It could be that Llano->Trinity was a "lucky period" for AMD. The question is whether AMD can sustain an increase in CPU performance per year that trumps Intel over a longer period of time. I guess that depends on what Haswell/Broadwell/[...] vs Richland/Kaveri/[...] brings, and when they are actually released.

But looking at the latest CPU releases from Intel, CPU performance increase does not seem to be their focus. Instead lower power consumption, better iGPU, and competing with ARM seems to be where their attention is at. Broadwell rumored as being BGA only perhaps also points in that direction.

Blandge · Mar 21, 2013

Fjodor2001 said:
You have a point. Llano->Trinity could perhaps be a "lucky period" for AMD. The question is whether AMD can sustain an increase in CPU performance per year that trumps Intel over a longer period of time. I guess that depends on what Broadwell/Skylake vs Richland/Kaveri brings, and when they are actually released.

But looking at the latest CPU releases from Intel, CPU performance increase does not seem to be their focus. Instead lower power consumption, better iGPU, and competing with ARM seems to be where their attention is at. Broadwell rumored as being BGA perhaps also points in that direction.

Lucky for sure consdering that Llano only performed 6% better than a 2.5 year old architecture at the time of it's release.

I do believe than AMD will be able to charge more for their top end desktop SKUs ($200 is so insanely cheap for a flagship part), but I don't believe they will ever beat Intel's flagship (ie i7-3960X) performance.

That is to say - Yes the gap is closing on Intel's ~$300 i7-x770K parts, but not on the top end.

itsmydamnation · Mar 21, 2013

i am in no way qualified to comment at a technical level, but from all the reading i have done Bulldozer had the following major issues.

1. L1D write is slow
2. L1I can have alignment issues and needed to pull data from L2
3. Decode throughput for a module is low vs its execution resources
4. soft flips-flops
5. long pipeline length means missed branches etc can have a bigger impact.

only 1 of those was fixed in piledriver, the other four are being fixed in steamroller. Llano to trinity might be a "lucky period", but trinity to kaveri might be another "lucky period" simply because bulldozer obviously had several miss-steps and short commings. Thats assuming AMD dont **** up kaveri.

Blandge · Mar 21, 2013

itsmydamnation said:
i am in no way qualified to comment at a technical level, but from all the reading i have done Bulldozer had the following major issues.

1. L1D write is slow
2. L1I can have alignment issues and needed to pull data from L2
3. Decode throughput for a module is low vs its execution resources
4. soft flips-flops
5. long pipeline length means missed branches etc can have a bigger impact.

only 1 of those was fixed in piledriver, the other four are being fixed in steamroller. Llano to trinity might be a "lucky period", but trinity to kaveri might be another "lucky period" simply because bulldozer obviously had several miss-steps and short commings. Thats assuming AMD dont **** up kaveri.

# 5 alone makes this situation similar to Netburst->Conroe in that AMD went for a long pipeline high frequency architecture, with Bullzoder (Just like Netburst), decided it sucks, and then moved to an approach more similar to Intel's current microarchitecture resulting in a huge performance increase between the two generations. (Assuming Steamroller does indeed give a huge performance increase).

Puppies04 · Mar 21, 2013

itsmydamnation said:
i am in no way qualified to comment at a technical level, but from all the reading i have done Bulldozer had the following major issues.

1. L1D write is slow
2. L1I can have alignment issues and needed to pull data from L2
3. Decode throughput for a module is low vs its execution resources
4. soft flips-flops
5. long pipeline length means missed branches etc can have a bigger impact.

only 1 of those was fixed in piledriver, the other four are being fixed in steamroller. Llano to trinity might be a "lucky period", but trinity to kaveri might be another "lucky period" simply because bulldozer obviously had several miss-steps and short commings. Thats assuming AMD dont **** up kaveri.

6. Software including games and even windows were unable to use all 8 cores available effectively when it was released compounding its comparatively weak single threaded performance.

mrmt · Mar 21, 2013

itsmydamnation said:
1. L1D write is slow
2. L1I can have alignment issues and needed to pull data from L2
3. Decode throughput for a module is low vs its execution resources
4. soft flips-flops
5. long pipeline length means missed branches etc can have a bigger impact.

Darek Mihocka also did some low level analysis of Bulldozer and he gave us a hint on why Bulldozer is so slow:

ut Sandy Bridge has addressed the load port and partial EFLAGS stalls now, and it is on other design details that Bulldozer loses and loses big:

http://www.emulators.com/docs/nx34_2011_avx.htm#Bulldozer

(...)
Bulldozer now has the same 4-cycle L1 cache latency as all the Core i5/i7 products. No longer the advantage of a 3-cycle L1 latency.

Bulldozer is now slower than Sandy Bridge at PUSHFD and LAHF arithmetic flags instructions, so again, no longer an advantage.

Bulldozer maxes out at 2 addition operations per cycle, compared to 3 in Sandy Bridge, meaning lower ILP on fundamental ALU operations.

Bulldozer still uses an older style integer divider, needing 44 cycles to perform an integer addition instead of 22 cycle. Similarly, Bulldozer needs 4 cycles for integer multiply instead of 3. Therefore integer scaling operations are slower than Sandy Bridge.

Bulldozer, as with previous AMD products, is consistently slower at most MMX and SSE operations. For example, a simple register move between 64-bit GPR and XMM (MOVD instruction) is 9 cycles instead of 1 on Sandy Bridge. This limits the ability to use SIMD registers as extensions of the integer register file.

A very key instruction introduced in SSSE3 - byte permute (PSHUFB) - is 3 cycles instead of 1 cycle. Practically throw darts at any other SSE instructions, they mostly tend to be slower.

L2 cache latency is almost twice as slow as Intel parts, looks like about 21 or 22 cycles as opposed to about 12 on Sandy Bridge.

L3 cache latency appears to be about 44 cycles, comparable to older Intel parts but slower than Sandy Bridge's 35.

Executing self-modifying code, and thus dynamic generation of code in Java or .NET, appears to be about twice as slow as Sandy Bridge.

CMPXCHG, a fundamental atomic instruction using for synchronization
primitives and locks appears to need about 50 cycles for an uncontended operation, more than twice as slow as Sandy Bridge.

==========================================

This, on top of one less ALU per core and the atrocious memory controller.

Homeles · Mar 21, 2013

Blandge said:
# 5 alone makes this situation similar to Netburst->Conroe in that AMD went for a long pipeline high frequency architecture, with Bullzoder (Just like Netburst), decided it sucks, and then moved to an approach more similar to Intel's current microarchitecture resulting in a huge performance increase between the two generations. (Assuming Steamroller does indeed give a huge performance increase).

Intel's pipeline on SNB/IVB is almost as long. They just have better ways to mitigate the impact of having a long pipeline (uop cache, which Steamroller will have a similar implementation) and better branch prediction (which Piledriver provided some of, with Steamroller bringing more improvements). They also have much better caches.

mrmt said:
Darek Mihocka also did some low level analysis of Bulldozer and he gave us a hint on why Bulldozer is so slow:

Johan De Galas here at Anandtech found it to be primarily comprised of three things: low clock speeds (relative to the pipeline length), L1 instruction cache is too small, branch misprediction penalty is too large. Cache latency was not a major factor.

http://www.anandtech.com/show/5057/the-bulldozer-aftermath-delving-even-deeper

Ajay · Mar 21, 2013

Homeles said:
Johan De Galas here at Anandtech found it to be primarily comprised of three things: low clock speeds (relative to the pipeline length), L1 instruction cache is too small, branch misprediction penalty is too large. Cache latency was not a major factor.

http://www.anandtech.com/show/5057/the-bulldozer-aftermath-delving-even-deeper

Having only two integer pipelines per core is yet another big negative (down from 3 in Thurban). I have no idea if AMD will stick to the BD architecture after Kaveri, but they will likely wait till 20nm, IMHO, to released a redesigned, rather than tweaked, core. That is if they are still around and have the R&D budget to do a major redesign.

Three int ports per core would be a nice improvement in peak performance and especially helpful in handling today's burtsy compiled code profiles/traces.

Homeles · Mar 21, 2013

Ajay said:
Having only two integer pipelines per core is yet another big negative (down from 3 in Thurban). I have no idea if AMD will stick to the BD architecture after Kaveri, but they will likely wait till 20nm, IMHO, to released a redesigned, rather than tweaked, core. That is if they are still around and have the R&D budget to do a major redesign.

Three int ports per core would be a nice improvement in peak performance and especially helpful in handling today's burtsy compiled code profiles/traces.

Wouldn't it need a fatter decode to feed 6 (total) int pipelines? To me, it sounds like the int pipelines aren't even being fully utilized.

Idontcare · Mar 21, 2013

Ajay said:
Having only two integer pipelines per core is yet another big negative (down from 3 in Thurban). I have no idea if AMD will stick to the BD architecture after Kaveri, but they will likely wait till 20nm, IMHO, to released a redesigned, rather than tweaked, core. That is if they are still around and have the R&D budget to do a major redesign.

Three int ports per core would be a nice improvement in peak performance and especially helpful in handling today's burtsy compiled code profiles/traces.

In regards to the bold, reality check equals a big "ugh" here

Consider this - If you took the AMD that exists today (financially, human resources, market presence, etc) and time-ported them back to circa 2006 when bulldozer was just starting development and tasked present-day AMD to develop what would one-day become bulldozer (with the same development costs that existed back then) then there is virtually no chance AMD would have had the resources to even develop the bulldozer (and Piledriver) that exists today.

They worked on borrowed time and money just to develop bulldozer. That was their hail-mary attempt after Core2Duo devastated their Phenom/Opteron revenue.

Now you want to talk about the prospects of an even less capable AMD, fewer employees, fewer resources, etc, facing an environment where IC design and validation is not only more expensive but is a LOT more expensive (future nodes)...you want to speculate on the likelihood that that AMD is going to come out with a "next-gen" microarchitecture that will supplant the bulldozer lineage?

I don't see how it is possible. The math just doesn't add up. I think we'll see AMD fall back to evolving their bobcat successors and become Via by a different name.

Idontcare · Mar 21, 2013

Homeles said:
...L1 instruction cache is too small, branch misprediction penalty is too large. Cache latency was not a major factor.

That just doesn't sound right to me when I read it.

If your L1$ is too small then you are going to be gated by the cache latency because your cache miss rate is going to be high.

If your branch misprediction penalty is too large then you are going to be gated by the cache latency because the branch mispredict is going to hit the cache.

If the first two things you wrote are to be taken as true then I would have thought having an even lower cache latency would help negate the magnitude of the impact of the first two things.

The only time you should be able to say "cache latency is not a major factor" is when your code is such that your microarchitecture basically doesn't need the cache to get its job done. But that goes without saying.

In AMD's case you are arguing that the microarchitecture does need the cache to get its job done, and as such the latency ought to be critical. Shouldn't it?

Homeles · Mar 21, 2013

I'm just passing on our resident editor's findings.

As far as branch misprediction goes, I think his major point there was that AMD needed to implement a uop cache. AMD apparently sees this the same way, cosidering they'll be doing just that with Steamroller.

For the L1 I-cache, it's too small when running both integer cores in a module. See here: http://images.anandtech.com/graphs/graph5057/42776.png

In SQL server, the hit rate drops from 97% to 95% when CMT is enabled. As you know, that's a big deal -- a 67% higher miss rate.

I'm sure a lower latency L2 would negate the issue, but that would be treating the symptom and not the problem in this circumstance, as I understand it (although we can't ignore that the L2 latency is a problem in and of itself).

Back to L1 I-cache being too small -- Johan was right here as well: AMD is expanding the L1 instruction cache from 64KB to 96KB. Think they're increasing the associativity to 3-way as well.

Ajay · Mar 21, 2013

Homeles said:
Wouldn't it need a fatter decode to feed 6 (total) int pipelines? To me, it sounds like the int pipelines aren't even being fully utilized.

Correct, but a large part of that is the lack of decoder resources which is supposed to be fixed. The current scheme 4 decoders module where as Kaveri will have 8. One way of increasing IPC, especially since clocks are probably going to drop on Si Bulk, would be have 10 decoders and 5 int ports per core (2 AGU, 3 ALU). But that is a substantial redesign and will require more xtors and like be better off being done on a 20nm node (which means the revised design would already need to be at least a 1 - 1.5 years along, depending on the actual release date.

With the current rate of Intel CPU performance increases, could AMD be catching up?

Lifer

Diamond Member

Senior member

Senior member

Member

Platinum Member

Member

Senior member

Member

Diamond Member

Diamond Member

Member

Diamond Member

Member

Diamond Member

Member

Diamond Member

Diamond Member

Platinum Member

Lifer

Platinum Member

Elite Member

Elite Member

Platinum Member

Lifer