AtenRa
Lifer
- Feb 2, 2009
- 14,003
- 3,362
- 136
Doesn't look like a higher IPC to me.
![]()
We measure IPC in the same frequency with a single core.
Just to add that Llano was an update over Athlon II not Phenom II in IPC.
Doesn't look like a higher IPC to me.
![]()
In any case. The time scales don't matter in the slightest. It's the microarchitectural generations that determine how different one CPU is from another. The performance comparison of the Husky (Read: Stars) cores in Llano and Piledriver cores inside Trinity is significantly different from the comparison of Stars to Bulldozer in when performance is more or less a wash.
That's why it's 2 generations.
Doesn't look like a higher IPC to me.
http://www.realworldtech.com/fusion-llano/... There are nearly a half dozen minor tweaks to improve performance that cumulatively add about 6% to IPC, according to AMD architects. ...
The most significant improvement in Llano is the larger 1MB, 16-way associative L2 caches. Essentially, AMD opted to eliminate the shared L3 cache and instead make the private L2 caches larger. ...
There are also modest improvements to the execution resources within each core. The renaming window grew slightly from 72 to 84 macro-ops. The schedulers received more attention and improvement. The integer scheduler went from 24 to 30 entries, while the FP scheduler is up to 42 entries. A new divider was added to the third integer pipeline, and certain FP instructions execute faster.
maybe, who knowsLooking at sales, its exactly what people are asking for. Seems to be you asking for a minority request.
So let's say we have 2 companies; A and B:
A: Releases 2 uarch generations and CPUs per year. Each uarch change is minor, so performance increases at 10% per year.
B: Releases 1 uarch generation CPU and per year. Each uarch change is major, so performance increases by 20% per year.
You're saying that A is better than B? Because all that matters is the number of uarch generations released per year, and not the actual performance increase per year?
Doesn't look like you know what IPC is.Doesn't look like a higher IPC to me.
Doesn't look like you know what IPC is.
1) The frequency is not the same. IPC is Instructions Per Cycle (or clock).
2) Phenom II has L3. Llano does not.
Llano also increased clock rate of mobile quad cores while lowering platform power. That's the same reason that made SB so good compared to Nehalem.AMD releases K10.5 uarch in 2008, recycles this uarch in Llano (Edit: Ok +6% performance) to be released in early 2011. Llano gets delayed until June 2011. Bulldozer gets released in 2011 with 0% (possibly negative) increase in IPC. Trinity gets released in 2012 with decent increase (15%?) in performance over Llano (K10.5).
That sounds like a 21% increase in performance over 4 years to me.
Llano also increased clock rate of mobile quad cores while lowering platform power. That's the same reason that made SB so good compared to Nehalem.
It's not as big of an improvement as Intel realized in the same step, but it's noteworthy.
I agree with your logic, but your understanding on the historical events that took place is flawed.
AMD releases K10.5 uarch in 2008, recycles this uarch in Llano (Edit: Ok +6% performance) to be released in early 2011. Llano gets delayed until June 2011. Bulldozer gets released in 2011 with 0% (possibly negative) increase in IPC. Trinity gets released in 2012 with decent increase (15%?) in performance over Llano (K10.5).
That sounds like a 21% increase in performance over 4 years to me.
The 6% performance increase that came with Llano is comparable to SB->IB. So Llano could be considered a new CPU generation just as much as IB (even though Llano did not have a completely new uarch, but neither did IB). Also note that uarch changes is not the only thing that increases performance, e.g. frequency increases does too. In the end, all that matters is performance increase per year.
Also, the performance increase from Llano to Trinity is more in the range of 20-30%, not 15%. See my previous post. And it was achieved in about 1 year and 2 months (Llano -> Trinity), as mentioned previously.
SB->IB is not a new uarch.
And Llano was late, as mentioned previously.
If AMD delayed Llano long enough they could have released Llano and Trinity on the same day, and by your logic that would have been a miraculous accomplishment that it only took them a day to put in the engineering work to gain 20-30%
You have a point. Llano->Trinity could perhaps be a "lucky period" for AMD. The question is whether AMD can sustain an increase in CPU performance per year that trumps Intel over a longer period of time. I guess that depends on what Broadwell/Skylake vs Richland/Kaveri brings, and when they are actually released.
But looking at the latest CPU releases from Intel, CPU performance increase does not seem to be their focus. Instead lower power consumption, better iGPU, and competing with ARM seems to be where their attention is at. Broadwell rumored as being BGA perhaps also points in that direction.
i am in no way qualified to comment at a technical level, but from all the reading i have done Bulldozer had the following major issues.
1. L1D write is slow
2. L1I can have alignment issues and needed to pull data from L2
3. Decode throughput for a module is low vs its execution resources
4. soft flips-flops
5. long pipeline length means missed branches etc can have a bigger impact.
only 1 of those was fixed in piledriver, the other four are being fixed in steamroller. Llano to trinity might be a "lucky period", but trinity to kaveri might be another "lucky period" simply because bulldozer obviously had several miss-steps and short commings. Thats assuming AMD dont **** up kaveri.
i am in no way qualified to comment at a technical level, but from all the reading i have done Bulldozer had the following major issues.
1. L1D write is slow
2. L1I can have alignment issues and needed to pull data from L2
3. Decode throughput for a module is low vs its execution resources
4. soft flips-flops
5. long pipeline length means missed branches etc can have a bigger impact.
only 1 of those was fixed in piledriver, the other four are being fixed in steamroller. Llano to trinity might be a "lucky period", but trinity to kaveri might be another "lucky period" simply because bulldozer obviously had several miss-steps and short commings. Thats assuming AMD dont **** up kaveri.
1. L1D write is slow
2. L1I can have alignment issues and needed to pull data from L2
3. Decode throughput for a module is low vs its execution resources
4. soft flips-flops
5. long pipeline length means missed branches etc can have a bigger impact.
Intel's pipeline on SNB/IVB is almost as long. They just have better ways to mitigate the impact of having a long pipeline (uop cache, which Steamroller will have a similar implementation) and better branch prediction (which Piledriver provided some of, with Steamroller bringing more improvements). They also have much better caches.# 5 alone makes this situation similar to Netburst->Conroe in that AMD went for a long pipeline high frequency architecture, with Bullzoder (Just like Netburst), decided it sucks, and then moved to an approach more similar to Intel's current microarchitecture resulting in a huge performance increase between the two generations. (Assuming Steamroller does indeed give a huge performance increase).
Johan De Galas here at Anandtech found it to be primarily comprised of three things: low clock speeds (relative to the pipeline length), L1 instruction cache is too small, branch misprediction penalty is too large. Cache latency was not a major factor.Darek Mihocka also did some low level analysis of Bulldozer and he gave us a hint on why Bulldozer is so slow:
Johan De Galas here at Anandtech found it to be primarily comprised of three things: low clock speeds (relative to the pipeline length), L1 instruction cache is too small, branch misprediction penalty is too large. Cache latency was not a major factor.
http://www.anandtech.com/show/5057/the-bulldozer-aftermath-delving-even-deeper
Wouldn't it need a fatter decode to feed 6 (total) int pipelines? To me, it sounds like the int pipelines aren't even being fully utilized.Having only two integer pipelines per core is yet another big negative (down from 3 in Thurban). I have no idea if AMD will stick to the BD architecture after Kaveri, but they will likely wait till 20nm, IMHO, to released a redesigned, rather than tweaked, core. That is if they are still around and have the R&D budget to do a major redesign.
Three int ports per core would be a nice improvement in peak performance and especially helpful in handling today's burtsy compiled code profiles/traces.
Having only two integer pipelines per core is yet another big negative (down from 3 in Thurban). I have no idea if AMD will stick to the BD architecture after Kaveri, but they will likely wait till 20nm, IMHO, to released a redesigned, rather than tweaked, core. That is if they are still around and have the R&D budget to do a major redesign.
Three int ports per core would be a nice improvement in peak performance and especially helpful in handling today's burtsy compiled code profiles/traces.
...L1 instruction cache is too small, branch misprediction penalty is too large. Cache latency was not a major factor.
Wouldn't it need a fatter decode to feed 6 (total) int pipelines? To me, it sounds like the int pipelines aren't even being fully utilized.
