Thoughts, Rumors, or Specs of AMD fx series steamroller cpu

Page 5 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

kernelc

Member
Aug 4, 2011
77
0
66
www.ilsistemista.net
I don't think Bulldozer's L2 is bad at all at least compared to their own CPUs... from Anandtech's review: http://forums.anandtech.com/newreply.php?do=newreply&noquote=1&p=33513489

L1/L2 latency-
FX-8150: 4/21 cycles
Phenom II X6: 3/14 cycles

L1 cache latency increased due to clock speed reasons. The culprit for the higher L2 latency is twofold, one is that its much larger at 1MB, and the second is that its a shared cache while in Phenom II its a dedicated one. Despite that, its delivering more bandwidth than the Phenom II, and I think that's quite respectable.

Hi,
at the module level, L2 cache bandwidth should be quite respectable. However, the problems are:
- it seems that a single integer core (ie: a single thread) can use only half of L2 bandwidth;
- as L1 is a write-through cache, once the WCC is full L1 write speed will be ganged to L2 's one

While PhenomII also had a relatively slow L2, it's L1 was a write-back design: this means that, in many workloads, the large (64 KB) L1 cache could completely mask L2 speed/latency. This is not always the case with Bulldozer: if WCC goes full, L1 speed become ganged to L2 speed.

I wrote a Bulldozer analysis some months ago, describing this problem in detail: http://www.ilsistemista.net/index.p...n-whats-wrong-with-amd-bulldozer.html?start=4

Shared L2 cache in Core Duo increased latency to 14 cycles, up from 10 cycles.

I think the problem is again the module concept. You can see even Sandy Bridge's E's direct adding of 2 more cores and scaling interconnect and memory bandwidth offered diminishing returns, now add having less performance/clock, and the module concept delivering less gains than adding cores.

-6 K10 cores to 8 Bulldozer cores, in theory is 33% increase.
-But in reality that ends up being less than that. For 50% more cores, 3960X ends up being mostly low-40% faster
-Then you add that modules don't deliver as much cores
-And in rest of the applications, it doesn't benefit from having more cores and there's less performance/clock

Obviuously, increasing core count after a certain point will give back diminished results, especially on desktop space. After all, while SB-E is a 8 core design, Intel don't have a single desktop SKU with all 8 cores enabled.

Relying on further clock increases won't work with Steamroller, as 28nm might end up somewhat less performing than 32nm. 28nm doesn't just forgo SOI(which is only responsible for few % but still), but may be a lower power process. Fortunately, 28nm should improve leakage characteristics, as that's the benefit of a slower transistor.

I'm very curious to see GF's 28nm process. Maybe they surprise us, maybe not! :hmm:
 

Ancalagon44

Diamond Member
Feb 17, 2010
3,274
202
106
I think now that AMD has transitioned to a new architecture the delays won't be so long as they were with Bulldozer and Steamroller will be faster than Thuban. At 28nm, they will have enough transistors available to finally fix all the shortcomings of Bulldozer. Steamroller should be Bulldozer done right, of course, Not that everything will be rosy - Intel will still be walking away from AMD, performance wise, at a brisk pace.

Why should you have to throw transistors at Bulldozer to fix it though? Thuban performed better with less transistors and being built on an older process node. Throwing transistors at the problem doesnt make it go away.
 

Jacky60

Golden Member
Jan 3, 2010
1,123
0
0
It will have crap performance and will be a huge disappointment like nearly all AMD CPU's in the last decade. AMD's delayed attempts to successfully integrate decent CPU and GPU performance on a single chip mean Intel will be p*ssing all over them in that space as well very soon.
 

kernelc

Member
Aug 4, 2011
77
0
66
www.ilsistemista.net
Why should you have to throw transistors at Bulldozer to fix it though? Thuban performed better with less transistors and being built on an older process node. Throwing transistors at the problem doesnt make it go away.

Hi,
Thuban, while good, showed that the old K10.5 uarch had problems growing bigger then 4 cores. Think that: while a single K10.5 core + L2 cache weights at about ~22 mm2, Thuban is ~88 mm2 bigger then regular PII X4 core, yet it has only 2 more cores (2x22 = 44mm2).

The additional ~44 mm2 are invested into cores/northbridge interconnects, that then to scale exponentially with core count. In other word, with this classical approach, interconnects size then to disproportionally increase.

For more informations: AMD Bulldozer vs Sandy Bridge and K10 performance and benchmark analysis (see page 2).

Bulldozer is a more modular desing, enabling AMD to integrate 8 integer core and huge amount of cache in a reasonable sized die.

Regards.
 

Arzachel

Senior member
Apr 7, 2011
903
76
91
Bulldozer is a more modular desing, enabling AMD to integrate 8 integer core and huge amount of cache in a reasonable sized die.

Or to scale the design down to four cores in a small package to make room for a larger iGPU.

When you consider that laptop and server CPUs are clocked comparatively low, a design aiming for clockspeed over IPC makes a lot of sense. It's just that do to a combination of GFs 32nm node being pretty bad at first and AMD cutting some corners to release faster, the clockspeed advantage didn't materialise.
 

kernelc

Member
Aug 4, 2011
77
0
66
www.ilsistemista.net
Or to scale the design down to four cores in a small package to make room for a larger iGPU.

When you consider that laptop and server CPUs are clocked comparatively low, a design aiming for clockspeed over IPC makes a lot of sense. It's just that do to a combination of GFs 32nm node being pretty bad at first and AMD cutting some corners to release faster, the clockspeed advantage didn't materialise.

Yes, I agree.

For that reason I am curious to see desktop Piledriver processors: with yelds improving and the new clock distribution scheme, maybe AMD surprise us ;)

Thanks.
 
Last edited:

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Hi,
at the module level, L2 cache bandwidth should be quite respectable. However, the problems are:
- it seems that a single integer core (ie: a single thread) can use only half of L2 bandwidth;
- as L1 is a write-through cache, once the WCC is full L1 write speed will be ganged to L2 's one

While PhenomII also had a relatively slow L2, it's L1 was a write-back design: this means that, in many workloads, the large (64 KB) L1 cache could completely mask L2 speed/latency. This is not always the case with Bulldozer: if WCC goes full, L1 speed become ganged to L2 speed.

I wrote a Bulldozer analysis some months ago, describing this problem in detail: http://www.ilsistemista.net/index.p...n-whats-wrong-with-amd-bulldozer.html?start=4

Given your experience in deconstructing and deconvoluting bulldozer's microarchitectural intricacies, I'm curious if you have had a chance to read Johan's recent article The Bulldozer Aftermath: Delving Even Deeper?

Cache Is Not the Only, Or Even the Main, Culprit

Most people pointed to high latency caches as a reason for subpar Bulldozer performance, but the real explanation of why Bulldozer's performance was underwhelming is a lot more complex.

The Real Shortcomings: Branch Misprediction Penalty and Instruction Cache Hit Rate

Bulldozer is a deeply pipelined CPU, just like Sandy Bridge, but the latter has a µop cache that can cut the fetching and decoding cycles out of the branch misprediction penalty. The lower than expected performance in SAP and SQL Server, plus the fact that the worst performing subbenches in SPEC CPU2006 int are the ones with hard to predict branches, all points to there being a serious problem with branch misprediction.

Another significant problem is that the L1 instruction cache does not seem to cope well with 2-threads. We have measured significantly higher miss rates once we run two threads on the 2-way 64KB L1 instruction cache. It looks like the associativity of that cache is simply too low. There is a reason why Intel has an 8-way associative cache to run two threads.
 

IntelUser2000

Elite Member
Oct 14, 2003
8,686
3,787
136
Prescott also had a Write Combining Cache.

Obviuously, increasing core count after a certain point will give back diminished results, especially on desktop space. After all, while SB-E is a 8 core design, Intel don't have a single desktop SKU with all 8 cores enabled.

While that is true, they have Hyperthreading.
 

kernelc

Member
Aug 4, 2011
77
0
66
www.ilsistemista.net
Given your experience in deconstructing and deconvoluting bulldozer's microarchitectural intricacies, I'm curious if you have had a chance to read Johan's recent article The Bulldozer Aftermath: Delving Even Deeper?

Yes, I read Johan's article and I found it excellent: he's profiling analysis is a real valuable one.

He posed two valid point, but I think that the real problem remain the slow L2 cache coupled with low clock speed / high thermal output.

Bulldozer pipeline, while longer then K10 one's, is not so much longer (about 25%). The I-Cache can similarly be overhauled with increased associativity, but it has a cost: higher-way caches tend to be slower, and with the already slow Bulldozer caches, this can be problematic.

Obviously, I can go wrong. But all the reviews examined so far tend to point versus a low-bandwidth L2 cache as the main culprit.

Speaking about the shared FPU, I feel that the design need another store port and / or larger store queue.

Thanks.
 

kernelc

Member
Aug 4, 2011
77
0
66
www.ilsistemista.net
Prescott also had a Write Combining Cache.

All Intel processors from PPro to IB have a feature called "write combining", but it is applied for non-cached I/O & memory operation. For example, writing 4x 64 bit on PCI bus, results in a single burst on PPro+ processors. On the same side, 4x NTMOV assembly instruction will cause a single memory flush.

Bulldozer extended this concept applying it to cached memory operation also. WCC is, in practice, a glorified WCB (write combining buffer) applied to cache/memory hierarchy.

For more info: http://semipublic.comp-arch.net/wiki/Difference_between_write_combining_and_write_coalescing

While that is true, they have Hyperthreading.

Correct :)
 

Makaveli

Diamond Member
Feb 8, 2002
4,990
1,579
136
This has been one of the more interesting threads in here recently.

And kernelc thanks for your break down of BD very informative, and IDC thanks for the link also.

Good reading for a rainy friday morning at work :)

Question for both of you do you think there are any bottlenecks with the IMC in BD. Ever since Nehalem intel has had a better IMC from what i've seen. Is this an area AMD need to focus on aswell?
 

kernelc

Member
Aug 4, 2011
77
0
66
www.ilsistemista.net
This has been one of the more interesting threads in here recently.

And kernelc thanks for your break down of BD very informative, and IDC thanks for the link also.

Good reading for a rainy friday morning at work :)

Question for both of you do you think there are any bottlenecks with the IMC in BD. Ever since Nehalem intel has had a better IMC from what i've seen. Is this an area AMD need to focus on aswell?

I think Bulldozer IMC is quite OK. However, the slow L2/L3 cache somewhat hamper its maximum speed.

To understand cache importance in regard to memory bandwidth, think to SB: the ring buffer really improved memory transfer speed.

Regards.
 

Olikan

Platinum Member
Sep 23, 2011
2,023
275
126
Question for both of you do you think there are any bottlenecks with the IMC in BD. Ever since Nehalem intel has had a better IMC from what i've seen. Is this an area AMD need to focus on aswell?

for CPUs, no...it's pointless

for APUs, yes it's critical
 

Ajay

Lifer
Jan 8, 2001
16,094
8,114
136
Why should you have to throw transistors at Bulldozer to fix it though? Thuban performed better with less transistors and being built on an older process node. Throwing transistors at the problem doesnt make it go away.

Well, what kernelc already said. Plus given the discussion here, I think AMD will be forced to stop reaching for higher clocks after Piledriver and their main goals will have to be a better front end, cores with higher sustained throughput (IPC) and improved cache performance. The first two will definitely require more resources, e.g. xtors.

Basically since AMD has committed to this uArch through Excavator, they will have to morph it towards 'fatter cores' (& possibly leaner caches) like Intel's because they will not be able to achieve their higher clocks goals at anything near reasonable power limits. How well they are able to pull this off (and how good their fab partners 28/20nm nodes perform) is all up in the air. What other way can they go?
 

Makaveli

Diamond Member
Feb 8, 2002
4,990
1,579
136
for CPUs, no...it's pointless

for APUs, yes it's critical

How is the IMC pointless?

That is one of the features in the AMD64 that allowed it to crap all over the P4.

I know the current cpu landscape is very different these days. however I would think that further development of the IMC would be equally important as other areas of the cpu improve.

This question is for all.

Pertaining to performance if one were to list the area's of importance with in the design what would it be.

example.
(list in order of importance)

1. Cores design(IPC)
2. Cache (size/speed)
3. IMC

?
 

Arzachel

Senior member
Apr 7, 2011
903
76
91

I'd love too see more articles in this vein, this one is pretty specific to server workloads. As far as I know, games are less branch intensive and more sensitive to cache latency. What's more, writes in the L1 cache have the same latency as in L2 once the WCC is full, if I understand correctly, which would absolutely tank performance.

Question to kernelc: is code that fills up the WCC too quickly prevalent enough to be a large concern or is it more of a worst case scenario?
 

Charles Kozierok

Elite Member
May 14, 2012
6,762
1
0
Basically since AMD has committed to this uArch through Excavator, they will have to morph it towards 'fatter cores' (& possibly leaner caches) like Intel's because they will not be able to achieve their higher clocks goals at anything near reasonable power limits.

Well, that remains to be seen.

Recall that this phenomenon -- new microarchitecture that seems crappy compared to its predecessor and doesn't scale as much as initially predicted -- has been repeated many times over the course of microprocessor history. I remember when the original Pentium came out and everyone declared how much it "sucked". The same thing happened with the P4, which eventually scaled up to decent levels (if not nearly as decent as Intel wanted), and as Johan pointed out, it's happened to AMD as well.

Time will tell.
 

pelov

Diamond Member
Dec 6, 2011
3,510
6
0
Question for both of you do you think there are any bottlenecks with the IMC in BD. Ever since Nehalem intel has had a better IMC from what i've seen. Is this an area AMD need to focus on aswell?

He wasn't technically wrong that in the desktop area it doesn't help as much, though perhaps the wording is a bit overstated. Due to the APU uArch, AMD relies very heavily on the DDR frequencies in order to provide good GPU throughput. Unlike Intel, AMD doesn't dedicate a X amount of cache to the on-die GPU for read/writes (at least I'm pretty sure they don't ;P). In turn, even for small block transfers this means AMD's APUs rely very heavily on the high DRAM frequencies. A beefier IMC will help the APUs in tasks like gaming far more than they would a desktop CPU which, for the most part, doesn't care whether you've got 1066mhz or 2400mhz RAM. On server workloads that's a bit of a different story but let's ignore that :p

Basically since AMD has committed to this uArch through Excavator, they will have to morph it towards 'fatter cores' (& possibly leaner caches) like Intel's because they will not be able to achieve their higher clocks goals at anything near reasonable power limits. How well they are able to pull this off (and how good their fab partners 28/20nm nodes perform) is all up in the air. What other way can they go?

28nm GloFo is likely to be bulk. This allows AMD to allow for more than GloFo to make their chips, although at this point it's highly unlikely TSMC will be making any chips other than GPUs for AMD considering their current issues. Though they'll be able to produce slightly cheaper chips, they also can't rely on having the SOI they did with 32nm. I think they're looking for FD-SOI for 20nm but that's years away and another discussion entirely. Going 28nm bulk will likely hamper clock speed goals a bit so they can't expect to ramp up the clock speeds for Steamroller like they have on the 32nm HKMB gate-first SOI from BD > Piledriver.

He posed two valid point, but I think that the real problem remain the slow L2 cache coupled with low clock speed / high thermal output.

I keep going back to this and still think the L1$ is too small. Well, either the L1$ or the WCC or both. A slow L2 is a given in a deeper pipeline higher clocking architecture so why not mask it by increasing the size of the L1$? WCC? It seems to be they exacerbated the L2 issues by cutting down the L1$/WCC. Granted, given higher clockspeeds we would likely not even be mentioning the L2 cache speeds.

Maybe we see smaller cache sizes in Steamroller? :) A wider front end is almost certain. I believe they increased the FPU queue by 10% in Piledriver. But whatever they do on the FPU end will be heavily influenced by their HSA goals so it wouldn't surprise me if that looked entirely different.

What's more, writes in the L1 cache have the same latency as in L2 once the WCC is full, if I understand correctly, which would absolutely tank performance.

Question to kernelc: is code that fills up the WCC too quickly prevalent enough to be a large concern or is it more of a worst case scenario?

I'm wondering about this too. Given the small size of the WCC, 4KB, is it just a matter of the WCC/L1$ being filled up too quickly that makes the L2 problems look worse than they already are? Games tend to benefit quite a bit from cache but they move all the way up to L3. IIRC the difference between no-L3 and L3 is about 10%? The L3 speeds aren't exactly quick either but AMD offers larger stores than Intel on L3 due to the way they handle the L2 > L3 cache moving up. In Intel's cache design the entire block is copied to the L3 (well, almost the entire thing) whereas with AMD it's only a small portion that's copied so you've got more L3 available. It probably helps mask the speed concerns slightly, though the gap between the access speeds in L3 between SB/BD is quite high due to AMD utilizing asynchronous cache, meaning the L3 isn't clocked the same as the rest of the chip (2200-2400mhz for the 4-module chips?). This is also the reason why AMD's chips respond quite favorably to NB/Hypertransport clock bumps (and even more with a better IMC).
 
Last edited:

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Well, that remains to be seen.

Recall that this phenomenon -- new microarchitecture that seems crappy compared to its predecessor and doesn't scale as much as initially predicted -- has been repeated many times over the course of microprocessor history. I remember when the original Pentium came out and everyone declared how much it "sucked". The same thing happened with the P4, which eventually scaled up to decent levels (if not nearly as decent as Intel wanted), and as Johan pointed out, it's happened to AMD as well.

Time will tell.

The FDIV bug in the original pentium did suck. But I don't remember anyone ever saying the performance of the pentium sucked :confused: It ran circles around the 486 in both INT and FPU, despite its clockspeed disadvantage.

Not calling shens on you, its just I honestly have a very different recollection of how the pentium was received by the market and reviewers when those original 60 and 66 MHz chips hit the streets.

Now the K5 on the other hand...

for phenom, it was important....for bulldozer is not

IMC overclocked from 2.2Ghz to 3.2Ghz = ~2% performance increase
http://www.madshrimps.be/articles/article/1000220/AMD-FX-8150-Bulldozer-CPU-Review/6#axzz1w3yc4sWh

Unless the IMC is the performance limiting component in the memory subsystem, OC'ing it isn't going to net much of a performance boost.

But surely you aren't arguing that if we took that 2.2GHz IMC out of bulldozer and put it into the NB again that performance would not suffer and suffer badly as the memory subsystem latency then doubles?

Having an IMC is critical to enabling today's performance, but having it be 3GHz or 10GHz is not critical unless your CPU is 15GHz and your ram is DDR4 1-1-1-1-2 T0 ;)
 

Charles Kozierok

Elite Member
May 14, 2012
6,762
1
0
The FDIV bug in the original pentium did suck. But I don't remember anyone ever saying the performance of the pentium sucked :confused: It ran circles around the 486 in both INT and FPU, despite its clockspeed disadvantage.

I may be confusing the Pentium/486 launch with the Pentium II launch. But nobody was a big fan of those original Pentiums as I recall. :) And everyone was very gungho about the high-clocked 486 clones AMD was putting out.

Anyway, the general pattern is of the first launch of a new microarchitecture being underwhelming.

Having an IMC is critical to enabling today's performance, but having it be 3GHz or 10GHz is not critical unless your CPU is 15GHz and your ram is DDR4 1-1-1-1-2 T0 ;)

I'm not even really sure how important the IMC is at all any more. Caches just seem to hide so much of any performance differences related to memory.
 

pelov

Diamond Member
Dec 6, 2011
3,510
6
0
I'm not even really sure how important the IMC is at all any more. Caches just seem to hide so much of any performance differences related to memory.

What about for APUs with no L3 cache? :p

Though it's not surprising to see rumors swirling around L4 cache being dedicated to an on-die GPU. We might see AMD take the same route as well. For seamless CPU/GPU compute a shared larger cache makes more sense.