Thoughts, Rumors, or Specs of AMD fx series steamroller cpu

pelov · May 31, 2012

blckgrffn said:
Write-through cache on a CPU makes me shudder.

*shudders*

Not bad if your L2 cache is one of the strongest points of your architecture. But somehow AMD decided to take one of their weakest points (at least on desktop workloads) and magnify its weaknesses by utilizing a write-through that essentially sits on the L2.

Wonder who thought that would have been a good idea

Ajay · May 31, 2012

kernelc said:
I'm quite curious to see if AMD will continue to use its write-through L1 design + WCC or if it will revert back to a write-back L1.

Maybe they will simply increment WCC size

I don't know, we'll see how fast the L1$ is in Piledriver, if AMD can make significant gains with L1$ (mainly write speed of course) maybe WCC can be eliminated. I would think that that is the way they'd like to go, but I could be completely off base (it could be someone's pet architectural feature :| )

pelov · May 31, 2012

Ajay said:
I don't know, we'll see how fast the L1$ is in Piledriver, if AMD can make significant gains with L1$ (mainly write speed of course) maybe WCC can be eliminated. I would think that that is the way they'd like to go, but I could be completely off base (it could be someone's pet architectural feature :| )

I think it has more to do with the size of L1$ than the actual speeds.

As soon as you have the L1 write you see it dumps down to L2 levels because of WCC (you can see this by comparing L1 write and L2 write). So they can improve the L1 access but ultimately if they don't increase the size of the L1$ to quit relying on the L2, errr WCC, it won't help. So if it can fit on the L1$, great, it should do well. If it can't...

ShintaiDK · May 31, 2012

Where did the "BD uses write through on L1" come from?

Write sucks when you share part of the L1 and L2 for that matter. The others got unshared L1/L2s

pelov · May 31, 2012

ShintaiDK said:
Where did the "BD uses write through on L1" come from?

Write sucks when you share part of the L1 and L2 for that matter. The others got unshared L1/L2s.

Well, if they improve the L2 drastically it wouldn't be a big problem (would also address the misprediction penalties) unless you're tasked with workloads that utilize a very small portion of cache, something like CPU queens? I think that fits entirely on L1$. Generally speaking I guess that's not a bad give-and-take.

I'm pretty sure it was stated by both AMD and Kanter in this article. In slides it's considered part of the L2 cache because it's essentially the same thing

Ajay · May 31, 2012

pelov said:
As soon as you have the L1 write you see it dumps down to L2 levels because of WCC. So they can improve the L1 access but ultimately if they don't increase the size of the L1$ to quit relying on the L2, errr WCC, it won't help. So if it can fit on the L1$, great, it should do well. If it can't...

I said what I did based on the second graph which you posted, where L1$ write needs to be ~2.5x faster.

Oddly, Westmere has a 32KB L1$ (x2, data + instruction) and a 256KB L2$, yet the fall off point in Sandra is @ 128KB. The 2500K also has a 256KB L2$, so it seems like speed is more the issue than size. Really it's about a balanced design, which Bulldozer clearly isn't.

kernelc · May 31, 2012

pelov said:
The WCC is essentially the L2 cache so the L1 concerns are directly translated to the L2 problems.

While the WCC should technically be part of the L2, it is quite special.
You can think the WCC as a "L1.5" cache set sits between L1 and L2 ones; its main due is to hide the long (>20 cycles) L2 cache and to coalesce multiple writes to L2 in a single one.

Based on what I have seen (but I don't have any Bulldozer sample to test it myself, sorry) the WCC effectively hide L2 latency quite well, but it fall short when dealing with prolonged writes that fill the WCC quickly. When it is filled, any write to L1 begin to last over 20 cycles (L2 latency) and so L1 speed, both in term of latency and bandwidth, fall down terribly.

Regards.

kernelc · May 31, 2012

Ajay said:
I don't know, we'll see how fast the L1$ is in Piledriver, if AMD can make significant gains with L1$ (mainly write speed of course) maybe WCC can be eliminated. I would think that that is the way they'd like to go, but I could be completely off base (it could be someone's pet architectural feature :| )

Hi,
Piledriver was already examined by Techreport and its L1 seems as speed as Bulldozer's one.

WCC may be eliminated only if L1 cache policy changes to write-back: with a write-through design, you can not get rid of WCC without enormously impact performances (or if you integrate a L2 cache as quick as L1, both in term of latency and throughput, but this is not only next to impossible, it is illogical...).

Regards.

pelov · May 31, 2012

What's the size of the WCC? I remember reading 4KB somewhere but I may be completely off...

http://realworldtech.com/beta/forums/index.cfm?action=detail&id=128675&threadid=128602&roomid=2

http://www.olcf.ornl.gov/wp-content/uploads/2012/01/TitanWorkshop2012_Day1_AMD.pdf

It seems it is only 4KB 4-way

Even still, the WCC and small L1$ isn't the issue by itself. If the L2 were accessed more quickly with a lesser penalty then we wouldn't be talking about the WCC at all. Ofc, if the L1$ were bigger or they changed their L1$ they could likely bypass the need for WCC entirely...

kernelc · May 31, 2012

pelov said:
As soon as you have the L1 write you see it dumps down to L2 levels because of WCC (you can see this by comparing L1 write and L2 write). So they can improve the L1 access but ultimately if they don't increase the size of the L1$ to quit relying on the L2, errr WCC, it won't help. So if it can fit on the L1$, great, it should do well. If it can't...

With a write-through L1 cache policy, _any_ write to L1 is immediatly broadcasted to WCC and, after a very short time, to L2.

So, increasing L1$ cache would be benefical only for reads, while writes will remain quite slow.

On the other side, if AMD increase WCC size or L2 latency/throughput, L1 write performance will be immediately better.

Regards.

kernelc · May 31, 2012

ShintaiDK said:
Where did the "BD uses write through on L1" come from?

It is clearly stated in AMD's Optimization Manual for Bulldozer architecture.

Regards.

pelov · May 31, 2012

kernelc said:
With a write-through L1 cache policy, _any_ write to L1 is immediatly broadcasted to WCC and, after a very short time, to L2.

So, increasing L1$ cache would be benefical only for reads, while writes will remain quite slow.

On the other side, if AMD increase WCC size or L2 latency/throughput, L1 write performance will be immediately better.

Regards.

I understand that, but would they have had to use WCC at all if the L1$ was larger? Using a write-through approach only ups > L2 if you have WCC, if you don't then it wouldn't have to given a larger L1$ size unless you've used up the larger and faster (than L2) L1$ entirely, in which case you have to move up anyway.

Increasing L1$ without increasing the speeds of the L2 would help but wouldn't cure the ailment. The issue here isn't JUST the WCC size nor the WCC speeds, but rather WCC is needed if a larger L1$ existed and/or a faster L2 existed. Masking the problem helps, but why magnify the problem by decreasing the size of the L1$? Unless AMD felt that a slow L2 wouldn't hurt them... but then why add WCC at all then? Make it hurt less?

kernelc · May 31, 2012

Ajay said:
I said what I did based on the second graph which you posted, where L1$ write needs to be ~2.5x faster.

Using a write-through scheme, to get a faster L1 cache you had to implement a faster L2 cache. With a 2 MB L2 cache size (per module) AMD had a bad day into creating a fast L2 cache. Maybe they can correct this with future architecture iterations

Oddly, Westmere has a 32KB L1$ (x2, data + instruction) and a 256KB L2$, yet the fall off point in Sandra is @ 128KB. The 2500K also has a 256KB L2$, so it seems like speed is more the issue than size. Really it's about a balanced design, which Bulldozer clearly isn't.

Westmere graph is perfectly fine: Sandra use a multi-threading approach, so the 128 KB point is really a +/- 21.3 KB data array per core. As this is less then the 32 KB L1 data cache per core, you see maximum speed. The 256 KB datapoint instead results in about 42.6 KB data array per core and, as this is more then available L1 data cache (32 KB), some data is trashed to L2 cache (which is slower).

The same apply for i7-2500K: 128 KB = 32 KB per core, so it fits more or less entirely into L1. At 256 KB, you are clearly into L2 territory (and Westmere has 6x L2 caches, while SB has "only" 4 of them).

Thanks.

kernelc · May 31, 2012

pelov said:
What's the size of the WCC? I remember reading 4KB somewhere but I may be completely off...

http://realworldtech.com/beta/forums/index.cfm?action=detail&id=128675&threadid=128602&roomid=2

http://www.olcf.ornl.gov/wp-content/uploads/2012/01/TitanWorkshop2012_Day1_AMD.pdf

It seems it is only 4KB 4-way

Yes, correct

Even still, the WCC and small L1$ isn't the issue by itself. If the L2 were accessed more quickly with a lesser penalty then we wouldn't be talking about the WCC at all. Ofc, if the L1$ were bigger or they changed their L1$ they could likely bypass the need for WCC entirely...

Yes, I agree. But a write-back L1 would be quite interesting :whiste:

kernelc · May 31, 2012

pelov said:
I understand that, but would they have had to use WCC at all if the L1$ was larger?

Yes, absolutely: with a write-through L1 cache and no WCC, L1 latency became ganged to L2 latency. A L1 write latency of > 20 cycles will kill performances.

Using a write-through approach only ups > L2 if you have WCC, if you don't then it wouldn't have to given a larger L1$ size unless you've used up the larger and faster (than L2) L1$ entirely, in which case you have to move up anyway.

Increasing L1$ without increasing the speeds of the L2 would help but wouldn't cure the ailment. The issue here isn't JUST the WCC size nor the WCC speeds, but rather WCC is needed if a larger L1$ existed and/or a faster L2 existed. Masking the problem helps, but why magnify the problem by decreasing the size of the L1$? Unless AMD felt that a slow L2 wouldn't hurt them... but then why add WCC at all then? Make it hurt less?

With current cache policy, a larger L1 cache would be benefical only for reads. As it appear that L1 hit is quite good at 16 KB also, the latency penaly to pay for a larger L1 that will help for reads only was not acceptable.

This opens another interesting questions: AMD left L1 at only 16 KB, but used a realtively relaxed 4 cycles latency. This was clearly done to aggresively scale clock speed, but fist generation Bulldozer basically failed to reach much high base clocks. This is probably one of bigger Bulldozer problems...

Thanks.

pelov · May 31, 2012

kernelc said:
Yes, absolutely: with a write-through L1 cache and no WCC, L1 latency became ganged to L2 latency. A L1 write latency of > 20 cycles will kill performances.

But what if it wasn't write-through? And the WCC helps, but given it's close ties to L2 and small size does it really help THAT much? I'm assuming they used write-through and WCC because of the small L1$ size in the first place, no? It's not like their L1$ performance in previous architectures hurt them massively...

kernelc said:
With current cache policy, a larger L1 cache would be benefical only for reads. As it appear that L1 hit is quite good at 16 KB also, the latency penaly to pay for a larger L1 that will help for reads only was not acceptable.

I think my point is given a different cache policy, one without the write-through and one where WCC would be unnecessary. It seems to me like the small L1$, though it does perform well, ultimately hampers its performance due to slow L2 speeds and the WCC size and > L2 ties. You need write-through if you have smaller L1$ sizes and in turn you need WCC (and larger WCC size the smaller L1$ you have).

But I guess in the end that mostly depends on L2 access speeds as that's one of the worst problems of the architecture for (desktop workloads at least). With a higher clock speed those problems wouldn't be so bad

sefsefsefsef · May 31, 2012

Before reading through this thread I somehow missed the existence (or at least the significance) of the write-coalescing cache in BD. I used to think that BD had the stupidest cache hierarchy of all time, but knowing about the WCC mostly redeems it for me (in concept, at least). I study memory hierarchies and I like the idea of the asymmetrical treatment and structures for reads and writes.

Ajay · May 31, 2012

kernelc said:
This opens another interesting questions: AMD left L1 at only 16 KB, but used a realtively relaxed 4 cycles latency. This was clearly done to aggresively scale clock speed, but fist generation Bulldozer basically failed to reach much high base clocks. This is probably one of bigger Bulldozer problems...

Thanks.

Yeah, shades of uhm, Netburst! I really hope that AMD focuses on L1$ write performance and L2$ latency. WCC was a clever idea to hide the L2$ latency (as previously mentioned), but I think a more standard cache architecture (I don't know which is better, write through or exclusivity) will ultimately be easier to implement - if by Steamroller the design team can improve cache speeds and latency appropriately. I think AMD will need to beef up cores and then the front end to handle higher IPC from the cores. If AMD can hold clock levels about the same and achieve this - that would be a big win. Trying to push the clocks even higher might work with Piledriver, but I think dropping to 28nm with Steamroller will increase the heat density and leakage and work against them visa vi clock speed improvements.

Oh, I just took a look @ this slide:

I have a bad feeling that "greater parallelism" is code for 'more cores'. I wish they meant more core resources, but probably not. So much for my stupid theories

IntelUser2000 · May 31, 2012

I don't think Bulldozer's L2 is bad at all at least compared to their own CPUs... from Anandtech's review: http://forums.anandtech.com/newreply.php?do=newreply&noquote=1&p=33513489

L1/L2 latency-
FX-8150: 4/21 cycles
Phenom II X6: 3/14 cycles

L1 cache latency increased due to clock speed reasons. The culprit for the higher L2 latency is twofold, one is that its much larger at 1MB, and the second is that its a shared cache while in Phenom II its a dedicated one. Despite that, its delivering more bandwidth than the Phenom II, and I think that's quite respectable.

Shared L2 cache in Core Duo increased latency to 14 cycles, up from 10 cycles.

I think the problem is again the module concept. You can see even Sandy Bridge's E's direct adding of 2 more cores and scaling interconnect and memory bandwidth offered diminishing returns, now add having less performance/clock, and the module concept delivering less gains than adding cores.

-6 K10 cores to 8 Bulldozer cores, in theory is 33% increase.
-But in reality that ends up being less than that. For 50% more cores, 3960X ends up being mostly low-40% faster
-Then you add that modules don't deliver as much cores
-And in rest of the applications, it doesn't benefit from having more cores and there's less performance/clock

Trying to push the clocks even higher might work with Piledriver, but I think dropping to 28nm with Steamroller will increase the heat density and leakage and work against them visa vi clock speed improvements.

Relying on further clock increases won't work with Steamroller, as 28nm might end up somewhat less performing than 32nm. 28nm doesn't just forgo SOI(which is only responsible for few % but still), but may be a lower power process. Fortunately, 28nm should improve leakage characteristics, as that's the benefit of a slower transistor.

Olikan · May 31, 2012

Ajay said:
I have a bad feeling that "greater parallelism" is code for 'more cores'. I wish they meant more core resources, but probably not. So much for my stupid theories

IMO, it's this...

http://developer.amd.com/assets/45432-ASF_Spec_2.1.pdf

SickBeast · May 31, 2012

Ajay said:
I have a bad feeling that "greater parallelism" is code for 'more cores'. I wish they meant more core resources, but probably not. So much for my stupid theories

The really bad news is it looks like Bulldozer is the foundation for the next 3 major product cycles for AMD. That's very bad news. The CEO behind that roadmap should be fired.

Vesku · May 31, 2012

That's interesting Olikan, that would mesh with Steamroller being the generation when the AMD's HSA initiative should first appear. My guess is that most effort on Steamroller will be to deliver a decent HSA enabled Kaveri. Would certainly involve lots of parallelism work.

Ajay · May 31, 2012

Olikan said:
IMO, it's this...

http://developer.amd.com/assets/45432-ASF_Spec_2.1.pdf

Transactional memory. Wonderful for servers. Yay :|

kernelc · Jun 1, 2012

pelov said:
But what if it wasn't write-through?

WCC would not be required in case of write-back L1: the write coalescing could be directly handled by L1, and L2 latency would be a much smaller problem.

And the WCC helps, but given it's close ties to L2 and small size does it really help THAT much? I'm assuming they used write-through and WCC because of the small L1$ size in the first place, no? It's not like their L1$ performance in previous architectures hurt them massively...

Yes, albeit with low size, WCC helps much: without WCC and in presence of a "pure" write-through L1 cache, _any_ stores to L1 comports the same very high L2 latency. And a > 20 cycles dalay to write anything into a cached memory area is not tolerable. In this case, L1 would basically be a read cache. Fortunately, it is not!

As Davide Kanter emphasized, AMD used write-through L1 cache to get rid of ECC: this has positive implication for L1 latency (it goes down). It is for that all were a bit surprised by Bulldozer 4 cycles L1 latency: K10 had a greater, ECC-enabled L1 with only 3 cycles latency. Bulldozer 4 cycles design clearly prove the pursue of high clock speed. Given how well Nehalem performs with a 4 cycles L1 latency, I don't think the L1 latency is a real problem for Bulldozer.

However:
- Bulldozer fall shorts of projected clock frequency (at a tolerable TDP, at least)
- the write-through approach, while enabling them to create a lower latency L1, only works if you pair it with a very, very fast (both in terms of latency and bandwidth) L2. Bulldozer L2 cache instead is slow (especially from bandwidth perspective).

I think my point is given a different cache policy, one without the write-through and one where WCC would be unnecessary. It seems to me like the small L1$, though it does perform well, ultimately hampers its performance due to slow L2 speeds and the WCC size and > L2 ties.

All right.

You need write-through if you have smaller L1$ sizes and in turn you need WCC (and larger WCC size the smaller L1$ you have).

In reality, a smaller cache is better suited to write-back then a larger ones. AMD decision to go write-through has little to do with L1 size (but L1 size become important when deciding WCC size).

But I guess in the end that mostly depends on L2 access speeds as that's one of the worst problems of the architecture for (desktop workloads at least). With a higher clock speed those problems wouldn't be so bad

I agree

kernelc · Jun 1, 2012

Ajay said:
Yeah, shades of uhm, Netburst! I really hope that AMD focuses on L1$ write performance and L2$ latency. WCC was a clever idea to hide the L2$ latency (as previously mentioned), but I think a more standard cache architecture (I don't know which is better, write through or exclusivity) will ultimately be easier to implement - if by Steamroller the design team can improve cache speeds and latency appropriately. I think AMD will need to beef up cores and then the front end to handle higher IPC from the cores. If AMD can hold clock levels about the same and achieve this - that would be a big win. Trying to push the clocks even higher might work with Piledriver, but I think dropping to 28nm with Steamroller will increase the heat density and leakage and work against them visa vi clock speed improvements.

Hi,
speaking about "ideal" cache hierarchy is quite difficult: it depends on a number of parameters, as cache size, clockspeed and workloads.

From a pure theoretical point, give actual L1 / L2 sizes and chip clockspeed, I think that a write-back L1 cache with inclusive L2 (similar to Intel's one) can give very good results.

Anyway, AMD's focus on clock speed take it to de-emphasized fast L2 cache in favor of small, fast, high-clocked L1 / WCC cache. This _can_ work, except that L2 bandwidth is far too low, and this negatively affect L1 also; moreover, these high clock speeds failed to materialize. This is my opinion at least, but I can go wrong

Thanks.

Thoughts, Rumors, or Specs of AMD fx series steamroller cpu

Diamond Member

Lifer

Diamond Member

Lifer

Diamond Member

Lifer

Member

Member

Diamond Member

Member

Member

Diamond Member

Member

Member

Member

Diamond Member

Senior member

Lifer

Elite Member

Platinum Member

Lifer

Diamond Member

Lifer

Member

Member