AMD Bulldozer and Llano going to be delayed? GF 32nm troubles?

Scali · Jul 20, 2010

ModestGamer said:
building a processor that work at all is a massive undertaking. I think they put the money elsewhere. Smartly so.

Okay, so you can call other people fanboys when they present technical arguments...
But then you come here with unsubstantiated faith in AMD's future products, and you're expecting us to just take your word for it... What exactly does that make you?

grimpr · Jul 20, 2010

24 of August is not far off. We have many real experts and sofa cpu designers that will drill through AMDs presentation. So just relax and dont give a damn about it.

ModestGamer · Jul 20, 2010

Scali said:
Okay, so you can call other people fanboys when they present technical arguments...
But then you come here with unsubstantiated faith in AMD's future products, and you're expecting us to just take your word for it... What exactly does that make you?

Right. I am saying wait till it comes out before you slam it.

DrMrLordX · Jul 20, 2010

I wish I had had time to respond to this sooner. This is getting out of hand.

Scali said:
It's four K8 *cores* glued together actually (yes, with insignificant improvements, that was the reason I referred to the K8, to emphasize that fact. Bonus point for reading comprehension)...

Right, it's four cores, not a couple, and there were some design differences. You'd have to drag out technical documentation and all that, but it's not like it was copypasta on AMD's part.

The dualcore K8 derivatives were already of the glued kind, as they made no attempt of exploiting the fact that they were on a single die, unlike the Core Duo and Core2 Duo, which exploited that with a shared L2 cache.

1). K8 dual cores shared a memory controller and a high-bandwidth interconnect between cores. They had no problems maintaining L2 cache coherency between cores in single-socket configurations, which is more-or-less what is needed when tackling multithreaded applications (and that's what dual core CPUs were introduced to do, after all). That was all AMD really needed to do to "exploit the fact that they were on a single die", and they did it rather nicely.

2). Shared L2 might not have worked so well with AMD's exclusive cache hierarchy. If you want to attack their chosen cache hierarchy, feel free to do so, but let's not slam K8 for having L2 per core rather than shared L2. The reason why Conroe and Penryn did so well with their L2 designs is that there was so much of it, and that it was fast, even for L2.

K10 tries to finally do something with the L3 cache, but it is so poorly designed that the latency is barely better than what you'd get from a two-socket dualcore K8 system using the hypertransport bus (which cache2cache proves, Anandtech tested it in one of their reviews).

I think you're looking at this the wrong way. Most K10/K10.5 chips have had an L3 cache latency of around 50-56 cycles, depending on whom you ask. Higher levels of cache will always have higher latency and lower bandwidth than cache lower down the hierarchy. There is nothing intrinsic to AMD's cache design that makes it run this slow; rather, it is their decision to launch processors with slow NB speeds that hobbles their L3 cache performance.

Furthermore ,the fact that communication could be carried out between sockets on an older Opteron platform within 50 CPU cycles (or less) is a testament to how wonderful HT was, is, and will be in the future.

Aside from that the L3 cache was also very small, so the hit ratio was too low to really have much benefit.

Again, I think you're looking at this the wrong way. Intel has a higher cache density than AMD, both due to some design differences (Intel's is arguably better, at least from a raw density standpoint) and relative differences in process tech (Intel has been in the lead there for some time now). The L2 on K10 and K10.5 is by no means "very small", especially when you consider the size of L2 present on most Stars chips.

As to whether or not L3 has been of any benefit, let us not ignore the facts: Stars processors (yes, even those afflicted with the dreaded TLB bug) have been at least 10% faster when equipped with L3, even in single-socket configurations. L3 has been valuable in multi-socket configurations when maintaining cache coherency, especially on platforms that support HT Assist. The fact that 48-core Magny-Cours systems run as well as they do at such low clockspeeds is a testament to this fact.

The only way I've ever been able to get an L3-less Stars chip to come close to an L3-equipped Stars chip is to overclock the hell out of the NB and choose a benchmark with an enormous working set (SuperPi 32m). On 2p and 4p systems, I have every reason to believe that disabling L3 (or removing it altogether) would ruin the performance of modern Opterons.

L3 has helped a great deal, even though I think the way they implemented it was more complex than their R&D department could handle, leading to the delays with Barcelona's and Agena's launches, along with the dreaded TLB bug. Personally, I think they should have launched a K8 MCM quad (or native quad, if it would have been possible) as a stop-gap until K10 was ready.

I was lenient enough to ignore the TLB issue actually.

The TLB bug was the only significant design failure of K10. Well, that and the clockspeeds weren't great, but the design has proven to be scalable to clockspeeds much higher than K8 ever saw.

'Glued' does not necessarily imply MCM. The original Pentium D Smithfield was also a single-die. Still it was pretty much copy-paste of Pentium 4 cores, not that different from what AMD did with the K8 and K10.

K8 was designed, from the ground up, to be scalable to up to four cores. Curiously, AMD never exercised the option to roll out K8 quads. Prescott was not designed for multi-core operation.

This conversation, and this thread, are not about bashing Smithfield or Presler, so I'll leave it to someone else to do that somewhere else. It's been done enough already.

Anyone else who wants to challenge what I said? I think it's pretty useless to try.

Furthermore, K10 differs from K8 in numerous other ways:

http://www.anandtech.com/show/2183/

The list of optimizations to k10 from K8 takes up damn near the entire article. And you think Barcelona was just four K8 cores glued together? I think not.

Scali · Jul 21, 2010

DrMrLordX said:
Right, it's four cores, not a couple, and there were some design differences.

'A couple of' can be four, it is not always to be taken literally as two, but meant to be 'a few', or 'several'. A small number.
http://www.thefreedictionary.com/a+couple+of
Again, reading comprehension (we're talking about a quadcore architecture, how many cores do you think it has? So what would 'a couple of' refer to? Useful to argue something basic like that?).
You deliberately seem to want to misunderstand EVERYTHING I post, just so you have something to argue.
But you'll lose everytime, this way. You can't argue the dictionary.

DrMrLordX said:
1). K8 dual cores shared a memory controller and a high-bandwidth interconnect between cores.

Back then it may have been a bit confusing, but these days they introduced the term 'uncore' to separate that from the 'core' logic.
K8 dual cores share the 'uncore' logic, but there are still two sets of identical 'core' logic. Which is copy-paste.
Core Duo/Core2 Duo share the L2 cache, which is 'core' logic, not 'uncore'. See the difference?
The result is massively improved core-to-core communication.

DrMrLordX said:
They had no problems maintaining L2 cache coherency between cores in single-socket configurations, which is more-or-less what is needed when tackling multithreaded applications (and that's what dual core CPUs were introduced to do, after all). That was all AMD really needed to do to "exploit the fact that they were on a single die", and they did it rather nicely.

No, that is a requirement for proper functioning of the CPU. Exploiting it means using it to your advantage to improve performance. AMD did not do this. Their dualcores didn't have significantly faster core-to-core communication than a two-socket single-core K8 system. Again, see cache2cache results.

DrMrLordX said:
2). Shared L2 might not have worked so well with AMD's exclusive cache hierarchy. If you want to attack their chosen cache hierarchy, feel free to do so, but let's not slam K8 for having L2 per core rather than shared L2. The reason why Conroe and Penryn did so well with their L2 designs is that there was so much of it, and that it was fast, even for L2.

Making a shared cache and making it fast, is very difficult. Intel clearly was well ahead of AMD there. Intel's shared L2 cache performed about as well as AMD's unshared cache in terms of latency and bandwidth, yet had the advantage of being shared, and thus having implicit core synchronization. AMD simply couldn't pull it off, making all the 'native die' talk nothing but hot air.

DrMrLordX said:
I think you're looking at this the wrong way. Most K10/K10.5 chips have had an L3 cache latency of around 50-56 cycles, depending on whom you ask. Higher levels of cache will always have higher latency and lower bandwidth than cache lower down the hierarchy. There is nothing intrinsic to AMD's cache design that makes it run this slow; rather, it is their decision to launch processors with slow NB speeds that hobbles their L3 cache performance.

That is the biggest load of crock that I've heard yet, as an excuse for AMD's L3.
Bottom line is, removing L3 cache on a K10 barely affects performance. So why is it there in the first place? If you're going to put it on there, get it right. Else it's just a waste of die space.

DrMrLordX said:
Furthermore ,the fact that communication could be carried out between sockets on an older Opteron platform within 50 CPU cycles (or less) is a testament to how wonderful HT was, is, and will be in the future.

Makes you wonder what the AMD engineers have been thinking then. They should have known that HT was so good, that it would be very hard for them to get an L3 cache system out that would have significant benefits (especially the early 65 nm models, where only 2 MB would fit, not even larger than the combined per-core L2 caches. If you're going to apply high-latency cache, rule #1 is to use a cache big enough so that the hit rate would make up for the latency).
As I said before, solutions to problems that aren't there.

DrMrLordX said:
And you think Barcelona was just four K8 cores glued together? I think not.

Not literally, but at the end of the day, that's pretty much how it performs.

IntelUser2000 · Jul 21, 2010

Scali said:
Not literally, but at the end of the day, that's pretty much how it performs.

Come on Scali, Barcelona and Agena was significantly faster than its predecessors. It commonly achieved performance increases of 15% faster per clock. Think that's easy? I don't think so. It narrowed the gap to mere 10% from the original Core 2. The only problem was Intel chips clocked significantly higher and 45nm Penryn generation were what it was going up against.

The SINGLE BIGGEST problem for the Barcelona generation was AMD couldn't clock the chips high. Had it been able to clock at 3GHz at launch AMD would have been much more competitive.

I think you are underestimating the importance of the L3 cache too. First, the chips were much faster than previous generations so it makes the point moot, and L3 cache probably helped in server configurations.

Scali · Jul 21, 2010

IntelUser2000 said:
Come on Scali, Barcelona and Agena was significantly faster than its predecessors. It commonly achieved performance increases of 15% faster per clock. Think that's easy? I don't think so.

For an 'entirely new' architecture, I think 15% is not impressive (especially not if you factor in that clockspeeds went DOWN by about 15% if not more).
Conroe vs Presler was much bigger than that, and Nehalem vs Penryn aswell (without sacrificing any clockspeed at all).

If Bulldozer is again an 'entirely new' architecture, giving an 'impressive' boost of 15% per clock over Phenom... well, Intel will really be sweating, won't they?

IntelUser2000 said:
The SINGLE BIGGEST problem for the Barcelona generation was AMD couldn't clock the chips high. Had it been able to clock at 3GHz at launch AMD would have been much more competitive.

And the SINGLE BIGGEST reason for that problem is that the die was too large. Had they gone MCM rather than single-die, or had they removed the L3 cache and relied on the HT bus only, they could probably have scaled the clockspeed considerably.
That's what I've been driving at.
It just didn't work on 65 nm. They had to wait for 45 nm to get it going.

IntelUser2000 said:
I think you are underestimating the importance of the L3 cache too. First, the chips were much faster than previous generations so it makes the point moot, and L3 cache probably helped in server configurations.

But if it only helps in server configurations, why put it on the desktop too, where it's only driving price up and dragging yields down? Intel has also built Xeons with L3 cache in the past, when the desktop versions only had L1/L2. There doesn't seem to be a lot of business sense within AMD.

DrMrLordX · Jul 21, 2010

Scali said:
'A couple of' can be four, it is not always to be taken literally as two, but meant to be 'a few', or 'several'. A small number.
http://www.thefreedictionary.com/a+couple+of
Again, reading comprehension (we're talking about a quadcore architecture, how many cores do you think it has? So what would 'a couple of' refer to? Useful to argue something basic like that?).

No offense, but I've always considered "a couple of" to mean two, because literally speaking, that's what a couple is. It's two. Really. The fact that its use may refer to an indefinite small number is merely a product of sloppy language usage, which is particularly silly, when you knew how many cores it was (four). Why would you use "a couple of" to refer to an indefinite quantity, when the quantity was clearly defined, and which you knew about all along?

You deliberately seem to want to misunderstand EVERYTHING I post, just so you have something to argue.
But you'll lose everytime, this way. You can't argue the dictionary.

No, the point is you've made a vague and inaccurate statement (that K10 = K8) for the purpose of supporting an argument, and I'm saying that your statement is inaccurate. The fact is that you just don't like the performance of Stars processors, and want to blame it on being unsatisfactorily improved over K8, or something along those lines.

Back then it may have been a bit confusing, but these days they introduced the term 'uncore' to separate that from the 'core' logic.
K8 dual cores share the 'uncore' logic, but there are still two sets of identical 'core' logic. Which is copy-paste.
Core Duo/Core2 Duo share the L2 cache, which is 'core' logic, not 'uncore'. See the difference?
The result is massively improved core-to-core communication.

So, despite the fact that AMD launched the first x86 dual-core processor that actually shared logic between cores (be it 'uncore' or 'core' logic), something which Smithfield and Presler did *not* do, I'm to believe that K8 was not a "real" dual-core processor or that it was somehow inadequate at the time of launch?

By 2006, when Conroe finally hit the market, the last thing that was hobbling AM2 K8 was its L2 architecture (or the fact that it didn't have shared L2 between cores).

And honestly, if your argument holds water, then Nehalem is also "glued" together since it doesn't share its l2 cache either. Last time I checked, Nehalem didn't use any glue.

No, that is a requirement for proper functioning of the CPU.

According to whom? K8 didn't just function "properly" as a dual-core processor . . . for its time, it was brilliant. Rehashed to death by summer of '06, but brilliant at its launch and for many months following.

Exploiting it means using it to your advantage to improve performance. AMD did not do this.

Putting the memory controller on the die did improve performance.

Their dualcores didn't have significantly faster core-to-core communication than a two-socket single-core K8 system. Again, see cache2cache results.

Well let's see here . . . we have an Opteron 880 (could not find cachetocache results for an Opteron 280) in this CachetoCache benchmark run by Anandtech:

http://www.anandtech.com/show/2322/4

Core to core latency on an Opteron 880: 134 ns
Socket to socket latency on an Opteron 880: 169-188 ns <-- I would be curious to see if the socket-to-socket latency on a 2p system would be different, but given HT's design, I would expect that it would not be, especially given the socket-to-socket latency on the 2p Barcelona systems in the same benchmark.

What we also see here is core-to-core latency on the Opteron 2350 (B1 Barcelona) being lower than socket-to-socket latency on the Opteron 2350, which sort of flies in the face of what you were saying about Barcelona's L3 performance. Then, in this review by Anandtech:

http://www.anandtech.com/show/2386/4

B1 Barcelona returns with core-to-core latency a full 25ns lower than what it was in the first Anandtech article. B2 (in the form of the Opteron 2360SE) shows a core-to-core latency another 20 ns lower than the updated Opteron 2350 result. The socket-to-socket latency for Barcelona remained unchanged between benchmarks.

So, not only is your statement about core-to-core latency on K8 being the same as socket-to-socket latency on an Opteron rig bogus, so was your statement about the L3 on Barcelona being "so poorly designed that the latency is barely better than what you'd get from a two-socket dualcore K8 system using the hypertransport bus".

From Anandtech's own benchmarks, we see that Barcelona's cache architecture allowed cache propagation to occur with a latency penalty 72-92 ns lower than what cache propagation over the HT link would incur. The raw latency of K10's L3 varies in ns based on core frequency, but for the 2360SE (2.5 ghz), assuming a "worst case" of 56-cycle latency on the L3, that gives you a raw L3 latency of 22.4 ns.

making all the 'native die' talk nothing but hot air.

I do not believe you are in any position to accuse anyone of being full of hot air. And, as I have already stated, shared L2 is not a "requirement" for a multi-core die to be a native multi-core die; look at Nehalem. Or K8. Or K10/K10.5. They all took (or take) advantage of the cores being on the same die in their own way, and they all gain in performance from doing so. Conroe and Penryn were their own special beasts, but that doesn't make them the only processors in x86 history to take advantage of having multiple cores on the same die.

That is the biggest load of crock that I've heard yet, as an excuse for AMD's L3.

No it isn't. You ever tried benchmark runs on a K10.5 at different NB speeds, especially on chips that have L3? The difference is quite telling. Slow stock NB speeds have been a source of frustration for AMD enthusiasts since people started toying with Agena.

Bottom line is, removing L3 cache on a K10 barely affects performance.

I used to think that, but my own comparative benchmarking results have shown that to be false, even in 1P configurations where L3 doesn't reach its full potential. Do you remember the TLB bug that you were "lenient enough" to overlook? The fix on systems afflicted with the TLB bug, which entailed disabling the L3 cache altogether, caused a performance hit on Barcelona systems that was severe enough that many chose not to implement it when they had the option not to.

So why is it there in the first place? If you're going to put it on there, get it right. Else it's just a waste of die space.

It certainly isn't a waste of die space.

Makes you wonder what the AMD engineers have been thinking then. They should have known that HT was so good, that it would be very hard for them to get an L3 cache system out that would have significant benefits (especially the early 65 nm models, where only 2 MB would fit, not even larger than the combined per-core L2 caches. If you're going to apply high-latency cache, rule #1 is to use a cache big enough so that the hit rate would make up for the latency).

Well, according to Anandtech's own CachetoCache numbers, it looks like AMD's engineers managed to get core-to-core cache propagation down to lower latencies than ever vs. K8 Opterons despite having twice as many cores on the same die. They also managed to cut core-to-core cache propagation latency down to half of what the socket-to-socket cache propagation latency is on a multi-socket Barcelona system. Despite HT being awesome, AMD's engineers came through anyway.

As I said before, solutions to problems that aren't there.

Right . . .

Not literally, but at the end of the day, that's pretty much how it performs.

Revisionist history.

Scali said:
had they removed the L3 cache and relied on the HT bus only, they could probably have scaled the clockspeed considerably.

I used to think that. Then Propus showed up and, lamentably, proved me very wrong. L3-equipped Deneb kicks Propus' ass; I should know, I benched my Propus and compared numbers to Deneb benchmarks out there, and it was damn hard getting the Propus to catch up in anything but SuperPi 32m.

But if it only helps in server configurations,

That is not the case.

IntelUser2000 · Jul 21, 2010

Scali said:
And the SINGLE BIGGEST reason for that problem is that the die was too large. Had they gone MCM rather than single-die, or had they removed the L3 cache and relied on the HT bus only, they could probably have scaled the clockspeed considerably.

And they did not. Arguably it was a bad decision, but it was made. It could have been successful, but it wasn't. Delays occur often in this industry, and along with delays follow design trimming, deal with it.

But if it only helps in server configurations, why put it on the desktop too, where it's only driving price up and dragging yields down? Intel has also built Xeons with L3 cache in the past, when the desktop versions only had L1/L2. There doesn't seem to be a lot of business sense within AMD.

Do you have ANY idea how long it takes for the L3 cache enabled MP Xeons to come out? It's a FULL year after the regular 2P server chip. For AMD, its available as soon as the desktop parts are ready. Hell, they have been even earlier most of the time!

Why wouldn't you put out a full-L3 cache enabled chip when the whole architecture is designed with it? Or are you suggesting making a whole new part with design optimized for L2 cache and HT link to connect the cores together would have took no time to make?

Scali · Jul 21, 2010

IntelUser2000 said:
And they did not. Arguably it was a bad decision, but it was made. It could have been successful, but it wasn't. Delays occur often in this industry, and along with delays follow design trimming, deal with it.

I'm just pointing out facts. I don't have to 'deal with it'. If anyone has to deal with it, it's AMD. I'm just saying that AMD made some rather unfortunate decisions at crucial points in time.

IntelUser2000 said:
Do you have ANY idea how long it takes for the L3 cache enabled MP Xeons to come out? It's a FULL year after the regular 2P server chip. For AMD, its available as soon as the desktop parts are ready. Hell, they have been even earlier most of the time!

Again, what's your point?
I'm just saying that it doesn't seem to be the most beneficial business model for AMD.

IntelUser2000 said:
Why wouldn't you put out a full-L3 cache enabled chip when the whole architecture is designed with it? Or are you suggesting making a whole new part with design optimized for L2 cache and HT link to connect the cores together would have took no time to make?

AMD is doing it now, aren't they? Making seperate models with and without L3 cache. I'm just saying this realization came a generation too late for AMD.

Scali · Jul 21, 2010

DrMrLordX said:
No offense, but I've always considered "a couple of" to mean two, because literally speaking, that's what a couple is. It's two. Really. The fact that its use may refer to an indefinite small number is merely a product of sloppy language usage, which is particularly silly, when you knew how many cores it was (four). Why would you use "a couple of" to refer to an indefinite quantity, when the quantity was clearly defined, and which you knew about all along?

It's called 'style figure'.

DrMrLordX said:
No, the point is you've made a vague and inaccurate statement (that K10 = K8) for the purpose of supporting an argument, and I'm saying that your statement is inaccurate. The fact is that you just don't like the performance of Stars processors, and want to blame it on being unsatisfactorily improved over K8, or something along those lines.

Again, 'style figure'.

The rest is not even worth a reply. I stick by what I said, and I'm done arguing with you.

IntelUser2000 · Jul 21, 2010

Scali said:
AMD is doing it now, aren't they? Making seperate models with and without L3 cache. I'm just saying this realization came a generation too late for AMD.

Yes, and they are value models, so? The full L3 cache variants still exist. The benefits of the L3 cache was still there. Be it 5% here and there is still significant.

Scali · Jul 21, 2010

IntelUser2000 said:
Yes, and they are value models, so? The full L3 cache variants still exist. The benefits of the L3 cache was still there. Be it 5% here and there is still significant.

Now it's 45 nm, and the L3 cache is three times as large aswell (which is one of the flaws I mentioned: the 2 MB was too small to benefit).
So it's not an accurate comparison.

IntelUser2000 · Jul 21, 2010

Wrong: http://www.anandtech.com/bench/Product/21?vs=106

And the situation is a bit favorable for the Athlon II X4 because its based on the updated Phenom II core with IPC improvements, even though it might be extremely small. Yet the Phenom 9950 at the same clock is still faster, because of the L3 cache.

Scali · Jul 21, 2010

IntelUser2000 said:
Wrong: http://www.anandtech.com/bench/Product/21?vs=106

And the situation is a bit favorable for the Athlon II X4 because its based on the updated Phenom II core with IPC improvements, even though it might be extremely small. Yet the Phenom 9950 at the same clock is still faster, because of the L3 cache.

You're the one who's wrong here...
You're comparing CPUs at the same clock...
Read what I said again, and pay attention this time. What I said was that the L3 didn't have significant improvements, but it made the die larger, which would affect yields and clockspeed scaling (I said the same thing about Fermi: too big to manufacture on that process... scaled down GF104 does wonders).
Without the cache, the clockspeed would probably scale better.
So when you compare CPUs at the same (low) clock, you completely missed my point.

My point is that the difference with the cache-less CPU is minor, and can be compensated by higher clocks. Just a few hundred MHz extra should do the trick.

DrMrLordX · Jul 21, 2010

Scali said:
It's called 'style figure'.

Mmmhmm. Next time I want four of something, I'll be sure to ask for a couple and see how that goes. Can I have them call you so you can berate them when I only get two?

Seriously, what really happened is that you tried to make Barcelona look bad by referring to it as a couple of k8 cores glued together. You can defend what you said using whatever silly excuse you can dig up, but I chose to process your statement literally to expose it as being intrinsically false. It was a pointless smear. If you want to say that Barcelona was a lousy CPU, go ahead and dig up some benchmarks that make it look bad compared to contemporary Intel offerings. There are probably plenty out there just waiting for their links to be pasted in this thread. Using a 'style figure' to show baseless contempt is pointless and contributes nothing.

The rest is not even worth a reply. I stick by what I said, and I'm done arguing with you.

Yay! I win. Thanks to Anandtech for archiving the CachetoCache figures that you inaccurately cited but couldn't be bothered to actually reference.

IntelUser2000 · Jul 21, 2010

Scali said:
You're the one who's wrong here...
You're comparing CPUs at the same clock...

My point is that the difference with the cache-less CPU is minor, and can be compensated by higher clocks. Just a few hundred MHz extra should do the trick.

All speculation. You are basing a lot of things on what-ifs. Asynchronous clock L3 brought on Agena makes that irrelevant. Plus caches are largely redundant logic so the impact on yields are much less significant.

Scali · Jul 21, 2010

DrMrLordX said:
Mmmhmm. Next time I want four of something, I'll be sure to ask for a couple and see how that goes. Can I have them call you so you can berate them when I only get two?

Wow, now there's someone with no communication skills whatsoever.
If you need to specify something, you specify it.
In my case, it was clear how many cores I was speaking of, so I chose not to state that specifically. The information would have been redundant.

DrMrLordX said:
Yay! I win. Thanks to Anandtech for archiving the CachetoCache figures that you inaccurately cited but couldn't be bothered to actually reference.

No you didn't.
I never claimed that there was NO difference. I claimed the difference was not significant (which is not citing any numbers, let alone inaccurately, as it is purely a subjective observation). I *did* however refer to the Conroe architecture and how it had much better core-to-core communication.
Care to cite the numbers for Intel's CPUs there, which you so conveniently neglected to mention?
Or will you just accept it when I say they're roughly three times as fast as AMD's? And then compare that to AMD's gain by sticking L3 on their CPU... I think the word is: fail.

Scali · Jul 21, 2010

IntelUser2000 said:
All speculation. You are basing a lot of things on what-ifs. Asynchronous clock L3 brought on Agena makes that irrelevant. Plus caches are largely redundant logic so the impact on yields are much less significant.

Then we simply disagree. You're speculating as much as I am.

DrMrLordX · Jul 21, 2010

Scali said:
Without the cache, the clockspeed would probably scale better.

The only thing that scales better when you pull the L3 is the Northbridge, and that only buys you 100-200 mhz NB speed. The reduction in system memory latency and increase in system memory read/write bandwidth from the extra NB speed doesn't make up for the lack of L3.

Scali · Jul 21, 2010

DrMrLordX said:
The only thing that scales better when you pull the L3 is the Northbridge, and that only buys you 100-200 mhz NB speed. The reduction in system memory latency and increase in system memory read/write bandwidth from the extra NB speed doesn't make up for the lack of L3.

Erm, you're not getting what I'm saying!
Smaller die means less defects per die, means better yields, means better binning for clockspeeds.

Phynaz · Jul 21, 2010

Scali said:
Erm, you're not getting what I'm saying!
Smaller die means less defects per die, means better yields, means better binning for clockspeeds.

Defect yeild != parametric yeild.

Less defects does not equate to higher speed bins.

BTW, cache is designed with built in redundancy, a defect in cache still results in a fully operational CPU.

DrMrLordX · Jul 21, 2010

Wow, now there's someone with no communication skills whatsoever.

What I'm communicating to you is that using baseless smears to strengthen your argument is poor form.

If you need to specify something, you specify it.

You did need to specify something.

In my case, it was clear how many cores I was speaking of, so I chose not to state that specifically. The information would have been redundant.

No, what was clear was that you want to make the entire Stars line of processors look even worse than its benchmark data would already indicate.

Scali said:
No you didn't.
I never claimed that there was NO difference. I claimed the difference was not significant (which is not citing any numbers, let alone inaccurately, as it is purely a subjective observation).

How can you possibly claim that the difference between core-to-core cache propagation and socket-to-socket cache propagation on B2 Barcelona systems was not significant when core-to-core cache propagation had HALF the latency as compared to socket-to-socket?

Furthermore, if your observation was subjective, then how can it be of any value in a field where objectivity is king?

I *did* however refer to the Conroe architecture

Why did you even bring up Conroe? Your statement was that K10 = K8, and I proved you wrong. The fact that Conroe and Penryn were outstanding micro-architectures is irrelevant. It's not like the transition from Netburst to Conroe set the bog standard for architectural updates. That was a once-in-a-lifetime kick in the pants from internally-competitive R&D departments with enormous R&D budgets on tap.

The transition from K8 to K10, both on the desktop and in the server market, was an improvement for AMD, albeit one that was late. To say that there was no improvement (or that it was insignificant) is false, whether or not the transition is viewed in the light of Intel's success with the Conroe launch.

Erm, you're not getting what I'm saying!
Smaller die means less defects per die, means better yields, means better binning for clockspeeds.

I very well get what you're saying. AMD never made a 65nm x4 without L3, so we'll never know if, on that process or with those steppings (B1-B3), whether or not their yields would have improved without L3.

What we DO know is that AMD DID make a 45nm x4 without L3; in fact, there are four of them, and three of them have two different steppings. Having beaten a C2-stepping Athlon II x4 635 to death (literally), I can hereby certify that the absence of L3 on a 45nm, C2-stepping x4 quad does nothing, zip, zero, zilch to improve yields, binning, or clockspeeds. AMD released C2 Denebs (Phenom II x4s) at higher clockspeeds than the fastest C2 Propus (Athlon II x4), and the fastest C2 Denebs certainly overclocked higher than your average Propus. My Propus hit 3.7-3.75 ghz tops with a consolidated overclock, whereas most Propus chips struggled to go faster than 3.5 ghz. C2 Deneb hit 3.6-3.8 ghz on a fairly regular basis.

Whatever AMD did when they made Propus, it sure as heck didn't help it scale up to better clockspeeds. In fact, it seems to be consistently worse than Deneb. I haven't had the chance to play with a C3 Propus yet, but maybe someday . . .

Scali · Jul 21, 2010

Phynaz said:
Defect yeild != parametric yeild.

Less defects does not equate to higher speed bins.

BTW, cache is designed with built in redundancy, a defect in cache still results in a fully operational CPU.

That's not what I'm talking about though.
Look at Fermi... on the GTX465, they turn pretty much half the GPU off... yea, it's still fully operational, but the power consumption and clockspeed scaling are horrible. It's just too large for the 40 nm process. The dies are so large that there will always be poor areas on the die (not just cache obviously). Then it all just drops off exponentially.
AMD had a similar situation with their 65 nm process. Even their dualcores were initially having trouble beating the 90 nm ones.
Cutting down die size and/or using MCM could have benefited Barcelona on 65 nm considerably, no doubt in my mind.
Eventually 45 nm solved the problems.

Scali · Jul 21, 2010

DrMrLordX said:
You did need to specify something.

So you're saying you didn't know how many cores Barcelona has? And I had to explain that to you?

DrMrLordX said:
No, what was clear was that you want to make the entire Stars line of processors look even worse than its benchmark data would already indicate.

No, I wanted to make them look exactly as bad as the benchmark data indicates.

DrMrLordX said:
How can you possibly claim that the difference between core-to-core cache propagation and socket-to-socket cache propagation on B2 Barcelona systems was not significant when core-to-core cache propagation had HALF the latency as compared to socket-to-socket?

From the article you linked yourself:
"AMD's native quad-core needs about 76ns to exchange (L1) cache information. That's not bad, but it's not fantastic either as the shared L2 cache approach of the Xeons allows the dual cores to exchange information via the L2 in about 26-30ns. Once you need to get information from core 0 to core 3, the dual die CPU of Intel still doesn't need much more time (77ns) than the quad-core Opteron (76ns). The complex L1-L2-L3 hierarchy might negate the advantages of being a "native" quad-core somewhat, but we have to study this a bit further as it is quite a complex matter."

Half the latency? Hardly.
And you want to accuse ME of poor form?

DrMrLordX said:
Furthermore, if your observation was subjective, then how can it be of any value in a field where objectivity is king?

You can figure out the objective results yourself.
See the above quote from the article. Intel's L2 cache gives you 26-30 ns core-to-core, almost a factor 3 better than AMD.
Even through the FSB (which is even worse than AMD's HT theoretically), Intel can pretty much match AMD's shared L3 cache.
So where is the advantage of the L3 cache and the native quad core design? We're not seeing it.
Intel has a much better best case, and a comparable worst case.
And as you can see, the dual core Opteron actually did BETTER, both with on-die and socket-to-socket communication.

DrMrLordX said:
Why did you even bring up Conroe?

I made that very clear: It is a native dualcore design that DOES take advantage of the single-die design. It serves as a reference for the gains you can get in core-to-core communications.

DrMrLordX said:
I very well get what you're saying.

That wasn't apparent in your reply at all.

AMD Bulldozer and Llano going to be delayed? GF 32nm troubles?

Banned

Golden Member

Banned

Lifer

Banned

Elite Member

Banned

Lifer

Elite Member

Banned

Banned

Elite Member

Banned

Elite Member

Banned

Lifer

Elite Member

Banned

Banned

Lifer

Banned

Lifer

Lifer

Banned

Banned