'A couple of' can be four, it is not always to be taken literally as two, but meant to be 'a few', or 'several'. A small number.
http://www.thefreedictionary.com/a+couple+of
Again, reading comprehension (we're talking about a quadcore architecture, how many cores do you think it has? So what would 'a couple of' refer to? Useful to argue something basic like that?).
No offense, but I've always considered "a couple of" to mean two, because literally speaking, that's what a couple is. It's two. Really. The fact that its use may refer to an indefinite small number is merely a product of sloppy language usage, which is particularly silly, when you knew how many cores it was (four). Why would you use "a couple of" to refer to an indefinite quantity, when the quantity was clearly defined, and which you knew about all along?
You deliberately seem to want to misunderstand EVERYTHING I post, just so you have something to argue.
But you'll lose everytime, this way. You can't argue the dictionary.
No, the point is you've made a vague and inaccurate statement (that K10 = K8) for the purpose of supporting an argument, and I'm saying that your statement is inaccurate. The fact is that you just don't like the performance of Stars processors, and want to blame it on being unsatisfactorily improved over K8, or something along those lines.
Back then it may have been a bit confusing, but these days they introduced the term
'uncore' to separate that from the 'core' logic.
K8 dual cores share the 'uncore' logic, but there are still two sets of identical 'core' logic. Which is copy-paste.
Core Duo/Core2 Duo share the L2 cache, which is 'core' logic, not 'uncore'. See the difference?
The result is massively improved core-to-core communication.
So, despite the fact that AMD launched the first x86 dual-core processor that actually shared logic between cores (be it 'uncore' or 'core' logic), something which Smithfield and Presler did *not* do, I'm to believe that K8 was not a "real" dual-core processor or that it was somehow inadequate at the time of launch?
By 2006, when Conroe finally hit the market, the last thing that was hobbling AM2 K8 was its L2 architecture (or the fact that it didn't have shared L2 between cores).
And honestly, if your argument holds water, then Nehalem is also "glued" together since it doesn't share its l2 cache either. Last time I checked, Nehalem didn't use any glue.
No, that is a requirement for proper functioning of the CPU.
According to whom? K8 didn't just function "properly" as a dual-core processor . . . for its time, it was brilliant. Rehashed to death by summer of '06, but brilliant at its launch and for many months following.
Exploiting it means using it to your advantage to improve performance. AMD did not do this.
Putting the memory controller on the die did improve performance.
Their dualcores didn't have significantly faster core-to-core communication than a two-socket single-core K8 system. Again, see cache2cache results.
Well let's see here . . . we have an Opteron 880 (could not find cachetocache results for an Opteron 280) in this CachetoCache benchmark run by Anandtech:
http://www.anandtech.com/show/2322/4
Core to core latency on an Opteron 880: 134 ns
Socket to socket latency on an Opteron 880: 169-188 ns <-- I would be curious to see if the socket-to-socket latency on a 2p system would be different, but given HT's design, I would expect that it would not be, especially given the socket-to-socket latency on the 2p Barcelona systems in the same benchmark.
What we also see here is core-to-core latency on the Opteron 2350 (B1 Barcelona) being lower than socket-to-socket latency on the Opteron 2350, which sort of flies in the face of what you were saying about Barcelona's L3 performance. Then, in this review by Anandtech:
http://www.anandtech.com/show/2386/4
B1 Barcelona returns with core-to-core latency a full 25ns lower than what it was in the first Anandtech article. B2 (in the form of the Opteron 2360SE) shows a core-to-core latency another 20 ns lower than the updated Opteron 2350 result. The socket-to-socket latency for Barcelona remained unchanged between benchmarks.
So, not only is your statement about core-to-core latency on K8 being the same as socket-to-socket latency on an Opteron rig bogus, so was your statement about the L3 on Barcelona being "so poorly designed that the latency is barely better than what you'd get from a two-socket dualcore K8 system using the hypertransport bus".
From Anandtech's own benchmarks, we see that Barcelona's cache architecture allowed cache propagation to occur with a latency penalty 72-92 ns lower than what cache propagation over the HT link would incur. The raw latency of K10's L3 varies in ns based on core frequency, but for the 2360SE (2.5 ghz), assuming a "worst case" of 56-cycle latency on the L3, that gives you a raw L3 latency of 22.4 ns.
making all the 'native die' talk nothing but hot air.
I do not believe you are in any position to accuse anyone of being full of hot air. And, as I have already stated, shared L2 is not a "requirement" for a multi-core die to be a native multi-core die; look at Nehalem. Or K8. Or K10/K10.5. They all took (or take) advantage of the cores being on the same die in their own way, and they all gain in performance from doing so. Conroe and Penryn were their own special beasts, but that doesn't make them the only processors in x86 history to take advantage of having multiple cores on the same die.
That is the biggest load of crock that I've heard yet, as an excuse for AMD's L3.
No it isn't. You ever tried benchmark runs on a K10.5 at different NB speeds, especially on chips that have L3? The difference is quite telling. Slow stock NB speeds have been a source of frustration for AMD enthusiasts since people started toying with Agena.
Bottom line is, removing L3 cache on a K10 barely affects performance.
I used to think that, but my own comparative benchmarking results have shown that to be false, even in 1P configurations where L3 doesn't reach its full potential. Do you remember the TLB bug that you were "lenient enough" to overlook? The fix on systems afflicted with the TLB bug, which entailed disabling the L3 cache altogether, caused a performance hit on Barcelona systems that was severe enough that many chose not to implement it when they had the option not to.
So why is it there in the first place? If you're going to put it on there, get it right. Else it's just a waste of die space.
It certainly isn't a waste of die space.
Makes you wonder what the AMD engineers have been thinking then. They should have known that HT was so good, that it would be very hard for them to get an L3 cache system out that would have significant benefits (especially the early 65 nm models, where only 2 MB would fit, not even larger than the combined per-core L2 caches. If you're going to apply high-latency cache, rule #1 is to use a cache big enough so that the hit rate would make up for the latency).
Well, according to Anandtech's own CachetoCache numbers, it looks like AMD's engineers managed to get core-to-core cache propagation down to lower latencies than ever vs. K8 Opterons despite having twice as many cores on the same die. They also managed to cut core-to-core cache propagation latency down to half of what the socket-to-socket cache propagation latency is on a multi-socket Barcelona system. Despite HT being awesome, AMD's engineers came through anyway.
As I said before, solutions to problems that aren't there.
Right . . .
Not literally, but at the end of the day, that's pretty much how it performs.
Revisionist history.
had they removed the L3 cache and relied on the HT bus only, they could probably have scaled the clockspeed considerably.
I used to think that. Then Propus showed up and, lamentably, proved me very wrong. L3-equipped Deneb kicks Propus' ass; I should know, I benched my Propus and compared numbers to Deneb benchmarks out there, and it was damn hard getting the Propus to catch up in anything but SuperPi 32m.
But if it only helps in server configurations,
That is not the case.