AMD demos Bulldozer at Investors conference

soccerballtux · Nov 12, 2010

Kuzi said:
Of course BD will have lots of improvements over Deneb/Thuban, but you have to remember the competition (Intel) already had some of these improvements for a year/few years already, so it's hard to get too excited about them (Examples: 32nm process, HKMG, Power gating, Turbo etc).

Honestly I see BD as being more interesting than SandyBridge, but I won't kid myself into believing that BD will be faster. It's possible that an 8-core BD beats a 4-core (8-thread) SB when most/all cores are running, but that's maybe the only situation where BD may come ahead (at similar clocks). Though this comparison would only be fair if AMD prices their 8-core processors similarly to Intel's Quad SBs.

It's very likely that AMD will compete with SB by releasing higher clocked BD at reasonable prices. One handicap that SandyBridge will have is that every CPU will have an IGP with it, unnecessarily increasing die size (cost), when enthusiasts will use a discrete Graphics card anyways.

at this point everything is so fast anyways I just like having the psychological benefit of having 8 virtual cores in task manager. I'm past the point of caring which is faster. Everything I need my comp for it accomplishes perfectly fast.

Scali · Nov 12, 2010

Kuzi said:
One handicap that SandyBridge will have is that every CPU will have an IGP with it, unnecessarily increasing die size (cost), when enthusiasts will use a discrete Graphics card anyways.

From what I understood, Intel will continue to make standalone CPUs, although perhaps not under the codename of Sandy Bridge.

Kuzi · Nov 12, 2010

soccerballtux said:
I'm past the point of caring which is faster. Everything I need my comp for it accomplishes perfectly fast.

I feel the same way too 🙂

And I'd take a fast SSD over a faster CPU any day after experiencing the difference a good SSD can make.

Scali said:
From what I understood, Intel will continue to make standalone CPUs, although perhaps not under the codename of Sandy Bridge.

From what I know, at least initially, Intels next-gen "tock" CPUs will all have on-die graphics. Maybe Nemesis could pitch in as he probably knows more about this.

Edrick · Nov 12, 2010

Kuzi said:
One handicap that SandyBridge will have is that every CPU will have an IGP with it

Interesting....where did you read that? Everything I ever read for the past year states that the SB-E (LGA2011) which will complete with BD will not have an IGP, but rather more cores and more L3 cache.

ilkhan · Nov 12, 2010

Edrick said:
Interesting....where did you read that? Everything I ever read for the past year states that the SB-E (LGA2011) which will complete with BD will not have an IGP, but rather more cores and more L3 cache.

correct. Everything we know says s2011 will NOT have on-chip GPUs, but all s1155 chips WILL have on-die GPUs.

JFAMD · Nov 12, 2010

Scali said:
We know that the L2 is relatively large, so that implies relatively high latency.
This means that a small low-latency L3 cache won't make much sense, as you already have the high-latency L2 cache 'in the way'. Which is one of the problems with Barcelona, hence the reference.
Hence I would expect a large L3 cache (like Phenom II), but we know that the L3 cache is the same size as the combined L2 caches, which again reminds me of Barcelona.

So it is up to JFAMD to explain the secret sauce which makes this configuration work, despite its appearance working against it, based on our previous experiences with AMD architectures.
He has had the chance to explain it, but decided not to. I've been in the industry long enough to know what that means.

How about this instead:

- Bulldozer is not Barcelona
- Sandy Bridge is not NetBurst

Let's all agree on this and move forward. If you've been in in the indsutry long enough then you should know that compaines learn from the past. Both companies learn.

Scali · Nov 12, 2010

JFAMD said:
How about this instead:

- Bulldozer is not Barcelona
- Sandy Bridge is not NetBurst

How about you not putting words in my mouth?
I never claimed either, nor did I even imply such.

JFAMD said:
If you've been in in the indsutry long enough then you should know that compaines learn from the past. Both companies learn.

The question is: what have you learnt? How does Bulldozer differ from Barcelona in the aforementioned aspects?

Hard Ball · Nov 12, 2010

Scali said:
We know that the L2 is relatively large, so that implies relatively high latency.
This means that a small low-latency L3 cache won't make much sense, as you already have the high-latency L2 cache 'in the way'. Which is one of the problems with Barcelona, hence the reference.

From microarchitectural point of view, this statement is wrong on many levels.
Latency for L3 cache is not necessarily additive with earlier levels of caches. The same kind of absurd arguments rose up a couple of years ago with speculations about K10 and the like, you can see my post there regarding this:

http://arstechnica.com/civis/viewtopic.php?f=8&t=181040&p=4752739&#p4752739

I can offer more details if you are interested.

Scali said:
Hence I would expect a large L3 cache (like Phenom II), but we know that the L3 cache is the same size as the combined L2 caches, which again reminds me of Barcelona.

It is pretty close to the norm, if you take a survey of CMP mem hierarchies that do not enforce precise inclusiveness. It also will depend on the number of banks of each cache level, set-associativity of each, presence and absence of way-prediction, and a number of other factors. Simple metric of relative size alone tells little to nothing about the hierarchy organization.

Scali · Nov 12, 2010

Hard Ball said:
From microarchitectural point of view, this statement is wrong on many levels.
Latency for L3 cache is not necessarily additive with earlier levels of caches.

That is not the statement that I was making however.
I never said they were additive. Merely that having two or more levels of cache with more or less the same latency and size isn't very useful.

Namely, if you can lookup quickly in L2, you win everytime you get a cache hit. You can do a parallel lookup in L3, which you can wait for if L2 misses.

But if you already have a large and high-latency L2 cache, then you don't have the advantage of the quick hits.
Therefore, you want the L3 to have the advantage of having MORE hits, because you won't be getting lower latencies from L3, obviously. Better hit rate is basically the only advantage that L3 can give you. And this means L3 should be large.
Problem is that since you already have a relatively large L2 cache, you won't be getting as many misses, so L3 is not being used all that often, making it all the harder to really gain much performance.

Hard Ball · Nov 12, 2010

Scali said:
That is not the statement that I was making however.
I never said they were additive. Merely that having two or more levels of cache with more or less the same latency and size isn't very useful.

Namely, if you can lookup quickly in L2, you win everytime you get a cache hit. You can do a parallel lookup in L3, which you can wait for if L2 misses.

OK, if that's what you actually meant, then I misunderstood you the first time.

But if you already have a large and high-latency L2 cache, then you don't have the advantage of the quick hits.
Therefore, you want the L3 to have the advantage of having MORE hits, because you won't be getting lower latencies from L3, obviously. Better hit rate is basically the only advantage that L3 can give you. And this means L3 should be large.
Problem is that since you already have a relatively large L2 cache, you won't be getting as many misses, so L3 is not being used all that often, making it all the harder to really gain much performance.

However, your reasoning is still wrong, size of the entire cache does not determine the latency. The # cycles it takes for initial tag lookup (which depends largely on the size of banks, length of the wires that carry the control bits, and pass through associative decoder, per way comparator and mux) + way prediction + any VPN to PPN lookup + a few other things. As I said before, size or relative size alone in the hierarchy tells you nearly nothing. I don't see how you can conclude these things from the evidence that are public right now.

Scali · Nov 12, 2010

Hard Ball said:
However, your reasoning is still wrong, size of the entire cache does not determine the latency.

Theoretically no...
But practically larger caches tend to have either higher latency... or poorer associativity, so they get less hits.
I think I have a pretty good idea of what AMD is going to aim for/is capable of, cache-wise. Call it an educated guess.
But it wouldn't hurt if JFAMD spilled the goods, so we can be sure that my estimates are correct.

Hard Ball · Nov 12, 2010

Scali said:
Theoretically no...
But practically larger caches tend to have either higher latency... or poorer associativity, so they get less hits.
I think I have a pretty good idea of what AMD is going to aim for/is capable of, cache-wise.
But it wouldn't hurt if JFAMD spilled the goods, so we can be sure that my estimations are correct.

I see what you are saying. But microarchitectures are not collocations of broad tendencies, especially given the fact that this is a completely redesigned microarchitecture. These kind of assumptions are simply not warranted, given that you only have a single one of the a few dozens of variables that you would need to determine latency from the publicly available info.

It's like saying that Mnt. Everest lies on a lower latitude than Rome, so it must have awefully mild climate.

Schmide · Nov 12, 2010

Scali said:
Merely that having two or more levels of cache with more or less the same latency and size isn't very useful.

It is if the higher level (L3) is fully (or very largely) associative, even if the size is relatively similar. Not to mention there are often tables to track cache coherency in the L3.

Scali · Nov 12, 2010

Hard Ball said:
I see what you are saying. But microarchitectures are not collocations of broad tendencies, especially given the fact that this is a completely redesigned microarchitecture. These kind of assumptions are simply not warranted, given that you only have a single one of the a few dozens of variables that you would need to determine latency from the publicly available info.

It's like saying that Mnt. Everest lies on a lower latitude than Rome, so it must have awefully mild climate.

Who decides what is 'warranted' in speculation and what is not?
Really, you don't have to agree with my speculations, but I don't care for this post of yours at all. Why do you try to ridicule me with that 'analogy' of yours? I think that is completely uncalled for.
The irony is: you are doing exactly the same thing here: You don't know what information I have exactly, nor what kind of assumptions I have made.

Scali · Nov 12, 2010

Schmide said:
It is if the higher level (L3) is fully (or very largely) associative, even if the size is relatively similar. Not to mention there are often tables to track cache coherency in the L3.

That's a nice theory, but given AMD's track record with associativity, I think we all know that it's not going to be THAT associative.

Hard Ball · Nov 12, 2010

Schmide said:
It is if the higher level (L3) is fully (or very largely) associative, even if the size is relatively similar.

This doesn't really exist in the real world, there is actually nothing anywhere close to it.

In order for a modern L2 sized cache to be fully associative, the logic that are needed to do tag lookups are going to take up the entire IC and then some, and the latency would be much higher than going to system memory.

Vast majority of the time, it is pretty close to completely opposite of fully associative:
# sets >>>>>> ways of associativity.
Otherwise, it's not a realistic solution for the sizes we are talking about in modern ICs.

Not to mention there are often tables to track cache coherency in the L3.

I think you are speaking of directory based coherence, which is present in Istanbul in a primitive form. It remains to see, in the public info from AMD, whether the scheme would become more elaborate with a new iteration.

Scali · Nov 12, 2010

Hard Ball said:
This doesn't really exist in the real world, there is actually nothing anywhere close to it.

In order for a modern L2 sized cache to be fully associative, the logic that are needed to do tag lookups are going to take up the entire IC and then some, and the latency would be much higher than going to system memory.

Vast majority of the time, it is pretty close to completely opposite of fully associative:
# sets >>>>>> ways of associativity.
Otherwise, it's not a realistic solution for the sizes we are talking about in modern ICs.

I think you are speaking of directory based coherence, which is present in Istanbul in a primitive form. It remains to see, in the public info from AMD, whether the scheme would become more elaborate with a new iteration.

There you go... Now you're doing pretty much the same kind of speculation and assumptions as what I was doing.

Hard Ball · Nov 12, 2010

Scali said:
Who decides what is 'warranted' in speculation and what is not?
Really, you don't have to agree with my speculations, but I don't care for this post of yours at all. Why do you try to ridicule me with that 'analogy' of yours? I think that is completely uncalled for.
The irony is: you are doing exactly the same thing here: You don't know what information I have exactly, nor what kind of assumptions I have made.

I'm not trying to ridicule by some analogy, if I came across as such, I apologize.

But I do have to say that the degree of obsurdity in that analogy is very close to that of the line of reasoning that you chose, which is why I thought it might be appropriate. Again, it's nothing personal at all, merely talking logic.

I did have to point out to other forum members that your speculations are about 99.9% baseless, from the POV of someone who has studies and worked in microarchitectural design. Of course, you are free to form your own opinion, but people here should also know that your opinion is not an informed one.

Schmide · Nov 12, 2010

Hard Ball said:
In order for a modern L2 sized cache to be fully associative, the logic that are needed to do tag lookups are going to take up the entire IC and then some, and the latency would be much higher than going to system memory.

L2 will never be uber associative although Intel's L2 is nicely high.

I was talking about the L3 which I believe is rumored to be 32-64 way associative. I know it's not fully associative, but at some point very many way associative becomes fully associativeish.

Hard Ball · Nov 12, 2010

Scali said:
There you go... Now you're doing pretty much the same kind of speculation and assumptions as what I was doing.

How so, the only things I mentioned are the real, known, designs today.

I only said that his proposal would not come close to fitting into any realistic design that can possibly be manufactured today.

Scali · Nov 12, 2010

Hard Ball said:
I did have to point out to other forum members that your speculations are about 99.9% baseless, from the POV of someone who has studies and worked in microarchitectural design. Of course, you are free to form your own opinion, but people here should also know that your opinion is not an informed one.

That one is even more uncalled for.
On top of that it is, ironically enough, just your uninformed opinion of me.

Hard Ball · Nov 12, 2010

Schmide said:
L2 will never be uber associative although Intel's L2 is nicely high.

I was talking about the L3 which I believe is rumored to be 32-64 way associative. I know it's not fully associative, but at some point very many way associative becomes fully associativeish.

OK, I see that you mean "nearly fully" as something that provide a similar probability model of lookup success as the fully associative model.

That is a valid comparison, and a good point. But just to make sure in the future, "fully associative" is a very technical term in comp arch that means something very specific, so you might want to use a different term next time. I certainly thought that you meant something that you actually didn't.

Phynaz · Nov 12, 2010

Scali said:
How about you not putting words in my mouth?
I never claimed either, nor did I even imply such.

The question is: what have you learnt? How does Bulldozer differ from Barcelona in the aforementioned aspects?

Scali, the confrontational and argumentative tone of your posts isn't going to get you any favors from JF.

99% of the details of BD architecture is never going to be released to the public.

Scali · Nov 12, 2010

Hard Ball said:
How so, the only things I mentioned are the real, known, designs today.

I only said that his proposal would not come close to fitting into any realistic design that can possibly be manufactured today.

So you apparently have some kind of mental model of what realistic designs would be possible today. And you are implying that Bulldozer has to fit in there somehow.
That is exactly what I am doing.

Scali · Nov 12, 2010

Phynaz said:
Scali, the confrontational and argumentative tone of your posts isn't going to get you any favors from JF.

I think JFAMD set the confrontation and argumentative tone here.

Phynaz said:
99% of the details of BD architecture is never going to be released to public.

I'm not asking for all the details. Just some more info on things like cache latency and associativity.
This has always been public information for both Intel and AMD CPUs (tools like CPU-Z can tell you what you want to know), and I have no reason to believe that Bulldozer will be different. For the most part you can measure these things with existing software anyway.

AMD demos Bulldozer at Investors conference

Lifer

Banned

Senior member

Golden Member

Golden Member

Senior member

Banned

Senior member

Banned

Senior member

Banned

Senior member

Diamond Member

Banned

Banned

Senior member

Banned

Senior member

Diamond Member

Senior member

Banned

Senior member

Lifer

Banned

Banned