Some Bulldozer and Bobcat articles have sprung up

AtenRa · Aug 25, 2010

When we have ONE thread that needs a 256-bit FP and the SECOND thread needs 128-bit FP, one thread will be executed in one Module (256-bit FMAC) and the second thread will be executed in Module number two (128-bit FMAC).

Now imagine that we have 3 x 256-bit FP Threads and 3 x 128-bit FP threads (6 threads), that means you don’t have enough modules (4 Module) to execute in the same cycle. If you had 6 Physical cores (with 256-bit FP units) you could execute 6 such Threads simultaneously at the same cycle.

So don’t forget the difference between 8 Physical and 8 Logical cores

jones377 · Aug 25, 2010

"Logical" cores doesn't necessary mean the cores aren't actual cores, it's just a way for the OS and software to manage the scheduling in a way to maximize performance. And it can be done using the techniques already in place for SMT. But JF is on a crusade against anything SMT or related to SMT so naturally he'd argue against it. I'm just glad the actual engineers at AMD aren't bound by their marketing department.

I mean it shouldn't even be an issue because it's not something that will come up for the general public, so AMD can call their CPUs whatever they want.

Cerb · Aug 25, 2010

Scali said:
AMD put its eggs in the OpenCL basket... perhaps because it's a multiplatform solution and open standard.

The problem being that they pretty much put out the basket, and talked about the eggs, but then left the eggs with the hen.

(ever looked at their GPGPU SDK? It contains quite a bit of OpenCL samples, and more and better documentation than AMD has... even a nice Cuda-to-OpenCL migration guide)...

Yes, and that was one of the reasons I got a nVidia card. Other aspects sealed the deal, but I was dead-set against nVidia, until AMD kept putting off adding support to release drivers, and doing so little to improve their SDK. It's all only so meaningful and challenging with two CPU cores to use. While I'm not keen on working as a programmer again (I've more or less had the Office Space experience, but as an asthmatic, construction work is out of the question

), I'd rather use tools that might be useful if such a job came into my path, or at least tools that let me make binaries anyone can make effective use of, and AMD is lacking a bit, there (where they have it on a technicality, they lack the 'effective' part, compared to nVidia).

AtenRa said:
Now imagine that we have 3 x 256-bit FP Threads and 3 x 128-bit FP threads (6 threads), that means you dont have enough modules (4 Module) to execute in the same cycle. If you had 6 Physical cores (with 256-bit FP units) you could execute 6 such Threads simultaneously at the same cycle.

That's all assuming that any given CPU core can feed those units every cycle, and the application(s) has/have such instructions every cycle in a common loop, in just the right way, over just the right number of threads, so as to pose this problem of having 75% utilization. It just seems like too deep speculation to worry about, when such a problem requires everything else in all of the CPUs to be functioning in a practically perfect way. Real code, running on real systems, will tell the story, and we're just going to have to wait several months.

Scali · Aug 25, 2010

Cores are always logical. That's also how CPUID works, for example.
Since Pentium 4, all CPUs report the HTT support bit in the CPUID feature flags. This doesn't mean that HT is enabled, or even that it COULD be enabled. It just indicates that you can query the logical core count of the CPU.
So eg a Core i5 750 will report that it has 4 logical cores. Which in this case is exactly the same as its 4 physical cores.

Basically, a core is always logical... and a logical core COULD be a physical core, but it could be something else as well.

A logical core is just 'a unit that a thread can be executed on'. It's become more of a software thing, really, with technologies like SMT blurring the lines between cores and threads.

JFAMD · Aug 25, 2010

jvroig said:
They've already said their first Zambezi chip will be a quad-core.

Zambezi is 8-core.

AtenRa · Aug 25, 2010

Cerb said:
That's all assuming that any given CPU core can feed those units every cycle, and the application(s) has/have such instructions every cycle in a common loop, in just the right way, over just the right number of threads

With OoO (Out of Order) execution you dont need to execute instructions one after the other in a given order.

AtenRa · Aug 25, 2010

from xtremesystems forum

http://www.xtremesystems.org/forums/showpost.php?p=4523917&postcount=204

JFAMD said:
Today's processors have 3 execution units that are shared between ALU/AGU. That is essentially 1.5 ALU and 1.5 AGU. With BD we get 2 AGU and 2 ALU. Much better.

Deneb has 3 execution units (3 way) within the Integer execution unit and as JF says that’s actually 1 and a half ALU and 1 and a half AGU (for 3 instructions) plus 3 Load/Store units.

But in BD, the Integer execution unit has 2 ALUs plus 2 AGUs (4 vs 3) plus one Ld/ST unit.

http://www.anandtech.com/Gallery/Album/754#7

So with a 4 way Decoder (vs 3-way Deneb) and 2 independent Int Schedulers we have better IPC in BD

Scali · Aug 25, 2010

AtenRa said:
Deneb has 3 execution units (3 way) within the Integer execution unit and as JF says thats actually 1 and a half ALU and 1 and a half AGU (for 3 instructions) plus 3 Load/Store units.

That's not true.
Deneb has three ALUs AND three AGUs:
http://en.wikipedia.org/wiki/File:AMD_A64_Opteron_arch.svg

Perhaps what JF means is that they share an execution port... but that's not the same thing as saying it has 1.5 of each.
AGU is usually mutually exclusive with ALU... That is, the ALU instruction is dependent on the address being generated and the data being fetched or stored. So I don't quite see how this shared setup would be a huge disadvantage, especially if you have three pipelines.
It may make more sense when you have only two... but trying to sell that as a deficit of the K7/8/10 architecture is a bit silly.

Idontcare · Aug 25, 2010

jvroig said:
The overhaul in architecture is supposedly what makes this possible - from there, we can also reasonably infer that perhaps K10's 3 ALU / 3 AGU is actually overkill, and does not actually get utilized anywhere close to its max potential, hence AMD did not go around adding more ALUs and AGUs - instead, they removed one ALU and AGU, and then optimized the relevant parts of the architecture to make sure the two are fed much better, hence a performance increase was achieved.

I think we all to easily forget just how xtor-constrained CPU designs of 4 and even 6 yrs ago were compared to the xtor budgets that today's designers get to throw at their plans.

For the xtor budget, time budget, and resource budget the original 3/3 of the K10 may have represented the optimal balance to get the performance where they could at a die size and clockspeed they needed for it to be commercially viable.

Move forward a couple nodes, give the same guys 4-5x more xtors and ask them "now what would you do different with all this extra hardware to throw at the architecture?" and they may suddenly implement a design they wanted to do 5 yrs ago but couldn't for reasons stated above. Now they can reduce the ALU/AGU count while overhauling other areas of the pipeline (sucking up massive xtor counts while doing so) and create an all the more performing architecture.

(note: I am not writing this as if I am telling you anything, just airing my thoughts...it always knocks my socks of when I think that P4 northwood had a mere 55million xtors on 130nm and ran at 3.4GHz, look how far we've come with xtor budgets per core to throw at making the same pipeline perform better)

JFAMD said:
Zambezi is 8-core.

Despite all the negative crap swirling around in this thread JF finds the high-road and drops in to give us some nice info. Thanks! :thumbsup:

Idontcare · Aug 25, 2010

http://www.anandtech.com/Gallery/Album/753#14

Bobcat's 512KB L2$ is clocked at half the speed of the core logic?

They state it is to save power, which lowering the clockspeed most certainly will do, but sram isn't usually a big power user. Lower clocks on the sram means they saved die space as they are able to make the sram cells all the smaller. (I'm just guessing what the engineering motivation was here, marketing spin aside, as they obviously had to give up something die-area-wise to get their OOO processing but be tiny to compete cost-wise with Atom)

Scali · Aug 25, 2010

If I read the small print correctly, AMD isn't even banking on improved IPC.
http://www.anandtech.com/Gallery/Album/754#6

Throughput advantages for multi-threaded workloads without significant loss on serial single-threaded workload components

To me that reads like: "Okay, there will be a loss on serial single-threaded workloads, but we managed to keep this loss under control, so it's not going to be significant."
So the approach seems to favour parallel workloads, more cores per die, less transistors per core... that sort of thing. But not better IPC.

Ben90 · Aug 25, 2010

Idontcare said:
...it always knocks my socks of when I think that P4 northwood had a mere 55million xtors on 130nm and ran at 3.4GHz, look how far we've come with xtor budgets per core to throw at making the same pipeline perform better

47million xtors on this:

khon · Aug 25, 2010

Ben90 said:
47million xtors on this:

That actually seems large to me.

The i7-980X has 1.2B transistors and at 248mm^2 it's smaller than a US penny (285 mm^2) or a US dime (251 mm^2)

Cogman · Aug 25, 2010

Scali said:
If I read the small print correctly, AMD isn't even banking on improved IPC.
http://www.anandtech.com/Gallery/Album/754#6

To me that reads like: "Okay, there will be a loss on serial single-threaded workloads, but we managed to keep this loss under control, so it's not going to be significant."
So the approach seems to favour parallel workloads, more cores per die, less transistors per core... that sort of thing. But not better IPC.

I kind of got the impression that the improvements would be minimal, but as you said, favoring highly threaded apps over single threaded apps. They are deepening the pipeline, and if I am reading into things correctly, Integer performance MIGHT be pretty good for a single threaded app (Two integer units available depending on how well the internal scheduler works).

I am, however, generally interested to see how floating point arithmetic works on these things. Gleaning from the slides, it seems like it might take a big hit in multithreaded applications with large amounts of floating point math. Which says to me "Science applications and encoders, beware" Perhaps they are going to be pushing OpenCL like technologies for highly parallel floating point math.

Of course, if they are offering 8 cores, it seems almost like a hyperthreading type deal is going on in the background. Either way, operating system schedulers should probably be looking into what they will do for thread management on these puppies.

Interesting architecture. Hopefully it competes well with the i7/Nehalems.

------------------------------------------------

Now, onto the bobcats. Not much to say about these other then.
* I really hope they don't saddle these things with a CRAPTASTIC chipset such as intel did with the atom. It is a bad thing when your chipset consumes MORE power than your CPU.
* These will be interesting for things such as thin clients.
* I don't think they will break into the mobile market any time soon. x86 + Mobile just does not mix well. Rather, discrete computing environments seems a more fitting place for these things, such as HTPCs.
* I really hope the integrated GPU included is better than intels GPUs. It would be nice to be able to do SOMETHING with this thing's GPU.

All in all, More numbers/demos would really be nice. When are they planning on releasing these things?

zebrax2 · Aug 25, 2010

khon said:
That actually seems large to me.

The i7-980X has 1.2B transistors and at 248mm^2 it's smaller than a US penny (285 mm^2) or a US dime (251 mm^2)

You are looking at it the wrong way. The thing you are seeing is the whole package not the die. I think that is an atom CPU which has a lot more smaller die size than the i7-980X being the largest at only 87mm^2 and going as low as 26mm^2 (according to Wikipedia)

Cogman · Aug 25, 2010

Idontcare said:
http://www.anandtech.com/Gallery/Album/753#14

Bobcat's 512KB L2$ is clocked at half the speed of the core logic?

They state it is to save power, which lowering the clockspeed most certainly will do, but sram isn't usually a big power user. Lower clocks on the sram means they saved die space as they are able to make the sram cells all the smaller. (I'm just guessing what the engineering motivation was here, marketing spin aside, as they obviously had to give up something die-area-wise to get their OOO processing but be tiny to compete cost-wise with Atom)

Doesn't cache usually take up a significant portion of the die? Yes, it will be smaller like you said, but it seems like having a lower clock speed would provide a significant power savings.

Scali · Aug 25, 2010

Cogman said:
They are deepening the pipeline, and if I am reading into things correctly, Integer performance MIGHT be pretty good for a single threaded app (Two integer units available depending on how well the internal scheduler works).

Intel's Conroe/Nehalem pipelines were already deeper, and both have 3 ALUs available.
K7/8/10 also had 3 ALUs available, so I don't quite see how they're going to improve single-threaded performance while having less ALUs than all common x86 processors today.
I'm not saying it's impossible, just very unlikely. Once again the problem of the 'missing secret sauce'.

jvroig · Aug 25, 2010

JFAMD said:
Zambezi is 8-core.

Thanks for the info. I have to say, this is a surprise. What happened to the "4/8 CPU" mentioned in an old slide before? You said to interpret that as "there will be a 4-core and 8-core variants available" instead of "4c/8t" since no HT on BD. The slide in question is below, but I can't link to your response - I forgot which forum it was, either here, at SA, or at AMDZone, I suppose.

AtenRa said:
So don’t forget the difference between 8 Physical and 8 Logical cores

I hope Scali's response has had you enlightened already. For the past weeks (or was it months already) it is getting tiresome having to explain "HT core", "logical core", "real core" over and over to people who think hyperthreading produces "one real + one hyperthreaded/logical core", and similar such fallacies regarding "logical cores", whether in HT context or not.

zebrax2 · Aug 25, 2010

Cogman said:
Doesn't cache usually take up a significant portion of the die? Yes, it will be smaller like you said, but it seems like having a lower clock speed would provide a significant power savings.

Maybe the extra transistors removed helped with the power savings?

ModestGamer · Aug 25, 2010

Scali said:
Because of the shared resources in a module (eg decoder, FPU), I'm not sure if you can speak of 'physical cores' with Bulldozer, to be honest.
I think we can say this:
A Bulldozer module is similar to one physical core on a HT processor: It contains two logical cores.
Logical cores on Bulldozer and HT processors can be considered equivalent.

But I'm not sure what a 'physical core' would be for Bulldozer. I think perhaps we should not even try to define it, as it isn't very relevant.

But yes, I think AMD will be marketing it on their logical core count.

I disagree with this assertion. There is one pipeline that HT uses it just feeds in sets of instructions. With the AMD approach if my understanding of it is correct you will have a pipeline for each logical core. Ergo HT will loose this fight and massively.

2 pipelines are faster then 1.

Scali · Aug 25, 2010

ModestGamer said:
I disagree with this assertion. There is one pipeline that HT uses it just feeds in sets of instructions. With the AMD approach if my understanding of it is correct you will have a pipeline for each logical core. Ergo HT will loose this fight and massively.

2 pipelines are faster then 1.

There's no such thing as 'a pipeline'. We're talking about out-of-order architectures, which have a number of parallel execution ports.
The main difference is that with HT *everything* is shared between two threads.
With Bulldozer, the cache is shared, the instruction fetcher and decoder is shared, the FPU is shared, but there are two sets of integer logic (albeit rather anemic ones, compared to regular x86 cores) and two sets of integer schedulers.
So it's HT, but not quite.
We'll have to wait for some benchmarks until we know for sure which of the two approaches works best in practice (and don't forget, by the time Bulldozer comes out, Intel has updated its HT-architecture to Sandy Bridge, so HT may work even better than what we know from Nehalem today).

jvroig · Aug 25, 2010

Scali said:
If I read the small print correctly, AMD isn't even banking on improved IPC.

...

To me that reads like: "Okay, there will be a loss on serial single-threaded workloads, but we managed to keep this loss under control, so it's not going to be significant."

I have touched on that in an earlier post, but this thread has exploded into so much "negative crap" (in the words of Idontcare) that perhaps it has gotten lost amidst all the swirl.

What AMD meant there was there is a performance hit, but in the context of having the same two "BD cores" (to put a name on them) completely isolated as opposed to "fused together" as they are in the shared architecture which is the Bulldozer reality, not Deneb cores. So while there is a theoretical performance hit, this is not against Deneb.

To illustrate:
If the BD cores were completely separate: Int performance = 150
If the BD cores were in a module (reality): Int performance = 140
Real world Deneb cores : Int performance = 100

The absolute value of the figures are all made up, naturally. It is only to illustrate that the performance hit they mentioned was in relation to "What if we didn't fuse together those int cores?", not in relation to "What kind of int performance do we already have with Deneb?". AMD has already promised that throughput of each BD int core is greater than current Deneb cores.

I take it to mean that they have, in fact, worked on improving IPC from Deneb, despite the supposedly insignificant loss as a result of their Bulldozer design.

Scali · Aug 25, 2010

jvroig said:
AMD has already promised that throughput of each BD int core is greater than current Deneb cores.

Careful there...
"AMD is also careful to mention that the integer throughput of one of these integer cores is greater than that of the Phenom II's integer units."

Problem is, each Phenom II core has 3 integer units (or well 3+3, if you break it down to ALU/AGU).
Making the statement a bit of a 'no shit, Sherlock'-one (two units better than one? really?)

ModestGamer · Aug 25, 2010

Scali said:
There's no such thing as 'a pipeline'. We're talking about out-of-order architectures, which have a number of parallel execution ports.
The main difference is that with HT *everything* is shared between two threads.
With Bulldozer, the cache is shared, the instruction fetcher and decoder is shared, the FPU is shared, but there are two sets of integer logic (albeit rather anemic ones, compared to regular x86 cores) and two sets of integer schedulers.
So it's HT, but not quite.
We'll have to wait for some benchmarks until we know for sure which of the two approaches works best in practice (and don't forget, by the time Bulldozer comes out, Intel has updated its HT-architecture to Sandy Bridge, so HT may work even better than what we know from Nehalem today).

I really really really doubt AMD dropped the ball here. The have been releasing blah products for a few years now while throwing massive R&D money at this design and overall architecture. I think they are playing this close to the vest and we aren't getting all the details just yet. Ht can only carry you so far and OoO will only cary you so far.

At some point physical hardware has to crunch numbers and it sounds like they went with a dvidie and conquer method.If they got the module on there that handles threading very well or better then other CPU's BD could rock. It sounds like AMD was well aware of the fact that most CPU cores go underutilized "thanx microsoft" and I would imagine that before they taped out they looked at current Intel CPUs and made the call as to wether the Intel HT process was going to be better or worse. they could have fixed it before tape out. So that should tell you a bunch right there.

Your also stating that things on the chip are anemic. How do you know this ? I bet we will start seeing some data trickle in soon. I for one will hold judgement until such time.

It sucks when the CPU has to do the job the OS should be doing as far as threading.

You should look at the HaikuOS and BEOS becuase they really show how great multi cpu can be.

jvroig · Aug 25, 2010

Scali said:
Careful there...
"AMD is also careful to mention that the integer throughput of one of these integer cores is greater than that of the Phenom II's integer units."

Problem is, each Phenom II core has 3 integer units (or well 3+3, if you break it down to ALU/AGU). Making the statement a bit of a 'no shit, Sherlock'-one (two units better than one? really?)

I agree, that statement is a bit vaguely-worded. It can either be interpreted as:

1. "each BD int core is greater than that of ALL 3 Deneb int units", or
2. "each BD int core is greater than a single Deneb int unit".

I have always interpreted it as #1, because #2 (after briefly considering the possibility when I first read that article) seemed too much of a negative, and the last thing AMD needs is to lessen single-threaded performance more. Perhaps Anand's wording needs a bit more clarity.

I believe it is #1, but since all we have is Anand's statement (and he could really have worded it better) and nothing from AMD, then I can only agree that it might possibly turn out to be #2.

Some Bulldozer and Bobcat articles have sprung up

Lifer

Senior member

Elite Member

Banned

Senior member

Lifer

Lifer

Banned

Elite Member

Elite Member

Banned

Platinum Member

Golden Member

Lifer

Senior member

Lifer

Banned

Platinum Member

Senior member

Banned

Banned

Platinum Member

Banned

Banned

Platinum Member