K8L article from X-bit Labs

KeithTalent · Aug 25, 2006

Article

Ok, I have read this article twice, but most of the information is way over my head. I was trying to figure out the possible performance jump from K8 and possibly Conroe.
Of course it seems everything is still TBD, but by looking at some of the technical stuff can anyone tell how things may pan out if everything written there is accurate?

Also, does this have anything to do with AM3 or is it something completely different? I am asking because I have heard speculation that AM3 processors will be comaptible with AM2 motherboards (that would be great).

Sorrry for the dumb questions.

Cheers, KT

imported_inspire · Aug 25, 2006

Actually AMD has mention that AM3 will be backwards compatible with the AM2 socket, so I'd bump it one past speculation.

KeithTalent · Aug 25, 2006

Originally posted by: inspire
Actually AMD has mention that AM3 will be backwards compatible with the AM2 socket, so I'd bump it one past speculation.

Good stuff. So K8L = AM3?

myocardia · Aug 25, 2006

I still can't figure out why AMD refuses to call it what it really is, K9.

betasub · Aug 25, 2006

Originally posted by: myocardia
I still can't figure out why AMD refuses to call it what it really is, K9.

woof

myocardia · Aug 25, 2006

Originally posted by: betasub
woof

Well, it will either be a dog, or it will be a case of "who let the dogs out"...

Baked · Aug 25, 2006

Originally posted by: myocardia
I still can't figure out why AMD refuses to call it what it really is, K9.

Reserve that name for the quad core?

KeithTalent · Aug 25, 2006

Geez, you guys slay me (K9, Quad core), hilarious.

zephyrprime · Aug 25, 2006

All the info in x-bit labs is pretty speculative at this point. All AMD has really told us is that k8l has twice the fpu resources, a doubled in size instructio prefect, and improved IPC. There's just no telling what all this will mean in specific terms. K8L will certainly be faster than the K8 but by how much is unknown.

However, x-bit gives us a pretty in depth speculation on the performance provided by some elements of the K8L. I'll try to condense what they have to say:

1. With a 32B instruction fetch (instead of 16B like in the K8/P4/C2D), being starved for instructions will be less likely. This is useful since SSE instructions are big and the SSE issue rate will be a lot higher with the K8L since it has real 128bit SSE execution instead of the half assed 64bit/64bit sse execution we now suffer with in the K8 (and P4). Decoding sse into real 128bit instructions on the K8L is easier than decoding an sse instruction into 2x64bit instructions as is done in the K8.

Also, 64bit execution should be sped up by the 32B prefetch because 64bit instructions are bigger than their 32bit counterparts.

(zephyrprime opinion: However, we don't know if the K8L has more instruction decoders so the picture of K8L instruction decoding power is very incomplete. I think xbit is saying that is doesn't but is that just a guess or is it a fact?)

2. Branch prediction is improved but specifics are unknown.

3. There will be some sort of read reordering but specifics are not known.

4. The K8L will retain separate instruction pools for int and fp code.

5. The K8L will have 2x128bit connections from SSE to L1. (z.p.: vs 2x64bit in the K8 and 1x128bit in the conroe).

6. The l1 & l2 caches sound unchanged. The l3 is new (obviously). The crossbar is enhanced. Doesn't sound like K8 will have fancy prefetching like the Conroe.

I could have made some mistakes. It's tough to read what those x-bir guys wrote.

I think it's highly informative to look at pictures of the K8L die and the K8 die.

http://www.techwarelabs.com/reviews/processors/amd4000_fx55/die_marked_E.jpg
http://www.xbitlabs.com/images/cpu/amd-k8l/image001s.png

From the pictures, you can see that the layout of the individual core's is basically the same. The most obvious difference is the smaller L2. Also, the additional SSE unit can easily be seen and look extremely similiar to the current FP unit. The L1 caches look basically the same. The instruction decoder looks really different. The load/store unit looks like it has significant changes. The int unit looks little changed. The bus unit looks like it has some changes.

Since the size of the int unit is approximately the same, I would guess that there are no new int execution resources. The decoder doesn't look any bigger either so I guess there will be no additional instruction decoders.

In SSE&FP code, I think the K8L will beat the Conroe. In int code, I'm guessing Conroe will win unless there is some fancy prefetching in the bus unit which doesn't seem to be the case at this point in time. It looks like the K8L will have the same number of pipeline stages as the K8 so I would speculate that the Conroe will clock higher than the K8L. In 64bit code, the K8L should have a big edge with its 32B fetcher.

KeithTalent · Aug 25, 2006

Originally posted by: zephyrprime
All the info in x-bit labs is pretty speculative at this point. All AMD has really told us is that k8l has twice the fpu resources, a doubled in size instructio prefect, and improved IPC. There's just no telling what all this will mean in specific terms. K8L will certainly be faster than the K8 but by how much is unknown.

However, x-bit gives us a pretty in depth speculation on the performance provided by some elements of the K8L. I'll try to condense what they have to say:

1. With a 32B instruction fetch (instead of 16B like in the K8/P4/C2D), being starved for instructions will be less likely. This is useful since SSE instructions are big and the SSE issue rate will be a lot higher with the K8L since it has real 128bit SSE execution instead of the half assed 64bit/64bit sse execution we now suffer with in the K8 (and P4). Decoding sse into real 128bit instructions on the K8L is easier than decoding an sse instruction into 2x64bit instructions as is done in the K8.

Also, 64bit execution should be sped up by the 32B prefetch because 64bit instructions are bigger than their 32bit counterparts.

(zephyrprime opinion: However, we don't know if the K8L has more instruction decoders so the picture of K8L instruction decoding power is very incomplete. I think xbit is saying that is doesn't but is that just a guess or is it a fact?)

2. Branch prediction is improved but specifics are unknown.

3. There will be some sort of read reordering but specifics are not known.

4. The K8L will retain separate instruction pools for int and fp code.

5. The K8L will have 2x128bit connections from SSE to L1. (z.p.: vs 2x64bit in the K8 and 1x128bit in the conroe).

6. The l1 & l2 caches sound unchanged. The l3 is new (obviously). The crossbar is enhanced. Doesn't sound like K8 will have fancy prefetching like the Conroe.

I could have made some mistakes. It's tough to read what those x-bir guys wrote.

I think it's highly informative to look at pictures of the K8L die and the K8 die.

http://www.techwarelabs.com/reviews/processors/amd4000_fx55/die_marked_E.jpg
http://www.xbitlabs.com/images/cpu/amd-k8l/image001s.png

From the pictures, you can see that the layout of the individual core's is basically the same. The most obvious difference is the smaller L2. Also, the additional SSE unit can easily be seen and look extremely similiar to the current FP unit. The L1 caches look basically the same. The instruction decoder looks really different. The load/store unit looks like it has significant changes. The int unit looks little changed. The bus unit looks like it has some changes.

Since the size of the int unit is approximately the same, I would guess that there are no new int execution resources. The decoder doesn't look any bigger either so I guess there will be no additional instruction decoders.

In SSE&FP code, I think the K8L will beat the Conroe. In int code, I'm guessing Conroe will win unless there is some fancy prefetching in the bus unit which doesn't seem to be the case at this point in time. It looks like the K8L will have the same number of pipeline stages as the K8 so I would speculate that the Conroe will clock higher than the K8L. In 64bit code, the K8L should have a big edge with its 32B fetcher.

This is fantastic, thank you very much zp.

Pardon me if any of my questions about what you said are stupid, but I have two:

1. What is the implication of this new L3 cache? Does it have anything to do with performance or is it just a necessary implementation for this architecture?

2. Do you have any examples of what applications use SSE&FP code and which use int code?

Thanks again, KT

Kromis · Aug 25, 2006

Are there any major architectural upgrades as far as the K8L goes?

zephyrprime · Aug 25, 2006

Originally posted by: KeithTalent1. What is the implication of this new L3 cache? Does it have anything to do with performance or is it just a necessary implementation for this architecture?

2. Do you have any examples of what applications use SSE&FP code and which use int code?

Thanks again, KT

Well, you really need the L3 in order to have an effective quad core. Otherwise, you'd have insufficient bandwidth since the external memory system is still the same as is on the current dual cores. Also, it's just more die space efficient to have a shared L3 instead of a bunch of big exclusive L2's (like on my 805D).

Most office apps will be dominated by int code and have hardly any fp code. The OS itself is also int heavy. Only programs that do a lot of calculations of some sort have a lot of FP/SSE code. This include games, video stuff, some graphics stuff, and scientific and engineering apps. The thing is, the fpu on the x86 platform is crap so using SSE is much preferred. I've always wondered how much SSE code there really is nowadays though? I've written SSE code and doing that sucks compared to writing fp code. I wish they would just put SSE data types as a native datatype that can do everything that the FP data types can in Visual Studio. There are SSE data types in visual studio but you can't do simple stuff like "c = a + b;" in Visual Studio if a,b,c are SSE datatypes. I would think that Visual studio would automatically generate SSE code nowadays but I wonder how good it is at that.

Some1ne · Aug 25, 2006

This is useful since SSE instructions are big and the SSE issue rate will be a lot higher with the K8L since it has real 128bit SSE execution instead of the half assed 64bit/64bit sse execution we now suffer with in the K8 (and P4). Decoding sse into real 128bit instructions on the K8L is easier than decoding an sse instruction into 2x64bit instructions as is done in the K8.

But how many applications actually use SSE instructions...weren't those mostly reserved for multimedia encoding applications? If so, then I don't see why so much focus seems to be given by AMD towards optimizing them.

zephyrprime · Aug 25, 2006

Originally posted by: Some1ne

This is useful since SSE instructions are big and the SSE issue rate will be a lot higher with the K8L since it has real 128bit SSE execution instead of the half assed 64bit/64bit sse execution we now suffer with in the K8 (and P4). Decoding sse into real 128bit instructions on the K8L is easier than decoding an sse instruction into 2x64bit instructions as is done in the K8.

Click to expand...

But how many applications actually use SSE instructions...weren't those mostly reserved for multimedia encoding applications? If so, then I don't see why so much focus seems to be given by AMD towards optimizing them.

Well, going forward - especially in Vista - the x87 FPU is deprecated. You're supposed to use SSE for everything and never use the x87 FPU. The x87 FPU is supposed to be for legacy apps only.

KeithTalent · Aug 25, 2006

Originally posted by: zephyrprime

Originally posted by: KeithTalent1. What is the implication of this new L3 cache? Does it have anything to do with performance or is it just a necessary implementation for this architecture?

2. Do you have any examples of what applications use SSE&FP code and which use int code?

Thanks again, KT

Click to expand...

Well, you really need the L3 in order to have an effective quad core. Otherwise, you'd have insufficient bandwidth since the external memory system is still the same as is on the current dual cores. Also, it's just more die space efficient to have a shared L3 instead of a bunch of big exclusive L2's (like on my 805D).

Most office apps will be dominated by int code and have hardly any fp code. The OS itself is also int heavy. Only programs that do a lot of calculations of some sort have a lot of FP/SSE code. This include games, video stuff, some graphics stuff, and scientific and engineering apps. The thing is, the fpu on the x86 platform is crap so using SSE is much preferred. I've always wondered how much SSE code there really is nowadays though? I've written SSE code and doing that sucks compared to writing fp code. I wish they would just put SSE data types as a native datatype that can do everything that the FP data types can in Visual Studio. There are SSE data types in visual studio but you can't do simple stuff like "c = a + b;" in Visual Studio if a,b,c are SSE datatypes. I would think that Visual studio would automatically generate SSE code nowadays but I wonder how good it is at that.

That actually makes perfect sense to me, thanks for clarifying.

zephyrprime · Aug 25, 2006

Originally posted by: Kromis
Are there any major architectural upgrades as far as the K8L goes?

Yeah, the doubled fpu resources, the shared L3, and the crossbar. Everything else sounds like evolutionary stuff. From the pictures, it looks like the fetch/decode unit is really different but I don't know if that is a major upgrade or not. I'd expect that AMD would do something about power too but x-bit doesn't mention that so I can't say.

KeithTalent · Aug 25, 2006

Originally posted by: zephyrprime

Originally posted by: Kromis
Are there any major architectural upgrades as far as the K8L goes?

Click to expand...

Yeah, the doubled fpu resources, the shared L3, and the crossbar. Everything else sounds like evolutionary stuff. From the pictures, it looks like the fetch/decode unit is really different but I don't know if that is a major upgrade or not. I'd expect that AMD would do something about power too but x-bit doesn't mention that so I can't say.

On the picture of the K8L processor there is a spot pointed to that says "Enhanced Power Management" though that is the only mention of this I have seen.

Keysplayr · Aug 25, 2006

Originally posted by: zephyrprime
All the info in x-bit labs is pretty speculative at this point. All AMD has really told us is that k8l has twice the fpu resources, a doubled in size instructio prefect, and improved IPC. There's just no telling what all this will mean in specific terms. K8L will certainly be faster than the K8 but by how much is unknown.

However, x-bit gives us a pretty in depth speculation on the performance provided by some elements of the K8L. I'll try to condense what they have to say:

1. With a 32B instruction fetch (instead of 16B like in the K8/P4/C2D), being starved for instructions will be less likely. This is useful since SSE instructions are big and the SSE issue rate will be a lot higher with the K8L since it has real 128bit SSE execution instead of the half assed 64bit/64bit sse execution we now suffer with in the K8 (and P4). Decoding sse into real 128bit instructions on the K8L is easier than decoding an sse instruction into 2x64bit instructions as is done in the K8.

Also, 64bit execution should be sped up by the 32B prefetch because 64bit instructions are bigger than their 32bit counterparts.

(zephyrprime opinion: However, we don't know if the K8L has more instruction decoders so the picture of K8L instruction decoding power is very incomplete. I think xbit is saying that is doesn't but is that just a guess or is it a fact?)

2. Branch prediction is improved but specifics are unknown.

3. There will be some sort of read reordering but specifics are not known.

4. The K8L will retain separate instruction pools for int and fp code.

5. The K8L will have 2x128bit connections from SSE to L1. (z.p.: vs 2x64bit in the K8 and 1x128bit in the conroe).

6. The l1 & l2 caches sound unchanged. The l3 is new (obviously). The crossbar is enhanced. Doesn't sound like K8 will have fancy prefetching like the Conroe.

I could have made some mistakes. It's tough to read what those x-bir guys wrote.

I think it's highly informative to look at pictures of the K8L die and the K8 die.

http://www.techwarelabs.com/reviews/processors/amd4000_fx55/die_marked_E.jpg
http://www.xbitlabs.com/images/cpu/amd-k8l/image001s.png

From the pictures, you can see that the layout of the individual core's is basically the same. The most obvious difference is the smaller L2. Also, the additional SSE unit can easily be seen and look extremely similiar to the current FP unit. The L1 caches look basically the same. The instruction decoder looks really different. The load/store unit looks like it has significant changes. The int unit looks little changed. The bus unit looks like it has some changes.

Since the size of the int unit is approximately the same, I would guess that there are no new int execution resources. The decoder doesn't look any bigger either so I guess there will be no additional instruction decoders.

In SSE&FP code, I think the K8L will beat the Conroe. In int code, I'm guessing Conroe will win unless there is some fancy prefetching in the bus unit which doesn't seem to be the case at this point in time. It looks like the K8L will have the same number of pipeline stages as the K8 so I would speculate that the Conroe will clock higher than the K8L. In 64bit code, the K8L should have a big edge with its 32B fetcher.

Nice Post!! But I wonder how much of a "duh" factor there will be. I mean, is the "twice the FPU resources" there because there are twice as many cores? Know what I mean? Likewise for Instruction prefetch and improved IPC? In other words, should we take the "no sh!t sherlock" approach to the marketing?

I'm just messing around. Had some wine. It's Friday FINALLY!!!! hehe.

Hard Ball · Aug 25, 2006

Originally posted by: keysplayr2003
Nice Post!! But I wonder how much of a "duh" factor there will be. I mean, is the "twice the FPU resources" there because there are twice as many cores? Know what I mean? Likewise for Instruction prefetch and improved IPC? In other words, should we take the "no sh!t sherlock" approach to the marketing?

I'm just messing around. Had some wine. It's Friday FINALLY!!!! hehe.

No, the doubling of FP resources refers to each core, specifically the width of the datapath for FP execution.

K8L article from X-bit Labs

Elite Member | Administrator | No Lifer

Senior member

Elite Member | Administrator | No Lifer

Diamond Member

Platinum Member

Diamond Member

Lifer

Elite Member | Administrator | No Lifer

Diamond Member

Elite Member | Administrator | No Lifer

Diamond Member

Diamond Member

Senior member

Diamond Member

Elite Member | Administrator | No Lifer

Diamond Member

Elite Member | Administrator | No Lifer

Elite Member

Senior member