AMD delays Phenom 2.4 GHz due to TLB errata

Redstorm · Nov 18, 2007

Phenom 2.4GHz Delayed

Lets hope that they can get that B3 stepping out in ASAP... Boy have they droped the ball lately.

brxndxn · Nov 18, 2007

Hopefully this is the entire problem that caused AMD to be unable to reach higher speeds in numbers..

nyker96 · Nov 18, 2007

they got nothig but trouble lately, however, 3xxx g-card launch was pretty decent so far, many sellout pretty quickly.

CTho9305 · Nov 18, 2007

The article claims there's no microcode workaround and product has to be recalled, but also says that there is a BIOS update to work around the problem. That doesn't make much sense to me.

Phynaz · Nov 18, 2007

Theo Valich is a clown.

Ignore him.

DrMrLordX · Nov 18, 2007

That is odd. Particularly this bit:

"Some 9500/9600 parts may even be overclocked to 2.6, 2.8, 2.9, 3.0 GHz and they will have no problems whatsoever, while some will have this error."

That does not sound good at all. However the news that Phenom will go to speeds as high as 3 ghz stock once B3 hits the market is good news.

zach0624 · Nov 19, 2007

Originally posted by: Phynaz
Theo Valich is a clown.

Ignore him.

The Inq. hasn't been right on many things regarding amd and phenom so I wouldn't take this too seriously (remember that 3dmark record score?). also the 9500 and 9600 not having the problem and hitting 3ghz is a little fishy.

DrMrLordX · Nov 19, 2007

After reading Anandtech's Phenom review it would seem that B2 chips may indeed be having problems. Theirs certainly wasn't stable at high speeds, and they did cite the TLB problem (though they did not make the connection between the two).

Viditor · Nov 19, 2007

Originally posted by: Phynaz
Theo Valich is a clown.

Ignore him.

This is scary...I agree completely. :Q

However, the data that he presented so incompetently and completely misunderstood is essentially correct (and you'll note that it's exactly what I've been saying for months now...).

Viditor · Nov 19, 2007

Originally posted by: CTho9305
The article claims there's no microcode workaround and product has to be recalled, but also says that there is a BIOS update to work around the problem. That doesn't make much sense to me.

I'll quote from a more knowledgable source than myself...

"There is microcode for the L3 controller separate from the main controller loop that is not updatable"
"In K7, all the memory controller microcode is generated during bootup by the BIOS. In K8 the same thing happens for the main memory controller loop, but there are some microcode routines for controllers (such as the L3 controller) which are not part of the main loop. These cannot be updated without a mask revision. As I understand it, there will be a BIOS update which allows the main loop to recover from a glitch in synchronization. If the problem occurs rarely or at all--as is expected in 2.3 GHz and below CPUs--this is sufficient. But if it occurs constantly, you take a big performance hit. So rev B3 is needed to swat the problem at its source"

DrMrLordX · Nov 19, 2007

So this is a problem with the L3 cache controller? Could this bug be side-stepped by disabling the L3 cache altogether?

harpoon84 · Nov 20, 2007

Originally posted by: DrMrLordX
So this is a problem with the L3 cache controller? Could this bug be side-stepped by disabling the L3 cache altogether?

According to Fudzilla, yes you can, but you lose 10% performance, so it's hardly practical.

DrMrLordX · Nov 20, 2007

Interesting. I wonder of B2 chips will OC better with their L3 cache disabled. Hmm.

CTho9305 · Nov 20, 2007

Originally posted by: harpoon84

Originally posted by: DrMrLordX
So this is a problem with the L3 cache controller? Could this bug be side-stepped by disabling the L3 cache altogether?

Click to expand...

According to Fudzilla, yes you can, but you lose 10% performance, so it's hardly practical.

The L3 is worth 10% performance? Wow, that's pretty amazing. Got links to benchmarks that compare L3 enabled/disabled?

Idontcare · Nov 20, 2007

Originally posted by: CTho9305

Originally posted by: harpoon84

Originally posted by: DrMrLordX
So this is a problem with the L3 cache controller? Could this bug be side-stepped by disabling the L3 cache altogether?

Click to expand...

According to Fudzilla, yes you can, but you lose 10% performance, so it's hardly practical.

Click to expand...

The L3 is worth 10% performance? Wow, that's pretty amazing. Got links to benchmarks that compare L3 enabled/disabled?

I am surprised as well, as this would imply the bulk of the K10 IPC improvements relative to K8 come from cache hierarchy and not micro-architecture improvements.

I assumed L3$ would improve performance scaling as number of threads crossed the 2->3 boundary.

I.e. a dual-core K10 should perform at least as good as dual-core K8 even with L3$ disabled or removed entirely. Shouldn't it?

CTho9305 · Nov 20, 2007

Originally posted by: Idontcare

Originally posted by: CTho9305

Originally posted by: harpoon84

Originally posted by: DrMrLordX
So this is a problem with the L3 cache controller? Could this bug be side-stepped by disabling the L3 cache altogether?

Click to expand...

According to Fudzilla, yes you can, but you lose 10% performance, so it's hardly practical.

Click to expand...

The L3 is worth 10% performance? Wow, that's pretty amazing. Got links to benchmarks that compare L3 enabled/disabled?

Click to expand...

I am surprised as well, as this would imply the bulk of the K10 IPC improvements relative to K8 come from cache hierarchy and not micro-architecture improvements.

I assumed L3$ would improve performance scaling as number of threads crossed the 2->3 boundary.

I.e. a dual-core K10 should perform at least as good as dual-core K8 even with L3$ disabled or removed entirely. Shouldn't it?

I would expect a dual-core Barcelona-based chip to match or kick the crap out of a K8 depending on the application. In particular, significant improvements are SSE128 (doubled SSE performance) and the doubled L1 bandwidth. One possibility is that some of the benchmarks are using codepaths that are highly-optimized for K8, and may not schedule operations efficiently to take advantage of Barcelona's enhancements. I don't really see how some of the synthetic benchmarks could reliably measure, say, floating point performance without using code tuned to each microarchitecture.

edit: To elaborate, obtaining decent performance from modern processors is easy. They'll do a certain amount of rescheduling of instructions if you don't do an optimal job, and they're relatively forgiving. Obtaining peak performance, however, is much more difficult. You have to track a lot of things - making sure operands are ready at the right time, keeping in mind how much work you have to make available to cover the latency of a cache access, keeping track of decode slot limitations (particularly on the Intel chips), etc.

Note: this next paragraph is based on my current understanding of the architectures, but I'm not sure about any numbers here and don't really know how to take advantage of SSE. Highly optimized code for K8 might keep execution units busy during a load operation by performing 2 128-bit SSE additions, which is 4 cycles of work - if the next instruction depends on the load, the code keeps the execution units busy 100% of the time. The same code sequence is sub-optimal on Barcelona/Phenom: the 2 additions would take only 2 cycles total, leaving the execution units idle for a cycle (loads take 3 cycles).

Psuedocode for the case I'm thinking of:
r1 = mem[1234] <- cache access, data won't come back for 3 cycles
r2=r2+r4 <- 128 bit packed add, 2 cycles on K8, 1 on Barcelona/Phenom
r3=r3+r5 <- 128 bit packed add, 2 cycles on K8, 1 on Barcelona/Phenom
r4=r1+r6 <- depends on the first instruction; r1 won't be ready yet on Barcelona/Phenom
xyz=abc <- some other instruction; unless this instruction doesn't depend on r1 / r4, Barcelona and Phenom will have to spend a cycle doing nothing.

edit2: One thing that disappointed me is that reviewers didn't do much analysis of per-thread performance relative to K8. On linux, I'd think it would be easy enough to keep the scheduler from using the 3rd and 4th core; there might be ways to do it on Windows too (worst case, 2 instances of while(1); run at high priority with affinity set to particular cores?).

DrMrLordX · Nov 20, 2007

Originally posted by: CTho9305

The L3 is worth 10% performance? Wow, that's pretty amazing. Got links to benchmarks that compare L3 enabled/disabled?

I was thinking the same thing and would like to see the same numbers. With the L3 cache supposedly locked at 2 ghz in B2 chips, and with the way system memory performance scales so well at higher clock speeds in K8 chips (something that will hopefully be true in K10 chips as well), I would think that a heavily-overclocked B2 stepping Phenom would gain very little from its L3 cache.

Search

AMD delays Phenom 2.4 GHz due to TLB errata

Redstorm

Senior member

brxndxn

Diamond Member

nyker96

Diamond Member

CTho9305

Elite Member

Phynaz

Lifer

DrMrLordX

Lifer

zach0624

Senior member

DrMrLordX

Lifer

Viditor

Diamond Member

Viditor

Diamond Member

DrMrLordX

Lifer

harpoon84

Golden Member

DrMrLordX

Lifer

CTho9305

Elite Member

Idontcare

Elite Member

CTho9305

Elite Member

DrMrLordX

Lifer

TRENDING THREADS