"TLB errata" bug

harpoon84 · Nov 25, 2007

Originally posted by: taltamir
are you saying that the patch isn't currently implemented? I thought the whole point was that they are using the bios patch to prevent the cpu from crashing by sacrificing 10% of its performance on those reset loops.

No, that is a possible workaround to guarantee stability at speeds exceeding 2.4GHz. However, a 10% performance hit on a 2.4GHz Phenom will make it slower than a 2.2GHz 9500, making the 'fix' impractical.

It has nothing to do with performance on current Phenoms. What you see is what you get. 2.3GHz and lower Phenoms aren't affected by the bug.

bfdd · Nov 25, 2007

Originally posted by: harpoon84

Originally posted by: bfdd

Originally posted by: harpoon84
It only affects Phenom at 2.4GHz or higher. Barcelona is not clocked that high.

Click to expand...

The bug is in all chips though even those clocked below 2.4ghz.

Click to expand...

But 2.3GHz and lower chips are not affected stability wise... so whether it is there or not is a moot point as long as they are stable.

Then how come when you overclock a 2.3ghz cpu that has the bug to a speed well above 2.4ghz where the TLB errata bug comes into play there's no stability issues?

harpoon84 · Nov 25, 2007

Because it's a rare issue that occurs sporadically? Some sites had difficulty overclocking past 2.6GHz with stability, but there is no way to tell if this is due to the bug or the chip itself just won't clock any higher.

bfdd · Nov 25, 2007

Originally posted by: harpoon84
Because it's a rare issue that occurs sporadically? Some sites had difficulty overclocking past 2.6GHz with stability, but there is no way to tell if this is due to the bug or the chip itself just won't clock any higher.

And some got to 2.8ghz stable. What I'm saying is if it's such a problem they can't release a 2.4ghz chip why can these reviewers go past it by a few 100mhz and not have problems? We all know there is a problem, but I think their yields for 2.4ghz chips are low *shrug*

JumpingJack · Nov 25, 2007

Originally posted by: Viditor
The linked site (like many I have seen out there) definately has at least some of their facts wrong...
The AMD Errata Sheet from Sept lists the to revs we currently have...BA for Barcelona, and B2 for Phenom.
Those 2 came out simultaneously (rev B1 was an engineering sample only), and I have confirmed (from a client) that current Barcelonas are rev BA still, and obviously Phenom is rev B2 at the moment.
Even though the cores are the same, the different chips (low power vs performance) ship with different revs (in the same way that Opteron had a different rev than A64, but it was the same core).

You were absolutely right about it being errata #254...many thanks for that! It appears that BA is also in need of the same fix, so we will probably see a new rev there as well (BB?)...though the need would seem to be less pressing as they are low power chips.

Well, I would not have brought it up but I have seen this repeated by a few other sites. I had a wiff of the pre-BA stepping errata (by accident, it was stored on a non-secure server) and this bug was there from the beginning.

It simply does not make sense to me that 2.4 GHz shows the error and 2.3 Ghz doesn't if it is simply a logic bug. Two explanations floating around the web seem reasonable... the 2.4 GHz has a different NB frequency, and it may be screwing up the logic since it is asynchronous to the core... or there is a hot spot that creates soft errors and only manifests itself at the higher clocks. I lean toward the latter myself.

It will be Q1 before we know for sure, with the B3's, but if they did fix this with a BIOS and the data we see includes that 10% performance hit, then why wait on 2.4 GHz... the errata was corrected.

We can argue back and forth, but until we see B3 ... it is up in the air. My guess is that you won't see any IPC improvement, that this is a process marginality. Others seem to want to take the position that AMD has already 'disabled' the offending feature and the data we see is a result of the performance hit.... without data it is impossible to say who is right.

JumpingJack · Nov 25, 2007

Originally posted by: bfdd

Originally posted by: harpoon84

Originally posted by: bfdd

Originally posted by: harpoon84
It only affects Phenom at 2.4GHz or higher. Barcelona is not clocked that high.

Click to expand...

The bug is in all chips though even those clocked below 2.4ghz.

Click to expand...

But 2.3GHz and lower chips are not affected stability wise... so whether it is there or not is a moot point as long as they are stable.

Click to expand...

Then how come when you overclock a 2.3ghz cpu that has the bug to a speed well above 2.4ghz where the TLB errata bug comes into play there's no stability issues?

The exact report of the TLB by AMD is that it only demontrates itself under full load conditions on some CPUs -- i.e. supporting the hot spot theory (process variation and all).

Regardless, it was enough that AMD felt that they could not guarantee quality at that bin, hence they pushed it out -- better to do that than to release a product that is defective at that bin.

taltamir · Nov 25, 2007

what do you mean not affected? wouldn't it cause it to crash if it wasn't implemented?

Viditor · Nov 25, 2007

Originally posted by: JumpingJack

Originally posted by: Viditor
The linked site (like many I have seen out there) definately has at least some of their facts wrong...
The AMD Errata Sheet from Sept lists the to revs we currently have...BA for Barcelona, and B2 for Phenom.
Those 2 came out simultaneously (rev B1 was an engineering sample only), and I have confirmed (from a client) that current Barcelonas are rev BA still, and obviously Phenom is rev B2 at the moment.
Even though the cores are the same, the different chips (low power vs performance) ship with different revs (in the same way that Opteron had a different rev than A64, but it was the same core).

You were absolutely right about it being errata #254...many thanks for that! It appears that BA is also in need of the same fix, so we will probably see a new rev there as well (BB?)...though the need would seem to be less pressing as they are low power chips.

Click to expand...

Well, I would not have brought it up but I have seen this repeated by a few other sites. I had a wiff of the pre-BA stepping errata (by accident, it was stored on a non-secure server) and this bug was there from the beginning.

It simply does not make sense to me that 2.4 GHz shows the error and 2.3 Ghz doesn't if it is simply a logic bug. Two explanations floating around the web seem reasonable... the 2.4 GHz has a different NB frequency, and it may be screwing up the logic since it is asynchronous to the core... or there is a hot spot that creates soft errors and only manifests itself at the higher clocks. I lean toward the latter myself.

If you look at reviews more closely, I think you might lean towards the former instead...
Notice that in every case where the reviews ran stable, the NB is downclocked to under 400 MHz, even though the cores are clocked much higher than normal. This should not work if the issue were one of hot spots...

It will be Q1 before we know for sure, with the B3's, but if they did fix this with a BIOS and the data we see includes that 10% performance hit, then why wait on 2.4 GHz... the errata was corrected.

I don't believe that there is a BIOS fix per se, merely a BIOS workaround...

We can argue back and forth, but until we see B3 ... it is up in the air. My guess is that you won't see any IPC improvement, that this is a process marginality. Others seem to want to take the position that AMD has already 'disabled' the offending feature and the data we see is a result of the performance hit.... without data it is impossible to say who is right.

I absolutely agree...we really won't know any more until the B3 surfaces.

JumpingJack · Nov 25, 2007

Originally posted by: Viditor
If you look at reviews more closely, I think you might lean towards the former instead...
Notice that in every case where the reviews ran stable, the NB is downclocked to under 400 MHz, even though the cores are clocked much higher than normal. This should not work if the issue were one of hot spots...

The NB clock (I assume your looking at memory clock) in this case is not under 400 MHz because of a 'fix', it is because they are using DDR2-800 (400 MHz standard clock) but AMD's odd and 1/2 multipliers do not allow it to hit 400 MHz -- the odd multiplier and all -- AM2 K8 CPUs do the same thing.

AMD's memory controller does not have a fractional divider, so for non-even multipliers the IMC will default to the closest whole number divider below the rated memory speed. I don't think this observations can be assigned to a 'BIOS workaround' as you are attempting to do.

For example, to get 400 MHz to drive DDR2-800, the IMC will take CPU clock / x thus:

2000 Mhz / 5 = 400 MHz (band on).
2100 Mhz / 5 = 420 Mhz (not good, so use 6)
2100 MHz / 6 = 350 Mhz (this is where it will run)
2200 MHz/ 6 = 367 Mhz (this is where it will run as 2200/5 = 440 MHz above the DDR2-800 spec)
2300 MHz/6 = 383 Mhz (again, this is where DDR2-800 will run as 2300/5 = 460 MHz)

Curiously, people wondered when the leaked Phenom benches were published by OCworkbench and epreivew.com why the memory was underclocked... this is the reason why.

Finally, I did find another site that indicates that a BIOS workaround (or fix, whatever you want to call it) does indeed take a 10% peformance hit and that a 9700 would be slower than a 9500 if used, this implies that what ever component is responsible, it remained active (not bios disabled) for the Tahoe show and tell.

http://www.channelregister.co....amd_phenom_9700_isses/

Last week, AMD Europe executive Dave Everett admitted "erata" uncovered at the eleventh hour had held up the 9700's release but that a BIOS fix could bypass them - at the cost of a ten per cent reduction in CPU performance.

Enough of a reduction, in other words, to make the 9700 run more slowly than the 9500. The erata were said to affect the 9700 when it's under "heavy load", which is the state of most CPUs in gaming PCs, of course.

Jack

bfdd · Nov 25, 2007

JumpingJack I am glad they didn't release a buggy product, but it doesn't make sense that it happens in 2.4ghz+ and not lower if they're on the same process so *shrug* I don't know just sounds like they're using it as an excuse.

bfdd · Nov 25, 2007

JumpingJack I am glad they didn't release a buggy product, but it doesn't make sense that it happens in 2.4ghz+ and not lower if they're on the same process so *shrug* I don't know just sounds like they're using it as an excuse.

DrMrLordX · Nov 25, 2007

Originally posted by: harpoon84
Because it's a rare issue that occurs sporadically? Some sites had difficulty overclocking past 2.6GHz with stability, but there is no way to tell if this is due to the bug or the chip itself just won't clock any higher.

Sure there is. If the BIOS allows you to disable the L3 cache, disable it. The TLB bug is in the L3 cache controller (or something along those lines).

taltamir · Nov 25, 2007

Originally posted by: bfdd
JumpingJack I am glad they didn't release a buggy product, but it doesn't make sense that it happens in 2.4ghz+ and not lower if they're on the same process so *shrug* I don't know just sounds like they're using it as an excuse.

it does happen, only in controllable amounts, cause performance degradation but not crashing. So that means that if they fix it their performance will increase even in those lower clock speeds.

Duvie · Nov 25, 2007

Originally posted by: taltamir

Originally posted by: bfdd
JumpingJack I am glad they didn't release a buggy product, but it doesn't make sense that it happens in 2.4ghz+ and not lower if they're on the same process so *shrug* I don't know just sounds like they're using it as an excuse.

Click to expand...

it does happen, only in controllable amounts, cause performance degradation but not crashing. So that means that if they fix it their performance will increase even in those lower clock speeds.

That may be a lot of assumptions on your part. While I agree Any errata in the cache would have an effect in scores (especially in a cache reset and flush as explained by Viditor) , the amount and per each item may vary widely whether the performance is 1-5%....The amount of L3 is not that much or at full speed such as the L2 cache is or the amount INtel has so I dont think it will amount to some big 10-20% it needs to compare clock for clock with the C2D.

This wont reverse any results that we have seen so far, IMHO....again if you look at reviews it appears the scores were already increasing in scale appropriately...

apps like multimedia historically gain little advantage in cache like this...partly due to its linear and constant changing nature, but the sheer size of it as well...Games seem to like some aspects of it, as well as DC computing projects like folding at home..certain WU with 4mb cache model C2D were twice as fast as 2mb cache models at same speed....This was seen by me and verified by MarkFW900.

JumpingJack · Nov 25, 2007

Originally posted by: bfdd
JumpingJack I am glad they didn't release a buggy product, but it doesn't make sense that it happens in 2.4ghz+ and not lower if they're on the same process so *shrug* I don't know just sounds like they're using it as an excuse.

Well, Viditor and I are arguing over salient points... there is indeed an errata in the TLB that is pushing out the higher clocked parts. The reasons through are not clear, it appears to be a bug in the logic but only manifesting itself at higher clocks and full load... this is why I lean to the hot-spot theory.

And if some reports are to be comprehended correctly, some higher clock parts will exhibit the bug and others won't, again explained by a hot-spot theory.

Viditor wants it to be a straight up bug that has been caught and deactivtated because this would give some credence to the argument that a stepping will fix it and will improve the IPC.

A hot spot argument, if this is indeed true, argues that IPC will not be affected that this is a process marginality that only manifests itself in the warmest running parts.

We will find out soon enough.... also, Viditor argues that there are 2.4 GHz + parts running stable, but what he does not explain is that the bug may or may not manifest itself depending on the conditions. It is perfectly reasonable to have this bug in the chip and actually never see it, depending on the circumstances... as is the case with many errata.

It could be as you explained, simply an excuse to buy more time -- perhaps to give some assurance to investors that it is a correctable situation and that higher speed bins are a piece of cake once the bug is erradicated...

Again, we won't know until Q1 of next year, when the B3s hit and if they meet their revised clock/roadmaps.

Jack

JumpingJack · Nov 25, 2007

Originally posted by: taltamir

Originally posted by: bfdd
JumpingJack I am glad they didn't release a buggy product, but it doesn't make sense that it happens in 2.4ghz+ and not lower if they're on the same process so *shrug* I don't know just sounds like they're using it as an excuse.

Click to expand...

it does happen, only in controllable amounts, cause performance degradation but not crashing. So that means that if they fix it their performance will increase even in those lower clock speeds.

Actually, when the bug occurs the system hangs. At least that is what has been reported, and is what is listed at errata 254 in AMDs errata list.

taltamir · Dec 13, 2007

Originally posted by: CTho9305

Also, what is it? What part of the chip does it affect? Will the BIOS update disable part of the chip, making it less effective?

Click to expand...

Disclaimer: I don't really know anything about this story beyond what's on the Inquirer/forums, and I'm not speaking for any companies.

The TLB is the "translation lookaside buffer". Background:
It used to be that if a program accessed memory location 5, the CPU really accessed physical memory location 5, and the program could access any memory location it wanted to. Programs also saw only as much memory as the computer really had (because they were accessing the physical memory directly). Modern systems use "paging". When a program accesses what it thinks is location 5, the CPU instead looks in a mapping table set up by the OS that maps the "virtual" address that the program sees to a real "physical" address that the CPU actually accesses.

That mapping table is called the "page table", because it maps memory at a "page" granularity (4KB). Along with the translation, the page table stores some permission bits that can be used to keep one program from accessing memory belonging to the OS or another program. Also, because the virtual addresses don't have to map directly to physical addresses, it's possible to make programs think a machine has more memory than it really does (when it runs out of physical memory, the OS can pick a page and swap it out to the hard drive until it's needed again...without the programs even realizing it).

Now, these mappings are pretty big, so the page table is actually hierarchical (don't worry about the details). The net result is that finding the translation from a virtual to physical address generally requires ~3 memory accesses (for 32-bit apps - it's about 2x as bad for 64-bit apps)... so to do one useful memory access, you'd need to actually do a total of 4 accesses! To make paging feasible performance-wise, the translations are cached so that they don't have to be looked up each time. This cache is the TLB.

Disabling the TLB is not an option, because the performance hit would be unreasonably large (best case, each memory access, even accesses that hit in the L1 data cache would take 4x as long). Now, modern processors actually have multiple levels of TLBs (multiple levels of cache for the page table translations, just like the L1/L2/L3 caches for data and instructions) - maybe the L2 TLB(s) could be disabled if they were buggy, but I would imagine that would have a large performance impact in some situations. I'm not familiar with Barcelona/Phenom's TLB organization though.

ok it seems I completely misunderstood the "uber technical explanation".. in fact I read the AMD blurb about it and I think I get it now, and even so I reread this and I see a very detailed explanation of how cache works, but no mention of the actual PROBLEM...

So here is my LAYMAN terms understanding of the issue based on what I Read from AMD:

I read the explanation by AMD and according to them they messed up the order in which L2 and L3 cache are updated so that you could end up with different data on L2 and L3 cache (With the L3 data being wrong). If the calculation finished with the L2 cache then nothing happened, but if ANOTHER process was intensive enough to cause the first one to drop out of L2 cache it will then later get a copy of what it dropped from L3, and THAT copy was wrong, causing the crash.

This means that any time a sufficiently intensive operation occurs the chip will crash. Above 2.4ghz almost every program is sufficiently intensive to deplete L2 cache causing the crash (not the error mind you, the crash because of the error). But even at 2.3 ghz certain programs (like photoshop for example) will cause it quite often. That is why the whole shebang has to be disabled by the bios... but with it disabled you loose 10-20% in performance...

To fix such a problem they will have to update the L2 and L3 cache in a slower process, resulting in a speed DECREASE not an increase. OR use a more complicated logic design (the examples given were the ability to somehow update BOTH at once, another example given was locking the data against change temporarily...).

I am guessing they probably went with the "more advanced circuit design" to fix it, which means they are taking extra time and hoping to end up with something faster, rather then slower (but not as slow as 10-20% slower). But we will know when the xx50 versions show up. They are saying that this will go through a very rigorous testing process now to make absolutely sure there are no more such problems on the revised chips.

BTW, due to the nature of the problem with phenom it should provide NO delays whatsoever in AMD's transition to 45nm. Since it had nothing to do with manufacturing and was due to an architectural flaw in the design of the chip.

Search

"TLB errata" bug

harpoon84

Golden Member

bfdd

Lifer

harpoon84

Golden Member

bfdd

Lifer

JumpingJack

Member

JumpingJack

Member

taltamir

Lifer

Viditor

Diamond Member

JumpingJack

Member

bfdd

Lifer

bfdd

Lifer

DrMrLordX

Lifer

taltamir

Lifer

Duvie

Elite Member

JumpingJack

Member

JumpingJack

Member

taltamir

Lifer

TRENDING THREADS