"TLB errata" bug

SickBeast

Lifer
Jul 21, 2000
14,377
19
81
So: Why Phenom and not barcelona?

Also, what is it? What part of the chip does it affect? Will the BIOS update disable part of the chip, making it less effective?

I don't understand the difference between the Barcelona and the Phenom, marketing aside. Why does this bug only affect a few Phenoms (9700)?

Opteron and X2 were always essentially the same chip running on the same socket (for the most part)...I take it things have changed?

It really feels like Phenom was rushed to market, which is strange for a product that's something like a year late.
 

bfdd

Lifer
Feb 3, 2007
13,312
1
0
Originally posted by: harpoon84
It only affects Phenom at 2.4GHz or higher. Barcelona is not clocked that high.

The bug is in all chips though even those clocked below 2.4ghz.
 

SickBeast

Lifer
Jul 21, 2000
14,377
19
81
Originally posted by: Sheninat0r
What is this bug? I haven't heard anything about it...

That's why I created this thread. :)

I did find this using google:
OMFG


What a train smash.


This errata stuff is BS speak for "our 2.4 GHZ yields are crap". You dont get different errata at different speed bins.

and

This problem was found during speed-binning the B2 revision processors, and this was the cause for the Phenom FX 3.0 GHz delay. It turns out that some CPUs running at 2.4 GHz or above in some benchmarking combinations, while all four cores are running at 100% load, can cause a system freeze.

The first quote is from someone named 'Wombat 2' on the THG forums. The second is from the inquirer AFAIK.

It sounds to me like AMD simply hit the wall at 2.4ghz. It's probably similar to when you overclock a chip and it's not prime95 stable once you reach a certain speed.

AMD doesn't seem to be saying anything about it, so it must be somewhat of an embarassment. :brokenheart:
 

harpoon84

Golden Member
Jul 16, 2006
1,084
0
0
Originally posted by: bfdd
Originally posted by: harpoon84
It only affects Phenom at 2.4GHz or higher. Barcelona is not clocked that high.

The bug is in all chips though even those clocked below 2.4ghz.

But 2.3GHz and lower chips are not affected stability wise... so whether it is there or not is a moot point as long as they are stable.
 

Viditor

Diamond Member
Oct 25, 1999
3,290
0
0
Thanks for posting the thread SB...it seems that we all need to learn more about what exactly the TLB errata effects.

My understanding comes from an engineering-savvy friend...though of course I'm probably not smart enough to understand it well.

"the problem with stability occurs when the memory controller runs faster than 2.4 GHz. Remember that with the K10 memory controller, the memory controller clock speed must be higher than the CPU clock speed, and is set in 400 MHz increments to synchronize best with actual DRAM speeds. So, 2.4 GHz is stable, and can be used with 2.3 GHz CPU clock speeds, and 2.8 GHz or above has the stability problem"

"a BIOS update allows the main loop to recover from a glitch in synchronization. If the problem occurs rarely if at all--as is expected in 2.3 GHz and below CPUs--this is sufficient. But if it occurs constantly, you take a big performance hit. So rev B3 is needed to swat the problem at its source"

What I glean from that is that the TLB errata causes both a stability limit at 2.3 GHz, as well as a slight performance hit whenever the loop is reset.
The higher you clock the CPU, the more resets you get...so the overclocked B2 Phenoms should perform quite poorly, overclocking will be quite limited, and normally clocked B2 Phenoms will only take a very minor performance hit (if any).

 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Also, what is it? What part of the chip does it affect? Will the BIOS update disable part of the chip, making it less effective?

Disclaimer: I don't really know anything about this story beyond what's on the Inquirer/forums, and I'm not speaking for any companies.

The TLB is the "translation lookaside buffer". Background:
It used to be that if a program accessed memory location 5, the CPU really accessed physical memory location 5, and the program could access any memory location it wanted to. Programs also saw only as much memory as the computer really had (because they were accessing the physical memory directly). Modern systems use "paging". When a program accesses what it thinks is location 5, the CPU instead looks in a mapping table set up by the OS that maps the "virtual" address that the program sees to a real "physical" address that the CPU actually accesses.

That mapping table is called the "page table", because it maps memory at a "page" granularity (4KB). Along with the translation, the page table stores some permission bits that can be used to keep one program from accessing memory belonging to the OS or another program. Also, because the virtual addresses don't have to map directly to physical addresses, it's possible to make programs think a machine has more memory than it really does (when it runs out of physical memory, the OS can pick a page and swap it out to the hard drive until it's needed again...without the programs even realizing it).

Now, these mappings are pretty big, so the page table is actually hierarchical (don't worry about the details). The net result is that finding the translation from a virtual to physical address generally requires ~3 memory accesses (for 32-bit apps - it's about 2x as bad for 64-bit apps)... so to do one useful memory access, you'd need to actually do a total of 4 accesses! To make paging feasible performance-wise, the translations are cached so that they don't have to be looked up each time. This cache is the TLB.

Disabling the TLB is not an option, because the performance hit would be unreasonably large (best case, each memory access, even accesses that hit in the L1 data cache would take 4x as long). Now, modern processors actually have multiple levels of TLBs (multiple levels of cache for the page table translations, just like the L1/L2/L3 caches for data and instructions) - maybe the L2 TLB(s) could be disabled if they were buggy, but I would imagine that would have a large performance impact in some situations. I'm not familiar with Barcelona/Phenom's TLB organization though.
 

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
I thought the B2 rev was the one that fixed the performance issues. Wasn't somebody saying that b2 was 10% faster than BA?

Now we are holding out hope for a B3 rev?

Yeah...Okay.
 

Duvie

Elite Member
Feb 5, 2001
16,215
0
71
Originally posted by: Viditor
Thanks for posting the thread SB...it seems that we all need to learn more about what exactly the TLB errata effects.

My understanding comes from an engineering-savvy friend...though of course I'm probably not smart enough to understand it well.

"the problem with stability occurs when the memory controller runs faster than 2.4 GHz. Remember that with the K10 memory controller, the memory controller clock speed must be higher than the CPU clock speed, and is set in 400 MHz increments to synchronize best with actual DRAM speeds. So, 2.4 GHz is stable, and can be used with 2.3 GHz CPU clock speeds, and 2.8 GHz or above has the stability problem"

"a BIOS update allows the main loop to recover from a glitch in synchronization. If the problem occurs rarely if at all--as is expected in 2.3 GHz and below CPUs--this is sufficient. But if it occurs constantly, you take a big performance hit. So rev B3 is needed to swat the problem at its source"

What I glean from that is that the TLB errata causes both a stability limit at 2.3 GHz, as well as a slight performance hit whenever the loop is reset.
The higher you clock the CPU, the more resets you get...so the overclocked B2 Phenoms should perform quite poorly, overclocking will be quite limited, and normally clocked B2 Phenoms will only take a very minor performance hit (if any).

So are we saying there may be an answer to the rather poor performance of the phenoms in the reviews? or just the fact it didn't seem to have any headroom to OC with stability?...

The only reason I ask is because looking at the numbers it seems to scale correctly from the lower speed tested at 2.2ghz up thru the 2.6ghz setting. linear that is....So if there was a hit in performance I didn't see it....even in chips that use 4 cores well...
 

Viditor

Diamond Member
Oct 25, 1999
3,290
0
0
Originally posted by: Duvie
Originally posted by: Viditor
Thanks for posting the thread SB...it seems that we all need to learn more about what exactly the TLB errata effects.

My understanding comes from an engineering-savvy friend...though of course I'm probably not smart enough to understand it well.

"the problem with stability occurs when the memory controller runs faster than 2.4 GHz. Remember that with the K10 memory controller, the memory controller clock speed must be higher than the CPU clock speed, and is set in 400 MHz increments to synchronize best with actual DRAM speeds. So, 2.4 GHz is stable, and can be used with 2.3 GHz CPU clock speeds, and 2.8 GHz or above has the stability problem"

"a BIOS update allows the main loop to recover from a glitch in synchronization. If the problem occurs rarely if at all--as is expected in 2.3 GHz and below CPUs--this is sufficient. But if it occurs constantly, you take a big performance hit. So rev B3 is needed to swat the problem at its source"

What I glean from that is that the TLB errata causes both a stability limit at 2.3 GHz, as well as a slight performance hit whenever the loop is reset.
The higher you clock the CPU, the more resets you get...so the overclocked B2 Phenoms should perform quite poorly, overclocking will be quite limited, and normally clocked B2 Phenoms will only take a very minor performance hit (if any).

So are we saying there may be an answer to the rather poor performance of the phenoms in the reviews? or just the fact it didn't seem to have any headroom to OC with stability?...

The only reason I ask is because looking at the numbers it seems to scale correctly from the lower speed tested at 2.2ghz up thru the 2.6ghz setting. linear that is....So if there was a hit in performance I didn't see it....even in chips that use 4 cores well...

A very good question...and the answer is that I don't know (I was reading from someone else's notes).

Possibilities that occur to me:
1. I notice that the OC actually involved lowering very slightly the NB and memory controller. It could be that this is what allowed a linear OC and kept stability and performance.
2. It could be that the performance is already more effected at 2.2 GHz than AMD is letting on. I remember that Kris Kubicki mentioned in his blog back in Sept that the new steppings (BA and B3) should increase performance by 5% or more.
3. It could be that my interpretation is way off base and that performance doesn't enter into it at all, only stability.

Sadly, there's no way for us to know until we see the new steppings...

Edit: Another point that occurs to me is that in all of the Phenom leaked benches where they ran at 3 GHz, they all had to downclock the memory controller significantly (375 Mhz).

Edit 2: One last point...since the AM2+ mobos are designed to run DDR2 @1066, then the mem controller should be able to run at 533 Mhz without a problem. Obviously, there is a problem...
 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
Originally posted by: Viditor
Thanks for posting the thread SB...it seems that we all need to learn more about what exactly the TLB errata effects.

My understanding comes from an engineering-savvy friend...though of course I'm probably not smart enough to understand it well.

"the problem with stability occurs when the memory controller runs faster than 2.4 GHz. Remember that with the K10 memory controller, the memory controller clock speed must be higher than the CPU clock speed, and is set in 400 MHz increments to synchronize best with actual DRAM speeds. So, 2.4 GHz is stable, and can be used with 2.3 GHz CPU clock speeds, and 2.8 GHz or above has the stability problem"

"a BIOS update allows the main loop to recover from a glitch in synchronization. If the problem occurs rarely if at all--as is expected in 2.3 GHz and below CPUs--this is sufficient. But if it occurs constantly, you take a big performance hit. So rev B3 is needed to swat the problem at its source"

What I glean from that is that the TLB errata causes both a stability limit at 2.3 GHz, as well as a slight performance hit whenever the loop is reset.
The higher you clock the CPU, the more resets you get...so the overclocked B2 Phenoms should perform quite poorly, overclocking will be quite limited, and normally clocked B2 Phenoms will only take a very minor performance hit (if any).

so that means you still get random dips in performance under 2.4ghz, it is just rare... but still pretty irking if it is noticeable (ie, fps wise)... I wonder how often thoe dips occur.
 

pm

Elite Member Mobile Devices
Jan 25, 2000
7,419
22
81
Nice description of TLB's, CTho. Good explanation.
 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
So the cpu randomly crashes and get reset with a slight performance dip every now and then, but if the cpu runs above 2.4ghz the crashing is so often that it gets worse performance then lower speeds? that is ridiculous..
I am not happy at all to hear that they are basically selling defective cpus with a workaround...
 

Viditor

Diamond Member
Oct 25, 1999
3,290
0
0
Originally posted by: taltamir
So the cpu randomly crashes and get reset with a slight performance dip every now and then, but if the cpu runs above 2.4ghz the crashing is so often that it gets worse performance then lower speeds? that is ridiculous..
I am not happy at all to hear that they are basically selling defective cpus with a workaround...

My understanding is that it's not a CPU crash, just a reset of the loop. However, if the resets occurs frequently enough, then a crash does occur because syncing no longer becomes possible.

Going by Ctho's excellent description of the TLB, it seems to me that what happens is that the TLB cache is flushed and has to be reset. If this starts to happen continuously, I assume this is what happens when a crash occurs.
My guess is that the problem is in the L3 TLB...
 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
meh, same difference... cpu crash, loop crash... something is crashing and it is not supposed to.

And performance consistancy over time matters more then peak performance... so dips in performance due to loop crashes is a big nono

PS. Thanks for the clarification though, It is good to know things more accurately and I am always glad when people correct me... speaking of which... what exactly IS a loop crash? how does it work and what is a loop?
 

Viditor

Diamond Member
Oct 25, 1999
3,290
0
0
Originally posted by: taltamir
meh, same difference... cpu crash, loop crash... something is crashing and it is not supposed to.

And performance consistancy over time matters more then peak performance... so dips in performance due to loop crashes is a big nono

PS. Thanks for the clarification though, It is good to know things more accurately and I am always glad when people correct me... speaking of which... what exactly IS a loop crash? how does it work and what is a loop?

I should probably use different terms...how about CPU crash and loop reset?
 

Viditor

Diamond Member
Oct 25, 1999
3,290
0
0
Originally posted by: Phynaz
I thought the B2 rev was the one that fixed the performance issues.

No...
BA was the fix for B1
B3 is the fix for B2

B1/BA was designed for lower/standard powered Barcelona chips
B2/B3 was designed for Phenoms and high-performance Barcelonas
 

JumpingJack

Member
Mar 7, 2006
61
0
0
Originally posted by: Sheninat0r
What is this bug? I haven't heard anything about it...

If you read some of the Phenom reviews, as well as Theo's report in the inquirer, or FUDzilla, they discuss a TLB (Translation Look-aside Buffer - http://www.cs.umass.edu/~weems...Lecture11/L11.18.html) bug in the L3 cache. It is a logical bug, from what I can gather, that expresses itself at higher clock speeds (no one has provided a good explanation as of yet, it may do with the NB/L3 clock at higher CPU speeds).

AMD has a published errata in which a TLB error is known in the BA/B2 steppings that can cause hard lock ups under certain loaded condtions.

http://www.amd.com/us-en/asset...nd_tech_docs/41322.pdf

See errata 254.

If this is the same, then there is (and has been reported) a BIOS work around fix, however the info reported from reputable sites (non-FUDzilla/non-Inq) states AMD expects a 10% performance hit using this work around.

Considering that Phenom and Barcelona cores are the same, it is probable that Barcelona also has this bug ... I am actually wondering if this will impact their commitment to deliver 2.5 GHz Barceys by Dec. as promised.... if so, there will be a few more non-compliant stamps on thier benchmark submissions at Spec.org.

Jack

 

JumpingJack

Member
Mar 7, 2006
61
0
0
Originally posted by: Phynaz
I thought the B2 rev was the one that fixed the performance issues. Wasn't somebody saying that b2 was 10% faster than BA?

Now we are holding out hope for a B3 rev?

Yeah...Okay.

That is what many people said... B2 was what would fix everything, now it looks like it is B3... so far it is ... the next stepping is it, just wait for the next stepping... after B3 will it be B4 or will they try C0?
 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
so a 10% bonus to speed on same clock and the ability to go from 2.3ghz to maybe 3.0ghz? I can imagine AMD is very anxious to get that fixed, it will make the comperable to the C2Q
 

JumpingJack

Member
Mar 7, 2006
61
0
0
Originally posted by: taltamir
so a 10% bonus to speed on same clock and the ability to go from 2.3ghz to maybe 3.0ghz? I can imagine AMD is very anxious to get that fixed, it will make the comperable to the C2Q

That was not my understanding.... in the current review samples, the TLB buffer is still on, and because the bug exists they won't launch the 2.4 GHz CPU.

If you think they are taking a 10% hit now resulting in the current data I think you will be in for disappointment. The reviewers clearly stated that if the BIOS patch is implemented the CPU will take a 10% hit, fixing the bug won't result in 10% improvement.

In addition, there's currently an errata in the 2.4GHz+ parts B1 and B2 stepping with the L3 TLB (translation lookup buffer) that requires a BIOS fix which should now be available for the latest 7-series chipsets. However this fix also kills performance "about 10 percent" according to AMD and it will be up to the user to enable or disable the BIOS fix, depending on the amount of work they're doing on the system.

http://www.bit-tech.net/news/2...es_phenom_and_spider/1

This is just a site where I saw this, a couple of others also report this info I recall.


 

Viditor

Diamond Member
Oct 25, 1999
3,290
0
0
Originally posted by: JumpingJack
Originally posted by: taltamir
so a 10% bonus to speed on same clock and the ability to go from 2.3ghz to maybe 3.0ghz? I can imagine AMD is very anxious to get that fixed, it will make the comperable to the C2Q

That was not my understanding.... in the current review samples, the TLB buffer is still on, and because the bug exists they won't launch the 2.4 GHz CPU.

If you think they are taking a 10% hit now resulting in the current data I think you will be in for disappointment. The reviewers clearly stated that if the BIOS patch is implemented the CPU will take a 10% hit, fixing the bug won't result in 10% improvement.

In addition, there's currently an errata in the 2.4GHz+ parts B1 and B2 stepping with the L3 TLB (translation lookup buffer) that requires a BIOS fix which should now be available for the latest 7-series chipsets. However this fix also kills performance "about 10 percent" according to AMD and it will be up to the user to enable or disable the BIOS fix, depending on the amount of work they're doing on the system.

http://www.bit-tech.net/news/2...es_phenom_and_spider/1

This is just a site where I saw this, a couple of others also report this info I recall.

The linked site (like many I have seen out there) definately has at least some of their facts wrong...
The AMD Errata Sheet from Sept lists the to revs we currently have...BA for Barcelona, and B2 for Phenom.
Those 2 came out simultaneously (rev B1 was an engineering sample only), and I have confirmed (from a client) that current Barcelonas are rev BA still, and obviously Phenom is rev B2 at the moment.
Even though the cores are the same, the different chips (low power vs performance) ship with different revs (in the same way that Opteron had a different rev than A64, but it was the same core).

You were absolutely right about it being errata #254...many thanks for that! It appears that BA is also in need of the same fix, so we will probably see a new rev there as well (BB?)...though the need would seem to be less pressing as they are low power chips.
 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
are you saying that the patch isn't currently implemented? I thought the whole point was that they are using the bios patch to prevent the cpu from crashing by sacrificing 10% of its performance on those reset loops.
 

SickBeast

Lifer
Jul 21, 2000
14,377
19
81
Originally posted by: taltamir
are you saying that the patch isn't currently implemented? I thought the whole point was that they are using the bios patch to prevent the cpu from crashing by sacrificing 10% of its performance on those reset loops.

I'll bet AMD wanted the initial reviews to come out before implementing the patch. If Phenom becomes 10% slower, it will probably be very similar to A64 clock-for-clock, which would make AMD look pretty bad (why did they spend all this time and money on a chip that's no better than their previous architecture?).