What does CPU bug AK43 mean, really?

QuixoticOne

Golden Member
Nov 4, 2005
1,855
0
0
From the errata of the Intel Quad cores:

AK43. Concurrent Multi-processor Writes to Non-dirty Page May Result in
Unpredictable Behavior
Problem: When a logical processor writes to a non-dirty page, and another logicalprocessor
either writes to the same non-dirty page or explicitly sets the dirty
bit in the corresponding page table entry, complex interaction with internal
processor activity may cause unpredictable system behavior.
Implication: This erratum may result in unpredictable system behavior and hang.

Workaround: It is possible for BIOS to contain a workaround for this erratum.

Status: For the steppings affected, see the Summary Tables of Changes.

So correct me if I'm wrong here --

Basically if a page hasn't been written to since its data was last loaded from or written
to memory, it's not dirty.
I.e. dirty = the cached copy is modified, but the main memory copy isn't yet updated to match.

So when a first CPU core writes to memory page X, that pages becomes
automatically marked dirty, right? Without software intervention (normally),
the page's dirty bit would normally get set by the CPU, right?
Or is it a software process to mark pages dirty always?

How / when would another CPU core write to that same page when that
page ISN'T marked dirty? Are they saying that BEFORE it gets a chance to
be marked dirty, the other core could write to it in its "still non dirty" state?

Why/when would another processor core (if it hasn't written to the page)
explicitly set that page's dirty bit?

How would the BIOS work around this problem? I don't get what the BIOS
could possibly do to make this situation better, unless they mean there's a
CPU microcode update for this problem, but if that was the case
wouldn't they just say that, and say that the BIOS or OS could load
corrective microcode patches/updates?

Anyway it sounds like the cache coherency must be wildly broken
if this can occur due to either a race condition between cores writing
to the same page, and/or cores affecting the same page's dirty bit.

If multiple cores CAN'T safely do I/O to the same cached copy of a
write-back memory page, doesn't that basically break the whole
usefulness of a memory write back data cache and cache coherency in general?

I don't recall what page sizes CAN be set to, but I seem to recall they're
often 4KBy, though can be bigger or maybe smaller too. It doesn't seem
that uncommon for multiple cores to be writing SOMEWHERE within the
same page of memory if it's containing some kind of common data
structure e.g. semaphores, counters, shared buffer space or something like
that.

If I am reading this correctly it doesn't seem to say that the problem
will not occur if the processor cores use atomic operation instructions
or refrain from writing to the SAME page cache lines or whatever.
So it seems like it'd be "generally unsafe" to write or manage a page
for two cores no matter HOW they did it e.g. even if areas of the page
were for the exclusive use of the individual cores.

 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
That erratum doesn't have much info - AMD's revision guides generally have more detail than that, and workaround info is more explicit, e.g. (from errata 63 on Athlon 64s and Opterons):
In MP systems, disable the TLB flush filter by setting HWCR.FFDIS (bit 6 of MSR 0xC001_0015

How would the BIOS work around this problem? I don't get what the BIOS could possibly do to make this situation better, unless they mean there's a CPU microcode update for this problem, but if that was the case
wouldn't they just say that, and say that the BIOS or OS could load corrective microcode patches/updates?

CPUs often have the ability to enable/disable various features through special control registers. At boot time, the BIOS can set/clear certain bits, which might disable whatever optimization caused the problem. Sometimes there can be a performance penalty that arises from the workaround.

Anyway it sounds like the cache coherency must be wildly broken if this can occur due to either a race condition between cores writing to the same page, and/or cores affecting the same page's dirty bit.

If multiple cores CAN'T safely do I/O to the same cached copy of a write-back memory page, doesn't that basically break the whole usefulness of a memory write back data cache and cache coherency in general?

Yes, but CPU errata often only cause problems until very specific circumstances.
 

VirtualLarry

No Lifer
Aug 25, 2001
56,580
10,216
126
Does this errata only apply to C2Qs, or does it apply to all C2D CPUs? It sounds pretty bad to me.
 

QuixoticOne

Golden Member
Nov 4, 2005
1,855
0
0
See also these links:
http://marc.info/?l=openbsd-misc&m=118296441702631
..."These processors are buggy as hell, and some of these bugs don't just
cause development/debugging problems, but will *ASSUREDLY* be
exploitable from userland code."

..."Note that some errata like AI65, AI79, AI43, AI39, AI90, AI99 scare
the hell out of us. Some of these are things that cannot be fixed in
running code, and some are things that every operating system will do
until about mid-2008, because that is how the MMU has always been
managed on all generations of Intel/AMD/whoeverelse hardware. Now
Intel is telling people to manage the MMU's TLB flushes in a new and
different way. Yet even if we do so, some of the errata listed are
unaffected by doing so.

As I said before, hiding in this list are 20-30 bugs that cannot be
worked around by operating systems, and will be potentially
exploitable. I would bet a lot of money that at least 2-3 of them
are.

For instance, AI90 is exploitable on some operating systems (but not
OpenBSD running default binaries).

At this time, I cannot recommend purchase of any machines based on the
Intel Core 2 until these issues are dealt with (which I suspect will
take more than a year). Intel must be come more transparent."
...


Here are the full details of the bug lists from Intel for Core2 DUO and Dore2 QUAD:

Intel® Core?,,?2 Extreme Processor X6800? and Intel® Core?,,?2 Duo Desktop Processor E6000? and E4000? Sequence Specification Update
http://www.intel.com/design/pr...or/specupdt/313279.htm

Intel® Core?2 Extreme Quad-Core Processor QX6000? Sequence and Intel® Core?2 Quad Processor Q6000? Sequence Specification Update
http://www.intel.com/design/pr...or/specupdt/315593.htm


My text:

Thanks for the info. CTho9305; yes reading the list shows that many of
the other errata are documented as happing in rather unusual/specific/unlikely
circumstances. But several few bad-sounding ones like the one I've posted here
don't indicate that they're limited to uncommon or unusual situations, and that's what
worries me.

Sounds like there COULD be a huge performance hit it they have to compromise
the effectiveness of the cache to make processor cache coherency sort-of-work. Eck!

VirtualLarry, I just looked, and, yes, there's an essentially identical bug
in the Core2 dual core CPUs.

Actually a quick glance makes it look like there are several additional bad-sounding
Core2 DUO bugs that I don't recall seeing duplicated on the QUAD errata list.

The different documents have different ID numbers but sometimes the
bugs described are the same.


From Core2 DUO bugs document:

AI43. Concurrent Multi-processor Writes to Non-dirty Page May Result in
Unpredictable Behavior
Problem: When a logical processor writes to a non-dirty page, and another logicalprocessor
either writes to the same non-dirty page or explicitly sets the dirty
bit in the corresponding page table entry, complex interaction with internal
processor activity may cause unpredictable system behavior.
Implication: This erratum may result in unpredictable system behavior and hang.
Workaround: It is possible for BIOS to contain a workaround for this erratum.
Status: For the steppings affected, see the Summary Tables of Changes.

AI39. Cache Data Access Request from One Core Hitting a Modified Line in
the L1 Data Cache of the Other Core May Cause Unpredictable System
Behavior
Problem: When request for data from Core 1 results in a L1 cache miss, the request is
sent to the L2 cache. If this request hits a modified line in the L1 data cache
of Core 2, certain internal conditions may cause incorrect data to be returned
to the Core 1.
Implication: This erratum may cause unpredictable system behavior.
Workaround: It is possible for the BIOS to contain a workaround for this erratum.
Status: For the steppings affected, see the Summary Tables of Changes.

AI40. PREFETCHh Instruction Execution under Some Conditions May Lead
to Processor Livelock
Problem: PREFETCHh instruction execution after a split load and dependent upon
ongoing store operations may lead to processor livelock.
Implication: Due to this erratum, the processor may livelock.
Workaround: It is possible for the BIOS to contain a workaround for this erratum.
Status: For the steppings affected, see the Summary Tables of Changes.

AI42. Upper 32 Bits of the FPU Data (Operand) Pointer in the FXSAVE
Memory Image May Be Unexpectedly All 1's after FXSAVE
Problem: The upper 32 bits of the FPU Data (Operand) Pointer may incorrectly be set
to all 1's instead of the expected value of all 0's in the FXSAVE memory
image if all of the following conditions are true:
? The processor is in 64-bit mode.
? The last floating point operation was in compatibility mode
? Bit 31 of the FPU Data (Operand) Pointer is set.
? An FXSAVE instruction is executed
Implication: Software depending on the full FPU Data (Operand) Pointer may behave
unpredictably.
Workaround: None identified.
Status: For the steppings affected, see the Summary Tables of Changes.

...etc...

Originally posted by: VirtualLarry
Does this errata only apply to C2Qs, or does it apply to all C2D CPUs? It sounds pretty bad to me.

 

QuixoticOne

Golden Member
Nov 4, 2005
1,855
0
0
Here's another 'nice' one for the C2Ds:

"Burn, baby, burn...."

AI65. A Thermal Interrupt is Not Generated when the Current Temperature
is Invalid
Problem: When the DTS (Digital Thermal Sensor) crosses one of its programmed
thresholds it generates an interrupt and logs the event
(IA32_THERM_STATUS MSR (019Ch) bits [9,7]). Due to this erratum, if the
DTS reaches an invalid temperature (as indicated IA32_THERM_STATUS MSR
bit[31]) it does not generate an interrupt even if one of the programmed
thresholds is crossed and the corresponding log bits become set.
Implication: When the temperature reaches an invalid temperature the CPU does not
generate a Thermal interrupt even if a programmed threshold is crossed.
Workaround: None identified.
Status: For the steppings affected, see the Summary Tables of Changes.
 

bryanW1995

Lifer
May 22, 2007
11,144
32
91
Iirc, that theo de raadt guy has done this sort of sensationalistic thing before. Linus Torvalds has disavowed his statements. Find a reputable person to quote.
 

QuixoticOne

Golden Member
Nov 4, 2005
1,855
0
0
Originally posted by: bryanW1995
Iirc, that theo de raadt guy has done this sort of sensationalistic thing before. Linus Torvalds has disavowed his statements. Find a reputable person to quote.

Well I quoted Intel themselves, and, in my own humble opinion based on reading
the actual known details of the errata, I'm inclined to agree with Mr. DeRaadt
and others who've said that several of these bugs are likely going to adversely
affect system reliability / performance / security.

Don't judge the message by the messenger, look at the technical merits of the
message. If I had a C2D I'd be writing some test code to see if things really
malfunctioned / crashed or took a big performance hit in cases similar to the ones
described, so I'd know what to watch out for in coding.

However at this point my question is WHETHER to buy a C2Q given these bugs.
Many of which I've already concluded that I can live with because they're unusual
cases, but many others of which seem like things that could cripple reliability / stabiliy,
hence my concern.

I'd be most interested to learn more about the details from those with actual data from
Intel / the BIOS poeple / the OS people, or those who've experimented and
benchmarked to see if they really have problems in relevant situations with these.

If you don't abide by T.DeRaadt's opinion, here are several other independent
people who seem at least moderately knowledgable and also quite concerned about
the potential / actual problems.

http://forums.pcquest.com/foru...49afd494238cbe2ae592c4

http://www.hardwareasylum.com/...showthread.php?p=28322

http://www.securityfocus.com/blogs/228

http://www.tjrforum.com/showthread.php?t=3182

 

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
So what's your story here?

You admit you don't know what these errata mean, but yet you say they are bad.

Sounds to me that you have a hidden agenda.
 

bryanW1995

Lifer
May 22, 2007
11,144
32
91
(putting away flamethrower) de raadt said similar things about the original c2d release iirc. I'm still a closet amd fanboy, so I was happy to read his comments last year, but they seem to be more theoretical than actual problems. all cpus have these issues, but I'm not going to be too concerned about them until they start popping up in real-world computers. It would definitely be interesting to see if somebody could actually bring one of these bugs to like...on his own computer and not mine, of course.
 

QuixoticOne

Golden Member
Nov 4, 2005
1,855
0
0
Originally posted by: Phynaz
So what's your story here?

You admit you don't know what these errata mean, but yet you say they are bad.

Sounds to me that you have a hidden agenda.

Are you kidding? Are you a shill from Intel doing PR damage control?

If you actually understand CPU architecture and read the content of my
posts and the original errata you'd realize:

a) These are BAD bugs and in some cases are sufficiently commonplace
risks that they deserve analysis / consideration of their impact to the
many average users / programs.

b) I have a pretty good understanding of what the nature and worst case impacts
of the bugs are. What I was desiring is to understand BETTER was the extent
to which an obviously bad problem can be ameliorated, and to explicitly
understand the quantitative risks / performance impacts of the bugs and any
kludgy 'workarounds'. It's a no brainer to say that if you turn off the
L1/L2 cache that you'll probably be less affected by cache bugs, but to me that's
a rather unacceptable solution since the memory performance on C2D/Q is
already pretty bad especially for writes. So that leaves less performance
drastic but also less EFFECTIVE / SAFE / CLEAN means of trying to bypass the
issues in less direct fashions.

I'm a lower level EE & CS developer than about 99.5% of the people out there,
and though Intel architecture MMU / Cache coherency isn't at all my specialty,
I certainly know enough to call a problem a problem and overlook the ones that
are less disasterous.

Notice that there's a short list of 5-10 that I and others mention as being really quite
potentially disasterous, and most of the other 101 on the list are certainly unpleasant in
general, disasterous for a minority, but not crippling in many common use cases
(e.g. unless you're debugging or writing driver / hypervisor / memory management code).

I was hoping that someone had either:

a) Written test code to check the safety / correctness / performance impact of any
BIOS / microcode patches that might exist.

b) Gotten more details of any useful ideas individual developers could use to
help ameliorate the bugs.

c) Been able to write demonstration code that could confirm the presence / absence
of the problems so one could even KNOW if one's BIOS HAD a workaround /
microcode update relevant to the few cases where it suggested one might be possible,
since the last BIOS / microcode updates I've seen were pathetically devoid of ANY
useful details to indicate WHAT was even changed.

So what's YOUR opinion about the severity of these, and what's your expertise
in development / CPU architecture that we may view your opinions in a useful
context?

 

QuixoticOne

Golden Member
Nov 4, 2005
1,855
0
0
Originally posted by: Diogenes2
Are these errata not present in core 2 duo ?

Have you see the errata list for AMD 64 ?

Three pages worth in this doc..

http://www.amd.com/us-en/asset...nd_tech_docs/25759.pdf


I think we would do well ( if anyone really cares ) to find out what these bugs really mean
in the real world ..

Yes, I agree we need better DETAILS of how the problems exactly manifest themselves
and what workarounds there could be, and what the PERFORMANCE impact of those would
be.

Yes the bugs are apparently even much worse in the C2 DUO than the C2 QUAD;
i.e. there are some that only exist on C2D and not on C2Q, and several that are
problems both on C2D and C2Q. Some fair number of them were 'FIXED' in the
last stepping (G0) released for C2D/C2Q, but many potentially serious ones remain
for each kind of CPU.

Yes I've seen AMD's errata lists many times. They certainly have their share of problems
too.

That being said, I read and understood the AMD errata before I bought by X2 4400
a couple of years ago, and now that I'm considering buying a C2Q for heavy
low level hardware / assembly level development, I'm trying to assess the risk
of the C2 bugs before I shoot myself in the foot writing code that's going to break
due to cache coherency problems or whatever.

 

dmens

Platinum Member
Mar 18, 2005
2,275
965
136
Originally posted by: QuixoticOne
a) These are BAD bugs and in some cases are sufficiently commonplace
risks that they deserve analysis / consideration of their impact to the
many average users / programs.

Are the bugs "bad" because you have no idea how the bug manifests itself and the big words on the errata list impress you? Or are they "bad" in the actual sense, namely, the bug can be hit without some obscure cycle accurate sequence of events and no workaround has been implemented, and it results in actual machine death and not a barely detectable perf loss?

b) I have a pretty good understanding of what the nature and worst case impacts
of the bugs are. What I was desiring is to understand BETTER was the extent
to which an obviously bad problem can be ameliorated, and to explicitly
understand the quantitative risks / performance impacts of the bugs and any
kludgy 'workarounds'. It's a no brainer to say that if you turn off the
L1/L2 cache that you'll probably be less affected by cache bugs, but to me that's
a rather unacceptable solution since the memory performance on C2D/Q is
already pretty bad especially for writes. So that leaves less performance
drastic but also less EFFECTIVE / SAFE / CLEAN means of trying to bypass the
issues in less direct fashions.

I'm not going to bother asking you about "obviously bad problem" (see above), but in regards to your request, chip manufacturers don't have to tell the outside world anything about the performance impact of any workaround for any bug. That is proprietary information. If the BIOS workaround for the bug would tank performance, then the bug would have been fixed on CPU instead of in the BIOS. Simple ROI evaluation.

And why should the outside world care about how effective or clean the solution and/or workaround is? The only thing needed (and the only thing they usually get) is the workaround on how to get the part to work as specified.

I'm a lower level EE & CS developer than about 99.5% of the people out there,
and though Intel architecture MMU / Cache coherency isn't at all my specialty,
I certainly know enough to call a problem a problem and overlook the ones that
are less disasterous.

Actually, you don't. Reading the errata never gives the full story, and the public never gets the full story. Again, proprietary information.

Notice that there's a short list of 5-10 that I and others mention as being really quite
potentially disasterous, and most of the other 101 on the list are certainly unpleasant in
general, disasterous for a minority, but not crippling in many common use cases
(e.g. unless you're debugging or writing driver / hypervisor / memory management code).

I was hoping that someone had either:

a) Written test code to check the safety / correctness / performance impact of any
BIOS / microcode patches that might exist.

b) Gotten more details of any useful ideas individual developers could use to
help ameliorate the bugs.

c) Been able to write demonstration code that could confirm the presence / absence
of the problems so one could even KNOW if one's BIOS HAD a workaround /
microcode update relevant to the few cases where it suggested one might be possible,
since the last BIOS / microcode updates I've seen were pathetically devoid of ANY
useful details to indicate WHAT was even changed.

So what's YOUR opinion about the severity of these, and what's your expertise
in development / CPU architecture that we may view your opinions in a useful
context?

Those are valid concerns, however, no CPU manufacturer guarantees perfect functionality on any part. If full transparency on errata is a critical concern, then you are free to choose a manufacturer that discloses all bug information. My opinion is that you are blowing things way out of proportion. C2D has been out in the wild for over a year, statistically, the chance of a meaningful bug being still out there is practically nil.
 

Toadster

Senior member
Nov 21, 1999
598
0
76
scoop.intel.com
Originally posted by: QuixoticOne
That being said, I read and understood the AMD errata before I bought by X2 4400
a couple of years ago, and now that I'm considering buying a C2Q for heavy
low level hardware / assembly level development, I'm trying to assess the risk
of the C2 bugs before I shoot myself in the foot writing code that's going to break
due to cache coherency problems or whatever.

wow - I can't imagine what it takes for you to buy anything in the consumer market if you're that worried about errata! have you seen the allowable filth recommendations posted by the FDA for food consumption?

http://www.fda.gov/consumer/default.htm

I bet you'll never buy at the store again with all that 'errata'!
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
I'm not going to bother asking you about "obviously bad problem" (see above), but in regards to your request, chip manufacturers don't have to tell the outside world anything about the performance impact of any workaround for any bug. That is proprietary information. If the BIOS workaround for the bug would tank performance, then the bug would have been fixed on CPU instead of in the BIOS. Simple ROI evaluation.

And why should the outside world care about how effective or clean the solution and/or workaround is? The only thing needed (and the only thing they usually get) is the workaround on how to get the part to work as specified.
I'd hope published SPEC / TPC / etc benchmark results are done using all performance-affecting workarounds someone who has their systems set up for maximum reliability might use. Why should the outside world care about how effective a workaround is? I hope you just worded that part poorly and don't mean that.
 

sayNOtoFSB

Banned
May 29, 2007
26
0
0
All I can say is: don?t sell your C2D to NASA or Shuttle space craft.
Or worst than that- Don?t use C2D in the nuclear reactor plants or weaponry system. I thing I am changing my alias to "SayNoToC2D".
 

SexyK

Golden Member
Jul 30, 2001
1,343
4
76
Originally posted by: dmens
*snip*

C2D has been out in the wild for over a year, statistically, the chance of a meaningful bug being still out there is practically nil.

Bingo. If any of the errata noted above were going to cause a real-world issue, it would have been identified by now. At this point, I think it's safe to say that C2D's have been used to run 99.99999% of the software that's out there in day-to-day used. If one of these errata was breaking people's code left and right, we would know about it, no doubt.

I'm honestly not really sure what is motivating the OP here. Everyone knows that CPUs with 100's of millions of transistors are going to come with some errata, it's nothing new and earth shattering. There's no reason at all to suggest people should steer clear of C2D or C2Q because they are "buggy" processors, that's the farthest thing from the truth.
 

Toadster

Senior member
Nov 21, 1999
598
0
76
scoop.intel.com
Originally posted by: SexyK
Originally posted by: dmens
*snip*

C2D has been out in the wild for over a year, statistically, the chance of a meaningful bug being still out there is practically nil.

Bingo. If any of the errata noted above were going to cause a real-world issue, it would have been identified by now. At this point, I think it's safe to say that C2D's have been used to run 99.99999% of the software that's out there in day-to-day used. If one of these errata was breaking people's code left and right, we would know about it, no doubt.

I'm honestly not really sure what is motivating the OP here. Everyone knows that CPUs with 100's of millions of transistors are going to come with some errata, it's nothing new and earth shattering. There's no reason at all to suggest people should steer clear of C2D or C2Q because they are "buggy" processors, that's the farthest thing from the truth.

fully agree... OP's attempt at trolling denied! ;)
 

dmens

Platinum Member
Mar 18, 2005
2,275
965
136
Originally posted by: CTho9305
I'd hope published SPEC / TPC / etc benchmark results are done using all performance-affecting workarounds someone who has their systems set up for maximum reliability might use. Why should the outside world care about how effective a workaround is? I hope you just worded that part poorly and don't mean that.

By effective, I meant the elegance of the fix within the design. It was poorly worded. This nomenclature is different from the ones I'm used to.

As for the performance impact, published benchmarks should be with systems closest to specification. What I meant by proprietary information was for chip manufacturers to release specific design details regarding the bug to explain any performance difference that could be caused. That is unnecessary.
 

xtknight

Elite Member
Oct 15, 2004
12,974
0
71
Hm been using my Core 2 Duo since January and I've done everything with it I can possibly think of. I'm sure about at least a million others are doing the same. The problems must not have been that bad.
 

VirtualLarry

No Lifer
Aug 25, 2001
56,580
10,216
126
Originally posted by: Toadster
wow - I can't imagine what it takes for you to buy anything in the consumer market if you're that worried about errata! have you seen the allowable filth recommendations posted by the FDA for food consumption?
I bet you'll never buy at the store again with all that 'errata'!
Heh. You have no idea how perfectionistic most ASM programmers are.