TLB errata is about error in calculation not heat.

taltamir

Lifer
Mar 21, 2004
13,576
6
76
Everywhere I read on the forums people were talking like the TLB errata was due to a process issue or some such causing a part of the CPU to overheat...

But I read the explanation by AMD and according to them they messed up the order in which L2 and L3 cache are updated so that you could end up with different data on L2 and L3 cache (With the L3 data being wrong). If the calculation finished with the L2 cache then nothing happened, but if ANOTHER process was intensive enough to cause the first one to drop out of L2 cache it will then later get a copy of what it dropped from L3, and THAT copy was wrong, causing the crash.

This means that any time a sufficiently intensive operation occurs the chip will crash. Above 2.4ghz almost every program is sufficiently intensive to deplete L2 cache causing the crash (not the error mind you, the crash because of the error). But even at 2.3 ghz certain programs (like photoshop for example) will cause it quite often. That is why the whole shebang has to be disabled by the bios... but with it disabled you loose 10-20% in performance...

To fix such a problem they will have to update the L2 and L3 cache in a slower process, resulting in a speed DECREASE not an increase. OR use a much more complicated logic design (the ability to somehow update BOTH at once is one example, another example given was locking the data against change temporarily...).

I am guessing they probably went with the "more advanced circuit design" to fix it, which means they are taking extra time and hoping to end up with something faster, rather then slower (but not as slow as 10-20% slower). But we will know when the xx50 versions show up.
BTW, due to the nature of the problem with phenom it should provide NO delays whatsoever in AMD's transition to 45nm. Since it had nothing to do with manufacturing and was due to an architectural flaw in the design of the chip.
 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
Not a single thread discussing the TLB errata mention that. In fact each and every one of them discussed doping, electron migration, temperatures and such... all completely irrelevant to a architectural flaw type of error.
 

myocardia

Diamond Member
Jun 21, 2003
9,291
30
91
Originally posted by: taltamir
Not a single thread discussing the TLB errata mention that. In fact each and every one of them discussed doping, electron migration, temperatures and such...

What are you talking about? In this thread (the very first discussion of it, in the CPU/OC forum), read the 8th post, by CTho9305, it was thoroughly discussed. Also read the 2nd link in the post by JumpingJack, about the 20th post in that thread, which you posted in, BTW.:D
 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
I just reread it and I still don't understand how exactly he is saying the TLB occurs.. I see VERY VERY Technical explanation of how the cache works...
but I didn't see a single reference to it handling the cache INCORRECTLY and causing a data corruption that could cause a crash if the L2 cache is depleted.

Maybe he is just being too technical for me, but it only hit home when I read the explanation AMD gave out.
 

SX2012

Member
Feb 4, 2005
48
0
0
I read somewhere that the problem stems from a race condition where two cpu's try to access the L2 and L3 cache at the same time and store the information in reverse order in the L3 cache but correctly in the L2 cache. If the L2 cache for that particular bit of information is purged from the L2(during a very CPU intensive operation) and later, one of the CPU's goes looking to the L3 for the backup copy(that doesnt exist on the L2 anymore), theres a chance that its ass backwards and the system freezes...........

I heard that the bug can only be reproduced under special conditions or if you overclock, then the caches are stressed and the system becomes L3 dependant. But other people say its a more serious problem overclocking or not, i personally havent seen it but i dont really know how to make it happen. It could be true that a 2.2 GHz just isnt fast enough to cause enough race conditions to generate the problem. I really dont know......


Look at the bright side about all this, everyone is going to learn something about cache :D whether is true or false anyways
 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
Originally posted by: SX2012
I read somewhere that the problem stems from a race condition where two cpu's try to access the L2 and L3 cache at the same time and store the information in reverse order in the L3 cache but correctly in the L2 cache. If the L2 cache for that particular bit of information is purged from the L2(CPU intensive operation) and later one of the CPU's goes looking to the L3 for the backup copy, theres a chance that its ass backwards and the system freezes...........

Exactly, absolutely right.

I heard that the bug can only be reproduced under special conditions or if you overclock, then the caches are stressed and the system becomes L3 dependant. But other people say its a more serious problem, i personally havent seen it but i dont really know how to make it happen. It could be true that a 2.2 GHz just isnt fast enough to cause enough race conditions to generate the problem. I really dont know......


Look at the bright side about all this, everyone is going to learn something about cache :D

No, it happens for enough programs above 2.4 that it crashes instantly. The faster the processor runs the more programs are affected. But each program has a different "threshhold"... thats why photoshop triggers it consistantly at 2.3 ghz... the lowest speed phenom shipped in. And even without a super intensive program it could occur, just as a "once in a..." kind of deal rather then "every 5 minutes".... It is still unacceptable.
 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
I stand corrected... but still that what happens... even at 2.2 Ghz it crashes on certain applications. The crashing depends on the usage and application entirely. So the OC part is a BS marketing by AMD... which they used to explain the 2.4ghz limit.
 

SX2012

Member
Feb 4, 2005
48
0
0
anyways, Barcelonas and Phenoms have the bug, so either AMD tried to hide the whole thing because they fabbed 500,000 chips and realized they were all possibly scrap, or they were truely oblivious of the bug until the last minute when they realized that couldnt make it stable past 2.4 GHz
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
This means that any time a sufficiently intensive operation occurs the chip will crash. Above 2.4ghz almost every program is sufficiently intensive to deplete L2 cache causing the crash (not the error mind you, the crash because of the error).

No. The issue is what is meant by "intensive". There are very many parts in a modern CPU, and it's actually pretty tricky to exercise all of them to their full potential (programs that do this are usually only used to measure the maximum power consumption and do no useful work; they're sometimes referred to as "power viruses"). The % CPU usage reported by OSes is actually an extremely poor indicator of how CPU-intensive a program is from the perspective of each unit inside a CPU. You can get 100% reported CPU usage with an endless stream of "do nothing" instructions that exercise very little of the CPU. Such a program would never (as I understand it) evoke this bug. If you're creative with the order you touch memory locations, you could actually have the CPU be stalled (idle) for about 99 out of 100 cycles, but the OS report 100% CPU usage. You could write a highly-optimized program that performs computations and keeps execution units busy nearly every cycle, results in 100% reported CPU usage and a lot of heat, but still would never hit the bug. I guess my point is that things people here consider "intensive" may never encounter the bug; other things may. It's very difficult for a person who isn't an experienced software developer to determine which programs do things that could cause the bug to show up (with any reasonable probability).

Put very simply, in order to evoke the bug, you'll need multiple threads that cause TLB misses. Just causing TLB misses isn't enough; certain events (whose timing is challenging to control) have to happen at certain times on multiple cores to evoke the bug. Ars Technica has a pretty good writeup on the issue here. I wrote some stuff here but removed it because even though I tried to simplify it, I still think it wouldn't have made anything clearer to you.

The rest of your post is pretty far off too, but I'm not going to go into specifics. Sorry.

edit: If you really didn't understand my explanation of a TLB which mycardia linked to, it's probably impossible for you to understand the bug in any meaningful way. Feel free to start a new thread asking what a TLB is, quoting my entire explanation (in the thread myocardia linked to) and pointing out parts you didn't understand. I did not explain what the bug was in that thread because I wasn't sure what had been publicly disclosed.
 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
which does not, in any way shape or form, contradicts what I said... who cares about "the definition of intensive"... by your definition its not "intensive operations" but "operations that stress that particular part of the architecture, being intensive on it, but mellow overall, because intensive usually refers to 100% cpu usage which isn't the case here and is a bad measurement anyways"... Gee whiz, tell me something I didn't know.. I said intensive because it was simple and adequate description of the issue rather then "multiple threads with high L2 cache requirement rapidly switching various data".
considering my whole post was meant to summarize the whole issue in one simple to understand sentence, I don't think the usage of the word "intensive" was incorrect in this regard.

Thank you for belittling me though. no, I really mean it.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: taltamir
which does not, in any way shape or form, contradicts what I said... who cares about "the definition of intensive"... by your definition its not "intensive operations" but "operations that stress that particular part of the architecture, being intensive on it, but mellow overall, because intensive usually refers to 100% cpu usage which isn't the case here and is a bad measurement anyways"... Gee whiz, tell me something I didn't know.. I said intensive because it was simple and adequate description of the issue rather then "multiple threads with high L2 cache requirement rapidly switching various data".
You said any program is "intensive" when run at 2.4 GHz, and that's flat out wrong. There are some things that a program can do differently depending on frequency, but there are others (relevant here) that really aren't frequency dependent.

like photoshop for example
Do you have any evidence that Photoshop troubles the one review site had were caused by the TLB erratum? The Tech Report (I think? I can't find the article now...) couldn't reproduce Photoshop issues when using a BIOS that actually supported Phenom. edit: See edit at bottom.

If you really didn't understand my explanation of a TLB which mycardia linked to, it's probably impossible for you to understand the bug in any meaningful way.
Thank you for belittling me though. no, I really mean it.
That was a simple statement of fact. To understand what's going on, you have to at least understand what a TLB is. If you don't understand what a TLB is, you're really just repeating words and not understanding what's going on (which helps nobody).

I simplified my explanation of a TLB here pretty much to the limit while still retaining accuracy. If you can't follow that, you just don't know enough about CPUs (or maybe even programming) to understand what the bug is, how to hit it, and what needs to be changed to fix it. I was hoping you'd realize this, and stop posting your horribly incorrect information, for the benefit of everybody (and maybe even start asking questions that would get you to the point where you do understand what's going on). I did offer to try further explaining TLBs if you asked about them, but I'm not going to do it in threads where you just post misinformed speculation ("more advanced circuit design"? It really sounds like you have no idea what you're talking about. Not understanding something is fine, and I'd be happy to answer questions that are posed as questions, but it's just wrong posting threads claiming something that you don't even understand) over and over. This isn't even the only time you've made the same post in multiple threads (road-map interpretation, anyone?).

edit: From here:
The Barcelona Opterons were rock solid when we conducted the testing for our review of those chips. We did run into some stability problems with our early Phenom test systems, but we'd trace those issues back to a pre-production Asus motherboard. (emphasis mine).
 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
Originally posted by: CTho9305
Originally posted by: taltamir
which does not, in any way shape or form, contradicts what I said... who cares about "the definition of intensive"... by your definition its not "intensive operations" but "operations that stress that particular part of the architecture, being intensive on it, but mellow overall, because intensive usually refers to 100% cpu usage which isn't the case here and is a bad measurement anyways"... Gee whiz, tell me something I didn't know.. I said intensive because it was simple and adequate description of the issue rather then "multiple threads with high L2 cache requirement rapidly switching various data".
You said any program is "intensive" when run at 2.4 GHz, and that's flat out wrong. There are some things that a program can do differently depending on frequency, but there are others (relevant here) that really aren't frequency dependent.
Every review I read said that regardless of temp they hit a brick wall at about 2.4ghz. The only explanation I can think of is that is manifesting at those speeds. Only reason I can think that would happen is if the L2 is filled causing the valid TLB to be dropped from it, and later looked up from the L3 (where they are corrupt). Can you come up with a different explanation?

like photoshop for example
Do you have any evidence that Photoshop troubles the one review site had were caused by the TLB erratum? The Tech Report (I think? I can't find the article now...) couldn't reproduce Photoshop issues when using a BIOS that actually supported Phenom. edit: See edit at bottom.

I got zero evidence to support that, I just read that some reviews could get photoshop to run phenom, which is a preatty huge freaking deal.

If you really didn't understand my explanation of a TLB which mycardia linked to, it's probably impossible for you to understand the bug in any meaningful way.
Thank you for belittling me though. no, I really mean it.
That was a simple statement of fact. To understand what's going on, you have to at least understand what a TLB is. If you don't understand what a TLB is, you're really just repeating words and not understanding what's going on (which helps nobody).

I simplified my explanation of a TLB here pretty much to the limit while still retaining accuracy. If you can't follow that, you just don't know enough about CPUs (or maybe even programming) to understand what the bug is, how to hit it, and what needs to be changed to fix it. I was hoping you'd realize this, and stop posting your horribly incorrect information, for the benefit of everybody (and maybe even start asking questions that would get you to the point where you do understand what's going on). I did offer to try further explaining TLBs if you asked about them, but I'm not going to do it in threads where you just post misinformed speculation ("more advanced circuit design"? It really sounds like you have no idea what you're talking about. Not understanding something is fine, and I'd be happy to answer questions that are posed as questions, but it's just wrong posting threads claiming something that you don't even understand) over and over. This isn't even the only time you've made the same post in multiple threads (road-map interpretation, anyone?).

edit: From here:
The Barcelona Opterons were rock solid when we conducted the testing for our review of those chips. We did run into some stability problems with our early Phenom test systems, but we'd trace those issues back to a pre-production Asus motherboard. (emphasis mine).

Ok so are you refering to this explanation?
Disclaimer: I don't really know anything about this story beyond what's on the Inquirer/forums, and I'm not speaking for any companies.

The TLB is the "translation lookaside buffer". Background:
It used to be that if a program accessed memory location 5, the CPU really accessed physical memory location 5, and the program could access any memory location it wanted to. Programs also saw only as much memory as the computer really had (because they were accessing the physical memory directly). Modern systems use "paging". When a program accesses what it thinks is location 5, the CPU instead looks in a mapping table set up by the OS that maps the "virtual" address that the program sees to a real "physical" address that the CPU actually accesses.

That mapping table is called the "page table", because it maps memory at a "page" granularity (4KB). Along with the translation, the page table stores some permission bits that can be used to keep one program from accessing memory belonging to the OS or another program. Also, because the virtual addresses don't have to map directly to physical addresses, it's possible to make programs think a machine has more memory than it really does (when it runs out of physical memory, the OS can pick a page and swap it out to the hard drive until it's needed again...without the programs even realizing it).

Now, these mappings are pretty big, so the page table is actually hierarchical (don't worry about the details). The net result is that finding the translation from a virtual to physical address generally requires ~3 memory accesses (for 32-bit apps - it's about 2x as bad for 64-bit apps)... so to do one useful memory access, you'd need to actually do a total of 4 accesses! To make paging feasible performance-wise, the translations are cached so that they don't have to be looked up each time. This cache is the TLB.

Disabling the TLB is not an option, because the performance hit would be unreasonably large (best case, each memory access, even accesses that hit in the L1 data cache would take 4x as long). Now, modern processors actually have multiple levels of TLBs (multiple levels of cache for the page table translations, just like the L1/L2/L3 caches for data and instructions) - maybe the L2 TLB(s) could be disabled if they were buggy, but I would imagine that would have a large performance impact in some situations. I'm not familiar with Barcelona/Phenom's TLB organization though.

I just read it from beginning to end slowly and carefuly. It is an informative, detailed, technical, yet somewhat simplified explanation of cache. But as it contains absolutely zero information about the bug. So let me rephrase my statement from the inoffensive and unconfrontational "I don't understand" to say... I completely understand it and while your data is good this does not explain the actual bug in any way shape or form...

When I was explaining the bug and referring to the TLB as "memory addresses in cache" it was MORE than enough, your explanation is completely superfluous to the actual BUG. The bug being that incorrect access and modification rules exist for L2 and L3 cache causing outdated data in the L3 cache copy of the data. So if the L2 cache fills and data has to be dropped, when it is retrieved from L3 later an incorrect copy is received.

My one misunderstanding was that I thought TLB stood for Translation LOCK buffer, instead of LOOKaside buffer. So I thought the cache process is called Translation Buffer (TB) and that TLB actually referred to the bug, specifically. (which is why I expected you to be explaining it)... Seeing as I was wrong about it, I now understand you were merely explaining what a TLB is, which again, I don't understand why people were refering to it as an explanation of the bug.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: taltamir
Originally posted by: CTho9305
Originally posted by: taltamir
which does not, in any way shape or form, contradicts what I said... who cares about "the definition of intensive"... by your definition its not "intensive operations" but "operations that stress that particular part of the architecture, being intensive on it, but mellow overall, because intensive usually refers to 100% cpu usage which isn't the case here and is a bad measurement anyways"... Gee whiz, tell me something I didn't know.. I said intensive because it was simple and adequate description of the issue rather then "multiple threads with high L2 cache requirement rapidly switching various data".
You said any program is "intensive" when run at 2.4 GHz, and that's flat out wrong. There are some things that a program can do differently depending on frequency, but there are others (relevant here) that really aren't frequency dependent.
Every review I read said that regardless of temp they hit a brick wall at about 2.4ghz. The only explanation I can think of is that is manifesting at those speeds. Only reason I can think that would happen is if the L2 is filled causing the valid TLB to be dropped from it, and later looked up from the L3 (where they are corrupt). Can you come up with a different explanation?

Maybe the chips just weren't that overclockable. I don't know. Given a complex system that can fail in a billion different ways, it doesn't make sense to me to conclude that the things you see are due to the one issue that you do know of. The average enthusiast/reviewer doesn't have the tools to figure out what's causing a particular symptom (e.g. distinguish an ordinary OC-related setup time failure from an issue cased by a specific bug). In this particular case, it sounds like the bug should cause a specific type of BSOD (Windows) or kernel panic (Linux) - see the "read this" link below.

If you really didn't understand my explanation of a TLB which mycardia linked to, it's probably impossible for you to understand the bug in any meaningful way.
Thank you for belittling me though. no, I really mean it.
That was a simple statement of fact. To understand what's going on, you have to at least understand what a TLB is. If you don't understand what a TLB is, you're really just repeating words and not understanding what's going on (which helps nobody).

I simplified my explanation of a TLB here pretty much to the limit while still retaining accuracy. If you can't follow that, you just don't know enough about CPUs (or maybe even programming) to understand what the bug is, how to hit it, and what needs to be changed to fix it. I was hoping you'd realize this, and stop posting your horribly incorrect information, for the benefit of everybody (and maybe even start asking questions that would get you to the point where you do understand what's going on). I did offer to try further explaining TLBs if you asked about them, but I'm not going to do it in threads where you just post misinformed speculation ("more advanced circuit design"? It really sounds like you have no idea what you're talking about. Not understanding something is fine, and I'd be happy to answer questions that are posed as questions, but it's just wrong posting threads claiming something that you don't even understand) over and over. This isn't even the only time you've made the same post in multiple threads (road-map interpretation, anyone?).

edit: From here:
The Barcelona Opterons were rock solid when we conducted the testing for our review of those chips. We did run into some stability problems with our early Phenom test systems, but we'd trace those issues back to a pre-production Asus motherboard. (emphasis mine).

Ok so are you refering to this explanation?
Disclaimer: I don't really know anything about this story beyond what's on the Inquirer/forums, and I'm not speaking for any companies.

The TLB is the "translation lookaside buffer". Background:
It used to be that if a program accessed memory location 5, the CPU really accessed physical memory location 5, and the program could access any memory location it wanted to. Programs also saw only as much memory as the computer really had (because they were accessing the physical memory directly). Modern systems use "paging". When a program accesses what it thinks is location 5, the CPU instead looks in a mapping table set up by the OS that maps the "virtual" address that the program sees to a real "physical" address that the CPU actually accesses.

That mapping table is called the "page table", because it maps memory at a "page" granularity (4KB). Along with the translation, the page table stores some permission bits that can be used to keep one program from accessing memory belonging to the OS or another program. Also, because the virtual addresses don't have to map directly to physical addresses, it's possible to make programs think a machine has more memory than it really does (when it runs out of physical memory, the OS can pick a page and swap it out to the hard drive until it's needed again...without the programs even realizing it).

Now, these mappings are pretty big, so the page table is actually hierarchical (don't worry about the details). The net result is that finding the translation from a virtual to physical address generally requires ~3 memory accesses (for 32-bit apps - it's about 2x as bad for 64-bit apps)... so to do one useful memory access, you'd need to actually do a total of 4 accesses! To make paging feasible performance-wise, the translations are cached so that they don't have to be looked up each time. This cache is the TLB.

Disabling the TLB is not an option, because the performance hit would be unreasonably large (best case, each memory access, even accesses that hit in the L1 data cache would take 4x as long). Now, modern processors actually have multiple levels of TLBs (multiple levels of cache for the page table translations, just like the L1/L2/L3 caches for data and instructions) - maybe the L2 TLB(s) could be disabled if they were buggy, but I would imagine that would have a large performance impact in some situations. I'm not familiar with Barcelona/Phenom's TLB organization though.

I just read it from beginning to end slowly and carefuly. It is an informative, detailed, technical, yet somewhat simplified explanation of cache.

Just to clarify, someone saying just "cache" is almost never talking about a TLB. This is an explanation of a very specific type of cache that is almost always referred to as a "TLB" (or maybe as a "page table cache" in rare circumstances). Don't confuse regular caches with the TLBs.

But as it contains absolutely zero information about the bug. So let me rephrase my statement from the inoffensive and unconfrontational "I don't understand" to say... I completely understand it and while your data is good this does not explain the actual bug in any way shape or form...

When I was explaining the bug and referring to the TLB as "memory addresses in cache" it was MORE than enough, your explanation is completely superfluous to the actual BUG. The bug being that incorrect access and modification rules exist for L2 and L3 cache causing outdated data in the L3 cache copy of the data. So if the L2 cache fills and data has to be dropped, when it is retrieved from L3 later an incorrect copy is received.

My one misunderstanding was that I thought TLB stood for Translation LOCK buffer, instead of LOOKaside buffer. So I thought the cache process is called Translation Buffer (TB) and that TLB actually referred to the bug, specifically. (which is why I expected you to be explaining it)... Seeing as I was wrong about it, I now understand you were merely explaining what a TLB is, which again, I don't understand why people were refering to it as an explanation of the bug.

It wasn't an explanation of the bug. I explained what part of the chip the bug effected, and said I thought disabling the TLBs was not an option--about half of the questions the OP asked. I'm not going to explain the bug beyond the information that's already available. I was hoping an explanation of a TLB might at least get rid of some of the wilder and less-informed speculation.

There's some new information since I posted the TLB explanaion. Ars Technica's post is good. The Tech Report post Ars references is also excellent and should make it pretty clear what's going on. Read this too. I believe Windows machines would produce a specific BSOD in situations where Linux machines kernel panic.
 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
BSOD isn't that hard... I wrote a C++ program that will BSOD anything in my third day in programming class... I wish I remember HOW I did it though...

It is immediately obvious to me that unless the OS is crippling the capability of software then it CANNOT prevent badly written software from causing a kernel panick /bsod... If it was to prevent such things then most software would simply not work on such an OS. (and making it user account related is not an option... it will just cause game developers and the like to explain that they need to click "allow this program"... like they do with the microsoft new security prompts.)
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: taltamir
BSOD isn't that hard... I wrote a C++ program that will BSOD anything in my third day in programming class... I wish I remember HOW I did it though...

It is immediately obvious to me that unless the OS is crippling the capability of software then it CANNOT prevent badly written software from causing a kernel panick /bsod... If it was to prevent such things then most software would simply not work on such an OS. (and making it user account related is not an option... it will just cause game developers and the like to explain that they need to click "allow this program"... like they do with the microsoft new security prompts.)

I'd doubt you could write an app to BSOD a fully patched XP SP2 box without exploiting a known security bug (e.g. doing something nobody would ever do normally).
 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
well I did... I misunderstood the way to use pointer and did something that my teacher described as "very very bad"... I should have saved the code and condensed it to a few lines and called it the BSOD program :)

Anyways, things people should never do? you mean like how quickbooks used the registry to communicate between MODULES between 2001 (the time microsoft banned the practice) and 2006 (when they found out there is absolutely nothing short of a complete rewrite that will make their application work on vista). Or that thing in TITAN QUEST that made it BSOD (the first five patches contained 11 common crash fixes... and I remeber at least one fix for a BSOD...)...
Its funny really, Titan quest patch notes said they fixed a situation where the game causes a BSOD (I know it does, I have seen it do so) and yet the developers of uplink (also causes a BSOD under certain circumstances) told me that it is absolutely impossible for a game to cause a BSOD...

Now that I think about it... I Remember that SpellForce the order of dawn will cause random BSODs... during the days of 32bit only that game would sometimes try to access a nonexistant PAGEFILE entry in the very VERY high numbers regardless of how much free ram there was (or at least, thats what the BSOD screen said)... I found a workaround by manually forcing the pagefile to rediculous sizes (I just went and flat out set it to 4GB of pagefile, or maybe it was 2 or 3... something of that order)
 

Mark R

Diamond Member
Oct 9, 1999
8,513
16
81
I think you misunderstand how the CPU and OS work together to prevent chaos if a program incorrectly uses a pointer, or performs some other damaging operation.

It should be possible to build an OS where application software cannot cause a BSOD/kernel panic, and modern OSs like Windows XP, Vista, Linux and MacOS X all come extremely close to this ideal. The OS maintains lists of memory addresses that the program is authorized to read, write or execute. If the program attempts to access an unauthorized address, or otherwise exceeds its authority (e.g. it uses a pointer incorrectly), the OS will spot this and terminate the program. This is standard functionality on all the above OSs.

Yes, the 'functionality' of the program is 'crippled', as it doesn't have free roam of the hardware - it can't address hardware directly because the hardware IO is outside its authorized range of addresses. If it wants memory, it calls the OS, the OS finds a chunk of VM, maps it as required, and returns the address to the application.

There is no need for this to hurt performance, because the CPU can check all this in hardware. When the OS switches execution to an application, it provides the CPU with the address/authorization list. If the CPU detects a violation, it returns control the OS immediately with an exception message. The OS can then close down the application without a panic.

Because applications can only communicate with teh OS kernal in a highly restricted manner, then it is possible to check all inputs to the kernel to make sure that they are consistent.

You don't say what OS your BSOD program worked on. But I'd wager that it wasn't XP or Vista. Windows 95/98/ME was quite slack in the memory protection described above - and while most of the time a stray pointer would end up with the OS nuking the program, occasionally, the dead pointer would hit an unprotected area and bring down the OS. In XP and Vista, memory is essentially 100% protected, and an application should never ever cause a BSOD. If it does, it indicates a bug either in the OS, a driver, or the CPU. E.g. there was a bug in some Pentium CPUs that would send the OS an incorrect exception message following a certain type of invalid operation in an application - this would cause the OS to BSOD/panic.

Similarly, if you could write a C++ program that triggers a BSOD, you should submit it to microsoft as they regard such findings as 'highly critical' security bugs, which are investigated and patched urgently.

The problem with some application software showing up BSODs is a difficult one, as it is virtually never the application itself that is responsible. The actual BSOD would be due to a driver malfunctioning (because in XP, drivers run in kernel mode and aren't protected), but that doesn't mean the app can't be patched to avoid it - if the app is using the driver incorrectly, then a defective driver may crash. Fixing the app can mask the driver bug so that it doesn't show up.

Again, accessing memory (or VM) that doesn't exist should never cause a BSOD. While it may look as though it was the app at fault, it was more likely the way the app worked exposing a buggy driver (or other defective hardware) which was changed with different virtual memory settings.



 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
That BSOD program I wrote worked on XP SP2. It was not even a GUI program, it was a command prompt windows program, two screen lengths in size, and it was dealing only with taking in calculating basic info and then outputting it to the screen using cout. So I really doubt it has anything to do with any drivers. I did however run it from within micorosft visual studio 2003 (run without debug command)... could that make it circumvent the supposed OS protections and give it full access to anything it pleases?

And you overestimate how much I DON'T know...
I admit to not be an expert. But I know enough, and simplification does not mean ignorance.

At the time i didn't think much of that BSOD program... I figured out what I did wrong, fixed the code and moved on. Until you said so a minute ago it never occured to me to bother reporting such a thing to microsoft.