AMD Phenom TLB bug is really just a software bug in VMWare?

VirtualLarry · Sep 24, 2008

More comments from the same guy that claimed Shangai was going to have a 4Ghz commercial SKU. (Prior thread )

From this thread

From a PM he wrote to me:

I see that you have been corrected by the retired senior tchenical advisor of IBM (whiteboi) and now another has seen to say BS. If you had read Mike's paper you would have known that Mike talked about Intel's version of this problem in IA32. He projected it for the MS coded Virtualization software under AMD-V since none of the MS codes other than MS's proprietary code enable dynamic rather than binary implementation under the Popek/Goldberg Principles. Gerry Popek criticized VMware for as he said "screwing up the process" .

AMD-V was originally developed as part of DARPA's HEC. The spec design calls for the extensions to be 100% compatible with UNICOS/SUSE Linuxbased on the following:

NSA, DOE/SC (SNL and ORNL): Continue cooperative development of Black Widow and Red Storm
systems, leading to introduction in 2006 of a new generation of these systems. So the Phenom went into the Black Widow and the virtualization ran straight out of the box with no issues for SUSE and Debian based Linux. By definition then if it ran to spec under the DARPA standards it had no flaw. The flaw was a bunch of questionable software code written by others not following the IEEE spec. There is a hiccup on Red Hat due to the rpm structure.

There is also a discussion of the problems here U of ILL C/A I trust that you will be competent enough to find it. And you might try Dr Katherine Yelick's follow up to this paper. If like Jackone are fool enough to doubt her credentials Bio

Remember this:
" The National Energy Research Scientific Computing Center and DARPA have both done projects that demonstrate that the TLB issue is a VMware/Windows problem only (if were truly the cpu it would affect Linux and Unix as well).

You're on drugs. The VM bug is a genuine CPU errata. Why else do you think that AMD made a fix? The fact is that Linux and Unix implemented workarounds, not that there is no bug. "

Now tell me if you are a Summa Cum Laude Phi Beta Kappa graduate of MIT or Dumb Kid with an overinflated ego too high of an opinion of himself.

The other thing is you must be in High School because every one in freshman English learns to support their writing by appropriate references and citations. I didn't see a single source anywhere in your postings.

By the way I am the retired Project Director for Red Storm at Sandia National Labs and was there for the development of AMD-V. I have had my regular drug test along with every federal employee for the last 20 years. What about you?

Where to start... first of all, this is the supposed "correction" post by whiteboi:

=WhiteBoi

=nsdp.. is really no more valid than people who claim Obama is Muslim.

Click to expand...

According to an authority in FL, he is a MUSLIN.

Yeah, highly technical at that. The fact that he claims whiteboi even agreed with him, or corrected me, with that post, suggests that he is delusional and detached from reality, as that post contained nothing technical, and nothing to suggest that I was incorrect.

Second, is this even remotely true - that AMD would spend R&D money on a chip respin, just to work around a software bug in VMWare under Windows? That seems just inconcievable to me.

Perhaps the Unix code paths never exercise that particular bug, but the Windows code does. That's my theory. I just have a hard time believing that these ivory-tower types deny the AMD TLB bug exists. It's like denying the holocost. Well, sort of.

Would AMD endure a chip re-spin just to be able to wash away the bad publicity from the bug?

taltamir · Sep 24, 2008

The first post is obviously multiple people saying things, all bundled together as one quote... can you separate it?
The muslin link seems to be a joke.

Idontcare · Sep 24, 2008

The word to describe this individual is zealot.

VirtualLarry · Sep 24, 2008

Originally posted by: taltamir
The first post is obviously multiple people saying things, all bundled together as one quote... can you separate it?
The muslin link seems to be a joke.

No, it's not multiple people, it's all one PM. The Italicized part is where he quoted my comment.

I agree, the muslin thing was a funny reply; my point was that it had nothing to do with correcting me.

myocardia · Sep 25, 2008

Larry, why do you keep trying to argue technical information about extremely technical things, like CPU architecture, with this idiot who:

A) seems to know extremely little about even the basic functions of computers/CPU's
B) obviously isn't even a member of a computer hardware forum
C) has never said anything (at least on fatwallet.com) that's even remotely true?

Here's what to reply to said chump. First, completely ignore any PM's he sends you. If he doesn't get a reply, he'll stop sending them. If he ever mentions something about you not responding to his PM's in a thread there, tell him (in the thread) that you do all of your discussions out in the open, in threads.

In the thread you linked, tell him:

1) to post his IEEE membership #, which I can assure you he doesn't have.
2) in response to this: "And if it is hard ware related tell me why Ranger and all the Sun Micro units didn't need the fix.", tell him that neither Sun's systems, nor Ranger, use any type of VMWare, hence they need no fix.
3) re: this statement, "The National Energy Research Scientific Computing Center and DARPA have both done projects that demonstrate that the TLB issue is a VMware/Windows problem only (if were truly the cpu it would affect Linux and Unix as well).", post this link, which shows he's an outright liar:
the thread you linked from fatwallet.com and this thread are the only two links on the internet that say a single thing about what he's claming
4) and finally, tell him that I have invited the moron to come here and discuss how little he knows about computers and CPU's. Feel free to give him a link to this thread, if you'd like.

VirtualLarry · Sep 25, 2008

I posted some replies, sourced with references, to the thread on FW. That should hopefully shut him up.

myocardia · Sep 25, 2008

Originally posted by: VirtualLarry
I posted some replies, sourced with references, to the thread on FW. That should hopefully shut him up.

You have a PM.

SickBeast · Sep 25, 2008

I would suggest that it is within the specter of possibility that VMWare is to blame for the TLB errata (along with Microsoft, potentially).

Are all employees of AMD and VMWare bound by an NDA? If so, it is well possible that a single programmer at VMWare was bought off by someone at intel.

If AMD goes broke, I wonder if they could be resurrected retrospectively once all of intel's anti-competitive behavior comes to light.

myocardia · Sep 25, 2008

Originally posted by: SickBeast
I would suggest that it is within the specter of possibility that VMWare is to blame for the TLB errata (along with Microsoft, potentially).

Almost anything is possible, but since the Phenom also had that same TLB errata with Microsoft's virtual machine, it becomes alot less likely, don't you think? I would guess that Microsoft actually has enough employees that if it were actually a bug in their code, it would have been caught (not beforehand, obviously), and patched.

pm · Sep 25, 2008

I must be missing something obvious here. The TLB issue - errata 298 - is a known issue. AMD's issued statements about it. There's a microcode patch for it, there's a Linux kernel patch. He's saying that this errata doesn't exist? That the guys at VMWare just wrote bad code? Bad code caused a cache coherency errata in the L2 -> L3 cache? And as proof he's listing all the things that he's seen done with the hardware that doesn't result in a cache coherency failure? That kind of misses the point of functional validation.

SickBeast · Sep 25, 2008

Originally posted by: myocardia

Originally posted by: SickBeast
I would suggest that it is within the specter of possibility that VMWare is to blame for the TLB errata (along with Microsoft, potentially).

Click to expand...

Almost anything is possible, but since the Phenom also had that same TLB errata with Microsoft's virtual machine, it becomes alot less likely, don't you think? I would guess that Microsoft actually has enough employees that if it were actually a bug in their code, it would have been caught (not beforehand, obviously), and patched.

As an early adopter of Vista64, I have a more cynical attitude toward MS at this point.

IMO the Win-Tel alliance is still formidable.

myocardia · Sep 25, 2008

Originally posted by: pm
I must be missing something obvious here. The TLB issue - errata 298 - is a known issue. AMD's issued statements about it. There's a microcode patch for it, there's a Linux kernel patch. He's saying that this errata doesn't exist? That the guys at VMWare just wrote bad code? Bad code caused a cache coherency errata in the L2 -> L3 cache? And as proof he's listing all the things that he's seen done with the hardware that doesn't result in a cache coherency failure? That kind of misses the point of functional validation.

Umm, an EE like yourself would know these things, but that guy isn't an EE, even though he's trying to make whomever reads the thread think that he's an EE. Pretty sad, isn't it?

Originally posted by: SickBeast
As an early adopter of Vista64, I have a more cynical attitude toward MS at this point.

IMO the Win-Tel alliance is still formidable.

You really think that a company who cares as much about it's image as M$ does would risk it over something as petty as this? Anyway, conspiracy theories aside, why would AMD actually have stopped shipping B2 Barcelonas, if there were infact no hardware TLB errata?

OCGuy · Sep 25, 2008

And AMD isnt really in tons of debt, it is just an illusion, perpetuated by M$, Intel, George W Bush, Die-Bold, the NSA, CIA, and the Flying Spaghetti Monster.

jones377 · Sep 25, 2008

It is my experience that every forum has a guy like that.. so please don't invite him here

soccerballtux · Sep 25, 2008

Originally posted by: jones377
It is my experience that every forum has a guy like that.. so please don't invite him here

Yes we already have dmcowen thank you.

Zstream · Sep 25, 2008

I have not seen the TLB except in VMWARE. I do quite a bit of VMware/virtual pc.

Umm Microsoft makes virtual pc not vmware fyi.

VirtualLarry · Sep 25, 2008

Originally posted by: pm
I must be missing something obvious here. The TLB issue - errata 298 - is a known issue. AMD's issued statements about it. There's a microcode patch for it, there's a Linux kernel patch. He's saying that this errata doesn't exist? That the guys at VMWare just wrote bad code? Bad code caused a cache coherency errata in the L2 -> L3 cache? And as proof he's listing all the things that he's seen done with the hardware that doesn't result in a cache coherency failure? That kind of misses the point of functional validation.

I think you nailed it there.

pm · Sep 25, 2008

I would imagine that my comment about missing the point of functional validation might be too much of an Intel-ism.

Functional validation is the double-checking of the CPU microarchitectural implementation of a given CPU by running content on the actual chip. During the design phase of a CPU, logic designers and validation engineers write small "tests" of assembly code to check chip logical functionality for correctness against the architectural model. So you have a blueprint - which is the architectural specification for the CPU - which is a so-called "architectural simulator" (which doesn't have a concept of clock frequency, or caches, or any other low-level microarchitectural features) and then engineers run little chunks of assembly code on a microarchitectural model and compare against this architectural simulator to check that the results.

One the chip tapes out and you have real silicon in your hands, you can then directly test the chip. This is called "functional validation" - you are validating that the chip functions like it's supposed to. For this, in my experience you use a suite of existings tests (again, snippets of assembly or machine code) from previous designs and often "RCG's" or random code generators are used, and then on other systems you have them boot to Windows or Linux and run programs and benchmarks and wait for a bluescreen. For the RCG's, you get a whole bunch of systems and then you put your recently designed CPU in them and run something like "brancher" which would spend it's days doing branches, and "memtraffic" which just does reads and writes to and from memory. You turn on your systems and put all of these RCG's on them looking for interesting things. When you find a failure - which is something that doesn't do what the architectural simulator says that it should - then you debug it. This stage is a long and arduous affair and it can often have a lot of ups and downs (you either worry that you aren't finding enough bugs and thus aren't looking hard/well enough, or you worry that you have too many bugs and you are going to miss your release schedule).

The point is functional validation is hard - it's not like you take the chips and give them to Sandia or the DoE and say "ok, guys, run your nastiest stuff on these new chips and if they bluescreen or something hand them back to us". It's not enough to run just the brainiest program in the world - because if you have an errata in the way that you do, say, multithreaded cache coherency and your brainiest program in the world doesn't do cache writebacks of data in large enough datasets to hit the errata, they would never see it even on these really difficult nuclear simulations. And likewise for writebacks that cross cacheline boundaries - a well-written program wouldn't do this because it's not going to run efficiently, and would thus miss this interesting corner case. A well-written chunk of code for something like a laplace transform is designed never to fill queues completely (because it wouldn't run as efficiently) and yet, often errata occurs on corner-cases like FIFO overflows. So a really efficient simulation program is probably not that great at checking things for functional validation because it's designed to run very efficiently. But still, this doesn't make these unefficient programs "wrong" or bad - they are legitimate architecturally valid instructions... just not running super-efficiently on a given CPU. So, it's not fair/right/proper to blame a software vendor for hitting a functional bug unless they were doing something that violated the architectural specification, even if the way they hit it was because their code wasn't as tight as it could be. .

So that's what I meant by "missing the point of functional validation".

All that said, for what it's worth, VirtualLarry, I thought you shouldn't have posted a Private Message from another forum here without the author's permission. Paraphrase it, sure, but I've never been pleased to see a Private Message that I send someone posted somewhere by someone. Maybe I'm "old school" ( I'm certainly "old"

) but there's a bit an expectation of privacy in a PM and I often don't phrase things the way that I would have if I knew that the world was going to be examining my words closely rather than just one person. Just my $0.02.

* As always, I'm not speaking for Intel Corp. *

SickBeast · Sep 26, 2008

Originally posted by: myocardia
You really think that a company who cares as much about it's image as M$ does would risk it over something as petty as this? Anyway, conspiracy theories aside, why would AMD actually have stopped shipping B2 Barcelonas, if there were infact no hardware TLB errata?

Maybe they had to. It's not like they can force VMWare and MS to play nice. Perhaps intel had their own way of doing things. It wouldn't be the first time this type of thing has happened. USB anyone?

taltamir · Sep 28, 2008

sickbeast, it is impossible for this to be a VMWare issue, there is too much specifically known about it. THE EXACT issue is known, publicized, and explained. Which allowed the linux community to patch it, and allowed motherboard makers to make bios workarounds. You couldn't work around it if it wasn't a real problem.
virtual machines are just known to always cause it, but it is theoretically possible for other software to cause it.

AMD Phenom TLB bug is really just a software bug in VMWare?

No Lifer

Lifer

Elite Member

No Lifer

Diamond Member

No Lifer

Diamond Member

Lifer

Diamond Member

Elite Member Mobile Devices

Lifer

Diamond Member

Lifer

Senior member

Lifer

Diamond Member

No Lifer

Elite Member Mobile Devices

Lifer

Lifer