TLB Bug fixed in B3 steppings

BlueAcolyte · Mar 13, 2008

Discuss...

TLB bug is gone! Fix has no real performance impact.

I guess that's one less thing to worry about.

nerp · Mar 13, 2008

Progress.

Idontcare · Mar 13, 2008

Originally posted by: BlueAcolyte
TLB bug is gone! Fix has no real performance impact.

I really do hope for the best with AMD, but I am suspicious anytime someone says "this will effect memory performance, but there will be almost mostly hardly none at all performance impact, usually that is".

All Anand showed was that a B3 can run WinRAR benchmark utility as unimpaired as an unpatched B2.

I'd like to see something loading all 4 cores and actually trying to use that shared L3$ in a high-frequency environment (remember this was supposed to only effect >2.4GHz K10's...) before I am going to toss this whole TLB thing into the memory shredder of my mind.

CTho9305 · Mar 13, 2008

edit: oops, wrong thread

Killrose · Mar 13, 2008

Bleh, no mention of other performance gains so it must mean that it still is the same'ol sucky Phenom we all love to hate. Bummer, I was hoping to put something decent in my AM2 socket for quite sometime

Sylvanas · Mar 13, 2008

Firstly it is a 'Preview' I assume that in usual AT style the actual 'Review' of a retail Phenom B3 will provide all the benchmarks you need to make your mind up. It's disappointing to see not too much of an improvement in overclockability, but as Anand said- that will probably come with Phenom's 45nm shrink.

Killrose · Mar 13, 2008

Sure its a preview, but AMD knows how well it performs in other apps, even @2.2gig, so I figured if they are not showing other benches to us in this preview, then all we will get is a bug fix with nothing other than that. Same poor Phenom performance, or PPP for short.

I dont want to wait. I want my kickass Phenom performance now, or KPP for short

lopri · Mar 14, 2008

What's the consensus when it comes to the best motherboard for Phenom? Stablility first and overclockability second. Features, layout, etc. not considered.

Sylvanas · Mar 14, 2008

Originally posted by: lopri
What's the consensus when it comes to the best motherboard for Phenom? Stablility first and overclockability second. Features, layout, etc. not considered.

770/790FX chipset. Only a few AM2 mobos support Phenom via a BIOS update and even then it can be finicky...you are better off with the new 7 series chipsets with support out of the box. I have a write up here on my MSI K9A2 Platinum....the other options for the High end are the M3A32-MVP Deluxe and the DFI 790FX, for the cheaper 770 boards I hear the Abit AX78 is doing well.

Cookie Monster · Mar 14, 2008

Adding to what Sylvanas has said, nVIDIA is launching their AMD chipsets 780a/750a within this month, so theres some food for thought. The nforce motherboards for AMD has always done alot of good things for AMD actually, so i hope this time they will try to make phenoms more of a viable option with the introduction of these new force 7 series (MCP72). They will be the first single chip solution from nVIDIA.

Regarding nVIDIA AMD chipsets, i have high hopes for it unlike its intel variants.

Martimus · Mar 14, 2008

Originally posted by: lopri
What's the consensus when it comes to the best motherboard for Phenom? Stablility first and overclockability second. Features, layout, etc. not considered.

The MSI 7series boards are known to have poor support, but the ASUS and Gigabyte boards are working pretty well. There is extensive discussion on each board on xtreme systems.

taltamir · Mar 14, 2008

actually, notice that the B3 phenom is slightly faster then the B2 without the bios fix. they used winrar because it was the biggest sufferer from the bios fix, loosing 72.8% of speed.

Idontcare · Mar 14, 2008

Originally posted by: taltamir
actually, notice that the B3 phenom is slightly faster then the B2 without the bios fix. they used winrar because it was the biggest sufferer from the bios fix, loosing 72.8% of speed.

Not sure how much benching you do with the WinRAR utility but it is not the bastion of repeatability. Anything that is in the same "ballpark" merely means equivalence.

You can't take a Phenom B2, or any chip that uses multi-level cache, and hard-wire it to force cache-evictions and suddenly get higher performance. It simply doesn't work that way.

At best you will get no perfomance degradation for those workloads which aren't effected by forced cache evictions, but you won't be able to avoid a performance penalty on 100% of the applications. There will be applications that suffer from having forced cache evictions.

If this weren't true then forced cache evictions would already be the norm. It's not like AMD stumbled on the idea of working around the TLB issue and suddenly realized the industry had been using cache structures incorrectly all this time.

All we can conclude at this time is that WinRAR's benchmark utility does not appear to suffer a performance hit by having forced cache evictions. Nothing more can be said.

v8envy · Mar 14, 2008

If memory serves these CPUs have exclusive caches. And it's not so much an eviction as moving of pages from processor specific L2 to shared L3 cache -- where they are available to another core where otherwise they might not have been. A demo application could be written to show better performance in this scenario. But it wouldn't be easy.

That said, yeah, there can be only one case of an application benefiting from premature cache evictions -- a monolithic block of code with no loops or function calls. In a blue moon scenario the time saved by possibly quicker tag lookups (assuming that happens) might somehow yield almost measurable performance gains.

As for the B3 Phenom -- too bad re: no higher clocks or > 1.8 'FSB'. I think everyone's over the whole TLB bug thing, and the main barriers standing in the way of mass Phenom adoption are clock speed & overclockability, third core quality and scaling with higher clock rates. Guess we wait for the 45nm part before considering the Phenom again.

That, or a price drop to ~$100 for the quad.

CTho9305 · Mar 14, 2008

Originally posted by: Idontcare

Originally posted by: taltamir
actually, notice that the B3 phenom is slightly faster then the B2 without the bios fix. they used winrar because it was the biggest sufferer from the bios fix, loosing 72.8% of speed.

Click to expand...

Not sure how much benching you do with the WinRAR utility but it is not the bastion of repeatability. Anything that is in the same "ballpark" merely means equivalence.

Nobody publishes standard deviations, so we don't know confidence intervals

You can't take a Phenom B2, or any chip that uses multi-level cache, and hard-wire it to force cache-evictions and suddenly get higher performance. It simply doesn't work that way.

Actually, if hitting in the L3 is faster than hitting in another processor's L2, it could improve performance, couldn't it? I hadn't thought about that before. It shouldn't be too hard for a good programmer to write a test program that thrashes the TLB from multiple threads.

Idontcare · Mar 14, 2008

Originally posted by: CTho9305
Actually, if hitting in the L3 is faster than hitting in another processor's L2, it could improve performance, couldn't it? I hadn't thought about that before. It shouldn't be too hard for a good programmer to write a test program that thrashes the TLB from multiple threads.

This is a good point. Hyperthreading spent a couple years in the doghouse because its corner conditions of thread thrashing actually resulted in performance degradation so I have no doubt that anything which gets thrashed around (when Windows moves threads from core to core to core) will actually benefit in having its prior L2 data evicted to the L3 as windows moves the thread to another core...as then when the thread accesses its own data (otherwise stuck on the other core's L2) at least it only has to hit the L3.

Thread thrashing on my Intel quads is awful under WinXP. When I run four single-threaded applications which saturate a CPU core each I will see upwards of a 15% performance penalty from thread thrashing unless I lock the affinity for each thread.

taltamir · Mar 14, 2008

it is possible that it does degrade performance slightly, but that there is a performance improvement from some opimizations that have entered the core since B2... unless the TLB is the ONLY thing that was changed.

And then there are all the other suggestions...

That being said, it would be nice to see a full suite of tests testings more then just one app.

soccerballtux · Mar 14, 2008

Originally posted by: Idontcare

Originally posted by: CTho9305
Actually, if hitting in the L3 is faster than hitting in another processor's L2, it could improve performance, couldn't it? I hadn't thought about that before. It shouldn't be too hard for a good programmer to write a test program that thrashes the TLB from multiple threads.

Click to expand...

This is a good point. Hyperthreading spent a couple years in the doghouse because its corner conditions of thread thrashing actually resulted in performance degradation so I have no doubt that anything which gets thrashed around (when Windows moves threads from core to core to core) will actually benefit in having its prior L2 data evicted to the L3 as windows moves the thread to another core...as then when the thread accesses its own data (otherwise stuck on the other core's L2) at least it only has to hit the L3.

Thread thrashing on my Intel quads is awful under WinXP. When I run four single-threaded applications which saturate a CPU core each I will see upwards of a 15% performance penalty from thread thrashing unless I lock the affinity for each thread.

Wow that is very interesting!

VirtualLarry · Mar 14, 2008

Originally posted by: soccerballtux
Wow that is very interesting!

Probably the same factor at work is why using Affinity Changer for F@H gives better PPD, it pins down the threads and prevents cache thrashing.

CTho9305 · Mar 14, 2008

Originally posted by: VirtualLarry

Originally posted by: soccerballtux
Wow that is very interesting!

Click to expand...

Probably the same factor at work is why using Affinity Changer for F@H gives better PPD, it pins down the threads and prevents cache thrashing.

It's not just caches - you throw away all of your branch prediction history when you move a thread across processors.

Idontcare · Mar 15, 2008

Originally posted by: CTho9305

Originally posted by: VirtualLarry

Originally posted by: soccerballtux
Wow that is very interesting!

Click to expand...

Probably the same factor at work is why using Affinity Changer for F@H gives better PPD, it pins down the threads and prevents cache thrashing.

Click to expand...

It's not just caches - you throw away all of your branch prediction history when you move a thread across processors.

I never understood why WindowsXP moves threads around from core to core.

The *only* benefit I can gather from watching it happen is that it reduces the average/peak temperature any given core sees so if you have your HSF fanspeed dynamically controlled by temp then it doesn't get as loud because no one core is pegged at 100% until all cores in the system are fully loaded.

But why would Microsoft ever care about such things as average core temp or system noise?

So there must be some other good reason why Microsoft has WinXP migrate threads in milliseconds...has anyone ever come across an explanation for such? And does Vista do it too? What about Windows Server editions? Are they any more intelligent about avoiding thread thrashing?

VirtualLarry · Mar 15, 2008

IDC, have you tested XP with the MS multicore patch V4 (or newer)? Supposedly that helps with the thread thrashing, so I've read.

Idontcare · Mar 15, 2008

Originally posted by: VirtualLarry
IDC, have you tested XP with the MS multicore patch V4 (or newer)? Supposedly that helps with the thread thrashing, so I've read.

If it isn't something that gets taken care of by Microsofts auto-update then no I haven't tested it.

If microsoft isn't confident in it enough to make it part of their auto-update KB's then I wouldn't put it on my computers anyway. I have enough troubles as it is without inviting new ones that even Microsoft's QC isn't willing to stand behind.

But if this is something they are "patching" on XP, then I take it this is no longer an issue on Vista? Anyone care to pop open task manager on a dualcore or a quadcore and watch the CPU loads while you run a single-threaded app? Does it peg just one core at 100% utilization or do you see all cores getting hit with the average utilization number (50% on dual-core, 25% on quad)?

Viditor · Mar 16, 2008

Originally posted by: Idontcare

Originally posted by: VirtualLarry
IDC, have you tested XP with the MS multicore patch V4 (or newer)? Supposedly that helps with the thread thrashing, so I've read.

Click to expand...

If it isn't something that gets taken care of by Microsofts auto-update then no I haven't tested it.

If microsoft isn't confident in it enough to make it part of their auto-update KB's then I wouldn't put it on my computers anyway. I have enough troubles as it is without inviting new ones that even Microsoft's QC isn't willing to stand behind.

But if this is something they are "patching" on XP, then I take it this is no longer an issue on Vista? Anyone care to pop open task manager on a dualcore or a quadcore and watch the CPU loads while you run a single-threaded app? Does it peg just one core at 100% utilization or do you see all cores getting hit with the average utilization number (50% on dual-core, 25% on quad)?

I don't know if it means anything, but I do know that the scheduler on Vista supports NUMA while the one on XP does not...

bryanW1995 · Mar 16, 2008

Originally posted by: soccerballtux

Originally posted by: Idontcare

Originally posted by: CTho9305
Actually, if hitting in the L3 is faster than hitting in another processor's L2, it could improve performance, couldn't it? I hadn't thought about that before. It shouldn't be too hard for a good programmer to write a test program that thrashes the TLB from multiple threads.

Click to expand...

This is a good point. Hyperthreading spent a couple years in the doghouse because its corner conditions of thread thrashing actually resulted in performance degradation so I have no doubt that anything which gets thrashed around (when Windows moves threads from core to core to core) will actually benefit in having its prior L2 data evicted to the L3 as windows moves the thread to another core...as then when the thread accesses its own data (otherwise stuck on the other core's L2) at least it only has to hit the L3.

Thread thrashing on my Intel quads is awful under WinXP. When I run four single-threaded applications which saturate a CPU core each I will see upwards of a 15% performance penalty from thread thrashing unless I lock the affinity for each thread.

Click to expand...

Wow that is very interesting!

If this wasn't a tech site I would swear that you were being sarcastic

When I read about getting thrashed I think of vanderbilt-kentucky or tiger woods vs everyone else...

TLB Bug fixed in B3 steppings

Platinum Member

Diamond Member

Elite Member

Elite Member

Diamond Member

Diamond Member

Diamond Member

Elite Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Elite Member

Platinum Member

Elite Member

Elite Member

Lifer

Lifer

No Lifer

Elite Member

Elite Member

No Lifer

Elite Member

Diamond Member

Lifer