TLB Bug fixed in B3 steppings

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

imported_SLIM

Member
Jun 14, 2004
176
0
0
Originally posted by: Idontcare
Originally posted by: VirtualLarry
IDC, have you tested XP with the MS multicore patch V4 (or newer)? Supposedly that helps with the thread thrashing, so I've read.

If it isn't something that gets taken care of by Microsofts auto-update then no I haven't tested it.

If microsoft isn't confident in it enough to make it part of their auto-update KB's then I wouldn't put it on my computers anyway. I have enough troubles as it is without inviting new ones that even Microsoft's QC isn't willing to stand behind.

But if this is something they are "patching" on XP, then I take it this is no longer an issue on Vista? Anyone care to pop open task manager on a dualcore or a quadcore and watch the CPU loads while you run a single-threaded app? Does it peg just one core at 100% utilization or do you see all cores getting hit with the average utilization number (50% on dual-core, 25% on quad)?

Vista home premium 32 bit core2 laptop says MS is still load sharing when running a single core workload in Prime95 25.6. CPU usage stays @~50%.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Originally posted by: SLIM
Originally posted by: Idontcare
Originally posted by: VirtualLarry
IDC, have you tested XP with the MS multicore patch V4 (or newer)? Supposedly that helps with the thread thrashing, so I've read.

If it isn't something that gets taken care of by Microsofts auto-update then no I haven't tested it.

If microsoft isn't confident in it enough to make it part of their auto-update KB's then I wouldn't put it on my computers anyway. I have enough troubles as it is without inviting new ones that even Microsoft's QC isn't willing to stand behind.

But if this is something they are "patching" on XP, then I take it this is no longer an issue on Vista? Anyone care to pop open task manager on a dualcore or a quadcore and watch the CPU loads while you run a single-threaded app? Does it peg just one core at 100% utilization or do you see all cores getting hit with the average utilization number (50% on dual-core, 25% on quad)?

Vista home premium 32 bit core2 laptop says MS is still load sharing when running a single core workload in Prime95 25.6. CPU usage stays @~50%.

Thanks SLIM! Answers my question. Here's a :cookie: for the help!
 
Dec 30, 2004
12,553
2
76
Originally posted by: bryanW1995
Originally posted by: soccerballtux
Originally posted by: Idontcare
Originally posted by: CTho9305
Actually, if hitting in the L3 is faster than hitting in another processor's L2, it could improve performance, couldn't it? I hadn't thought about that before. It shouldn't be too hard for a good programmer to write a test program that thrashes the TLB from multiple threads.

This is a good point. Hyperthreading spent a couple years in the doghouse because its corner conditions of thread thrashing actually resulted in performance degradation so I have no doubt that anything which gets thrashed around (when Windows moves threads from core to core to core) will actually benefit in having its prior L2 data evicted to the L3 as windows moves the thread to another core...as then when the thread accesses its own data (otherwise stuck on the other core's L2) at least it only has to hit the L3.

Thread thrashing on my Intel quads is awful under WinXP. When I run four single-threaded applications which saturate a CPU core each I will see upwards of a 15% performance penalty from thread thrashing unless I lock the affinity for each thread.

Wow that is very interesting!

If this wasn't a tech site I would swear that you were being sarcastic ;) When I read about getting thrashed I think of vanderbilt-kentucky or tiger woods vs everyone else...

Particularly the 15% lost performance.
 
Dec 30, 2004
12,553
2
76
Originally posted by: Viditor
Originally posted by: Idontcare
Originally posted by: VirtualLarry
IDC, have you tested XP with the MS multicore patch V4 (or newer)? Supposedly that helps with the thread thrashing, so I've read.

If it isn't something that gets taken care of by Microsofts auto-update then no I haven't tested it.

If microsoft isn't confident in it enough to make it part of their auto-update KB's then I wouldn't put it on my computers anyway. I have enough troubles as it is without inviting new ones that even Microsoft's QC isn't willing to stand behind.

But if this is something they are "patching" on XP, then I take it this is no longer an issue on Vista? Anyone care to pop open task manager on a dualcore or a quadcore and watch the CPU loads while you run a single-threaded app? Does it peg just one core at 100% utilization or do you see all cores getting hit with the average utilization number (50% on dual-core, 25% on quad)?

I don't know if it means anything, but I do know that the scheduler on Vista supports NUMA while the one on XP does not...

NUMA NUMA NUMAI YAY?
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Originally posted by: Viditor
Originally posted by: Idontcare
Originally posted by: VirtualLarry
IDC, have you tested XP with the MS multicore patch V4 (or newer)? Supposedly that helps with the thread thrashing, so I've read.

If it isn't something that gets taken care of by Microsofts auto-update then no I haven't tested it.

If microsoft isn't confident in it enough to make it part of their auto-update KB's then I wouldn't put it on my computers anyway. I have enough troubles as it is without inviting new ones that even Microsoft's QC isn't willing to stand behind.

But if this is something they are "patching" on XP, then I take it this is no longer an issue on Vista? Anyone care to pop open task manager on a dualcore or a quadcore and watch the CPU loads while you run a single-threaded app? Does it peg just one core at 100% utilization or do you see all cores getting hit with the average utilization number (50% on dual-core, 25% on quad)?

I don't know if it means anything, but I do know that the scheduler on Vista supports NUMA while the one on XP does not...

Don't quote me on this but I beleive NUMA for a windows box really only comes into play when you have >1 socket or >1 memory controller at play.

Last time I dealt explicitly with NUMA was when we were building clusters of computers (beowulfs) where the interprocessor communications were handled via ethernet...so each "box" was most definitely blind to the contents of the other computer's ram and cache.

This was 6 years ago, so a lot has probably changed since then. I should read up more on it.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Originally posted by: soccerballtux
Originally posted by: bryanW1995
Originally posted by: soccerballtux
Wow that is very interesting!

If this wasn't a tech site I would swear that you were being sarcastic ;) When I read about getting thrashed I think of vanderbilt-kentucky or tiger woods vs everyone else...

Particularly the 15% lost performance.

:confused: I'm not following... Is this an inside joke thing? No need to expound if it is, I'm content to just move along...
 
Dec 30, 2004
12,553
2
76
Originally posted by: Idontcare
Originally posted by: soccerballtux
Originally posted by: bryanW1995
Originally posted by: soccerballtux
Wow that is very interesting!

If this wasn't a tech site I would swear that you were being sarcastic ;) When I read about getting thrashed I think of vanderbilt-kentucky or tiger woods vs everyone else...

Particularly the 15% lost performance.

:confused: I'm not following... Is this an inside joke thing? No need to expound if it is, I'm content to just move along...

He said if it weren't a tech site he'd think I was trolling/being sarcastic. 15% performance loss to thread-cache-thrashing was what I found the most interesting.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Originally posted by: soccerballtux
He said if it weren't a tech site he'd think I was trolling/being sarcastic. 15% performance loss to thread-cache-thrashing was what I found the most interesting.

Ah, I'm with you now. Yep, I'm sure the performance penalty gets worse the more your application of interest "fits inside" the non-shared cache.

What I find truly odd is that for such an obvious way to lose performance that Microsoft hasn't made any progress in making the OS more performance savy when handling thread allocation across multiple cores.
 

Viditor

Diamond Member
Oct 25, 1999
3,290
0
0
Just to add to things on the server front (especially vm machines), have you read Johann's blog on the TLB fix?

Johan Blog

Barcelona or AMD's K10 supports 4K, 2M and 1GB page sizes. 2MB pages are getting more popular (especially on Linux servers) as it significantely reduces the memory management overhead. AMD's TLB architecture:
Low latency L1 TLB (Data and Instructions) 48 entries, supporting all pagesizes
L2 TLB (Data and Instructions): 512 4k entries, or 128 2M entries

If you compare this with the Intel Penryn family:
One instruction TLB: 128 entries (4 KB) but only 8 entries for 2MB pages.
The Data TLB has 2 levels:
? 16 entries (4 KB)
? 256 entries (4 KB), but only 32 for larger pages(2 MB)

You can see that AMD?s K10 family has really massive TLBs compared to the Penryn and previous Intel CPUs, especially if you want to run with large pages. So while this will certainly not affect anyone behind a desktop or mobile, it may well have an impact in the serverworld
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
I was revisiting Anand's B3 Preview article again, thinking to myself why it's taken nearly 2 weeks and not a single additional benchmark has creeped out (I find this odd)...but it caught my eye that the WinRAR benchmark was actually done with an overclocked B3 sample.

Anyone else catch this?

So I'm curious whether the ram got overclocked by the 4.5% they overclocked the B3 to get it from 2.2 stock to 2.3GHz. And if ram did get overclocked...does the B3 really deliver the same WinRAR score as the B2?

Just seemed odd to do that and offer no qualifications as to what was overclocked and what wasn't. I find this odd.
 

Sylvanas

Diamond Member
Jan 20, 2004
3,752
0
0
Well I asked for an ETA on the 9850 black edition from a local store here in Australia and they said 'Early April'.... so as soon as it comes in I hope to be picking it up- will let you all know my experiences.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: Idontcare
I was revisiting Anand's B3 Preview article again, thinking to myself why it's taken nearly 2 weeks and not a single additional benchmark has creeped out (I find this odd)...but it caught my eye that the WinRAR benchmark was actually done with an overclocked B3 sample.

Anyone else catch this?

So I'm curious whether the ram got overclocked by the 4.5% they overclocked the B3 to get it from 2.2 stock to 2.3GHz. And if ram did get overclocked...does the B3 really deliver the same WinRAR score as the B2?

Just seemed odd to do that and offer no qualifications as to what was overclocked and what wasn't. I find this odd.

Good catch. That makes all but page 2 of Anand's article virtually worthless. Gotta love how Tom, Anand, et. al. are so unscientific in so many ways ;).
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Originally posted by: CTho9305
Originally posted by: Idontcare
I was revisiting Anand's B3 Preview article again, thinking to myself why it's taken nearly 2 weeks and not a single additional benchmark has creeped out (I find this odd)...but it caught my eye that the WinRAR benchmark was actually done with an overclocked B3 sample.

Anyone else catch this?

So I'm curious whether the ram got overclocked by the 4.5% they overclocked the B3 to get it from 2.2 stock to 2.3GHz. And if ram did get overclocked...does the B3 really deliver the same WinRAR score as the B2?

Just seemed odd to do that and offer no qualifications as to what was overclocked and what wasn't. I find this odd.

Good catch. That makes all but page 2 of Anand's article virtually worthless. Gotta love how Tom, Anand, et. al. are so unscientific in so many ways ;).

It's a bummer, isn't it?
 

Martimus

Diamond Member
Apr 24, 2007
4,490
157
106
Originally posted by: Idontcare
I was revisiting Anand's B3 Preview article again, thinking to myself why it's taken nearly 2 weeks and not a single additional benchmark has creeped out (I find this odd)...but it caught my eye that the WinRAR benchmark was actually done with an overclocked B3 sample.

Anyone else catch this?

So I'm curious whether the ram got overclocked by the 4.5% they overclocked the B3 to get it from 2.2 stock to 2.3GHz. And if ram did get overclocked...does the B3 really deliver the same WinRAR score as the B2?

Just seemed odd to do that and offer no qualifications as to what was overclocked and what wasn't. I find this odd.

I thought that the Phenoms used a divider on the IMC and not the processor speed, unlike the X2's. Was I wrong about that?
 

v8envy

Platinum Member
Sep 7, 2002
2,720
0
0
BTW, for what it's worth -- desktop Linux does the same thing re: migrating processes between cores for no good reason. 2.6.2something kernel -- was playing eve online, and watching the load to 0->100 on one core than the other, every 5 seconds.

Both the Borg and the Penguin might be doing it for the aforementioned thermals reasons -- if machines with inadequate cooling bluescreen when a core melts it might reflect badly on the OS vendor.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
Originally posted by: v8envy
BTW, for what it's worth -- desktop Linux does the same thing re: migrating processes between cores for no good reason. 2.6.2something kernel -- was playing eve online, and watching the load to 0->100 on one core than the other, every 5 seconds.

Both the Borg and the Penguin might be doing it for the aforementioned thermals reasons -- if machines with inadequate cooling bluescreen when a core melts it might reflect badly on the OS vendor.

Thanks for confirming V8.

It seems silly odd to me. Especially with power saving schemes like Foxton and that feature AMD is rolling out where cores can be independtly set into power savings. If you really aren't loading your system with enough single-threaded apps to consume a full core at 100% utilization then ideally you'd have the other three idle down, reduce their clock multipliers and go to sleep until they are needed. But if the OS keeps skipping the threads around then nothing gets to sleep for too long and performance goes down.

I guess what I am getting at is that I can't envision a worse way to implement thread management than what has already been done. Its like it was intentionally designed to be the least effective. Of course I suppose that is how you manage planned obsolesence...build in your features that you know will need to be replaced.