AMD 'Bulldozer' gets an Update from Microsoft

Page 5 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

taltamir

Lifer
Mar 21, 2004
13,576
6
76
this confirms Windows 7 was in fact hampering “Bulldozer” from performing at 100% in all prior benches.

No, it confirms that bulldozer design requires OS optimization (a claim made before and I don't think anyone actually refuted it), which is a really bad way to design a processor.
Windows, MacOS, Linux, Solaris, BSD.... every OS will feel such hampering until they institute code to detect bulldozer and then apply bulldozer specific optimizations.

Hmmmm could it be a comeback??

No.
 

Vesku

Diamond Member
Aug 25, 2005
3,743
28
86
Hope for competition's sake that Bulldozer is more a Pentium Pro than a Pentium 4. I disagree that it's terrible to need OS optimizations, Intel was throwing that FUD around with the introduction of x86-64 and AMD was using similar FUD to attack HyperThreading. Regardless of which category it ends up in, Bulldozer is still not a buy for consumers.
 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
Hope for competition's sake that Bulldozer is more a Pentium Pro than a Pentium 4. I disagree that it's terrible to need OS optimizations, Intel was throwing that FUD around with the introduction of x86-64 and AMD was using similar FUD to attack HyperThreading. Regardless of which category it ends up in, Bulldozer is still not a buy for consumers.

its not FUD... its terrible because AMD doesn't actually write that optimization, which means few others will.
Who do you think is going to write those optimization for open source OS, the type that powers all the servers in the world? Nobody that's who.
 
Dec 30, 2004
12,553
2
76
AMD and MS would need to work together on it, and to be sure it's correct, MS would need chip revisions guaranteed to give exactly the same performance as final production models. If they had it ready on the day of launch, they would still need to spend time testing it, and likely would not have gotten launch-equivalent CPU samples more than a few weeks ahead of the rest of the world.

Reliability anomalies are somewhat rare, thankfully, but performance anomalies are common, and have been getting patched for a good long while, either when MS or the vendor gets around to it.

It's more impressive on Intel's part that Nehalem and SB did not need any significant performance patching, than it is a black mark on AMD that BD needs it.

kinda disagree there. This doesn't do anything besides spread the the threads on 0,2,4 and 6 first and then 1, 3, 5, and 7 second
 

Vesku

Diamond Member
Aug 25, 2005
3,743
28
86
its not FUD... its terrible because AMD doesn't actually write that optimization, which means few others will.
Who do you think is going to write those optimization for open source OS, the type that powers all the servers in the world? Nobody that's who.

AMD has submitted Bulldozer optimizations for Linux. Phoronix has been benchmarking Linux performance.

http://www.phoronix.com/scan.php?page=article&item=amd_bdver1_ofast&num=1

http://www.phoronix.com/scan.php?page=article&item=amd_fx8150_bulldozer&num=1

http://www.phoronix.com/scan.php?page=news_item&px=MTAxMzg

Their software development wing in regards to CPUs isn't as impressive as Intel's but they aren't leaving BD on a mountain top with a spike through it's ankle.

I'd actually be more inclined to consider BD if I was planning a dedicated server that benefitted from their multi-thread focus. Instead I'm left disappointed and begrudgingly running a 1090t. I would have gotten a 2500K but AMD 990FX boards have some nice layouts in the sub-$150 range, and I can actually fully utilize the x6 so the 2500K isn't as much of a must have as it would be for a pure 4 threads or less type of enthusiast.
 

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
kinda disagree there. This doesn't do anything besides spread the the threads on 0,2,4 and 6 first and then 1, 3, 5, and 7 second
...and MS has a magic QA button, that allows them to know that changes made to a core OS functionality's behavior is going to remain isolated, and work exactly as expected, in 100% of cases, without doing extensive testing of use cases? :rolleyes:

I don't know how extensive the changes that the hotfix makes are, but I'm quite confident that you are both underestimating the needed changes in the OS itself (it's never as simple as it first looks), and you've never been on either end of a Friday evening update deployment (some lessons you learn the hard way!).
 

grkM3

Golden Member
Jul 29, 2011
1,407
0
0
Just so you guys know I installed the hotfix right when it came out on my sandy setup and its benching higher in everything that is using 8 threads and a bigger boost with avx.

I broke 120gflops using avx and 8 threads in intel burn test when before I could never get over 108 and my cinbench points increased as much as a 150-200mhz over clock would add.

this patch helps all threaded apps and helped out my 2600k

Its crazzy how a 1.3mb file ca add so much performance

pm me if you need it,I have it saved on my computer
 
Last edited:

red454

Senior member
Oct 7, 2011
205
0
0
www.cardomain.com
Just so you guys know I installed the hotfix right when it came out on my sandy setup and its benching higher in everything that is using 8 threads and a bigger boost with avx.

I broke 120gflops using avx and 8 threads in intel burn test when before I could never get over 108 and my cinbench points increased as much as a 150-200mhz over clock would add.

this patch helps all threaded apps and helped out my 2600k

Its crazzy how a 1.3mb file ca add so much performance

pm me if you need it,I have it saved on my computer


Did you notice anything negative (slowdowns, etc.) with the patch? And any noticeable change in normal use applications?
 

grkM3

Golden Member
Jul 29, 2011
1,407
0
0
nothing bad so far,the os feels more responsive and the best way I can explain it is like going from 4gb ram to 8gb ram.
 

Diceman2037

Member
Dec 19, 2011
54
0
66
Just so you guys know I installed the hotfix right when it came out on my sandy setup and its benching higher in everything that is using 8 threads and a bigger boost with avx.

I broke 120gflops using avx and 8 threads in intel burn test when before I could never get over 108 and my cinbench points increased as much as a 150-200mhz over clock would add.

this patch helps all threaded apps and helped out my 2600k

Its crazzy how a 1.3mb file ca add so much performance

pm me if you need it,I have it saved on my computer

numbers please, 'feeling' responsiveness can change from 1 boot to the next based on a number of criteria.

HeavyHemi said:
i7 980 4.3 Ghz 1.35 vcore


Just because I'm a crazy guy...

Pre patch...

Intel(R) LINPACK 64-bit data - LinX 0.6.4

Current date/time: Fri Dec 16 23:50:58 2011

CPU frequency: 4.207 GHz
Number of CPUs: 1
Number of cores: 6
Number of threads: 12

Parameters are set to:

Number of tests : 1
Number of equations to solve (problem size) : 10000
Leading dimension of array : 10008
Number of trials to run : 20
Data alignment value (in Kbytes) : 4

Maximum memory requested that can be used = 800844256, at the size = 10000

============= Timing linear equation system solver =================

Size LDA Align. Time(s) GFlops Residual Residual(norm)
10000 10008 4 10.715 62.2389 9.915883e-011 3.496441e-002
10000 10008 4 10.576 63.0562 9.915883e-011 3.496441e-002
10000 10008 4 10.549 63.2171 9.915883e-011 3.496441e-002
10000 10008 4 10.538 63.2818 9.915883e-011 3.496441e-002
10000 10008 4 10.622 62.7827 9.915883e-011 3.496441e-002

Post patch....

Intel(R) LINPACK 64-bit data - LinX 0.6.4

Current date/time: Sat Dec 17 00:24:04 2011

CPU frequency: 4.222 GHz
Number of CPUs: 1
Number of cores: 6
Number of threads: 12

Parameters are set to:

Number of tests : 1
Number of equations to solve (problem size) : 10000
Leading dimension of array : 10008
Number of trials to run : 20
Data alignment value (in Kbytes) : 4

Maximum memory requested that can be used = 800844256, at the size = 10000

============= Timing linear equation system solver =================

Size LDA Align. Time(s) GFlops Residual Residual(norm)
10000 10008 4 9.952 67.0063 9.915883e-011 3.496441e-002
10000 10008 4 9.951 67.0170 9.915883e-011 3.496441e-002
10000 10008 4 9.828 67.8505 9.915883e-011 3.496441e-002
10000 10008 4 9.900 67.3604 9.915883e-011 3.496441e-002
10000 10008 4 9.727 68.5595 9.915883e-011 3.496441e-002
 
Last edited:

Diceman2037

Member
Dec 19, 2011
54
0
66
It looks like this patch is not only about SMT (see i7 improvements), but probably contains some temporary thread pinning.

more likely that it includes the cache aliasing fix which improves the condition on HT enabled processors.

the same processor had no improvement with ht disabled

What is somewhat interesting is with HT disabled, there is zero difference in results. I just ran this twice with a clean boot
With the patch.
Intel(R) LINPACK 64-bit data - LinX 0.6.4

Current date/time: Mon Dec 19 01:17:58 2011

CPU frequency: 4.234 GHz
Number of CPUs: 1
Number of cores: 6
Number of threads: 6

Parameters are set to:

Number of tests : 1
Number of equations to solve (problem size) : 10000
Leading dimension of array : 10008
Number of trials to run : 20
Data alignment value (in Kbytes) : 4

Maximum memory requested that can be used = 800844256, at the size = 10000

============= Timing linear equation system solver =================

Size LDA Align. Time(s) GFlops Residual Residual(norm)
10000 10008 4 8.007 83.2852 9.915883e-011 3.496441e-002
10000 10008 4 7.898 84.4365 9.915883e-011 3.496441e-002
10000 10008 4 7.827 85.1980 9.915883e-011 3.496441e-002
10000 10008 4 7.836 85.1014 9.915883e-011 3.496441e-002
10000 10008 4 7.836 85.1068 9.915883e-011 3.496441e-002
And without the patch

Intel(R) LINPACK 64-bit data - LinX 0.6.4

Current date/time: Mon Dec 19 01:27:29 2011

CPU frequency: 4.233 GHz
Number of CPUs: 1
Number of cores: 6
Number of threads: 6

Parameters are set to:

Number of tests : 1
Number of equations to solve (problem size) : 10000
Leading dimension of array : 10008
Number of trials to run : 20
Data alignment value (in Kbytes) : 4

Maximum memory requested that can be used = 800844256, at the size = 10000

============= Timing linear equation system solver =================

Size LDA Align. Time(s) GFlops Residual Residual(norm)
10000 10008 4 7.830 85.1681 9.915883e-011 3.496441e-002
10000 10008 4 7.839 85.0728 9.915883e-011 3.496441e-002
10000 10008 4 7.843 85.0243 9.915883e-011 3.496441e-002
10000 10008 4 7.839 85.0719 9.915883e-011 3.496441e-002
10000 10008 4 7.840 85.0554 9.915883e-011 3.496441e-002
 

tweakboy

Diamond Member
Jan 3, 2010
9,517
2
81
www.hammiestudios.com
Would this hotfix do anything for people with Thuban or Sandy cpus?

I regularly see 10-15% core activity across all my PIIX6 cores during during browsing and listening to music, which keeps the core states active when they should be parked.


That is not right, I get like 3 percent cpu usage back to 0 percent to 4 percent, while listening to winamp and have all browsers open with one of them watching a youtube video... hmmmm

You shouldnt get 10 percent on a single core let alone, all cores,,, thats not cool,, makes me wonder!!!
 

Dresdenboy

Golden Member
Jul 28, 2003
1,730
554
136
citavia.blog.de
So... in other words... Bulldozer was so bad that they had to adjust windows for it run them?
Windows does run on Bulldozer without this patch. It's just a tribute to the complexity and diversity among CPU architectures. There is no "one OS fits all" or "one compiler fits all" (on actual code generation) anymore ;)

more likely that it includes the cache aliasing fix which improves the condition on HT enabled processors.

the same processor had no improvement with ht disabled
So why would the cache aliasing fix help Sandy Bridge? The fix improved performance by 1-2% in most cases under linux (see phoronix).

Microsoft mentioned the scheduler.
 

Ferzerp

Diamond Member
Oct 12, 1999
6,438
107
106
open source OS, the type that powers all the servers in the world? Nobody that's who.

Now that's just wishful thinking on your part. "Servers" are more than just http servers (yes, linux does have huge market share for publicfacing www sites).
 

Diceman2037

Member
Dec 19, 2011
54
0
66
Windows does run on Bulldozer without this patch. It's just a tribute to the complexity and diversity among CPU architectures. There is no "one OS fits all" or "one compiler fits all" (on actual code generation) anymore ;)


So why would the cache aliasing fix help Sandy Bridge? The fix improved performance by 1-2% in most cases under linux (see phoronix).

Microsoft mentioned the scheduler.

The cache aliasing tweak benefits Hyperthreaded processors where both threads share the same core/cache

This has a 5-10% improvement in SMT workloads, where the worker threads are working together

not so much in cases like prime95 where the worker threads are working individually.

That means that you can have two copies of the same data in separate parts of the cache without knowing it... and they wouldn't be updated correctly, so you'd get wrong results.

if the wrong result resides in cache, then the cpu has to go back to l2 or l3 and finally system memory to update this data which will reduce throughput.

Cache aliasing occurs when multiple mappings to a physical page of memory have conflicting caching states, such as cached and uncached. Due to these conflicting states, data in that physical page may become corrupted when the processor's cache is flushed. If that page is being used for DMA by a driver, this can lead to hardware stability problems and system lockups.
So an overall improvement to aliasing will affect any cpu which has multiple cores, virtual or physical, if the share cache
 
Last edited:

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
its not FUD... its terrible because AMD doesn't actually write that optimization, which means few others will.
Who do you think is going to write those optimization for open source OS, the type that powers all the servers in the world? Nobody that's who.
Nobody? Hmmm, well, somebody did. Linux does have a performance patch. It is for a cache aliasing issue involving shared libraries.

Linux has also made new schedulers over the years, specifically to be able to schedule a wide variety of workloads well for a wide variety of CPUs, with very different performance characteristics. They generally have been able to get very good performance with CFS (especially after CK got back in there and lit a fire :)), with just a few issues here and there to fix. BD, FI, needed no changes to scale better than Windows with low thread counts, and the fix it did need was the same kind they've had to do many times before, for many different CPUs (unless we can find a better way to handle memory than fast cache, we'll either have high-latency shared caches, poor corner case performance with fully exclusive caches, or unnecessary corruption/eviction to deal with).

The cache aliasing tweak benefits Hyperthreaded processors where both threads share the same core/cache
But it shouldn't, unless they specifically added a method for SB. Better temporal locality would improve any modern CPU's performance, however. Linux's alias fix, FI, worked around the specifics of BD's I$. More generic methods, like coloring and extra copies, tend to be error-prone and wasteful (in both space and performance), when used where not needed (OSes that use coloring and the like for most or all virtual memory are a different matter). If the hardware can do it reasonably well, it's not worth it.

The BD alias fix appears to only help by a few percent, and sometimes not even that. It's good to do, because someone down the line will end up in a situation where it will be much more than that, and having that fix in place will prevent any performance crisis problems. It may have been better if AMD had prevented the need for it, on one hand; but on the other, it's better that they err on the side of correctness, if they're going to take a performance shortcut.

if the wrong result resides in cache, then the cpu has to go back to l2 or l3 and finally system memory to update this data which will reduce throughput.
But, which result is correct? The penalty of figuring that out will typically be negligible, but when it's not, it can be damning. In addition, we don't know, if, or how much, Windows may be affected. Cache aliasing on BD does not appear to be a major problem. Linux, FI, patched for a very specific case, and it only mildly affects performance even for that case (sometimes even performing slightly worse when corrected!).

So an overall improvement to aliasing will affect any cpu which has multiple cores, virtual or physical, if the share cache
But, not all CPUs will need it, ones that need it for one OS may not need it for another (assuming it's a performance, not correctness, problem), and in some cases, it may appear to need work, but letting the hardware take care of it may still be faster than a software fix. In the case that it does need work, it will need to be done in a way specific to the OS' use of memory on the specific family of CPUs in question.
 
Last edited:

Diceman2037

Member
Dec 19, 2011
54
0
66
But it shouldn't, unless they specifically added a method for SB. Better temporal locality would improve any modern CPU's performance, however. Linux's alias fix, FI, worked around the specifics of BD's I$. More generic methods, like coloring and extra copies, tend to be error-prone and wasteful (in both space and performance), when used where not needed (OSes that use coloring and the like for most or all virtual memory are a different matter). If the hardware can do it reasonably well, it's not worth it.
Well theres also the fact this patch may be a backport of the windows 8 scheduler, and windows 8 shows improvements in certain scenario's as well.

its possible that an issue was identified with how windows handles logical processors, and MSFT decided to to keep the fix neutral so that it didn't just seem it was optimising for bulldozer

from the results, it would seem that the hyperthread is gaining additional throughput, and iir the numbers correctly the hyperthread was usually 25% the performance of the first thread.
 
Last edited:

Cerb

Elite Member
Aug 26, 2000
17,484
33
86
from the results, it would seem that the hyperthread is gaining additional throughput, and iir the numbers correctly the hyperthread was usually 25% the performance of the first thread.
Maybe. I was addressing the potential for fixing cache aliasing issues across different CPUs. At this point, that sort of problem is very much CPU-dependent. It's not a program behavior, but the specific behavior of one uarch. Another may exhibit cache aliasing, as well, but the conditions in which it may present a problem to be fixed will generally be different, as will what will be needed to take care of it.

Memory mapping changes just for BD are pretty much just for BD. Changes to where threads are placed, how often, and where they should be placed if another thread is active in the prior location, aught to affect HT CPUs in a very similar manner to BD, but less so (due to BD's big L2).
 

taltamir

Lifer
Mar 21, 2004
13,576
6
76
Nobody? Hmmm, well, somebody did. Linux does have a performance patch. It is for a cache aliasing issue involving shared libraries.

Getting some low hanging fruit is obvious. But if AMD does not pay for it itself nobody else has the time, money, interest, and knowledge to do the advanced stuff.
 

Idontcare

Elite Member
Oct 10, 1999
21,110
64
91
So I have a 2600k running Win 7 Ult 64. Should I run this? Any gaming benefits?

Wait for it to be rolled out as something more robust than a hotfix. If it is everything and a bag of chips then it will be included in a standard windows update download without you needing to install it manually.