18-30% imporvement in superpi for amd via patch

monstercameron

Diamond Member
Feb 12, 2013
3,818
1
0
The Book of Bulldozer - Revelations: Episode 2 (SuperPI / x87)
Exactly two year ago, when I tested a Bulldozer based Zambesi CPU for the first I was shocked.
The early sample units were even hotter and slower than the final silicon revision CPUs, which finally were released four months later.
One of the largest single let-down came from the way back: SuperPI.

SuperPI mainly uses legacy x87 instructions which have been almost completely superceded.
SuperPI doesn't show any indication what so ever about SMP performance as it can only utilize a single thread. On top of that it has no real world use or purpose as there are newer programs which can calculate PI almost 100 times faster.

Still, SuperPI can almost be considered as a industry standard.
Nowdays it is generally a VERY poor indicator of real world performance, yet it is so addictive for any old school overclocker. It scales very well along with the CPU/NB/DRAM/IO performance and tweaking it is a big challenge. An overclocker who hasn't ever benched SuperPI simply doesn't exist.

SuperPI has a special place in my heart simply because it was one of the first benchmarks I ever ran... almost 14 years ago...

So, why are all of the 15h (Bulldozer) based CPU/APU/NPUs performing so bad in SuperPI?
Some people say it is because 15h family has 50% less FPs per core than the preceeding 10h family.
In 15h family a compute unit (two cores) share a FP when the 10/12h family had a dedicated FP for each of the cores.

If this would be the only reason, the issue would be solved when the "slave" core of the CU is disabled, leaving a "private" FP for the "master" (BSC) core. However this is not the case and it even shouldn't be as SuperPI is single threaded, remember?

The caches on 15h family have higher latency than 10h family for example, and SuperPI happens to love large & low latency caches.
15h family was initially designed for high frequencies. Just like the F1 engines, they produce no power at low revs. And unfortunately it currently doesn't seem to be possible to build an engine capable reving high enough. We might discuss more about the caches in "Episode 3"... If possible.

source

maybe amd is hamstrung via software implementation afterall...
 

grimpr

Golden Member
Aug 21, 2007
1,095
7
81
Cant believe that in the era of AVX/TSX/FMA/XOP, we're still talking and taking serious, Superpi and x87.
 

exar333

Diamond Member
Feb 7, 2004
8,518
8
91
I checked-out as soon as I reached the car analogy in the paragraph. It is never that useful to release patches for synthetic performance, especially literally years too late. Not that impressed, honestly.
 

Phynaz

Lifer
Mar 13, 2006
10,140
819
126
Manipulating undocumented(?) CPU registers to get a speedup in Superpi = :|
 

BallaTheFeared

Diamond Member
Nov 15, 2010
8,115
0
71
I calculate 1m in .250 seconds, another example of how ancient software doesn't take advantage of modern chips.
 

JQuilty

Junior Member
Mar 28, 2013
9
0
66
Anybody that uses SuperPi or gives any credence to results in 2013 needs to be beaten with a baseball bat. x87 is useless and I don't think anyone particularly cares about calculating pi.
 

moonbogg

Lifer
Jan 8, 2011
10,731
3,440
136
So what excuse is there for everything else running so terrible on AMD chips?
 

bononos

Diamond Member
Aug 21, 2011
3,939
190
106
From the source

Another example of how software has been underusing AMD chips.

It would have been more impressive if the record was more than the best of '32nm AMD chips'. How does it fare next to Sandys?
 

ViRGE

Elite Member, Moderator Emeritus
Oct 9, 1999
31,516
167
106
Anybody that uses SuperPi or gives any credence to results in 2013 needs to be beaten with a baseball bat. x87 is useless and I don't think anyone particularly cares about calculating pi.
Indeed. This is as geeky as all heck and I love it for that. But the end result, SuperPi, has no real world significance. SuperPi has been outdated for so long that there are members of this board almost as old as it is; poor SuperPi performance was never a real concern for AMD in the first place.
Manipulating undocumented(?) CPU registers to get a speedup in Superpi = :|
I was really hoping for more details on what exactly he's done. "x87 instruction (NRAC) block" is not a description, especially since that instruction (NRAC) doesn't exist.
 

SPBHM

Diamond Member
Sep 12, 2012
5,068
423
126
cool, it still very little compared to how much faster Intel CPUs are on Super pi but, nevertheless, I think Skyrim 1.0 would also greatly benefit from this patch.

the question is, why AMD decided not to use this?
 

JQuilty

Junior Member
Mar 28, 2013
9
0
66
Still it was Superpi scores from Bulldozer ES that gave us the first indication of Bulldozer weak performance.

In other news Aida64 memory bandwidth test have been updated so they now support multi-threading and now Bulldozer is now much more competitive in read performance.

http://www.xtremesystems.org/forums/showthread.php?286212-new-AIDA-64-seems-fixed-for-FX-CPUs


AMD doesn't do x87 in hardware anymore.

cool, it still very little compared to how much faster Intel CPUs are on Super pi but, nevertheless, I think Skyrim 1.0 would also greatly benefit from this patch.

the question is, why AMD decided not to use this?

Because x87 is obsolete. There is no benefit to using it over SSE, and AMD doesn't even support it in hardware anymore. Both they and Intel have actively discouraged it's use for over a decade.
 

Exophase

Diamond Member
Apr 19, 2012
4,439
9
81
I don't know where all this stuff about AMD not doing x87 in hardware anymore comes from. Of course it's done in hardware, it's not like the instructions raise exceptions that make the OS emulate them. If this were happening things would be a lot slower than they are.

They could be heavily microcoded, but IIRC the timings in the tables reflect this. I think they just have worse latencies than their SSE2 equivalents.
 

bononos

Diamond Member
Aug 21, 2011
3,939
190
106
One day they'll just drop x87/mmx/3dnow to save on silicon real estate and hand out some software emulator.
 

Hitman928

Diamond Member
Apr 15, 2012
6,737
12,455
136
They could be heavily microcoded, but IIRC the timings in the tables reflect this. I think they just have worse latencies than their SSE2 equivalents.

:thumbsup: IIRC the reliance on microcode increased in Bulldozer, but I could just be remembering the large increases in latency.

One day they'll just drop x87/mmx/3dnow to save on silicon real estate and hand out some software emulator.

AMD said 3dnow is no longer supported and x87 is heavily microcoded, as Exophase pointed out.
 

zir_blazer

Golden Member
Jun 6, 2013
1,263
580
136
I think that people is underestimating the required in-depth knowledge of both the Processor architecture and Assembler to get a thing like this done, regardless of the uselessness of x87 in this era. I bet he spend ton of time reading Data Sheets, White Papers and Manuals with countless pages to be aware of the existence of a Processor-specific register and how to modify it.
Wouldn't you love any sort of microcode hack that allow to do things like unlock a feature like the Multiplier, features like Hyper Threading, VT-d, Cores, Cache L3, whatever, that mainly Intel likes to disable? Boy, I would love that some some Russian ASM gurus get into the core of how this works to attempt to reproduce it on other Processors. At the very least, is one step closer to that.
 

Atreidin

Senior member
Mar 31, 2011
464
27
86
I can't take this guy seriously when he claims to be knowledgeable about CPU architectures but thinks that x87 is done in software on these processors. That is just absurd. :rolleyes:

Maybe he spent a lot of time reading but he sure didn't understand it all.
 

toyota

Lifer
Apr 15, 2001
12,957
1
0
can they get a patch for this? I am sure Skyrim would be more of a concern than superpi. lol


 

zir_blazer

Golden Member
Jun 6, 2013
1,263
580
136
I can't take this guy seriously when he claims to be knowledgeable about CPU architectures but thinks that x87 is done in software on these processors. That is just absurd.

Maybe he spent a lot of time reading but he sure didn't understand it all.
Relative. The fact that you can send the Processor an instruction that gets directly computed inside it to give you back the result, doesn't means that the Processor got purpose-specific logic dedicated to do so. Actually, it may not even have any sort of logic for that instruction, so for that task it has to rely on using another unit that exist or is optimized for another purpose.
The FPU of a Bulldozer module is supposed to include two 128 Bits FMAC units for AVX and FMA, and another two 128 Bits units that can do either x87, MMX or SSE. Those last two units should be quite general purpose instead of purpose-specific, as I doubt there are tons of similar things between the 80 Bit floating point calculations that x87 does and what MMX and SSE do, that is SIMD. And considering that x87 should have much lower priority that SSE, I doubt they spend too many resources on it. So basically, when you run x87 code in a Bulldozer, you have to run it on a non-specific unit that should provide inferior performance for those instructions.
This is similar to what Intel did with Pentium 3 Katmai. It introduced SSE support, but the logic for it was subpar because they were trying to not increase the die size too much. As some Instruction Sets fades away (Or gets introduced, but don't expect to pick up quickly in that generation, so don't want to waste die space), chances are that you don't have dedicated logic for it and instead gets processed by other units inside the Processor in a sort of "Software emulation" fashion.