18-30% imporvement in superpi for amd via patch

monstercameron · Jun 21, 2013

The Book of Bulldozer - Revelations: Episode 2 (SuperPI / x87)
Exactly two year ago, when I tested a Bulldozer based Zambesi CPU for the first I was shocked.
The early sample units were even hotter and slower than the final silicon revision CPUs, which finally were released four months later.
One of the largest single let-down came from the way back: SuperPI.

SuperPI mainly uses legacy x87 instructions which have been almost completely superceded.
SuperPI doesn't show any indication what so ever about SMP performance as it can only utilize a single thread. On top of that it has no real world use or purpose as there are newer programs which can calculate PI almost 100 times faster.

Still, SuperPI can almost be considered as a industry standard.
Nowdays it is generally a VERY poor indicator of real world performance, yet it is so addictive for any old school overclocker. It scales very well along with the CPU/NB/DRAM/IO performance and tweaking it is a big challenge. An overclocker who hasn't ever benched SuperPI simply doesn't exist.

SuperPI has a special place in my heart simply because it was one of the first benchmarks I ever ran... almost 14 years ago...

So, why are all of the 15h (Bulldozer) based CPU/APU/NPUs performing so bad in SuperPI?
Some people say it is because 15h family has 50% less FPs per core than the preceeding 10h family.
In 15h family a compute unit (two cores) share a FP when the 10/12h family had a dedicated FP for each of the cores.

If this would be the only reason, the issue would be solved when the "slave" core of the CU is disabled, leaving a "private" FP for the "master" (BSC) core. However this is not the case and it even shouldn't be as SuperPI is single threaded, remember?

The caches on 15h family have higher latency than 10h family for example, and SuperPI happens to love large & low latency caches.
15h family was initially designed for high frequencies. Just like the F1 engines, they produce no power at low revs. And unfortunately it currently doesn't seem to be possible to build an engine capable reving high enough. We might discuss more about the caches in "Episode 3"... If possible.

source

maybe amd is hamstrung via software implementation afterall...

grimpr · Jun 21, 2013

Cant believe that in the era of AVX/TSX/FMA/XOP, we're still talking and taking serious, Superpi and x87.

exar333 · Jun 21, 2013

I checked-out as soon as I reached the car analogy in the paragraph. It is never that useful to release patches for synthetic performance, especially literally years too late. Not that impressed, honestly.

Phynaz · Jun 21, 2013

Manipulating undocumented(?) CPU registers to get a speedup in Superpi = :|

ShintaiDK · Jun 21, 2013

grimpr said:
Cant believe that in the era of AVX/TSX/FMA/XOP, we're still talking and taking serious, Superpi and x87.

This, plus SSE btw.

Phynaz said:
Manipulating undocumented(?) CPU registers to get a speedup in Superpi = :|

And this.

galego · Jun 21, 2013

From the source

AMD 32nm SuperPI 32M record taken easily.

Another example of how software has been underusing AMD chips.

BallaTheFeared · Jun 21, 2013

I calculate 1m in .250 seconds, another example of how ancient software doesn't take advantage of modern chips.

JQuilty · Jun 21, 2013

Anybody that uses SuperPi or gives any credence to results in 2013 needs to be beaten with a baseball bat. x87 is useless and I don't think anyone particularly cares about calculating pi.

moonbogg · Jun 21, 2013

So what excuse is there for everything else running so terrible on AMD chips?

Vesku · Jun 21, 2013

moonbogg said:
So what excuse is there for everything else running so terrible on AMD chips?

I think terrible is quite a stretch. Unless you zero in specifically on gaming. Even bulldozer round 2, piledriver, can be quite poor for single threaded games more so than most single threaded non-games.

http://www.anandtech.com/bench/Product/699?vs=363

bononos · Jun 21, 2013

galego said:
From the source

Another example of how software has been underusing AMD chips.

It would have been more impressive if the record was more than the best of '32nm AMD chips'. How does it fare next to Sandys?

ViRGE · Jun 21, 2013

JQuilty said:
Anybody that uses SuperPi or gives any credence to results in 2013 needs to be beaten with a baseball bat. x87 is useless and I don't think anyone particularly cares about calculating pi.

Indeed. This is as geeky as all heck and I love it for that. But the end result, SuperPi, has no real world significance. SuperPi has been outdated for so long that there are members of this board almost as old as it is; poor SuperPi performance was never a real concern for AMD in the first place.

Phynaz said:
Manipulating undocumented(?) CPU registers to get a speedup in Superpi = :|

I was really hoping for more details on what exactly he's done. "x87 instruction (NRAC) block" is not a description, especially since that instruction (NRAC) doesn't exist.

wlee15 · Jun 21, 2013

Still it was Superpi scores from Bulldozer ES that gave us the first indication of Bulldozer weak performance.

In other news Aida64 memory bandwidth test have been updated so they now support multi-threading and now Bulldozer is now much more competitive in read performance.

http://www.xtremesystems.org/forums/showthread.php?286212-new-AIDA-64-seems-fixed-for-FX-CPUs

SPBHM · Jun 21, 2013

cool, it still very little compared to how much faster Intel CPUs are on Super pi but, nevertheless, I think Skyrim 1.0 would also greatly benefit from this patch.

the question is, why AMD decided not to use this?

JQuilty · Jun 22, 2013

wlee15 said:
Still it was Superpi scores from Bulldozer ES that gave us the first indication of Bulldozer weak performance.

In other news Aida64 memory bandwidth test have been updated so they now support multi-threading and now Bulldozer is now much more competitive in read performance.

http://www.xtremesystems.org/forums/showthread.php?286212-new-AIDA-64-seems-fixed-for-FX-CPUs

AMD doesn't do x87 in hardware anymore.

cool, it still very little compared to how much faster Intel CPUs are on Super pi but, nevertheless, I think Skyrim 1.0 would also greatly benefit from this patch.

the question is, why AMD decided not to use this?

Because x87 is obsolete. There is no benefit to using it over SSE, and AMD doesn't even support it in hardware anymore. Both they and Intel have actively discouraged it's use for over a decade.

Exophase · Jun 22, 2013

I don't know where all this stuff about AMD not doing x87 in hardware anymore comes from. Of course it's done in hardware, it's not like the instructions raise exceptions that make the OS emulate them. If this were happening things would be a lot slower than they are.

They could be heavily microcoded, but IIRC the timings in the tables reflect this. I think they just have worse latencies than their SSE2 equivalents.

bononos · Jun 22, 2013

One day they'll just drop x87/mmx/3dnow to save on silicon real estate and hand out some software emulator.

BallaTheFeared · Jun 22, 2013

Does AMD still even support 3DNow!?

lol, that never caught on, but sure caught my eye as a kid

ShintaiDK · Jun 22, 2013

bononos said:
One day they'll just drop x87/mmx/3dnow to save on silicon real estate and hand out some software emulator.

Hardly any savings from that. I doubt all that together takes up 0.1mm2.

Hitman928 · Jun 22, 2013

Exophase said:
They could be heavily microcoded, but IIRC the timings in the tables reflect this. I think they just have worse latencies than their SSE2 equivalents.

:thumbsup: IIRC the reliance on microcode increased in Bulldozer, but I could just be remembering the large increases in latency.

bononos said:
One day they'll just drop x87/mmx/3dnow to save on silicon real estate and hand out some software emulator.

AMD said 3dnow is no longer supported and x87 is heavily microcoded, as Exophase pointed out.

ShintaiDK · Jun 22, 2013

x87 is still done sd dedicated HW, and will always be as long as its there.

zir_blazer · Jun 22, 2013

I think that people is underestimating the required in-depth knowledge of both the Processor architecture and Assembler to get a thing like this done, regardless of the uselessness of x87 in this era. I bet he spend ton of time reading Data Sheets, White Papers and Manuals with countless pages to be aware of the existence of a Processor-specific register and how to modify it.
Wouldn't you love any sort of microcode hack that allow to do things like unlock a feature like the Multiplier, features like Hyper Threading, VT-d, Cores, Cache L3, whatever, that mainly Intel likes to disable? Boy, I would love that some some Russian ASM gurus get into the core of how this works to attempt to reproduce it on other Processors. At the very least, is one step closer to that.

Atreidin · Jun 22, 2013

I can't take this guy seriously when he claims to be knowledgeable about CPU architectures but thinks that x87 is done in software on these processors. That is just absurd.

Maybe he spent a lot of time reading but he sure didn't understand it all.

toyota · Jun 22, 2013

can they get a patch for this? I am sure Skyrim would be more of a concern than superpi. lol

zir_blazer · Jun 22, 2013

Atreidin said:
I can't take this guy seriously when he claims to be knowledgeable about CPU architectures but thinks that x87 is done in software on these processors. That is just absurd.

Maybe he spent a lot of time reading but he sure didn't understand it all.

Relative. The fact that you can send the Processor an instruction that gets directly computed inside it to give you back the result, doesn't means that the Processor got purpose-specific logic dedicated to do so. Actually, it may not even have any sort of logic for that instruction, so for that task it has to rely on using another unit that exist or is optimized for another purpose.
The FPU of a Bulldozer module is supposed to include two 128 Bits FMAC units for AVX and FMA, and another two 128 Bits units that can do either x87, MMX or SSE. Those last two units should be quite general purpose instead of purpose-specific, as I doubt there are tons of similar things between the 80 Bit floating point calculations that x87 does and what MMX and SSE do, that is SIMD. And considering that x87 should have much lower priority that SSE, I doubt they spend too many resources on it. So basically, when you run x87 code in a Bulldozer, you have to run it on a non-specific unit that should provide inferior performance for those instructions.
This is similar to what Intel did with Pentium 3 Katmai. It introduced SSE support, but the logic for it was subpar because they were trying to not increase the die size too much. As some Instruction Sets fades away (Or gets introduced, but don't expect to pick up quickly in that generation, so don't want to waste die space), chances are that you don't have dedicated logic for it and instead gets processed by other units inside the Processor in a sort of "Software emulation" fashion.

18-30% imporvement in superpi for amd via patch

Diamond Member

Golden Member

Diamond Member

Lifer

Lifer

Golden Member

Diamond Member

Junior Member

Lifer

Diamond Member

Diamond Member

Elite Member, Moderator Emeritus

Senior member

Diamond Member

Junior Member

Diamond Member

Diamond Member

Diamond Member

Lifer

Diamond Member

Lifer

Golden Member

Senior member

Lifer

Golden Member