"PhysX hobbled on CPU by x87 code"

thilanliyan · Jul 7, 2010

http://techreport.com/discussions.x/19216

Where's the "shakes head" emoticon??!! :sneaky:

railven · Jul 7, 2010

Are you kidding me?

x87 has been deprecated for many years now, with Intel and AMD recommending the much faster SSE instructions for the last 5 years. On modern CPUs, code using SSE instructions can easily run 1.5-2X faster than similar code using x87. By using x87, PhysX diminishes the performance of CPUs, calling into question the real benefits of PhysX on a GPU

...

The bottom line is that Nvidia is free to hobble PhysX on the CPU by using single threaded x87 code if they wish. That choice, however, does not benefit developers or consumers though, and casts substantial doubts on the purported performance advantages of running PhysX on a GPU, rather than a CPU.

I wonder if someone who works in the biz like Scali can add to this.

Is this true to the extent that the author claims it to be? That is disgusting man. I understand pimping your product, but PhysX would have more adoption the more platforms it can be used on. Just charge them all a license! You want PhysX on your ATI+Intel system, buy PhysX middleware for $x.xx.

Imagine the performance boost for users that run GT250s or less.

Crazy.

Ben90 · Jul 7, 2010

subscribed

BenSkywalker · Jul 7, 2010

Wow, just wow.

Cryostasis requires an Athlon 3000 to play, that is important to remember in this analysis.

Intel started discouraging the use of x87 with the introduction of the P4 in late 2000. AMD deprecated x87 since the K8 in 2003, as x86-64 is defined with SSE2 support; VIA’s C7 has supported SSE2 since 2005.

So take a game that requires a given processor to play, check code designed to run on the processor it requires, then blame the physics engine for not using code that the required chip doesn't support? Seriously? Hard to find the words.

The bottom line is that Nvidia is free to hobble PhysX on the CPU by using single threaded x87 code if they wish.

http://physxinfo.com/news/2390/new-physx-fluidmark-1-2-first-tests/

How many threads PhysX supports is up to the developer, it isn't hard to enable multithreading- the developers set as their baseline a single core CPU, that their code base reflects that shouldn't be exactly shocking to anyone.

Nothing he posted requires much in depth of analysis- a game that requires a certain processor doesn't use code that it can't run and isn't threaded beyond its' limitations- even taking that into consideration its' hitting 1.2-1.4 IPC; the guy that wrote the article has an axe to grind.

v8envy · Jul 7, 2010

Athlon 3000 supported SSE as well as 3dnow! So no, physx didn't have to be compatible enough to run on an 8088 with a coprocessor -- his point still stands.

BenSkywalker · Jul 7, 2010

Athlon 3000 supported SSE as well as 3dnow!

Athlon 3000 didn't support SSE2- to give you an idea of what that gets you-

http://software.intel.com/en-us/art...eration-and-processor-specific-optimizations/

Intel doesn't even support SSE only chips in their compilers.

Edit- Forgot to mention, the author of the article even uses as a baseline SSE2 support on his own, SSE is far too limited to be considered viable.

Scali · Jul 7, 2010

railven said:
I wonder if someone who works in the biz like Scali can add to this.

Is this true to the extent that the author claims it to be? That is disgusting man. I understand pimping your product, but PhysX would have more adoption the more platforms it can be used on. Just charge them all a license! You want PhysX on your ATI+Intel system, buy PhysX middleware for $x.xx.

Imagine the performance boost for users that run GT250s or less.

Crazy.

It's very difficult to say really.
In some cases, SSE is slower than x87, like for example calcing a dotproduct (on the other hand, if you need to calc 4 dotproducts at a time, SSE wins).
It also depends on the CPU. I've written code that was faster with SSE on a Core2 or Pentium 4, but slower on an Athlon64.

My guess is that SSE may improve performance, but it's probably not going to be 1.5-2x as suggested in the article. Firstly, because in my own experience, most code doesn't scale that well with SSE... Secondly, because if that was true, a CPU-optimized library like Havok would run circles around PhysX, but it doesn't.

What I'd like to add though... is that PhysX was not written by nVidia. It was originally called NovodeX. It was then acquired by Ageia and used for their PPU cards. Then it was acquired by nVidia for GPU acceleration.
It never had SSE code, it has always been x87. So it's not like Ageia or nVidia have intentionally crippled the code. It just never was SSE, and clearly for Ageia or nVidia it's not that interesting to invest in that now.

GaiaHunter · Jul 7, 2010

I guess it isn't something surprising or that can be seen as unethical.

NVIDIA is interested on having its products working. AMD isn't interested on paying NVIDIA a fee. NVIDIA isn't interested on getting a fee from developers as developers are only interested (at least some) on using physX if it is free.

They could as well remove the CPU path entirely - but that would simply reduce the physX base to a handful of titles that support GPU accelerated physX (and probably even less would be interested on it if that was the case).

Just a case of NVIDIA not having enough muscle to push physX but not wanting to call it quits either.

Scali · Jul 7, 2010

Upon reading the article again, I noticed this claim:
"(and frankly supporting SSE is easier than x87 anyway)."

That's not necessarily true. If you use assembly optimizations, then yes, x87 is very difficult, because it's stack-oriented, where SSE has registers.
But if you're using C++, then optimizing for SSE is done with special language extensions, which are almost as hard to use as assembly. x87 doesn't need any special care. C++ code will compile to x87 by default.
You can switch on SSE in most compilers, but generally it's not going to make very efficient use of SSE, because the compiler won't be able to extract a lot of parallelism, so the SSE unit will mostly be used as a pseudo-x87 unit.

I don't really agree with SSE2 as a minimum though. The main difference with SSE is that you get double precision support, allowing SSE2 to act as a near-complete replacement of x87, where SSE only had single precision.
But most nVidia GPUs also have single precision only, so apparently PhysX doesn't need to rely on double precision. So SSE should also work for the most part.

Dribble · Jul 7, 2010

Probably more important is what do the consoles support?

If it's a compiler flag change then sure update it on windows (although if some of your target windows cpu's don't support it then you can't - got to compile to the lowest common denominator).

If (as Scali) suggests it actually requires new code to make use of it and the consoles don't support it, and the performance improvement is minor if I were developing it I wouldn't do it. I would want to minimise deviation between the console and windows cpu code paths - i.e. only do it if it really mattered.

Scali · Jul 7, 2010

Dribble said:
Probably more important is what do the consoles support?

No current consoles use an x86 processor anyway, so x87 and SSE are irrelevant.
All supported consoles have their own optimized CPU path (Cell, PPC+AltiVec).

Dribble said:
If it's a compiler flag change then sure update it on windows (although if some of your target windows cpu's don't support it then you can't - got to compile to the lowest common denominator).

Certain compilers (not MS and GCC afaik) can compile multiple paths into an executable and select the proper one at runtime. Alternatively you can manually compile an x87 and SSE-powered path and select the correct one at runtime.
It's not uncommon for software to have optional support for instructionset extensions this way.

But as I say, if you just have 'vanilla' C++ code, and recompile it to SSE, it is unlikely to be very optimized, and can in some cases even be slower than the x87 code.
If you want to have maximum benefit from SSE, you need to rewrite the code with SSE optimizations. Which is a LOT of work. And even if you do, I doubt you'd get the 1.5x-2x gains that are claimed in the article.
The reason why the PPU worked so well was not because of its massive parallelism or its computational power (It was rated at about 58 GFLOPS apparently). Its main feature was that the cores could forward results to other cores quickly.
After all, physics is very much a serial problem... You start with one force, which affects one or more objects, and each of these objects may affect other objects etc... like a waterfall.
Getting the right data to the right place is at least as important as the calculations (which generally aren't all that heavy... mostly simple geometric tests).

BenSkywalker · Jul 7, 2010

The main difference with SSE is that you get double precision support, allowing SSE2 to act as a near-complete replacement of x87, where SSE only had single precision.

In terms of physics- movnti, rcpps, rcpss and clflush seem to be far more useful then DP.

Scali · Jul 7, 2010

BenSkywalker said:
In terms of physics- movnti, rcpps, rcpss and clflush seem to be far more useful then DP.

Not sure about movnti? We use floats, so SSE's movntq should be fine if you want it (which I doubt, you'd want data to remain in cache, as other calculations will depend on it).
rcpps and rcpss are available in SSE... although I would say rsqrtss/ps are more important (and also available in SSE).
clflush... don't really think that's going to be all that big of a deal in overall performance.

Kr@n · Jul 7, 2010

Scali said:
If you want to have maximum benefit from SSE, you need to rewrite the code with SSE optimizations. Which is a LOT of work. And even if you do, I doubt you'd get the 1.5x-2x gains that are claimed in the article.

If you rewrite your code to take full advantage of SSE instructions (which is indeed a big rewrite), you can achieve near perfect scaling (ie 3.5x-4x) for algorithms that can be efficiently paralleled at the instruction level (packet tracing instead of ray tracing for exemple). I'm not an expert in physics computation, and I do not know if extracting instruction parallelism is easy for such computations, but I'm betting it's quite feasible.

However, I'm with you when you state the gain will be much smaller for the whole library (you'll have to factor in sections of code that do not scale, overheads et al, etc.)

Scali · Jul 7, 2010

Kr@n said:
However, I'm with you when you state the gain will be much smaller for the whole library (you'll have to factor in sections of code that do not scale, overheads et al, etc.)

Yup, as I say... the raw parallel and computational power aren't the most important factor.
The PPU was relatively modest in terms of both. I think it had 12 execution units and rated at about 58 GFLOPS peak performance (a Core i7 quadcore would be around 80 GFLOPS).
It's the design that made the difference. x86 is not very efficient with shoveling data from one core(thread) to the next, and computational power is often crippled by branching or caching mispredictions.
A GPU is more nimble with how it handles threads. It's a completely different approach. Still the effective PhysX performance is not all that impressive when compared to the PhysX PPU. GPUs are mainly faster at PhysX because they have 1+ TFLOPS and hundreds of parallel threads to start with.

So in theory you may be able to speed up some computations by a factor of 3-4 with SSE, but it may not have much of an impact overall, because a lot of time is lost in other areas. Just the overall inefficiency of a general-purpose CPU.

ViRGE · Jul 7, 2010

Scali said:
Certain compilers (not MS and GCC afaik) can compile multiple paths into an executable and select the proper one at runtime. Alternatively you can manually compile an x87 and SSE-powered path and select the correct one at runtime.
It's not uncommon for software to have optional support for instructionset extensions this way.

That's the part that gets me. As Ben noted the minimum system requirements don't necessarily call for SSE2 (although I suspect they meant a K8 3000+, not a K7 3000+) but that shouldn't make a difference because they can use different code paths. Given this I'm not at all surprised that they have an x87 code path, but I'm very surprised that they only have an x87 code path.

lavaheadache · Jul 7, 2010

can't say I'm all that surprised if there is truth to this

Scali · Jul 7, 2010

lavaheadache said:
can't say I'm all that surprised if there is truth to this

It's true... back in 2008 we already discussed this over at Beyond3D (so in a way it's very old news). Profiling the PhysX library and disassembling the hotspot code revealed vanilla x87 code.
Multithreading is also left to the developer (3DMark Vantage uses PhysX in a multithreaded way in its physics tests). So yes, if developers don't implement multithreading, you'll only see one PhysX thread.

But I think that's where the facts end.
Most of the article on RWT is speculation, and trying to make a bigger deal out of this than there really is.
They suggest compiling Bullet to x87 and SSE, to compare the two... sadly they didn't bother to actually DO it (or perhaps they did, and decided the results didn't fit their story, so they kept quiet).

Some parts are also just downright misleading... such as claiming SSE has 16 registers. It does, but only in 64-bit mode, not in 32-bit mode. They fail to mention that.
In 64-bit mode, x87 is deprecated anyway, since SSE2 is not an extension there, and all code is supposed to use SSE2 rather than x87 (x87 code is not guaranteed to work in 64-bit OSes by Microsoft, and their compiler can only generate SSE2 code anyway).

Kr@n · Jul 7, 2010

Scali said:
It's the design that made the difference. x86 is not very efficient with shoveling data from one core(thread) to the next, and computational power is often crippled by branching or caching mispredictions.

Feel free to correct me if I'm mistaken, but I don't think threads and core (thread parallelism) have anything to do with SSE (instruction parallelism) ...

ViRGE said:
Given this I'm not at all surprised that they have an x87 code path, but I'm very surprised that they only have an x87 code path.

I also agree : being quite familiar with ray tracing et al, I know for a fact that it's more often easy than not to extract instruction parallelism from already embarrassingly parallel computations, and it's quite feasible to strongly optimize critical sections of the code using SSE (given a little more work, it's even possible to keep the same source and switch between SSE and x87 using templates or ifdefs). It is thus a bit surprising that no effort have been made on that matter (or at least none made it to us mere mortals).

Just to be clear, I don't think nVIDIA (nor AEGIA for that matter) crippled anything ... They just don't have any motive to invest some time into something that's not their priority (CPU PhysX). That is what I don't understand, given how slow CPU PhysX is : developers would be more inclined to learn the PhysX API if they could benefit from acceptable performance on CPU (but again, maybe nVIDIA did some benchmarks and decided the performance gain was not worth the cost) ...

Scali said:
Multithreading is also left to the developer

As for multi-threading, I didn't dig into the library, so I can't tell if it's almost free to implement with PhysX or if you have to struggle to make it happen. What I gathered from my readings is that some complain that multi-threading is not embedded in the engine (the library does not spawn threads on it's own). We developers all know how lazy we are ...

Finally, I do think the article is misleading, suggesting the code is *artificially* crippled ...

Scali · Jul 7, 2010

Kr@n said:
Feel free to correct me if I'm mistaken, but I don't think threads and core (thread parallelism) have anything to do with SSE (instruction parallelism) ...

No, but I don't mean to imply that the two are related... just that the PPU and GPU handle 'threads' in a different way from CPUs. If you want run multiple threads with SSE-optimized routines, you need to keep them fed too. If not, the effect of the SSE-optimizations will be marginal. The PPU was designed in a way to also keep its units fed. That's going to be very difficult on a CPU. CPU Threads have quite a bit of overhead with creating, switching, synchronizing etc.

Kr@n said:
As for multi-threading, I didn't dig into the library, so I can't tell if it's almost free to implement with PhysX or if you have to struggle to make it happen. What I gathered from my readings is that some complain that multi-threading is not embedded in the engine (the library does not spawn threads on it's own). We developers all know how lazy we are ...

Yes, on the other hand, it can also be a good thing, because you as a developer have control over the threading. That's what nVidia advertises with anyway:
http://developer.nvidia.com/object/physx_features.html
"Fine-grain multithreading control".
If a physics library just blasts off threads by itself, it will be harder to control the CPU resources.
I believe that even with GPU/PPU you have to do some of the 'handywork' yourself, in order to get things accelerated the way you want them. But it's been a long time since I looked at the PhysX SDK.

AMD has actually criticized nVidia on this earlier this year, and here is an official nVidia response:
http://www.tomshardware.com/news/nvidia-physx-amd-gpu-multicore,9481.html

Nadeem Mohammad said:
I have been a member of the PhysX team, first with AEGIA, and then with Nvidia, and I can honestly say that since the merger with Nvidia there have been no changes to the SDK code which purposely reduces the software performance of PhysX or its use of CPU multi-cores.

Our PhysX SDK API is designed such that thread control is done explicitly by the application developer, not by the SDK functions themselves. One of the best examples is 3DMarkVantage which can use 12 threads while running in software-only PhysX. This can easily be tested by anyone with a multi-core CPU system and a PhysX-capable GeForce GPU. This level of multi-core support and programming methodology has not changed since day one. And to anticipate another ridiculous claim, it would be nonsense to say we “tuned” PhysX multi-core support for this case.

PhysX is a cross platform solution. Our SDKs and tools are available for the Wii, PS3, Xbox 360, the PC and even the iPhone through one of our partners. We continue to invest substantial resources into improving PhysX support on ALL platforms--not just for those supporting GPU acceleration.

As is par for the course, this is yet another completely unsubstantiated accusation made by an employee of one of our competitors. I am writing here to address it directly and call it for what it is, completely false. Nvidia PhysX fully supports multi-core CPUs and multithreaded applications, period. Our developer tools allow developers to design their use of PhysX in PC games to take full advantage of multi-core CPUs and to fully use the multithreaded capabilities.

Kr@n · Jul 7, 2010

Scali said:
No, but I don't mean to imply that the two are related... just that the PPU and GPU handle 'threads' in a different way from CPUs. If you want run multiple threads with SSE-optimized routines, you need to keep them fed too. If not, the effect of the SSE-optimizations will be marginal. The PPU was designed in a way to also keep its units fed. That's going to be very difficult on a CPU. CPU Threads have quite a bit of overhead with creating, switching, synchronizing etc.

Hum ... I kinda understand your point, but you don't have to implement thread parallelism to benefit from instruction parallelism (it's even good practice to NOT implement them both at the same time : we implement one, then the other on top of it - no need to increase difficulty). In RT, we often see > 2-3x performance by adding SSE on mono-threaded algorithms. Only when we factor in multi-threading and thread management overheads that we see poorer scaling per core (and some clever implementation scale pretty well nonetheless)

Scali said:
If a physics library just blasts off threads by itself, it will be harder to control the CPU resources.

I'm all for thread control, and I don't blame nVIDIA at all, but seeing how many game engines implement multi-threaded CPU PhysX, I think nVIDIA could bite the bullet and implement an "auto-mode" for lazy developers ...

Scali · Jul 7, 2010

Kr@n said:
Hum ... I kinda understand your point, but you don't have to implement thread parallelism to benefit from instruction parallelism

As I said: I didn't mean to imply that the two are related.
What I meant to say was this:
If most of the time is spent in gathering and dispatching data to your threads, then any instruction-level optimizations are going to be marginal at best...
In other words: you need to know when and where to apply optimizations. You need to remove the bottleneck (the old rule of "90% of the time is spent in 10% of the code", so don't waste time optimizing the code that is rarely run).
If the bottleneck is not with instruction-level performance, then the whole debate about x87 vs SSE is irrelevant... Because even if we assume that you can make your threads perform about 4 times faster with the use of SSE... if it means they're only going to wait on data 4 times as long, you're not going to see any performance improvements.

Nemesis 1 · Jul 7, 2010

Scali said:
Yup, as I say... the raw parallel and computational power aren't the most important factor.
The PPU was relatively modest in terms of both. I think it had 12 execution units and rated at about 58 GFLOPS peak performance (a Core i7 quadcore would be around 80 GFLOPS).
It's the design that made the difference. x86 is not very efficient with shoveling data from one core(thread) to the next, and computational power is often crippled by branching or caching mispredictions.
A GPU is more nimble with how it handles threads. It's a completely different approach. Still the effective PhysX performance is not all that impressive when compared to the PhysX PPU. GPUs are mainly faster at PhysX because they have 1+ TFLOPS and hundreds of parallel threads to start with.

So in theory you may be able to speed up some computations by a factor of 3-4 with SSE, but it may not have much of an impact overall, because a lot of time is lost in other areas. Just the overall inefficiency of a general-purpose CPU.

Interesting how would AVX play into this?

Kr@n · Jul 7, 2010

Scali said:
If most of the time is spent in gathering and dispatching data to your threads, then any instruction-level optimizations are going to be marginal at best...

Sure, but given no implementation of CPU PhysX is multi-threaded at this time (apart from Vantage it seems), thread management overhead cannot be a bottleneck ... However, I agree other sections of the API could as well be bottlenecked. And that's assuming the game engine can still feed the PhysX engine fast enough if we (they) make the PhysX engine faster ...

Nemesis 1 said:
Interesting how would AVX play into this?

I guess it would be even faster (assuming these new instructions are useful for PhysX computations, which is certainly not a given - maybe Scali can scheme in), but it would be even less compatible with older processors (but like ViRGE stated, it can be (de)activated as needed)

Scali · Jul 7, 2010

Kr@n said:
Sure, but given no implementation of CPU PhysX is multi-threaded at this time (apart from Vantage it seems)

Is that so? I have no idea how many developers have bothered to multithread their implementation, but I doubt that Vantage is the ONLY one.
For all we know, they cherry-picked the single-threaded ones for that article.

Kr@n said:
However, I agree other sections of the API could as well be bottlenecked. And that's assuming the game engine can still feed the PhysX engine fast enough if we (they) make the PhysX engine faster ...

Well, my point is just that this article likes to do some handwaving with "1.5-2x faster" and such... but those figures aren't based on anything. I think they are grossly underestimating the complexity of a physics library.
There are a lot of different issues that need to be analyzed before you can make any kind of reasonable guess at potential speedups.
But I doubt the intention of the article was an objective analysis. Sounds like they just like to take cheap shots at PhysX (like AMD itself).

"PhysX hobbled on CPU by x87 code"

Lifer

Diamond Member

Platinum Member

Diamond Member

Platinum Member

Diamond Member

Banned

Diamond Member

Banned

Platinum Member

Banned

Diamond Member

Banned

Member

Banned

Elite Member, Moderator Emeritus

Diamond Member

Banned

Member

Banned

Member

Banned

Lifer

Member

Banned