- Jun 21, 2005
- 12,062
- 2,275
- 126
Last edited:
x87 has been deprecated for many years now, with Intel and AMD recommending the much faster SSE instructions for the last 5 years. On modern CPUs, code using SSE instructions can easily run 1.5-2X faster than similar code using x87. By using x87, PhysX diminishes the performance of CPUs, calling into question the real benefits of PhysX on a GPU
...
The bottom line is that Nvidia is free to hobble PhysX on the CPU by using single threaded x87 code if they wish. That choice, however, does not benefit developers or consumers though, and casts substantial doubts on the purported performance advantages of running PhysX on a GPU, rather than a CPU.
Intel started discouraging the use of x87 with the introduction of the P4 in late 2000. AMD deprecated x87 since the K8 in 2003, as x86-64 is defined with SSE2 support; VIA’s C7 has supported SSE2 since 2005.
The bottom line is that Nvidia is free to hobble PhysX on the CPU by using single threaded x87 code if they wish.
Athlon 3000 supported SSE as well as 3dnow!
I wonder if someone who works in the biz like Scali can add to this.
Is this true to the extent that the author claims it to be? That is disgusting man. I understand pimping your product, but PhysX would have more adoption the more platforms it can be used on. Just charge them all a license! You want PhysX on your ATI+Intel system, buy PhysX middleware for $x.xx.
Imagine the performance boost for users that run GT250s or less.
Crazy.
Probably more important is what do the consoles support?
If it's a compiler flag change then sure update it on windows (although if some of your target windows cpu's don't support it then you can't - got to compile to the lowest common denominator).
The main difference with SSE is that you get double precision support, allowing SSE2 to act as a near-complete replacement of x87, where SSE only had single precision.
In terms of physics- movnti, rcpps, rcpss and clflush seem to be far more useful then DP.
If you want to have maximum benefit from SSE, you need to rewrite the code with SSE optimizations. Which is a LOT of work. And even if you do, I doubt you'd get the 1.5x-2x gains that are claimed in the article.
However, I'm with you when you state the gain will be much smaller for the whole library (you'll have to factor in sections of code that do not scale, overheads et al, etc.)
That's the part that gets me. As Ben noted the minimum system requirements don't necessarily call for SSE2 (although I suspect they meant a K8 3000+, not a K7 3000+) but that shouldn't make a difference because they can use different code paths. Given this I'm not at all surprised that they have an x87 code path, but I'm very surprised that they only have an x87 code path.Certain compilers (not MS and GCC afaik) can compile multiple paths into an executable and select the proper one at runtime. Alternatively you can manually compile an x87 and SSE-powered path and select the correct one at runtime.
It's not uncommon for software to have optional support for instructionset extensions this way.
can't say I'm all that surprised if there is truth to this
Feel free to correct me if I'm mistaken, but I don't think threads and core (thread parallelism) have anything to do with SSE (instruction parallelism) ...It's the design that made the difference. x86 is not very efficient with shoveling data from one core(thread) to the next, and computational power is often crippled by branching or caching mispredictions.
Given this I'm not at all surprised that they have an x87 code path, but I'm very surprised that they only have an x87 code path.
Scali said:Multithreading is also left to the developer
Feel free to correct me if I'm mistaken, but I don't think threads and core (thread parallelism) have anything to do with SSE (instruction parallelism) ...
As for multi-threading, I didn't dig into the library, so I can't tell if it's almost free to implement with PhysX or if you have to struggle to make it happen. What I gathered from my readings is that some complain that multi-threading is not embedded in the engine (the library does not spawn threads on it's own). We developers all know how lazy we are ...
Nadeem Mohammad said:I have been a member of the PhysX team, first with AEGIA, and then with Nvidia, and I can honestly say that since the merger with Nvidia there have been no changes to the SDK code which purposely reduces the software performance of PhysX or its use of CPU multi-cores.
Our PhysX SDK API is designed such that thread control is done explicitly by the application developer, not by the SDK functions themselves. One of the best examples is 3DMarkVantage which can use 12 threads while running in software-only PhysX. This can easily be tested by anyone with a multi-core CPU system and a PhysX-capable GeForce GPU. This level of multi-core support and programming methodology has not changed since day one. And to anticipate another ridiculous claim, it would be nonsense to say we “tuned” PhysX multi-core support for this case.
PhysX is a cross platform solution. Our SDKs and tools are available for the Wii, PS3, Xbox 360, the PC and even the iPhone through one of our partners. We continue to invest substantial resources into improving PhysX support on ALL platforms--not just for those supporting GPU acceleration.
As is par for the course, this is yet another completely unsubstantiated accusation made by an employee of one of our competitors. I am writing here to address it directly and call it for what it is, completely false. Nvidia PhysX fully supports multi-core CPUs and multithreaded applications, period. Our developer tools allow developers to design their use of PhysX in PC games to take full advantage of multi-core CPUs and to fully use the multithreaded capabilities.
No, but I don't mean to imply that the two are related... just that the PPU and GPU handle 'threads' in a different way from CPUs. If you want run multiple threads with SSE-optimized routines, you need to keep them fed too. If not, the effect of the SSE-optimizations will be marginal. The PPU was designed in a way to also keep its units fed. That's going to be very difficult on a CPU. CPU Threads have quite a bit of overhead with creating, switching, synchronizing etc.
If a physics library just blasts off threads by itself, it will be harder to control the CPU resources.
Hum ... I kinda understand your point, but you don't have to implement thread parallelism to benefit from instruction parallelism
Yup, as I say... the raw parallel and computational power aren't the most important factor.
The PPU was relatively modest in terms of both. I think it had 12 execution units and rated at about 58 GFLOPS peak performance (a Core i7 quadcore would be around 80 GFLOPS).
It's the design that made the difference. x86 is not very efficient with shoveling data from one core(thread) to the next, and computational power is often crippled by branching or caching mispredictions.
A GPU is more nimble with how it handles threads. It's a completely different approach. Still the effective PhysX performance is not all that impressive when compared to the PhysX PPU. GPUs are mainly faster at PhysX because they have 1+ TFLOPS and hundreds of parallel threads to start with.
So in theory you may be able to speed up some computations by a factor of 3-4 with SSE, but it may not have much of an impact overall, because a lot of time is lost in other areas. Just the overall inefficiency of a general-purpose CPU.
If most of the time is spent in gathering and dispatching data to your threads, then any instruction-level optimizations are going to be marginal at best...
Interesting how would AVX play into this?
Sure, but given no implementation of CPU PhysX is multi-threaded at this time (apart from Vantage it seems)
However, I agree other sections of the API could as well be bottlenecked. And that's assuming the game engine can still feed the PhysX engine fast enough if we (they) make the PhysX engine faster ...
