"PhysX hobbled on CPU by x87 code"

Kr@n · Jul 9, 2010

Based on Scali's tests on Bullet, it seems physics are not so much computation heavy, but rather bandwidth constrained or something. There is no way a massively parallel computation heavy algorithm cannot benefit from good SSE implementation if there is not another bottleneck somewhere (memory bandwidth, cache misses, thread management and synchro, etc.) ...

Scali · Jul 9, 2010

Well, similar to packet raytracing... the packets themselves are parallel, but there is some management involved (splitting up packets, regrouping etc).
But raytracing usually doesn't need a lot of bounces per ray, so the inefficiency of the packet management is not that great.
With physics it's different, it's an iterative process. More as if everything in the world is reflective and your rays keep bouncing around. In that case, packet raytracing won't be that much more efficient than a simpler approach.

Kr@n · Jul 9, 2010

Now I kinda understand. We have the same problem with incoherent rays (successive bounces keep splitting packets). Some clever implementations solve this by resorting rays into groups when packets don't contain enough rays (or by casting all rays of a particular level, then sort bounces, then continue for next level). It's quite tricky, though, and may not be applicable to physics computations.

Nevertheless, it seems nVidia acknowledged the need for improving efficiency of CPU PhysX by automating multi-threading, and implementing SSE (if we are to believe leaks), thus it must be possible to squeeze some performance gains by doing that ...

Scali · Jul 9, 2010

Kr@n said:
Now I kinda understand. We have the same problem with incoherent rays (successive bounces keep splitting packets). Some clever implementations solve this by resorting rays into groups when packets don't contain enough rays (or by casting all rays of a particular level, then sort bounces, then continue for next level). It's quite tricky, though, and may not be applicable to physics computations.

Yes, you can see that if your packets usually contain only 1-2 active rays, they won't be all that more efficient... The extra overhead of resorting packets etc may be more costly than the processing of the rays themselves.
That's why I said that 'threading' is done differently on a PPU/GPU. They can sort and forward the data more efficiently than a CPU can.

Kr@n said:
Nevertheless, it seems nVidia acknowledged the need for improving efficiency of CPU PhysX by automating multi-threading, and implementing SSE (if we are to believe leaks), thus it must be possible to squeeze some performance gains by doing that ...

Yes, even if SSE doesn't give much of a gain, there is little reason not to use it.
As for the automated multi-threading... I'm not sure if that improves efficiency, or just allows more developers to take advantage of multithreading, since apparently not many of them bothered to do it manually.

Scali · Jul 10, 2010

Apparently this thread has found its way onto many other forums...
Sad part is, people are either stupid, or extremely biased...
Take this one for example:
http://hardforum.com/showpost.php?p=1035928063&postcount=56

kllrnohj said:
But yeah, switching to SSE vs. x87 does give a speedup, just not a particularly large one. Still a good speedup for little more than a compiler option.

I guess he didn't actually LOOK at the sourcecode.
In this thread, I mentioned the BT_USE_SSE flag that I disabled.
If he would actually bother to look at the sourcecode, and look for that flag, he'd see that Bullet does a LOT more than just flipping the compiler to SSE mode. It makes full use of SIMD datatypes and packed arithmetic through the SSE extensions in VS2008.

In btScalar.h we find:

Code:

#if (defined (_WIN32) && (_MSC_VER) && _MSC_VER >= 1400) && (!defined (BT_USE_DOUBLE_PRECISION))
			#define BT_USE_SSE
			#include <emmintrin.h>
#endif

And that BT_USE_SSE enables SIMD in btSolverBody.h:

Code:

///Until we get other contributions, only use SIMD on Windows, when using Visual Studio 2008 or later, and not double precision
#ifdef BT_USE_SSE
#define USE_SIMD 1
#endif

And then we get some of the good stuff, like for example this parallellized datatype:

Code:

#ifdef USE_SIMD

struct	btSimdScalar
{
	SIMD_FORCE_INLINE	btSimdScalar()
	{

	}

	SIMD_FORCE_INLINE	btSimdScalar(float	fl)
	:m_vec128 (_mm_set1_ps(fl))
	{
	}

	SIMD_FORCE_INLINE	btSimdScalar(__m128 v128)
		:m_vec128(v128)
	{
	}
	union
	{
		__m128		m_vec128;
		float		m_floats[4];
		int			m_ints[4];
		btScalar	m_unusedPadding;
	};
	SIMD_FORCE_INLINE	__m128	get128()
	{
		return m_vec128;
	}

	SIMD_FORCE_INLINE	const __m128	get128() const
	{
		return m_vec128;
	}

	SIMD_FORCE_INLINE	void	set128(__m128 v128)
	{
		m_vec128 = v128;
	}

	SIMD_FORCE_INLINE	operator       __m128()       
	{ 
		return m_vec128; 
	}
	SIMD_FORCE_INLINE	operator const __m128() const 
	{ 
		return m_vec128; 
	}
	
	SIMD_FORCE_INLINE	operator float() const 
	{ 
		return m_floats[0]; 
	}

};

///@brief Return the elementwise product of two btSimdScalar
SIMD_FORCE_INLINE btSimdScalar 
operator*(const btSimdScalar& v1, const btSimdScalar& v2) 
{
	return btSimdScalar(_mm_mul_ps(v1.get128(),v2.get128()));
}

///@brief Return the elementwise product of two btSimdScalar
SIMD_FORCE_INLINE btSimdScalar 
operator+(const btSimdScalar& v1, const btSimdScalar& v2) 
{
	return btSimdScalar(_mm_add_ps(v1.get128(),v2.get128()));
}

Flicking a switch? Not quite. Bullet has some specific SSE optimizations. It's just that disabling them all (which I did) doesn't really make a difference (as I explained earlier in this thread).

Lonbjerg · Jul 10, 2010

Scali said:
I guess he didn't actually LOOK at the sourcecode.

He dosn't care, he is known for bashing NVIDIA everywhere he can...even if it makes no sense.

Scali · Jul 10, 2010

Yea, even guys like Andrew Lauritzen don't seem to get it. Then again, it IS Beyond3D...

But that's the sad thing... I seem to be the *only* person telling the truth... and on tons of forums, tons of people are saying I'm wrong (so much for open source... everyone can see, but nobody bothers to look. Apparently prejudice is stronger than openness).
I wish Erwin Coumans would join in.

I sent Andrew Lauritzen an email, pointing out his errors. So now we know for sure that he knows about the SIMD optimizations in Bullet. If he doesn't respond with a correction, we also know that he has no integrity (just like David Kanter... we know he has been reading this, but he hasn't responded so far. Hasn't modified his article, hasn't posted a follow-up, nothing... despite all this new information contradicting him).

Lonbjerg · Jul 10, 2010

Scali said:
Yea, even guys like Andrew Lauritzen don't seem to get it. Then again, it IS Beyond3D...

But that's the sad thing... I seem to be the *only* person telling the truth... and on tons of forums, tons of people are saying I'm wrong.
I wish Erwin Coumans would join in.

I sent Andrew Lauritzen an email, pointing out his errors. So now we know for sure that he knows about the SIMD optimizations in Bullet. If he doesn't respond with a correction, we also know that he has no integrity (just like David Kanter... we know he has been reading this, but he hasn't responded so far. Hasn't modified his article, hasn't posted a follow-up, nothing... despite all this new information contradicting him).

I am done with B3D too.
That site has gone from a good 3d information site to a small bastion of AMD-fluffers that hate anything green.

Most of the posts directed at me where either fallcies (lots of ad hominem), outdated or even false information and a lots of hot air.

It's sad really...B3D wen't from a great site to a mockery of a site.

evolucion8 · Jul 10, 2010

Lonbjerg said:
I am done with B3D too.
That site has gone from a good 3d information site to a small bastion of AMD-fluffers that hate anything green.

Most of the posts directed at me where either fallcies (lots of ad hominem), outdated or even false information and a lots of hot air.

It's sad really...B3D wen't from a great site to a mockery of a site.

Because lots of forums are full of one sided trolls like you. Regardless of the brand preference, both are harmful for the forums, and you are pretty much critizicing the same attitude that you do here quite often. Just buy what you need to buy and feel happy, neither nVidia or AMD will care about your purchase to smile you back.

Scali said:
Apparently this thread has found its way onto many other forums...
Sad part is, people are either stupid, or extremely biased...
Take this one for example:

What do you get in exchange of insulting people and calling them stupid? I don't think that you would say something like that in someone's face, remember that behind keyboard, there's people writing and expressing their opinion, and that part of the human attitude is to withstand the differences and not to impose your point of view, that's what make us different from animals, if you continue with your etilist attitude you will look like a jackass, just shine with your knowledge and be tolerant.

Scali · Jul 10, 2010

evolucion8 said:
What do you get in exchange of insulting people and calling them stupid?

No, it's those guys that insult me first... and then make false claims about Bullet (like it only doing scalar SSE via the compiler, no hand-optimizations, or that I enabled SSE, where instead it was enabled by default and I *disabled* it for the x87 version), even AFTER I pointed out what I did.
I pointed out what I did, you can download the source (and someone like kllrnohj claims he did and recompiled it) and see for yourself.
If you then STILL claim that Bullet uses only scalar SSE code, and just insult me, yea, I think that's pretty stupid.
I don't have any tolerance for that. Why should I?

So it's not 'my view' that I would like to 'impose', it's simple facts, easy to verify by everyone (I just tried to verify DKanter's claims with his own suggestions). But nobody DOES.

evolucion8 · Jul 10, 2010

Scali said:
No, it's those guys that insult me first... and then make false claims about Bullet (like it only doing scalar SSE via the compiler, no hand-optimizations, or that I enabled SSE, where instead it was enabled by default and I *disabled* it for the x87 version), even AFTER I pointed out what I did.
I pointed out what I did, you can download the source (and someone like kllrnohj claims he did and recompiled it) and see for yourself.
If you then STILL claim that Bullet uses only scalar SSE code, and just insult me, yea, I think that's pretty stupid.
I don't have any tolerance for that. Why should I?

Because you could get banned for that, are you going lower to match your opponent? that's why the purpose of a report post button.

Scali · Jul 10, 2010

evolucion8 said:
Because you could get banned for that, are you going lower to match your opponent? that's why the purpose of a report post button.

I'm commenting on people on another forum... firstly, why would I get banned for that *here*... secondly, how is the 'report post' button here going to help me?

evolucion8 · Jul 10, 2010

You never specified that the stupid comment was for other forum per se, you just said that people are stupid, look at this example, just underscoring your comment, and the report button works.

Meh, nevermind, it isn't like is for your own good, you are too stubborn to accept it.

Lonbjerg · Jul 10, 2010

evolucion8 said:
Because lots of forums are full of one sided trolls like you. Regardless of the brand preference, both are harmful for the forums, and you are pretty much critizicing the same attitude that you do here quite often. Just buy what you need to buy and feel happy, neither nVidia or AMD will care about your purchase to smile you back.

It's about the lies (which are so stupid and debunked with actuall games) that was the final straw.

Lies from B3D posters:

Physx is only single treaded on CPU (debunked that one with a link Metro2033)
AGEIA had better CPU multithreading (I have never seen an AGIA physx multithreaded implementation, even their reality marks was singlethreaded)
Devs are only using PhysX because NVIDIA pays them to (See the fanboyism comming out?)
NVIDIA has deliberatly "borked" PhysX (while they only kept the core code as they got it)

That masked with countless smoke&mirrors posts aka:

Havok can do it better (but not posting a single game)
The CPU can do as well as the GPU (But not posting any games doing it)
"You shoud go to NVnwez (as the facts are not welcome here if the are NV positive) and a gazzlion other smoke&mirror posts

Kinda like a small lot here...but they are a minority, not the majority.

Lonbjerg · Jul 10, 2010

evolucion8 said:
Because you could get banned for that, are you going lower to match your opponent? that's why the purpose of a report post button.

Don't worship AMD's GPU = ban on B3D...especially if your to technical to be placed in the "n00b troll catagori" I highly suspect after the number of retard posts posted by the regulear crowd over there.

They are like the new AMD fansite...

evolucion8 · Jul 10, 2010

Lonbjerg said:
It's about the lies (which are so stupid and debunked with actuall games) that was the final straw.

Lies from B3D posters:

Physx is only single treaded on CPU (debunked that one with a link Metro2033)

AGEIA had better CPU multithreading (I have never seen an AGIA physx multithreaded implementation, even their reality marks was singlethreaded)

Devs are only using PhysX because NVIDIA pays them to (See the fanboyism comming out?)

NVIDIA has deliberatly "borked" PhysX (while they only kept the core code as they got it)

That masked with countless smoke&mirrors posts aka:

Havok can do it better (but not posting a single game)

The CPU can do as well as the GPU (But not posting any games doing it)

"You shoud go to NVnwez (as the facts are not welcome here if the are NV positive) and a gazzlion other smoke&mirror posts

Kinda like a small lot here...but they are a minority, not the majority.

Usually most games that uses PhysX are single threaded (Batman AA, Mirror's Edge etc), but that isn't totally nVidia's fault, the developers also play an important role into that. Metro 2033 is far more GPU bound than CPU bound, so any multi threaded stuff is more of the game optimization than PhysX itself. I turned PhysX On and Off with my Crossfire setup and I couldn't notice a slow down or performance boost. I have the PhysX card with everything original from AGEIA, and their software is single threaded, but used to work much better.

nVidia didn't borked PhysX that much, the software was designed in 2002 using x87 and at that time, SSE was barely taking off, that's why we can only blame nVidia's laziness for doing a cheap port of their PhysX runtime to CUDA and not optimizing it, but I can understand that they don't own a CPU, so is a waste of time, they just cared to optimize for their GPU, where their market belongs.

Havok runs very fast on CPU, but no one knows which instruction type it uses so it wouldn't be fair to compare it directly in a per code basis, CPU optimized PhysX code can do very well, but that depends of how much parallelization the code has and the fact that if it doesn't have lots of dependencies (GPU's are terrible for that), but highly parallelizable code that doesn't have dependencies (Ala Vec5), definitively will run much faster on a GPU, a single shader processor is much weaker than a current single core Intel processor, but having lots of it (Ala Cell), is quite a powerful set of train rails.

Skurge · Jul 10, 2010

Lonbjerg said:
Snip

Dude, do you ever post anything constructive?
I was enjoying the 1st 2 pages of this thread and then you came along with your conspiracy theory that everyone is out to get nV.

No one was comparing physx to havok or anything like that, you started that and thats not what this thread is about. Scali and the rest have been posting very informative information that I'm learning a lot from. You are just giving me a headache.

Speaking of the techreport article, I would say they need to amend it. What they are saying there is pretty deceptive if what Scali as shown so far is accurate.

Scali · Jul 10, 2010

Skurge said:
Speaking of the techreport article, I would say they need to amend it. What they are saying there is pretty deceptive if what Scali as shown so far is accurate.

I fully agree...
A few parts are just wrong altogether, such as the multithreading.
PhysX supports multithreading just fine, it just leaves thread management to the developer (although according to nVidia, there will be automated threading in the upcoming 3.0 version).
There are examples of PhysX applications using multithreading.
So the fact that not all games use a multithreaded PhysX implementation can't be blamed on nVidia, certainly not as an attempt to slow down CPU performance. nVidia doesn't prevent you from doing it.

As for the x87 vs SSE. Firstly, I don't think nVidia did that deliberately, as I don't think the code has ever been anything other than x87 (not in the NovodeX or Ageia days either). So it's not like nVidia threw away existing SSE code and replaced it with x87... SSE just wasn't there, and nVidia didn't put any in (or well, apparently there was some experimental SSE code, but it was not enabled by default, although licensed developers with access to the sourcecode could use it). That's a completely different story. At most you can call it 'neglect', but not anything like 'deliberate sabotage'.

Secondly, they need to make a better case for the performance improvements gained from SSE. Why didn't they actually DO the recompile for Bullet and use that as an example? That would be a much stronger argument than these speculative numbers (but presented more as fact than as speculation).
I'm not saying I did the recompilation correctly... But I explained what I did, and Bullet is open source, so anyone can verify my results and correct me if I'm wrong.
But assuming I'm not... then they don't HAVE a case. If you indeed get about 10-20% performance gain from SSE vector optimizations, that is never going to be enough to close the gap with GPUs.

And since you can already multithread on CPUs aswell, apparently that part isn't going to do it either. 3DMark Vantage uses a multithreaded implementation... and actually goes as far as to add extra physics load depending on the number of cores you have (you see more objects on screen with a quadcore than with a dualcore, for example). This maximizes the benefit of multithreading. Even so, high-end GPUs are much faster. So making the CPUs 20% faster isn't going to cut it.

Lonbjerg · Jul 10, 2010

Skurge said:
Dude, do you ever post anything constructive?
I was enjoying the 1st 2 pages of this thread and then you came along with your conspiracy theory that everyone is out to get nV.

You mean you enjoyed the NVIDIA bashing..based on presumptions masked as facts?
You are only posting now because it got shot down badly.

No one was comparing physx to havok or anything like that, you started that and thats not what this thread is about. Scali and the rest have been posting very informative information that I'm learning a lot from. You are just giving me a headache.

Yeah, not fallng for specualtion and actually having to see one guy (Scali)doing a recompile and actually comparing x87 and SSE performance in the Bullet physics API (which turned out to be so small it's laughable) and another guy (me) pointing out that no one crippled anything, that the arguemnt is flawed and that no one can show any Physics API that makes CPU physX look borked must really give you headaches :hmm:

Perhaps the HelloKitty forum would be more kind on your head?

Speaking of the techreport article, I would say they need to amend it. What they are saying there is pretty deceptive if what Scali as shown so far is accurate.

It not the techreport that need to "amend" anything.
It's David Kanter aka Realworldtech (Char-lie's IRL friend) that need to deliver evidence for his speculations...to make up for serving his unproven specualtions as facts.

It's his FUD that started this whole mess.

Schmide · Jul 10, 2010

Scali? I finally had time to look at the bullet code. The one flag BT_USE_SSE only effects a very small chunk of code. In fact the flag seems to only effect the resolveSplitPenetrationSIMD function.

Edit: I got it to disable.

I will say though. There is basically one function that has implemented SSE. I can see how there is very little if any difference in execution.

Skurge · Jul 10, 2010

Lonbjerg said:
You mean you enjoyed the NVIDIA bashing..based on presumptions masked as facts?
You are only posting now because it got shot down badly.

Im posting now because YOU have derailed this thread from whether physx would run faster with SSE to "this guy and this guy are anti-nV" and "somebody show me a havok game that does what physx does"

oh and as for facts, remember this thread http://forums.anandtech.com/showthread.php?t=2087488&page=3

I havn't seen you post there after you claims were "shot down badly" (isnt that the same thing that the guys at beyond3d are doing? Proven wrong yet they don't take back what they said previously)

Yeah, not fallng for specualtion and actually having to see one guy (Scali)doing a recompile and actually comparing x87 and SSE performance in the Bullet physics API (which turned out to be so small it's laughable) and another guy (me) pointing out that no one crippled anything, that the arguemnt is flawed and that no one can show any Physics API that makes CPU physX look borked must really give you headaches :hmm:

Perhaps the HelloKitty forum would be more kind on your head?

Again, no one was comparing physx to anything, you claimed someone did.

It not the techreport that need to "amend" anything.
It's David Kanter aka Realworldtech (Char-lie's IRL friend) that need to deliver evidence for his speculations...to make up for serving his unproven specualtions as facts.

It's his FUD that started this whole mess.

I don't really care who wrote it, it should be fixed or another one should be put up with actual testing.

Scali · Jul 10, 2010

Schmide said:
Scali? I finally had time to look at the bullet code. The one flag BT_USE_SSE only effects a very small chunk of code. In fact the flag seems to only effect the resolveSplitPenetrationSIMD function.

Edit: I got it to disable.

I will say though. There is basically one function that has implemented SSE. I can see how there is very little if any difference in execution.

Uhh... the entire datatypes are SSE, as is the solver (and there are multiple *SIMD() functions).
Look at btSequentialImpulseConstraintSolver::solveSingleIteration() for example.
As you can see, it will call the SIMD versions of the solver if available.

It's the meat of the processing... So what's your point?

At any rate, it doesn't really matter for the original point that Kanter was going for... Apparently he assumed that Bullet, like PhysX, has little or no SSE code at all, and the compiler switch would have to do all the vectorizing. Any amount of vector-optimized SSE code in Bullet is a bonus in favour of Kanter's SSE claims... which still fails to make his point.

GaiaHunter · Jul 10, 2010

Schmide said:
Scali? I finally had time to look at the bullet code. The one flag BT_USE_SSE only effects a very small chunk of code. In fact the flag seems to only effect the resolveSplitPenetrationSIMD function.

Edit: I got it to disable.

I will say though. There is basically one function that has implemented SSE. I can see how there is very little if any difference in execution.

What does the sentence in bold mean?

Schmide · Jul 10, 2010

Scali said:
Uhh... the entire datatypes are SSE, as is the solver (and there are multiple *SIMD() functions).
Look at btSequentialImpulseConstraintSolver::solveSingleIteration() for example.
As you can see, it will call the SIMD versions of the solver if available.

It's the meat of the processing... So what's your point?

Although I subscribe to the 90-10 rule*, it would be premature IMO to declare this the meat of the processing. Stepping through the code, collision detection seems to get the bulk of the processing. Also IMO the collision detection seems a bit brute forceish and could deal with a dose of optimization. There are no intrensics outside ConstraintSolver.

Too bad I don't own VTune anymore.

* 90-10 rule 90% of your processing time usually takes place in 10% of your code.

Scali said:
At any rate, it doesn't really matter for the original point that Kanter was going for... Apparently he assumed that Bullet, like PhysX, has little or no SSE code at all, and the compiler switch would have to do all the vectorizing. Any amount of vector-optimized SSE code in Bullet is a bonus in favour of Kanter's SSE claims... which still fails to make his point.

I'm not going to affirm or deny any point on either side of too many unknowns.

GaiaHunter said:
What does the sentence in bold mean?

It means I got the SSE to disable.

Scali · Jul 11, 2010

Schmide said:
I'm not going to affirm or deny any point on either side of too many unknowns.

This I don't get.
So you don't even agree that using optimizations with SSE intrinsics will yield more performance than just compiling vanilla C++ code and letting the compiler extract and optimize all parallelism for SSE?
Because that part is not an unknown.

"PhysX hobbled on CPU by x87 code"

Member

Banned

Member

Banned

Banned

Diamond Member

Banned

Diamond Member

Platinum Member

Banned

Platinum Member

Banned

Platinum Member

Diamond Member

Diamond Member

Platinum Member

Diamond Member

Banned

Diamond Member

Diamond Member

Diamond Member

Banned

Diamond Member

Diamond Member

Banned