Speaking of the techreport article, I would say they need to amend it. What they are saying there is pretty deceptive if what Scali as shown so far is accurate.
I fully agree...
A few parts are just wrong altogether, such as the multithreading.
PhysX supports multithreading just fine, it just leaves thread management to the developer (although according to nVidia, there will be automated threading in the upcoming 3.0 version).
There are examples of PhysX applications using multithreading.
So the fact that not all games use a multithreaded PhysX implementation can't be blamed on nVidia, certainly not as an attempt to slow down CPU performance. nVidia doesn't prevent you from doing it.
As for the x87 vs SSE. Firstly, I don't think nVidia did that deliberately, as I don't think the code has ever been anything other than x87 (not in the NovodeX or Ageia days either). So it's not like nVidia threw away existing SSE code and replaced it with x87... SSE just wasn't there, and nVidia didn't put any in (or well, apparently there was some experimental SSE code, but it was not enabled by default, although licensed developers with access to the sourcecode could use it). That's a completely different story. At most you can call it 'neglect', but not anything like 'deliberate sabotage'.
Secondly, they need to make a better case for the performance improvements gained from SSE. Why didn't they actually DO the recompile for Bullet and use that as an example? That would be a much stronger argument than these speculative numbers (but presented more as fact than as speculation).
I'm not saying I did the recompilation correctly... But I explained what I did, and Bullet is open source, so anyone can verify my results and correct me if I'm wrong.
But assuming I'm not... then they don't HAVE a case. If you indeed get about 10-20% performance gain from SSE vector optimizations, that is never going to be enough to close the gap with GPUs.
And since you can already multithread on CPUs aswell, apparently that part isn't going to do it either. 3DMark Vantage uses a multithreaded implementation... and actually goes as far as to add extra physics load depending on the number of cores you have (you see more objects on screen with a quadcore than with a dualcore, for example). This maximizes the benefit of multithreading. Even so, high-end GPUs are much faster. So making the CPUs 20% faster isn't going to cut it.