I never understood the AMD's odd performance using x87, is there a reason?
Odd in what way?
Back in the days of Athlon classic/XP/64 vs Pentium 3/4, it was like this:
The Pentium 3 had a relatively simple x87 unit. Intel was moving towards SSE anyway, and didn't really care about x87 performance anymore.
AMD on the other hand used the technology from the Alpha processor to implement a very advanced pipelined x87 implementation.
This gave AMD the upper hand in x87, but Intel was usually faster with floating point when SSE was used (which AMD didn't support yet). AMD did have 3DNow!, but on an Athlon it didn't make as much sense as on the K6 anymore, now that the x87 was so powerful.
With the Pentium 4, Intel put the final nail in x87's coffin. SSE2 was going to replace the x87 completely for floating point operations (aside from some legacy stuff), and x87 was implemented in micro-code macros using the SSE execution units.
The result was that Pentium 4 had very good floating point performance when SSE2 was used, but x87 was absolutely atrocious.
AMD slowly caught up with SSE support, but their implementations weren't as strong as Intel's... Combined with AMD's superior x87 design, situations could arise where x87 was faster than SSE on Athlons.
When AMD introduced their 64-bit extensions, they took Intel's hint with the Pentium 4, and AMD deprecate x87 for 64-bit, SSE2 was preferred for all floating point operations.
AMD later moved to a better, full 128-bit implementation of SSE, so their SSE wasn't a weak point anymore. Conversely, with the Core2, Intel improved legacy x87 performance again, since it turned out to be a weak point in the Pentium 4, and SSE2 adoption was much slower than anticipated.
Currently, Intel and AMD are pretty reasonably matched in x87 and SSE (with Intel having the advantage of having new SSE extensions first, for those programs that adopt them early)... but with 64-bit adoption going the way it does with Windows 7, x87 will probably be nothing but a bad memory soon.
I would love to know why, plus seems to me that Penryn can perform slightly better than the i7 used thanks to its very low latency cache, may be code that can challenge more the execution resources or are heavily multithreaded will shine in the i7.
Penryn has the advantage that two cores share a relatively large and low-latency L2 cache. In some cases you can take advantage (if you need to share data, but only between two threads at a time) of that and do things the i7 simply cannot do.