Are you asking about flavors of Windows or all OSes in general?  The code of Windows is highly proprietary.   If you want to compare with Linux, you could start with a comparison of the relative benefits of a monolithic OS model (ie Linux) vs a microkernel (Windows NT) is a place to start...   Depends on who you ask.   Old theories said the microkernel was much more difficult to optimize, but much more flexible.  Even Linus has admitted that he feels Microsoft has overcome most all of the difficulties of designing a solid Microkernel system (Apple's OSX uses a pseudo-microkernel too).  It depends on the code you write.
Since the actual kernel code is not something I can look at, I can only estimate by performance benchmark numbers I've read about and run myself.  Since you really can't "see" the difference (it's very small), it's hard to argue either way without simply doing the benchmarks and it varies for different applications.   Like I said, I've seen benchmarks that show XP faster and I've seen some that show 98 faster and some that show 2000 faster and some that would run faster on DOS.  It REALLY depends on the application, the APIs used and the type of processing.   Is it multithreaded or single?   Does it use heap memory?  Stack?  Do you make an effort to use SSE2?  What math libraries?  What threading libraries?   How much precision are you using?  Is it branch heavy code?  Can it be executed in parallel for distributed execution or is it sequential?  Does it operate on single data or multiple?  Does it draw bursts of bandwidth or a steady flow?  Can you prefetch data?  What about code locality? (spatial, temporal?)  What *language* are you writing it in?  They all make the answer different and favors different HARDWARE too.
I have a lot more experience with the hardware side.  But again, I could go on and on about the theories of Symmetric multi-threading (HyperThreading) and cases where it benefits you... that wouldn't do any good.  Since it slows down certain single-threaded tasks, but VASTLY improves multi-threaded parallizable tasks.
The numbers are pretty clear in AnandTech's Pentium 4 3.06GHz review.   Go to the main page and click the "CPU" button on the right side.  It should be right there.  There's hundreds of benchmark results on all types of applications.  Since I don't know what sort of application you're developing, I can't say where you might fall in those.  Then again, maybe you should get one of IBM's Power4 chips.  They fly at data-heavy work, but they don't run Windows.  Same with Sun's US3.  *Shrugs*
Besides, if you're that concerned about the subtle performance differences, it would be more important to use a system where you're FAMILIAR with the best means of optimizing code and compilers (for SSE2 as an example), rather than relying on someone else's code and compiler settings to try to prove some subtle speed variations.  It's been shown time and again that with raw x86 Floating point, the Athlon rules.  If you're willing to use SSE2, you will love the P4.  If you can write data prefetching into the code, it will help, but with some chipsets (i850/845/nForce) integrating some aspects of prefetching already, that can actually slow things down as the two prefetching algorithms sometimes clash and flush the data you intentionally prefetched because it doesn't think it's as relevant.  If you get a processor with SMT (HyperThreading) you need to write code as if it were for 2 CPUs to get the best speed, but if you do that on a chip that doesn't support SMT, you're going to slow it down a bit because of thread overhead.  It's all tradeoffs...
hehe like I said, depends on the app when talking about subtle differences.
Eric