Okay if you're that fixed on asm code, let's make a small game:
We both write a nice small boyer-moore string search - trivial algorithm that every decent programmer should be able to write in less than half an hour.. at least in C or any other higher level language that is (but still something I could imagine given students in a ASM course, not that hard)
Let's see how long it takes you in asm and how much performance you gain compared to my C version.
And really don't forget that that's a trivial algorithm, nothing anywhere near what the kernel of a modern OS would look like. So if you want something a little bit more demanding how about a multi threaded (after all multi threading is the way to go in this time and age) 3d stencil computation? Cache oblivious if we're at it.
If you get THAT right in assembly and get a measureable performance improvement compared to a optimized Cilk programm I'll applaud you and you should be able to get a job whereever you want.. honestly I don't think I could write that efficient even in something like MPI.
Optimizing a program written in C to get more performance out of it? For some applications, especially in HPC, sure that's worth it. Manual register allocation, making sure that the FMAs are maximized, yep.
And I'm also sure that some parts in windows is heavily optimized, but writting complex programs completely in asm? Never, just not worth the cost. Also you seem to forget that if you're using asm you've got to write every program half a dozen times. Itanium, x86, x86-64, or even so trivial things like SSE available or not.
And at that level, even the architecture itself starts to play an important role, I've seen PowerPCs that used parts of the fp pipe for ld/st ops.. meant you just couldn't more than ~90% of peak performance out of it.
No there are many better ways to optimize programs today, you can probably speed up 50% of all matrix multiplications by rewriting them as cache oblivious for example - that's something no compiler can do for you.
Engineering is always about tradeoffs, at google for example lots of new code is written in python (well c++ is still dominant, but that has historic reasons) and you really need a case before something will be rewritten in c++. The performance improvements have to be large enough to warrant all that invested time, testing and - extremely important but somehow completely ignored here - maintainability.
Some parts of the inner search loop surely are asm, since that's where it matters, but that's the exception and can be compared to the inner loops of the scheduler in windows that are surely optimized as well.