If the OS would be in optimized ASM...maybe we'd have a real-time 3d rendered desktop right now and whatever other fancy schmanzy matrix -style stuff simply because the optimized code WOULD be able to do it with given hardware.
I doubt it would make much of a difference really.
We already use a 3D accelerator for all the fancy effects.
Which means that:
1) Custom-designed hardware is way more power-efficient than a CPU performing the same task. Most savings are because of the hardware here, not software.
A fine example of this is HD video playback for example. A regular Atom-based system is not capable of it. You need a pretty high-end CPU to decode HD video in realtime. Which would give you a LOT more power consumption than the Atom.
However, pair an Atom with a decent IGP, such as the nVidia Ion, and suddenly you can do HD video perfectly without even taxing the CPU all that much, and overall power consumption is still relatively low. You can still make devices with passive cooling and long battery life with an Atom+Ion solution. There's your massive power savings.
2) The 3D accelerators use a programming model that is pretty much custom-made for D3D/OpenGL. Although in theory you can still program them in asm, in practice there is little to gain, because the HLSL/GLSL are designed to match the hardware so closely, and compilers are optimized so well, that there is little or no difference in performance.
In fact, in D3D10 Microsoft abandoned the use of their assembly shader language altogether (which is not exactly the same as the underlying hardware, granted... but the underlying hardware is optimized to run this language as efficiently as possible, not much more than that). Only HLSL is supported now.