Interesting, i rebooted and in my SuperMicro BIOS set things to node interleaving On. (ie non NUMA mode - evens out memory latency to all sockets). - 187 GFlops.
With the BIOS set to NUMA mode (ie node interleaving Off)... results were - 148.xx GFlops.
i normally have it set to NUMA for exactly the reasons you mention. i like to have most games sit on socket 1 and use the memory attached to that socket... better latency = higher fps.
Linpack seems to like more memory bandwidth however.
This is with 4 Opteron 61xx series (12 K10 cores each) running at 3.0 GHz. Unfortunately i can't try running at 3.3 since these 92mm Noctuas won't cool fast enough
This is the "problem" with MP/DP rigs running Windows. A lot of apps and games use only one CPU because Windows categorizes each CPU separately.