Oh, yeah, we're way beyond double precision! (And yet not.)
Way back on the original Pentium, Intel noticed that there were these eight 80-bit registers (usually used for 64-bit double-precision math at most) that couldn't be used for anything integer. So they came up with MMX, which allowed a single instruction to operate on sets of either 8 bytes, 4, pairs-of-bytes ("words"), or occasionally two double-words (32-bit integers).
Well, this was a hit. But floating-point math remained as it had always been, awkward and slow. AMD actually had the bright idea of working with two 32-bit floating point numbers at once in the same way, in the same registers. This was called "3Dnow!". But it never really took off.
So in the Pentium III, Intel created eight SSE registers. Each held 128 bits (4 single-precision numbers or 2 double-precision numbers), and could operate on them as sets. You can also work with as little as a single FP number at a time here. This is the new standard way to do floating point math. (Except for some 80-bit FPU math hijinks.)
Well, SSE2 added integers to the mix, and SSE3, 4.1, and 4.2 have added new instructions. With 64-bit processors they even added 8 more SSE registers. But they've stayed 128 bits, until Sandy Bridge.
Sandy Bridge and Bulldozer have 256-bit AVX registers, 128 bits of which can still be used as SSE registers. Applications made specifically to use AVX (either through compiler optimizations that generally don't exist yet or through manual assembly coding) can work with 4 double-precision or 8 single-precision FP numbers at a time. But "at a time" can be misleading - I believe Sandy Bridge simply works with half of the AVX registers in each clock cycle. So there are fewer instructions to read if that's a bottleneck, but otherwise speed stays the same for now.
Bulldozer has an interesting trick, sharing 1 256-bit FPU between two integer processors. In applications that use regular SSE registers, half the FPU is allocated to each processor, and they can both do work at the same time. If one (and only one) of a pair of processors wants to do an AVX instruction, that instruction takes only one clock cycle, apparently working twice as fast. But if, as in DC, all the processors want to do AVX work at the same time, they have to trade off, and the speed again becomes 2 clock cycles per instruction, just like Intel.
In case you're wondering, yes, you could run 4 AVX-heavy WUs and 4 integer-heavy WUs on an 8-core bulldozer at the same time. If they were on the proper cores, the AVX work would be up to twice as fast. But current OSes don't know how to allocate work to the proper cores. Reportedly Windows 8 will know how, but it would seem to be a very tough thing to do.