I?ve been seeing a lot of confusion surrounding 64-bit computing lately, undoubtedly spurred by discussion of AMD?s Hammer. There seems to be this idea that 64-bit computing will provide a 2X performance increase over 32-bit computing; this couldn?t be more false. This idea was probably proliferated by video game console marketing (ie, a ?64-bit? game console is twice as fast as a ?32-bit? one.) I often see the phrase ?a 64-bit CPU can process twice as much data per cycle as a 32-bit CPU? or something similar in articles, implying that the 64-bit CPU is twice as fast.
Background
General-purpose microprocessors have long relied on SISD computing ? Single Instruction, Single Data. Essentially a single arithmetic instruction specifies a single set of operands and yields a single result.
There used to be a time, from the 1970s to about 1986, when bit-level parallelism in microprocessors was very important. The ?bit-ness? is generally defined as the width of the general purpose registers, the integer ALUs (arithmetic logic unit), and the flat memory addressing capability. Using Intel as an example, going from the 4-bit 4004, to the 8-bit 8008, to the 16-bit 8086 provided greater parallelism. The reason is that, for example, the 8-bit 8008 could only do arithmetic on integers from ?127 to 128 (using two?s complement), hardly a useful range for a single operation. Going to the 16-bit 8086 provided 16-bit two?s complement arithmetic, from ?32768 to 32767, yielding greater parallelism within a single arithmetic instruction.
What about Integer Code?
This advance bit-level parallelism promptly ended with the implementation of 32-bit microprocessors around 1986, followed about 10 years later with 64-bit RISC MPUs. 32-bit MPUs can do single instruction arithmetic on integers from ?2,147,483,648 to 2,147,483,647, providing more than enough range for almost all integer applications. A 64-bit MPU obviously can do single instruction arithmetic from ?2^63 to 2^63 ? 1, but that does not necessarily yield twice the performance. A greater range is simply not needed for a vast majority of code; I honestly can?t remember the last time I?ve had to use a 64-bit long integer in C or Java. Even in the rare cases that a 64-bit integer data type is needed, it?s rarely in a critical inner loop; when compiled to assembly code, that 64-bit integer is going to comprise an even smaller percentage of the arithmetic, given the amount of arithmetic needed for temporary registers, compiler optimizations, and pointer chasing that is invisible to the high-level code.
Going back to SISD, remember that a single arithmetic instruction has a single result. Thus doing 2 + 3, whether the operands are expressed as 32-bit integers and executed on a 32-bit CPU, or expressed as 64-bit integers and executed on a 64-bit CPU, still takes a single operation. Again, as the range provided by 64-bit integers is rarely needed, there is no extra parallelism offered from 64-bit computing in a vast majority of code, except for perhaps scientific computing, some engineering applications, cryptography, and a few other niche applications (hardly applicable to most people here). That's not to say that 64-bit computing is useless, but recompiling Quake3 for a 64-bit MPU will not see a 2X increase in performance (or any increase at all) by taking advantage of the increase in bit-level parallelism. In fact, many software vendors for 64-bit MPUs (when 64-bit memory address isn?t needed) are more reluctant to compile for 64-bit code, given the increase in code size.
There is, however, one important use for 64-bit integer addition (see below).
What about Floating-Point Code?
Increasing the bit width of floating-point data types has two effects: larger range, and more importantly, greater accuracy. The format is similar to scientific notation, though obviously in base-2 instead of base-10, with a mantissa M and exponent E, in the form M * 2^E. The IEEE 754 standard, which everyone has used since 1980, defines the composition of the sign, mantissa, and exponent in 32-bit single-precision and 64-bit double-precision data types. Going from single-precision to double-precision increases the range; single-precision is from roughly 2*10^-38 to 2*10^38, and double-precision offers 2*10^-308 to 2*10^308. More importantly, however, double-precision offers greater accuracy by adding 29 bits to the mantissa.
Floating-point width, however, is generally independent of the ?bit-ness? of a MPU. x86 MPUs have had, since the adoption of the x87 floating-point, single- and double-precision arithmetic, as well as an optional 80-bit mode. Modern fully-pipelined x86 MPUs are capable of low-latency and high-throughput double-precision arithmetic (though the x87 FP stack and two operand format holds back performance, but that?s a different story).
Why a 64-bit mode can actually be slower than a 32-bit counterpart
If all else is equal, a 64-bit MPU will be slower than its 32-bit equivalent. Since instructions (assuming a fixed-length instruction format) and data are now twice as long, half as many instructions and data can be stored in caches. The miss-rate of the caches likewise increases by roughly 41%, and the longer instruction and data words require more bandwidth. Performance studies have shown this to be detrimental to overall performance by around 10-15%.
What good is 64-bit computing?
There is one good use for 64-bit integer arithmetic: 64-bit flat memory addressing. Offering memory addressing beyond 4GB, 64-bit addressing needs 64-bit general purpose registers and 64-bit integer arithmetic for memory pointer calculations. This obviously has no effect on 32-bit vs. 64-bit performance, since 64-bit memory addressing on a 32-bit processor is a moot point.
With memory addressing needs increasing by less than a bit per year, 64-bit addressing should last for 20-40 years.
What about x86-64?
The situation for x86-64 is slightly different. First off, x86 is a variable length instruction set, from 1 byte to 15 bytes for 32-bit x86, so a 64-bit x86 instruction set wouldn?t necessarily double the instruction word length (perhaps the only advantage of a variable length instruction set, otherwise the prefix kludges are a decoding nightmare). AMD says compiling for x86-64 increases the code size by around 10%, so the miss-rate increase in caches due to larger code size is minimal (perhaps 5%). On the other hand, 64-bit data types would cause the same ~40% increase to data cache miss-rate and an increase in data bandwidth requirements. As such, AMD actually sets the default data operand size to 32-bits in the 64-bit addressing mode. The motivation is that 64-bit data operands are not likely to be needed and could hurt performance; in those situations where 64-bit data operands are desired, they can be activated using the new REX prefix (woohoo, yet another x86 instruction prefix
).
x86-64 mode also doubles the number of general purpose registers from 8 to 16. When used, this will reduce the reliance on memory accesses for data, though it requires efficient register allocation from supporting x86-64 compilers. 16 registers may still not be enough to do cool RISC-like compiler optimizations such as loop unrolling and software pipelining, so the increase in performance for integer code will probably be around 5 ? 10%.
Also, x86-64 compilers can take advantage of implementation specific (Hammer) characteristics...Compiler optimizations for the Hammer?s improved front-end and other microarchitectural modifications may yield perhaps another 5-10% increase in performance over existing 32-bit x86 compilers (get on the ball with compiler development, AMD, gcc is hardly adequate
).
Can't a 64-bit MPU do two 32-bit arithmetic operations at once, thereby be twice as fast?
No, then it wouldn't be doing SISD computing as described above. Vector, or SIMD (single instruction, multiple data) works in this manner. The idea is that 128-bit SIMD can perform the same arithmetic operations on eight 16-bit, four 32-bit, or two 64-bit sets of operands at once. The challenge (which is why 128-bit SIMD isn't going to be four times as fast as 32-bit SISD) is to find the data-level parallelism in the code...finding four sets of 32-bit operands on which you want to do the same operation without data dependencies between them isn't easy.
But this is exactly the role that MMX, 3DNow, SSE, and SSE2 fill. SSE/SSE2 already do 128-bit SIMD with a dedicated 8-register flat register set (SSE2 extends it to integer arithmetic), the others do 64-bit SIMD mapped onto the x87 FP stack (MMX does 64-bit integer SIMD, 3DNow does both FP and integer).
edited to include SIMD
Background
General-purpose microprocessors have long relied on SISD computing ? Single Instruction, Single Data. Essentially a single arithmetic instruction specifies a single set of operands and yields a single result.
There used to be a time, from the 1970s to about 1986, when bit-level parallelism in microprocessors was very important. The ?bit-ness? is generally defined as the width of the general purpose registers, the integer ALUs (arithmetic logic unit), and the flat memory addressing capability. Using Intel as an example, going from the 4-bit 4004, to the 8-bit 8008, to the 16-bit 8086 provided greater parallelism. The reason is that, for example, the 8-bit 8008 could only do arithmetic on integers from ?127 to 128 (using two?s complement), hardly a useful range for a single operation. Going to the 16-bit 8086 provided 16-bit two?s complement arithmetic, from ?32768 to 32767, yielding greater parallelism within a single arithmetic instruction.
What about Integer Code?
This advance bit-level parallelism promptly ended with the implementation of 32-bit microprocessors around 1986, followed about 10 years later with 64-bit RISC MPUs. 32-bit MPUs can do single instruction arithmetic on integers from ?2,147,483,648 to 2,147,483,647, providing more than enough range for almost all integer applications. A 64-bit MPU obviously can do single instruction arithmetic from ?2^63 to 2^63 ? 1, but that does not necessarily yield twice the performance. A greater range is simply not needed for a vast majority of code; I honestly can?t remember the last time I?ve had to use a 64-bit long integer in C or Java. Even in the rare cases that a 64-bit integer data type is needed, it?s rarely in a critical inner loop; when compiled to assembly code, that 64-bit integer is going to comprise an even smaller percentage of the arithmetic, given the amount of arithmetic needed for temporary registers, compiler optimizations, and pointer chasing that is invisible to the high-level code.
Going back to SISD, remember that a single arithmetic instruction has a single result. Thus doing 2 + 3, whether the operands are expressed as 32-bit integers and executed on a 32-bit CPU, or expressed as 64-bit integers and executed on a 64-bit CPU, still takes a single operation. Again, as the range provided by 64-bit integers is rarely needed, there is no extra parallelism offered from 64-bit computing in a vast majority of code, except for perhaps scientific computing, some engineering applications, cryptography, and a few other niche applications (hardly applicable to most people here). That's not to say that 64-bit computing is useless, but recompiling Quake3 for a 64-bit MPU will not see a 2X increase in performance (or any increase at all) by taking advantage of the increase in bit-level parallelism. In fact, many software vendors for 64-bit MPUs (when 64-bit memory address isn?t needed) are more reluctant to compile for 64-bit code, given the increase in code size.
There is, however, one important use for 64-bit integer addition (see below).
What about Floating-Point Code?
Increasing the bit width of floating-point data types has two effects: larger range, and more importantly, greater accuracy. The format is similar to scientific notation, though obviously in base-2 instead of base-10, with a mantissa M and exponent E, in the form M * 2^E. The IEEE 754 standard, which everyone has used since 1980, defines the composition of the sign, mantissa, and exponent in 32-bit single-precision and 64-bit double-precision data types. Going from single-precision to double-precision increases the range; single-precision is from roughly 2*10^-38 to 2*10^38, and double-precision offers 2*10^-308 to 2*10^308. More importantly, however, double-precision offers greater accuracy by adding 29 bits to the mantissa.
Floating-point width, however, is generally independent of the ?bit-ness? of a MPU. x86 MPUs have had, since the adoption of the x87 floating-point, single- and double-precision arithmetic, as well as an optional 80-bit mode. Modern fully-pipelined x86 MPUs are capable of low-latency and high-throughput double-precision arithmetic (though the x87 FP stack and two operand format holds back performance, but that?s a different story).
Why a 64-bit mode can actually be slower than a 32-bit counterpart
If all else is equal, a 64-bit MPU will be slower than its 32-bit equivalent. Since instructions (assuming a fixed-length instruction format) and data are now twice as long, half as many instructions and data can be stored in caches. The miss-rate of the caches likewise increases by roughly 41%, and the longer instruction and data words require more bandwidth. Performance studies have shown this to be detrimental to overall performance by around 10-15%.
What good is 64-bit computing?
There is one good use for 64-bit integer arithmetic: 64-bit flat memory addressing. Offering memory addressing beyond 4GB, 64-bit addressing needs 64-bit general purpose registers and 64-bit integer arithmetic for memory pointer calculations. This obviously has no effect on 32-bit vs. 64-bit performance, since 64-bit memory addressing on a 32-bit processor is a moot point.
With memory addressing needs increasing by less than a bit per year, 64-bit addressing should last for 20-40 years.
What about x86-64?
The situation for x86-64 is slightly different. First off, x86 is a variable length instruction set, from 1 byte to 15 bytes for 32-bit x86, so a 64-bit x86 instruction set wouldn?t necessarily double the instruction word length (perhaps the only advantage of a variable length instruction set, otherwise the prefix kludges are a decoding nightmare). AMD says compiling for x86-64 increases the code size by around 10%, so the miss-rate increase in caches due to larger code size is minimal (perhaps 5%). On the other hand, 64-bit data types would cause the same ~40% increase to data cache miss-rate and an increase in data bandwidth requirements. As such, AMD actually sets the default data operand size to 32-bits in the 64-bit addressing mode. The motivation is that 64-bit data operands are not likely to be needed and could hurt performance; in those situations where 64-bit data operands are desired, they can be activated using the new REX prefix (woohoo, yet another x86 instruction prefix
x86-64 mode also doubles the number of general purpose registers from 8 to 16. When used, this will reduce the reliance on memory accesses for data, though it requires efficient register allocation from supporting x86-64 compilers. 16 registers may still not be enough to do cool RISC-like compiler optimizations such as loop unrolling and software pipelining, so the increase in performance for integer code will probably be around 5 ? 10%.
Also, x86-64 compilers can take advantage of implementation specific (Hammer) characteristics...Compiler optimizations for the Hammer?s improved front-end and other microarchitectural modifications may yield perhaps another 5-10% increase in performance over existing 32-bit x86 compilers (get on the ball with compiler development, AMD, gcc is hardly adequate
Can't a 64-bit MPU do two 32-bit arithmetic operations at once, thereby be twice as fast?
No, then it wouldn't be doing SISD computing as described above. Vector, or SIMD (single instruction, multiple data) works in this manner. The idea is that 128-bit SIMD can perform the same arithmetic operations on eight 16-bit, four 32-bit, or two 64-bit sets of operands at once. The challenge (which is why 128-bit SIMD isn't going to be four times as fast as 32-bit SISD) is to find the data-level parallelism in the code...finding four sets of 32-bit operands on which you want to do the same operation without data dependencies between them isn't easy.
But this is exactly the role that MMX, 3DNow, SSE, and SSE2 fill. SSE/SSE2 already do 128-bit SIMD with a dedicated 8-register flat register set (SSE2 extends it to integer arithmetic), the others do 64-bit SIMD mapped onto the x87 FP stack (MMX does 64-bit integer SIMD, 3DNow does both FP and integer).
edited to include SIMD
