The myths and realities of 64-bit computing

Sohcan · Mar 2, 2002

I?ve been seeing a lot of confusion surrounding 64-bit computing lately, undoubtedly spurred by discussion of AMD?s Hammer. There seems to be this idea that 64-bit computing will provide a 2X performance increase over 32-bit computing; this couldn?t be more false. This idea was probably proliferated by video game console marketing (ie, a ?64-bit? game console is twice as fast as a ?32-bit? one.) I often see the phrase ?a 64-bit CPU can process twice as much data per cycle as a 32-bit CPU? or something similar in articles, implying that the 64-bit CPU is twice as fast.

Background
General-purpose microprocessors have long relied on SISD computing ? Single Instruction, Single Data. Essentially a single arithmetic instruction specifies a single set of operands and yields a single result.

There used to be a time, from the 1970s to about 1986, when bit-level parallelism in microprocessors was very important. The ?bit-ness? is generally defined as the width of the general purpose registers, the integer ALUs (arithmetic logic unit), and the flat memory addressing capability. Using Intel as an example, going from the 4-bit 4004, to the 8-bit 8008, to the 16-bit 8086 provided greater parallelism. The reason is that, for example, the 8-bit 8008 could only do arithmetic on integers from ?127 to 128 (using two?s complement), hardly a useful range for a single operation. Going to the 16-bit 8086 provided 16-bit two?s complement arithmetic, from ?32768 to 32767, yielding greater parallelism within a single arithmetic instruction.

What about Integer Code?
This advance bit-level parallelism promptly ended with the implementation of 32-bit microprocessors around 1986, followed about 10 years later with 64-bit RISC MPUs. 32-bit MPUs can do single instruction arithmetic on integers from ?2,147,483,648 to 2,147,483,647, providing more than enough range for almost all integer applications. A 64-bit MPU obviously can do single instruction arithmetic from ?2^63 to 2^63 ? 1, but that does not necessarily yield twice the performance. A greater range is simply not needed for a vast majority of code; I honestly can?t remember the last time I?ve had to use a 64-bit long integer in C or Java. Even in the rare cases that a 64-bit integer data type is needed, it?s rarely in a critical inner loop; when compiled to assembly code, that 64-bit integer is going to comprise an even smaller percentage of the arithmetic, given the amount of arithmetic needed for temporary registers, compiler optimizations, and pointer chasing that is invisible to the high-level code.

Going back to SISD, remember that a single arithmetic instruction has a single result. Thus doing 2 + 3, whether the operands are expressed as 32-bit integers and executed on a 32-bit CPU, or expressed as 64-bit integers and executed on a 64-bit CPU, still takes a single operation. Again, as the range provided by 64-bit integers is rarely needed, there is no extra parallelism offered from 64-bit computing in a vast majority of code, except for perhaps scientific computing, some engineering applications, cryptography, and a few other niche applications (hardly applicable to most people here). That's not to say that 64-bit computing is useless, but recompiling Quake3 for a 64-bit MPU will not see a 2X increase in performance (or any increase at all) by taking advantage of the increase in bit-level parallelism. In fact, many software vendors for 64-bit MPUs (when 64-bit memory address isn?t needed) are more reluctant to compile for 64-bit code, given the increase in code size.

There is, however, one important use for 64-bit integer addition (see below).

What about Floating-Point Code?
Increasing the bit width of floating-point data types has two effects: larger range, and more importantly, greater accuracy. The format is similar to scientific notation, though obviously in base-2 instead of base-10, with a mantissa M and exponent E, in the form M * 2^E. The IEEE 754 standard, which everyone has used since 1980, defines the composition of the sign, mantissa, and exponent in 32-bit single-precision and 64-bit double-precision data types. Going from single-precision to double-precision increases the range; single-precision is from roughly 2*10^-38 to 2*10^38, and double-precision offers 2*10^-308 to 2*10^308. More importantly, however, double-precision offers greater accuracy by adding 29 bits to the mantissa.

Floating-point width, however, is generally independent of the ?bit-ness? of a MPU. x86 MPUs have had, since the adoption of the x87 floating-point, single- and double-precision arithmetic, as well as an optional 80-bit mode. Modern fully-pipelined x86 MPUs are capable of low-latency and high-throughput double-precision arithmetic (though the x87 FP stack and two operand format holds back performance, but that?s a different story).

Why a 64-bit mode can actually be slower than a 32-bit counterpart
If all else is equal, a 64-bit MPU will be slower than its 32-bit equivalent. Since instructions (assuming a fixed-length instruction format) and data are now twice as long, half as many instructions and data can be stored in caches. The miss-rate of the caches likewise increases by roughly 41%, and the longer instruction and data words require more bandwidth. Performance studies have shown this to be detrimental to overall performance by around 10-15%.

What good is 64-bit computing?
There is one good use for 64-bit integer arithmetic: 64-bit flat memory addressing. Offering memory addressing beyond 4GB, 64-bit addressing needs 64-bit general purpose registers and 64-bit integer arithmetic for memory pointer calculations. This obviously has no effect on 32-bit vs. 64-bit performance, since 64-bit memory addressing on a 32-bit processor is a moot point.

With memory addressing needs increasing by less than a bit per year, 64-bit addressing should last for 20-40 years.

What about x86-64?
The situation for x86-64 is slightly different. First off, x86 is a variable length instruction set, from 1 byte to 15 bytes for 32-bit x86, so a 64-bit x86 instruction set wouldn?t necessarily double the instruction word length (perhaps the only advantage of a variable length instruction set, otherwise the prefix kludges are a decoding nightmare). AMD says compiling for x86-64 increases the code size by around 10%, so the miss-rate increase in caches due to larger code size is minimal (perhaps 5%). On the other hand, 64-bit data types would cause the same ~40% increase to data cache miss-rate and an increase in data bandwidth requirements. As such, AMD actually sets the default data operand size to 32-bits in the 64-bit addressing mode. The motivation is that 64-bit data operands are not likely to be needed and could hurt performance; in those situations where 64-bit data operands are desired, they can be activated using the new REX prefix (woohoo, yet another x86 instruction prefix

).

x86-64 mode also doubles the number of general purpose registers from 8 to 16. When used, this will reduce the reliance on memory accesses for data, though it requires efficient register allocation from supporting x86-64 compilers. 16 registers may still not be enough to do cool RISC-like compiler optimizations such as loop unrolling and software pipelining, so the increase in performance for integer code will probably be around 5 ? 10%.

Also, x86-64 compilers can take advantage of implementation specific (Hammer) characteristics...Compiler optimizations for the Hammer?s improved front-end and other microarchitectural modifications may yield perhaps another 5-10% increase in performance over existing 32-bit x86 compilers (get on the ball with compiler development, AMD, gcc is hardly adequate

).

Can't a 64-bit MPU do two 32-bit arithmetic operations at once, thereby be twice as fast?
No, then it wouldn't be doing SISD computing as described above. Vector, or SIMD (single instruction, multiple data) works in this manner. The idea is that 128-bit SIMD can perform the same arithmetic operations on eight 16-bit, four 32-bit, or two 64-bit sets of operands at once. The challenge (which is why 128-bit SIMD isn't going to be four times as fast as 32-bit SISD) is to find the data-level parallelism in the code...finding four sets of 32-bit operands on which you want to do the same operation without data dependencies between them isn't easy.

But this is exactly the role that MMX, 3DNow, SSE, and SSE2 fill. SSE/SSE2 already do 128-bit SIMD with a dedicated 8-register flat register set (SSE2 extends it to integer arithmetic), the others do 64-bit SIMD mapped onto the x87 FP stack (MMX does 64-bit integer SIMD, 3DNow does both FP and integer).

edited to include SIMD

Rand · Mar 2, 2002

This deserves to be stickified to the top, because I've also seen a tremendous amount of mis-information regarding 64bit microprocessors of late.

Barnaby W. Füi · Mar 2, 2002

regardless of how much information is available, people will continue to use AMD even if the 1.6A is a better deal, say that scsi is only for servers, say the same about linux, and think that 64 bit is twice as fast as 32 bit

downhiller80 · Mar 2, 2002

Glad that someone has said this, good effort that man. I don't think we'll see ahuge number of Anandtecers jumping on the Hammer when it's released because the prices will be astronomical compared to "normal" processors. TBH I can't see where AMD's market is going to be this one, because I'm sure virtually all "corporate mainframes" or whatever use Intel and only Intel. I suppose AMD has to start somewhere in this arena, but I really can't see many people picking it over Itanium - or am I missing something?

Like you say, one of the main reasons is for the memory addressability. Back in 1996 (I think) I had a 16MB stick of memory (£350!). These days I'm running 512MB. 2006 I'll probably be pushing the 4GB limit. I would say "god knows why?" but the same would've been said back in 1996. "512MB, nah you must be mad mate". I mean jeez, Apollo only had 2KB........

But in the coming years 64bit will become mainstream, you'd be off your rocker to buy one in the next couple of years though.

- seb

sandorski · Mar 2, 2002

I think many AT'ers will get Hammers. The price will not be that different than current prices and the Hammers will still be among the best performing x86 cpus around, with 64bit capabilities thrown in.

Martin · Mar 2, 2002

Anand should publish this on the main page

Definately a worthwhile read. Great job.

I think a few people here will get the Clawhammer when it comes out....with more later on.

Bluga · Mar 2, 2002

<< but I really can't see many people picking it over Itanium - or am I missing something? >>

why did Dell drop Itanimum?

Derango · Mar 2, 2002

Wonderful post. It gets a sticky reccomendation from me

Shantanu · Mar 2, 2002

A damn good post.

ST4RCUTTER · Mar 2, 2002

Awesome post Sohcan! This really needs to go into an "informative post archive" or some such.

As far as the use of 64-bit computing, I really think it depends on what you're doing. For those running entry levels server clusters and heavy duty workstations, 64-bit mode with large amounts of memory is the way to go. With support for up to 32GB of DDR266+, a dual or quad Hammer station could find great acceptance for 3D rendering, heavy computation (weather simulation etc.) and other massive computational undertakings where the size of the interger/FP data sets are truly large.

For those of us only needing 32-bit computing, the Hammer should still compete for the title of the fastest X86 processor ever. Most of us won't be buying 64-bit apps anytime in the near future. Leaked simulations show the Hammer to be as fast as three Athlons in parallel at 32-bit, but that has yet to be proven. I suspect it will be a healthy increase nonetheless.

AGodspeed · Mar 2, 2002

Wonderful idea to start a thread like this Sohcan.

Your posts are truly informative and necessary.

Mr.IncrediblyBored · Mar 2, 2002

Pardon the ignorant response, but in a nutshell, you're saying the primary use for 64-bit MPUs is memory addressing greater than 4GB?

This begs the question why Sun, Alpha, et al have been pushing 64-bit for the past few years. Every datacenter I've ever been in consisted of hundreds of smaller 2, 3, or 4U servers with 1GB of RAM at most. Hell 1GB is only now becoming the norm. When the first 64-bit processors hit the market, 128 or 256 was considered high-end. You're saying these processors were essentially useless for their time?

Martin · Mar 2, 2002

I think you should put this in the FAQs btw.

joohang · Mar 2, 2002

Thank you, Sohcan. Very informative post, as most of your posts are.

I'm printing it out because it's too much to consume at once.

PliotronX · Mar 2, 2002

I printed it too, enjoyed reading it. Thanks, Sohcan

Sohcan · Mar 2, 2002

<< You're saying these processors were essentially useless for their time? >>

Of course not, to say so would ignore their superior microarchitectures, larger caches, greater memory and IO bandwidth, far better multiprocessing scalability, and software and vendor support.

64-bit integer arithmetic is important for some scientific computing, though I'm sure at the time the reasons for 64-bit computing were to anticipate the move towards 64-bit addressing, marketing, and because they could do it.

downhiller80 · Mar 2, 2002

Of course a Hammer isn't just a 64bit version of an Athlon, it must have a highly advanced core design (64bit aside).

- seb

Sohcan · Mar 2, 2002

<< Of course a Hammer isn't just a 64bit version of an Athlon, it must have a highly advanced core design (64bit aside). >>

Certainly, I was referring to x86-64 in the last section, independent of any particular (Hammer) implementation (edited slightly as such).

CurtOien · Mar 2, 2002

Thanks for the info.

Mr.IncrediblyBored · Mar 2, 2002

<< Of course not, to say so would ignore their superior microarchitectures, larger caches, greater memory and IO bandwidth, far better multiprocessing scalability, and software and vendor support. >>

Well the superiority of the microprocessor design isn't directly related to it being 64-bit. I meant if the 64-bit ALUs weren't being used to their full potential then one could argue then it was wasted silicon.

I'm woefully out of my depth, so I'll let the argument die.

Great thread!

rockhard · Mar 2, 2002

<< If all else is equal, a 64-bit MPU will be slower than its 32-bit equivalent. Since instructions (assuming a fixed-length instruction format) and data are now twice as long, half as many instructions and data can be stored in caches. >>

Please feel free to tell me im wrong but the 64-bit Hammer will have 64 bit registers so the above is wrong, yeah?
64 bit registers would allow for the 64 bit instructions to be moved to and from these registers in one go making it just as efficient as a 32 bit MPU using 32 bit registers IMO FWIW.
Doing 32 bit with 64 bit registers should be interesting, especially if they have managed to do a combining algorithm in hardware to allow 2 32 bit instructions to be processed by the MPU in one clock cycle. If they have pulled this off then the Hammer is going to kick a 32 bit processors ass doing 32 bit code processing i hope

Sohcan · Mar 2, 2002

<< Well the superiority of the microprocessor design isn't directly related to it being 64-bit >>

Nope, but what I meant was as high-end designs aimed at the enterprise market, they can target a design for a larger die size and put more into the superscalar design: 4-way issue, larger branch history tables, larger out-of-order buffer window, larger caches, etc.

<< I meant if the 64-bit ALUs weren't being used to their full potential then one could argue then it was wasted silicon. >>

ALU size is quite insignificant thsee days...take a look at this McKinley die photo. That small box labelled "Integer Unit" has six 64-bit ALUs, not to mention the datapath and bypass control.

You do have a good point about the timing of the appearence of 64-bit MPUs in the enterprise market....I think I'll raise this question over at RWT and see what people there have to say.

<< Please feel free to tell me im wrong but the 64-bit Hammer will have 64 bit registers so the above is wrong, yeah?
64 bit registers would allow for the 64 bit instructions to be moved to and from these registers in one go making it just as efficient as a 32 bit MPU using 32 bit registers IMO FWIW. >>

I'm not sure I follow you....General purpose registers aren't used to store instructions, just data and memory addresses.

<< Doing 32 bit with 64 bit registers should be interesting, especially if they have managed to do a combining algorithm in hardware to allow 2 32 bit instructions to be processed by the MPU in one clock cycle. If they have pulled this off then the Hammer is going to kick a 32 bit processors ass doing 32 bit code processing i hope >>

No offense, but this is exactly the kind of misconception I'm talking about.

SISD computing simply does not work that way. Vector, or SIMD (single instruction, multiple data) computing, on the other hand, does. But that's exactly what MMX, 3DNow, SSE, and SSE2 are for. SSE/SSE2 already does 128-bit SIMD, the others do 64-bit SIMD. And SIMD computing in the MPU environment has hardly been overwhelmingly successful, simply because compilers can't extract enough data parallelism out the code to make more than a 10-20% impact in most cases.

Efficient vector computing requires a number of data alignment, pack, shift, merge, and permute instructions that AMD simply has not implemented to x86-64 to make it an SIMD instruction set. They would have no need to, since they've already implemented and extended SSE2 for Hammer.

Martin · Mar 3, 2002

Bump.

And just wondering, Sohcan, what do you do for a living? Comp. Engineer?

MadRat · Mar 3, 2002

I'm guessing the whole ideal of x86-64 is more or less to break the 4GB memory addressing of "x86-32". I'm not one to bitch about limitations that are unachievable, but this 4GB limit is coming up on us fast. It seems that memory prices for double the memory stay around the same level from year to year. At the current rate we should see more than 1GB of system RAM in baseline PCs by 2004. It may be as soon as 2006 that 4GB may be commonplace.

rockhard · Mar 3, 2002

<< I'm not sure I follow you....General purpose registers aren't used to store instructions, just data and memory addresses. >>

Thanx for putting me right

Im kinda rusty on this as its been a while since i read up on this.

IIRC the 64 bit data is stored in the registers ready to be used if needed and are only moved to a lower level cache or even memory if there is no available registers for the new data required from memory or HD for a new instruction, yeah?
So the instructions are not held in cache but are directly sent to MPU from software via an instruction to do so by OS/software yeah?
If you have an instruction that is say get data from register 1 and add to contents of register 2 and put sum into register 3, then the data will not be penalised as being 64 bit in that the full 64 bits of data (sum) will be stored in a register along with the memory location part of the 64 bit data.
The only time the data will span over more than one register is if the data contains more than 64 bits and then it is spanned over 2x registers, but this is not a speed penalty as i thought most modern CPU's were able to fetch multiple registers in one go for the MPU to utilise, hence no multiple clock cycle penalty.
What i was hoping is that the new breed of 64 bit processors will be able to do this spanning over what would appear to be 32 bit registers, but in reality are 2x 32 (simulated) bit registers that are in fact made up of one single 64 bit register?
This would reduce the amount of register space that would be unused and allow for a more effiecient use of 64 bit registers when used for 32 bit data, yeah?

Sorry if i come across as trying to chase my own tail

If you can put me right on where i may be not understanding this be much appreciated.

thanx,

rockhard =)

The myths and realities of 64-bit computing

Platinum Member

Lifer

Elite Member

Platinum Member

No Lifer

Lifer

Banned

Diamond Member

Banned

Platinum Member

Diamond Member

Lifer

Lifer

Lifer

Diamond Member

Platinum Member

Platinum Member

Platinum Member

Golden Member

Lifer

Golden Member

Platinum Member

Lifer

Lifer

Golden Member