• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

Some quick questions regard 64-bit CPUs

her209

No Lifer
1. Does the CPU work with 64-bit data (all registers/ALU being 64-bit as well)?
2. Is the CPU capable of addressing a 2^64 memory space (physically and virtually)?
3. If #1 is true, does this mean instruction length is 64-bits wide and if so, are there more registers?

Thanks in advance.
 
1. yes
2. virtually, not physically. AMD64 (and IA32E or whatever the Intel knockoff is) allow for smaller physical addresses - I believe the AMD implementations are 40 or 48 bits, and Intel's is just 36.
3. The instruction length actually varies a lot in x86 anyway, from 1 byte to ~15 bytes. There are more registers, but that's not an inherent result of the switch to 64-bit... AMD had the opportunity to improve the number of architectural registers when they changed the instruction set, and they took that opportunity.
 
Will bit size continue to go up? Will we have 512 bit cpus in say, 20 years? Or is there no need or limiting returns for larger bits at some point?
 
CTho9305 is correct.
Just to elaborate on "2":


For AMD64, the virtual space to rumble around in, is fully 16 ExaBytes.
But "only" 4 PetaBytes in that space can be mapped, so that's the limit.

However, that is for AMD'86-64 as such, not for the current K8's.
Current K8's have 256 Terabytes virtual space to rumble around in, but "only"
a total of 1 TeraByte can be mapped, so that's the limit of virtual memory. But again, the space is much bigger, which should be a help for many things, including fragmentation.

Even further limiting is WindowsXP64´addressing scheme, which I understand will give you 'only' 16 Terabyte virtual space, and initially map to only 16GB.

Also, current implementations of iAMD'86-64 processors, both AMD and Intel, are of course more limited in physical address space. In case of AMD, the most constrictive component is the integrated memory controller (currently 16GB). Opterons can use other Opterons memory controllers over HT links to access 128GB. Intel implementations too, might have some issues beyond 4GB (sofar). But the important thing is that the software virtual memory model is not limited. It will have enough addresses.

Originally posted by: Gibsons
Will bit size continue to go up? Will we have 512 bit cpus in say, 20 years? Or is there no need or limiting returns for larger bits at some point?

The need for greater integer width than 32 bits, is primarily pointers.
There will be no need for larger addressing than 64-bit, for a good while.

Processors are already much wider than 64-bit, in terms of how many bits can be committed. That width is apparently going to continue to increase. Intel's Conroe moves on to four execution pipelines. We're likely to see increased width of vector fields too. That width is today 128 bits. Vector instructions represent explicit parallelism, that is easier and more expedient to shedule into execution units.
 
Originally posted by: Vee
For AMD64, the virtual space to rumble around in, is fully 16 ExaBytes.
But "only" 4 PetaBytes in that space can be mapped, so that's the limit.

Some posters may wonder why microprocessor manufacturers and OS designers would want to impose such limits on the amount of memory you can address.

The essential problem is the page table used by each process in a virtual memory system. Every access to memory is translated by the processor using the appropriate process's page table. Most virtual memory systems divide memory into 4KB pages, so on a 32-bit memory system, you have 2^32 bytes of virtual memory divided by 2^12 bytes/page, leaving you with 2^20 entries in your page table. Assuming you only need a single 32-bit physical address for each entry, each process's page table would require 4 megabytes of memory.

However, in a 64-bit memory system, you have 2^64 bytes of virtual memory divided by 2^12 bytes/page, leaving you with 2^52 entries. Each entry would need at least 64-bits, leaving you with a requirement of 2^60 bytes of memory for each process's page table. That's obviously absurd. You could increase size of each page, but pages would have to be far beyond the size of current physical memory capacity to make a difference here.

While real processes and OSes use a variety of techniques to reduce the memory cost of page tables, those techniques cost in terms of performance and simply don't scale sufficiently to reduce page table size for 64-bit architectures such that they could fit in the physical memory of any existing computer. That's why all 64-bit architectures restrict their address space well below the full size that would be allowed by 64-bit addresses.
 
While real processes and OSes use a variety of techniques to reduce the memory cost of page tables, those techniques cost in terms of performance and simply don't scale sufficiently to reduce page table size for 64-bit architectures such that they could fit in the physical memory of any existing computer. That's why all 64-bit architectures restrict their address space well below the full size that would be allowed by 64-bit addresses.

That doesn't seem like a good reason to me. If that were the case, wouldn't they have had the same problem with the switch to 32 bits and initial introduction of virtual memory? At that time, the 4MB required per-process would have been more memory than most machines had (the 386 was 1989, right?). x86 certainly wasn't the first 32-bit ISA to do virtual memory either. Heirarchical page tables really make this a moot point, since you only need create page table entries for the pages that are in use (plus some small overhead). With 32-bits and 4KB pages (i.e. the usual setup), you can get by with 2-level page tables with each level having 1024 entries - so each process needs a top-level page table (1024 entries, or 4KB worth of physical addresses) and at least one second-level page (another 4KB of physical addresses). This takes us to a minimum of 8KB, instead of the 4MB a flat page table would require (I'm ignoring everything but the space for the actual physical address, e.g. permission bits). For 64-bits, if we added another two levels to the heirarchy, I think you'd be able to do it with 16KB worth of physical addresses at the smallest. You do have to do 2 more memory accessses in this scheme, but it makes the sizes manageable. Section 5.3.3 of this AMD document says they do use a 4-level heirarchy:
Each field is used as an index into the four-level page-translation hierarchy.

I would think the limit arises from things like keeping the processor actually implementable. For example, I think you have to do CAM lookups on physical addresses in the load-store unit, and more bits to compare makes the CAM slower (CAM lookups work by having every bit control a transistor, with the transistors for the bits all in parallel - if any bit is set, the shared wire's value change... more bits => more capacitance, which makes it slower, and more bits => more transistors => more leakage => you need smaller transistors to control leakage, or a pullup leaker device, both of which again make it slower). Additionally, your cache tags need to get bigger - tag matching probably uses a CAM also (or at least a z-detect, which again gets slower with size). You'd also need either more pins to send the address bits, or you'd have to use the address pins for more cycles per memory access (and since these are the bus cycle time, a single cycle is a long time).
 
Originally posted by: CTho9305
That doesn't seem like a good reason to me. If that were the case, wouldn't they have had the same problem with the switch to 32 bits and initial introduction of virtual memory? At that time, the 4MB required per-process would have been more memory than most machines had (the 386 was 1989, right?). x86 certainly wasn't the first 32-bit ISA to do virtual memory either.

I mentioned the existence of memory-saving techniques that would make the page table less than 4MB above.

Heirarchical page tables really make this a moot point, since you only need create page table entries for the pages that are in use (plus some small overhead).

Hierarchical page tables are one of the techniques that reduce the size, but they don't eliminate the problem of page table size and they incur a performance hit on every memory lookup and require more processor space devoted to TLB entries. Hierarchical tables also cost in performance in terms of page table locking in multiprocessor systems.

For 64-bits, if we added another two levels to the heirarchy, I think you'd be able to do it with 16KB worth of physical addresses at the smallest. You do have to do 2 more memory accessses in this scheme, but it makes the sizes manageable. Section 5.3.3 of this AMD document says they do use a 4-level heirarchy.

While AMD64 uses a 4-level hierarchy, other 64-bit processors use a 3-level hierarchy. Most OSes (Linux (prior to 2.6.11) and at least some of the BSD UNIXes I know for certain as I've seen the source, but I beleive MS Windows does the same) also use 3-level tables. (To be pedantic, they put one entry in the highest level table pointing to the 3-level table the OS builds.)

It's also worth calculating the size of a full AMD page table, with 512 entries/table and 8bytes/entry, it's 512GB.

I would think the limit arises from things like keeping the processor actually implementable.

Adding more levels to the page table costs in terms of processor architecture too, which is another way that page table size is a limiting factor.

Additionally, your cache tags need to get bigger - tag matching probably uses a CAM also (or at least a z-detect, which again gets slower with size). You'd also need either more pins to send the address bits, or you'd have to use the address pins for more cycles per memory access (and since these are the bus cycle time, a single cycle is a long time).

I agree that page table size isn't the only reason for limiting address size, but I thought it was one of the easier ones to understand.
 
Originally posted by: cquark
Hierarchical page tables are one of the techniques that reduce the size, but they don't eliminate the problem of page table size and they incur a performance hit on every memory lookup and require more processor space devoted to TLB entries. Hierarchical tables also cost in performance in terms of page table locking in multiprocessor systems.
Hopefully your TLBs are good enough that you don't have to do lookups too often 😉

I would think the limit arises from things like keeping the processor actually implementable.

Adding more levels to the page table costs in terms of processor architecture too, which is another way that page table size is a limiting factor.
I wouldn't expect that to be significant factor.... the table walk is probably a pretty simple FSM, and I'd guess each extra level just adds a couple of states.
 
Originally posted by: CTho9305
Originally posted by: cquark
Hierarchical page tables are one of the techniques that reduce the size, but they don't eliminate the problem of page table size and they incur a performance hit on every memory lookup and require more processor space devoted to TLB entries. Hierarchical tables also cost in performance in terms of page table locking in multiprocessor systems.
Hopefully your TLBs are good enough that you don't have to do lookups too often 😉

Hopefully. I'm always surprised at how effective they are given their small size.

I wouldn't expect that to be significant factor.... the table walk is probably a pretty simple FSM, and I'd guess each extra level just adds a couple of states.

I read some benchmarks on that recently, but I don't recall the URL. The difference was measurable, but not terribly large (a few percent). Although, given the amount of silicon we add to a branch predictor to get a couple of percent performance, it's worth thinking about, especially as the Athlon64's 4-level hierarchy only deals with 40-bit addresses (48-bit on Opteron.) We may need a deeper hierarchy of page tables to deal with the memory issues of full 64-bit addresses, but why add that until people start needing more than the terabyte accessible via 40-bit addresses?
 
3. If #1 is true, does this mean instruction length is 64-bits wide and if so, are there more registers?

Yes. i386 has 8 architectural registers, while iAMD64 has 16. This is actually what's responsible for a lot of the performance increase going from running an AMD64 CPU in 32 bit mode as opposed to 64 bit mode. Decreased "register pressure."
 
Back
Top