For Intel systems how many clock cycles to go from logical address to physical address?

chrstrbrts · Dec 16, 2016

Hello,

So, I'm making my way through the Intel bible, Intel 64 and IA-32 Architectures Software Developer’s Manual Combined Volumes 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, and 3D, and I'm reading about segmentation and paging.

I understand, basically, how to go from a logical address to a physical one that you put on a bus and hand over to a memory controller of some kind.

Though, I'm wondering how many clock cycles does it take.

There seem to be quite a few steps in referencing tables and directories and such.

That is, if you want to reach into physical memory and read or write a location, you have to access descriptor tables, page directories, and page tables all of which are located in physical memory themselves.

So, one attempt at a memory fetch requires 5 or more memory fetches just to grab all the pointers you need to find the address of the memory location that you initially wanted.

So, I'd like to know how many clock cycles does it take to go from a logical address to a physical address if no TLB or paging structure caches are used.

I'd also like to know how many clock cycles does it take to go from a logical address to a physical address if TLB or paging structure caches are used.

Thanks.

Schmide · Dec 17, 2016

Well it doesn't take 5 memory fetches to load memory. The GDT can be thought of as always loaded and with modern page based OSes, the LDT is generally unnecessary. So say you're working in a 32 bit system. This means a paging system for a 4GiB space. That's a 1024 entry directory pointing to 1024 entry page tables. If data is in the TLB you can almost bet that it's in the cache system somewhere. So you're looking at 4-75 cycles. Otherwise you're looking at 3 memory accesses 60+ each. Now if that page is not in memory, you're loading that data from disk and well, that's in the 500k cycles.

Edit: on this.

chrstrbrts said:
So, I'd like to know how many clock cycles does it take to go from a logical address to a physical address if no TLB or paging structure caches are used.

If you're not using paging. The logical address is mostly the physical address+(offset). In general it takes zero time for the processor to compute this. Because of caching; paging puts, on average, very little on top of any address calculation. Some say at most 10%. Pretty much once you go to main memory it's all the same.

Here's a good table. https://gist.github.com/jboner/2841832

Code:

Latency Comparison Numbers
--------------------------
L1 cache reference                           0.5 ns
Branch mispredict                            5   ns
L2 cache reference                           7   ns                      14x L1 cache
Mutex lock/unlock                           25   ns
Main memory reference                      100   ns                      20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy             3,000   ns        3 us
Send 1K bytes over 1 Gbps network       10,000   ns       10 us
Read 4K randomly from SSD*             150,000   ns      150 us          ~1GB/sec SSD
Read 1 MB sequentially from memory     250,000   ns      250 us
Round trip within same datacenter      500,000   ns      500 us
Read 1 MB sequentially from SSD*     1,000,000   ns    1,000 us    1 ms  ~1GB/sec SSD, 4X memory
Disk seek                           10,000,000   ns   10,000 us   10 ms  20x datacenter roundtrip
Read 1 MB sequentially from disk    20,000,000   ns   20,000 us   20 ms  80x memory, 20X SSD
Send packet CA->Netherlands->CA    150,000,000   ns  150,000 us  150 ms

Notes
-----
1 ns = 10^-9 seconds
1 us = 10^-6 seconds = 1,000 ns
1 ms = 10^-3 seconds = 1,000 us = 1,000,000 ns

Credit
------
By Jeff Dean:               http://research.google.com/people/jeff/
Originally by Peter Norvig: http://norvig.com/21-days.html#answers

Contributions
-------------
Some updates from:       https://gist.github.com/2843375
'Humanized' comparison:  https://gist.github.com/2843375
Visual comparison chart: http://i.imgur.com/k0t1e.png
Animated presentation:   http://prezi.com/pdkvgys-r0y6/latency-numbers-for-programmers-web-development/latency.txt

Merad · Dec 17, 2016

chrstrbrts said:
So, I'd like to know how many clock cycles does it take to go from a logical address to a physical address if no TLB or paging structure caches are used.

I'm assuming you mean "how long does it take when there's a TLB hit", since all address translation (AFAIK) goes through the TLB. Modern CPUs have a TLB hierarchy and AFAIK the L1 TLB is basically on par with the L1 d-cache at ~4 cycles to access.

chrstrbrts said:
I'd also like to know how many clock cycles does it take to go from a logical address to a physical address if TLB or paging structure caches are used.

There isn't really an easy answer to this, because resolving a TLB miss is a complex operation with many factors involved. Modern x86 has dedicated hardware to walk the page tables and resolve the miss, and I believe even has facilities to perform speculative searches before a miss actually occurs, as well as supporting multiple simultaneous searches. I don't know many details about them, though. They're probably somewhere in those manuals you're reading.

The main potential problem is that the page tables may not be resident in memory. Absolute worst case scenario I suppose you could have many (several? a dozen? I'm not sure) page faults trying to resolve the correct PTE. Each page fault will cost you millions of cycles. That's highly unlikely, however. IIRC "typical" time to resolve a TLB miss is ~100 cycles.

Merad · Dec 17, 2016

Schmide said:
Here's a good table. https://gist.github.com/jboner/2841832

Some of those numbers have decreased dramatically since that was originally written. Berkely has a neat visualization of how they've changed over time: https://people.eecs.berkeley.edu/~rcs/research/interactive_latency.html

you2 · Dec 17, 2016

In practice how would one use this information ?

Ken g6 · Dec 17, 2016

you2 said:
In practice how would one use this information ?

Try to limit the amount of memory you need access to at any one time to something that fits in the L1 cache. If that's not possible, try for something that fits in the L2 cache. If that's not possible either, consider prefetching - though it may not be as effective as it used to be.

chrstrbrts · Dec 17, 2016

you2 said:
In practice how would one use this information ?

LOL....If you follow my lines of questioning here, you'll see that I'm mostly concerned with hypotheticals and the garnering of knowledge for the sake of the garnering of knowledge.

Practical I am not.

Search

For Intel systems how many clock cycles to go from logical address to physical address?

chrstrbrts

Senior member

Schmide

Diamond Member

Merad

Platinum Member

Merad

Platinum Member

you2

Diamond Member

Ken g6

Programming Moderator, Elite Member

chrstrbrts

Senior member

TRENDING THREADS