Some questions about TLBs in AMD64's MMU

uOpt

Golden Member
Oct 19, 2004
1,628
0
0
There are several things that I don't understand about TLBs.

First of all, AMD64 CPUs have 40/40 L1 TLBs (instruction/data) and
512/512 L2 TLBs.

That seems low to me. 512 entries at page size 4096 bytes means that
only 2 MB worth of data can be accessed without having a TLB miss and
walking the page table. That is just twice as big as the L2 cache.
Do I have some misunderstanding here?

Then, since AMD64 and all other x86 processors have caches working on
physical addresses, does that mean that a TLB lookup is needed
everytime that any address is used in the program, even if that
address' data is currently cached?

I am looking at the performance counters for an application I want to
improve. I can see that compared to a reference program assumed to be
average (gcc) that I get better L2 data hit rates but more L1+L2 DTLB
misses. I wonder what I should start doing about it and I wonder which other
performance counter will tell me more about the resulting memory
accesses (because I now walk the page table).

%%

Semi-related question: where is AMD hiding the list of performance
counters and what they mean? Is there anything published by AMD on
this?

Thanks
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
That seems low to me. 512 entries at page size 4096 bytes means that
only 2 MB worth of data can be accessed without having a TLB miss and
walking the page table. That is just twice as big as the L2 cache.
Do I have some misunderstanding here?
Conroe has 128 entries in the ITLB and 256 in the DTLB. P4 has 128(?) ITLB and 64(?) DTLB in the L2 DTLB.

Then, since AMD64 and all other x86 processors have caches working on
physical addresses, does that mean that a TLB lookup is needed
everytime that any address is used in the program, even if that
address' data is currently cached?
Yes, but I believe all relevant architectures work that way. You kinda have to do that, because with virtual memory, the OS could map multiple virtual pages to the same physical pages... if you didn't access the TLB on the way to the cache, you'd have to check more locations in the cache because different virtual addresses could actually be using the same data.

I am looking at the performance counters for an application I want to
improve. I can see that compared to a reference program assumed to be
average (gcc) that I get better L2 data hit rates but more L1+L2 DTLB
misses.
You can't just take an arbitrary program and call it average... you might at least try a few more standard spec INT-style benchmarks (gzip, bzip2, perl, lisp). My guess is just that gcc has better locality at the page granularity - while both your program and gcc might have a working set that fits in the L2, every line in the L2 could be coming from a different page in your app, vs a small number of pages in gcc.

I wonder what I should start doing about it and I wonder which other
performance counter will tell me more about the resulting memory
accesses (because I now walk the page table).
How bad is your TLB hit rate? If it's really awful, then you might want to look into either shrinking your working set, or doing something so that objects you access at around the same time end up spread out over fewer pages (allocate them at the same time?).

Hmm... I wonder what the effect of <a target=_blank class=ftalternatingbarlinklarge href="http://www.google.com/search?hl=en&lr=&client=mozilla&rls=org.mozilla%3Aen-US%3Aunofficial&q=openbsd+random+malloc&btnG=Search--->P4]http:// has 128(?) ITLB and 64(?) DTLB in the L2 DTLB.
<b">random malloc</a> is on TLBs. Probably not good.

Semi-related question: where is AMD hiding the list of performance
counters and what they mean? Is there anything published by AMD on
this?
There's probably a list somewhere... I know CodeAnalyst has info on a lot of them. Page 238 (258) for K7, this is the equivalent K8 document, but I don't see a list in it.
 

uOpt

Golden Member
Oct 19, 2004
1,628
0
0
Originally posted by: CTho9305
Then, since AMD64 and all other x86 processors have caches working on
physical addresses, does that mean that a TLB lookup is needed
everytime that any address is used in the program, even if that
address' data is currently cached?
Yes, but I believe all relevant architectures work that way. You kinda have to do that, because with virtual memory, the OS could map multiple virtual pages to the same physical pages... if you didn't access the TLB on the way to the cache, you'd have to check more locations in the cache because different virtual addresses could actually be using the same data.

SPARC has virtual caches, but I think that's the only current implementation that does.

I am looking at the performance counters for an application I want to
improve. I can see that compared to a reference program assumed to be
average (gcc) that I get better L2 data hit rates but more L1+L2 DTLB
misses.
You can't just take an arbitrary program and call it average... you might at least try a few more standard spec INT-style benchmarks (gzip, bzip2, perl, lisp). My guess is just that gcc has better locality at the page granularity - while both your program and gcc might have a working set that fits in the L2, every line in the L2 could be coming from a different page in your app, vs a small number of pages in gcc.

Well, I didn't start with just taking my applications. I compare my application to a gcc run, as you suggest.

I wonder what I should start doing about it and I wonder which other
performance counter will tell me more about the resulting memory
accesses (because I now walk the page table).
How bad is your TLB hit rate? If it's really awful, then you might want to look into either shrinking your working set, or doing something so that objects you access at around the same time end up spread out over fewer pages (allocate them at the same time?).

Dunno. I don't know what exactly a very bad TLB hit rate is. I can compare with SuperPi or the stream benchmark, but those are artificial memory cruncher and the comparision isn't useful. So I know I am worse than gcc but I don't know what that means in wall clock units.

What I would like is to figure how much slowdown I actually get from it. For that, I would need a performance counter which tells me whether my memory hits are really bad or not.

The documentation I can find on the performance counters for AMD chips is not sufficient for that. I don't even know what a catch-all "these are all memory accesses" is.

Neither do I know how to separate out read and write memory accesses for the purpose of TLB misses. That makes a huge difference. The data that I read is mostly mmaped data which already went through intensive reviews for best locality. However, if writes are my problem then I could have more to play with, such as using more preallocated buffers.
Hmm... I wonder what the effect of <a target=_blank class=ftalternatingbarlinklarge href="http://www.google.com/search?hl=en&lr=&client=mozilla&rls=org.mozilla%3Aen-US%3Aunofficial&q=openbsd+random+malloc&btnG=Search--->P4]http:// has 128(?) ITLB and 64(?) DTLB in the L2 DTLB.
<b"><b"><b">random malloc</a> is on TLBs. Probably not good.

The application written in Common Lisp, so I would have to play with the allocator in the GC, I can't just drop in a malloc replacement.

Also, as I mentioned most of the data is readonly mapped, not allocated, so there's nothing on the fly to do.

However, one thing I could try is to deliberately spread out the writes and observe what happens. That way I can work my way towards figuring out what exactly I'm doing there.

Semi-related question: where is AMD hiding the list of performance
counters and what they mean? Is there anything published by AMD on
this?
There's probably a list somewhere... I know CodeAnalyst has info on a lot of them. Page 238 (258) for K7, this is the equivalent K8 document, but I don't see a list in it.

Yeah, my problem exactly.

I have a list with short texts from the FreeBSD pcm(8) manpage for K8.

But that is really not comparable to what Intel gives you in their ia32 programming manual part 3 "systems programming", which is an order of magnitude better.

I remember that I once found what I needed. I should probably just mail the FreeBSD guy or some Linux perfmon/perfmon2 guy.