Cache question

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
What is the probability that, given that something is presently in cache, it will be in cache after time t? I know this is a really bad question, but I don't know if I can be much more specific... I'm mostly interested in asymptotic behavior. Does the probability fall off linearly? Exponentially? A power function? If different replacement strategies produce different results, I'm interested in all of them ;).

I'm wondering because of some stuff I learned in a psych class ;).

edit: If it changes the answer, ignore exceptions/interrupts, and just take a single program running by itself. If the answer is different for different workloads, I'm interested in all of them, too :). Probability or odds are fine. If p is probability, q is odds: q = p/(1-p). While I'm asking... what's the probability something will be needed again, given that it is needed now, as a function of time?
 

Matthias99

Diamond Member
Oct 7, 2003
8,808
0
0
Welcome to one of the thorniest problems in practical computer science. :)

I'll try to give a simple explanation, hopefully not leaving out (or forgetting) too many factors. I'm going to assume a single-level cache here (although size and associativity will be touched on in a minute), that you *must* read data into cache to do anything with it (you can't somehow read data directly into registers while bypassing cache), that you do not know the exact order of the data needed (if you did, there wouldn't be much of a problem), and with the assumption that there is some very large amount of slow storage out there somewhere for you to pull data from (read: hard disk).

One of the more important things to know when talking about cache is the size of the "working set" -- this is all the information that is going to be accessed by your system. If the working set is known in advance and is smaller than the cache, and the cache is fully associative, then it is possible to simply load the entire working set into cache and then you'll have a 100% hit ratio. Unfortunately, this is rarely the case -- due to the high cost of low-latency memory, you inevitably have only a limited amount at your disposal. With a fully associative cache, in theory the best you can do is to have a hit ratio of (cache size / working set size) -- that is, if you have 50MB of cache, and 100MB of data that you might want to access at any given time, then there's a 50/100 = 50% chance of your data being in cache when you go to look for it. This assumes you have no precise knowledge of the workload -- that the data accesses are all random.

Associativity affects the probabilities and stuff basically linearly -- if your cache is 8-way associative, for example, then each particular piece of data has only 1/8 as many effective spaces for it. For this analysis, I'm going to assume a fully associative cache (one where any data can go into any space). It makes things much simpler.

The problem, of course, is that you rarely know the working set in advance, even if you do it's almost always larger than the amount of cache you have available, and in any general-purpose computing system, it's constantly changing. Thus, you wind up with various algorithms for dealing with cache. There are basically two questions you're trying to answer: a) what data will be needed next, and b) if we have to kick something out cache for new data, what should be kicked out?

The first question is highly dependent on your system and what you're doing with it. For instance, your program might be reading a very large file sequentially. If it's possible for the operating system (or, in more recent years, the hard disk itself) to detect this series of reads, it can start spooling the next pieces of data into cache before you even ask for them! You rarely have this exact luxury, but you can often make educated guesses about what's going to be read next depending on what has been read in the recent past. As this is more of an art than a science, and is usually done at an application level (rather than in hardware), I'm not going to talk much about it.

The second questions is that of a cache replacement algorithm -- if the cache is full, and you need a piece of data that's *not* in the cache currently, some data has to be removed to make room for it. If you had perfect information, you would simply remove the data which whose next access is furthest away. But since you don't *know* exactly which data is going to be accessed next, you have to guess. The two approaches generally used are LFU (Least Frequently Used) and LRU (Least Recently Used). The idea is that you track usage of the data in cache, and try to get rid of data that hasn't been used in a while, or that hasn't been used very frequently. In practice (at least for physical cache memories), LFU schemes require too much overhead, and LRU (often an approximate LRU) is overwhelmingly used. Another piece of data that can be exploited here is "locality of reference" -- if you just looked at data block n, you're more likely to need data near n in the near future than other data. This requires too much overhead to be used in low-level cache memory (since it's dependent on the data itself), but is often used for swapping virtual memory pages and in relational database applications.

So... your question is, if a piece of data is in cache at time t, how likely is it to be in cache at time t + n? For a very loose first-order approximation, what you'd need to know is how many data accesses there are per time unit, how those accesses are distributed, and the relative sizes of the cache and working set (for sanity's sake, I'll assume a constant working set here). We'll use an LRU replacement method. Let's say we have 128 units of fully associative cache, and 1024 units of working set (an 8:1 ratio), and we read 16 pieces of data per time interval, uniformly distributed over the working set.

Let's say at time t=0, we just read in data block n. Block n will be removed if we read 128 items that *aren't* block n (as at that point, it will be the least recently used block). If we read block n again, then we "reset" back to time t=0.

In any particular time interval, we read 16 items. Each has a (1/1024) chance of being block n, and a (1023/1024) chance of *not* being block n. So, on average it should take 64 time intervals before we randomly hit block n again (1/ (16 * (1/1024)), but only 8 before it gets removed from the cache (128 / (16 * (1023/1024)).

In this case, our data block isn't going to last long, and we're going to have a sorry hit ratio. But since the data accesses are random, you can't do much better -- 7/8 of the data is just not going to be in cache at any given time. If we were using block n all the time (say, there was a 1 in 16 chance of any particular access being on block n), then it would be very unlikely to leave cache -- on average, we'd hit it once per time interval, and it would only be removed if we didn't hit it for 8 time intervals. This is a grossly simplified analysis -- a book on discrete matematics and/or probability theory would be a good place to start if you want to get more complex.

In short, your question is dependent on everything. :) Perhaps if you had some particular situations in mind, more precise answers could be provided.

Your second question -- "what's the probability something will be needed again, given that it is needed now, as a function of time?" -- is a function of the workload of the system. That's something you'd define as an input to a model like this, not a question you'd ask of it. Also, in real life, you often have *no idea* what this number will be -- if you knew this number for every piece of data, then you'd simply load in all the most-often-used pieces of data and be done with the whole cache management thing.

But hey, if this was easy, I wouldn't have a job. :p
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: Matthias99
In short, your question is dependent on everything. :) Perhaps if you had some particular situations in mind, more precise answers could be provided.

Choose some random app where the working set is very similar to the size of the cache. For the sake of simplicity, make it fully associative so you don't have to worry about conflict misses. Again for simplicity's sake, you get to choose the app, and a cache size to match ;). Maybe the SPEC Int 2000 chess benchmark would be good.

Your second question -- "what's the probability something will be needed again, given that it is needed now, as a function of time?" -- is a function of the workload of the system. That's something you'd define as an input to a model like this, not a question you'd ask of it. Also, in real life, you often have *no idea* what this number will be -- if you knew this number for every piece of data, then you'd simply load in all the most-often-used pieces of data and be done with the whole cache management thing.

We can know this at compile time, assuming we know what the input to the program is going to be. I'm not interested in predicting... I'm interested in the "correct" answer. Again, it depends on workload, so use the same one from above. Running the program the whole way through on a simulator, and then looking at what happened would also provide the answer.
 

Matthias99

Diamond Member
Oct 7, 2003
8,808
0
0
Originally posted by: CTho9305

Choose some random app where the working set is very similar to the size of the cache. For the sake of simplicity, make it fully associative so you don't have to worry about conflict misses. Again for simplicity's sake, you get to choose the app, and a cache size to match ;). Maybe the SPEC Int 2000 chess benchmark would be good.

I'm not familiar with the internals of that benchmark, unfortunately. However, if the working set is "similar" in size to the cache, then (assuming we're dealing with just one process here -- in a real system, obviously the OS and other processes will end up overwriting parts of it while your process isn't running) the cache will eventually be filled with all (or most of) the data in the working set. In theory, you should see a hit ratio eventually around (size of cache / size of working set), unless you know enough about the workload to do some sort of intelligent caching, or it is significantly nonrandom (which this benchmark probably is). You'd have to simulate it.

We can know this at compile time, assuming we know what the input to the program is going to be. I'm not interested in predicting... I'm interested in the "correct" answer. Again, it depends on workload, so use the same one from above. Running the program the whole way through on a simulator, and then looking at what happened would also provide the answer.

If the program is nondeterministic, and you know exactly what the inputs are going to be, then yes, you'll know this. I'm not certain exactly what your question becomes at that point, as you then have enough information to determine an optimal caching strategy. As this is an NP-hard problem, you generally would solve for a near-optimal strategy like LRU (whenever you need to throw something out of cache, throw out the data whose next use is furthest in the future) or LFU (throw out the data whose number of future uses is minimum). If you're wondering about the properties of that strategy (for example, what the read hit rate will be), it's dependent on the structure of the workload and how big the working set is relative to the cache, and how fast the working set is changing (if it's not constant). Maybe I'm just not understanding your question properly (this is likely). :frown:
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: Matthias99
Maybe I'm just not understanding your question properly (this is likely). :frown:

I don't think you are.

...as you then have enough information to determine an optimal caching strategy. As this is an NP-hard problem, you generally would solve for a near-optimal strategy like LRU (whenever you need to throw something out of cache, throw out the data whose next use is furthest in the future) or LFU (throw out the data whose number of future uses is minimum).

I'm not interested in knowing an algorithm that would give you a good hitrate. (Well, I am, but not for this question, and that kind of thing is covered in classes and many papers have been written on the subject.)

I'm not familiar with the internals of that benchmark, unfortunately.
What are you familiar with?

Once you've picked a benchmark to use that you're familiar with, the questions are...
1) Given that memory location x is in cache at time t, how likely is it that it will be in cache at time t+n?
2) Given that we used memory location x at time t (some history information, such as frequency of access, time between accesses, etc. can be included), how likely is it that we will access it at time t+n?

Obviously in both cases, the general behavior will be that as n increases, the chances are smaller. But how are the chances shrinking? Linearly? Exponentially?

Hopefully that clears up what I'm asking...

edit: while question 2 is of some interest normally when designing a cache (answering it would help pick a replacement strategy), question 1 is probably not considered when designing a cache... as computer engineers, we don't particularly care how much time something spends in cache, so long as it is present when needed.
 

sao123

Lifer
May 27, 2002
12,656
207
106
Graphing the decay of P(t) = P(A|B) = P(A & B) * P(B) probably isnt very hard as it seems to be a recursive function.

Though I cant remember all of my probability class, can you define an equation for
P(B) & P(A & B) based on the size of workspace, size of cache, and replacement strategy?
 

Matthias99

Diamond Member
Oct 7, 2003
8,808
0
0
Okay, I think I at least understand the question now. This is progress. :)

The problem I'm having trying to answer your two questions is that the answer to question 1 is partially dependent on the answer to question 2 (and partially on the cache details), and the answer to question 2 is dependent on the pattern of accesses to the cache (and therefore the application you're running).

When you start getting into the nitty-gritty details of this stuff, it has to be analyzed in the context of a particular application, since the performance starts to depend on how the cache is being used rather than just on its structure. If you keep using the same block of data very frequently, it will almost always be in the cache, and the odds of it not being there are very small. If you use it only occasionally, then it is increasingly likely to be replaced during each interval, and the odds of it being in cache at any given time start to drop. But the odds are also simultaneously dependent on the cache/working set size and the replacement strategy.

For any particular pattern of usage, it's possible to define (or at least approximate) the probabilities referred to in the last post, and from there to model the chances of any particular piece of data being in cache at a given time. But it's not possible to answer your questions "in general". The best I can do is to offer the following, which is how I would try to model this (very simplistically):

If you assume a fully associative cache capable of holding N items, with an LRU replacement policy (many L2 or L3 processor caches are like this), then a piece of data D that is in the cache will be removed from the cache if N other pieces of data *not already in the cache* (I had neglected this before) are looked at before you read D again. Let W be the size of the working set.

If W is smaller than N, nothing will ever leave the cache. After enough time, your hit rate will be 100%.
If N << W, then it's not really possible to cache much data effectively -- you need a bigger cache. Your hit rate will never be very good. We'll assume here that N is at least within a couple orders of magnitude of W.

Let P(D) be the probability that any particular data read will try to access the piece of data you're interested in. In reality, this is probably a very complex function, but for now we'll assume it's a constant. We'll assume also that any accesses not to item D (P(NOT-D)) are evenly distributed over the rest of the working set.

P(NOT-D) = 1 - P(D). Now, there are actually two cases here -- one where we access something that's already in the cache, and one where we pull some new item into the cache and kick some old item out.
P(H) = Probability of a cache hit = P(NOT-D) * ((N-1) / (W-1)) (if N and W are big enough, you can just ignore the -1 for item D)
P(M) = Probability of a cache miss = P(NOT-D) - P(H).

The number we're looking for here is the probability, after R reads, that data D is still in cache. In our particular case, the data leaves cache if we have N read misses without reading item D.

I'm lacking access to any statistics books right now, so this is even more grossly oversimplified and probably not 100% correct. :p Generally, in R reads, we would expect to hit item D (P(D)*R) times. We would also expect P(H) * R read hits, and P(M) * R read misses. If (P(D) + P(H)) >> P(M), item D will almost always be in cache (this implies either a very large cache, or that you're constantly using D). If (P(D) + P(H)) << P(M), it will almost never be in cache except immediately after it is read (this implies a very small cache, or that D is very infrequently accessed). If P(D) << P(M) (likely if you have a large working set), then the expected number of reads before D is no longer in cache is R such that P(M) * R = N. I'll try to come up with a better equation later on (I suspect this would be best modelled as a Markov process), but hopefully this gives you the idea.

GROSSLY SIMPLIFIED:

Solve for R:
P(M) * R = N

R = N / P(M)
R = N / (P(NOT-D) - P(H))
R = N / (P(NOT-D) - ((P(NOT-D) * (N/W)) (note: I left out the -1 here because I'm assuming that N and W are big enough that N-1 ~= N and W-1 ~= W)
R = N / ( P(NOT-D) * (1 - (N/W)) )

R = N / ( (1-P(D)) * (1 - (N/W)) )

If we set P(D) to zero (if we want to know how long we can randomly read other data while D will still remain in cache):
R = N / ( 1 - (N/W) )

As W gets bigger and bigger, R approaches N. Which makes sense, since if you have a working set much much bigger than the cache size, the odds of any particular piece of data staying in cache longer than the minimum possible time (N reads) are very low.

Close enough for government work, if I didn't screw anything up. :)
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
So, function describing the probability something is in cache after r reads is 1-(N/(R*1-(N/W))). I must be doing something wrong when I solved for that, because that doesn't vary between 1 and 0 ;).
 

Matthias99

Diamond Member
Oct 7, 2003
8,808
0
0
I'm not sure how you came up with that formula, exactly. Could you show your work? (I'm starting to feel like a TA again...)

The last formula I had in my post (R = N / (1 - (N/W) ) should be calculating the *expected value* of R when the item in question will be removed from cache (I'm fairly certain it's correct, since the values it gives make sense). You can't use that directly as a probability function, if that's what you were trying to do.
 

CTho9305

Elite Member
Jul 26, 2000
9,214
1
81
Originally posted by: Matthias99
I'm not sure how you came up with that formula, exactly. Could you show your work? (I'm starting to feel like a TA again...)

The last formula I had in my post (R = N / (1 - (N/W) ) should be calculating the *expected value* of R when the item in question will be removed from cache (I'm fairly certain it's correct, since the values it gives make sense). You can't use that directly as a probability function, if that's what you were trying to do.

Oh. Forgot my stats. Did you have have a forumula for probability in terms of R in your post somewhere that I missed?
 

Matthias99

Diamond Member
Oct 7, 2003
8,808
0
0
No; that's what I haven't had time to go back and figure out. I have this "job" thing I have to do. :p