What exactly does cache do?

Malak · Dec 5, 2004

So it says my proc has a certain amount of cache, which seems small compared to any other amount in other pieces of hardware, so I'm curious what goes through that tiny amount, how it affects performance(AMD procs have different amounts of L1/L2 cache than P4's), and why there are 3 different caches instead of just one large one(which might be answered in the previous questions). I read about this some time ago I think in Maximum PC and have since forgotten.

CTho9305 · Dec 5, 2004

http://en.wikipedia.org/wiki/C...che#Multi-level_caches
First google result for cpu cache small fast. Summary: smaller caches are faster. You want to hit the L1 cache in <=3 cycles, and that gives you the vast majority (80+%) of your accesses. Those that miss will cost about 10-20 cycles if they hit in the L2, so you pay a larger performance hit, and if you miss in the L2, you wait hundreds of cycles. If you had just one big cache, you'd always be waiting 10-20 cycles, so 80% of the time, you're going to get worse performance. If you only had the small cache, 20% of the time you'd have to wait hundreds of cycles for main memory.

http://www.pantherproducts.co..../CPU/CPU%20Cache.shtml second google result for the same thing

walla · Dec 5, 2004

Cache has a huge impact on performance.

L1 cache is small, typically on chip. The smaller it is, the short the line delays are and the faster it can be associatively searched. Hence, performance is a balance between cache size and cache efficiency.

L2 cache is larger, typically on-chip. L2 is searched in the event of L1 cache miss. So on L2 hits, it essentially mitigates the penalty of L1 cache miss. Of course, it is larger and the access latency will increase compared to L1. But consider reading from main memory (RAM) taking 100 cycles, the extra 7 cycles suffered on an L2 hit is a much better alternative.

Essentially, increasing the number of level of cache reduces the number of times you have to access main memory - however, the number of increasing levels presents diminishing returns.

It is really all about the power of heirarchy. You could implement 100 levels of cache on a chip (given the real-estate)...but if the 99th percentile case doesn't access beyond cache L4, can you justify the expense and size of the unused levels? The size, and level of cache is therefore should be based on study and statistics of typical applications for that processor class.

silverpig · Dec 5, 2004

Laymans terms:

Hard drive = Library of Congress
RAM = Local library
L2 cache = file cabinet in your office
L1 cache = stack of papers on your desk

Say you need some information. It's obviously easiest and quickest to just flip through a few papers on your desk, and gets progressively more difficult and more time consuming the higher up that list you go. Of course if you have the entire contents of the library of congress in a stack of paper on your desk, it'll take you for damn ever to find anything too. The trick is to break it up into intelligently scaled sizes and speeds to optimize memory access.

Pacemaker · Dec 5, 2004

I think a point that was not made is why cache is effective. Because most programs use large numbers of loops the same code is being executed many times over and over again. The cache can store this code so that you don't have to go all the way to ram (or in the worst case the hard drive) to get the information.

piroroadkill · Dec 5, 2004

If you're doubting the need for cache, do something geeky:

Get drunk, and decide it would be a good idea to turn off your level 1 and 2 cache.

Oh man, dear christ, me and a friend literally sat there for just over half an hour and only managed to open one explorer window, and it still hadn't loaded up all my startup items.

Eventually I just gave up, but it was fun. Then we decided a good prank to play on people would be to disable their cache. Harmless, but dear god, so intrusive.

sao123 · Dec 6, 2004

Summary: smaller caches are faster.

Actually I think its written best as
faster caches are smaller... So why not just make a L1 speed cache the size of an L2 cache or even bigger?

Why? Cost!

The fastest cache (L1) is an on chip extremely low latency SRAM. The L2 cache is a slightly slower form of SRAM, but still faster than main memory (DRAM.) Look at the difference in prices between speed grades of RAM are... now try to iagine what a 256 MB stick of L1 rated cache memory would cost. Probly 1-2K or more.

DoubleE · Dec 6, 2004

The point that Pacemaker writes about is important: L1/L2 Cache is typically separated into areas for both Instructions and Data. It may be overlooked because people tend to think of memory as storing information used in/by programs, but the processor primarily consumes memory containing the programs themselves (aka instructions). Store/Load operations are clearly affected by cache memory, but it is essential to get a huge win out of effective instruction cacheing (especially with Branch Target Buffers or the like) because modern processors are trying to execute many instructions at the same time. Whereas not all instructions reference other memory (to load/store data), *all* instructions are themselves located in memory (obviously), and when a program branches, the cache can make the difference between a big stall and little/no stall of the processor pipeline. If a load/store stalls, speculative execution may still occur.

So, that's one big reason that even a 'small' cache is so important - Instructions, which are essential on every cycle (obviously) and are highly localized or predictable. Data obviously fits into the model as well, but is more likely to have misses.

One reason that larger cache isn't used more frequently is that more die size typically reduces yield, thus increasing the costs (which is what sao123 mentioned).

Other people may have more in-depth (or accurate) knowledge, as I haven't worked with cache design in quite a while!

CTho9305 · Dec 7, 2004

Originally posted by: sao123

Summary: smaller caches are faster.

Click to expand...

Actually I think its written best as
faster caches are smaller... So why not just make a L1 speed cache the size of an L2 cache or even bigger?

Why? Cost!

The fastest cache (L1) is an on chip extremely low latency SRAM. The L2 cache is a slightly slower form of SRAM, but still faster than main memory (DRAM.) Look at the difference in prices between speed grades of RAM are... now try to iagine what a 256 MB stick of L1 rated cache memory would cost. Probly 1-2K or more.

I really don't think that's the issue - in a cache, your limiting factor is going to be the drive strength of the bit cells vs the capacitance of bit lines. As your bit cells get bigger (for more drive), you pay for it in longer bitlines and lower density. Shorter bit lines means either fewer cells or smaller cells (weaker drive). I think it just comes down to manufacturability / process limitations (interconnect cap vs transistor drive)... if building 1MB L1 caches that you can access in 1-3 cycles were possible, I bet you'd see it on high-end stuff like Itaniums or other server processes.

Varun · Dec 7, 2004

SRAM also has far more transistors than DRAM (1 transistor per byte) upping the cost. DRAM also must be refreshed often to keep the data intact.

Back to cache though, with all the things the CPU and computer have to do at once, cache is vital. The CPU will pull many instructions into cache that it needs, and then turns the system bus over to the chipset for things like DMA. If a program fits entirely in the cache, it is obviously very fast as the CPU never needs to go to DRAM for op codes.

sao123 · Dec 7, 2004

in a cache, your limiting factor is going to be the drive strength of the bit cells vs the capacitance of bit lines. As your bit cells get bigger (for more drive), you pay for it in longer bitlines and lower density. Shorter bit lines means either fewer cells or smaller cells (weaker drive). I think it just comes down to manufacturability / process limitations (interconnect cap vs transistor drive)...

I dont disagree that silicon space on die and trace length is a determining factor, but even those can be overcome with other higher costing designs.

if building 1MB L1 caches that you can access in 1-3 cycles were possible...I bet you'd see it on high-end stuff like Itaniums or other server processes.

Why? Most servers arent based around high end CPU's for high speed processing, they are designed for maximal I/O Throughput. Look at the purpose of most servers... File & Print, Web hosting, Domain Controller, Database Hosting. Low speed, high volume usage.
This should be evident in that current itaniums are only 1.6 Ghz with 400/533Mhz FSB and have the same L1 cache of 32KB as a P4, the same 256KB L2 cache as the crippled Celery CPU, and some large 3-9MB of slow L3 cache. Servers have multiple (2-32 or more) cpus for handeling as many simultaneous data requests at a time.

Peter · Dec 7, 2004

Intel are keeping their L1 caches very small for the simple fact that their L2 caches are "inclusive", meaning that the L2 always holds copies of what's going on in L1. Thus, making L1 caches large would make the same portion of L2 redundant.

AMD processors on the other hand have "exclusive" cache levels. L1-cached data aren't found in L2. Thus, all the cache there is contributes to the performance of the machine, and L1 caches are comparatively big - 128 KBytes currently.

sao123 · Dec 7, 2004

inclusive caching schemes are a big headache.

AMD processors on the other hand have "exclusive" cache levels. L1-cached data aren't found in L2. Thus, all the cache there is contributes to the performance of the machine, and L1 caches are comparatively big - 128 KBytes currently.

Now if only AMD could implement the trace execution cache. It would be so much faster if they stored decoded micro op instructions, instead of the full 32bit instruction itself. It would save re-decoding the instruction every cycle in the loop.

DoubleE · Dec 7, 2004

Originally posted by: sao123
inclusive caching schemes are a big headache.

Is the reason Intel does this is to improve cache coherency algorithms for multiprocessor desgins?

It would seem to be simpler to solve this problem with an inclusive L2 cache rather than attempting to determine the correct memory content in all levels of cache...

sao123 · Dec 7, 2004

I suggest you read this Article From CPU-Z ...contrasting the AMD K8 with the Pentium 4 line of CPU's. The section on caching scemes is excellently written.

Is the reason Intel does this is to improve cache coherency algorithms for multiprocessor designs?

No, the L1 & L2 caches are both on the P4 cpu (unlike previously P!!! & P2 were on the Motherboard.) So the cache is self contained, and not shared among the processors in a mulitple processor scheme.

Intel chose the inclusive cache relationship, because inclusive caching scheme, gives a big performance gains during L2 success comapred to exclusive. A L1 Miss, L2 hit, in an exclusive cache requires 2 writes (Eviction from L1 into Victum Buffer, then L2 copy into L1.) while in the inclusive scheme L1 Miss, L2 hit only requires 1 write step (copy from L2 to L1, the evicted instruction already exists in the L2 cache so a write back is not necessary.)

The headache involved is determining the best fit size of the caches. Too little L1 cache, and youll have too many L1 cache misses. Too Much L1, and having a fast L2, is negated, by too many hits in the L1. (not necessarily a bad thing speedwise, just a waste of L2 cache. (and die space.) In any case, The bigger the L2 cache in an inclusive model, the bigger perfomance jump you can achieve. this is why the celery processors are a complete failure. They skimp on the L2 cache, which completely breaks the performance found in the P4's.

Calin · Dec 15, 2004

As the processor executes one instruction at a time, the need for a large cache doesn't arise so often. However, the cache memory is a big consumer of power (current) compared to the DRAM - for a similar memory speed, it uses something like 10x the current, and generates 10x the heat.
The cache holds areas of the main memory that were used in the recent past. Most of the code and data in a program tend to be reused again and again (they say that a programs spends 90% of its execution time in 10% of the code).

Ariste · Dec 15, 2004

Alright, I'm probably just repeating here, but this is the best way that I've seen cache explained:

Pretend you're a librarian. When somebody comes into the library, they have to ask you to get the book that they want. So every time that somebody comes in for a book, you have to get up from your desk, walk to the shelf, get the book, and come all the way back. This takes a lot of time and slows up all the operations of the library.

So one day you get fed up with this system and try to think up something more efficient. While thinking about it, you notice that a few books have been taken out over and over again while the rest of the books have barely been taken out at all. So you take the 5 most commonly used books and put them on your desk. This way when somebody comes in and asks for those books, you can circumvent the whole system and just give them the book immediately. Much faster and more efficient than the other system.

Unfortunately you don't have the best memory, and every time that somebody comes in and asks for a book you have to check through the books on your desk to make sure that it's not in there before going out to the shelves to get them the book. If the book that the customer asks for is on your desk, great! You just saved a bunch of time through your nifty little system. If the book isn't on your desk, though, you've just wasted time looking through the books on your desk that you wouldn't have otherwise. As a result of this, you need to keep the stack of books on your desk relatively small, or else you will spend more time searching through the book pile on your desk than you actually save by having them there in the first place.

This is more or less how the cache in your CPU works. The cache is equatable to the books on your desk. Since the cache is right on the CPU, it takes almost no time to find things that are stored there. The cache is much faster than your system RAM because the RAM is "farther away" from the CPU and the process of the CPU retrieving information from system RAM is much like you having to get up to get a book off of the shelf. If the CPU had to send out a command to the RAM and wait for the RAM to send back the relevant information each time it ever had to execute a command, computers would run horribly slow. Cache allows the CPU to work much more efficiently. If, however, the cache is too large, the CPU will spend so much time searching through it that the loss of time spent on every time the information is not in the cache (called a cache miss) will negate the benefits of having the cache there in the first place. This, coupled with cost, is why caches are relatively small compared to RAM and hard drives.

Hope this helps,

Gamingphreek · Dec 15, 2004

So why dont we make a whole mess of small caches. Like L1 8kb L2 8kb L38kb L4 8kb and just have a decent amount that way.

I know about the differences between L1 and L2 but for the sake of an example i used those as an erxample.

-Kevin

ghackmann · Dec 15, 2004

Originally posted by: Gamingphreek
So why dont we make a whole mess of small caches. Like L1 8kb L2 8kb L38kb L4 8kb and just have a decent amount that way.

I know about the differences between L1 and L2 but for the sake of an example i used those as an erxample.

-Kevin

Because each level of cache you miss just adds more latency to the fetch.

mjia · Dec 16, 2004

I believe that the primary reason for smaller cache is cost.

If you double the size of your cache, you will not double to access time. Accessing memory is not a matter of linearly searching through all the data. They use a much more efficient algorithm (perhaps similar to the file systems used for drives). The difference in speed btw memory stored in chips versus physical storage like hard drives is in the order of hundreds (or at least close to). So if you can store more useful code in the cache, you will significantly improve performance.

However, large size doesn't guarentee improved performace as the extra data stored has to be useful. The branch prediction algoithms have to guess which pieces will be accessed next. The percentage of correct guesses to useless ones is not guarenteed to be the same btw say filling 512 KB vs. 1024 KB.

The benefits do not always justify the costs, so manufactures pick an optimal size (to reduce their expense).

Googer · Dec 16, 2004

Damn, This is the HIGHLY Technical forum. I was seriously hoping to find somthing other than a simple questen like this here.

harrkev · Dec 17, 2004

Originally posted by: Gamingphreek
So why dont we make a whole mess of small caches. Like L1 8kb L2 8kb L38kb L4 8kb and just have a decent amount that way.

We do. You can look at it this way

L0 - on-chip registers
L1 - Close to the CPU - 8-64K or so
L2 - Slower than L1. Still on CPU. 128K to 1M or so.
L3 - 256MB to 1GB. Often called DDR DRAM. Made my Mushkin, PNY, Crucial, etc.
L4 - 100Gb. Made by Seagate, Western Digital (this actually happens when you use virtual memory)
L5 - Multiple Terrabytes. Service provided by EarthLink, RoadRunner, AOL, etc.

You can literally think of it this way. When comparing ANY two memory technologies, one will be faster, and the other will be cheaper per bit. If this is NOT the case, the one that is more expensive per bit AND slower will not be used, and will become obsolete instantly.

So, every type of memory fits in a different place in the speed/size curve. But people want speed AND size. A lot of cheap memory only will hold all of your stuff, but you will wait forever for it. A system with only the fast and expensive memory will fly, but will be priced so high that you will have to sell your house to buy it.

The way that current systems are designed is the ultimate compromise. You can have as much of the cheap storage as possible, but you use ever-decreasing amounts of the faster stuff to make the level below appear to be faster.

sao123 · Dec 17, 2004

L0 - on-chip registers
L1 - Close to the CPU - 8-64K or so
L2 - Slower than L1. Still on CPU. 128K to 1M or so.
L3 - 256MB to 1GB. Often called DDR DRAM. Made my Mushkin, PNY, Crucial, etc.
L4 - 100Gb. Made by Seagate, Western Digital (this actually happens when you use virtual memory)
L5 - Multiple Terrabytes. Service provided by EarthLink, RoadRunner, AOL, etc.

This is not necessarily a good pattern to follow...

Many Processors... including the P4EE, P4Xeon, Itanium 1 & Itanium 2 Server Lines have on CPU L3 cache (1MB to 27MB)

What you are calling L3 is actually known as Main Memory. OR System RAM.

What you calling L4 is your Virtual Memory or Swap File. Not necessarily a cache, since it is slower than Main Memory.

What you are calling L5 is actually not even a form of cache. I'm not even sure where you are going with this. Web Cache on ISP DNS servers? This has nothing to do with the Software / Operating System processing & Memory heirchy.

Googer · Dec 17, 2004

Originally posted by: sao123

L0 - on-chip registers
L1 - Close to the CPU - 8-64K or so
L2 - Slower than L1. Still on CPU. 128K to 1M or so.
L3 - 256MB to 1GB. Often called DDR DRAM. Made my Mushkin, PNY, Crucial, etc.
L4 - 100Gb. Made by Seagate, Western Digital (this actually happens when you use virtual memory)
L5 - Multiple Terrabytes. Service provided by EarthLink, RoadRunner, AOL, etc.

Click to expand...

This is not necessarily a good pattern to follow...

Many Processors... including the P4EE, P4Xeon, Itanium 1 & Itanium 2 Server Lines have on CPU L3 cache (1MB to 27MB)

What you are calling L3 is actually known as Main Memory. OR System RAM.

What you calling L4 is your Virtual Memory or Swap File. Not necessarily a cache, since it is slower than Main Memory.

What you are calling L5 is actually not even a form of cache. I'm not even sure where you are going with this. Web Cache on ISP DNS servers? This has nothing to do with the Software / Operating System processing & Memory heirchy.

Agreed.

Tab · Dec 18, 2004

Originally posted by: CTho9305
http://en.wikipedia.org/wiki/C...che#Multi-level_caches
First google result for cpu cache small fast. Summary: smaller caches are faster. You want to hit the L1 cache in <=3 cycles, and that gives you the vast majority (80+%) of your accesses. Those that miss will cost about 10-20 cycles if they hit in the L2, so you pay a larger performance hit, and if you miss in the L2, you wait hundreds of cycles. If you had just one big cache, you'd always be waiting 10-20 cycles, so 80% of the time, you're going to get worse performance. If you only had the small cache, 20% of the time you'd have to wait hundreds of cycles for main memory.

http://www.pantherproducts.co..../CPU/CPU%20Cache.shtml second google result for the same thing

What do you mean by miss and hit? The CPU tries to fit things into the L1 Cache but if It cant it tries the L2?

What exactly does cache do?

Lifer

Elite Member

Senior member

Lifer

Golden Member

Senior member

Lifer

Junior Member

Elite Member

Golden Member

Lifer

Elite Member

Lifer

Junior Member

Lifer

Diamond Member

Member

Lifer

Member

Member

Lifer

Senior member

Lifer

Lifer

Lifer