Another issue is what is known as "locality of reference". Data accesses exhibit what is called spatial locality and temporal locality.
Spatial locality refers to the fact that when you access a piece of data in memory, you are highly likely to access nearby elements in the future. Think of indexing through an array using a for loop for a good example of spatial locality.
Temporal locality refers to the fact that when you access a piece of data at one point in time, you are highly likely to need to access it again in the near future.
Because memory access patterns in most programs exhibit such predictable behavior, you can have a very small sized L1 cache and generate an extremely high hit rate. Modern L1 caches often achieve a 96+% hit rate. This means that 96+% of the time, the data requested by the program is found in the L1 cache, despite the cache's small size.
Another thing you may be wondering is "why then are L2 and L3 caches so large compared to the L1 cache"? Aside from physical design issues (L2 and L3 caches are designed in such a way as to make them slower, and therefore cheaper, etc.) the reason is that they need to be large enough capture the remaining amount of temporal locality that may be present in a program.
This topic is somewhat advanced, and I'm sure I could dig up a ton of papers written on it, but basically, here is how it breaks down:
Let's say you sequentially order the addresses of every memory access in a program. Now for each access, define the "stack distance" to be the number of accesses between that access and the previous access to the same number. So let's suppose you have accesses to the following memory addresses in order:
0x01, 0x02, 0x03, 0x01
The stack distances of the first three will be infinite since they are the first accesses to occur at those addresses. The stack distance of the second 0x01 will be 2, because there are 2 memory accesses between that access to 0x01 and the previous access to 0x01.
Now suppose you do this for the entire program, and construct a histogram of the number of occurences of each stack distance. What you will find for most programs is that the vast majority of memory accesses have very short stack distances - that is, the same memory addresses are accessed frequently in time. This is temporal locality, and, without going into too much detail into how caches work, this is why a small L1 cache is able to perform so well - it doesn't have to store a large amount of data to hold everything the program needs to function. Making it larger would be very expensive and would not significantly improve the performance of most programs.
But back to the other issue - why L2 and L3 caches are so much larger. If you make the histogram plot as I described in the previous paragraph, you will find that most of the memory accesses have very short stack distances. However, you will also find that there is a reasonably large occurence of very *high* stack distances. This means that there is a non-negligible number of memory elements who are accessed semi-frequently, but whose accesses are spread out pretty far apart in time. This means that you need a large cache to take advantage of this. A smaller cache would not work because other memory accesses would evict the original piece of data before it could be used again.
Again, this is a very abstract, high level view of why caches are sized the way they are. You can get *very* quantitative for all of this, but it is really beyond the scope of what I could put into a message board post.
Also, the behavior I described above is typical of *most* programs, but exceptions always exist. A few of the SPEC2000 benchmarks exhibit very poor locality, and therefore the cache does not provide a tremendous performance benefit in these cases.