L1, L2, L3 cache

Jeff7181 · May 31, 2005

Originally posted by: Mildlyamused

Originally posted by: Goi

Originally posted by: roguerower
Thanks for all the replies. I take it that 2mb on L2 is better than 1mb. INTEL RULES.

Click to expand...

More cache is always better for performance, as long as the cache access latency remains the same. However, increasing cache size also increases hit time, which is why L1 caches are so small compared to L2 and L3. Also, increasing cache size increases leakage power as well as die area, which may be a problem in certain designs.

Click to expand...

Thats not necessarily true I mean look at intel's Extreme edition processors with huge amounts of cache but no performance gain...

Notice he said as long as latency remains the same. The reason EE processors don't show as big a performance gain with more cache is because to add more cache they had to increase latency.

redhatlinux · Jun 1, 2005

OK... a little education for FREE here... CACHES never talk, the i-unit in the pipe does all the heavy lifting for the E-unit, in an 'IDEAL World', NOTHING should get to the E-UNIT that can't be executed. The I-unit completes lots of ops itself, REG1toREG2, duck soup, the E-unit NEVER see's it. EVERY instruction that uses main memory has its address placed on 'the memory bus'. L1,L2,L3 regular memory. L1 Organization and speed is the best, L1 hit, zing on with the show.. L2 L3 etc never gets a chance to even decode the address fully and see if the data, or instruction, in the case of a branch, is in the cache lower levels. Other design features, such instruction and data pre-fetch has 'the data from L1 already sitting in "Special internal 64-bit registers ready to zoooom down the pipe with zero cache access at all !!. Now comes some p!ss poor code, an instruction executing in the E-unit MODIFIES another instruction that is somewhere in 'the pipe' behind it ... boom ... flush every thing, ...get the drift..., you MUST now store thru the cache to main memory,no wait for completion of the store thru as long as the new object of the modified instruction is in some cache... or sit their and spit no-ops down the pipe waiting for main memory to store then to cough up the data. hope it helps shed some light ...

pm · Jun 1, 2005

hope it helps shed some light ...

redhat, I helped design a fairly complex cache (the single-cycle read L1 cache on the Itanium 2), and I had a really hard time understanding much of what you wrote. 🙂

Patrick Mahoney
Microprocessor Design Engineer
Intel Corp.

RichUK · Jun 1, 2005

Originally posted by: pm

hope it helps shed some light ...

Click to expand...

redhat, I helped design a fairly complex cache (the single-cycle read L1 cache on the Itanium 2), and I had a really hard time understanding much of what you wrote. 🙂

Patrick Mahoney
Microprocessor Design Engineer
Intel Corp.

PWnage ... haha ..yeah i was wondering what he was on about aswell, didnt want to challange as i dont know much on this subject ..

So pm,
would you like to give a brief explantion on the different caches, or a linky .. i would like to learn more 😀

IntelUser2000 · Jun 1, 2005

Okay people, some of you guys are seriously Intel's cache architecture. It seems that, no matter how much you educate them, they never learn.

On the Extreme Editions series of Pentium 4's that contain L3 caches, L2 is NOT inclusive to L3 cache. I guess the confusion came when they were told that Intel uses inclusive caches. True, Intel does use inclusive caches, and with processors that have two level caches, its easy to understand that L1 is inclusive with L2. However, with three level cache CPUs like Pentium 4 EE's that's different. L1 and L2 is inclusive like the other Intel CPUs, however, L3 is seperate here. The total cache of Northwood core based EE is actually 2.5MB, not 2MB, since 512KB L2 and 2MB L3 is SEPERATE, not INCLUSIVE. Same with all their chips that have 3 level caches.

I haven't heard of any CPUs these days that use DRAMs for cache. All of AMD's and Intel's CPUs use SRAMs for cache.

There may also be some people that think integrated memory controller means that now your RAM is in your CPU like caches are :roll:. My friend did asked me for the Athlon 64 processors and I told him that ONLY THE MEMORY CONTROLLER THAT IS IN YOUR CHIPSET GETS INTEGRATED, NOT THE RAM ITSELF.

Glossary
SRAM-Basically a super fast storage located very close to CPU, that uses 6 transistors for 1 bit(called 6T)
DRAM-Used in RAM, 1 transistor and 1 capacitor to store 1 bit, slower in latency than SRAM, after you count the fact that they are both in the same place in the CPU.
eDRAM-embedded DRAM, basically integrated into CPU like SRAM based caches are. Still slower than SRAM, but take less power and less die space at same capacity.

(There are 1T SRAM, where it takes 1 transistor for SRAM-like function, but then beyond that I don't understand the difference between that and DRAM, plus it was supposedly still slower than 6T SRAM)

TuxDave · Jun 2, 2005

Originally posted by: pm

hope it helps shed some light ...

Click to expand...

redhat, I helped design a fairly complex cache (the single-cycle read L1 cache on the Itanium 2), and I had a really hard time understanding much of what you wrote. 🙂

Patrick Mahoney
Microprocessor Design Engineer
Intel Corp.

Bwahaha

ToeJam13 · Jun 2, 2005

Originally posted by: Cheesetogo
What is the point of a level 3 cache?

Its used when (1) you're working with huge datasets, (2) the difference between your core processor speed and main memory is really large [5:1 or more] and (3) you have lots of money.

Some of the largest users of systems with L3 caches have historically been universities running large scientific models and buisnesses with huge databases.

They were also more common with high speed RISC processors during the memory speed slump of the 1990s. A DEC Alpha 21164 running at 300MHz or a MIPS R4400 running at 150MHz usually had to suffer with 60ns or 50ns SIMMs running with a bus speed of 33MHz or 50MHz. That's as much as a 90% speed difference.

Today's memory speed is fast enough (PC2-8000!) that the need for a large L3 cache has waned. Several benchmarks of MySQL and Lightwave have shown modest performance gains when a L3 cache is present, but nothing astonishing like in the old days. As always, some apps benifit more than others.

Depending on the processor and application, the addition of a large L3 cache might actually be a hinderance if you're experiencing a lot of cache misses due to the increased latency. If your application causes long pipeline chains to form and your pipe stalls due to the increased latency, your app might not see any speed increase from the faster L3 cache.

Goi · Jun 2, 2005

Well, I wouldn't say that main memory is "fast enough". The cost of going to memory is still getting higher and higher, and is termed as the "memory wall". It takes hundreds of cycles to access main memory today, compared to a few cycles for L1 cache and <2 dozen for L2/L3 cache. This has been an increasing problem for decades. While memory throughput increases, memory latencies haven't decreased that much at all, which is why all sorts of spatial/temporal locality tricks are exploited to mitigate the problem. Increasing levels of memory hierarchy is one of them.

gobucks · Jun 3, 2005

all right, a couple points
1) More cache is usually better, but it's not always an apples to apples comparison. The Pentium M has a great low-latency 2MB L2 cache, and that works great for its architecture. However, the Pentium M was designed to operate as much as possible in the cache, because its memory bandwidth sucks - they only use 400MHz and 533MHz busses, as opposed to the 800MHz and 1066MHz busses of P4s or the dedicated 6.4MB/s link on the A64.
2) The P4, by comparison, also needs a large cache, but only to a point, just enough to keep its pipeline fed. The northwood core needed 512MB to keep its 20 stage pipeline happy, and prescott needs 1MB for its 31 stage monster. Because it has tons of memory bandwidth by comparison, larger caches (i.e. 2MB prescotts) don't do much to help performance, because alleviating the memory usage doesn't buy you as much as it does with the Pentium M. Also, the P4s L2 cache is much higher latency than the Pentium M, and while northwoods latency was slightly lower than the A64, the Prescott latency is way higher than all other chips.
3) And finally, the A64 has a completely different approach. Its L1 cache is 128MB, 4 times the size of the P4, and it runs at full CPU speed. Because it is so much faster than L2, it alleviates the need to have a large L2 cache. This is helped further by the A64's small 12 stage pipeline, which doesn't require a lot of cache to stay fed. In fact, if you look at benchmarks of A64 CPUs, the performance difference by going from 1MB L2 to 512KB L2 is usually under 3%. Even semprons, who only have 256K or even 128K, still perform less than 5% slower than 512KB A64s whereas earlier Celerons (i.e. the 128KB ones) were sometimes twice as slow as equivalent northwood P4s because they simply couldn't keep the pipeline full. Oh, and also, AMD's cache is exclusive. For the total cache, you take 128KB of L1 + the L2, so you end up with 640KB of cache on lower end A64s, and 1152KB on high end chips. With Intel, all the data from L1 is replicated into L2, and if there is a L3, then it is all replicated into the L3. That means 1MB prescotts have 1MB cache TOTAL, while P4 EEs have 2MB TOTAL.
4) All cache memory is definitely SRAM. Like people have said, it is made up of basically a logic flip flop, whereas DRAM is similar to a charging capacitor. SRAM is about an order of magnitude faster, but because it is logic, it requires 6 transistors for each bit, versus 1 for DRAM. That is why cache densities are so much smaller than system RAM densities. Because there is usually a small amount of data that is executed extremely often, this approach helps performance a lot.
5) Currently all cache is on-die. L3 at one point in time was located on the mobo or something i think (didn't slot A do this?), but they have since moved onto the actaul piece of silicon itself. I think that the "on die" that L1 cache is also referred to is more accurately "on chip", i.e. part of the actual core, whereas L2 and 3 are on the silicon but not actaully integrated into the chip.
6) Intel is not better simply because it has more cache. Intel's cache is large because it needs to be, due to the P4 and PM designs, in addition to the fact that Intel's manufacturing prowess makes it easier for them to make these types of chips. AMD decided on a design focused on efficiency that relies on L2 cache less, which also fits in with its manufacturing style, which is less sophisticated. So simply saying Intel is better because of more cache is like saying Intel is better "cuz they got more of them gigahurtz thiniges."

Gamingphreek · Jun 3, 2005

More cache is not necessarily better. There is a happy medium you have to attain. Too much Cache and it gets really slow. Too little cache it simply doesn't have enough room to store frequently accessed data and what not.

-Kevin

L1, L2, L3 cache

Jeff7181

Lifer

redhatlinux

Senior member

pm

Elite Member Mobile Devices

RichUK

Lifer

IntelUser2000

Elite Member

TuxDave

Lifer

ToeJam13

Senior member

Goi

Diamond Member

gobucks

Golden Member

Gamingphreek

Lifer

TRENDING THREADS