all right, a couple points
1) More cache is usually better, but it's not always an apples to apples comparison. The Pentium M has a great low-latency 2MB L2 cache, and that works great for its architecture. However, the Pentium M was designed to operate as much as possible in the cache, because its memory bandwidth sucks - they only use 400MHz and 533MHz busses, as opposed to the 800MHz and 1066MHz busses of P4s or the dedicated 6.4MB/s link on the A64.
2) The P4, by comparison, also needs a large cache, but only to a point, just enough to keep its pipeline fed. The northwood core needed 512MB to keep its 20 stage pipeline happy, and prescott needs 1MB for its 31 stage monster. Because it has tons of memory bandwidth by comparison, larger caches (i.e. 2MB prescotts) don't do much to help performance, because alleviating the memory usage doesn't buy you as much as it does with the Pentium M. Also, the P4s L2 cache is much higher latency than the Pentium M, and while northwoods latency was slightly lower than the A64, the Prescott latency is way higher than all other chips.
3) And finally, the A64 has a completely different approach. Its L1 cache is 128MB, 4 times the size of the P4, and it runs at full CPU speed. Because it is so much faster than L2, it alleviates the need to have a large L2 cache. This is helped further by the A64's small 12 stage pipeline, which doesn't require a lot of cache to stay fed. In fact, if you look at benchmarks of A64 CPUs, the performance difference by going from 1MB L2 to 512KB L2 is usually under 3%. Even semprons, who only have 256K or even 128K, still perform less than 5% slower than 512KB A64s whereas earlier Celerons (i.e. the 128KB ones) were sometimes twice as slow as equivalent northwood P4s because they simply couldn't keep the pipeline full. Oh, and also, AMD's cache is exclusive. For the total cache, you take 128KB of L1 + the L2, so you end up with 640KB of cache on lower end A64s, and 1152KB on high end chips. With Intel, all the data from L1 is replicated into L2, and if there is a L3, then it is all replicated into the L3. That means 1MB prescotts have 1MB cache TOTAL, while P4 EEs have 2MB TOTAL.
4) All cache memory is definitely SRAM. Like people have said, it is made up of basically a logic flip flop, whereas DRAM is similar to a charging capacitor. SRAM is about an order of magnitude faster, but because it is logic, it requires 6 transistors for each bit, versus 1 for DRAM. That is why cache densities are so much smaller than system RAM densities. Because there is usually a small amount of data that is executed extremely often, this approach helps performance a lot.
5) Currently all cache is on-die. L3 at one point in time was located on the mobo or something i think (didn't slot A do this?), but they have since moved onto the actaul piece of silicon itself. I think that the "on die" that L1 cache is also referred to is more accurately "on chip", i.e. part of the actual core, whereas L2 and 3 are on the silicon but not actaully integrated into the chip.
6) Intel is not better simply because it has more cache. Intel's cache is large because it needs to be, due to the P4 and PM designs, in addition to the fact that Intel's manufacturing prowess makes it easier for them to make these types of chips. AMD decided on a design focused on efficiency that relies on L2 cache less, which also fits in with its manufacturing style, which is less sophisticated. So simply saying Intel is better because of more cache is like saying Intel is better "cuz they got more of them gigahurtz thiniges."