Games benefit. Database's benefit. Photoshop sure as heck benefits.
The type of application that benefits from more onchip L2 cache is the type of program that likes to do large operations on a set of data1 or 2 MB in size. Thus, with sufficent cache, you can cache the data onchip and eleminate memory access for reading the file, as the whole file has been loaded into the chip. This allows work on the file to proceed much faster than if it had to have drawn 3/4ths of the file from main memory. Lemme just give you an example of the massive speed disparity between onboard cache and main memory.
P4 2.2GHZ
Internal L2 bandwidth SSE2 enabled:70GB/s
Internal L2 bandwidth non SSE2 enabled:35GB/s
Main memory:3.2GB/s
In programs where you're working with reasonably sized files (i.e. a 1 or 2 MB bitmap) having enough cache memory to bring the whole entire thing into cache would give you *tremondous* boosts in speed! Like as in the 20% range. Same thing for database's. If one entry in a database needs to have some things changed, if you can fit the whole thing into L2 cache you're going to work on it significantly faster than if not.
Things that *don't* benefit much from L2 cache are streaming applications where in you are depending mostly on RAM for your data access. or programs that have such small datasets that you don't need more than 256KB total cache to cache the objects your working on. But just imagine this situation.
You have a database. You have one entry where every person who's owed a certian amount of money needs to have their intrest recalculated. If 1 set of data is 1.5MB, a 2MB Xeon is gonna do a hell of alot better than a 512K Xeon because it doesn't have to go rummaging back to main memory every time it needs to make a modification to the file. Just make your modifications in L2, ship it back to main memory, get the next set of data, perform the same operation, ship it back, etc etc etc..
*OR* the best example! Photoshop! Imagine (again) your working with a 1.5MB file, this time you're doing a guassian blur filter. If you only had 256K of cache you would have to go back to main memory constantly to get the information about a certian part of the picture for interpolation etc.. and it would just waste alot of time. Now, assume you have a 2MB cache running at full core speed. Do a logic operation, check interpolation points, do another logic operation, create interpolation points, etc.. until the whole image is finished. Ship it back to main memory. Your done! No memory access! Again, I show you the figures.
P4 2.2GHZ
Internal L2 bandwidth SSE2 enabled:70GB/s
Internal L2 bandwidth non SSE2 enabled:35GB/s
Main memory:3.2GB/s
Now, the P4's prefetch instructions drastically reduce the benefit of tremondous caches in comparison to P3's. But large L2 caches are still very useful in the above mentioned applications.