• We’re currently investigating an issue related to the forum theme and styling that is impacting page layout and visual formatting. The problem has been identified, and we are actively working on a resolution. There is no impact to user data or functionality, this is strictly a front-end display issue. We’ll post an update once the fix has been deployed. Thanks for your patience while we get this sorted.

A little more info on the upcoming Mustang....

"by shrinking the die area and adding 2 or more stages to the Athlon first gen's short 10-stage pipeline."

:Q I didn't think they would go to such extremes to compete for mhz.. I sure hope this doesn't get ridiculous eventually.. It would suck majorly if they just kept increasing mhz, without really increasing speed..

EDIT: I think the reason AMD chose 512K L2 cache, was because they need whatever speed advantage they can get, and with their superior facilities at Dresden for chips with onboard L2 cache (their facility is much cleaner then Intels facilities, which is one reason why Intel has sooo many bad processors with the P3), they might as well take advantage of that. unfortunately, it means a bigger die size..

of course, though, all the tweaks they put into the Mustang will probably overcome the 2 extra stages of pipe added..
 
the 512KB mustang is a higher end chip. It is more to place in the P4 range because both will not be "consumer" chips right off until .13u shifts.

Honestly, increasing the size doesn't help that much in "desktop" applications.

256k->512k doesn't do much!!!

What the Athlon could benefit from is a SMALLER L1 cache.

The 128KB L1 is going to start to hurt high-Mhz scalability. If they want to hit 2Ghz with this core, they are going to have to scale back to 64KB or increase the cache latency. Since they're using an exclusive architecture, this shouldn't hurt hitrates and shrinking will be MUCH better than increasing latency.

P4 uses its small cache (8kb) in order to decrease latency by 1/3 which, running through some numbers, actually improves overall cache system speed!!

The numbers: 32KB L1 d-cache results in average hit rate of 96%. A 32KB d-cache would have a higher latency (3 cycel) at high speed (>1Ghz) than an 8KB d-cache (2 cycles). The 8KB cache would have an average hit-rate of 92%. These are numbers taken from computer engineering textbooks, so don't debate me here. So, you get a 3% lower hitrate but a 33% faster latency. If it isn't obvious, it can be worked out mathematically, but just look at it... the low latency is faster!

Eric

 
8kb? What the hell can you do with that?


Oh, I don't know much about these stuff, but can't AMD do what Intel did with the P3 steppings?
 
IaPuP...you can't just quote numbers from textbooks, since they often base their averages on applications such as small benchmarks(LINPACK for example) or stuff like SPECint and SPECfp, and the memory access pattersn of such programds aren't really indicative of that of consumer programs that we use.

Also, the hit rate and hit time of a cache system isn't dependent only on its size, but also on its set associativity, its replacement policy, its block size, specific tweaks such as victim caches and cache write pipelining.
 
"Oh, I don't know much about these stuff, but can't AMD do what Intel did with the P3 steppings?"

AMD's Mustang is the first core to be improved apon. I'm hoping they don't keep increasing the pipe length though. after that we may still see some "improvements" on the core, I'm just hoping they don't "improve" the core by increasing the pipeline stage count..
 
ok.. I apologize to the nitpickers out there.

I was referring to the load-to-use latency in a direct mapped cache. This is referring, of course, to averages, and some applications will approach 100% hit rate and others may barely get 50%. The point is that MOST programs don't benefit from a slower, larger L1.

Assuming these average numbers and a 99% L2 hitrate with 6 cycle L2 latency and 150 'apparent' main memory latency (as is on the P4 1.4Ghz):
8KB(2cycle): 2 * .92 + 6 * .08 + 150 * .01 * .08 = 2.44 average CPU cycles

32KB(3cycle): 3 * .96 + 6 * .04 + 150 * .01* .04 = 3.14 average CPU cycles

Assuming lower hitrate (ie 90%and 83% but mostly fit in L2)
8KB(2cycle): 2 * .83 + 6 * .17 + 150 * .01 * .17 = 2.935 average CPU cycles

32KB(2cycle): 3* .9 + 6 * .1 + 150 * .01 * .1 = 3.45

Assuming data set is fairly random, the main memory becomes 95% of the puzzle and then you're talking about data streaming, etc where the cache only holds about 10% of the time. *shrug*

I'm sure you could find some numbers where the larger slower cache is better than smaller faster (similar to how the old Katmai beats coppermine at 1 or 2 benchmarks).

In any case. There is both saved space and improved performance with the 8KB 2 cycle cache. Perhaps they could have implemented a 3 cycle 16KB set associative cache- but the direct mapped is so much faster and less complex that I think it was the best way to go. Getting fancy with the L1 only limits the clockspeed of the MPU.

BTW, these numbers are fairly widely accepted "average" hit rates. They should loosely apply to most applications that have smaller data-sets. Disclaimer: this is a rough approximation and if you want to nitpick it, go nitpick elsewhere. We are going to wait until the chip is released to have a "good idea" of what makes it faster and what doesn't and whether it is faster at all...

Eric Hagen
 
But what happens when new programs come out that will take advantage of the larger cache, thats where amd 512 will come in handy i suppose.
 
there currently ARE some programs that DO take advantage of 512K cache. SOME programs. others take advantage or FAST cache.

so, AMD appears to be making sure that they have as many advantages over the Willamette as possible. I'm just wondering how wide the pipe between the core and L2 cache is..
 
Back
Top