AMD Chip question

fuzzynavel

Senior member
Sep 10, 2004
629
0
0
I read a few years ago that AMD chips perform more instructions per clock cycle than Intel chips....I think it was something like 3 instructions per cycle for Athlon and 2 instructions for intel......Is this the only/main reason for slower clocked AMD's comparing with faster clocked Intel's or is there something else????
Has This still held true for the A64 chips....or is it just the long pipeline of the Prescotts letting Intel down?

This isn't for a project or anything ....just interested...Bit old for school now at 26 anyway!!
 

Steffenm

Member
Aug 24, 2004
79
0
0
I think of AMD and Intel this way:

Imagine a tunnel very short and very wide. This is AMD. Then imagine another tunnel very long and very thin. This is Intel. The data packages going through are cars, and Intel's cars move a hell'uv'alot faster, but can only go through one at a time... in a row, that is. While AMD's cars move a bit slower, but they can move many, many, many more at the same times, as i paralell. This means that AMD's tunnel gets more done per second, though the speed of the cars are lower :p Yes, I know, ridicilous.. But there has to be some truth in it? Maybe not? Or?

Anyway, I use AMD and always have. I find it more efficient for my use.
 

TuxDave

Lifer
Oct 8, 2002
10,571
3
71
Originally posted by: Steffenm
I think of AMD and Intel this way:

Imagine a tunnel very short and very wide. This is AMD. Then imagine another tunnel very long and very thin. This is Intel. The data packages going through are cars, and Intel's cars move a hell'uv'alot faster, but can only go through one at a time... in a row, that is. While AMD's cars move a bit slower, but they can move many, many, many more at the same times, as i paralell. This means that AMD's tunnel gets more done per second, though the speed of the cars are lower :p Yes, I know, ridicilous.. But there has to be some truth in it? Maybe not? Or?

Anyway, I use AMD and always have. I find it more efficient for my use.

Now that was for the Intel P4. Now what's the analogy for Intel Centrinos? :)
 

imgod2u

Senior member
Sep 16, 2000
993
0
0
The K7/K8 designs are much wider than Netburst or the P6 cores, yes. There are other reasons, flexibility among many, but on average, they will achieve a better per-clock performance. The trade-off, of course, is that you need more hardware.

If you look at modern performance numbers, it is mainly bottlenecked by memory. It's usually not the processor that can process more data that's faster, but rather, the processor with the better latency-hiding technique. Netburst had a great caching system (before Prescott) and it was ahead of the K7 designs most of the time. The K8 came out with an integrated memory controller and completely dominated. Intel made a big mistake with Prescott by focusing more on the processor and making an inferior caching system (almost 4x the L1 latency, 2x the L2 latency).
 

uOpt

Golden Member
Oct 19, 2004
1,628
0
0
Just ignore the GHz rating.

BTW, Pentium-4s do execute more than one instruction per clockcycle, too, although only very selected integer operations, and of course with the right scheduling.
 

itachi

Senior member
Aug 17, 2004
390
0
0
no, the p4 executes more than 1 integer op indefinitely. it has 3 parallel execution engines, shift/rotate and other time consuming ones take an entire clock cycle.. add/sub executes in a half cycle for each of the x2 integer units.

imgod2u- where'd you hear that L1 latency increased by a factor of 4? i can't find anything that supports that.. what i have seen is that the latency increased by 1 clock cycle (2 to 3) and L2 increasing by roughly the same factor (20 to 30).
 

imgod2u

Senior member
Sep 16, 2000
993
0
0
Our own dear Anandtech has the numbers:

http://www.anandtech.com/showdoc.aspx?i=1956&p=8

L1 cache latency with Cachemem increased from 1 cycle to 4. Sciencemark changed from 2 to 4. The L2 cache latency increased from 16 to 23 cycles. Considering how small the L1 cache is, and that *all* FP data/instructions are fetched from the L2, having a cache latency that is way more than the level of parallelism you could ever hope to extract from code practically guarantees that you'll have stalls even with cache *hits*.

Btw, the P4 is a 3-issue design (3 micro-ops anyway) so it can issue and execute 3 micro-ops each cycle. The K7-K8 can issue 3-6 depending on the type of instruction.
 

itachi

Senior member
Aug 17, 2004
390
0
0
i was about to argue my old case.. it wasn't til i read the intel documentation that i realized that anandtechs measurements were closer to the truth (how 4 sites can all put out the same bs is beyond me). the claimed access latency for L1 cache increased from 2 to 4 for integer ops and 9 to 12 for fp ops, in case you wanted to know.

anyways, increasing the size of the cache and increasing the associativity of the cache reduces miss rate by a significant amount.. cache misses have a far higher penalty than the difference in the 2 cache latencies, which is far less significant for fp. also, you can't ignore that intel had to change the L1 cache architecture to handle amd's 64-bit architecture. i don't know how all the other factors weigh in statistically, but from what i see, i think you're making out the increase in latency to be far more detrimental then it actually is.

p4 was a 3-issue design.. the prescott's trace cache can now push out 4 uops per cycle.
 

complacent

Banned
Dec 22, 2004
191
0
0
Originally posted by: Steffenm
I think of AMD and Intel this way:

Imagine a tunnel very short and very wide. This is AMD. Then imagine another tunnel very long and very thin. This is Intel. The data packages going through are cars, and Intel's cars move a hell'uv'alot faster, but can only go through one at a time... in a row, that is. While AMD's cars move a bit slower, but they can move many, many, many more at the same times, as i paralell. This means that AMD's tunnel gets more done per second, though the speed of the cars are lower :p Yes, I know, ridicilous.. But there has to be some truth in it? Maybe not? Or?

Anyway, I use AMD and always have. I find it more efficient for my use.

That is wrong. AMD is not a parallel architecture. Its "tunnel" doesn't allow several cars through at once. Basically, the reason AMD is clocked lower is purely a function of its pipeline stages. Overall speedup is a factor of pipe stages. The more stages, the faster the clock cycle. For instance, a 5 stage pipeline with each stage taking 1ns can have a theoretical speedup of 5x over a single stage pipeline. Did you know that there are stages in the P4 pipeline that only forward information? That is, their only reason for being a stage is to keep synchronization? Hardly efficient...<br>
The problem with a long pipeline comes from branch misprediction, delays, etc. If you flush a pipeline that is 10 stages vs. one that is 30 stages, it takes much less time to result out of the 10 stage pipeline. Intel is going the way of Intel with their Pentium M (which will be used for the dual core.) They have realized that there is a glass ceiling with clock speeds...

 

RelaxTheMind

Platinum Member
Oct 15, 2002
2,245
0
76
Anyone been paying attention to the intel roadmaps? forgot the link maybe some can help, but they are all nothing but dual cores ending with a bunch of question marks. Cant wait to see what they are up to and what AMD has to put on the table.

The super marketing ploy of the Plus rating. They were basically basing performance on bandwidth than raw speed. Which in turn comes down to the pipelines and efficiency of the various cache stages. Current dual cores from what I know only have 1 shared ALU (arithmetic logic unit) amongst the cores.

IMHO... I dont really see the big point for the general market when they still ship them with outdated, slow ide drives. uber fast loading times is what really pleases the general public. comments?
 

KSmith

Junior Member
Feb 23, 2005
19
0
0
Originally posted by: RelaxTheMind
. I dont really see the big point for the general market when they still ship them with outdated, slow ide drives.
The transition to SATA with striped performance RAID has already begun. Virtually all MBs support this now. Come the nect Xmas season, RAID SATA will be mainstream.
 

imgod2u

Senior member
Sep 16, 2000
993
0
0
Originally posted by: itachi
i was about to argue my old case.. it wasn't til i read the intel documentation that i realized that anandtechs measurements were closer to the truth (how 4 sites can all put out the same bs is beyond me). the claimed access latency for L1 cache increased from 2 to 4 for integer ops and 9 to 12 for fp ops, in case you wanted to know.

anyways, increasing the size of the cache and increasing the associativity of the cache reduces miss rate by a significant amount.. cache misses have a far higher penalty than the difference in the 2 cache latencies, which is far less significant for fp. also, you can't ignore that intel had to change the L1 cache architecture to handle amd's 64-bit architecture. i don't know how all the other factors weigh in statistically, but from what i see, i think you're making out the increase in latency to be far more detrimental then it actually is.

p4 was a 3-issue design.. the prescott's trace cache can now push out 4 uops per cycle.

The increase in L2 latency may not be as significant, but the increase in L1 latency most likely accounts for a *lot*. If you have an L1 miss, you go to L2, which is a ~25 cycle latency nowadays. If you have an L1 hit, you have a ~4 cycle latency now. Considering integer data is retrieved mostly from L1, having a 4 cycle latency would require that you have an ILP of 4 instructions at all time to maintain throughput. Anyone who's run your average code knows that is nowhere near the case. You're lucky to get 2 sometimes. Having a latency of 1 cycle is a *huge* advantage. How much could 50% of the miss rate make up for a 4 fold processing time for each dependent integer instruction?

As for the L2 cache. Keep in mind you'll need an ILP of ~30 instructions in order to hide the latency of the *cache*. Cache miss rates asside, if your L2 cache has a higher latency than your average ILP, you'll have stalls even with cache hits.
 

KSmith

Junior Member
Feb 23, 2005
19
0
0
This discussion of L1 and L2 performance on various Intel and AMD arhitectures is fascinating.

How does all of this impact JIT (Just In Time) compilers?
Examples of JITs are the JVM (Java Virtual Machine) and the CLR (Microsoft .NET's Common Language Runtime).
We often think of X86 CPUs to be plug-compaticle.
They are, but it would seem that a given JIT compiler could eke out some performance gains by taking advantage of CPU peculiarities.

Java's slogan is "write once, run anywhere."
That applies to applications only, though.
JVMs stay with the host machine.
 

itachi

Senior member
Aug 17, 2004
390
0
0
Originally posted by: imgod2u
The increase in L2 latency may not be as significant, but the increase in L1 latency most likely accounts for a *lot*. If you have an L1 miss, you go to L2, which is a ~25 cycle latency nowadays. If you have an L1 hit, you have a ~4 cycle latency now. Considering integer data is retrieved mostly from L1, having a 4 cycle latency would require that you have an ILP of 4 instructions at all time to maintain throughput. Anyone who's run your average code knows that is nowhere near the case. You're lucky to get 2 sometimes. Having a latency of 1 cycle is a *huge* advantage. How much could 50% of the miss rate make up for a 4 fold processing time for each dependent integer instruction?
first.. the latency on the L1 for pre-prescott is 2 cycles, not 1. second, if you do get lucky and your program gets the ILP of 2, how would the northwood react if another program could get an ILP of 2 at the same instance? not only does the northwood have a smaller bus for the trace cache (pushin out 3 vs 4 uops on the prescott), it can also only process 1 thread every 2 cycles.. whereas the prescott can alternate between threads each cycle.

anyways, i got a crap load of work piled up.. thanks to procrastination. i'll give it more thought when i actually got the time.

 

imgod2u

Senior member
Sep 16, 2000
993
0
0
Originally posted by: itachi
Originally posted by: imgod2u
The increase in L2 latency may not be as significant, but the increase in L1 latency most likely accounts for a *lot*. If you have an L1 miss, you go to L2, which is a ~25 cycle latency nowadays. If you have an L1 hit, you have a ~4 cycle latency now. Considering integer data is retrieved mostly from L1, having a 4 cycle latency would require that you have an ILP of 4 instructions at all time to maintain throughput. Anyone who's run your average code knows that is nowhere near the case. You're lucky to get 2 sometimes. Having a latency of 1 cycle is a *huge* advantage. How much could 50% of the miss rate make up for a 4 fold processing time for each dependent integer instruction?
first.. the latency on the L1 for pre-prescott is 2 cycles, not 1. second, if you do get lucky and your program gets the ILP of 2, how would the northwood react if another program could get an ILP of 2 at the same instance? not only does the northwood have a smaller bus for the trace cache (pushin out 3 vs 4 uops on the prescott), it can also only process 1 thread every 2 cycles.. whereas the prescott can alternate between threads each cycle.

anyways, i got a crap load of work piled up.. thanks to procrastination. i'll give it more thought when i actually got the time.

Depends on the program being run. Anandtech's Cachemem results seem to indicate 1 cycle. But that doesn't change the case that ILP in average code is almost *never* 4 instructions for your average integer code. 2 cycles may be acceptable, but not 4. As for micro-ops issue, I doubt that's a bottleneck, especially, again, considering the ILP bottleneck. Considering how Netburst's FP throughput depends almost entirely on SSE (which is limited to 1 instruction at any given time anyway), I doubt the parallel issue really comes in handy. Your average SSE instruction includes a load, a store and an arithmetic instruction.
 

complacent

Banned
Dec 22, 2004
191
0
0
Originally posted by: RelaxTheMind

The super marketing ploy of the Plus rating. They were basically basing performance on bandwidth than raw speed. Which in turn comes down to the pipelines and efficiency of the various cache stages. Current dual cores from what I know only have 1 shared ALU (arithmetic logic unit) amongst the cores.

IMHO... I dont really see the big point for the general market when they still ship them with outdated, slow ide drives. uber fast loading times is what really pleases the general public. comments?

First, the plus rating was to help people get over the MHz myth. It didn't have anything to do with basing performance on bandwidth, and everything to do with comparing AMD to Intel. When joe average walk into a store and sees a 1.8 GHz Athlon 64 or a 3.0 GHz P4, what do you think he'll buy? By slapping 3000+ on the Athlon, it can give them a reference for the P4. A 3000+ Athlon 64 should perform as well as a 3 GHz P4. I think it is very helpful, and the actual clock rate is right on the box.

You are wrong about dual cores having a shared ALU. The ALU is the key component in a core. There would be no point in having dual cores if there is only one ALU. Every time a squareroot or FP divide happens, both cores would stall. Dual cores share the same cache, and in most cases, that is all they share.

Also, loading time of a hard drive has little to do with the speed problem. Most programs can be loaded into main memory now, and there is a certain amount of spatial and temporal locality that makes the bulk processing time of the program very close (physically) and also relatively small. Even when decoding a DVD, I can guarantee you the processor is doing no more than 133 MB/s. Actually, DVD Decrypter usually works at about 2 KB a second, and that is a very processor demanding program. One of the only processors that would truly need a huge amount of bandwidth, and lots of data to go through it, would be a vector processor, which has its own market.


The HUGE bottleneck everyone needs to work on is the speed of memory. Increasing cache will help this, but that is very expensive.