Why AMD?

Page 2 - Seeking answers? Join the AnandTech community: where nearly half-a-million members share solutions and discuss the latest tech.

Demon-Xanth

Lifer
Feb 15, 2000
20,551
2
81
For most encoding programs, they are heavily optimized for Intel, and not optimized for AMD.
 

Matthias99

Diamond Member
Oct 7, 2003
8,808
0
0
Originally posted by: So
Originally posted by: Cawchy87
why doesn't intel make their pipelines shorter thus blowing away AMD?

Simply put, the longer pipeline is a big part of WHY they can have such high clockrates. When you lengthen the pipeline, it becomes easier to increase clockrate without improving the materials science. If intel shortened the pipeline they would have to drop clockrates (see Pentium M) - similarly, AMD could increase clockrates if they made a new CPU with longer pipelines. You must remember that these architectures cannot be easily changed. Any small change in CPU design represents millions of dollars spent. Changing the pipeline involves redesigning the entire CPU, so it is nontrivial to 'shorten the pipeline'

However, all signs right now point to Intel's next CPU (since they canned Tejas) to being derived from the Pentium M (which, itself, is derived from the old Pentium 3) -- fewer pipeline stages, lower clock speed, higher IPC. Someone also mentioned 'parallelism' in CPUs -- both Intel and AMD appear to be moving towards dual-core designs, where you get two CPUs in a single die (and, further in the future, there could even be more than two cores in a single die). The reason this hasn't caught on sooner is that most single-user desktop applications are singlethreaded -- partly because the benefits of multithreading are mostly realized in large-scale server applications (or when running lots of different programs at once), and partly because writing thread-safe code for large projects is a difficult and time-consuming process (and trying to turn a large, single-threaded program into a multithreaded one is VERY difficult as well).
 

Vee

Senior member
Jun 18, 2004
689
0
0
This seem like an occasion for me to contribute to this forum. A lot of things have already been said, but I hope to put it into a more cohesive package.

First thing you need to understand, is that the amount of work a cpu does, is not directly dependent upon clockrate. That is only true if you have two identical cpus on different clock rates.

AMD's AthlonXP and Athlon64 architectures are similar in the sense that they are based upon the same ideas. AMD and Intel P4 architectures have rather little in common though. They are very different. This architectual difference will show up in different performance profiles on various types of software.

Early cpus had only some decaKilo transistors on the chip. By the '486DX it was 1.2 million, Pentium went to 3.3 million and today we have passed 100 million transistors. What are all these transistors used for? Well, simply put: compute more per clock cycle. Both in the direct sense, and in the indirect sense of avoiding various penalties, bottlenecks, at higher clock rates, like the memory latency and bandwidth being more of a restriction. In the case of the P4, additional transistors are also 'wasted' on the purpose of simply achieving a higher clockrate.

The basics for a computer is that you have data models in memory. The CPU perform work upon these data models, changing them, creating new data, according to a program also in memory. At the end of the day, an amount of work, cannot have been finished faster than the time needed for all required indata to travel into the cpu and for the finished work to travel out of the cpu. Having the same data making the travel again and again is clearly something to avoid as much as possible.

Early cpus worked pretty much in the intuitive manner, reading an instruction from memory, loading data from memory, executing the very simple operation and finally writing new data back to memory.
This is like the equivalent of a small car manufacturer. Taking the order for a car. Ordering and receiving all the parts. Then assembling the car in a garage. Finally delivering it to the customer and then going home to take the order for next car.

What has happened to CPUs since then, is pretty much the equivalent of a modern car factory and its supporting supply and distribution networks. They build the same cars (execute the same instructions), but they work on many cars, in sequential stages, simultaneously on a production line (CPU pipeline). It still takes a total of many hours of work to build each car, for it to travel down the production line, but a new car comes off the end of the line every few minutes. So the production rate is much higher.

In order to run the line smoothly, parts are preordered and stocked (prefetch, cache). To keep the line busy, cars that have not been ordered and have no customer yet, are also built (speculative execution, speculative results).
Nor do they wait for a customer to come and pick off the car from the end of the line. Instead it's driven to a buffer, a depot (write buffer, cache). With that, I'm leaving the car factory analogy.

What can be done to increase speed? CPU instructions are quite simple. More simple than most people, without assembly programming experience, probably realize. In typical x86 software, only some 37% of all instructions try to load something from memory. Only some 22% try to write to memory. Obviously, here's an opportunity to run the CPU much faster than we can read and write to memory. But it's even better, because most of those reads and writes access the same small memory areas again and again. So we move high speed images of those memory areas into the CPU. This is the cache or caches. The goal now is to keep these caches stocked as cleverly as possible, to utilize the memory bus as efficiently as possible. This is one area of CPU logic development that claims lots of transistors. And a good strategy here is also very much the key to real CPU performance.

Another thing that can be done for performance, is to add new instructions that perform a common task more efficiently. The original instruction extension was adding x87 floating point math instructions. Since then, all extensions, MMX, 3DNow, SSE, 3DNow+, SSE2 have been adding vector operations. More about what this is follows soon. But first a bit about AthlonXP and P4 strategies.

The idea behind the AthlonXP is simple. Make a processor that will execute existing software code as fast as possible, within the limits of available number of transistors.

The P4 architecture is based on three goals:
1: High clockrate. High clock rate was a goal in itself, regardless of actual performance. The reason being the markets perception of high clock rates as *performance*, and as something desirable.
2: Low technical risk. An architecture with a highly predictable performance was desired.
3: Finally, Intel didn't made the P4 to execute existing (then) code well (and it didn't and still doesn't). They intended software to change into something else than classic '386/'387. Once all the common benchmarks were re-optimized for the P4, it would seem as if the P4 were performing well. And eventually, when common applications that can take advantage of P4-optimizations, where optimized for the P4, it would perform well in reality as well. At least some of the time.


Understanding Clockrate and Pipelines:

The clock synchronizes the lace of logic switches, toggling from one state to another. The more complex and branched the transistor trees are, the longer the chains are, the longer the clock pulse need to be, to allow more electrons to flow through more gates and saturate to toggle all transistors.
So the simpler you keep things, the faster you will be able to clock. Simpler logic unfortunately means less work gets done. So we need more stages in the pipeline. This is the P4's deep pipeline.
But the rate of finished instructions, leaving the end of the line still increases, so does this still seem like a good idea to you?

Well, it isn't. First of all, higher clockrate is not needed to increase 'production' rate. We can just as well have other parallel execution units. That's how Athlons do it, (Intel too, to some extent).
In the Athlons, instructions are gathered in a 'scheduler'. Here, their order is broken. When all data needed for executing an instruction is assembled, it is dispatched by a scheduler to one of the execution units. The Athlons have 3 integer units and 3 (different) fp units. Then the retired instructions/results goes into a queue unit to be put back into the proper order again.
This is called "Out of Order" execution. And (to avoid misunderstandings) it has nothing to do with multiple cores or in core threading. It is parallel execution of instructions from one thread, and in one core.

There's one thing that is needed here, More registers. And the Athlons (and P4s) have many more registers than are visible. The thing is, different instructions, simultaneously in a state of execution, have different opinions on what should be the current content of a register. To avoid any such false dependencies, they have each their own version of the register. The scheme covering this, and speculative results, is called "register renaming".

So a deep fastclocked pipe doesn't have any real advantage.
It does have some serious disadvantages though. One problem is that you have many more instructions in the pipe. If you get a branch, and guess the result wrong, all the instructions in the pipe are wrong. You need to flush the pipe, and start over again. You lose as many clock cycles as you have stages in the pipe. This is one thing that can make P4 performance look ridiculous at times.

The deeper pipe also increases the difficulties in keeping the cache cleverly stocked. Intel have compensated this with a high FSB bandwidth, which refills caches faster, and large cache. These penalties are clearly visible in Celerons (smaller cache) and older P4 generations (slower FSB).

The final disadvantage is the clockrate itself. Heat increases much faster than rate. And this is the really big problem Intel have ventured into now, with their late P4Cs and Prescotts.
AMD is much cooler than Intel today.

The key to CPU performance is not core execution speed anyway! You can make an OoO execution arbitrarily powerful. But you can't use more power than you can move into execution. The important thing is utilizing the memory bus well, keeping the execution flow smooth. Prefetch, branch prediction, cache handling, total memory latency and to some extent bandwidth.

But the benchmarks Intel sell their processors with, don't need to contain much branches (and they don't).


P4 Code Optimization:

Say you have a block of 100,000 32-bit floats. Say you want to subtract 1.0 from each and then multiply by 0.414. In old style code that would be in a loop, indexing from 0 to 99,999.
But if you recompile the code for the P4, using Intel's auto vectorizing, SSE2 optimizing compiler, the floats will be put into 128-bit vectors (4 X 32-bit) and the loop will index from 0 to 24,999 instead, executing the operations as vector operations in the SSE2 registers. This is 4 times faster.

Programs _MUST_ be recompiled and optimized for the P4, in order for Intel to be competitive at all! On old programs, corresponding Athlons are 20% - 150% faster. The AthlonXP is an absolute brute on 386/387 code. The P4 is rather weak on old 386/387 code. On a clockcycle basis, even the old PIIIe is stronger.

But code resembling the above example, makes up for most of the time consuming and performance critical work a cpu has to do today. It's particularly common in media applications. And this was Intel's idea with the P4 architecture. Concentrate on those things that seem likely to become a bottleneck. The AthlonXP is a very powerful general CPU, the P4 is more of a multimedia chip with x86-extensions. And many modern, large market applications are optimized for the P4!
And (important for market perception) most benchmarks are!

If the above example is also optimized for AMD's 3DNow+, the AthlonXP will execute that as 50,000 64-bit (2 X 32) vector operations. - BUT it will execute 2 of those at a time, in parallel. So, it too, will do 4 32-bit floats simultaneously.
If P4 SSE2 operation is absolutely optimal, no overflow/underflow, no branches, no division, the performance relates to 'corresponding' AMD's 3DNow+, as roughly 7 to 5. If there are kinks, OTOH, AMD relative performance will be better, as it will not slow down as Intel will. And I think, theoretically, the AthlonXP could actually be better at audio encoding than the P4. (There is the additional complication that AthlonXP will bottleneck in memory bandwidth before the P4, though that may only be a factor in some cases.)

It's up to the degree of optimization in applications.
- And it has been much more common with good optimization for the P4! I think that is the big reason AMD will not do their own vector extensions in the future. Now they just follow Intel, as the SSE in AthlonXP and SSE2-SSE3 in the Athlon64. (Even so, 3DNow+ and A64 SSE2 optimization has picked up somewhat recently, as can be seen in some later media apps and in Winstone2004 Content Creation benchmark.)

However, for in particular video editing, and media in general, the P4 has been a much better choice than AMD. That might not have been the case with different software optimizations, (though I still think P4 would have had the edge) but we live in the real world.

Finally, I think AMD has got a raw deal in benchmark bias. Except for media encoding, common benchmarks do not paint an entirely correct picture of AMD vs Intel performance. Both PCMark and SYSMark are pretty worthless.

Those who chose a P4 for math, science, AI, flight simulators, or an old application suit, made a rather poor choice. Lots of AMD CPUs, even some rather cheap ones, leave even the $1000 P4EE in the dust on such applications.
Actual real gaming (as opposed to gaming benchmarks, which only play a 'movie' through the highly SSE2 optimized 3D-engine) also seem to slightly favor AMD (or not so 'slightly' in regards of the A64).
 

Demon-Xanth

Lifer
Feb 15, 2000
20,551
2
81
Vee: for only having 46 posts, that was a damned good one.

As far as AMD getting a "raw deal" on benchmarks, a fair amount is that there's more P4s than Athlons, so guess what people optimize for?
 
Mar 11, 2004
23,444
5,852
146
Holy shiznit Vee!

Thats some crazy postage. I admit I didn't even begin to read it, but what I saw while glancing through seemed to be well done.

And I thought that I got wordy when posting stuff....:p