This ones for you GUTB, or whatever you call yourself ^^

Phokus

Lifer
Nov 20, 1999
22,994
779
126
Subject: Re: Athlon vs Pentium
Date: Wed, 10 Jan 2001 17:25:02 -0600
From: &quot;Togolot&quot; <mfevs@winco.net>
Organization: Info Avenue Internet Services
Newsgroups: 24hoursupport.helpdesk


----------------------------------------------------------------------------
----

Pentium 4: In Depth
Copyright (C) 2000 by Darek Mihocka
President and Founder, Emulators Inc.
Updated December 30, 2000

----------------------------------------------------------------------------
----

Introduction

How Intel Blew It

Processor Basics

Limitations of the Pentium III

Pentium 4 - Generation 7 or complete stupidity?

Analyzing the results

Updates

----------------------------------------------------------------------------
----

Introduction

According to Gateway's web site, the Pentium 4 is &quot;the most powerful
processor available for your PC&quot;. Unfortunately for most computer users,
it's simply not true.

Despite a huge pavilion at COMDEX Las Vegas last month, Intel is almost mute
when it comes to this &quot;most powerful processor&quot;. Instead, Intel has been
insulting the television viewer all through this Christmas shopping season
with blue guy commercial pimping an almost 2 year old Pentium III processor
instead of its new flagship processor. Why? Could there be... problems?

Merry Christmas, Happy 21st Century, and brace for impact! The PC industry
is taking a huge leap backwards as Intel's new flagship Pentium 4 processor
turns out to be an engineering disaster that will hurt both consumers and
computer manufacturers for some time to come. Effects of Intel's heavily
delayed Pentium 4 release and this summer's aborted high-end Pentium III
release are already being felt, with sharp drops in PC sales this season,
and migration to competing AMD Athlon based systems. Intel, and Intel
exclusive vendors such as DELL have already suffered crippling drops in
stock price due to Intel's processor woes, with each company's stock falling
well over 50% in the past few months and hitting yearly lows this month.
Don't say I didn't warn you folks about this back in September, I did.

As has been confirmed by other independent sources such as Tom's Hardware
and by Intel engineers themselves at
http://www.eet.com/story/OEG20001213S0045, and as a month of close study of
the chip reveals, the Pentium 4 does not live up to speed claims, loses to
months old AMD Athlon processors, and lacks some of the crucial features
originally designed into the Pentium 4 spec. The only thing the chip does
live up to is the claim that it is based on a new redesigned architecture -
something that Intel does every few years, but usually to increase speed.
The new architecture has serious fatal flaws that in some cases can throttle
the speed of a 1.5 GHz Pentium 4 chip down to the equivalent speed of a mere
200 MHz Pentium MMX chip of 4 years ago, even slower than the level of any
Celeron, Pentium II, or Pentium III chip ever released! It's a huge setback
for Intel.

As Tom's Hardware points out, in some case a massive rewrite of the code
does put the Pentium 4 on top, barely, over the Athlon, but for most &quot;bread
and butter&quot; code it losses to the Athlon. As I've found out this past month
in rewriting both FUSION PC and SoftMac 2000 code, no amount of code
re-writing can make up for the simple fact that the Pentium 4 contains
serious defects in design and implementation. Other developers who have
followed Intel's architectures and optimization guidelines and optimized
their code for the Pentium II and III will also find that no amount of
rewriting will make their code run faster on the Pentium 4 than it currently
runs on the Pentium III. And this is not the fault of the developer. In
cutting corners to rush to release the Pentium 4 as soon as possible, Intel
made numerous cuts to the design to reduce transistor counts, reduce die
size, reduce manufacturing costs, and thus get a product out the door. And
in the process crippled the chip.

What's worse, popular compiler tools, such as Microsoft's Visual C++ 6.0 are
still producing code optimized for obsolete 486 and Pentium classic
processors. They haven't even caught up to Pentium III levels yet. Since
most developers do not write in low-level machine language as I do, most
Windows software released for the next year or two will not be Pentium 4 (or
even Pentium III) optimized. Far from it. Since Microsoft traditionally
takes about 3 to 5 years from the release of a processor to the time when
their compiler tools are optimized for that processor, it will be a long
wait for all of us waiting for Intel and Microsoft to get things right.

Good news is, the AMD Athlon processor is still the fastest x86 processor on
the planet and works around many of the problems in Microsoft's compilers
and Intel's flawed Pentium III and Pentium 4 designs. If you weren't an AMD
fan before, you will be after you read what I have to say.

What happened? In an attempt to regain the coveted PC processor speed crown
which the Intel Pentium III lost to the AMD Athlon in late 1999, Intel seems
to have lost all sense of reason and no longer allows engineering to dictate
product design. Under pressure from stock holders to prop up its sagging
stock price, and under pressure from PC users to deliver a chip faster than
the AMD Athlon, Intel made two serious back-to-back mistakes in 2000 trying
to rush chips out the door:

#1 - in the summer of 2000 it tried to push the aging &quot;P6&quot; architecture too
far. The P6 design, or 6th generation of x86 processor which since 1996 has
been the heart of all Pentium Pro, Pentium II, Celeron, and Pentium III
processors, simply does not scale well above 1 GHz. As the aborted 1.13 GHz
Pentium III launch this summer showed, Intel tried to overclock an aging
architecture without doing thorough enough testing to make sure it would
work. The chip was recalled on the day of the launch, costing Intel, and
costing computer manufacturers such as DELL millions of dollars in lost
sales as speed conscious users migrated to the faster AMD Athlon.

#2 - after numerous postponements and under pressure to ship by year end,
Intel finally launched the Pentium 4 chip on November 20 2000, but only
after engineering cut so many features from the chip as to effectively
render it useless.

Consider that Intel stock was over $70 a share just 4 months ago, prior to
these two mishaps. On the last day of trading in 2000, Intel stock dipped
below $30, having failed to beat AMD all year. Yes, the whole industry is
down this year. Dell is down. Gateway is down. Microsoft is down. But
consider how much damage Intel made to its own credibility and to the
credibility of the whole market by launching one dud after another? Not to
mention its confused and three-progned battle against AMD, not quite sure
whether to keep pushing Pentium III, go forward with Pentium 4, or switch
all efforts to the new IA64 Itanium architecture. Their engineers are spread
out three ways right now.

What it boils down to is this - just like at Microsoft and just like at
Apple, the marketing scumbags at Intel have prevailed and pushed sound
engineering aside. With the 1.13 GHz Pentium III chip dead on arrival, and
the Pentium 4 crippled beyond easy repair, Intel may have just set itself
back a good 3 to 5 years. Don't get me wrong, I've liked Intel's processors
for years. I rode their stock up when their engineers were allowed to
innovate. After all, they invented the processor that powers the PC. For
almost a decade the 486 and Pentium architectures have been superior to any
competitors' efforts - better and faster than the AMD K5 and K6 chips, far
more backward compatible than Motorola 68K and PowerPC chips, and almost as
fast or faster than the previous generation of chips they replaced. But, as
past history shows, it takes an Intel or an AMD or a Motorola a good 3 to 5
years to design a new processor architecture. And when you blow it, you blow
it. You sit in second place for those next 3 to 5 years. Pentium III has no
future. Pentium 4 needs to be redesigned. Itanium is still not ready and
will require all-new operating systems, computer tools, and application
software.

What users get today, buying either the 1.4 or 1.5 GHz systems from DELL or
Gateway or whoever, is an over-priced, under-engineered, and very costly
computer. A basic 1.5 GHz Pentium 4 computer runs for well over $3000, while
comparable Athlon and Pentium III based systems literally cost 1/3 to 1/2 as
much. Given the price the PC manufacturers pay Intel for the Pentium 4 chip
(a few hundred dollars more than the Pentium III), and given the $1000 to
$2000 premium consumers pay for Pentium 4 systems, the only ones who benefit
from the Pentium 4 are the PC manufacturers themselves! That is, if people
will be stupid enough to fall for it.

The Pentium 4 fails miserably on all counts. In terms of speed and running
existing Windows code, the Pentium 4 is as slow or slower than existing
Pentium III and AMD Athlon processors. In terms of price, an entry level
Pentium 4 system from DELL or Gateway sells for about double the cost of a
similar Pentium III or AMD Athlon based system, with little or no benefit to
the consumer. And most sadly of all, from the engineering viewpoint, the
Pentium 4 design is very disappointing and casts serious doubts on whether
any intelligent life exists in Intel's engineering department. After a month
of using them, I was so disgusted with the two Pentium 4 machines I
purchased in November that both machines have since been returned to DELL
and Gateway. I personally own dozens of PCs and hundreds of PC peripherals,
and never have I been so disgusted with a product (and the way it is
marketed) as to return it.

Both DELL and Gateway falsely advertise their Pentium 4 based systems as
somehow being superior or better than their Pentium III and/or Athlon based
systems. The only thing that is superior is the price. I urge all computer
consumers to BOYCOTT THE PENTIUM 4 and BOYCOTT ALL INTEL PRODUCTS until such
time as Intel redesigns their chips to work as advertised. If you have
already purchased a Pentium 4 system and sadly found out that it doesn't
work as fast as expected, RETURN IT IMMEDIATELY for a refund.

In hindsight, it is not surprising then that prior to the November 20th
launch of the Pentium 4, Intel delayed the chip numerous times, and Intel,
DELL, Gateway, and COMPAQ all warned of potential earnings problems in the
coming quarter, probably knowing full well of the defects in the Pentium 4.
Remember, the engineers at those companies have had Pentium 4 chips to play
with for several months prior to launch.

It is also not surprising that a week before the Pentium 4 launch, at COMDEX
Las Vegas the week of November 13th 2000, neither Intel nor Gateway, who
both had huge displays at that show, would give much information about the
Pentium 4 systems. Not price. Not speed. Not specs. While Intel did display
Pentium 4 based computers, they were locked up during show hours and not
available for inspection by the general public. At Gateway's booth, many of
the salespeople appeared ignorant, apparently not even aware that the
Pentium 4 was being launched the following week. Even DELL, usually a big
exhibitor at these shows (hell, they were half the show at the Windows 2000
launch in February), chose to pull their show exhibit completely, holding
only closed door private sessions with the press. Shareholders and software
developers and the public were barred from these secret meetings. Why?

Don't be a sucker. Don't buy a Pentium 4 based computer. Do as we have
suggested here at Emulators for over a year. If you need a fast inexpensive
PC, buy one that uses an inexpensive Intel Celeron processor if you must. If
you require maximum speed, buy one based on the AMD Athlon. Under no
circumstances should you purchase a Pentium II, Pentium III, or Pentium 4
based computer! In fact, with the cheaper AMD Duron now available to rival
the Celeron, it makes more sense to boycott Intel completely. Buy AMD based
systems. AMD has worked hard to outperform Intel and they deserve your
business!

----------------------------------------------------------------------------
----

How Intel Blew It

Before I start the next section and get very technical, I'll explain briefly
how over the past 5 years Intel dug itself into the hole it is in now. When
you understand the trouble Intel is in, their erratic behavior will make a
little more sense.

Let's go back 2 or 3 years, back to when the basic Pentium and Pentium MMX
chips were battling with AMD's K6 chips. AMD knew (I'm sure) that it had
inferior chips on its hands with the K6 line. With the goal of producing a
chip truly superior than anything from Intel, its engineers went back to the
drawing board and designed a chip architecture from scratch they codenamed
the &quot;K7&quot;. Which in late 1999 was released as the AMD Athlon processor. It
took 5 years of work, but they hit their goal. Faster than the best Pentium
III at the time, the Athlon delivered 20% to 50% more speed at only slightly
higher cost than the basic Pentium III. Mission accomplished!

Intel on the other hand, not content with 90% market share, focused not on
FASTER chips, but on CHEAPER SLOWER chips. Monopolistic actions, much like
Microsoft's, designed not to deliver a better product to the consumer but
rather to wipe out the competition. The Pentium II, while easily the fastest
chip on the market at the time, was also more expensive than the AMD K6 and
its own Pentium chips. And thus started a comical series of brain dead
marketing blunders:

Intel launched the Celeron processor, a marketing gimmick aimed directly at
AMD, which consisted of nothing more than taking a Pentium II and removing
the entire 512K of L2 cache memory. Basically what they did was chop off the
most expensive part of the chip to reduce costs, with no regard to side
effects. The result was a smaller less expensive processor, which
unfortunately had the nasty side effect of running far slower than other AMD
or Intel chips on the market at the time! Buying a Celeron was like buying
an old 486 system, it was that slow.

When that didn't pan out, Intel kept selling its older line of Pentium MMX
chips. While running at the same 233, 266, and 300 MHz clock speeds as the
Pentium II, the Pentium MMX was based on the older design of the original
Pentium, the &quot;P5&quot; architecture, and thus delivered about 30% lower
performance than the Pentium II. Again, it lost out to the AMD K6.

In an effort to fix the Celeron problem, Intel re-launched the Celeron as
the Celeron-A, (starship Enterprise anyone?) which now featured, surprise,
surprise, an L2 cache right on the chip. The new chip was indeed much faster
than the original Celeron. Due to the faster on-chip L2 cache, the new
Celeron was even faster than the more expensive Pentium II chip! Intel now
shot itself in the foot by offering a &quot;low end&quot; Celeron processor that
outperformed the &quot;high end&quot; Pentium II. Confusion!

Finally, in 1999, Intel killed off the slower more expensive Pentium II by
introducing the &quot;new&quot; Pentium III. Which for all intents and purposes is
simply a Pentium II with a higher number to justify the higher cost relative
to the Celeron.

In other words, Intel succeeded so well at producing a low cost version of
the Pentium II, that it not only put the AMD K6 to shame, it also killed off
the Pentium II and was forced to fraudulently remarket the chip as the
Pentium III! For all intents and purposes, the Pentium II, the Celeron, and
the Pentium III are ONE AND THE SAME CHIP. They're based on the same P6
architecture, with only things like clock speed and cache size to
differentiate the chips. This is why we tell you not to purchase a Pentium
II or Pentium III based system. If you must buy Intel, buy a Celeron. Same
chip, lower cost.

Sure, sure, the Pentium III has new and innovative features, like, oooh, a
unique serial number on each chip. Well guess what? The serial number idea
was so poorly received, and rightfully so, that the serial number is already
dead. The Pentium 4 has no such feature. The new MMX instructions, renamed
SSE to sound more important, are still not supported by most compilers.

What Intel FAILED TO DO during these past 5 years is it failed to anticipate
that the end of the line of the P6 architecture would come as quickly as it
did. It hits an upper limit around 1 GHz and cannot compete with faster AMD
chips which people already have running over-clocked in the 1.5 GHz range.

Here is how and why Intel REALLY blew it. Intel has known since the Athlon
first came out in 1999 that its P6 architecture was doomed. Intel was
already well under way to developing the Pentium 4. Remember, these chips
take 3 to 5 years to design and implement and it had already been 3 years
since the P6 architecture was launched. Intel had about two more years of
work left, but that meant losing badly to the Athlon for those next two
years.

So instead of focusing on engineering - doing what AMD did and biting the
bullet while it developed the new chip - Intel went ahead and first tried to
ship a faster Pentium III chip. That back fired. So as a last resort they
pulled another Celeron-type stunt and shipped a crippled Pentium 4 chip that
cut so many features as to result in a chip that is neither fast nor cheap
and benefits no one but greedy computer makers.

I've been studying Intel's publicly available white papers on the Pentium 4
for the good part of 6 months now, and while the chip looked promising on
paper, the actual first release of the chip is a castrated version at best
of the ideal chip that Intel set out to design. Intel selectively left out
important implementation details of the Pentium 4, which they finally
revealed in November with the posting of the Intel Pentium 4 Processor
Optimization manual on their web site.

In an attempt to cover up their design defects, and with no back up plan in
place (since the demise of the 1+ GHz Pentium III chip) Intel has been
forced to carefully word their optimization document. I encourage all
software developers and technically literate computer users to download the
Pentium 4 optimization manual mentioned above, and to also for comparison to
download and study the Pentium III manuals as well as the AMD Athlon manual.
It does not take a rocket scientist to read and compare the three sets of
processors to realize what the design flaws in the Pentium 4 are.

This is not a simple Pentium floating point bug that can be fixed by
replacing the processor. This is not a 486SX scam where Intel was selling
crippled 486DX chips as SX chips and then selling you a second processor (a
real 486DX) as an upgrade. No, in both those past cases the defective chip
still delivered the true speed performance advertised. One was simply the
result of a minor design error while the other was a marketing scam, but in
the end, the chips lived up to spec. And both chips could be replaced with
working chips.

In the case of the Pentium 4, the chip contains design flaws which aren't
easily fixed, and it is marketed fraudulently since the speed claims are
pulled out of thin air. No quick upgrade or chip fix exists to deliver the
true performance that the Pentium 4 was supposed to have. Users will have to
wait another year or two while Intel cranks out new silicon which truly
implements the full Pentium 4 spec and fixes some of the glaring flaws of
the Pentium 4 design.

If you do not have a good technical background on Pentium processors, I
recommend you read my Processor Basics section. It will give you a good
outline of the history of PC processors over the past 20 years and will
allow you to read and understand most of the Intel and AMD processor
documents. You have to have at least a basic understanding of the concepts
in order to understand why the Pentium 4 is the disaster that it is.

Or if you're a geek like me, skip right ahead to the Pentium 4 - Generation
7 section.

----------------------------------------------------------------------------
----

Processor Basics - the various generations of processors over the past 20
years

Generation 1 - 8086 and 68000

In the beginning, the computer dark ages of two decades ago, there was the
8086 chip, Intel's first 16-bit processor which delivered 8 16-bit registers
and could manipulate 16 bits of data at a time. It could also address 16-bit
of address space at a time (or 64K, much like the Atari 800 and Apple II of
the same time period). Using a trick known as segment registers, a program
could simultaneously address 4 such 64K segments at a time and have a total
of 1 megabyte of addressable memory in the computer. Thus was born the
famous 640K RAM limitation of DOS, since the remaining 384K was used for
hardware and video.

A lower cost and slower variant, the 8088, was used in early PCs, providing
only an 8-bit bus externally to limit the number of pins on the chip and
reduce costs. As I incorrectly stated here before, the 8086 was not used in
the original IBM PC. It was actually the lower cost 8088.

The original Motorola 68000 chip, while containing 16 32-bit registers and
being essentially a 32-bit processor, used a similar trick of having only 16
external data pins and 24 external data pins to reduce the pin count on the
chip. An even smaller 68008 chip, addressed only 20 bits of address space
externally and had the same 1 megabyte memory limitation as the 8086.

While these first generation processors from Intel and Motorola ran at
speeds of 4 to 8 MHz, they each required multiple clock cycles to execute
any given machine language instruction. This is because these processors
lacked any of the modern features we know today such as caches and
pipelines. A typical instruction to 4 to 8 cycles to execute, really giving
the chips an equivalent speed of 1 MIPS (i.e. 1 million instructions per
second).

Generation 2 - 80286 and 68020

By 1984, Intel released the 80286 chip used in the IBM AT and clones. The
80286 introduced the concept of protect mode, a way of protecting memory so
that multiple programs could run at the same time and not step on each
other. This was the base chip that OS/2 was designed for and which was also
used by Windows/286. The 286 ran at 8 to 16 MHz, offering over double the
speed of the original 8086 and could address 16 megabytes of memory.

Motorola meanwhile developed the 68020, the true 32-bit version of the
68000, with a full 32-bit data bus and 32-bit address bus capable of
addressing 4 gigabytes of memory.

By the way, both companies did release a &quot;1&quot; version of each processor - the
80186 and 68010 - but these were minor enhancements over the 8086 and 68000
and not widely used in home computers.

Generation 3 - 80386 and 68030

The world of home computers didn't really become interesting until late 1986
when Intel released its 3rd generation chip - the 80386, or simply the 386.
This chip, although almost 15 years old now, is the base on which OS/2 2.0,
Windows 95, and the original Windows NT run on. It was Intel's first true
32-bit x86 chip, extending the registers to a full 32 bits in size and
increasing addressable memory to 4 gigabytes. In effect, catching up to the
68020 in a big way, by also adding things like paging (which is the basis of
virtual memory) and support for true multi-tasking and mode switching
between 16-bit and 32-bit modes.

The 386 is really the chip, I feel, that put Intel in the lead over Motorola
for good. It opened the door to things like OS/2 and Windows NT and Linux -
truly pre-emptive, multi-tasking, memory protected operating systems. It was
a 286 on steroids, so much more powerful, so much faster, so much more
capable than the 286, that at over $20,000 a machine, people were dying to
get their hands on them. I remember reading the review of the first Compaq
386 machine, again, a $20,000+ machine that today you can buy for $50, and
the reviewer would basically kill to get one.

What made the 386 so special? Well, Intel did a number of things right.
First they made the chip more orthogonal. What that means is that they
extended the machine language instructions so that in 32-bit mode, almost
any of the 8 32-bit registers could be used for anything - storing data,
addressing memory, or performing arithmetic operations. Compare this to the
8086 and 80286 whose 16-bit instructions could only use certain instructions
for certain operations. The orthogonality of the 386 registers made up for
the extra registers in the Motorola chips, which specifically had 8
registers which could be used for data and 8 for addressing memory. While
you could use an address registers to hold data or use data registers to
address memory, it was most costly in terms of clock cycles.

The 386 allowed the average programmer to do away with segment registers and
640K limitations. In 386 protect mode, which is what most Windows, OS/2, and
Linux programs run in today, a program has the freedom to address up to 4
gigabytes of memory. Even when such memory is not present, the chip's paging
feature allows the OS to implement virtual memory by swapping memory to hard
disk, what most people know as the swap file.

Another innovation of the 386 chip was the code cache, the ability of the
chip of buffer up to 256 bytes of code on the chip itself and eliminate
costly memory reads. This is especially useful in tight loops that are
smaller than 256 bytes of code.

Motorola countered with the 68030 chip, a similar chip which added built-in
paging and virtual memory support, memory protection, and a 256 byte code
cache. The 68030 also added a pipeline, a way of executing parts of multiple
instructions at the same time, to overlap instructions, in order to speed up
execution.

Both the 386 and 68030 ran at speeds ranging from 16 MHz to well above 40
MHz, easily bringing the speed of the chips to over 10 MIPS. Both chips
still required multiple clock cycles to execute even the simplest machine
language instructions, but were still an order of magnitude than their first
generation counterparts. Microsoft quickly developed Windows/386 (and later
OS/2 2.0 and Windows NT) for the 386, and Apple added virtual memory support
to Mac OS.

Both chips also introduced something known as a barrel shifter, a circuit in
the chip which can shift or rotate any 32-bit number in one clock cycle.
Something used often by many different machine language instructions.

The 386 chip is famous for unseating IBM as the leading PC developer and for
causing the breakup with Microsoft. IBM looked at the 386, decided it was
too powerful for the average user, and decided not to use it in PCs and not
to write operating systems for it. Instead it chose to keep using the 286
and to develop OS/2 for the 286. Microsoft on the other hand developed
Windows/386 with improved multitasking, Compaq and other clone makers did
use the 386 to deliver the horsepower needed to run such a graphical
operating system, and the rest is history. By the time IBM woke up, it was
too late. Microsoft won. Compaq DELL and Gateway won.

Generation 4 - 486 and 68040

This generation is famous for integrating the floating point co-processor,
previously a separate external chip, into the main processor. This
generation also refined the existing technology to run faster. The pipelines
on the Intel 486 and Motorola 68040 were improved to in effect give the
appearance of 1 clock cycle per instruction execution. 20 MIPS. 25 MIPS. 33
MIPS. Double or triple the speed of the previous generation with virtually
no change in instruction set! As far as the typical programmer or computer
user is concerned, the 386 and 486, or 68030 and 68040, were the same chips,
except that the 4th generation ran quicker than the 3rd. And speed was the
selling point and the main reason you upgraded to these chips.

The way these chips exploited speed was in a number of ways. First, the
caches were increased in size to 8K, and made to handle both code and data.
Suddenly relatively large amounts of data (several thousands bytes) could be
manipulated without incurring the costly penalty of accessing main memory.
Great for mathematical calculations and other such applications. This is why
many operating systems today and many video games don't support anything
prior to the 4th generation. Mac OS 8 and many Macintosh games require a
68040. Windows 98, Windows NT 4.0, and most Windows software today requires
at least a 486. The caches made that huge a difference in speed! Remember
this for later!

With the ability to read memory in a single clock cycle now came the ability
to execute instructions in a single clock cycle. By decoding one instruction
while finishing the execution of the previous instruction, both the 486 and
68040 could give the appearance of executing 1 instruction per cycle. Any
given instruction still takes multiple clock cycles to execute, but by
overlapping several instructions at once at different stages of execution,
you get the appearance of one instruction per cycle. This is the job of the
pipeline.

Keeping the pipeline full is of extreme importance! If you have to stop and
wait for memory (i.e. the data or code being executed isn't in the cache) or
you execute a complex instruction such as a square root, you introduce a
bubble into the pipeline - an empty step where no useful work is being done.
This is also known as a stall. Stalls are bad. Remember that.

One of the great skills of writing assembly language code, or writing a
compiler, is knowing how to arrange the machine language instructions in
such an order so that the steps you ask the processor to perform are done as
efficiently as possible.

The rules for optimizing code on the 486 and 68040 are fairly simple:

keep loops small to take advantage of the code cache

avoid referencing memory by using the chip's 32-bit registers

avoid referencing memory blocks larger than the size of the data cache

avoid complex instructions - for example where possible use simple
instructions such as ADD numerous times in places of a multiply

The techniques used in the 4th generation are very similar to techniques
used by RISC (reduced instruction set) processors. The concept is to use as
simple instructions as possible. Use several simple instructions in place of
one complex instructions. For example, to multiply by 2 simply add a value
to itself instead of forcing the chip to use its multiply circuitry.
Multiply and divide take many clock cycles, which is fine when multiplying
by a large number. But if you simply need to double a number, it is faster
to tell the chip to add two numbers than to multiply two numbers.

Another reason to follow the optimization rules is because both the 486 and
68040 introduced the concept of clock doubling, or in general, using a clock
multiplier to run the processor internally at several times the speed of the
main computer clock. The computer may run at say, 33 MHz, the bus speed, but
a typical 486 or 68040 chip is actually running at 66 MHz internally and
delivering a whopping 66 MIPS of speed.

The year is now 1990. Windows 3.0 and Macintosh System 7 are about to be
released.

Generation 5 - the Pentium and PowerPC

With the first decade and the first 4 generations of chips now in the bag,
both Motorola and Intel looked for new ways to squeeze speed out of their
chips. Brick walls were being hit in terms of speed. For one, memory chips
weren't keeping up with the rapidly increasing speed of processors. Even
today, most memory chips are barely 10 or 20 times faster than the memory
chips used in computers two decades ago, yet processor speeds are up by a
factor of a thousand!

Worse, the remaining hardware in the PC, things like video cards and sound
cards and hard disks and modems, run at fixed clock speeds of 8 MHz or 33
MHz or some sub multiple of bus speed. Basically, any time the processor has
to reference external memory or hardware, it stalls. The faster the clock
multiplier, the more instructions that execute each bus cycle, and the
higher the chances of a stall.

This is why for example, upgrading from a 33 MHz 486 to a 66 MHz 486 only
offers about a 50% speed increase in general, and similarly when upgrading
from the 68030 to the clock doubled 68040.

It's been said many times by many people, but by now you should have
realized that CLOCK SPEED IS NOT EVERYTHING!!

Two chips running at the same speed (a 33 MHz 386 and a 33 MHz 486) do not
necessarily give the same level of performance, and

Doubling the internal clock speed of a chip (486 from 33 to 66 MHz) does not
always double the performance.

What can affect speed far more than mere clock speed is the rate at which
the chip can process instructions. The 4th generation brought the chip down
to one instruction per clock cycle. The 5th generation developed the concept
of superscalar execution. That is, executing more than one instruction per
clock cycle by executing instructions in parallel.

Intel and Motorola chose different paths to achieve this. After an aborted
68050 chip and short lived 68060 chip, Motorola abandoned its 68K line of
processors and designed a new chip based on IBM's POWER RISC chip. A RISC
processor (or Reduced Instruction Set) does away with complicated machine
language instructions which can take multiple clock cycles to execute, and
replaces them with simpler instructions which execute in fewer cycles. The
advantage of this is the chip achieves a higher throughput in terms of
instructions per second or instructions per clock cycle, but the down side
is it usually takes more instructions to do the same thing as on a CISC (or
Complex Instruction Set) processor.

The theory with RISC processors, which has long since proven to be bullshit,
was that by making the instructions simpler the chip could be clocked at a
higher clock speed. But this in turn only made up for the fact that more
instructions were now required to implement any particular algorithm, and
worse, the code grew bigger and thus used up more memory. In reality a RISC
processor is no more or less powerful than a CISC processor.

Intel engineers realized this and continued the x86 product line by
introducing the Pentium chip, a superscalar version of the 486. The original
Pentium was for all intents and purposes a faster 486, executing up to 2
instructions per clock cycle, compared to the 1 instruction per cycle limit
of the 486. Once again, CLOCK SPEED IS NOT EVERYTHING.

By executing multiple instructions at the same time, the design of the
processor gets more complicated. No longer is it a serial operating. While
earlier processors essentially followed this process:

fetch an instruction from memory or the code cache

decode the instruction

execute the instruction in either the floating point unit (FPU), integer
unit (ALU), or branch unit

repeat

a superscalar processor how has additional steps to worry about

fetch two instructions from memory or the code cache

decode the two instructions

execute the first instruction

if the second instruction does not depend on the results of the first
instruction, and if the second instruction does not require an execution
unit being used by the first instruction, execute the second instruction

repeat

The extra check are necessary to make sure that the code executes in the
correct order. If two ADD operations follow one another, and the second ADD
depends on the result of the first, the two ADD operations cannot execute in
parallel. They must execute in serial order.

Intel gave special names to the two &quot;pipes&quot; that instructions execute in -
the U pipe and the V pipe. The U pipe is the main path of execution. The V
pipe executes &quot;paired&quot; instructions, that is, the second instruction sent
from the decoder and which is determined not to conflict with the first
instruction.

Since the concept of superscalar execution was new to most programmers, and
to Microsoft's compilers, the original Pentium chip only delivered about 20%
faster speed than a 486 at the same speed. Not 100% faster speed as
expected. But faster nevertheless. The problem was very simply that most
code was written serially.

Code written today on the other hand does execute much faster, since
compilers now generate code that &quot;schedules&quot; instructions correctly. That
is, it interleaves pairs of mutually exclusive instructions so that most of
the time two instructions execute each clock cycle.

The original PowerPC 601 chip similarly had the ability to execute two
instructions per cycle, an arithmetic instruction pair with a branch
instruction. The PowerPC 603 and later versions of the PowerPC added
additional arithmetic execution units in order to execute 2 math
instructions per cycle.

With the ability to execute twice as much code as before comes greater
demand on memory. Twice as many instructions need to be fed into the
processor, and potentially twice as much data memory is processed.

Intel and Motorola found that as clock speed was being increased in the
processors, performance didn't scale, even on older chips. A 66 MHz 486 only
delivered 50% more speed than a 33 MHz 486. Why?

The reason again has to do with memory speed. When you double the speed of a
processor, the speed of main memory stay the same. That means that a cache
miss, which forces the processor to read main memory, now takes TWICE the
number of clock cycles. With today's fast processors, a memory read can
literally take 100 or more clock cycles. That means 100, or worse, 200
instructions not being executed.

The way Intel and Motorola attacked this problem was to increase the size of
the L1 cache, the very high speed on-chip level one cache. For example, the
original 486 had an 8K cache. The newer 100 MHz 486 chips had a 16K cache.

But 8K or 16K is nothing compared to the megabytes that a processor can suck
in every second. So computers started to include a second level cache, the
L2 cache, which was made up of slightly slower but larger memory. Typically
256K. The L2 cache is still on the order of 10 times faster than main
memory, and allows most code to operate at near to full speed.

When the L2 cache is disabled (which most PC users can do in the BIOS
setup), or when it is left out completely, as Apple did in the original
Power Macintosh 6100, performance suffers.

Generation 6 - the P6 architecture and PowerPC G3/G4

By 1996 as processor speeds hit 200 MHz, more brick walls were being hit.
Programmers simply weren't optimizing their code and as processor speeds
increased, the processors simply spent more time waiting on memory or
waiting for instructions to finish executing. Intel and Motorola adopted a
whole new set of tricks in their 6th generation of processors. Tricks such
as &quot;register renaming&quot;, &quot;out of order execution&quot;, and &quot;predication&quot;.

In other words, if the programmer won't fix the code, the chip will do it
for him. The Intel P6 architecture, first released in 1996 in the Pentium
Pro processor, is at the heart of all of Intel's current processors - the
Pentium II, the Celeron, and the Pentium III. Even AMD's Athlon processor
uses the same tricks.

What they did is as follows:

expand the L2 cache to a full 512K of memory. The Pentium II, the original
Pentium III, and the original AMD Athlon all did this. Big speed win with no
burden on the programmer.

expand the L1 cache. The P6 processors have 32K of L1 cache (16K for data,
16K for code), while the AMD Athlon has a whopping 128K of L1 cache (64K
data, 64K code). Another big speed win, more so for the Athlon. Again with
no burden on the programmer.

expand the decoder to handle 3 instructions at once. This places a burden on
the programmer because instructions now have to be grouped in sets of 3, not
just in pairs. Potential 50% speed increase if the code is written properly.

allow decoded instructions to execute out of order provided they are
mutually exclusive. This is a huge speed win because it can make up for poor
scheduling on the part of the programmer. It also allows code to execute
around &quot;bubbles&quot;, or instructions which are stalled due to a memory access.
Big big big big big speed win.

improved branch prediction. The processor can &quot;guess&quot; with pretty good
reliability whether a branch instruction (an &quot;if/else&quot; in a higher level
language) will go one way or the other. Higher rates of branch prediction
mean fewer stalls caused by branching to the wrong code.

conditional execution or &quot;predication&quot; allows the processor to conditionally
execute an instruction based on the result of a previous operating. This is
similar to branching, except no branch occurs. Instead data is either moved
or not moved. This reduces the number of &quot;if/else&quot; style branch conditions,
which is a big win. Unfortunately, predication is new to the P6 family and
is not supported on the 486 or earlier versions of the Pentium

add additional integer execution units so that up to 3 integer instructions
can execute at once. Big speed win thanks to out of order execution.

in the AMD Athlon, add additional floating point units to allow up to 3
floating point instructions to execute at once. Big speed win for the
Athlon, allowing it to trounce the Intel chips on 3-D and math intensive
tasks.

allow registers to be mapped to a larger set of internal registers, a
process called &quot;register renaming&quot;. Internally, the P6 and K7 architectures
do away with the standard 8 x86 32-bit general purpose registers. Instead
they contain something like 40 32-bit registers. The processor decides how
to assign the 8 registers which the programmers &quot;sees&quot; to the 40 internal
registers. This is a speed win for cases where the same register is used
serially, but for mutually exclusive instructions. The two uses of the
register will get renamed to two different internal registers, thus allowing
superscalar out-of-order execution to take place. This trick works best on
older 386 and 486 code, or poorly optimized C code which tends to make heavy
use of one or two registers only.

From an engineering standpoint, the enhancements in the 6th generation
processors are truly amazing. Through the use of brute force (larger caches
and faster clock speed), parallel execution (multiple execution units and 3
decoders), and clever interlocking circuitry to allow out-of-order
execution, Intel has been able to stick with the same basic architecture for
5 years now, catapulting CPU throughput from the 100 to 150 MHz range in
1995 to over 1 GHz today. Most code, every poorly written unoptimized code,
executes at a throughput of over 1 instruction per clock cycle, or roughly
1000 MIPS on today's fastest Pentium III processors.

The PowerPC G3 and G4 chips use much the same tricks (after all, all these
silicon engineers went to the same schools and read the same technical
papers) which is why the G3 runs faster than a similarly clocked 603 or 604
chip.

----------------------------------------------------------------------------
----

Limitations of the Pentium III

AMD, calling the Athlon a &quot;7th generation&quot; processor, something I don't
fully agree with since they really didn't have a 6th generation processor,
took the basic ideas behind the Pentium II/III and PowerPC G3 and used them
to implement the Athlon. Having the benefit of seeing the original Pentium
Pro's faults, they fixed many of bottlenecks of the P6 design and which even
today limit the full speed of the Pentium III.

These are the same problems that Intel of course is trying to address in the
Pentium 4. It helps us to understand why the AMD Athlon is a faster chip and
what AMD did right to understand why Intel needed to design the Pentium 4,
and that is what I shall discuss in this section.

Not counting the unbuffered segment register problem in the original Pentium
Pro (which was fixed in the far more popular Pentium II chip), what are the
bottlenecks? What can possibly slow down the processor when instructions are
being executed out-of-order 3 at a time!?!?

Well, keep in mind that a chain is only as strong as its weakest link. In
the case of the processor, each stage can be considered a link in a chain.
The main memory. The L2 cache. The L1 cache. The decoder. The scheduler
which takes decoded micro-ops and feeds them into the various execution
units. in a the two main bottlenecks in the P6 architecture are the 4-1-1
limitation of the decoder, and the dreaded partial register stall.

If you read the Pentium III optimization document, you will see reference to
the 4-1-1 rule for decoding instructions. When the Pentium III (for example)
fetches code, it pulls in up to three instructions through the decoders each
clock cycle. Decoder 1 can decode any machine language instruction. Decoders
2 and 3 can decode only simple, RISC-like instructions that break down into
1 micro-op. A micro-op is a basic operation performed inside the processor.
For example, adding two registers takes one micro-op. Adding a memory
location to a register requires two micro-ops: a load from memory, then an
add. It uses two execution units inside the processors, the load/store unit
on one clock cycle, and then an ALU on the next clock cycle. Micro-ops
translate roughly into clock cycles per instruction but don't think of it
that way. Since several instructions are being executed in parallel and out
of order, the concept of clock cycles per instruction becomes rather fuzzy.

Instead, think of it like this. What is the limitation of each link? How
frequently does that link get hit? Main memory, for example, may not be
accessed for thousands of clock cycles at a time. So while accessing main
memory may cost 100 clock cycles, that penalty is taken infrequently thanks
to the buffering performed by the L1 and L2 caches. Only when dealing with
large amounts of memory at a time, such as when processing a multi-megabyte
bitmap, does it start to hurt.

Intel and AMD have addressed this problem in two ways. First, over they
years they have gradually increased the speed of the &quot;front side bus&quot;, the
data path between main memory and the processor, to work at faster and
faster clock speeds. From 66 MHz in the Celeron and Pentium II, to 100 and
133 MHz in the Pentium III, to 200 MHz in the AMD Athlon. Second, Intel
produces a version of the Pentium II and III called the &quot;Xeon&quot;, which
contains up to 2 megabytes of L2 cache. The Xeon is used frequently in
servers as it supports 8-way multi-processing, but on the desktop the Xeon
does offer considerable speed advantages over the standard Pentium III when
large amounts of data are involved. The PowerPC G4 has up to 1 megabyte of
L2 cache, which explains why a slower clock speed Power Mac G4 blows away a
Pentium III in applications such as Photoshop.

Basically, the larger the working set of an application, that is, the amount
of code and data in use at any given time, the larger the L2 cache needs to
be. To keep costs low, Intel and AMD have both actually DECREASED the sizes
of their L2 caches in newer versions of the Pentium III and Athlon, which I
believe is a mistake.

The top level cache, the L1 cache, is the most crucial, since it is accessed
first for any memory operation. The L1 cache uses extremely high speed
memory (which has to keep up with the internal speed of the processor), so
it is very expensive to put on chip and tends to be relatively small. Again,
from 8K in the 486 to 128K in the Athlon. But as my tests have shown, the
larger the L1 cache, the better.

The next step is the decoder, and this is one of the two major flaws of the
P6 family. The 4-1-1 rule prevents more than one &quot;complex&quot; instruction from
being decoded each clock cycle. Much like the U-V pairing rules for the
original Pentium, Intel's documents contain tables showing how many
micro-ops are required by every machine language instructions and they give
guidelines on how to group instructions.

Unlike main memory, the decoder is always in use. Every clock cycle, it
decodes 1, 2, or 3 instructions of machine language code. This limits the
throughput of the processor to at most 3 times the clock speed. For example,
a 1 GHz Pentium III can execute at most 3 billion instructions per second,
or 3000 MIPS. In reality, most programmers and most compilers write code
that is less than optimal, and which is usually grouped for the
complex-simple-complex-simple pairing rules of the original Pentium. As a
result, the typical throughput of a P6 family processor is more like double
the clock speed. For example, 2000 MIPS for a 1 GHz processor.

By sticking to simpler instruction forms and simpler instructions in general
(which in turn decode to fewer micro-ops) a machine language programmer can
achieve close to the 3x MIPS limit imposed by the decode. In fact, this
simple technique (along with elimination of the partial register stalls) is
the reason our SoftMac 2000 Macintosh emulator runs so much faster than
other emulators, and why in the summer of 2000 when I re-wrote the FUSION PC
emulator I was able to achieve about a 50% speed increase in the speed of
that emulator in only a few days worth of work. By simply understanding how
the decoder works and writing code appropriately, one can achieve near
optimal speeds of the processor.

Once again, let me repeat: CLOCK SPEED IS NOT EVERYTHING! So many people
stupidly go out and buy a new computer every year expecting faster clock
speed to solve their problems, when the main problem is not clock speed. The
problem is poorly written code, uneducated programmers, and out of date
compilers (that's YOU Microsoft) that target obsolete processors. How many
people still run Microsoft Office 95? Ok, do a DUMPBIN on WINWORD.EXE or
EXCEL.EXE to get the version number of the compiler tools. That product was
written in an old version of Visual C++ which targets now obsolete 486
processors. Do the same thing with Office 97 or Office 2000. Newer tools
that target P6. Wonder why your Office 97 runs faster than your Office 95 on
the same Pentium III computer? Ditto for Windows 98 over Windows 95. Windows
2000 over Windows 98. Etc. etc. The newer the compiler tools, the better
optimized the code is for today's processors.

The next bottleneck are the actual execution units - the guts of the
processor. They determine how many micro-ops of a given type can execute in
one clock cycle. For example, the P6 family can load or store one memory
location per clock cycle. It can execute one floating point instruction per
clock cycle because there is only one FPU. This means that every the most
optimized code, that caches perfectly, decodes perfectly, can still hit a
bottleneck simply because too many instructions of the same type are trying
to executing. Again, one needs to mix instructions - integer and floating
point and branch, to make best use of the processor.

Finally that dreaded partial register stall! The one serious bug in the P6
design that can cause legacy code to run slower. By &quot;legacy code&quot; I mean
code written for a previous version of the processor. See, until now, every
generation so far improved on the design of previous generations. No matter
what, you were almost 100% guaranteed that a newer processor, even running
at the same clock speed as a previous processor, would deliver more speed.
Why a 68040 is faster than a 68030. Why a Pentium is faster than a 486.

Not so with generation 6. While every other optimization in the P6 family
pretty much boosts performance without requiring the programmer to rewrite
one single line of code, even the 4-1-1 decode rule, the register renaming
optimization has one fatal flaw that kills performance: partial registers
stalls! A partial register stall is when a partial register (that is, the
AL, AH, and AX parts of the EAX register, the BL, BH, and BX parts of the
EBX register, etc) get renamed to different internal registers because the
processor believes the uses are mutually exclusive.

For example, a C compiler will typically read an 8-bit or 16-bit integer
from memory into the AL or AX register. It will then perform some operation
on that integer, for example, incrementing it or testing a value. A typical
C code sequence to test a byte for zero goes something like this:

int foo(unsigned char ch)
{
return (ch == 0) ? 1 : -1;
}

Microsoft's compilers for years have used a &quot;clever&quot; little trick with
conditional expressions, and that is to use a compare instruction to set the
carry flag based on the result of an expression, then to use the x86 SBB
instruction to set a register to all 1's or 0's. Once set, the register can
be masked and manipulated to generate any two desired resulting values. MMX
code makes heavy use of this trick as well, although MMX registers are not
subject to the partial register stall.

Anyway, when you compile the above code using Microsoft's Visual C++ 4.2
compiler with full Pentium optimizations (-O2 -G5) you get code the
following code:

PUBLIC _foo
_TEXT SEGMENT
_ch$ = 8

; 4 : return (ch == 0) ? 1 : -1;

00000 80 7c 24 04 01 cmp BYTE PTR _ch$[esp-4], 1
00005 1b c0 sbb eax, eax
00007 83 e0 02 and eax, 2
0000a 48 dec eax

0000b c3 ret 0
_foo ENDP
_TEXT ENDS
END

and when compiled with Microsoft's latest Visual C++ 6.0 SP4 compiler you
get code like this:

PUBLIC _foo
_TEXT SEGMENT
_ch$ = 8

; 4 : return (ch == 0) ? 1 : -1;

00000 8a 44 24 04 mov al, BYTE PTR _ch$[esp-4]
00004 f6 d8 neg al
00006 1b c0 sbb eax, eax
00008 83 e0 fe and eax, -2 ; fffffffeH
0000b 40 inc eax

0000c c3 ret 0
_foo ENDP
_TEXT ENDS
END

Notice in both cases the use of the SBB instruction to set EAX to either
$FFFFFFFF or $00000000. Internally the processor reads the EAX register,
subtracts it from itself, then write out the value back to EAX. (Yes, it is
stupid that when a processor subtracts a register from itself that it would
read the register first, but I have verified that it does). In the VC 4.2
case, the processor may or may not stall because we don't know how far back
the EAX register was last updated and whether all or part of it was updated.

But interestingly, with the latest 6.0 compiler, even using the -G6
(optimize for P6 family) flag, a partial register stall results. AL is
written to, then all of EAX is used by the SBB instruction. This is
perfectly valid code, and runs perfectly fine on the 486, Pentium classic,
and AMD processors, but suffers a partial register stall on any of the P6
processors. On the Pentium Pro a stall of about 12 clock cycles, and on the
newer Pentium III about 4 clock cycles.

Why does the partial register stall occur? Because internally the AL
register and the EAX registers get mapped to two different internal
registers. The processor does not discover the mistake until the second
micro-op is about to execute, at which point it needs to stop and re-execute
the instruction properly. This results in the pipeline being flushed and the
processor having to decode the instructions a second time.

How to solve the problem? Well, Intel DID tell developers how to avoid the
problem. Most didn't listen. The way you work around a partial register
stall is to clear a register, either using an XOR operation on itself, a SUB
on itself, or moving the value 0 into the register. (Ironically, SBB which
is almost identical to SUB, does not do the trick!) Using one of these three
tricks will flag the register as being clear, i.e. zero. This allows the
second use of the instruction to be mapped to the same internal register. No
stall.

So what is the correct code? Something like this is correct (generated with
the Visual C++ 7.0 beta):

PUBLIC _foo
_TEXT SEGMENT
_ch$ = 8

; 4 : return (ch == 0) ? 1 : -1;

00000 8a 4c 24 04 mov cl, BYTE PTR _ch$[esp-4]
00004 33 c0 xor eax, eax
00006 84 c9 test cl, cl
00008 0f 94 c0 sete al
0000b 8d 44 00 ff lea eax, DWORD PTR [eax+eax-1]

0000f c3 ret 0
_foo ENDP
_TEXT ENDS
END

Until every single Windows application out there gets re-compiled with
Visual C++ 7.0, or gets hand coded in assembly language, your brand spanking
new Pentium III processor will not run as fast as it can. But even then, at
the expense of code size and larger memory usage. Note the extra XOR
instruction needed to prevent the partial register stall on the SETE
instruction. While this does eliminate the partial register stall, it does
so at the expense of larger code. You eliminate one bottleneck, you end up
increasing another.

Why the AMD Athlon doesn't suck

Guess what folks? The AMD Athlon has no partial register stall!!!! Woo hoo!
AMD's engineers attacked the problem and eliminated it. I've verified that
to be true by checking several different releases of the Athlon. That simple
design fix, which affects just about every single Windows program every
written, along with the larger L1 cache and better parallel execution is why
the AMD Athlon runs circles around the Pentium III.

Floating point code especially, which means many 3-D games, run faster on
the Athlon because the code runs into fewer bottlenecks inside the
processor.

That's it, simple things that AMD did right:

They stuck to the principle of making every generation of processor faster
than every previous generation without forcing programmers to re-write their
code.

Don't force the programmer to group code a certain way. That's the job of
the decoder.

Don't force code to execute serially because you were too lazy to add a
second or third floating point unit.

Don't let perfectly legal code cause the processor to have a cardiac. Code
that has worked perfectly well for 15 years on all x86 processor in that
shouldn't suddenly have serious speed problems. I'll point out a few
examples below.

Intel? Are you listening? HELLO?

----------------------------------------------------------------------------
----

Pentium 4 - Generation 7 or complete stupidity?

Let's get to the meat of it. WHY THE PENTIUM 4 SUCKS. If you've read this
far I expect you have downloaded the Intel and AMD manuals I mentioned
above, you're read them, and you have a good understanding of how the
Pentium III, AMD Athlon, and Pentium 4 work internally. If not, start over!

You're read my previous section on the cool tricks introduced in the 6th
generation processors (Pentium II, AMD Athlon, PowerPC G3) and the kinds of
bottlenecks that can slow down the code:

partial register stalls, big big minus on Pentium II and Pentium III
(including Celeron and Xeon)

decoder limitation - forcing the programmer to re-arrange code a certain way

lack of execution units - such as only having one floating point unit
available each clock cycle

small caches - the faster the processor, the larger the cache you need to
keep up with the data coming in

As I mentioned, AMD to their well deserved credit attacked all these
problems head on in the Athlon by detecting and eliminating the partial
register stall, by relaxing limitations on the decoder and instruction
grouping, and by making the L1 caches larger than ever.

So, after 5 years of deep thought, billions of dollars in R&amp;D, months of
delays, hype beyond belief, how did Intel respond?

In what can only be considered a monumental lapse in judgment, Intel went
ahead and threw out the many tried and tested ideas implemented in both the
PowerPC and AMD Athlon processor families and literally took a step back 5
years to the days of the 486.

It seems that Intel is taking the approach similar to that of their upcoming
Itanium chip - that the chip should do less optimization work and that the
programmer should be responsible for that work. An idea not unfamiliar to
RISC chip programmers, but Intel really went a little too far. They
literally stripped the processor bare and tried to use brute force clock
speed to make up for it!

Except the idea doesn't work. Benchmark after benchmark after benchmark
shows the 1.5 GHz Pentium chip running slower than a 900 MHz Athlon, and in
some cases slower than a 533 MHz Celeron, even as slow as a 200 MHz Pentium
in rare cases.

Intel did throw a few new ideas into the Pentium 4. The two ALUs (arithmetic
logic units) which perform adds and other simple integer operations, run at
twice the clock speed of the processors. In other words, the Pentium 4 is in
theory capable of executing 6 billion integer arithmetic operations per
second. As I'll explain below, the true limit is actually much lower and not
any better than what you can get out of a Pentium III or Athlon already.

Another new idea is the concept of a 'trace cache&quot;, or what is basically a
code cache for decoded micro-ops. Rather than constantly decode the
instructions in a loop over and over again, the Pentium 4 does not have an
L1 code cache. Instead, it caches the output of the decoder, caching the raw
micro-ops. This sounds like a good idea at first
 

Wingznut

Elite Member
Dec 28, 1999
16,968
2
0
Now... What was your point in bringing up a four month old article, that has been discussed on these forums many times before?

And why reference someone (in the title) who's been banned for well over a year now?

And why copy/paste instead of bringing your own thoughts?
 

AndyHui

Administrator Emeritus<br>Elite Member<br>AT FAQ M
Oct 9, 1999
13,141
16
81
This is a fairly biased article. Especially those parts regarding the Athlon.

While the Processor Basics section is not too bad, the writer fails to understand the architecture of the Pentium 4 and its purpose.

There are several statements that I do find inaccurate, but as this is a long article, I will not elaborate. If you wish, PM me.

An interesting slant, but no more accurate view than that Inquest article.

Phokus: I don't see the need for the deliberately inflammatory topic title. The reason these forums have been going to hell lately and top members leaving is for stuff like this.
 
  • Wow
Reactions: DAPUNISHER

crab

Diamond Member
Jan 29, 2001
7,330
19
81
Man, I think I just lowered my IQ scrolling down for so long..


oh yeah, and I read this before.