This was posted to an Intel internal newsgroup. The responses to it were almost as funny. The author gave tacit approval for outside posting, so I'm posting it here (if you forward, keep the original author).
My apologies if you don't think it's funny. If nothing else it can be used as an analogy for highly-technical CPU architecture terms.
It's about a McDonald's drive-thru on a street called Cornell in an area west of Portland, OR - somewhere near Beaverton, I gather. I've never been there myself, but I can imagine. They implemented a new setup where there are 2 ordering systems merging into the drive-thru so 2 people can order at a time.
----------------------------------------------------------------------------------------------
It seems to me that McDonalds could use a serious lesson in pipeline
architecture. They started out well. They recognized that their
classic 2 stage drive through pipeline (order, execute) could be
optimized by extending it to a 3 deep pipeline splitting the execute
stage into pay and pickup stages since the execute stage typically
had a higher latency than the order stage assuming well behaved
instructions whose operands are known at order time thus avoiding
an order stall condition. So they had a high performance 3 stage
pipeline (order, pay, pickup) which worked rather well in my humble
opinion. But they couldn't leave well enough alone. They decided to
attempt to get really fancy with a full blown out of order machine
(apart from the catastrophy avoidance out-of-order "penalty box" for
inordinately high latency instructions which has existed for as long
as I can remember). Now they have a 2 wide order unit feeding the
rest of the pipeline allowing them to simultaneously process the
order stage of 2 instructions (I guess they thought that the order
stage had become the critical latency in the pipe by a large enough
amount to warrant such an architectural change...). The problem is
that the execution engine can not keep up with a 2-wide front end.
Invariably the pipeline stalls in execution and leaves instructions
which have already completed the order stage stuck in the pipe
preventing subsequent instructions from entering. With the older
system during peak processing times the execution engine would
have trouble keeping matching the throughput of a single order
unit, making any benefit from a higher throughput front end
miniscule at best during high load times. In an almost empty
system (perhaps that is the condition they are designing to) there
may be a moderate improvement in throughput, but I can't imagine the
overall benefit being worthwhile.
Another problem is the lack of an arbitration mux to schedule
instructions from the front end into the execution pipeline. The
combination of a lack of collision avoidance logic with the fact
that there are almost always 2 instructions stalled and waiting to
enter the execution pipeline is a fault waiting to happen...
Additionally, the absence of scheduled control of entry for
instructions from the front end into the execution pipe prevents
the instruction queue from being able to correlate operands with
instructions in the (frequent) event of a collision. This results
in a increased risk of instructions executing with other
instructions' operands and an added "confirm" logic latency in
the pay pipestage. All in all the new design strikes me as an
overly ambitious attempt at optimization which has resulted in a
drive-through model which offers no significant benefit over the
simpler model but plenty of drawbacks.
Of course, the analogy does have (at least) one gaping hole. In
a CPU pipeline an instruction doesn't get irritated when 3 other
instructions which were issued behind it end up executing before
it, dramatically increasing its latency through the system. *grumble*
I'd be really interested in seeing some system latency, throughput,
and error rate data comparing the old system to the new. I can't
imagine the new system shows any real benefit (although maybe I just
don't have a wild enough imagination). I suspect it is simply an
overarchitected mess. I hope they have someone doing performance
validation on it.
Jim Vaught
My apologies if you don't think it's funny. If nothing else it can be used as an analogy for highly-technical CPU architecture terms.
It's about a McDonald's drive-thru on a street called Cornell in an area west of Portland, OR - somewhere near Beaverton, I gather. I've never been there myself, but I can imagine. They implemented a new setup where there are 2 ordering systems merging into the drive-thru so 2 people can order at a time.
----------------------------------------------------------------------------------------------
It seems to me that McDonalds could use a serious lesson in pipeline
architecture. They started out well. They recognized that their
classic 2 stage drive through pipeline (order, execute) could be
optimized by extending it to a 3 deep pipeline splitting the execute
stage into pay and pickup stages since the execute stage typically
had a higher latency than the order stage assuming well behaved
instructions whose operands are known at order time thus avoiding
an order stall condition. So they had a high performance 3 stage
pipeline (order, pay, pickup) which worked rather well in my humble
opinion. But they couldn't leave well enough alone. They decided to
attempt to get really fancy with a full blown out of order machine
(apart from the catastrophy avoidance out-of-order "penalty box" for
inordinately high latency instructions which has existed for as long
as I can remember). Now they have a 2 wide order unit feeding the
rest of the pipeline allowing them to simultaneously process the
order stage of 2 instructions (I guess they thought that the order
stage had become the critical latency in the pipe by a large enough
amount to warrant such an architectural change...). The problem is
that the execution engine can not keep up with a 2-wide front end.
Invariably the pipeline stalls in execution and leaves instructions
which have already completed the order stage stuck in the pipe
preventing subsequent instructions from entering. With the older
system during peak processing times the execution engine would
have trouble keeping matching the throughput of a single order
unit, making any benefit from a higher throughput front end
miniscule at best during high load times. In an almost empty
system (perhaps that is the condition they are designing to) there
may be a moderate improvement in throughput, but I can't imagine the
overall benefit being worthwhile.
Another problem is the lack of an arbitration mux to schedule
instructions from the front end into the execution pipeline. The
combination of a lack of collision avoidance logic with the fact
that there are almost always 2 instructions stalled and waiting to
enter the execution pipeline is a fault waiting to happen...
Additionally, the absence of scheduled control of entry for
instructions from the front end into the execution pipe prevents
the instruction queue from being able to correlate operands with
instructions in the (frequent) event of a collision. This results
in a increased risk of instructions executing with other
instructions' operands and an added "confirm" logic latency in
the pay pipestage. All in all the new design strikes me as an
overly ambitious attempt at optimization which has resulted in a
drive-through model which offers no significant benefit over the
simpler model but plenty of drawbacks.
Of course, the analogy does have (at least) one gaping hole. In
a CPU pipeline an instruction doesn't get irritated when 3 other
instructions which were issued behind it end up executing before
it, dramatically increasing its latency through the system. *grumble*
I'd be really interested in seeing some system latency, throughput,
and error rate data comparing the old system to the new. I can't
imagine the new system shows any real benefit (although maybe I just
don't have a wild enough imagination). I suspect it is simply an
overarchitected mess. I hope they have someone doing performance
validation on it.
Jim Vaught