an Intel possibility: dual virtual piperlines

KF · Oct 11, 2003

Well I don't know how possible it is, but after seeing all the speculation about some super thing that may be in Prescott, this thought came to mind:

The P4s big slowdown is with branch instructions, because the long pipeline means a lot of wasted time when a branch is mispredicted and the pipeline is invalidated. One obvious way to negate that is to speculatively execute both branches instead of only one. Then regardless of which branch is taken, the pipeline will be prepared, and there will be no branch penalty. The problem is that you would have to duplicate most of the CPU resources, leading to about doubling the cost in silicon, and it's probably not the best way to use that quantity of silicon. But Intel already has worked out a way of creating a second "virtual" CPU that takes good advantage of what, by Intel's own statistics, are largely idle resourses. Hyper Threading. So a version of the same idea could create a second virtual pipeline holding the second branch, just as HT runs two virtual CPUs on the same silicon real estate. Seems logically possible.

Maybe I could get a job with the Inquirer

Only thing is, I didn't have some "vole" tell me he heard it from a big time Intel exec! Maybe some one would volunteer, so I could make that claim. I mean how would I know if you heard it from a big exec or not

sao123 · Oct 11, 2003

Theoretically...
How would one undo the incorrect instruction if both instructions are carried out?

KF · Oct 12, 2003

>How would one undo the incorrect instruction if both instructions are carried out?

I don't know. but they do it now. When the present CPUs take the wrong branch, they have to do this. When a branch instruction enters the pipeline, the branch prediction logic follows the instruction stream to the predicted location, and the pipeline starts to fill with those instructions. When the branch instruction is retired, and the true path of the branch turns out be different, all instructions in the pipeline have to be undone. Maintaining the possibility of undoing them is what makes it speculative execution. It is speculative because it can can go back and repair what has been done. If there are two pipelines, even virtual, they will be able to do the same thing.

In Hyper Threading, there are two instruction streams being executed by the CPU. Both streams could contain branch instuctions, and the branch logic does sometimes mispredict the correct branch, so it is evidently possible to undo either, or both, streams.

EvanB · Oct 12, 2003

This is not possible on the current processors. Im not sure where, sorry, but i read something that posed the exact question you have, and someone, I think it was someone from Intel, said that it was not possible. Ill try to dig up the article, because I forget the reasoning behind it.

CTho9305 · Oct 12, 2003

Speculation based on amdahl's law:
1. on average, about 20% of integer instructions are branches.
2. presently, we mispredict 5% of these, or a total of about 1% of instructions.
3. If we remove all mispredicts, we get this 1% of instructions running faster, and the other 99% aren't improved.

If, instead, we put our transistor budget towards something else, like a better cache, branch predictor, or whatever (even adding an extra "regular" pipeline), we can speed up all instructions a little bit (simplification, but you get the idea). Since now instead of speeding up 1% of instructions, you speed up 99% of them, even a tiny speedup will be much more significant than a huge speedup on the 1%.

edit: You don't really "undo" instructions... instead, every time you are executing something that you aren't sure should be done, you don't actually commit anything to memory or the registers. Once you have established for sure that you are executing instructions that should be executed, you commit the stuff. If you predicted wrong, you flush the pipeline (dump it's contents) and don't commit the stuff you just computed.

zephyrprime · Oct 12, 2003

I've thought of that idea also. It would be entirely possible I believe and will eventually done in my opinion.

It won't take 100% more silicon but it will take a significant amount. We would only have to duplicate the pipeline from the point that dispatches micro-ops up to the portion that completes ALU/FPU comparisons. We'd might need some more decoders+dispatchers or we could do without and only allow "both path branching" if the necessary ops are in the trace cache (but that would only work on the P4). So we don't need to duplicate the retirement units or cache or branch prediction unit, etc. If you look at a CPU die map, a lot of it isn't execution units.
picture

If we remove all mispredicts, we get this 1% of instructions running faster, and the other 99% aren't improved.

It's a little more complicated than this. You forget that getting the 1% wrong causes a disproportionally big penalty. Supposing the average instruction has a 2 cycle latency on the P4, then 99 instructions would consume 198 cycles and 1 branch misprediction would consume 20 cycles. So that's 20/218 = 9.1% of the processors time spent on branch mispredictions.

Plus, the real benefit of doing such a thing would be it would allow a cpu to safely have a longer pipeline. That way, it would have higher clock speed.

CTho9305 · Oct 12, 2003

Originally posted by: zephyrprime

If we remove all mispredicts, we get this 1% of instructions running faster, and the other 99% aren't improved.

Click to expand...

It's a little more complicated than this. You forget that getting the 1% wrong causes a disproportionally big penalty. Supposing the average instruction has a 2 cycle latency on the P4, then 99 instructions would consume 198 cycles and 1 branch misprediction would consume 20 cycles. So that's 20/218 = 9.1% of the processors time spent on branch mispredictions.

You can't have a 2 cycle latency on the P4. It's got to be much higher than that. Looking at the pipeline for the P4 , given that 2 stages are dispatch, 2 more register fetches, and one for execute, you're going to have a latency of well over 5 cycles (I'm sure there are other stages that have to be executed whether or not the instruction is in the trace cache).

The latency is really irrelevant anyway, because (in theory) you can issue one instruction per pipeline per cycle regardless of latency.

I think your math is wrong because of your 2-cycle latency assumption (what matters is throughput on non-mispredicts) - the penalty might actually be higher. It looks like at the 19th stage, you know what a branch outcome really is... so you will have executed 18 (or 19?) junk instructions. You've lost 18 cycles. If you have a mispredicted branch, followed by 99 instructions, you'll need to redo 18 instructions, meaning you did 118 worth of work total, for a penalty closer to 20%.

edit: linkify that picture. You're widening the page. Scrolling sideways sucks.

KF · Oct 12, 2003

> linkify that picture.
direct link to die pic at chip-architect.com

>I've thought of that idea also. It would be entirely possible I believe and will eventually done in my opinion.
>It won't take 100% more silicon but it will take a significant amount. ...

I figured there were brainier people than me that had though of this. In fact, I thought Intel might be doing this in Prescott, or what's following.

Pardon if I'm going over what already was clear enough in my original message.

The way Intel does hyperthreading doesn't require much added hardware according to Intel. (Although working out the requirements is conceptually very complex.) Intel claims it offers the most improvement for the least amount of silicon, far more than a larger cache or anything else. The duplication of hardware is minimal, they say, and only done when it would not take much silicon. Basically there is logic on the CPU that tracks two instruction streams going through almost all the same hardware as on a non-hyperthreaded CPU.

Now suppose they drop hyperthreading but apply the same concepts to the P4 pipeline executing a single instruction stream, but only for branches. Then they could keep track of both potential branches, only one of which is the correct one, without adding any more hardware then hyperthreading did.

One reason this would work on branches is that they are independant. The results of one branch are not required by the other. That is because only one of the branches is actually supposed to be executed.

If they tried to just arbitrarily split a single instruction stream into two and run it through the same CPU resources like HT does, it could not work because of instruction dependence. But Intel has already worked out how to do it when two instruction streams are independent, in the form of hyperthreading.

Over an over again, I read how the P4 gets less done per cycle (on average) than the Athlon because of the longer pipeline. And the reason longer pipelines have this problem is said to be ONLY because of mispredicted branches, and nothing else. Therefore this concept I'm trying to explain has the potential to close the gap in instructions per cycle. The new type of P4 could actually exceed the Athlon in IPC, since it might have approximately zero penalty, while the Athlon would still have some.

Is there some other reason the P4 gets less IPC than the Athlon?

zephyrprime · Oct 12, 2003

Sorry about that link thing. I never realized that long links wouldn't wrap.

You can't have a 2 cycle latency on the P4. It's got to be much higher than that.

Yes, my mistake. I was thinking of instruction througput not latency. Most processors cannot execute any instruction in 1 cycle even though they are pipelined. Check it out: p4 optimization manual page 427+.

Sahakiel · Oct 12, 2003

Originally posted by: KF

Over an over again, I read how the P4 gets less done per cycle (on average) than the Athlon because of the longer pipeline. And the reason longer pipelines have this problem is said to be ONLY because of mispredicted branches, and nothing else. Therefore this concept I'm trying to explain has the potential to close the gap in instructions per cycle. The new type of P4 could actually exceed the Athlon in IPC, since it might have approximately zero penalty, while the Athlon would still have some.

Is there some other reason the P4 gets less IPC than the Athlon?

Also, the lower IPC may not be entirely due to branch mispredicts. One has to remember that x86 instructions are now decoded into "RISC" instructions (which I'm assuming means fixed length instructions). Both chips have different execution engines which means the decoded instructions will be different. What may take two instructions on the Athlon may require five on the P4 and vice versa. The Athlon may use a more efficient core while the P4 use a simpler design.

I'm not entirely sure as to whether the P4 flushes the entire pipeline on a mispredict. It's more likely to flush the few stages where the actual execution is done. Fetch, Decode, and Retire stages shouldn't really be affected beyond having to hold extra data that turns out to be useless.

What you're describing would best be done on a true multi-threaded processor. It would also be done at the OS or even compiler level.
The problem is that what you're proposing would result in another pipeline or something very close, at which point it'd be stupid not to make it full.
This means that you're effectively wasting half the resources on the CPU for one single branch. This is assuming that the instruction sequence doesn't have branches every five addresses.
Hyperthreading works by having two threads ready to execute but not executing them in a true SMT fashion. What happens is the second thread is executed only if the other logic units are free. It's like trying to archive a bunch of files onto CD's. You first fill up the CD with the biggest files that fit, then shove in progressively smaller files until you can't fit in any more. However, this is done when you only get to see two files at a time which can each be split into three or four different sized segments. That's why sometimes Hyperthreading works great and at other times it gives you a hit to execution time.

That's about all I know. I'm still learning quite a bit of this stuff.

CTho9305 · Oct 12, 2003

Originally posted by: Sahakiel
The Athlon may use a more efficient core while the P4 use a simpler design.

I think unless you are an employee of Intel's processor design division and AMD's, you can't really say that. In general, I think making pipelines longer and being fast is probably harder.

I'm not entirely sure as to whether the P4 flushes the entire pipeline on a mispredict. It's more likely to flush the few stages where the actual execution is done. Fetch, Decode, and Retire stages shouldn't really be affected beyond having to hold extra data that turns out to be useless.

right. the data is useless, so you have to flush it.

This means that you're effectively wasting half the resources on the CPU for one single branch. This is assuming that the instruction sequence doesn't have branches every five addresses.

Thats a good point. On a 20 stage pipeline, you're going to have an average of FOUR branches. Each of those 4 can go 2 ways. That's a lot of pipelines if you want to execute all of the possibilites.

Search

an Intel possibility: dual virtual piperlines

KF

Golden Member

sao123

Lifer

KF

Golden Member

EvanB

Senior member

CTho9305

Elite Member

zephyrprime

Diamond Member

CTho9305

Elite Member

KF

Golden Member

zephyrprime

Diamond Member

Sahakiel

Golden Member

CTho9305

Elite Member

TRENDING THREADS