Avoiding Single-ported Translation Lookaside Buffers

polarmystery · Feb 13, 2012

Hey guys,

I got a homework question in my "advanced" computer architecture class (2nd graduate level course in the series) this semester and I'm a little confused on one of the questions. It's not a question of theory, rather a question of implementation. Here is the question:

"The MIPS pipeline shown in Table 2.7 employs a two-phase clocking scheme that makes efficient use of a shared TLB, since instruction fetch accesses the TLB in phase one and data fetch accesses in phase two. However, when resolving a conditional branch, both the branch target address and the branch fall-through address need to be translated during phase one – in parallel with the branch condition check in phase one of the ALU stage – to enable instruction fetch from either the target or the fall-through during phase two. This seems to imply a dual-ported TLB. Suggest and architected solution to this problem that avoids dual-porting the TLB."

This question suggests that the ALU needs to both compute the branch addresses (taken address and not taken address) along with doing the branch check validation (taken and not taken). We are supposed to avoid dual-porting TLB's, but I can't think of how one could use one TLB and solve both checks. I'm thinking maybe data-forwarding in the ALU stage (the table is a 5 stage pipeline, where each stage has two phases) might solve the problem, but I'm under the impression that maybe that would only work for a branch fall-through, and not a branch taken.

Thoughts?

eLiu · Feb 14, 2012

Unfortunately I can't help you, but if you don't get much responeses here you might try posting in the Programming forum as well. There's a good chance that at least user degibson would be able to provide some input; I don't know if he checks HT.

Just out of curiosity, does 'dual port' mean forming a TLB that can simultaneously service 2 requests?

polarmystery · Feb 15, 2012

Yes it does. It was something we wanted to avoid for our homework problem :\

Born2bwire · Feb 15, 2012

Well, the most obvious solution would be, like you said to forward the data or to simply stall the pipeline so that the ALU first does the branch check and then calculates the memory address. I don't think this is a very ideal solution though.

But in terms of your data forwarding, have you looked at some basic branch predictors? The simplest is to assume that you take the branch and let the pipeline move forward under that assumption while you work the actual branch check. Once the branch check resolves, if you take the branch then nothing happens as you have already started operating under that condition. If you find you don't take the branch, then you flush the pipeline and restart it under the fall through branch.

So one solution is to implement some form of branch predictor first and then let the ALU catch up with the calculation of the branch check.

DISCLAIMER: It's been several years since I did my computer architecture class. I mean, I can read your post and the words make sense, but I could be wrong on my interpretation of the problem. Just been a long time since I've had to think about this stuff.

eLiu · Feb 15, 2012

I wanted to suggest or ask about branch prediction too but I figured OP must know about this already if he's in grad level classes. That and the phrasing of the question made me think there's some kind of trick you can do w/o adding something as substantial as branch pred.

Though as you point out even trivial prediction of 'always take (or dont take)' would be better than always stalling till you have both targets resolved.

Born2bwire · Feb 15, 2012

eLiu said:
I wanted to suggest or ask about branch prediction too but I figured OP must know about this already if he's in grad level classes. That and the phrasing of the question made me think there's some kind of trick you can do w/o adding something as substantial as branch pred.

Though as you point out even trivial prediction of 'always take (or dont take)' would be better than always stalling till you have both targets resolved.

Yeah, I missed that it's a graduate level course which would imply a more subtle answer.

eLiu · Feb 15, 2012

Feel free to ignore this b/c I don't want to hijack your thread, buuuuuut...

What is data forwarding?

And back to the OP: would it be possible to like pair up TLB entries? So branch taken/not taken targets are stored together & you only have to do one lookup to get both of them?

polarmystery · Feb 15, 2012

eLiu said:
Feel free to ignore this b/c I don't want to hijack your thread, buuuuuut...

What is data forwarding?

And back to the OP: would it be possible to like pair up TLB entries? So branch taken/not taken targets are stored together & you only have to do one lookup to get both of them?

In an instruction pipeline, there are typically stages within the pipeline that decodes an instruction. Normally, it happens like this:

instruction fetch -> instruction decode -> ALU operation -> memory location -> write back.

In order to improve efficiency, instructions propagate through an instruction per stage. Let's say you have two instructions. The first instruction gets fetched. After it gets fetched, it gets decoded in stage two of the pipeline. When it gets moved to stage two, the second instruction gets fetched and is on stage one of the pipeline.

For data forwarding, if there is a data dependency (ie: the second instruction going through the pipeline needs the data result of the first instruction), you can implement what's called data forwarding where rather than having the instruction traverse the entire pipeline before it's written to memory, you can have the answer ready for the instruction earlier by creating a data path for the second instruction to receive the result.

****

For my homework answer, I found that the answer lies within the Program Counter, and with forwarding I wouldn't need to compute the fall-through address, and use the TLB just to do the branch target check. Thanks for all the comments guys!

eLiu · Feb 15, 2012

polarmystery said:
In an instruction pipeline, there are typically stages within the pipeline that decodes an instruction. Normally, it happens like this:

instruction fetch -> instruction decode -> ALU operation -> memory location -> write back.

In order to improve efficiency, instructions propagate through an instruction per stage. Let's say you have two instructions. The first instruction gets fetched. After it gets fetched, it gets decoded in stage two of the pipeline. When it gets moved to stage two, the second instruction gets fetched and is on stage one of the pipeline.

For data forwarding, if there is a data dependency (ie: the second instruction going through the pipeline needs the data result of the first instruction), you can implement what's called data forwarding where rather than having the instruction traverse the entire pipeline before it's written to memory, you can have the answer ready for the instruction earlier by creating a data path for the second instruction to receive the result.

Oh I think I've heard of this idea before but not w/a name associated. Out of order execution has a a buffer or two dedicated to this purpose right? Instructions with data dependence will go through as far as possible & i think they maintain a "pointer" to the register(?) where the result they need will be placed. And they can move ahead as soon as that value is ready to go even before the instruction generating the result has retired.

****
For my homework answer, I found that the answer lies within the Program Counter, and with forwarding I wouldn't need to compute the fall-through address, and use the TLB just to do the branch target check. Thanks for all the comments guys!

Just for my curiosity... the program counter helps b/c it holds the instruction for one condition of the branch? (Either the next instruction to be executed for the branch not-taken path or the taken path) So that means you don't have to do any target look up for one path b/c PC has the answer already?

Do you need extra logic to track which result is held by the PC? If I'm understanding you correctly, sounds like you're assuming that if I have...
<code block A>
if(branch)
<code block B>
endif
<code block C>
Then the assembly layout will be A,B,C, so the instruction after branch is always the 'taken' path. But at least in x86, it's possible to get A,C,B as well.

polarmystery · Feb 16, 2012

eLiu said:
Oh I think I've heard of this idea before but not w/a name associated. Out of order execution has a a buffer or two dedicated to this purpose right? Instructions with data dependence will go through as far as possible & i think they maintain a "pointer" to the register(?) where the result they need will be placed. And they can move ahead as soon as that value is ready to go even before the instruction generating the result has retired.

Just for my curiosity... the program counter helps b/c it holds the instruction for one condition of the branch? (Either the next instruction to be executed for the branch not-taken path or the taken path) So that means you don't have to do any target look up for one path b/c PC has the answer already?

Do you need extra logic to track which result is held by the PC? If I'm understanding you correctly, sounds like you're assuming that if I have...
<code block A>
if(branch)
<code block B>
endif
<code block C>
Then the assembly layout will be A,B,C, so the instruction after branch is always the 'taken' path. But at least in x86, it's possible to get A,C,B as well.

Yes sir, out of order execution uses a buffer between stages, but the premise for instruction decode is the same. OO execution is for paralleled or "super-scalar" pipeline implementations.

As far as extra logic for the PC, no you don't. The program counter simply counts up the next instruction depending on the size of the words fetched. Words fetched typically have the op code, register target and location. If each word was 16 bits, the PC would increment by 16 per fetch (typically, although this is not -ALWAYS- the case). The program counter is helpful in computing the branch-fallthrough address (IE: not taken) because of the way the fetches work. If no branch, there is no need to compute the target address because the next instruction fetch location is in sequence; fetch being the keyword here, and not execution as in out of order.

Using your example:
<code block A>
if(branch)
<code block B>
endif
<code block C>

Code block C's address will use the subsequent address after code block A because the branch in code block B was not taken. So yes, it is possible to get A, C, B. This actually happens a TON on super-scalar out of order operation.

eLiu · Feb 16, 2012

polarmystery said:
Yes sir, out of order execution uses a buffer between stages, but the premise for instruction decode is the same. OO execution is for paralleled or "super-scalar" pipeline implementations.

As far as extra logic for the PC, no you don't. The program counter simply counts up the next instruction depending on the size of the words fetched. Words fetched typically have the op code, register target and location. If each word was 16 bits, the PC would increment by 16 per fetch (typically, although this is not -ALWAYS- the case). The program counter is helpful in computing the branch-fallthrough address (IE: not taken) because of the way the fetches work. If no branch, there is no need to compute the target address because the next instruction fetch location is in sequence; fetch being the keyword here, and not execution as in out of order.

Using your example:
<code block A>
if(branch)
<code block B>
endif
<code block C>

Code block C's address will use the subsequent address after code block A because the branch in code block B was not taken. So yes, it is possible to get A, C, B. This actually happens a TON on super-scalar out of order operation.

Whoops, it looks like my comment actually arose from some misunderstanding on my end. I was thinking the PC increment is always basically +1. So it fetches the branch instruction, then it fetches whatever instruction follows. By "follows", I mean whatever instr is next "on disk", implying that the instruction arrangement generated by whoever/whatever wrote the assembly would affect what the PC does after seeing a branch instr. I guess that "on disk" code ordering is more relevant for caching (e.g. if a branch is almost never taken, better cache performance if the branch's 'taken' code does not appear in the main code body).

But after some googling it looks like the PC doesn't quite work that way. Instead jumps (jmp, branch, function call/return, etc) will interrupt the standard "+1" and modify the value. Which makes more sense b/c my mental image made it pretty hard to do function calls, lol.

Though now I'm kind of confused on how you're using the PC. So the PC comes into play in the first stage of the pipeline you drew, 'fetch' right? I'm assuming that an instruction 'moves' from 1 stage to the next during 1 cycle. When you run into a branch, in order for things to keep moving smoothly, the PC has to be immediately updated.

But I don't see how that's possible since at the very least it seems like instruction decode would have to happen. So in the current scheme (w/no prediction), does running into a branch always cause stalling?

In the original post, the problem said the branch target & fall-thru addresses need to be translated in phase 1. Is this so that once the ALU completes its computation to decide the taken-ness of the branch (which happens in parallel), the address of the next instruction needed is immediately known? Wouldn't this next address be written to the PC? This seems like where the data-forwarding could help; the ALU's result can be handed to the PC asap.

Clearly I've screwed something up b/c I still need to resolve branch target & fall-thru simultaneously. What did I miss? How does the PC help you only need to compute 1 address? Does the branch instruction fetched by the PC also contain the address of the instruction that should be executed next after taking (or is it not-taking?) the branch?

Thanks for entertaining my questions, btw. I'm clearly not a computer engineering major, haha.

Search

Avoiding Single-ported Translation Lookaside Buffers

polarmystery

Diamond Member

eLiu

Diamond Member

polarmystery

Diamond Member

Born2bwire

Diamond Member

eLiu

Diamond Member

Born2bwire

Diamond Member

eLiu

Diamond Member

polarmystery

Diamond Member

eLiu

Diamond Member

polarmystery

Diamond Member

eLiu

Diamond Member

TRENDING THREADS