Stage | Abbreviation | Actions |
Instruction Fetch | IF | Read next instruction into CPU and increment PC by 4 |
Instruction Decode | ID | Determine opcode, read registers, compare registers (if branch), sign-extend immediate if needed, compute target address of branch, update PC if branch |
Execution / Effective addr | EX | Calculate using operands prepared in ID
|
Memory access | MEM |
|
Write-back | WB |
|
Example: Two possible "streams" of instruction
BEQ R3, R8, ELSE In what stage is the outcome of the comparison known?
ADD R4, R5, R6 ADD should not be executed if the branch is taken
SUB R8, R5, R6
.
.
.
ELSE: OR R3, R3, R2
Assume the branch is taken:
  | Time | |||||||||||
Instructions | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
BEQZ R3, ELSE |   | IF | ID | EX | MEM | WB |   |   |   |   |   |   |
ADD R4, R5, R6 |   |   | IF | ID | EX | MEM | WB |   | ||||
ELSE: OR R3, R3, R2 |   |   |   | IF | ID | EX | MEM | WB |   |   |   |   |
  |   |   |   |   |   |   |   |   |   |   |   |   |
  |   |   |   |   |   |   |   |   |   |   |   |   |
  |   |   |   |   |   |   |   |   |   |   |   |   |
If the branch is taken, then there is a branch penalty of 1 cycle.
If the branch is not taken and we continue to fetch instructions sequentially, then there is no branch penalty.
How could reduce the branch penalty in the pipeline above?
Delayed Branching - redefine the branch such that one (or two) instruction(s) after the branch will always be executed.
Compiler automatically rearranges code to fill the delayed-branch slot(s) with instructions that can always be executed. Instructions in the delayed-branch slot(s) do not need to be flushed after branching. If no instruction can be found to fill the delayed-branch slot(s), then a NOOP instruction is inserted.
Without Delayed Branching | With Delayed Branching |
SUB R8, R2, R1 BEQZ R3, ELSE ADD R4, R5, R6 . . ELSE: ADD R3, R3, R2 |
BEQZ R3, ELSE SUB R8, R2, R1 # delay slot alway done ADD R4, R5, R6 . . ELSE: ADD R3, R3, R2 |
Due to data dependences, the instruction before the branch cannot always be moved into the branch-delay slot. Other alternative to consider are:
The Instruction at the Target of the Branch | |
Without Delayed Branching | With Delayed Branching |
LOOP: ADD R7, R8, R9 . . . SUB R3, R2, R1 BEQZ R3, LOOP MUL R4, R5, R6 |
ADD R7, R8, R9 LOOP: . . . SUB R3, R2, R1 BEQZ R3, LOOP ADD R7, R8, R9 #delay slot MUL R4, R5, R6 |
Can this technique always be used?
The Instruction From the Fall-Through of the Branch | |
Without Delayed Branching | With Delayed Branching |
SUB R3, R2, R1 BEQZ R3, ELSE ADD R8, R5, R6 . . ELSE: ADD R3, R3, R2 |
SUB R3, R2, R1 BEQZ R3, ELSE ADD R8, R5, R6 # delay slot . . ELSE: ADD R3, R3, R2 |
Can this technique always be used?
Branch Prediction to reducing the branch penalty
Main idea: predict whether the branch will be taken and fetch accordingly
Fixed Techniques:
a) Predict never taken - continue to fetch sequentially. If the branch is not taken, then there is no wasted fetches.
b) Predict always taken - fetch from branch target as soon as possible
(From analyzing program behavior, > 50% of branches are taken.)
Static Techniques: Predict by opcode - compiler helps by having different opcodes based on likely outcome of the branch
Consider the HLL constructs:
HLL AL
CMP x, #0
While (x > 0) do BR_LE_PREDICT_NOT_TAKEN END_WHILE
{loop body}
end while END_WHILE:
Studies have found about a 75-82% successful prediction rate using this technique.
Dynamic Techniques: try to improve prediction by recording history of conditional branch
We need to store one or more history bits to reflect whether the most recent executions of the branch were taken or not.
Problem: How do we avoid always fetching the instruction after the branch?
Solution:
Branch-History Table (BHT)- small, fully-associative cache to store information about most recently executed branch instructions. (Figure 8.13)
With a fully-associative cache, you supply a tag/key value to search for across the whole cache in parallel. In a BHT, the Branch instruction address acts as the tag since that's what you know at IF.
During the IF stage, the Branch-History Table is checked to see if the instruction being fetched is a branch (if the addresses match) instruction.
If the instruction is a branch instruction and it is in the Branch-History Table, then the target address and prediction can be supplied by the BHT by the end of the IF for the branch instruction.
If the branch instruction is in the Branch-History Table, will the target address supplied correspond to the correct instruction to be execute next?
What if the instruction is a branch instruction and it is not in the Branch-History Table?
Should the Branch-History Table contain entries for unconditional as well as conditional branch instructions?
Table 8.2 shows the advantage of using a Branch-history table to improve accuracy of the branch prediction. It shows the impact of past n branches on prediction accuracy.
Notice:
Typically, two prediction bits are use so that two wrong predictions in a row are need to change the prediction -- see Figure 8.12.
How does this help for nested loops?
Consider the nested loops: for (i = 1; i <= 100; i++) {
for (j = 1; j <= 100; j++) {
<do something>
}
}
Consider the nested loops: for (i = 1; i <= 100; i++) {
for (j = 1; j <= 100; j++) {
<do something>
}
}