Body

Control Hazards - branching causes problems since the pipeline can be filled with the wrong instructions.

Stage Abbreviation Actions

Instruction Fetch IF Read next instruction into CPU and increment PC by 4

Instruction Decode ID Determine opcode, read registers, compare registers (if branch), sign-extend immediate if needed, compute target address of branch, update PC if branch

Execution / Effective addr EX Calculate using operands prepared in ID

memory ref: add base reg to offset to form effective address
reg-reg ALU: ALU performs specified calculation
reg-immediate ALU: ALU performs specified calculation

Memory access MEM

load: read memory from effective address into pipeline register
store: write reg value from ID stage to memory at effective address

Write-back WB

ALU or load instruction: write result into register file

Example: Two possible "streams" of instruction

BEQ R3, R8, ELSE In what stage is the outcome of the comparison known?

ADD R4, R5, R6 ADD should not be executed if the branch is taken

SUB R8, R5, R6

ELSE: OR R3, R3, R2

Assume the branch is taken:

Time

Instructions 1 2 3 4 5 6 7 8 9 10 11 12

BEQZ R3, ELSE IF ID EX MEM WB

ADD R4, R5, R6 IF ID EX MEM WB

ELSE: OR R3, R3, R2 IF ID EX MEM WB

If the branch is taken, then there is a branch penalty of 1 cycle.

If the branch is not taken and we continue to fetch instructions sequentially, then there is no branch penalty.

How could reduce the branch penalty in the pipeline above?

Delayed Branching - redefine the branch such that one (or two) instruction(s) after the branch will always be executed.

Compiler automatically rearranges code to fill the delayed-branch slot(s) with instructions that can always be executed. Instructions in the delayed-branch slot(s) do not need to be flushed after branching. If no instruction can be found to fill the delayed-branch slot(s), then a NOOP instruction is inserted.

Without Delayed Branching With Delayed Branching

SUB R8, R2, R1
BEQZ R3, ELSE
ADD R4, R5, R6
.
.
ELSE: ADD R3, R3, R2
BEQZ R3, ELSE
SUB R8, R2, R1 # delay slot alway done
ADD R4, R5, R6
.
.
ELSE: ADD R3, R3, R2

Due to data dependences, the instruction before the branch cannot always be moved into the branch-delay slot. Other alternative to consider are:

The Instruction at the Target of the Branch

Without Delayed Branching With Delayed Branching

LOOP: ADD R7, R8, R9
.
.
.
SUB R3, R2, R1
BEQZ R3, LOOP
MUL R4, R5, R6
ADD R7, R8, R9
LOOP: .
.
.
SUB R3, R2, R1
BEQZ R3, LOOP
ADD R7, R8, R9 #delay slot
MUL R4, R5, R6

Can this technique always be used?

The Instruction From the Fall-Through of the Branch

Without Delayed Branching With Delayed Branching

SUB R3, R2, R1
BEQZ R3, ELSE
ADD R8, R5, R6
.
.
ELSE: ADD R3, R3, R2
SUB R3, R2, R1
BEQZ R3, ELSE
ADD R8, R5, R6 # delay slot
.
.
ELSE: ADD R3, R3, R2

Can this technique always be used?

Branch Prediction to reducing the branch penalty

Main idea: predict whether the branch will be taken and fetch accordingly

Fixed Techniques:

a) Predict never taken - continue to fetch sequentially. If the branch is not taken, then there is no wasted fetches.

b) Predict always taken - fetch from branch target as soon as possible

(From analyzing program behavior, > 50% of branches are taken.)

Static Techniques: Predict by opcode - compiler helps by having different opcodes based on likely outcome of the branch

Consider the HLL constructs:

HLL AL

CMP x, #0

While (x > 0) do BR_LE_PREDICT_NOT_TAKEN END_WHILE

{loop body}

end while END_WHILE:

Studies have found about a 75-82% successful prediction rate using this technique.

Dynamic Techniques: try to improve prediction by recording history of conditional branch

We need to store one or more history bits to reflect whether the most recent executions of the branch were taken or not.

Problem: How do we avoid always fetching the instruction after the branch?

Solution:

Branch-History Table (BHT)- small, fully-associative cache to store information about most recently executed branch instructions. (Figure 8.13)

With a fully-associative cache, you supply a tag/key value to search for across the whole cache in parallel. In a BHT, the Branch instruction address acts as the tag since that's what you know at IF.

During the IF stage, the Branch-History Table is checked to see if the instruction being fetched is a branch (if the addresses match) instruction.

If the instruction is a branch instruction and it is in the Branch-History Table, then the target address and prediction can be supplied by the BHT by the end of the IF for the branch instruction.

If the branch instruction is in the Branch-History Table, will the target address supplied correspond to the correct instruction to be execute next?

What if the instruction is a branch instruction and it is not in the Branch-History Table?

Should the Branch-History Table contain entries for unconditional as well as conditional branch instructions?

Table 8.2 shows the advantage of using a Branch-history table to improve accuracy of the branch prediction. It shows the impact of past n branches on prediction accuracy.

Notice:

the big jump in using the knowledge of just 1 past branch to predict the branch
notice the big jump in going from using 1 to 2 past branches to predict the branch for scientific applications. What types of data do scientific applications spend most of their time processing? What would be true about the code for processing this type of data?

Typically, two prediction bits are use so that two wrong predictions in a row are need to change the prediction -- see Figure 8.12.

How does this help for nested loops?

Consider the nested loops: for (i = 1; i <= 100; i++) {

for (j = 1; j <= 100; j++) {

}

Consider the nested loops: for (i = 1; i <= 100; i++) {

for (j = 1; j <= 100; j++) {

}

Stage	Abbreviation	Actions
Instruction Fetch	IF	Read next instruction into CPU and increment PC by 4
Instruction Decode	ID	Determine opcode, read registers, compare registers (if branch), sign-extend immediate if needed, compute target address of branch, update PC if branch
Execution / Effective addr	EX	Calculate using operands prepared in ID memory ref: add base reg to offset to form effective address reg-reg ALU: ALU performs specified calculation reg-immediate ALU: ALU performs specified calculation
Memory access	MEM	load: read memory from effective address into pipeline register store: write reg value from ID stage to memory at effective address
Write-back	WB	ALU or load instruction: write result into register file