Computer Organization
Andreas Moshovos
FALL 2013
Pipelining
Concept
Thus far we’ve seen two different ways of implementing processors. In the first, and slower performing, each instruction took a single cycle to execute. In the second, multi-cycle implementation, each instruction took multiple, shorter cycles to execute. In either case, instructions execute in isolation in time as the time diagram below shows:
However, both approaches fail to exploit the fact that executing an instruction is not a monolithic operation. Instead, instruction execution comprises several smaller actions that ought to be performed in sequence. Let’s use a cake making analogy. Let’s say after the valuable education you receive you decide to become a purveyor of fine birthday cakes. To start, you decide you will focus on single product: shark cakes. Here’s one:
How do you make such a cake you ask? I’m glad you did. I recently made one. First you make the dough for the base and then bake it:
BTW, a nice recipe can be found here http://www.marthastewart.com/337010/chocolate-cupcakes.
You may want to cut down on the sugar, it will still taste delicious.
Once you have the
base, then comes the fun part, frosting. For that you need three frostings
colored blue, white, and black:
Recipe here: http://www.marthastewart.com/318727/swiss-meringue-buttercream-for-cupcakes.
Then in three steps,
you apply first the blue, then the white and then the black “frosting”:
So, in total, making
a cake requires five actions:
1.
Make the
dough
2.
Bake the
dough
3.
Apply
blue frosting
4.
Apply
white frosting
5.
Apply
black frosting
If you are making
one cake you would perform the actions sequentially:
You find that
business has picked up, and you can no longer keep up with demand. What can you
do to? How about this? You get four of your friends on board and now the five
of you work together:
The previous time
diagram shows five cakes in the making. Each person does one thing. The first
makes the dough, the second bakes it, the third applies blue frosting and so
on. While each cake still takes five steps to make, multiple cake are been made
in parallel each being at a different stage of preparation (you may be thinking
we can do cakes in parallel – yes, that is possible but defer that thought
until we start talking about superscalar architectures). This is pipelining.
In our example, each stage takes exactly the same time and as result each
cake takes almost exactly the same time to be made as before and because we partially overlap the making of many cakes,
we do made more cakes over the same amount of time. That is the cake throughput,
defined as #cakes/time is higher. The reality is, however, that the stages
will require different amounts of time. In this case, the longer stage will
determine the rate at which the pipeline will work as shown in the following
diagram:
In this example, baking a cake takes much longer than every other step. So, our
pipeline now proceeds in stages each taking the same amount of time as baking
the cake. Except for baking, all other actions finish early in their respective
stages and then they wait. This is necessary since we assume that we only have
one oven. While this way it takes longer to make each cake the cake throughput
is still higher than it was compared to making one cake at a time.
Pipelining Instruction Execution
Pipelining can be
applied to many activities including instruction execution. Instead of
approaching instruction execution as a monolithic sequence of actions that have
to happen one at a time, we can approach instruction execution as a series of
steps, or stages that can be partially overlapped
across instructions. Applying pipelining
for instruction execution will be a bit more tricky than for baking cakes for
the following reasons:
1.
Instructions
may produce results that other instructions may use. That is, instructions have
data dependencies. No cake depends on another cake.
2.
Instructions
may change the PC and hence may change which instructions will execute next.
These are control dependencies. When we were making cakes, we knew which
cakes we will make in advance.
3.
Instructions
may produce results at different execution stages. We have to be wary to adhere
to sequential semantics so that the program still runs correctly.
4.
Instructions
need resources to execute, such as ALUs, registers, memory, etc. Same with
cakes, but for our example, we made things simple by assuming we had enough of
those.
5.
While this
is not possible in our architecture, instruction execution may have to be
stopped due to exceptions. For example, in NIOS II this may be a division by
zero, or an unaligned memory store or load.
For the time being,
let’s ignore these issues and let’s assume that somehow we will be executing
straight-line code (that is PC next = PC + 1 always). We will progressively
refine our implementation to tackle all aforementioned challenges.
To start building
our pipeline, let’s start simple by trying to pipeline just the ADD, SUB, and
NAND instructions. These are structurally similar, the only thing that changes
is what calculation they perform in the ALU.
Let’s recall how these are implemented in our multi-cycle implementation.
The following table
shows the execution steps for all instructions as we have optimized them:
CYCLE |
|
|
ADD, SUB, NAND |
1 |
[IR] = Mem[ [PC] ] [PC] = [PC] + 1 |
2 |
[R1] = RF[ [IR7..6] ] [R2] = RF[ [IR5..4] ] |
3 |
[ALUout] = [R1] op [R2] Update Z & N |
4 |
RF[ [IR7..6] ] = [ALUout] |
Recall that the
aforementioned schedule relies on speculation and data dependencies to reduce
the number cycles needed to four.
Without speculation, the schedule required five cycles:
CYCLE |
|
|
ADD, SUB, NAND |
1 |
[IR] = Mem[ [PC] ] [PC] = [PC] + 1 |
2 |
Decode |
3 |
[R1] = RF[ [IR7..6] ] [R2] = RF[ [IR5..4] ] |
4 |
[ALUout] = [R1] op [R2] Update Z & N |
5 |
RF[ [IR7..6] ] = [ALUout] |
We’ll start with this
5-cycle schedule to build our pipeline since it is easier to understand. It is
possible to use speculation to reduce the number of stages but that would
require corrective action for ORI. We may discuss this later.
The aforementioned
five cycle schedule for ADD, SUB and NAND can be distilled down to the
following five stages:
1.
FETCH:
Fetch Instruction and update PC (PC = PC + 1)
2.
DECODE:
Decode Instruction
3.
RF: Read
Registers or Update PC for branches
4.
EXEC:
Execute and Memory Read and Memory Write
5.
WB: Writeback
results to Register file
The following table
names these stages and details what actions take place:
CYCLE |
|
|
|
|
ADD, SUB, NAND |
1 |
FETCH |
[IR] = Mem[ [PC] ] [PC] = [PC] + 1 |
2 |
DECODE |
Decode |
3 |
RF |
[R1] = RF[ [IR7..6] ] [R2] = RF[ [IR5..4] ] |
4 |
EXEC |
[ALUout] = [R1] op [R2] Update Z & N |
5 |
WB |
RF[ [IR7..6] ] = [ALUout] |
In time, we have the
following diagram showing 6 instruction, I1 through I6, executing in a
pipelined fashion:
In C1, I1 is being
fetched. In C2, I1 is being decoded while in parallel, I2 is being fetched. In
C3, I1 is reading from the register file, I2 is being decoded, and I3 is being
fetched, and so on. Let’s focus on C5: In C5 we have five instructions being
executed in parallel, each at a different stage of their execution:
1.
I1 is finishing writing its result back to the
register.
2.
I2 is
performing a calculation in the ALU.
3.
I3 is reading
registers from the register file.
4.
I4 is
being decoded.
5.
I5 is
being fetched from memory and updating the PC.
Notice that at every
stage of the pipeline there are different actions. What action needs to take
place depends on the instruction executing at that stage alone. So each stage needs to operate independently,
and at the end of each cycle each stage will pass work to the next stage down
the pipeline. As a result, our pipelined implementation will look as follows:
Each stage has a
combinatorial component that performs the calculations necessary and in between
the pipeline stages there are registers. These registers act as an output for
one stage, and as an input for the next stage. At the end of the clock cycle,
the results of each stage are loaded into the output register for the stage.
These are then used as inputs for the next stage during the next cycle. The FETCH stage has no input register since
it uses the PC as the only input.
Let us now see how
we can implement this in hardware. Let’s start with the multi-cycle datapath
and to make things easier to follow assume that we executing the following
instruction sequence:
I1: ADD K1 K1
I2: SUB K2 K2
I3: NAND K3 K3
I4: ADD K0 K0
I5: SUB K2 K1
FETCH stage:
We can start from
our multi-cycle implementation. Let’s start building the pipeline stage by
stage. In stage 1, we need to fetch an instruction and update the PC. In the
multicycle implementation this is done
as shown below:
At the end of the
cycle the new instruction will be loaded into the IR register can act as the
pipeline stage register between the FETCH and DECODE stages. While in the multi-cycle
implementation we could use the main ALU to do PC = PC + 1, this is not
possible in the pipelined implementation since that ALU will be used to
implement the EXEC stage. This is a structural hazard. A structural
hazard occurs when at least two different stages attempt to use the same
resource. Since the stages operate concurrently, this is not possible. One
solution to such structural hazards is replication. If a resource cannot be
shared, we can use a second copy to accommodate another stage. In our case, we
can provide a dedicated adder for incrementing the PC in the FETCH stage. Our datapath now looks as follows:
During the FETCH
stage, the PC can become PC + 1 and an instruction can be fetched into IR.
DECODE stage:
Next comes the
DECODE stage. Nothing happens in the datapath during this stage, the control is
taking a look at the instruction just fetched and prepares the next set of
actions that will happen in the RF stage. Since a cycle does elapse and another
instruction is being fetched, we need a register to pass along the previous IR
to the RF stage while allowing a new IR to be loaded with the next instruction.
Our datapath now is as follows:
RF Stage:
In the stage 3, the
RF stage, we can now proceed to read the two registers as specified by the
instruction. For this purpose, we are using the instruction stored in the IR2
register which is the one that was
provided by the DECODE stage:
The existing R1 and
R2 registers can act as the register between the RF and the EXECUTE
stages. The EXEC stage also needs to
know what operation to perform and finally, we need to remember which register
we need to be writing to when we eventually arrive at the WB stage.
Accordingly, we can add on more IR register as shown below. Alternatively, we
can pass along all the necessary information encoded in a convenient form. For
example, we can use an alternate encoding for the operation, etc.
EXECUTE stage:
In the fourth,
EXECUTE stage, the calculation takes place in the ALU and we update the Z and N
flags:
As before, this is
not sufficient, we also need to pass along some information to the next stage
WB so that it can write the result back to the appropriate register. For this
purpose we can simply introduce another IR register as shown below:
WB stage:
In the fifth and
final WB stage, the result previously calculated into ALUout gets written into
the register file. The register name now
has to come from IR register that is between EXEC and WB. That would be IR4 in
our not so great naming scheme:
We have now seen how
a single ADD, SUB or NAND instruction will flow through a simple 5-stage
pipeline. However, as designed our pipeline has problems. We discuss those
next.
Data Hazards:
Let us consider the
following instruction sequence:
I1: ADD K2 K1
I2: ADD K0 K0
I3: ADD K3 K2
Instruction I1 writes to
register K2 while instruction I3 reads that value. Let’s see how these will
happen in time and whether they will execute correctly on our pipeline:
At cycle C5, I1 will
be writing K2, while I3 will be reading K2. If we use edge triggered flip-flops
for the registers, the write of I1 will happen at the end of the clock cycle,
while the read will happen during the clock cycle. As a result I5 will read the
previous value of K2 and not the one I1 is writing. In this case
execution will be incorrect.
There are two
possible solutions to this problem. The first to delay I5 by one cycle
and the second is to bypass or forward the value from ALUout into
R2.
Data Hazards -- Bubbles:
Let’s look at the
first option. The goal here is to delay the execution of I5 by one cycle. The
decision that needs to happen can be taken during the RF stage for I5. At this
stage, the RF stage can compare the destination register of the WB stage with
the source registers being used during the RF stage. If they are the same, then
at the end of the cycle, all stages from RF and earlier that includes
the DECODE and FETCH stages, will be loaded with their current input values.
That is their stage registers will be loaded with the same values as before
through a feedback connection. This is called a pipeline bubble. In time
execution now will proceed as follows:
The control logic of
the RF stage will need to detect whether a bubble is necessary. This can be
done by comparing the register names that it is currently using (found in IR3)
with those written into by the WB stage (found in IR4). If a data hazard is
detected, then the RF stage will have to notify both the DECODE and the FETCH
stage. In general, bubbles affect all preceding stages. If our pipeline has 10
stages and a bubble had to be inserted at stage 5, then all stages from 5 and
below would have to repeat the same action next cycle. That is all stages will
have to reload their input stage registers with their current values and
prevent loading the value from their preceding stage. As we know the FETCH
stage has no input stage register. What needs to be done for it? We can either
inhibit another fetch from memory since the one we just did is good enough, or
we can inhibit the update to the PC. Either solution is fine, the first reduces
energy but may run into problems with self-modifying code. The second solution
where the FETCH repeats the fetch will work no matter what but does use more
energy because it repeats the memory read.
Data Hazards -- Bypass:
As we said, during
the RF stage I3 can detect that is reading is value that will be written into
K2 at the end of the clock cycle. However, that value is currently available as
the output of ALUout (the output register of the EXEC stage). Note that all
that is needed is that the right value gets loaded at the end in register R2
(the output register of the RF stage). Accordingly, we can bypass through a
multiplexer the ALUout value to R2 instead of the value read from the register
file. The modified datapath is as follows:
R1 and R2 now are
driven by two multiplexers. Each of those multiplexers can either pass the
value read from the register file, or bypass the value currently in ALUout. The
diagram does not show the control logic needed to make the selection. In our
case, this logic would have to compare the register identifiers fed to the
register file (these are found in IR2) with the destination register of the
instruction in stage WB. This can be found in IR4. If they are the same, the
value can be bypassed.
In our discussion we
have assume that it is not possible to directly bypass the value through the
register file. The reality of custom build register files is such that the
write can happen in the first half of the clock cycle and the read can then
proceed in the second half. If you pick up most books that cover pipelining you
will see this as the base assumption. In such a design, there is no need for an
explicit bypass as this is done internally in the register file assuming that
the registers are latches and not edged triggered. With edge triggered register
elements as the ones we assumed, explicit forwarding paths are necessary.
Now let’s consider
the following instruction sequence:
I1: ADD K2 K1
I2: ADD K3 K2
I2 must read the
value for K2 that I1 is producing. In time this will appear as follows:
From this time
diagram it would appear that we have no option other than to delay I2 since the
value it needs to read in stage 3 at cycle C4 does not get written back until
stage 5 of I1 which occurs a cycle later in C5. However, observe the following:
1.
The
value for K2 has been calculated by I1 in C4 and has been stored in ALUout.
2.
The
value for K2 is only needed by I2 in cycle C5 where the actual calculation
takes place.
So, in reality the
data dependency between I2 and I1 appears as follows:
Now the arrow moves forward
in time, which is to say that the value gets produced before it needs to be consumed. To do this, however, we need to
further extend our datapath to allow forwarding from WB to EXEC (before we did
forwarding to WB to RF):
The control logic of
the EXEC stage will have to compare the two source registers (found in the IR3
register) and compare them with the register in the WB stage as found in IR4.
If the names are the same, the value can be bypassed through the ALU1 and or
ALU2 multiplexers.
This concludes the
pipelined design for ADD, SUB and NAND. We next proceed to consider the other
instructions.
ORI and SHIFT:
ORI and SHIFT are
structurally similar to ADD. The difference is that ORI does not provide the
registers as part of the instruction while the second operand for both ORI and
SHIFT comes from the instruction as an Imm5 or Imm3. Our datapath is already
capable of performing those operations. The bypass logic will have to be
modified to take into consideration the instruction opcode to work correctly
for these instructions. It will have to use the name K1 when comparing for
bypassing with the EXEC and WB stages, and use K1 as the destination register
for those stages as well. For shift, it has use only the R1 field from the
instruction since SHIFT does not use a second register operand.
STORE:
The store
instruction is identical to ADD up to and including the RF stage. In the EXEC
stage, however, it needs to write to memory. Unfortunately, memory is already
being used by the FETCH stage of every instruction. So, we have a structural
hazard since all instructions read from memory to fetch during the FETCH
stage and stores also try to write to memory at their fourth stage, the EXEC
stage. How can we overcome this problem? There are two solutions:
1.
Bubble:
Give priority to the store over the instruction currently at FETCH.
2.
Replicate:
Use a second memory for data accesses.
Let’s look at these
options more closely.
Allowing Memory Writes with a single memory:
If we opt for the
bubble approach, we need to prevent the FETCH stage from accessing memory while
a store is at its EXEC stage. In time this will look as follows:
This solution
requires no modification to the datapath. The control of the EXEC stage will
need to detect that it executes a STORE
instruction and accordingly set the MemWrite, and AddrSelect signals. The
control of the FETCH stage will need to inhibit writing to the PC and disable
the MemRead signal. So, the two stages will have to coordinate and moreover,
the decision is best taken by the end of stage RF instead. That is, when a
STORE enters the RF stage, it can signal that in the next cycle FETCH should be
de-activated.
Replicating Memory:
The other solution
is to use a separate memory for instructions and another one for data. This is
a Harvard-style architecture which is familiar to use from the single-cycle
implementation. This is exactly the architecture of choice being used today.
However, instead of having separate instruction and data memories, modern
processors use a unified memory space for both instructions and data.
Pipelining is possible by using separate caches for instructions and
data. Those caches feed from the same memory, however, they can be accessed
independently. As long as the access hits, the pipeline can proceed
unabated. Here’s the modified datapath:
The figure contains
the changes for the LOAD instruction as well which are explained next.
LOAD:
For load we need to
access memory at the EXEC stage, something we’ve already supported as part of
the discussion for STORES. The result of the memory read gets stored in the MDR
register. We then need at the WB stage to write to the RF. The diagram below
shows how this can be done. We need to introduce additional bypass options to
support bypassing MDR from WB to RF and from WB to EXEC:
Instead of having
loads and ALU operations write to different registers at the EXEC stage, we can
instead unify ALUout and MDR. The modified datapath now does not need extra
bypass options, moreover, we no longer need a multiplexer to select the value
that will be written into the register file:
Branches:
All instructions we saw
thus far update the PC to be PC + 1. Branches however may further update the
PC. Let’s first look at where the branches may do that update. The information
needed to do that update comprises the following:
1.
Which
branch this instruction is. This is known after DECODE.
2.
What is
the value of the flags. This may be produced as later as the end of EXEC of the
immediately preceding instruction.
Accordingly the
simplest approach is to have branches update the PC during the EXEC stage. The
datapath is as follows:
The first input to
the ALU now can also pass the value of PC3. PC1, PC2, and PC3 are pipeline
registers that carry the value PC+1 for the corresponding instruction. When the
branch gets fetched, it also updates the PC to be PC+1. That value gets written
back into the PC and carried with the branch into PC1. As the branch moves from
stage to stage, its PC value flows with it, so that the correct calculation can
take place in the EXEC stage as shown. When a branch executes in the EXEC stage
and is taken, that is it decides to update the PC, all instructions that follow
must be squashed. That is the must be not allowed to make any changed to
architectural state which includes registers, memory, flags and the PC. The
instructions that must be squashed include those in the FETCH, DECODE, and RF
stages. In time execution flows like this:
Control Flow Hazards:
Every branch
instruction creates a control flow dependency for the instructions that
follow it. That is, when we encounter a branch we do not know which instruction
will execute next until we execute the branch to completion. In the pipeline
this manifests as a control flow hazard, or simply a control hazard. The
simplest approach to handling control hazards would be to stall. That is,
whenever a branch is fetched we would stall the pipeline until the branch
executes, or resolves. Unfortunately, we do not know whether the
instruction is a branch until the end of decode. Accordingly the stall approach
would stall the pipeline for a cycle after fetching any instruction to wait to
make sure that the instruction is not a branch. The pipeline we designed thus
far, does not use this approach. Instead it uses control flow speculation. In
control flow speculation the pipeline guesses what is the address of the next instruction to fetch and proceeds to
do so speculatively. Eventually when the instruction is known and
executes, the speculation is validated. If it was correct, no further action is
necessary and we gained performance since we speculatively executed the right
instructions without waiting to make
sure that we should. If however, the speculation is incorrect, we have a misspeculation
and we must:
(1) Undo all speculative changes done by the incorrectly executed
instructions
(2) Redirect fetch to the right instruction.
Let’s see now how
these concepts apply to our pipeline.
To do control flow
speculation or branch speculation we first need a branch predictor. A
branch predictor does two things:
1.
Speculate
whether the branch is taken or not. This is a binary decision.
2.
Speculate
what is the exact address of the next instruction.
Our pipeline uses a
simple, static branch predictor that speculates always not taken.
So, it always answer Q1 above with not taken. For Q2 it does not need to
speculate since we already have the PC and can calculate the fall through PC as
PC +1. This is a static branch predictor since its decision does not get
influenced by program execution. No matter what it is always the same.
In the event of
misspeculation we need to first undo any speculative changes and then redirect
fetch. Let’s look at where changes to architectural state happen:
1.
All
instructions update the PC with PC+1 at FETCH.
2.
Some
instructions update the flags in EXEC.
3.
Stores
update memory in EXEC.
4.
ADD,
SUB, NAND, ORI, SHIFT, and LOAD, update the registers at WB.
Branches signal a
misspeculation at the end of their EXEC stage. Instructions that follow the
branch appear at earlier stages and they had no chance of updating any
architectural state except for the PC. As we have seen the branch carries the
PC with it, and calculates the right target at EXEC. Accordingly, the PC gets
correctly updated on a misspeculation.
As a result, our pipelined implementation correctly handles branch
mispeculations.
Summary of what happens when:
The following table
summarizes what each instruction does at each stage.
CYCLE |
|
Instruction Class |
|||||||
|
|
ADD, SUB, NAND |
SHIFT |
ORI |
LOAD |
STORE |
BPZ |
BZ |
BNZ |
1 |
FETCH |
[IR] = Mem[ [PC] ] [PC] = [PC] + 1 |
|||||||
2 |
DECODE |
||||||||
3 |
RF |
[R1] = RF[ [IR7..6] ] [R2] = RF[ [IR5..4] ] |
[R1] = RF [1] |
[R1] = RF[ [IR7..6] ] [R2] = RF[ [IR5..4] ] |
|
|
|
||
4 |
EXECUTE |
[WBin] = [R1] op [R2] |
[WBin] = [R1] shift Imm3 |
[WBin] = [R1] OR Imm5 |
[WBin] = Mem[ [R2] ] |
MEM[[R2] = [R1] |
if (N’) PC = PC + SE(Imm4) |
if (Z) PC = PC + SE(Imm4) |
if (‘Z) PC = PC + SE(Imm4) |
5 |
WRITEBACK |
RF[[IR7..6]] = [WBin] |
RF[ 1 ] = [WBin] |
RF[[IR7..6]] = [WBin] |
|
|
|
|
Adhering to Sequential Semantics:
Early on in the
lectures, we discussed that conventional programs use sequential semantics. That
is, instructions are supposed to execute one after the other in order and
without any overlap. Pipelining, however, allows the partial overlap of
instructions. Does our pipelined implementation still adhere to sequential
semantics? To reason about this let’s try to better understand what sequential
semantics really calls for. An equivalent definition would be that at any given
point in time, if one stops and inspects the architectural state, it should be
such that:
1.
All
instructions in the original sequential execution order up to and including
some instruction A have executed. That is they have made all their changes to
the architectural state.
2.
No
instruction after A has made any changes to the architectural state.
For example, assume
a sequence of instructions A, B, C, D, and so on, as they would have executed
sequentially. If we stop the processor at any point and inspect its
architectural state (that is registers, PC, memory and flags), the above two
conditions must hold. That is if C has executed, the A and B must have executed
too, and no changes from D, or E and so on should be visible. Equivalent, if we
take an execution timeline in the pipeline, we should always be able to draw a
vertical line such that all instructions before it made all their changes to
the architectural state and no instruction after it has made any changes:
Let us focus on the
point in time A. At this point, the first instruction has just finished the
EXEC stage. If this is an ADD it could have updated the flags. Say that is an
ADD K2 K1. If the machine is stopped at this point, an outside observer could
see that the flags have been updated but the register file for K1 has not. As a
result, this implementation would violate sequential semantics. However, recall
that they only way someone could observe the register values would be by
executing an instruction that reads them. So, the observer would try to read K2
through an instruction that possibly follows immediately our ADD. Our pipeline
has bypass paths that would correctly bypass the value from WB to EXEC or RF,
accordingly, to the outside observer it would as if the register has been
updated at the same time as the flags. Hence our pipelined implementation would
appear to be adhering to sequential execution semantics.
Similar reasoning
applies to STORES that update memory in the EXEC stage. The only instructions
that can observe the change are LOADS which also access memory at the EXEC
stage. Moreover, changes to the registers are visible to other instructions
immediately after the EXEC stage. So, to an outside observer all architectural
state changes appear to happen at the end of the EXEC stage or equivalently the
beginning of the WB stage. The only exception is the PC which gets updated at
the FETCH stage. If it was possible to read the PC value once could detect
those changes. There is no instruction in our architecture that reads the PC as
data. Accordingly, it is not possible to directly observe the value of the PC.
However, even if it was, recall that our instructions carry their PC value
along with them in registers PC1, PC2, PC3 as they pass through the pipeline
stages. Accordingly, the PC could be
treated like any other register and bypassed accordingly similarly to what we
did for the K registers at the EXEC and WB stages.
For registers, flags
and memory all updates become visible at the EXEC stage. For PC we can
introduce bypassing to maintain the illusion of sequential execution as needed.