Advanced Computer Architecture
HW #3 - Due Friday,
Nov. 4th 2005 at 23:59 under the door of EA310
Andreas Moshovos
1.
Consider
a MIPS-like architecture that has delayed
loads That is, an instruction immediately following a load is guaranteed
not to see the loaded value. In this architecture the following two code
fragments are equivalent:
load r1, 10(r2) add
r3, r1, r4
add
r3, r1, r4 load r1,
10(r2)
sub r5, r1, 1 noop
sub
r5, r1, 1
In either case, the “add” sees the value r1 had before the load and the “sub”
sees the value loaded from memory. The “noop” (no operation) is necessary.
Delayed loads “simplify” the pipeline control as it is not necessary for the
hardware to check for load-use hazards. Enforcement of load-use hazards is left
to the compiler. Can you modify conventional out-of-order execution to
correctly execute programs for this architecture? Assume a register renaming
table such as that we discussed in class and describe the necessary extensions.
Assume that no control flow (a.k.a. branch) prediction is used. Of course,
stalling execution after a load is one solution. However, it may not perform
very well. Come up with a solution that will allow out-of-order execution for
all instructions following a load. Assume that it is illegal for an instruction
in a delay slot to have the same destination register as the corresponding
load. Briefly explain what changes need to be made to the RAT and the scheduler
if any. Be concise and describe your changes at a high level.
2.
Consider
adding indirect addressing to the register file. That is consider instructions
of the following form:
op Rdest,
(Rs1), Rs2
The
semantics are:
A
= read register Rs1 from register file
B
= read register A from register file
C
= read register Rs2 from register file
Rdest
= B op C
Argue
(one reason is good enough) why this is a good idea. Then argue why this might
no be such a good idea by commenting on how these instructions impact register
renaming. NOTE: Similar issues exist in stack-based ISAs.
3.
Is
it possible that a dynamically scheduled, superscalar processor with imperfect
branch prediction to perform better than the same processor but with perfect
branch prediction? Explain why or why not. Needed be provide a
pseudo-code/block-level example.
4.
Write
a code sequence of two MIPS instructions for which register renaming will
increase the opportunities for out-of-order execution. Also , write a sequence
of two MIPS instructions for which register renaming will not increase the
opportunities for out-of-order execution.
5.
Current
Instruction Set Architectures use an indirect specification of
inter-instruction communication. In this specification communication happens
through registers. The specification is indirect as producing instructions do
not directly name their consumers or vice versa. Of course, this is not the
only possible way of specifying inter-instruction communication. An alternative
is to have the consuming instructions name the producer. For example, an
instruction may be of the following format:
add (10), (2)
This is interpreted as take the result of
the instruction at distance 10 (distance is measured in instructions during
execution) before this one, the result of the instruction at distance 2 and add
them up. Notice that no target register is specified. Why such an ISA might be a good idea. What
are the implications for register renaming and code size? Will we be able to do
everything that is possible with registers using such an ISA? Can you think of
scenario where this scheme creates complications?
6.
In
dynamically-scheduled, superscalar processors producers broadcast their result
tag to consumer so that the consumers can determine when all their source
operands are available. In modern processors the result tag broadcast happens
the cycle the result is actually produced. This way, the result can be bypassed
the next cycle directly to the consumers. For operations of fixed latency the
aforementioned process is straightforward: if we know an add takes two cycle to
produce its result we have it broadcast its result tag at the beginning of the
second cycle of its execution. Loads combined with caches present a challenge.
With caches a load requires a variable number of cycles to produce its result
depending on where it finds the data. Accordingly, we cannot safely broadcast
the result tag at the beginning of the last cycle since we do not know which
the last cycle is. Do a quick online
survey to determine what solutions are currently used to overcome this
limitation. Summarize your findings using less than 500 words (with 100 being
more preferable).