Advanced Computer Architecture
HW #1 - Due Tuesday, Nov. 3, 2009 in class
Instructor: Andreas Moshovos
There are two parts in this assignment
NOTE: This assignment refers to several source and executables that you need to complete it. They are available through the web site. The gcc port for simplescalar will be made available through the web page. If you untar it, it will install on /usr/local under CYGWIN)
In this
assignment you will be exposed to using and modifying Simplescalar’s functional
simulator
The
assignment is almost identical to the one offered last year.
In this assignment you will modify the functional simulator from the Simplescalar simulation suite. You are asked to collect certain statistics about program execution. There are other, more detailed simulators in the Simplescalar suite. We will start with the functional simulation not only because it’s the simplest, but also because it forms the building-block for all other simulators. A functional simulator simulates the function of the program. It executes one instruction after the other with no notion of time. Instructions execute one after the other, as assumed by sequential execution semantics. In contrast, a timing simulator is in addition concerned with time, that is how long it takes to complete each instruction. The core of the functional simulator is basically a loop with a huge switch statement where every case corresponds to a different instruction type. The register file is simulated via an array and memory is simulated via another array (actually implemented as a two-level array: first level is indexed by page number, second level is dynamically allocated only for pages actually touched by the simulated program). System calls are forwarded to the actual host OS (more info in class).
While Simplescalar is a fairly elaborate and lengthy piece of software, still there is no need to not panic. A detailed, step-by-step description of all the necessary modifications for collecting the first set of statistics is given in this homework. This is why there are so many pages.
Equipment Required: You will need access to a Cygwin machine to complete this assignment. This assignment requires the use of benchmarks that are compiled for a little-endian machine such as an x86 one. You may be able to complete this assignment on other little-endian architectures. However, this may require various changes in the Simplescalar code (which of course, you will have to figure out on your own).
Structure: First, this assignment shows you how to modify the simulator using an example. Then, you are asked to develop code for collecting a different set of statistics. You will have to figure out how to do the latter on your own. Please plan ahead. This is not something you want to attempt the night before the due date. Minor details not necessarily related to the core of this assignment may take more time than expected.
Here’s the statistics we are interested in. A detailed description of all required modifications follows:
Assume an ideal 5-stage pipeline (similar to the DLX pipeline in H&P ch. 3). In this pipeline accessing memory for instructions or data takes a single cycle. Moreover, an instruction access and a data access can proceed at the same cycle. For a very good reason we need to measure how often a load use hazard occurs in typical programs. What is a load hazard? It occurs when an instruction that immediately follows a load has to use the value read by the load. Recall that in a typical 5-stage pipeline memory accesses occur in the MEM stage (4th stage). In the same pipeline, registers are used in the beginning of the EX stage (3rd stage). An instruction following a load has to wait for a cycle before the load value is read. The following figure illustrates a load use hazard:
Cycle |
N |
N+1 |
N+2 |
N+3 |
N+4 |
N+5 |
ld $1, 10($2) |
FETCH |
DECODE |
EXEC |
MEM |
WB |
|
add $3, $1, 1 |
|
FETCH |
DECODE |
STALL |
EXEC |
MEM |
Here’s the main idea on how to collect the desired statistic. Every time an instruction is executed we can simply check whether one of its inputs is the target of an immediately preceding load. If so, we simply count one more load use hazard. To get this done, we will modify the functional simulator to collect the source register numbers for all instructions and to also identify loads and their target registers.
Here’s a step-by-step description on how this can be done (there are other ways this is one that works):
a) Create a directory where you will copy the simulator files. For clarity, let’s name this directory ~/mysim. The commands to do so are:
cd ~
mkdir mysim
b) Copy the simulator files using:
cd ~/mysim
bunzip2 –c simplesim-aca.tar.bz | tar xvf -
c) Before you attempt any changes, first make sure that your simulator copy compiles and works correctly without any modifications. Use the following command to compile the functional simulator:
cd simplesim-aca
make config-pisa
make sim-safe
The first command configures
simplescalar to use the
There many simulators in the Simplescalar suite. We will be working with “sim-safe” which is a functional simulator. “Sim-fast” is pretty much the same simulator, however, it does not include many of the sanity checks included in sim-safe. For example, sim-fast would never complaint if your code jumps into illegal memory addresses.
We can use sim-safe to measure load use hazards without having to write the complete code for a pipeline. This is a standard trick in simulation. In general, simulating is different than building hardware. If you are interested in studying a phenomenon you may be able to do so (and save a lot of time and effort) by thinking whether it is possible to collect the necessary statistics without necessarily simulating how the hardware would do something but instead simulating the resulting behavior. For example, in a typical five stage pipeline writeback would occur in the 5th stage. In hardware, each instruction would go through every stage with the appropriate information stored in a set of inter-stage latches. In simulation, this behavior can be simulated simply by delaying the writeback for five “cycles” (i.e., we do not need to have code for the latches and copy values every “cycle”). Also note that Simplescalar performs execution-driven simulation. That is, it actually simulates the execution of each instruction. This is in contrast to trace-driven simulation where a trace of a previous execution is used to retrace the path followed when the program executed in the past.
If you get “warnings” while compiling “sim-safe” you can safely ignore them. If you get errors, then something went terribly wrong (this is not supposed to happen).
d) Check that the compiled simulator works. You can invoke it as follows:
./sim-safe [arguments] executable [executable arguments]
A precompiled, short running program is provided through the website for your convenience. Run it as follows:
./sim-safe msg
You should see a nice message along with numerous statistics about the program. Try running count also.
e) Now it’s time to start making modifications. We will be working with “sim-safe.c” which is the core implementation of the sim-safe simulator. Much of the code in “sim-safe.c” can be ignored for completing this assignment. Because Simplescalar was written to facilitate multiple simulators there is a set of functions that have to be provided by any simulator implementation even though many of them are not always used.
The core of instruction execution occurs in function “sim_main()”. Go ahead and edit “sim-safe.c” with your favorite editor and locate “sim_main()”. After a couple of statements in “sim_main()” you will find an infinite loop starting with “while (TRUE)” statement. Each iteration of this loop corresponds to the execution of one instruction. The first statement in this loop is (ignore all code that is within an #ifdef TARGET_ALPHA…#endif):
regs.regs_R[MD_REF_ZERO] = 0;
This statement sets register $0 to 0. In the MIPS instruction set architecture (Simplescalar uses a derivative of the MIPS ISA) register $0 is always 0. The table “regs.regs_R[]” implements the general purpose register file. This structure has another array “regs.regs_F[]” which implements the floating point register file. Look in “regs.h” and “regs.c” if you wish to learn more (but, leave this for later and this is not necessary – actually, it is good practice to abstract away parts of the code that are not of immediate interest. This is the only hope one has to work with existing, lengthy and complex code).
The next statement reads an instruction from memory:
MD_FETCH_INST (inst, mem, regs.regs_PC);
This macro is defined in
“machine.h” which contains many,
Following this statement is one that keeps a tally of executed instructions followed with some bookeeping and instruction opcode determination statements:
MD_SET_OPCODE (op, inst);
This command, extracts the opcode field out of the raw instruction. This field specific what the instruction does (e.g., addition, load, etc.). It is defined in “machine.h”.
Instruction execution takes place in the switch statement that follows. You will see that this switch statement uses the instruction’s opcode (switch (op)) to determine what needs to be done. If at first the code within the switch statement looks weird don’t worry. You are absolutely right! Just keep reading on and soon all will make sense. The idea is that the switch statement oddly enough contains one “case” statement per instruction opcode. Within each of this “case” statements are command that simulate instruction execution.
In the switch statement you will find a couple of “#define” statements defining various macros. The one to pay attention to is the:
#define DEFINST(OP, MSK, NAME, OPFORM, RES, FLAGS, O1, O2, I1, I2, I3)
This is defined to be “SYM_CAT(OP,_IMPL)”. “SYM_CAT(X,Y)” is a macro that expands to XY which is then also treated as a macro that is expanded by the preprocessor (I think this is gcc specific). As a result, whenever DEFINST is called, it expands into OP_IMPL where “OP” is the first argument passed into DEFINST. For example, the statement “DEFINST(ADD,….)” expands into “ADD_IMPL” which is also treated as macro and expanded. After these definitions, there is an “#include “machine.def”” statement. View this file and you will see that it contains a macro definition and a macro call per instruction opcode. The first macro is of the form “X_IMPL” where “X” is an opcode such as ADD or SUB. The second macro, is a call to “DEFINST(X,…)” Returning back to “sim_main()” and the switch statement, we can now understand what happens when this code passes through the preprocessor. As a result of the definition of “DEFINST” in sim-safe.c, every call to DEFINST from machine.def expands into the following code:
case OP: OP_IMPL; break;
Next the preprocessor replaces each of these OP_IMPL, with the appropriate statements as defined in the “machine.def”. As a result, after preprocessing the “switch (op)” statement expands into a huge switch statement with a case statement for every possible opcode. Within each case, the code provided in “machine.def” is included. You can see the expanded code (doesn’t look pretty) via the following command:
gcc –E sim-safe.c –o sim-safe.pcc.
This should produce a
“sim-safe.pcc” file which you may view. Search for
“sim_main()” in that file and then take a look at the expanded switch
statement. On another note, if you want to see what gcc does to compile your
programs use the “-v” option. In fact, gcc is just a front-end that calls a
variety of other programs.
The “OP_IMPL” macros utilize other macros to access machine state including the register file and the memory. These have to be provided in the simulator. “Sim-safe.c” includes appropriate definitions for our purposes. For example, to read a register, OP_IMPL uses the “GPR(N)” macro, while it uses the “SET_GPR(N)” to write to a register. “GPR(N)” is defined simply as “regs.regs_R[N]” in “sim-safe.c”. That is, to read register N we just access the N element of the array regs.regs_R[] which implements the register file. There are other macros, that deal with floating point registers and memory which you can find in “sim-safe.c” starting at line 239.
A more detailed description of
how to use the “DEFINST” macro is given at the beginning of “machine.def”. For
our purposes, suffices to note that “O1” and “O2” are the target register
numbers, while “I1”, “I2” and “I3” are the source register numbers for each
instruction. There are 2 targets and 3 sources as in
After the switch statement you will encounter a couple of “if”statements that either detect faults or report statistics. There is also one that checks whether the instruction was a load or a store (“if (MD_OP_FLAGS(op) & F_MEM)”). This is also used for keeping statistics and has nothing to do with execution. Throughout the simulator you may notice calls to functions named “dlite_....”. Simplescalar implements a simple debugger. Simply ignore these for the time being.
After all these, there are two statements that update the PC and a statement that checks whether the maximum count of instructions has been reached.
/* go to the next instruction */
regs.regs_PC = regs.regs_NPC;
regs.regs_NPC += sizeof(md_inst_t);
f) Now that we have a high-level understanding of how the simulator works, it’s time to start modifying it. Recall that we want to measure how often the target register of a load is used by the next instruction. Here’s how we are going to measure this. We are going to keep a record per register indicating when was the last “time” the register was the targer of a load. For example, if register $1 is set by load at “time” (instruction count really) 1000, then we will mark that this value will be ready at “time” 1002. A subsequent instruction can then check whether any of its inputs are not yet available by inspecting the corresponding record.
g) Let’s first declare the table with the ready times per register. Go at the beginning of sim-safe.c just after the initial #include statements and add the following line:
static counter_t reg_ready[MD_TOTAL_REGS];
MD_TOTAL_REGS is defined in “machine.h” to be the maximum number of registers that are available (includes the general purpose, the floating point, hi, lo and others).
h) After the above statement, declare a variable that we will use to count the load use hazards:
static counter_t sim_num_lduh = 0;
i) Simplescalar provides an elaborate package for collecting statistics. It keeps an internal database of counters and other statistics-related data structures. It also prints out this database at the end of the simulation. We need to register our counter with this database so that it’s printed out at the end. To do so go to the “sim_regs_stats()” function include the following statements:
stat_reg_counter(sdb, “sim_num_lduh”, “total number of load use hazards”, &sim_num_lduh, sim_num_lduh, NULL);
stat_reg_formula(sdb, “sim_load_use_ratio”, “load use fraction”, “sim_num_lduh / sim_num_insn”, NULL);
The second statement registers a formula that reports the load use hazard count as a fraction of all instructions executed. The statistics database is implemented in the file “stats.c”.
j) Now, go back to “sim_main()” and declare the following local variables:
int r_out[2], r_in[3];
Recall the DEFINST macro? Its arguments include O1, O2, I1, I2, and I3 which are the target and source register numbers. “machine.def” defines these appropriately for every possible opcode. We will use the previously defined local variables to store these register numbers for our purposes.
k) Modify the “DEFINST” macro to extract the source and target register numbers. To do so first include the file “decode.def (“#include “decode.def”) immediately after all other “#include” directives at the beginning of “sim-safe.c”. If you browse through “decode.def” you will see a number of “D…(N)” macros being defined. These macros map architectural register numbers (i.e., $0, $r0, $f0) to internal register numbers used by simplescalar. This mapping is necessary as the number 1 may be used to refer to both integer and floating point registers depending on which instruction uses it. For example, an integer ADD instruction uses the number 1 to refer to the general purpose register 1. At the same time, a floating point ADD (FADD) instruction uses the number 1 to refer to the floating point register 1. Hence the number 1 is used by different instruction to refer to different registers. Simplescalar maps all these register/type combinations into a continuous register name space starting with 0. “decode.def” provides the macros necessary to do this mapping. For example, “DGPR(N)” maps the general purpose registers while “DFPR(N)” maps the double precision floating point registers. As you will see, GPRs are mapped to 0…31 while FPRs are mapped to 32…63. There are other special purpose macros for the other registers such as HI, LO, PC, and FCC (floating point condition codes). These “D…(N)” macros are called from the DEFINST macros in “machine.def”. Not all simulators need this mapping to function. However, for our purpose it is convenient to use them.
Do the following modifications to “DEFINST”. Go back to “sim_main()” and find where “DEFINST” is defined within the “switch (op)” statement. Change the definition to the following:
#define DEFINST(OP,MSK,NAME,OPFORM,RES,FLAGS,O1,O2,I1,I2,I3) \
case OP: \
r_out[0] = (O1); r_out[1] = (O2); \
r_in[0] = (I1); r_in[1] = (I2); r_in[2] = (I3); \
SYM_CAT(OP,_IMPL); \
break;
Make sure that there are no extra spaces after the “\” in every line. The “\” are used to split the macro definition across multiple lines. A space breaks the definition. Using parenthesis around macro arguments is a very good practice.
l) Getting close! Time to add the code for collecting our statistic. Go after the switch statement, but stay within the “while (TRUE)” loop. Now include the following code:
{
int i;
for (i = 0; i < 3; i++)
if (r_in[i] != DNA && reg_ready [r_in [i]] > sim_num_insn)
{
sim_num_lduh++;
break;
}
}
This loop scans through all input registers and checks whether they are currently available. The DNA constant is used to indicate that the corresponding source register is not really used by the instruction.
Simplescalar does not provide a default definition for DNA. We have to provide our own. Do so at the beginning of the file by inserting the following directive:
#define DNA -1
m) We are almost there. The last thing we need to do is update the “reg_ready” array on loads. Go after the code we added in part (l) and add the following:
if (MD_OP_FLAGS(op) & F_MEM) && (MD_OP_FLAGS(op) & F_LOAD))
{
if (r_out[0] != DNA)
reg_read[r_out[0]] = sim_num_insn + 2;
if (r_out[1] != DNA)
reg_read[r_out[1]] = sim_num_insn + 2;
}
This code flags the load’s target registers indicating that they will not be available for the next instruction to see.
Part
(1) Compile the new sim-safe using: make sim-safe
Run
the modified simulator twice. The fist time use the “test” program as your
input. Then use the “test.O2” program as your input. Both are compiled versions
of the “test.c” file.
(a)
Provide print-outs of the two runs
Test
your simulator with a “real” program. For that, we are going to use gcc itself.
We are providing a precompiled version of gcc for the simplescalar
architecture. To run gcc use the following command:
./sim-safe
cc1.lit.ss gcc.i >& cc.out
This
simulates gcc (the C compiler really which is called cc1) compiling the
preprocessed C file gcc.i. It generates assembly which is written in the
“gcc.s” file.
(b)
Hand-in a printout of this run. The output goes into the cc.out file.
Part
(2): Branch Predictors
Even
in simple pipelines branch prediction is used to reduce control-flow induced
pipeline bubbles. Modify sim-safe so that you can study how well various static
predictors may do. After the switch statement include an “if” statement that
identifies branches:
if ( (MD_OP_FLAGS (op) & (F_CTRL|F_COND) ) == (F_CTRL|F_COND) ) …
This
captures only conditional branches and not jumps such as “jmp”, “jalr”, “jr”,
etc. Still, this is sufficient for our purpose. To determine whether a branch
is taken or not you can simply compare regs.reg_PC and regs.reg_NPC. The first
is the PC of the current instruction and the other is the PC of the next to
execute instruction. If the branch is not taken then “..PC
= ..NPC + sizeof (md_inst_t)". Make sure
you insert your code before the last update to PC and NPC because at that point
PC points to the next instruction and next PC to the one following it.
Make
sim-safe print out how many conditional branches execute and what fraction of
overall instructions these represent. Then make it print out the fraction of
conditional branches that are taken and the fraction of those that are not
taken.
Then,
print out how often the following dynamic branch prediction methods will be
correct (percentage):
1.
GShare with 32K
entries using a single bit (last outcome) and 8 bits of global history. Use the
lower bits of “PC ^ (PC >> 16) ^
Global History” to index the prediction table.
2.
Bimodal with 32K
entries.
3.
The GShare predictor but using 32K entries with two-bit confidence.
If
you need to determine the target address of a branch use the following: The
implementation macros for branches in “machine.def” include a macro call to
SET_TPC(ARG) where ARG is the target of the branch. You can define the
SET_TPC() macro in sim-safe.c so that you can get the target address. For
example, you may define it as:
#define
SET_TPC(N) (target_addr = (N))
Where
“target_addr” is a variable. Every time a branch executes, “target_addr” will
hold the target of the branch (as if it was taken). Comparing this to the PC of
the branch you can determine whether the branch is going backward (lower
address) or not.
Part (3):
“Lies, damned lies, and statistics”:
Collecting
statistics is easy. Verifying that they are correct
is always challenging. To increase your confidence on the correctness of your
statistics gathering code develop two micro-benchmarks in C. These should be
short codes that you know how their branches should behave. For example, a long
running loop with no branches will probably get very close to 100% with all
three predictors. Your micro-benchmarks should be such that you indeed test all
three predictors. If you feel you need more than two please do so (the number
two is just an estimate which I haven’t really thought about). Once you
wrote your C micro-benchmark compile it with the following command:
ss-gcc –o mbench-1 mbench-1.c -O
Double-check
the code that is generated using ss-objdump –d by focusing on main() or any
function you wrote. A complication you have to think about is that any C
program you write, when compiled, is linked with several initialization and
shutdown routines. So, it is not only your code that will run, but also some
other code. These additional routines usually don’t take too much time. To
reduce their impact on the overall statistics, make sure that the core of your
micro-benchmarks runs for a sufficiently large number of instructions (e.g., a
loop that iterates 10,000 times or so).
What
to hand-in: Printouts of your micro-benchmark C files along with a brief
explanation on what they do and why you chose them. Also, printouts of the
sim-safe runs showing that the micro-benchmarks indeed produce the desired
behavior.