ACA HW2

Advanced Computer Architecture
HW #1 - Due Tuesday, Nov. 3, 2009 in class

Instructor: Andreas Moshovos

There are two parts in this assignment

NOTE: This assignment refers to several source and executables that you need to complete it. They are available through the web site. The gcc port for simplescalar will be made available through the web page. If you untar it, it will install on /usr/local under CYGWIN)

In this assignment you will be exposed to using and modifying Simplescalar’s functional simulator

The assignment is almost identical to the one offered last year.

In this assignment you will modify the functional simulator from the Simplescalar simulation suite. You are asked to collect certain statistics about program execution. There are other, more detailed simulators in the Simplescalar suite. We will start with the functional simulation not only because it’s the simplest, but also because it forms the building-block for all other simulators. A functional simulator simulates the function of the program. It executes one instruction after the other with no notion of time. Instructions execute one after the other, as assumed by sequential execution semantics. In contrast, a timing simulator is in addition concerned with time, that is how long it takes to complete each instruction. The core of the functional simulator is basically a loop with a huge switch statement where every case corresponds to a different instruction type. The register file is simulated via an array and memory is simulated via another array (actually implemented as a two-level array: first level is indexed by page number, second level is dynamically allocated only for pages actually touched by the simulated program). System calls are forwarded to the actual host OS (more info in class).

While Simplescalar is a fairly elaborate and lengthy piece of software, still there is no need to not panic. A detailed, step-by-step description of all the necessary modifications for collecting the first set of statistics is given in this homework. This is why there are so many pages.

Equipment Required: You will need access to a Cygwin machine to complete this assignment. This assignment requires the use of benchmarks that are compiled for a little-endian machine such as an x86 one. You may be able to complete this assignment on other little-endian architectures. However, this may require various changes in the Simplescalar code (which of course, you will have to figure out on your own).

Structure: First, this assignment shows you how to modify the simulator using an example. Then, you are asked to develop code for collecting a different set of statistics. You will have to figure out how to do the latter on your own. Please plan ahead. This is not something you want to attempt the night before the due date. Minor details not necessarily related to the core of this assignment may take more time than expected.

Here’s the statistics we are interested in. A detailed description of all required modifications follows:

Assume an ideal 5-stage pipeline (similar to the DLX pipeline in H&P ch. 3). In this pipeline accessing memory for instructions or data takes a single cycle. Moreover, an instruction access and a data access can proceed at the same cycle. For a very good reason we need to measure how often a load use hazard occurs in typical programs. What is a load hazard? It occurs when an instruction that immediately follows a load has to use the value read by the load. Recall that in a typical 5-stage pipeline memory accesses occur in the MEM stage (4^th stage). In the same pipeline, registers are used in the beginning of the EX stage (3^rd stage). An instruction following a load has to wait for a cycle before the load value is read. The following figure illustrates a load use hazard:

*Cycle*	N	*N+1*	*N+2*	*N+3*	*N+4*	*N+5*
ld $1, 10($2)	FETCH	DECODE	EXEC	MEM	WB
add $3, $1, 1		FETCH	DECODE	STALL	EXEC	MEM

Here’s the main idea on how to collect the desired statistic. Every time an instruction is executed we can simply check whether one of its inputs is the target of an immediately preceding load. If so, we simply count one more load use hazard. To get this done, we will modify the functional simulator to collect the source register numbers for all instructions and to also identify loads and their target registers.

Here’s a step-by-step description on how this can be done (there are other ways this is one that works):

a) Create a directory where you will copy the simulator files. For clarity, let’s name this directory ~/mysim. The commands to do so are:

cd ~

mkdir mysim

b) Copy the simulator files using:

cd ~/mysim

bunzip2 –c simplesim-aca.tar.bz | tar xvf -

c) Before you attempt any changes, first make sure that your simulator copy compiles and works correctly without any modifications. Use the following command to compile the functional simulator:

cd simplesim-aca

make config-pisa

make sim-safe

The first command configures simplescalar to use the PISA instruction set. There are other instruction set supported such as ALPHA. For most of our assignments we will use PISA since it is somewhat simpler while still being a representative of commercial load/store instruction sets.

There many simulators in the Simplescalar suite. We will be working with “sim-safe” which is a functional simulator. “Sim-fast” is pretty much the same simulator, however, it does not include many of the sanity checks included in sim-safe. For example, sim-fast would never complaint if your code jumps into illegal memory addresses.

We can use sim-safe to measure load use hazards without having to write the complete code for a pipeline. This is a standard trick in simulation. In general, simulating is different than building hardware. If you are interested in studying a phenomenon you may be able to do so (and save a lot of time and effort) by thinking whether it is possible to collect the necessary statistics without necessarily simulating how the hardware would do something but instead simulating the resulting behavior. For example, in a typical five stage pipeline writeback would occur in the 5^th stage. In hardware, each instruction would go through every stage with the appropriate information stored in a set of inter-stage latches. In simulation, this behavior can be simulated simply by delaying the writeback for five “cycles” (i.e., we do not need to have code for the latches and copy values every “cycle”). Also note that Simplescalar performs execution-driven simulation. That is, it actually simulates the execution of each instruction. This is in contrast to trace-driven simulation where a trace of a previous execution is used to retrace the path followed when the program executed in the past.

If you get “warnings” while compiling “sim-safe” you can safely ignore them. If you get errors, then something went terribly wrong (this is not supposed to happen).

d) Check that the compiled simulator works. You can invoke it as follows:

./sim-safe [arguments] executable [executable arguments]

A precompiled, short running program is provided through the website for your convenience. Run it as follows:

./sim-safe msg

You should see a nice message along with numerous statistics about the program. Try running count also.

e) Now it’s time to start making modifications. We will be working with “sim-safe.c” which is the core implementation of the sim-safe simulator. Much of the code in “sim-safe.c” can be ignored for completing this assignment. Because Simplescalar was written to facilitate multiple simulators there is a set of functions that have to be provided by any simulator implementation even though many of them are not always used.

The core of instruction execution occurs in function “sim_main()”. Go ahead and edit “sim-safe.c” with your favorite editor and locate “sim_main()”. After a couple of statements in “sim_main()” you will find an infinite loop starting with “while (TRUE)” statement. Each iteration of this loop corresponds to the execution of one instruction. The first statement in this loop is (ignore all code that is within an #ifdef TARGET_ALPHA…#endif):

regs.regs_R[MD_REF_ZERO] = 0;

This statement sets register $0 to 0. In the MIPS instruction set architecture (Simplescalar uses a derivative of the MIPS ISA) register $0 is always 0. The table “regs.regs_R[]” implements the general purpose register file. This structure has another array “regs.regs_F[]” which implements the floating point register file. Look in “regs.h” and “regs.c” if you wish to learn more (but, leave this for later and this is not necessary – actually, it is good practice to abstract away parts of the code that are not of immediate interest. This is the only hope one has to work with existing, lengthy and complex code).

The next statement reads an instruction from memory:

MD_FETCH_INST (inst, mem, regs.regs_PC);

This macro is defined in “machine.h” which contains many, PISA specific functions. Suprisingly, variable “mem” implements main memory. Again, you may look in “memory.c” if you care for details. You should probably avoid the temptation and assume that mem is simply a gargantuan array (it’s almost that, as it is implemented as via an array of pointers pointing to 4K chunks of simulated memory. This is similar to page tables as we will learn later on in the course). The above statement reads an instruction (which is of size “sizeof(md_inst_t)” from address “regs.regs_PC”. “Regs.regs_PC” is an integer simulating the program counter.

Following this statement is one that keeps a tally of executed instructions followed with some bookeeping and instruction opcode determination statements:

MD_SET_OPCODE (op, inst);

This command, extracts the opcode field out of the raw instruction. This field specific what the instruction does (e.g., addition, load, etc.). It is defined in “machine.h”.

Instruction execution takes place in the switch statement that follows. You will see that this switch statement uses the instruction’s opcode (switch (op)) to determine what needs to be done. If at first the code within the switch statement looks weird don’t worry. You are absolutely right! Just keep reading on and soon all will make sense. The idea is that the switch statement oddly enough contains one “case” statement per instruction opcode. Within each of this “case” statements are command that simulate instruction execution.

In the switch statement you will find a couple of “#define” statements defining various macros. The one to pay attention to is the:

#define DEFINST(OP, MSK, NAME, OPFORM, RES, FLAGS, O1, O2, I1, I2, I3)

This is defined to be “SYM_CAT(OP,_IMPL)”. “SYM_CAT(X,Y)” is a macro that expands to XY which is then also treated as a macro that is expanded by the preprocessor (I think this is gcc specific). As a result, whenever DEFINST is called, it expands into OP_IMPL where “OP” is the first argument passed into DEFINST. For example, the statement “DEFINST(ADD,….)” expands into “ADD_IMPL” which is also treated as macro and expanded. After these definitions, there is an “#include “machine.def”” statement. View this file and you will see that it contains a macro definition and a macro call per instruction opcode. The first macro is of the form “X_IMPL” where “X” is an opcode such as ADD or SUB. The second macro, is a call to “DEFINST(X,…)” Returning back to “sim_main()” and the switch statement, we can now understand what happens when this code passes through the preprocessor. As a result of the definition of “DEFINST” in sim-safe.c, every call to DEFINST from machine.def expands into the following code:

case OP: OP_IMPL; break;

Next the preprocessor replaces each of these OP_IMPL, with the appropriate statements as defined in the “machine.def”. As a result, after preprocessing the “switch (op)” statement expands into a huge switch statement with a case statement for every possible opcode. Within each case, the code provided in “machine.def” is included. You can see the expanded code (doesn’t look pretty) via the following command:

gcc –E sim-safe.c –o sim-safe.pcc.

This should produce a “sim-safe.pcc” file which you may view. Search for
“sim_main()” in that file and then take a look at the expanded switch statement. On another note, if you want to see what gcc does to compile your programs use the “-v” option. In fact, gcc is just a front-end that calls a variety of other programs.

The “OP_IMPL” macros utilize other macros to access machine state including the register file and the memory. These have to be provided in the simulator. “Sim-safe.c” includes appropriate definitions for our purposes. For example, to read a register, OP_IMPL uses the “GPR(N)” macro, while it uses the “SET_GPR(N)” to write to a register. “GPR(N)” is defined simply as “regs.regs_R[N]” in “sim-safe.c”. That is, to read register N we just access the N element of the array regs.regs_R[] which implements the register file. There are other macros, that deal with floating point registers and memory which you can find in “sim-safe.c” starting at line 239.

A more detailed description of how to use the “DEFINST” macro is given at the beginning of “machine.def”. For our purposes, suffices to note that “O1” and “O2” are the target register numbers, while “I1”, “I2” and “I3” are the source register numbers for each instruction. There are 2 targets and 3 sources as in PISA an instruction may write up to 2 registers (load double or multiply) and read up to 3 (e.g., store double). Most instructions read up to 2 registers and write just one. The macro DNA is used to indicate that the instruction does not write or read the corresponding target or source register (i.e., if the instruction only writes one target register).

After the switch statement you will encounter a couple of “if”statements that either detect faults or report statistics. There is also one that checks whether the instruction was a load or a store (“if (MD_OP_FLAGS(op) & F_MEM)”). This is also used for keeping statistics and has nothing to do with execution. Throughout the simulator you may notice calls to functions named “dlite_....”. Simplescalar implements a simple debugger. Simply ignore these for the time being.

After all these, there are two statements that update the PC and a statement that checks whether the maximum count of instructions has been reached.

/* go to the next instruction */

regs.regs_PC = regs.regs_NPC;

regs.regs_NPC += sizeof(md_inst_t);

f) Now that we have a high-level understanding of how the simulator works, it’s time to start modifying it. Recall that we want to measure how often the target register of a load is used by the next instruction. Here’s how we are going to measure this. We are going to keep a record per register indicating when was the last “time” the register was the targer of a load. For example, if register $1 is set by load at “time” (instruction count really) 1000, then we will mark that this value will be ready at “time” 1002. A subsequent instruction can then check whether any of its inputs are not yet available by inspecting the corresponding record.

g) Let’s first declare the table with the ready times per register. Go at the beginning of sim-safe.c just after the initial #include statements and add the following line:

static counter_t reg_ready[MD_TOTAL_REGS];

MD_TOTAL_REGS is defined in “machine.h” to be the maximum number of registers that are available (includes the general purpose, the floating point, hi, lo and others).

h) After the above statement, declare a variable that we will use to count the load use hazards:

static counter_t sim_num_lduh = 0;

i) Simplescalar provides an elaborate package for collecting statistics. It keeps an internal database of counters and other statistics-related data structures. It also prints out this database at the end of the simulation. We need to register our counter with this database so that it’s printed out at the end. To do so go to the “sim_regs_stats()” function include the following statements:

stat_reg_counter(sdb, “sim_num_lduh”, “total number of load use hazards”, &sim_num_lduh, sim_num_lduh, NULL);

stat_reg_formula(sdb, “sim_load_use_ratio”, “load use fraction”, “sim_num_lduh / sim_num_insn”, NULL);

The second statement registers a formula that reports the load use hazard count as a fraction of all instructions executed. The statistics database is implemented in the file “stats.c”.

j) Now, go back to “sim_main()” and declare the following local variables:

int r_out[2], r_in[3];

Recall the DEFINST macro? Its arguments include O1, O2, I1, I2, and I3 which are the target and source register numbers. “machine.def” defines these appropriately for every possible opcode. We will use the previously defined local variables to store these register numbers for our purposes.

k) Modify the “DEFINST” macro to extract the source and target register numbers. To do so first include the file “decode.def (“#include “decode.def”) immediately after all other “#include” directives at the beginning of “sim-safe.c”. If you browse through “decode.def” you will see a number of “D…(N)” macros being defined. These macros map architectural register numbers (i.e., $0, $r0, $f0) to internal register numbers used by simplescalar. This mapping is necessary as the number 1 may be used to refer to both integer and floating point registers depending on which instruction uses it. For example, an integer ADD instruction uses the number 1 to refer to the general purpose register 1. At the same time, a floating point ADD (FADD) instruction uses the number 1 to refer to the floating point register 1. Hence the number 1 is used by different instruction to refer to different registers. Simplescalar maps all these register/type combinations into a continuous register name space starting with 0. “decode.def” provides the macros necessary to do this mapping. For example, “DGPR(N)” maps the general purpose registers while “DFPR(N)” maps the double precision floating point registers. As you will see, GPRs are mapped to 0…31 while FPRs are mapped to 32…63. There are other special purpose macros for the other registers such as HI, LO, PC, and FCC (floating point condition codes). These “D…(N)” macros are called from the DEFINST macros in “machine.def”. Not all simulators need this mapping to function. However, for our purpose it is convenient to use them.

Do the following modifications to “DEFINST”. Go back to “sim_main()” and find where “DEFINST” is defined within the “switch (op)” statement. Change the definition to the following:

#define DEFINST(OP,MSK,NAME,OPFORM,RES,FLAGS,O1,O2,I1,I2,I3) \

case OP: \

r_out[0] = (O1); r_out[1] = (O2); \

r_in[0] = (I1); r_in[1] = (I2); r_in[2] = (I3); \

SYM_CAT(OP,_IMPL); \

break;

Make sure that there are no extra spaces after the “\” in every line. The “\” are used to split the macro definition across multiple lines. A space breaks the definition. Using parenthesis around macro arguments is a very good practice.

l) Getting close! Time to add the code for collecting our statistic. Go after the switch statement, but stay within the “while (TRUE)” loop. Now include the following code:

{

int i;

for (i = 0; i < 3; i++)

if (r_in[i] != DNA && reg_ready [r_in [i]] > sim_num_insn)

{

sim_num_lduh++;

break;

}

This loop scans through all input registers and checks whether they are currently available. The DNA constant is used to indicate that the corresponding source register is not really used by the instruction.

Simplescalar does not provide a default definition for DNA. We have to provide our own. Do so at the beginning of the file by inserting the following directive:

#define DNA -1

m) We are almost there. The last thing we need to do is update the “reg_ready” array on loads. Go after the code we added in part (l) and add the following:

if (MD_OP_FLAGS(op) & F_MEM) && (MD_OP_FLAGS(op) & F_LOAD))

{

if (r_out[0] != DNA)

reg_read[r_out[0]] = sim_num_insn + 2;

if (r_out[1] != DNA)

reg_read[r_out[1]] = sim_num_insn + 2;

}

This code flags the load’s target registers indicating that they will not be available for the next instruction to see.

Part (1) Compile the new sim-safe using: make sim-safe

Run the modified simulator twice. The fist time use the “test” program as your input. Then use the “test.O2” program as your input. Both are compiled versions of the “test.c” file.

(a) Provide print-outs of the two runs

Test your simulator with a “real” program. For that, we are going to use gcc itself. We are providing a precompiled version of gcc for the simplescalar architecture. To run gcc use the following command:

./sim-safe cc1.lit.ss gcc.i >& cc.out

This simulates gcc (the C compiler really which is called cc1) compiling the preprocessed C file gcc.i. It generates assembly which is written in the “gcc.s” file.

(b) Hand-in a printout of this run. The output goes into the cc.out file.

Part (2): Branch Predictors

Even in simple pipelines branch prediction is used to reduce control-flow induced pipeline bubbles. Modify sim-safe so that you can study how well various static predictors may do. After the switch statement include an “if” statement that identifies branches:

if ( (MD_OP_FLAGS (op) & (F_CTRL|F_COND) ) == (F_CTRL|F_COND) ) …

This captures only conditional branches and not jumps such as “jmp”, “jalr”, “jr”, etc. Still, this is sufficient for our purpose. To determine whether a branch is taken or not you can simply compare regs.reg_PC and regs.reg_NPC. The first is the PC of the current instruction and the other is the PC of the next to execute instruction. If the branch is not taken then “..PC = ..NPC + sizeof (md_inst_t)". Make sure you insert your code before the last update to PC and NPC because at that point PC points to the next instruction and next PC to the one following it.

Make sim-safe print out how many conditional branches execute and what fraction of overall instructions these represent. Then make it print out the fraction of conditional branches that are taken and the fraction of those that are not taken.

Then, print out how often the following dynamic branch prediction methods will be correct (percentage):

1. GShare with 32K entries using a single bit (last outcome) and 8 bits of global history. Use the lower bits of “PC ^ (PC >> 16) ^ Global History” to index the prediction table.

2. Bimodal with 32K entries.

3. The GShare predictor but using 32K entries with two-bit confidence.

If you need to determine the target address of a branch use the following: The implementation macros for branches in “machine.def” include a macro call to SET_TPC(ARG) where ARG is the target of the branch. You can define the SET_TPC() macro in sim-safe.c so that you can get the target address. For example, you may define it as:

#define SET_TPC(N) (target_addr = (N))

Where “target_addr” is a variable. Every time a branch executes, “target_addr” will hold the target of the branch (as if it was taken). Comparing this to the PC of the branch you can determine whether the branch is going backward (lower address) or not.

Part (3): “Lies, damned lies, and statistics”:

Collecting statistics is easy. Verifying that they are correct is always challenging. To increase your confidence on the correctness of your statistics gathering code develop two micro-benchmarks in C. These should be short codes that you know how their branches should behave. For example, a long running loop with no branches will probably get very close to 100% with all three predictors. Your micro-benchmarks should be such that you indeed test all three predictors. If you feel you need more than two please do so (the number two is just an estimate which I haven’t really thought about). Once you wrote your C micro-benchmark compile it with the following command:

ss-gcc –o mbench-1 mbench-1.c -O

Double-check the code that is generated using ss-objdump –d by focusing on main() or any function you wrote. A complication you have to think about is that any C program you write, when compiled, is linked with several initialization and shutdown routines. So, it is not only your code that will run, but also some other code. These additional routines usually don’t take too much time. To reduce their impact on the overall statistics, make sure that the core of your micro-benchmarks runs for a sufficiently large number of instructions (e.g., a loop that iterates 10,000 times or so).

What to hand-in: Printouts of your micro-benchmark C files along with a brief explanation on what they do and why you chose them. Also, printouts of the sim-safe runs showing that the micro-benchmarks indeed produce the desired behavior.