We will modify the Simplescalar simulator. While old, it is sufficient “simple” to allow us, in the time available, to gain experience on how to explore architectural options.
Here's what you will need:
1. The simplescalar simulator sources.
2. GNU GCC and BINUTILS toolchain which can generate and manipulate binaries for the simulated architecture (PISA a derivative of MIPS).
3. If you wish to run the gcc and the binutils you will have to install the cygwin environment on a windows machine. Make sure to install the 32-bit version. If you already have the 64-bit version installed, no worries, just install the 32-bit cygwin as well. They can happily co-exist.
#1 and #2 above can be found here: Simplescalar sources and compilers.
Do these in the order presented:
1. Can skip if cygwin32 is already installed. Here's how to install cygwin32.
2. How to install the compilers and compile your first program under cygwin.
3. How to compile one of the simplescalar simulators under cygwin.
4. Here's a set of benchmarks you can use. These are traces of execution. They contain a complete checkpoint of the machine's state after 1B of instructions and a record of all return values from system calls for the next 1B instructions. This way you can replay the execution of the program. Simplescalar generates these EIO traces. They are useful in multiple ways: a. You do not need to have the binaries and inputs. 2. You do not have to simulate the first 1B instructions (this supposedly allows us to skip the initialization).
We will be working with Simplescalar's OOO core simulator, sim-outorder. Most of the changes will happen in ruu_issue which is the function that simulates the OOO scheduler using register dependencies. We will overview how it works during the lectures.
Mirroring what would be a typical exploratory effort we will follow these steps:
1. Is there potential for our candidate technique to even work? What measurements would help us decide whether the idea is worth exploring further? Recall it is easier to show that something does not hold than that it does (prove that something is not true rather that it is). Mirroring the paper we want to first characterize instructions based on how many operands they need to wait for when they first enter the scheduler. Let's try to create a graph where along X we list the workloads. No need to do multiple scheduler sizes. Let's pick 128. The Y axis will be the fraction of instructions, up to 100%. List three categories: no need to wait on any operand, need to wait on 1 operand, and need to wait on 2 operands. Show these as stacked bars.
Any time we use simulation we are creating a model of a hypothetical machine. A critical consideration is correctness and then accuracy. Let's talk about correctness. How do we know that the model we created is correct. That is, it properly models what it is supposed to. In our case, the question transforms to: how do we know that the measurements we are collecting are correct. We need some validation test. However, “validation” is too strong a word here as it means making sure that it is absolutely correct. This is hard to do. So, lets instead talk about increasing our confidence that the measurements are correct.
For this purpose you will have to create a set of synthetic workloads (microbenchmarks) for which you know what the measurements should look like. It's up to you to figure out what these should be. I would suggest creating at least 3: 1. Worst case, 2. Best case, 3. “Average case”. Deliverable 1: The microbenchmark code and an explanation of why you think they are good validation microbenchmarks.
Deliverable 2: The above referenced graph for the workloads we have provided. No need to run the full 1B instructions. Feel free to do just 100M for all workloads except for gcc. For gcc run the full 1B.
2. Predicting the last tag: Assuming that the first set of experiments provide enough motivation to move forward, let's now try to build a predictor for the last tag. We will not be changing how the scheduler works. We will just check whether we can predict and with what accuracy which of the two operands an instruction should wait for. Let's try to report results as a graph where: X are the benchmarks and Y is accuracy (up to 100%). Let's us only focus on a 16K entry predictor which ought to be similar to what is described in the paper. Let's use 8 bits of global history and let's use the lower 14b of the instruction address (after we omit the lower bits that are always zero). Xor the bits with the upper 16b of the instruction address first and then index the table. The entries should be 2 bit saturating counters.
Deliverable 3: the accuracy measurements graph. In addition, instead of just reporting correct/wrong, what other categories you think would be useful.
3. We will not need you to change the simulator to model execution with the technique. This may be trickier than it looks. We may revisit this in a later assignment.