ECE 454 - Computer Systems Programming

The Edward S. Rogers Sr. Department of Electrical and Computer Engineering

Midterm Examination  Fall 2011

<table>
<thead>
<tr>
<th>Name</th>
</tr>
</thead>
<tbody>
<tr>
<td>Student #</td>
</tr>
</tbody>
</table>

Answer all questions. Write your answers on the exam paper. Show your work. Each question has a different assigned value, as indicated.

The exam is open book (only simple calculators allowed, no cell phones or PDAs)

Total time available: **110 minutes**

Total marks available: **110** (roughly one mark per minute)

Verify that your exam has all the pages.

<table>
<thead>
<tr>
<th>Part</th>
<th>Points</th>
<th>Mark</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>30</td>
<td></td>
</tr>
<tr>
<td>2</td>
<td>15</td>
<td></td>
</tr>
<tr>
<td>3</td>
<td>15</td>
<td></td>
</tr>
<tr>
<td>4</td>
<td>15</td>
<td></td>
</tr>
<tr>
<td>5</td>
<td>15</td>
<td></td>
</tr>
<tr>
<td>6</td>
<td>10</td>
<td></td>
</tr>
<tr>
<td>7</td>
<td>10</td>
<td></td>
</tr>
<tr>
<td>Total</td>
<td>110</td>
<td></td>
</tr>
</tbody>
</table>
PART 1) [30] Short Answer

a) **IBM-Visit Question**: Why did processor architects for machines programmed using punchcards favour CISC instructions over RISC instructions?

b) Why are array based codes easier for compilers to parallelize than pointer-based codes?

c) What is the main challenge for a deep pipeline, and what is the solution?

d) Name advantages of a superscalar processor architecture. For each indicate if the advantage is due to the wide-issue ability or the out-of-order-execution ability of the superscalar processor.
e) OptsRU's has only enough time to implement one of two optimizations for a customer. Which one provides the best overall performance? Compute the speedup expected for each, and state which option is the best.

1) Memory Optimization: makes 40% of the program 3 times faster, but doing so makes the remaining 60% of the program 1.1 times slower.

2) Parallelization: 50% of the program is perfectly parallel, the rest is sequential. Half of the users have dual-core processors and the other half of the users have quad-core processors (there is not support for hyperthreading).

3) which optimization is best?

f) What are pros & cons of profiling an application using a software processor simulator, compared to using an instrumentation approach like PIN.

1) Pros of simulation (vs PIN):

2) Cons of simulation (vs PIN):
g) In homework2, loop unrolling did not improve performance much in most cases, and made performance worse in other cases. Why?

h) You are using gprof to analyze a long-running program as you try different gcc optimization levels. You notice that a function that was 20% of execution time for level -O2 is only 0.01% of execution time for level -O3. What most likely happened to cause this?

i) Give two reasons why this code:

```c
for (i=n-1; i>=0; i--){
   sum += A[i];
}
```

would likely perform better than this code:

```c
for (i=0; i<n; i++){
   sum += A[permute(i,n)];
}
```

assuming that

- both codes calculate the identical result
- the \( x=\text{permute}(i,n) \) operation converts each \( i \) into a unique pseudo-random number \( x \) such that \( 0 \leq x < n \)
- the \( \text{permute()} \) operation is "free" (has zero execution time)
- \( A[] \) is large
- the code is only executed once (not multiple times in an outer loop)
PART 2) [15] Compiler Optimizations

Apply optimizations that a compiler would do to the following function foo (you probably want to do this on a scrap paper first, then copy your final optimized version onto the exam). Give a list of the optimizations that you did in order, and for each give the name of the optimization and the original line number of the code that was modified. Write your final optimized version of foo in the space provided.

```c
1: void foo(int n){
2:     int x=10;
3:     int y=20;
4:     int z = x + y;
5:     int w = 1;
6:     int v = 0;
7: 
8: 
9: 10:     for (int i=0;i<n;i++){
11:         x = y*z;
12:         if (x==600){
13:             v += i*(i + n);
14:         }
15:         w *= i + v + n;
16:     }
17:     printf("%d %d %d %d\n",v,w,x,z);
18: }
```
(this page blank, for part2)
PART 3) [15] Alignment

struct mystruct {
    int x;
    char c;
    float y;
    short z;
    char k;
    short p;
} A[10];

a) How many bytes of memory will the array A occupy?

b) Give a space-optimized version of the above code.

c) How many bytes of memory will the optimized version occupy?
PART 4) [15] Tiling

Given this machine:

1st-level data cache:
- 128B cache blocks
- 8-way set associative
- 16 sets

Page size: 8K B

TLB:
- fully associative
- 128 entries

If you were to tile the following code

```c
// assume D[] is already initialized to zeros
void mmmcm(int A[], int B[], int C[], int D[], int n){
    for (i=0; i<n; i++){
        for (j=0; j<n; j++){
            for (k=0; k<n; k++)
                D[i*n+j] += A[i*n+k] * B[k*n+j] * C[i*n+j];
        }
    }
}
```

Ignoring storage for instructions and small variables, and assuming that there are no unlucky conflict misses (only capacity misses), what is the largest tile size $T$ (in number of elements, not bytes) that you can use to tile to minimize misses for the:

a) first level data cache
b) TLB

c) given your answers to (a) and (b), what is a good overall tile size in elements to use (considering only 1st level data cache and TLB together)? Explain why the next power-of-two tile sizes that are bigger and smaller than your answer would be worse than your answer (e.g., if your answer is 8 elements, discuss for 4 elements and 1 elements).
PART 5) [15] M alloc

Consider an allocator that uses an implicit free list. Each memory block, either allocated or free, has a size that is a multiple of eight bytes. Thus, only the 29 higher order bits in the header and footer are needed to record block size, which includes the header and footer and is represented in units of bytes. The usage of the remaining 3 lower order bits is as follows:

- bit 0 indicates the use of the current block: 1 for allocated, 0 for free.
- bit 1 indicates the use of the previous adjacent block: 1 for allocated, 0 for free.
- bit 2 is unused and is always set to be 0.

Five helper routines are defined to facilitate the implementation of free(void *p). The functionality of each routine is explained in the comment above the function definition. Fill in the missing code for the body of each helper routine.

HINT 1: each solution can be done in a single line of C code, but you are allowed to use more than one line if you want

HINT 2: use lots of parentheses "()" to clarify your order of operations

/* given a pointer p to an allocated block, i.e., p is a pointer returned by some previous malloc()/realloc() call; returns the pointer to the header of the block */
void * header(void* p)
{
    void *ptr;

    return ptr;
}

/* given a pointer to a valid block header or footer, returns the size of the block */
int size(void *hp)
{
    int result;

    return result;
}
/* given a pointer p to an allocated block, i.e. p is a pointer returned by some previous malloc()/realloc() call; returns the pointer to the footer of the block */
void * footer(void *p)
{
    void *ptr;
    return ptr;
}

/* given a pointer to a valid block header or footer, returns the usage of the current block, 1 for allocated, 0 for free */
int allocated(void *hp)
{
    int result;
    return result;
}

/* given a pointer to a valid block header, returns the pointer to the header of previous block in memory */
void * prev(void *hp)
{
    void *ptr;
    return ptr;
}
PART 6) [10] Cache Coherence

Assuming a 4-CPU multicore with MESI invalidation-based cache coherence with write-back caches, considering only the cache block for location X, in the table below are shown the order in time of loads and stores to X. Put a '1' or a checkmark in the column on the left next to each row for which a coherence message occurs that requests a copy of the cache block contents for the corresponding CPU’s cache. HINT: a load-miss results in a copy request, while a write-hit does not.

<table>
<thead>
<tr>
<th>Copy of cache block requested?</th>
<th>CPU 0</th>
<th>CPU 1</th>
<th>CPU 2</th>
<th>CPU 3</th>
</tr>
</thead>
<tbody>
<tr>
<td>Load X</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Store X</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>Load X</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Load X</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Store X</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>Load X</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Load X</td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Load X</td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td></td>
<td>Load X</td>
</tr>
<tr>
<td></td>
<td></td>
<td>Store X</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Store X</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Store X</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>Load X</td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>Load X</td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td>Load X</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
PART 7) [10] Consistency

Consider the following code for two threads, including an initial state:

Initialization:  \( a=0, b=0, x=0, y=0; \)

T1:
\( x = 1; \)
\( b = y; \)

T2:
\( y = 1; \)
\( a = x; \)

a) What are the possible outcomes for \((a,b)\) for all possible executions of the two threads given, assuming a modern out-of-order superscalar processor? Put a checkmark next to each possible valid outcome.

\( (a,b) = (0,0) \) __________

\( (a,b) = (0,1) \) __________

\( (a,b) = (1,0) \) __________

\( (a,b) = (1,1) \) __________

b) Sequential consistency is a strict form of consistency for a parallel architecture where we assume that each thread/CPU executes operations in their original program order, but that instructions from different threads might be interleaved in any manner. What are the possible outcomes for \((a,b)\) for all possible interleavings of the two threads given? Put a checkmark next to each possible valid outcome.

\( (a,b) = (0,0) \) __________

\( (a,b) = (0,1) \) __________

\( (a,b) = (1,0) \) __________

\( (a,b) = (1,1) \) __________
c) Assuming sequential consistency, will the following code for threads T1 and T2 properly synchronize the communication of the value of a from T1 to T2 via x? Explain briefly why or why not. Assume that flag is initialized to zero.

**T1:**
```
x = a;
flag = 1;
```

**T2:**
```
while(!flag){};
... = x;
```