Determining Hits & Misses with Caches
In this set of lectures we will look at how, given (1) a piece of assembly code and (2) a memory hierarchy, we can determine how many hits and misses will occur and thus how long it will take for the code to execute. To simplify the discussion we make the following assumptions:
1. Assuming that memory is ideally fast all instructions execute within a single cycle.
2. We ignore the memory accesses needed for fetching instructions (because we want to avoid having to figure out how many bytes each m68k instruction takes and because adding these references will make it harder to understand what is happening without really enhancing the material).
Example #1: Linear access of array elements
Let’s see what will happen when the following code executes:
movia r8, array ; base of the array
add r9, r0, r0 ; sum = 0
add r10, r0, r0 ; counts the number of elements processed thus far
addi r11, r0, 4096 ; number of array elements
ldw r12, 0(r8)
add r9, r9, r12 ; sum = sum + next element of the array
addi r8, r8, 4 ; make a0 point to the next element
addi r10, r10, 1 ; one more element processed
bne r10, r11, loop ; have we processed all elements?
When this code executes, and ignoring instruction fetches, it will access memory via the instruction “ldw r12, 0(r8)”. It will access the following addresses:
0, 4, 8, 16, ..., 4092 (decimal).
Execution time with ideal memory (single cycle latency):
Assuming that each instruction takes a single cycle to execute the aforementioned code will execute in the following number of cycles:
4 + 4096 x 5 cycles
The first four cycles are for the first four instructions that execute only once. Each loop iteration comprises five instructions. The loop iterates 4096 times (r10 counting from 0 to 4096).
Execution time with a Cache
Let’s now determine how long it will take for the code to execute with a realistic memory hierarchy. For this purpose let us assume that a memory reference completes in a single cycle if it hits in the cache (cache latency = 1 cycle). It takes 100 cycles in total to complete a reference when it misses in the cache (1 cycle for determining that this is a miss plus 99 cycle to go to memory). Let’s assume that the data cache is direct mapped, 16Kbytes (data capacity – notice that many more bits are needed for implementing the cache since it contains the tags, valid bits and possibly dirty bits – these are often not included in the data cache size – the latter refers to the useful capacity of the cache). The data cache has 32 byte frames.
Let us further assume that array (the base address of the array) is divisible by 16K, that is (array modulo 16K) = 0. This assumption will allow us to determine which set access will map to. Because array is divisible by 16k this implies that the last 14 bits (that’s because 2^14 = 16K) are zeroes. So, array = x...x0...0 where the x...x = 18 bits and 0...0 = 14 bits.
First we need to determine how the cache is indexed. That is given a 32 bit address from the CPU how do we determine which set and frame this may map to.
We know that:
#frames = Capacity / bytes_per_frame = 16K / 32 = 2^14 / 2^5 = 2^9 = 512
#sets = #frames / associativity = 2^9 / 1 = 2^9
So, our cache has 512 sets with one frame per set. Each frame contains 32 bytes. Thus, the 32-bit address is divided into three fields:
Offset: total 5 bits because each frame holds 32 bytes à bits 4 through bit 0 (the least significant ones)
Set: total 9 bits because there are 2^9 sets in the cache à bits 13 through 5 (immediately next to offset)
Tag: what is left, so 32 – 9 – 5 = 18 bits, or bits 31 through 14.
Now that we know how addresses are mapped onto the cache we can determine how many are hits and how many are misses.
Access #1: address array + 0 à this maps to tag x...x, set 0, and offset 0 à the cache is initially empty hence this is a miss à latency 100 cycles
Note the tag is x...x the upper 18 bits of the “array” address.
Because this is a miss, the whole block will be fetched from memory in the 100 cycles. The block includes the bytes in addresses 0 through 31 (32 bytes in total).
Recall that the assignment of addresses into blocks is static. Addresses 0 through 31 belong to block 0, 32 to 63 to block 1, 64 trough 95 to block 2 and so on.
Access #2: address array + 4 à tag x...x, set 0, offset 4 à this is a hit since we brought in this data when we missed on the first access à latency is 1
Access #3: address array + 8 à tag x...x, set 0, offset 8 à hit à latency 1 cycle
Access #8: address array + 28 à tag x...x, set 0, offset 28 à hit à latency 1 cycle
Access #9: address array + 32 à tag x...x, set 1, offset 0 à miss since this belongs to block 1 and this is the first access to an address within block 1 à latency 100 cycles
This will bring in the data at addresses array + 32 ... array + 63
Access #10: address array + 36 à tag x...x, set 1, offset 4 à hit à latency 1 cycle
If we continue this way we will see a repeating pattern of 1 miss and 7 hits per block.
So our execution time will be:
4 (for the first four instructions) + 4096 x 4 (for the four instructions in the loop that do not access memory) + 4096 (7/8 x 1 (for the 7 hits per block) 1/8 x 100 (for the one miss per block))
if the above code was to be executed twice with the same cache, the second time the array elements are accessed they will all be hits since the full array has been cached. This is possible because the cache’s capacity is sufficient to hold all of the array (which is 4096 x 4 = 16Kbytes). The same will hold true for caches that have higher capacity. If the cache was 8Kbytes, then accesses to elements 2048 through 4095 would replace elements 0 through 2047 from the cache. So, next time around 0 through 2047 are accessed they will observe the 1 miss – 7 hit pattern.
Example #2: Accessing two aligned arrays
Let’s us now look at another example where we access two arrays simultaneously:
movia r8, aA ; array A
movia r9, aB ; array B
addi r10, r10, 4096 ; elements per array
addi r11, r0, r0 ; elements processed thus far
addi r12, r0, r0 ; sum = 0
ldw r13, 0(r8) ; sum = sum + element of array a
ldw r14, 0(r9) ; sum = sum + element of array b
add r12, r12, r13
add r12, r12, r14
addi r11, r11, 1 ; one more element per array processed
addi r8, r8, 4 ; point to next element of array a
addi r9, r9, 4 ; point to next element of array b
bne r10, r11, loop ; all elements processed?
If we assume an ideal memory system then this many cycles will be needed:
5 (for the initialization) + 4096 (iterations) x 8 (instructions in loop body)
Let’s assume that aA modulo 16K =0 and aB modulo 16K = 0 hence aA = x...x0...0 and bB=y...y0...0 where the 0...0 portions comprise 14 bits and the x...x and y...y are each 18 bits. The exact values of x...x and y...y are not important.
As before let us assume a 16Kbyte cache, with 32 bytes per frame and that is direct mapped. Here’s the sequence of accesses and what happens during each one of them:
Access #1: aA + 0 à
tag x...x, set 0, offset 0 à
miss –-> 100 cycles à
brings in data from aA + 0 through aA + 31
Access #2: bB+ 0 à tag y...y, set 0, offset 0 à miss à 100 cycles à brings in data from aB + 0 through bB + 31 à replaces data in the frame (those fetched from the previous access)
Access #3: aA + 4 à tag x...x, set 0, offset 4 à miss à 100 cycles à brings in data from aA+0 through aA + 31 à replaces data from previous reference
Access #4: bB + 4 à tag y...y, set 0, offset 4 à miss à 100 cycles – brings in data from aB + 0 through bB + 31 à replaces data from previous reference
And so on.
As we see no access results in a hit as consecutive addresses map onto the same cache frame but to different blocks and hence they evict each other.
What if the cache was 2-way set associative? In this case, the case will have 256 sets where each set will contain 2 frames. In this case, the tag is 19 bits long and the set 8 bits long. The offset still is 5 bits long since each frame is still 32 bytes long.
So it will take this many cycles for the code to execute:
5 (for initialization) + 4096 x 3 (for the loop body instructions that do not access memory) + 4096 x 2 x 100 (for the two instructions that access memory)
Here’s what happens at each access:
Access #1: aA + 0 à
tag x...x0, set 0, offset 0 à
miss –-> 100 cycles à brings
in data from aA + 0 through aA + 31, placed in frame 0 of set 0
Access #2: bB+ 0 à tag y...y0, set 0, offset 0 à miss à 100 cycles à brings in data from aB + 0 through bB + 31 à placed in frame 1 of set 0
Access #3: aA + 4 à tag x...x0, set 0, offset 4 à hit
Access #4: bB + 4 à tag y...y0, set 0, offset 4 à hit
Access #15: aA + 28 à tag x...x0, set 0, offset 28 à hit
Access #16: bB + 28 à tag y...y0, set 0, offset 28 à hit
Access #17: aA + 32 à
tag x...x0, set 1, offset 0 à
miss –-> 100 cycles à brings
in data from aA + 32 through aA + 63, placed in frame 0 of set 1
Access #18: bB+ 32 à tag y...y0, set 1, offset 0 à miss à 100 cycles à brings in data from aB + 32 through bB + 63 à placed in frame 1 of set 1
And so on.
5 + 4096 x 6 + 4096 x ( 1 / 8 x 100 + 7 / 8 x 1) x 2 (two instructions experiencing one miss and seven hits per eight references).