A cache is a small and fast memory, which typically contains a subset of the current contents of main memory. There are often several levels of caches: for example, in a modern desktop, there are typically three levels of cache referred to as L1, L2, and L3. L1 is the smallest, fastest, and closest to the processor. L2 is larger but slower than the L1. L1 and L2 are nowadays both integrated on the same chip as the processor, and there is often a L3 cache (slower and larger still) on a separate chip. In general, storage devices are slower, larger, and cheaper the farther away they are from the processor. For example, main memory can be thousands of times larger and tens of times slower than an on-chip cache. A hard drive can be hundreds of thousands of times slower than a main memory.
Caches are interesting because they are invisible to the ISA and the programmer, meaning that you generally do not know that they are there, and do not specifically have to manage them (with the exception of instructions such as ldwio/stwio which avoid the caches for the purpose of communicating directly with I/O devices). Caches are hence a case of processor designers providing greater performance in the underlying implementation without changing the ISA. For example, with the NIOS environment you can generate systems that either have an L1 cache or do not have an L1 cache, and either way the ISA and the programs you write look the same. The introduction of the cache only improves the performance (hopefully) of your program as it runs.
for (i=0;i < n;i++){ sum += A[i]; }What forms of locality does it contain? There is spatial locality in the data accesses of the array, since we are accessing the array elements sequentially (one item at a time in consecutive order), and these elements will be next to each other in memory. Hence if we access A[5], spatial locality says that we are very likely to soon access A[6] and A[7] as well. There is also much locality present in the instructions for the looop. The body of the loop will be a sequence of instructions that will be fetched (and executed) consecutively, which is a form of spatial locality. The loop body will also be executed repeatedly as the loop goes round and round, which is a form of temporal locality (the instructions of the loop body are "re-used").
Caches exploit locality by doing these two things. First, by storing a subset of memory: the subset that is most likely to be re-used. Second, by grouping its contents into blocks to exploit spatial locality---so if you access one part of a block, the entire block is moved into the cache, and you are likely to soon request a different part of that same block that is now already present in the cache.
To access a cache we break up the bits of a memory address into three portions: the tag, set index, and offset. The offset portion indicates which byte within a cache block you are refering to---in other words it can be thought of as an index into the small array of bytes which comprises a cache block. The set index portion indicates how to locate a block within the cache. In other words it is used to tell us which set we are referring to, and we can think of it as an index into an array of sets in our cache. Finally, since several memory locations can map to each block in the cache, we use the tag portion to identify which memory location this block corresponds to. In other words, we use the tag as a unique identifier that tells us which of the several possible blocks we currently have in the corresponding set in the cache.
Note that we cannot simply use a tag of 0x0 to indicate an empty or invalid cache block---the memory addresses with a tag of 0x0 are themselves valid addresses! Therefore every block in the cache has an associated valid bit. This bit is used to decide whether the block currently holds valid data: the valid bit is a 'one' if the cache block is holding valid data, and a 'zero' otherwise.
Assuming an address space size of 2n bytes (byte addressable), then the size of a block in this cache is B = 2b bytes, the number of sets S = 2s, and S is also equal to the total number of cache blocks in the cache---in other words there is one block per set in the cache. S = number of cache blocks (ie, 1 block per set). The capacity of a direct mapped cache can be computed as B*S = 2(b+s) bytes.
movia r8,0xface; ldb r8,0(r8) movia r8,0xface; ldb r8,0(r8) movia r8,0xfac0; ldb r8,0(r8) movia r8,0xab00; ldb r8,0(r8) movia r8,0xcd00; ldb r8,0(r8)Also assume that all memory locations are initialized to zero, except for the following memory locations:
0xab00: 0x12 0xcd00: 0x25 0xfac0: 0x56 0xface: 0x78First, lets determine the capacity of this cache, which is 2(s+b) = 2(4+4) = 256 bytes. Second, the number of sets is S = 2s = 24 = 16 sets. Third, the block size is B = 2b = 24 = 16 bytes.
Breakdown of an address:
Tag | Set Index | Offset |
8 bits | 4 bits | 4 bits |
Our cache is initially empty, so we can think of it as looking like this:
Set Index | Valid? | Tag | Hex data values (for bytes 15..0) 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 |
0x0 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x1 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x2 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x3 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x4 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x5 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x6 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x7 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x8 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x9 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0xa | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0xb | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0xc | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0xd | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0xe | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0xf | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
Lets go through the steps for accessing the cache for the first load:
movia r8,0xface; ldb r8,0(r8)We will perform the following steps:
Here we show the contents of the cache after the first load instruction:
Set Index | Valid? | Tag | Hex data values (for bytes 15..0) 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 |
0x0 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x1 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x2 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x3 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x4 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x5 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x6 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x7 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x8 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x9 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0xa | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0xb | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0xc | 1 | 0xfa | 00 78 00 00 00 00 00 00 00 00 00 00 00 00 00 56 |
0xd | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0xe | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0xf | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
Lets go through the steps for accessing the cache for the second load:
movia r8,0xface; ldb r8,0(r8)We will perform the following steps:
The previous load did not change the state of the cache. Lets go through the steps for accessing the cache for the third load:
movia r8,0xfac0; ldb r8,0(r8)We will perform the following steps:
The previous load did not change the state of the cache. Lets go through the steps for accessing the cache for the fourth load:
movia r8,0xab00; ldb r8,0(r8)We will perform the following steps:
Here we show the contents of the cache after the fourth load instruction:
Set Index | Valid? | Tag | Hex data values (for bytes 15..0) 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 |
0x0 | 1 | 0xab | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 12 |
0x1 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x2 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x3 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x4 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x5 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x6 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x7 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x8 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x9 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0xa | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0xb | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0xc | 1 | 0xfa | 00 78 00 00 00 00 00 00 00 00 00 00 00 00 00 56 |
0xd | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0xe | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0xf | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
Lets go through the steps for accessing the cache for the fifth load:
movia r8,0xcd00; ldb r8,0(r8)We will perform the following steps:
Here we show the final contents of the cache (after the fifth load instruction):
Set Index | Valid? | Tag | Hex data values (for bytes 15..0) 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 |
0x0 | 1 | 0xcd | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 25 |
0x1 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x2 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x3 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x4 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x5 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x6 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x7 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x8 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0x9 | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0xa | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0xb | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0xc | 1 | 0xfa | 00 78 00 00 00 00 00 00 00 00 00 00 00 00 00 56 |
0xd | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0xe | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
0xf | 0 | 0x00 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
To access a fully-associative cache we break up the bits of a memory address into only two portions: the tag and the offset (since there is effectively only one set, we do not need to use any bit to index it). The offset portion indicates which byte within a cache block you are refering to. We use the tag portion to identify which memory location a block corresponds to.
movia r8,0xface; ldb r8,0(r8) movia r8,0xab00; ldb r8,0(r8) movia r8,0xcd00; ldb r8,0(r8)Also assume that all memory locations are initialized to zero, except for the following memory locations:
0xab00: 0x12 0xcd00: 0x25 0xface: 0x78
Since this is a fully-associative cache, there is only one set, and in the breakdown of the address no bits are designated as set index bits. Each cache block is 16-bytes, hence 4-bits of the address are used as offset bits. Therefore the remaining 12-bits in the address are used as tag bits, giving the following breakdown:
Breakdown of an address:
Tag | Offset |
12 bits | 4 bits |
The resulting cache looks like this:
Set Index | Valid? | Tag | Hex data values (for bytes 15..0) 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 | Set 0 |
---|---|---|---|
0 | 0x000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0x000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0x000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0x000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
Lets go through the steps for accessing the cache for the first load:
movia r8,0xface; ldb r8,0(r8)We will perform the following steps:
Here we show the contents of the cache after the first load instruction:
Set Index | Valid? | Tag | Hex data values (for bytes 15..0) 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 | Set 0 |
---|---|---|---|
1 | 0xfac | 00 78 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0x000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0x000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0x000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
Lets go through the steps for accessing the cache for the second load:
movia r8,0xab00; ldb r8,0(r8)We will perform the following steps:
Here we show the contents of the cache after the second load instruction:
Set Index | Valid? | Tag | Hex data values (for bytes 15..0) 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 | Set 0 |
---|---|---|---|
1 | 0xfac | 00 78 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
1 | 0xab0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 12 | |
0 | 0x000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0x000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
Lets go through the steps for accessing the cache for the third load:
movia r8,0xcd00; ldb r8,0(r8)We will perform the following steps:
Here we show the contents of the cache after the third load instruction:
Set Index | Valid? | Tag | Hex data values (for bytes 15..0) 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 | Set 0 |
---|---|---|---|
1 | 0xfac | 00 78 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
1 | 0xab0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 12 | |
1 | 0xcd0 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 25 | |
0 | 0x000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
To access a set-associative cache we break up the bits of a memory address into three portions: the tag, set index, and offset. The offset portion indicates which byte within a cache block you are refering to. The set index portion indicates how to locate a block within the cache. Finally we use the tag portion to identify which memory location this block corresponds to within a given set.
Assuming an address space size of 2n bytes (byte addressable), then the size of a block in this cache is B = 2b bytes, the number of sets S = 2s, and the total number of cache blocks in the cache is W * S---in other words there are W blocks per set in the cache. The capacity of a set-associative cache can be computed as B*S*W. If W is a power of 2 (which it might not be) then the capacity can also be computed as 2(b+s+w) bytes.
movia r8,0xface; ldb r8,0(r8) movia r8,0xab00; ldb r8,0(r8) movia r8,0xcd00; ldb r8,0(r8)Also assume that all memory locations are initialized to zero, except for the following memory locations:
0xab00: 0x12 0xcd00: 0x25 0xface: 0x78
First, lets determine the address breakdown. The capacity is 256-bytes, the block-size is 16-bytes, and W is 2-ways, we know that capacity = B*S*W, hence 256 = 16 * S * 2. Solving for S gives 256/(16*2) = 8 sets. Hence we have b=4 offset bits, s = 3 set index bits, and therefore t = 9 tag bits. Hence the breakdown of an address is:
Tag | Set Index | Offset |
9 bits | 3 bits | 4 bits |
Because s and t are not multiples of four, it will be take more effort to represent those values using hex---hence instead we show those values using binary. We can visualize this cache as:
Set Index | Valid? | Tag | Hex data values (for bytes 15..0) 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 | 0b000 |
---|---|---|---|
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b001 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b010 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b011 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b100 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b101 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b110 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b111 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
Lets go through the steps for accessing the cache for the first load:
movia r8,0xface; ldb r8,0(r8)We will perform the following steps:
Set Index | Valid? | Tag | Hex data values (for bytes 15..0) 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 | 0b000 |
---|---|---|---|
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b001 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b010 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b011 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b100 |
1 | 0b111110101 | 00 78 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b101 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b110 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b111 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
Lets go through the steps for accessing the cache for the second load:
movia r8,0xab00; ldb r8,0(r8)We will perform the following steps:
Set Index | Valid? | Tag | Hex data values (for bytes 15..0) 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 | 0b000 |
---|---|---|---|
1 | 0b101010110 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 12 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b001 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b010 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b011 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b100 |
1 | 0b111110101 | 00 78 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b101 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b110 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b111 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
Lets go through the steps for accessing the cache for the third load:
movia r8,0xcd00; ldb r8,0(r8)We will perform the following steps:
Set Index | Valid? | Tag | Hex data values (for bytes 15..0) 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00 | 0b000 |
---|---|---|---|
1 | 0b101010110 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 12 | |
1 | 0b101010110 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 25 | 0b001 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b010 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b011 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b100 |
1 | 0b111110101 | 00 78 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b101 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b110 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | 0b111 |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 | |
0 | 0b000000000 | 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |
There are two main options for handling writes to caches. The first is called write-through, where for every store both the cache and the next level of the memory hierarchy are updated (eg., memory itself). This is good, because now both memory and the cache are kept consistent and will always hold the same values for corresponding locations. However it can be bad because now every store results in communication to the memory, reducing the potential benefits of locality.
A second option is called write-back, and for this option only the cache is updated on stores: memory is only updated whenever a cache block is evicted from the cache. This way the amount of communication between the cache and memory is reduced, however now the cache and memory are no longer always consistent. Implementing write-back also requires the addition of a bit per every cache block to track whether that cache block has been modified, typically called a dirty bit. Whenever a cache block is stored to we set the dirty bit for that cache block. When we replace a cache block with its dirty bit set, we must first write-back that cache block to memory, otherwise we might lose the only up-to-date copy of that cache block.
To implement the LRU algorithm requires that you either (i) keep blocks in sorted order (according to LRU), or (ii) encode and track the order of blocks. The first option requires no additional storage, but can be fairly slow. The second option is more common and fast, but requires extra bits to encode the LRU state (ie., to track the relative LRU order of all blocks in the set, specifically it requires W * log2 W bits). For example, for a 4-way set-associative cache it would require 4 * log2 4 bits = 4 * 2 bits = 8 bits per cache set of tracking state.
There are other schemes to decide replacement such as random, where you pick a random block---this scheme works surprisingly well and requires no extra state. A scheme commonly used in modern processors is called random-but-not-MRU, where MRU stands for Most-Recently Used. In this scheme you replace a random block so long as it is not the MRU block, in which case you pick again. To track MRU requires only a total of log2 W bits (to track whichever is the MRU block). For example, for a 4-way set-associative cache it would require log2 4 bits = 2 bits per cache set of tracking state.
TOTAL_CACHE_SIZE = TOTAL_CAPACITY + TOTAL_OVERHEAD
TOTAL_OVERHEAD = TAG_BITS + VALID_BITS + DIRTY_BITS + LRU_BITS
TOTAL_CAPACITY = 6kB = 6*1024B = 6*1024*8 bits = 49152 bits
num_blocks B = TOTAL_CAPACITY / block_size = 6kB / 32B = 192 blocks
num_sets S = 192 blocks / W ways = 192 blocks / 3 ways = 64 sets => s = 6 set bits
B = 32bytes => b = 5 offset bits
Hence the breakdown of an address is:
Tag | Set Index | Offset |
5 bits | 6 bits | 5 bits |
TAG_BITS = num_blocks * 5 bits = 192 blocks * 5 bits = 960bits
VALID_BITS = num_blocks * 1bit = 192bits
DIRTY_BITS = num_blocks * 1bit = 192bits
random-but-not-MRU_BITS = num_sets * log2 W bits = 64 * log2 3 = 64 * 2 (rounding up) = 128bits
TOTAL_CACHE_SIZE = 49152bits + 960bits + 192bits + 192bits + 128bits = 50624bits = 6328bytes =~ 6.18kB
for (i=0;i<1024;i++){ sum += A[i]; }Assume that each element of the array A[i] is 4B
First, we will note that each cache miss will cause a 32B block to be loaded from memory, hence 8 array elements will be loaded (since each array element is 4B). Specifically, if we read A[0] and it is a miss, a block containing A[0] through A[7] will be loaded into the cache. Therefore one in eight accesses will miss in the cache, therefore the number of misses is 1024/8 = 128.
miss_rate = num_misses / num_accesses = 128/1024 = 12.5%
hit_rate = num_hits/num_accesses = 1 - miss_rate = 87.5%
We need to reason about the total size of the array relative to the total size of the cache. Since the array is 1024*4B = 4KB, while the cache is only 1KB, we know that the entire array does not fit---only 1/4 of the array fits at one time. From the first execution of the loop, the last quarter of the array (A[768] through A[1023]) will be in the cache when we start the second execution of the loop, which will first access A[0] and so on. Therefore there will be no locality across iterations of the loop exploited by the cache. If instead the cache was 4KB or larger, then subsequent iterations of the loop would produce all cache hits.
for (i=0;i<1024;i++){ sum += A[i] + B[i]; }Also assume that A[i] maps to the same set in the cache as B[i]
Lets trace through what happens for the first couple of accesses.
for(...){sum+=A[i];} for(){sum+=B[i];}
average_access_time = 0.875 * (1 cycle) + 0.125 * (10 cycles) = 2.125 cycles