Caches

Storage in a computer system comes in a variety of capacities, speeds, and costs. However, any given storage devices can only be two of the following: (i) fast (access latency), (ii) large (capacity), and (iii) cheap. In other words, any storage device is either slow, small, or expensive. For example, a hard-drive has a very large capacity and is fairly cheap (when considering dollars per megabyte of storage)---however a hard-drive is relatively very slow, taking thousands of processor clock cycles to return requested data.

A cache is a small and fast memory, which typically contains a subset of the current contents of main memory. There are often several levels of caches: for example, in a modern desktop, there are typically three levels of cache referred to as L1, L2, and L3. L1 is the smallest, fastest, and closest to the processor. L2 is larger but slower than the L1. L1 and L2 are nowadays both integrated on the same chip as the processor, and there is often a L3 cache (slower and larger still) on a separate chip. In general, storage devices are slower, larger, and cheaper the farther away they are from the processor. For example, main memory can be thousands of times larger and tens of times slower than an on-chip cache. A hard drive can be hundreds of thousands of times slower than a main memory.

Caches are interesting because they are invisible to the ISA and the programmer, meaning that you generally do not know that they are there, and do not specifically have to manage them (with the exception of instructions such as ldwio/stwio which avoid the caches for the purpose of communicating directly with I/O devices). Caches are hence a case of processor designers providing greater performance in the underlying implementation without changing the ISA. For example, with the NIOS environment you can generate systems that either have an L1 cache or do not have an L1 cache, and either way the ISA and the programs you write look the same. The introduction of the cache only improves the performance (hopefully) of your program as it runs.

Memory Metrics

When measuring memories and other storage devices, we will normally consider two metrics of performance. Latency, which is how long it takes from the start of a request until the response is received. For example, we might measure the latency of a load from memory, which would be measured in hundreds of CPU cycles (typically tens of nanoseconds). We might also measure the bandwidth, which is the data transfer rate in bits-per-second. Bandwidth is mostly determined by the width (in number of wires) and speed (in MHz or GHz) of the bus or other physical connection with the storage device, and the design of the storage device itself. For example, a DDR-SDRAM can be 2.1GB/s, with a data bus that is 133MHz and 64bits wide (64 wires).

Why Caches?

Caches exploit a common tendency of programs to exhibit locality. Locality refers to the fact that programs tend to re-use data and instructions near those they have used recently. Formally, there are two types of locality. The first is temporal locality (re-use), which implies that recently-referenced items are likely to be referenced again soon. The second is spatial locality (nearby), which implies that items with nearby addresses tend to be referenced close together in time. For example, consider the following simple code:

for (i=0;i < n;i++){
   sum += A[i];
}

What forms of locality does it contain? There is spatial locality in the data accesses of the array, since we are accessing the array elements sequentially (one item at a time in consecutive order), and these elements will be next to each other in memory. Hence if we access A[5], spatial locality says that we are very likely to soon access A[6] and A[7] as well. There is also much locality present in the instructions for the looop. The body of the loop will be a sequence of instructions that will be fetched (and executed) consecutively, which is a form of spatial locality. The loop body will also be executed repeatedly as the loop goes round and round, which is a form of temporal locality (the instructions of the loop body are "re-used").

Caches exploit locality by doing these two things. First, by storing a subset of memory: the subset that is most likely to be re-used. Second, by grouping its contents into blocks to exploit spatial locality---so if you access one part of a block, the entire block is moved into the cache, and you are likely to soon request a different part of that same block that is now already present in the cache.

Cache Reality

A typical processor chip has separate L1 caches for instructions and data. This is because instructions and data tend to behave quite differently, and this allows the designer to design two different caches for instructions and data that exploit the different behaviour of each. Having two caches also simplifies the design of the processor itself, since instructions will be fetched from a separate entity than data is loaded and stored. A typical processor chip also has a unified L2 cache, which means that the L2 cache holds both instructions and data. At this level it is more economical to have only a single, larger cache that holds both instructions and data.

Cache Mechanisms and Terminology

Before we delve into the design of caches, it is helpful to understand some terminology.

cache capacity: how much can the cache hold? (bytes)
placement: where should a cache block be placed?
identification: how do we find a block in the cache?
cache hit: the cache block we want is present in the cache
hit rate: total number of hits divided by the total number of accesses
cache miss: the cache block we want is absent (not in the cache)
miss rate: total number of misses divided by the total number of accesses
replacement: on a miss, we must kick a block out to make room for the new block we are copying in
replacement strategy: which block should we kick out on a replacement?
write strategy: what happens on a write? Which entities do we update?

How a cache is used

The following describes the operation of a cache. In particular, we walk through the events that ocurr when the CPU performs a load of a certain address.

the CPU performs a load from address $A
if it is a "hit", return the value of location $A stored in the cache (DONE)
if it is a "miss", retrieve the block containing location $A from memory
place that block in the cache, replacing an existing block
return the value of location $A that is now stored in the cache (DONE)

Cache Implementations

Direct Mapped Cache

A direct mapped cache is the most simple type of cache. For a direct mapped cache, each memory location maps to a single specific location in the cache. Since the cache is much smaller than memory, this means that multiple memory locations all map to any given location in the cache.

To access a cache we break up the bits of a memory address into three portions: the tag, set index, and offset. The offset portion indicates which byte within a cache block you are refering to---in other words it can be thought of as an index into the small array of bytes which comprises a cache block. The set index portion indicates how to locate a block within the cache. In other words it is used to tell us which set we are referring to, and we can think of it as an index into an array of sets in our cache. Finally, since several memory locations can map to each block in the cache, we use the tag portion to identify which memory location this block corresponds to. In other words, we use the tag as a unique identifier that tells us which of the several possible blocks we currently have in the corresponding set in the cache.

Note that we cannot simply use a tag of 0x0 to indicate an empty or invalid cache block---the memory addresses with a tag of 0x0 are themselves valid addresses! Therefore every block in the cache has an associated valid bit. This bit is used to decide whether the block currently holds valid data: the valid bit is a 'one' if the cache block is holding valid data, and a 'zero' otherwise.

Assuming an address space size of 2ⁿ bytes (byte addressable), then the size of a block in this cache is B = 2^b bytes, the number of sets S = 2^s, and S is also equal to the total number of cache blocks in the cache---in other words there is one block per set in the cache. S = number of cache blocks (ie, 1 block per set). The capacity of a direct mapped cache can be computed as B*S = 2^(b+s) bytes.

Example (Direct Mapped Cache)

Given a 16 bit addr space, implement a direct-mapped cache, where t = 8 bits, s = 4 bits, and b = 4 bits. Execute the following code using that cache, assuming that the cache is initially empty:

movia r8,0xface; ldb r8,0(r8) 
movia r8,0xface; ldb r8,0(r8) 
movia r8,0xfac0; ldb r8,0(r8)
movia r8,0xab00; ldb r8,0(r8)
movia r8,0xcd00; ldb r8,0(r8)

Also assume that all memory locations are initialized to zero, except for the following memory locations:

0xab00: 0x12
0xcd00: 0x25
0xfac0: 0x56
0xface: 0x78

First, lets determine the capacity of this cache, which is 2^(s+b) = 2⁽⁴⁺⁴⁾ = 256 bytes. Second, the number of sets is S = 2^s = 2⁴ = 16 sets. Third, the block size is B = 2^b = 2⁴ = 16 bytes.

Breakdown of an address:

Tag Set Index Offset

8 bits 4 bits 4 bits

Our cache is initially empty, so we can think of it as looking like this:

Set Index Valid? Tag Hex data values (for bytes 15..0)
15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00

0x0 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x1 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x2 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x3 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x4 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x5 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x6 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x7 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x8 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x9 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0xa 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0xb 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0xc 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0xd 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0xe 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0xf 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Lets go through the steps for accessing the cache for the first load:

movia r8,0xface; ldb r8,0(r8)

We will perform the following steps:

We break the load address 0xface into its components: tag = 0xfa, set-index = 0xc = 12, offset = 0xe = 14.
We use the set-index to index the cache, which points us to set number 0xc (of sets 0 through 0xf).
We first check the valid bit for set number 0xc. In this case the valid bit is zero, hence the cache block in set number 0xc is invalid, and we have a cache miss.
We request the corresponding cache block from memory, i.e., the cache block that starts at address 0xfac0 and contains 16 bytes of data at addresses 0xfac0 through 0xfacf.
After a delay, memory returns the requested cache block and we copy the cache block into the cache set 0xc. According to the initalization, our block will then have the value: 0x00780000000000000000000000000056 (assuming little-endian like NIOS).
We also set the valid bit for set 0xc, and set the tag for set 0xc equal to 0xfa.
Finally, we use the offset value 14 to index into the cache block and return the byte requested by the load, which is the byte at address 0xface, i.e., the byte at offset 0xe, i.e., the 14th byte in the block at set index 0xc---hence the load returns the value 0x78.

Here we show the contents of the cache after the first load instruction:

Set Index Valid? Tag Hex data values (for bytes 15..0)
15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00

0x0 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x1 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x2 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x3 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x4 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x5 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x6 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x7 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x8 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x9 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0xa 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0xb 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0xc 1 0xfa 00 78 00 00 00 00 00 00 00 00 00 00 00 00 00 56

0xd 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0xe 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0xf 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Lets go through the steps for accessing the cache for the second load:

movia r8,0xface; ldb r8,0(r8)

We will perform the following steps:

We break the load address 0xface into its components: tag = 0xfa, set-index = 0xc = 12, offset = 0xe = 14.
We use the set-index to index the cache, which points us to set number 0xc (of sets 0 through 0xf).
We first check the valid bit for set number 0xc, which is a one. We also check the tag for set number 0xc to see if it matches the tag from the load address 0xfa, and it does. Hence we have a cache hit.
Finally, we use the offset value to index into the cache block and return the byte requested by the load, which is the byte at address 0xface, i.e., the byte at offset 0xe, i.e., the 14th byte in the block at set index 0xc---hence the load returns the value 0x78.

The previous load did not change the state of the cache. Lets go through the steps for accessing the cache for the third load:

movia r8,0xfac0; ldb r8,0(r8)

We will perform the following steps:

We break the load address 0xface into its components: tag = 0xfa, set-index = 0xc = 12, offset = 0x0 = 0.
We use the set-index to index the cache, which points us to set number 0xc (of sets 0 through 0xf).
We first check the valid bit for set number 0xc, which is a one. We also check the tag for set number 0xc to see if it matches the tag from the load address 0xfa, and it does. Hence we have a cache hit.
Finally, we use the offset value to index into the cache block and return the byte requested by the load, which is the byte at address 0xfac0, i.e., the byte at offset 0x0, i.e., the 0th byte in the block at set index 0xc---hence the load returns the value 0x56.

The previous load did not change the state of the cache. Lets go through the steps for accessing the cache for the fourth load:

movia r8,0xab00; ldb r8,0(r8)

We will perform the following steps:

We break the load address 0xface into its components: tag = 0xab, set-index = 0x0 = 0, offset = 0x0 = 0.
We use the set-index to index the cache, which points us to set number 0x0 (of sets 0 through 0xf).
We first check the valid bit for set number 0x0. In this case the valid bit is zero, hence the cache block in set number 0x0 is invalid, and we have a cache miss.
We request the corresponding cache block from memory, i.e., the cache block that starts at address 0xab00 and contains 16 bytes of data at addresses 0xab00 through 0xab0f.
After a delay, memory returns the requested cache block and we copy the cache block into the cache set 0x0. According to the initalization, our block will then have the value: 0x00000000000000000000000000000012 (assuming little-endian like NIOS).
We also set the valid bit for set 0x0, and set the tag for set 0x0 equal to 0xab.
Finally, we use the offset value to index into the cache block and return the byte requested by the load, which is the byte at address 0xab00, i.e., the byte at offset 0x0, i.e., the 0th byte in the block at set index 0x0---hence the load returns the value 0x12.

Here we show the contents of the cache after the fourth load instruction:

Set Index Valid? Tag Hex data values (for bytes 15..0)
15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00

0x0 1 0xab 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 12

0x1 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x2 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x3 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x4 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x5 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x6 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x7 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x8 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x9 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0xa 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0xb 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0xc 1 0xfa 00 78 00 00 00 00 00 00 00 00 00 00 00 00 00 56

0xd 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0xe 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0xf 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Lets go through the steps for accessing the cache for the fifth load:

movia r8,0xcd00; ldb r8,0(r8)

We will perform the following steps:

We break the load address 0xcd00 into its components: tag = 0xcd, set-index = 0x0 = 0, offset = 0x0 = 0.
We use the set-index to index the cache, which points us to set number 0x0 (of sets 0 through 0xf).
We first check the valid bit for set number 0x0, which is a one. We also check the tag for set number 0x0 (currently 0xab) against the tag field of the load address (which is 0xcd)---since the tags do not match, this is a cache miss.
We request the corresponding cache block from memory, i.e., the cache block that starts at address 0xcd00 and contains 16 bytes of data at addresses 0xcd00 through 0xcd0f.
After a delay, memory returns the requested cache block and we copy the cache block into the cache set 0x0, replacing the current cache block (the block that started at address 0xab00). According to the initalization, our block will then have the value: 0x00000000000000000000000000000025 (assuming little-endian like NIOS).
We also set the valid bit for set 0x0, and set the tag for set 0x0 equal to 0xcd.
Finally, we use the offset value to index into the cache block and return the byte requested by the load, which is the byte at address 0xab00, i.e., the byte at offset 0x0, i.e., the 0th byte in the block at set index 0x0---hence the load returns the value 0x25.

Here we show the final contents of the cache (after the fifth load instruction):

Set Index Valid? Tag Hex data values (for bytes 15..0)
15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00

0x0 1 0xcd 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 25

0x1 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x2 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x3 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x4 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x5 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x6 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x7 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x8 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0x9 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0xa 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0xb 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0xc 1 0xfa 00 78 00 00 00 00 00 00 00 00 00 00 00 00 00 56

0xd 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0xe 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0xf 0 0x00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Fully-Associative Cache

A fully-associative cache allows any cache block to reside in any cache entry. You can think of a fully-associative cache as having one large set. A fully-associative cache is very flexible, but has the drawback of having to compare with all of its tags to decide whether a given access is a hit or a miss.

To access a fully-associative cache we break up the bits of a memory address into only two portions: the tag and the offset (since there is effectively only one set, we do not need to use any bit to index it). The offset portion indicates which byte within a cache block you are refering to. We use the tag portion to identify which memory location a block corresponds to.

Example (fully-associative cache)

Given a 16 bit addr space, implement a fully-associative cache with 16-byte blocks and a total capacity of 64 bytes. Execute the following code using that cache, assuming that the cache is initially empty:

movia r8,0xface; ldb r8,0(r8) 
movia r8,0xab00; ldb r8,0(r8)
movia r8,0xcd00; ldb r8,0(r8)

Also assume that all memory locations are initialized to zero, except for the following memory locations:

0xab00: 0x12
0xcd00: 0x25
0xface: 0x78

Since this is a fully-associative cache, there is only one set, and in the breakdown of the address no bits are designated as set index bits. Each cache block is 16-bytes, hence 4-bits of the address are used as offset bits. Therefore the remaining 12-bits in the address are used as tag bits, giving the following breakdown:

Breakdown of an address:

Tag Offset

12 bits 4 bits

The resulting cache looks like this:

Set Index Valid? Tag Hex data values (for bytes 15..0)
15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00

Set 0
0 0x000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0 0x000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0 0x000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0 0x000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Lets go through the steps for accessing the cache for the first load:

movia r8,0xface; ldb r8,0(r8)

We will perform the following steps:

We break the load address 0xface into its components: tag = 0xfac, offset = 0xe = 14.
For every block in the cache for which the valid bit is set we must also check the tags for match. Since our cache is empty and all valid bits are zero, we know that this is a cache miss.
We request the corresponding cache block from memory, i.e., the cache block that starts at address 0xfac0 and contains 16 bytes of data at addresses 0xfac0 through 0xfacf.
After a delay, memory returns the requested cache block and we copy the cache block into the first available cache block. According to the initalization, our block will then have the value: 0x00780000000000000000000000000000 (assuming little-endian like NIOS).
We also set the valid bit for this cache block, and set the tag to be 0xfac.
Finally, we use the offset value 14 to index into the cache block and return the byte requested by the load, which is the byte at address 0xface, i.e., the byte at offset 0xe, i.e., the 14th byte in the block---hence the load returns the value 0x78.

Here we show the contents of the cache after the first load instruction:

Set Index Valid? Tag Hex data values (for bytes 15..0)
15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00

Set 0
1 0xfac 00 78 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0 0x000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0 0x000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0 0x000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Lets go through the steps for accessing the cache for the second load:

movia r8,0xab00; ldb r8,0(r8)

We will perform the following steps:

We break the load address 0xab00 into its components: tag = 0xab0, offset = 0x0 = 0.
For every block in the cache for which the valid bit is set we must also check the tags for match. The only block with its valid bit set has a different tag (0xfac), hence this is a miss.
We request the corresponding cache block from memory, i.e., the cache block that starts at address 0xab00 and contains 16 bytes of data at addresses 0xab00 through 0xab0f.
After a delay, memory returns the requested cache block and we copy the cache block into the first available cache block. According to the initalization, our block will then have the value: 0x00000000000000000000000000000012 (assuming little-endian like NIOS).
We also set the valid bit for this cache block, and set the tag to be 0xab0.
Finally, we use the offset value 0x0 to index into the cache block and return the byte requested by the load, which is the byte at address 0xab00, i.e., the byte at offset 0xe, i.e., the 0th byte in the block---hence the load returns the value 0x12.

Here we show the contents of the cache after the second load instruction:

Set Index Valid? Tag Hex data values (for bytes 15..0)
15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00

Set 0
1 0xfac 00 78 00 00 00 00 00 00 00 00 00 00 00 00 00 00

1 0xab0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 12

0 0x000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0 0x000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Lets go through the steps for accessing the cache for the third load:

movia r8,0xcd00; ldb r8,0(r8)

We will perform the following steps:

We break the load address 0xcd00 into its components: tag = 0xcd0, offset = 0x0 = 0.
For every block in the cache for which the valid bit is set we must also check the tags for match. The two blocks valid bits set have different tags (0xfac and 0xab0), hence this is also a miss.
We request the corresponding cache block from memory, i.e., the cache block that starts at address 0xcd00 and contains 16 bytes of data at addresses 0xcd00 through 0xcd0f.
After a delay, memory returns the requested cache block and we copy the cache block into the first available cache block. According to the initalization, our block will then have the value: 0x00000000000000000000000000000025 (assuming little-endian like NIOS).
We also set the valid bit for this cache block, and set the tag to be 0xcd0.
Finally, we use the offset value 0x0 to index into the cache block and return the byte requested by the load, which is the byte at address 0xab00, i.e., the byte at offset 0xe, i.e., the 0th byte in the block---hence the load returns the value 0x25.

Here we show the contents of the cache after the third load instruction:

Set Index Valid? Tag Hex data values (for bytes 15..0)
15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00

Set 0
1 0xfac 00 78 00 00 00 00 00 00 00 00 00 00 00 00 00 00

1 0xab0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 12

1 0xcd0 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 25

0 0x000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Set-Associative Cache

A set-associative cache is a hybrid between a fully-associative cache and a direct-mapped cache. It's goal is to be somewhat flexible but still fast to access. A set-associative cache has a number of ways (lets say W ways) within each set, meaning that there are some number of cache blocks within each set, and any memory block that maps to a certain set can be placed in any of the cache blocks inside that set. Hence such a cache is called a W-way set-associative cache.

To access a set-associative cache we break up the bits of a memory address into three portions: the tag, set index, and offset. The offset portion indicates which byte within a cache block you are refering to. The set index portion indicates how to locate a block within the cache. Finally we use the tag portion to identify which memory location this block corresponds to within a given set.

Assuming an address space size of 2ⁿ bytes (byte addressable), then the size of a block in this cache is B = 2^b bytes, the number of sets S = 2^s, and the total number of cache blocks in the cache is W * S---in other words there are W blocks per set in the cache. The capacity of a set-associative cache can be computed as B*S*W. If W is a power of 2 (which it might not be) then the capacity can also be computed as 2^(b+s+w) bytes.

Example (set-associative cache)

Given a 16 bit addr space, implement a 2-way set-associative cache with 16-byte blocks and a total capacity of 256 bytes. Execute the following code using that cache, assuming that the cache is initially empty:

movia r8,0xface; ldb r8,0(r8) 
movia r8,0xab00; ldb r8,0(r8)
movia r8,0xcd00; ldb r8,0(r8)

Also assume that all memory locations are initialized to zero, except for the following memory locations:

0xab00: 0x12
0xcd00: 0x25
0xface: 0x78

First, lets determine the address breakdown. The capacity is 256-bytes, the block-size is 16-bytes, and W is 2-ways, we know that capacity = B*S*W, hence 256 = 16 * S * 2. Solving for S gives 256/(16*2) = 8 sets. Hence we have b=4 offset bits, s = 3 set index bits, and therefore t = 9 tag bits. Hence the breakdown of an address is:

Tag Set Index Offset

9 bits 3 bits 4 bits

Because s and t are not multiples of four, it will be take more effort to represent those values using hex---hence instead we show those values using binary. We can visualize this cache as:

Set Index Valid? Tag Hex data values (for bytes 15..0)
15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00

0b000
0 0b000000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0 0b000000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0b001
0 0b000000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0 0b000000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0b010
0 0b000000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0 0b000000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0b011
0 0b000000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0 0b000000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0b100
0 0b000000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0 0b000000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0b101
0 0b000000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0 0b000000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0b110
0 0b000000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0 0b000000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0b111
0 0b000000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

0 0b000000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Lets go through the steps for accessing the cache for the first load:

movia r8,0xface; ldb r8,0(r8)

We will perform the following steps:

We break the load address 0xface (0b1111 1010 1100 1110) into its components: tag = 0b1111 1010 1; set index = 0b100; offset = 0b1110 = 0xe.
We use the set-index to index the cache, which points us to set number 0b100 (of sets 0 through 0b111).
We first check the valid bits and tags for set number 0b100. In this case the valid bits for all cache blocks in that set are zero, hence we have a cache miss.
We request the corresponding cache block from memory, i.e., the cache block that starts at address 0xfac0 and contains 16 bytes of data at addresses 0xfac0 through 0xfacf.
After a delay, memory returns the requested cache block and we copy the cache block into the first available block in the cache set 0b100. According to the initalization, our block will then have the value: 0x00780000000000000000000000000000 (assuming little-endian like NIOS).
We also set the valid bit for set 0b100, and set the tag for set 0b100 equal to 0b111110101.
Finally, we use the offset value 0xe to index into the cache block and return the byte requested by the load, which is the byte at address 0xface, i.e., the byte at offset 0xe, i.e., the 14th byte in the block at set index 0xc---hence the load returns the value 0x78.

Here we show the contents of the cache after the first load instruction:

0b000
Set Index	Valid?	Tag	Hex data values (for bytes 15..0) 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0b001
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0b010
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0b011
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0b100
	1	0b111110101	00 78 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0b101
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0b110
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0b111
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Lets go through the steps for accessing the cache for the second load:

movia r8,0xab00; ldb r8,0(r8)

We will perform the following steps:

We break the load address 0xab00 (0b1010 1011 0000 0000) into its components: tag = 0b1010 1011 0; set index = 0b000; offset = 0b0000 = 0x0.
We use the set-index to index the cache, which points us to set number 0b000 (of sets 0 through 0b111).
We first check the valid bits and tags for set number 0b000. In this case the valid bits are zero, hence we have a cache miss.
We request the corresponding cache block from memory, i.e., the cache block that starts at address 0xab00 and contains 16 bytes of data at addresses 0xab00 through 0xab0f.
After a delay, memory returns the requested cache block and we copy the cache block into the first available block in the cache set 0b000. According to the initalization, our block will then have the value: 0x00000000000000000000000000000012 (assuming little-endian like NIOS).
We also set the valid bit for set 0b000, and set the tag for set 0b000 equal to 0b101010110.
Finally, we use the offset value 0x0 to index into the cache block and return the byte requested by the load, which is the byte at address 0xab00, i.e., the byte at offset 0x0, i.e., the 0th byte in the block at set index 0b000---hence the load returns the value 0x12.

Here we show the contents of the cache after the second load instruction:

0b000
Set Index	Valid?	Tag	Hex data values (for bytes 15..0) 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
	1	0b101010110	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 12
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0b001
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0b010
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0b011
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0b100
	1	0b111110101	00 78 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0b101
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0b110
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0b111
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Lets go through the steps for accessing the cache for the third load:

movia r8,0xcd00; ldb r8,0(r8)

We will perform the following steps:

We break the load address 0xcd00 (0b1100 1101 0000 0000) into its components: tag = 0b1100 1101 0; set index = 0b000; offset = 0b0000 = 0x0.
We use the set-index to index the cache, which points us to set number 0b000 (of sets 0 through 0b111).
We first check the valid bits and tags for set number 0b000. The cache block with a set valid bit has a tag that does not match (0b101010110), hence this is a cache miss.
We request the corresponding cache block from memory, i.e., the cache block that starts at address 0xcd00 and contains 16 bytes of data at addresses 0xcd00 through 0xcd0f.
After a delay, memory returns the requested cache block and we copy the cache block into the first available block in the cache set 0b000. According to the initalization, our block will then have the value: 0x00000000000000000000000000000025 (assuming little-endian like NIOS).
We also set the valid bit for the cache block and set the tag equal to 0b110011010.
Finally, we use the offset value 0x0 to index into the cache block and return the byte requested by the load, which is the byte at address 0xcd00, i.e., the byte at offset 0x0, i.e., the 0th byte in the block at set index 0b000---hence the load returns the value 0x25.

Here we show the contents of the cache after the first load instruction:

0b000
Set Index	Valid?	Tag	Hex data values (for bytes 15..0) 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
	1	0b101010110	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 12
	1	0b101010110	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 25
0b001
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0b010
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0b011
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0b100
	1	0b111110101	00 78 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0b101
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0b110
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0b111
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
	0	0b000000000	00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Handling Writes

In cache design an important question is how to handle writes to the cache. This can be tricky, because the cache essentially holds a copy of the corresponding blocks from memory. If you allow stores to modify the blocks in the cache, then they may not be the same as the corresponding blocks in memory (i.e., a load might return two different results depending on whether it accessed the data in the cache or memory.

There are two main options for handling writes to caches. The first is called write-through, where for every store both the cache and the next level of the memory hierarchy are updated (eg., memory itself). This is good, because now both memory and the cache are kept consistent and will always hold the same values for corresponding locations. However it can be bad because now every store results in communication to the memory, reducing the potential benefits of locality.

A second option is called write-back, and for this option only the cache is updated on stores: memory is only updated whenever a cache block is evicted from the cache. This way the amount of communication between the cache and memory is reduced, however now the cache and memory are no longer always consistent. Implementing write-back also requires the addition of a bit per every cache block to track whether that cache block has been modified, typically called a dirty bit. Whenever a cache block is stored to we set the dirty bit for that cache block. When we replace a cache block with its dirty bit set, we must first write-back that cache block to memory, otherwise we might lose the only up-to-date copy of that cache block.

Replacement Strategies

For fully- and set-associative caches, a key design question is which block to replace when there is a miss? The ideal algorithm for deciding which block to replace would choose the block which is not going to be used for the longest time (ie., will be used at the farthest point in the future). Unforunately such a scheme requires knowing the future which is generally impossible. A common realistic algorithm is called Least-Recently Used (LRU), where you replace the block which has been used least recently (the longest ago in time). In practice this algorithm provides a good approximation to the ideal algorithm.

To implement the LRU algorithm requires that you either (i) keep blocks in sorted order (according to LRU), or (ii) encode and track the order of blocks. The first option requires no additional storage, but can be fairly slow. The second option is more common and fast, but requires extra bits to encode the LRU state (ie., to track the relative LRU order of all blocks in the set, specifically it requires W * log₂ W bits). For example, for a 4-way set-associative cache it would require 4 * log₂ 4 bits = 4 * 2 bits = 8 bits per cache set of tracking state.

There are other schemes to decide replacement such as random, where you pick a random block---this scheme works surprisingly well and requires no extra state. A scheme commonly used in modern processors is called random-but-not-MRU, where MRU stands for Most-Recently Used. In this scheme you replace a random block so long as it is not the MRU block, in which case you pick again. To track MRU requires only a total of log₂ W bits (to track whichever is the MRU block). For example, for a 4-way set-associative cache it would require log₂ 4 bits = 2 bits per cache set of tracking state.

Cache Storage Requirements

It is important to reason about the total storage requirements for the design of a cache in bits, including the bits for raw data as well as all of the different overhead bits for its implementation. For example, assuming a set-associative cache with LRU replacement we have the following:

TOTAL_CACHE_SIZE = TOTAL_CAPACITY + TOTAL_OVERHEAD

TOTAL_OVERHEAD = TAG_BITS + VALID_BITS + DIRTY_BITS + LRU_BITS

Example

For a 16-bit address space machine, how many bits total are required to implement a 6kB 3-way set-associative, write-back cache with 32-byte blocks and random-but-not-MRU replacement?

TOTAL_CAPACITY = 6kB = 6*1024B = 6*1024*8 bits = 49152 bits
num_blocks B = TOTAL_CAPACITY / block_size = 6kB / 32B = 192 blocks
num_sets S = 192 blocks / W ways = 192 blocks / 3 ways = 64 sets => s = 6 set bits
B = 32bytes => b = 5 offset bits

Hence the breakdown of an address is:

Tag Set Index Offset

5 bits 6 bits 5 bits

TAG_BITS = num_blocks * 5 bits = 192 blocks * 5 bits = 960bits
VALID_BITS = num_blocks * 1bit = 192bits
DIRTY_BITS = num_blocks * 1bit = 192bits
random-but-not-MRU_BITS = num_sets * log₂ W bits = 64 * log₂ 3 = 64 * 2 (rounding up) = 128bits

TOTAL_CACHE_SIZE = 49152bits + 960bits + 192bits + 192bits + 128bits = 50624bits = 6328bytes =~ 6.18kB

Types of Cache Misses

A given cache miss may occur for a number of different reasons. First, the cache may be initially empty (i.e., all blocks are invalid) when a program starts. Such cache misses are called cold misses, because in a sense the cache is cold when the program first starts. Once the cache is warmed up, the program might access much more data than can fit in the cache: for example, think of a program that accesses a 1MB array over and over, when the cache is only 1kB. Such misses are called capacity misses, and can only be addressed by building a larger cache. Finally, when a program accesses two memory blocks that map to the same cache block, these two blocks are said to conflict, and depending on the design of the cache may result in conflict misses. For example, in a direct mapped cache if you repeatedly access two memory blocks that map to the same cache set then you will definitely suffer from conflict misses, because a direct-mapped cache can only hold one cache block per set. For a 2-way set-associative cache, you have to be repeately accessing at least three memory blocks that map to the same cache set to suffer conflict misses.

Access Patterns

As a programmer it is important to be able to quickly reason about the behaviour a given part of a program will have given a certain cache. Here are some examples of different code and the corresponding cache behaviour.

Example1

Assuming a 1KB direct-mapped cache with 32B cache blocks that is initially empty, how many misses will there be for this loop?

for (i=0;i<1024;i++){
  sum += A[i];
}

Assume that each element of the array A[i] is 4B

First, we will note that each cache miss will cause a 32B block to be loaded from memory, hence 8 array elements will be loaded (since each array element is 4B). Specifically, if we read A[0] and it is a miss, a block containing A[0] through A[7] will be loaded into the cache. Therefore one in eight accesses will miss in the cache, therefore the number of misses is 1024/8 = 128.

miss_rate = num_misses / num_accesses = 128/1024 = 12.5%
hit_rate = num_hits/num_accesses = 1 - miss_rate = 87.5%

Example2

Assume the same code and setup as Example1; what if the loop was executed again (i.e., the cache was already warmed up after one execution of the loop).

We need to reason about the total size of the array relative to the total size of the cache. Since the array is 1024*4B = 4KB, while the cache is only 1KB, we know that the entire array does not fit---only 1/4 of the array fits at one time. From the first execution of the loop, the last quarter of the array (A[768] through A[1023]) will be in the cache when we start the second execution of the loop, which will first access A[0] and so on. Therefore there will be no locality across iterations of the loop exploited by the cache. If instead the cache was 4KB or larger, then subsequent iterations of the loop would produce all cache hits.

Example3

Assume the same cache from example1, and the following code:

for (i=0;i<1024;i++){
  sum += A[i] + B[i];
}

Also assume that A[i] maps to the same set in the cache as B[i]

Lets trace through what happens for the first couple of accesses.

read A[0]: miss, loads A[0]..A[7]
read B[0]: miss (replaces block with A[0]..A[7] with B[0]..B[7])
read A[1]: miss (replaces block with B[0]..B[7] with A[0]..A[7])
read B[1]: miss (replaces block with A[0]..A[7] with B[0]..B[7])
etc

Hence the miss rate will be 100%! This behaviour of constant conflicts is called thrashing. There are three ways we could fix this pathological behaviour:

move B[] elsewhere in memory such that B[i] does not map to the same set in the cache as A[i]
break the loop into two loops, such as:
```
for(...){sum+=A[i];}  for(){sum+=B[i];}
```
use set associative cache with 2-ways or more

Cache Performance

Finally, we can think about what impact caches might have on performance. Given the hit or miss rate and some information about the latencies inside the memory system of the processor, we can compute such information about a program. In particular we have:

Average_Access_Time = (hit_rate * Cache_Hit_Time) + (miss_rate * miss_penalty)
Cache_hit_time = time for a hit (typical processor: 1 or 2 cycles if hits in L1)
Miss_penalty = time for a miss (typical processor: 10-20 cycles if misses L1 but hits in L2)

Example

Assume the result from Example1 above (a miss rate of 12.5%). Assume a 1-cycle L1 hit latency and a 10-cycle L2 hit latency, and that no access misses the L2 cache. What is the average access time in processor cycles?

average_access_time = 0.875 * (1 cycle) + 0.125 * (10 cycles) = 2.125 cycles

Greg Steffan

Last modified: Mon Feb 25 14:52:41 EST 2008