University of Toronto

ECE1388F – VLSI Design Methodology

 

 

 

 

 

 

 

 

 

 

 

Final Project:  Cache Chip Design

 

 

 

 

 

 

 

 

 

Jennifer Pham
Cintia Man

Shahriar Shahramian

Oleksiy Tyshchenko

 

 

 

 

January 31, 2005


1.          Introduction

This document focuses on the design of a synchronous write through cache memory which implements least recently used (LRU) algorithm for efficient data access.  In the LRU algorithm, the content of the least recently used cell is being replaced whenever a write cycle is executed.  The content of the most recently used cell remains untouched in the cache.  The cache chip consists of address and data path. The address path is responsible for mapping the global memory address (typically from external DRAM) to the physical location of data in the cache. The data path returns the content of the physical memory location to the requesting device (typically from a processor or controller).  The address path is implemented as CAM (Content Addressable Memory) block, which is able to search its content in a single clock cycle.  The data path is implemented as SRAM (Static Random Access Memory) block.  The overall architecture of cache chip is the integration between the two blocks which results in hardware reduction and better performance of the cache chip.

 

2.          System Functionality

Figure 2.1 illustrates the overall block diagrams of the cache chip. The processor is assumed to have an address space of 16 bits. The address bits (15:0) are latched with Flip Flops before they are sent to the CAM and Decoder units. From the address bits, 14 bits are sent to the CAM unit and the remaining 2 bits are decoded into 4 bits to select the desired SRAM block through the SRAM Column MUX/DEMUX block. For instance, let’s imagine the data for address “AB72” is to be stored in the Cache. The first 14 bits are separated i.e. “AB70” (truncated to 14 bits) and the remaining 2 bits “10” (binary) are decoded. However, the processor will not just bring the data in block “AB72”; it will bring 4 words at a time. In this case, the data from “AB70”, “AB71”, “AB72” and “AB73” will all be brought into the Cache. Thus one row of the CAM (out of 256 rows) will point to 4 blocks of SRAM containing consecutive data block based on the remaining least significant 2 bits of the address bits. Column MUX/DEMUX block is in charge of accessing the data blocks of the SRAM. Since the DATA port of the Cache is bidirectional, tri-state buffers are used to interface the Cache to the processor. To reduce hardware, the SENSE AMPS for the SRAM are placed after the COLUMN MUX/DEMUX. For read operations, the CAM (which contains the most 14 significant bits of the address) selects the appropriate SRAM row and the least 2 significant bits will chose the appropriate word and return it to the processor. Therefore the total CAM & SRAM sizes are:

 

·         CAM: 256 Rows X 14 Bits

·         SRAM: 256 Rows x 4 Blocks x 32 Bits

 

Figure 2.1: System Block Diagram

 

 

When it comes to replacing an element in the Cache, an LRU algorithm has been employed. A custom LRU CAM block has been designed for which its CAM elements are made of T-Flip Flops connected in the form of a saturating 5-Bit counter. This ensures that no data in the Cache which has been accessed within the last 32 clock cycles would be replaced over one which hasn’t been accessed for longer. This LRU CAM block also has 256 bits, and keeps track of accesses made to the Cache. When an element in the Cache needs to be replaced, the LRU CAM is searched with bit patterns starting from “11111”. If a block contains this data, that row has not been accessed by the processor for at least 32 cycles and can be replaced. If no hits are detected, the next search pattern would be “1111X”. In this case any block which has not been access for at least 30 cycles would be searched. This continues until a hit is found and that block would be replaced. However, it is possible that there are multiple blocks have not been access for a certain number of cycles. In this case, multiple hits would come from the CAM. However, only one hit is necessary. To resolve this issue an LRU DECISION block has been designed. This block is a small state machine which scans the HIT LINES of the LRU CAM and selects the first one that it encounters. Although this is a multi clock operation, it only takes place when we need to access the main memory which is also a multi clock cycle operation. Thus, the speed of the Cache in not compromised.

 

The over all algorithmic description of the behavior of our Cache chip during READ/WRITE cycles is as follows:

 

READ – HIT

 

1.      CAM is searched for the given address from the processor and locates it in one of the rows.

2.      The least 2 significant bits of the address are used to select the appropriate SRAM block.

3.      The data is read from SRAM and sent back to the processor.

4.      The corresponding LRU CAM ROW COUNTER is reset to “00000”.

 

READ – MISS

 

1.      CAM is searched for the given address from the processor, address is not found.

2.      The processor is notified and the LRU CAM is searched with patters to find the LRU element.

3.      In the meanwhile 4 words of data corresponding to the 14 most significant bits of the address are brought from main memory.

4.      The data from the main memory is also sent to the processor.

5.      LRU DECISION block selects the LRU CAM row and the new address is written in the row while 4 new words are stored in the SRAM in the same row number.

6.      The corresponding LRU CAM ROW COUNTER is reset to “00000”.

 

WRITE – HIT

 

1.      CAM is searched for the given address from the processor and locates it in one of the rows.

2.      The least 2 significant bits of the address are used to select the appropriate SRAM block.

3.      The new data coming from the processor is updates in the SRAM.

4.      In the meanwhile, the data in the main memory is also updates. This is necessary as the Cache is a write through Cache.

5.      The corresponding LRU CAM ROW COUNTER is reset to “00000”.

 

WRITE – MISS

 

1.      CAM is searched for the given address from the processor, address is not found.

2.      The processor is notified and the LRU CAM is searched with patters to find the LRU element.

3.      In the meanwhile 4 words of data corresponding to the 14 most significant bits of the address are updated and brought from main memory.

4.      LRU DECISION block selects the LRU CAM row and the new address is written in the row while 4 new words are stored in the SRAM in the same row number.

5.      The corresponding LRU CAM ROW COUNTER is reset to “00000”.

 

In terms of division of tasks, here is an estimate:

·         LRU Block Circuitry: Shahriar

·         CAM Block Circuitry: Oleksiy

·         SRAM Column Circuitry: Jen

·         Peripheral Circuitry for LRU, CAM, SRAM: Cintia

·         System Integration: All team members

 


3.          LRU Design

a)  Circuit Schematic

 

 

Figure 3.1:            LRU CAM Cell

 

 

 

 

Figure 3.2:            LRU CAM Row


 

 

 

Figure 3.3:            LRU Decision Block Schematic

 

 

 

 

b)  Cell Layout

 

Figure 3.4:            LRU CAM Cell Layout

 

 

 

 

Figure 3.5:            LRU Decision Block Layout

 

 

 

 

 

c)  Simulation Results

 

 


4.          CAM Design

a)  Circuit Schematic

 

 

 

Figure 4.1:            CAM Architecture

 

 

 

Figure 4.2:            CAM 10T Cell

 

 

Figure 4.3:            CAM Match Line Sense Amplifier (MLSA)

 

 

 

 

 

b)  Cell Layout

 

 

Figure 4.4:            CAM Cell Layout

 

 

 

 

Figure 4.5:            MLSA Layout

 

 

c)  Simulation Results

 

 


5.          SRAM Design

a)           Circuit Schematic 

Figure 5.1:            SRAM Column Circuitry

 

 

 

 

Figure 5.2:            SRAM 6T Cell

 

 

Figure 5.3:            Column Decoder

 

 

 

Figure 5.1:            Column Multiplexer

 

 

Figure 5.2:            Read Tri-state Circuitry

 

Figure 5.3:            SRAM Sense Amplifier (SA)

 

Figure 5.4:            Write Tri-state Circuitry

 

Figure 5.5:            Write Bit-line Driver

 

Figure 5.6:            Input/Output Flip Flop


a)           Cell Layout 

Figure 5.7:            SRAM Cell Layout

 

 

 

 

Figure 5.8:            Column Decoder Layout

 

 

 

Figure 5.9:            Column Multiplexer Layout

 

Figure 5.10:       Read Tri-state Layout

 

 

 

 

 

Figure 5.11:       SRAM Sense Amplifier Layout

 

 

 

Figure 5.12:       Write Tri-state and Bit-line Driver Layout

 

 

Figure 5.13:       Input/Output Flip Flop Layout

 


b)          Simulation Results

Read “1” from SRAM:  d = data @ SRAM, I/O = data out @ PAD, CLK = 200MHz

Read Access Time (Read_Precharge ΰ IO) = 5.34ns

 

Read “0” from SRAM:  d = data @ SRAM, I/O = data out @ PAD, CLK = 200MHz

 

Write “1” and “0” to SRAM :  I/O = data in @ PAD, d = data @ SRAM, CLK = 200MHz

Write Time (CLK ΰ d) = 2.28ns

 


1.          System Integration

 

 

Figure 6.1:            Cache Chip Layout Architecture

 

 

 

Figure 6.2:            Overall Chip With Pads

 

 

Figure 6.3:            Overall Chip