This section describes details of the NUMAchine cache coherence protocol. Since cache coherence is highly complex, we necessarily cannot describe all possible coherence operations, but enough examples are presented to enable a knowledgeable reader to understand how NUMAchine's cache coherence operates.
NUMAchine's cache coherence scheme is optimized specifically for the NUMAchine architecture. In particular, it leverages the natural multicast mechanism available via the rings, it utilizes the feature that the rings provide a unique path from one node to any other node, and it is designed to localize coherence traffic to within a single level of the ring hierarchy whenever possible. The protocol is enforced simultaneously at two levels, as illustrated in Figure 4. Network-level coherence is enforced between the home memory module for a given cache line, and network caches in other stations holding copies of this cache line. Station-level coherence for a given cache line is enforced between the processor caches and the home memory on the same station or between the processor caches and the network cache on the same station if the home memory of the cache line is a remote station.
To maintain cache coherence at both the network and station levels, a hierarchical, two-level directory exists. The directory is stored in SRAM located in the memory module and network caches. At the network level, the home memory maintains a full directory of routing masks for each cache line. The routing mask can identify a single station or multiple stations as described in Section 2.2. In the directory, the routing mask indicates which stations may have copies of a cache line. At the station level, the directory consists of a simple bitmask, or processor mask, for each cache line. Since there is only a small number of processors per station, each processor has a dedicated bit in the processor mask. These bits indicate which processors on the station have a copy of the cache line. Processor masks that are stored in a memory module indicate which processors within the local station have a copy of a cache line. The processor masks for copies of cache lines on remote stations are maintained in their respective network caches.
In addition to the directory, the memory and network caches contain a valid/invalid (V/I) bit per cache line, which indicates whether the copy they have is valid. The network caches also contain a local/global (L/G) bit, which indicates whether the only valid copies of the cache line are on the local station. In the memory module, a separate L/G bit is not needed because this information is provided by the routing mask in the directory.
While three basic states (dirty, shared and invalid) are defined for the secondary cache in the standard way for write-back invalidate protocols, four basic states are defined for a cache line in a memory module or a network cache. The L/G and V/I bits are used to indicate the state of the cache line and can have the following meanings: local valid (LV), local invalid (LI), global valid (GV) and global invalid (GI). The LV and LI states indicate that valid copies of the cache line exist only on this station. In the LV state, the memory (or network cache) as well as the secondary caches indicated by the processor mask have a valid copy. In the LI state, only one of the local secondary caches has a copy (which would be dirty), and the particular cache is identified by the processor mask. In GV, the memory (or network cache) has a valid copy of the cache line, and it is being shared by several stations, indicated by the routing mask in the directory. The meaning of the GI state differs slightly for the memory module and for the network cache. In both cases, the GI state means that there is no valid copy on this station. However, the GI state in the memory module also indicates that there exists a remote network cache (identified by the routing mask) with a copy in LV or LI state. Each of the basic states also has a locked version. The locked versions are used to prevent accesses to a cache line while the line is undergoing some transition. Any requests for a cache line in a locked state are negatively acknowledged, and the requester will try again.
Figure 5: State Transition Tables for memory.
The NUMAchine cache coherence protocol employs a write-back/invalidate scheme at both levels. The protocol is illustrated using four basic examples: local write, local read, remote read and remote write. The first three of these examples illustrate basic operation of the protocol by indicating how the directories and states are manipulated. The fourth example provides additional details by showing some of the actual steps taken in the hardware. For readers who are interested in the entire protocol, full state transition diagrams for a cache line in memory and for a cache line in a network cache are given in figures 5 and 6.
Figure 6: State Transition Tables for network cache.
Let us first consider a local write request by a processor on station Y, for a cache line whose home location is also on station Y. Let us assume that there are valid copies of the cache line on station Y and that the cache line is shared on another station, Z; therefore, the cache line is in the GV state in both the memory on station Y and the network cache on station Z. After the processor issues a write to memory on station Y, the memory controller will send an invalidate request to the remote station Z indicated by the routing mask, and to the local processors indicated by the processor mask in the directory. All the bits in the processor mask are reset except for the bit corresponding to the processor requesting the write. Also, the routing mask bits in the directory are set to indicate the local station. The new state of the cache line will be LI indicating that the memory no longer has a valid copy, but that the copy is in one of the secondary caches on the local station.
Upon receiving an invalidation packet, the remote network cache controller on station Z invalidates copies on the station according to its processor mask (if the cache line has been ejected from the NC, then the invalidation message is broadcast to all four processors), which is then cleared. The state of the cache line is set to GI, indicating that neither the network cache nor any of the secondary caches contain a valid copy of the cache line.
Let us now consider a read by a processor on station Y for the same cache line which is in the LI state in the memory module on station Y. The memory controller determines which processor has the dirty copy, and that processor then forwards a copy of the cache line to the requesting processor and to the memory module. Upon receiving the data, the memory controller writes it to DRAM and ORs the bit corresponding to the requesting processor to the processor mask in the directory. The new state of the cache line will be LV indicating that copies of the cache line are located on this station only. The memory and the processors indicated by the processor mask have valid copies of the cache line.
Next we consider the case where a shared read request issued by a processor on station X arrives at a memory module on station Y, where the cache line is in the GI state. In this example, we assume that the cache line is dirty on another station Z. We also assume that on station Z, the network cache entry for this cache line is in LI state. The home memory module sends a read request message (identifying the requesting processor on station X) to station Z using the routing mask. Using the information in its processor mask, the network cache on station Z obtains the dirty copy from the secondary cache, causing the state to change to GV in the network cache. The dirty data is forwarded to station X and a copy is also sent to the home memory module (in separate transmission). When the data arrives at station X, a copy is written to both the network cache and the requesting processor. In the network cache the state of the cache line is changed to GV and the processor mask is set to indicate the requesting processor. When the data arrives at the home memory module, it is written into DRAM. The existing routing mask in the memory is OR'ed with the bits corresponding to Stations X and Y, and the state of the cache line is changed to GV.
Figure 7: Coherence actions for a remote write.
As a final example, we consider a write request by a processor on station X for a cache line whose home location is on station Y. In this final example we would also like to describe the locking mechanism that allows cache coherence and provides support for different consistency models. Figure 7 illustrates the necessary actions. Let us assume that the network cache state on station X is GI (i.e., there is no valid copy in the NC or in any of the processor on the station), and that the cache line is in GV state in the home memory. The processor's request goes first to the network cache on station X. The network cache locks this location and sends a write request packet to station Y (a write request means that the memory module should provide the data and give write permission). When the request reaches the home memory module, the data is sent to station X and all other copies are invalidated. The invalidation scheme is implemented as previously suggested in . The home memory location is locked when the invalidate request packet is issued. The invalidate packet reaches the highest level of (sub)hierarchy needed to multicast it to stations with copies; it is then distributed according to the routing mask, which identifies all stations with valid copies, plus station X. When the invalidate packet returns to station Y (where it originated), the memory location is unlocked and placed in GI state, and the routing mask is updated to indicate station X as the only station with a copy. It is important to note that the invalidation requests do not have to be acknowledged by the caches that invalidate their copies of the cache line.
When the cache line reaches station X, the network cache writes it into its DRAM and waits for the invalidate packet to arrive. It is guaranteed that the data packet will arrive before the invalidate message, because the memory module sends the data first and the ring hierarchy preserves ordering of message. Upon arrival of the invalidate packet, the network cache sends the data from its DRAM to the requesting processor and puts the cache line into LI state. Also, the processor mask is set to indicate which processor on the station has the copy.
Some further aspects of the NUMAchine coherence protocol are summarized below.
In summary, the basic coherence mechanism for the NUMAchine cache coherence protocol is write-back/invalidate. This mechanism is the best choice for today's applications which are designed to exhibit as much locality in accessing data as possible. Since we efficiently enforce the strictest model of memory consistency (sequential consistency), our implementation also enables us to efficiently support any other memory consistency model that is supported by the processor/secondary cache subsystem. In order to make the protocol efficient, communication overhead must be minimized, which we have successfully achieved. At the same time, we have kept latency low for all cases except for the optimistic one described above which makes a decision in favor of low communication overhead at the expense of a slight increase in latency. Since latency can be tolerated using techniques such as prefetching, we have concentrated more strongly on reducing the communications overhead. Finally, the NUMAchine cache coherence protocol is conducive to low-cost implementation in hardware, because the amount of memory required for the directories is small and the logic circuitry needed to manipulate those directories is reasonable.