The network cache (NC) is shared by all processors in a station and is used to cache data originating from other stations. The cache lines are maintained in DRAM so that very large network caches can be implemented at reasonable cost. The NC should be at least as large as the combined capacities of the secondary caches on the station, and can be made larger. SRAM is used to maintain status and control information of each cache line so that it can be accessed quickly.
The NC serves a number of useful purposes. It acts as a shared tertiary cache for the station, as a target for broadcasts, and as a target for prefetching when the processor does not support prefetching directly into its primary or secondary caches. It also performs a function akin to snooping, which is usually found in bus-based systems. In this section, uses of the NC are described. Section 4 will show the effectiveness of the NC, based on simulations of our NUMAchine prototype.
A read request to non-local memory is always directed to the local NC. If the cache line is present in the NC, then the NC responds with the data. If the NC knows that the cache line is dirty in a secondary cache on the station, it causes the data to be transferred to the requester. Otherwise, the NC sends a read request to the home memory module. When the data arrives from the remote station, it is forwarded to the requesting processor and a copy is kept in the NC. Subsequent read requests for the same cache line by another processor on the station are satisfied by the network cache, avoiding remote memory accesses. In effect, the NC replicates shared data from remote memories into the station. This feature is referred to as the migration effect of the NC.
The NC retains shared data that is overwritten in a processor's secondary cache, if the new cache line does not map into the same location in the NC as the cache line that is overwritten. Also, dirty data ejected from a processor's secondary cache due to limited capacity or conflicts is written back into the network cache, but not necessarily to the remote memory. If such data is needed again, it will be available from the network cache. This feature of the NC is referred to as the caching effect.
The NC ``combines'' concurrent read requests to the same remote memory location into a single request that propagates through the network to the remote home memory module. This occurs as a direct consequence of locking the location reserved for the desired cache line; subsequent requests for the locked line are negatively acknowledged, forcing the processors to try again. After the response to the initial request arrives, subsequent requests are satisfied by the NC. In this respect, the NC reduces network traffic and alleviates contention at the remote memory module. This feature is referred to as the combining effect of the NC.
The NC localizes coherence traffic for cache lines used only within a station but whose home location is in a remote memory module. Such lines exist in either LV or LI states in the NC, and all coherence actions for these lines involve only the NC and not the remote home memory module. For example, assume that a dirty cache line exists in a secondary cache and that its state in the network cache is LI. If another processor on the same station reads this cache line, then the NC determines from its processor mask which processor has the dirty copy and that processor sends the data to both the requesting processor and the NC. The state in the NC now becomes LV. If one of these two processors later requests exclusive access to the same cache line, the line becomes dirty again, and the NC invalidates the other copy. The state of the line in the NC becomes LI. All this is done locally, without having to send any messages to the home memory, which maintains the cache line in the GI state. This feature is referred to as the coherence localization effect of the NC.
The network cache is a convenient target for broadcasts (multicasts). Data produced by one processor, and needed by other processors can be broadcast, to avoid hot-spot contention at memory modules and in the interconnection network. Other possibilities for broadcast targets are less attractive: broadcasting into secondary caches requires complicated hardware on each processor and can eject data in use by the processor; broadcasting into memory modules is impractical for addressing reasons.
The NC can also be used for prefetching data if the processor does not support prefetching directly. Prefetching can be implemented easily as a ``write'' request to a special memory address which causes the hardware to initiate the prefetch . The disadvantage of prefetching into the network cache is that the data is not placed as close as possible to the requesting processor.
The use of the NC obviates the need for snooping on the station bus, saving cost and reducing hardware complexity. A processor can obtain remote data from the NC instead of obtaining it from another processor's secondary cache through snooping. In fact, the NC provides complete snooping functionality. It responds with shared data directly, and causes forwarding of dirty data as explained above. This functionality may not be easy to achieve using snooping, because many modern processors make it difficult to implement a mechanism that allows shared data to be obtained from a secondary cache.