Related Work



next up previous
Next: Concluding remarks Up: The NUMAchine Multiprocessor Previous: Performance of the

Related Work

 

Over the past few years, a number of scalable multiprocessors that support a single coherent view of memory have been designed and/or built. In this section, some of the features of recent machines are considered, in order to show how NUMAchine compares to other approaches.

The Stanford DASH multiprocessor [15] uses clusters of processors that share a single bus, with clusters interconnected by a mesh. It uses a directory-based hardware cache coherence protocol that, on a write to a shared cache line, requires separate invalidates be sent for each of the copies, and requires acknowledgments for each invalidate. In the NUMAchine protocol, only a single invalidate message is used, and no acknowledgements are required. DASH employs a small cache in each cluster called a Remote Access Cache (RAC). NUMAchine's network cache includes the functionality of the RAC; however, the key to the effectiveness of NUMAchine's network cache is its large capacity, being at least as large as the combined capacities of the secondary caches on a station.

The FLASH multiprocessor [13], under development at Stanford University, will provide a single address space with integrated message passing support. A programmable co-processor, called MAGIC, serves as a memory and I/O controller, a network interface, and as a communication and coherence protocol processor. Through this programmable co-processor, FLASH provides a high degree of flexibility. NUMAchine uses a different approach to providing flexibility. The basic protocols, such as coherence, are implemented in hardware to ensure good performance, but software has the ability to override the hardware when different protocols are desirable.

The Alewife machine from MIT [1] shares the FLASH approach of integrating a single address space with message passing. Its approach for achieving flexibility is to implement common case critical path operations in hardware, letting software handle exceptional or unusual conditions. For example, it uses limited directories [8] to implement cache coherence, where hardware supports directly a small number of pointers, and software must handle the case when cache lines are shared by a larger number of processors. An important difference between Alewife and NUMAchine is that Alewife relies on a great deal of custom hardware. As a result, it is harder for Alewife to track the rapid improvements in workstation technology.

The KSR multiprocessors [12] from Kendall Square Research use a ring-based hierarchical network to interconnect up to 1088 processing cells. These systems implement a Cache Only Memory Architecture (COMA), which automatically replicates data to requesting cells. Although NUMAchine uses a similar interconnection topology, there are a number of fundamental differences between the two networks. In the KSR systems, each processing cell must snoop on ring traffic to maintain cache coherence. This effectively involves a directory lookup and slows the speed of operation. Furthermore, a combined cache directory is needed at each level in the interconnect hierarchy, containing all the directory information in the levels below, which severely limits the scalability of the architecture. The replication of data in the COMA memory is effective in reducing memory and network contention [6]. NUMAchine captures most of these benefits with its network caches, but without affecting scalability and at a considerably reduced cost.

Other interesting multiprocessor projects include the ASURA [17] multiprocessor being developed at Kyoto University in Japan, Typhoon [20] from the University of Wisconsin, the Cray T3D system [18] from Cray Research, and the Exemplar from Convex [9]. ASURA has many similarities with NUMAchine, but its equivalent of the network cache uses very long cache line sizes (1 Kbyte), which may lead to considerable false sharing. Typhoon has similar flexibility goals to FLASH, and also depends on a programmable co-processor to implement its coherence policy. The T3D does not support cache coherence in hardware. The Exemplar uses a crossbar to interconnect processors in a cluster and uses SCI rings to interconnect clusters and maintain inter-cluster coherence. The distributed directory-based protocol implemented by SCI, using linked lists, can introduce considerable cache coherence latency overhead.



next up previous
Next: Concluding remarks Up: The NUMAchine Multiprocessor Previous: Performance of the



Stephen D. Brown
Wed Jun 28 18:34:27 EDT 1995