In order to be successful, future multiprocessor systems must be cost effective, modular, and easy to program for efficient parallel execution. The NUMAchine project seeks to address these issues by developing a cost-effective high-performance hardware platform supported by software to ease the task of developing parallel applications and maximizing parallel performance. In this report we have provided an overview of the NUMAchine hardware architecture and presented simulation results to demonstrate some of the implications of the architecture on performance.
The NUMAchine ring hierarchy gives the desired simplicity of implementation. Since there are only three connections to each node, it is possible to use wide datapaths. We have developed a simple routing mechanism that allows the rings to be clocked at high speed. An shown in the evaluation section, the bisectional bandwidth of our network is sufficient for typical applications running on the target system size. In addition, the high-speed operation results in low latency for remote accesses.
The hierarchical nature of the NUMAchine rings allows for a natural implementation of multicasts. This feature is exploited by the coherence mechanism to invalidate multiple cache lines using a single packet. It is also exploited to implement an efficient multicast interrupt mechanism and to implement, in hardware, support for efficient barrier synchronization.
The cache coherence support in NUMAchine is highly optimized for applications where most sharing is localized within a single station, in which case coherence is controlled by the local memory or network cache and no remote interactions are required. A two-level directory structure is used, where the number of bits per cache-line grows only logarithmically with the number of processors in the system.
In addition to localizing coherence traffic, the network cache serves as a larger shared tertiary cache for the processors on the station. It is implemented in DRAM, which will allow us to experiment with very large cache sizes in order to avoid remote accesses. Also, the network cache serves as a target for such operations as multicast writes; system software can cause cache lines to be multicast to a set of stations where it is expected that the data will soon be required.
The NUMAchine architecture is one component of the larger NUMAchine project, which involves development of a new operating system, parallelizing compilers, a number of tools for aiding in correctness and parallel performance debugging, and a large set of applications. For this reason, our prototype will include extensive monitoring support. Also, it will allow system software to take control of the low-level features of the hardware, facilitating experimentation into hardware-software interaction.