Multiprocessors have existed for many years, but they have not achieved the level of success that many experts initially felt would be reached. The lack of stronger acceptance of multiprocessors is due in part to the following reasons: (1) an over-reliance on custom hardware solutions, making it difficult to track the rapid improvements in mainstream workstation technology, (2) a focus on scalability to thousands of processors, involving considerable up-front costs that preclude reasonably-priced small configurations, and (3) a lack of adequate system software, impeding development of application programs that can exploit the performance potential of the machines. These three factors have influenced our approach to multiprocessor architecture, as discussed below.
Multiprocessor systems designed using workstation technology can provide large computing capability at a reasonable cost. Future demand is likely to be the greatest for machines that give good performance and are modular, cost-effective, scalable to a reasonable size, and easy to use efficiently. A key requirement is that a multiprocessor system be viable and affordable in a relatively small configuration, which precludes a large up-front cost. However, it also must be easy to expand the system, necessitating a modular design. While scalability is an important issue, and has strongly influenced research in recent years, it is apparent that demand for huge machines (with thousands of processors) will continue to be low. Commercial interest is likely to be concentrated on designs that are scalable in the range of hundreds of processors.
From a user's perspective, it is desirable that a machine provide high performance and be easy to use, requiring little effort to structure programs for parallel execution. One way to facilitate ease of use is to provide a shared memory programming model with a single flat address space for all processors. This allows parallel programs to communicate by normal memory reads and writes, as opposed to communicating based on software message passing with its attendant overhead. In addition, by providing hardware-based cache-coherence for the shared memory, the task of developing parallel programs is simplified, both because programmers are given a familiar abstraction for accessing memory, and because it is simpler to create compilers that can automatically parallelize programs.
In order for multiprocessor technology to reach a much greater level of commercial success than is presently held, it is crucial that system software for multiprocessors evolve considerably beyond the current state-of-the-art. In order for this to occur, it is necessary that multiprocessor machines become available for use as software research platforms. Such a machine should allow a large degree of flexibility to allow software to control the hardware resources available in the machine.
This report presents the architecture of the NUMAchine multiprocessor and describes a 64-processor prototype that is being constructed. This hardware is part of a larger NUMAchine project that includes development of a new operating system, parallelizing compilers, a number of tools for aiding in correctness and parallel performance debugging, and a large set of applications. The overall objectives of the NUMAchine project are to design a multiprocessor system that meets the criteria discussed above and is scalable in the range of hundreds of processors.
The NUMAchine architecture has many interesting features, the most important of which are listed below:
The overall NUMAchine project is still in an early phase. The hardware for an initial prototype using MIPS R4400 processors is currently being fabricated. A detailed behavioral simulator is being used to evaluate architectural tradeoffs and the expected performance for a prototype implementation. The final version of the prototype system, targeted for completion in 1996, will consist of 64 processors, connected by a two-level hierarchy of rings. Initial implementations of much of the system software for NUMAchine have been developed on hardware simulators and existing multiprocessor platforms.
The rest of this document provides more details on the NUMAchine architecture (and prototype) and is organized as follows: Section 2 provides an overview of the NUMAchine architecture, Section 4 presents the results of simulations to evaluate the architecture for a variety of parallel applications, Section 5 refers to some examples of related work, and Section 6 concludes.