NUMAchine is a shared memory multiprocessor with the memory distributed across the stations. A flat physical addressing scheme is used with a specific address range assigned to each station. All processors access all memory locations in the same manner. The time needed by a processor to access a given memory location depends upon the distance between the processor and the memory. Thus, the architecture is of NUMA (Non-Uniform Memory Access) type.
NUMAchine uses a ring-based hierarchical interconnection network. At the lowest level of the hierarchy it has stations that contain several processors. The stations are interconnected by bit-parallel rings, as shown in Figure 1. For simplicity, the figure shows only two levels of rings - local rings connected by a central ring. Our prototype machine will have 4 processors in each station, 4 stations per local ring and 4 local rings connected by a central ring.
Figure 1: The NUMAchine hierarchy.
The use of ring-based interconnection networks provides numerous advantages, including: (1) there is a unique path between any two points on the network, so that the ordering of packets is always maintained, (2) information can be sent from one point in the network to one or more other points, providing a natural multicast mechanism, and (3) a simple routing scheme can be used, allowing for high-speed operation of the rings. One of the key design features of NUMAchine is that the above strengths of ring-based networks are fully exploited to provide an efficient implementation of our cache coherence protocol, as described later. Finally, rings engender a modular design that minimizes the cost of small machine configurations, while allowing for relatively large systems.
The hierarchical structure in Figure 1 supports high throughput when communicating nodes lie within a localized part of the hierarchy, because many concurrent transfers can take place. Such is the case when there is a high degree of locality in data accesses, so that most transfers are within a station or between stations on the same local ring. The longest transfers traverse all levels of the hierarchy, but these transfer times are considerably shorter than if all stations were connected by a single ring. An obvious drawback of the hierarchical structure is its limited bisection bandwidth, which means that software that does not exhibit locality may perform poorly. While there are some applications in which locality is inherently low, we believe that with sufficient operating system, compiler, and program development support, data locality can be high for a large class of applications.
Figure 2: Station Organization.
Within each station, modules are interconnected by a single bus, as shown in Figure 2. A processor module contains a processor with an on-chip primary cache and an external secondary cache. Each memory module includes DRAM to store data and SRAM to hold status information about each cache line for use by the cache coherence protocol. The network cache is relatively large in size, and unlike the secondary caches, it uses DRAM to store data to allow for larger cache sizes at a reasonable cost. It also includes SRAM to store the tags and status information needed for cache coherence. The local ring interface contains buffers and circuitry needed to handle packets flowing between the station and the ring. The I/O module contains standard interfaces for connecting disks and other I/O devices.
The following subsection provides additional details on various aspects of the NUMAchine architecture, including the memory hierarchy, communications scheme, cache coherence protocol, and the procedure by which flow-control is maintained and deadlock avoided in NUMAchine.