NUMAchine includes considerable hardware dedicated to monitoring system performance in a non-intrusive fashion. Monitoring hardware is distributed amongst all major sub-systems of the multiprocessor, including the memory system, the processor modules, the ring-interfaces, and the I/O subsystem. For convenience, monitoring hardware is collectively referred to in this section as ``the monitor''.
The overall uses of the monitor are as follows: 1) Investigate and evaluate architectural features of NUMAchine, 2) provide real-time feedback concerning utilization of system resources, to allow tuning of application programs, 3) accumulate trace history information to aid hardware and software debugging, 4) validate our NUMAchine hardware simulator, and 5) characterize application workloads for high-level performance modelling or network simulations.
A key feature of the monitor is that it is implemented in high-capacity programmable logic devices (PLDs). Because the PLDs being used (Altera MAX7000 Complex PLDs and FLEX8000 FPGAs) are re-programmable, the same monitoring circuits can be re-configured to perform different functions. This offers tremendous flexibility because a wide variety of measurements can be made without incurring excessive cost.
In general, the monitor comprises a number of dedicated hardware counters, flexible SRAM-based counters, and trace memory. The dedicated hardware counters monitor critical resources such as FIFO buffer depths and network utilization. For example, bus and ring-link utilization are important overall performance metrics that can be monitored by dedicated counters. The SRAM-based counters are used to categorize and measure events in a table format. A good example of this is in the memory module, where transactions can be categorized based upon the type of transaction and its originator; a table counting each transaction from each originator would be monitored. This information can help identify resource ``hogs'', or even program bottlenecks. In addition to counters, trace memory (DRAM) is used to recall history information about bus traffic, for example. This allows non-intrusive probing into activity just before or after an important event such as a hardware error or software barrier.
A novel feature of the monitor is that information gathered can be correlated with execution of specific segments of code, by particular processors. This is implemented by a small register, called a phase identifier, at each processor. As executing code enters regions that should be distinguishable for monitoring purposes, the code writes into the phase identifier register; this information is appended to each transaction from the processor and is used by the monitor throughout the system.
In this paper, we discuss in more detail only those monitoring circuits associated with the memory subsystem. The reason for this focus is that memory system performance is a key aspect of shared-memory multiprocessor design, and offers many opportunities for improving performance. A memory module in NUMAchine, as mentioned earlier, consists of an incoming FIFO, DRAM for data, SRAM for state information, and an outgoing FIFO. The monitor measures the way in which the memory is being used by all processors in the system; to accomplish this, it monitors the incoming and outgoing FIFOs, and some of the state information for accessed memory locations. There are two main types of monitoring circuits in the memory module: multipurpose counters, and histogram tables. The purpose of each of these is discussed below.