Figure 12: Simulation environment. The parameter file is a text file containing
all information on timing and geometry.
The performance of the prototype has been investigated by means of an
execution-driven simulation using the SPLASH-2 [3]
benchmark suite as input. The simulator itself uses
Mint [22] as a front-end to interpret the native MIPS
binaries produced for the suite.
The back-end does behavioral modelling of the system at a
cycle level. This includes all timing details (e.g. bus arbitration,
DRAM and SRAM access times) as well as functional details,
such as L1 and L2 data and instruction caches, and a packet-by-packet
model of the rings.
Figure 12 illustrates
the NUMAchine simulation environment. A single binary running on either
SGI or SUN workstations simulates both the multi-threaded application and the
NUMAchine configuration, all of whose details are specified in the text
parameter file. Run-time is quite good given the level of detail, with
native versus simulated execution slowdown ratios of 100-300 when running
on an SGI Challenge machine. Although aspects such as
instruction fetching and serial code execution can be modelled in the
simulator, they are time consuming and do not significantly affect
results.
For this reason the results in the
rest of this report will assume that only data caches and fetches are
implemented, and only the parallel section of the code is modelled in
detail. (The serial section of code still executes, but does not generate
events.) Results from more detailed simulations will be contained
in [4].
Table 1: Contention-free request latencies in the simulated
prototype. Reads and interventions
involve 64-byte cache line fills. Upgrades contain no data, only
permission to write.
Table 1 gives the contention-free latencies for different types of accesses as a yardstick for comparison with results in later sections. For this data, we manually calculate the number of clock cycles required in the hardware to perform the various types of accesses (i.e., these numbers to not reflect such architectural features as caches). The two types of remote accesses represent: requests that traverse only a single lower-level ring, and requests that span the whole network. (Note that due to the single-path nature of a ring, the distance between any two stations that are not on the same ring is equal to the span of the network, regardless of the position of the two stations.) Even without the effect of the Network Cache, these numbers indicate that the prototype behaves as a mildly NUMA architecture.