Simulation Environment



next up previous
Next: Overall Performance Up: Performance Results Previous: Simulation Goals

Simulation Environment

  
Figure 12: Simulation environment. The parameter file is a text file containing all information on timing and geometry.

The performance of the prototype has been investigated by means of an execution-driven simulation using the SPLASH-2 [3] benchmark suite as input. The simulator itself uses Mint [22] as a front-end to interpret the native MIPS binaries produced for the suite. The back-end does behavioral modelling of the system at a cycle level. This includes all timing details (e.g. bus arbitration, DRAM and SRAM access times) as well as functional details, such as L1 and L2 data and instruction caches, and a packet-by-packet model of the rings. Figure 12 illustrates the NUMAchine simulation environment. A single binary running on either SGI or SUN workstations simulates both the multi-threaded application and the NUMAchine configuration, all of whose details are specified in the text parameter file. Run-time is quite good given the level of detail, with native versus simulated execution slowdown ratios of 100-300 when running on an SGI Challenge machine. Although aspects such as instruction fetching and serial code execution can be modelled in the simulator, they are time consuming and do not significantly affect results.gif For this reason the results in the rest of this report will assume that only data caches and fetches are implemented, and only the parallel section of the code is modelled in detail. (The serial section of code still executes, but does not generate events.) Results from more detailed simulations will be contained in [4].

  
Table 1: Contention-free request latencies in the simulated prototype. Reads and interventions involve 64-byte cache line fills. Upgrades contain no data, only permission to write.

Table 1 gives the contention-free latencies for different types of accesses as a yardstick for comparison with results in later sections. For this data, we manually calculate the number of clock cycles required in the hardware to perform the various types of accesses (i.e., these numbers to not reflect such architectural features as caches). The two types of remote accesses represent: requests that traverse only a single lower-level ring, and requests that span the whole network. (Note that due to the single-path nature of a ring, the distance between any two stations that are not on the same ring is equal to the span of the network, regardless of the position of the two stations.) Even without the effect of the Network Cache, these numbers indicate that the prototype behaves as a mildly NUMA architecture.



next up previous
Next: Overall Performance Up: Performance Results Previous: Simulation Goals



Stephen D. Brown
Wed Jun 28 18:34:27 EDT 1995