In order to gauge the overall performance of the NUMAchine prototype, the Splash2 suite was run through the simulator to measure parallel speedup; for this data we consider only the parallel section of the code, and ignore the sequential section. In the Splash2 suite, the parallel section is defined as the time from the creation of the master thread, until the master thread has successfully completed a wait() call for all of its children. This is not a true speedup, but is in line with other performance measurements of this nature (e.g.,see  citesplash2.suite). In order to be conservative, all fixed hardware latencies are set to their actual values in the hardware if those values are known, and to pessimistic values otherwise. In addition, the results shown use a simple round-robin page placement scheme which is expected to perform more poorly than if intelligent placement were done. (For example, pages containing data used by only one processor, also called private pages, are not placed in the memory local to that processor, which would be simple to ensure in a real system.) For these reasons, We expect the actual prototype hardware to have equal or better performance than the results shown here indicate.
Figure 13: Parallel speedup for SPLASH-2 kernels
Figure 14: Parallel speedup for SPLASH-2 applications.
Table 2: Problem sizes used for the SPLASH-2 benchmarks.
Figures 13 and 14 show the parallel speedups for the SPLASH-2 benchmarks. All benchmarks are unmodified, except for LU-Contig, which used a slightly modified block-allocation scheme to improve workload balance.. Table 2 gives the problem sizes used for generating the speedup curves.
Highly parallelizable applications such as Barnes and Water show excellent speedups, as high as 57. Of more interest is NUMAchine's performance for code that has a higher degree of data sharing. For FFT and LU, examples of such code, the speedups are still good, especially given the small problem sizes. These results compare favorably with measurements of the SPLASH-2 suite in citeref:splash2.suite using a perfect memory system. This leads us to believe that with suitable tuning of both hardware and software, performance will be on par with the existing state-of-the-art.
Figure 15: Network cache total hit rate.
Figure 16: Network cache combining rate