Alan Adamson (IBM):

SPEC

SPEC is an industry consortium devoted to promoting 'fair' performance benchmarking of computer systems. Alan Adamson is a member of the SPEC Board of Directors and actively involved in several of its benchmark subcommittees as an IBM representative. The talk will focus on the origins of SPEC, the processes for developing benchmarks, assuring fairness, and enforcing compliance by members. There will be a more detailed discussion of the development of SPECjbb2005, in which the author was involved, and the way that the primary JVM vendors have responded to the existence of this new benchmark.

Reza Azimi:

Online Performance Analysis using Hardware Performance Counters

Hardware Performance Counters (HPCs) can potentially play an important role in analyzing performance and identifying the root causes of performance problems. However, HPCs are difficult to use for several reasons. First, there are too few such counters considering that any meaningful analysis requires simultaneously monitoring of many hardware events. Moreover, HPCs primarily count low-level micro-architectural events from which it is difficult to extract high-level insight required for identifying causes of performance problems. We describe two techniques that help overcome these limitations. First, we use high-frequency multiplexing of HPCs in order to make a larger set of "logical" HPCs available for analysis. Second, we how a stall breakdown model can be built online as a starting point for identifying performance bottlenecks. We have implemented our techniques on two IBM PowerPC processors (PowerPC970 and POWER5).

David Tam:

Cache-Aware Thread Scheduling for Server Workloads on an SMP-CMP-SMT System

Multiprocessors that consist of parallelism at the SMT, CMP, and SMP levels will be common in the near future. Two main differences from previous generation multiprocessors are: (1) on-chip shared L2 caches, and (2) lower task migration costs within a chip. Current operating system schedulers are not particularly aware of the topology of the memory hierarchy. As a result, they may distribute threads across processors in such a way that causes many unnecessary long-latency cache misses. For certain types of multi-threaded server workloads, we observe that clustering threads according to their sharing patterns will significantly improve cache locality and performance. Using an 8-way Power5 SMP-CMP-SMT multiprocessor running Linux on commercial server workloads (SPECjbb, Volano, etc...) we examine the maximum potential performance improvement of such a cache-aware OS scheduling policy. Using Power5 HPCs, we analyze the impact of the new scheduling policy both on the utilization of caches at various levels and also on the actual performance measured in terms of IPC.


Greg Steffan
Last modified: Fri Aug 25 18:42:42 EDT 2006