.
Title: A Comparison of Blocking and Non-blocking Packet Switching Techniques in Hierarchical Ring Networks

Authors: G. Ravindran and M. Stumm

Where: IEICE Trans. Inf. & Syst., vol. E79-D, No. 8, August 1996

Keywords: Networks, Switching, Wormhole, Virtual Cut-through, Hierarchical Ring Networks, Slotted Rings

Abstract: This paper presents the results of a simulation study of blocking and non-blocking switching for hierarchical ring networks. The switching techniques include wormhole, virtual cut-through, and slotted ring. We conclude that slotted ring network performs better than the more popular wormhole and virtual cut-through networks. We also show that the size of the node buffers is an important parameter and that choosing them too large can hurt performance in some cases. Slotted rings have the advantage that the choice of buffer size is easier in that larger than necessary buffers do not hurt performance and hence a single choice of buffer size performs well for all system configurations. In contrast, the optimal buffer size for virtual cut-through and wormhole switching nodes varies depending on the system configuration and the level in the hierarchy in which the switching node lies.

.
Title: Hierarchical Ring Topologies and the effect of their Bisection Bandwidth Constraints

Authors: G. Ravindran and M. Stumm

Where: Proc. Intl. Conf. on Parallel Processing, pp.I/51-55, 1995

Keywords: Multiprocessor architectures, Interconnection networks, Hierarchical rings, Bisection bandwidth

Abstract: Ring-based hierarchical networks are interesting alternatives to popular direct networks such as 2D meshes or tori. They allow for simple router designs, wider communications paths, and faster networks than their direct network counterparts. However, they have a constant bisection bandwidth, regardless of system size. In this paper, we present the results of a simulation study to determine how large hierarchical ring networks can become before their performance deteriorates due to their bisection bandwidth constraint. We show that a system with a maximum of 128 processors can sustain most memory access behaviors, but that larger systems can be sustained, only if their bisection bandwidth is increased.

.
Title: Processor Pool-Based Scheduling for Large-Scale NUMA Multiprocessors

Authors: Songnian Zhou and Timothy Brecht

Where: Appears in: Proceedings of the 1991 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, May (1991), pp. 133-142.

Keywords: NUMA, Schedulling, multiprocessor performance

Abstract:

Large-scale Non-Uniform Memory Access (NUMA) multiprocessors are gaining increased attention due to their potential for achieving high performance through the replication of relatively simple components. Because of the complexity of such systems, scheduling algorithms for parallel applications are crucial in realizing the performance potential of these systems. In particular, scheduling methods must consider the scale of the system, with the increased likelihood of creating bottlenecks, along with the NUMA characteristics of the system, and the benefits to be gained by placing threads close to their code and data.

We propose a class of scheduling algorithms based on processor pools. A processor pool is a software construct for organizing and managing a large number of processors by dividing them into groups called pools. The parallel threads of a job are run in a single processor pool, unless there are performance advantages for a job to span multiple pools. Several jobs may share one pool. Our simulation experiments show that processor pool-based scheduling may effectively reduce the average job response time. The performance improvements attained by using processor pools increase with the average parallelism of the jobs, the load level of the system, the differentials in memory access costs, and the likelihood of having system bottlenecks. As the system size increases, while maintaining the workload composition and intensity, we observed that processor pools can be used to provide significant performance improvements. We therefore conclude that processor pool-based scheduling may be an effective and efficient technique for scalable systems.

.
Title: On the Importance of Parallel Application Placement in NUMA Multiprocessors

Authors: Timothy Brecht

Where: Proceedings of the Fourth Symposium on Experiences with Distributed and Multiprocessor Systems (SEDMS IV), San Diego, CA, September, 1993.

Keywords: NUMA, multiprocessor scheduling, multiprocessor performance

Abstract:

The thesis of this paper is that scheduling decisions in large-scale, shared-memory, NUMA (Non-Uniform Memory Access) multiprocessors must consider not only how many processors, but also which processors to allocate to each application. We call the problem of assigning parallel processes of an application to processors application placement.

We explore the importance of placement decisions by measuring the execution time of several parallel applications using different placements on a shared-memory NUMA multiprocessor. The results of these experiments lead us to conclude that, as expected, in small- scale mildly NUMA multiprocessors, placement decisions have only a minor affect on the execution time of parallel applications. However, the results also show that placement decisions in large-scale multiprocessors are critical and localization that considers the architectural clusters inherent in these systems is essential. Our experiments also show that the importance of placement decisions increases substantially with the size and NUMAness of the system and that the placement of individual processes of an application within the subset of chosen processors also significantly impacts performance.

.
Title: Generalized Unimodular Loop Transformations for Distributed Memory Multiprocessors (does not contain figures)

Authors: K G Kumar*, D Kulkarni+ and A Basu

Center for Development of Advanced Computing 2/1 Brunton Road, Bangalore 560 025, India
* Now at IBM TJ Watson, York Town Heights, NY 10598
+ Now at Dept of Computer Science, University of Toronto, Toronto, ON M5S 1A4

Where: International Conference of Parallel Processing -91

Keywords: Parallelizing Compilers, Restructuring Transformations, Loop Partitioning, Iteration Spaces, Dependence Vectors.

Abstract:

In this paper, we present a generalized unimodular loop transformation as a simple, systematic and elegant method for partitioning the iteration spaces of nested loops for execution on distributed memory multiprocessors. We present a methodology for deriving the transformations that internalize multiple dependences in a multidimensional iteration space without resulting in a deadlocking situation. We then derive the general expression for the bounds of the transformed loops in terms of the bounds of the original space and the transformation matrix elements.

.
Title: Deriving Good Transformations for Mapping Nested Loops on Hierarchical Parallel Machines in Polynomial Time

Authors: K G Kumar*, D Kulkarni+ and A Basu

Center for Development of Advanced Computing 2/1 Brunton Road, Bangalore 560 025, India
* Now at IBM TJ Watson, York Town Heights, NY 10598
+ Now at Dept of Computer Science, University of Toronto, Toronto, ON M5S 1A4

Where: International Conference on Supercomputing 92

Keywords: Parallelizing Compilers, Restructuring Transformations, Loop Partitioning, Iteration Spaces, Dependence Vectors.

Abstract:

We present a computationally efficient method for deriving the most appropriate transformation and mapping of a nested loop for a given hierarchical parallel machine. This method is in the context of our systematic and general theory of unimodular loop transformations for the problem of iteration space partitioning \cite{kandk6}. Finding an optimal mapping or an optimal associated unimodular transformation is NP-complete. We present a polynomial time method for obtaining a good' transformation using a simple parameterized model of the hierarchical machine. We outline a systematic methodology for obtaining the most appropriate mapping.

.
Title: Locality and Loop Scheduling on Numa Multiprocessors

Authors: Hui Li, Sudarsan Tandri Michael Stumm, and Kenneth C. Sevcik

Where: International Conference on Parallel Processing 93

Keywords: NUMA multiprocessors, Locality, Scheduling

Abstract:

An important issue in the parallel execution of loops is how to partition and schedule the loops onto the available processors. While most existing dynamic scheduling algorithms manage load imbalances well, they fail to take locality into account and therefore perform poorly on parallel systems with non-uniform memory access times. In this paper, we propose a new loop scheduling algorithm, Locality-based Dynamic Scheduling (LDS), that exploits locality, and dynamically balances the load.

.
Title: The shared regions approach to software cache coherence on multiprocessors

Authors: Harjinder Sandhu, Benjamin Gamsa and Songnian Zhou

Where: Proceedings of the 1993 ACM SIGPLAN Symposium on Principles and Pranctice of Parallel Programming, May (1993).

Keywords: NUMA, cache coherence, multiprocessor performance

Abstract:

The effective management of caches is critical to the performance of applications on shared-memory multiprocessors. In this paper, we discuss a technique for software cache coherence that is based upon the integration of a program-level abstraction for shared data with software cache management. The program-level abstraction, called Shared Regions, explicitly relates synchronization objects with the data they protect. Cache coherence algorithms are presented which use the information provided by shared region primitives, and ensure that shared regions are always cacheable by the processors accessing them. Measurements and experiments of the Shared Region approach on a shared-memory multiprocessor are shown. Comparisons with other software based coherence strategies, including a user-controlled strategy and an operating system-based strategy, show that this approach is able to deliver better performance, with relatively low corresponding overhead and only a small increase in the programming effort. Compared to a compiler-based coherence strategy, the Shared Regions approach still performs better than a compiler that can achieve 90\% accuracy in allowing cacheing, as long as the regions are a few hundred bytes or larger, or they are re-used a few times in the cache.

.
Title: Architectural Support for Block Transfers in a Shared-Memory Multiprocessor

Authors: Steven J.E. Wilton and Zvonko G. Vranesic

Where: Fifth IEEE Symposium on Parallel and Distributed Processing, Irving, Texas, December 1993

Keywords: Shared-memory multiprocessor, block transfer support

Abstract:

This paper examines how the performance of a shared-memory multiprocessor can be improved by including hardware support for block transfers. A system similar to the Hector multiprocessor developed at the University of Toronto is used as a base architecture. It is shown that such hardware support can improve the performance of initialization code by as much as 50%, but that the amount of improvement depends on the memory access behavior of the program and the way in which the operating system issues block transfer requests.

.
Title: Performance Benefits and Limitations of Large NUMA Multiprocessors

Authors: Kenneth C. Sevcik and Songnian Zhou

Where: Proceedings of Performance '93 , Rome, Italy, September 27 to October 1, 1993, pp. 183-204, Elsevier Science Publ.

Abstract: Please see the postscript file.

.
Title: Hot Spot Analysis in Large Scale Shared Memory Multiprocessors

Authors: Karim Harzallah and Kenneth C. Sevcik

Where: Proceedings of the Supercomputing '93 Conference, November, 1993, Portland, Oregon.

Abstract: Please see the postscript file.

.
Title: Application Scheduling and Processor Allocation in Multiprogrammed Parallel Processing Systems

Authors: Kenneth C. Sevcik

Where: (Journal of) Performance Evaluation, vol. 19 (1994), pp. 107-140 (Special issue on the performance evaluation of parallel systems)

Abstract: Please see the postscript file.

.
Title: Performance Evaluation of Hierarchical Ring-Based Shared Memory Multiprocessors

Authors:
Mark Holliday
Dept. of Computer Science, Duke University, Durham, NC 27706

Michael Stumm
Dept. of Electrical and Computer Engineering
University of Toronto, Toronto, Canada M5S 1A4

Date: November 1992; revised April 1993

Where: Technical Report CS-1992-18, Duke University
IEEE Transactions on Computers

Keywords: communication locality; hierarchical ring-based networks; hot spots; large scale parallel systems; memory banks; performance evaluation; prefetching; shared memory multiprocessors; simulation.

Abstract:

This paper investigates the performance of word-packet, slotted unidirectional ring-based hierarchical direct networks in the context of large-scale shared memory multiprocessors. Slotted unidirectional rings are attractive because their electrical characteristics and simple interfaces allow for fast cycle times and large bandwidths. For large-scale systems, it is necessary to use multiple rings for increased aggregate bandwidth. Hierarchies are attractive because the topology ensures unique paths between nodes, simple node interfaces and simple inter-ring connections.

To ensure that a realistic region of the design space is examined, the architecture of the network used in the Hector prototype is adopted as the initial design point. A simulator of that architecture has been developed and validated with measurements from the prototype. The system and workload parameterization reflects conditions expected in the near future.

The results of our study show the importance of system balance on performance. Large-scale systems inherently have large communication delays for distant accesses, so processor efficiency will be low, unless the processors can operate with multiple outstanding transactions using techniques such as prefetching, asynchronous writes and multiple hardware contexts. However with multiple outstanding transactions and only one memory bank per processing module, memory quickly saturates. Memory saturation can be alleviated by having multiple memory banks per processing module, but this shifts the bottleneck to the ring subsystem. While the topology of the ring hierarchy affects performance, the ring subsystem will inherently limit the throughput of the system. Hence increasing the number of outstanding transactions per processor beyond a certain point only has a limiting effect on performance, since it causes some of the rings to become congested. An adaptive maximum number of outstanding transactions appears necessary to adjust for the appropriate tradeoff between concurrency and contention as the communication locality changes. We show the relationships between processor, ring and memory speeds, and their effects on performance. Using block transfers for prefetching seems unlikely to be advantageous in that the improvement in the cache hit ratio needed to compensate for the increased network utilization is substantial.

.
Title: A Comparison of basic CPU Scheduling Algorithms for Multiprocessor Unix

Authors: Stephen Curran and Michael Stumm

Where: Computer Systems, 3(4), Oct., 1990, pp. 551--579.

Abstract:

In this paper, we present the results of a simulation study comparing three basic algorithms that schedule independent tasks in multiprocessor versions of Unix. Two of these algorithms, namely Central Queue and Initial Placement, are obvious extensions to the standard uniprocessor scheduling algorithm and are in use in a number of multiprocessor systems. A third algorithm, Take, is a variation on Initial Placement, where processors are allowed to raid the task queues of the other processors. Our simulation results show the difference between the performance of the three algorithms to be small when scheduling a typical Unix workload running on a small, bus-based, shared memory multiprocessor. They also show that the Take algorithm performs best for those multiprocessors on which tasks incur overhead each time they migrate. In particular, the Take algorithm appears to be more stable than the other two algorithms under extreme conditions.

.
Title: Hierarchical Clustering: A Structure for Scalable Multiprocessor Operating System Design

Authors: Michael Stumm, Ron Unrau, and Orran Krieger

Where: Extended version of Clustering Micro-Kernels for Scalability, Proc. of the Usenix Workshop on Micro-Kernels and Other Kernel Architectures, April, 1992.

Abstract: Please see the postscript file.

.
Title: Experience with the Hector Multiprocessor

Authors: Michael Stumm, Zvonko Vranesic, Ron White

Where: Extended version of paper with same title in Proc. Intl. Parallel Processing Symposium Parallel Systems Fair, 1993, pp. 9-16.

Abstract: Please see the postscript file.

.
Title: The Alloc Stream Facility: A redesign of application-level Strea m I/O

Authors: O. Krieger, M. Stumm, and R. Unrau

Where:IEEE Computer, 27(3), March, 1994, pp. 75--83.

Abstract:

This paper introduces a new application level I/O facility called the Alloc Stream Facility (ASF). ASF has several key advantages. First, performance is substantially improved as a result of a)~the structure of the facility that allows it to take advantage of system specific features like mapped files, and b)~a reduction in data copying and the number of I/O system calls. Second, the facility is designed for multi-threaded applications running on multiprocessors and allows for a high degree of concurrency. Finally, the facility can support a variety of I/O interfaces, including stdio, emulated Unix I/O, ASI, and C++ streams, in a way that allows applications to freely intermix calls to the different interfaces, resulting in improved code re-usability. We show that on several Unix workstation platforms, I/O intensive applications perform substantially better when linked to ASF instead of the native facilities -- in the best case, up to twice as good. Modifying the applications to use a new interface provided with ASF can improve performance even more.

.
Title: HFS: A Flexible File System for Large-Scale Multiprocessors

Authors: Orran Krieger and Michael Stumm

Where: Proceedings of the 1993 DAGS/PC Symposium

Abstract:

The Hurricane File System (HFS) is a new file system being developed for large-scale shared memory multiprocessors with distributed disks. The main goal of this file system is scalability; that is, the file system is designed to handle demands that are expected to grow linearly with the number of processors in the system. To achieve this goal, HFS is designed using a new structuring technique called Hierarchical Clustering. HFS is also designed to be flexible in supporting a variety of policies for managing file data and for managing file system state. This flexibility is necessary to support in a scalable fashion the diverse workloads we expect for a multiprocessor file system.

.
Title: A Fair Fast Scalable Reader-Writer Lock

Authors: O. Krieger, M. Stumm, R. Unrau, and J. Hanna

Where: Proc. Intl. Conf. on Parallel Processing, 1993.

Abstract:

A reader-writer lock allows either multiple readers to inspect shared data or a single writer exclusive access to that data. On shared memory multiprocessors, the cost of acquiring and releasing these locks can have a large impact on the performance of parallel applications. Other researchers have shown how to implement scalable locks, that is, locks that can become contended without resulting in memory or interconnection network contention. This paper describes a new algorithm for a reader-writer lock that, while being scalable in the contended case, has a low overhead in the uncontended case. This is important because most parallel applications are written so that locks are typically uncontended. The only atomic operation required by this algorithm is fetch_and_store and hence it can be used on most current multiprocessor systems. Experimental results are provided.

.
Title: Loop and Data Transformations: A tutorial

Authors: Dattatraya Kulkarni and Michael Stumm

Where: CSRI Tech Report 337, University of Toronto, June 1993.

Abstract:

Hierarchically structured machines appear to be becoming the dominant parallel computing structure. These systems have non-uniform access times. We address the problem of restructuring a possibly sequential program to execute efficiently on such parallel machines. This restructuring involves transforming and partitioning the loop structures and the data to so as to improve parallelism, static and dynamic locality, and load balance. The objective of this paper is to present previous and ongoing work on loop and data transformations and motivate a unified framework to restructuring of a sequence of loops and data so as to execute efficiently on parallel machines with several levels of hierarchy.

.
Title: Data reorganization in parallel database systems

Authors: Chaitanya Baru & Daniel C. Zilio

Where: Proc. of the IEEE Workshop on Advances in Parallel and Distributed Systems}, Princeton, NJ, pp.102-107, Oct. 1993.

Abstract:

Parallel database systems are suitable for use in applications with high capacity and high performance and availability requirements. The trend in such systems is to provide efficient on-line capability for performing various system administration functions such as, index creation and maintenance, backup/restore, reorganization, and gathering of statistics. For some of these functions, the on-line capability can be efficiently supported by the use of `incremental algorithms", i.e., algorithms that achieve the function in several, relatively small (i.e., less time-consuming) steps, rather than in a single, large step. Incremental algorithms ensure that only small parts of the database become inaccessible for short durations as opposed to non-incremental algorithms which may lock large portions of the database or the entire database for a longer duration. In this paper, we discuss issues in providing concurrent data reorganization capability using incremental algorithms in parallel database systems.

.
Title: Computational Alignment: A new, unified program transformation for local and global optimization

Authors: Dattatraya Kulkarni and Michael Stumm

Where: CSRI Tech report 292, ISSN 0834-1648

Abstract:

Computational Alignment is a new class of program transformations suitable for both local and global optimization. Computational Alignment transforms all of the computations of a portion of the loop body in order to align them to other computations either in the same loop or in another loop. It extends along a new dimension and is significantly more powerful than linear transformations because i) it can transform subsets of dependences and references; ii) it is sensitive to the location of data in that it can move the computation relative to data; iii) it applies to imperfect loop nests; and iv) it is the first loop transformation that can change access vectors. Linear transformations are just a special case of Computational Alignment. Computational Alignment is highly suitable for global optimization because it can transform given loops to access data in similar ways. Two important subclasses of Computational Alignment are presented as well, namely, Freeing and Isomerizing Computational Alignment.

.
Title: Multiprogrammed Parallel Application Scheduling in NUMA Multiprocessors

Authors: Timothy B. Brecht

Where: Ph.D. Dissertation - CSRI Technical Report CSRI-303

Abstract:

The invention, acceptance, and proliferation of multiprocessors are primarily a result of the quest to increase computer system performance. The most promising features of multiprocessors are their potential to solve problems faster than previously possible and to solve larger problems than previously possible. Large-scale multiprocessors offer the additional advantage of being able to execute multiple parallel applications simultaneously.

The execution time of a parallel application is directly related to the number of processors it is allocated and, in shared-memory non-uniform memory access time (NUMA) multiprocessors, which processors it is allocated. As a result, efficient and effective scheduling becomes critical to overall system performance. In fact, it is likely to be a contributing factor in ultimately determining the success or failure of shared-memory NUMA multiprocessors.

The subjects of this dissertation are the problems of processor allocation and application placement. The processor allocation problem involves determining the number of processors to allocate to each of several simultaneously executing parallel applications and possibly dynamically adjusting those allocations to improve overall system performance. The performance metric used is mean response time. We show that by differentiating between applications based on the amount of remaining work they have to execute, performance can be improved significantly. Then we propose techniques for estimating an application's expected remaining work along with policies for using these estimates to make improved processor allocation decisions. An experimental evaluation demonstrates the promise of this approach.

The placement problem involves determining which of the many processors to assign to each application. Using experiments conducted on a representative system, we demonstrate that in large-scale NUMA multiprocessors the execution time of parallel applications is significantly affected by the placement of the application. This motivates the need for new techniques designed explicitly for NUMA multiprocessors. We introduce such a technique, called processor pool-based scheduling, that is designed to localize the execution of parallel applications within a NUMA architecture and to isolate different parallel applications from each other. An experimental evaluation of this scheduling method shows that it can be used to significantly reduce mean response time over methods that do not consider the placement of parallel applications.

.
Title: Region-Oriented Main Memory Management in Shared-Memory NUMA Multiprocessors

Authors: Benjamin Gamsa

Where: M.Sc. Thesis

Abstract:

In Non-Uniform Memory Access time (NUMA) multiprocessors, distribution of the memory modules facilitates architectural scaling, but creates complications for the programmers who must be concerned with the physical distribution of their data in order to obtain good performance.

In order to reduce the impact of remote accesses, in this thesis we propose that data be partitioned into Shared Regions that reflect the granularity of data sharing in programs, and that special synchronization calls be added to enforce proper ordering of accesses to the shared data as well as to manage replication and consistency transparently to the programmer.

Results from measurements on a 16-processor NUMA multiprocessor and from a model of the system indicate that the Shared Regions approach is successful in obtaining the necessary locality critical to performance, while incurring only minimal overhead. Data distribution methods are also observed to have a significant impact on the performance of the system, especially in the larger multiprocessors modeled.

.
Title: Scalable Memory Management through Hierarchical Symmetric Multiprocessing

Authors: Ronald C. Unrau

Where: Ph.D. Disseration

Abstract:

This dissertation examines scalability issues in the design of operating systems for large-scale, shared-memory multiprocessors. In particular, the thesis focuses on structuring issues as they relate to memory management.

From a set of simple, well-known queuing network formulas, we derive a set of properties that describe sufficient conditions for an operating system to scale. From these properties we first develop a set of guidelines for designing scalable systems, and then develop a new structuring philosophy for shared-memory multiprocessor operating systems, called Hierarchical Symmetric Multiprocessing (HSM).

HSM manages the system resources in clusters, using tight coupling within a cluster, and loose coupling across clusters. Distributed systems principles are applied by distributing and replicating system services and data objects to increase locality, increase concurrency, and to avoid centralized bottlenecks, thus making the system scalable. However, tight coupling is used within a cluster, so the system performs well for local interactions. HSM maximizes locality which is key to good performance in large systems, and systems based on HSM can easily be adapted to different hardware configurations and architectures by changing the size of the clusters. Finally, HSM leads to a modular system composed from easy-to-design and hence efficient building blocks.

Memory management is a particularly challenging service to implement within the HSM framework, because it must provide the applications with an integrated and coherent view of a single system, while distributing and replicating services in order to fully exploit the hardware potential. We describe in detail the implementation of an HSM structured memory management subsystem, and evaluate the performance of our implementation on Hector, a prototype scalable shared memory multiprocessor.

.
Title: Proce ssor Scheduling in Multiprogrammed Shared Memory NUMA Multiprocessors

Authors: Chee-Shong Wu

Where: M.Sc. Thesis

Abstract:

In a multiprogrammed multiprocessor, the scheduler is not only responsible for deciding when to activate an application and when to suspend it, but is also responsible for determining how many processors to allocate to each application. In a scalable Non- Uniform Memory Access (NUMA) multiprocessor, it must further resolve the problem of which processor(s) to allocate to which application since the memory reference times are not the same for all processor-memory pairs.

In this thesis, we study the problem of how to characterize parallel applications and how to apply this knowledge in scheduling for NUMA systems. We also study the performance of several scheduling algorithms in an NUMA environment. These algorithms differ in the frequency of reallocations. We propose two policies, the Static policy and the Immediate Start Static policy, that utilize application characteristics when making scheduling decisions. The performance of these two policies is compared with that of the Dynamic policy, on an NUMA multiprocessor, Hector.

.
Title: Hierarchical clustering: A structure for scalable multiprocessor operating system design

Authors: R. Unrau, O. Krieger, B. Gamsa, M. Stumm,

Where: Journal of Supercomputing, will appear 1995.

Abstract: We introduce the concept of Hierarchical Clustering as a way to structure shared memory multiprocessor operating systems for scalability. As the name implies, the concept is based on clustering and hierarchical system design.Hierarchical Clustering leads to a modular system, composed of easy-to-design and efficient building blocks. The resulting structure is scalable because it i) maximizes locality, which is key to good performance in NUMA systems, and ii) provides for concurrency that increases linearly with the number of processors. At the same time, there is tight coupling within a cluster, so the system performs well for local interactions which are expected to constitute the common case. A clustered system can easily be adapted to different hardware configurations and architectures by changing the size of the clusters. We show how this structuring technique is applied to the design of a microkernel-based operating system called Hurricane. This prototype system is the first complete and running implementation of its kind, and demonstrates the feasibility of a hierarchically clustered system. We present performance results based on the prototype, demonstrating the characteristics and behavior of a clustered system. In particular, we show how clustering trades off the efficiencies of tight coupling for the advantages of replication, increased locality, and decreased lock contention. We describe some of the lessons we learned from our implementation efforts and close with a discussion of our future work.

.
Title: Optimizing IPC Performance for Shared-Memory Multiprocessors

Authors: B. Gamsa, O. Krieger, M. Stumm,

Where: Proc. Intl. Conf. on Parallel Processing, 1994

Abstract:

We assert that in order to perform well, a shared-memory multiprocessor inter-process communication (IPC) facility must avoid a) accessing any shared data, and b) acquiring any locks. In addition, such a multiprocessor IPC facility must preserve the locality and concurrency of the applications themselves so that the high performance of the IPC facility can be fully exploited.

In this paper we describe the design and implementation of a new shared-memory multiprocessor IPC facility that in the common case internally requires no accesses to shared data and no locking. In addition, the model of IPC we support and our implementation ensure that local resources are made available to the server to allow it to exploit any locality and concurrency available in the service. To the best of our knowledge, this is the first IPC subsystem with these attributes.

The performance data we present demonstrates that i) the end-to-end performance of our multiprocessor IPC facility is competitive with the fastest uniprocessor IPC times, and ii) that our implementation can sustain this performance with perfect speedup regardless of the degree of concurrency, even if all requests are directed to the same server.

.
Title: Multiprocessor Scheduling for High-Variability Service Time Distributions >

Authors: Eric W. Parsons and Kenneth C. Sevcik

Where: IPPS '95 Workshop on Job Scheduling Strategies for Parallel Processing reprinted in Springer-Verlag Lecture Notes in Computer Science, Vol 949, pages 127--145.

Abstract: Many disciplines have been proposed for scheduling and processor allocation in multiprogrammed multiprocessors for parallel processing. These have been, for the most part, designed and evaluated for workloads having relatively low variability in service demand. But with reports that variability in service demands at high performance computing centers can actually be quite high, these disciplines must be reevaluated. In this paper, we examine the performance of two well-known static scheduling disciplines, and propose preemptive versions of these that offer much better mean response times when the variability in service demand is high. We argue that, in systems in which dynamic repartitioning in applications is expensive or impossible, these preemptive disciplines are well suited for handling high variability in service demand.

.
Title: Experiences with Locking in a NUMA Multiprocessor Operating System Kenel

Authors: Ron Unrau, Orran Krieger, Ben Gamsa and Michael Stumm,

Where: Where: OSDI-94

Abstract:

We describe the locking architecture of a new operating system, Hurricane, designed for large scale shared-memory multiprocessors. Many papers already describe kernel locking techniques, and some of the techniques we use have been previously described by others. However, our work is novel in the particular combination of techniques used, as well as several of the individual techniques themselves. Moreover, it is the way the techniques work together that is the source of our performance advantages and scalability. Briefly, we use:

1. a hybrid coarse-grain/fine-grain locking strategy that has the low latency and space overhead of a coarse-grain locking strategy while having the high concurrency of a fine-grain locking strategy;
2. replication of data structures to increase access bandwidth and improve concurrency;
3. a clustered kernel that bounds the number of processors that can compete for a lock so as to reduce second order effects such as memory and interconnect contention;
4. Distributed Locks to further reduce second order effects, with modifications that reduce the uncontended latency of these locks to close to that of spin locks.

.
Title: HFS: A flexible file system for shared-memory multiprocessors

Authors: O. Krieger,

Where: Where: PhD Dissertation, Department of Electrical and Computer Engineering, University of Toronto

Abstract:

The Hurricane File System (HFS) is designed for large-scale, shared-memory multiprocessors. Its architecture is based on the principle that a file system must support a wide variety of file structures, file system policies and I/O interfaces to maximize performance for a wide variety of applications. HFS uses a novel, object-oriented building-block approach to provide the flexibility needed to support this variety of file structures, policies, and I/O interfaces. File structures can be defined in HFS that optimize for sequential or random access, read-only, write-only or read/write access, sparse or dense data, large or small file sizes, and different degrees of application concurrency. Policies that can be defined on a per-file or per-open instance basis include locking policies, prefetching policies, compression/decompression policies and file cache management policies. In contrast, most existing file systems have been designed to support a single file structure and a small set of policies. We have implemented large portions of HFS as part of the Hurricane operating system running on the Hector shared-memory multiprocessor. We demonstrate that the flexibility of HFS comes with little processing or I/O overhead. Also, we show that HFS is able to deliver the full I/O bandwidth of the disks on our system to the applications.

.
Title: Hector -- A hierarchically structured shared memory multiprocessor

Authors: Z. Vranesic, M. Stumm, D. Lewis and R. White

Where: Where: IEEE Computer, 24(1): 72-80, January, 1991.

.
Title: A Generalized Theory of Linear Loop Transformations

Authors: Dattatraya Kulkarni, Michael Stumm, Ron Unrau and Wei Li

Where: CSRI Tech report 317, ISSN 0834-1648

Abstract:

In this paper we present a new theory of linear loop transformations called {\em Computation Decomposition and Alignment\/} (CDA). A CDA transformation has two components: {\em Computation Decomposition\/} first decomposes the computations in the loop into computations of finer granularity, from iterations to instances of subexpressions. {\em Computation Alignment\/} subsequently, linearly transforms each of these sets of computations, possibly by using a different transformation for each set. This framework subsumes all existing linear transformation frameworks in that it reduces to a conventional linear loop transformation when the smallest granularity is an iteration, and it reduces to some of the more recently extended frameworks when the smallest granularity is a statement instance. The possibility of being able to align computations at arbitrary granularities adds a new dimensions to performance optimization on high performance computing platforms. We describe Computation Decomposition and Alignment and provide examples of CDA transformations. We present some heuristics to derive appropriate CDA transformations, given a desired optimization objective. We present the results of experiments run on the KSR1 multiprocessor and various RS6000 and Sparc platforms that demonstrate that CDA can result in substantial performance improvements.

.
Title: Linear Loop Transformations in Optimizing Compilers for Parallel Machines

Authors: Dattatraya Kulkarni and Michael Stumm

Where: To appear in the Australian Computer Journal

Abstract:

We present the linear loop transformation framework which is the formal basis for state of the art optimization techniques in restructuring compilers for parallel machines. The framework unifies most existing transformations and provides a systematic set of code generation techniques for arbitrary compound loop transformations. The algebraic representation of the loop structure and its transformation give way to quantitative techniques for optimizing performance on parallel machines. We discuss in detail the techniques for generating the transformed loop and deriving the desired linear transformation.

.
Title: Fusion of Loops for Parallelism and Locality

Authors: Naraig Manjikian and Tarek Abdelrahman

Where: CSRI Tech Report 315

Abstract:

Loop fusion improves data locality and reduces synchronization in data-parallel applications. However, loop fusion is not always legal. Even when legal, fusion may introduce loop-carried dependences which reduce parallelism. In addition, performance losses result from cache conflicts in fused loops. We present new, systematic techniques which: (1) allow fusion of loop nests in the presence of fusion-preventing dependences, (2) allow parallel execution of fused loops with minimal synchronization, and (3) eliminate cache conflicts in fused loops. We evaluate our techniques on a 56-processor KSR2 multiprocessor, and show performance improvements of up to 20% for representative loop nest sequences. The results also indicate a performance tradeoff as more processors are used, suggesting a careful evaluation of the profitability of fusion.

.
Title: CDA Loop Transformations

Authors: Dattatraya Kulkarni and Michael Stumm

Where: Proceedings of the Third workshop on languages, compilers and run- time systems for scalable computers}, Troy, NY, May 1995, Kluwer Academic.

Abstract:

In this paper we present a new loop transformation technique called {\em Computation Decomposition and Alignment\/} (CDA). {\em Computation Decomposition\/} first decomposes the iteration space into finer computation spaces. {\em Computation Alignment\/} subsequently, linearly transforms each computation space independently. CDA is a general framework in that linear transformations and its recent extensions are just special cases of CDA. CDA's fine grained loop restructuring can incur considerable computational effort, but can exploit optimization opportunities that earlier frameworks cannot. We present four optimization contexts in which CDA can be useful. Our initial experiments demonstrate that CDA adds a new dimension to performance optimization.

.
Title: Implementing Flexible Computation Rules with Subexpression-level Loop Transformations

Authors: Dattatraya Kulkarni, Michael Stumm and Ronald C. Unrau

Where:Proceedings of the Euro-Par95, Stockholm, Aug 28-31, 1995.

Abstract:

Computation Decomposition and Alignment (CDA) is a new loop transformation framework that extends the linear loop transformation framework and the more recently proposed Computation Alignment frameworks by linearly transforming computations at the granularity of subexpressions. It can be applied to achieve a number of optimization objectives, including the removal of data alignment constraints, the elimination of ownership tests, the reduction of cache conflicts, and improvements in data access locality. In this paper we show how CDA can be used to effectively implement flexible computation rules with the objective of minimizing communication and, whenever possible, eliminating intrinsics that test whether computations need to be executed or not. We describe CDA, show how it can be used to implement flexible computation rules, and present an algorithm for deriving appropriate CDA transformations.

.
Title: On the Scalability of Demand-Driven Parallel Systems

Authors: Ronald C. Unrau and Michael Stumm and Orran Krieger

Where:Proceedings of the Euro-Par95, Stockholm, Aug 28-31, 1995.

Abstract:

Demand-driven systems follow the model where customers enter the system, request some service, and then depart. Examples are databases, transaction processing systems and operating systems, which form the system software layer between the applications and the hardware. Achieving scalability at the system software layer is critical for the scalability of the system as a whole, and yet this layer has largely been ignored. In this paper, we characterize the scalability of the system software layer of demand-driven parallel systems based on fundamental metrics of quantitative system performance analysis. We develop a set of sufficient conditions so that if a system satisfies these conditions, then the system is scalable. We further argue that in practice these conditions are also necessary. In the remainder of the paper, we use the necessary and sufficient conditions to develop a set of practical design guidelines, to study the effect of application workloads, and to examine the scalability behavior of a system with only a limited number of processors.

.
Title: (De-)Clustering Objects for Multiprocessor System Software

Authors: Eric Parsons, Ben Gamsa, Orran Krieger, Michael Stumm

Where: IWOOS95 (Fourth International Workshop on Object Orientation in Op erating Systems 95)

Abstract:

Designing system software for large-scale shared-memory multiprocessors is challenging because of the level of performance demanded by the application workload and the distributed nature of the system. Adopting an object-oriented approach for our system, we have developed a framework for de-clustering objects, where each object may migrate, replicate, and distribute all or part of its data across the system memory using the policies that will best meet the locality requirements for that data. The mechanism for object invocation hides the internal structure of an object, allowing a request to be made directly to the most suitable part of the object on a per-processor basis without any knowledge of how the object is de-clustered. Method invocation is very efficient, both within and across address spaces, involving no remote memory accesses in the common case. We describe the design and implementation of this framework in Tornado, our multiprocessor operating system.

.
Title: The Importance of Performance-Oriented Flexibility in System Software for Large-Scale Shared-Memory Multiprocessors

Authors: Orran Krieger, Ben Gamsa, Karen Reid, Paul Lu, Eric Parsons and Michael Stumm

Where: OOPSLA Workshop on Flexible System Software. October 1994.

Abstract:

See paper for abstract.

.
Title: Exploiting Mapped Files for Parallel I/O

Authors: Orran Krieger, Karen Reid and Michael Stumm

Where: SPDP Workshop on Modeling and Specification of I/O (MSIO), October 1995

Abstract:

Harnessing the full I/O capabilities of a large-scale multiprocessor is difficult and requires a great deal of cooperation between the application programmer, the compiler and the operating (/file) system. Hence, the parallel I/O interface used by the application to communicate with the system is crucial in achieving good performance. We present a set of properties we believe that a good I/O interface should have and consider current parallel I/O interfaces from the perspective of these properties. We describe the advantages and disadvantages of mapped-file I/O and argue that if properly implemented it can be a good basis for a parallel I/O interface that can fulfill the suggested properties. To demonstrate that such an implementation is feasible, we describe methodology used in our previous work on the Hurricane operating system and in our current work on the Tornado operating system to implement mapped files.

.
Title: Computation and Data Partitioning on Scalable Shared Memory Multiprocessors

Authors: Sudarsan Tandri and Tarek S. Abdelrahman

Where: International Conference on Parallel and Distributed Processing Techniques and Applications, Athens, Georgia, November 1995

Abstract: In this paper we identify the factors that affect the derivation of computation and data partitions on scalable shared memory multiprocessors (SSMMs). We show that these factors necessitate an SSMM-conscious approach. In addition to remote memory access, which is the sole factor on distributed memory multiprocessors, cache affinity, memory contention and false sharing are important factors that must be considered. Experimental evidence is presented to demonstrate the impact of these factors on performance using three applications on the KSR1 and the Hector multiprocessors.