Automatic Data and Computation Partitioning

Scalable Shared Memory Multiprocessors (SSMMs) are becoming increasingly popular as platforms for parallel scientific computing. Recent commercial systems such as the Convex Exemplar and the Cray T3E offer not only scalability previously exclusive to distributed memory multiprocessors, but also the convenience of a single coherent view of memory. The presence of shared memory initially suggests that parallelizing compilers for SSMMs need not be concerned with the data management issues that compilers for distributed memory must contend with. However, the non-uniformity of memory accesses and limited operating system data management policies suggest that compilers should play a more active role in data management on SSMMs. A data partitioning based approach to data management can improve application performance on SSMMs.

We developed a framework for deriving data and computation partitions on SSMMs. We showed that communication cost alone is not adequate to assess the appropriateness of data and computation partitions; shared memory effects such as cache affinity, false sharing, synchronization and contention must also be taken into account. Furthermore, the presence of shared memory hardware makes the use of the owner-computes rule unnecessary; the performance of some applications benefit from relaxing this rule. We described an algorithm for deriving data and computation partitions on SSMMs taking shared memory effects into account. The algorithm is computationally efficient compared to previous approaches and does not rely on run-time profiling. Experimental results from a prototype implementation of the algorithm demonstrated its effectiveness in parallelizing standard benchmarks and the necessity of taking shared memory effects into account. The results also demonstrate the computational efficiency of our framework.


Description

Computation and data partitions are specified by the selection of a processor geometry, the assignment of a distribution attribute to dimensions of arrays (and to parallel loops), and a mapping between array dimensions (and parallel loops) to the dimensions of the processor geometry. In addition to the well known distribution attributes (Block, Cyclic and BlockCyclic ), we introduce six additional distribution attributes ( RBlock, RCyclic, RBlockCyclic, BlockRBlock, CyclicRCyclic and BlockCyclicRBlockCyclic ) to resolve alignment conflicts between data and computations. The explicit association between the dimensions of the arrays and the dimensions of the processor geometry is used to obtain new data partitions that can reduce contention and synchronization in programs on SSMMs.


Some Data Partitioning Examples.

The algorithm derives data and computation partitions for programs in which parallel loops have been identified. The Polaris compiler is used for parallelism detection.


An example program used to illustrate the algorithm.

First, computation partitions and static data partitions are derived based on the data access patterns in the program using the NUMA-CAG. Each node in this graph represents either and array dimension or a parallel loop. An unweighted edge connects an array dimension node to a loop node if a subscript expression in the array dimension contains the loop iterator. An array dimension node has two sub-nodes, a forward sub-node and a reverse sub-node. If the coefficient of the loop iterator in the subscript expression of an array dimension is positive then the undirected edge between the loop node and the array node connects the forward sub-node. Similarly, there is an edge between a loop node and the reverse sub-node of an array dimension node if the coefficient of the loop iterator in the subscript expression is negative.


NUMA-CAG for the example program.

Array access patterns and load balancing requirements are used to assign initial distribution attribute for each node in the NUMA-CAG. An iterative algorithm that utilizes our new distribution attributes is used to ensure that any two connected nodes in the NUMA-CAG have the same distribution attribute.


Selection of the distribution attribute for the example program.

In addition, the algorithm determines block sizes to ensure that false sharing is minimized while balancing communication and load balancing requirements. The algorithm also maps connected nodes in the NUMA-CAG onto the same dimension of the processor geometry. This enhances locality and minimizes communication by allocating data and computation to the same processor.


Processor dimension assignment for the example program.

The data partitions derived by the NUMA-CAG are static partitions. Such partitions may not be the best choice for arrays that require different partitions in different loop nests. These arrays will have the same array dimension mapped onto several dimensions of the processor geometry. The partitions of only these arrays are re-evaluated by considering all possible partitions for each array individually using a depth first search with pruning. In the example program the first dimension of the array should be partitioned in the first loop nest and the second dimension must be partitioned in the second loop nest to enhance locality. Possible partitions evaluated are: replication, partitioning the first dimension, partitioning the second dimension and the compromise data partition selected by the NUMA-CAG. The cost of a given data partition is computed using knowledge of cache, local and remote memory access costs as well as the costs of contention and synchronization (incurred when a loop is executed in wavefront due to data dependencies or to avoid contention). These costs are determined by a combination of analytical and empirical evaluation of the target machine.


Pruned tree search to determine data partitions.

The number of processors along each dimension of the processor geometry is selected to enhance cache affinity by examining possible processor geometry combinations for a given number of processors and the selected dimensionality of the processor geometry.

The computational complexity is reduced by


Experimental Evaluation

Impact of shared memory effects.

Altering Direction Integration ( ADI ) program is used to illustrate that the shared address space provides flexibility in the choice of computation partition reducing contention and synchronization overhead improving the performance significantly. A single iteration of an outer sequentially iterated loop consists of a forward and a backward sweep phase along the rows of three arrays, followed by another forward and backward sweep phase along the columns of the arrays. OS page placement policies fail to enhance memory locality for this application. The (*, Block{1}) data partition in conjunction with the owner-computes computation partition results in a wavefront type pipelined computation, leading to synchronization overheads. The synchronization overhead can be eliminated by relaxing the owner-computes rule in the second phase and allowing the processor to write the results to remote memory modules. The performance does not improve because the relaxed compute rule combined with the (*,Block{1}) partition results in heavy contention. The data partition (Block{1}, Block{1}) eliminates contention and synchronization and hence results in the best overall performance.

Performance of some applications.

Data and computation partitions for the vpenta and mxm are obtained directly from the NUMA-CAG (i.e., without re-evaluation), and are obtained for ADI after examining 116 possible partitions.


Publications