Application programs for scalable multiprocessors contain sequences of parallel loop nests which reuse a common set of data. In addition to exploiting the available loop-level parallelism, parallelizing compilers must also convert enhance cache locality in order to obtain high levels of performance on scalable multiprocessors. Existing compilers employ loop transformations which seek to exploit data reuse within individual loop nests. Unfortunately, the benefit of these transformations is limited because there is often little reuse within a loop nest, and today's large caches are capable of capturing what little reuse exists. Our contribution is the recognition that there exists considerable data reuse across sequences of parallel loop nests which is not converted into locality. Despite the large cache sizes available today, the volume of data accessed in one loop nest causes the data that is reused in subsequent loop nests to be ejected from the cache. As a result, costly cache misses must be incurred to reload the data into the cache each time it is reused.
Loop fusion may be used to combine a sequence of parallel loop nests into a single loop nest and enhance locality by exploiting the data reuse across the loop nests. In addition, fusion increases the granularity of parallelism if the resulting loop nest is parallel, and also eliminates the synchronization between the original loop nests. However, data dependences between iterations from different loop nests may become loop-carried in the fused loop nest, and these dependences may make the resulting fusion illegal, or may prevent parallelization of the fused loop.
We have developed a novel technique called the shift-and-peel transformation
that enables parallelizing compilers to apply fusion and enhance both locality
and parallelism, despite the presence of dependences that would otherwise
render fusion illegal or prevent parallelization. We have derived algorithms
to enable compilers to automatically derive and implement our transformation.
We have implemented our shift-and-peel transformation in the Jasmine
compiler, and evaluated its effectiveness for a number of applications
on state-of-the-art scalable shared-memory multiprocessors. Performance
measurements indicate that our techniques improve parallel performance
The shift-and-peel transformation consists of two operations which overcome fusion-preventing and serializing data dependences between the loop nests being fused. The following figures illustrate the application of shifting and peeling for a pair of one-dimensional parallel loops with dependences flowing between iterations in different loops.
Shifting of iteration spaces enables legal fusion.
Peeling of iterations enables subsequent parallelization.
Our algorithms embody the analysis required to determine the appropriate amounts of shifting and peeling in order to enable legal fusion and parallelization for a sequence of parallel loop nests. Both the analysis and implementation may be applied to inner loops as well as the outermost loop.
Speedup for spem on the Convex SPP-1000
spem is a large ocean modelling application that contains 11 loop nest sequences which require application of our shift-and-peel transformation in order to enable legal fusion and subsequent parallelization. The longest sequence consisted of 10 loop nests, and most of the sequences contain 4 or more loop nests. The array size used in these experiments is 60x65x65, and the total data size for the application is 70 MBytes. Since this size exceeds the available cache capacity, locality enhancement with our shift-and-peel transformation is required to improve performance.
In the experiment above, we have applied our cache partitioning data transformation in order to avoid conflicts which eject reusable data from the cache and diminish locality. Without cache conflict avoidance, the speedups for both the original and locality-enhanced versions of the codes are degraded significantly.