Application programs for scalable multiprocessors contain loop nests in which the same data elements are reused in successive iterations of the outermost loop. This data reuse is converted to cache locality when the data remains in the cache between uses. However, if the number of loop iterations between uses of the same data is large, the likelihood of retaining the data in the cache diminishes significantly. The volume of other data accessed between uses of the same data overflows the available cache capacity and ejects reusable from the cache. As a result, costly cache misses must be incurred to reload the data into the cache each time it is reused.
Tiling may be used to enhance locality for a loop nest by exploiting data reuse carried by an outer loop. Iterations from the original loop nest are reordered and blocked into units of tiles in order to reduce the number of iterations between uses of the same data. In this manner, the data used by one tile is loaded once into the cache and retained there for subsequent reuse during the execution of the tile. Unfortunately, data dependences in a loop nest are often such that direct application of tiling violates the semantics of the original iteration ordering. In such cases, a transformation known as loop skewing is used to transform the loop and its dependences into a form for which tiling is legal. However, loop skewing introduces dependences in the outer loops of a tiled loop nest, and these dependences limit the available parallelism to wavefronts in the tiled iteration space. Loop skewing also modifies the data access patterns such that there is data reuse between adjacent tiles, as well as the data reuse that is captured within individual tiles. Execution of tiled loop nests with wavefront parallelism requires appropriate scheduling such that the dependences are satisfied, and data reuse is exploited to enhance locality while maintaining sufficient parallelism for a large number of processors.
We have adapted a number of scheduling strategies in order to exploit wavefront parallelism in tiled loop nests on scalable shared-memory multiprocessors, and we have evaluated their effectiveness both analytically and experimentally. Our strategies enable the use of small tiles to increase the degree of parallelism on each wavefront in the tiled iteration space. At the same time, our strategies effectively exploit data reuse between adjacent tiles to ensure that locality is also enhanced. Performance results indicate that by enhancing locality for adjacent tiles, our scheduling strategies enable performance to scale up to a large number of processors.
We illustrate the impact of skewing on data access patterns in the following figure which shows the relationship between the iterations in a skewed, tiled iteration space, and the data which they access. The overlapping portions of data within a tile capture the data reuse carried by the outer loop in the original loop nest. Loop skewing introduces overlap between adjacent tiles, and this reuse must also be exploited for locality.
Skewing introduces data reuse between tiles.
Once a loop nest has been tiled, the execution of the tiles must respect the dependences introduced by loop skewing. In order to provide sufficient parallelism while effectively exploiting data reuse between adjacent tiles for locality, we employ the scheduling of tiles illustrated below. Blocks of adjacent tiles are executed by the same processor to exploit intertile reuse. The wavefronts are oriented such that there is sufficient parallelism when small tiles are used for a large number of processors.
Scheduling of tiles to enhance intertile locality.
Speedup for Jacobi the Convex SPP-1000
Jacobi is the kernel of a PDE solver, consisting of two inner loop nests which access two large arrays of size 2048x2048, and an outer loop which carries reuse for these arrays. Our shift-and-peel transformation is required in order to fuse the two inner loop nests. Not only does this exploit reuse between the inner loop nests, but it also enables tiling to exploit the reuse carried by the outer loop. Due to the dependences, loop skewing is required prior to tiling. As a result, there is wavefront parallelism and intertile data reuse. The consistent improvement in performance for a large number of processors is made possible by our scheduling strategies for wavefront parallelism.
In the experiment above, we have applied our cache partitioning data transformation in order to avoid conflicts which eject reusable data from the cache and diminish locality. Without cache conflict avoidance, the speedups for both the original and locality-enhanced versions of the codes are degraded significantly.