Cache conflicts among arrays during the execution of loop nests cause useful data to be ejected from the cache, subsequently leading to superfluous cache misses to reload the data into the cache. Conflicts are particularly undesirable when locality-enhancing transformations are used because the failure to retain reused data the cache diminishes the effectiveness of these transformations. The occurrence of cache conflicts is sensitive to cache hardware parameters, array sizes, and data access patterns in a loop nest. It is difficult to guarantee consistently high levels of performance since small changes in these factors may cause a sharp increase in the number of conflicts.
We have developed a new technique called cache partitioning for eliminating cache conflicts among arrays in a loop nest with the specific intent of retaining reused data in the cache and ensuring the effectiveness of locality-enhancing loop transformations. The key idea behind the technique is to logically divide the cache into a number of partitions and adjust the array layout in memory such that data from different arrays is mapped into different partitions of the cache throughout the execution of the loop nest. The partitioning is done entirely in software only to guide the memory layout; no hardware support is required. Hence, there is no additional hardware complexity which may affect performance.
The aim of cache partitioning is to eliminate conflicts between reused portions of different arrays. Reuse which extends across many loop iterations requires large portions of data to be retained in the cache. Conflicts among these reusable portions of data lead to unnecessary cache misses.
Conflicts occur data from different arrays map into overlapping regions of the cache.
Cache partitioning derives a data layout which ensures a conflict-free mapping of data from different arrays into the cache.
Cache partitioning is most effective when data access patterns are compatible, which means data is accessed from different arrays with the same stride and direction. Compatible data access patterns are common in loop nests in which there is data reuse. The regularity in such data access patterns implies that once conflicts begin to occur, they continue to occur in frequent, regular patterns. Hence, it is important to address the case of compatible data accesses.
To apply cache partitioning, it is first necessary to analyze loop nests which reuse a common set of arrays to determine subsets of arrays which must not conflict with each other. Once these subsets are known, a greedy data layout algorithm is employed to adjust the arrays starting addresses with the insertion of gaps with an appropriate size to enforce the desired conflict-free mapping of data into the cache. The array starting addresses are selected to coincide with starting addresses of partitions in the cache. The algorithm makes a greedy choice in the selection of the partition to assign to each array in order to minimize the size of the gap which must be introduced.
The graph above depicts the measured number of cache misses on a Convex SPP-1000 from applying fusion to the LL18 kernel of three loop nests. These loop nests references a total of 9 arrays of size 512x512. The shift-and-peel transformation is required to enable the fusion, which requires data from several arrays to remain cached for reuse. Cache partitioning is compared with the common technique of array padding in which the array dimensions are arbitrarily increased in an effort to perturb the mapping of data into the cache such that conflicts are reduced. Whereas considerable trial-and-error is required for array padding to reduce the number of misses, cache partitioning directly results in the minimum number of misses. In fact, there are values of padding for which fusion has little or no benefit because the number of cache misses is not reduced, i.e., there is no locality.