Whole Packet Forwarding: Efficient Design of Fully Adaptive Routing Algorithms for Networks-on-Chip
Routing algorithms for networks-on-chip (NoCs) typically only have a small number of virtual channels (VCs) at their disposal. Limited VCs pose several challenges to the design of fully adaptive routing algorithms. First, fully adaptive routing algorithms based on previous deadlock avoidance theories require a conservative VC re-allocation scheme: a VC can only be re-allocated when it is empty, which limits performance. We propose a novel VC reallocation scheme, whole packet forwarding (WPF), which allows a non-empty VC to be re-allocated. WPF leverages the observation that the majority of packets in NoCs are short. We prove that WPF does not induce deadlock if the routing algorithm is deadlock-free using conservative VC re-allocation. WPF is an important extension of previous deadlock-avoidance theories. Second, to efficiently utilize WPF in VC-limited networks, we design a novel fully adaptive routing algorithm which maintains packet adaptivity without significant hardware cost. Compared with conservative VC re-allocation, WPF achieves an average 88.9% saturation throughput improvement in synthetic traffic patterns and an average 21.3% and maximal 37.8% speedup for PARSEC applications with heavy network loads. Our design also offers higher performance than several partially adaptive and deterministic routing algorithms.
Architectural Support for Synchronization-Free Deterministic Parallel Programming
We propose a novel synchronization mechanism called versioning. It dynamically establishes a deterministic order of memory accesses in parallel programs that have serial semantics, in a way that is transparent to the programmer. This order is created in a distributed manner and is enforced by monitoring memory accesses and stalling threads if necessary. Versioning gives rise to parallel programming models in which programmers need not explicitly synchronize threads and only need to specify shared data, which greatly simplifies parallel programming. However, versioning introduces overheads and thus demands architectural support. We describe versioning and the architectural support it needs. We also propose one parallel programming model that utilizes versioning and use it to parallelize 13 benchmark applications. We build an FPGA prototype of a multiprocessor system with versioning support and show that good parallel speedups are obtained. Our analysis shows minimal impact of versioning, both in terms of timing overheads and in terms of additional hardware.