My primary research interests lie in on-chip interconnection networks and cache coherence protocols for many-core architectures. As the number of cores integrated on die continues to increase, communication can limit performance and can consume a substantial fraction of power. Novel communication fabrics are required to enable continued scaling. My research spans all aspects of interconnection networks as well as cross-layer optimizations. My current research re-examines and challenges some of the design assumptions that hold true for shared-memory multiprocessors when explored in the context of chip multiprocessors. As we migrate many of these design choices on-chip, it is worthwhile to examine their suitability and present novel solutions that are attractive in the unique environment of a many-core architecture. The goal of this research is to carefully consider communication requirements (as dictated by the coherence protocol and software), design the interconnection to better serve the coherence protocol and improve the cache coherence protocols and high-level communication mechanisms to better leverage the functionality of the on-chip interconnection network.
For an overview and introduction to on-chip network research, I refer the interested reader to the Computer Architecture Synthesis Lecture on On-Chip Networks.
Application-Aware On-Chip Networks
As architectures scale to many cores, it becomes increasingly difficult to scale individual programs to fully utilize the available cores. As a result, multiple workloads are being consolidated on a single chip to maximize utilization. Existing routing algorithms, both the deterministic and adaptive largely overlook the issues associated with workload consolidation. Ideally, the performance of each application should be the same whether it is running in isolation or is co-scheduled with other applications. Significant research has focused on maintaining isolation and effectively sharing on-chip resources such as caches and memory controllers. Recently, we have proposed DBAR, a destination-based adaptive routing scheme [ISCA 2011]. DBAR dynamically filters network congestion information to prevent the traffic patterns and congestion of one workload from impacting the routing decisions of a separate workload.
To adapt to changing communication demands both within and across applications, we are exploring adaptive and configurable networks. Our most recent result explores the benefits of implementing bi-direcitonal channels whose direction can be configured at a fine-granularity [NOCS 2012]. This router microarchitecture can switch the directionality of channels with low-overhead to provide additional bandwidth in the event of heavy traffic flows between particular source-destination pairs. By enabling intelligent network resources, we can potentially reduce the footprint of the network without loss in performance.
Routing and Flow Control Optimizations for Cache-Coherence Many-Core Architectures
Shared memory models continue to dominate many-core architecture proposals. Providing an on-chip network that efficiently handles cache coherence traffic is imperative for these types of systems. In recent work, we have proposed several optimizations for cache coherence traffic including routing algorithms to handle multicast and reduction traffic [ISCA 2008][HPCA 2012]. Due to abundant wiring on chip, cache coherence traffic contains of a large fraction of short packets (single flit). We propose a novel flow control technique called Whole Packet Forwarding [HPCA 2012] specifically designed to exploit and optimize short packets in the on-chip network. These optimizations offer performance improvements and energy reductions for cache-coherence traffic. Alternatively, they can be used to reduce the required resources of the network while maintaining performance.
Modern multi-cores and systems-on-chip have increasingly used packet-switched networks-on-chip (NoCs) to meet the growing demand for on-chip communication bandwidth, as more cores are incorporated into each chip. NoC designs are sensitive to many parameters such as topology, buffer sizes, routing algorithms, and flow control mechanisms. Detailed NoC simulation is essential to accurate full-system evaluation. We are exploring various technique to improve simulation time and fidelity. First, we propose DART, a fast and flexible FPGA-based NoC simulation architecture [NOCS 2011][TC 2012]. Rather than laying the NoC out in hardware on the FPGA like previous approaches, our design virtualizes the NoC by mapping its components to a generic NoC simulation engine, composed of a fully-connected collection of fundamental components (e.g., routers and flit queues). This approach has two main advantages: (i) since FPGA implementation is decoupled it can simulate any NoC; and (ii) any NoC can be mapped to the engine without resynthesizing it, which can take time for a large FPGA design. DART is available for download (DART). Second, we are exploring new evaluation methodologies that will allow early-stage analysis of the impact and requirements of cache coherence protocols for on-chip networks.
Writing efficient, high performance parallel programs represents a significant challenge to the adoption of many-core architectures. Communication consumes a significant fraction of on-chip resources and can becomes a bottleneck in scaling programs. Using the Intel Single-chip Cloud Computer, we are exploring the impact of programming models and algorithms on communication. This chip can be programmed using message passing; various communication libraries and primitives are being developed to leverage the on-chip network to exploit the computation capacities of this device. Most recently we have explored optimized broadcasting implementations on the SCC and demonstrate up to a 35X speedup [ISPASS 2013].
Semantically-rich Interconnection Networks
Today's on-chip interconnection networks are largely oblivious to the needs of the components they connect and serve the sole purpose of shuffling bits around the die. In this project, we propose to embed additional functionality in the interconnection network. In particular, we are providing hardware support within the network for various communication primitives such as multicasting [ISCA 2008][HPCA 2012] which is used in cache coherence protocols and leveraged by programming models such as MPI. By more efficiently handling communication primitives in hardware, we improve performance and reduce the dynamic power consumption of the on-chip network. Furthermore, we are exploring on-chip network designs that must effectively match the demands of different applications.
In addition to these projects, I am currently recruiting outstanding Masters and PhD students to work on new projects. These projects explore various aspects of the memory system, on-chip network design and optimization.
If you have applied to graduate school at the University of Toronto and feel your interests align with mine, please email me. You are more likely to receive a response if you can demonstrate that you have read at least one of my papers. E-mails that address me as "Dear Sir:" will be ignored. Bonus points if you figure out that my full last name is "Enright Jerger". Undergraduates at the University of Toronto looking for summer research positions are also welcome to contact me. Unfortunately, I am unable to accept summer undergraduate students from international universities at this time.
We are grateful for the funding and in-kind contributions for these projects provided by the following: Natural Science and Engineering Research Council (NSERC), Connaught Foundation, University of Toronto, Canadian Foundation for Innovation (CFI), Ontario Ministry of Research and Innovation, Intel, Qualcomm, AMD and Fujitsu.
Last updated: February 2013