Research Overview
In the last decade, there has been a fundamental shift in the design of computer systems. Despite continued device scaling as predicted by Moore's Law, building larger and faster single-processor chips has become infeasible due to the power consumption of these systems. Concerns about power consumption have held clock frequencies relatively steady in recent years (compared to the increases enjoyed through the 1990s). Architects are now exploiting growing transistor counts by integrating multiple cores per chip to provide increased performance with every new technology generation. However, doubling the number of cores does not naively result in a doubling of performance. Several challenges exist to extract performance from multi-core systems including communication, synchronization and extracting parallelism. Communication is needed in these architectures to feed computational units with data (both from off-chip memory and from other areas on the chip) and to maintain a coherent view of globally shared memory. Maintaining a coherent view of memory is relied on for correct execution of parallel programs. Mechanisms to facilitate efficient communication are imperative for future architectures. Our research focuses on innovations in on-chip networks (OCNs) to provide scalable communication. Our work has produced several high impact innovations in OCN router architectures, routing algorithms, flow control, and communication-aware scheduling and mapping. Communication represents a significant bottleneck to the scalability. Communication at all levels of the system: datacenter, board-level, on chip, etc. plays a critical role in both the performance and energy consumption of the system. Our research focuses on hardware and software innovations to improve on-chip communication. In addition to designing high-performance, scalable communication fabrics, we are also pursuing research in microarchitectural techniques to improve performance and energy-efficiency.
For an overview and introduction to on-chip network research, I refer the interested reader to the Computer Architecture Synthesis Lecture on On-Chip Networks.
Current Projects
Approximate Computing
Details coming soon!
Application-Aware On-Chip Networks
As architectures scale to many cores, it becomes increasingly difficult to scale individual programs to fully utilize the available cores. As a result, multiple workloads are being consolidated on a single chip to maximize utilization. Existing routing algorithms, both the deterministic and adaptive largely overlook the issues associated with workload consolidation. Ideally, the performance of each application should be the same whether it is running in isolation or is co-scheduled with other applications. Significant research has focused on maintaining isolation and effectively sharing on-chip resources such as caches and memory controllers. Recently, we have proposed DBAR, a destination-based adaptive routing scheme [ISCA 2011][TC 2012]. DBAR dynamically filters network congestion information to prevent the traffic patterns and congestion of one workload from impacting the routing decisions of a separate workload.
To adapt to changing communication demands both within and across applications, we are exploring adaptive and configurable networks. Our most recent result explores the benefits of implementing bi-direcitonal channels whose direction can be configured at a fine-granularity [NOCS 2012]. This router microarchitecture can switch the directionality of channels with low-overhead to provide additional bandwidth in the event of heavy traffic flows between particular source-destination pairs. By enabling intelligent network resources, we can potentially reduce the footprint of the network without loss in performance.
Routing and Flow Control Optimizations for Cache-Coherence Many-Core Architectures
Shared memory models continue to dominate many-core architecture proposals. Providing an on-chip network that efficiently handles cache coherence traffic is imperative for these types of systems. In recent work, we have proposed several optimizations for cache coherence traffic including routing algorithms to handle multicast and reduction traffic [ISCA 2008][HPCA 2012]. Due to abundant wiring on chip, cache coherence traffic contains of a large fraction of short packets (single flit). We propose a novel flow control technique called Whole Packet Forwarding [HPCA 2012][TPDS 2013] specifically designed to exploit and optimize short packets in the on-chip network. These optimizations offer performance improvements and energy reductions for cache-coherence traffic. Alternatively, they can be used to reduce the required resources of the network while maintaining performance.
Simulation Methodologies and Acceleration
Modern multi-cores and systems-on-chip have increasingly used packet-switched networks-on-chip (NoCs) to meet the growing demand for on-chip communication bandwidth, as more cores are incorporated into each chip. NoC designs are sensitive to many parameters such as topology, buffer sizes, routing algorithms, and flow control mechanisms. Detailed NoC simulation is essential to accurate full-system evaluation. We are exploring various technique to improve simulation time and fidelity. Recently, we have proposed SynFull, a novel methodology to generate realistic traffic patterns that allow for fast, yet accurate design space exploration of NoCs for shared memory architectures [ISCA 2014]. SynFull captures both application and cache coherence behaviour to rapidly evaluate NoCs. SynFull allows designers to quickly indulge in detailed performance simulations without the cost of long-running full-system simulation. By capturing a full range of application and coherence behaviour, architects can avoid the over or underdesign of the network as may occur when using traditional synthetic traffic patterns such as uni- form random. SynFull has errors as low as 0.3% and provides 50x speedup on average over full-system simulation. SynFull is available for download and use by the research community (SynFull).
Second, we propose DART, a fast and flexible FPGA-based NoC simulation architecture [NOCS 2011][TC 2012]. Rather than laying the NoC out in hardware on the FPGA like previous approaches, our design virtualizes the NoC by mapping its components to a generic NoC simulation engine, composed of a fully-connected collection of fundamental components (e.g., routers and flit queues). This approach has two main advantages: (i) since FPGA implementation is decoupled it can simulate any NoC; and (ii) any NoC can be mapped to the engine without resynthesizing it, which can take time for a large FPGA design. DART is available for download (DART).
Finally, we are exploring the application of sampling to network on chip simulation to accelerate full-system simulation. This work looks at both statistical sampling and phase-based sampling. Both technique achieve high accuracy while reducing simulation time by one order of magnitude.
Programming Many-Core Architectures: A Communication Perspective
Writing efficient, high performance parallel programs represents a significant challenge to the adoption of many-core architectures. Communication consumes a significant fraction of on-chip resources and can becomes a bottleneck in scaling programs. Using the Intel Single-chip Cloud Computer, we are exploring the impact of programming models and algorithms on communication. This chip can be programmed using message passing; various communication libraries and primitives are being developed to leverage the on-chip network to exploit the computation capacities of this device. Most recently we have explored optimized broadcasting implementations on the SCC and demonstrate up to a 35X speedup [ISPASS 2013].
GPUs are used to speed up many scientific computations; however, to use several networked GPUs concurrently, the programmer must explicitly partition work and transmit data between devices. We propose DistCL [MASCOTS 2013], a novel framework that distributes the execution of OpenCL kernels across a GPU cluster. DistCL makes multiple distributed compute devices appear to be a single compute device. DistCL abstracts and manages many of the challenges associated with distributing a kernel across multiple devices including: (1) partitioning work into smaller parts, (2) scheduling these parts across the network, (3) partitioning memory so that each part of memory is written to by at most one device, and (4) tracking and transferring these parts of memory. Converting an OpenCL application to DistCL is straightforward and requires little programmer effort. This makes it a powerful and valuable tool for exploring the distributed execution of OpenCL kernels. We compare DistCL to SnuCL, which also facilitates the distribution of OpenCL kernels. We also give some insights: distributed tasks favor more compute bound problems and favour large contiguous memory accesses. DistCL is available for download.
Semantically-rich Interconnection Networks
Today's on-chip interconnection networks are largely oblivious to the needs of the components they connect and serve the sole purpose of shuffling bits around the die. In this project, we propose to embed additional functionality in the interconnection network. In particular, we are providing hardware support within the network for various communication primitives such as multicasting [ISCA 2008][HPCA 2012] which is used in cache coherence protocols and leveraged by programming models such as MPI. By more efficiently handling communication primitives in hardware, we improve performance and reduce the dynamic power consumption of the on-chip network. Furthermore, we are exploring on-chip network designs that must effectively match the demands of different applications.
Architectures for Deep Neural Networks
Details coming soon!
Microarchitectural Optimizations
Details coming soon!
Additional Projects
In addition to these projects, I am currently recruiting outstanding Masters and PhD students to work on new projects. These projects explore various aspects of the memory system, on-chip network design and optimization.
If you have applied to graduate school at the University of Toronto and feel your interests align with mine, please email me. You are more likely to receive a response if you can demonstrate that you have read at least one of my papers. E-mails that address me as "Dear Sir:" will be ignored. Bonus points if you figure out that my full last name is "Enright Jerger". Undergraduates at the University of Toronto looking for summer research positions are also welcome to contact me. Unfortunately, I am unable to accept summer undergraduate students from international universities at this time.
We are grateful for the funding and in-kind contributions for these projects provided by the following: Natural Science and Engineering Research Council (NSERC), Ontario Centres of Excellence (OCE) Connaught Foundation, University of Toronto, Canadian Foundation for Innovation (CFI), Ontario Ministry of Research and Innovation, Intel, Qualcomm, AMD and Fujitsu.
Last updated: June 2014