Mobirise



Natalie Enright Jerger  

Professor
Percy Edward Hart Professor of
Electrical and Computer Engineering

About

I am a professor in the Department of Electrical and Computer Engineering at the University of Toronto. I joined the Edward S. Rogers Department of Electrical and Computer Engineering in 2009. I am currently the Percy Edward Hart Professor of Electrical and Computer Engineering.

Prior to joining the University of Toronto, I received my PhD from the University of Wisconsin-Madison studying Computer Architecture. I was part of the PHARM research group and I was co-advised by Mikko Lipasti and Li-Shiuan Peh. I received my Bachelor of Science degree in 2002 from Purdue University in West Lafayette, IN.

My research interests include multi- and many-core architectures, on-chip networks, cache coherence protocols, memory systems and approximate computing. Please see my research page for more details.

In 2015, I was awarded an Alfred P. Sloan Research Fellowship and the Borg Early Career Award. I am the recipient of the 2014 Ontario Professional Engineers Young Engineer Mdeal. In 2012, I received the Ontario Ministry of Research and Innovation Early Researcher Award. I recently served as the program co-chair of the 7th Network-on-Chip Symposium (NOCS) and as the program chair of the 20th International Symposium on High Performance Computer Architecture (HPCA). In 2009, I co-authored a book on On-Chip Networks with Li-Shiuan Peh. My research is supported by NSERC, Intel, CFI, AMD and Qualcomm.

Beyond research, I am also involved in outreach activities for women in computer architecture. If you are a woman studying or working in the area of computer architecture, please consider joining our group: WICArch.

My CV can be found here: CV (updated May 2017)

News

Bunker Cache

Congrats to Josh and Jorge on their accepted MICRO 2016 paper!

Anytime Automaton

Congrats to Josh on his accepted ISCA 2016 paper!

Cnvlutin

Congrats to Jorge on his accepted ISCA 2016 paper!






Top Picks

Congrats to Ajay on his IEEE MICRO Top Picks Paper: Exploiting Interposer Technologies to Disintegrate and Reintegrate Multi-core Processors for Performance and Cost

Runahead NoC

Congrats to Zimo and Josh on their 2016 HPCA paper!

MICRO 2015

Congrats to Josh, Jorge and Ajay on their MICRO 2015 papers!

Research Overview

In the last two decades, there has been a fundamental shift in the design of computer systems. Despite continued device scaling as predicted by Moore’s Law, building larger and faster single-processor chips has become increasingly difficult due to the power consumption of these systems. Concerns about power consumption have held clock frequencies relatively steady in recent years (compared to the increases enjoyed through the 1990s). Architects are now exploiting growing transistor counts to integrate multiple cores per chip in order to provide increased performance with every new technology generation. In this computing landscape, computation is essentially free; we can build as many arithmetic and logic units (ALUs) as we would like. Despite the ability to integrate an astounding amount of computation on to a chip, we face a number of key challenges. First, doubling the number the compute resources in the form of more cores does not naïvely result in a doubling of performance. Communication remains a bottleneck to extracting more performance from these systems. Communication is needed in these architectures to feed computational units with data (both from off-chip memory and from other areas on the chip) and to maintain a coherent view of globally shared memory. Second, power consumption remains a significant concern and growing transistor counts have lead to the era of Dark Silicon. Dark Silicon is an operating paradigm in which chips have more transistors than can be simultaneously switched on and off due to power and thermal limitations. To address these challenges, my research group continues to work in on-chip interconnects but has also expanded to make research contributions to energy-efficient design through approximate computing and hardware acceleration to mitigate the impacts of dark silicon. More details on recent and on-going projects can be found below.

Approximate Computing

Machine Learning Acceleration

Simulation Methodologies and Acceleration

Application-Aware On-Chip Networks

Routing and Flow Control Optimizations

Die-Stacked Architectures

Interconnection Network Optimizations

Additional Projects

Publications

Complete list of papers.  See also: Google Scholar Profile

Book


On-Chip Networks, 2nd Edition
Natalie Enright Jerger, Tushar Krishna and Li-Shiuan Peh

Abstract:This book targets engineers and researchers familiar with basic computer architecture concepts who are interested in learning about on-chip networks. This work is designed to be a short synthesis of the most critical concepts in on-chip network design. It is a resource for both understanding on-chip network basics and for providing an overview of state of-the-art research in on-chip networks. We believe that an overview that teaches both fundamental concepts and highlights state-of-the-art designs will be of great value to both graduate students and industry engineers. While not an exhaustive text, we hope to illuminate fundamental concepts for the reader as well as identify trends and gaps in on-chip network research. With the rapid advances in this field, we felt it was timely to update and review the state of the art in this second edition. We introduce two new chapters at the end of the book. We have updated the latest research of the past years throughout the book and also expanded our coverage of fundamental concepts to include several research ideas that have now made their way into products and, in our opinion, should be textbook concepts that all on-chip network practitioners should know. For example, these fundamental concepts include message passing, multicast routing, and bubble flow control schemes.
BibTeX:
@book{EnrightJerger2017,
  author = {Natalie {Enright Jerger} and Tushar Krishna and Li-Shiuan Peh},
  editor = {Margaret Martonosi},
  title = {On-Chip Networks},
  publisher = {Morgan Claypool},
  year = {2017}
}
On-Chip Networks
Natalie Enright Jerger and Li-Shiuan Peh
Abstract: With the ability to integrate a large number of cores on a single chip, research into on-chip networks to facilitate communication becomes increasingly important. On-chip networks seek to provide a scalable and high-bandwidth communication substrate for multi-core and many-core architectures. High bandwidth and low latency within the on-chip network must be achieved while fitting within tight area and power budgets. In this lecture, we examine various fundamental aspects of on-chip network design and provide the reader with an overview of the current state-of-the-art research in this field.
BibTeX:
@book{EnrightJerger2009,
  author = {Natalie {Enright Jerger} and Li-Shiuan Peh},
  editor = {Mark Hill},
  title = {On-Chip Networks},
  publisher = {Morgan Claypool},
  year = {2009}
}

Refereed Conference and Journal Publications


The Bunker Cache for Spatio-Value Approximation
Joshua San Miguel, Jorge Albericio, Natalie Enright Jerger and Aamer Jaleel
In Proceedings of the International Symposium on Microarchitecture, 2016. [PDF]

Abstract: The cost of moving and storing data is still a fundamental concern for computer architects. Inefficient handling of data can be attributed to conventional architectures being oblivious to the nature of the values that these data bits carry. We observe the phenomenon of spatio-value similarity, where data elements that are approximately similar in value exhibit spatial regularity in memory. This is inherent to 1) the data values of real-world applications, and 2) the way we store data structures in memory. We propose the Bunker Cache, a design that maps similar data to the same cache storage location based solely on their memory address, sacrificing some application quality loss for greater efficiency. The Bunker Cache enables performance gains (ranging from 1.08x to 1.19x) via reduced cache misses and energy savings (ranging from 1.18x to 1.39x) via reduced off-chip memory accesses and lower cache storage requirements. The Bunker Cache requires only modest changes to cache indexing hardware, integrating easily into commodity systems.
BibTeX:
@inproceedings{Josh2016b,
  author = {Joshua {San Miguel} and Jorge Albericio and Natalie {Enright Jerger} and Aamer Jaleel},
  title = {The Bunker Cache for Spatio-Value Approximation},
  booktitle = {Proceedings of the International Symposium on Microarchitecture},
  year = {2016}
}
CNVLUTIN: Ineffectual-Neuron-Free Deep Neural Network Computing
Jorge Albericio, Patrick Judd, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger and Andreas Moshovos
In Proceedings of the International Symposium on Computer Architecture, 2016. [PDF]

Abstract: This work observes that a large fraction of the computations performed by Deep Neural Networks (DNNs) are intrinsically ineffectual as they involve a multiplication where one of the inputs is zero. This observation motivates Cnvlutin (CNV), a value-based approach to hardware acceleration that eliminates most of these ineffectual operations, improving performance and energy over a state-of-the-art accelerator with no accuracy loss. CNV uses hierarchical data-parallel units, allowing groups of lanes to proceed mostly independently enabling them to skip over the ineffectual computations. A co-designed data storage format encodes the computation elimination decisions taking them off the critical path while avoiding control divergence in the data parallel units. Combined, the units and the data storage format result in a data-parallel architecture that maintains wide, aligned accesses to its memory hierarchy and that keeps its data lanes busy. By loosening the ineffectual computation identification criterion, CNV enables further performance and energy efficiency improvements, and more so if a loss in accuracy is acceptable. Experimental measurements over a set of state-of-the-art DNNs for image classification show that CNV improves performance over a state-of-the-art accelerator from 1.24x to 1.55x and by 1.37x on average without any loss in accuracy by removing zero-valued operand multiplications alone. While CNV incurs an area overhead of 4.49%, it improves overall EDP (Energy Delay Product) and ED^2P (Energy Delay Squared Product) on average by 1.47x and 2.01x, respectively. The average performance improvements increase to 1.52x without any loss in accuracy with a broader ineffectual identification policy. Further improvements are demonstrated with a loss in accuracy.
BibTeX:
@inproceedings{Albericio2016,
  author = {Jorge Albericio and Patrick Judd and Tayler Hetherington and Tor Aamodt and Natalie {Enright Jerger} and Andreas Moshovos},
  title = {{CNVLUTIN: Ineffectual-Neuron-Free Deep Neural Network Computing},
  booktitle = {Proceedings of the International Symposium on Computer Architecture},
  year = {2016}
}
The Anytime Automaton
Joshua San Miguel and Natalie Enright Jerger
In Proceedings of the International Symposium on Computer Architecture, 2016. [PDF]

Abstract: Approximate computing is an emerging paradigm enabling tradeoffs between accuracy and efficiency. However, a fundamental challenge persists: state-of-the-art techniques lack the ability to enforce runtime guarantees on accuracy. The convention is to 1) employ offline or online accuracy models, or 2) present experimental results that demonstrate empirically low error. Unfortunately, these approaches are still unable to guarantee acceptability of all application outputs at runtime. We offer a solution that revisits concepts from anytime algorithms. Originally explored for real-time decision problems, anytime algorithms have the property of producing results with increasing accuracy over time. We propose the Anytime Automaton, a new computation model that executes applications as a parallel pipeline of anytime approximations. An automaton produces approximate versions of the application output with increasing accuracy, guaranteeing that the final precise version is eventually reached. The automaton can be stopped whenever the output is deemed acceptable; otherwise, it is a simple matter of letting it run longer. We present an in-depth analysis of the model and demonstrate attractive runtime-accuracy profiles on various applications. Our anytime automaton is the first step towards systems where the acceptability of an application's output directly governs the amount of time and energy expended.
BibTeX:
@inproceedings{Josh2016a,
  author = {Joshua {San Miguel}  and Natalie {Enright Jerger}},
  title = {The Anytime Automaton},
  booktitle = {Proceedings of the International Symposium on Computer Architecture},
  year = {2016}
}

Abstract: This work exploits the tolerance of Deep Neural Networks (DNNs) to reduced precision numerical representations and specifically, their recently demonstrated ability to tolerate representations of different precision per layer while maintaining accuracy. This flexibility enables improvements over conventional DNN implementations that use a single, uniform representation. This work proposes Proteus, which reduces the data traffic and storage footprint needed by DNNs, resulting in reduced energy and improved area efficiency for DNN implementations. Proteus uses a different representation per layer for both the data (neurons) and the weights (synapses) processed by DNNs. Proteus is a layered extension over existing DNN implementations that converts between the numerical representation used by the DNN execution engines and the shorter, layer-specific fixed-point representation used when reading and writing data values to memory be it on-chip buffers or off-chip memory. Proteus uses a novel memory layout for DNN data, enabling a simple, low-cost and low-energy conversion unit. We evaluate Proteus as an extension to a state-of-the-art accelerator [7] which uses a uniform 16-bit fixed-point representation. On five popular DNNs Proteus reduces data traffic among layers by 43% on average while maintaining accuracy within 1% even when compared to a single precision floating-point implementation. As a result, Proteus improves energy by 15% with no performance loss. Proteus also reduces the data footprint by at least 38% and hence the amount of on-chip buffering needed resulting in an implementation that requires 20% less area overall. This area savings can be used to improve cost by building smaller chips, to process larger DNNs for the same on-chip area, or to incorporate an additional three execution engines increasing peak performance bandwidth by 18%.
BibTeX:
@inproceedings{Judd2016,
  author = {Patrick Judd and Jorge Albericio and Tayler Hetherington and Tor Aamodt and Natalie {Enright Jerger} and Andreas Moshovos},
  title = {Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks},
  booktitle = {Proceedings of the International Conference on Supercomputing}, 
  year = {2016}
}
Exploiting Interposer Technologies to Disintegrate and Reintegrate Multi-core Processors for Performance and Cost
Ajaykumar Kannan, Natalie Enright Jerger and Gabriel H. Loh
In IEEE MICRO Top Picks from Computer Architecture (to appear), 2016. [PDF]
Abstract: Silicon interposers enable the integration of multiple stacks of in-package memory to provide higher bandwidth or lower energy for memory accesses. Once the interposer has been paid for, there are new opportunities to exploit the interposer. In this work, we exploit the interposer to "disintegrate" a multi-core CPU chip into smaller chips that collectively cost less to manufacture than a single large chip. We study the performance-cost trade-offs of interposer-based, multi-chip, multi-core systems and propose new interposer network-on-chip (NoC) organizations to mitigate the performance challenges while preserving the cost benefits. While this work focuses on NoC support for disintegrated systems, it paves the way for a range of new research problems in interposer-based disintegrated systems.
BibTeX:
@article{Kannan2016,
  author = {Ajaykumar Kannan and Natalie {Enright Jerger} and Gabriel H. Loh},
  title = {Exploiting Interposer Technologies to Disintegrate and Reintegrate Multi-core Processors for Performance and Cost},
  journal = {IEEE Micro Top Picks from Computer Architecture},
  year = {2016}
}

Efficient Synthetic Traffic Models for Large, Complex SoCs
Jieming Yin, Onur Kayiran, Matthew Poremba, Gabriel Loh and Natalie Enright Jerger
In Proceedings of the International Symposium on High Performance Computer Architecture, 2016. [PDF]

Abstract: The interconnect or network on chip (NoC) is an increasingly important component in processors. As systems scale up in size and functionality, the ability to efficiently model larger and more complex NoCs becomes increasingly important to the design and evaluation of such systems. Recent work proposed the "SynFull" methodology that performs statistical analysis of a workload's NoC traffic to create compact traffic generators based on Markov models. While the models generate synthetic traffic, the traffic is statistically similar to the original trace and can be used for fast NoC simulation. However, the original SynFull work only evaluated multi-core CPU scenarios with a very simple cache coherence protocol (MESI). We find the original SynFull methodology to be insufficient when modeling the NoC of a more complex system on a chip (SoC). We identify and analyze the shortcomings of SynFull in the context of a SoC consisting of a heterogeneous architecture (CPU and GPU), a more complex cache hierarchy including support for full coherence between CPU, GPU, and shared caches, and heterogeneous workloads. We introduce new techniques to ad- dress these shortcomings. Furthermore, the original SynFull methodology can only model a NoC with N nodes when the original application analysis is performed on an identically-sized N-node system, but one typically wants to model larger future systems. Therefore, we introduce new techniques to enable SynFull-like analysis to be extrapolated to model such larger systems. Finally, we present a novel synthetic memory reference model to replace SynFull's fixed latency model; this allows more realistic evaluation of the memory subsystem and its interaction with the NoC. The result is a robust NoC simulation methodology that works for large, heterogeneous SoC architectures.
BibTeX:
@inproceedings{Yin2016,
  author = {Jieming Yin and Onur Kayiran and Matthew Poremba and Gabriel Loh  and Natalie {Enright Jerger}},
  title = {Efficient Synthetic Traffic Models for Large Complex {SoCs}},
  booktitle = {Proceedings of the International Symposium on High Performance Computer Architecture}
  year = {2016}
}
The Runahead Network-on-Chip
Zimo Li, Joshua San Miguel and Natalie Enright Jerger
In Proceedings of the International Symposium on High Performance Computer Architecture, 2016. [PDF]

Abstract: With increasing core counts and higher memory de- mands from applications, it is imperative that networks-on-chip (NoCs) provide low-latency, power-efficient communication. Conventional NoCs tend to be over- provisioned for worst-case bandwidth demands leading to ineffective use of network resources and significant power inefficiency; average channel utilization is typically less than 5% in real-world applications. In terms of performance, low-latency techniques often introduce power and area overheads and incur significant complexity in the router microarchitecture. We find that both low latency and power efficiency are possible by relaxing the constraint of lossless communication. This is inspired from internetworking where best effort delivery is commonplace. We propose the Runahead NoC, a lightweight, lossy network that provides single-cycle hops. Allowing for lossy delivery enables an extremely simple bufferless router microarchitecture that performs routing and arbitration within the same cycle as link traversal. The Runahead NoC operates either as a power-saver that is integrated into an existing conventional NoC to improve power efficiency, or as an accelerator that is added on top to provide ultra-low latency communication for select packets. On a range of PARSEC and SPLASH-2 workloads, we find that the Runahead NoC reduces power consumption by 1.81x as a power-saver and improves runtime and packet latency by 1.08x and 1.66x as an accelerator.
BibTeX:
@inproceedings{Li2016,
  author = {Zimo Li and Joshua {San Miguel}  and Natalie {Enright Jerger}},
  title = {The Runahead Network-on-Chip},
  booktitle = {Proceedings of the International Symposium on High Performance Computer Architecture},
  year = {2016}
}
ACM DL Author-ize serviceDoppelganger: a cache for approximate computing
Joshua San Miguel, Jorge Albericio, Andreas Moshovos, Natalie Enright Jerger
MICRO-48 Proceedings of the 48th International Symposium on Microarchitecture, 2015

Abstract: Modern processors contain large last level caches (LLCs) that consume substantial energy and area yet are impera- tive for high performance. Cache designs have improved dramatically by considering reference locality. Data values are also a source of optimization. Compression and deduplication exploit data values to use cache storage more efficiently resulting in smaller caches without sacrificing performance. In multi-megabyte LLCs, many identical or similar values may be cached across multiple blocks simultaneously. This redundancy effectively wastes cache capacity. We observe that a large fraction of cache values exhibit approximate similarity. More specifically, values across cache blocks are not identical but are similar. Coupled with approximate computing which observes that some applications can tolerate error or inexactness, we leverage approximate similarity to design a novel LLC architecture: the Doppelganger cache. The Doppelganger cache associates the tags of multiple similar blocks with a single data array entry to reduce the amount of data stored. Our design achieves 1.55x, 2.55x and 1.41x reductions in LLC area, dynamic energy and leakage energy without harming performance nor incurring high application error.
BibTeX:
@inproceedings{SanMiguel2015b,
  author = {Joshua {San Miguel} and Jorge Albericio and Andreas Moshovos  and Natalie {Enright Jerger}},
  title = {Doppelganger: A Cache for Approximate Computing},
  booktitle = {Proceedings of the International Symposium on Microarchitecture},
  year = {2015}
}
ACM DL Author-ize serviceEnabling interposer-based disintegration of multi-core processors
Ajaykumar Kannan, Natalie Enright Jerger, Gabriel H. Loh
MICRO-48 Proceedings of the 48th International Symposium on Microarchitecture, 2015

Abstract: Silicon interposers enable the integration of multiple stacks of in-package memory to provide higher bandwidth or lower energy for memory accesses. Once the interposer has been paid for, there are new opportunities to exploit the interposer. Recent work considers using the routing resources of the interposer to improve the network-on-chip's (NoC) capabilities. In this work, we exploit the interposer to disintegrate a multi-core CPU chip into smaller chips that individually and collectively cost less to manufacture than a single large chip. However, this fragments the overall NoC, which decreases performance as core-to-core messages between chips must now route through the interposer. We study the performance-cost trade-offs of interposer-based, multi-chip, multi-core systems and propose new interposer NoC organizations to mitigate the performance challenges while preserving the cost benefits.
BibTeX:
@inproceedings{Kannan2015,
  author = {Ajaykumar Kannan and Natalie {Enright Jerger} and Gabriel Loh},
  title = {Enabling Interposer-based Disintegration of Multi-core Processors},
  booktitle = {Proceedings of the International Symposium on Microarchitecture},
  year = {2015}
}

Abstract: Silicon interposer technology is promising for large-scale in- tegration of memory within a processor package. While past work on vertical, 3D-stacked memory allows a stack of memory to be placed directly on top of a processor, the total amount of memory that could be integrated is limited by the size of the processor die. With silicon interposers, multiple memory stacks can be integrated inside the processor package, thereby increasing both the capacity and the bandwidth provided by the 3D memory. However, the full potential of all of this integrated memory may be squandered if the in-package interconnect architecture cannot keep up with the data rates provided by the multiple memory stacks. This position paper describes key issues in providing the interconnect support for aggressive interposer-based memory integration, and argues for additional research efforts to address these challenges to enable integrated memory to deliver its full value.
BibTeX:
@inproceedings{Loh2015,
  author = {Gabriel Loh and Natalie {Enright Jerger} and Ajaykumar Kannan and Yasuko Eckert},
  title = {Interconnect-Memory Challenges for Multi-chip Silicon Interposer Systems},
  booktitle = {Proceedings of the International Symposium on Memory Systems},
  year = {2015}
}
ACM DL Author-ize serviceImproving DVFS in NoCs with Coherence Prediction
Robert Hesse, Natalie Enright Jerger
NOCS '15 Proceedings of the 9th International Symposium on Networks-on-Chip, 2015

Abstract: As Networks-on-Chip (NoCs) continue to consume a large frac- tion of the total chip power budget, dynamic voltage and frequency scaling (DVFS) has evolved into an integral part of NoC designs. Efficient DVFS relies on accurate predictions of future network state. Most previous approaches are reactive and based on network-centric metrics, such as buffer occupation and channel utilization. However, we find that there is little correlation between those metrics and subsequent NoC traffic, which leads to suboptimal DVFS decisions. In this work, we propose to utilize highly predictable properties of cache-coherence communication to derive more spe- cific and reliable NoC traffic predictions. A DVFS mechanism based on our traffic predictions, reduces power by 41% compared to a baseline without DVFS and by 21% on average when compared to a state-of-the-art DVFS implementation, while only degrading performance by 3%.
BibTeX:
@inproceedings{Hesse2015,
  author = {Robert Hesse and Natalie {Enright Jerger}},
  title = {Improving {DVFS} in {NoCs} with Coherence Prediction},
  booktitle = {Proceedings of the International Symposium on Networks on Chip},
  year = {2015}
}
ACM DL Author-ize serviceData Criticality in Network-On-Chip Design
Joshua San Miguel, Natalie Enright Jerger
NOCS '15 Proceedings of the 9th International Symposium on Networks-on-Chip, 2015

Abstract: Many network-on-chip (NoC) designs focus on maximizing performance, delivering data to each core no later than needed by the application. Yet to achieve greater energy efficiency, we argue that it is just as important that data is delivered no earlier than needed. To address this, we explore data criticality in CMPs. Caches fetch data in bulk (blocks of multiple words). Depending on the application's memory access patterns, some words are needed right away (critical) while other data are fetched too soon (non-critical). On a wide range of applications, we perform a limit study of the impact of data criticality in NoC design. Criticality-oblivious designs can waste up to 37.5% energy, compared to an idealized NoC that fetches each word both no later and no earlier than needed. Furthermore, 62.3% of energy is wasted fetching data that is not used by the application. We present NoCNoC, a practical, criticality-aware NoC design that achieves up to 60.5% energy savings with no loss in per- formance. Our work moves towards an ideally-efficient NoC, delivering data both no later and no earlier than needed.
BibTeX:
@inproceedings{SanMiguel2015,
  author = {Joshua {San Miguel} and Natalie {Enright Jerger}},
  title = {Data Criticality in Network-on-Chip Design},
  booktitle = {Proceedings of the International Symposium on Networks on Chip}
  year = {2015}
}
Wormhole: Wisely predicting multidimensional branches
Jorge Albericio, Joshua San Miguel, Natalie Enright Jerger and Andreas Moshovos
In Proceedings of the International Symposium on Microarchitecture (MICRO-47) [PDF]

Abstract: Improving branch prediction accuracy is essential in enabling high-performance processors to find more concurrency and to improve energy efficiency by reducing wrong path instruction execution, a paramount concern in today's power-constrained computing landscape. Branch prediction traditionally considers past branch outcomes as a linear, continuous bit stream through which it searches for patterns and correlations. The state-of-the-art TAGE predictor and its variants follow this approach while varying the length of the global history fragments they consider.
This work identifies a construct, inherent to several applications that challenges existing, linear history based branch prediction strategies. It finds that applications have branches that exhibit multi-dimensional correlations. These are branches with the following two attributes: 1) they are enclosed within nested loops, and 2) they exhibit correlation across iterations of the outer loops. Folding the branch history and interpreting it as a multidimensional piece of information, exposes these cross-iteration correlations allowing predictors to search for more complex correlations in the history space with lower cost. We present wormhole, a new side-predictor that exploits these multidimensional histories. Wormhole is integrated alongside ISL-TAGE and leverages information from its existing side-predictors. Experiments show that the wormhole predictor improves accuracy more than existing side-predictors, some of which are commercially available, with a similar hardware cost. Considering 40 diverse application traces, the wormhole predictor reduces MPKI by an average of 2.53% and 3.15% on top of 4KB and 32KB ISL-TAGE predictors respectively. When considering the top four workloads that exhibit multi-dimensional history correlations, Wormhole achieves 22% and 20% MPKI average reductions over 4KB and 32KB ISL-TAGE.
BibTeX:
@inproceedings{Albericio2014,
  author = {Jorge Albericio and Joshua {San Miguel} and Natalie {Enright Jerger} and Andreas Moshovos},
  title = {Wormhole: Wisely Predicting Multidimensional Branches},
  booktitle = {Proceedings of the International Symposium on Microarchitecture}
  year = {2014}
}
NoC Architectures for Silicon Interposer Systems
Natalie Enright Jerger, Ajaykumar Kannan, Zimo Li and Gabriel H. Loh
In Proceedings of the International Symposium on Microarchitecture (MICRO-47) [PDF]

Abstract: Silicon interposer technology ("2.5D" stacking) enables the integration of multiple memory stacks with a processor chip, thereby greatly increasing in-package memory capacity while largely avoiding the thermal challenges of 3D stacking DRAM on the processor. Systems employing interposers for memory integration use the interposer to provide point-to-point interconnects between chips. However, these interconnects only utilize a fraction of the interposer's overall routing capacity, and in this work we explore how to take advantage of this otherwise unused resource. We describe a general approach for extending the architecture of a network-on-chip (NoC) to better exploit the additional routing resources of the silicon interposer. We propose an asymmetric organization that distributes the NoC across both a multi-core chip and the interposer, where each sub-network is different from the other in terms of the traffic types, topologies, the use or non-use of concentration, direct vs. indirect network organizations, and other network attributes. Through experimental evaluation, we show that exploiting the otherwise unutilized routing resources of the interposer can lead to significantly better performance.
BibTeX:
@inproceedings{EnrightJerger2014,
  author = {Natalie {Enright Jerger} and Ajaykumar Kannan and Zimo Li and Gabriel Loh},
  title = {{NoC} Architectures for Silicon Interposer Systems},
  booktitle = {Proceedings of the International Symposium on Microarchitecture}
  year = {2014}
}
Load Value Approximation
Joshua San Miguel, Mario Badr and Natalie Enright Jerger
In Proceedings of the International Symposium on Microarchitecture (MICRO-47) [PDF]

Abstract: Approximate computing explores opportunities that emerge when applications can tolerate error or inexactness. These applications, which range from multimedia processing to machine learning, operate on inherently noisy and imprecise data. We can trade-off some loss in output value integrity for improved processor performance and energy-efficiency. As memory accesses consume substantial latency and energy, we explore load value approximation, a microarchitectural technique to learn value patterns and generate approximations for the data. The processor uses these approximate data values to continue executing without incurring the high cost of accessing memory, removing load instructions from the critical path. Load value approximation can also inhibit approximated loads from accessing memory, resulting in energy savings. On a range of PARSEC workloads, we observe up to 28.6% speedup (8.5% on average) and 44.1% energy savings (12.6% on average), while maintaining low output error. By exploiting the approximate nature of applications, we draw closer to the ideal latency and energy of accessing memory.
BibTeX:
@inproceedings{SanMiguel2014b,
  author = {Joshua {San Miguel} and Mario Badr and Natalie {Enright Jerger}},
  title = {Load Value Approximation},
  booktitle = {Proceedings of the International Symposium on Microarchitecture}
  year = {2014}
}
Dodec: Random-Link, Low-Radix On-Chip Networks
Haofan Yang, Jyoti Tripathi, Natalie Enright Jerger and Dan Gibson
In Proceedings of the International Symposium on Microarchitecture (MICRO-47) [PDF]

Abstract: Network topology plays a vital role in chip design; it largely determines network cost (power and area) and significantly impacts communication performance in many-core architectures. Conventional topologies such as a 2D mesh have drawbacks including high diameter as the network scales and poor load balancing for the center nodes. We propose a methodology to design random topologies for on-chip networks. Random topologies provide better scalability in terms of network diameter and provide inherent load balancing. As a proof-of-concept for random on-chip topologies, we explore a novel set of networks -- dodecs -- and illustrate how they reduce network diameter with randomized low-radix router connections. While a 4x4 mesh has a diameter of 6, our dodec has a diameter of 4 with lower cost. By introducing randomness, dodec networks exhibit more uniform message latency. By using low-radix routers, dodec networks simplify the router microarchitecture and attain 20% area and 22% power reduction compared to mesh routers while delivering the same overall application performance for PARSEC.
BibTeX:
@inproceedings{Yang2014,
  author = {Haofan Yang and Jyoti Tripathi and Natalie {Enright Jerger} and Dan Gibson},
  title = {Dodec: Random-Link, Low-Radix On-Chip Networks},
  booktitle = {Proceedings of the International Symposium on Microarchitecture}
  year = {2014}
}
Efficient and Programmable Ethernet Switching with a NoC-Enhanced FPGA
Andrew Bitar, Jeffrey Cassidy, Natalie Enright Jerger and Vaughn Betz
In Proceedings of the ACM/IEEE Symposium on Architectures for Networking and Communication Systems, October 2014 [PDF]

Abstract: Communications systems make heavy use of FPGAs; their programmability allows system designers to keep up with emerging protocols and their high-speed transceivers enable high bandwidth designs. While FPGAs are extensively used for packet parsing, inspection and classification, they have seen less use as the switch fabric between network ports. However, recent work has proposed embedding a network-on-chip (NoC) as a new "hard" resource on FPGAs and we show that by properly leveraging such a NoC one can create a very efficient yet still highly programmable network switch.
We compare a NoC-based 16x16 network switch for 10-Gigabit Ethernet traffic to a recent innovative FPGA-based switch fabric design. The NoC-based switch not only consumes 5.8% less logic area, but also reduces latency by 8.1%. We also show that using the FPGA's programmable interconnect to adjust the packet injection points into the NoC leads to significant performance improvements. A routing algorithm tailored to this application is shown to further improve switch performance and scalability. Overall, we show that an FPGA with a low-cost hard 64-node mesh NoC with 64-bit links can support a 16x16 switch with up to 948 Gbps in aggregate bandwidth, roughly matching the transceiver bandwidth on the latest FPGAs.
BibTeX:
@inproceedings{Bitar2014,
  author = {Andrew Bitar and Jeffrey Cassidy and Natalie {Enright Jerger} and Vaughn Betz},
  title = {Efficient and Programmable Ethernet Switching with a {NoC}-Enhanced {FPGA}},
  booktitle = {Proceedings of the ACM/IEEE Symposium on Architectures for Networking and Communication Systems},
  year = {2014}
}
Sampling-based Approaches to Accelerating Network-on-Chip Simulation
Wenbo Dai and Natalie Enright Jerger
In Proceedings of the International Symposium on Networks on Chip, September 2014 [PDF]

Abstract:Architectural complexity continues to grow as we consider the large design space of multiple cores, cache architectures, networks-on-chip (NoC) and memory controllers. Simulators are growing in complexity to reflect these system components. However, many full-system simulators fail to utilize the underlying hardware resources such as multiple cores; consequently, simulation times have grown significantly. Long turnaround times limit the range and depth of design space exploration.
Communication has emerged as a first class design consideration and has led to significant research into NoCs. NoC is yet another component of the architecture that must be faithfully modeled in simulation. Here, we focus on accelerating NoC simulation through the use of sampling techniques. We propose NoCLabs and NoCPoint, two sampling methodologies utilizing statistical sampling theory and traffic phase behavior, respectively. Experimental results show that NoCLabs and NoCPoint estimate NoC performance with an average error of 7% while achieving one order of magnitude speedup.
BibTeX:
@inproceedings{Dai2014b,
  author = {Wenbo Dai and Natalie {Enright Jerger}},
  title = {Sampling-Based Approaches to Accelerate Network-on-Chip Simulation},
  booktitle = {Proceedings of the International Symposium on Networks on Chip},
  year = {2014}
}
QuT: A Low-Power All Optical Architecture for a Next Generation of Network-on-Chip
Parisa Khadem Hamedani, Natalie Enright Jerger and Shaahin Hessabi
In Proceedings of the International Symposium on Networks on Chip, September 2014 [PDF]

Abstract: To enable the adoption of optical Networks-on-Chip (NoCs) and allow them to scale to large systems, they must be designed to consume less power and energy. Therefore, optical NoCs must use a small number of wavelengths, avoid excessive insertion loss and reduce the number of microring resonators. We propose the Quartern Topology (QuT), a novel low-power all-optical NoC. We also propose a deterministic wavelength routing algorithm based on Wavelength Division Multiplexing that allows us to reduce the number of wavelengths and microring resonators in optical routers. The key advantages of QuT network are simplicity and lower power consumption. We compare QuT against three alternative all-optical NoCs: optical Spidergon, λ-router and Corona under different synthetic traffic patterns. QuT demonstrates good scalability with significantly lower power and competitive latency. Our optical topology reduces power by 23%, 86.3% and 52.7% compared with 128-node optical Spidergon, λ-router and Corona, respectively.
BibTeX:
@inproceedings{Hamedani2014,
  author = {Parisa Khadem Hamedani and Natalie {Enright Jerger} and Shaahin Hessabi},
  title = {{QuT}: A Low-Power Optical Network-on-Chip},
  booktitle = {Proceedings of the International Symposium on Networks on Chip},
  year = {2014}
}
SynFull: Synthetic Traffic Models Capturing a Full Range of Cache Coherent Behaviour
Mario Badr and Natalie Enright Jerger
In Proceedings of the International Symposium on Computer Architecture, June 2014. [PDF]

SynFull is available for download here


Abstract: Modern and future many-core systems represent complex architectures. The communication fabrics of these large systems heavily influence their performance and power consumption. Current simulation methodologies for evaluating networks-on-chip (NoCs) are not keeping pace with the increased complexity of our systems; architects often want to explore many different design knobs quickly. Methodologies that capture workload trends with faster simulation times are highly beneficial at early stages of architectural exploration. We propose SynFull, a synthetic traffic generation methodology that captures both application and cache coherence behaviour to rapidly evaluate NoCs. SynFull allows designers to quickly indulge in detailed performance simulations without the cost of long-running full-system simulation. By capturing a full range of application and coherence behaviour, architects can avoid the over or underdesign of the network as may occur when using traditional synthetic traffic patterns such as uniform random. SynFull has errors as low as 0.3% and provides 50x speedup on average over full-system simulation.
BibTeX:
@inproceedings{Badr2014,
  author = {Mario Badr and Natalie {Enright Jerger}},
  title = {{SynFull}: Synthetic Traffic Models Capturing Cache Coherent Behaviour},
  booktitle = {Proceedings of the International Symposium on Computer Architecture},
  year = {2014}
}
Evaluating The Memory System Behavior of Smartphone Workloads
Goran Narancic, Patrick Judd, Di Wu, Islam Atta, Michel El Nacouzi, Jason Zebchuk, Jorge Albericio, Natalie Enright Jerger, Andreas Moshovos, Kyros Kutulakos and Serag Gadelrab
In Proceedings of the International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation [PDF]
: Modern smartphones comprise several processing and input/output units that communicate mostly through main memory. As a result, memory represents a critical performance bottleneck for smartphones. This work1 introduces a set of emerging workloads for smartphones and characterizes the performance of several memory controller policies and address-mapping schemes for those workloads. The workloads include high-resolution video conferencing, computer vision algorithms such as upper-body detection and feature extraction, computational photography techniques such as high dynamic range imaging, and web browsing. This work also considers combinations of these workloads that represent possible use cases of future smartphones such as detecting and focusing on people or other objects in live video. While some of these workloads have been characterized before, this is the first work that studies address mapping and memory controller scheduling for these workloads. Experimental analysis demonstrates: (1) Most of the workloads are either memory throughput or latency bound straining a conventional smartphone main memory system. (2) The address mapping schemes that balance row locality with concurrency among different banks and ranks are best. (3) The FR-FCFS with write drain memory scheduler performs best, outperforming some more recently proposed schedulers targeted at multi-threaded workloads on general purpose processors. These results suggest that there is potential to improve memory performance and that existing schedulers developed for other platforms ought to be revisited and tuned to match the demands of such smartphone workloads.
BibTeX:
@inproceedings{Narancic2014,
  author = {Goran Narancic and Patrick Judd and Di Wu and Islam Atta and Michel El Nacouzi and Jason Zebchuk and Jorge Albericio and Natalie {Enright Jerger} and Andreas Moshovos and Kyros Kutulakos and Serag Gadelrab},
  title = {Evaluating the Memory System Behavior of Smartphone Workloads},
  booktitle = {International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation},
  year = {2014},
}
Leaving One Slot Empty: Flit Bubble Flow Control for Torus Cache-Coherent NoCs
Sheng Ma, Zhiying Wang, Zonglin Liu, and Natalie Enright Jerger
IEEE Transactions on Computers (Preprint) [PDF]
Abstract: Routing algorithms for cache-coherent NoCs only have limited VCs at their disposal, which poses challenges to the design of routing algorithms. Existing fully adaptive routing algorithms apply conservative VC re-allocation: only empty VCs can be re-allocated, which limits performance. We propose two novel flow control designs. First, whole packet forwarding (WPF) re-allocates a non-empty VC if the VC has enough free buffers for an entire packet. WPF does not induce deadlock if the routing algorithm is deadlock-free using conservative VC re-allocation. It is an important extension to several deadlock avoidance theories. Second, we extend Duato's theory [11] to apply aggressive VC re-allocation on escape VCs without deadlock. Finally, we propose a design which maintains maximal routing flexibility with low hardware cost. For synthetic traffic, our design performs averagely 88.9% better than existing fully adaptive routing. Our design is superior to partially adaptive and deterministic routing.
BibTeX:
@article{Ma2013a,
  author = {Sheng Ma and Zhiying Wang and Zonglin Liu and Natalie {Enright Jerger}},
  title = {Leaving One Slot Empty: Flit Bubble Flow Control for Torus Cache-coherent NoCs},
  journal = {IEEE Transactions on Computers},
  year = {2013},
  volume = {99},
  pages = {1}
}
Novel Flow Control for Fully Adaptive Routing in Cache-Coherent NoCs
Sheng Ma, Zhiying Wang, Natalie Enright Jerger, Li Shen and Nong Xiao
IEEE Transactions on Parallel and Distributed Systems (Preprint) [PDF][Supplement]
Abstract: Routing algorithms for cache-coherent NoCs only have limited VCs at their disposal, which poses challenges to the design of routing algorithms. Existing fully adaptive routing algorithms apply conservative VC re-allocation: only empty VCs can be re-allocated, which limits performance. We propose two novel flow control designs. First, whole packet forwarding (WPF) re-allocates a non-empty VC if the VC has enough free buffers for an entire packet. WPF does not induce deadlock if the routing algorithm is deadlock-free using conservative VC re-allocation. It is an important extension to several deadlock avoidance theories. Second, we extend Duato's theory [11] to apply aggressive VC re-allocation on escape VCs without deadlock. Finally, we propose a design which maintains maximal routing flexibility with low hardware cost. For synthetic traffic, our design performs averagely 88.9% better than existing fully adaptive routing. Our design is superior to partially adaptive and deterministic routing.
BibTeX:
@article{Ma2013,
  author = {Sheng Ma and Zhiying Wang and Natalie {Enright Jerger} and Li Shen and Nong Xiao},
  title = {Novel Flow Control for Fully Adaptive Routing in Cache-Cohernece NoCs},
  journal = {IEEE Transactions on Parallel and Distributed Systems},
  year = {2013},
  volume = {99},
  pages = {1}
}
DistCL: A Framework for Distributed Execution of OpenCL Kernels
Tahir Diop, Steven Gurfinkel, Jason Anderson and Natalie Enright Jerger
In Proceedings of the International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS) August 2013. [PDF]

DistCL is available for download here


Abstract: GPUs are used to speed up many scientific computations; however, to use several networked GPUs concurrently, the programmer must explicitly partition work and transmit data between devices. We propose DistCL, a novel framework that distributes the execution of OpenCL kernels across a GPU cluster. DistCL makes multiple distributed compute devices appear to be a single compute device. DistCL abstracts and manages many of the challenges associated with distributing a kernel across multiple devices including: (1) partitioning work into smaller parts, (2) scheduling these parts across the network, (3) partitioning memory so that each part of memory is written to by at most one device, and (4) tracking and transferring these parts of memory. Converting an OpenCL application to DistCL is straightforward and requires little programmer effort. This makes it a powerful and valuable tool for exploring the distributed execution of OpenCL kernels. We compare DistCL to SnuCL, which also facilitates the distribution of OpenCL kernels. We also give some insights: distributed tasks favor more compute bound problems and favour large contiguous memory accesses. DistCL achieves a maximum speedup of 29.1 and average speedups of 7.3 when distributing kernels among 32 peers over an Infiniband cluster.
BibTeX:
@inproceedings{Diop2013,
  author = {Tahir Diop and Steven Gurkfinkel and Jason Anderson and Natalie {Enright Jerger}},
  title = {DistCL: A Framework for Distributed Execution of OpenCL Kernels},
  booktitle = {International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS)},
  year = {2013}
}
Performance Analysis of Broadcasting Algorithms on the Intel Single Chip Cloud Computer
John Matienzo and Natalie Enright Jerger
In Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS). April 2013. [PDF]
Abstract: Efficient broadcasting is essential for good performance on distributed or multiprocessor systems. Broadcasts are commonly used to implement message passing synchronization primitives, such as barriers, and also appear frequently in the set up stage of scientific applications.The Intel Single-Chip Cloud Computer (SCC), an experimental processor, uses synchronous message passing to facilitate communication between its 48 cores. RCCE, the SCC's message passing library, implements broadcasting in a traditional way: sending n-1 unicast messages, where n is the number of cores participating in the broadcast. This implementation can hinder performance as the number of cores participating in the broadcast increases and if the data being sent to each core is large. Also in the RCCE implementation, the broadcasting core is blocked from doing any useful work until all cores receive the broadcast.
This paper explores several broadcasting schemes that take advantage of the resources of the SCC and the RCCE library. For example, we explore a scheme that propagates a broadcast to multiple cores in parallel and a scheme that parallelizes off-chip memory accesses which would otherwise need to be done sequentially. Our best broadcast scheme achieves a 35x speedup over the RCCE implementation. We also demonstrate that our improved broadcasting substantially reduces the time spent on communication in some benchmarks. While the broadcast schemes presented in this paper are implemented specifically for the SCC, they provide insight into the more general problem of broadcast communication and could be adapted to other types of distributed and multiprocessor systems.
BibTeX:
@inproceedings{Matienzo2013,
  author = {John Matienzo and Natalie {Enright Jerger}},
  title = {Performance Analysis of Broadcasting Algorithms on the Intel Single Chip Cloud Computer},
  booktitle = {International Symposium on Performance Analysis of Systems and Software (ISPASS)},
  year = {2013}
}
A Dual-Grain Hit/Miss Detector for Large Stacked-Die DRAM Caches
Michel El Nacouzi, Islam Atta, Myrto Papadopoulou, Jason Zebchuk, Natalie Enright Jerger and Andreas Moshovos
In the Design, Automation and Test in Europe Conference. March 2013. [PDF]
Abstract: Die-Stacked DRAM caches offer the promise of improved performance and reduced energy by capturing a larger fraction of an applicationÕs working set than on-die SRAM caches. However, given that their latency is only 50% lower than that of main memory, DRAM caches considerably increase latency for misses. They also incur a significant energy overhead for remote lookups in snoop-based multi-socket systems. Ideally, it would be possible to detect in advance that a request will miss in the DRAM cache and thus selectively bypass it. This work proposes a dual grain filter which successfully predicts whether an access is a hit or a miss in most cases. Experimental results with commercial and scientific workloads show that a 158KB dual-grain filter can correctly predict data block residency for 85% of all accesses to a 256MB DRAM cache. As a result, off-die latency with our filter is nearly identical to that possible with an impractical, perfect filter.
BibTeX:
@inproceedings{ElNacouzi2013,
  author = {Michel El Nacouzi and Islam Atta and Myrto Papadopoulou and Jason Zebchuk and Natalie {Enright Jerger} and Andreas Moshovos},
  title = {A Dual-Grain Hit/Miss Detector for Large Stacked Die DRAM Caches},
  booktitle = {Conference on Design, Automation and Test in Europe (DATE)},
  year = {2013},
  pages = {89-92}
}
ACM DL Author-ize serviceMoths: Mobile threads for on-chip networks
Matthew Misler, Natalie Enright Jerger
ACM Transactions on Embedded Computing Systems (TECS) - Special section on ESTIMedia'12, LCTES'11, rigorous embedded systems design, and multiprocessor system-on-chip for cyber-physical systems, 2013

Abstract: As the number of cores integrated on a single chip continues to increase, communication has the potential to become a severe bottleneck to overall system performance. The presence of thread sharing and the distribution of data across cache banks on the chip can result in longdistance communication. Long-distance communication incurs substantial latency that impacts performance; furthermore, this communication consumes significant dynamic power when packets are switched over many Network-on-Chip (NoC) links and routers. Thread migration can mitigate problems created by long distance communication. This article presents Moths, an efficient runtime algorithm that responds automatically to dynamic NoC traffic pat- terns, providing beneficial thread migration to decrease overall traffic volume and average packet latency. Moths reduces on-chip network latency by up to 28.4% (18.0% on average) and traffic volume by up to 24.9% (20.6% on average) across a variety of commercial and scientific benchmarks.
BibTeX:
@article{Misler2013,
  author = {Matthew Misler and Natalie {Enright Jerger}},
  title = {Moths: Mobile Threads for On-Chip Networks},
  journal = {ACM Transactions on Embedded Computing},
  year = {2013},
  volume = {12},
  pages = {56:1-22}
}
Holistic Routing Algorithm Design to Support Workload Consolidation in NoCs
Sheng Ma, Natalie Enright Jerger, Zhiying Wang, Mingche Lai and Libo Huang
IEEE Transactions on Computers (Preprint) [PDF])
Abstract: To provide efficient, high-performance routing algorithms, a holistic approach should be taken. The key aspects of routing algorithm design include adaptivity, path selection strategy, VC allocation, isolation and hardware implementation cost; these design aspects are not independent. The key contribution of this work lies in the design of a novel selection strategy, Destination-Based Selection Strategy (DBSS) which targets interference that can arise in many-core systems running consolidation workloads. In the process of this design, we holistically consider all aspects to ensure an efficient design. Existing routing algorithms largely overlook issues associated with workload consolidation. Locally adaptive algorithms do not consider enough status information to avoid network congestion. Globally adaptive routing algorithms attack this issue by utilizing network status beyond neighboring nodes. However, they may suffer from interference, coupling the behavior of otherwise independent applications. To address these issues, DBSS leverages both local and non-local network status to provide more effective adaptivity. More importantly, by integrating the destination into the selection procedure, DBSS mitigates interference and offers dynamic isolation among applications. Results show that DBSS offers better performance than the best baseline selection strategy and improves the energy-delay product for medium and high injection rates; it is well suited for workload consolidation.
BibTeX:
@article{Ma2012b,
  author = {Sheng Ma and Natalie {Enright Jerger} and Zhiying Wang and Mingche Lai and Libo Huang},
  title = {Holistic Routing Algorithm Design to Support Workload Consolidation in NoCs},
  journal = {IEEE Transactions on Computers},
  year = {2012},
  volume = {99},
  pages = {1}
}
DART: A Programmable Architecture for NoC Simulation on FPGAs
Danyao Wang, Charles Lo, Jasmina Vasiljevic, Natalie Enright Jerger and J. Gregory Steffan
IEEE Transactions on Computers (Preprint [PDF])
Abstract: The increased demand for on-chip communication bandwidth as a result of the multi-core trend has made packet-switched networks-on-chip (NoCs) a more compelling choice for the communication backbone in next-generation systems [1]. However, NoC designs have many power, area, and performance trade-offs in topology, buffer sizes, routing algorithms and flow control mechanismsÑhence the study of new NoC designs can be very time-intensive. To address these challenges, we propose DART, a fast and flexible FPGA-based NoC simulation architecture. Rather than laying the NoC out in hardware on the FPGA like previous approaches [2], [3], our design virtualizes the NoC by mapping its components to a generic NoC simulation engine, composed of a fully-connected collection of fundamental components (e.g., routers and flit queues). This approach has two main advantages: (i) since it is virtualized it can simulate any NoC; and (ii) any NoC can be mapped to the engine without rebuilding it, which can take significant time for a large FPGA design. We demonstrate (i) that an implementation of DART on a Virtex-II Pro FPGA can achieve over 100x speedup over the cycle-based software simulator Booksim [4], while maintaining the same level of simulation accuracy, and (ii) that a more modern Virtex-6 FPGA can accommodate a 49-node DART implementation.
BibTeX:
@article{Wang2012,
  author = {Danyao Wang and Charles Lo and Jasmina Vasiljevic and Natalie {Enright Jerger} and J. Gregory Steffan},
  title = {DART: A Programmable Architecture for NoC Simulation on FPGAs},
  journal = {IEEE Transactions on Computers},
  year = {2012},
  volume = {99},
  pages = {1-1}
}
Fine-Grained Bandwidth Adaptivity in Networks-on-Chip Using Bidirectional Channels
Robert Hesse, Jeff Nicholls and Natalie Enright Jerger
In Proceedings of the 6th International Symposium on Networks-on-Chip. May 2012. [PDF]
Abstract: Networks-on-Chip (NoC) serve as efficient and scalable communication substrates for many-core architectures. Currently, the bandwidth provided in NoCs is overprovisioned for their typical usage case. In real-world multi-core applications, less than 5% of channels are utilized on average. Large bandwidth resources serve to keep network latency low during periods of peak communication demands. Increasing the average channel utilization through narrower channels could improve the efficiency of NoCs in terms of area and power; however, in current NoC architectures this degrades overall system performance. Based on thorough analysis of the dynamic behaviour of real workloads, we design a novel NoC architecture that adapts to changing application demands. Our architecture uses fine-grained bandwidth-adaptive bidirectional channels to improve channel utilization without negatively affecting network latency. Running PARSEC benchmarks on a cycle-accurate full-system simulator, we show that fine-grained bandwidth adaptivity can save up to 75% of channel resources while achieving 92% of overall system performance compared to the baseline network; no performance is sacrificed in our network design configured with 50% of the channel resources used in the baseline.
BibTeX:
@inproceedings{Hesse2012,
  author = {Robert Hesse and Jeff Nicholls and Natalie {Enright Jerger}},
  title = {Fine-Grained Bandwidth Adaptivity in Networks-on-Chip Using Bidirectional Channels},
  booktitle = {International Network on Chip Symposium (NOCS)},
  year = {2012},
  pages = {132-141}
}
Supporting Efficient Collective Communication in NoCs
Sheng Ma, Natalie Enright Jerger, Zhiying Wang
In Proceedings of the 18th International Symposium on High Performance Computer Architecture. February 2012. [PDF]
Abstract: Across many architectures and parallel programming paradigms, collective communication plays a key role in performance and correctness. Hardware support is necessary to prevent important collective communication from becoming a system bottleneck. Support for multicast communication in Networks-on-Chip (NoCs) has achieved substantial throughput improvements and power savings. In this paper, we explore support for reduction or many-to-one communication operations. As a case study, we focus on acknowledgement messages (ACK) that must be collected in a directory protocol before a cache line may be upgraded to or installed in the modified state. This paper makes two primary contributions: an efficient framework to support the reduction of ACK packets and a novel Balanced, Adaptive Multicast (BAM) routing algorithm. The proposed message combination framework complements several multicast algorithms. By combining ACK packets during transmission, this framework not only reduces packet latency by 14.1% for low-to-medium network loads, but also improves the network saturation throughput by 9.6% with little overhead. The balanced buffer resource configuration of BAM improves the saturation throughput by an additional 13.8%. For the PARSEC benchmarks, our design offers an average speedup of 12.7% and a maximal speedup of 16.8%.
BibTeX:
@inproceedings{Ma2012a,
  author = {Sheng Ma and Natalie {Enright Jerger} and Zhiying Wang},
  title = {Supporting Efficient Collective Communication in NoCs},
  booktitle = {International Symposium on High Performance Computer Architecture (HPCA)},
  year = {2012},
  pages = {165-177}
}
Whole Packet Forwarding: Efficient Design of Fully Adaptive Routing Algorithms for Networks-on-Chip
Sheng Ma, Natalie Enright Jerger, Zhiying Wang
In Proceedings of the 18th International Symposium on High Performance Computer Architecture. February 2012. [PDF]
Abstract: Routing algorithms for networks-on-chip (NoCs) typically only have a small number of virtual channels (VCs) at their disposal. Limited VCs pose several challenges to the design of fully adaptive routing algorithms. First, fully adaptive routing algorithms based on previous deadlock-avoidance theories require a conservative VC re-allocation scheme: a VC can only be re-allocated when it is empty, which limits performance. We propose a novel VC re-allocation scheme, whole packet forwarding (WPF), which allows a non-empty VC to be re-allocated. WPF leverages the observation that the majority of packets in NoCs are short. We prove that WPF does not induce deadlock if the routing algorithm is deadlock-free using conservative VC re-allocation. WPF is an important extension of previous deadlock-avoidance theories. Second, to efficiently utilize WPF in VC-limited networks, we design a novel fully adaptive routing algorithm which maintains packet adaptivity without significant hardware cost. Compared with conservative VC re-allocation, WPF achieves an average 88.9% saturation throughput improvement in synthetic traffic patterns and an average 21.3% and maximal 37.8% speedup for PARSEC applications with heavy network loads. Our design also offers higher performance than several partially adaptive and deterministic routing algorithms.
BibTeX:
@inproceedings{Ma2012,
  author = {Sheng Ma and Natalie {Enright Jerger} and Zhiying Wang},
  title = {Whole Packet Forwarding: Efficient Design of Fully Adaptive Routing Algorithms for Networks-on-Chip},
  booktitle = {International Symposium on High Performance Computer Architecture (HPCA)},
  year = {2012},
  pages = {467-479}
}
Exploration of Temperature Constraints for Thermal Aware Mapping of 3D Networks on Chip
Parisa Khadem Hamedani, Shaahin Hessabi, Hamid Sarbazi-Azad, Natalie Enright Jerger
In Proceedings of 20th Euromicro International Conference on Parallel, Distributed and Network-Based Processing. February 2012. [PDF]
Abstract: This paper proposes three ILP-based static thermal-aware mapping algorithms for 3D Networks on Chip (NoC) to explore the thermal constraints and their effects on temperature and performance. Through complexity analysis, we show that the first algorithm, an optimal one, is not suitable for 3D NoC. Therefore, we develop two approximation algorithms and analyze their algorithmic complexities to show their proficiency. As the simulation results show, the mapping algorithms that employ direct thermal calculation to minimize the temperature reduce the peak temperature by up to 24% and 22%, for the benchmarks that have the highest communication rate and largest number of tasks, respectively. This comes at the price of a higher power-delay product. This exploration shows that considering power balancing early in the mapping algorithms does not affect the chip temperature. Moreover, it shows that considering the explicit performance constraint in the thermal mapping has no major effect on performance.
BibTeX:
@inproceedings{Hamedani2012,
  author = {Parisa Khadem Hamedani and Shaahin Hessabi and Hamid Sarbazi-Azad and Natalie {Enright Jerger}},
  title = {Exploration of Temperature Constraints for Thermal-Aware Mapping of 3D Networks-on-Chip},
  booktitle = {Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)},
  year = {2012},
  pages = {499-506}
}
ACM DL Author-ize serviceDBAR: an efficient routing algorithm to support multiple concurrent applications in networks-on-chip
Sheng Ma, Natalie Enright Jerger, Zhiying Wang
ISCA '11 Proceeding of the 38th annual international symposium on Computer architecture, 2011

Abstract: With the emergence of many-core architectures, it is quite likely that multiple applications will run concurrently on a system. Existing locally and globally adaptive routing algorithms largely overlook issues associated with workload consolidation. The shortsightedness of locally adaptive routing algorithms limits performance due to poor network congestion avoidance. Globally adaptive routing algorithms attack this issue by introducing a congestion propagation network to obtain network status information beyond neighboring nodes. However, they may suffer from intra- and inter-application interference during output port selection for consolidated workloads, coupling the behavior of otherwise independent applications and negatively affecting performance.
To address these two issues, we propose Destination-Based Adaptive Routing (DBAR). We design a novel low-cost congestion propagation network that leverages both local and non-local network information for more accurate congestion estimates. Thus, DBAR offers effective adaptivity for con- gestion beyond neighboring nodes. More importantly, by integrating the destination into the selection function, DBAR mitigates intra- and inter-application interference and offers dynamic isolation among regions. Experimental results show that DBAR can offer better performance than the best baseline algorithm for all measured configurations; it is well suited for workload consolidation. The wiring overhead of DBAR is low and DBAR provides improvement in the energy-delay product for medium and high injection rates.
BibTeX:
@inproceedings{Ma2011,
  author = {Sheng Ma and Natalie {Enright Jerger} and Zhiying Wang},
  title = {DBAR: An Efficient Routing Algorithm to Support Multiple Concurrent Applications in Networks-on-Chip},
  booktitle = {International Symposium on Computer Architecture},
  year = {2011},
  pages = {413-424}
}
ACM DL Author-ize serviceDART: a programmable architecture for NoC simulation on FPGAs
Danyao Wang, Natalie Enright Jerger, J. Gregory Steffan
NOCS '11 Proceedings of the Fifth ACM/IEEE International Symposium on Networks-on-Chip, 2011

Abstract: The increased demand for on-chip communication bandwidth as a result of the multi-core trend has made networks on-chip (NoCs) a compelling choice for the communication backbone in next-generation systems [3]. However, NoC designs have many power, area, and performance trade-offs in topology, buffer sizes, routing algorithms and flow control mechanismsÑhence the study of new NoC designs can be very time-intensive. To address this challenge we propose DART, a fast and flexible FPGA-based NoC simulation architecture. Rather than laying the NoC out in hardware on the FPGA like previous approaches [8, 6], our design virtualizes the NoC by mapping its components to a generic NoC simulation engine, composed of a fully-connected collection of fundamental components (e.g., routers and flit queues). This approach has two main advantages: (i) since FPGA implementation is decoupled it can simulate any NoC; and (ii) any NoC can be mapped to the engine without resynthesizing it, which can take time for a large FPGA design. We demonstrate that an implementation of DART can achieve over 100x speedup relative to a cycle-based software simulator, while maintaining the same level of simulation accuracy.
BibTeX:
@inproceedings{Wang2011,
  author = {Danyao Wang and Natalie {Enright Jerger} and J. Gregory Steffan},
  title = {DART: A Programmable Architecture for NoC Simulation on FPGAs},
  booktitle = {International Network on Chip Symposium (NOCS)},
  year = {2011},
  pages = {145-152}
}
SigNet: Network-on-Chip Filtering for Coarse Vector Directories
Natalie Enright Jerger
In Proceedings of the Conference on Design, Automation and Test in Europe. March 2010. [PDF][SLIDES]
Abstract: Scalable cache coherence is imperative as systems move into the many-core era with cores counts numbering in the hundreds. Directory protocols are often favored as more scalable in terms of bandwidth requirements than broadcast protocols; however, directories incur storage overheads that can become prohibitive with large systems. In this paper, we explore the impact that reducing directory overheads has on the network-on-chip and propose SigNet to mitigate these issues. SigNet utilizes signatures within the network fabric to filter out extraneous requests prior to reaching their destination. Overall, we demonstrate average reductions in interconnect activity of 21% and latency improvements of 20% over a coarse vector directory while utilizing as little as 25% of the area of a full- map directory.
BibTeX:
@inproceedings{EnrightJerger2010,
  author = {Natalie {Enright Jerger}},
  title = {SigNet: Network-on-Chip Filtering for Coarse Vector Directories},
  booktitle = {International Conference on Design Automation and Test in Europe (DATE)},
  year = {2010},
  pages = {1378-1383}
}
ACM DL Author-ize serviceSCARAB: a single cycle adaptive routing and bufferless network
Mitchell Hayenga, Natalie Enright Jerger, Mikko Lipasti
MICRO 42 Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009

Abstract: As technology scaling drives the number of processor cores upward, current on-chip routers consume substantial portions of chip area and power budgets. Since existing research has greatly reduced router latency overheads and capitalized on available on-chip bandwidth, power constraints dominate interconnection network design. Recently research has pro- posed bufferless routers as a means to alleviate these constraints, but to date all designs exhibit poor operational frequency, throughput, or latency. In this paper, we propose an efficient bufferless router which lowers average packet latency by 17.6% and dynamic energy by 18.3% over existing bufferless on-chip network designs. In order to maintain the energy and area benefit of bufferless routers while delivering ultra-low latencies, our router utilizes an opportunistic processor-side buffering technique and an energy-efficient circuit-switched network for delivering negative acknowledg- ments for dropped packets.
BibTeX:
@inproceedings{Hayenga2009,
  author = {Mitch Hayenga and Natalie {Enright Jerger} and Mikko Lipasti},
  title = {SCARAB: A Single Cycle Adaptive Routing and Bufferless Network},
  booktitle = {International Symposium on Microarchitecture},
  year = {2009},
  pages = {244-254}
}

Abstract: In the near term, Moore's law will continue to provide an increasing number of transistors and therefore an increasing number of on-chip cores. Limited pin bandwidth prevents the integration of a large number of memory controllers on-chip. With many cores, and few memory controllers, where to locate the memory controllers in the on-chip interconnection fabric becomes an important and as yet unexplored question. In this paper we show how the location of the memory controllers can reduce contention (hot spots) in the on-chip fabric and lower the variance in reference latency. This in turn provides predictable performance for memory-intensive applications regardless of the processing core on which a thread is scheduled. We explore the design space of on-chip fabrics to find optimal memory controller placement relative to different topologies (i.e. mesh and torus), routing algorithms, and workloads.
BibTeX:
@inproceedings{Abts2009,
  author = {Dennis Abts and Natalie {Enright Jerger} and John Kim and Mikko Lipasti and Dan Gibson},
  title = {Achieving Predictable Performance through Better Memory Controller Placement in Many-Core CMPs},
  booktitle = {International Symposium on Computer Architecture},
  year = {2009},
  pages = {451-461}
}
Outstanding Research Problems in NoC Design: System, Microarchitecture, and Circuit Perspectives
Radu Marculescu, Umit Y. Ogras, Li-Shiuan Peh, Natalie Enright Jerger, Yatin Hoskote
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 28(1), January, 2009. [PDF]
Abstract: To alleviate the complex communication problems that arise as the number of on-chip components increases, network-on-chip (NoC) architectures have been recently proposed to replace global interconnects. In this paper, we first provide a general description of NoC architectures and applications. Then, we enumerate several related research problems organized under five main categories: Application characterization, communication paradigm, communication infrastructure, analysis, and solution evaluation. Motivation, problem description, proposed approaches, and open issues are discussed for each problem from system, microarchitecture, and circuit perspectives. Finally, we address the interactions among these research problems and put the NoC design process into perspective.
BibTeX:
@article{Marculescu2009,
  author = {Radu Marculescu and Umit Ogras and Li-Shiuan Peh and Natalie {Enright Jerger} and Yatin Hoskote},
  title = {Outstanding Research Problems in NoC Design: Circuit-, Microarchitecture- and System-Level Perspective},
  journal = {IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems},
  year = {2009},
  volume = {28},
  pages = {3-21}
}
Virtual Tree Coherence: Leveraging Regions and In-Network Multicast Trees for Scalable Cache Coherence
Natalie Enright Jerger, Li-Shiuan Peh, Mikko Lipasti
In Proceedings of the International Symposium on Microarchitecture (MICRO-41), Lake Como Italy, November 2008. (accepted 40/210). [PDF]
Abstract: Scalable cache coherence solutions are imperative to drive the many-core revolution forward. To fully realize the massive computation power of these many-core architectures, the communication substrate must be carefully examined and stream- lined. There is tension between the need for an ordered interconnect to simplify coherence and the need for an unordered interconnect to provide scalable communication. In this work, we propose a coherence protocol, Virtual Tree Coherence (VTC), that relies on a virtually ordered interconnect. Our virtual ordering can be overlaid on any unordered interconnect to provide scalable, high-bandwidth communication. Specifically, VTC keeps track of sharers of a coarse-grained region, and multicasts requests to them through a virtual tree, employing properties of the virtual tree to enforce ordering amongst coherence requests. We compare VTC against a commonly used directory-based protocol and a greedy-order protocol extended onto an unordered interconnect. VTC outperforms both of these by averages of 25% and 11% in execution time respectively across a suite of scientific and commercial applications on 16 cores. For a 64-core system running server consolidation workloads, VTC outperforms directory and greedy protocols with average runtime improvements of 31% and 12%.
BibTeX:
@inproceedings{EnrightJerger2008,
  author = {Natalie {Enright Jerger} and Li-Shiuan Peh and Mikko Lipasti},
  title = {Virtual Tree Coherence: Leveraging Regions and In-Network Multicast Trees for Scalable Cache Coherence},
  booktitle = {International Symposium on Microarchitecture},
  year = {2008},
  pages = {35-46}
}
ACM DL Author-ize serviceVirtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support
Natalie Enright Jerger, Li-Shiuan Peh, Mikko Lipasti
ISCA '08 Proceedings of the 35th Annual International Symposium on Computer Architecture, 2008

Abstract: Current state-of-the-art on-chip networks provide efficiency, high throughput, and low latency for one-to-one (unicast) traffic. The presence of one-to-many (multicast) or one-to-all (broadcast) traffic can significantly degrade the performance of these designs, since they rely on multiple unicasts to provide one-to-many communication. This results in a burst of packets from a single source and is a very inefficient way of performing multicast and broadcast communication. This inefficiency is compounded by the proliferation of architectures and coherence protocols that require multicast and broadcast communication. In this paper, we characterize a wide array of on-chip communication scenarios that benefit from hardware multicast support. We propose Virtual Circuit Tree Multicasting (VCTM) and present a detailed multicast router design that improves network performance by up to 90% while reducing network activity (hence power) by up to 53%. Our VCTM router is flexible enough to improve interconnect performance for a broad spectrum of multicasting scenarios, and achieves these benefits with straightforward and inexpensive extensions to a state-of-the-art packet-switched router.
BibTeX:
@inproceedings{EnrightJerger2008a,
  author = {Natalie {Enright Jerger} and Li-Shiuan Peh and Mikko Lipasti},
  title = {Virtual Circuit Tree Multicasting: A Case for On-Chip Hardware Multicast Support},
  booktitle = {International Symposium on Computer Architecture},
  year = {2008},
  pages = {229-240}
}
Circuit-Switched Coherence
Natalie Enright Jerger, Li-Shiuan Peh, Mikko H. Lipasti
In 2nd Annual IEEE Network on Chip Symposium, Newcastle-Upon-Tyne, UK, April 2008.[PDF][PPT]
Abstract: Our characterization of a suite of commercial and scientific workloads on a 16-core cache-coherent chip multiprocessor (CMP) shows that overall system performance is sensitive to on-chip communication latency, and can degrade by 20% or more due to long interconnect latencies. On the other hand, communication bandwidth demand is low. These results prompt us to explore circuit-switched networks. Circuit-switched networks can significantly lower the communication latency between processor cores, when compared to packet-switched networks, since once circuits are set up, communication latency approaches pure interconnect delay. However, if circuits are not frequently reused, the long setup time can hurt overall performance, as is demonstrated by the poor performance of traditional circuit-switched networks -- all applications saw a slowdown rather than a speedup with a traditional circuit-switched network.
To combat this problem, we propose hybrid circuit switching (HCS), a network design which removes the circuit setup time overhead by intermingling packet-switched flits with circuit-switched flits. Additionally, we co-design a prediction-based coherence protocol that leverages the existence of circuits to optimize pair-wise sharing between cores. The protocol allows pair-wise sharers to communicate directly with each other via circuits and drives up circuit reuse. Circuit-switched coherence provides up to 23% savings in network latency which leads to an overall system performance improvement of up to 15%. In short, we show HCS delivering the latency benefits of circuit switching, while sustaining the throughput benefits of packet switching, in a design realizable with low area and power overhead.
BibTeX:
@inproceedings{EnrightJerger2008b,
  author = {Natalie {Enright Jerger} and Li-Shiuan Peh and Mikko Lipasti},
  title = {Circuit-Switched Coherence},
  booktitle = {International Network on Chip Symposium (NOCS)},
  year = {2008},
  pages = {193-202}
}
An Evaluation of Server Consolidation Workloads for Multi-core Designs
Natalie Enright Jerger, Dana Vantrease, Mikko H. Lipasti
In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC), Boston MA, September 2007. [PDF][PPT]
Abstract: While chip multiprocessors with ten or more cores will be feasible within a few years, the search for appliCations fully exploit their attributes continues. In the meantime, one sure-fire application for such machines will be to serve as consolidation platforms for sets of workloads that previously occupied multiple discrete systems. Such server consolidation scenarios will simplify system administration and lead to savings in power, cost, and physical infrastructure. This paper studies the behavior of server consolidation workloads, focusing particularly on sharing of caches across a variety of configurations. Noteworthy interactions emerge within a workload, and notably across workloads, when multiple server workloads are scheduled on the same chip. These workloads present an interesting design point and will help designers better evaluate trade-offs as we push forward into the many-core era.
BibTeX:
@inproceedings{EnrightJerger2007,
  author = {Natalie {Enright Jerger} and Dana Vantrease and Mikko Lipasti},
  title = {An Evaluation of Server Consolidation Workloads for Multi-Core Designs},
  booktitle = {International Symposium on Workload Characterization},
  year = {2007},
  pages = {47-56}
}
Circuit-Switched Coherence
Natalie Enright Jerger, Mikko H. Lipasti, Li-Shiuan Peh
In IEEE Computer Architecture Letters, vol. 6, no. 1, Jan-Jun, 2007. [PDF]
Abstract: Circuit-switched networks can significantly lower the communication latency between processor cores, when compared to packet-switched networks, since once circuits are set up, communication latency approaches pure interconnect delay. However, if circuits are not frequently reused, the long set up time and poorer interconnect utilization can hurt overall performance. To combat this problem, we propose a hybrid router design which intermingles packet-switched flits with circuit-switched flits. Additionally, we co-design a prediction-based coherence protocol that leverages the existence of circuits to optimize pair-wise sharing between cores. The protocol allows pair-wise sharers to communicate directly with each other via circuits and drives up circuit reuse. Circuit-switched coherence provides overall system performance improvements of up to 17% with an average improvement of 10% and reduces network latency by up to 30%.
BibTeX:
@article{EnrightJerger2007a,
  author = {Natalie {Enright Jerger} and Mikko Lipasti and Li-Shiuan Peh},
  title = {Circuit-Switched Coherence},
  journal = {IEEE Computer Architecture Letters},
  year = {2007},
  volume = {6}
}
Women and Girls in Science and Engineering: Understanding the Barriers to Recruitment, Retention and Persistence Across the Educational Trajectory.
Elizabeth M. O'Callaghan and Natalie D. Enright Jerger
In Journal of Women and Minorities in Science and Engineering. Volume 12, 2006.
BibTeX:
@article{OCallaghan2006,
  author = {Elizabeth M. O'Callaghan and Natalie {Enright Jerger}},
  title = {Women and Girls in Science and Engineering: Understanding the Barriers to Recruitment, Retention and Persistence across the Educational Trajectory},
  journal = {Journal of Women and Minorities in Science and Engineering},
  year = {2006},
  volume = {12},
  pages = {209-232}
}
Friendly Fire: Understanding the Effects of Multiprocessor Prefetches
Natalie Enright Jerger, Eric L. Hill, Mikko H. Lipasti
In Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS-2006), Austin, TX, March 2006. [PDF]
Abstract: Modern processors attempt to overcome increasing memory latencies by anticipating future references and prefetching those blocks from memory. The behavior and possible negative side effects of prefetching schemes are fairly well understood for uniprocessor systems. However, in a multiprocessor system a prefetch can steal read and/or write permissions for shared blocks from other processors, leading to permission thrashing and overall performance degradation. In this paper, we present a taxonomy that classifies the effects of multiprocessor prefetches. We also present a characterization of the effects of four different hardware prefetching schemes--sequential prefetching, content-directed data prefetching, wrong path prefetching and exclusive prefetching--in a bus-based multiprocessor system. We show that accuracy and coverage are inadequate metrics for describing prefetching in a multiprocessor; rather, we also need to understand what fraction of prefetches interfere with remote processors. We present an upper bound on the performance of various prefetching algorithms if no harmful prefetches are issued, and suggest prefetch filtering schemes that can accomplish this goal.
BibTeX:
@inproceedings{EnrightJerger2006,
  author = {Natalie {Enright Jerger} and Eric Hill and Mikko Lipasti},
  title = {Friendly Fire: Understanding the Effects of Multiprocessor Prefetches},
  booktitle = {International Symposium on Performance Analysis of Systems and Software (ISPASS)},
  year = {2006},
  pages = {177-188}
}

Workshop Papers and Posters


Hierarchical Clustering for On-Chip Networks
Robert Hesse and Natalie Enright Jerger
Workshop on Advanced Interconnect Solutions and Technologies for Emerging Computing Systems, 2016. [PDF]
Abstract: Hierarchy and communication locality are a must for many-core systems. As systems scale to dozens or hundreds of cores, we simply cannot afford the power consumption and latency of random communication that spans the entire chip. Existing hierarchical Networks-on-Chip (NoCs) support communication locality only for a fixed cluster of nodes; providing a fixed hierarchy is too restrictive in terms of parallelism and data placement. Therefore, we propose a new, more flexible class of hierarchical NoCs: Elastic Hierarchical NoCs. Elastic Hierarchical NoCs dynamically adjust the number and size of clusters during runtime according to the system's communication demands. The interconnect can adapt to changes in communication locality across different application phases, between applications and in the presence of server consolidation. Our design improves overall system performance by up to 46% and 13% on average over a conventional 2D mesh and by up to 16% and 6% on average over an existing hierarchical NoC implementation. Power consumption is reduced by 45% and 7% respectively on average.
BibTeX:
@inproceedings{Hesse2016,
  author={Robert Hesse and Natalie {Enright Jerger}},
  title={Hierarchical Clustering for On-Chip Networks},
  booktitle={Workshop on Advanced Interconnect Solutions and Technologies for Emerging Computing Systems},
  year={2016}
}
Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks
Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger and Andreas Moshovos
Workshop on Approximate Computing, 2016. [PDF]
Abstract: This work exploits the tolerance of Deep Neural Networks (DNNs) to reduced precision numerical representations and specifically, their ability to use different representations per layer while maintaining accuracy. This flexibility provides an additional opportunity to improve performance and energy compared to conventional DNN implementations that use a single, uniform representation for all layers throughout the network. This work exploits this property by proposing PROTEUS, a layered extension over existing DNN implementations that converts between the numerical representation used by the DNN execution engines and a shorter, layer specific fixed-point representation when reading and writing data values to memory be it on-chip buffers or off-chip memory. When used with a modified layout of data in memory, PROTEUS can use a simple, low-cost and low energy conversion unit.
On five popular DNNs, PROTEUS can reduce data traffic among layers by 41% on average and up to 44% compared to a baseline that uses 16-bit fixed-point representation, while maintaining accuracy within 1% even when compared to a single precision floating-point implementation. When incorporated into a state-of-the-art accelerator PROTEUS improves energy by 14% While maintaining the same performance. When incorporated on a graphics processor PROTEUS improves performance by 1%, energy by 4% and reduces off-chip DRAM accesses by 46%.
BibTeX:
@inproceedings{Judd2016,
  author={Patrick Judd and Jorge Albericio and Tayler Hetherington and Tor Aamodt and Natalie {Enright Jerger} and Andreas Moshovos},
  title={Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks
  booktitle={Workshop on Approximate Computing},
  year={2016}
}
Texture Cache Approximation on GPUs
Mark Sutherland, Joshuan San Miguel and Natalie Enright Jerger
Workshop on Approximate Computing Across the Stack, 2015. [PDF]
Abstract: We present texture cache approximation as a method for using ex- isting hardware on GPUs to eliminate costly global memory ac- cesses. We develop a technique for using a GPU's texture fetch units to generate approximate values, and argue that this technique is ap- plicable to a wide variety of GPU kernels. Applying texture cache approximation to an image blur kernel on an NVIDIA 780GTX, we obtain a 12% reduction in kernel execution time while only intro- ducing 0.4% output error in the final image.
BibTeX:
@inproceedings{Sutherland2015b,
  author={Mark Sutherland and Joshua {San Miguel} and Natalie {Enright Jerger}},
  title={Texture Cache Approximation on GPUs},
  booktitle={Workshop on Approximate Computing Across the Stack},
  year={2015}
}
Offloading to the GPU: An Objective Approach
Ajaykumar Kannan, Mario Badr, Parisa Khadem Hamedani and Natalie Enright Jerger
PRISM: 3rd Annual Workshop on Parallelism in Mobile Platforms , 2015. [PDF]
Abstract:
BibTeX:
@inproceedings{Sutherland2015a,
  author={Mark Sutherland and Ajaykumar Kannan and Natalie {Enright Jerger}},
  title={Offloading to the GPU: An Objective Approach},
  booktitle={Workshop on Parallelism in Mobile Platforms},
  year={2015}
}
Not Quite My Tempo: Matching Prefetches to Memory Access Times
Mark Sutherland, Ajaykumar Kannan and Natalie Enright Jerger
Data Prefetching Championship Workshop, 2015. [PDF]
Abstract: Modern prefetchers can generally be divided into two categories, spatial and temporal, based on the type of correlations they attempt to exploit. Although these two types have different advantages, and perform well on different application sets, a design that utilizes both types of information will be able to achieve greater prefetch accuracy. We address the lack of temporal information in the state-of-the-art Spatial Memory Streaming (SMS) prefetcher by proposing Tempo, a novel banked implementation of SMS that further classifies cache accesses within the same physical page based on the repetitive miss latency, or tempo, present in the local access stream. Evaluated on the SPEC CPU2006 benchmark suite, Tempo reduces useless prefetches by 17.6%, and achieves 1.45% and 2.57% increase in IPC on high and low bandwidth memory configurations over a purely spatial SMS design.
BibTeX:
@inproceedings{Sutherland2015a,
  author={Mark Sutherland and Ajaykumar Kannan and Natalie {Enright Jerger}},
  title={Not Quite My Temp: Matching Prefetches to Memory Access Times},
  booktitle={Data Prefetching Championship Workshop},
  year={2015}
}
Wormhole Branch Prediction using Multi-dimensional Histories
Jorge Albericio, Joshua San Miguel, Natalie Enright Jerger and Andreas Moshovos
Championship Branch Prediction Workshop, 2014.

Accelerating Network-on-Chip Simulation via Sampling
Wenbo Dai and Natalie Enright Jerger
International Symposium on Performance Analysis of Systems and Software (poster), 2014. [PDF]
Abstract: Architectural complexity continues to grow as we consider the large design space of multiple cores, cache architectures, networks-on-chip and memory controllers for emerging architectures. Simulators are growing in complexity to reflect each of these system components. However, many full-system simulators fail to take advantage of the underlying hardware resources such as multiple cores; as a result, simulation times have grown significantly in recent years. Long turnaround times limit the range and depth of design space exploration that is tractable. Communication has emerged as a first class design consideration and has lead to significant research into networks-on-chip (NoC). The NoC is yet another component of the architecture that must be faithfully modeled in simulation. Given its importance, we focus on accelerating simulation of the NoC through the use of sampling techniques; sampling can provide both accurate results and fast evaluation. We propose NoCLabs and NoCPoint, two sampling methodologies utilizing statistical sampling theory and traffic phase behavior, respectively. Experimental results show that our proposed NoCLabs and NoCPoint estimate NoC performance with an average error of 5\% while achieving one order of magnitude speedup on average.
BibTeX:
@inproceedings{Dai2014,
  author={Wenbo Dai and Natalie {Enright Jerger},
  title={Accelerating Network-on-Chip Simulation via Sampling},
  booktitle={International Symposium on Performance Analysis of System and Software},
  year={2014}
}
Load Value Approximation: Approaching the Ideal Memory Access Latency
Joshua San Miguel and Natalie Enright Jerger
Workshop on Approximate Computing Across the System Stack (WACAS), 2014. [PDF]
Abstract: Approximate computing recognizes that many applications can tolerate inexactness. These applications, which range from multimedia processing to machine learning, operate on inherently noisy and imprecise data. As a result, we can trade-off some loss in output value integrity for improved processor performance and energy-efficiency. In this paper, we introduce load value approximation. In modern processors, upon a load miss in the private cache, the data must be retrieved from main memory or from the higher-level caches. These data accesses are costly both in terms of latency and energy. We implement load value approximators, which are hardware structures that learn value patterns and generate approximations of the data. The processor can then use these approximate data values to continue executing without incurring the high cost of accessing memory. We show that load value approximators can achieve high coverage while maintaining very low error in the application's output. By exploiting the approximate nature of applications, we can draw closer to the ideal memory access latency.
BibTex:
@inproceedings{SanMiguel2014,
   author={Joshua {San Miguel} and Natalie {Enright Jerger}},
   title={Load Value Approximation: Approaching the Ideal Memory Access Latency},
   booktitle={Workshop on Approximate Computing Across the System Stack (WACAS)},
   year={2014}
}
				      
ACM DL Author-ize servicePower Modeling for Heterogeneous Processors
Tahir Diop, Natalie Enright Jerger, Jason Anderson
GPGPU-7 Proceedings of Workshop on General Purpose Processing Using GPUs, 2014

The microbenchmark framework used in this study is available for download here


Abstract: As power becomes an ever more important design consideration, there is a need for accurate power models at all stages of the design process. While power models are available for CPUs and GPUs, only simple models are available for heterogeneous processors. We present a micro-benchmark-based modeling technique that can be used for chip multiprocessor (CMPs) and accelerated processing units (APUs). We use our approach to model power on an Intel Xeon CPU and an AMD Fusion heterogeneous processor. The resulting error rate for the Xeon's model is below 3% and is only 7% for the Fusion. We also present a method to reduce the number of benchmarks required to create these models. Instead of running micro-benchmarks for every combination of factors (e.g. different operations or memory access patterns), we cluster similar micro-benchmarks to avoid unnecessary simulations. We show that it is possible to eliminate as many as 93% of the compute micro-benchmarks, while still producing power models having less than 10% error rate.
BibTeX:
@inproceedings{Diop2014,
  author={Tahir Diop and Natalie {Enright Jerger} and Jason Anderson},
  title={Power Modeling for Heterogeneous Processors},
  booktitle={7th Workshop on General Purpose Processing with GPUs},
  year={2014}
}
				      
DART: Fast and Flexible NoC Simulation using FPGAs
Danyao Wang, Natalie Enright Jerger and J. Gregory Steffan
Workshop on Architectural Research Prototyping 2010 (held in conjunction with ISCA 2010). [PDF]

PhD Thesis


Chip Multiprocessor Coherence and Interconnect System Design (December 2008)

[PDF][PPT]


US Patent


Natalie D. Enright, Jamison Collins, Perry Wang, Hong Wang, Xinmin Tran, John Shen, Gad Sheaffer, Per Hammarlund.

Mechanism to exploit synchronization overhead to improve multithread performance.

Patent no: 7487502. Sept 2009.

Research Group

Current Students

Mario Badr

PhD Student

Shehab Elsayed

PhD Student

Josh San Miguel

PhD Student

Karthik Ganesan

MASc Student

Former Students

PhD Students and Post Docs


Wenbo Dai (PhD, 2017)

    First Employment: Intel

Parisa Khadem Hamedani (PhD 2017)

    First Employment: TBD

Jorge Albericio (Post Doc 2016)

    First employment: NVidia

Robert Hesse (PhD, 2015)

    First Employment: Intel

Sheng Ma (Visiting PhD, 2013)

    First Employment: Assistant Professor,

    NUDT


MASc and MEng Students


Mark Sutherland (MASc, 2016)

    First Employment: PhD student, EPFL

Zimo Li (MASc, 2015)

    First Employment: Amazon

Ajaykannan Kumar (MASc 2015)

    First Employment: Intel

Steven Gurfinkel (MASc, 2014)

    Co-supervised with Jason Anderson

    First Employment: NVidia

Haofan Yang (MASc, 2014)

    First Employment: NVidia

Tahir Diop (MASc, 2013)

    Co-supervised with Jason Anderson

    First Employment: TXIO

Remi Dufour (MEng, 2013)

    First Employment: Apple

John Matienzo (MEng, 2012)

    First Employment: Apple

Tony Feng (MASc, 2012)

    First Employment: Chinese Electronics

    Corporation

Danyao Wang (MASc, 2010)

    Co-supervised with Greg Steffan

    First Employment: Google

Matthew Misler (MASc, 2010)

    First Employment: TD Bank Financial

    Group


Undergraduate Students


Daniel Lee, ECE

Alex Liu, ECE

Jeff Nicholls, EngSci

    First Employment: MASc Student, UofT

Carl Noel, ECE

    First Employment: Microsoft

Jyoti Tripathi, ECE

Camille Wingson, ECE

Tian Fang Yu, ECE

Victor Feng, EngSci

    First Employment: AMD

Ruolong Lian, ECE

    First Employment: MASc Student, UofT

Thierry Moreau, ECE

    First Employment: PhD Student, University

    of Washington


Downloads

DART: An FPGA-Based Network-on-Chip Simulation Acceleration Engine


DART is an FPGA implementation of a network-on-chip simulator, where the topology and parameters of the simulated network can be modified without rebuilding the FPGA image. DART is described in our NOCS 2011 paper and is available for download.


DistCL: A Framework for Distributed Execution of OpenCL Kernels


DistCL is a novel framework that distributes the execution of OpenCL kernels across a GPU cluster. DistCL makes multiple distributed compute devices appear to be a single compute device. DistCL is described in our MASCOTS 2013 paper and is available for download: DistCL (version 1.0, released August 12, 2013).


Power Modeling for Heterogeneous Processors


This microbenchmark framework was used in the design of the heterogeneous power model described in our GPGPU 2014 paper and is available for download: Microbenchmarks (version 1.0, released April 11, 2014).


SynFull: Synthetic Traffic Models Capturing Cache Coherent Behaviour


SynFull is a traffic generation methodology that captures both application and cache coherence behaviour to rapidly evaluate NoCs. SynFull allows designers to quickly indulge in detailed performance simulations without the cost of long-running full system simulation. The model generated and explored in our ISCA 2014 paper is available for download. (version 1.0, released May 14, 2014)

Teaching

Spring 2017: ECE342 Computer Hardware -- Course material available through Blackboard

ECE1755: Parallel Computer Architecture and Programming -- Course material available through Blackboard


Spring 2015: ECE243 Computer Organization -- Course material available through CoursePeer


Fall 2014: ECE552 Computer Architecture -- Course material available through CoursePeer

ECE1749H: Interconnection Networks -- Course material available through CoursePeer


Spring 2014: ECE 243 Computer Organization -- Course material available through CoursePeer

ECE1749H: Interconnection Networks -- Course material available through CoursePeer


Fall 2013: ECE 552 Computer Architecture -- Course material available through CoursePeer


Spring 2013: ECE 243 Computer Organization


Fall 2012: ECE 552 Computer Architecture -- Course material available through CoursePeer

ECE 1749H: Interconnection Networks -- Course material available through BlackBoard


Spring 2012: ECE 243: Computer Organization


Fall 2011: ECE 552: Computer Architecture

ECE 1749H: Interconnection Networks for Parallel Computer Architectures


Spring 2011: ECE 1749H: Interconnection Networks for Parallel Computer Architectures

ECE 243: Computer Organization


Fall 2010: ECE 452/ECE 1718H: Computer Architecture


Spring 2010: ECE 1749H: Interconnection Networks for Parallel Computer Architectures


Fall 2009: ECE 452: Computer Architecture