Deep Learning Acceleration

This is an old revision of the document!

We are developing, designing and demonstrating a novel class of hardware accelerators for Deep Learning networks whose key feature is that they are value-based. Conventional accelerators rely mostly on the structure of computation, that is, which calculations are performed and how they communicate. Value-based accelerators further boost performance by taking advantage of expected properties in the runtime calculated value stream, such as, dynamically redundant or ineffectual computations, or the distribution of values, or even their bit content. In short, our accelerator designs, reduce the amount of work that needs to be performed for existing neural models and so transparently to the model designer. Why are we pursuing these designs? Because Deep Learning is transforming our world by leaps and bounds. One of the three drivers behind Deep Learning success is the computing hardware that enabled its first practical applications. While algorithmic improvements will allow Deep Learning to evolve, much hinges on hardware’s ability to keep delivering ever higher performance and data processing storage and processing capability. As Dennard scaling has seized, the only viable way to do so is by architecture specialization.

The figure below highlights the potential of some of the methods we have developed:

A: remove zero Activations, W: remove zero weights, Ap: use dynamic per group precision for activations, Ae: skip ineffectual terms after Booth Encoding the Activations.

This IEEE MICRO and IEEE Computer (draft) articles present our rationale and summarize some of our designs. The most recently publicly disclosed design is Bit-Tactical that targets but does not require a sparse network.

Here is an older talk that describes some of our work (talk given in late February 2017): CNN Inference Accelerators.

Our accelerators are taking advantage of the unique properties of DL and have shown that they have the potential to outperform even the fastest to-date accelerator by 50% to 400% and commodity hardware by 2 to 3 orders of magnitude. We hope that we will unlock the next wave of innovation in DL by developing and refining additional value-based acceleration techniques for a broader class of DL applications. Such accelerators will 1) enable the practical training and deployment of larger and more sophisticated DL models, 2) provide additional degrees of freedom and exploration for DL researchers and developers to tune network performance and accuracy, 3) enable faster response times under constrained scenarios such as embedded or mobile applications, and 4) ultimately, lead to a unified accelerator architecture that can be used for many different DL applications for both training and inference and that can be configured for data center or mobile/embedded applications.

Cnvlutin: Ineffectual Neuron Free Convolutional Neural Networks: Further innovation in Deep Learning hinges upon the hardware’s ability to support deeper, more computationally demanding and larger neural network. Cnvlutin is our first value-aware hardware accelerator for convolutional layers of Convolutional Neural Networks (CNNs). CNNs are the state-of-the-art image classification method which applications in many fields including medical diagnosis, security, cataloguing and in general image based search.

We made a key observation that a large portion of operations performed in these layers, the dominant operation at the core of Convolutional Neural Networks, are ineffectual due to the nature of CNNs. Ineffectual operations do not affect the final output of the network. Moreover, processing such ineffectual information consumes valuable computing resources, wasting time and power. However, which operations are ineffectual is data dependent and can only be found at runtime. In a conventional architecture, by the time we figure out that an operation is ineffectual is typically too late: The time it takes to check that an operation is ineffectual is typically as long as the time it would have taken to just perform the operation, i.e., checking to avoid operations would result in a slowdown.

Cnvultin exploits the organization of CNNs to eliminate ineffectual operations on-the-fly completely transparently improving speed and energy consumption. We demonstrated Cnvlutin as a very wide single-instruction multiple data CNN accelerator that dynamically eliminates most ineffectual multiplications. A configuration that computers 4K terms in parallel per cycle improves performance over an equivalently configured state-of-the-art accelerator, DaDianNao, from 24% to 55% and by 37% on average by targeting zero-valued operands alone. While Cnvlutin incurs an area overhead of 4.49%, it improves overall Energy Delay Squared (ED^2) and Energy Delay (ED) by 2.01 and 1.47 on average. By loosening the ineffectual operand identification criterion eliminating neurons below a per-layer pre-specified threshold, performance improvements increase to 1.52x on average with no loss of accuracy. Raising these thresholds further allows for larger performance gains by trading off accuracy. Concurrently, groups from Harvard and MIT exploit multiplications with zero to save energy, and a group from Stanford to also improve performance. We save both energy and improve performance and for additional computations by using a looser ineffectual computation criterion.

J. Albericio, P. Judd, T Hetherington, T. Aamodt, N. Enright Jerger, and A. Moshovos, Cnvltin: Ineffectual Neuron-Free Deep Learning Computing, ACM/IEEE International Symposium on Computer Architecture, June 2016.

Stripes: Exploiting Precision Variability: In Stripes, we developed a CNNs accelerator whose performance improves with the numerical precision used. Typical hardware uses bit-parallel units whose performance is independent of numerical precision, with 16-bit-fixed point being common for CNNs. In Stripes performance varies with 16/p; every bit matters. Stripes improves performance over DaDianNao by 1.92x on average and for the full CNNs and under very pessimistic assumptions. Stripes enables on-the-fly trading off accuracy for performance and energy. It opens up a new direction for CNNs research for both inference and training some of which we plan to investigate. For example, it may accelerate training by enabling incremental and dynamic adjustment of precision, or it may enable incremental processing of classification where using small precision can lead to quick rough result and a decision whether to proceed to increase precision to get a more accurate result. In more recent work we proposed TARTAN that boosts performance also for fully-connected layers.

P. Judd, J. Albericio, Andreas Moshovos, Stripes: Bit-Serial Deep Learning Computing, Computer Architecture Letters, Accepted, April 2016, Appears Aug 2016.
P. Judd, J. Albericio, T. Hetherington, T. Aamodt, Andreas Moshovos, Stripes: Bit-Serial Deep Learning Computing, ACM/IEEE International Conference on Microarchitecture, Oct 2016.
A. Delmas Lascorz, S. Sharify, P. Judd, A. Moshovos, TARTAN: Accelerating Fully-Connected and Convolutional Layers in Deep Neural Networks, OpenReview, Oct 2016.

Bit-Pragmatic Deep Learning Computing: We observe that on average more than 90% of the work done when multiplying activations and weights in Convolutional Neural Networks is ineffectual. In Pragmatic, execution time depends only on the essential bit content of the runtime values for convolutional layers. Even after reducing precision to the absolute minimum, there are still many ineffectual computations. Pragmatic eliminates any remaining computations that are certainly ineffectual. It boosts both energy efficiency and performance over all previous proposals. Compared to DaDianNao it improves performance by more than 4x for narrower configurations performance approaches nearly 8x over an equivalent DaDianNao configuration.

J. Albericio, P. Judd, Andreas Moshovos, Bit-Pragmatic Deep Learning Computing, arxiv.
J. Albericio, A. Delmas Lascorz, P. Judd, S. Sharify, G. O'Leary, R. Genov, and A. Moshovos, Bit-Pragmatic Deep Learning Computing, IEEE/ACM MICRO 2017.

Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks: Having observed that the precision requirements of Deep Neural Networks vary per layer, we proposed a simple extension to existing computing engines that reduces memory bandwidth and footprint. The key concept is to use two different representations. One that is storage efficient and one that is being used for computations. A new storage container format is also being proposed that drastically reduces the cost of converting from one format to the other as it avoids the large crossbar that would otherwise be necessary.

Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor M. Aamodt, Natalie Enright Jerger, and Andreas Moshovos. 2016. Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks. In Proceedings of the 2016 International Conference on Supercomputing (ICS '16).

Dynamic Stripes : It has been known that the precisions that activations need can be tailored per network layer. Several hardware approaches exploit this precision variability to boost performance and energy efficiency. Here we show that much is left on the table by assigning precisions at the layer level. In practice the precisions will vary with the input and at a much lower granularity. An accelerator only needs to consider as many activations as it can process per cycle. In the work below we show how to adapt precisions variability at runtime at the processing granularity. We also show how to boost performance and energy efficiency for fully-connected layers.

Alberto Delmas, Patrick Judd, Sayeh Sharify, Andreas Moshovos, Dynamic Stripes: Exploiting the Dynamic Precision Requirements of Activation Values in Neural Networks, arxiv.

An Accelerator for Embedded Devices : When compute needs are modest the design described below exploits both activation and weight precisions.

S. Sharify, A. Delmas Lascorz, P. Judd and Andreas Moshovos, Arxiv

Bit-Tactical: : Exploiting weight sparsity if available

A. Delmas, P. Judd, D. Malone Stuart, Z. Poulos, S. Sharify, M. Mahmoud, M. Nikolic, K. Siu, and A. Moshovos Arxiv

Boosting Transaction Processing Rates for Free: Data management systems (DMS) have evolved into an essential component of numerous applications such as financial transactions and retailing, and are the backbone many “big data” applications. They are usually deployed over large data centers with multi-million equipment and operational costs. Equally important, the environmental footprint of such facilities is large often exceeding that of small countries. Improving performance per watt of energy improves not only capability and profitability but also greatly reduces the environmental impact of these facilities. Data management systems, are often used to perform “transactional” workloads comprising transactions, that is, requests that lookup or update information based on some selection criteria, e.g., selecting sensor data over specific periods of time or that exceed certain thresholds, purchasing items, or completing financial transactions. For such systems, the higher the throughput – the number of transactions that are serviced per unit of time – the better and often the more efficient the system.

Our work greatly improves the energy efficiency and performance of such systems. An analogy would be increasing the fuel efficiency of airplanes. Our work would be the equivalent of enabling longer haul flights with more passengers and with less fuel. We have observed how transactions execute on modern hardware and demonstrated that transactions operate selfishly. This behavior ends up penalizing all transactions as it thrashes the instruction caches, a key performance enhancing mechanism. Instead of having transactions act selfishly we developed a system that allows them to cooperate seamlessly. Over the course of the work, we developed two novel cooperative policies that boost throughput. The policies judiciously decide where, when, and for how long transactions should execute.

This work has led to several top-tier international conference publications out of which I list the most impactful two, invitations for presentations at various companies, and immediate job offers for my Ph.D. Mr. Islam Atta. He respectfully declined all these offers as he preferred to complete his Ph.D. first. The interest transcends that of processor companies and includes services companies such as Amazon that ultimately Dr. Atta decided to join.

I. Atta, P. Tozun, A. Ailamaki and Andreas Moshovos, STREX: Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution, ACM/IEEE International Symposium on Computer Architecture, June 2013.
I. Atta, P. Tozun, A. Ailamaki and Andreas Moshovos, SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads, ACM/IEEE International Symposium on Microarchitecture, December 2012.

Deep Learning Acceleration

Data Management Systems