Deep Learning Acceleration

We are developing, designing and demonstrating a novel class of hardware accelerators for Deep Learning networks whose key feature is that they are value-based. Conventional accelerators rely mostly on the structure of computation, that is, which calculations are performed and how they communicate. Value-based accelerators further boost performance by taking advantage of expected properties in the runtime calculated value stream, such as, dynamically redundant or ineffectual computations, or the distribution of values, or even their bit content. In short, our accelerator designs, reduce the amount of work that needs to be performed for existing neural models and so transparently to the model designer. Why are we pursuing these designs? Because Deep Learning is transforming our world by leaps and bounds. One of the three drivers behind Deep Learning success is the computing hardware that enabled its first practical applications. While algorithmic improvements will allow Deep Learning to evolve, much hinges on hardware’s ability to keep delivering ever higher performance and data processing storage and processing capability. As Dennard scaling has seized, the only viable way to do so is by architecture specialization.

The figure below highlights the potential and motivation of some of the methods we have developed:

A: remove zero Activations, W: remove zero weights, Ap: use dynamic per group precision for activations, Ae: skip ineffectual terms after Booth Encoding the Activations. We also have designs that exploit Weight precision (see LOOM below) and yet to be released designs that exploit further properties :)

This IEEE MICRO and IEEE Computer articles present our rationale and summarize some of our designs. The most recently publicly disclosed design is Bit-Tactical that targets but does not require a sparse network.

The tables below summarize some key characteristics of some of our designs:

Here is an older talk that describes some of our work (talk given in late February 2017): CNN Inference Accelerators.

Cnvlutin: Ineffectual Neuron Free Convolutional Neural Networks:

We made a key observation that a large portion of operations performed in these layers, the dominant operation at the core of Convolutional Neural Networks, are ineffectual (e.g., the activation value is zero or near zero enough) due to the nature of CNNs. Ineffectual operations do not affect the final output of the network. Moreover, processing such ineffectual information consumes valuable computing resources, wasting time and power. However, which operations are ineffectual is data dependent and can only be found at runtime. In a conventional architecture, by the time we figure out that an operation is ineffectual is typically too late: The time it takes to check that an operation is ineffectual is typically as long as the time it would have taken to just perform the operation, i.e., checking to avoid operations would result in a slowdown.

Cnvultin exploits the organization of CNNs to eliminate ineffectual operations on-the-fly completely transparently improving speed and energy consumption. We demonstrated Cnvlutin as a very wide single-instruction multiple data CNN accelerator that dynamically eliminates most ineffectual multiplications. A configuration that computers 4K terms in parallel per cycle improves performance over an equivalently configured state-of-the-art accelerator, DaDianNao, from 24% to 55% and by 37% on average by targeting zero-valued operands alone. While Cnvlutin incurs an area overhead of 4.49%, it improves overall Energy Delay Squared (ED^2) and Energy Delay (ED) by 2.01 and 1.47 on average. By loosening the ineffectual operand identification criterion eliminating neurons below a per-layer pre-specified threshold, performance improvements increase to 1.52x on average with no loss of accuracy. Raising these thresholds further allows for larger performance gains by trading off accuracy. Concurrently, groups from Harvard and MIT exploit multiplications with zero to save energy, and a group from Stanford to also improve performance. We save both energy and improve performance and for additional computations by using a looser ineffectual computation criterion.

  • J. Albericio, P. Judd, T Hetherington, T. Aamodt, N. Enright Jerger, and A. Moshovos, Cnvltin: Ineffectual Neuron-Free Deep Learning Computing, ACM/IEEE International Symposium on Computer Architecture, June 2016.

Stripes: Exploiting Precision Variability: In Stripes, we developed a CNNs accelerator whose performance improves with the numerical precision used. Typical hardware uses bit-parallel units whose performance is independent of numerical precision, with 16-bit-fixed point being common for CNNs. In Stripes performance varies with 16/p; every bit matters. Stripes improves performance over DaDianNao by 1.92x on average and for the full CNNs and under very pessimistic assumptions. Stripes enables on-the-fly trading off accuracy for performance and energy. It opens up a new direction for CNNs research for both inference and training some of which we plan to investigate. For example, it may accelerate training by enabling incremental and dynamic adjustment of precision, or it may enable incremental processing of classification where using small precision can lead to quick rough result and a decision whether to proceed to increase precision to get a more accurate result. In more recent work we proposed TARTAN that boosts performance also for fully-connected layers. We also observed that the typical value distribution is such that profile-derived precisions, as used in the original Stripes work, are pessimistic. A simple, surgical extension to Stripes can further boost performance by triming the precisions on-the-fly to a much smaller width at a much smaller granularity. See Dynamic Stripes below

  • P. Judd, J. Albericio, Andreas Moshovos, Stripes: Bit-Serial Deep Learning Computing, Computer Architecture Letters, Accepted, April 2016, Appears Aug 2016.
  • P. Judd, J. Albericio, T. Hetherington, T. Aamodt, Andreas Moshovos, Stripes: Bit-Serial Deep Learning Computing, ACM/IEEE International Conference on Microarchitecture, Oct 2016.
  • A. Delmas Lascorz, S. Sharify, P. Judd, A. Moshovos, TARTAN: Accelerating Fully-Connected and Convolutional Layers in Deep Neural Networks, OpenReview, Oct 2016. Arxiv.

Bit-Pragmatic Deep Learning Computing: We observe that on average more than 90% of the work done when multiplying activations and weights in Convolutional Neural Networks is ineffectual. In Pragmatic, execution time depends only on the essential bit content of the runtime values for convolutional layers. Even after reducing precision to the absolute minimum, there are still many ineffectual computations. Pragmatic eliminates any remaining computations that are certainly ineffectual. It boosts both energy efficiency and performance over all previous proposals. Compared to DaDianNao it improves performance by more than 4x for narrower configurations performance approaches nearly 8x over an equivalent DaDianNao configuration.

Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks: Having observed that the precision requirements of Deep Neural Networks vary per layer, we proposed a simple extension to existing computing engines that reduces memory bandwidth and footprint. The key concept is to use two different representations. One that is storage efficient and one that is being used for computations. A new storage container format is also being proposed that drastically reduces the cost of converting from one format to the other as it avoids the large crossbar that would otherwise be necessary.

  • Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor M. Aamodt, Natalie Enright Jerger, and Andreas Moshovos. 2016. Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks. In Proceedings of the 2016 International Conference on Supercomputing (ICS '16).

Dynamic Stripes : It has been known that the precisions that activations need can be tailored per network layer. Several hardware approaches exploit this precision variability to boost performance and energy efficiency. Here we show that much is left on the table by assigning precisions at the layer level. In practice the precisions will vary with the input and at a much lower granularity. An accelerator only needs to consider as many activations as it can process per cycle. In the work below we show how to adapt precisions variability at runtime at the processing granularity. We also show how to boost performance and energy efficiency for fully-connected layers.

LOOM: An Accelerator for Embedded Devices : When compute needs are modest the design described below exploits both activation and weight precisions.

  • S. Sharify, A. Delmas Lascorz, P. Judd and Andreas Moshovos, Arxiv
  • S. Sharify, A. Delmas, P. Judd, K. Siu, and A. Moshovos, Design Automation Conference, June 2018 (here we use dynamic precision detection and also study the effects of off-chip traffic and on-chip buffering).

DPRed: This builds on the Dynamic Stripes work and shows that dynamic precision and per group precision adaptation can yield benefits. Stay tuned for an update:

  • A. Delmas, A. Sharify, P. Judd, M. Nikolic, A. Moshovos, DPRed: Making Typical Activation Values Matter In Deep Learning Computing, Arxiv

Bit-Tactical: : Exploiting weight sparsity if available

  • A. Delmas, P. Judd, D. Malone Stuart, Z. Poulos, S. Sharify, M. Mahmoud, M. Nikolic, K. Siu, and A. Moshovos Arxiv

Data Management Systems

Boosting Transaction Processing Rates for Free: Data management systems (DMS) have evolved into an essential component of numerous applications such as financial transactions and retailing, and are the backbone many “big data” applications. They are usually deployed over large data centers with multi-million equipment and operational costs. Equally important, the environmental footprint of such facilities is large often exceeding that of small countries. Improving performance per watt of energy improves not only capability and profitability but also greatly reduces the environmental impact of these facilities. Data management systems, are often used to perform “transactional” workloads comprising transactions, that is, requests that lookup or update information based on some selection criteria, e.g., selecting sensor data over specific periods of time or that exceed certain thresholds, purchasing items, or completing financial transactions. For such systems, the higher the throughput – the number of transactions that are serviced per unit of time – the better and often the more efficient the system.

Our work greatly improves the energy efficiency and performance of such systems. An analogy would be increasing the fuel efficiency of airplanes. Our work would be the equivalent of enabling longer haul flights with more passengers and with less fuel. We have observed how transactions execute on modern hardware and demonstrated that transactions operate selfishly. This behavior ends up penalizing all transactions as it thrashes the instruction caches, a key performance enhancing mechanism. Instead of having transactions act selfishly we developed a system that allows them to cooperate seamlessly. Over the course of the work, we developed two novel cooperative policies that boost throughput. The policies judiciously decide where, when, and for how long transactions should execute.

This work has led to several top-tier international conference publications out of which I list the most impactful two, invitations for presentations at various companies, and immediate job offers for my Ph.D. Mr. Islam Atta. He respectfully declined all these offers as he preferred to complete his Ph.D. first. The interest transcends that of processor companies and includes services companies such as Amazon that ultimately Dr. Atta decided to join.

  • I. Atta, P. Tozun, A. Ailamaki and Andreas Moshovos, STREX: Boosting Instruction Cache Reuse in OLTP Workloads Through Stratified Transaction Execution, ACM/IEEE International Symposium on Computer Architecture, June 2013.
  • I. Atta, P. Tozun, A. Ailamaki and Andreas Moshovos, SLICC: Self-Assembly of Instruction Cache Collectives for OLTP Workloads, ACM/IEEE International Symposium on Microarchitecture, December 2012.