Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
wiki:overview [2018/04/27 17:05]
Andreas Moshovos
wiki:overview [2018/04/27 17:59] (current)
Andreas Moshovos
Line 11: Line 11:
  
 This [[https://​ieeexplore.ieee.org/​document/​8259428/​|IEEE MICRO]] and IEEE Computer ​ articles present our rationale and summarize some of our designs. The most recently publicly disclosed design is [[https://​arxiv.org/​abs/​1803.03688|Bit-Tactical]] that targets but does not require a sparse network. This [[https://​ieeexplore.ieee.org/​document/​8259428/​|IEEE MICRO]] and IEEE Computer ​ articles present our rationale and summarize some of our designs. The most recently publicly disclosed design is [[https://​arxiv.org/​abs/​1803.03688|Bit-Tactical]] that targets but does not require a sparse network.
 +
 +The tables below summarize some key characteristics of some of our designs:
 +{{ :​wiki:​summary2.gif?​800 |}}{{ :​wiki:​summary1.gif?​800 |}}
  
 Here is an older talk that describes some of our work (talk given in late February 2017): {{ :​wiki:​moshovos_cnn_acceleration.pdf | CNN Inference Accelerators}}. Here is an older talk that describes some of our work (talk given in late February 2017): {{ :​wiki:​moshovos_cnn_acceleration.pdf | CNN Inference Accelerators}}.
  
-**Cnvlutin: Ineffectual Neuron Free Convolutional Neural Networks:​** ​Further innovation in Deep Learning hinges upon the hardware’s ability to support deeper, more computationally demanding and larger neural network. Cnvlutin is our first value-aware hardware accelerator for convolutional layers of Convolutional Neural Networks (CNNs). CNNs are the state-of-the-art image classification method which applications in many fields including medical diagnosis, security, cataloguing and in general image based search. ​+**Cnvlutin: Ineffectual Neuron Free Convolutional Neural Networks:​** ​
  
-We made a key observation that a large portion of operations performed in these layers, the dominant operation at the core of Convolutional Neural Networks, are ineffectual due to the nature of CNNs. Ineffectual operations do not affect the final output of the network. Moreover, processing such ineffectual information consumes valuable computing resources, wasting time and power. However, which operations are ineffectual is data dependent and can only be found at runtime. In a conventional architecture,​ by the time we figure out that an operation is ineffectual is typically too late: The time it takes to check that an operation is ineffectual is typically as long as the time it would have taken to just perform the operation, i.e., checking to avoid operations would result in a slowdown. ​+We made a key observation that a large portion of operations performed in these layers, the dominant operation at the core of Convolutional Neural Networks, are ineffectual ​(e.g., the activation value is zero or near zero enough) ​due to the nature of CNNs. Ineffectual operations do not affect the final output of the network. Moreover, processing such ineffectual information consumes valuable computing resources, wasting time and power. However, which operations are ineffectual is data dependent and can only be found at runtime. In a conventional architecture,​ by the time we figure out that an operation is ineffectual is typically too late: The time it takes to check that an operation is ineffectual is typically as long as the time it would have taken to just perform the operation, i.e., checking to avoid operations would result in a slowdown. ​
  
 Cnvultin exploits the organization of CNNs to eliminate ineffectual operations on-the-fly completely transparently improving speed and energy consumption. We demonstrated Cnvlutin as a very wide single-instruction multiple data CNN accelerator that dynamically eliminates most ineffectual multiplications. A configuration that computers 4K terms in parallel per cycle improves performance over an equivalently configured state-of-the-art accelerator,​ DaDianNao, from 24% to 55% and by 37% on average by targeting zero-valued operands alone. While Cnvlutin incurs an area overhead of 4.49%, it improves overall Energy Delay Squared (ED^2) and Energy Delay (ED) by 2.01 and 1.47 on average. By loosening the ineffectual operand identification criterion eliminating neurons below a per-layer pre-specified threshold, performance improvements increase to 1.52x on average with no loss of accuracy. Raising these thresholds further allows for larger performance gains by trading off accuracy. Concurrently,​ groups from Harvard and MIT exploit multiplications with zero to save energy, and a group from Stanford to also improve performance. We save both energy and improve performance and for additional computations by using a looser ineffectual computation criterion. Cnvultin exploits the organization of CNNs to eliminate ineffectual operations on-the-fly completely transparently improving speed and energy consumption. We demonstrated Cnvlutin as a very wide single-instruction multiple data CNN accelerator that dynamically eliminates most ineffectual multiplications. A configuration that computers 4K terms in parallel per cycle improves performance over an equivalently configured state-of-the-art accelerator,​ DaDianNao, from 24% to 55% and by 37% on average by targeting zero-valued operands alone. While Cnvlutin incurs an area overhead of 4.49%, it improves overall Energy Delay Squared (ED^2) and Energy Delay (ED) by 2.01 and 1.47 on average. By loosening the ineffectual operand identification criterion eliminating neurons below a per-layer pre-specified threshold, performance improvements increase to 1.52x on average with no loss of accuracy. Raising these thresholds further allows for larger performance gains by trading off accuracy. Concurrently,​ groups from Harvard and MIT exploit multiplications with zero to save energy, and a group from Stanford to also improve performance. We save both energy and improve performance and for additional computations by using a looser ineffectual computation criterion.
Line 26: Line 29:
   * P. Judd, J. Albericio, Andreas Moshovos, Stripes: Bit-Serial Deep Learning Computing, Computer Architecture Letters, Accepted, April 2016, Appears Aug 2016.   * P. Judd, J. Albericio, Andreas Moshovos, Stripes: Bit-Serial Deep Learning Computing, Computer Architecture Letters, Accepted, April 2016, Appears Aug 2016.
   * P. Judd, J. Albericio, T. Hetherington,​ T. Aamodt, Andreas Moshovos, Stripes: Bit-Serial Deep Learning Computing, ACM/IEEE International Conference on Microarchitecture,​ Oct 2016.   * P. Judd, J. Albericio, T. Hetherington,​ T. Aamodt, Andreas Moshovos, Stripes: Bit-Serial Deep Learning Computing, ACM/IEEE International Conference on Microarchitecture,​ Oct 2016.
-  * A. Delmas Lascorz, S. Sharify, P. Judd, A. Moshovos, TARTAN: Accelerating Fully-Connected and Convolutional Layers in Deep Neural Networks, OpenReview, Oct 2016.+  * A. Delmas Lascorz, S. Sharify, P. Judd, A. Moshovos, TARTAN: Accelerating Fully-Connected and Convolutional Layers in Deep Neural Networks, ​[[https://​openreview.net/​forum?​id=Hy-lMNqex|OpenReview]], Oct 2016. [[https://​arxiv.org/​abs/​1707.09068|Arxiv]].
  
 **Bit-Pragmatic Deep Learning Computing:​** We observe that //on average more than 90% of the work done when multiplying activations and weights in Convolutional Neural Networks is ineffectual//​. In Pragmatic, execution time depends only on the essential bit content of the runtime values for convolutional layers. Even after reducing precision to the absolute minimum, there are still many ineffectual computations. Pragmatic eliminates any remaining computations that are certainly ineffectual. It boosts both energy efficiency and performance over all previous proposals. Compared to DaDianNao it improves performance by more than 4x for narrower configurations performance approaches nearly 8x over an equivalent DaDianNao configuration. **Bit-Pragmatic Deep Learning Computing:​** We observe that //on average more than 90% of the work done when multiplying activations and weights in Convolutional Neural Networks is ineffectual//​. In Pragmatic, execution time depends only on the essential bit content of the runtime values for convolutional layers. Even after reducing precision to the absolute minimum, there are still many ineffectual computations. Pragmatic eliminates any remaining computations that are certainly ineffectual. It boosts both energy efficiency and performance over all previous proposals. Compared to DaDianNao it improves performance by more than 4x for narrower configurations performance approaches nearly 8x over an equivalent DaDianNao configuration.
Line 39: Line 42:
 ** Dynamic Stripes **: It has been known that the precisions that activations need can be tailored per network layer. Several hardware approaches exploit this precision variability to boost performance and energy efficiency. Here we show that much is left on the table by assigning precisions at the layer level. In practice the precisions will vary with the input and at a much lower granularity. An accelerator only needs to consider as many activations as it can process per cycle. In the work below we show how to adapt precisions variability at runtime at the processing granularity. We also show how to boost performance and energy efficiency for fully-connected layers. ** Dynamic Stripes **: It has been known that the precisions that activations need can be tailored per network layer. Several hardware approaches exploit this precision variability to boost performance and energy efficiency. Here we show that much is left on the table by assigning precisions at the layer level. In practice the precisions will vary with the input and at a much lower granularity. An accelerator only needs to consider as many activations as it can process per cycle. In the work below we show how to adapt precisions variability at runtime at the processing granularity. We also show how to boost performance and energy efficiency for fully-connected layers.
  
-  * Alberto Delmas, Patrick Judd, Sayeh Sharify, Andreas Moshovos, [[https://​arxiv.org/​abs/​1706.00504|Dynamic Stripes: Exploiting the Dynamic Precision Requirements of Activation Values in Neural Networks, arxiv.]] ​(An extended version combining DStripes and Tartan was rejected thus far from ISCA and HPCA)+  * Alberto Delmas, Patrick Judd, Sayeh Sharify, Andreas Moshovos, [[https://​arxiv.org/​abs/​1706.00504|Dynamic Stripes: Exploiting the Dynamic Precision Requirements of Activation Values in Neural Networks, arxiv.]] ​
  
 ** LOOM: An Accelerator for Embedded Devices **: When compute needs are modest the design described below exploits both activation and weight precisions. ** LOOM: An Accelerator for Embedded Devices **: When compute needs are modest the design described below exploits both activation and weight precisions.
Line 45: Line 48:
    * S. Sharify, A. Delmas Lascorz, P. Judd and Andreas Moshovos, [[https://​arxiv.org/​abs/​1706.07853|Arxiv]] ​    * S. Sharify, A. Delmas Lascorz, P. Judd and Andreas Moshovos, [[https://​arxiv.org/​abs/​1706.07853|Arxiv]] ​
    * S. Sharify, A. Delmas, P. Judd, K. Siu, and A. Moshovos, Design Automation Conference, June 2018 (here we use dynamic precision detection and also study the effects of off-chip traffic and on-chip buffering). ​    * S. Sharify, A. Delmas, P. Judd, K. Siu, and A. Moshovos, Design Automation Conference, June 2018 (here we use dynamic precision detection and also study the effects of off-chip traffic and on-chip buffering). ​
 +
 +
 +** DPRed: ** This builds on the Dynamic Stripes work and shows that dynamic precision and per group precision adaptation can yield benefits. Stay tuned for an update:
 +   * A. Delmas, A. Sharify, P. Judd, M. Nikolic, A. Moshovos, DPRed: Making Typical Activation Values Matter In Deep Learning Computing, [[https://​arxiv.org/​abs/​1804.06732|Arxiv]] ​
 +
  
 ** Bit-Tactical:​ **: Exploiting weight sparsity if available ** Bit-Tactical:​ **: Exploiting weight sparsity if available