Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
wiki:overview [2018/04/27 17:03] – Andreas Moshovos | wiki:overview [2018/04/27 17:59] (current) – Andreas Moshovos | ||
---|---|---|---|
Line 8: | Line 8: | ||
{{ : | {{ : | ||
- | **A:** remove zero Activations, | + | **A:** remove zero Activations, |
- | This [[https:// | + | This [[https:// |
+ | |||
+ | The tables below summarize some key characteristics of some of our designs: | ||
+ | {{ : | ||
Here is an older talk that describes some of our work (talk given in late February 2017): {{ : | Here is an older talk that describes some of our work (talk given in late February 2017): {{ : | ||
- | **Cnvlutin: Ineffectual Neuron Free Convolutional Neural Networks: | + | **Cnvlutin: Ineffectual Neuron Free Convolutional Neural Networks: |
- | We made a key observation that a large portion of operations performed in these layers, the dominant operation at the core of Convolutional Neural Networks, are ineffectual due to the nature of CNNs. Ineffectual operations do not affect the final output of the network. Moreover, processing such ineffectual information consumes valuable computing resources, wasting time and power. However, which operations are ineffectual is data dependent and can only be found at runtime. In a conventional architecture, | + | We made a key observation that a large portion of operations performed in these layers, the dominant operation at the core of Convolutional Neural Networks, are ineffectual |
Cnvultin exploits the organization of CNNs to eliminate ineffectual operations on-the-fly completely transparently improving speed and energy consumption. We demonstrated Cnvlutin as a very wide single-instruction multiple data CNN accelerator that dynamically eliminates most ineffectual multiplications. A configuration that computers 4K terms in parallel per cycle improves performance over an equivalently configured state-of-the-art accelerator, | Cnvultin exploits the organization of CNNs to eliminate ineffectual operations on-the-fly completely transparently improving speed and energy consumption. We demonstrated Cnvlutin as a very wide single-instruction multiple data CNN accelerator that dynamically eliminates most ineffectual multiplications. A configuration that computers 4K terms in parallel per cycle improves performance over an equivalently configured state-of-the-art accelerator, | ||
Line 26: | Line 29: | ||
* P. Judd, J. Albericio, Andreas Moshovos, Stripes: Bit-Serial Deep Learning Computing, Computer Architecture Letters, Accepted, April 2016, Appears Aug 2016. | * P. Judd, J. Albericio, Andreas Moshovos, Stripes: Bit-Serial Deep Learning Computing, Computer Architecture Letters, Accepted, April 2016, Appears Aug 2016. | ||
* P. Judd, J. Albericio, T. Hetherington, | * P. Judd, J. Albericio, T. Hetherington, | ||
- | * A. Delmas Lascorz, S. Sharify, P. Judd, A. Moshovos, TARTAN: Accelerating Fully-Connected and Convolutional Layers in Deep Neural Networks, OpenReview, Oct 2016. | + | * A. Delmas Lascorz, S. Sharify, P. Judd, A. Moshovos, TARTAN: Accelerating Fully-Connected and Convolutional Layers in Deep Neural Networks, |
**Bit-Pragmatic Deep Learning Computing: | **Bit-Pragmatic Deep Learning Computing: | ||
- | * J. Albericio, P. Judd, Andreas Moshovos, [[https:// | + | * J. Albericio, P. Judd, Andreas Moshovos, [[https:// |
* J. Albericio, A. Delmas Lascorz, P. Judd, S. Sharify, | * J. Albericio, A. Delmas Lascorz, P. Judd, S. Sharify, | ||
** Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks:** Having observed that the precision requirements of Deep Neural Networks vary per layer, we proposed a simple extension to existing computing engines that reduces memory bandwidth and footprint. The key concept is to use two different representations. One that is storage efficient and one that is being used for computations. A new storage container format is also being proposed that drastically reduces the cost of converting from one format to the other as it avoids the large crossbar that would otherwise be necessary. | ** Proteus: Exploiting Numerical Precision Variability in Deep Neural Networks:** Having observed that the precision requirements of Deep Neural Networks vary per layer, we proposed a simple extension to existing computing engines that reduces memory bandwidth and footprint. The key concept is to use two different representations. One that is storage efficient and one that is being used for computations. A new storage container format is also being proposed that drastically reduces the cost of converting from one format to the other as it avoids the large crossbar that would otherwise be necessary. | ||
- | * Patrick Judd, Jorge Albericio, Tayler Hetherington, | + | * Patrick Judd, Jorge Albericio, Tayler Hetherington, |
** Dynamic Stripes **: It has been known that the precisions that activations need can be tailored per network layer. Several hardware approaches exploit this precision variability to boost performance and energy efficiency. Here we show that much is left on the table by assigning precisions at the layer level. In practice the precisions will vary with the input and at a much lower granularity. An accelerator only needs to consider as many activations as it can process per cycle. In the work below we show how to adapt precisions variability at runtime at the processing granularity. We also show how to boost performance and energy efficiency for fully-connected layers. | ** Dynamic Stripes **: It has been known that the precisions that activations need can be tailored per network layer. Several hardware approaches exploit this precision variability to boost performance and energy efficiency. Here we show that much is left on the table by assigning precisions at the layer level. In practice the precisions will vary with the input and at a much lower granularity. An accelerator only needs to consider as many activations as it can process per cycle. In the work below we show how to adapt precisions variability at runtime at the processing granularity. We also show how to boost performance and energy efficiency for fully-connected layers. | ||
- | * Alberto Delmas, Patrick Judd, Sayeh Sharify, Andreas Moshovos, [[https:// | + | * Alberto Delmas, Patrick Judd, Sayeh Sharify, Andreas Moshovos, [[https:// |
** LOOM: An Accelerator for Embedded Devices **: When compute needs are modest the design described below exploits both activation and weight precisions. | ** LOOM: An Accelerator for Embedded Devices **: When compute needs are modest the design described below exploits both activation and weight precisions. | ||
- | * S. Sharify, A. Delmas Lascorz, P. Judd and Andreas Moshovos, [[https:// | + | * S. Sharify, A. Delmas Lascorz, P. Judd and Andreas Moshovos, [[https:// |
* S. Sharify, A. Delmas, P. Judd, K. Siu, and A. Moshovos, Design Automation Conference, June 2018 (here we use dynamic precision detection and also study the effects of off-chip traffic and on-chip buffering). | * S. Sharify, A. Delmas, P. Judd, K. Siu, and A. Moshovos, Design Automation Conference, June 2018 (here we use dynamic precision detection and also study the effects of off-chip traffic and on-chip buffering). | ||
+ | |||
+ | |||
+ | ** DPRed: ** This builds on the Dynamic Stripes work and shows that dynamic precision and per group precision adaptation can yield benefits. Stay tuned for an update: | ||
+ | * A. Delmas, A. Sharify, P. Judd, M. Nikolic, A. Moshovos, DPRed: Making Typical Activation Values Matter In Deep Learning Computing, [[https:// | ||
+ | |||
** Bit-Tactical: | ** Bit-Tactical: |