Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
wiki:overview [2018/04/27 17:14] – Andreas Moshovos | wiki:overview [2018/04/27 17:59] (current) – Andreas Moshovos | ||
---|---|---|---|
Line 17: | Line 17: | ||
Here is an older talk that describes some of our work (talk given in late February 2017): {{ : | Here is an older talk that describes some of our work (talk given in late February 2017): {{ : | ||
- | **Cnvlutin: Ineffectual Neuron Free Convolutional Neural Networks: | + | **Cnvlutin: Ineffectual Neuron Free Convolutional Neural Networks: |
- | We made a key observation that a large portion of operations performed in these layers, the dominant operation at the core of Convolutional Neural Networks, are ineffectual due to the nature of CNNs. Ineffectual operations do not affect the final output of the network. Moreover, processing such ineffectual information consumes valuable computing resources, wasting time and power. However, which operations are ineffectual is data dependent and can only be found at runtime. In a conventional architecture, | + | We made a key observation that a large portion of operations performed in these layers, the dominant operation at the core of Convolutional Neural Networks, are ineffectual |
Cnvultin exploits the organization of CNNs to eliminate ineffectual operations on-the-fly completely transparently improving speed and energy consumption. We demonstrated Cnvlutin as a very wide single-instruction multiple data CNN accelerator that dynamically eliminates most ineffectual multiplications. A configuration that computers 4K terms in parallel per cycle improves performance over an equivalently configured state-of-the-art accelerator, | Cnvultin exploits the organization of CNNs to eliminate ineffectual operations on-the-fly completely transparently improving speed and energy consumption. We demonstrated Cnvlutin as a very wide single-instruction multiple data CNN accelerator that dynamically eliminates most ineffectual multiplications. A configuration that computers 4K terms in parallel per cycle improves performance over an equivalently configured state-of-the-art accelerator, | ||
Line 29: | Line 29: | ||
* P. Judd, J. Albericio, Andreas Moshovos, Stripes: Bit-Serial Deep Learning Computing, Computer Architecture Letters, Accepted, April 2016, Appears Aug 2016. | * P. Judd, J. Albericio, Andreas Moshovos, Stripes: Bit-Serial Deep Learning Computing, Computer Architecture Letters, Accepted, April 2016, Appears Aug 2016. | ||
* P. Judd, J. Albericio, T. Hetherington, | * P. Judd, J. Albericio, T. Hetherington, | ||
- | * A. Delmas Lascorz, S. Sharify, P. Judd, A. Moshovos, TARTAN: Accelerating Fully-Connected and Convolutional Layers in Deep Neural Networks, OpenReview, Oct 2016. | + | * A. Delmas Lascorz, S. Sharify, P. Judd, A. Moshovos, TARTAN: Accelerating Fully-Connected and Convolutional Layers in Deep Neural Networks, |
**Bit-Pragmatic Deep Learning Computing: | **Bit-Pragmatic Deep Learning Computing: | ||
Line 42: | Line 42: | ||
** Dynamic Stripes **: It has been known that the precisions that activations need can be tailored per network layer. Several hardware approaches exploit this precision variability to boost performance and energy efficiency. Here we show that much is left on the table by assigning precisions at the layer level. In practice the precisions will vary with the input and at a much lower granularity. An accelerator only needs to consider as many activations as it can process per cycle. In the work below we show how to adapt precisions variability at runtime at the processing granularity. We also show how to boost performance and energy efficiency for fully-connected layers. | ** Dynamic Stripes **: It has been known that the precisions that activations need can be tailored per network layer. Several hardware approaches exploit this precision variability to boost performance and energy efficiency. Here we show that much is left on the table by assigning precisions at the layer level. In practice the precisions will vary with the input and at a much lower granularity. An accelerator only needs to consider as many activations as it can process per cycle. In the work below we show how to adapt precisions variability at runtime at the processing granularity. We also show how to boost performance and energy efficiency for fully-connected layers. | ||
- | * Alberto Delmas, Patrick Judd, Sayeh Sharify, Andreas Moshovos, [[https:// | + | * Alberto Delmas, Patrick Judd, Sayeh Sharify, Andreas Moshovos, [[https:// |
** LOOM: An Accelerator for Embedded Devices **: When compute needs are modest the design described below exploits both activation and weight precisions. | ** LOOM: An Accelerator for Embedded Devices **: When compute needs are modest the design described below exploits both activation and weight precisions. | ||
Line 48: | Line 48: | ||
* S. Sharify, A. Delmas Lascorz, P. Judd and Andreas Moshovos, [[https:// | * S. Sharify, A. Delmas Lascorz, P. Judd and Andreas Moshovos, [[https:// | ||
* S. Sharify, A. Delmas, P. Judd, K. Siu, and A. Moshovos, Design Automation Conference, June 2018 (here we use dynamic precision detection and also study the effects of off-chip traffic and on-chip buffering). | * S. Sharify, A. Delmas, P. Judd, K. Siu, and A. Moshovos, Design Automation Conference, June 2018 (here we use dynamic precision detection and also study the effects of off-chip traffic and on-chip buffering). | ||
+ | |||
+ | |||
+ | ** DPRed: ** This builds on the Dynamic Stripes work and shows that dynamic precision and per group precision adaptation can yield benefits. Stay tuned for an update: | ||
+ | * A. Delmas, A. Sharify, P. Judd, M. Nikolic, A. Moshovos, DPRed: Making Typical Activation Values Matter In Deep Learning Computing, [[https:// | ||
+ | |||
** Bit-Tactical: | ** Bit-Tactical: |