Differences

This shows you the differences between two versions of the page.

--- wiki:aca2017:assignments [2018/03/26 17:10] – Andreas Moshovos
+++ wiki:aca2017:assignments [2018/04/22 19:30] (current) – Andreas Moshovos
@@ Line 1: / Line 1: @@
+====== Assignment #5 ======
+Please read the following publication: {{ :wiki:aca2017:selfoptimizingmemcntrl.pdf | Self-Optimizing Memory Controllers: A Reinforcement Learning Approach, Ipek et al., ISCA 2008. }}
+Then answer the following questions:
+  * A DRAM chip is contains several independent banks. At a high-level what is the sequence of operations that need to be issued to a bank to read and write data? Briefly explain what a row is and what purpose it serves.
+  * Briefly explain what the FR-FCFS policy is?
+  * What is a Markov Decision Process? Please formally define it.
+  * Why is a discounted cumulative reward function more appropriate for infinite horizon problems?
+  * Explain what are Q-values.
+  * What is  epsilon-greedy action selection? Why is it needed?
+  * What is CMAC and why is it used here? What goal does it serve?
+  * In FIg. 6(a) why are there two parallel vertical pipes?
+====== Assignment #4 ======
 You are asked to think about the on and off-chip memory system for our DaDianNao like accelerator. Let's restrict attention to CNNs, where the layers are convolutions, pooling and fully-connected.
 Let us assume you have an external memory interface which can provide X bytes/cycle. For example, a DDR3-1600 memory system can, at peak provide, 1600 x 1M x 8B/ sec =
 ,107,200 bytes/sec. Assuming a 1GHz operating frequency for our accelerator that would translate into just 12.5 bytes/cycle.
-You are asked to think about strategies of how to allocate your on-chip memory to reduce as much as possible off-chip traffic to sustain as much as possible the execution cores. We will provide you with the architecture of a few CNNs shortly. You will have to calculate how much traffic will be needed
+You are asked to think about strategies of how to allocate your on-chip memory to reduce as much as possible off-chip traffic to sustain as much as possible the execution cores. We will provide you with the architecture of a few CNNs shortly. You will have to calculate how much traffic will be needed to execute each layer assuming a 1GHz operating frequency for your accelerator. This is an open ended problem and I am looking for suggestions on how to start to think about solutions. To further simplify things let's only consider the convolutional layers.
+Here's some starting pointers. Option number #1 is to assume that everything can fit on chip. Your traffic then is just to read the input activations for the first layer and then to write the output activations from that last layer.
+What if you cannot fit all of the filters and all the activations on chip? What are the options? One approach then would be to approach weights and activations separately. Let's assume that you can fit indeed all activations on chip. What happens when you cannot fit all the weights for all layers on chip? What other options exist? Try to identify cases and report what you find.
+Then switch the problem around. What if you can fit the weights but you cannot fit the activations. What are the choices there?
+FInally, if you can fit neither the weights nor the activations can you suggest good ways to compute the layers?
+For all options calculate the bandwidth that would be needed from off-chip. That would be the number of bytes that are needed to read the activatios and weights, plus the number of bytes that are needed to write the output weights, and the sum of these divided by the number of cycles the accelerator takes to do the computation. Can you compare and contract diffferent on-chip memory allocation policies between weights and activations with regards to the overall off-chip bandwidth?
+[[wiki:aca2017:assignment4|Find the input data and further information here. Thank you to Kevin Siu for preparing these.]]
 ====== Assignment #3 ======