wiki:aca2017:assign3 [Andreas Moshovos]

This page is read only. You can view the source, but not change it. Ask your administrator if you think this is wrong.
==== Warning:) ====

I have not written a simulator for the accelerator in question. So, I do not have strong opinions on what needs to be done and how. Moreover, I do not have the experience with potential problems that may arise for this specific case. However, I have experience with processor simulators which I will happily share as needed. So approach this more as a "how would you go about developing this simulator" as opposed to "do these precise steps".
AT the end what I care the most is documenting your approach, thinking, problems encountered and solutions developed, and a commentary of what you would do next if you had more time and of the lessons learned.

==== Goal: ====

You are going build a simulator for the DaDianNao accelerator. This should be a cycle-accurate, behavioral simulator similar to what Simplescalar’s sim-outorder is for general purpose dynamically-scheduled (OoO) processors. The simulator will have to model the functional units and the various memory buffers and should be configurable. It should support only inference. Your simulator should be able to accept as input a neural network (weights), and a set of data inputs (e.g., images for an image classification network) and model its execution cycle by cycle over DaDN. The inputs will be read from files. The specific are given below. The simulator will have to print out various statistics about its execution.

==== Components that need to be modeled: ====

DaDianNao (from here on referred to as DaDN) consisted of 16 tiles and a 4MB Activation Memory (AM) (referred to as Neuron Memory in the publication). The AM provided 16 activations per cycle which it broadcast to all 16 tiles. These input activations were temporarily buffered into an input buffer (NBin) per tile. Each tile contained 16 filter lanes, each processing 16 weight (called synapse in the publication) and activation pairs. A local 2MB per tile eDRAM Weight Memory (WM -- Synapse Buffer in the original publication) provided the 256 weights each tile needed per cycle. For each filter lane there were 16 multipliers feeding into an 16-input adder tree. The resulting sum passed through an activation function and finally was stored into an output buffer (NBout) prior to eventually being written back to AM. Writes to AM were performed by at most one tile per cycle and were for 16 output activations. 

In summary, a complete model would include the following: tiles, WMs (per tile), AM, NBin (per tile), NBout (per tile), external memory, link between AM and tiles, link between WMs and tile, link(s) between external memory and the WMs and the AM. The bare minimum the simulator will have to model the Tiles and accesses to/from WM and AM. Finally, there is the control unit, the one that instructs all others what to do. The DaDN paper has a description of a simple "instruction" set, which is essentially a set of memory access engines that feed to and from the datapath. A significant part of this assignment is to think how to go about developing this control unit. You do not have to worry about how it would look in hardware, only what functionality it should provide. What it should do, not how it should it.

==== What to implement the simulator in ====
For speed reasons, I would favor C or C++ implementations. That said, feel free to use any programming language you wish as long as I can reasonably test at the end that the simulator works. 

==== Parameters to be supported: ====

For our first model let’s try to keep things simple and ignore for the most part the latencies of the various components and assume that all weights and activations can be safely stored on chip. Accordingly, the simulator should support the following tile parameters:

  - 	Tile:
  * 	Number: default 16
  *    Filters/Tile: default 16.
  *	Terms/Filter: how many weight and activation pairs to multiply accumulate per cycle. Default 16.
  -	For AM and WM assume that their sizes are 4MB and 2MB (per tile) respectively. You can for the time being ignore the NBin and NBout. There will be separate address spaces for AM and WM.
  -
==== Modeling Time ====

Have a global variable sim_cycle which refers to the current cycle (start from 0). By adding a “ready_cycle” field with each data value as it flows through the various units you can implement a rudimentary timing model. For example, when accessing a memory, you could have it return a latency or the absolute sim_cycle time at which the response will be available. This a simplistic timing model but could be a good starting point beyond assuming unit latencies for everything. More accurate timing can be build on top of this by modeling the links, buffers, etc.

==== Modeling the Memories: ====

DaDN has several memory components: WM, AM, external memory, NBin and NBout. For our first simulator you do not have to model their timing behavior. However, we want to keep track of the memory reference counts and their source. For example, for the external memory we would want to know how many weights and activations were read and written. Furthermore, all accesses to these memories should be implemented via function calls even though these would do nothing other than return or update data values.

For each memory do have a call that you consistently use to access it. For example, for a C implementation, this could be something like this:

''Int AM_access(D_addr_t addr, uint_t nb, D_act_t *data, D_time_t *ready_cycle)''

Where: 
•	D_addr_t the address where to initiate the access from.
•	nb: how much data to read or write. Let’s assume this is in bytes for the time being.
•	Data: pointer to a buffer. For reads, the function will write the data read into it. For writes, it will read the data from.
•	Ready_cycle: at what cycle will the request complete.

These are suggestions, you may have to revise as you develop the simulator. However, for the first version most of these will be left unused.

==== Statistics: ====

The  simulator should print out relevant statistics at the end. These include:
Total cycle count, read/write breakdowns per memory component, breakdown of utilization of the tiles and filter lanes, input layer configuration. If possible you may want to reuse simplescalar’s statistics package. However, you are not required to do so.
Feel free to revise as you see fit. Explain your choices in your final report. Use this as your guide: What is the goal, what measurements you would like to see when trying to build the hardware? For example, how many cycles is semi-obvious: we want to know which option is faster. But, today energy efficiency is sometimes more important that pure performance. What would you like to see measured to be able to estimate energy? For example you may want to see how often individual units/memories are used.

==== Modeling the actual calculations:====

For simplicity assume that all calculations are to be done using the datatype used in the input data files that we provide. For keeping track of time, assume that all calculations take unit time. Feel free to make simplifications or to improve the model as you see fit. Explain your choices in your report. For example, a real implementation may use multiple cycles to multiply accumulate each set of inputs. 
For the activation function assume that you can use ReLU. Ideally, you will have an option to choose this. For the time you can only provide support for ReLU regardless.

==== What to hand-in: ====

A tarball with your code and a report describing your major choices and how you addressed them. Also a description of your statistics.
How to start:
Use simplescalar as guide on how to structure your simulator. Start with a main function containing a loop. Each loop iteration can be a simulated cycle. Have a global sim_cycle variable to keep track of which cycle you are currently in. Use functions to model accesses to various components. This will allow you later on to expand the simulator if needed. You will have to think how to map the weights and activations onto the memory address spaces. 

==== Sample Inputs/Networks ====

**This will be revised soon. Ignore for the time being.**
Milos Nikolic was kind enough to prepare the following sample inputs which you can use to test your simulator. We will clarify some of the information below later on. For the time being this should be sufficient to get you started.

The link bellow contains weights(*_w) and biases(*_b) for lenet and Network in Network(nin). Additionally, I added the inputs and outputs for 100 images, prototxts and prototxt visualizations from caffe. I have attached the used to obtain these. All are contained in separate npy files in signed 16 bit int. Number of bits above the fixed point excluding the sign bit:

Lenet activations: [1,3,3,3]

Lenet weights: [0,0,0,0]

Nin activations: [10,10,9,12,12,11,11,11,10,10,10,9]

Nin weights: [0,1,1,0,0,0,0,0,0,0,0,0]

These are in order, only convolution and fully connected. I used 7 over the fixed point for the result.

[[ https://www.dropbox.com/s/hkt4x9kq9ref0st/per_layer_weights.zip?dl=0|Get the networks and inputs from dropbox.]] This was updated on March 23.

Here are two python scripts that can read the above files:
{{ :wiki:aca2017:scripts.zip |}}

Here's further info from Milos:

The values are all stored in 16 bit int (if you load it in numpy, you will get an int). If you look at them in binary the first n+1 are the integer part, while the rest are the fraction bits, where n is precision I included in lists. For lenet layer one, input activations will be 2.14 and weights are 1.15.

 

The npy files contain python dictionaries with the values, layer names are the keys. After loading the files into variable var, each layer parameters are accessed as var[‘layer name’], this will be a 4D array. I never used it, but there is a git repo with code to load npy in c/c++ (https://github.com/rogersce/cnpy).
'


The numbers should be interpreted as fixed point values with the following format. The layers are in the same order as prototxt and the diagram and include only convolution and inner product layers.

 

Lenet layers:

conv1  conv2  ip1       ip2

Lenet activations:

2.14     4.12     4.12     4.12     8.8

Lenet weights:

1.15     1.15     1.15     1.15

 

Nin layers:

conv1  cccp1   cccp2   conv2  cccp3   cccp4   conv3  cccp5   cccp6   conv4-1024     cccp7-1024      cccp8-1024

Nin activations:

11.5     11.5     10.6     13.3     13.3     12.4     12.4     12.4     11.5     11.5     11.5     10.6     8.8

Nin weights:

1.15     2.14     2.14     1.15     1.15     1.15     1.15     1.15     1.15     1.15     1.15     1.15

 

Activations show the format of the input into the layer. The bias and output of the layer should follow the format of the following layer. Weights follow the format of the current layer.