Find out more about packet-switched networks, especially in NoC applications
What are the tradeoffs in NoC system design?
Next week:
Natalie joined the meeting today
Most appropriate canonical router model seems to be a 5-stage VC router
Mesh and 2D torus are the most commonly used topologies
Next week:
Danyao presented an overview on the network simulator project [
PDF ]
Next:
Brainstormed on project details:
The Plan
Implemented a mesh element with credits
Timestamp problem fixed (credit stalls and bandwidth limits are correctly accounted for)
How to properly warm up the network?
Next
Added dedicated credit channels in the FlitQueues so that credits can be potentially handled in a separate interconnect parallel to the flit network
Merged output buffers of the Router into a single one and added a dedicated credit channel. This avoids credit flits with earlier timestamps be blocked by data flits with later timestamps.
Implemented time-multiplexed Router that emulates simultaneous switching of multiple ports by switching one port each clock cycle
Head-of-line blocking is avoided by an input arbitrator with rotating-priority
Privatized TG/FQ → Router links so it's easier to implement the arbiter. The Router → TG/FQ communication still goes through the global interconnect.
Limitation is that now we're stuck with 5 ports per Router and we can only simulate topologies where the max fan-out degree is 5 (including the local node)
Removed credits between the local node and the router (assuming the local node can always consume incoming flits)
When we send credits the data flits from the host node may arrive at the local router 1 FPGA cycle later in the crossbar interconnect than the ideal interconnect. This makes it hard to implement deterministic arbitration among input ports in the receiving Router
Progress summary 1 (/doc/summary_0809.ppt)
Next
Can VCs be simply modeled using a spearate FlitQueue? - Difficulty in implementing physical port bandwidth (Done. Implemented as separate FIFOs within a single FQ with unitified bandwidth controls)
Wormhole routing (Done)
Push router output buffer flits to downstream FQs (Done. Rev. 48)
Resource estimates
Build multiple FIFOs in a single block RAM
Do we measure flit or packet latency? (Packet)
Typical lifetime of a flit (for timestamp bit width)? (~300 so far with 16-node torus)
| | LUTs | FFs | BlockRAM |
| Router | ~300 | 211 | 1 (128 x 8 bits = 1 Kb) |
| FlitQueue (2 VCs) | 306 | 342 | Flit FIFO: 2 x 32 x 36 bits = 2 Kb
Credit FIFO: 2 x 32 x 16 bits = 1 Kb |
Bubble-avoidance in ideal IC/mesh-ideal router brings sim steps/clock to 1.
Bubbles in dual-crossbar and regular mesh Router are caused by the Router::can_increment() function. By allowing a Router to increment if only 1 available port is uninspected, sim steps/clock is improved from 0.31 to 0.44 at 50% load and from 0.23 to 0.31 at 100% load.
The Router with the most ready ports in a given time step determines the least number of clock cycles required to simulate this step. The average for this metric in our simulation is about 2.9 over 100 steps, which is consistent with the measured values of sim steps/clock.
Partitioned Crossbar: the grouping of FQs/TGs into destination partitions in the 16-node torus benchmark reduces simulator performance by 2.5x (2x from grouping 2 RTs per source partition, and probably some inefficiencies in crossbar allocation)
Added 16-node mesh network benchmark. Need more.
Changed input port arbitration scheme from rotating priority (based on CLK) to static priority because the iport.inspected flag prevents head-of-line blocking, which the rotating priority scheme was introduced to address. The static priority results in deterministic results irrespective of the global interconnect model used.
booksim vs. sim:
booksim does not seem to implement initial credits for buffers. When wait_for_tail_credit is 1 and packet size is smaller than VC buffer size, booksim stalls after buffering the first packet waiting for credit
When wait_for_tail_credit is 0, measured average latencies from booksim and icsim for 4×4 torus are the same (8.12921 vs. 8.13)
writing/introduction
Need to process flits in their arrival order (so a flit stalled from the previous cycle should get priority over new flits that just arrived)
Model the fact that body/tail flits by-pass the routing stage. No effect now because routing delay is 1 for both head and other flits
Latency discrepancy
Cause 1: If only one port is left uninspected at a router, global time counter would be allowed to increment, hence we have to prioritize processing of this port first
Cause 2: credit should have same timestamp as out-going flit
Cause 3: Incoming flits at the router inputs should be used for global time counter throttling too
Eye-balling of traces from icsim seems fine (most flits arrive on-time and the late ones are due to two packets being injected close to each other). RNG seems random enough.
Sweeping over 5 seeds, the difference between the two simulators are smaller than the seed sweep standard deviation

booksim with DOR will randomly select a direction if both directions are equally close
Implemented RAMFIFO. Compared LFSR based and counter based implementation. In ISE, LFSR-based is 50% cheaper for 36 bit x 64 deep FIFO based on block RAM (6-bit head/tail pointers)
Trying to get Timing Analyzer to work (need to update Debian to get libmotif3)
Read a few papers on on-chip network and trace-based simulation (see
realistic_traffic)
Back to top