Weekly Status

WW3, Jan. 13, 2009

  • Next week:
    • Elias: summary on the FAST project
    • Danyao: summary on the ProtoFlex project

WW4, Jan. 20, 2009

  • Raised issue: How detailed should these be modeled in functional simulation?
    • wrong-path execution
    • out-of-order execution
  • Next week:
    • Elias: summary on the Princeton Orion - Interconnect network simulator
    • Danyao: A-Port and ASim

WW5, Jan. 27, 2009

  • Next week:
    • Elias: interconnects
    • Danyao: interconnect modeling in GEMS and look for papers on what's important in CMP architectures

WW6, Feb. 3, 2009

  • Are cache coherence protocols a 'soft' layer on top of underlying interconnects?
    • GEMS' cache cohrence protocols build directly on top of their underlying network with protocol messages being the traffic in the network
  • Next week:
    • Elias: modeling nodes in interconnect networks
    • Danyao: GEMS interconnect model in more details

WW7, Feb. 10, 2009

  • Find out more about packet-switched networks, especially in NoC applications
  • What are the tradeoffs in NoC system design?
  • Next week:
    • Elias: more about switches
    • Danyao: 1) typical topology used by GEMS; 2) Xilinx Microblaze & FSL

WW8, Feb. 17, 2009

  • No meeting due to reading week

WW9, Feb. 24, 2009

  • Natalie joined the meeting today
  • Most appropriate canonical router model seems to be a 5-stage VC router
    • What routing strategy?
  • Mesh and 2D torus are the most commonly used topologies
  • Next week:

WW10, Mar. 3, 2009

  • Elias presented slides on chapter 16 (router architecture) from Dally
  • Next:
    • Elias: chapter 17 from Dally

WW11, Mar. 10, 2009

  • No meeting. Andreas was away.

WW12, Mar. 17, 2009

  • Elias presented slides on chapter 17 (router datapath components) from Dally
  • Next:
    • Narrow down on the specifics of the simulator: what functions to model? how? design constraints?
    • Study an example network?

WW13, Mar. 24, 2009

  • Danyao presented an overview on the network simulator project [ PDF ]
  • Next:
    • Brain-storm on what to build
    • Generic interconnect for mesh-like topologies?

WW14, Mar. 31, 2009

  • Next:
    • Performance study of a full-system simulator (Flexus or GEMS)
    • Look for open cores for VC routers

WW15, Apr. 7, 2009

  • NoCem: interconnect simulator on FPGA
  • Netmaker: synthesizable VC router
  • Next:
    • Try SimFlex

WW16, Apr. 14, 2009

  • Got SimFlex running

WW18, Apr. 28, 2009

  • No meeting (due to exams)

WW19, May 5, 2009

  • No meeting

WW20, May 12, 2009

  • Bochs project presentation
  • We want to simulate up to 256 nodes on a single chip

WW21, May 19, 2009

  • No meeting

WW22, May 28, 2009

  • Brainstormed on project details: The Plan

WW23, Jun. 2, 2009

  • Playing with SimFlex to get aggregated bandwidth numbers between the processors and the interconnect
    • Asked Jason & Myrto for data they may have

WW28, Jul. 7, 2009

  • Simulator and SVN up
  • Added RouterIdeal class to model an unpipelined router (1 cycle routing delay).
  • We still need PQs to model the Router input buffers
    • Deadlock risk: Router does not know which PQs have packets for it. It could repeatedly route for a particular output port and overflow that PQ, at which point the simulation must be stalled, even though another PQ may have an earlier packet for this Router that will go to an idle output port…
    • How to allow the Router to choose a different input port if some output ports are blocked?
  • To-Do
    • Some scripts to generate the topology config files (hard part is the deadlock free routing table)
    • Buffer allocation scheme (one PQ per VC? per port?)
    • Non-uniform packet delivery latency

WW29, Jul. 14, 2009

  • Unsuccessful attempt at implementing credits in PQs instead of Routers
  • Functional unpipelined mesh element with credits
  • To-do:
    • Fix timestamp error when a packet is stalled due to credits

WW30, Jul. 20, 2009

  • Implemented a mesh element with credits
  • Timestamp problem fixed (credit stalls and bandwidth limits are correctly accounted for)
  • How to properly warm up the network?
  • Next
    • Pipeline stages in Router
    • Flow control in the FPGA simulator (experiment with buffer sizes)
    • Organizational chart

WW31, Jul. 27, 2009

  • On-chip network paper survey (see Research Survey)
    • Existing attempts at emulating NoC on FPGAs are limited to small networks (up to 4×4 mesh) and they are slow (Wolkotte)
    • VC routers may beexpensive to implement (Schelle) - need to verify

WW32, Aug. 3, 2009

  • Added dedicated credit channels in the FlitQueues so that credits can be potentially handled in a separate interconnect parallel to the flit network
  • Merged output buffers of the Router into a single one and added a dedicated credit channel. This avoids credit flits with earlier timestamps be blocked by data flits with later timestamps.
  • Implemented time-multiplexed Router that emulates simultaneous switching of multiple ports by switching one port each clock cycle
    • Head-of-line blocking is avoided by an input arbitrator with rotating-priority
    • Privatized TG/FQ → Router links so it's easier to implement the arbiter. The Router → TG/FQ communication still goes through the global interconnect.
    • Limitation is that now we're stuck with 5 ports per Router and we can only simulate topologies where the max fan-out degree is 5 (including the local node)
  • Removed credits between the local node and the router (assuming the local node can always consume incoming flits)
    • When we send credits the data flits from the host node may arrive at the local router 1 FPGA cycle later in the crossbar interconnect than the ideal interconnect. This makes it hard to implement deterministic arbitration among input ports in the receiving Router
    • Progress summary 1 (/doc/summary_0809.ppt)
  • Next
    • Can VCs be simply modeled using a spearate FlitQueue? - Difficulty in implementing physical port bandwidth (Done. Implemented as separate FIFOs within a single FQ with unitified bandwidth controls)
    • Wormhole routing (Done)
    • Push router output buffer flits to downstream FQs (Done. Rev. 48)
    • Resource estimates
    • Build multiple FIFOs in a single block RAM
    • Do we measure flit or packet latency? (Packet)
    • Typical lifetime of a flit (for timestamp bit width)? (~300 so far with 16-node torus)

WW33, Aug. 10, 2009

  • Separated the output interfaces for credits and flits at the Routers. Performance improved from 5.9 clks/ts to 3.47 clks/ts (Rev. 45)
  • Added wormhole routing to RouterMesh (Rev. 52)
  • Added VC support in FlitQueue and RouterMesh (Rev. 55, see Flit Queue for a description)

WW34, Aug. 17, 2009

  • Make simulator model register stages and RAM latencies that will be in the hardware model
  • Working on Verilog modules to estimate hardware utilization
LUTs FFs BlockRAM
Router ~300 211 1 (128 x 8 bits = 1 Kb)
FlitQueue (2 VCs) 306 342 Flit FIFO: 2 x 32 x 36 bits = 2 Kb
Credit FIFO: 2 x 32 x 16 bits = 1 Kb
  • 1 RouterMesh = 1 Router + 5 FlitQueues ~= 1830 LUTs + 1921 FFs + 16 Kb blockRAMs

WW35, Aug. 24, 2009

  • Small FIFO implementation
    • 16 x 8 bit SRL FIFO (Open Cores): 5 FFs, 21 LUTs
    • Coregen 16 x 8 bit shift register: 33 FFs, 24 LUTs
  • Scripts to run sweeps on simulator parameters

WW36, Aug. 31, 2009

  • Sweep on Interconnect and Router models over various traffic load: interconnect
  • Investigating performance bottlenecks
    • Minimum clocks required per sim step = max (2, packet_size, FQ bandwidth) with single-cycle injection of multi-flit packets
      • = packet_size without single-cycle injection
    • RouterMesh's performance = IdealRouter's performance / avg. number of ready ports per time step
    • Single-cycle injection of multi-flit packet helps ideal IC/mesh-ideal combination, but not others

WW37, Sep. 7, 2009

  • On vacation this week

WW38, Sep. 14, 2009

  • Bubble-avoidance in ideal IC/mesh-ideal router brings sim steps/clock to 1.
  • Bubbles in dual-crossbar and regular mesh Router are caused by the Router::can_increment() function. By allowing a Router to increment if only 1 available port is uninspected, sim steps/clock is improved from 0.31 to 0.44 at 50% load and from 0.23 to 0.31 at 100% load.
    • The can_increment condition is met if the ready signals from the input ports are all 0s or one-hot. Checking this can be done with 2 LUTs for 5 ports.
  • The Router with the most ready ports in a given time step determines the least number of clock cycles required to simulate this step. The average for this metric in our simulation is about 2.9 over 100 steps, which is consistent with the measured values of sim steps/clock.

WW39, Sep. 21, 2009

  • Partitioned Crossbar: the grouping of FQs/TGs into destination partitions in the 16-node torus benchmark reduces simulator performance by 2.5x (2x from grouping 2 RTs per source partition, and probably some inefficiencies in crossbar allocation)
    • Optimal crossbar allocation (using bipartite matching) improves performance by ~2%
    • Placement of nodes on partition has bigger impact. Looking for a good cost function…
  • Added 16-node mesh network benchmark. Need more.
  • Changed input port arbitration scheme from rotating priority (based on CLK) to static priority because the iport.inspected flag prevents head-of-line blocking, which the rotating priority scheme was introduced to address. The static priority results in deterministic results irrespective of the global interconnect model used.

WW40, Sep. 28, 2009

  • Project positioning
    • Where does our project fit in in the design space
    • What kind of new application can we enable
  • Writing an abstract and a summary of related work
  • Trying to run SICOSYS and booksim and use them as a baseline for correctness and software simulator performance
  • Next:
    • Write introduction (add section on RAMP, ProtoFlex etc.)
    • SICOSYS/booksim measurement

WW41, Oct. 5, 2009

  • booksim vs. sim:
    • booksim does not seem to implement initial credits for buffers. When wait_for_tail_credit is 1 and packet size is smaller than VC buffer size, booksim stalls after buffering the first packet waiting for credit
    • When wait_for_tail_credit is 0, measured average latencies from booksim and icsim for 4×4 torus are the same (8.12921 vs. 8.13)
  • writing/introduction

WW42, Oct. 12, 2009

  • Need to process flits in their arrival order (so a flit stalled from the previous cycle should get priority over new flits that just arrived)
    • Saturation counter for age?
  • Model the fact that body/tail flits by-pass the routing stage. No effect now because routing delay is 1 for both head and other flits
  • Latency discrepancy
    • Cause 1: If only one port is left uninspected at a router, global time counter would be allowed to increment, hence we have to prioritize processing of this port first
    • Cause 2: credit should have same timestamp as out-going flit
    • Cause 3: Incoming flits at the router inputs should be used for global time counter throttling too
    • Eye-balling of traces from icsim seems fine (most flits arrive on-time and the late ones are due to two packets being injected close to each other). RNG seems random enough.
    • Sweeping over 5 seeds, the difference between the two simulators are smaller than the seed sweep standard deviation :-)
    • booksim with DOR will randomly select a direction if both directions are equally close

WW43, Oct. 19, 2009

  • Verify:
    • Distribution of packet latency (Done. See Verification)
    • Running longer to reduce random variation (Similar results as before)
    • Routing functions of booksim (turn-based) and how to implement in a table
      • Only do deterministic static routing for now. Butterfly networks fall into this category (flattened bufferfly uses a small set of deterministic routing, so we can do that as well)

WW44, Oct. 26, 2009

  • Implemented RAMFIFO. Compared LFSR based and counter based implementation. In ISE, LFSR-based is 50% cheaper for 36 bit x 64 deep FIFO based on block RAM (6-bit head/tail pointers)
  • Trying to get Timing Analyzer to work (need to update Debian to get libmotif3)
  • Read a few papers on on-chip network and trace-based simulation (see realistic_traffic)

WW45, Nov. 2, 2009

  • Let TG injection lag behind global clock if output queue is full. This approach is used in booksim. No performance hit. (Rev. 124)
  • Working on verilog (see hdl for progress)
  • Weird problem: FF counter decreases significantly after MAP (something must be wrong… @@). Only occurs for V2P, not V5.
    • Map packs output registers into I/O blocks in V2P. Not a problem when module is incorporated into higher level component. Just reporting issue…

WW46, Nov. 9, 2009

  • TGBernoulliFSM/TGBernoulli, FQCtrl module (hdl)
  • RAMFIFO: 2-cycle dequeue latency when switching contexts

WW47, Nov. 16, 2009

  • FlitQueue. TrafficGen.
  • Working on FlitQueue testbench
    • DPI on itchy.eecg: ModelSim uses gcc-4.0.2, which is incompatible with libc.so.6 (libc-2.7.so) on itchy. This causes a linker error (unrecognized file format) when compiling work/_dpi/export_wrapper.so during the simulation startup.
    • A workaround is to manually compile export_wrapper.so using the system gcc (4.3) and ModelSim seems to be able to use the library to start the simulation.
meeting_minutes.txt · Last modified: 2009/11/18 19:03 by danyao
Back to top
chimeric.de = chi`s home Creative Commons License Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0