On-Chip Interconnect

The global on-chip interconnect is used to facilitate communication among the simulator nodes (Traffic Generators, Flit Queues and Routers). The preferred interconnect architecture should be simple, and provide high throughput and low latency. Conceptually, the on-chip interconnect is a black box, that delivers flits from source nodes to destination nodes in the order of their timestamps. The flit stream for each source-destination pair must be strictly in order of increasing timestamp values to maintain simulation correctness; the global ordering may be relaxed.

All flits in the simulator flow either from TG/FQs to Routers or vice versa. When the Router node has a fixed number of ports (e.g. with RouterMesh), TG/FQ to Router interconnect can be privatized by providing direct connections between the TG/FQs and the Router they connect to.

We experiment with a few different approaches in the simulator. The Interconnect models can be found in sim/interconnect.[h/cpp].

Ideal

The ideal interconnect (InterconnectIdeal) is one that has an infinite bandwidth. In any given time step, all flits that are ready are delivered. Since the simulator nodes may take more than one FPGA cycles to produce all the flits in a single simulation time step, the Ideal interconnect only increments the global simulation time counter if all ready flits have a future timestamp. This results in a one-cycle bubble.

Crossbar

The Crossbar (InterconnectCrossbar) implements a full crossbar connecting all Routers. Each input and output port of the crossbar has bandwidth one. Hence, the maximum throughput of the Crossbar is N flits in a network with N Routers. In Crossbar and all its variants, the FlitQueue/TrafficGen → Router portion of the interconnect is implemented as parallel wires connecting the FQ/TGs directly to the corresponding input ports on the Router. This is because the Router architecture assumes a specific number of ports, fixed at design time. While a full interconnect that enables arbitrary FQ/TG → Router connections would be more flexible and allows can be configured to a greater number of irregular topologies given the same number and composition of FQ/TG/Router nodes implemented on the FPGA, it may not justify the area cost and performance degradation. (data?).

Similar to the Ideal interconnect, the Crossbar creates a one-cycle bubble when incrementing the global simulation counter. Somehow the Crossbar hides this bubble better…

DualCrossbar

The DualCrossbar (InterconnectDualCrossbar) doubles the throughput of the Crossbar by duplicating the interconnect and dedicating one to each of the credit and flit traffic stream.

DualCrossbarPort

Both the Crossbar and DualCrossbar only provides one input to the crossbar for each Router so that the size of the interconnect scales linearly with respect to the number of Router nodes in the simulator and not quadratically with both the number of Routers and the number of ports in each Router. This means that the out-going flits for all output ports of the Router are serialized, which can reduce the throughput of the interconnect. To measure the performance impact of serializing all out-going flits at the Router, the DualCrossbarPort (InterconnectDualCrossbarRTPort) gives each Router output port an input port on the crossbar.

Performance

The left figure below shows the simulator performance (measured by the number of time steps that can be simulated per simulator clock cycle. Higher is better.). The 16node-torus benchmark is used (see benchmarks)

In addition to the three different types of interconnect models, we also tried two different types of Router models. See router for a description of the Router architectures. In short, the blue line (Ideal IC, Mesh-Ideal) represents the best possible performance (with the simulation time counter bubble). The yellow line (Dual-Crossbar-full, Mesh-Ideal) represents the performance when each Router output port gets its own crossbar input port. Since the crossbar hides the bubble better than the Ideal interconnect, the yellow line is slightly better than the blue one. The drop from the yellow line to the orange line (Dual-Crossbar, Mesh-Ideal) represents the performance degradation due to the serialization of out-going flits at the Router. The further drop from the orange line to the green line (Dual-Crossbar, Mesh Router), which overlaps with the purple line (Ideal IC, Mesh Router) represents the performance loss due to serialization of input ports at the Router. The overlapsing of the green and purple lines indicates that the serialization of input and output ports at the Router dominates the performance, and the bubble in the interconnect is not as important. Finally, the light blue line (Dual-Crossbar-full, Mesh-full) uses a Router model where only the input ports are serialized and not the output ports. This line is identical to the green and purple line, indicating that input port serialization is indeed responsible for all the performance loss from the yellow line.

The top right figure shows the number of flits (data and credit) delivered over the global interconnect from Routers to FQ/TGs per clock cycle. In the 16node-torus benchmark at 100% offered traffic, during steady state, 23 flits have traverse a network link each time step. For each flit traversing a link, a credit will eventually cross the link in the opposite direction. That's a total of 46 flits. However, since we can only simulate a time step every 2 clock cycles (blue line in top left figure), the maximum throughput of the simulator is 23 flits/clock, which is represented by the blue line. However, for the mesh Router (green, purple and light blue lines), since it can only process one input ports per clock cycle, the throughput it can deliver correlates with the number of input ports with flits for routing every clock cycle. In the 16node-torus benchmark, the bottleneck Router has to route 4 flit streams from 4 input ports; each port has a 50% probability of having a ready flit at any given time step (since max. channel loading is 2, the max offered traffic is 0.5 flit/link/time step). Hence, we expect the throughput of the mesh Router to be about 11.5 flits/clock, or 2x less of the throughput of the ideal Router. We observe a throughput of 11 flits/clock.

Note

  1. The minimum number of clocks required to simulate a time step is equal to the number of flits in a packet because TGs can only inject one flit per simulator clock cycle. This can be ameliorated if the injection process understands the bandwidth of the injection link and adjust the time stamp of the injected flits accordingly.
  2. Increasing the Router input buffer space from 1 to 2 or higher does not significantly affect simulator performance
interconnect.txt · Last modified: 2009/08/31 03:35 by danyao
Back to top
chimeric.de = chi`s home Creative Commons License Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0