This page contains a summary of on-going research projects that use FPGAs to accelerate computer architecture simulation.
Genko and Antieza focused on the design of TG and TR (Traffic Receptor) nodes. Switches are generated using Xpipes compiler. As an emulation, the timing, resource contention and flow control details in the interconnects are not modeled. Consequently, the behaviour of the network presented may not be representative of real systems. Program traces are presented as an alternative way to load the network besides synthetic traffic; however, it is unclear how the traces are obtained (it is mentioned that it's a set of packet descriptors), what kind of traces they are, and what is the timing of the trace playback. In addition, only small test cases are implemented in this system (3×2 and 2×2).
Wolkotte et al. trades off simulation time and on-chip resources by time-multiplexing all N routers in an NoC onto a single HDL model on the FPGA, which implements a five-port 4VC wormhole Router. Because the simulation is completely serialized, each system clock requires 2N FPGA clocks to simulate. An on-board ARM processor generates and feeds the stimuli flits to the FPGA, collects results after a fixed number of simulated cycles and computes statistics from the results. The software components run in parallel with the hardware simulation to amortize the communication cost. The final FPGA Router runs at 6.6MHz while the ARM processor runs at 84MHz. The size of the simulation is limited only by the amount of on-chip block RAM available, which are used to stored the “contexts” of the Router. Even though only synthetic traffic is presented in the paper, it should not be hard to use a network trace instead. However, it is unclear how inter-router communication is handled during a simulation period. This could potentially lead to buffer overflow on congested links, which is not addressed in the paper.
Marescaux et al. describes an on-chip network for heterogenous systems-on-chip. The main concept is the use of three different networks to carry control, configuration and data communications. This seems rather straight forward.
NoCem (OpenCore project) offers a synthesizable model for on-chip network. Supports configurable topology, bus width, VC etc. Resource utilization is the bottleneck; the XC2VP30 part only fits a 4×4 mesh with 32-bit datawidth. Traffic stimuli is provided through Microblaze processors to emulate real coherence traffic. The applications used are unknown.
Schelle and Grunwald emulates NoCs by building VC switches on the FPGA. Resource utilization limits the size of their NoC to 4×4 (on XC2VP30). It appears that VC-based routers are expensive due to the control logic for arbitrating and allocating VC resources and crossbar. One way to alleviate resource usage of a VC router may be to decompose the parallel switching for all output ports into multiple clock cycles. This avoids the crossbar, and the other VC logic may be pipelined. This approach is not explored by the authors and need to be verified.
RASoC proposes a composable router that is build using individual input port and output port modules. The switching logic (crossbar, flow control) is distributed in each port module. The ports implement small buffers (2 or 4 flits, 32-bit flits).
Krasteva et al. uses partial reconfiguration to build a NoC emulation framework that can be reconfigured without resynthesis. DRNoC contains a fixed mesh network that connects reconfigurable blocks. Hard core libraries are used to populate the reconfigurable blocks to model routers and TGs. The fixed mesh network has cross links so it can emulate non-mesh topologies (such as star). It is mentioned that independent routers can be mapped to the same router-slot to save area, however the paper does not elaborate on how this is done. The final system built is a 2×2 on a XC2VP30 (although they claim a 4×4 system can be mapped to the same FPGA). A software tool is developed to configure and read data back from the emulator, which is similar to our approach.
Designing NoCs on FPGA
Uses communication trace to evaluate the interconnection network in a multicomputer system. Parallel programs are first simulated in a virtual multicomputer (seems to be a functional processor simulator) to generate the trace. The trace is a sequence of messages. Each message can be independent or dependent. The former is associated with an absolute timestamp of injection. The latter has a pointer to a trigger message and the delay before injection after receipt of the trigger message. Seems to be a simple method to provide some feedback without actually executing the program during network simulation.
Uses Limes (an execution-driven simulator for multiprocessors, sounds like a SMP version of SimpleScalar) to generate memory requests to be simulated by an interconnect simulator (NetSim). Limes has a memory system simulator (MemSim) and can run SPLASH-2. Instruction execution time is calculated at the end of each basic block, hence a rough estimate without regards to microarchitectural details such as out-of-order execution. The memory model is not cache coherent. Results show that adaptive routing algorithm does result in shorter execution time for the two applications examined (OCEAN and FFT).
Execution-drive SMP simulator that models a cache-coherent shared-memory system. It runs the SPLASH-2 parallel benchmark. Parallel execution is simulated one thread at a time, running natively on the PC, until it encounters a memory reference, at which point the operation is rescheduled until the its timestamp reaches the global simulation time. Simulation time is calculated from instruction counts at the end of each basic block. A centralized simulation kernel makes sure memory references from all threads are interleaved in the correct order and supply this trace to the back-end memory simulator.
Adaptive bubble router: a design to improve performance in torus networks. Puente, Beivide et al. Parallel Processing, 2009.
DRAMsim: a memory system simulator. Wang, Ganesh et al. Computer Architecture News, 2005.
How to simulate 1000 cores. Monchiero, Ahn et al. Computer Architecture News, 2005.
COTSon: infrastructure for full system simulation. Argollo, Falcon et al. Operating Systems Review, 2009.