Each benchmark circuit is ``implemented'' in each FPGA using the following procedure: First, each memory in the memory configuration must be implemented using one or more of the physical memory arrays. For example, a 4Kx2 user memory can be implemented using four 2-kilobit arrays each configured as a 2Kx1 memory with appropriate decoding. We call this process the logical-to-physical mapping. Because there can be many ways to perform this mapping, it is an optimization problem. We use an algorithm described in .
Next, the mapped circuits are then placed on an appropriately-sized FPGA using a simulated annealing-based placement program, and the detailed routing is performed using a multi-pass maze router (both tools are described in ). The routing is repeated for different values of W (the number of tracks per channel) to determine the minimum W that gives a 100% routable solution. The router considers all input pins of a lookup-table equivalent, as well as all address pins of a memory array. Data pins are also considered equivalent with the constraint that a pin assignment for a specific bit in the data-out port fixes the corresponding assignment in the data-in port (and vice versa).
The size of the FPGA used in the place and route step depends on the number of lookup tables in the circuit to be implemented, as well as the number of memory arrays to be included. For a given number of arrays, we place the required number of logic blocks above and below the memory row, creating a roughly square chip. This will result in different values of r (the number of logic blocks per memory block in the horizontal dimension) for different circuits. If, however, the resulting r is less than 3, we force r to be 3, resulting in a non-square aspect ratio. For some circuits, the number of input/outputs will determine the chip area; for these circuits, a square array with the required number of input/output blocks will be used. Table 1 shows the average values of r and the average aspect ratio (ratio of horizontal logic blocks to vertical logic blocks) for circuits implemented on the four architectures.
Finally, a timing analyzer is used to measure the critical path through the circuit. The timing analyzer finds each net delay by constructing an RC-tree. The memory access time is measured from a modified version of CACTI, a detailed cache access time model . The CLB delay is assumed to be constant. A 0.5um CMOS process was assumed throughout.