|
|
|||||||||||||||||||||||
|
Adding Large Embedded Block Support
to Versatile Place and Route Mark Fang (fang AT eecg DOT toronto DOT edu) Mark Jarvin (jarvin AT eecg DOT toronto DOT edu) Andrew Ling (aling AT eecg DOT toronto DOT edu) IntroductionField Programmable Gate Arrays (FPGAs) are programmable chips that can be used to quickly implement any digital circuit. FPGAs have seen a growth in popularity in recent years due to lower product development costs achieved through design flexibility. Correspondingly, the need for high-performance CAD tools for FPGAs has never been greater.
The CAD flow for FPGAs starts with a description of the circuit, usually
in the form of a hardware description language (HDL) such as VHDL. Once
the description is verified through simulation, synthesis occurs where
the description is synthesized into a gate-level network consisting of
primitive gates. Next, technology mapping occurs where the gate-level
network produced in synthesis is converted into a network of
programmable logic blocks (PLBs). After technology mapping, PLBs are
grouped together into regular clusters called clustered logic blocks
(CLBs). Clustering PLBs into CLBs has proven to
improve the performance of FPGAs both in terms of area and speed
[Ahmed2000].
After clustering, placement and routing are then performed to physically
arrange the CLBs on the FPGA and assign routing tracks and switch
connections for CLB connections. Figure 1 shows a generic FPGA
architecture in which CLBs are connected by switch blocks and routing wires.
Figure 1: Generic FPGA architecture. In this investigation, we examine the placement and routing phase of the CAD flow where we attempt to upgrade Versatile Place and Route (VPR) [Betz1997], a competitive placement and routing tool used in academic research. The FPGA Placement ProblemFPGA placement is the task of finding a valid placement for each CLB, while optimizing placement quality. Placement algorithms determine where to physically arrange CLBs on the FPGA. Optimization goals are typically associated with compact placement of CLBs to either minimize overall wire-length or maximize circuit speed. These two optimization criteria are commonly referred to as wire-length-driven placement and timing-driven placement. A third optimization criterion, whereby logic blocks are placed to balance wiring density over the FPGA is known as routability-driven placement. There are three major classes of placement tools used today: min-cut, analytic and simulated annealing. The min-cut approach uses recursive partitioning to divide a circuit into increasingly smaller sub-circuits and optimizes sub-circuits. Analytic placement treats the placement problem globally and finds the optimized placement by solving a system of quadratic equations [Kleinhans1991]. Analytic placement has generally been used for ASIC placement and has not been successfully applied in the FPGA domain. Simulated annealing uses random logic block movement and converges towards an optimized placement. We base our work on VPR which uses simulated annealing to perform placement. The FPGA Routing Problem
The goal of routing is to create physical connections between CLBs such
that timing or wire-length is minimized. In this work, we focus on
wire-length as our measure of routing performance.
One of the earliest pieces of work dealing with the routing problem is
[Lee1961]. In [Lee1961], Lee developed a
routing algorithm called the Maze Router. The routing algorithm applies
dijkstra's algorithm to find the minimum path between two connections.
This is applied successively for all connections until all nets have
been routed.
For example, consider the Figure 2a where block S is to be
connected to block T. Assuming that white blocks are feasible routing segments
and black blocks are not feasible routing segments. The maze router algorithm starts by exploring all
adjacent segments to block S and labeling their distance from S. This
continues for segments adjacent to the previous segments explored. This
creates a breadth first search of the wire segments where the algorithm
terminates when block T is found (blocks 6 and 7 have been shown just
for illustration of labeling). Following this,
the maze router will connect segments as it backtracks back to S
as is shown in Figure 2b. ![]() (a) Illustration of breadth first traversal of wire segments and labeling phase. (b) Illustration of connection after labeling phase. Figure 2. Example of labeling and connection phase in the maze router algorithm. FPGA routing architectures generally consist of channels, switch blocks and connection blocks. Channels are made up of several routing tracks which are individual wires used to transmit signals. CLBs are connected to channels via connection blocks and adjacent channels are connected together via switch blocks. It is up to the FPGA router to assign specific tracks to connections and connect segments and CLBs together which will maximize performance. Traditionally, this problem has been solved in either two phases [Lemieux1997] or one phase [McMurchie1995]. In two phase routing, a global router is responsible for assigning connections to specific FPGA channels while it leaves the task of assigning connections to specific tracks to a detailed router. In contrast, one phase routers does both the channel and track assignment all at once. The benefits of the two phase routing approach is that it is generally much faster than the one phase approach. However, two phase routers hurt FPGA performance when compared to one phase approaches. VPR currently supports a one phase routing approach as described in the following sections. Versatile Place and RouteVPR is an FPGA placement and routing tool, developed at the University of Toronto by Vaughn Betz as part of his PhD thesis. VPR accepts a clustered netlist produced by t-vpack as input. This netlist consists of CLBs and IOs. VPR places this circuit onto the FPGA while observing one of the aforementioned optimization criteria. Finally, VPR finds an optimized circuit routing. In our upgrades to VPR, we focus on VPR’s placement and routing function and ignore the clustering. VPR performs simulated annealing using an automatic annealing schedule. A scheduler gathers statistics regarding placement quality as the placement is being performed and uses them to control the rate of cooling. The performance of the scheduler greatly affects the final placement quality. The source code for the academic version of VPR is publicly available here. We use this code base as the starting point for our work. Optimization by Simulated AnnealingSimulated annealing mimics the annealing process used to gradually cool molten metal to produce metal objects in low-energy crystalline states. The simulated annealing algorithm is based on random movement of logic blocks, but also on a smart cost function. The cost function is used to evaluate the quality of any placement of CLBs. For example, a common cost for wire-length-driven placement is the sum of the sizes of the bounding boxes, where a bounding box is a box that encapsulates all logic blocks attached to a single net. A net can be thought as a hyper-edge that connects a set of blocks together. Figure 3 illustrates a net and its associated bounding box.
Figure 3: An example of a net and its associated bounding box, with size measured as L+W (half-perimeter). Simulated annealing begins with an initial placement obtained by assigning CLBs randomly to the available programmable logic block locations in the FPGA. This is then followed by a large number of moves, or local improvements, to gradually improve placement. A move consists of randomly selecting a CLB and then randomly selecting a destination. The change in the cost function that would result from performing the randomly generated move is then computed. If the cost decreases, the move is always accepted. However, if the cost increases, there is still a chance of the move being accepted. This allows for hill-climbing moves that make the placement worse, but enable the simulated annealing process to escape local minima of the cost function. Figure 4 illustrates an example cost function and the process of hill-climbing to escape local minima.
Figure 4: Hill-climbing used to escape local minima. The probability of accepting a bad move is a function of an annealing parameter known as temperature. When temperature is high, almost all moves are accepted. However, as temperature decreases during simulated annealing, placement is refined so that the probability of accepting a bad movie decreases. The distance a CLB may move is also a function of temperature. At high temperatures, the radius of allowable moves, Dlimit, is large. For example, in Figure 5a, block x can move to all locations within the highlighted radius. As temperature decreases, kinetic energy of the CLBs also decreases, and freedom of movement is constrained; thus, the radius of movement decreases, as illustrated in Figure 5b.
(a) Radius of movement at high-temperatures
(b) Radius of movement for low-temperatures. Figure 5: Illustration of radius of movement decrease. Finally, When temperature is reduced to zero, the circuit is in a static state and placement is considered complete. In the previous description, we only mentioned CLBs in the annealing process. However, other blocks can also be included into the anneal process such as IOs. For these blocks, the same set of rules on accepting move locations and cost calculations apply. FPGA Routing with PathFinderVPR routing is based on the PathFinder algorithm [McMurchie1995, Swartz1998]. PathFinder is very similar to the Maze Route algorithm described previously, however, several iterations are applied to route connections where routed nets are ripped up and retried. This allows the router to fix any "mistakes" during the routing since nets can be rerouted to potentially better tracks.
In PathFinder, the routing architecture is represented by a directed graph
G=(V,E), called
routing resource graph. Unlike the
basic Maze Route algorithm, nodes are given an associated cost. When
considering only congestion, the cost is as follows:
VPR routing improves on PathFinder by taking delay into account. At the end
of each routing iteration timing analysis is performed. From timing analysis
the timing criticality Critij of every connection (i,j) is computed as follows:
Here, cn is the previous congestion only cost function. By incorporating a delay sensitive term in the cost function, timing-critical connections (high Critij) will use low delay nodes even if they are congested. On the other hand, a non-timing-critical connection will use higher delay nodes if low delay nodes are overused/congested, to alleviate congestion. Therefore VPR routing is a negotiated congestion-delay routing algorithm based on PathFinder. Adding Large Embedded Block SupportMotivationThe fact that CLBs and IOs are all uniform in shape and can be treated homogeneously allows for a relatively simple and elegant implementation of placement in VPR. However, most modern day FPGAs are not homogeneous and contain elements such as DSPs and RAM blocks which must be treated much differently than clusters. Thus, VPR currently only supports traditional FPGA architectures which only contain clusters and IOs. It would be beneficial to the academic community to support heterogeneous blocks in VPR. Relatively little work has been done in heterogeneous FPGA architecture exploration [Jamieson2005], and no published work known to this author has covered the problem of placing and routing heterogeneous FPGA blocks. The goal of this project is to create VPR place and route support for heterogeneous blocks which we will refer as Large Embedded Blocks (LEBs). For simplicity, we will support only one type of LEB can exist in the FPGA architecture. Furthermore, we will assume the FPGA architecture will have column wide support of LEBs where a column will either support CLBS or LEBs. The details of the LEB architectural assumptions can be found in the Architecture Specifications section. There is hope that these assumption can be relaxed in future revisions of LEB support. Primary ModificationsThe modifications to VPR can be categorized into three areas: architecture definition, placement, and routing. Like CLB characteristics, the user will need to specify the LEB architecture and properties of individual LEBs, such as their dimensions and physical pin layout. After the FPGA architecture is specified, the placement of LEBs must occur in conjunction with CLBs such that placement quality is not impacted. Finally, the router needs to support multi-dimensional blocks and return a final cost of the placed and routed circuit. This page has been accessedLast Updated Jan 2006 |