Application Overview

The generator creates hardware to accelerate the LU Factorization method to solve linear system of equations on any FPGA. A system of N linear equations with N unknown variables can be represented in a matrix-vector form as Ax = b. The coefficient of the variables in each linear equation are represented in each row of an N x N matrix (A) and they are multiplied by the N-element vector of unknown variables (x). A solver must determine the values of x for which the product generates the N-dimensional constant (b).

The LU factorization method solves for x by breaking the coefficient matrix A into two matrices, A = LU. One of those matrices, called L, is a lower triangular matrix which has the diagonal elements equal to 1 and all elements above the diagonal equal to 0; the other matrix, called U, is an upper triangular matrix which has the elements below the diagonal equal to 0. Using LUx = b, if we set y = Ux, a forward substitution can be performed to compute y from Ly = b. Then a backward substitution can be performed to compute x from Ux = y. The most time consuming computation in this algorithm is the factorization of the coefficient matrix, which is the determination of the matrices L and U such that A = LU. It is this factorization that the generator will create hardware to accelerate on FPGA.

 

System Architecture Overview

The generator will create synthesizable Verilog HDL code based on a set of parameters. The parameters specify the capabilities and resources available on the target FPGA platform. Thus, a customized hardware can be created to take full advantage on the FPGA. A full list of the parameters for the generator can be find in the user manual.

The generated Verilog HDL code has to perform two tasks. One of the tasks is data marshalling, which involves loading and storing data from off-chip memory to on-chip memory in the FPGA. The other task is the actual computation of the LU factorization algorithm. A high level block diagram of the hardware is shown in the figure below. The coefficient matrix is stored in off-chip memory so that the matrix dimension can be any value. The data marshaller module brings blocks of the matrix onto the FPGA for the computation module to work on.

High level block diagram

To implement other algorithms, only the computation section of the hardware has to be changed. The data marshalling component can be reused to handle interactions with the off-chip memory.

 

Hardware Execution Outline

To perform LU factorization, the coefficient matrix A has to be placed in the off-chip memory. Some padding is necessary based on the block size and number of processing elements. More details on the necessary padding are described in the Thesis found here.

Additional inputs are sent through I/O pins to the FPGA to initiate computation: the size of the matrix (N), the starting address of the matrix in off-chip memory and a start signal. The solution of the LU factorization is stored back into the off-chip memory in the same location as the coefficient matrix A. A done output signal is asserted when the algorithm completes.