The generator creates hardware to accelerate the LU Factorization method to
solve linear system of equations on any FPGA. A system of N linear equations
with N unknown variables can be represented in a matrix-vector form as *Ax = b*.
The coefficient of the variables in each linear equation are represented in each
row of an N x N matrix (*A*) and they are multiplied by the N-element
vector of unknown variables (*x*). A solver must determine the values of *
x* for which the product generates the N-dimensional constant (*b*).

The LU factorization method solves for *x* by breaking the coefficient
matrix *A* into two matrices, *A = LU*. One of those matrices, called
*L*, is a lower triangular matrix
which has the diagonal elements equal to 1 and all elements above the diagonal
equal to 0; the other matrix, called *U*, is an upper triangular matrix
which has the elements below the diagonal equal to 0. Using *LUx = b*, if
we set *y = Ux*, a forward substitution can be performed to compute *y*
from *Ly = b*. Then a backward substitution can be performed to compute *
x* from *Ux = y*. The most time consuming computation in this algorithm
is the factorization of the coefficient matrix, which is the determination of
the matrices *L* and *U* such that *A = LU*. It is this
factorization that the generator will create hardware to accelerate on FPGA.

The generator will create synthesizable Verilog HDL code based on a set of parameters. The parameters specify the capabilities and resources available on the target FPGA platform. Thus, a customized hardware can be created to take full advantage on the FPGA. A full list of the parameters for the generator can be find in the user manual.

The generated Verilog HDL code has to perform two tasks. One of the tasks is data marshalling, which involves loading and storing data from off-chip memory to on-chip memory in the FPGA. The other task is the actual computation of the LU factorization algorithm. A high level block diagram of the hardware is shown in the figure below. The coefficient matrix is stored in off-chip memory so that the matrix dimension can be any value. The data marshaller module brings blocks of the matrix onto the FPGA for the computation module to work on.

High level block diagram

To implement other algorithms, only the computation section of the hardware has to be changed. The data marshalling component can be reused to handle interactions with the off-chip memory.

To perform LU factorization, the coefficient matrix A has to be placed in the
off-chip memory. Some padding is necessary based on the block size and number
of processing elements. More details on the necessary padding are described in
the Thesis found here.

Additional inputs are sent through I/O pins to the FPGA to initiate computation:
the size of the matrix (N), the starting address of the matrix in off-chip memory
and a start signal. The solution of the LU factorization is stored back into the
off-chip memory in the same location as the coefficient matrix A. A done output
signal is asserted when the algorithm completes.