The part of each stage that takes up the most resources is the S-Box. Each 6-bit input, 4-bit output LUT can be represented as four independent 6-bit, 1-output LUTs.
    Since we only have 4-bit LUTs available, this is how we construct a 6-bit LUT from 4-bit LUTs. It basically contains two parts. The lower four bits go to four 4-LUTs and the higher two bits go to a 4-way multiplexer to select one of the outputs. A 4-way multiplexer is usually built with two levels of LUTs. However, by using the cascade chain this can be reduced to only one level plus the cascade chain. This is faster since cascade chains have less delay and are available at
the same LE as the two multiplexing LUTs. It also takes up fewer resources since we don't need additional LEs.

.