## Solving Quadratic Equations with XL on Parallel Architectures

### Introduction

Multivariate quadratic systems can be solved using the XL algorithm:
The system is e**x**tended by multiplying each equation with all monomials up to a
certain degree \(D\);
then the resulting system is **l**inearized by treating all monomials as
individual variables.
The resulting large and sparse linear system \(B\) is solved;
if \(D\) was chosen large enough, the kernel space becomes small enough so that
we find a solution for the original quadratic system.

We are using Coppersmith's block variant of the Wiedemann algorithm (block Wiedemann: BW) for solving the sparse linear system. In the first step of BW, a sequence of matrices \(\{a^{(i)}\}, a_i=x B^{i+1} z\) for some matrices \(x\) and \(z\) is computed. In the second step, the block Berlekamp-Massey algorithm is used in order to compute the minimal polynomial \(\lambda\) of the sequence. Finally, the polynomial is evaluated at \(B\) giving a solution \(s\) such that \(Bs = 0\).

We describe this project and the implementation in detail in our paper
"Solving Quadratic Equations with XL on Parallel Architectures"
and in Chapter 4 of my thesis.

We used this code to solve Type III systems of the Fukuoka MQ Challenge. Here is more information on solving the challenges.

### Source Code

To download the latest version of the software do the following:

The field and the size of the system must be defined at compile time, there are defines (Q, M and N) in the Makefile; currently, there are implementations for \(\mathbb{F}_2,\) \(\mathbb{F}_{16},\) and \(\mathbb{F}_{31}\); the code is in the gf/ directory. After compiling, you can run the program using:

The code generates a random system, prints the expected solution, and then uses XL to do the computation. A Fukuoka challenge can be loaded using

The top (comment) lines from the challenge must have been removed from the file.

By default, the code compiles with OpenMP; there are several implementations for MPI which you can choose in the Makefile:

- The most useful one is probably "BW1_two_blocks" which splits the workload of the first step of BW into two blocks in order to communicate while computing.
- Background communication works best if you have Infiniband. Using "BW1_two_blocks_ibv", the Infiniband cards are programmed directly using the IB Verbs API, reducing overhead and providing 'real' (CPU less) background data transfer.
- "BW1_one_block" uses one block only and works well if there is no DMA capable network hardware (i.e., there is no real background send anyway).
- "BW1_mpi_size_blocks" scales very well for the first step of BW — but due to the increasing block size, for many MPI nodes the runtime of Berlekamp-Massey goes up (the second step of BW).

### Tweaking BW3

The code implements an unpublished idea to improve the runtime of the last step of BW which is not in the paper: Usually, when solving a large linear system, one is interested in a solution for all variables. However, for XL, we actually need only a solution for the original system (before it was extended and linearized). Therefore, we do not need to repeat the computations of the first BW step — but can just compute the solution from the sequence that was computed in the first step (using a tweak on how the sequence is computed). That reduces the cost of the last step to a marginal amount (e.g., 3 minutes out of 50 days in the Type III n=35 case).