SuperLU FAQ
- ATLAS: http://math-atlas.sourceforge.net/
- Goto BLAS: http://www.tacc.utexas.edu/resources/software/#blas
Windows users
This was tested in MS Visual Studio. However the configuration highly
depends on which compiler you using.
Normally there is an IDE (Integrated Development Environment) editor
associated with your compiler. You can do it in two steps:
- Step I: Create SuperLU library file
- Create a new project, then include all the .c and .h files in SRC
directory (they can be put in two folders of the IDE).
- Change the property of the project to make the output as Library file
.lib (not .exe or .dll file).
- Compile the project to produce the library file, e.g. superlu.lib.
(after you successfully compile it, you can build a release version
without the debug information).
- Step II: Build your own application
- Create a new project with your own source files which call the
SuperLU routines.
- Add the SRC directory and the directory where superlu.lib is located
to the include path and library searching path respectively.
- Add superlu.lib as the link optional library.
- Compile your own .dll or .exe file. Then you are done.
If you are using a compiler with command line only, you have to play
with the makefile or -I -L -O -c options.
As SuperLU calls BLAS routines but BLAS is not a native library of
MS Visual Studio, you have to build your own BLAS library in the similar
way as SuperLU library. The SuperLU distribution includes a C version
of BLAS in SuperLU/CBLAS directory. This version is only functional but
not fast. For speed, it's better to use vendor-supplied BLAS (e.g.,
Intel MKL) or public domain versions (e.g., ATLAS, or Goto BLAS).
Further remarks on SuperLU_MT Windows compilation:
- On Cygwin, both Pthread and OpenMP versions can be compiled without trouble,
just like on linux.
- MinGW works with both Pthread and OpenMP, too. But for MinGW users, they have to enter each directory (SRC, CBLAS, EXAMPLE, etc) and run mingw32-make separately. It is not easy to get the highest level Makefile work with MinGW.
- Visual Studio only works with the OpenMP version.
The user needs to define the macro __OPENMP by editing the property of the VS project.
Symmetric or diagonally dominant problems
SuperLU cannot take advantage of symmetry, but it can still solve
the linear system as long as you input both the lower and upper parts
of the matrix A. If off-diagonal pivoting does not occur, the U matrix
in A = L*U can be rewritten as D*L1', with L1 ~= L, modular roundoff errors.
In many applications, matrix A may be diagonally dominant or nearly so.
In this case, pivoting on the diagonal is sufficient for stability and
is preferable for sparsity to off-diagonal pivoting.
To do this, the user can set a small (less-than-one) diagonal pivot threshold
(e.g., 0.0, 0.01, ...) and choose an (A' + A)-based column permutation
algorithm. We call this setting Symmetric Mode.
To use this (in serial SuperLU), you need to set:
options.SymmetricMode = YES;
options.ColPerm = MMD_AT_PLUS_A;
options.DiagPivotThresh = 0.0; /* or 0.001, 0.01, etc. */
For serial SuperLU, an example of using symmetric mode can be found in EXAMPLE/dlinsol1.c
(Note, when a diagonal entry is smaller than the
threshold, the code will still choose an off-diagonal pivot.
That is, the row permutation P_r may not be Identity.)
For SuperLU_MT, an example can be found in EXAMPLE/pdlinsolx1.c
The algorithm in SuperLU_DIST is already in the spirit of symmetric mode.
Complex types
SuperLU supports single complex and double complex data types.
The double complex is defined as a pair of doubles, as follows:
typedef struct { double r, i; } doublecomplex;
For a variable "v" of doublecomplex type, you need to use
"v.r" and "v.i" to refer to the real and imaginary parts, respectively.
For an array "a[]" of doublecomplex type, you can refer to the k-th
element by "a[k].r" or "a[k].i".
Compressed row storage
The factorization and solve routines in SuperLU are designed to handle
column-wise storage only. If the input matrix A is in row-oriented storage, i.e.,
in SLU_NR format, then the driver routines (dgssv() and dgssvx() in serial SuperLU, and
pdgssv() and pdgssvx() in SuperLU_MT) actually perform the LU decomposition on A^T,
which is column-wise, and solve the system using the L^T and U^T factors.
The data structures holding L and U on output are different (swapped) from the data
structures you get from column-wise input. For more detailed descriptions about
this process, please refer to the leading comments of the routines
dgssv/dgssvx and pdgssv/pdgssvx.
Convert supernodal L & U to CCS format
The L matrix is represented as *supernode*, the format is defined in
structure SCformat{...} in supermatrix.h.
In serial SuperLU/MATLAB/mexsuperlu.c, a routine called LUextract()
converts L & U into respective CCS format (for matlab to use).
In SuperLU_MT, the data structure is the same for U. The difference
in L is permuted columns, that is, the supernodes are stored out of order.
It is defined in structure SCPformat{...} in supermatrix.h.
Section 3.2 of the Users Guide explains this.
SuperLU_MT: exceeding storage error
SuperLU_MT does not support dynamic memory expansion. It pre-allocates storage
based on the nonzeros in original matrix A and the estimated fill ratios
given by the #6, #7, #8 parameters n SRC/sp_ienv.c. If the guestimate is
small, you may get the following error message.
Storage for L subscripts exceeded; Current column xxxx; Need at least yyyyyy;
You may set it by the 8-th parameter in routine sp_ienv().
Then, you need to set the corresponding parameter to be a larger value
in magnitude.
Section 3.5.2 of the Users' Guide explains this in more detail.
SuperLU_DIST: 2D process grid
For best performance, the process grid should be arranged as square as
possible. When square grid is not possible, it is better to set the row
dimension (nprow) slightly smaller than the column dimension (npcol).
For example, the following are good options: 2x3, 2x4, 4x4, 4x8, etc.
SuperLU_DIST: ReplaceTinyPivot
SuperLU_DIST uses "static pivoting" strategy, which pre-pivot large elements
of A to the diagonal before numerical factorization. No dynamic pivoting is
performed during factorization. This is less rigorous than classical
partial pivoting, but makes it much more scalable. During factorization,
if we encounter a value smaller than sqrt(epsilon) * norm(A), we bump it up
to sqrt(epsilon)*norm(A). This corresponds to a single precision
perturbation, that is, we compute a factorization of a nearby matrix
by half-of-double precision distance to the original one. This trick
avoids division by small pivots, and work well for lots of problems
whose conditioner numbers are not too large.
For very ill-conditioned problems, this perturbation may be too large, you can
turn it off by setting:
options.ReplaceTinyPivot = NO;
SuperLU_DIST: support for 64-bit integers
Sequential SuperLU does not explicitly support 64-bit integer; all
the integers are declared as "int". The parallel SuperLU_DIST
declares integers as "int_t", which can be defined as "int" or "long
int" at compile time. This is explained in the top-level README file.
If you use 64-bit integer and (Par)METIS for sparsity-preserving
ordering, you need to install (Par)METIS with 64-bit integer
as well.
SuperLU_DIST:
symbolic factorization
The default setting uses serial symbolic factorization routine, which
can be used together with any fill-reducing orderings. The disadvantage
is that it gathers the entire graph of A to one process,
which may become a memory bottleneck.
To use parallel symbolic factorization routine, you need to set the following:
options.ParSymbFact = YES;
options.ColPerm = PARMETIS;
Note that the parallel symbolic factorization must be used together
with a parallel ordering algorithm, such as ParMETIS, by which the
amount of fill may be more than that given by a serial algorithm.
SuperLU_DIST: reproducibility
SuperLU_DIST does not guarantee bit-wise reproducible result. It may be true
with one MPI task and one OpenMP thread. But with multiple MPI tasks,
the communication order and computation order may be different from run
to run, since the algorithm uses asynchronous pipelined execution with
overlapping communication and computation, there is no guarantee of the
same order of floating point operations.
For ill-conditioned problems, different order of operations may cause
large change in the solution. For not so ill-conditioned problems the
difference may be only in the last few digits.
You can check the error bound returnedto help confirm
whether the difference between runs is as large as may be expected.
GPU support
Only SuperLU_DIST has GPU support, currently on Nvidia/CUDA. In the near future,
we will support AMD and Intel GPUs.
Serial SuperLU uses partial pivoting, with row interchanges during elimination
and a fully dynamic memory management. It is not beneficial to parallelize it on GPUs.
SuperLU_DIST uses static pivoting, potentially less stable than serial SuperLU, but having
separate pivoting and symbolic factorization phases, so we can tackle them one by one.