Docstoc

Sparse LU Factorization

Document Sample
Sparse LU Factorization Powered By Docstoc
					Sparse Matrix Methods
 • Day 1: Overview

 • Day 2: Direct methods

 • Day 3: Iterative methods

    •   The conjugate gradient algorithm
    •   Parallel conjugate gradient and graph partitioning
    •   Preconditioning methods and graph coloring
    •   Domain decomposition and multigrid
    •   Krylov subspace methods for other problems
    •   Complexity of iterative and direct methods
SuperLU-dist:
Iterative refinement to improve solution

   Iterate:
   • r = b – A*x
   • backerr = maxi ( ri / (|A|*|x| + |b|)i )
   •   if backerr < ε or backerr > lasterr/2 then stop iterating
   •   solve L*U*dx = r
   •   x = x + dx
   •   lasterr = backerr
   •   repeat

           Usually 0 – 3 steps are enough
(Matlab demo)
 • iterative refinement
Convergence analysis of iterative refinement

  Let C = I – A(LU)-1    [ so A = (I – C)·(LU) ]

  x1 = (LU)-1b
  r1 = b – Ax1 = (I – A(LU)-1)b = Cb
  dx1 = (LU)-1 r1 = (LU)-1Cb
  x2 = x1+dx1 = (LU)-1(I + C)b
  r2 = b – Ax2 = (I – (I – C)·(I + C))b = C2b
  ...
  In general, rk = b – Axk = Ckb

  Thus rk  0 if |largest eigenvalue of C| < 1.
The Landscape of Sparse Ax=b Solvers

              Direct     Iterative
              A = LU      y’ = Ay
                                             More General
  Non-       Pivoting   GMRES,
symmetric      LU       QMR, …

Symmetric
             Cholesky   Conjugate
 positive                gradient
 definite
                                             More Robust


    More Robust                      Less Storage
                                                      D
Conjugate gradient iteration

 x0 = 0, r0 = b, p0 = r0
 for k = 1, 2, 3, . . .
    αk = (rTk-1rk-1) / (pTk-1Apk-1)   step length
    xk = xk-1 + αk pk-1               approx solution
     rk = rk-1 – αk Apk-1             residual
    βk = (rTk rk) / (rTk-1rk-1)       improvement
    pk = rk + βk pk-1                 search direction


 • One matrix-vector multiplication per iteration
 • Two vector dot products per iteration
 • Four n-vectors of working storage
Conjugate gradient: Krylov subspaces
 • Eigenvalues: Au = λu               { λ1, λ2 , . . ., λn}
 • Cayley-Hamilton theorem:
          (A – λ1I)·(A – λ2I) · · · (A – λnI) = 0
          Therefore  Σ nciAi =
                   0i
                                 0 for some ci

           so           A-1 =  Σ n
                             1i
                                  (–ci/c0) Ai–1

 • Krylov subspace:
        Therefore if Ax = b, then x = A-1 b and
     x  span (b, Ab, A2b, . . ., An-1b) = Kn (A, b)
Conjugate gradient: Orthogonal sequences

• Krylov subspace: Ki (A, b) = span (b, Ab, A2b, . . ., Ai-1b)
• Conjugate gradient algorithm:
      for i = 1, 2, 3, . . .
              find xi  Ki (A, b)
              such that ri = (Axi – b)  Ki (A, b)

• Notice ri  Ki+1 (A, b), so ri  rj for all j < i

• Similarly, the “directions” are A-orthogonal:
              (xi – xi-1 )T·A· (xj – xj-1 ) = 0
• The magic: Short recurrences. . .
      A is symmetric => can get next residual and direction
         from the previous one, without saving them all.
Conjugate gradient: Convergence

 • In exact arithmetic, CG converges in n steps
                  (completely unrealistic!!)

 • Accuracy after k steps of CG is related to:
    • consider polynomials of degree k that are equal to 1 at 0.
    • how small can such a polynomial be at all the eigenvalues of A?

 • Thus, eigenvalues close together are good.

 • Condition number: κ(A) = ||A||2 ||A-1||2 = λmax(A) / λmin(A)

 • Residual is reduced by a constant factor by
         O(κ1/2(A)) iterations of CG.
(Matlab demo)
 • CG on grid5(15) and bcsstk08
 • n steps of CG on bcsstk08
Conjugate gradient: Parallel implementation

• Lay out matrix and vectors by rows

• Hard part is matrix-vector product         P0 P1   P2   P3
            y = A*x                                            x
                                                               P0
• Algorithm                              y                     P1
   Each processor j:
    Broadcast x(j)                                             P2
    Compute y(j) = A(j,:)*x                                    P3

• May send more of x than needed

• Partition / reorder matrix to reduce
  communication
(Matlab demo)
 • 2-way partition of eppstein mesh
 • 8-way dice of eppstein mesh
Preconditioners

 •   Suppose you had a matrix B such that:
     1. condition number κ(B-1A) is small
     2. By = z is easy to solve

 •   Then you could solve (B-1A)x = B-1b instead of Ax = b

 •   B = A is great for (1), not for (2)
 •   B = I is great for (2), not for (1)
 •   Domain-specific approximations sometimes work
 •   B = diagonal of A sometimes works

 •   Or, bring back the direct methods technology. . .
(Matlab demo)
 • bcsstk08 with diagonal precond
Incomplete Cholesky factorization (IC, ILU)


                                    x

         A                   RT             R
 • Compute factors of A by Gaussian elimination,
   but ignore fill

 • Preconditioner B = RTR  A, not formed explicitly

 • Compute B-1z by triangular solves (in time nnz(A))

 • Total storage is O(nnz(A)), static data structure

 • Either symmetric (IC) or nonsymmetric (ILU)
(Matlab demo)
 • bcsstk08 with ic precond
Incomplete Cholesky and ILU: Variants

                                            1         4          1   4
• Allow one or more “levels of fill”
   • unpredictable storage requirements     2        3           2   3

• Allow fill whose magnitude exceeds a “drop tolerance”
   • may get better approximate factors than levels of fill
   • unpredictable storage requirements
   • choice of tolerance is ad hoc

• Partial pivoting (for nonsymmetric A)

• “Modified ILU” (MIC): Add dropped fill to diagonal of U or R
   • A and RTR have same row sums
   • good in some PDE contexts
Incomplete Cholesky and ILU: Issues

  • Choice of parameters
     • good: smooth transition from iterative to direct methods
     • bad: very ad hoc, problem-dependent
     • tradeoff: time per iteration (more fill => more time)
                  vs # of iterations (more fill => fewer iters)

  • Effectiveness
     • condition number usually improves (only) by constant factor
       (except MIC for some problems from PDEs)
     • still, often good when tuned for a particular class of problems

  • Parallelism
     • Triangular solves are not very parallel
     • Reordering for parallel triangular solve by graph coloring
(Matlab demo)
 • 2-coloring of grid5(15)
Sparse approximate inverses




                   A                   B-1

 • Compute B-1  A explicitly

 • Minimize || B-1A – I ||F    (in parallel, by columns)

 • Variants: factored form of B-1, more fill, . .
 • Good: very parallel
 • Bad: effectiveness varies widely
Support graph preconditioners: example                           [Vaidya]




             G(A)                                G(B)
   • A is symmetric positive definite with negative off-diagonal nzs
   • B is a maximum-weight spanning tree for A
      (with diagonal modified to preserve row sums)
   • factor B in O(n) space and O(n) time
   • applying the preconditioner costs O(n) time per iteration
Support graph preconditioners: example




             G(A)                             G(B)
   • support each edge of A by a path in B
   • dilation(A edge) = length of supporting path in B
   • congestion(B edge) = # of supported A edges
   • p = max congestion, q = max dilation
   • condition number κ(B-1A) bounded by p·q (at most O(n2))
Support graph preconditioners: example




              G(A)                               G(B)
   • can improve congestion and dilation by adding a few
     strategically chosen edges to B
   • cost of factor+solve is O(n1.75), or O(n1.2) if A is planar
   • in recent experiments [Chen & Toledo], often better than
     drop-tolerance MIC for 2D problems, but not for 3D.
Domain decomposition (introduction)

                            B     0   E

                  A=        0     C   F

                            ET   FT G

 • Partition the problem (e.g. the mesh) into subdomains
 • Use solvers for the subdomains B-1 and C-1 to
   precondition an iterative solver on the interface

 • Interface system is the Schur complement:
               S = G – ET B-1E – FT C-1F
 • Parallelizes naturally by subdomains
(Matlab demo)
 • grid and matrix structure for overlapping 2-way
   partition of eppstein
 Multigrid          (introduction)




• For a PDE on a fine mesh, precondition using a solution on
  a coarser mesh
• Use idea recursively on hierarchy of meshes
• Solves the model problem (Poisson’s eqn) in linear time!
• Often useful when hierarchy of meshes can be built
• Hard to parallelize coarse meshes well

• This is just the intuition – lots of theory and technology
Other Krylov subspace methods
• Nonsymmetric linear systems:
   • GMRES:
     for i = 1, 2, 3, . . .
       find xi  Ki (A, b) such that ri = (Axi – b)  Ki (A, b)
     But, no short recurrence => save old vectors => lots more space

   • BiCGStab, QMR, etc.:
     Two spaces Ki (A, b) and Ki (AT, b) w/ mutually orthogonal bases
     Short recurrences => O(n) space, but less robust

   • Convergence and preconditioning more delicate than CG
   • Active area of current research

• Eigenvalues: Lanczos (symmetric), Arnoldi (nonsymmetric)
The Landscape of Sparse Ax=b Solvers

              Direct     Iterative
              A = LU      y’ = Ay
                                             More General
  Non-       Pivoting   GMRES,
symmetric      LU       QMR, …

Symmetric
             Cholesky   Conjugate
 positive                gradient
 definite
                                             More Robust


    More Robust                      Less Storage
Complexity of direct methods

Time and
space to solve
any problem         n1/2                n1/3
on any well-
shaped finite
element mesh


                              2D                 3D
    Space (fill):          O(n log n)          O(n 4/3 )
    Time (flops):           O(n 3/2 )          O(n 2 )
 Complexity of linear solvers

 Time to solve
 model problem
 (Poisson’s             n1/2               n1/3
 equation) on
 regular mesh
                                 2D                 3D
Sparse Cholesky:               O(n1.5 )            O(n2 )
CG, exact arithmetic:           O(n2 )             O(n2 )
CG, no precond:                O(n1.5 )           O(n1.33 )
CG, modified IC:               O(n1.25 )          O(n1.17 )
CG, support trees:             O(n1.20 )          O(n1.75 )
Multigrid:                      O(n)               O(n)

				
DOCUMENT INFO