VIEWS: 22 PAGES: 30 POSTED ON: 7/18/2010 Public Domain
Sparse Matrix Methods • Day 1: Overview • Day 2: Direct methods • Day 3: Iterative methods • The conjugate gradient algorithm • Parallel conjugate gradient and graph partitioning • Preconditioning methods and graph coloring • Domain decomposition and multigrid • Krylov subspace methods for other problems • Complexity of iterative and direct methods SuperLU-dist: Iterative refinement to improve solution Iterate: • r = b – A*x • backerr = maxi ( ri / (|A|*|x| + |b|)i ) • if backerr < ε or backerr > lasterr/2 then stop iterating • solve L*U*dx = r • x = x + dx • lasterr = backerr • repeat Usually 0 – 3 steps are enough (Matlab demo) • iterative refinement Convergence analysis of iterative refinement Let C = I – A(LU)-1 [ so A = (I – C)·(LU) ] x1 = (LU)-1b r1 = b – Ax1 = (I – A(LU)-1)b = Cb dx1 = (LU)-1 r1 = (LU)-1Cb x2 = x1+dx1 = (LU)-1(I + C)b r2 = b – Ax2 = (I – (I – C)·(I + C))b = C2b ... In general, rk = b – Axk = Ckb Thus rk 0 if |largest eigenvalue of C| < 1. The Landscape of Sparse Ax=b Solvers Direct Iterative A = LU y’ = Ay More General Non- Pivoting GMRES, symmetric LU QMR, … Symmetric Cholesky Conjugate positive gradient definite More Robust More Robust Less Storage D Conjugate gradient iteration x0 = 0, r0 = b, p0 = r0 for k = 1, 2, 3, . . . αk = (rTk-1rk-1) / (pTk-1Apk-1) step length xk = xk-1 + αk pk-1 approx solution rk = rk-1 – αk Apk-1 residual βk = (rTk rk) / (rTk-1rk-1) improvement pk = rk + βk pk-1 search direction • One matrix-vector multiplication per iteration • Two vector dot products per iteration • Four n-vectors of working storage Conjugate gradient: Krylov subspaces • Eigenvalues: Au = λu { λ1, λ2 , . . ., λn} • Cayley-Hamilton theorem: (A – λ1I)·(A – λ2I) · · · (A – λnI) = 0 Therefore Σ nciAi = 0i 0 for some ci so A-1 = Σ n 1i (–ci/c0) Ai–1 • Krylov subspace: Therefore if Ax = b, then x = A-1 b and x span (b, Ab, A2b, . . ., An-1b) = Kn (A, b) Conjugate gradient: Orthogonal sequences • Krylov subspace: Ki (A, b) = span (b, Ab, A2b, . . ., Ai-1b) • Conjugate gradient algorithm: for i = 1, 2, 3, . . . find xi Ki (A, b) such that ri = (Axi – b) Ki (A, b) • Notice ri Ki+1 (A, b), so ri rj for all j < i • Similarly, the “directions” are A-orthogonal: (xi – xi-1 )T·A· (xj – xj-1 ) = 0 • The magic: Short recurrences. . . A is symmetric => can get next residual and direction from the previous one, without saving them all. Conjugate gradient: Convergence • In exact arithmetic, CG converges in n steps (completely unrealistic!!) • Accuracy after k steps of CG is related to: • consider polynomials of degree k that are equal to 1 at 0. • how small can such a polynomial be at all the eigenvalues of A? • Thus, eigenvalues close together are good. • Condition number: κ(A) = ||A||2 ||A-1||2 = λmax(A) / λmin(A) • Residual is reduced by a constant factor by O(κ1/2(A)) iterations of CG. (Matlab demo) • CG on grid5(15) and bcsstk08 • n steps of CG on bcsstk08 Conjugate gradient: Parallel implementation • Lay out matrix and vectors by rows • Hard part is matrix-vector product P0 P1 P2 P3 y = A*x x P0 • Algorithm y P1 Each processor j: Broadcast x(j) P2 Compute y(j) = A(j,:)*x P3 • May send more of x than needed • Partition / reorder matrix to reduce communication (Matlab demo) • 2-way partition of eppstein mesh • 8-way dice of eppstein mesh Preconditioners • Suppose you had a matrix B such that: 1. condition number κ(B-1A) is small 2. By = z is easy to solve • Then you could solve (B-1A)x = B-1b instead of Ax = b • B = A is great for (1), not for (2) • B = I is great for (2), not for (1) • Domain-specific approximations sometimes work • B = diagonal of A sometimes works • Or, bring back the direct methods technology. . . (Matlab demo) • bcsstk08 with diagonal precond Incomplete Cholesky factorization (IC, ILU) x A RT R • Compute factors of A by Gaussian elimination, but ignore fill • Preconditioner B = RTR A, not formed explicitly • Compute B-1z by triangular solves (in time nnz(A)) • Total storage is O(nnz(A)), static data structure • Either symmetric (IC) or nonsymmetric (ILU) (Matlab demo) • bcsstk08 with ic precond Incomplete Cholesky and ILU: Variants 1 4 1 4 • Allow one or more “levels of fill” • unpredictable storage requirements 2 3 2 3 • Allow fill whose magnitude exceeds a “drop tolerance” • may get better approximate factors than levels of fill • unpredictable storage requirements • choice of tolerance is ad hoc • Partial pivoting (for nonsymmetric A) • “Modified ILU” (MIC): Add dropped fill to diagonal of U or R • A and RTR have same row sums • good in some PDE contexts Incomplete Cholesky and ILU: Issues • Choice of parameters • good: smooth transition from iterative to direct methods • bad: very ad hoc, problem-dependent • tradeoff: time per iteration (more fill => more time) vs # of iterations (more fill => fewer iters) • Effectiveness • condition number usually improves (only) by constant factor (except MIC for some problems from PDEs) • still, often good when tuned for a particular class of problems • Parallelism • Triangular solves are not very parallel • Reordering for parallel triangular solve by graph coloring (Matlab demo) • 2-coloring of grid5(15) Sparse approximate inverses A B-1 • Compute B-1 A explicitly • Minimize || B-1A – I ||F (in parallel, by columns) • Variants: factored form of B-1, more fill, . . • Good: very parallel • Bad: effectiveness varies widely Support graph preconditioners: example [Vaidya] G(A) G(B) • A is symmetric positive definite with negative off-diagonal nzs • B is a maximum-weight spanning tree for A (with diagonal modified to preserve row sums) • factor B in O(n) space and O(n) time • applying the preconditioner costs O(n) time per iteration Support graph preconditioners: example G(A) G(B) • support each edge of A by a path in B • dilation(A edge) = length of supporting path in B • congestion(B edge) = # of supported A edges • p = max congestion, q = max dilation • condition number κ(B-1A) bounded by p·q (at most O(n2)) Support graph preconditioners: example G(A) G(B) • can improve congestion and dilation by adding a few strategically chosen edges to B • cost of factor+solve is O(n1.75), or O(n1.2) if A is planar • in recent experiments [Chen & Toledo], often better than drop-tolerance MIC for 2D problems, but not for 3D. Domain decomposition (introduction) B 0 E A= 0 C F ET FT G • Partition the problem (e.g. the mesh) into subdomains • Use solvers for the subdomains B-1 and C-1 to precondition an iterative solver on the interface • Interface system is the Schur complement: S = G – ET B-1E – FT C-1F • Parallelizes naturally by subdomains (Matlab demo) • grid and matrix structure for overlapping 2-way partition of eppstein Multigrid (introduction) • For a PDE on a fine mesh, precondition using a solution on a coarser mesh • Use idea recursively on hierarchy of meshes • Solves the model problem (Poisson’s eqn) in linear time! • Often useful when hierarchy of meshes can be built • Hard to parallelize coarse meshes well • This is just the intuition – lots of theory and technology Other Krylov subspace methods • Nonsymmetric linear systems: • GMRES: for i = 1, 2, 3, . . . find xi Ki (A, b) such that ri = (Axi – b) Ki (A, b) But, no short recurrence => save old vectors => lots more space • BiCGStab, QMR, etc.: Two spaces Ki (A, b) and Ki (AT, b) w/ mutually orthogonal bases Short recurrences => O(n) space, but less robust • Convergence and preconditioning more delicate than CG • Active area of current research • Eigenvalues: Lanczos (symmetric), Arnoldi (nonsymmetric) The Landscape of Sparse Ax=b Solvers Direct Iterative A = LU y’ = Ay More General Non- Pivoting GMRES, symmetric LU QMR, … Symmetric Cholesky Conjugate positive gradient definite More Robust More Robust Less Storage Complexity of direct methods Time and space to solve any problem n1/2 n1/3 on any well- shaped finite element mesh 2D 3D Space (fill): O(n log n) O(n 4/3 ) Time (flops): O(n 3/2 ) O(n 2 ) Complexity of linear solvers Time to solve model problem (Poisson’s n1/2 n1/3 equation) on regular mesh 2D 3D Sparse Cholesky: O(n1.5 ) O(n2 ) CG, exact arithmetic: O(n2 ) O(n2 ) CG, no precond: O(n1.5 ) O(n1.33 ) CG, modified IC: O(n1.25 ) O(n1.17 ) CG, support trees: O(n1.20 ) O(n1.75 ) Multigrid: O(n) O(n)