Docstoc

lukas

Document Sample
lukas Powered By Docstoc
					       Parallel Fast BEM
for Distributed Memory Systems

        BEM on the Saar 2012, May 15

       as        ar            ar a
 D. Luk´ˇ, P. Kov´ˇ, and T. Kov´ˇov´


     Department of Applied Mathematics
     Centre of Excellence IT4Innovations
 ˇ
VSB–Technical University of Ostrava, Czech Rep.
          email: dalibor.lukas@vsb.cz
   Parallel Fast BEM for Distributed Memory Systems

Motivation: railway wheel noise elimination by profiling




                                                  ˇ
 Courtesy of J. Szweda, Department of Mechanics, VSB–TU Ostrava
                                Outline

• FEM, BEM for time–harmonic elasticity
• Galerkin BEM for acoustics
• Parallel H–matrices, adaptive cross approximation (ACA)
• Scalability on a distributed memory system
• Outlook
                                Outline

• FEM, BEM for time–harmonic elasticity
• Galerkin BEM for acoustics
• Parallel H–matrices, adaptive cross approximation (ACA)
• Scalability on a distributed memory system
• Outlook
                 FEM for time–harmonic elasticity

Time–harmonic boundary value problem with mixed b.c.
 Given a material density ρ, elastic modulus E, Poisson ratio ν, damping β, and a load
 f applied at a perimeter point a with an angular frequency ω, we search for
                                u(x, t) := R u(x)eiωt
 such that
                  − (1 + iωβ(ω)) divσ(u) − ρ ω 2 u = 0     in Ω,
                                                 u = 0     on ΓD,
                                         σ(u) · n = f n δa on ΓN,
 where
                                                         Eν                    E
    σij (u) = λδij ∇ · u + µ(∂iuj + ∂j ui),   λ :=                 ,   µ :=          .
                                                   (1 + ν)(1 − 2ν)          2(1 + ν)

 What kind of singularity uP = u − uH does the Dirac in the Neumann data pose?
                  3                     3
 I.e. uH ∈ H 1(Ω) , but uP, u ∈ H 1(Ω) .
 FEM analysis using ANSYS
                    BEM for viscoelastodynamics

Viscoelastodynamics [Chaillat ’08]
                         −divσ(u) − ρ ω 2 u = 0     in Ω,
                                          u = 0     on ΓD,
                                  σ(u) · n = f n δa on ΓN,
 where
                                                 Eν                              E
 σij (u) = λδij ∇·u+µ(∂iuj +∂j ui),   λ :=                 (1 + iωβ),    µ :=          (1 + iωβ
                                           (1 + ν)(1 − 2ν)                    2(1 + ν)

Regularized Galerkin BEM approach [Kielhorn & Schanz ’08]

   τ , V t ΓD − τ , Ku    ΓD   = − τ , V (f nδa) ΓD
                                                               on H−1/2(ΓD) × H1/2(ΓN)
   K ′t, v ΓN + V u, v    ΓN   = (1/2I − K ′)(f nδa), v   ΓN

 The regularized hypersingular operator D consists of 14 parts, while 4 are elastostatic.
 Numerical simulation on 426 elems., 215 nodes: V , K, D assembled in 26, 703, 2501 s
                                Outline

• FEM, BEM for time–harmonic elasticity
• Galerkin BEM for acoustics
• Parallel H–matrices, adaptive cross approximation (ACA)
• Scalability on a distributed memory system
• Outlook
                       Galerkin BEM for acoustics

Neumann boundary value problem
 Given the angular frequency ω, the boundary displacement field u(x, t) := R(u(x)eiωt)
 and the material density ρ, denote r := |x − y|, κ := ω/c, c := 340 ms−1, we solve for
                                p(x, t) := R p(x)eiωt
 such that
             −△p(x) − κ2p(x) = 0                         in Ωe := R3 \ Ω,
                  ∂p(x)/∂n(x) = vn(x) := ω 2ρu(x) · n(x) on Γ,
         |∇p(x) x/|x| − iκp(x)| = O(|x|−2)               as |x| → ∞.


Fundamental solution, representation formula

             eiκr
 P (x, y) :=      ,   x ∈ Ωe : p(x) = −       vn(y)P (x, y) dS(y)+       p(y) (∂P (x, y)/∂n(y)) dS
             4πr
                                          Γ                          Γ
                                 Galerkin BEM for acoustics

Boundary integral equation: the Galerkin direct approach
                                                         ′
                                Dκp, v   Γ   = (−1/2I + Kκ)vn, v            Γ   on H 1/2(Γ)

Regularized BI operators

  Dκp, v       Γ   :=           P (x, y) (n(x) × ∇v(x)) · (n(y) × ∇p(y)) − κ2n(x)v(x) · n(y)p(y) dS
                        Γ   Γ

                                                  ′
  vn , v   Γ   :=       vn(x)v(x) dS(x),         Kκvn, v   Γ   :=               (∂P (x, y)/∂n(y)) vn(y)v(x) dS(y) dS
                    Γ                                               Γ   Γ


Boundary element discretization
 p, v ∈ H1/2(Γ), resp. vn ∈ H−1/2(Γ), by pcw. linear, resp. constant, functions,
 Semi–analytic integration due to Steinbach & Rjasanow ’07, Dκ partly by Zapletal ’11.
                    Galerkin BEM for acoustics

Numerical simulation: vn and p for 341 Hz




 15112 triangles, 22668 nodes,
 ACA assembling of Kκ (compressed to 12%) in 25 minutes, of Dκ (15%) in 3.4 hours,
 142 GMRES iters. in 223 s
                    Galerkin BEM for acoustics

Numerical simulation: vn and p for 2706 Hz




 15112 triangles, 22668 nodes,
 ACA assembling of Kκ (compressed to 12%) in 25 minutes, of Dκ (15%) in 4 hours,
 700 GMRES iters.
                                Outline

• FEM, BEM for time–harmonic elasticity
• Galerkin BEM for acoustics
• Parallel H–matrices, adaptive cross approximation (ACA)
• Scalability on a distributed memory system
• Outlook
Parallel H–matrices, adaptive cross approximation (ACA)

Geometric cluster bisection
   1                                     1                                     1

  0.9                                   0.9                                   0.9

  0.8                                   0.8                                   0.8

  0.7                                   0.7                                   0.7

  0.6                                   0.6                                   0.6

  0.5                                   0.5                                   0.5

  0.4                                   0.4                                   0.4

  0.3                                   0.3                                   0.3

  0.2                                   0.2                                   0.2

  0.1                                   0.1                                   0.1

   0                                     0                                     0
        0   0.2   0.4   0.6   0.8   1         0   0.2   0.4   0.6   0.8   1         0   0.2   0.4   0.6   0.8   1




Admissible pairs of clusters

  min{diam Cx, diam Cy } ≤
             2 min {rad Cx, rad Cy ) ≤ η |xCx − xCy | − rad Cx − rad Cy
                                                                   ≤ η dist(Cx, Cy ),
 where η ∈ (0, 1), xC the center of C, rad C := max |xC − xC |.
                                                      k
                                                                                                                    k
Parallel H–matrices, adaptive cross approximation (ACA)

Quad–tree of cluster pairs, H–matrices

                                          C ×C



            C1 × C1            C1 × C2                 C2 × C1     C2 × C2

                                         111
                                         000       000
                                                   111
                                                   111
                                                   000
                                         000
                                         111
                                         000
                                         111       111
                                                   000
                                         111
                                         000       111
                                                   000
                                                   000
                                                   111
                                         000
                                         111
                                         000
                                         111       000
                                                   111
                                               11111
                                               00000
                                               00000
                                               11111
                                               11111
                                               00000
                                               11111
                                               00000
                                               11111
                                               00000
                                               00000
                                               11111
                                               11111
                                               00000
                                               11111
                                               00000
                                               11111
                                               00000




                                 00000
                                 11111
                                 11111
                                 00000
                                 11111
                                 00000
                                 11111
                                 00000
                                 00000
                                 11111
                                 11111
                                 00000
                                 11111
                                 00000
                                 11111
                                 00000



 Nonadmissible blocks assembled as full, admissible approximated by low–rank matrices.
Parallel H–matrices, adaptive cross approximation (ACA)

Asymptotically smooth functions
 Assume (A)i,j := γi γj f (x, y) dSy dSx, where γi ⊂ Cx, γj ⊂ Cy ; Cx, Cy admissible.
 f : Cx × Cy → R is asymptotically smooth if
 ∃c1, c2 > 0 ∃g ≤ 0 ∀α ∈ Nd : |∂x f (x, y)| , ∂y f (x, y) ≤ c1p!(c2)p|x−y|g−p,
                          0
                                α              α
                                                                                         p = |α|.


Adaptive cross approximation (ACA)

                 A11 A12         A11     A12                  A11
 PCx A PT y =:
        C                    ≈                         =                 A−1 A11, A12
                                                                          11
                 A21 A22         A21 A21 A−1 A12
                                          11                  A21
                                                       =: (u1, . . . , ur ) (v1, . . . , vr )T .
 The rank r := r(ε), where A11 ∈ Cr×r , is adaptively controlled by ε. The pivots,
 stored in PCx , PCy , are chosen as to maximize |det Ak | with a wish to minimize
                                                       11
  Rk ≡ Ak − Ak (Ak )−1 Ak .
            22     21     11     12
Parallel H–matrices, adaptive cross approximation (ACA)
Parallel implementation on a shared memory system

           processor 1    processor 2                     processor N
                                              ...
             master          slave                           slave


          Nonadm/adm     Nonadm/adm           ...        Nonadm/adm
            blocks 1       blocks 2                        blocks N


              shared
              memory             Quad-tree



                         nodes               triangles
Parallel H–matrices, adaptive cross approximation (ACA)

Laplace problem with Dir. datum u(x) := x1 +x2 +x3 on B1 := {x : |x| ≤ 1}
                    compr.           scheduling+assembling times of V [s]
     n      err.     of V   N := 2   N := 4     N := 8 N := 16 N := 32 N := 46
    40      3e-4     100%    0+0       0+0        0+0       0+0       0+0 0+0
    160    2.3e-4    100%    0+0       0+0        0+0       0+0       0+0 0+0
    640    9.5e-5     99%    0+4       0+2        0+1       0+0       0+1 0+0
   2560    4.3e-5     65%   0+43      2+23       0+12       0+6       0+3 0+3
   10240   2.1e-5     27%   2+282    1+143       1+72      1+35      1+19 0+13
   40960   1.1e-5     10%  46+1572   24+792     15+399 17+201 20+102 21+72
  163840   6.4e-6     3% 3041+8219 1457+4162 828+2084 492+1061 409+543 397+377
                                      V (u − uh), u − uh   Γ
                            err. :=
                                             V u, u   Γ

                                          n log n
 Towards parallel scalability: CP U = O      N      , but M em = O(N n log n).
Parallel H–matrices, adaptive cross approximation (ACA)

Prof. Bebendorf ’s advice: ACA for the double-layer operator
 Apply ACA to the kernel of K, rather than to the matrix:
                                                 1
                         (x − y) · n(y) ·
                                             |x − y|3
                             rank-4 function
                                               ACA       rank-k function
                                         rank-4k function

 Assembling:
 Step 1. ACA for    Ti Tj   1/|x − y|2                    → pivots (i1, j1), . . . , (ik , jk )
                                          (i,j)∈CX ×CY
                                              k
 Step 2. Assemble M := 1/|xis −       yjt |3 s,t=1
 Step 3. Assemble 4 ’column’ matrices U1 :=              Ti x1/|x − yjl |3                   , . . . U4
                                                                               i∈CX ,l=1:k
 Step 4. Assemble 4 ’row’ matrices V1 :=          Tj n1/|xil − y|3                      , . . . V4
                                                                           l=1:k,j∈CY
                    4
 Action: K · w ≈    r=1 Ur   · M−T · (Vj · w)
Parallel H–matrices, adaptive cross approximation (ACA)

Helmholtz, Dirich. u(x) := eıκ|x−xs |/(4π|x − xs|), κ := 2.8, xs := (2, 2, 2) on B1
                    compr.       scheduling+assembling times of Vκ [s]
     n      err.     of Vκ N := 2 N := 4 N := 8 N := 16            N := 32
    40     3.3e-1    100%   0+0      0+0      0+0        0+0         0+0
    160    1.2e-1    100%   0+1      0+1      0+1        0+1         0+1
    640    3.6e-2    100%  0+10      0+4      0+3        0+2         0+2
   2560    9.9e-3    100% 0+142 0+72          0+38       0+20         0+9
   10240   2.8e-3     65% 66+1388 27+673 7+335          7+168        5+88
   40960   9.0e-4     26%                   452+3600 280+1823 233+929
  163840   3.3e-4     8%                                         4011+19892
                                       M (u − uh), u − uh   Γ
                            err. :=
                                             M u, u   Γ

                                          n log n
 Towards parallel scalability: CP U = O      N      , but M em = O(N n log n).
                                Outline

• FEM, BEM for time–harmonic elasticity
• Galerkin BEM for acoustics
• Parallel H–matrices, adaptive cross approximation (ACA)
• Scalability on a distributed memory system
• Outlook
         Scalability on a distributed memory system

The idea
 N processes, N × N submatrices
  • Each diagonal block with the related geometry data assigned to one process
    ⇒ both memory and CPU balanced, since most nonadmissible blocks are dis-
      tributed efficiently.
  • Each geometrically closely related N −1 off–diagonal blocks assigned to one process
                                       n log n
     ? memory balanced: M em = O          N      + √n
                                                    N
     ? CPU balanced
         Scalability on a distributed memory system

Finding optimal distributions by brute force fails

  • N = 2: 2 cases,
  • N = 4: 34650 cases,
  • N = 8: 4 · 1042 cases.
                                (N − 1) N       (N − 2) N         2N
            number of cases =               ·               ···
                                   N               N               N
          Scalability on a distributed memory system

Cyclic decomposition of undirected graphs
            N := 3                      N := 7                    N := 21




 It is equivalent to perfect difference sets [Singer, 1934]: decompositions available for
                                N (N − 1) p(p − 1)
                                          =        ,
                                   2N         2
 where p + 1 is a power of a prime number.
           Scalability on a distributed memory system

ACA for Laplace 1-layer matrix on a cube

         compr.              average memory [MB],                 CPU [s] per process
    n      V    N := 1 N := 7 N := 31 N := 57                     N := 73 N := 91       N := 133
   3072 21.5% 160, 8 148, 1       170, 0   194, 0                  177, 0   197, 0         ?, 0
  12288 13.1% 267, 59 163, 7      175, 1   176, 1                  167, 1   200, 1        207, 1
  49152 5.2% 884, 367 263, 51 201, 10 194, 8                       195, 6   214, 5        220, 4
  196608 1.8%          705, 226 353, 53 274, 32                   254, 25 280, 25        276, 18
  786432 0.7%                    999, 294 668, 172                599, 119 570, 110      535, 99
 3145728 0.3%                                                                        1911 MB, 596 s
 ACA: η := 1.1, ε := 10−4, . . . , 10−9, nmin := 10, . . . , 60

                                      n log n                 n log n
 Parallel scalability: CP U = O          N
                                                , M em = O       N
                                                                        + √n .
                                                                           N
         Scalability on a distributed memory system

Int. Laplace problem with Dir. datum u(x) := 1/|x−(2, 2, 2)| on Ω := (0, 1)3

    #elems                       assemble time: CPU(V)/CPU(K) [s]
  error, #CG       memory [MB] per process: compression of V/compression of K [%]
                   N := 7       N := 31       N := 57      N := 73       N := 133
      3072        114:2/84      42:2/24       24:0/21      20:1/17
   2.6e-2, 59    159:41/84     173:40/93     176:42/99   192:46/100
     12288       545:11/396    153:2/81       95:0/54      77:2/75        47:0/30
   1.3e-2, 78    247:19/41     213:19/45     210:18/49    206:20/53     202:23/67
     49152      2752:69/2209 819:13/474      601:6/280    446:8/292      241:7/176
  6.5e-3, 102     803:8/16     347:8/17      291:8/19      277:8/20      258:9/25
    196608                   3171:83/2521 2122:45/1282 1885:39/1348 1016:31/790
  3.3e-3, 129                  1025:3/6       717:3/7      646:3/7        529:3/8
    786432                                                           4247 s:161/4085
  1.7e-3, 167                                                          1885 MB:1/3
                                Outline

• FEM, BEM for time–harmonic elasticity
• Galerkin BEM for acoustics
• Parallel H–matrices, adaptive cross approximation (ACA)
• Scalability on a distributed memory system
• Outlook
   Parallel Fast BEM for Distributed Memory Systems

Conclusion

  • parallel fast BEM on a cluster,
  • ACA for the double-layer operator

Outlook

  • ACA–BEM for viscoelastodynamics,
  • time–domain acoustics

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:12
posted:10/16/2012
language:English
pages:27