Docstoc

Kennedy

Document Sample
Kennedy Powered By Docstoc
					   Algorithms for
 Dynamical Fermions
                          A D Kennedy
            School of Physics, The University of Edinburgh

                                            NCTS 2006 School on Modern Numerical
Sunday, 13 January 2013                      Methods in Mathematics and Physics
Outline

         Monte Carlo integration
         Importance sampling
         Markov chains
         Detailed balance, Metropolis algorithm
         Symplectic integrators
         Hybrid Monte Carlo
         Pseudofermions
         RHMC
         QCD Computers


Sunday, 13 January 2013             A D Kennedy   2
Monte Carlo methods: I

       Monte Carlo integration is based on the
       identification of probabilities with measures

       There are much better methods of carrying out
       low dimensional quadrature
            All other methods become hopelessly expensive for large
            dimensions

            In lattice QFT there is one integration per degree of freedom
                We are approximating an infinite dimensional functional integral


Sunday, 13 January 2013                       A D Kennedy                          3
Monte Carlo methods: II
       Generate a sequence of random field
                          
       configurations 1 , 2 ,        , t ,         , N   
       chosen from the probability distribution
                                  1 S (t )
                    P (t ) d t  e
                              Z
       Measure the value of  on each configuration
       and compute the average
                               1   N
                                (t )
                               N t 1

Sunday, 13 January 2013                 A D Kennedy              4
Central Limit Theorem: I
                          Law of Large Numbers   lim 
                                                                    N 

                          Central Limit Theorem     O                     C2
                                                                                N

                             where the variance of the distribution of  is
                                        
                                                2
                            C2 

                          The Laplace–DeMoivre Central Limit theorem
                          is an asymptotic expansion for the probability
                          distribution of 

                             Distribution of values for a single sample  = ()
                                P ( )   d  P                 


Sunday, 13 January 2013                             A D Kennedy                            5
Central Limit Theorem: II
        Generating function for connected moments
                W  (k )  ln  d  P ( ) e ik 
                                                                                 ik  C
                                                                                    n
                                                                             
                           ln  d  P   e
                                                  ik  
                                                              ln e ik    
                                                                            n 0   n! n
        The first few cumulants are

                                                      
                                                             3
        C0  0                      C3             

                                                      
                                                             4
         C1                       C4                          3C 22

                        
                                2
         C2 

        Note that this is an asymptotic expansion

Sunday, 13 January 2013                                 A D Kennedy                        6
Central Limit Theorem: III
            Distribution of the average of N samples
                                                                           1       N
                                                                                                 
                 
             P    d 1          d N P 1          P N        
                                                                           N
                                                                                    
                                                                                    t 1
                                                                                          t  
                                                                                                 
            Connected generating function
            W (k )  ln  d  P ( ) e ik 
                                                                          ik   N
                                                                                         
                       ln  d 1    d N P 1  P N             exp         t 
                                                                         N     t 1     
                                                          N
                       ln   d  P   e     N ln e
                                           ik   N        ik  N
                                                   
                                       ik  C n
                                                   n
                             k         
                       NW     
                              N  n 1 n ! N n 1

Sunday, 13 January 2013                                A D Kennedy                                   7
Central Limit Theorem: IV
            Take inverse Fourier transform to obtain distribution P

               1
             
       P  
              2
                       W   k   ik 
                  dk e e
                             C3 d 3      C4 d 4
                                                                                               2
                                                                                   ik   21  ik  C 2
                                                                   
                            3!N 2 d  3 4!N 3 d  4                      dk                                    ik 
                   e                                                    2
                                                                               e           N
                                                                                                          e
                                                                           2
                                                                
                                                                     
                                    d3            d4
                          
                               C3
                                         
                                             C4
                                                        
                                                            e     2C 2 N

                   e         3!N 2 d  3 4!N 3 d  4
                                                            2 C          N
                                                                     2




Sunday, 13 January 2013                                         A D Kennedy                                            8
Central Limit Theorem: V

      Re-scale to show convergence to Gaussian distribution

                          d
                 
            P   F  
                          d

      where                N and


    F    1 
                          
              C 3  2  3C 2
                                    e  2 2C 2
                                   
             
                  6C 2 N
                      3
                                    2 C 2
                                   



Sunday, 13 January 2013                      A D Kennedy      9
Importance Sampling: I
           Integral        dx f (x )
           Sample from distribution
              Probability 0   (x ) a .e .
              Normalisation N   dx  (x )  1
                                                 f (x )
           Estimator of integral I    (x ) dx
                                                  (x )
           Estimator of variance
                                                  2
                               f (x )             f ( x )2
              V    (x ) dx          I    dx          I 2
                                (x )               (x )

Sunday, 13 January 2013                    A D Kennedy              10
Importance Sampling: II
   Minimise variance
      Constraint N=1
      Lagrange multiplier 
             (V  N )    f (y ) 2
                                    0
                (y )      (y ) 2


                                        f (x )
   Optimal measure  opt (x ) 
                                     dx f (x )
                                dx                dx            
                                                2                         2
   Optimal variance V opt             f (x )                f (x )


Sunday, 13 January 2013           A D Kennedy                    11
                                                 2
Importance Sampling: III 0 dx sin x

        Example: f (x )  sin(x )

        Optimal weight  (x )         1
                                       2
                                           sin(x )

        Optimal variance
                                   2                         2
                    2
                                         2
                                                              
        V opt      dx   sin(x )     dx         sin(x )   16
                  0                0                       


Sunday, 13 January 2013                A D Kennedy                     12
                                                      2
Importance Sampling: IV                          0
                                                           dx sin x


1 Construct cheap approximation to
|sin(x)|
2 Calculate relative area within each
rectangle

3 Choose a random number uniformly

4 Select corresponding rectangle

5 Choose another random number
uniformly
6 Select corresponding x value within
rectangle

7 Compute |sin(x)|


Sunday, 13 January 2013                 A D Kennedy                   13
                                                       2
Importance Sampling: V                             0
                                                            dx sin x


         With 100 rectangles we have V = 16.02328561
             But we can do better!
       2                                  2
I       dx      sin(x )  sin(x )      dx     sin(x )   sin(x ) 
        0                                  0

              For which V opt  0

         With 100 rectangles we have V = 0.011642808


Sunday, 13 January 2013                    A D Kennedy                   14
Markov Chains: I
       State space 
       (Ergodic) stochastic transitions P’:   
       Deterministic evolution of probability
        distribution P: Q  Q
       Distribution converges to unique fixed point Q




Sunday, 13 January 2013        A D Kennedy              15
Convergence of Markov Chains: I
      Define a metric d Q1 ,Q 2    dx Q1  x   Q 2  x  on
      the space of (equivalence classes of) probability
      distributions

      Prove that d PQ1 , PQ 2   1    d Q1 ,Q 2  with
       > 0, so the Markov process P is a contraction
      mapping
      The sequence Q, PQ, P²Q, P³Q,… is Cauchy
      The space of probability distributions is complete, so the
      sequence converges to a unique fixed point Q  lim P nQ
                                                                n 


Sunday, 13 January 2013                A D Kennedy                     16
Convergence of Markov Chains: II
  d PQ1 , PQ2    dx PQ1  x   PQ 2  x 
         dx     dy P  x      y  Q1  y    dy P  x  y  Q 2  y 
                              Q  y   Q1  y   Q 2  y 
         dx     dy P  x      y  Q  y 
                            y     y   1
         dx  dy P  x  y  Q  y    Q  y      Q  y  
                                                                        
                    a  b  a  b  2min  a , b                
         dx  dy P  x  y  Q  y 
                          2 dx min  dy P  x  y  Q  y   Q  y  
                                   


Sunday, 13 January 2013                          A D Kennedy                    17
Convergence of Markov Chains: III
d PQ1 , PQ 2              dx P  x  y   1
  dy Q  y   2 dx min  dy P  x  y  Q  y    Q  y  
                           

  dy Q  y   2 dx inf P  x  y  min  dy Q  y    Q  y  
                           y                

    dy Q  y    Q y     dy Q y    Q y  
             dy Q  y    dy Q  y    dy Q  y   1  1 0
                                    1                 2

            dy Q  y    Q  y     dy Q  y 
                                                1
                                                2

  dy Q  y    dx inf P  x  y   dy Q  y   (1   )d Q ,Q 
                                                                   1        2
                         y

                      0     dx inf P  x  y 
                                      y


 Sunday, 13 January 2013                A D Kennedy                    18
Markov Chains: II

      Use Markov chains to sample from Q
           Suppose we can construct an ergodic Markov process P which has
           distribution Q as its fixed point
           Start with an arbitrary state (“field configuration”)
           Iterate the Markov process until it has converged (“thermalized”)
           Thereafter, successive configurations will be distributed according to Q
               But in general they will be correlated
           To construct P we only need relative probabilities of states
               Do not know the normalisation of Q
               Cannot use Markov chains to compute integrals directly
               We can compute ratios of integrals



Sunday, 13 January 2013                             A D Kennedy                       19
Markov Chains: III
       How do we construct a Markov process with a specified
       fixed point Q  x    dy P  x  y  Q  y  ?
       Detailed balance P  y  x  Q  x   P  x  y  Q  y 
            Integrate w.r.t. y to obtain fixed point condition
            Sufficient but not necessary for fixed point
                                               Q x  
       Metropolis algorithm P  x  y   min  1,
                                               Q y  
                                                       
                                                      
            Consider cases Q (x )  Q (y ) and Q (x )  Q (y ) separately to
            obtain detailed balance condition
            Sufficient but not necessary for detailed balance
                                                                  Q x 
            Other choices are possible, e.g., P  x  y  
                                                              Q  x   Q y 

Sunday, 13 January 2013                     A D Kennedy                          20
Markov Chains: IV

       Composition of Markov steps
           Let P1 and P2 be two Markov steps which have the desired fixed
           point distribution
                They need not be ergodic
           Then the composition of the two steps P2P1 will also have the
           desired fixed point
                And it may be ergodic
           This trivially generalises to any (fixed) number of steps
                For the case where P1 is not ergodic but (P1 ) n is the terminology
                weakly and strongly ergodic are sometimes used


Sunday, 13 January 2013                        A D Kennedy                            21
Markov Chains: V

         This result justifies “sweeping” through a
         lattice performing single site updates
             Each individual single site update has the desired fixed point
             because it satisfies detailed balance
             The entire sweep therefore has the desired fixed point, and
             is ergodic
             But the entire sweep does not satisfy detailed balance
             Of course it would satisfy detailed balance if the sites were
             updated in a random order
                  But this is not necessary
                  And it is undesirable because it puts too much randomness into
                  the system

Sunday, 13 January 2013                       A D Kennedy                          22
Markov Chains: VI
       Coupling from the Past
            Propp and Wilson (1996)
            Use fixed set of random numbers
            Flypaper principle: If states coalesce they stay together forever
             –   Eventually, all states coalesce to some state  with probability one
             –   Any state from t = - will coalesce to 
             –    is a sample from the fixed point distribution




                                                                                
Sunday, 13 January 2013                            A D Kennedy                          23
Markov Chains: VII

         Suppose we have a lattice                                    a
             That is, a partial ordering with a
             largest and smallest element
         And an update step that
                                                          b       c           d
         preserves it
         Then once the largest and
         smallest states have                                 e           f
         coalesced all the others must
         have too                                                     g

Sunday, 13 January 2013                     A D Kennedy                           24
Markov Chains: VIII




                                                                
         A non-trivial example of         β≥0
         where this is possible is        Ordering is spin configuration
                                          A ≥ B iff for every site As ≥ Bs
         the ferrormagnetic Ising
                                          Update is a local heatbath
         model H = b å sis j
                          ij



Sunday, 13 January 2013              A D Kennedy                             25
Autocorrelations: I

       Exponential autocorrelations
            The unique fixed point of an ergodic Markov process
            corresponds to the unique eigenvector with eigenvalue 1
            All its other eigenvalues must lie within the unit circle

            In particular, the largest subleading eigenvalue is max  1

            This corresponds to the exponential autocorrelation time

                                              1
                               N exp               0
                                           ln max


Sunday, 13 January 2013                        A D Kennedy                 26
Autocorrelations: II
          Integrated autocorrelations
                  Consider the autocorrelation of some operator Ω

                  Without loss of generality we may assume   0

      2       1       N   N
                                                 1 N             N 1 N
                                                                                         
                        t   t         2    t   2    t   t   
                                                              2
                                         
              N   2
                      1 1
                      t t                       N  t 1           t 1 t t 1         
                                                                                t    t 
                  Define the autocorrelation function C                 
                                                                                       
                                                                                              2



                                   2    1      2   N 1
                                                                                 
                                        2 
                                        N      N
                                                     N 
                                                     1
                                                                 C        2 
                                                                                 

Sunday, 13 January 2013                                    A D Kennedy                             27
Autocorrelations: III

            The autocorrelation function must fall faster that the
            exponential autocorrelation C                 max  e
                                                                          N exp



            For a sufficiently large number of samples

                                    
                                               2        N exp  
                           1  2C           1  O          
                                 1           N          N  
                                                                                   
            Define the integrated autocorrelation function A  C                    
                                                                                   1

                               2                  2        N exp  
                                   1  2A       1  O          
                                                  N          N  

Sunday, 13 January 2013                              A D Kennedy                            28
Hybrid Monte Carlo: I
           In order to carry out Monte Carlo computations
           including the effects of dynamical fermions we
           would like to find an algorithm which
               Update the fields globally
                    Because single link updates are not cheap if the action is not local
               Take large steps through configuration space
                    Because small-step methods carry out a random walk which leads
                    to critical slowing down with a dynamical critical exponent z=2
                          z relates the autocorrelation to the correlation length of the
                          system,   A   z
               Does not introduce any systematic errors


Sunday, 13 January 2013                             A D Kennedy                            29
Hybrid Monte Carlo: II

        A useful class of algorithms with these properties is
        the (Generalised) Hybrid Monte Carlo (HMC) method
            Introduce a “fictitious” momentum p corresponding to each
            dynamical degree of freedom q
            Find a Markov chain with fixed point  exp[-H(q,p) ] where H is the
            “fictitious” Hamiltonian ½p2 + S(q)
                 The action S of the underlying QFT plays the rôle of the potential in the
                 “fictitious” classical mechanical system
                 This gives the evolution of the system in a fifth dimension, “fictitious” or
                 computer time
            This generates the desired distribution exp[-S(q) ] if we ignore the
            momenta p (i.e., the marginal distribution)


Sunday, 13 January 2013                             A D Kennedy                                 30
Hybrid Monte Carlo: III

         The HMC Markov chain alternates two Markov
         steps
             Molecular Dynamics Monte Carlo (MDMC)
             (Partial) Momentum Refreshment
         Both have the desired fixed point
         Together they are ergodic




Sunday, 13 January 2013                A D Kennedy    31
MDMC: I
       If we could integrate Hamilton’s equations
       exactly we could follow a trajectory of
       constant fictitious energy
            This corresponds to a set of equiprobable fictitious
            phase space configurations
            Liouville’s theorem tells us that this also preserves the
            functional integral measure dp  dq as required
       Any approximate integration scheme which
       is reversible and area preserving may be
       used to suggest configurations to a
       Metropolis accept/reject test
            With acceptance probability min[1,exp(-H)]


Sunday, 13 January 2013                         A D Kennedy             32
MDMC: II
       We build the MDMC step out of three parts
            Molecular Dynamics (MD), an approximate integrator
            U   : q , p  q , p  which is exactly
                                                 q , p  
                Area preserving, det U *  det                1
                                                 q , p  
                Reversible, F U   F U    1
            A momentum flip F : p      p
            A Metropolis accept/reject step
       The composition of these gives
       q                                              q 
        p    F U    e
                 
                                H
                                       
                                     y 1  y e  H
                                                       
                                                         
                                                        p              
                                                        
            with y being a uniformly distributed random number in [0,1]

Sunday, 13 January 2013                               A D Kennedy            33
Partial Momentum Refreshment
       This mixes the Gaussian distributed momenta p
       with Gaussian noise 
                      p    cos       sin     p 
                            sin          F  
                                          cos  
                                                  
            The Gaussian distribution of p is invariant under F
            The extra momentum flip F ensures that for small  the
            momenta are reversed after a rejection rather than after an
            acceptance
            For  =  / 2 all momentum flips are irrelevant


Sunday, 13 January 2013                    A D Kennedy                    34
Symplectic Integrators: I

                  Baker-Campbell-Hausdorff (BCH) formula
                         If A and B belong to any (non-commutative) algebra
                         then e Ae B  e A B  , where  constructed from
                         commutators of A and B (i.e., is in the Free Lie
                         Algebra generated by {A,B })

                                                             
                         More precisely, ln e Ae B  c n where c 1  A  B and
                                                                   n 1
                                                   n 2 
                                                                                                                   
                          1  1                         
                                                            B                                                    
             c n 1           2 c n , A  B    2m                  1 c k1 ,    , c k 2m , A  B 
                        n 1                       m  0  2m  ! k 1 , ,k 2 m
                                                                                                             
                                                                                                                   
                                                                 k 1   k 2 m n                                 



Sunday, 13 January 2013                                     A D Kennedy                                         35
Symplectic Integrators: II
            Explicitly, the first few terms are
                     
             ln e Ae B  A  B          1
                                           2    A , B   12  A ,  A , B   B ,  A , B   24 B ,  A ,  A , B  
                                                            1
                                                                                             
                                                                                                     1
                                                                                                                          
                                   A ,  A ,  A ,  A , B     4 B ,  A ,  A ,  A , B    
                                                                                          
                             1                                                                          
                           720  6  A , B  ,  A ,  A , B    4 B , B ,  A ,  A , B     
                                                                 
                                                                                                
                                                                                                        
                                 2  A , B  , B ,  A , B    B , B , B ,  A , B    
                                                                                              
                                     
            In order to construct reversible integrators we use symmetric
            symplectic integrators
            The following identity follows directly from the BCH formula
                              
              ln e A 2e B e A 2  A  B           1
                                                     24    A,  A, B   2 B ,  A, B 
                                                                                        
                                           7  A ,  A ,  A ,  A , B     28 B ,  A ,  A ,  A , B    
                                                                                                     
                                      1                                                                            
                                    5760  12  A , B  ,  A ,  A , B    32 B , B ,  A ,  A , B     
                                                                            
                                                                                                            
                                                                                                                   
                                           16  A , B  , B ,  A , B    8 B , B , B ,  A , B    
                                                                             
                                                                                                           


Sunday, 13 January 2013                                                     A D Kennedy                                           36
Symplectic Integrators: III
       We are interested in finding the classical trajectory in
       phase space of a system described by the Hamiltonian
       H q , p   T  p   S q   1 p 2  S q 
                                       2

            The basic idea of such a symplectic integrator is to write the
            time evolution operator as
                               d            dp  dq   
                          exp        exp                   
                              
                                    
                                 dt           dt p dt q      
                                              H           H            ˆ
                                       exp                         e H
                                                 q p p q  
                                                                       
                                       exp   S  q     T p       
                                                         p          q  

Sunday, 13 January 2013                            A D Kennedy                     37
Symplectic Integrators: IV

                                                        
            Define Q  T   p       and P  S  q             ˆ
                                                           so that H  P  Q
                                   q                   p
            Since the kinetic energy T is a function only of p and the
            potential energy S is a function only of q, it follows that the
            action of e  P and e Q may be evaluated trivially
                            e Q : f q , p     f q  T   p  , p 
                            e  P : f q , p    f q , p   S  q  




Sunday, 13 January 2013                          A D Kennedy                   38
Symplectic Integrators: V
            From the BCH formula we find that the PQP symmetric
            symplectic integrator is given by

                         e e 
                            1  P             1  P    / 
              /                    Q
    U 0 ( )          e   2                  2




                        exp P Q    P ,P ,Q   2 Q ,P ,Q                   
                                                                                                      
                                                           1                           3 O      5
                                                       24                                     

                                    
                       exp  P  Q   24 P , P ,Q   2 Q , P ,Q   2  O  4  
                                          1
                                                                                                  
                                                      O  2 
                            ˆ             P Q 
                       e H   e

            In addition to conserving energy to O (² ) such symmetric
            symplectic integrators are manifestly area preserving and
            reversible

Sunday, 13 January 2013                                              A D Kennedy                        39
Symplectic Integrators: VI
       For each symplectic integrator there exists a
       Hamiltonian H’ which is exactly conserved
            This is given by substituting Poisson brackets for commutators
            in the BCH formula

   H '  S  T  24 T , T , S   2 S , T , S   2
                 1
                                                     

               
                                        
                32 S , S , T T , S   16 T , S  , S , T , S  
                                                                         
           1 
                                                                    
                                                                         
         5760  28 S , T , T , T , S   12 T , S  , T , T , S        4
                                                                                         
                                                                                      O  6
                                                                            
                                         
                8 S , S , S , T , S   7 T , T , T , T , S       
                                                                             



Sunday, 13 January 2013                           A D Kennedy                               40
Symplectic Integrators: VII

      Poisson brackets form a Lie algebra

                           A B A B
             A , B           
                           p q q p
                                                            A , B    B , A

           Homework exercise: verify the Jacobi identity

                          A,B ,C   B ,C , A  C ,A, B   0


Sunday, 13 January 2013                            A D Kennedy                     41
Symplectic Integrators: VIII

            The explicit result for H’ is

              H   H  24  2p 2S   S 2  2
                        1



                   720
                     1
                                    4
                                                                                   
                               p 4S    6 p 2 S S   2S 2  3S 2S   4  O  6

            Note that H’ cannot be written as the sum of a p-dependent
            kinetic term and a q-dependent potential term

            As H’ is conserved, δH is of O(δτ 2) for trajectories of
            arbitrary length
                Even if τ = O (δτ -k) with k > 1



Sunday, 13 January 2013                              A D Kennedy                              42
Symplectic Integrators: IX
      Multiple timescales
          Split the Hamiltonian into pieces H (q , p )  T ( p )  S 1 (q )  S 2 (q )
                                                      
          Define Q  T   p     and Pi  S i q                 ˆ
                                                         so that H  P1  P2  Q
                               q                     p
          Introduce a symmetric symplectic integrator of the form
                                                                                                                                                         / 
                                                            n2                                                n1
                                                                                                                                                    
                                                                                                                                                   n2
  U SW ( ) /    e                                    e 4 n1                                          e 2 n2
                             1     Q        1      P2          1    Q        1  P         1     Q          1     P2        1     Q
                            4 n2
                                         e   2 n2
                                                                              e   n1     1
                                                                                             e   4 n1
                                                                                                                                  e   4 n2


                                                          
                                                                                                             
                                                                                                                                                  
                                                                                                                                                    
                    P1   P
               If       2  Q                              then the instability in the integrator is tickled
                    n1  2n 2
               equally by each sub-step
               This helps if the most expensive force computation does not
               correspond to the largest force

Sunday, 13 January 2013                                                       A D Kennedy                                                                    43
Dynamical fermions: I

       Pseudofermions
            Direct simulation of Grassmann fields is not feasible
                 The problem is not that of manipulating anticommuting values
                  in a computer

                 It is that e  S F  e M is not positive, and thus we get poor
                 importance sampling

            We therefore integrate out the fermion fields to obtain the
            fermion determinant  d  d  e M  det M 
                  and  always occur quadratically
                 The overall sign of the exponent is unimportant


Sunday, 13 January 2013                           A D Kennedy                        44
Dynamical fermions: II
            Any operator  can be expressed solely in terms of the
            bosonic fields
            Use Schwinger’s source method and integrate out the fermions

                                          M 1  
                                 , , e
                                                      0


                E.g., the fermion propagator is

                          G (x , y )    x   y   M 1 (x , y )




Sunday, 13 January 2013                            A D Kennedy             45
Dynamical fermions: III
            Including the determinant as part of the observable to be
                                            det M      S
            measured is not feasible                          B

                                               det M   S
                                                                B

                The determinant is extensive in the lattice volume, thus again we
                get poor importance sampling
            Represent the fermion determinant as a bosonic Gaussian
            integral with a non-local kernel det M     d d  e  M   
                                                                         1




            The fermion kernel must be positive definite (all its
            eigenvalues must have positive real parts) otherwise the
            bosonic integral will not converge
            The new bosonic fields are called pseudofermions


Sunday, 13 January 2013                       A D Kennedy                           46
Dynamical fermions: IV
          It is usually convenient to introduce two flavours of fermion
                                                                          M †M  
                                                                                   1

                                                             
          and to write  det M     det M   M     d  d  e                
                                     2              †



              This not only guarantees positivity, but also allows us to generate
              the pseudofermions from a global heatbath by applying M † to a
              random Gaussian distributed field
          The equations for motion for the boson (gauge) fields are
            
                  S B                                 S B                            M †M  M †M
                                                                                              †
                                                                          
                                                                         M †M                                   
                                               1                                     1                          1
                           †
                                    M †M                              
                                                                                                                     
                                                                                                             

          The evaluation of the pseudofermion action and the
          corresponding force then requires us to find the solution of a
                                                                       
                                               1
          (large) set of linear equations M †M     

Sunday, 13 January 2013                                          A D Kennedy                                           47
Dynamical fermions: V

         It is not necessary to carry out the inversions
         required for the equations of motion exactly
             There is a trade-off between the cost of computing the force
             and the acceptance rate of the Metropolis MDMC step
         The inversions required to compute the
         pseudofermion action for the accept/reject
         step does need to be computed exactly,
         however
             We usually take “exactly” to by synonymous with “to
             machine precision”


Sunday, 13 January 2013                   A D Kennedy                       48
Reversibility: I
      Are HMC trajectories reversible and area
      preserving in practice?
           The only fundamental source of irreversibility is the rounding
           error caused by using finite precision floating point arithmetic
                For fermionic systems we can also introduce irreversibility by
                choosing the starting vector for the iterative linear equation solver
                time-asymmetrically
                     We do this if we to use a Chronological Inverter, which takes (some
                     extrapolation of) the previous solution as the starting vector
                Floating point arithmetic is not associative
                It is more natural to store compact variables as scaled integers
                (fixed point)
                     Saves memory
                     Does not solve the precision problem



Sunday, 13 January 2013                           A D Kennedy                              49
Reversibility: II
      Data for SU(3) gauge theory
      and QCD with heavy quarks
      show that rounding errors are
      amplified exponentially
           The underlying continuous
           time equations of motion are
           chaotic
           Ляпунов exponents
           characterise the divergence of
           nearby trajectories
           The instability in the integrator
           occurs when H » 1
                Zero acceptance rate anyhow


Sunday, 13 January 2013                        A D Kennedy   50
Reversibility: III
       In QCD the Ляпунов exponents appear
       to scale with  as the system approaches
       the continuum limit   
            = constant
           This can be interpreted as saying that the
           Ляпунов exponent characterises the
           chaotic nature of the continuum classical
           equations of motion, and is not a lattice
           artefact
           Therefore we should not have to worry
           about reversibility breaking down as we
           approach the continuum limit
           Caveat: data is only for small lattices, and
           is not conclusive



Sunday, 13 January 2013                             A D Kennedy   51
Polynomial approximation

    What is the best polynomial
    approximation p(x) to a continuous
    function f(x) for x in [0,1] ?
         Best with respect to the appropriate norm
                                                     1/n
                               1
                                                    n
                    p  f n    dx p (x )  f (x ) 
                              0                     
         where n  1


Sunday, 13 January 2013                A D Kennedy         52
Weierstraß’ theorem

      Taking n →∞ this is the minimax
      norm

     p  f   minmax p (x )  f (x )
                p 0  x 1




   Weierstraß: Any continuous
   function can be arbitrarily well
   approximated by a polynomial


Sunday, 13 January 2013          A D Kennedy   53
Бернштейне polynomials


                              The explicit solution is
                              provided by Бернштейне
                              polynomials
                                     n
                                            k  n 
                          p n  x    f     x n (1  x )n k
                                      k 0  n   k 




Sunday, 13 January 2013          A D Kennedy                   54
Чебышев’s theorem

   Чебышев: There is always a
   unique polynomial of any
   degree d which minimises
     p f     
                   max        p x   f x 
                    0  x 1



   The error |p(x) - f(x)| reaches its
   maximum at exactly d+2 points on
   the unit interval


Sunday, 13 January 2013                          A D Kennedy   55
Чебышев’s theorem: Necessity
      Suppose p-f has less than d+2 extrema of equal magnitude
      Then at most d+1 maxima exceed some magnitude
      This defines a “gap”
      We can construct a polynomial q of degree d which has the opposite sign to p-f at
      each of these maxima (Lagrange interpolation)
      And whose magnitude is smaller than the “gap”
      The polynomial p+q is then a better approximation than p to f




Sunday, 13 January 2013                          A D Kennedy                              56
  Чебышев’s theorem: Sufficiency
      Suppose there is a polynomial p   f                      p f
                                                                        
      Then p   x i   f  x i       p x i   f x i       at each of the d+2 extrema
      Therefore p’ - p must have d+1 zeros on the unit interval
      Thus p’ – p = 0 as it is a polynomial of degree d




Sunday, 13 January 2013                                          A D Kennedy                   57
Чебышев polynomials

     Convergence is often exponential in d
          The best approximation of degree d-1 over [-1,1] to
          xd is
                                                  
                                           d 1
                     pd 1  x   x d  1      Td  x 
                                         2
              Where the Чебышев polynomials are
                             T d  x   cos d cos 1  x  
              The notation is an old transliteration of Чебышев !


                                                    
                                                         d 1
          The error is x d  pd  x              1            T d  x    2e d ln2
                                                  2

Sunday, 13 January 2013                          A D Kennedy                        58
Чебышев rational functions
      Чебышев’s theorem is easily extended to
      rational approximations
          Rational functions with nearly equal degree numerator and
          denominator are usually best
          Convergence is still often exponential
          Rational functions usually give a much better approximation


                          A simple (but somewhat slow) numerical
                          algorithm for finding the optimal Чебышев
                          rational approximation was given by Ремез



Sunday, 13 January 2013                      A D Kennedy                59
Чебышев rationals: Example
               A realistic example of a rational approximation is
     1
             0.3904603901
                              x  2.3475661045   x  0.1048344600   x    0.0073063814 
        x                     x  0.4105999719   x  0.0286165446   x    0.0012779193 

                   This is accurate to within almost 0.1% over the range [0.003,1]

               Using a partial fraction expansion of such rational functions
               allows us to use a multishift linear equation solver, thus
               reducing the cost significantly.

               The partial fraction expansion of the rational function above is
    1                            0.0511093775     0.1408286237     0.5964845033
             0.3904603901                                    
     x                         x  0.0012779193 x  0.0286165446 x  0.4105999719

                   This appears to be numerically stable.

Sunday, 13 January 2013                              A D Kennedy                                60
Polynomials versus rationals
                                                          n
         Золотарев’s formula has L∞ error   e          ln 


                                                                1
         Optimal L2 approximation with weight
         is            n      j
                          ( ) 4                          1 x2
                      (2 j  1) T 2 j 1 (x )
                     j 0

             This has L2 error of O(1/n)

         Optimal L∞ approximation cannot be too
         much better (or it would lead to a better L2
         approximation)

Sunday, 13 January 2013                    A D Kennedy              61
Non-linearity of CG solver
         Suppose we want to solve A2x=b for
         Hermitian A by CG
             It is better to solve Ax=y, Ay=b successively
             Condition number (A2) = (A)2
             Cost is thus 2(A) < (A2) in general
         Suppose we want to solve Ax=b
             Why don’t we solve A1/2x=y, A1/2y=b successively?
         The square root of A is uniquely defined if A>0
             This is the case for fermion kernels
         All this generalises trivially to nth roots
             No tuning needed to split condition number evenly
         How do we apply the square root of a matrix?

Sunday, 13 January 2013                       A D Kennedy        62
Rational matrix approximation

         Functions on matrices
             Defined for a Hermitian matrix by diagonalisation
             H = U D U -1
             f (H) = f (U D U -1) = U f (D) U -1
         Rational functions do not require diagonalisation
              H m +  H n = U ( D m +  D n) U -1
             H -1 = U D -1 U -1
         Rational functions have nice properties
             Cheap (relatively)
             Accurate




Sunday, 13 January 2013                        A D Kennedy       63
No Free Lunch Theorem

         We must apply the rational approximation
         with each CG iteration
             M1/n  r(M)
             The condition number for each term in the partial fraction
             expansion is approximately (M)
             So the cost of applying M1/n is proportional to (M)
             Even though the condition number (M1/n)=(M)1/n
             And even though (r(M))=(M)1/n
         So we don’t win this way…


Sunday, 13 January 2013                   A D Kennedy                     64
Pseudofermions

         We want to evaluate a functional integral
         including the fermionic determinant det M

         We write this as a bosonic functional integral
         over a pseudofermion field with kernel M -1
                                                   M 1
                          det M   d  d  e
                                       




Sunday, 13 January 2013                    A D Kennedy        65
Multipseudofermions
          We are introducing extra noise into the system by
          using a single pseudofermion field to sample this
          functional integral
               This noise manifests itself as fluctuations in the force exerted by
               the pseudofermions on the gauge fields
               This increases the maximum fermion force
               This triggers the integrator instability
               This requires decreasing the integration step size


          A better estimate is det M = [det M1/n]n
                                    1                            1
                                                          M    n
                           det M   n
                                         d  d  e
                                               




Sunday, 13 January 2013                            A D Kennedy                       66
Violation of NFL Theorem
         So let’s try using our nth root trick to implement
         multipseudofermions
             Condition number (r(M))=(M)1/n
             So maximum force is reduced by a factor of n(M)(1/n)-1
             This is a good approximation if the condition number is dominated
             by a few isolated tiny eigenvalues
             This is so in the case of interest
         Cost reduced by a factor of n(M)(1/n)-1
             Optimal value nopt  ln (M)
             So optimal cost reduction is (e ln) /
         This works!



Sunday, 13 January 2013                        A D Kennedy                       68
Rational Hybrid Monte Carlo: I

                                                                                                                    
                                                                                                                          1
        RHMC algorithm for fermionic kernel M M                                                                  †       2n



            Generate pseudofermion from Gaussian heatbath

                                                                M M         
                                                                                    1
                                                1  †
                               P ( )  e        2                         †       4n
                                                                                        
                                                                                                            1
                          
                                                                                                      
                                                  M †M                e
                                                                       1
                                    1  †                                              1  † M †M        2n   
               P ( )   d  e      2
                                                
                                                                      4n
                                                                            
                                                                                          2

                                                                         
            Use accurate rational approximation r (x )  4n x
            Use less accurate approximation for MD, r (x )  2n x
                   r (x )  r (x )2, so there are no double poles
            Use accurate approximation for Metropolis acceptance step


Sunday, 13 January 2013                                        A D Kennedy                                                    69
Rational Hybrid Monte Carlo: II
         Reminders
             Apply rational approximations using their partial fraction
             expansions
             Denominators are all just shifts of the original fermion kernel
                  All poles of optimal rational approximations are real and positive for
                  cases of interest (Miracle #1)
                  Only simple poles appear (by construction!)
             Use multishift solver to invert all the partial fractions using a single
             Krylov space
                  Cost is dominated by Krylov space construction, at least for O(20)
                  shifts
             Result is numerically stable, even in 32-bit precision
                  All partial fractions have positive coefficients (Miracle #2)
             MD force term is of the usual form for each partial fraction
             Applicable to any kernel


Sunday, 13 January 2013                             A D Kennedy                            70
Comparison with R algorithm: I
 Binder cumulant of chiral condensate,
 B4, and RHMC acceptance rate A from
 a finite temperature study (2+1 flavour
 naïve staggered fermions, Wilson
 gauge action, V = 83×4, mud = 0.0076,
 ms = 0.25, τ= 1.0)

   Algorithm        δt     A      B4

        R        0.0019         1.56(5)

        R        0.0038         1.73(4)

     RHMC         0.055   84%   1.57(2)


Sunday, 13 January 2013                    A D Kennedy   71
Comparison with R algorithm: II
The different masses at which domain wall
results were gathered, together with the
step-sizes δt, acceptance rates A, and
plaquettes P (V = 163×32×8, DBW2
gauge action, β = 0.72)
Algorithm   mud    ms      δt       A          P

   R        0.04   0.04    0.01            0.60812(2)


   R        0.02   0.04    0.01            0.60829(1)


   R        0.02   0.04   0.005             0.60817


 RHMC       0.04   0.04    0.02    65.5%   0.60779(1)                 The step-size variation of the
                                                                       plaquette with mud =0.02
 RHMC       0.02   0.04   0.0185   69.3%   0.60809(1)



Sunday, 13 January 2013                                 A D Kennedy                                    72
Comparison with R algorithm: III

The integrated
autocorrelation time of
the 13th time-slice of
the pion propagator
from the domain wall
test, with mud = 0.04




Sunday, 13 January 2013   A D Kennedy   73
Multipseudofermions with multiple timescales
                                    100%
  Semiempirical observation:
  The largest force from a                                 Residue (α)
                                                           L² Force
  single pseudofermion does
                                     75%                   α/(β+0.125)
  not come from the smallest                               CG iterations
  shift
       For example, look at the      50%
       numerators in the partial
       fraction expansion we
       exhibited earlier             25%

  Make use of this by using a
  coarser timescale for the
                                      0%
  more expensive smaller                   -13   -10 -8.5 -7.1 -5.8 -4.4 -3.1 -1.7 -0.3   1.5
  shifts                                                       Shift [ln(β)]
   1                        0.0511093775     0.1408286237     0.5964845033
        0.3904603901                                    
   x                      x  0.0012779193 x  0.0286165446 x  0.4105999719


Sunday, 13 January 2013                          A D Kennedy                               74
Berlin Wall for Wilson fermions

   HMC Results
       C Urbach, K Jansen, A Schindler,
       and U Wenger, hep-lat/0506011,
       hep-lat/0510064
   Comparable performance
   to Lüscher’s SAP algorithm
   RHMC?
       t.b.a.




Sunday, 13 January 2013                   A D Kennedy   76
Conclusions (RHMC)

         Advantages of RHMC
             Exact
                   No step-size errors; no step-size extrapolations
             Significantly cheaper than the R algorithm
             Allows easy implementation of Hasenbusch
             (multipseudofermion) acceleration
             Further improvements possible
                   Such as multiple timescales for different terms in the partial
                   fraction expansion

         Disadvantages of RHMC
             ???


Sunday, 13 January 2013                          A D Kennedy                        77
QCD Machines: I

         We want a cost-effective computer to solve
         interesting scientific problems
             In fact we wanted a computer to solve lattice QCD
                  But it turned out that there is almost nothing that was not applicable to
                  many other problems too
             Not necessary to solve all problems with one architecture
                  Demise of the general-purpose computer?
             Development cost « hardware cost for one large machine
             Simple OS and software model
                  Interleave a few long jobs without time- or space-sharing




Sunday, 13 January 2013                            A D Kennedy                                78
QCD Machines: II

     Take advantage of mass market components
          It is not cost- or time-effective to compete with PC market in designing
          custom chips
          Use standard software and tools whenever possible

     Do not expect compilers or optimisers to anything
     particularly smart
          Parallelism has to be built into algorithms and programs from the start

     Hand code critical kernels in assembler
          And develop these along with the hardware design



Sunday, 13 January 2013                       A D Kennedy                            79
QCD Machines: III

       Parallel applications
            Many real-world applications are intrinsically parallel
                Because they are approximations to continuous systems
            Lattice QCD is an good example
                Lattice is a discretisation of four dimensional space-time
                Lots of arithmetic on small complex matrices and vectors
                Relatively tiny amount of I/O required
       Amdahl’s law
            Amount of parallel work may be increased working on a larger
            volume
            Relevant parameter is σ, the number of sites per processor

Sunday, 13 January 2013                        A D Kennedy                   80
Strong Scaling
         The amount of computation V δ required to
         equilibrate a system of volume V increases faster
         than linearly
             If we are to have any hope of equilibrating large systems the value
             of δ cannot be much larger than one
             For lattice QCD we have algorithms with δ = 5/4
         We therefore are driven to as small a value of σ as
         possible
             This corresponds to “thin nodes,” as opposed to “fat nodes” as in
             PC clusters
         Clusters are competitive in price/performance up to a
         certain maximum problem size
             This borderline increase with time, of course


Sunday, 13 January 2013                       A D Kennedy                          81
Data Parallelism

        All processors run the same code
             Not necessarily SIMD, where they share a common clock
             Synchronization on communication, or at explicit barriers

        Type of data parallel operations
             Pointwise arithmetic
             Nearest neighbour shifts
                 Perhaps simultaneously in several directions
             Global operations
                 Broadcasts, sums or other reductions



Sunday, 13 January 2013                       A D Kennedy                82
Alternatives Paradigms

         Multithreading
             Parallelism comes from running many separate
             more-or-less independent threads
             Recent architectures propose running many light-
             weight threads on each processor to overcome
             memory latency
                  What are the threads that need almost no memory
                          Calculating zillions of digits of ?
                          Cryptography?



Sunday, 13 January 2013                              A D Kennedy    83
Computational Grids


    In the future carrying out large scale
    computations using the Grid will be as
    easy as plugging into an electric socket




Sunday, 13 January 2013   A D Kennedy          84
Hardware Choices

       Cost/MFlop
            Goal is to be about 10 times more cost-effective than
            commercial machines
                Otherwise it is not worth the effort and risk of building our own
                machine
            Our current Teraflops machines cost about $1/MFlop
            For a Petaflops machine we will need to reach about $1/GFlop
       Power/cooling
            Most cost effective to use low-power components and high-
            density packaging
            Life is much easier if the machine can be air cooled


Sunday, 13 January 2013                        A D Kennedy                          85
Clock Speed

       Peak/Sustained speed
           The sustained performance should be about 20%—50% of peak
           Otherwise there is either too much or too little floating point
           hardware
                Real applications rarely have equal numbers of adds and multiplies

       Clock speed
           “Sweet point” is currently at 0.5—1 GHz chips
                High-performance chips running at 3 GHz are both hot and
                expensive
                Using moderate clock speed makes electrical design issues such as
                clock distribution much simpler



Sunday, 13 January 2013                       A D Kennedy                            86
Memory Systems
         Memory bandwidth
             This is currently the main bottleneck in most
             architectures
             There are two obvious solutions
                  Data prefetching
                          Vector processing is one way of doing this
                          Requires more sophisticated software
                          Feasible for our class of applications because the control
                          flow is essentially static (almost no data dependencies)
                  Hierarchical memory system (NUMA)
                  We make use of both approaches


Sunday, 13 January 2013                            A D Kennedy                         87
Memory Systems

         On-chip memory
             For QCD the memory footprint is small enough that we can put all
             the required data into an on-chip embedded DRAM memory
         Cache
             Whether the on-chip memory is managed as a cache or as directly
             addressable memory is not too important
                  Cache flushing for communications DMA is a nuisance
                  Caches are built into most μ-processors, not worth designing our own!
         Off-chip memory
             The amount of off-chip memory is determined by cost
                  If the cost of the memory is » 50% of the total cost then buy more
                  processing nodes instead
                          After all, the processors are almost free!




Sunday, 13 January 2013                                   A D Kennedy                     88
Communications Network: I

         This is where a massively parallel computer
         differs from a desktop or server machine
         In the future the network will become the
         principal bottleneck
             For large data parallel applications
             We will end up designing networks and decorating them
             with processors and memories which are almost free




Sunday, 13 January 2013                  A D Kennedy                 89
Communications Network

         Topology
             Grid
                  This is easiest to build
                  How many dimensions?
                          QCDSP had 4, Blue Gene/L has 3, and QCDOC has 6
                          Extra dimensions allow easier partitioning of the machine
             Hypercube/Fat Tree/Ω network/Butterfly network
                  These are all essentially an infinite dimensional grid
                  Good for FFTs
             Switch
                  Expensive, and does not scale well



Sunday, 13 January 2013                              A D Kennedy                      90
Global Operations

     Combining network
          A tree which can perform arithmetic at each node
          Useful for global reductions
          But global operations can be performed using grid wires
               It is all a matter of cost
               Used in BGL, not in QCDx
                                              6    8      10   12   36
     Grid wires
          Not very good error propagation
          O(N1/4) latency (grows as the       1     2     3    4
          perimeter of the grid)
          O(N) hardware                       5     6     7    8



Sunday, 13 January 2013                     A D Kennedy                  91
Global Operations

     Combining tree                                                  36
         Good error propagation
         O(log N) latency
         O(N log N) hardware
                                                    10                                 26

     Bit-wise operations
         Allow arbitrary precision
                                            3                7                11                15
         Used on QCDSP
         Not used on QCDOC or
         BGL
              Because data sent byte-   1       2        3       4        5        6        7        8
              wise (for ECC)



Sunday, 13 January 2013                         A D Kennedy                                              92
Global Operations
                                   110100
                                                     Broadcast   110100
                                   110100
         001011
          0001
          000
           00
            0
         00011
                           +        10010
                                    0010
                                     010
                                      10
                                       0
                                   010010
                                     Sum
    LSB first
                                         1101
                                          110
                                           11
                                           1
                                        11010
                                       110100
         000111
         00101
          0010
           001
           00
            0
                          Carry
                            0
                            1                            >          0
                                                                   100
                                                                  0100
                                                                  10100
                                                                 110100
                                                                    00
                                                                   Max
                                  MSB first

                                           1
                                           11
                                        11100
                                          111
                                         1110
                                       111000          Select
                                                       upper
                                                         ?



Sunday, 13 January 2013                A D Kennedy                   93
Communications Network: II

         Parameters
             Bandwidth
             Latency
             Packet size
                  The ability to send small [O(102) bytes] packets between
                  neighbours is vital if we are to run efficiently with a small
                  number of sites per processor (σ)
             Control (DMA)
                  The DMA engine needs to be sophisticated enough to
                  interchange neighbouring faces without interrupting the CPU
                  Block-strided moves
                  I/O completion can be polled (or interrupt driven)



Sunday, 13 January 2013                          A D Kennedy                      94
Packet Size




Sunday, 13 January 2013   A D Kennedy   95
Block-strided Moves
        0     1       2   3   4    5   6   7    8     9      10 11 12 13 14 15

                  4               Block size                         1


            Stride                                     12 3

         For each direction separately we
         specify in memory mapped registers:
              Source starting address                           0       1   2   3
              Target starting address                           4       5   6   7
              Block size
              Stride
                                                                 8       9   10 11
              Number of blocks                                  12 13 14 15

Sunday, 13 January 2013                        A D Kennedy                           96
Hardware Design Process
         Optimise machine for applications of interest
             We run simulations of lattice QCD to tune our design
                  Most design tradeoffs can be evaluated using a spreadsheet as
                  the applications are mainly static (data-independent)
             It also helps to debug the machine if you use the actual
             kernels you will run in production as test cases!
                  But running QCD even on a RTL simulator is painfully slow

         Circuit design done using hardware
         description languages (e.g., VHDL)
         Time to completion is critical
             Trade-off between risk of respin and delay in tape-out


Sunday, 13 January 2013                       A D Kennedy                         97
VHDL
                          entity lcount5 is
                          port              ( signal clk, res: in vlbit;
                                              signal reset, set: in vlbit;
                                              signal count:      inout vlbit_1d(4 downto 0)
                                            );
                          end lcount5;

                          architecture bhv of lcount5 is
                          begin
                          process
                                            begin
                                            wait until prising(clk) or res='1';

                                            if res='1' then count<=b"00000";
                                            else if reset='1'
                                                               then count<=b"00000";
                                            elsif set='1' then

                                            count(4)<=count(4) xor (count(0) and count(1) and count(2) and count(3));
                                            count(3)<=count(3) xor (count(0) and count(1) and count(2));
                                            count(2)<=count(2) xor (count(0) and count(1));
                                            count(1)<=count(1) xor count(0);
                                            count(0)<=not count(0);

                                            end if;end if;
                          end process ;
                          end bhv;




Sunday, 13 January 2013                             A D Kennedy                                                         99
QCDSP




     Columbia University                    BNL
         0.4 TFlops                      0.6 Tflops
Sunday, 13 January 2013    A D Kennedy                100
                                                                              PEC = Prefetching EDRAM Controller
                                                                              EDRAM = Embedded DRAM
QCDOC ASIC Design                                                             DRAM = Dynamic RAM
                                                                              RAM = Random Access Memory
                    EDRAM               Interrupt
                                        Controller
                                                      Boot/Clock
                                                       Support
                                                                      DCR SDRAM




                       PEC                     440
                                               Core
                                                          FPU                             2.6 Gbyte/sec
                                                                                         EDRAM/SDRAM
                                                                                    OFB-PLB         OFB
                                                                                     Bridge        Arbiter




                           100 Mbit/sec                                                        DMA
                                                                      DCR SDRAM
                             DMA                                       Controller


                          Fast Processor Node for
                   CompleteEthernet (0.8)Gbyte/sec
                           Controller


                                                 8
                                           24 Link DMA
                                                                                                      IDC             Serial

                                    4 Off-Node Glops
                                   24 Mbytes of Links
                                                                                                   Interface          number

                                           1
                                                              PLB




                                             Memory-Processor
                                         Communications
                       QCD computer on a Chip
                                 Embedded DRAM
                                      12 Gbit/sec  Precision
                                         Double Bandwidth
                                               Control
                                                             PLB
                                                            Arbiter
                                                                                    MCMAL
                                                                                    Ethernet
                                                                                      DMA
                                                                                               O
                                                                                               F
                                                                                               B
                                                                                                     Slave
                                                                                                                      location
                                                                                                                      number


                                       Bandwidth 0.18μ
                                         RISC SA27E
                  2.6 Gbyte/sec by IBM using Processor
                  HSSL
                        fabricated
                   Interface to embedded DRAM process
                           logic +
                       SCUMemory
                 External
                  HSSL         IBM ASIC Library Component                                           EMACS              MII


                                                     Custom Designed Logic
                    HSSL
                                                                                                                   FIFO




                     PLL
                                                                                                               4KB recv. FIFO




Sunday, 13 January 2013                                                        A D Kennedy                                       101
QCDOC Components




Sunday, 13 January 2013   A D Kennedy   102
QCDOC




Sunday, 13 January 2013   A D Kennedy   103
Edinburgh ACF

                                        UKQCDOC is
                                        located in the
                                        ACF near
                                        Edinburgh


                                        Our insurers will
                                        not let us show
                                        photographs of
                                        the building for
                                        security reasons!

Sunday, 13 January 2013   A D Kennedy                    104
Blue Gene/L




Sunday, 13 January 2013   A D Kennedy   105
Vicious Design Cycle
     1. Communications DMA controller required
             Otherwise control by CPU needed
                  Requires interrupts at each I/O completion
                  CPU becomes bottleneck
     2. Introduce more sophisticated DMA instructions
             Block-strided moves for QCDSP
             Sequences of block-strided moves for QCDOC
                  Why not? It is cheap
     3. Use CPU as DMA engine
             Second CPU in Blue Gene/L
     4. Add FPU
                  Why not? It is cheap, and you double you peak Flops
     5. Use second FPU for computation
             Otherwise sustained fraction of peak is embarrassingly small
     6. Go back to step 1


Sunday, 13 January 2013                               A D Kennedy           106
Power Efficiencies
                     1

                                                              QCDOC
   GFlops/Watt




                                                                                   Blue Gene/L
                   0.1

                                                                             NCSA
                               QCDSP                                       Intel Xeon
                  0.01                                           ASCI Q
                                          ASCI White
                                                                 Earth Simulator


                 0.001
                     1997   1998   1999   2000    2001      2002      2003     2004       2005


                                                 Year

Sunday, 13 January 2013                                A D Kennedy                               107
What Next?

         We need a Petaflop computer for QCD
             New fermion formulations solve a lot of problems, but cost
             significantly more in computer time
         Algorithms are improving
             Slowly
             By factors of about 2
         Can we build a machine about 10 times more
         cost effective than commercial machines?
             Otherwise not worth the effort & risk of building our own
             Blue Gene/L and its successors provide a new opportunity


Sunday, 13 January 2013                   A D Kennedy                     108

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:1/13/2013
language:English
pages:105