# Kennedy by huangyuarong

VIEWS: 0 PAGES: 105

• pg 1
Algorithms for
Dynamical Fermions
A D Kennedy
School of Physics, The University of Edinburgh

NCTS 2006 School on Modern Numerical
Sunday, 13 January 2013                      Methods in Mathematics and Physics
Outline

Monte Carlo integration
Importance sampling
Markov chains
Detailed balance, Metropolis algorithm
Symplectic integrators
Hybrid Monte Carlo
Pseudofermions
RHMC
QCD Computers

Sunday, 13 January 2013             A D Kennedy   2
Monte Carlo methods: I

Monte Carlo integration is based on the
identification of probabilities with measures

There are much better methods of carrying out
All other methods become hopelessly expensive for large
dimensions

In lattice QFT there is one integration per degree of freedom
We are approximating an infinite dimensional functional integral

Sunday, 13 January 2013                       A D Kennedy                          3
Monte Carlo methods: II
Generate a sequence of random field

configurations 1 , 2 ,        , t ,         , N   
chosen from the probability distribution
1 S (t )
P (t ) d t  e
Z
Measure the value of  on each configuration
and compute the average
1   N
      (t )
N t 1

Sunday, 13 January 2013                 A D Kennedy              4
Central Limit Theorem: I
Law of Large Numbers   lim 
N 

Central Limit Theorem     O                     C2
N

where the variance of the distribution of  is
     
2
C2 

The Laplace–DeMoivre Central Limit theorem
is an asymptotic expansion for the probability
distribution of 

Distribution of values for a single sample  = ()
P ( )   d  P                 

Sunday, 13 January 2013                             A D Kennedy                            5
Central Limit Theorem: II
Generating function for connected moments
W  (k )  ln  d  P ( ) e ik 
ik  C
n

 ln  d  P   e
ik  
 ln e ik    
n 0   n! n
The first few cumulants are

           
3
C0  0                      C3             

           
4
C1                       C4                          3C 22

       
2
C2 

Note that this is an asymptotic expansion

Sunday, 13 January 2013                                 A D Kennedy                        6
Central Limit Theorem: III
Distribution of the average of N samples
    1       N

 
P    d 1          d N P 1          P N        
    N

t 1
 t  

Connected generating function
W (k )  ln  d  P ( ) e ik 
 ik   N

 ln  d 1    d N P 1  P N             exp         t 
N     t 1     
N
 ln   d  P   e     N ln e
ik   N        ik  N
                        
ik  C n
n
k         
 NW     
 N  n 1 n ! N n 1

Sunday, 13 January 2013                                A D Kennedy                                   7
Central Limit Theorem: IV
Take inverse Fourier transform to obtain distribution P

1
 
P  
2
W   k   ik 
 dk e e
C3 d 3      C4 d 4
                                                                     2
ik   21  ik  C 2

3!N 2 d  3 4!N 3 d  4                      dk                                    ik 
e                                                    2
e           N
e
2

   
d3            d4

C3

C4

e     2C 2 N

e         3!N 2 d  3 4!N 3 d  4
2 C          N
2

Sunday, 13 January 2013                                         A D Kennedy                                            8
Central Limit Theorem: V

Re-scale to show convergence to Gaussian distribution

d
 
P   F  
d

where                N and

F    1 

 C 3  2  3C 2
 e  2 2C 2


     6C 2 N
3
 2 C 2


Sunday, 13 January 2013                      A D Kennedy      9
Importance Sampling: I
Integral        dx f (x )
Sample from distribution
Probability 0   (x ) a .e .
Normalisation N   dx  (x )  1
f (x )
Estimator of integral I    (x ) dx
 (x )
Estimator of variance
2
 f (x )             f ( x )2
V    (x ) dx          I    dx          I 2
  (x )               (x )

Sunday, 13 January 2013                    A D Kennedy              10
Importance Sampling: II
Minimise variance
Constraint N=1
Lagrange multiplier 
 (V  N )    f (y ) 2
            0
 (y )      (y ) 2

f (x )
Optimal measure  opt (x ) 
 dx f (x )
  dx                dx            
2                         2
Optimal variance V opt             f (x )                f (x )

Sunday, 13 January 2013           A D Kennedy                    11
2
Importance Sampling: III 0 dx sin x

Example: f (x )  sin(x )

Optimal weight  (x )         1
2
sin(x )

Optimal variance
2                         2
  2
      2

V opt      dx   sin(x )     dx         sin(x )   16
0                0                       

Sunday, 13 January 2013                A D Kennedy                     12
2
Importance Sampling: IV                          0
dx sin x

1 Construct cheap approximation to
|sin(x)|
2 Calculate relative area within each
rectangle

3 Choose a random number uniformly

4 Select corresponding rectangle

5 Choose another random number
uniformly
6 Select corresponding x value within
rectangle

7 Compute |sin(x)|

Sunday, 13 January 2013                 A D Kennedy                   13
2
Importance Sampling: V                             0
dx sin x

With 100 rectangles we have V = 16.02328561
But we can do better!
2                                  2
I       dx      sin(x )  sin(x )      dx     sin(x )   sin(x ) 
0                                  0

For which V opt  0

With 100 rectangles we have V = 0.011642808

Sunday, 13 January 2013                    A D Kennedy                   14
Markov Chains: I
State space 
(Ergodic) stochastic transitions P’:   
Deterministic evolution of probability
distribution P: Q  Q
Distribution converges to unique fixed point Q

Sunday, 13 January 2013        A D Kennedy              15
Convergence of Markov Chains: I
Define a metric d Q1 ,Q 2    dx Q1  x   Q 2  x  on
the space of (equivalence classes of) probability
distributions

Prove that d PQ1 , PQ 2   1    d Q1 ,Q 2  with
 > 0, so the Markov process P is a contraction
mapping
The sequence Q, PQ, P²Q, P³Q,… is Cauchy
The space of probability distributions is complete, so the
sequence converges to a unique fixed point Q  lim P nQ
n 

Sunday, 13 January 2013                A D Kennedy                     16
Convergence of Markov Chains: II
d PQ1 , PQ2    dx PQ1  x   PQ 2  x 
  dx     dy P  x      y  Q1  y    dy P  x  y  Q 2  y 
Q  y   Q1  y   Q 2  y 
  dx     dy P  x      y  Q  y 
  y     y   1
  dx  dy P  x  y  Q  y    Q  y      Q  y  
                               
a  b  a  b  2min  a , b                
  dx  dy P  x  y  Q  y 
2 dx min  dy P  x  y  Q  y   Q  y  


Sunday, 13 January 2013                          A D Kennedy                    17
Convergence of Markov Chains: III
d PQ1 , PQ 2              dx P  x  y   1
  dy Q  y   2 dx min  dy P  x  y  Q  y    Q  y  


  dy Q  y   2 dx inf P  x  y  min  dy Q  y    Q  y  
y                

 dy Q  y    Q y     dy Q y    Q y  
  dy Q  y    dy Q  y    dy Q  y   1  1 0
1                 2

  dy Q  y    Q  y     dy Q  y 
1
2

  dy Q  y    dx inf P  x  y   dy Q  y   (1   )d Q ,Q 
1        2
y

0     dx inf P  x  y 
y

Sunday, 13 January 2013                A D Kennedy                    18
Markov Chains: II

Use Markov chains to sample from Q
Suppose we can construct an ergodic Markov process P which has
distribution Q as its fixed point
Start with an arbitrary state (“field configuration”)
Iterate the Markov process until it has converged (“thermalized”)
Thereafter, successive configurations will be distributed according to Q
But in general they will be correlated
To construct P we only need relative probabilities of states
Do not know the normalisation of Q
Cannot use Markov chains to compute integrals directly
We can compute ratios of integrals

Sunday, 13 January 2013                             A D Kennedy                       19
Markov Chains: III
How do we construct a Markov process with a specified
fixed point Q  x    dy P  x  y  Q  y  ?
Detailed balance P  y  x  Q  x   P  x  y  Q  y 
Integrate w.r.t. y to obtain fixed point condition
Sufficient but not necessary for fixed point
 Q x  
Metropolis algorithm P  x  y   min  1,
 Q y  

        
Consider cases Q (x )  Q (y ) and Q (x )  Q (y ) separately to
obtain detailed balance condition
Sufficient but not necessary for detailed balance
Q x 
Other choices are possible, e.g., P  x  y  
Q  x   Q y 

Sunday, 13 January 2013                     A D Kennedy                          20
Markov Chains: IV

Composition of Markov steps
Let P1 and P2 be two Markov steps which have the desired fixed
point distribution
They need not be ergodic
Then the composition of the two steps P2P1 will also have the
desired fixed point
And it may be ergodic
This trivially generalises to any (fixed) number of steps
For the case where P1 is not ergodic but (P1 ) n is the terminology
weakly and strongly ergodic are sometimes used

Sunday, 13 January 2013                        A D Kennedy                            21
Markov Chains: V

This result justifies “sweeping” through a
lattice performing single site updates
Each individual single site update has the desired fixed point
because it satisfies detailed balance
The entire sweep therefore has the desired fixed point, and
is ergodic
But the entire sweep does not satisfy detailed balance
Of course it would satisfy detailed balance if the sites were
updated in a random order
But this is not necessary
And it is undesirable because it puts too much randomness into
the system

Sunday, 13 January 2013                       A D Kennedy                          22
Markov Chains: VI
Coupling from the Past
Propp and Wilson (1996)
Use fixed set of random numbers
Flypaper principle: If states coalesce they stay together forever
–   Eventually, all states coalesce to some state  with probability one
–   Any state from t = - will coalesce to 
–    is a sample from the fixed point distribution


Sunday, 13 January 2013                            A D Kennedy                          23
Markov Chains: VII

Suppose we have a lattice                                    a
That is, a partial ordering with a
largest and smallest element
And an update step that
b       c           d
preserves it
Then once the largest and
smallest states have                                 e           f
coalesced all the others must
have too                                                     g

Sunday, 13 January 2013                     A D Kennedy                           24
Markov Chains: VIII


A non-trivial example of         β≥0
where this is possible is        Ordering is spin configuration
A ≥ B iff for every site As ≥ Bs
the ferrormagnetic Ising
Update is a local heatbath
model H = b å sis j
ij

Sunday, 13 January 2013              A D Kennedy                             25
Autocorrelations: I

Exponential autocorrelations
The unique fixed point of an ergodic Markov process
corresponds to the unique eigenvector with eigenvalue 1
All its other eigenvalues must lie within the unit circle

In particular, the largest subleading eigenvalue is max  1

This corresponds to the exponential autocorrelation time

1
N exp               0
ln max

Sunday, 13 January 2013                        A D Kennedy                 26
Autocorrelations: II
Integrated autocorrelations
Consider the autocorrelation of some operator Ω

Without loss of generality we may assume   0

2       1       N   N
1 N             N 1 N

  t   t         2    t   2    t   t   
2
                                       
N   2
1 1
t t                       N  t 1           t 1 t t 1         
 t    t 
Define the autocorrelation function C                 
  
2

2    1      2   N 1

         2 
N      N
 N 
1
C        2 


Sunday, 13 January 2013                                    A D Kennedy                             27
Autocorrelations: III

The autocorrelation function must fall faster that the
exponential autocorrelation C                 max  e
 N exp

For a sufficiently large number of samples

        
 2        N exp  
 1  2C           1  O          
     1           N          N  

Define the integrated autocorrelation function A  C                    
1

2                  2        N exp  
        1  2A       1  O          
N          N  

Sunday, 13 January 2013                              A D Kennedy                            28
Hybrid Monte Carlo: I
In order to carry out Monte Carlo computations
including the effects of dynamical fermions we
would like to find an algorithm which
Update the fields globally
Because single link updates are not cheap if the action is not local
Take large steps through configuration space
Because small-step methods carry out a random walk which leads
to critical slowing down with a dynamical critical exponent z=2
z relates the autocorrelation to the correlation length of the
system,   A   z
Does not introduce any systematic errors

Sunday, 13 January 2013                             A D Kennedy                            29
Hybrid Monte Carlo: II

A useful class of algorithms with these properties is
the (Generalised) Hybrid Monte Carlo (HMC) method
Introduce a “fictitious” momentum p corresponding to each
dynamical degree of freedom q
Find a Markov chain with fixed point  exp[-H(q,p) ] where H is the
“fictitious” Hamiltonian ½p2 + S(q)
The action S of the underlying QFT plays the rôle of the potential in the
“fictitious” classical mechanical system
This gives the evolution of the system in a fifth dimension, “fictitious” or
computer time
This generates the desired distribution exp[-S(q) ] if we ignore the
momenta p (i.e., the marginal distribution)

Sunday, 13 January 2013                             A D Kennedy                                 30
Hybrid Monte Carlo: III

The HMC Markov chain alternates two Markov
steps
Molecular Dynamics Monte Carlo (MDMC)
(Partial) Momentum Refreshment
Both have the desired fixed point
Together they are ergodic

Sunday, 13 January 2013                A D Kennedy    31
MDMC: I
If we could integrate Hamilton’s equations
exactly we could follow a trajectory of
constant fictitious energy
This corresponds to a set of equiprobable fictitious
phase space configurations
Liouville’s theorem tells us that this also preserves the
functional integral measure dp  dq as required
Any approximate integration scheme which
is reversible and area preserving may be
used to suggest configurations to a
Metropolis accept/reject test
With acceptance probability min[1,exp(-H)]

Sunday, 13 January 2013                         A D Kennedy             32
MDMC: II
We build the MDMC step out of three parts
Molecular Dynamics (MD), an approximate integrator
U   : q , p  q , p  which is exactly
  q , p  
Area preserving, det U *  det                1
  q , p  
Reversible, F U   F U    1
A momentum flip F : p      p
A Metropolis accept/reject step
The composition of these gives
q                                              q 
 p    F U    e

 H

 y 1  y e  H

  
 p              
                                                 
with y being a uniformly distributed random number in [0,1]

Sunday, 13 January 2013                               A D Kennedy            33
Partial Momentum Refreshment
This mixes the Gaussian distributed momenta p
with Gaussian noise 
 p    cos       sin     p 
       sin          F  
cos  
                             
The Gaussian distribution of p is invariant under F
The extra momentum flip F ensures that for small  the
momenta are reversed after a rejection rather than after an
acceptance
For  =  / 2 all momentum flips are irrelevant

Sunday, 13 January 2013                    A D Kennedy                    34
Symplectic Integrators: I

Baker-Campbell-Hausdorff (BCH) formula
If A and B belong to any (non-commutative) algebra
then e Ae B  e A B  , where  constructed from
commutators of A and B (i.e., is in the Free Lie
Algebra generated by {A,B })

        
More precisely, ln e Ae B  c n where c 1  A  B and
n 1
                      n 2 

1  1                         
B                                                    
c n 1           2 c n , A  B    2m                  1 c k1 ,    , c k 2m , A  B 
n 1                       m  0  2m  ! k 1 , ,k 2 m
                          

                                    k 1   k 2 m n                                 

Sunday, 13 January 2013                                     A D Kennedy                                         35
Symplectic Integrators: II
Explicitly, the first few terms are
      
ln e Ae B  A  B          1
2    A , B   12  A ,  A , B   B ,  A , B   24 B ,  A ,  A , B  
1
                              
1
                  
   A ,  A ,  A ,  A , B     4 B ,  A ,  A ,  A , B    
                                                          
1                                                                          
 720  6  A , B  ,  A ,  A , B    4 B , B ,  A ,  A , B     
               
                                                           
                                                                        
 2  A , B  , B ,  A , B    B , B , B ,  A , B    
                                           
     
In order to construct reversible integrators we use symmetric
symplectic integrators
The following identity follows directly from the BCH formula
              
ln e A 2e B e A 2  A  B           1
24    A,  A, B   2 B ,  A, B 
                             
 7  A ,  A ,  A ,  A , B     28 B ,  A ,  A ,  A , B    
                                                           
1                                                                            
 5760  12  A , B  ,  A ,  A , B    32 B , B ,  A ,  A , B     
               
                                                            
                                                                         
 16  A , B  , B ,  A , B    8 B , B , B ,  A , B    
               
                                                                 

Sunday, 13 January 2013                                                     A D Kennedy                                           36
Symplectic Integrators: III
We are interested in finding the classical trajectory in
phase space of a system described by the Hamiltonian
H q , p   T  p   S q   1 p 2  S q 
2

The basic idea of such a symplectic integrator is to write the
time evolution operator as
 d            dp  dq   
exp        exp                   


dt           dt p dt q      
  H           H            ˆ
 exp                         e H
     q p p q  
                           
 exp   S  q     T p       
             p          q  

Sunday, 13 January 2013                            A D Kennedy                     37
Symplectic Integrators: IV

                    
Define Q  T   p       and P  S  q             ˆ
so that H  P  Q
q                   p
Since the kinetic energy T is a function only of p and the
potential energy S is a function only of q, it follows that the
action of e  P and e Q may be evaluated trivially
e Q : f q , p     f q  T   p  , p 
e  P : f q , p    f q , p   S  q  

Sunday, 13 January 2013                          A D Kennedy                   38
Symplectic Integrators: V
From the BCH formula we find that the PQP symmetric
symplectic integrator is given by

 e e 
1  P             1  P    / 
 /                    Q
U 0 ( )          e   2                  2

  exp P Q    P ,P ,Q   2 Q ,P ,Q                   
 
1                           3 O      5
                         24                                     

        
 exp  P  Q   24 P , P ,Q   2 Q , P ,Q   2  O  4  
1
                                                     
 O  2 
ˆ             P Q 
 e H   e

In addition to conserving energy to O (² ) such symmetric
symplectic integrators are manifestly area preserving and
reversible

Sunday, 13 January 2013                                              A D Kennedy                        39
Symplectic Integrators: VI
For each symplectic integrator there exists a
Hamiltonian H’ which is exactly conserved
This is given by substituting Poisson brackets for commutators
in the BCH formula

H '  S  T  24 T , T , S   2 S , T , S   2
1
                                 


                    
 32 S , S , T T , S   16 T , S  , S , T , S  

1 
                                                

 5760  28 S , T , T , T , S   12 T , S  , T , T , S        4
 
 O  6
                                                             
                          
 8 S , S , S , T , S   7 T , T , T , T , S       


Sunday, 13 January 2013                           A D Kennedy                               40
Symplectic Integrators: VII

Poisson brackets form a Lie algebra

A B A B
A , B           
p q q p
A , B    B , A

Homework exercise: verify the Jacobi identity

A,B ,C   B ,C , A  C ,A, B   0

Sunday, 13 January 2013                            A D Kennedy                     41
Symplectic Integrators: VIII

The explicit result for H’ is

H   H  24  2p 2S   S 2  2
1

 720
1
          4
                                    
 p 4S    6 p 2 S S   2S 2  3S 2S   4  O  6

Note that H’ cannot be written as the sum of a p-dependent
kinetic term and a q-dependent potential term

As H’ is conserved, δH is of O(δτ 2) for trajectories of
arbitrary length
Even if τ = O (δτ -k) with k > 1

Sunday, 13 January 2013                              A D Kennedy                              42
Symplectic Integrators: IX
Multiple timescales
Split the Hamiltonian into pieces H (q , p )  T ( p )  S 1 (q )  S 2 (q )
                      
Define Q  T   p     and Pi  S i q                 ˆ
so that H  P1  P2  Q
q                     p
Introduce a symmetric symplectic integrator of the form
 / 
                                        n2                                                n1
 
n2
U SW ( ) /    e                                    e 4 n1                                          e 2 n2
1     Q        1      P2          1    Q        1  P         1     Q          1     P2        1     Q
4 n2
e   2 n2
e   n1     1
e   4 n1
e   4 n2

                                     
                                                 
                                    
 
P1   P
If       2  Q                              then the instability in the integrator is tickled
n1  2n 2
equally by each sub-step
This helps if the most expensive force computation does not
correspond to the largest force

Sunday, 13 January 2013                                                       A D Kennedy                                                                    43
Dynamical fermions: I

Pseudofermions
Direct simulation of Grassmann fields is not feasible
The problem is not that of manipulating anticommuting values
in a computer

It is that e  S F  e M is not positive, and thus we get poor
importance sampling

We therefore integrate out the fermion fields to obtain the
fermion determinant  d  d  e M  det M 
 and  always occur quadratically
The overall sign of the exponent is unimportant

Sunday, 13 January 2013                           A D Kennedy                        44
Dynamical fermions: II
Any operator  can be expressed solely in terms of the
bosonic fields
Use Schwinger’s source method and integrate out the fermions

    M 1  
       , , e
                0

E.g., the fermion propagator is

G (x , y )    x   y   M 1 (x , y )

Sunday, 13 January 2013                            A D Kennedy             45
Dynamical fermions: III
Including the determinant as part of the observable to be
det M      S
measured is not feasible                          B

det M   S
B

The determinant is extensive in the lattice volume, thus again we
get poor importance sampling
Represent the fermion determinant as a bosonic Gaussian
integral with a non-local kernel det M     d d  e  M   
1

The fermion kernel must be positive definite (all its
eigenvalues must have positive real parts) otherwise the
bosonic integral will not converge
The new bosonic fields are called pseudofermions

Sunday, 13 January 2013                       A D Kennedy                           46
Dynamical fermions: IV
It is usually convenient to introduce two flavours of fermion
  M †M  
1


and to write  det M     det M   M     d  d  e                
2              †

This not only guarantees positivity, but also allows us to generate
the pseudofermions from a global heatbath by applying M † to a
random Gaussian distributed field
The equations for motion for the boson (gauge) fields are
 
S B                                 S B                            M †M  M †M
†
                                      
  M †M                                   
1                                     1                          1
                 †
M †M                              
                          
                                                                                         

The evaluation of the pseudofermion action and the
corresponding force then requires us to find the solution of a
           
1
(large) set of linear equations M †M     

Sunday, 13 January 2013                                          A D Kennedy                                           47
Dynamical fermions: V

It is not necessary to carry out the inversions
required for the equations of motion exactly
There is a trade-off between the cost of computing the force
and the acceptance rate of the Metropolis MDMC step
The inversions required to compute the
pseudofermion action for the accept/reject
step does need to be computed exactly,
however
We usually take “exactly” to by synonymous with “to
machine precision”

Sunday, 13 January 2013                   A D Kennedy                       48
Reversibility: I
Are HMC trajectories reversible and area
preserving in practice?
The only fundamental source of irreversibility is the rounding
error caused by using finite precision floating point arithmetic
For fermionic systems we can also introduce irreversibility by
choosing the starting vector for the iterative linear equation solver
time-asymmetrically
We do this if we to use a Chronological Inverter, which takes (some
extrapolation of) the previous solution as the starting vector
Floating point arithmetic is not associative
It is more natural to store compact variables as scaled integers
(fixed point)
Saves memory
Does not solve the precision problem

Sunday, 13 January 2013                           A D Kennedy                              49
Reversibility: II
Data for SU(3) gauge theory
and QCD with heavy quarks
show that rounding errors are
amplified exponentially
The underlying continuous
time equations of motion are
chaotic
Ляпунов exponents
characterise the divergence of
nearby trajectories
The instability in the integrator
occurs when H » 1
Zero acceptance rate anyhow

Sunday, 13 January 2013                        A D Kennedy   50
Reversibility: III
In QCD the Ляпунов exponents appear
to scale with  as the system approaches
the continuum limit   
 = constant
This can be interpreted as saying that the
Ляпунов exponent characterises the
chaotic nature of the continuum classical
equations of motion, and is not a lattice
artefact
Therefore we should not have to worry
about reversibility breaking down as we
approach the continuum limit
Caveat: data is only for small lattices, and
is not conclusive

Sunday, 13 January 2013                             A D Kennedy   51
Polynomial approximation

What is the best polynomial
approximation p(x) to a continuous
function f(x) for x in [0,1] ?
Best with respect to the appropriate norm
1/n
 1
n
p  f n    dx p (x )  f (x ) 
0                     
where n  1

Sunday, 13 January 2013                A D Kennedy         52
Weierstraß’ theorem

Taking n →∞ this is the minimax
norm

p  f   minmax p (x )  f (x )
p 0  x 1

Weierstraß: Any continuous
function can be arbitrarily well
approximated by a polynomial

Sunday, 13 January 2013          A D Kennedy   53
Бернштейне polynomials

The explicit solution is
provided by Бернштейне
polynomials
n
 k  n 
p n  x    f     x n (1  x )n k
k 0  n   k 

Sunday, 13 January 2013          A D Kennedy                   54
Чебышев’s theorem

Чебышев: There is always a
unique polynomial of any
degree d which minimises
p f     
 max        p x   f x 
0  x 1

The error |p(x) - f(x)| reaches its
maximum at exactly d+2 points on
the unit interval

Sunday, 13 January 2013                          A D Kennedy   55
Чебышев’s theorem: Necessity
Suppose p-f has less than d+2 extrema of equal magnitude
Then at most d+1 maxima exceed some magnitude
This defines a “gap”
We can construct a polynomial q of degree d which has the opposite sign to p-f at
each of these maxima (Lagrange interpolation)
And whose magnitude is smaller than the “gap”
The polynomial p+q is then a better approximation than p to f

Sunday, 13 January 2013                          A D Kennedy                              56
Чебышев’s theorem: Sufficiency
Suppose there is a polynomial p   f                      p f
                 
Then p   x i   f  x i       p x i   f x i       at each of the d+2 extrema
Therefore p’ - p must have d+1 zeros on the unit interval
Thus p’ – p = 0 as it is a polynomial of degree d

Sunday, 13 January 2013                                          A D Kennedy                   57
Чебышев polynomials

Convergence is often exponential in d
The best approximation of degree d-1 over [-1,1] to
xd is

d 1
pd 1  x   x d  1      Td  x 
2
Where the Чебышев polynomials are
T d  x   cos d cos 1  x  
The notation is an old transliteration of Чебышев !


d 1
The error is x d  pd  x              1            T d  x    2e d ln2
     2

Sunday, 13 January 2013                          A D Kennedy                        58
Чебышев rational functions
Чебышев’s theorem is easily extended to
rational approximations
Rational functions with nearly equal degree numerator and
denominator are usually best
Convergence is still often exponential
Rational functions usually give a much better approximation

A simple (but somewhat slow) numerical
algorithm for finding the optimal Чебышев
rational approximation was given by Ремез

Sunday, 13 January 2013                      A D Kennedy                59
Чебышев rationals: Example
A realistic example of a rational approximation is
1
 0.3904603901
 x  2.3475661045   x  0.1048344600   x    0.0073063814 
x                     x  0.4105999719   x  0.0286165446   x    0.0012779193 

This is accurate to within almost 0.1% over the range [0.003,1]

Using a partial fraction expansion of such rational functions
allows us to use a multishift linear equation solver, thus
reducing the cost significantly.

The partial fraction expansion of the rational function above is
1                            0.0511093775     0.1408286237     0.5964845033
 0.3904603901                                    
x                         x  0.0012779193 x  0.0286165446 x  0.4105999719

This appears to be numerically stable.

Sunday, 13 January 2013                              A D Kennedy                                60
Polynomials versus rationals
n
Золотарев’s formula has L∞ error   e          ln 

1
Optimal L2 approximation with weight
is            n      j
( ) 4                          1 x2
 (2 j  1) T 2 j 1 (x )
j 0

This has L2 error of O(1/n)

Optimal L∞ approximation cannot be too
much better (or it would lead to a better L2
approximation)

Sunday, 13 January 2013                    A D Kennedy              61
Non-linearity of CG solver
Suppose we want to solve A2x=b for
Hermitian A by CG
It is better to solve Ax=y, Ay=b successively
Condition number (A2) = (A)2
Cost is thus 2(A) < (A2) in general
Suppose we want to solve Ax=b
Why don’t we solve A1/2x=y, A1/2y=b successively?
The square root of A is uniquely defined if A>0
This is the case for fermion kernels
All this generalises trivially to nth roots
No tuning needed to split condition number evenly
How do we apply the square root of a matrix?

Sunday, 13 January 2013                       A D Kennedy        62
Rational matrix approximation

Functions on matrices
Defined for a Hermitian matrix by diagonalisation
H = U D U -1
f (H) = f (U D U -1) = U f (D) U -1
Rational functions do not require diagonalisation
 H m +  H n = U ( D m +  D n) U -1
H -1 = U D -1 U -1
Rational functions have nice properties
Cheap (relatively)
Accurate

Sunday, 13 January 2013                        A D Kennedy       63
No Free Lunch Theorem

We must apply the rational approximation
with each CG iteration
M1/n  r(M)
The condition number for each term in the partial fraction
expansion is approximately (M)
So the cost of applying M1/n is proportional to (M)
Even though the condition number (M1/n)=(M)1/n
And even though (r(M))=(M)1/n
So we don’t win this way…

Sunday, 13 January 2013                   A D Kennedy                     64
Pseudofermions

We want to evaluate a functional integral
including the fermionic determinant det M

We write this as a bosonic functional integral
over a pseudofermion field with kernel M -1
 M 1
det M   d  d  e


Sunday, 13 January 2013                    A D Kennedy        65
Multipseudofermions
We are introducing extra noise into the system by
using a single pseudofermion field to sample this
functional integral
This noise manifests itself as fluctuations in the force exerted by
the pseudofermions on the gauge fields
This increases the maximum fermion force
This triggers the integrator instability
This requires decreasing the integration step size

A better estimate is det M = [det M1/n]n
1                            1
 M    n
det M   n
 d  d  e


Sunday, 13 January 2013                            A D Kennedy                       66
Violation of NFL Theorem
So let’s try using our nth root trick to implement
multipseudofermions
Condition number (r(M))=(M)1/n
So maximum force is reduced by a factor of n(M)(1/n)-1
This is a good approximation if the condition number is dominated
by a few isolated tiny eigenvalues
This is so in the case of interest
Cost reduced by a factor of n(M)(1/n)-1
Optimal value nopt  ln (M)
So optimal cost reduction is (e ln) /
This works!

Sunday, 13 January 2013                        A D Kennedy                       68
Rational Hybrid Monte Carlo: I

                   
1
RHMC algorithm for fermionic kernel M M                                                                  †       2n

Generate pseudofermion from Gaussian heatbath

  M M         
1
 1  †
P ( )  e        2                         †       4n

 1

         
    M †M                e
1
 1  †                                              1  † M †M        2n   
P ( )   d  e      2

4n

2

                                               
Use accurate rational approximation r (x )  4n x
Use less accurate approximation for MD, r (x )  2n x
r (x )  r (x )2, so there are no double poles
Use accurate approximation for Metropolis acceptance step

Sunday, 13 January 2013                                        A D Kennedy                                                    69
Rational Hybrid Monte Carlo: II
Reminders
Apply rational approximations using their partial fraction
expansions
Denominators are all just shifts of the original fermion kernel
All poles of optimal rational approximations are real and positive for
cases of interest (Miracle #1)
Only simple poles appear (by construction!)
Use multishift solver to invert all the partial fractions using a single
Krylov space
Cost is dominated by Krylov space construction, at least for O(20)
shifts
Result is numerically stable, even in 32-bit precision
All partial fractions have positive coefficients (Miracle #2)
MD force term is of the usual form for each partial fraction
Applicable to any kernel

Sunday, 13 January 2013                             A D Kennedy                            70
Comparison with R algorithm: I
Binder cumulant of chiral condensate,
B4, and RHMC acceptance rate A from
a finite temperature study (2+1 flavour
naïve staggered fermions, Wilson
gauge action, V = 83×4, mud = 0.0076,
ms = 0.25, τ= 1.0)

Algorithm        δt     A      B4

R        0.0019         1.56(5)

R        0.0038         1.73(4)

RHMC         0.055   84%   1.57(2)

Sunday, 13 January 2013                    A D Kennedy   71
Comparison with R algorithm: II
The different masses at which domain wall
results were gathered, together with the
step-sizes δt, acceptance rates A, and
plaquettes P (V = 163×32×8, DBW2
gauge action, β = 0.72)
Algorithm   mud    ms      δt       A          P

R        0.04   0.04    0.01            0.60812(2)

R        0.02   0.04    0.01            0.60829(1)

R        0.02   0.04   0.005             0.60817

RHMC       0.04   0.04    0.02    65.5%   0.60779(1)                 The step-size variation of the
plaquette with mud =0.02
RHMC       0.02   0.04   0.0185   69.3%   0.60809(1)

Sunday, 13 January 2013                                 A D Kennedy                                    72
Comparison with R algorithm: III

The integrated
autocorrelation time of
the 13th time-slice of
the pion propagator
from the domain wall
test, with mud = 0.04

Sunday, 13 January 2013   A D Kennedy   73
Multipseudofermions with multiple timescales
100%
Semiempirical observation:
The largest force from a                                 Residue (α)
L² Force
single pseudofermion does
75%                   α/(β+0.125)
not come from the smallest                               CG iterations
shift
For example, look at the      50%
numerators in the partial
fraction expansion we
exhibited earlier             25%

Make use of this by using a
coarser timescale for the
0%
more expensive smaller                   -13   -10 -8.5 -7.1 -5.8 -4.4 -3.1 -1.7 -0.3   1.5
shifts                                                       Shift [ln(β)]
1                        0.0511093775     0.1408286237     0.5964845033
 0.3904603901                                    
x                      x  0.0012779193 x  0.0286165446 x  0.4105999719

Sunday, 13 January 2013                          A D Kennedy                               74
Berlin Wall for Wilson fermions

HMC Results
C Urbach, K Jansen, A Schindler,
and U Wenger, hep-lat/0506011,
hep-lat/0510064
Comparable performance
to Lüscher’s SAP algorithm
RHMC?
t.b.a.

Sunday, 13 January 2013                   A D Kennedy   76
Conclusions (RHMC)

Exact
No step-size errors; no step-size extrapolations
Significantly cheaper than the R algorithm
Allows easy implementation of Hasenbusch
(multipseudofermion) acceleration
Further improvements possible
Such as multiple timescales for different terms in the partial
fraction expansion

???

Sunday, 13 January 2013                          A D Kennedy                        77
QCD Machines: I

We want a cost-effective computer to solve
interesting scientific problems
In fact we wanted a computer to solve lattice QCD
But it turned out that there is almost nothing that was not applicable to
many other problems too
Not necessary to solve all problems with one architecture
Demise of the general-purpose computer?
Development cost « hardware cost for one large machine
Simple OS and software model
Interleave a few long jobs without time- or space-sharing

Sunday, 13 January 2013                            A D Kennedy                                78
QCD Machines: II

Take advantage of mass market components
It is not cost- or time-effective to compete with PC market in designing
custom chips
Use standard software and tools whenever possible

Do not expect compilers or optimisers to anything
particularly smart
Parallelism has to be built into algorithms and programs from the start

Hand code critical kernels in assembler
And develop these along with the hardware design

Sunday, 13 January 2013                       A D Kennedy                            79
QCD Machines: III

Parallel applications
Many real-world applications are intrinsically parallel
Because they are approximations to continuous systems
Lattice QCD is an good example
Lattice is a discretisation of four dimensional space-time
Lots of arithmetic on small complex matrices and vectors
Relatively tiny amount of I/O required
Amdahl’s law
Amount of parallel work may be increased working on a larger
volume
Relevant parameter is σ, the number of sites per processor

Sunday, 13 January 2013                        A D Kennedy                   80
Strong Scaling
The amount of computation V δ required to
equilibrate a system of volume V increases faster
than linearly
If we are to have any hope of equilibrating large systems the value
of δ cannot be much larger than one
For lattice QCD we have algorithms with δ = 5/4
We therefore are driven to as small a value of σ as
possible
This corresponds to “thin nodes,” as opposed to “fat nodes” as in
PC clusters
Clusters are competitive in price/performance up to a
certain maximum problem size
This borderline increase with time, of course

Sunday, 13 January 2013                       A D Kennedy                          81
Data Parallelism

All processors run the same code
Not necessarily SIMD, where they share a common clock
Synchronization on communication, or at explicit barriers

Type of data parallel operations
Pointwise arithmetic
Nearest neighbour shifts
Perhaps simultaneously in several directions
Global operations
Broadcasts, sums or other reductions

Sunday, 13 January 2013                       A D Kennedy                82

Parallelism comes from running many separate
Recent architectures propose running many light-
weight threads on each processor to overcome
memory latency
What are the threads that need almost no memory
Calculating zillions of digits of ?
Cryptography?

Sunday, 13 January 2013                              A D Kennedy    83
Computational Grids

In the future carrying out large scale
computations using the Grid will be as
easy as plugging into an electric socket

Sunday, 13 January 2013   A D Kennedy          84
Hardware Choices

Cost/MFlop
Goal is to be about 10 times more cost-effective than
commercial machines
Otherwise it is not worth the effort and risk of building our own
machine
Our current Teraflops machines cost about \$1/MFlop
For a Petaflops machine we will need to reach about \$1/GFlop
Power/cooling
Most cost effective to use low-power components and high-
density packaging
Life is much easier if the machine can be air cooled

Sunday, 13 January 2013                        A D Kennedy                          85
Clock Speed

Peak/Sustained speed
The sustained performance should be about 20%—50% of peak
Otherwise there is either too much or too little floating point
hardware
Real applications rarely have equal numbers of adds and multiplies

Clock speed
“Sweet point” is currently at 0.5—1 GHz chips
High-performance chips running at 3 GHz are both hot and
expensive
Using moderate clock speed makes electrical design issues such as
clock distribution much simpler

Sunday, 13 January 2013                       A D Kennedy                            86
Memory Systems
Memory bandwidth
This is currently the main bottleneck in most
architectures
There are two obvious solutions
Data prefetching
Vector processing is one way of doing this
Requires more sophisticated software
Feasible for our class of applications because the control
flow is essentially static (almost no data dependencies)
Hierarchical memory system (NUMA)
We make use of both approaches

Sunday, 13 January 2013                            A D Kennedy                         87
Memory Systems

On-chip memory
For QCD the memory footprint is small enough that we can put all
the required data into an on-chip embedded DRAM memory
Cache
Whether the on-chip memory is managed as a cache or as directly
addressable memory is not too important
Cache flushing for communications DMA is a nuisance
Caches are built into most μ-processors, not worth designing our own!
Off-chip memory
The amount of off-chip memory is determined by cost
If the cost of the memory is » 50% of the total cost then buy more
After all, the processors are almost free!

Sunday, 13 January 2013                                   A D Kennedy                     88
Communications Network: I

This is where a massively parallel computer
differs from a desktop or server machine
In the future the network will become the
principal bottleneck
For large data parallel applications
We will end up designing networks and decorating them
with processors and memories which are almost free

Sunday, 13 January 2013                  A D Kennedy                 89
Communications Network

Topology
Grid
This is easiest to build
How many dimensions?
QCDSP had 4, Blue Gene/L has 3, and QCDOC has 6
Extra dimensions allow easier partitioning of the machine
Hypercube/Fat Tree/Ω network/Butterfly network
These are all essentially an infinite dimensional grid
Good for FFTs
Switch
Expensive, and does not scale well

Sunday, 13 January 2013                              A D Kennedy                      90
Global Operations

Combining network
A tree which can perform arithmetic at each node
Useful for global reductions
But global operations can be performed using grid wires
It is all a matter of cost
Used in BGL, not in QCDx
6    8      10   12   36
Grid wires
Not very good error propagation
O(N1/4) latency (grows as the       1     2     3    4
perimeter of the grid)
O(N) hardware                       5     6     7    8

Sunday, 13 January 2013                     A D Kennedy                  91
Global Operations

Combining tree                                                  36
Good error propagation
O(log N) latency
O(N log N) hardware
10                                 26

Bit-wise operations
Allow arbitrary precision
3                7                11                15
Used on QCDSP
Not used on QCDOC or
BGL
Because data sent byte-   1       2        3       4        5        6        7        8
wise (for ECC)

Sunday, 13 January 2013                         A D Kennedy                                              92
Global Operations
110100
110100
001011
0001
000
00
0
00011
+        10010
0010
010
10
0
010010
Sum
LSB first
1101
110
11
1
11010
110100
000111
00101
0010
001
00
0
Carry
0
1                            >          0
100
0100
10100
110100
00
Max
MSB first

1
11
11100
111
1110
111000          Select
upper
?

Sunday, 13 January 2013                A D Kennedy                   93
Communications Network: II

Parameters
Bandwidth
Latency
Packet size
The ability to send small [O(102) bytes] packets between
neighbours is vital if we are to run efficiently with a small
number of sites per processor (σ)
Control (DMA)
The DMA engine needs to be sophisticated enough to
interchange neighbouring faces without interrupting the CPU
Block-strided moves
I/O completion can be polled (or interrupt driven)

Sunday, 13 January 2013                          A D Kennedy                      94
Packet Size

Sunday, 13 January 2013   A D Kennedy   95
Block-strided Moves
0     1       2   3   4    5   6   7    8     9      10 11 12 13 14 15

4               Block size                         1

Stride                                     12 3

For each direction separately we
specify in memory mapped registers:
 Source starting address                           0       1   2   3
 Target starting address                           4       5   6   7
 Block size
 Stride
8       9   10 11
 Number of blocks                                  12 13 14 15

Sunday, 13 January 2013                        A D Kennedy                           96
Hardware Design Process
Optimise machine for applications of interest
We run simulations of lattice QCD to tune our design
Most design tradeoffs can be evaluated using a spreadsheet as
the applications are mainly static (data-independent)
It also helps to debug the machine if you use the actual
kernels you will run in production as test cases!
But running QCD even on a RTL simulator is painfully slow

Circuit design done using hardware
description languages (e.g., VHDL)
Time to completion is critical
Trade-off between risk of respin and delay in tape-out

Sunday, 13 January 2013                       A D Kennedy                         97
VHDL
entity lcount5 is
port              ( signal clk, res: in vlbit;
signal reset, set: in vlbit;
signal count:      inout vlbit_1d(4 downto 0)
);
end lcount5;

architecture bhv of lcount5 is
begin
process
begin
wait until prising(clk) or res='1';

if res='1' then count<=b"00000";
else if reset='1'
then count<=b"00000";
elsif set='1' then

count(4)<=count(4) xor (count(0) and count(1) and count(2) and count(3));
count(3)<=count(3) xor (count(0) and count(1) and count(2));
count(2)<=count(2) xor (count(0) and count(1));
count(1)<=count(1) xor count(0);
count(0)<=not count(0);

end if;end if;
end process ;
end bhv;

Sunday, 13 January 2013                             A D Kennedy                                                         99
QCDSP

Columbia University                    BNL
0.4 TFlops                      0.6 Tflops
Sunday, 13 January 2013    A D Kennedy                100
PEC = Prefetching EDRAM Controller
EDRAM = Embedded DRAM
QCDOC ASIC Design                                                             DRAM = Dynamic RAM
RAM = Random Access Memory
EDRAM               Interrupt
Controller
Boot/Clock
Support
DCR SDRAM

PEC                     440
Core
FPU                             2.6 Gbyte/sec
EDRAM/SDRAM
OFB-PLB         OFB
Bridge        Arbiter

100 Mbit/sec                                                        DMA
DCR SDRAM
DMA                                       Controller

Fast Processor Node for
CompleteEthernet (0.8)Gbyte/sec
Controller

8
IDC             Serial

4 Off-Node Glops
24 Mbytes of Links
Interface          number

1
PLB

Memory-Processor
Communications
QCD computer on a Chip
Embedded DRAM
12 Gbit/sec  Precision
Double Bandwidth
Control
PLB
Arbiter
MCMAL
Ethernet
DMA
O
F
B
Slave
location
number

Bandwidth 0.18μ
RISC SA27E
2.6 Gbyte/sec by IBM using Processor
HSSL
fabricated
Interface to embedded DRAM process
logic +
SCUMemory
External
HSSL         IBM ASIC Library Component                                           EMACS              MII

Custom Designed Logic
HSSL
FIFO

PLL
4KB recv. FIFO

Sunday, 13 January 2013                                                        A D Kennedy                                       101
QCDOC Components

Sunday, 13 January 2013   A D Kennedy   102
QCDOC

Sunday, 13 January 2013   A D Kennedy   103
Edinburgh ACF

UKQCDOC is
located in the
ACF near
Edinburgh

Our insurers will
not let us show
photographs of
the building for
security reasons!

Sunday, 13 January 2013   A D Kennedy                    104
Blue Gene/L

Sunday, 13 January 2013   A D Kennedy   105
Vicious Design Cycle
1. Communications DMA controller required
Otherwise control by CPU needed
Requires interrupts at each I/O completion
CPU becomes bottleneck
2. Introduce more sophisticated DMA instructions
Block-strided moves for QCDSP
Sequences of block-strided moves for QCDOC
Why not? It is cheap
3. Use CPU as DMA engine
Second CPU in Blue Gene/L
Why not? It is cheap, and you double you peak Flops
5. Use second FPU for computation
Otherwise sustained fraction of peak is embarrassingly small
6. Go back to step 1

Sunday, 13 January 2013                               A D Kennedy           106
Power Efficiencies
1

QCDOC
GFlops/Watt

Blue Gene/L
0.1

NCSA
QCDSP                                       Intel Xeon
0.01                                           ASCI Q
ASCI White
Earth Simulator

0.001
1997   1998   1999   2000    2001      2002      2003     2004       2005

Year

Sunday, 13 January 2013                                A D Kennedy                               107
What Next?

We need a Petaflop computer for QCD
New fermion formulations solve a lot of problems, but cost
significantly more in computer time
Algorithms are improving
Slowly
By factors of about 2
Can we build a machine about 10 times more
cost effective than commercial machines?
Otherwise not worth the effort & risk of building our own
Blue Gene/L and its successors provide a new opportunity

Sunday, 13 January 2013                   A D Kennedy                     108

To top