VIEWS: 0 PAGES: 105 POSTED ON: 1/13/2013 Public Domain
Algorithms for Dynamical Fermions A D Kennedy School of Physics, The University of Edinburgh NCTS 2006 School on Modern Numerical Sunday, 13 January 2013 Methods in Mathematics and Physics Outline Monte Carlo integration Importance sampling Markov chains Detailed balance, Metropolis algorithm Symplectic integrators Hybrid Monte Carlo Pseudofermions RHMC QCD Computers Sunday, 13 January 2013 A D Kennedy 2 Monte Carlo methods: I Monte Carlo integration is based on the identification of probabilities with measures There are much better methods of carrying out low dimensional quadrature All other methods become hopelessly expensive for large dimensions In lattice QFT there is one integration per degree of freedom We are approximating an infinite dimensional functional integral Sunday, 13 January 2013 A D Kennedy 3 Monte Carlo methods: II Generate a sequence of random field configurations 1 , 2 , , t , , N chosen from the probability distribution 1 S (t ) P (t ) d t e Z Measure the value of on each configuration and compute the average 1 N (t ) N t 1 Sunday, 13 January 2013 A D Kennedy 4 Central Limit Theorem: I Law of Large Numbers lim N Central Limit Theorem O C2 N where the variance of the distribution of is 2 C2 The Laplace–DeMoivre Central Limit theorem is an asymptotic expansion for the probability distribution of Distribution of values for a single sample = () P ( ) d P Sunday, 13 January 2013 A D Kennedy 5 Central Limit Theorem: II Generating function for connected moments W (k ) ln d P ( ) e ik ik C n ln d P e ik ln e ik n 0 n! n The first few cumulants are 3 C0 0 C3 4 C1 C4 3C 22 2 C2 Note that this is an asymptotic expansion Sunday, 13 January 2013 A D Kennedy 6 Central Limit Theorem: III Distribution of the average of N samples 1 N P d 1 d N P 1 P N N t 1 t Connected generating function W (k ) ln d P ( ) e ik ik N ln d 1 d N P 1 P N exp t N t 1 N ln d P e N ln e ik N ik N ik C n n k NW N n 1 n ! N n 1 Sunday, 13 January 2013 A D Kennedy 7 Central Limit Theorem: IV Take inverse Fourier transform to obtain distribution P 1 P 2 W k ik dk e e C3 d 3 C4 d 4 2 ik 21 ik C 2 3!N 2 d 3 4!N 3 d 4 dk ik e 2 e N e 2 d3 d4 C3 C4 e 2C 2 N e 3!N 2 d 3 4!N 3 d 4 2 C N 2 Sunday, 13 January 2013 A D Kennedy 8 Central Limit Theorem: V Re-scale to show convergence to Gaussian distribution d P F d where N and F 1 C 3 2 3C 2 e 2 2C 2 6C 2 N 3 2 C 2 Sunday, 13 January 2013 A D Kennedy 9 Importance Sampling: I Integral dx f (x ) Sample from distribution Probability 0 (x ) a .e . Normalisation N dx (x ) 1 f (x ) Estimator of integral I (x ) dx (x ) Estimator of variance 2 f (x ) f ( x )2 V (x ) dx I dx I 2 (x ) (x ) Sunday, 13 January 2013 A D Kennedy 10 Importance Sampling: II Minimise variance Constraint N=1 Lagrange multiplier (V N ) f (y ) 2 0 (y ) (y ) 2 f (x ) Optimal measure opt (x ) dx f (x ) dx dx 2 2 Optimal variance V opt f (x ) f (x ) Sunday, 13 January 2013 A D Kennedy 11 2 Importance Sampling: III 0 dx sin x Example: f (x ) sin(x ) Optimal weight (x ) 1 2 sin(x ) Optimal variance 2 2 2 2 V opt dx sin(x ) dx sin(x ) 16 0 0 Sunday, 13 January 2013 A D Kennedy 12 2 Importance Sampling: IV 0 dx sin x 1 Construct cheap approximation to |sin(x)| 2 Calculate relative area within each rectangle 3 Choose a random number uniformly 4 Select corresponding rectangle 5 Choose another random number uniformly 6 Select corresponding x value within rectangle 7 Compute |sin(x)| Sunday, 13 January 2013 A D Kennedy 13 2 Importance Sampling: V 0 dx sin x With 100 rectangles we have V = 16.02328561 But we can do better! 2 2 I dx sin(x ) sin(x ) dx sin(x ) sin(x ) 0 0 For which V opt 0 With 100 rectangles we have V = 0.011642808 Sunday, 13 January 2013 A D Kennedy 14 Markov Chains: I State space (Ergodic) stochastic transitions P’: Deterministic evolution of probability distribution P: Q Q Distribution converges to unique fixed point Q Sunday, 13 January 2013 A D Kennedy 15 Convergence of Markov Chains: I Define a metric d Q1 ,Q 2 dx Q1 x Q 2 x on the space of (equivalence classes of) probability distributions Prove that d PQ1 , PQ 2 1 d Q1 ,Q 2 with > 0, so the Markov process P is a contraction mapping The sequence Q, PQ, P²Q, P³Q,… is Cauchy The space of probability distributions is complete, so the sequence converges to a unique fixed point Q lim P nQ n Sunday, 13 January 2013 A D Kennedy 16 Convergence of Markov Chains: II d PQ1 , PQ2 dx PQ1 x PQ 2 x dx dy P x y Q1 y dy P x y Q 2 y Q y Q1 y Q 2 y dx dy P x y Q y y y 1 dx dy P x y Q y Q y Q y a b a b 2min a , b dx dy P x y Q y 2 dx min dy P x y Q y Q y Sunday, 13 January 2013 A D Kennedy 17 Convergence of Markov Chains: III d PQ1 , PQ 2 dx P x y 1 dy Q y 2 dx min dy P x y Q y Q y dy Q y 2 dx inf P x y min dy Q y Q y y dy Q y Q y dy Q y Q y dy Q y dy Q y dy Q y 1 1 0 1 2 dy Q y Q y dy Q y 1 2 dy Q y dx inf P x y dy Q y (1 )d Q ,Q 1 2 y 0 dx inf P x y y Sunday, 13 January 2013 A D Kennedy 18 Markov Chains: II Use Markov chains to sample from Q Suppose we can construct an ergodic Markov process P which has distribution Q as its fixed point Start with an arbitrary state (“field configuration”) Iterate the Markov process until it has converged (“thermalized”) Thereafter, successive configurations will be distributed according to Q But in general they will be correlated To construct P we only need relative probabilities of states Do not know the normalisation of Q Cannot use Markov chains to compute integrals directly We can compute ratios of integrals Sunday, 13 January 2013 A D Kennedy 19 Markov Chains: III How do we construct a Markov process with a specified fixed point Q x dy P x y Q y ? Detailed balance P y x Q x P x y Q y Integrate w.r.t. y to obtain fixed point condition Sufficient but not necessary for fixed point Q x Metropolis algorithm P x y min 1, Q y Consider cases Q (x ) Q (y ) and Q (x ) Q (y ) separately to obtain detailed balance condition Sufficient but not necessary for detailed balance Q x Other choices are possible, e.g., P x y Q x Q y Sunday, 13 January 2013 A D Kennedy 20 Markov Chains: IV Composition of Markov steps Let P1 and P2 be two Markov steps which have the desired fixed point distribution They need not be ergodic Then the composition of the two steps P2P1 will also have the desired fixed point And it may be ergodic This trivially generalises to any (fixed) number of steps For the case where P1 is not ergodic but (P1 ) n is the terminology weakly and strongly ergodic are sometimes used Sunday, 13 January 2013 A D Kennedy 21 Markov Chains: V This result justifies “sweeping” through a lattice performing single site updates Each individual single site update has the desired fixed point because it satisfies detailed balance The entire sweep therefore has the desired fixed point, and is ergodic But the entire sweep does not satisfy detailed balance Of course it would satisfy detailed balance if the sites were updated in a random order But this is not necessary And it is undesirable because it puts too much randomness into the system Sunday, 13 January 2013 A D Kennedy 22 Markov Chains: VI Coupling from the Past Propp and Wilson (1996) Use fixed set of random numbers Flypaper principle: If states coalesce they stay together forever – Eventually, all states coalesce to some state with probability one – Any state from t = - will coalesce to – is a sample from the fixed point distribution Sunday, 13 January 2013 A D Kennedy 23 Markov Chains: VII Suppose we have a lattice a That is, a partial ordering with a largest and smallest element And an update step that b c d preserves it Then once the largest and smallest states have e f coalesced all the others must have too g Sunday, 13 January 2013 A D Kennedy 24 Markov Chains: VIII A non-trivial example of β≥0 where this is possible is Ordering is spin configuration A ≥ B iff for every site As ≥ Bs the ferrormagnetic Ising Update is a local heatbath model H = b å sis j ij Sunday, 13 January 2013 A D Kennedy 25 Autocorrelations: I Exponential autocorrelations The unique fixed point of an ergodic Markov process corresponds to the unique eigenvector with eigenvalue 1 All its other eigenvalues must lie within the unit circle In particular, the largest subleading eigenvalue is max 1 This corresponds to the exponential autocorrelation time 1 N exp 0 ln max Sunday, 13 January 2013 A D Kennedy 26 Autocorrelations: II Integrated autocorrelations Consider the autocorrelation of some operator Ω Without loss of generality we may assume 0 2 1 N N 1 N N 1 N t t 2 t 2 t t 2 N 2 1 1 t t N t 1 t 1 t t 1 t t Define the autocorrelation function C 2 2 1 2 N 1 2 N N N 1 C 2 Sunday, 13 January 2013 A D Kennedy 27 Autocorrelations: III The autocorrelation function must fall faster that the exponential autocorrelation C max e N exp For a sufficiently large number of samples 2 N exp 1 2C 1 O 1 N N Define the integrated autocorrelation function A C 1 2 2 N exp 1 2A 1 O N N Sunday, 13 January 2013 A D Kennedy 28 Hybrid Monte Carlo: I In order to carry out Monte Carlo computations including the effects of dynamical fermions we would like to find an algorithm which Update the fields globally Because single link updates are not cheap if the action is not local Take large steps through configuration space Because small-step methods carry out a random walk which leads to critical slowing down with a dynamical critical exponent z=2 z relates the autocorrelation to the correlation length of the system, A z Does not introduce any systematic errors Sunday, 13 January 2013 A D Kennedy 29 Hybrid Monte Carlo: II A useful class of algorithms with these properties is the (Generalised) Hybrid Monte Carlo (HMC) method Introduce a “fictitious” momentum p corresponding to each dynamical degree of freedom q Find a Markov chain with fixed point exp[-H(q,p) ] where H is the “fictitious” Hamiltonian ½p2 + S(q) The action S of the underlying QFT plays the rôle of the potential in the “fictitious” classical mechanical system This gives the evolution of the system in a fifth dimension, “fictitious” or computer time This generates the desired distribution exp[-S(q) ] if we ignore the momenta p (i.e., the marginal distribution) Sunday, 13 January 2013 A D Kennedy 30 Hybrid Monte Carlo: III The HMC Markov chain alternates two Markov steps Molecular Dynamics Monte Carlo (MDMC) (Partial) Momentum Refreshment Both have the desired fixed point Together they are ergodic Sunday, 13 January 2013 A D Kennedy 31 MDMC: I If we could integrate Hamilton’s equations exactly we could follow a trajectory of constant fictitious energy This corresponds to a set of equiprobable fictitious phase space configurations Liouville’s theorem tells us that this also preserves the functional integral measure dp dq as required Any approximate integration scheme which is reversible and area preserving may be used to suggest configurations to a Metropolis accept/reject test With acceptance probability min[1,exp(-H)] Sunday, 13 January 2013 A D Kennedy 32 MDMC: II We build the MDMC step out of three parts Molecular Dynamics (MD), an approximate integrator U : q , p q , p which is exactly q , p Area preserving, det U * det 1 q , p Reversible, F U F U 1 A momentum flip F : p p A Metropolis accept/reject step The composition of these gives q q p F U e H y 1 y e H p with y being a uniformly distributed random number in [0,1] Sunday, 13 January 2013 A D Kennedy 33 Partial Momentum Refreshment This mixes the Gaussian distributed momenta p with Gaussian noise p cos sin p sin F cos The Gaussian distribution of p is invariant under F The extra momentum flip F ensures that for small the momenta are reversed after a rejection rather than after an acceptance For = / 2 all momentum flips are irrelevant Sunday, 13 January 2013 A D Kennedy 34 Symplectic Integrators: I Baker-Campbell-Hausdorff (BCH) formula If A and B belong to any (non-commutative) algebra then e Ae B e A B , where constructed from commutators of A and B (i.e., is in the Free Lie Algebra generated by {A,B }) More precisely, ln e Ae B c n where c 1 A B and n 1 n 2 1 1 B c n 1 2 c n , A B 2m 1 c k1 , , c k 2m , A B n 1 m 0 2m ! k 1 , ,k 2 m k 1 k 2 m n Sunday, 13 January 2013 A D Kennedy 35 Symplectic Integrators: II Explicitly, the first few terms are ln e Ae B A B 1 2 A , B 12 A , A , B B , A , B 24 B , A , A , B 1 1 A , A , A , A , B 4 B , A , A , A , B 1 720 6 A , B , A , A , B 4 B , B , A , A , B 2 A , B , B , A , B B , B , B , A , B In order to construct reversible integrators we use symmetric symplectic integrators The following identity follows directly from the BCH formula ln e A 2e B e A 2 A B 1 24 A, A, B 2 B , A, B 7 A , A , A , A , B 28 B , A , A , A , B 1 5760 12 A , B , A , A , B 32 B , B , A , A , B 16 A , B , B , A , B 8 B , B , B , A , B Sunday, 13 January 2013 A D Kennedy 36 Symplectic Integrators: III We are interested in finding the classical trajectory in phase space of a system described by the Hamiltonian H q , p T p S q 1 p 2 S q 2 The basic idea of such a symplectic integrator is to write the time evolution operator as d dp dq exp exp dt dt p dt q H H ˆ exp e H q p p q exp S q T p p q Sunday, 13 January 2013 A D Kennedy 37 Symplectic Integrators: IV Define Q T p and P S q ˆ so that H P Q q p Since the kinetic energy T is a function only of p and the potential energy S is a function only of q, it follows that the action of e P and e Q may be evaluated trivially e Q : f q , p f q T p , p e P : f q , p f q , p S q Sunday, 13 January 2013 A D Kennedy 38 Symplectic Integrators: V From the BCH formula we find that the PQP symmetric symplectic integrator is given by e e 1 P 1 P / / Q U 0 ( ) e 2 2 exp P Q P ,P ,Q 2 Q ,P ,Q 1 3 O 5 24 exp P Q 24 P , P ,Q 2 Q , P ,Q 2 O 4 1 O 2 ˆ P Q e H e In addition to conserving energy to O (² ) such symmetric symplectic integrators are manifestly area preserving and reversible Sunday, 13 January 2013 A D Kennedy 39 Symplectic Integrators: VI For each symplectic integrator there exists a Hamiltonian H’ which is exactly conserved This is given by substituting Poisson brackets for commutators in the BCH formula H ' S T 24 T , T , S 2 S , T , S 2 1 32 S , S , T T , S 16 T , S , S , T , S 1 5760 28 S , T , T , T , S 12 T , S , T , T , S 4 O 6 8 S , S , S , T , S 7 T , T , T , T , S Sunday, 13 January 2013 A D Kennedy 40 Symplectic Integrators: VII Poisson brackets form a Lie algebra A B A B A , B p q q p A , B B , A Homework exercise: verify the Jacobi identity A,B ,C B ,C , A C ,A, B 0 Sunday, 13 January 2013 A D Kennedy 41 Symplectic Integrators: VIII The explicit result for H’ is H H 24 2p 2S S 2 2 1 720 1 4 p 4S 6 p 2 S S 2S 2 3S 2S 4 O 6 Note that H’ cannot be written as the sum of a p-dependent kinetic term and a q-dependent potential term As H’ is conserved, δH is of O(δτ 2) for trajectories of arbitrary length Even if τ = O (δτ -k) with k > 1 Sunday, 13 January 2013 A D Kennedy 42 Symplectic Integrators: IX Multiple timescales Split the Hamiltonian into pieces H (q , p ) T ( p ) S 1 (q ) S 2 (q ) Define Q T p and Pi S i q ˆ so that H P1 P2 Q q p Introduce a symmetric symplectic integrator of the form / n2 n1 n2 U SW ( ) / e e 4 n1 e 2 n2 1 Q 1 P2 1 Q 1 P 1 Q 1 P2 1 Q 4 n2 e 2 n2 e n1 1 e 4 n1 e 4 n2 P1 P If 2 Q then the instability in the integrator is tickled n1 2n 2 equally by each sub-step This helps if the most expensive force computation does not correspond to the largest force Sunday, 13 January 2013 A D Kennedy 43 Dynamical fermions: I Pseudofermions Direct simulation of Grassmann fields is not feasible The problem is not that of manipulating anticommuting values in a computer It is that e S F e M is not positive, and thus we get poor importance sampling We therefore integrate out the fermion fields to obtain the fermion determinant d d e M det M and always occur quadratically The overall sign of the exponent is unimportant Sunday, 13 January 2013 A D Kennedy 44 Dynamical fermions: II Any operator can be expressed solely in terms of the bosonic fields Use Schwinger’s source method and integrate out the fermions M 1 , , e 0 E.g., the fermion propagator is G (x , y ) x y M 1 (x , y ) Sunday, 13 January 2013 A D Kennedy 45 Dynamical fermions: III Including the determinant as part of the observable to be det M S measured is not feasible B det M S B The determinant is extensive in the lattice volume, thus again we get poor importance sampling Represent the fermion determinant as a bosonic Gaussian integral with a non-local kernel det M d d e M 1 The fermion kernel must be positive definite (all its eigenvalues must have positive real parts) otherwise the bosonic integral will not converge The new bosonic fields are called pseudofermions Sunday, 13 January 2013 A D Kennedy 46 Dynamical fermions: IV It is usually convenient to introduce two flavours of fermion M †M 1 and to write det M det M M d d e 2 † This not only guarantees positivity, but also allows us to generate the pseudofermions from a global heatbath by applying M † to a random Gaussian distributed field The equations for motion for the boson (gauge) fields are S B S B M †M M †M † M †M 1 1 1 † M †M The evaluation of the pseudofermion action and the corresponding force then requires us to find the solution of a 1 (large) set of linear equations M †M Sunday, 13 January 2013 A D Kennedy 47 Dynamical fermions: V It is not necessary to carry out the inversions required for the equations of motion exactly There is a trade-off between the cost of computing the force and the acceptance rate of the Metropolis MDMC step The inversions required to compute the pseudofermion action for the accept/reject step does need to be computed exactly, however We usually take “exactly” to by synonymous with “to machine precision” Sunday, 13 January 2013 A D Kennedy 48 Reversibility: I Are HMC trajectories reversible and area preserving in practice? The only fundamental source of irreversibility is the rounding error caused by using finite precision floating point arithmetic For fermionic systems we can also introduce irreversibility by choosing the starting vector for the iterative linear equation solver time-asymmetrically We do this if we to use a Chronological Inverter, which takes (some extrapolation of) the previous solution as the starting vector Floating point arithmetic is not associative It is more natural to store compact variables as scaled integers (fixed point) Saves memory Does not solve the precision problem Sunday, 13 January 2013 A D Kennedy 49 Reversibility: II Data for SU(3) gauge theory and QCD with heavy quarks show that rounding errors are amplified exponentially The underlying continuous time equations of motion are chaotic Ляпунов exponents characterise the divergence of nearby trajectories The instability in the integrator occurs when H » 1 Zero acceptance rate anyhow Sunday, 13 January 2013 A D Kennedy 50 Reversibility: III In QCD the Ляпунов exponents appear to scale with as the system approaches the continuum limit = constant This can be interpreted as saying that the Ляпунов exponent characterises the chaotic nature of the continuum classical equations of motion, and is not a lattice artefact Therefore we should not have to worry about reversibility breaking down as we approach the continuum limit Caveat: data is only for small lattices, and is not conclusive Sunday, 13 January 2013 A D Kennedy 51 Polynomial approximation What is the best polynomial approximation p(x) to a continuous function f(x) for x in [0,1] ? Best with respect to the appropriate norm 1/n 1 n p f n dx p (x ) f (x ) 0 where n 1 Sunday, 13 January 2013 A D Kennedy 52 Weierstraß’ theorem Taking n →∞ this is the minimax norm p f minmax p (x ) f (x ) p 0 x 1 Weierstraß: Any continuous function can be arbitrarily well approximated by a polynomial Sunday, 13 January 2013 A D Kennedy 53 Бернштейне polynomials The explicit solution is provided by Бернштейне polynomials n k n p n x f x n (1 x )n k k 0 n k Sunday, 13 January 2013 A D Kennedy 54 Чебышев’s theorem Чебышев: There is always a unique polynomial of any degree d which minimises p f max p x f x 0 x 1 The error |p(x) - f(x)| reaches its maximum at exactly d+2 points on the unit interval Sunday, 13 January 2013 A D Kennedy 55 Чебышев’s theorem: Necessity Suppose p-f has less than d+2 extrema of equal magnitude Then at most d+1 maxima exceed some magnitude This defines a “gap” We can construct a polynomial q of degree d which has the opposite sign to p-f at each of these maxima (Lagrange interpolation) And whose magnitude is smaller than the “gap” The polynomial p+q is then a better approximation than p to f Sunday, 13 January 2013 A D Kennedy 56 Чебышев’s theorem: Sufficiency Suppose there is a polynomial p f p f Then p x i f x i p x i f x i at each of the d+2 extrema Therefore p’ - p must have d+1 zeros on the unit interval Thus p’ – p = 0 as it is a polynomial of degree d Sunday, 13 January 2013 A D Kennedy 57 Чебышев polynomials Convergence is often exponential in d The best approximation of degree d-1 over [-1,1] to xd is d 1 pd 1 x x d 1 Td x 2 Where the Чебышев polynomials are T d x cos d cos 1 x The notation is an old transliteration of Чебышев ! d 1 The error is x d pd x 1 T d x 2e d ln2 2 Sunday, 13 January 2013 A D Kennedy 58 Чебышев rational functions Чебышев’s theorem is easily extended to rational approximations Rational functions with nearly equal degree numerator and denominator are usually best Convergence is still often exponential Rational functions usually give a much better approximation A simple (but somewhat slow) numerical algorithm for finding the optimal Чебышев rational approximation was given by Ремез Sunday, 13 January 2013 A D Kennedy 59 Чебышев rationals: Example A realistic example of a rational approximation is 1 0.3904603901 x 2.3475661045 x 0.1048344600 x 0.0073063814 x x 0.4105999719 x 0.0286165446 x 0.0012779193 This is accurate to within almost 0.1% over the range [0.003,1] Using a partial fraction expansion of such rational functions allows us to use a multishift linear equation solver, thus reducing the cost significantly. The partial fraction expansion of the rational function above is 1 0.0511093775 0.1408286237 0.5964845033 0.3904603901 x x 0.0012779193 x 0.0286165446 x 0.4105999719 This appears to be numerically stable. Sunday, 13 January 2013 A D Kennedy 60 Polynomials versus rationals n Золотарев’s formula has L∞ error e ln 1 Optimal L2 approximation with weight is n j ( ) 4 1 x2 (2 j 1) T 2 j 1 (x ) j 0 This has L2 error of O(1/n) Optimal L∞ approximation cannot be too much better (or it would lead to a better L2 approximation) Sunday, 13 January 2013 A D Kennedy 61 Non-linearity of CG solver Suppose we want to solve A2x=b for Hermitian A by CG It is better to solve Ax=y, Ay=b successively Condition number (A2) = (A)2 Cost is thus 2(A) < (A2) in general Suppose we want to solve Ax=b Why don’t we solve A1/2x=y, A1/2y=b successively? The square root of A is uniquely defined if A>0 This is the case for fermion kernels All this generalises trivially to nth roots No tuning needed to split condition number evenly How do we apply the square root of a matrix? Sunday, 13 January 2013 A D Kennedy 62 Rational matrix approximation Functions on matrices Defined for a Hermitian matrix by diagonalisation H = U D U -1 f (H) = f (U D U -1) = U f (D) U -1 Rational functions do not require diagonalisation H m + H n = U ( D m + D n) U -1 H -1 = U D -1 U -1 Rational functions have nice properties Cheap (relatively) Accurate Sunday, 13 January 2013 A D Kennedy 63 No Free Lunch Theorem We must apply the rational approximation with each CG iteration M1/n r(M) The condition number for each term in the partial fraction expansion is approximately (M) So the cost of applying M1/n is proportional to (M) Even though the condition number (M1/n)=(M)1/n And even though (r(M))=(M)1/n So we don’t win this way… Sunday, 13 January 2013 A D Kennedy 64 Pseudofermions We want to evaluate a functional integral including the fermionic determinant det M We write this as a bosonic functional integral over a pseudofermion field with kernel M -1 M 1 det M d d e Sunday, 13 January 2013 A D Kennedy 65 Multipseudofermions We are introducing extra noise into the system by using a single pseudofermion field to sample this functional integral This noise manifests itself as fluctuations in the force exerted by the pseudofermions on the gauge fields This increases the maximum fermion force This triggers the integrator instability This requires decreasing the integration step size A better estimate is det M = [det M1/n]n 1 1 M n det M n d d e Sunday, 13 January 2013 A D Kennedy 66 Violation of NFL Theorem So let’s try using our nth root trick to implement multipseudofermions Condition number (r(M))=(M)1/n So maximum force is reduced by a factor of n(M)(1/n)-1 This is a good approximation if the condition number is dominated by a few isolated tiny eigenvalues This is so in the case of interest Cost reduced by a factor of n(M)(1/n)-1 Optimal value nopt ln (M) So optimal cost reduction is (e ln) / This works! Sunday, 13 January 2013 A D Kennedy 68 Rational Hybrid Monte Carlo: I 1 RHMC algorithm for fermionic kernel M M † 2n Generate pseudofermion from Gaussian heatbath M M 1 1 † P ( ) e 2 † 4n 1 M †M e 1 1 † 1 † M †M 2n P ( ) d e 2 4n 2 Use accurate rational approximation r (x ) 4n x Use less accurate approximation for MD, r (x ) 2n x r (x ) r (x )2, so there are no double poles Use accurate approximation for Metropolis acceptance step Sunday, 13 January 2013 A D Kennedy 69 Rational Hybrid Monte Carlo: II Reminders Apply rational approximations using their partial fraction expansions Denominators are all just shifts of the original fermion kernel All poles of optimal rational approximations are real and positive for cases of interest (Miracle #1) Only simple poles appear (by construction!) Use multishift solver to invert all the partial fractions using a single Krylov space Cost is dominated by Krylov space construction, at least for O(20) shifts Result is numerically stable, even in 32-bit precision All partial fractions have positive coefficients (Miracle #2) MD force term is of the usual form for each partial fraction Applicable to any kernel Sunday, 13 January 2013 A D Kennedy 70 Comparison with R algorithm: I Binder cumulant of chiral condensate, B4, and RHMC acceptance rate A from a finite temperature study (2+1 flavour naïve staggered fermions, Wilson gauge action, V = 83×4, mud = 0.0076, ms = 0.25, τ= 1.0) Algorithm δt A B4 R 0.0019 1.56(5) R 0.0038 1.73(4) RHMC 0.055 84% 1.57(2) Sunday, 13 January 2013 A D Kennedy 71 Comparison with R algorithm: II The different masses at which domain wall results were gathered, together with the step-sizes δt, acceptance rates A, and plaquettes P (V = 163×32×8, DBW2 gauge action, β = 0.72) Algorithm mud ms δt A P R 0.04 0.04 0.01 0.60812(2) R 0.02 0.04 0.01 0.60829(1) R 0.02 0.04 0.005 0.60817 RHMC 0.04 0.04 0.02 65.5% 0.60779(1) The step-size variation of the plaquette with mud =0.02 RHMC 0.02 0.04 0.0185 69.3% 0.60809(1) Sunday, 13 January 2013 A D Kennedy 72 Comparison with R algorithm: III The integrated autocorrelation time of the 13th time-slice of the pion propagator from the domain wall test, with mud = 0.04 Sunday, 13 January 2013 A D Kennedy 73 Multipseudofermions with multiple timescales 100% Semiempirical observation: The largest force from a Residue (α) L² Force single pseudofermion does 75% α/(β+0.125) not come from the smallest CG iterations shift For example, look at the 50% numerators in the partial fraction expansion we exhibited earlier 25% Make use of this by using a coarser timescale for the 0% more expensive smaller -13 -10 -8.5 -7.1 -5.8 -4.4 -3.1 -1.7 -0.3 1.5 shifts Shift [ln(β)] 1 0.0511093775 0.1408286237 0.5964845033 0.3904603901 x x 0.0012779193 x 0.0286165446 x 0.4105999719 Sunday, 13 January 2013 A D Kennedy 74 Berlin Wall for Wilson fermions HMC Results C Urbach, K Jansen, A Schindler, and U Wenger, hep-lat/0506011, hep-lat/0510064 Comparable performance to Lüscher’s SAP algorithm RHMC? t.b.a. Sunday, 13 January 2013 A D Kennedy 76 Conclusions (RHMC) Advantages of RHMC Exact No step-size errors; no step-size extrapolations Significantly cheaper than the R algorithm Allows easy implementation of Hasenbusch (multipseudofermion) acceleration Further improvements possible Such as multiple timescales for different terms in the partial fraction expansion Disadvantages of RHMC ??? Sunday, 13 January 2013 A D Kennedy 77 QCD Machines: I We want a cost-effective computer to solve interesting scientific problems In fact we wanted a computer to solve lattice QCD But it turned out that there is almost nothing that was not applicable to many other problems too Not necessary to solve all problems with one architecture Demise of the general-purpose computer? Development cost « hardware cost for one large machine Simple OS and software model Interleave a few long jobs without time- or space-sharing Sunday, 13 January 2013 A D Kennedy 78 QCD Machines: II Take advantage of mass market components It is not cost- or time-effective to compete with PC market in designing custom chips Use standard software and tools whenever possible Do not expect compilers or optimisers to anything particularly smart Parallelism has to be built into algorithms and programs from the start Hand code critical kernels in assembler And develop these along with the hardware design Sunday, 13 January 2013 A D Kennedy 79 QCD Machines: III Parallel applications Many real-world applications are intrinsically parallel Because they are approximations to continuous systems Lattice QCD is an good example Lattice is a discretisation of four dimensional space-time Lots of arithmetic on small complex matrices and vectors Relatively tiny amount of I/O required Amdahl’s law Amount of parallel work may be increased working on a larger volume Relevant parameter is σ, the number of sites per processor Sunday, 13 January 2013 A D Kennedy 80 Strong Scaling The amount of computation V δ required to equilibrate a system of volume V increases faster than linearly If we are to have any hope of equilibrating large systems the value of δ cannot be much larger than one For lattice QCD we have algorithms with δ = 5/4 We therefore are driven to as small a value of σ as possible This corresponds to “thin nodes,” as opposed to “fat nodes” as in PC clusters Clusters are competitive in price/performance up to a certain maximum problem size This borderline increase with time, of course Sunday, 13 January 2013 A D Kennedy 81 Data Parallelism All processors run the same code Not necessarily SIMD, where they share a common clock Synchronization on communication, or at explicit barriers Type of data parallel operations Pointwise arithmetic Nearest neighbour shifts Perhaps simultaneously in several directions Global operations Broadcasts, sums or other reductions Sunday, 13 January 2013 A D Kennedy 82 Alternatives Paradigms Multithreading Parallelism comes from running many separate more-or-less independent threads Recent architectures propose running many light- weight threads on each processor to overcome memory latency What are the threads that need almost no memory Calculating zillions of digits of ? Cryptography? Sunday, 13 January 2013 A D Kennedy 83 Computational Grids In the future carrying out large scale computations using the Grid will be as easy as plugging into an electric socket Sunday, 13 January 2013 A D Kennedy 84 Hardware Choices Cost/MFlop Goal is to be about 10 times more cost-effective than commercial machines Otherwise it is not worth the effort and risk of building our own machine Our current Teraflops machines cost about $1/MFlop For a Petaflops machine we will need to reach about $1/GFlop Power/cooling Most cost effective to use low-power components and high- density packaging Life is much easier if the machine can be air cooled Sunday, 13 January 2013 A D Kennedy 85 Clock Speed Peak/Sustained speed The sustained performance should be about 20%—50% of peak Otherwise there is either too much or too little floating point hardware Real applications rarely have equal numbers of adds and multiplies Clock speed “Sweet point” is currently at 0.5—1 GHz chips High-performance chips running at 3 GHz are both hot and expensive Using moderate clock speed makes electrical design issues such as clock distribution much simpler Sunday, 13 January 2013 A D Kennedy 86 Memory Systems Memory bandwidth This is currently the main bottleneck in most architectures There are two obvious solutions Data prefetching Vector processing is one way of doing this Requires more sophisticated software Feasible for our class of applications because the control flow is essentially static (almost no data dependencies) Hierarchical memory system (NUMA) We make use of both approaches Sunday, 13 January 2013 A D Kennedy 87 Memory Systems On-chip memory For QCD the memory footprint is small enough that we can put all the required data into an on-chip embedded DRAM memory Cache Whether the on-chip memory is managed as a cache or as directly addressable memory is not too important Cache flushing for communications DMA is a nuisance Caches are built into most μ-processors, not worth designing our own! Off-chip memory The amount of off-chip memory is determined by cost If the cost of the memory is » 50% of the total cost then buy more processing nodes instead After all, the processors are almost free! Sunday, 13 January 2013 A D Kennedy 88 Communications Network: I This is where a massively parallel computer differs from a desktop or server machine In the future the network will become the principal bottleneck For large data parallel applications We will end up designing networks and decorating them with processors and memories which are almost free Sunday, 13 January 2013 A D Kennedy 89 Communications Network Topology Grid This is easiest to build How many dimensions? QCDSP had 4, Blue Gene/L has 3, and QCDOC has 6 Extra dimensions allow easier partitioning of the machine Hypercube/Fat Tree/Ω network/Butterfly network These are all essentially an infinite dimensional grid Good for FFTs Switch Expensive, and does not scale well Sunday, 13 January 2013 A D Kennedy 90 Global Operations Combining network A tree which can perform arithmetic at each node Useful for global reductions But global operations can be performed using grid wires It is all a matter of cost Used in BGL, not in QCDx 6 8 10 12 36 Grid wires Not very good error propagation O(N1/4) latency (grows as the 1 2 3 4 perimeter of the grid) O(N) hardware 5 6 7 8 Sunday, 13 January 2013 A D Kennedy 91 Global Operations Combining tree 36 Good error propagation O(log N) latency O(N log N) hardware 10 26 Bit-wise operations Allow arbitrary precision 3 7 11 15 Used on QCDSP Not used on QCDOC or BGL Because data sent byte- 1 2 3 4 5 6 7 8 wise (for ECC) Sunday, 13 January 2013 A D Kennedy 92 Global Operations 110100 Broadcast 110100 110100 001011 0001 000 00 0 00011 + 10010 0010 010 10 0 010010 Sum LSB first 1101 110 11 1 11010 110100 000111 00101 0010 001 00 0 Carry 0 1 > 0 100 0100 10100 110100 00 Max MSB first 1 11 11100 111 1110 111000 Select upper ? Sunday, 13 January 2013 A D Kennedy 93 Communications Network: II Parameters Bandwidth Latency Packet size The ability to send small [O(102) bytes] packets between neighbours is vital if we are to run efficiently with a small number of sites per processor (σ) Control (DMA) The DMA engine needs to be sophisticated enough to interchange neighbouring faces without interrupting the CPU Block-strided moves I/O completion can be polled (or interrupt driven) Sunday, 13 January 2013 A D Kennedy 94 Packet Size Sunday, 13 January 2013 A D Kennedy 95 Block-strided Moves 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 4 Block size 1 Stride 12 3 For each direction separately we specify in memory mapped registers: Source starting address 0 1 2 3 Target starting address 4 5 6 7 Block size Stride 8 9 10 11 Number of blocks 12 13 14 15 Sunday, 13 January 2013 A D Kennedy 96 Hardware Design Process Optimise machine for applications of interest We run simulations of lattice QCD to tune our design Most design tradeoffs can be evaluated using a spreadsheet as the applications are mainly static (data-independent) It also helps to debug the machine if you use the actual kernels you will run in production as test cases! But running QCD even on a RTL simulator is painfully slow Circuit design done using hardware description languages (e.g., VHDL) Time to completion is critical Trade-off between risk of respin and delay in tape-out Sunday, 13 January 2013 A D Kennedy 97 VHDL entity lcount5 is port ( signal clk, res: in vlbit; signal reset, set: in vlbit; signal count: inout vlbit_1d(4 downto 0) ); end lcount5; architecture bhv of lcount5 is begin process begin wait until prising(clk) or res='1'; if res='1' then count<=b"00000"; else if reset='1' then count<=b"00000"; elsif set='1' then count(4)<=count(4) xor (count(0) and count(1) and count(2) and count(3)); count(3)<=count(3) xor (count(0) and count(1) and count(2)); count(2)<=count(2) xor (count(0) and count(1)); count(1)<=count(1) xor count(0); count(0)<=not count(0); end if;end if; end process ; end bhv; Sunday, 13 January 2013 A D Kennedy 99 QCDSP Columbia University BNL 0.4 TFlops 0.6 Tflops Sunday, 13 January 2013 A D Kennedy 100 PEC = Prefetching EDRAM Controller EDRAM = Embedded DRAM QCDOC ASIC Design DRAM = Dynamic RAM RAM = Random Access Memory EDRAM Interrupt Controller Boot/Clock Support DCR SDRAM PEC 440 Core FPU 2.6 Gbyte/sec EDRAM/SDRAM OFB-PLB OFB Bridge Arbiter 100 Mbit/sec DMA DCR SDRAM DMA Controller Fast Processor Node for CompleteEthernet (0.8)Gbyte/sec Controller 8 24 Link DMA IDC Serial 4 Off-Node Glops 24 Mbytes of Links Interface number 1 PLB Memory-Processor Communications QCD computer on a Chip Embedded DRAM 12 Gbit/sec Precision Double Bandwidth Control PLB Arbiter MCMAL Ethernet DMA O F B Slave location number Bandwidth 0.18μ RISC SA27E 2.6 Gbyte/sec by IBM using Processor HSSL fabricated Interface to embedded DRAM process logic + SCUMemory External HSSL IBM ASIC Library Component EMACS MII Custom Designed Logic HSSL FIFO PLL 4KB recv. FIFO Sunday, 13 January 2013 A D Kennedy 101 QCDOC Components Sunday, 13 January 2013 A D Kennedy 102 QCDOC Sunday, 13 January 2013 A D Kennedy 103 Edinburgh ACF UKQCDOC is located in the ACF near Edinburgh Our insurers will not let us show photographs of the building for security reasons! Sunday, 13 January 2013 A D Kennedy 104 Blue Gene/L Sunday, 13 January 2013 A D Kennedy 105 Vicious Design Cycle 1. Communications DMA controller required Otherwise control by CPU needed Requires interrupts at each I/O completion CPU becomes bottleneck 2. Introduce more sophisticated DMA instructions Block-strided moves for QCDSP Sequences of block-strided moves for QCDOC Why not? It is cheap 3. Use CPU as DMA engine Second CPU in Blue Gene/L 4. Add FPU Why not? It is cheap, and you double you peak Flops 5. Use second FPU for computation Otherwise sustained fraction of peak is embarrassingly small 6. Go back to step 1 Sunday, 13 January 2013 A D Kennedy 106 Power Efficiencies 1 QCDOC GFlops/Watt Blue Gene/L 0.1 NCSA QCDSP Intel Xeon 0.01 ASCI Q ASCI White Earth Simulator 0.001 1997 1998 1999 2000 2001 2002 2003 2004 2005 Year Sunday, 13 January 2013 A D Kennedy 107 What Next? We need a Petaflop computer for QCD New fermion formulations solve a lot of problems, but cost significantly more in computer time Algorithms are improving Slowly By factors of about 2 Can we build a machine about 10 times more cost effective than commercial machines? Otherwise not worth the effort & risk of building our own Blue Gene/L and its successors provide a new opportunity Sunday, 13 January 2013 A D Kennedy 108