VIEWS: 2 PAGES: 51 POSTED ON: 8/14/2012 Public Domain
The H-principle of mathematical modelling Agnar Höskuldsson, IPL, DTU, Bldg 358, 2800 Kgs. Lyngby Ph: +45 4525 5643, Fax +45 4593 1577 Email: ah@ipl.dtu.dk Home page: www.predict.dk Modelling industrial data, background and basic issues Industry requires: • Simple models. Handled by different people. Different models for different situations. Graphic tools for calibration/testing. • Stable predictions. Small uncertainties associated with predictions derived from new samples. • Robust situation. Small measurement errors have small effects on the predictions. Detection of abnormalities. Easy or simple procedures to ’calibrate instruments’. • Reliable data. The data used adequately reflect the measurement situation (operating conditions) and the modelling task. Empirical approach to data or model analysis w X Y t Score vector: t=Xw=w1 x1 + … + wK xK Requirements to t: • Explain Y. Projection of Y onto t: (Y’t)/(t’t) t • As large t as possible (Advice from experimental design) • Reflect features in X that are important in describing Y. Results from empirical approach to data analysis Typical values: K A M M: <=5 X T Y A: <10 K: IR&NIR: 1050 Examples: NIR: K=926, A=3 IR: K=1050, A=4 T is the latent structure in X. Furnace: K=21, A=5 Regression is based on T. Process: K=12, A=4 Features in T are related to X. The H-principle, formulation It is a recommendation of how we should carry out the modelling procedure for any mathematical model: 1. Carry out the modelling in steps. You specify how you want to look at the data at this step by formulating how the weights are computed. 2. At each step compute expressions for a) improvement in fit, Δfit b) the associated prediction, ΔPrecision 3. Compute the solution that maximizes the product ΔFit ΔPrecision 4. In case the computed solution improves the prediction abilities of the model, the solution is accepted. If the solution does not provide this improvement, we stop modelling. 5. The data is adjusted for what has been selected and start again at 1). Ref: Prediction Methods in Science and Technology, Vol 1. Thor Publishing, 1996, pp 405+disk. ISBN 87-985941-0-9 The task of modelling data In industry the primary focus is on the prediction associated with new samples. Assuming standard assumptions for linear regression the variance of the estimated response value, y(x0), given a new sample x0, is Var(y(x0)) = s2 x0T(XTX)-1 x0, where s2 = yT (I-X(XTX)-1XT)y/(N-A) The task of modelling is a) Good fit, small value of s2, or of yT(I-X (XTX)-1XT)y b) Low value of model variation, x0T(XTX)-1 x0 Application to linear regression Multivariate regression, YN(XB,). In step 1). we are looking for a score vector t, computed as t=Xw, that gives us a good description of the data for the response variables, Y. There are two aspects of this modelling task, step 2), a) improvement in fit: |YTt|2/(tTt) b) associated prediction in terms of variance: |Σ|/(tTt) Step 3) suggests to maximize the product {|YTt|2/(tTt)} {1/[|Σ|/(tTt)]} = |YTt|2/|Σ| Assuming |Σ| constant, we maximize |YTt|2 = wT (XTYYTX) w, for |w|=1. Solution: Find w as the eigen vector of the leading eigen value of (XTYYTX) w = λ w. (PLS regression) Background for the H-principle. 1 The measure of error of fit: YTY- YTX (XTX)-1 XTY= YT(I- X (XTX)-1 XT)Y and the measure of precision, (XTX)-1 are stochastically independent. It means that a knowledge of the measure of error of fit does not provide with any information on the precision. (Estimate of mean value of the normal distribution does not provide with any information on the variance) Background for the H-principle. 2 Significance testing: The F-test of significance of a regression coefficient is equivalent to test the significance of the correlation coefficient between the present residual response values and the score vector of the coefficient. The correlation coefficient is invariant to the size of the score vector, tA+1, in question. From significance testing there is no information on the precision of the obtained model. The predictions derived from a ’significant model’ can be good or bad. (Statistical significance testing in industrial data with many variables tends to lead to severe overfitting). Background for the H-principle. 3 Assuming a standard linear regression model, yN(X,2I). The response value of the new sample, x0, is estimated as y(x0)=b1Tx0. The variance of y(x0) after A components is given by Var(y(x0))A = 2 (1+x0T (X1TX1)+ x0) with 2 s2A = [(yTy)- {(yTt1)2/(t1Tt1)+…+(yTtA)2/(tATtA)}]/(N-A-1) x0T (X1TX1)+ x0 = [(x0Tr1)2/(t1Tt1) + … + (x0TrA)2/(tATtA)] Result: Var(y(x0))A+1 < Var(y(x0))A if and only if (yTtA+1)2 > f/[g/(x0TrA+1)2 + 1/(tA+1TtA+1)]. Conclusion: maximize (yTtA+1)2 and and check if the prediction variance has reduced for all or most samples. Background for the H-principle. 4 Assuming a standard linear regression model, yN(X,2I), the variance of the estimated response, y(x0)=b1T x0, associated with a new sample x0 consists of two parts: the error of fit, [(yTy) - yT X1(X1TX1)+ X1Ty ], always decreases the model variation, (1+x0T (X1TX1)+ x0), always increases, when the model is expanded. In terms of orthogonal components: Decrease in error of fit: (yTtA)2/(tATtA) Increase in model variation: (x0TrA)2/(tATtA)= [(yTtA)2/(tATtA)](x0TrA)2/(yTtA)2 Conclusion: Balancing these two terms seek to maximize (yTtA)2. Schematic interpretation of the optimization task w w q X Y X Y q= t=Xw YTt t=Xw v=Yq Find a weight vector w: Find weight vectors w and q: Maximize |q|2 Maximize (tTv) 1050 5 Common sizes for optical instruments X Y 50 Ref.: PLS Regression Methods Journal of Chemometrics, 2 (1988) 211-228 Mathematical expansions Data. Simultaneous decomposition of X and X+ X = d1 t1 p1T + d2 t2 p2T + … + dA tA pAT + … + dK tK pKT X+ = d1 r1 s1T + d2 r2 s2T + … + dA rA sAT + … + dK rK sKT Covariance matrix. Simultaneous decomposition of S and S+ S = d1 p1 p1T + d2 p2 p2T + … + dA pA pAT + … + dK pK pKT S+ = d1 r1 r1T + d2 r2 r2T + … + dA rA rAT + … + dK rK rKT Ref: Data analysis, matrix decompositions and generalized inverse, SIAM J. Sci. Comput, 15 (1994) 239-262. Mean squared error The squared bias, JA= |X1b1- Xβ|2 has the mean value E(JA)/σ2 = A + [A+12 (tA+1TtA+1) + … + K2 (tKTtK)]/σ2 = A + 2TD22/σ2 The task of modelling is to find a balance between as low values as possible of: 1) The dimension of the model, A 2) The relative squared bias, 2TD22/σ2 Mallow’s Cp value is used to measure this balance. McMaster Process Control Consortium, McMaster University, Canada In charge: Prof. John MacGregor Participation of 16 large companies in USA and Canada Success stories of implementing chemometric methods on the factory floor. John MacGregor has received many international honors for the results obtained at the consortium. http://www.chemeng.mcmaster.ca/macc/MACC.HTM Process data. 800 700 600 500 400 300 200 100 0 0 2 4 6 8 10 12 12 x-variables. 289 samples (time points). Y=output variables. Variables 3 and 8 have large variance (variation). Some variables have small variance. Process data, squared correlations 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 2 4 6 8 10 12 Squared correlation coefficients between the x’s and the response variable. Var. no. 3 has high value. Four variables have low value. Process data. Analysis, 1. PLS regression. Explained variation. No (X) X (y) y 1 11.8432 11.8432 88.9710 88.9710 2 14.3286 26.1718 6.5648 95.5358 3 16.5351 42.7070 1.6504 97.1862 4 15.0060 57.7129 0.5906 97.7768 Four components explain 97.78% of the Y-variation. 57.71% of the total variation in X has been used. Process data. Analysis, 2. Estimated Y1 vs observed Y1 0.2 0.15 0.1 0.05 0 -0.05 -0.1 -0.15 -0.2 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 0.15 0.2 Computed y-value according to model versus observed y-values Process data. Analysis, 3. U1 vs score vector 1 U2 vs score vector 2 0.2 0.1 0.1 0.05 0 0 -0.1 -0.05 -0.2 -0.1 -0.2 -0.1 0 0.1 0.2 -0.2 0 0.2 0.4 0.6 U3 vs score vector 3 U4 vs score vector 4 0.04 0.04 0.02 0.02 0 0 -0.02 -0.02 -0.04 -0.04 -0.06 -0.06 -0.2 -0.1 0 0.1 0.2 -0.4 -0.2 0 0.2 0.4 Reduced y-values, u’s, (u1=y) versus score vectors. Process data. Analysis, 4. Score vector 2 vs score vector no 1 Score vector 3 vs score vector no 1 0.3 0.2 0.2 0.1 0.1 0 0 -0.1 -0.1 -0.2 -0.2 -0.2 -0.1 0 0.1 0.2 -0.2 -0.1 0 0.1 0.2 Score vector 4 vs score vector no 1 Score vector 3 vs score vector no 2 0.4 0.2 0.2 0.1 0 0 -0.2 -0.1 -0.4 -0.2 -0.2 -0.1 0 0.1 0.2 -0.2 -0.1 0 0.1 0.2 0.3 S Scatter plots for the first four score vectors. 750 Process data. Analysis, 5. 700 650 600 550 500 450 400 350 300 250 0 50 100 150 200 250 300 Plot of the y-values versus time Process data. Analysis, 6. Loading vector 2 vs loading vector no 1 Loading vector 3 vs loading vector no 1 1 0.5 11 9 4 5 3 5 6 7 1 0.5 0 10 2 3 8 1 0 11 -0.5 2 12 12 9 6 7 10 8 -0.5 4 -1 -0.5 0 0.5 1 -0.5 0 0.5 1 Loading vector 4 vs loading vector no 1 Loading vector 3 vs loading vector no 2 1 0.5 11 9 2 4 0.5 3 5 12 1 10 0 10 2 7 6 8 8 0 5 11 9 3 -0.5 -0.5 4 12 6 7 1 -1 -1 -0.5 0 0.5 1 -0.5 0 0.5 1 Scatter plot of the first four loading vectors Process data. Analysis, 7. Loading vector 2 vs loading vector no 1 Loading vector 3 vs loading vector no 1 0.6 3 0.6 5 11 3 0.4 0.4 5 9 6 0.2 11 7 0.2 2 2 0 1 0 10 9 -0.2 12 -0.2 4 10 8 8 4 1 12 -0.4 -0.4 7 6 -0.5 0 0.5 1 -0.5 0 0.5 1 Loading vector 4 vs loading vector no 1 Loading vector 3 vs loading vector no 2 0.5 0.6 10 3 11 11 12 3 5 2 0.4 5 9 0 9 8 7 6 0.2 2 0 10 -0.5 4 -0.2 4 8 1 1 12 -1 -0.4 7 6 -0.5 0 0.5 1 -0.4 -0.2 0 0.2 0.4 0.6 Scatter plot of the first for causal vectors, the r’s, ti=Xri. Foss-Electric, Hillerød, Denmark Produces IR instruments for the chemical industry. Yearly sales around 200 mio euros. Around 1600 employees. An instrument is calibrated against the concrete measurement situation, like e.g., quality of milk (fat, etc). The instruments are calibrated by chemometric methods. A certain number of IR spectra (samples, say 50) are used to identify the model that should be used in the measurement situation. The model is then developed using chemometric methods. http://www.foss-electric.dk IR (Infra-Red) data, (Diary-production) 2 1.5 1 0.5 0 -0.5 -1 -1.5 0 200 400 600 800 1000 1200 IR-data, 1050 x-variables, Y=quality parameter. Some variables have small variance, others large. IR (Infra-Red) data, squared correlations 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 200 400 600 800 1000 1200 Squared correlation coefficients, 1050 x-vars and Y. Six intervals of variables show high values. Ref: Variable and subset selection in PLS regression. Chemometrics and Intelligent Laboratory Systems, 55 (2001) 23-38 The CovProc procedures Search for intervals having high covariance: a) Restrict X to columns showing high covariance. b) Sort the variables according to the size of the covariance. 1) For original X 2) Select at each step intervals of X Ref: Covariance procedures for subset selection in regression analysis. Journal of Chemometrics, submitted Furnace data. Analysis, 1. Percentage variation extracted of X, Y: No (X) X (Y) Y 1 78.7264 78.7264 57.4918 57.4918 2 10.4655 89.1920 41.0052 98.4970 3 5.9258 95.1178 0.4263 98.9233 4 2.4266 97.5444 0.2015 99.1248 99.12% of Y’s variation is extracted using 97.54% of X. Furnace data. Analysis, 2. U1 vs score vector 1 U2 vs score vector 2 0.4 0.2 0.2 0.1 0 0 -0.2 -0.1 -0.4 -0.2 -1 -0.5 0 0.5 1 -0.4 -0.2 0 0.2 0.4 U3 vs score vector 3 U4 vs score vector 4 0.06 0.06 0.04 0.04 0.02 0.02 0 0 -0.02 -0.02 -0.04 -0.04 -0.4 -0.2 0 0.2 0.4 -0.3 -0.2 -0.1 0 0.1 0.2 Plots of reduced y’s, the u’s, versus score vectors. Furnace data. Analysis, 3. Score vector 2 vs score vector no 1 Score vector 3 vs score vector no 1 0.4 0.4 0.2 0.2 0 0 -0.2 -0.2 -0.4 -0.4 -1 -0.5 0 0.5 1 -1 -0.5 0 0.5 1 Score vector 4 vs score vector no 1 Score vector 3 vs score vector no 2 0.2 0.4 0.1 0.2 0 0 -0.1 -0.2 -0.2 -0.3 -0.4 -1 -0.5 0 0.5 1 -0.4 -0.2 0 0.2 0.4 Scatter plots of the first four score vectors. Furnace data. Analysis, 4. Loading vector 2 vs loading vector no 1 Loading vector 3 vs loading vector no 1 1 0.5 20 19 21 5 0.5 6 4 18 5 4 7 3 17 10 8 9 0 10 0 119 3 11 13 12 87 6 16 14 15 2 21 17 18 20 19 12 2 1 1 16 13 -0.5 14 15 -1 -0.5 -0.4 -0.2 0 0.2 0.4 -0.4 -0.2 0 0.2 0.4 Loading vector 4 vs loading vector no 1 Loading vector 3 vs loading vector no 2 0.4 0.5 9 10 20 8 19 21 0.2 20 11 7 19 18 4 5 0 3 3 17 10 2 1812 11 9 4 6 0 8 7 6 1 17 2 -0.2 21 12 13 16 16 1 13 -0.4 5 14 14 15 15 -0.6 -0.5 -0.4 -0.2 0 0.2 0.4 -0.5 0 0.5 1 Scatter plots of the first four causal vectors. Beer data. Analysis, 1. Percentage variation extracted of X, Y and B: No d(X) X d(Y) Y 1 86.7429 86.7429 89.1011 89.1011 2 4.5285 91.2714 9.7844 98.8855 3 0.6658 91.9372 0.3781 99.2636 4 0.7473 92.6845 0.0973 99.3609 Three components extract 99.26% of the variation in Y, and use 91.94% of X. Beer data. Analysis, 2. U1 vs score vector 1 U2 vs score vector 2 0.6 0.3 0.4 0.2 0.2 0.1 0 0 -0.2 -0.4 -0.1 -4 -2 0 2 4 -0.5 0 0.5 1 U3 vs score vector 3 U4 vs score vector 4 0.04 0.04 0.02 0.02 0 0 -0.02 -0.02 -0.04 -0.04 -0.2 -0.1 0 0.1 0.2 0.3 -0.3 -0.2 -0.1 0 0.1 0.2 Plot of the first four reduced y’s, u’s, versus score vectors Comparison of methods Procedure cross-validation: 1. Leave out 10% of the samples. Use the 90% to estimate the parameters of the model. Use the estimates to compute the response values of the 10% left-out values. Register the average squared difference of observed and computed response values of the 10%, i(yi – ŷ-i,j)2/I. 2. Repeat 1. several times, J times, each time select 10% of the samples. Register the average total squared difference, j[i(yi – ŷ-i,j)2/I]/J Comparison of methods CovProc. The weight vector is computed as w=XTy/|XTy|. The absolute values of w are sorted and only the largest values are used as long as the fit of the score vector is improved and the others zeroed. PLS regression. The weight vector w is chosen as w=XTy/|XTy|. Forward regression. The weight vector is w=(0,0,…,1,0,0…), where the index of 1 corresponds to the variable that gives the largest value of (yTxi)2/(xiTxi). The value of xi is the value of the ith column of reduced X. Principal Component Regression (PCR). The regression is based on the first (having largest eigen values) components. The weight vector w is the eigen vector associated with XTX. PCR with sorted components. The same as the previous one except the components are sorted according to the value of the squared covariance between y and t, (yTt)2. Ridge Regression (RR). The ridge constant is estimated by leave-one-out procedure. I.e., it is the constant that gives the smallest value of N (yi-ŷ-i)2, where ŷ-i is the estimate response value for the ith sample, when it is excluded from the analysis. RR with sorted components. Same as standard RR, except the components are sorted according to the value of the squared covariance between y and t, (yTt)2. Comparison of methods Dimension CovProc PLS Var Sel PCR PCR, sorted RR RR, sorted 1 0.0263 0.1162 0.0276 0.1271 0.1271 0.1271 0.1271 2 0.0081 0.0134 0.0208 0.0136 0.0136 0.0139 0.0139 3 0.0065 0.0143 0.0162 0.0142 0.0122 0.0142 0.0154 4 0.0061 0.0148 0.0150 0.0142 0.0105 0.0142 0.0175 5 0.0061 0.0165 0.0135 0.0146 0.0093 0.0145 0.0195 6 0.0062 0.0181 0.0130 0.0150 0.0083 0.0147 0.0204 7 0.0062 0.0191 0.0126 0.0154 0.0080 0.0151 0.0205 8 0.0064 0.0194 0.0126 0.0136 0.0072 0.0144 0.0207 9 0.0065 0.0194 0.0128 0.0141 0.0065 0.0138 0.0204 10 0.0067 0.0191 0.0129 0.0140 0.0060 0.0137 0.0197 The square root of the cross-validation measure, {j[i(yi – ŷ-i,j)2/I]/J}½ 0.5 Variation of the solution vector 0 Prediction Significance -0.5 Numerical bi precision in data -1 0 5 10 15 20 25 Dimension Solutions considered as curves Spurious significance in industrial data, an example Too detailled model often leads to spurious significance in case of low rank industrial data. Illustration: Principal Component Regression of Furnace data X is 119 times 21. SVD decomposition of X: X=USVT. Matrix of score vectors: T=US, same size as X. Score vector no. 17, t17 the seventeenth column of T. It has no correlation with the response variable. But it has correlation of 0.55 with the reduced response variable, when the effects of all first 17 score vectors have been subtracted from the response variable. Scatter plot of response variable vs the 17th score vector Response variable vs score variable no. 17 0.3 0.2 0.1 0 -0.1 -0.2 -0.3 -0.4 -0.02 -0.015 -0.01 -0.005 0 0.005 0.01 Scatter plot of the reduced response variable vs the 17th score vector. Reduced response variable vs score variable no. 17 0.04 0.03 0.02 0.01 0 -0.01 -0.02 -0.02 -0.015 -0.01 -0.005 0 0.005 0.01 Note that the all except one score value is between 0.01. New methodologies • Weighing procedures. At each step there are found appropriate weights on variables and samples (use all data, but ‘zoom in’ on important parts). • Multi-way data analysis. Extend analysis of matrices (two-way data, xij) to multi-way data (data with many indices, xijklmn). It includes weighing schemes, multiplication, inverse, precision and other concepts. Normal analysis of matrices is different because linearity, orthogonality and similar concepts are only valid in each mode. • Path modelling. The concept of regressin analysis X Y is extended to a large network of data blocks, where there can be many input (X’s) and output (Y’s) data blocks. • Non-linear modelling. There are two approaches. One is to extend the linear model to a low dimensional surface. This accounts for many situations in industry, where non-linearity appears as weak curvatures. The other is to replace at the iterations exact solutions (or adjusted ones like Ridge and others) by low rank and stable solutions. • Dynamic models. There are typically three types of objectives: good fit, small changes in solution vectors over time and target values. Stable solutions are found meeting these objectives. • Stable solutions to statistical models. Significance testing becomes more efficient, when it is based on stable solutions. Weighing schemes w v X t=Xw p=X’v Find w such that the score vector t is good. Find v such that the loading vector p better meets the objective. Ref: Weighing schemes in Multivariate Data Analysis, Journal of Chemometrics, 15 (2001) 371-396. Multi-way data analysis w2 w2 w1 w1 w1 X X X w3 T t X=(xijk). One weight vector generates T, two weight vectors generate t and three weight vectors generate one value. Ref: Multi-way data analysis, Journal of Chemometrics, submitted Three-way regression analysis w2 v2 w1 v1 X Y t u Find weight vectors w1 and w2 from X and weight vectors v1 and v2 from Y such that the score vectors show maximal covariance, maximize (tTu), subject to |w1|=|w2|=|v1|=|v2|=1. Path modelling, types Paths Interpretation 1 X One block. We get PCA or PCA-types of solutions. 2 X1 X2 Two blocks. We get linear regression. 3 X1 X2 X3 Multi-block extensions of linear regression. 4 X1 Multi-block extensions of PCA-type of solutions with the role of variables and X2 samples exchanged. 5 X2 X3 Here we want the regression to be done with components generated from X1. X1 6 X1 X2 Here we want to study possible changes in the response values even if we have not been able X3 to observe or measure the corresponding X- values. 7 X1 X2 Here are two ‘sources’ X1 and X3 that influence on the data block X2. X3 Ref: Causal and Path Modelling, Chemometrics and Intelligent Laboratory Systems, 58 (2001) 287-311. Path modelling, three data blocks X Y Z X0 ? ? When new X-samples, X0, become available, we want to know the estimated Y-samples and how the estimated Y-samples project onto Z. X is projected onto Y and the projection is projected further onto Z. Path modelling, two input and two output w 1 X1 Z1 t1 pz,1 Y w2 t3 py Z2 X2 t2 pz,2 X1 chemical measurements, X2=physical. Z1=quality variables, Z2=technical Non-linear modelling, illustration 0.6 U1 vs score vector 1 0.2 U2 vs score vector 2 0.4 0.1 0.2 0 0 -0.1 -0.2 -0.4 -0.2 -2 -1 0 1 2 -0.15 -0.1 -0.05 0 0.05 0.1 U3 vs score vector 3 U4 vs score vector 4 0.1 0.1 0.05 0.05 0 0 -0.05 -0.05 -0.1 -0.1 -0.15 -0.1 -0.05 0 0.05 0.1 -0.04 -0.02 0 0.02 0.04 Adjusted response variable versus the first four score vectors. The first plot shows quadratic curvature, the third and fourth show third order curvature. Ref: The Heisenberg Modelling Procedure and Application to non-linear modelling, Chemometrics and Intelligent Laboratory Systems, 44 (1998) 15-30. Non-linear modelling, quadratic surface y = Tx + xTFx + Transformations. = TPRT x + xTRPTFPRTx + Variables: = (PT)T(RT x) + (RT x)T(PTFP)(RT x) + t = RT x Parameters: = Tt + tTGt + = PT G = PTFP Quadratic function in x is transformed into a quadratic function in t. The parameters and G are identified by the estimation procedure. When new sample x0 is available, the associated score vector is computed as t0=RTx0 and the values of the score vector are used in computing the corresponding response value. Summary • The H-principle provides with (almost) optimal results with respect to prediction • It is applicable to most mathematical modelling of data. • Advanced graphic facilities to mathematical models • ’Safe play’ modelling of data. No risk of spurious significance • It can be used to validate/judge the predictive ability of specific algorithms • The criteria of the H-principle extend to different mathematical areas and generate new methodologies/mathematics • The numerical procedures are very fast (on the level of ordinary regression analysis) • Can handle efficiently linear/non-linear models with thousands of variables. Models can include a network of (multiway) data blocks • Automatic/online/remote procedures can be established