Dept. of Chemical Engineering
CH 544 Multivariate Data Analysis
END SEM EXAM
TAKE HOME EXAM: 5PM on 01/05/09 - 9AM on 04/05/09
This is a take home exam. YOU ARE EXPECTED TO WORK ON THIS ON YOUR
OWN WITHOUT CONSULTING ANY OTHER LIVING BEING. You are free to
consult your notes, text books, research papers for solving the problems. If your
solutions indicate that you have consulted or copied from your classmates, be prepared to
spend another year to learn the meaning of ethics.
Use MATLAB to solve the problems. Submit your answers summarizing the results
obtained along with explanations where required. For each problem attach a print-
out of the MATLAB code. Document the MATLAB codes so that it is easy for me to
check them. Well documented codes carry bonus marks. Sign the declaration
below and attach this sheet along with your submission. The answers should be
submitted no later than 9AM on Monday, May 4, 2009 in my office.
I have neither assisted nor taken any assistance from anyone to solve this
Name and Roll No:
1. The NIR data set (nirdata.mat) consists of near infrared (NIR) spectra for three-
component mixtures containing toluene, chlorobenzene, and heptane. The data set was
obtained for 31 samples over the range 400-2500 nm at an interval of 2nm. The
concentrations in data set vary between 20-70 weight percentage for toluene and
chlorobenzene and between 2-10 weight percentage for heptane. The corresponding
concentrations of the mixtures are stored in variable concs contained in the data set.
Write a MATLAB code implementing PCR to build a multivariable calibration model to
predict concentrations given the absorbance spectra of a mixture containing the three
components (assume that the number of pure species in the mixtures is known to be
three). Use leave one score out validation procedure and compute the RMSE (difference
between predicted and measured concentration of the sample whose score has been left
out) to assess the quality of the calibration model.
2. Consider the flow network given in Figure 1 below. The flows of streams 6, 9, 13, 15,
16, 20, 25, and 27 are not measured while all the rest are measured.
(a) Classify the measured flows as redundant/non-redundant and the unmeasured flows as
observable/unobservable using graph theoretic concepts.
(b) A set of measurements for the measured flows is given in Table 1 along with the
standard deviation of errors in the measurements. Determine the reconciled values of the
measured flows and the estimated values of the observable unmeasured flows.
(c) It is suspected that one of the measurements has a bias. Apply the global statistical
test to determine whether this suspicion is justified. If a bias is detected, apply the GLR
test to identify the measurement having the bias.
4 9 25
1 3 15 8
3 1 5 24
5 2 23
9 6 10
2 6 27
8 21 26
Table 1. Measured values of streams
Variable Value Std of error
1 108.761 2.7488
2 107.5951 2.8068
3 52.5742 1.3102
4 14.9669 0.3715
5 108.0808 2.7818
7 24.3428 0.5910
8 32.6992 0.8182
10 7.9847 0.1988
11 10.451 0.2625
12 98.8535 2.1818
14 49.1856 1.1660
17 74.1563 1.8058
18 0.8913 0.0223
19 0.9976 0.0250
21 53.6923 1.3325
22 2.2908 0.0593
23 166.9796 4.1012
24 0.8845 0.0212
26 61.3074 1.5005
28 82.055 2.1365
3. For the flow process given in problem 2, a sample of 1000 measurements
(corresponding to the 20 measured variables) have been generated and stored in file
networkdata.mat. Apply PCA to obtain an estimate of the constraint matrix relating the
measured variables. Justify your selection of the number of PCs and evaluate how good
your estimated constraint matrix is as compared to the true constraint matrix (for this
purpose find the angle between the row space of the true and estimated constraint
matrices using subspace function in MATLAB).
4. In order to assess the effectiveness of different approaches for treating missing data in
building PCA models, consider the problem of determining the constraint matrix from the
data set given in problem 3 using PCA. Randomly delete 10% of the data from the above
data set and use the following methods to build PCA models (develop your own
MATLAB codes for this purpose).
(a) Impute the missing values using the mean of the remaining measured/estimated
values for each variable and apply PCA iteratively until convergence of the estimates and
(b) Impute the missing values using reconciled estimates (assuming that all unmeasured
variables are observable) and iteratively apply PCA until the model and estimates
(c) Use functional PCA to handle missing data. For this purpose, fit a high order
polynomial to the sequence of measurements of each variable and apply FPCA to
estimate the model and estimates.
Compare the quality of the models obtained using subspace angle as in problem 3 and
provide possible explanations for the results you obtain.
5. The data file speech.mat contains a five measured noisy signals which are linear
mixtures of two source signals, which are also included in the data set.
(a) Use ICA to estimate the two sources (you can use ICALAB toolbox for this purpose,
but indicate in your answers the steps and parameter choices you made), by first using
PCA to remove noise and estimate the source subspace. Determine the correlation
between the estimated sources with the true sources.
(b) Assuming that the sources are non-Gaussian and independent, while the noise sources
are all Gaussian describe a method for using ICA only (without using PCA for
preprocessing) to extract the sources. Apply your method to the same data set and
determine the correlation between identified and estimated sources.
ALL THE BEST