Document Sample

Examining a Multivatiate Database Issues to be examined Tools for examining a multivariate database The problem of missing data The problem of outliers Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 2 Key Concepts ***** Examining a Multivariate Database Dangers of analyzing data without theory or a thorough understanding of the data Reliability & validity Missing data Outliers Distributional dynamics of the variables Ratio of cases to variables Statistical assumptions about the data Analytic tools for examining data: Histogram Stem & leaf diagram Scatterplot Box-Whisker plot Bar graph Normal probability plot Cross-tabulation table Descriptive statistics Concept of skew: Right or positive skew Left or negative skew Concept of kurtosis: Platykurtic Mesokurtic Leptokurtic The problem of missing data in multivariate analysis: The impact of eliminating subjects The impact of eliminating variables Causes of missing data Missing at random (MAR) v. missing completely at random (MCAR) Techniques for determining MAR v. MCAR Remedies for missing data Deletion of cases Deletion of variables Imputation Model-based solutions Problems with deleting cases Problems with deleting variables Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 3 Key Concepts (cont.) The concept of imputation Complete case approach All-available approach Techniques for imputation: Case substitution Mean/median substitution Cold deck imputation Regression imputation Multiple imputation Advantages & disadvantages of different imputation techniques Model-based procedures for missing data The problem of outliers & fringeliers Univariate v. multivariate outliers Sources of outliers Critical questions about outliers Techniques for identifying outliers: Histogram Stem & leaf diagram Scatterplots: 2 or 3 dimentional Box-Whisker plot Trend or time series plot Descriptive statistics Converting data to standard scores Multivariate tools Ways of dealing with outliers Problems with deleting outlies Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 4 Lecture Outline Issues to examine Tools for examining data Problems with missing data Problems with outliers Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 5 Blind Analysis of Multivariate Data Blind analysis of a multivariate database without theory and a good understanding of the data is hazardous. Research should be theory-driven with a thorough understanding of: The reliability and validity of the data The extent and impact of missing data Presence and impact of outliers Distributional characteristics of the variables The ratio of cases to the number of variables Whether the data meets the assumptions of the statistical methods to be used Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 6 Examples of Tools for Examining Data Histogram Stem and Leaf Diagram Scatterplot Box-Whisker Plot Bar Graph Normal Probability Plot Cross-Tabulation Table Descriptive Statistics Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 7 Histogram Useful in determining the symmetry or skew of a metric variable, and in determining the presence of extreme values or outliers. Example Distribution of sentences received by 70 felony offenders Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 8 Stem and Leaf Diagram Useful in determining the symmetry or skew of a metric variable, and in determining the presence of extreme values or outliers. Example Distribution of sentences received by 70 felony offenders Frequency Stem & Leaf 8.00 0* 11111111 19.00 0t 2222222222333333333 14.00 0f 44444444555555 10.00 0s 6666667777 7.00 0. 8888889 3.00 1* 001 2.00 1t 22 3.00 1f 455 4.00 Extremes (17), (18), (20), (25) Stem width: 10.0 Each leaf: 1 case(s) Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 9 Bivariate Scatterplot Useful in determining … Whether there is a relationship between two metric variables, its direction and relative magnitude, The presence of bivariate outliers, and Whether the relationship is linear or nonlinear Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 10 Example Scatterplot of sentence and number of prior convictions Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 11 Box-Whisker Plot Useful in determining the symmetry or skew of a metric variable, and in determining the presence of extreme values or outliers. Example Distribution of sentences given to 70 felons offenders Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 12 Bar Graph Useful for determining the frequency of cases in the various categories of a nonmetric variable, and as a reference for collapsing categories if necessary. Example Distribution of race/ethnicity among 70 offenders Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 13 Normal Probability Plot Useful in determining if a variable is normally distributed Example Sentences received by 70 convicted felons Since the points are not on the line and "bow" to the right, the distribution of sentences is skewed to the right. Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 14 Cross-Tabulation Table Useful in determining whether there is a relationship between two nonmetric variables, and whether any cells have low frequencies or contain no cases at all. Example Cross-classification of race by gender among 70 felons Race/ Male Female Total Ethnicity White 7 18 25 African 13 9 22 American Hispanic 15 8 23 Total 35 35 70 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 15 Descriptive Statistics Useful in profiling the central tendency, variability, skew and kurtosis of a metric variable. Example Descriptive statistics on the sentences received by 70 felons Valid cases: 70.0 Missing cases: .0 Percent missing: .0 Mean 5.9571 Std Err .5920 Min 1.0000 Skewness 1.6771 5% Trim 5.4286 Std Dev 4.9532 Range 24.0000 Kurtosis 3.0632 95% CI for Mean (4.7761, 7.1382) IQR 6.0000 S E Kurt .5663 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 16 The Problem of Missing Data A multivariate data base is an N x k matrix. (N = subjects k = variables) Subjects X1 X2 ..... Xk S1 S2 ..... Sn A complete data set is required to analyze the interrelationships among all the variables. If one or more values are missing, the associated subject (s) or variable (s) must be eliminated from the analysis Or the missing data imputed (estimated) by some means. Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 17 Impact of Missing Data Subjects X1 X2 X3 X4 1 12 2 253 64 2 18 5 (?) 85 3 (?) 6 163 94 4 22 9 315 77 5 16 (?) 286 64 6 28 3 173 83 7 11 2 311 94 8 19 4 289 81 9 25 8 198 69 10 20 4 274 75 This is a 10 x 4 matrix, 40 data points, with 3 missing values. If the variables with missing data are eliminated, 75% of the variables are lost for the analysis. Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 18 If subjects with missing data are eliminated, 33% of the subjects are lost for the analysis. Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 19 Eliminating Subjects or Variables Elimination of Subjects Elimination of Variables Reduces power (1-), May result in a may lead to a Type II specification error Error Reduces df and may The model may over-fit lead to a Type II Error the data & not cross- validate May reduce the May produce a larger representativeness of error term due to sample unexplained variance in the dependent variable May effect the external May lead to a Type II validity of the study Error May result in May reduce the inaccurate estimates of explanatory power of population variances the model and covariances Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 20 Causes of Missing Data Recording error Change in definition of a variable Data entry error Refusal to answer a survey question Morbidity of subjects Ignorance of the meaning of a survey question Missing record Agency disclosure policy Missing data field Survey response alternatives not applicable Change in record Computer crash keeping procedure Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 21 Types of Missing Data Missing At Random ( MAR ) The pattern of the missing values in a variable (Y) is related to the pattern of missing values in one or more other variables (Xk). Missing Completely at Random ( MCAR ) The pattern of the missing values in a variable (Y) is not related to the pattern of missing values in one or more other variables (Xk). Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 22 Diagnosing the Pattern of the Missing Data: MAR v MCAR Technique 1: for metric variables For the variable with the missing data, create two groups of subjects: Group 0 = Subjects with missing data Group 1 = Subjects with complete data Conduct a t-test to see if the groups differ significantly on the other variables in the database, assuming they are metric. Technique 2: for nonmetric variables For the variable with the missing data, create a dummy variable with two groups of subjects … Group 0 = Subjects with missing data Group 1 = Subjects with complete data Conduct a chi-square test to see if there is any association between the dummy variable and other nonmetric variables. Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 23 An Example of a Multivariate Database with Missing Data (Shaded cells are missing data) Sentence Prior Convictions Drug Score 3 2 1 1 0 2 5 3 1 7 5 0 4 1 1 1 2 2 1 4 2 3 2 1 8 3 8 10 4 7 10 1 4 20 9 14 3 2 14 2 5 7 4 7 23 6 12 0 8 15 3 6 Prior convictions has 4 missing values Drug score has 4 missing values Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 24 Is the pattern of missing values in either of these two variables related to sentence? Is the pattern MAR or MCAR? Creating Dummy Variables to Represent the Pattern of Missing Data (0 = data missing. 1 = data not missing) Sentence Prior Missing Priors Drug Score Missing Convictions Drug Score 3 2 1 1 1 1 0 1 0 2 0 5 1 3 1 1 7 1 5 0 1 4 1 1 1 1 0 1 2 1 0 2 0 1 1 4 2 1 3 1 2 1 1 0 8 3 1 8 1 10 4 1 7 1 10 1 1 4 1 20 0 9 1 14 3 1 2 1 14 2 1 5 1 7 4 1 7 1 23 0 6 1 12 0 1 8 1 15 3 1 6 1 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 25 Is the Pattern of Missing Data in the Variable Priors Related to the Variable Sentence? Step 1 Using the dummy variable "missing priors" … Compute the average sentence for the subjects coded 0 and those coded 1 Group Mean Sentence Missing data group(0) 11.75 years Not missing data 6.88 years group (1) Step 2 Run a t-test on the difference between the means of the two groups t = 1.33, df = 18, p = 0.1986 Since the difference between means is not significant, the missing data process is MCAR The process is not related to the length of sentence Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 26 Is the Pattern of Missing Data in the Variable Drug Score Related to the Variable Sentence? Step 1 Using the dummy variable "missing drug score" Compute the average sentence for the subjects coded 0 and those coded 1 Group Mean Sentence Missing data group(0) 1.25 years Not missing data 9.50years group (1) Step 2 Run a t-test on the difference between the means of the two groups t = 2.50, df = 18, p = 0.022 Since the difference between means is significant, the missing data process is MAR The process that produced the missing data is related to the length of sentence Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 27 Is the Pattern of Missing Data in the Variable Drug Score Related to the Pattern of Missing Data In the Variable Priors? Step 1 Using the dummy variable "missing drug score" and "missing priors" … Construct a 2x2 cross-tabulation table Priors Drug Score Missing (0) Not (1) Missing (0) 0 4 Not (1) 4 12 Step 2 Run a chi-square test on the cell frequencies. Since one cell has zero frequency, run Fisher's exact probability test as well. 2 = 1.25, df = 1, p = 0.246 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 28 Fisher's p = 0.538 Since the results are not significant, the missing data process is MCAR Remedies for Missing Data Delete the cases with missing data Delete the variables with missing data Imputation of the missing values Model-based procedures Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 29 Case Deletion Probably the most commonly used method. Depending upon the number of cases deleted, the deletion of cases … May reduce the power (1 - ) of the subsequent statistical tests, and may lead to a Type II error Will reduce the df of subsequent statistical tests, which may lead to a Type II error May reduce the representativeness of the sample, reducing the external validity of the study If the process of the missing data is MAR, may lead to incorrect generalizations of the results May bias the estimates of the variables' population variances and covariances … Resulting in biased estimates of the statistical model's parameters and their associated standard errors Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 30 Variable Deletion A poor strategy if the purpose of the study is multivariate in nature. Deletion of one or more variables may … Result in a specification error in the model Result in a model that over-fits the data and does not cross-validate May produce too large an error term due to the unexplained variance in the dependent variable Lead to a Type II error Reduce the explanatory power of the model Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 31 Imputation of Missing Values Imputation refers to estimating the missing values in one variable from the relationship between that variable and other variables in the database. Complete Case Approach Uses only cases with complete data across all the variables in the data base. (called casewise approach in SPSS) All-Available Approach Uses any cases with complete data on a pair of variables. (called pairwise approach in SPSS) Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 32 Imputation Techniques Case substitution Mean/median substitution Cold deck imputation Regression imputation Multiple imputation Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 33 Case Substitution Identify the case with missing data Find the case in the database that is most similar to the case with the missing data Impute the missing values from the corresponding values of the case with complete data If this procedure is used on too many cases it may … Reduce the external validity of the study Result in misrepresentation of the population variances and covariances and … Produce biased estimates of the model's parameters and associated standard errors Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 34 Mean Substitution Identify the variable with missing data Compute some measure of central tendency of the variable, e.g. arithmetic mean or median Substitute the average of the variable for the missing values If a variable has too many missing values … The average will be a biased estimate of the true average The population variance will be underestimated The relationship of the variables with the missing data with other variables in the database will be underestimated, risking a Type II error Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 35 Cold Deck Imputation Substitute an estimate of the missing value from an external source From a pilot study From a similar research study found in the literature From expert opinion or judgment Disadvantages An external source for the missing data may not be available In other ways the disadvantages are comparable to those associated with mean/median substitution Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 36 Regression Imputation If the variables in the database are highly interrelated … Then it may be possible to estimate the missing values … By making the variable with the missing values (xm) a dependent variable, and regressing it on the other variables (xk) in the database xm = a + b1x1 + b2x2 + ... bkxk By substituting the known values of the case with missing data in the model, the missing value can be estimated. The efficiency of this technique depends upon … The extent of the missing data, and The magnitude of the relationship between Xm and the other variables used in the regression model Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 37 Caveats in Regression Imputation The interrelationship among the variables used in the model must be high to produce accurate estimates of the missing values If the number of missing values in the imputed variable is large … The imputation will reinforce sample specific relationships which will not cross-validate The population variances & covariances will be underestimated Problems may be encountered if the imputed variable is an independent variable, since regression assumes no collinearity The imputed estimates may go beyond the bounds of a possible value, since regression analysis is not constrained by units of measurement Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 38 Multiple Imputation Involves the use of several different imputation techniques to produce a set of estimated values for the missing value. The different estimates are then averaged to derive the imputed value The goal is to derive an estimate of the missing value (s) by using several different techniques, and … By averaging the various estimates, Hopefully canceling out or offsetting the disadvantages of the different techniques employed Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 39 Caveats in Multiple Imputation Multiple imputation may or may not cancel out the disadvantages of the techniques employed. The success of the technique will depend upon the peculiarities of the database. It may in fact compound the disadvantages of the techniques used. Using the average of various estimates may reinforce relationships peculiar to the sample, which may not cross-validate. As with other techniques, the more missing values imputed, the less reliable the imputed values. Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 40 Model-Based Imputation This involves a variety of techniques that either … Incorporate the missing data process into the analysis as a separate variable to assess the amount of variance accounted for by the missing data Or use of maximum likelihood estim- ation to model the missing data process, and based upon the results, make the most accurate estimates of the missing values Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 41 Outliers & Fringeliers An outlier is an extreme value or case. A fringelier is a marginally extreme value or case. Such values can significantly affect and distort multivariate analysis, leading to … Type I Errors Type II Errors Underestimation of significant findings Reversal of results Of particular concern is the fact that a case can be a multivariate outlier … While not being a univariate outlier on any individual variable. Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 42 Sources of Outliers Recording error or data entry error An unusual event which causes a one time change in a variable The beginning of a new phenomenon with few of the cases represented in the database Short term change in the way the variable is defined Differences in the way agencies or jurisdictions define the a variable "Apparent" outliers resulting from a sample that is too small Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 43 A case that is not an outlier on each individual variable, but is an outlier across several variables, i.e. a multivariate outlier Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 44 Critical Questions About Outliers How extreme must a case be to be an outlier? Is the apparent outlier an error or a reliable value? Are there cases that are univariate “reasonable” yet multivariate outliers? What impact might the outlier have on the analysis of the data? Is the outlier part of an extreme trend for which there are few cases in the database, or is it simply a very exceptional case? How will the outliers be dealt with? Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 45 Ways to Identify Outliers Histogram Stem & leaf diagram Scatterplot: 2 or 3 dimensions Box-Whisker plot Trend or time series plot Descriptive statistics mean v. median minimum & maximum values skew & kurtosis interquartile range & standard deviation Convert data to standard scores (Z) & examine cases where Z 1.96 Multivariate tools Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 46 Identifying Outliers with a Histogram Example Years served in prison by paroled felons Outliers Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 47 Identifying Outliers with a Stem & Leaf Diagram Example Years served in prison by paroled felons Frequency Stem & Leaf 16.00 0 * 0000001111111111 20.00 0 t 22222222222233333333 16.00 0 f 4444444455555555 9.00 0 s 666667777 2.00 0 . 89 3.00 1 * 001 4.00 Extremes (12), (15), (16),(18) Outliers Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 48 Identifying Bivariate Outliers with a Scatterplot Example Years served in prison as a function of length of sentence Outlier ? Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 49 Identifying Outliers with a Box-Whisker Plot Example Years served in prison by paroled felons Outliers Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 50 Identifying Outliers in a Time Series Plot Example Number of arrested parolees in a county jail Some change in policy in 2/97 caused a substantial and permanent change in the number of jailed parolees. 12 00 10 00 80 0 60 0 40 0 20 0 0 95 96 97 98 99 00 01 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 51 Identifying Outliers with Descriptive Statistics Example Years served in prison by paroled felons Valid cases: 70.0 Missing cases: .0 Percent missing: .0 Mean 4.6786 Std Err .4383 Min .4000 Skewness 1.6584 Median 3.7500 Variance 13.4455 Max 18.2000 S E Skew .2868 5% Trim 4.2929 Std Dev 3.6668 Range 17.8000 Kurtosis 3.1904 95% CI for Mean (3.8043, 5.5529) IQR 4.0500 S E Kurt .5663 The mean median, therefore the distribution is skewed right The skew (1.658) is positive The most extreme value (18.2 years) may be an outlier Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 52 Identifying Outliers by Converting the Data to Standard Scores (Z) Example Years served in prison by paroled felons (Mean = 4.6786 years, S = 3.6668) Time Served Z Score Time Served 7.3 +0.71 5.2 +0.14 11.3 +1.81 … … … … 8.6 +1.07 12.2 +2.05 16.3 +3.17 14.6 +2.71 18.2 +3.69 1.5 -0.87 … … … … 2.7 -0.54 Cases with a Z score 1.96 may be considered as outliers Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 53 Multivariate Outliers Subject X1 X2 X3 1 1 3 18 2 7 14 10 3 3 5 17 4 4 4 19 5 15 20 3 6 11 16 2 7 2 18 2 8 10 12 4 9 5 7 18 10 8 9 9 One of the 10 cases above is a multivariate outlier … A case which may appear univariate "reasonable" … But which is extreme relative to the interrelationship among all three variables. Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 54 Multivariate Outliers (cont.) X3 24 20 16 12 8 4 0 16 14 12 10 8 6 24 4 16 20 X1 2 12 8 0 4 0 X2 Case 7 is the multivariate outlier. Relative to the three variables, its values are … X1 = 2 X2 = 18 X3 = 2 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 55 Is Case 7 a Univariate Outlier? For variable X1 Case 7 = 2 It is not an outlier. Frequency Stem & Leaf 4.00 0 * 1234 3.00 0 . 578 2.00 1 * 01 1.00 1 . 5 16 14 12 10 8 6 4 2 0 N= 10 X1 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 56 Is Case 7 a Univariate Outlier? (cont.) For Variable X2 Case 7 = 18 It is not an outlier Frequency Stem & Leaf 2.00 0 * 34 3.00 0 . 579 2.00 1 * 24 2.00 1 . 68 1.00 2 * 0 30 20 10 0 N= 10 X2 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 57 Is Case 7 a Univariate Outlier? (cont.) For Variable X3 Case 7 = 2 It is not an outlier Frequency Stem & Leaf 4.00 0 * 2234 1.00 0 . 9 1.00 1 * 0 4.00 1 . 7889 30 20 10 0 N= 10 X3 Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 58 Is Case 7 a Bivariate Outlier with Respect to X1 and X2? Case 7 Case 7 X1 = 2 and X2 = 18 In this bivariate relationship, Case 7 is an outlier Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 59 Is Case 7 a Bivariate Outlier with Respect to X1 and X3? Case 7 Case 7 X1 = 2 and X3 = 2 In this bivariate relationship, Case 7 is an outlier Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 60 Is Case 7 a Bivariate Outlier with Respect to X2 and X3? Case 7 Case 7 X2 = 18 and X3 = 2 In this bivariate relationship, Case 7 is not an outlier Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 61 Identifying Multivariate Outliers among More than Three Variables Graphical techniques can not be used to identify multivariate outiers when more than three variables are involved In this case … The model is estimated Predictions (Y') are made with the model using the original data Then the prediction errors (Y' - Y), called residuals, are plotted against the predictions (Y') and … Likely multivariate outliers are identified in the resulting scatterplot Example Sentence as a function of… Age Prior convictions Drug dependency Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 62 Sentence = -17.42 + 0.9 age + 0.28 drugs + 0.5 priors Identifying Multivariate Outliers Among More than Three Variables (cont.) Plot of the residuals (Y' - Y) against the predictions (Y') Possible multivariate outliers Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University 63 What To Do With Outliers? There is no “silver bullet”. It is a matter of judgment. If the outlier is an error ... correct it Analyze the data with and without the outlier and see if it makes a difference Transform the data to reduce the influence of the outlier or skew in the data, assuming that the problem is due to sampling error Increase the sample size if the “apparent” outlier resulted from too small a sample Use a parameter estimating algorithm that is less sensitive to outliers (maximum likelihood estimation v. OLS) Examining a Multivariate Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University

DOCUMENT INFO

Shared By:

Categories:

Tags:
independent component analysis, blind source separation, signal processing, multivariate analysis, factor analysis, blind separation, source separation, data analysis, independent components, hypothesis testing, quality control, multivariate data, projection pursuit, the noise, neural networks

Stats:

views: | 43 |

posted: | 1/19/2010 |

language: | English |

pages: | 63 |

OTHER DOCS BY guf14004

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.