Learning Center
Plans & pricing Sign in
Sign Out

The very short talk on outliers


The very short talk on outliers

More Info
									The very short talk on outliers

• Outlier: An observation not explained by the model we're using • Introduce bias (some estimators more sensitive to outliers than others) • Many possible sources - noise in the observation process, dierences between model

and real world environment, etc. anyway

• Removing outliers is statistically questionable, but, since it improves results, we do it

 Typeset by FoilTEX 


1-D methods

• 1-D is easy(-ish) • Many problems can be recast to search for 1-D outliers • Most classical approaches work by assuming we deal with normal distibutions (every-

thing's normal, after all)

• Many, many methods to choose from

 Typeset by FoilTEX 


A few examples

• Interquantile range-based fences


Q25 = 25th percentile, Q75 = 75th percentile IQR=Q75-Q25 Mild outlier: x<Q25-1.5*IQR, x>Q75+1.5*IQR Extreme outlier: x<Q25-3*IQR, x>Q75+3*IQR

• Chauvert's Criteria

 Calculate µ and σ of data  Calculate p = P (x) assuming data is ∼ N (µ, σ) 1  If p < 2n , reject as an outlier
 Typeset by FoilTEX  3

• Mahalanobis Distance based approaches • Kurtosis (good for single outliers in normal data) • Minimum Variance Approaches • Extreme Deviate Removal (and related approaches) • etc.

 Typeset by FoilTEX 


A few issues

• Many methods don't deal well with groups of outliers. Large numbers of outliers also

cause methods to fail

• Assumptions about normality not well suited to non-normal data

 Typeset by FoilTEX 


Higher dimensions

• Can sometimes extend 1-D approaches, but not obvious for some distributions • Some approaches extend directly, but acquire additional issues - i.e. Mahalnobis distance is (x − µ)T Σ (x − µ), but becomes more sensitive to masking • Robust approaches tend to be slow - Minimum Convariance Determinant or EDR

expensive for more than a few hundred points
N 2

• Diculities arise as number of outliers →

 Typeset by FoilTEX 



• RANdom SAmple Consensus • Assumption 1: We have some model of the data, with unknown parameters • Assumption 2: We estimate the parameters from comparatively few samples • Algorithm is roughly:

 Select sample  Calculate model  Test all points against model
 Typeset by FoilTEX  7

 if enough t the model, accept all points that t the model as inliers  Repeat process until suitable stopping conidtion is met, and return model with the
lowest error
• Stopping condition is usually maximum number of iterations, although sometimes ∗ Re-estimate model from inliers ∗ Calculate error of inliers against new model

people use current best error to allow early exit when a good t is found

 Typeset by FoilTEX 



• Very simple to implement • Theortically capable of handling
N 2

− 1 outliers

• Cost mainly dependant on number of iterations and number of points used to estimate

the model

• Based on nding inlier subset, so gives robust result even with quite sensitive


• Model is arbitary, so no concerns about non-normal data
 Typeset by FoilTEX  9

• Does require enough knowledge of the problem to set relevant parameters (number

of iterations, number of elements to use as inliers, model to use, etc).

 Typeset by FoilTEX 


To top