Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out

The very short talk on outliers

VIEWS: 34 PAGES: 10

The very short talk on outliers

More Info
									The very short talk on outliers

• Outlier: An observation not explained by the model we're using • Introduce bias (some estimators more sensitive to outliers than others) • Many possible sources - noise in the observation process, dierences between model

and real world environment, etc. anyway

• Removing outliers is statistically questionable, but, since it improves results, we do it

 Typeset by FoilTEX 

1

1-D methods

• 1-D is easy(-ish) • Many problems can be recast to search for 1-D outliers • Most classical approaches work by assuming we deal with normal distibutions (every-

thing's normal, after all)

• Many, many methods to choose from

 Typeset by FoilTEX 

2

A few examples

• Interquantile range-based fences

   

Q25 = 25th percentile, Q75 = 75th percentile IQR=Q75-Q25 Mild outlier: x<Q25-1.5*IQR, x>Q75+1.5*IQR Extreme outlier: x<Q25-3*IQR, x>Q75+3*IQR

• Chauvert's Criteria

 Calculate µ and σ of data  Calculate p = P (x) assuming data is ∼ N (µ, σ) 1  If p < 2n , reject as an outlier
 Typeset by FoilTEX  3

• Mahalanobis Distance based approaches • Kurtosis (good for single outliers in normal data) • Minimum Variance Approaches • Extreme Deviate Removal (and related approaches) • etc.

 Typeset by FoilTEX 

4

A few issues

• Many methods don't deal well with groups of outliers. Large numbers of outliers also

cause methods to fail

• Assumptions about normality not well suited to non-normal data

 Typeset by FoilTEX 

5

Higher dimensions

• Can sometimes extend 1-D approaches, but not obvious for some distributions • Some approaches extend directly, but acquire additional issues - i.e. Mahalnobis distance is (x − µ)T Σ (x − µ), but becomes more sensitive to masking • Robust approaches tend to be slow - Minimum Convariance Determinant or EDR

expensive for more than a few hundred points
N 2

• Diculities arise as number of outliers →

 Typeset by FoilTEX 

6

RANSAC

• RANdom SAmple Consensus • Assumption 1: We have some model of the data, with unknown parameters • Assumption 2: We estimate the parameters from comparatively few samples • Algorithm is roughly:

 Select sample  Calculate model  Test all points against model
 Typeset by FoilTEX  7

 if enough t the model, accept all points that t the model as inliers  Repeat process until suitable stopping conidtion is met, and return model with the
lowest error
• Stopping condition is usually maximum number of iterations, although sometimes ∗ Re-estimate model from inliers ∗ Calculate error of inliers against new model

people use current best error to allow early exit when a good t is found

 Typeset by FoilTEX 

8

Comments

• Very simple to implement • Theortically capable of handling
N 2

− 1 outliers

• Cost mainly dependant on number of iterations and number of points used to estimate

the model

• Based on nding inlier subset, so gives robust result even with quite sensitive

estimators

• Model is arbitary, so no concerns about non-normal data
 Typeset by FoilTEX  9

• Does require enough knowledge of the problem to set relevant parameters (number

of iterations, number of elements to use as inliers, model to use, etc).

 Typeset by FoilTEX 

10


								
To top