Docstoc

Detection of Outliers

Document Sample
Detection of Outliers Powered By Docstoc
					   Detection of Outliers
               TNM033 - Data Mining


by  Ant on A uo ja, Alb ert Back enho f & M ik ae l
Da lk vi st
   Holy Outliers, Batman!!


“An outlying observation, or outlier, is one that
appears to deviate markedly from other
members of the sample in which it occurs.”
- Frank E. Grubbs
    Holy Causes, Batman!!

Apparatus
malfunction.

Fraudulent behavior.

Human error.

Natural deviations.

Contamination.
Holy Applications, Batman!!

Fraud Detection

Medicine

Public Health

Sports statistics

Detecting
measurement errors
    Holy WEKA, Batman!!


Interquartile Range

One Class Classifier

DBScan
  Holy Common Methods,
         Batman!!

Statistical

Distance

Kernel

High Dimensional
Holy Statistical Methods,
        Batman!!
An outlier is an object with
low probability with respect to
the probability distribution
model of the data.

Model Based.

Assume Gaussian distribution.
Calculate the mean and
standard deviation of the data.
The probability of each object
under the distribution can
then be calculated.
Holy Examples, Batman!!


             Box Plots

             Trimmed Means

             Grubbs’ Test
Holy Box and Whisker Plots,
         Batman!!

               Interquartile Range
               Q3 - Q1

               Lower Inner Fence: Q1 - 1.5*IQR

               Upper Inner Fence: Q3 + 1.5*IQR

               Lower Outer Fence: Q1 - 3*IQR

               Upper Outer Fence: Q3 + 3*IQR
    Holy Trimmed Means,
          Batman!!


Delete percentage of extreme values.

Calculate mean.

Use new mean for comparison.
       Holy Test, Grubbs!!


Calculate the normal logarithm.

Sort data.

Calculate Z.

Compare Z to the critical Z value.
    Holy Issues, Batman!!


Identifying distribution of data set.

The number of attributes

Mixtures of distribution
     Holy Distance Based
     Methods, Batman!!


DP(p,D)

k-Nearest Neighbor

Local Distance Based
Holy DB(p,D), Knorr & Ng,
       Batman!!



An object o is an outlier if at least the p:th
fraction of all objects of the database are at a
distance greater than D from the given object o.
Holy Distance to k-Nearest
   Neighbors, Batman!!
Outlier score.

Score each object [0,∞[ depending on the
distance to its k-nearest neighbors.

Highly dependent on the choice of k.

Can be modified to use the mean of distances of
a point to all its 1NN, 2NN, ..., kNN as an outlier
score.
Holy Local distance-based
  algorithms, Batman!!

Determine the difference of an
object from its nearest neighbors.

A threshold value is set.

All objects whose outlier factors
exceed this value are considered to be outliers.

Local Outlier Factor (LOF).
Holy Advantages, Batman!!


More general and easier to apply then statistical
approaches

No probabilistic model needed

Can find local outliers
  Unholy Disadvantages,
        Batman!!

Methods are typically O(n2)

Sensitive to choice of parameters

Dependent on pre-defined parameters

Can’t handle datasets with regions that have
widely differing density
Holy Kernel Based Methods,
         Batman!!
Original    Hilbert
 space     (Feature)
             space
X   H
 Holy Implicitly, Batman!!



No additional memory or computation cost.
Holy High Dimensional,
       Batman!!
    Curse of Dimensionality
One way is to create subspaces of
        original space.
Another is Angle Based Outlier Degree.
Holy References, Batman!!
Outlier Detection Techniques. Hans-Peter Kriegel, Peer Kröger and Arthur Zimek. Ludwig-
Maximilians-Universität München Munich, Germany.

A Review of Statistical Outlier Methods. Steven Walfish. Pharmaceutical Technology.

Outlier Detection Algorithms in Data Mining Systems. M. I. Petrovskiy. Department of
Computational Mathematics and Cybernetics, Moscow State University, Vorob’evy gory, Moscow.

Detection and Accommodation of Outliers in Normally Distributed Data Sets. Agata Fallon and
Christine Spada.

Outlier Detection with Kernel Density Functions. L. J. Latecki, A. Lazarevic, D. Pokrajac. 2008.

Classification by Support Vector Machines. F. Markowetz. Max-Planck-Institute for Molecular
Genetics. 2002.

Introduction to Data Mining. Pang-Ning Tan, Michael Steinbach, Vipin Kumar. 2005.

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:75
posted:7/6/2011
language:English
pages:31