# ICS 278: Data Mining Lecture 1: Introduction to Data Mining

Document Sample

```					                            ICS 278: Data Mining

Lecture 2: Measurement and Data

Data Mining Lectures             Lecture 1: Introduction   Padhraic Smyth, UC Irvine
Today’s lecture

• Feedback on quiz
– Supplementary reading material: package being prepared

• Update on projects

• Office hours tomorrow: 9 to 10am

• Outline of today’s lecture:
– Finish material from Lecture 1
– Chapter 2: Measurement and Data
• Types of measurement
• Distance measures
• Data quality issues

Data Mining Lectures                       Lecture 1: Introduction   Padhraic Smyth, UC Irvine
Measurement
Mapping domain entities to symbolic representations

Data

Real world

Relationship in data
Relationship
in real world
Data Mining Lectures                     Lecture 1: Introduction                        Padhraic Smyth, UC Irvine
Nominal variable

Here, numerical values just "name" the attribute uniquely.
No ordering implied
i.e. jersey numbers in basketball; a player with number 30 is not more
of anything than a player with number 15; certainly not twice whatever
number 15 is.

http://trochim.human.cornell.edu/kb/measlevl.htm

Data Mining Lectures                     Lecture 1: Introduction              Padhraic Smyth, UC Irvine
Measurements, cont.
ordinal measurement - attributes can be rank-ordered.
Distances between attributes do not have any meaning.
i.e., on a survey you might code Educational Attainment as
0=less than H.S.; 1=some H.S.; 2=H.S. degree; 3=some college;
4=college degree; 5=post college.
In this measure, higher numbers mean more education.
But is distance from 0 to 1 same as 3 to 4? No.
The interval between values is not interpretable in an ordinal measure.

interval measurement - distance between attributes does have meaning.
i.e., when we measure temperature (in Fahrenheit), the distance from
30-40 is same as distance from 70-80. The interval between values is
interpretable. average makes sense, however ratios don't - 80 degrees is
not twice as hot as 40 degrees

Data Mining Lectures                Lecture 1: Introduction           Padhraic Smyth, UC Irvine
Measurements, cont.
ratio measurement - an absolute zero that is meaningful. This means
that you can construct a meaningful fraction (or ratio) with a ratio
variable. Weight is a ratio variable. In applied social research most
"count" variables are ratio, for example, the number of clients in past
six months. Why? Because you can have zero clients and because it
is meaningful to say that "...we had twice as many clients in the past
six months as we did in the previous six months."

Data Mining Lectures                 Lecture 1: Introduction           Padhraic Smyth, UC Irvine
Hierarchy of Measurements

Data Mining Lectures          Lecture 1: Introduction   Padhraic Smyth, UC Irvine
Scales

scale                Legal transforms                                  example

nominal              Any one-one mapping                          Hair color, employment

ordinal        Any order preserving transform                     Severity, preference

interval      Multiply by constant, add a constant              Temperature, calendar time

ratio              Multiply by constant                           Weight, income

Data Mining Lectures                           Lecture 1: Introduction                         Padhraic Smyth, UC Irvine
Why is this important?

• As we will see….
– Many models require data to be represented in a specific form
– e.g., real-valued vectors
• Linear regression, neural networks, support vector machines, etc
• These models implicitly assume interval-scale data (at least)

– What do we do with non-real valued inputs?
• Nominal with M values:
– Not appropriate to “map” to 1 to M (maps to an interval scale)
– Why? w_1 x employment_type + w_2 x city_name
– Could use M binary “indicator” variables
» But what if M is very large? (e.g., cluster into groups of values)
• Ordinal?

Data Mining Lectures                            Lecture 1: Introduction                    Padhraic Smyth, UC Irvine
Mixed data

• Many real-world data sets have multiple types of variables,
– e.g., demographic data sets for marketing
– Nominal: employment type, ethnic group
– Ordinal: education level
– Interval: income, age

• Unfortunately, many data analysis algorithms are suited to only one
type of data (e.g., interval)

• Exception: decision trees
– Trees operate by subgrouping variable values at internal nodes
– Can operate effectively on binary, nominal, ordinal, interval
– We will see more details later…..

Data Mining Lectures                Lecture 1: Introduction            Padhraic Smyth, UC Irvine
Other Kinds of Measurements

• “Derived variables”
– An operational or non-representational measurement: both
defines the property and assigns a number to it.
– Examples: quality of life in medicine, effort in software
engineering
a = # of unique operators in program
b = # of unique operands
n = total # of operator occurrences
m = total # of operand occurrences
Programming effort: e = am(n+m)log(a+b)/2b

Data Mining Lectures                      Lecture 1: Introduction   Padhraic Smyth, UC Irvine
Distance Measures

•       Many data mining techniques are based on similarity or distance
measures between objects.

•       Two methods for computing similarity or distance:
1. Explicit similarity measurement for each pair of objects
2. Similarity obtained indirectly based on vector of object
attributes.

•       Metric: d(i,j) is a metric iff
1. d(i,j)  0 for all i, j and d(i,j) = 0 iff i = j
2. d(i,j) = d(j,i) for all i and j
3. d(i,j)  d(i,k) + d(k,i) for all i, j and k

Data Mining Lectures                        Lecture 1: Introduction          Padhraic Smyth, UC Irvine
Vector data and distance matrices

• Data may be available as n “vectors” each p-dimensional

• Or “data” itself may be a n x n matrix of similarities or distances

Data Mining Lectures                 Lecture 1: Introduction           Padhraic Smyth, UC Irvine
Distance

• Notation: n objects with p measurements
x (i )  ( x 1 (i ), x 2 (i ),  , x p (i ))

• Most common distance metric is Euclidean distance:
1
 p                      2
2
d E (i, j)    ( x k (i)  x k ( j)) 
                         
 k 1                    
• Makes sense in the case where the different measurements are
commensurate; each variable measured in the same units.
• If the measurements are different, say length and weight, it is
not clear.

Data Mining Lectures                    Lecture 1: Introduction           Padhraic Smyth, UC Irvine
Standardization
When variables are not commensurate, we can standardize them by
dividing by the sample standard deviation. This makes them all equally
important.
The estimate for the standard deviation of xk :
1
1 n              2
 k    xk (i)  xk  
2
ˆ
 n i 1           
where xk is the sample mean:
1 n
x k   x k (i)
n i 1

(When might standardization *not* be a such a good idea?
hint: think of extremely skewed data and outliers, e.g., Gates income)

Data Mining Lectures                  Lecture 1: Introduction          Padhraic Smyth, UC Irvine
Weighted Euclidean distance

Finally, if we have some idea of the relative importance of
each variable, we can weight them:

1
 p                          2
2
d WE (i, j)    w k ( x k (i)  x k ( j)) 
                             
 k 1                        

Data Mining Lectures                       Lecture 1: Introduction         Padhraic Smyth, UC Irvine
Other Distance Metrics

• Minkowski or L metric:                                    1
 p                      

d(i, j)    ( x k (i)  x k ( j)) 
                         
 k 1                    
• Manhattan, city block or L1 metric:
p
d (i, j)   x k (i)  x k ( j)
k 1

• L

d(i, j)  max x k (i)  x k ( j)
k

Data Mining Lectures                      Lecture 1: Introduction       Padhraic Smyth, UC Irvine

• Each variable contributes independently to the measure of distance.
• May not always be appropriate…

object i
object j

diameter(i)                     diameter(j)
height(i)                       height(j)
height2(i)                      height2(j)
…

…

height100(i)                    height100(j)
Data Mining Lectures                     Lecture 1: Introduction                 Padhraic Smyth, UC Irvine
Dependence among Variables

• Covariance and correlation measure linear dependence

• Assume we have two variables or attributes X and Y and n objects
taking on values x(1), …, x(n) and y(1), …, y(n). The sample
covariance of X and Y is:
1 n
Cov ( X, Y )   ( x (i)  x )( y(i)  y)
n i 1

• The covariance is a measure of how X and Y vary together.
– it will be large and positive if large values of X are associated
with large values of Y, and small X  small Y

Data Mining Lectures                   Lecture 1: Introduction         Padhraic Smyth, UC Irvine
Sample correlation coefficient

• Covariance depends on ranges of X and Y
• Standardize by dividing by standard deviation
• Sample correlation coefficient

n

 ( x(i)  x )( y(i)  y )
 ( X ,Y )          i 1
1
 n                  n
2
  ( x(i )  x )  ( y (i )  y ) 
2
2

 i 1             i 1            

Data Mining Lectures                         Lecture 1: Introduction         Padhraic Smyth, UC Irvine
Sample Correlation Matrix
-1 0 +1

nitrous oxide

average # rooms

Data on characteristics
of Boston surburbs
Median house value
percentage of large residential lots

Data Mining Lectures                Lecture 1: Introduction         Padhraic Smyth, UC Irvine
Mahalanobis distance

                                      
1
d MH (i, j)  x (i)  x ( j)   1 x (i)  x ( j) 
T                  2

1. It automatically accounts for the scaling of the coordinate axes
2. It corrects for correlation between the different features

Price:
1. The covariance matrices can be hard to determine accurately
2. The memory and time requirements grow quadratically rather
than linearly with the number of features.

Data Mining Lectures                          Lecture 1: Introduction               Padhraic Smyth, UC Irvine

Y              Are X and Y dependent?
(X,Y) = ?

X

linear covariance, correlation

Data Mining Lectures    Lecture 1: Introduction                   Padhraic Smyth, UC Irvine
Binary Vectors

j=1              j=0

i=1             n11             n10

i=0             n01             n00

• matching coefficient
n11  n 00
n11  n10  n 01  n 00

• Jaccard coefficient
n11
n11  n 10  n 01

Data Mining Lectures                    Lecture 1: Introduction         Padhraic Smyth, UC Irvine
Other distance metrics

• Nominal variables
– Number of matches divided by number of dimensions

• Distances between strings of different lengths
– e.g., “Patrick J. Smyth” and “Padhraic Smyth”
– Edit distance

• Distances between images and waveforms
– Shift-invariant, scale invariant
– i.e., d(x,y) = min_{a,b} ( (ax+b) – y)

Data Mining Lectures               Lecture 1: Introduction      Padhraic Smyth, UC Irvine
Transforming Data

• Duality between form of the data and the model
– Useful to bring data onto a “natural scale”
– Some variables are very skewed, e.g., income

• Common transforms: square root, reciprocal, logarithm, raising to a
power.

• Logit: transforms from 0 to 1 to real-line
p
log it ( p) 
1 p

Data Mining Lectures                Lecture 1: Introduction         Padhraic Smyth, UC Irvine
Data Quality

•     Individual measurements
– Random noise in individual measurements
•   Variance (precision)
•   Bias
•   Random data entry errors
•   Noise in label assignment (e.g., class labels in medical data sets)
– Systematic errors
• E.g., all ages > 99 recorded as 99
• More individuals aged 20, 30, 40, etc than expected
– Missing information
• Missing at random
– Questions on a questionaire that people randomly forget to fill in
• Missing systematically
– Questions that people don’t want to answer
– Patients who are too ill for a certain test

Data Mining Lectures                                 Lecture 1: Introduction                        Padhraic Smyth, UC Irvine
Data Quality

•     Collections of measurements
– Ideal case = random sample from population of interest
– Real case = often a biased sample of some sort
– Key point: patterns or models built on the training data may only be
valid on future data that comes from the same distribution

•     Examples of non-randomly sampled data
– Medical study where subjects are all students
– Geographic dependencies
– Temporal dependencies
– Stratified samples
• E.g., 50% healthy, 50% ill
– Hidden systematic effects
• E.g., market basket data the weekend of a large sale in the store
• E.g., Web log data during finals week

Data Mining Lectures                             Lecture 1: Introduction                     Padhraic Smyth, UC Irvine
Next Lecture

• Discussion of class projects

• Chapter 3
– Exploratory data analysis and visualization

Data Mining Lectures                Lecture 1: Introduction   Padhraic Smyth, UC Irvine

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 4 posted: 9/11/2012 language: Unknown pages: 29