Outline
Introduction
Descriptive Data Summarization
Data Cleaning
Missing value
Noise data
Data Integration
Redundancy
Data Transformation
Data Cleaning
Importance
“Data cleaning is one of the three biggest
problems in data warehousing”—Ralph
Kimball
“Data cleaning is the number one problem
in data warehousing”—DCI survey
Data Cleaning
Data cleaning tasks
Fill in missing values
Identify outliers and smooth out noisy data
Missing Data
Missing data may be due to
equipment malfunction
inconsistent with other recorded data and thus deleted
data not entered due to misunderstanding
certain data may not be considered important at the time of
entry
not register history or changes of the data
It is important to note that, a missing value may not
always imply an error. (for example, Null-allow attri. )
How to Handle Missing Data?
Ignore the tuple: usually done when class label is
missing (assuming the tasks in classification—not
effective when the percentage of missing values per
attribute varies considerably.
Fill in the missing value manually: tedious +
infeasible
How to Handle Missing Data?
Fill in it automatically with
a global constant : e.g., “unknown”, a new class?!
the attribute mean
the attribute mean for all samples belonging to
the same class: smarter
the most probable value: inference-based such as
Bayesian formula or decision tree
Outline
Introduction
Descriptive Data Summarization
Data Cleaning
Missing value
Noise data
Data Integration
Redundancy
Data Transformation
Noisy Data
Noise: random error or variance in a
measured variable
How to Handle Noisy Data?
Binning
Regression
Clustering
Binning
Binnig methods smooth a sorted data value
by consulting its “neighborhood”
First of all, we sort all the values
Then, the sorted values are distributed into a
number of “buckets”, or “bins”
Then we smooth the values by
Means (bin value is replace by mean value), or
Medium (bin value is replace by medium value), or
Boundaries (bin value is replace by the closest boundary
value)
Simple Discretization Methods:
Binning
Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Regression y
Y1
Y1’ y=x+1
X1 x
Cluster Analysis
Outline
Introduction
Descriptive Data Summarization
Data Cleaning
Missing value
Noise data
Data Integration
Redundancy
Data Transformation
Data integration
Data integration:
Combines data from multiple sources into a
coherent store
Data integration problems
Schema integration:
e.g., A.cust-id B.cust-#
Integrate metadata from different sources
Detecting and resolving data value conflicts
For the same real world entity, attribute values
from different sources are different
Possible reasons: different representations,
different scales, e.g., metric vs. British units
Redundant data
Redundant data occur often when integration
of multiple databases
Object identification: The same attribute or object
may have different names in different databases
Derivable data: One attribute may be a “derived”
attribute in another table, e.g., annual revenue
Redundant data
Redundant attributes may be able to be
detected by correlation analysis
Careful integration of the data from multiple
sources may help reduce/avoid redundancies
and inconsistencies and improve mining
speed and quality
Pearson’s product moment
coefficient
Correlation coefficient (also called Pearson’s
product moment coefficient)
rA, B
( A A)(B B) ( AB) n AB
(n 1)AB (n 1)AB
where n is the number of tuples, and are the
respective means of A and B, σA and σB are the respective
standard deviation of A and B, and Σ(AB) is the sum of
the AB cross-product.
Pearson’s product moment
coefficient
The correlation coefficient is always between
-1 and +1. The closer the correlation is to +/-
1, the closer to a perfect linear relationship.
Here is how I tend to interpret correlations.
-1.0 to -0.7 strong negative association.
-0.7 to -0.3 weak negative association.
-0.3 to +0.3 little or no association.
+0.3 to +0.7 weak positive association.
+0.7 to +1.0 strong positive association.
Chi-Square
Χ2 (chi-square) test
The larger the Χ2 value, the more likely the variables
are related
Chi-Square Calculation: An
Example
Suppose a group of 1500 people was
surveyed.
The gender of each person was noted
Male: 300
Female: 1200
We have two attributes:
Gender
Prefer-reading
Chi-Square Calculation: An
Example
i
Male Female Sum
j
(row)
Like science fiction 250(90) 200(360) 450
Not like science 50(210) 1000(840) 1050
fiction
Sum(col.) 300 1200 1500
E11 = count (male)*count(fiction)/N = 300 * 450 / 1500 =90
E12 = count (male)*count(not_fiction)/N = 300 * 1050/ 1500 =90
(250 90 ) 2 (50 210 ) 2 (200 360 ) 2 (1000 840 ) 2
2
507 .93
90 210 360 840
Chi-Square Calculation: An
Example
For this 2 by 2 table, the degree of freedom are (2-
1)(2-1)=1
For 1 degree of freedom, the Chi-Square value
needed to reject the hypothesis at the 0.001
significance is 10.828
Since our value is above this, we can conclude that
the gender and prefer_reading are (strongly)
correlated for the given group of people
Outline
Introduction
Descriptive Data Summarization
Data Cleaning
Missing value
Noise data
Data Integration
Redundancy
Data Transformation
Data Transformation
Data Transformation can involve the
following:
Smoothing: remove noise from the data,
including binning, regression and clustering
Aggregation
Generalization
Normalization
Attribute construction
Normalization
Normalization
Normalization
Min-max normalization
Z-score normalization
Decimal normalization
Min-max normalization
Min-max normalization: to [new_minA,
new_maxA]
v minA
v' (new _ maxA new _ minA) new _ minA
maxA minA
Ex. Let income range $12,000 to $98,000
normalized to [0.0, 1.0]. Then $73,000 is
mapped to 73,600 12 ,000
(1.0 0) 0 0.716
98,000 12 ,000
Z-score normalization
Z-score normalization (μ: mean, σ:
standard deviation):
v A
v'
A
Ex. Let μ = 54,000, σ = 16,000. Then
73,600 54 ,000
1.225
16 ,000
Decimal normalization
Normalization by decimal scaling
v
v' j Where j is the smallest integer such that Max(|ν’|) < 1
10
Suppose the recorded value of A range from
-986 to 917, the max absolute value is 986,
so j = 3