Embed
Email

Outline

Document Sample

Shared by: gegeshandong
Categories
Tags
Stats
views:
1
posted:
12/26/2011
language:
pages:
31
Outline

 Introduction

 Descriptive Data Summarization

 Data Cleaning

 Missing value

 Noise data

 Data Integration

 Redundancy

 Data Transformation

Data Cleaning

 Importance

 “Data cleaning is one of the three biggest

problems in data warehousing”—Ralph

Kimball

 “Data cleaning is the number one problem

in data warehousing”—DCI survey

Data Cleaning

 Data cleaning tasks

 Fill in missing values

 Identify outliers and smooth out noisy data

Missing Data

 Missing data may be due to

 equipment malfunction

 inconsistent with other recorded data and thus deleted

 data not entered due to misunderstanding

 certain data may not be considered important at the time of

entry

 not register history or changes of the data





 It is important to note that, a missing value may not

always imply an error. (for example, Null-allow attri. )

How to Handle Missing Data?

 Ignore the tuple: usually done when class label is

missing (assuming the tasks in classification—not

effective when the percentage of missing values per

attribute varies considerably.





 Fill in the missing value manually: tedious +

infeasible

How to Handle Missing Data?

 Fill in it automatically with

 a global constant : e.g., “unknown”, a new class?!

 the attribute mean

 the attribute mean for all samples belonging to

the same class: smarter

 the most probable value: inference-based such as

Bayesian formula or decision tree

Outline

 Introduction

 Descriptive Data Summarization

 Data Cleaning

 Missing value

 Noise data

 Data Integration

 Redundancy

 Data Transformation

Noisy Data

 Noise: random error or variance in a

measured variable



 How to Handle Noisy Data?

 Binning

 Regression

 Clustering

Binning

 Binnig methods smooth a sorted data value

by consulting its “neighborhood”

 First of all, we sort all the values

 Then, the sorted values are distributed into a

number of “buckets”, or “bins”

 Then we smooth the values by

 Means (bin value is replace by mean value), or

 Medium (bin value is replace by medium value), or

 Boundaries (bin value is replace by the closest boundary

value)

Simple Discretization Methods:

Binning

 Sorted data for price (in dollars):

 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34



* Partition into equal-frequency (equi-depth) bins:

- Bin 1: 4, 8, 9, 15

- Bin 2: 21, 21, 24, 25

- Bin 3: 26, 28, 29, 34



* Smoothing by bin means:

- Bin 1: 9, 9, 9, 9

- Bin 2: 23, 23, 23, 23

- Bin 3: 29, 29, 29, 29



* Smoothing by bin boundaries:

- Bin 1: 4, 4, 4, 15

- Bin 2: 21, 21, 25, 25

- Bin 3: 26, 26, 26, 34

Regression y



Y1





Y1’ y=x+1





X1 x

Cluster Analysis

Outline

 Introduction

 Descriptive Data Summarization

 Data Cleaning

 Missing value

 Noise data

 Data Integration

 Redundancy

 Data Transformation

Data integration

 Data integration:

 Combines data from multiple sources into a

coherent store

Data integration problems

 Schema integration:

e.g., A.cust-id  B.cust-#

 Integrate metadata from different sources





 Detecting and resolving data value conflicts

 For the same real world entity, attribute values

from different sources are different

 Possible reasons: different representations,

different scales, e.g., metric vs. British units

Redundant data

 Redundant data occur often when integration

of multiple databases

 Object identification: The same attribute or object

may have different names in different databases

 Derivable data: One attribute may be a “derived”

attribute in another table, e.g., annual revenue

Redundant data

 Redundant attributes may be able to be

detected by correlation analysis







 Careful integration of the data from multiple

sources may help reduce/avoid redundancies

and inconsistencies and improve mining

speed and quality

Pearson’s product moment

coefficient

 Correlation coefficient (also called Pearson’s

product moment coefficient)





rA, B 

 ( A  A)(B  B)   ( AB)  n AB

(n  1)AB (n  1)AB



where n is the number of tuples, and are the

respective means of A and B, σA and σB are the respective

standard deviation of A and B, and Σ(AB) is the sum of

the AB cross-product.

Pearson’s product moment

coefficient

 The correlation coefficient is always between

-1 and +1. The closer the correlation is to +/-

1, the closer to a perfect linear relationship.

Here is how I tend to interpret correlations.



 -1.0 to -0.7 strong negative association.

 -0.7 to -0.3 weak negative association.

 -0.3 to +0.3 little or no association.

 +0.3 to +0.7 weak positive association.

 +0.7 to +1.0 strong positive association.

Chi-Square

 Χ2 (chi-square) test









 The larger the Χ2 value, the more likely the variables

are related

Chi-Square Calculation: An

Example

 Suppose a group of 1500 people was

surveyed.



 The gender of each person was noted

 Male: 300

 Female: 1200



 We have two attributes:

 Gender

 Prefer-reading

Chi-Square Calculation: An

Example

i

Male Female Sum

j

(row)

Like science fiction 250(90) 200(360) 450



Not like science 50(210) 1000(840) 1050

fiction

Sum(col.) 300 1200 1500



 E11 = count (male)*count(fiction)/N = 300 * 450 / 1500 =90

 E12 = count (male)*count(not_fiction)/N = 300 * 1050/ 1500 =90



(250  90 ) 2 (50  210 ) 2 (200  360 ) 2 (1000  840 ) 2

 

2

    507 .93

90 210 360 840

Chi-Square Calculation: An

Example

 For this 2 by 2 table, the degree of freedom are (2-

1)(2-1)=1



 For 1 degree of freedom, the Chi-Square value

needed to reject the hypothesis at the 0.001

significance is 10.828



 Since our value is above this, we can conclude that

the gender and prefer_reading are (strongly)

correlated for the given group of people

Outline

 Introduction

 Descriptive Data Summarization

 Data Cleaning

 Missing value

 Noise data

 Data Integration

 Redundancy

 Data Transformation

Data Transformation

 Data Transformation can involve the

following:

 Smoothing: remove noise from the data,

including binning, regression and clustering

 Aggregation

 Generalization

 Normalization

 Attribute construction

Normalization

Normalization

Normalization

 Min-max normalization

 Z-score normalization

 Decimal normalization

Min-max normalization

 Min-max normalization: to [new_minA,

new_maxA]

v  minA

v'  (new _ maxA  new _ minA)  new _ minA

maxA  minA

 Ex. Let income range $12,000 to $98,000

normalized to [0.0, 1.0]. Then $73,000 is

mapped to 73,600  12 ,000

(1.0  0)  0  0.716

98,000  12 ,000

Z-score normalization

 Z-score normalization (μ: mean, σ:

standard deviation):

v  A

v' 

 A





 Ex. Let μ = 54,000, σ = 16,000. Then

73,600  54 ,000

 1.225

16 ,000

Decimal normalization

 Normalization by decimal scaling



v

v' j Where j is the smallest integer such that Max(|ν’|) < 1

10



 Suppose the recorded value of A range from

-986 to 917, the max absolute value is 986,

so j = 3



Related docs
Other docs by gegeshandong
A_E_KY-4PSE30WUSeries-Rev1012A
Views: 0  |  Downloads: 0
688_xls
Views: 0  |  Downloads: 0
2-1 辫措康
Views: 0  |  Downloads: 0
VINPR Lit Order Form New Jan 09
Views: 0  |  Downloads: 0
WRECKED - Torino Film Festival
Views: 0  |  Downloads: 0
project2btestcases
Views: 0  |  Downloads: 0
Fund Account transfer form9.2011
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!