Embed
Email

Data Cleansing

Document Sample
Data Cleansing
Shared by: HC11121515320
Categories
Tags
Stats
views:
0
posted:
12/15/2011
language:
pages:
26
Data Cleansing









> >>

“A company’s most important asset is

information. A corporation’s ability to

compete, adapt, and grow in a business

climate of rapid change is dependent in

large measure on how well the company

uses information to make decisions …

Sharing information that isn’t clean and

consolidated to the fullest extent can

substantially reduce the effectiveness of a

system of significant investment and

considerable pay-off potential.”

Stoker, 1999

> >>

INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION









Today’s Coverage

INTRODUCTION  Data Cleansing and Data Quality

 Steps in Data Cleansing



• Why is “Dirty” Data a Problem?

WHY “DIRTY” DATA

 Why is Legacy Data “Dirty”?

 To Cleanse or Not To Cleanse



CLEANSING STEPS

 Parsing • Matching

 Correcting • Consolidating

 Standardizing



CONCLUSION • Conclusion

 Demonstration

 Questions

>>

INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION









Data Cleansing and Data Quality

 Data is a product that can be characterized

as either “quality” or “non-quality.” The

ability to make quality decisions depends in

part on the decision-maker’s ability to

access quality data.

 Data cleansing is the process that insures

that the same piece of information is

referred to in only ONE way. When data is

clean, its users can focus on its use and not

its credibility.



>>

INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION









Steps in Data Cleansing

 Parsing

 Correcting

 Standardizing

 Matching

 Consolidating





>>

INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION





Why is Data “Dirty”

and Why is This a Problem?

 Simply put, dirty data for data warehouses is

the product of relying on data from legacy

systems.

 But if company’s have relied on this data for

decades, why is it a problem today?

 Because a data warehouse “promises” to

deliver “a single version of the truth.”

Unfortunately integrating data from different

sources magnifies its problems.





>>

INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION









Why is Legacy Data “Dirty” ?

 Dummy Values,

 Absence of Data,

 Multipurpose Fields,

 Cryptic Data,

 Contradicting Data,

 Inappropriate Use of Address Lines,

 Violation of Business Rules,

 Reused Primary Keys,

 Non-Unique Identifiers, and

 Data Integration Problems

>>

INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION









To Cleanse or Not to Cleanse

 CAN the legacy data be cleansed?

 Sometimes the answer is “NO”

 Then, SHOULD it be cleansed?

 Again, sometimes “NO”

 Next, WHERE should it be cleansed?

 Finally, HOW should it be cleansed?





>>

INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION









Steps in Cleansing Data

 Parsing



 Correcting



 Standardizing



 Matching



 Consolidating



>>

INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION









Parsing





Parsing locates and identifies individual

data elements in the source files and then

isolates these data elements in the target

files.









>>

INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION









Parsing



Parsed Data in Target File

First Name: Beth

Middle Name: Christine

Input Data from Source File Last Name: Parker

Beth Christine Parker, SLS MGR Title: SLS MGR

Regional Port Authority Firm: Regional Port Authority

Federal Building Location: Federal Building

12800 Lake Calumet Number: 12800

Hedgewisch, IL Street: Lake Calumet

City: Hedgewisch

State: IL









>>

INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION









Correcting







Corrects parsed individual data

components using sophisticated data

algorithms and secondary data sources.









>>

INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION









Correcting

Corrected Data

Parsed Data First Name: Beth

First Name: Beth Middle Name: Christine

Middle Name: Christine Last Name: Parker

Last Name: Parker Title: SLS MGR

Title: SLS MGR Firm: Regional Port Authority

Firm: Regional Port Authority Location: Federal Building

Location: Federal Building Number: 12800

Number: 12800 Street: South Butler Drive

Street: Lake Calumet City: Chicago

City: Hedgewisch State: IL

State: IL Zip: 60633

Zip+Four: 2398









>>

INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION









Standardizing





Standardizing applies conversion routines

to transform data into its preferred (and

consistent) format using both standard

and custom business rules.









>>

INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION









Standardizing

Corrected Data

Corrected Data Pre-name: Ms.

First Name: Beth First Name: Beth

Middle Name: Christine 1st Name Match

Last Name: Parker Standards: Elizabeth, Bethany, Bethel

Title: SLS MGR Middle Name: Christine

Firm: Regional Port Authority Last Name: Parker

Location: Federal Building Title: Sales Mgr.

Number: 12800 Firm: Regional Port Authority

Street: South Butler Drive Location: Federal Building

City: Chicago Number: 12800

State: IL Street: S. Butler Dr.

Zip: 60633 City: Chicago

Zip+Four: 2398 State: IL

Zip: 60633

Zip+Four: 2398







>>

INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION









Parsing, Correcting, Standardizing

TITLE FIRST CONC. LAST GENER.

NAME

LINE

Mr. William St. John III

Bill

HSNO ST-DIR ST-NM ST-TYPE



STREET

LINE 101 S. Main St.

Strete

CITY STATE POST



GEOG.

LINE St. 63118

Sant. Louis, MO 63181

>>

INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION









Matching





Searching and matching records within and

across the parsed, corrected and

standardized data based on predefined

business rules to eliminate duplications.









>>

INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION









Match Patterns



Business Street Branch Customer City Vendor Pattern Pattern

Name Type #/Tax ID Code I.D.



Exact Exact Exact Exact Exact Exact AAAAAA P110



Exact VClose Exact VClose Exact Blanks ABAAA- P115



Exact VClose Exact Blanks Exact Exact ABA-AA P120



Exact VClose Close Close Exact Exact ABCCAA S300



VClose VClose Exact Close Exact Exact BBACAA S310





>>

INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION









Matching

Corrected Data (Data Source #2)

Corrected Data (Data Source #1) Pre-name: Ms.

Pre-name: Ms. First Name: Elizabeth

First Name: Beth 1st Name Match

1st Name Match Standards: Beth, Bethany, Bethel

Standards: Elizabeth, Bethany, Bethel Middle Name: Christine

Middle Name: Christine Last Name: Parker-Lewis

Last Name: Parker Title:

Title: Sales Mgr. Firm: Regional Port Authority

Firm: Regional Port Authority Location: Federal Building

Location: Federal Building Number: 12800

Number: 12800 Street: S. Butler Dr., Suite 2

Street: S. Butler Dr. City: Chicago

City: Chicago State: IL

State: IL Zip: 60633

Zip: 60633 Zip+Four: 2398

Zip+Four: 2398 Phone: 708-555-1234

Fax: 708-555-5678





>>

INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION









Consolidating





Analyzing and identifying relationships

between matched records and

consolidating/merging them into ONE

representation.









>>

INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION









Consolidating



Consolidated Data

Name: Ms. Beth (Elizabeth)

Corrected Data (Data Source #1) Christine Parker-Lewis

Title: Sales Mgr.

Firm: Regional Port Authority

Location: Federal Building

Address: 12800 S. Butler Dr., Suite 2

Chicago, IL 60633-2398

Corrected Data (Data Source #2)

Phone: 708-555-1234

Fax: 708-555-5678









>>

INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION









Consolidating









William Janet Karen William

Jones Jones Jones Jones Jr.







>>

INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION









Recommended Best Practices

1. Use metadata to document rules





2. Determine data cleansing schedule





3. Build quality into new and existing

systems



>>

INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION









Legacy Systems View (3 Clients)



Account No.

83451234 Policy No.

ME309451-2









Transaction

B498/97



>>

INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION









The Reality – ONE Client



Account No.

83451234 Policy No.

ME309451-2









Transaction

B498/97





>>

INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION









Demonstration



 Vality

http://www.vality.com

 Trillium Software

http://www.trilliumsoft.com

 First Logic

http://www.firstlogic.com





>>


Related docs
Other docs by HC11121515320
Sanctuary Cove
Views: 0  |  Downloads: 0
Sheet1
Views: 0  |  Downloads: 0
Slide 1
Views: 0  |  Downloads: 0
XIV-USA-Sambo-Open-Results
Views: 6  |  Downloads: 0
De onde v�m os Beb�s ?
Views: 1  |  Downloads: 0
Ethan Frome
Views: 4  |  Downloads: 0
informe concejo 2008 cultura
Views: 0  |  Downloads: 0
Rhumatologie
Views: 4  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!