Data Cleansing
> >>
“A company’s most important asset is
information. A corporation’s ability to
compete, adapt, and grow in a business
climate of rapid change is dependent in
large measure on how well the company
uses information to make decisions …
Sharing information that isn’t clean and
consolidated to the fullest extent can
substantially reduce the effectiveness of a
system of significant investment and
considerable pay-off potential.”
Stoker, 1999
> >>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
Today’s Coverage
INTRODUCTION Data Cleansing and Data Quality
Steps in Data Cleansing
• Why is “Dirty” Data a Problem?
WHY “DIRTY” DATA
Why is Legacy Data “Dirty”?
To Cleanse or Not To Cleanse
CLEANSING STEPS
Parsing • Matching
Correcting • Consolidating
Standardizing
CONCLUSION • Conclusion
Demonstration
Questions
>>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
Data Cleansing and Data Quality
Data is a product that can be characterized
as either “quality” or “non-quality.” The
ability to make quality decisions depends in
part on the decision-maker’s ability to
access quality data.
Data cleansing is the process that insures
that the same piece of information is
referred to in only ONE way. When data is
clean, its users can focus on its use and not
its credibility.
>>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
Steps in Data Cleansing
Parsing
Correcting
Standardizing
Matching
Consolidating
>>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
Why is Data “Dirty”
and Why is This a Problem?
Simply put, dirty data for data warehouses is
the product of relying on data from legacy
systems.
But if company’s have relied on this data for
decades, why is it a problem today?
Because a data warehouse “promises” to
deliver “a single version of the truth.”
Unfortunately integrating data from different
sources magnifies its problems.
>>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
Why is Legacy Data “Dirty” ?
Dummy Values,
Absence of Data,
Multipurpose Fields,
Cryptic Data,
Contradicting Data,
Inappropriate Use of Address Lines,
Violation of Business Rules,
Reused Primary Keys,
Non-Unique Identifiers, and
Data Integration Problems
>>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
To Cleanse or Not to Cleanse
CAN the legacy data be cleansed?
Sometimes the answer is “NO”
Then, SHOULD it be cleansed?
Again, sometimes “NO”
Next, WHERE should it be cleansed?
Finally, HOW should it be cleansed?
>>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
Steps in Cleansing Data
Parsing
Correcting
Standardizing
Matching
Consolidating
>>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
Parsing
Parsing locates and identifies individual
data elements in the source files and then
isolates these data elements in the target
files.
>>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
Parsing
Parsed Data in Target File
First Name: Beth
Middle Name: Christine
Input Data from Source File Last Name: Parker
Beth Christine Parker, SLS MGR Title: SLS MGR
Regional Port Authority Firm: Regional Port Authority
Federal Building Location: Federal Building
12800 Lake Calumet Number: 12800
Hedgewisch, IL Street: Lake Calumet
City: Hedgewisch
State: IL
>>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
Correcting
Corrects parsed individual data
components using sophisticated data
algorithms and secondary data sources.
>>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
Correcting
Corrected Data
Parsed Data First Name: Beth
First Name: Beth Middle Name: Christine
Middle Name: Christine Last Name: Parker
Last Name: Parker Title: SLS MGR
Title: SLS MGR Firm: Regional Port Authority
Firm: Regional Port Authority Location: Federal Building
Location: Federal Building Number: 12800
Number: 12800 Street: South Butler Drive
Street: Lake Calumet City: Chicago
City: Hedgewisch State: IL
State: IL Zip: 60633
Zip+Four: 2398
>>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
Standardizing
Standardizing applies conversion routines
to transform data into its preferred (and
consistent) format using both standard
and custom business rules.
>>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
Standardizing
Corrected Data
Corrected Data Pre-name: Ms.
First Name: Beth First Name: Beth
Middle Name: Christine 1st Name Match
Last Name: Parker Standards: Elizabeth, Bethany, Bethel
Title: SLS MGR Middle Name: Christine
Firm: Regional Port Authority Last Name: Parker
Location: Federal Building Title: Sales Mgr.
Number: 12800 Firm: Regional Port Authority
Street: South Butler Drive Location: Federal Building
City: Chicago Number: 12800
State: IL Street: S. Butler Dr.
Zip: 60633 City: Chicago
Zip+Four: 2398 State: IL
Zip: 60633
Zip+Four: 2398
>>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
Parsing, Correcting, Standardizing
TITLE FIRST CONC. LAST GENER.
NAME
LINE
Mr. William St. John III
Bill
HSNO ST-DIR ST-NM ST-TYPE
STREET
LINE 101 S. Main St.
Strete
CITY STATE POST
GEOG.
LINE St. 63118
Sant. Louis, MO 63181
>>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
Matching
Searching and matching records within and
across the parsed, corrected and
standardized data based on predefined
business rules to eliminate duplications.
>>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
Match Patterns
Business Street Branch Customer City Vendor Pattern Pattern
Name Type #/Tax ID Code I.D.
Exact Exact Exact Exact Exact Exact AAAAAA P110
Exact VClose Exact VClose Exact Blanks ABAAA- P115
Exact VClose Exact Blanks Exact Exact ABA-AA P120
Exact VClose Close Close Exact Exact ABCCAA S300
VClose VClose Exact Close Exact Exact BBACAA S310
>>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
Matching
Corrected Data (Data Source #2)
Corrected Data (Data Source #1) Pre-name: Ms.
Pre-name: Ms. First Name: Elizabeth
First Name: Beth 1st Name Match
1st Name Match Standards: Beth, Bethany, Bethel
Standards: Elizabeth, Bethany, Bethel Middle Name: Christine
Middle Name: Christine Last Name: Parker-Lewis
Last Name: Parker Title:
Title: Sales Mgr. Firm: Regional Port Authority
Firm: Regional Port Authority Location: Federal Building
Location: Federal Building Number: 12800
Number: 12800 Street: S. Butler Dr., Suite 2
Street: S. Butler Dr. City: Chicago
City: Chicago State: IL
State: IL Zip: 60633
Zip: 60633 Zip+Four: 2398
Zip+Four: 2398 Phone: 708-555-1234
Fax: 708-555-5678
>>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
Consolidating
Analyzing and identifying relationships
between matched records and
consolidating/merging them into ONE
representation.
>>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
Consolidating
Consolidated Data
Name: Ms. Beth (Elizabeth)
Corrected Data (Data Source #1) Christine Parker-Lewis
Title: Sales Mgr.
Firm: Regional Port Authority
Location: Federal Building
Address: 12800 S. Butler Dr., Suite 2
Chicago, IL 60633-2398
Corrected Data (Data Source #2)
Phone: 708-555-1234
Fax: 708-555-5678
>>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
Consolidating
William Janet Karen William
Jones Jones Jones Jones Jr.
>>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
Recommended Best Practices
1. Use metadata to document rules
2. Determine data cleansing schedule
3. Build quality into new and existing
systems
>>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
Legacy Systems View (3 Clients)
Account No.
83451234 Policy No.
ME309451-2
Transaction
B498/97
>>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
The Reality – ONE Client
Account No.
83451234 Policy No.
ME309451-2
Transaction
B498/97
>>
INTRODUCTION WHY “DIRTY” DATA CLEANSING STEPS CONCLUSION
Demonstration
Vality
http://www.vality.com
Trillium Software
http://www.trilliumsoft.com
First Logic
http://www.firstlogic.com
>>