Docstoc

wwwsfdamaorgPresentationsdata_profiling_presen

Document Sample
wwwsfdamaorgPresentationsdata_profiling_presen Powered By Docstoc
					                        Data Profiling
                       The Path to More Accurate Data




Jack Olson
CTO, Evoke Software
jolson@evokesoft.com
                                  is Data Profiling
                             What What is Data Profiling




The use of analytical techniques on data for the
purpose of developing a thorough knowledge of its
content, structure and quality.



A process of developing information about data
instead of information from data.




                                                     2
                                                What is Data Profiling



Information About Data: (Data Profiling)

  30% of entries in SUPPLIER_ID are blank
  the range of values in UNIT_PRICE is 5.99 to 4599.99
  there are 14 ORDER_HEADER rows with no ORDER_DETAIL rows



Information FROM Data: (not Data Profiling)

  Texas auto buyers buy more Cadillacs per capita than any other state
  The average mortgage amount increased last year by 6%
  10% of last year's customers did not buy anything this year




                                                                         3
                           Examples of Data Content Issues


• understanding explicit column properties
    –   datatype
    –   length
    –   range
    –   permitted values
    –   null
    –   unique
    –   patterns
•   empty columns
•   inconsistency in value representation
•   representation of missing values
•   change points in granularity
•   unused data values
•   over-used data values


                                                       4
                                 Examples of Data Structure Issues


• understanding table structure
   – primary key
   – derived columns
   – foreign keys
      • inclusive rule
      • one-one, one-many rule
• denormalized column sets within table
• overloaded columns (columns having more than one
  data fact recorded in them)
• duplicate data columns
• synonym column pairs
• homonym column pairs




                                                               5
                                 Examples of Data Quality Issues


•   inaccurate data values
•   inconsistent representation of same value
•   missing values
•   violation of structure rules
    – non-unique primary keys
    – primary key/foreign key pair orphans
    – synonym discrepancies
• violation of simple data rules
    – date ordering
• violation of business data rules




                                                             6
                         Data Profiling Model



metadata
accurate &
inaccurate               accurate
                         metadata


               data
             profiling

  data                   facts about
accurate &               inaccurate
inaccurate                  data




                                           7
                                             Data Profiling Challenge



> Metadata does not exist

> Metadata is incomplete                      The quality of metadata
                                              is generally much worse
> Metadata is inaccurate                      than the quality of the data

> Metadata is difficult to gather


                                    > Data deviates from expectations

   The quality of the data          > Data has inaccuracies
   is generally unknown
                                    > Data inconsistent with metadata



                                                                             8
                                     The Need For Data Profiling


• Ad hoc methods of gathering the information fail
   – too much effort required
   – lack of adequate tools causes failure to expose important
     information
   – lack of organized methodology leads to missing important
     information
   – not enough time to complete
• Complete and accurate data profiles are critical to
  success of many projects
   –   data quality assessment
   –   data migration
   –   data integration
   –   data warehousing
   –   crm

                                                                 9
                                             Data Profile Uses


• data quality assessments
• justifications for data quality remedies
• designing data movement processes
   –   target system design
   –   transformations needed
   –   data cleansing required
   –   data purging/fixing required
• designing data migration processes
• designing data integration processes
• writing new programs to access data




                                                           10
                            Elements of a Data Profiling Environment


Metadata                        A Methodology
   no matter how bad               a rigorous, repeatable process

Data                            A Repository
   must include real data          a place to store information as it develops

                                Discovery Processes
                                   to find rules you don't know

                                Validation Processes
                                    to find violations of rules you do know

    data analyst
       to execute the process

    business analysts
      to interpret the findings and make decisions
                                                                                 11
THE DATA PROFILING PROCESS




                             12
                                       Data Profiling Steps




Gather metadata

  Identify and Prepare Data for Profiling

     Value Analysis

        Structure Analysis

           Single Object Data Rule Analysis

              Multiple Object Data Rule Analysis




                                                       13
                                            Gather Metadata


• Copy Information From Metadata Repositories
    – enterprise repositories
    – data dictionaries
• Extract Data from Data Definitions
    – COBOL Copybooks
    – PL/1 Data
    – RDBMS Catalogs
•   Extract Rules from Program Logic
•   Extract Rules from Screen Definitions
•   Extract Rules from User Procedures
•   Consult with Data Management Staff
    – DBA
    – Data Architect
    – Business Analysts


                                                        14
                                           Identify and Prepare Data


• Need Data in Tabular Form (First Normal or better)
   – must resolve REDEFINE structures
   – must resolve OCCURS structures
   – must resolve overloaded attributes
      •   bit level assignments
      •   combination values
      •   keywords
      •   multiple tokens
• Need access to data
   – direct
   – indirect through offload to staging database or flat files
   – honor security requirements
• May need to sample
   – extremely large data sources need to be broken down
   – get only more recent data

                                                                  15
                                                                           Value Analysis


                                                     documented properties

•   Attribute Properties
     –   Business meaning
     –   Datatype
     –   Length/precision                            discovered properties
     –   List of valid values      discovery
                                                                                 analysis
     –   Range of valid values
     –   Null/empty rule         data
     –   Unique rule
                                                     accurate properties
     –   Consecutive rule
     –   Formatting rules               validation
     –   Encoding patterns


                                                     invalid values



                                                                                        16
                                       Value Analysis - Discovery


• Discovery software can compute most properties for
  columns
   – some is easy - min/max lengths
   – some is complex - datatype, patterns
• Cannot identify local conventions for NULL
• Cannot tell you the Business Meaning
• Computes "what is" not "what is possible" or "what is
  valid"




                                                             17
                                             Value Analysis - Analysis


•   Compare documented properties with discovered properties
•   Investigate discrepancies
    – caused by inaccurate data
    – caused by inaccurate metadata
•   Determine correct property list
•   Document Accurate Property List


                     collaborative process

                     data analyst
                     business analyst
                     subject matter expert
                     dba
                     data architect


                                                                  18
                           Value Analysis - Property Extensions


• Find Other Facts About Content
  – empty columns (no values)
  – low usage columns (high percentage of blanks or
    zeros)
  – inconsistent representation of values
  – unused data value
  – methods used to indicate "no value"
     • null, blank, ?, don't know, did not provide
  – indications of change points
     • excessive number of blanks
     • excessive frequencies on some values




                                                           19
                              Value Analysis - Visual Inspection


• Visually Inspect data values
  –   value/frequency list
  –   boundary points (high/low values), (short/long values)
  –   values with special characters
  –   random walk through data
• Determine if consistent with business meaning
• Determine if quality errors exist that cannot be
  extracted from property list
  – names
       • DONALD DUCK, CHICAOG, BLAKC
  – Frequencies
       • 50% color BLUE

                                                               20
                                        Value Analysis -Documentation


• You must document everything you find in the repository
   –   documented properties
   –   discovered properties
   –   accurate properties
   –   characteristics found within valid values
   –   data quality issues
        • bad practices
        • values found in violation of properties
        • values found in error but within property definitions
• This information is needed for project execution
• This information will be helpful to next project that uses
  this data source



                                                                  21
                                                             Structure Analysis


                                              documented structure
•   Structural Properties
     –   Functional dependencies
     –   Primary keys
     –   Denormalization
                                              discovered structure
     –   Derived values           discovery
     –   Synonyms                                                     analysis
     –   Foreign keys            data
     –   Duplicate Data
     –   Third-normal form model              accurate structure
                                 validation




                                              invalid structure



                                                                             22
                                 Structure Analysis - Discovery


• Discovery software is very complex
• Discovery software can find many things for you
   –   functional dependencies
   –   primary keys
   –   derived columns
   –   denormalization sets
• Discovery software can aid in identifying others
   – foreign keys
   – synonyms
   – duplicate data
• Data inaccuracies at value level can reduce discovery
  software's effectiveness



                                                           23
         Structure Analysis - Functional Dependency Analysis

ORDER_NUMBER              determines         CUSTOMER-ID
ORDER_NUMBER, LINE_NUMBER determines         PRICE
PART_NUMBER               determines         DESCRIPTION
QTY,UNIT_PRICE            determines         AMOUNT
CUSTOMER_CATEGORY         determines         DISCOUNT_AMT

• Functional dependencies exist when the value of a
  column is constant for each value of one or more other
  columns
• Functional dependencies identify
   –   primary keys
   –   denormalized sets and their sub-key
   –   derived columns
   –   coincidences



                                                            24
                     Structure Analysis - Primary Key Analysis


• Primary keys are functional dependencies that cover all
  columns in a table
• Usually obvious
• May have multiple candidates
   – encoded key     (customer_id)
   – natural key     (customer name, address)
• Testing natural keys may uncover data quality problems
   – two employees with same social security number
   – same part under two part-numbers
• Single column primary keys are easy to find
   – cardinality equals number of rows
• Multiple column primary keys are hard to find without
  functional dependency discovery

                                                            25
                                Structure Analysis - Denormalization


• Many operational systems are denormalized
   – done for performance reasons
   – two types
      • inner denormalization
          PART_NUMBER > DESCRIPTION, UNIT_OF_MEASURE
      • outer denormalization
          ORDER_NO, CUST_NAME, LINE_NO > QTY, PRICE
      • outers often encountered on unloaded data
• Almost never documented
• Functional Dependency Discovery will identify them
   – unless data inaccuracies hide them
   – unless the sub-key is not included




                                                                26
                          Structure Analysis - Derived Columns


• Relationship exists between two or more columns
   – computation
      • PRICE = QTY * UNIT_PRICE
   – business rule
      • CUSTOMER_TYPE > DISCOUNT
• Functional dependency discovery will find them
• Usually not documented in metadata
   – documentation may exist in program logic
   – documentation may exist in application procedures




                                                           27
                                       Structure Analysis - Synonyms


• Synonyms are column pairs that connect data in one
  table to data in another table
   – may be same table
        • employee-id - manager-id
   – may be more than one column to each side of pair
        • building,room - building,meeting_room
• Discovery software can find candidates
   – cannot find exclusive relationships (split tables)
   – cannot find pairs with different data representation
• Synonyms may reflect
   –   Foreign key pairs
   –   Duplicate data
   –   Shared Domain sets
   –   Coincidences


                                                                 28
       Structure Analysis- Primary-Key/Foreign-Key Analysis


• Foreign keys are columns that connect data in one row
  to the primary key of another row of the same or a
  different table. Match is done on column value.
• Often not documented or enforced
• Can identify some by names
   – employee-id, emp-id
• Can identify some through value overlap analysis
   – a discovery technique that finds pairs of columns across
     tables that have a high degree of value overlap
• Need to establish rules for each pair
   – which is primary/ which is foreign
   – inclusive rule (yes or no)
   – one-to-one or one-to-many


                                                                29
                              Structure Analysis - Duplicate Data


• Duplicate data is a synonym pair that
   – does not define a primary/foreign key pair
   – each side is dependent on a keys that are primary/foreign
     key pairs
      • INVENTORY.DESCRIPTION - ORDER_DETAIL.DESCRIPTION
      • INVENTORY.PARTID      - ORDER-DETAIL.PART-ID
• Generally used for performance improvements
• May be false indicator of duplication
      • CUSTOMER.DISCOUNT    - ORDER.DISCOUNT
      • CUSTOMER.CUSTOMERID - ORDER.CUSTOMER-ID




                                                                 30
               Structure Analysis - Third-Normal-Form Model


• Software can generate TNF from functional
  dependencies, primary key, and synonym information
• Generation of TNF model will help validate conclusions
  on structural issues




                                                           31
                                           Single Object Data Rule Analysis



  •   Single Object Data Rules                       documented rules
       – Physical Rules
       – Business Rules


                                                          analysis

An Object is data that                  validation
documents a single instance
of a business object.
                                 data                accurate rules
examples:
order
personnel record
                                        validation
inventory record



                                                     invalid objects
                                                                        32
                                 Single Object Data Rule - Analysis


• Rules involve more than 1 column

• Discovery is not very useful
   – ordering of date columns
   – patterns of blanks, nulls

• Validate twice
   – first to determine if rule holds to aid in establishing the rule
   – second to find data quality violations

• Violations do not identify the specific offending value(s)

• Violations do not necessarily mean an incorrect value
   – may be a process violation
   – may be an authorized exception
                                                                        33
                      Single Object Data Rule - Physical Rules


• Rules that are inherent in the data definitions
   –   DELIVERY_DATE > ORDER_DATE
   –   HIRE_DATE > BIRTHDATE
   –   SHIPPED_AMT >= ORDER_AMT
   –   If RETURN_FLAG = NO THEN RETURN_DATE = NULL
   –   IF MARITAL_STATUS = 'S' THEN SPOUSE NAME = BLANK


• Analyst can develop them with common sense examination of
  business meaning of columns

• Rules and results need to be reviewed by business analysts




                                                               34
                       Single Object Data Rule - Business Rules


• Rules that require business knowledge
   – If PAYTYPE='PARTTIME' THEN SALARY_TYPE='HOURLY'
   – If CUSTOMER_TYPE='RETAIL' THEN DISCOUNT=0
   – If PRODUCT_TYPE='ENGINE' THEN QTY < 100


• Business analysts must develop

• Analysis should determine hardness of the rule
   – must always be true
   – violations may exist and be correct




                                                            35
                      Single Object Data Rule - Final Validation


• All data rules should be documented in the repository
• All results of validation tests should be documented in
  the repository
• Rules should be classified
   – violation level (none, small, moderate, large)
   – importance to business if not true (no impact, moderate,
     large)
   – monitoring status (currently monitored, not monitored)
• Can analyze further to find inaccurate data values within
  violations
   – some are obvious from inspection
   – clustering analysis may find source of errors




                                                                36
                                     Multiple Object Data Rule Analysis


                                               documented rules
•   Multiple Object Data Rules
     – Same object rules
     – Different object rules
     – Computed values
                                                    analysis
                                 validation


                             data              accurate rules


                                 validation




                                              invalid sets of objects


                                                                        37
                                Multiple Object Data Rule Analysis


• Discovery is not possible

• Validate twice
   – first to determine if rule holds to aid in establishing the rule
   – second to find data quality violations

• Violations do not identify the specific offending value(s)
   • set within which offending values exist is much larger
     than in the case of single object data rules

• Violations do not necessarily mean an incorrect value
   – may be a process violation
   – may be an authorized exception



                                                                        38
    Multiple Object Data Rule Analysis - Same Object Rules


• Rules that cross multiple occurrences of the same data
  object type
   – CHECKOUT_DATES CANNOT OVERLAP FOR SAME
     ASSET_ID
   – IF TWO PRODUCTS SHARE SAME BIN-NUMBER THEN ONE
     MUST BE A SUBSTITUTE PART_NO FOR THE OTHER
   – IF A PRODUCT IS PROVIDED BY MORE THAN ONE
     SUPPLIER-ID THEN ONE MUST BE IDENTIFIED AS THE
     PRIMARY SUPPLIER




                                                           39
 Multiple Object Data Rule Analysis - Different Object Rules


• Rules that deal with interaction between different object
  types
   – PRODUCT.SHIPPING_MODE = ORDER.SHIPPING_MODE
   – PRODUCT.TYPE_CODE MUST BE VALID FOR
     SUPPLIER.VALID_TYPES
   – EMPLOYEES MUST NOT BE CONVICTED FELONS




                                                              40
             Multiple Object Data Rule Analysis - Value Tests


• Computations that may indicate problems in the data
   – value/frequency pairs
   – comparison of order numbers/$amounts by quarter
   – average sales per quarter
• Not true rules in that they do not clearly define boundary
  between accurate and inaccurate
• Looking for values that are clearly out of bounds
  (reasonable tests)
• Good for finding incomplete sets of data
• Good for finding bad practices




                                                               41
                                    END OF DATA PROFILING


• Repository contains complete metadata descriptions
   – Column Properties
   – Structure Definitions
   – Data Rules
• Repository contains data on content
• Repository contains identification of inaccurate data
   – column property violations
   – structure rule violations
   – data rule violations




                                                          42
                                             Data Profiling Value




                                             $
                                             $
                          $                  $
      $                   $                  $
________________    ________________    ________________
Complete Projects   Better Quality      Better able to
Faster and at       Systems             Respond to New
Lower Cost          Delivered Earlier   Business
                                        Opportunities

35% savings in
cost and time



                                                             43
                                                  Data Profiling




project data analysts

                                     business analysts


        data quality assurance analysts




      Data Profiling should be a foundation technology
      for all their activities.



                                                            44
                                                          Bibliography



Jack Olson, Data Profiling: The Accuracy Dimension,
Morgan Kaufmann, 2002

Jack Olson, "Data Profiling: The First Step in Creating
a Data Warehouse", Enterprise Systems Journal,
Vol 14 No 5, May 1999, pp34-36

Jack Olson, "Data Profiling, The Key to Success in
Integration Projects", EAI Journal, Vol 4 No 2,
February 2002, pp 22-26

Philip Russom, "Data Profiling FAQ", Intelligent Enterprise,
August 1, 2002




                                                                  45

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:8
posted:3/10/2010
language:English
pages:45