Document by dfhdhdhdhjr


									Data Cleaning 101:
The Ron Cody Story

       April 26, 2007
     Katherine Semrau
So, who is Ron Cody anyway?

    Professor at RWJ Medical School,
   expert SAS programmer, SAS book
       writer and he loves to cycle
5 rules of data management

1.   What can go wrong, will.
2.   Nothing is ever as simple as it
     first appears.
3.   Everything takes longer than
     you expect.
4.   One size does NOT fit all.
5.   A calm sea does not make a
     skilled sailor -African proverb
                  -Suzette Levenson’s 5 Rules

   Purpose and Flow
   Data cleaning from A to Z
   Types of Data Cleaning
       SAS Coding & examples
   Reporting Errors & Corrections
   Conclusions
Purpose of Data Cleaning

   To verify that the dataset to be
    used for analysis accurately
    reflects the truth
Flow of Data Cleaning

         Data       Data
         Entry     Queries

     Data          Data Checking/
   Collection       Verification

First things first…
Data cleanliness starts with:
 A good form
       Is it clear? Readable?
        Understandable? In the
        appropriate language?


   Legible handwriting
Once data is collected…
   Reviewing by the data collector
   Checked by 2nd pair of eyes
   Both asking the following:
        Is the form complete (i.e. all

        Are there blanks?

        Are there strange answers?

        Are multiple choices made
         when only one choice is
On to data entry…

   Double Data Entry (CSPro,
    Access, EpiInfo...)
   TeleForms

   Data entry specialist will come
    across queries initially missed
On to data entry…
   Queries sent to field site for
    variables that don’t make sense
    or are illegible
       Original data collector should be
        asked about the query

   Continue with data entry or wait
    for the query return?
Now to data checking &

   So now you have a database
    with data entered…now what?

   Where to begin….
     Start small and basic
     Then move to more complex

   Get your data dictionary out
Types of data checking

       Validity Checks

       Range Checks

       Logic Checks

       Missing Data
What’s in the database?
(proc contents)
   “proc contents” should be run
    first to make sure all the variable
    names expected

     Check variable names
     Format of variable
           Character vs. Numeric
Validity Checks (proc freq)

   “proc freq” is your friend
   Use the standard “proc freq” just
    to get the possible answers
     Do the answers make sense?
     Are there a bunch of missing
     Are there strange outliers from
      what is expected?
   Example: Gender: M, F
Range Checks
(proc freq & proc means)
   Do the variables fall within
    expected limits:
     Age: 0-100 years
     Weight: 2500g-4000g

   Proc freq, proc means (example)
   Proc Tabulate
Logic Checks

   Use “if..then” statements in the
    data step
   “Where” statements to ensure
    the skip patterns were met
Duplicate Records

   Records mistakenly entered twice
   Two options:
       NODUPKEY
          Eliminates duplicate record by
           valuables you identify…
          But you have to be very careful
          Proc sort data=name out=name

       Proc sort NODUP; by _ALL_;
          Eliminates records that are
           EXACTLY identical for all variables
What to do when you find an
   Generate report
   Send to field staff or review from
    original charts
   Receive answers
   Correction of answers
     In database or
     In coding

   Document, document, document
Back to the flow of data cleaning

        Data       Data
        Entry     Queries

    Data          Data Checking/
  Collection       Verification

When is data cleaning done?

   Batches of forms are processed
   Time Schedule (i.e. monthly)

   NOT just before the analytic
    dataset is created
When is the data cleaning just
as good as it is going to get…

   How many rounds do you go?

   Who decides how clean it needs
    to be?
       Conversation between data
        analyst, PI, field staff…

   Data cleaning is very important

   Planning is key

   Think of it as CSI of the data

To top