Creating _ Examining a Database

W
Shared by: hcj
Categories
Tags
-
Stats
views:
9
posted:
2/23/2010
language:
English
pages:
33
Document Sample
scope of work template
							Creating & Examining
a Database
Creating an SPSS database

Database nomenclature

Graphic and statistical tools for
examining a database
                                                                                                                   2


                                            KEY CONCEPTS
                                                 *****
                                            Organizing Data


Data set, database, spreadsheet
Elements to be included in a database code sheet
Concept of a variable
Different ways a variable can change
Independent and dependent variables (X & Y)
Unit of analysis
A case
Datum, data, a value
Sample size (N)
Measurement (metric) v categorical (nonmetric) variables
Scales of measurement
       Nominal
       Ordinal
       Interval
       Ration
Examining variables:
       Missing data
       Outliers
       Central tendency
       Variability
       Frequency (f)
       Skew
       Kurtosis
       Modality
Problems created by missing data
How to identify missing data
Ways to correct the problem of missing data
Problems created by outliers
Techniques for identifying outliers
       Mean, minimum & maximum values
       Box Whisker plot
       Stem & Leaf plot
       Scatterplot
Frequencies (f), proportions (p), and percentages (P)
Histogram
Pie chart
Creating intervals for a measurement variable




  Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                   3



Organizing data ( con'd )

Optimal number of intervals in a frequency table
Skew: right (+) and left (-)
Tails of a distribution
Kurtosis
        Leptokurtosis
        Mesokurtosis
        Platykurtosis
Modality
        Unimodal
        Bimodal
        Poly- or multimodal
Reliability of a measure
Validity of a measure
Different types of statistics
        Univariate
        Bivariate
        Multivariate
Population v sample
Parameter v statistic

                                                     *****




  Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                  4


                                 Lecture Outline

 Organizing the data and the concept
  of a database

 Database code sheet

 Database nomenclature

 Types of variables and scales of
  measurement

 Examining the data: graphic and
  statistical tools

 Reliability & validity of the measurements

 Population & sample

 Levels of statistics

                                                  *****




 Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                  5


                Getting the Data Organized
Step 1
The first step in the analytic process involves
getting the data organized
      Normally this involves creating an electronic
      spreadsheet in a computer

      Variously called a data set or database

      Creation of a database code sheet

Step 2
Prior to the analysis of the data it is critical to
examine the database to determine

      The quality of the data, and

      The distributional dynamics of the variables

The validity of the statistical results of
subsequent analyses will be no better than the
quality of the data and the analyst’s
understanding of the variables involved




 Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                  6


                An Example of a Database

           Criminal History Database
As It Might Appear in an Electronic Spreadsheet


  SS                    SENT                      OFF                     PA                      GEN
   1                       2                       F                       2                       0
   2                       7                       B                       5                     -9999
   3                      15                       D                       7                       0
   4                       5                       A                       1                       0
   5                      10                       R                       6                       0
   6                       3                       F                       12                      1
   7                      25                       M                       8                       0
   8                      15                       R                       4                       0
   9                       4                       A                       1                       0
  10                      30                       M                     -9999                     1
    .                       .                      .                        .                      .
    .                       .                      .                        .                      .
 100                      12                       D                       9                       0

Without a database code sheet

      This spreadsheet is uninterpretable.



 Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                  7


                       Database Code Sheet

An essential tool in creating a database is the
maintenance of a database code sheet

The code sheet should contain …

      The name of the database

      The source of the data

      The date created

      Code used for missing data

      The name of each variable in the database
                The code name of the variable in the
                database
                How the variable is operationally defined
                          Unit of measurement if a
                          measurement variable
                          How coded if a categorical variable




 Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                  8


                   Sample Database Code Sheet

Name:           Criminal History Database

Source: Random sample from the state prison population
Date Created:                       January 14, 1999

Missing Data Code:                                      -9999


 Variable                Code                          Operational Definition
                         Name

Prisoner               SS                  Arbitrarily numbered
                                           sequentially from 1 to 100

Length of              SENT                Measured in years
sentence

Offense                OFF                 Most serious offense involved in
                                           the current conviction

                                           F=fraud, B=burglary, D=drugs,
                                           etc.

Prior                  PA                  Actual number of priors
Arrests

Gender                 GEN                 Male=0, Female=1




 Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                  9


                    Database Nomenclature

The goal of science is to explain change

A variable
      Anything that can or does change

Different ways a variable can change
      Within a subject
      Between subjects

Symbols for variables
      X         independent variable (X1, X2, ... Xk)
      Y         dependent variable

Unit of analysis
      The “thing” studied, e.g. police precincts
                The variables: number of homicides,
                robberies, assaults, etc.




 Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                   10


Database ( con'd )

A case or subject
                 An individual unit of analysis,
                           e.g. 2nd precinct

A value or datum ( pl. data)
       The specific value of a variable for a given
       case, e.g. 6 homicides in the 2nd precinct

Sample size

       The total number of cases or subjects (N)

An example

       The study of homicide
                 Variable: annual number of homicides
                 Unit of analysis: police precincts
                 Case: 2nd precinct
                 Value or datum:
                           Frequency of homicides in the
                           2nd precinct,
                           f = 6 homicides



  Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                 11


                            Sample Database

                Annual Frequency of Homicides



              Precincts (subjects)                              Frequency of
                                                                Homicides (f)

                               1st                                          3

                               2nd                                          6

                               3rd                                          4

                               4th                                          0

                               5th                                          5

                               6th                                          2

                               7th                                          1

                               8th                                          2

                             Total                                         23




Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                  12


                            Types of Variables

Measurement variables
      Also called metric variables
      Change across a continuum
      Examples: age, number of prior arrests, time
      to disposition of a case, length of sentence

Categorical variables
      Also called nonmetric variables
      Change from one categorical condition to
      another
      Examples: gender, verdict, race, type of
      offence, marital status

Measurement and categorical variables can be
measured on different scales of measurement:
      Nominal
      Ordinal
      Interval
      Ratio



 Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                  13


                     Scales of Measurement
Nominal Scale of Measurement
      Used to measure a categorical variable in which there
      is no hierarchical distinction among the categories
      Examples: gender, race, names of states or
      people, zip codes, numbers of precincts

Ordinal Scale of Measurement
      Used to measure a categorical variable in
      which there is a hierarchical distinction among the
      categories
      Examples: grades A, B, & C, academic and military
      ranks, trustee classifications

Interval Scale of Measurement
      Used to measure a measurement variable
      where zero is an arbitrary value
      Examples: Fahrenheit temperature, IQ, numeric grade
      on a test

Ratio Scale of Measurement

      Used to measure a measurement variable
      where zero means “nothing”

      Examples: income, time, weight, length, speed




 Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                  14


                           Examining the Data
Each variable should be explored prior to
analyzing the data


Q What does one look for?
                Missing data

                Outliers

                Central tendency: what is typical?

                Variability: how different are the cases?

                Frequency of different case values

                Skew: degree of asymmetry

                Kurtosis: degree of “peakedness”

                Modality: where is the center of the data




 Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                  15


             The Problem of Missing Data

Missing data can be very troublesome in the
analysis of the data

If the data on a single variable are missing on
one or more cases

      The cases must be eliminated from the
      analysis

If a database with more than one variable has
missing data on various subjects across several
variables

      Either all cases or all variables with one or
      more missing values must be eliminated
      from the analysis

      Depending upon the pattern of the missing
      data, relatively few missing values can
      render the database relatively unanalyzable




 Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                  16


               Database with Missing Data
 10 x 3 matrix: 30 cells, 6 missing values (20%)


     SS                               X1                           X2                            X3
      1                                                            14                            32
      2                               2                            19
      3                               5                            23                            34
      4                              12                            18                            39
      5                               9                                                          41
      6                               1                            12                            36
      7                               9                            15
      8                               6                                                          48
      9                              10                            10                            42
     10                                                            17                            31

If variables with missing data are eliminated,
there is nothing left to analyze

If cases with missing data are eliminated, there
will be a 20% reduction in the database
jeopardizing the external validity, reducing the
power of the statistics used, and risking Type I
or II errors.




 Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                   17


                          Finding Missing Data

Calculate the number of cases for each variable

In this example, the database is a 70 x 3 matrix
with 210 cells.

70 subjects and 3 variables, sentence, age, & IQ

If subjects with missing data are eliminated,
there are only 58 valid cases (82.9%).


Number of valid observations (listwise) =                                  58.00


Variable                   Mean          Std Dev Minimum Maximum                                   N Label

SENTENCE                     5.87           5.00                 1.00          25.00                       63
AGE                         23.09           3.99                18.00         36.00                        64
IQ                          93.36           15.91               69.00         188.00                       70




If the variables with missing data are eliminated,
there is only one variable left, IQ (33.3%).



  Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                  18


                Correcting for Missing Data

 Go back to the original source of the data and
  retrieve the missing data

 Impute the missing values in a variable by
  substituting the mean or median of the
  variable

 Find another case identical or very similar to
  the case with missing values. Use the known
  values of the former case for the missing
  values of the latter case

 From cases with complete data, build a
  statistical model to predict values of the
  variable with missing data use the model to
  predict the missing data




 Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                  19


                     The Problem of Outliers

An outlier is any extreme value of a variable

An outlier may be a valid value, or …
   It could be a recording error at the source
    of the data, or
   A data entry error

Consider the data entry error on the age of a
juvenile delinquent …
   Instead of entering age 17, 71 is mistakenly
    entered
   This type of error can cause havoc in any
    subsequent analysis of the data

While a case may not be a univariate outlier
   It can be a multivariate outlier
   Being 10 years old or being a Ph.D. may not
    be extreme cases
                But being a 10-year-old Ph.D. is




 Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                   20


                                  Finding Outliers

Univariate Analysis


       Compute the mean, minimum and maximum
       values of the variable

Number of valid observations (listwise) =                                  64.00

Variable                    Mean          Std Dev Minimum Maximum                                   N

SENTENCE                     6.72           9.54             -5.00             70.00              64




       A mean sentence of -5 year is impossible.

       A sentence of 70 years, given a mean of 6.72
       years, is suspicious




  Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                   21


Finding outliers: univariate ( con'd )

             Box whisker plot of sentences
  30


                                                            60




  20                                                        59
                                                            58
                                                            57




  10




    0




  -10
        N=                                             70

                                                   SENTENCE




             Stem & leaf plot of sentences


Frequency        Stem & Leaf

8.00          0 * 11111111
19.00         0 t 2222222222333333333
14.00         0 f 44444444555555
10.00         0 s 6666667777
7.00          0 . 8888889
3.00          1 * 001
2.00          1 t 22
1.0        1 f 455
2.0
4.00 Extremes (17), (18), (20), (25)
Stem width: 10.00
Each leaf:     1 case(s)




  Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                   22



Finding outliers ( con'd )


Bivariate Analysis: scatterplot of one variable
against another, e.g. sentence by age
         Senterce

    28


    22


    16


    10


     4


    -2
         -2              2                 6                 10                14               18                  22
                                                       Age

                       Sample A: no apparent outliers
         Sentence



    28


    22


    16


    10


     4


    -2
         16              20                24                28                32               36                  40



                              Sample B: possible outlier




  Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                  23


             Examining Central Tendency
                   And Variability

Statistical procedures make assumptions about
the central tendency and variability of the
variables used in analysis

                        Sentences of 70 offenders

                                    Valid cases:                 70

                                    Missing cases:                  0

                                    Mean               5.9571

                                    Median             4.5000

                                    Variance 24.5344

                                    Std Dev            4.9532

                                    Range               24.0000

                                    IQR 6.0000




 Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                  24


 Examining the Frequency of a Variable
Frequency: the number of times an event occurs
in a given category, e.g. 6 homicides in the 2nd
precinct

            Frequency of homicides by precinct

                        Precincts                                 Frequency of
                                                                  Homicides (f)

                               1st                                            3

                               2nd                                            6

                               3rd                                            4

                               4th                                            0

                               5th                                            5

                               6th                                            2

                               7th                                            1

                               8th                                            2

                            Total                                           23




 Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                       25




Graphic Depictions of a Frequency Table
    Nu mb er o f Precin cts
3




2




1




0
           0                  1                2               3                4                   5              6
                                                          HOMICIDE

                              Histogram of the number of homicides




                                      Precin ct 8 : 9%
                                                                         Precin ct 1 : 13 %
                      Precin ct 7 : 4%


                 Precin ct 6 : 9%




                                                                                              Precin ct 2 : 26 %




                 Precin ct 5 : 22 %




                                                         Precin ct 3 : 17 %


                                  Pie chart of homicide by precinct




Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                  26


                            Examining for Skew
Skew: the degree of asymmetry in the
distribution
      Skew may be either right (+) or left (-)

 12   Freq uen cy

 10

  8

  6

  4

  2

  0
      1       2     3   4     5     6        7     8       9   10       11   12   14   15   17   18   20    25
                                                       SENTENCE

                                     Right skewed distribution



 12   Freq uen cy


 10

  8


  6


  4


  2


  0
          1         2         3          4             5            6         7        8         9         10
                                                 Drug Dependency


                                        Left skewed distribution
                                                   Drug De pen den cy




 Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                   27


Examining for skew ( con'd)




Stem & Leaf Plot
       Combines both a numeric and a graphic
       description of the variable

Distribution of sentences of convicted felons


Frequency               Stem & Leaf

    8.00                   0*     11111111
   19.00                   0t     2222222222333333333
   14.00                   0f     44444444555555
   10.00                   0s     6666667777
    7.00                   0.     8888889
    3.00                   1*     001
    2.00                   1t     22
   3.0                     1f     455

   4.00 Extremes                     (17), (18), (20), (25)

Stem width:                10.0
Each leaf:                 1 case(s)




  Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                    28


Examining for skew ( con'd )


Box and Whisker Plot




 30
                                                                                                 Max (25 )
                                                                                                 75 % (8)
                                                                                                 Med ia n (4 .5 )
 24                                                                                              25 % (2.5)
                                                                                                 Mi n (1 )


 18




 12




  6




  0
                                           SENTENCE

Distribution of sentences of convicted felons

The box represents the middle 50% of sentences

The point in the box represent the median sentence

The whiskers above and below the box represent the
cases that are closest to being 1.5 times the
interquartile range outside the box




  Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                  29


                      Examining the Kurtosis
Kurtosis: the peakedness of the distribution
      Platykurtic: very flat, great variance
      Leptokurtic: very peaked, little variance
      Mesokurtic: in between platy- & leptokurtic

 24   Freq uen cy

 20

 16


 12


  8


  4


  0
        0        1        2        3         4      5         6       7        8         9       10        11

                                       Leptokurtic distribution

                                        Platykurtic distribution
 14   Freq uen cy

 12

 10

  8

  6

  4

  2

  0
            1             2              3               4                5            6               7




 Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                  30


            Examining the Modality of the
                    Distribution
Modality: the number of peaks or “high points” in
the distribution

      This distribution has several modalities




 6
     Freq uen cy

 5

 4

 3

 2

 1

 0




Strictly speaking, the mode of a distribution is
the value that appears most frequently.
      However, when there are multiple peaks,
      distributions may be called bimodal
      (2 peaks), trimodal, multimodal…


 Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                  31


               Reliability & Validity of Data

Reliability
      Are the data accurate? Is there error in the
      measurements?
      Sources of errors
               Human error                               Recording error

               Instrument error                          Data entry error

               Record keeping      Missing data
              error: incomplete,
              untimely information


Validity

      What is being measured?

      Consider the grade on a test. To what extent
      does it measure:

                Course content                           Study habits

                Test taking ability  Whether working

                Anxiety                                  IQ



 Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                 32


                     Population and Sample

     The Population
                                              = parameter = the population mean




                                   




     sampling                                                   inference or
     method                                                     generalization




               The Sample
                                                                          X
               X = statistic = sample mean




Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
                                                                                                                  33


                            Levels of Statistics

Univariate Statistics

      Tools to examine a single variable (X)

      Examples: frequency, percentage,
      mean, standard deviation, histogram

Bivariate Statistics

      Tools for determining whether one variable
      is related to another (X    Y)

      Examples: correlation coefficient, chi-
      square, t-test, one-way analysis of variance

Multivariate Statistics

      Tools for determining whether two or more
      predictor variables are related to a
      dependent variable (X1, X2, … Xk       Y)

      Examples: multiple regression, discriminant
      analysis, log-linear analysis



 Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University

						
Related docs
Other docs by hcj