Creating _ Examining a Database
Document Sample


Creating & Examining
a Database
Creating an SPSS database
Database nomenclature
Graphic and statistical tools for
examining a database
2
KEY CONCEPTS
*****
Organizing Data
Data set, database, spreadsheet
Elements to be included in a database code sheet
Concept of a variable
Different ways a variable can change
Independent and dependent variables (X & Y)
Unit of analysis
A case
Datum, data, a value
Sample size (N)
Measurement (metric) v categorical (nonmetric) variables
Scales of measurement
Nominal
Ordinal
Interval
Ration
Examining variables:
Missing data
Outliers
Central tendency
Variability
Frequency (f)
Skew
Kurtosis
Modality
Problems created by missing data
How to identify missing data
Ways to correct the problem of missing data
Problems created by outliers
Techniques for identifying outliers
Mean, minimum & maximum values
Box Whisker plot
Stem & Leaf plot
Scatterplot
Frequencies (f), proportions (p), and percentages (P)
Histogram
Pie chart
Creating intervals for a measurement variable
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
3
Organizing data ( con'd )
Optimal number of intervals in a frequency table
Skew: right (+) and left (-)
Tails of a distribution
Kurtosis
Leptokurtosis
Mesokurtosis
Platykurtosis
Modality
Unimodal
Bimodal
Poly- or multimodal
Reliability of a measure
Validity of a measure
Different types of statistics
Univariate
Bivariate
Multivariate
Population v sample
Parameter v statistic
*****
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
4
Lecture Outline
Organizing the data and the concept
of a database
Database code sheet
Database nomenclature
Types of variables and scales of
measurement
Examining the data: graphic and
statistical tools
Reliability & validity of the measurements
Population & sample
Levels of statistics
*****
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
5
Getting the Data Organized
Step 1
The first step in the analytic process involves
getting the data organized
Normally this involves creating an electronic
spreadsheet in a computer
Variously called a data set or database
Creation of a database code sheet
Step 2
Prior to the analysis of the data it is critical to
examine the database to determine
The quality of the data, and
The distributional dynamics of the variables
The validity of the statistical results of
subsequent analyses will be no better than the
quality of the data and the analyst’s
understanding of the variables involved
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
6
An Example of a Database
Criminal History Database
As It Might Appear in an Electronic Spreadsheet
SS SENT OFF PA GEN
1 2 F 2 0
2 7 B 5 -9999
3 15 D 7 0
4 5 A 1 0
5 10 R 6 0
6 3 F 12 1
7 25 M 8 0
8 15 R 4 0
9 4 A 1 0
10 30 M -9999 1
. . . . .
. . . . .
100 12 D 9 0
Without a database code sheet
This spreadsheet is uninterpretable.
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
7
Database Code Sheet
An essential tool in creating a database is the
maintenance of a database code sheet
The code sheet should contain …
The name of the database
The source of the data
The date created
Code used for missing data
The name of each variable in the database
The code name of the variable in the
database
How the variable is operationally defined
Unit of measurement if a
measurement variable
How coded if a categorical variable
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
8
Sample Database Code Sheet
Name: Criminal History Database
Source: Random sample from the state prison population
Date Created: January 14, 1999
Missing Data Code: -9999
Variable Code Operational Definition
Name
Prisoner SS Arbitrarily numbered
sequentially from 1 to 100
Length of SENT Measured in years
sentence
Offense OFF Most serious offense involved in
the current conviction
F=fraud, B=burglary, D=drugs,
etc.
Prior PA Actual number of priors
Arrests
Gender GEN Male=0, Female=1
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
9
Database Nomenclature
The goal of science is to explain change
A variable
Anything that can or does change
Different ways a variable can change
Within a subject
Between subjects
Symbols for variables
X independent variable (X1, X2, ... Xk)
Y dependent variable
Unit of analysis
The “thing” studied, e.g. police precincts
The variables: number of homicides,
robberies, assaults, etc.
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
10
Database ( con'd )
A case or subject
An individual unit of analysis,
e.g. 2nd precinct
A value or datum ( pl. data)
The specific value of a variable for a given
case, e.g. 6 homicides in the 2nd precinct
Sample size
The total number of cases or subjects (N)
An example
The study of homicide
Variable: annual number of homicides
Unit of analysis: police precincts
Case: 2nd precinct
Value or datum:
Frequency of homicides in the
2nd precinct,
f = 6 homicides
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
11
Sample Database
Annual Frequency of Homicides
Precincts (subjects) Frequency of
Homicides (f)
1st 3
2nd 6
3rd 4
4th 0
5th 5
6th 2
7th 1
8th 2
Total 23
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
12
Types of Variables
Measurement variables
Also called metric variables
Change across a continuum
Examples: age, number of prior arrests, time
to disposition of a case, length of sentence
Categorical variables
Also called nonmetric variables
Change from one categorical condition to
another
Examples: gender, verdict, race, type of
offence, marital status
Measurement and categorical variables can be
measured on different scales of measurement:
Nominal
Ordinal
Interval
Ratio
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
13
Scales of Measurement
Nominal Scale of Measurement
Used to measure a categorical variable in which there
is no hierarchical distinction among the categories
Examples: gender, race, names of states or
people, zip codes, numbers of precincts
Ordinal Scale of Measurement
Used to measure a categorical variable in
which there is a hierarchical distinction among the
categories
Examples: grades A, B, & C, academic and military
ranks, trustee classifications
Interval Scale of Measurement
Used to measure a measurement variable
where zero is an arbitrary value
Examples: Fahrenheit temperature, IQ, numeric grade
on a test
Ratio Scale of Measurement
Used to measure a measurement variable
where zero means “nothing”
Examples: income, time, weight, length, speed
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
14
Examining the Data
Each variable should be explored prior to
analyzing the data
Q What does one look for?
Missing data
Outliers
Central tendency: what is typical?
Variability: how different are the cases?
Frequency of different case values
Skew: degree of asymmetry
Kurtosis: degree of “peakedness”
Modality: where is the center of the data
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
15
The Problem of Missing Data
Missing data can be very troublesome in the
analysis of the data
If the data on a single variable are missing on
one or more cases
The cases must be eliminated from the
analysis
If a database with more than one variable has
missing data on various subjects across several
variables
Either all cases or all variables with one or
more missing values must be eliminated
from the analysis
Depending upon the pattern of the missing
data, relatively few missing values can
render the database relatively unanalyzable
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
16
Database with Missing Data
10 x 3 matrix: 30 cells, 6 missing values (20%)
SS X1 X2 X3
1 14 32
2 2 19
3 5 23 34
4 12 18 39
5 9 41
6 1 12 36
7 9 15
8 6 48
9 10 10 42
10 17 31
If variables with missing data are eliminated,
there is nothing left to analyze
If cases with missing data are eliminated, there
will be a 20% reduction in the database
jeopardizing the external validity, reducing the
power of the statistics used, and risking Type I
or II errors.
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
17
Finding Missing Data
Calculate the number of cases for each variable
In this example, the database is a 70 x 3 matrix
with 210 cells.
70 subjects and 3 variables, sentence, age, & IQ
If subjects with missing data are eliminated,
there are only 58 valid cases (82.9%).
Number of valid observations (listwise) = 58.00
Variable Mean Std Dev Minimum Maximum N Label
SENTENCE 5.87 5.00 1.00 25.00 63
AGE 23.09 3.99 18.00 36.00 64
IQ 93.36 15.91 69.00 188.00 70
If the variables with missing data are eliminated,
there is only one variable left, IQ (33.3%).
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
18
Correcting for Missing Data
Go back to the original source of the data and
retrieve the missing data
Impute the missing values in a variable by
substituting the mean or median of the
variable
Find another case identical or very similar to
the case with missing values. Use the known
values of the former case for the missing
values of the latter case
From cases with complete data, build a
statistical model to predict values of the
variable with missing data use the model to
predict the missing data
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
19
The Problem of Outliers
An outlier is any extreme value of a variable
An outlier may be a valid value, or …
It could be a recording error at the source
of the data, or
A data entry error
Consider the data entry error on the age of a
juvenile delinquent …
Instead of entering age 17, 71 is mistakenly
entered
This type of error can cause havoc in any
subsequent analysis of the data
While a case may not be a univariate outlier
It can be a multivariate outlier
Being 10 years old or being a Ph.D. may not
be extreme cases
But being a 10-year-old Ph.D. is
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
20
Finding Outliers
Univariate Analysis
Compute the mean, minimum and maximum
values of the variable
Number of valid observations (listwise) = 64.00
Variable Mean Std Dev Minimum Maximum N
SENTENCE 6.72 9.54 -5.00 70.00 64
A mean sentence of -5 year is impossible.
A sentence of 70 years, given a mean of 6.72
years, is suspicious
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
21
Finding outliers: univariate ( con'd )
Box whisker plot of sentences
30
60
20 59
58
57
10
0
-10
N= 70
SENTENCE
Stem & leaf plot of sentences
Frequency Stem & Leaf
8.00 0 * 11111111
19.00 0 t 2222222222333333333
14.00 0 f 44444444555555
10.00 0 s 6666667777
7.00 0 . 8888889
3.00 1 * 001
2.00 1 t 22
1.0 1 f 455
2.0
4.00 Extremes (17), (18), (20), (25)
Stem width: 10.00
Each leaf: 1 case(s)
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
22
Finding outliers ( con'd )
Bivariate Analysis: scatterplot of one variable
against another, e.g. sentence by age
Senterce
28
22
16
10
4
-2
-2 2 6 10 14 18 22
Age
Sample A: no apparent outliers
Sentence
28
22
16
10
4
-2
16 20 24 28 32 36 40
Sample B: possible outlier
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
23
Examining Central Tendency
And Variability
Statistical procedures make assumptions about
the central tendency and variability of the
variables used in analysis
Sentences of 70 offenders
Valid cases: 70
Missing cases: 0
Mean 5.9571
Median 4.5000
Variance 24.5344
Std Dev 4.9532
Range 24.0000
IQR 6.0000
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
24
Examining the Frequency of a Variable
Frequency: the number of times an event occurs
in a given category, e.g. 6 homicides in the 2nd
precinct
Frequency of homicides by precinct
Precincts Frequency of
Homicides (f)
1st 3
2nd 6
3rd 4
4th 0
5th 5
6th 2
7th 1
8th 2
Total 23
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
25
Graphic Depictions of a Frequency Table
Nu mb er o f Precin cts
3
2
1
0
0 1 2 3 4 5 6
HOMICIDE
Histogram of the number of homicides
Precin ct 8 : 9%
Precin ct 1 : 13 %
Precin ct 7 : 4%
Precin ct 6 : 9%
Precin ct 2 : 26 %
Precin ct 5 : 22 %
Precin ct 3 : 17 %
Pie chart of homicide by precinct
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
26
Examining for Skew
Skew: the degree of asymmetry in the
distribution
Skew may be either right (+) or left (-)
12 Freq uen cy
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10 11 12 14 15 17 18 20 25
SENTENCE
Right skewed distribution
12 Freq uen cy
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9 10
Drug Dependency
Left skewed distribution
Drug De pen den cy
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
27
Examining for skew ( con'd)
Stem & Leaf Plot
Combines both a numeric and a graphic
description of the variable
Distribution of sentences of convicted felons
Frequency Stem & Leaf
8.00 0* 11111111
19.00 0t 2222222222333333333
14.00 0f 44444444555555
10.00 0s 6666667777
7.00 0. 8888889
3.00 1* 001
2.00 1t 22
3.0 1f 455
4.00 Extremes (17), (18), (20), (25)
Stem width: 10.0
Each leaf: 1 case(s)
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
28
Examining for skew ( con'd )
Box and Whisker Plot
30
Max (25 )
75 % (8)
Med ia n (4 .5 )
24 25 % (2.5)
Mi n (1 )
18
12
6
0
SENTENCE
Distribution of sentences of convicted felons
The box represents the middle 50% of sentences
The point in the box represent the median sentence
The whiskers above and below the box represent the
cases that are closest to being 1.5 times the
interquartile range outside the box
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
29
Examining the Kurtosis
Kurtosis: the peakedness of the distribution
Platykurtic: very flat, great variance
Leptokurtic: very peaked, little variance
Mesokurtic: in between platy- & leptokurtic
24 Freq uen cy
20
16
12
8
4
0
0 1 2 3 4 5 6 7 8 9 10 11
Leptokurtic distribution
Platykurtic distribution
14 Freq uen cy
12
10
8
6
4
2
0
1 2 3 4 5 6 7
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
30
Examining the Modality of the
Distribution
Modality: the number of peaks or “high points” in
the distribution
This distribution has several modalities
6
Freq uen cy
5
4
3
2
1
0
Strictly speaking, the mode of a distribution is
the value that appears most frequently.
However, when there are multiple peaks,
distributions may be called bimodal
(2 peaks), trimodal, multimodal…
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
31
Reliability & Validity of Data
Reliability
Are the data accurate? Is there error in the
measurements?
Sources of errors
Human error Recording error
Instrument error Data entry error
Record keeping Missing data
error: incomplete,
untimely information
Validity
What is being measured?
Consider the grade on a test. To what extent
does it measure:
Course content Study habits
Test taking ability Whether working
Anxiety IQ
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
32
Population and Sample
The Population
= parameter = the population mean
sampling inference or
method generalization
The Sample
X
X = statistic = sample mean
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
33
Levels of Statistics
Univariate Statistics
Tools to examine a single variable (X)
Examples: frequency, percentage,
mean, standard deviation, histogram
Bivariate Statistics
Tools for determining whether one variable
is related to another (X Y)
Examples: correlation coefficient, chi-
square, t-test, one-way analysis of variance
Multivariate Statistics
Tools for determining whether two or more
predictor variables are related to a
dependent variable (X1, X2, … Xk Y)
Examples: multiple regression, discriminant
analysis, log-linear analysis
Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
Get documents about "