# Database Sample Problems

Document Sample

```					Creating & Examining
a Database
Creating an SPSS database

Database nomenclature

Graphic and statistical tools for
examining a database
2

KEY CONCEPTS
*****
Organizing Data

Elements to be included in a database code sheet
Concept of a variable
Different ways a variable can change
Independent and dependent variables (X & Y)
Unit of analysis
A case
Datum, data, a value
Sample size (N)
Measurement (metric) v categorical (nonmetric) variables
Scales of measurement
Nominal
Ordinal
Interval
Ration
Examining variables:
Missing data
Outliers
Central tendency
Variability
Frequency (f)
Skew
Kurtosis
Modality
Problems created by missing data
How to identify missing data
Ways to correct the problem of missing data
Problems created by outliers
Techniques for identifying outliers
Mean, minimum & maximum values
Box Whisker plot
Stem & Leaf plot
Scatterplot
Frequencies (f), proportions (p), and percentages (P)
Histogram
Pie chart
Creating intervals for a measurement variable

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
3

Organizing data ( con'd )

Optimal number of intervals in a frequency table
Skew: right (+) and left (-)
Tails of a distribution
Kurtosis
Leptokurtosis
Mesokurtosis
Platykurtosis
Modality
Unimodal
Bimodal
Poly- or multimodal
Reliability of a measure
Validity of a measure
Different types of statistics
Univariate
Bivariate
Multivariate
Population v sample
Parameter v statistic

*****

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
4

Lecture Outline

 Organizing the data and the concept
of a database

 Database code sheet

 Database nomenclature

 Types of variables and scales of
measurement

 Examining the data: graphic and
statistical tools

 Reliability & validity of the measurements

 Population & sample

 Levels of statistics

*****

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
5

Getting the Data Organized
Step 1
The first step in the analytic process involves
getting the data organized
Normally this involves creating an electronic

Variously called a data set or database

Creation of a database code sheet

Step 2
Prior to the analysis of the data it is critical to
examine the database to determine

The quality of the data, and

The distributional dynamics of the variables

The validity of the statistical results of
subsequent analyses will be no better than the
quality of the data and the analyst’s
understanding of the variables involved

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
6

An Example of a Database

Criminal History Database
As It Might Appear in an Electronic Spreadsheet

SS                    SENT                      OFF                     PA                      GEN
1                       2                       F                       2                       0
2                       7                       B                       5                     -9999
3                      15                       D                       7                       0
4                       5                       A                       1                       0
5                      10                       R                       6                       0
6                       3                       F                       12                      1
7                      25                       M                       8                       0
8                      15                       R                       4                       0
9                       4                       A                       1                       0
10                      30                       M                     -9999                     1
.                       .                      .                        .                      .
.                       .                      .                        .                      .
100                      12                       D                       9                       0

Without a database code sheet

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
7

Database Code Sheet

An essential tool in creating a database is the
maintenance of a database code sheet

The code sheet should contain …

The name of the database

The source of the data

The date created

Code used for missing data

The name of each variable in the database
The code name of the variable in the
database
How the variable is operationally defined
Unit of measurement if a
measurement variable
How coded if a categorical variable

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
8

Sample Database Code Sheet

Name:           Criminal History Database

Source: Random sample from the state prison population
Date Created:                       January 14, 1999

Missing Data Code:                                      -9999

Variable                Code                          Operational Definition
Name

Prisoner               SS                  Arbitrarily numbered
sequentially from 1 to 100

Length of              SENT                Measured in years
sentence

Offense                OFF                 Most serious offense involved in
the current conviction

F=fraud, B=burglary, D=drugs,
etc.

Prior                  PA                  Actual number of priors
Arrests

Gender                 GEN                 Male=0, Female=1

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
9

Database Nomenclature

The goal of science is to explain change

A variable
Anything that can or does change

Different ways a variable can change
Within a subject
Between subjects

Symbols for variables
X         independent variable (X1, X2, ... Xk)
Y         dependent variable

Unit of analysis
The “thing” studied, e.g. police precincts
The variables: number of homicides,
robberies, assaults, etc.

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
10

Database ( con'd )

A case or subject
An individual unit of analysis,
e.g. 2nd precinct

A value or datum ( pl. data)
The specific value of a variable for a given
case, e.g. 6 homicides in the 2nd precinct

Sample size

The total number of cases or subjects (N)

An example

The study of homicide
Variable: annual number of homicides
Unit of analysis: police precincts
Case: 2nd precinct
Value or datum:
Frequency of homicides in the
2nd precinct,
f = 6 homicides

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
11

Sample Database

Annual Frequency of Homicides

Precincts (subjects)                              Frequency of
Homicides (f)

1st                                          3

2nd                                          6

3rd                                          4

4th                                          0

5th                                          5

6th                                          2

7th                                          1

8th                                          2

Total                                         23

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
12

Types of Variables

Measurement variables
Also called metric variables
Change across a continuum
Examples: age, number of prior arrests, time
to disposition of a case, length of sentence

Categorical variables
Also called nonmetric variables
Change from one categorical condition to
another
Examples: gender, verdict, race, type of
offence, marital status

Measurement and categorical variables can be
measured on different scales of measurement:
Nominal
Ordinal
Interval
Ratio

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
13

Scales of Measurement
Nominal Scale of Measurement
Used to measure a categorical variable in which there
is no hierarchical distinction among the categories
Examples: gender, race, names of states or
people, zip codes, numbers of precincts

Ordinal Scale of Measurement
Used to measure a categorical variable in
which there is a hierarchical distinction among the
categories
ranks, trustee classifications

Interval Scale of Measurement
Used to measure a measurement variable
where zero is an arbitrary value
Examples: Fahrenheit temperature, IQ, numeric grade
on a test

Ratio Scale of Measurement

Used to measure a measurement variable
where zero means “nothing”

Examples: income, time, weight, length, speed

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
14

Examining the Data
Each variable should be explored prior to
analyzing the data

Q What does one look for?
Missing data

Outliers

Central tendency: what is typical?

Variability: how different are the cases?

Frequency of different case values

Skew: degree of asymmetry

Kurtosis: degree of “peakedness”

Modality: where is the center of the data

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
15

The Problem of Missing Data

Missing data can be very troublesome in the
analysis of the data

If the data on a single variable are missing on
one or more cases

The cases must be eliminated from the
analysis

If a database with more than one variable has
missing data on various subjects across several
variables

Either all cases or all variables with one or
more missing values must be eliminated
from the analysis

Depending upon the pattern of the missing
data, relatively few missing values can
render the database relatively unanalyzable

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
16

Database with Missing Data
10 x 3 matrix: 30 cells, 6 missing values (20%)

SS                               X1                           X2                            X3
1                                                            14                            32
2                               2                            19
3                               5                            23                            34
4                              12                            18                            39
5                               9                                                          41
6                               1                            12                            36
7                               9                            15
8                               6                                                          48
9                              10                            10                            42
10                                                            17                            31

If variables with missing data are eliminated,
there is nothing left to analyze

If cases with missing data are eliminated, there
will be a 20% reduction in the database
jeopardizing the external validity, reducing the
power of the statistics used, and risking Type I
or II errors.

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
17

Finding Missing Data

Calculate the number of cases for each variable

In this example, the database is a 70 x 3 matrix
with 210 cells.

70 subjects and 3 variables, sentence, age, & IQ

If subjects with missing data are eliminated,
there are only 58 valid cases (82.9%).

Number of valid observations (listwise) =                                  58.00

Variable                   Mean          Std Dev Minimum Maximum                                   N Label

SENTENCE                     5.87           5.00                 1.00          25.00                       63
AGE                         23.09           3.99                18.00         36.00                        64
IQ                          93.36           15.91               69.00         188.00                       70

If the variables with missing data are eliminated,
there is only one variable left, IQ (33.3%).

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
18

Correcting for Missing Data

 Go back to the original source of the data and
retrieve the missing data

 Impute the missing values in a variable by
substituting the mean or median of the
variable

 Find another case identical or very similar to
the case with missing values. Use the known
values of the former case for the missing
values of the latter case

 From cases with complete data, build a
statistical model to predict values of the
variable with missing data use the model to
predict the missing data

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
19

The Problem of Outliers

An outlier is any extreme value of a variable

An outlier may be a valid value, or …
 It could be a recording error at the source
of the data, or
 A data entry error

Consider the data entry error on the age of a
juvenile delinquent …
 Instead of entering age 17, 71 is mistakenly
entered
 This type of error can cause havoc in any
subsequent analysis of the data

While a case may not be a univariate outlier
 It can be a multivariate outlier
 Being 10 years old or being a Ph.D. may not
be extreme cases
But being a 10-year-old Ph.D. is

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
20

Finding Outliers

Univariate Analysis

Compute the mean, minimum and maximum
values of the variable

Number of valid observations (listwise) =                                  64.00

Variable                    Mean          Std Dev Minimum Maximum                                   N

SENTENCE                     6.72           9.54             -5.00             70.00              64

A mean sentence of -5 year is impossible.

A sentence of 70 years, given a mean of 6.72
years, is suspicious

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
21

Finding outliers: univariate ( con'd )

Box whisker plot of sentences
30

60

20                                                        59
58
57

10

0

-10
N=                                             70

SENTENCE

Stem & leaf plot of sentences

Frequency        Stem & Leaf

8.00          0 * 11111111
19.00         0 t 2222222222333333333
14.00         0 f 44444444555555
10.00         0 s 6666667777
7.00          0 . 8888889
3.00          1 * 001
2.00          1 t 22
1.0        1 f 455
2.0
4.00 Extremes (17), (18), (20), (25)
Stem width: 10.00
Each leaf:     1 case(s)

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
22

Finding outliers ( con'd )

Bivariate Analysis: scatterplot of one variable
against another, e.g. sentence by age
Senterce

28

22

16

10

4

-2
-2              2                 6                 10                14               18                  22
Age

Sample A: no apparent outliers
Sentence

28

22

16

10

4

-2
16              20                24                28                32               36                  40

Sample B: possible outlier

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
23

Examining Central Tendency
And Variability

the central tendency and variability of the
variables used in analysis

Sentences of 70 offenders

Valid cases:                 70

Missing cases:                  0

Mean               5.9571

Median             4.5000

Variance 24.5344

Std Dev            4.9532

Range               24.0000

IQR 6.0000

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
24

Examining the Frequency of a Variable
Frequency: the number of times an event occurs
in a given category, e.g. 6 homicides in the 2nd
precinct

Frequency of homicides by precinct

Precincts                                 Frequency of
Homicides (f)

1st                                            3

2nd                                            6

3rd                                            4

4th                                            0

5th                                            5

6th                                            2

7th                                            1

8th                                            2

Total                                           23

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
25

Graphic Depictions of a Frequency Table
Nu mb er o f Precin cts
3

2

1

0
0                  1                2               3                4                   5              6
HOMICIDE

Histogram of the number of homicides

Precin ct 8 : 9%
Precin ct 1 : 13 %
Precin ct 7 : 4%

Precin ct 6 : 9%

Precin ct 2 : 26 %

Precin ct 5 : 22 %

Precin ct 3 : 17 %

Pie chart of homicide by precinct

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
26

Examining for Skew
Skew: the degree of asymmetry in the
distribution
Skew may be either right (+) or left (-)

12   Freq uen cy

10

8

6

4

2

0
1       2     3   4     5     6        7     8       9   10       11   12   14   15   17   18   20    25
SENTENCE

Right skewed distribution

12   Freq uen cy

10

8

6

4

2

0
1         2         3          4             5            6         7        8         9         10
Drug Dependency

Left skewed distribution
Drug De pen den cy

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
27

Examining for skew ( con'd)

Stem & Leaf Plot
Combines both a numeric and a graphic
description of the variable

Distribution of sentences of convicted felons

Frequency               Stem & Leaf

8.00                   0*     11111111
19.00                   0t     2222222222333333333
14.00                   0f     44444444555555
10.00                   0s     6666667777
7.00                   0.     8888889
3.00                   1*     001
2.00                   1t     22
3.0                     1f     455

4.00 Extremes                     (17), (18), (20), (25)

Stem width:                10.0
Each leaf:                 1 case(s)

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
28

Examining for skew ( con'd )

Box and Whisker Plot

30
Max (25 )
75 % (8)
Med ia n (4 .5 )
24                                                                                              25 % (2.5)
Mi n (1 )

18

12

6

0
SENTENCE

Distribution of sentences of convicted felons

The box represents the middle 50% of sentences

The point in the box represent the median sentence

The whiskers above and below the box represent the
cases that are closest to being 1.5 times the
interquartile range outside the box

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
29

Examining the Kurtosis
Kurtosis: the peakedness of the distribution
Platykurtic: very flat, great variance
Leptokurtic: very peaked, little variance
Mesokurtic: in between platy- & leptokurtic

24   Freq uen cy

20

16

12

8

4

0
0        1        2        3         4      5         6       7        8         9       10        11

Leptokurtic distribution

Platykurtic distribution
14   Freq uen cy

12

10

8

6

4

2

0
1             2              3               4                5            6               7

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
30

Examining the Modality of the
Distribution
Modality: the number of peaks or “high points” in
the distribution

This distribution has several modalities

6
Freq uen cy

5

4

3

2

1

0

Strictly speaking, the mode of a distribution is
the value that appears most frequently.
However, when there are multiple peaks,
distributions may be called bimodal
(2 peaks), trimodal, multimodal…

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
31

Reliability & Validity of Data

Reliability
Are the data accurate? Is there error in the
measurements?
Sources of errors
 Human error                               Recording error

 Instrument error                          Data entry error

 Record keeping      Missing data
error: incomplete,
untimely information

Validity

What is being measured?

Consider the grade on a test. To what extent
does it measure:

 Course content                           Study habits

 Test taking ability  Whether working

 Anxiety                                  IQ

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
32

Population and Sample

The Population
 = parameter = the population mean



sampling                                                   inference or
method                                                     generalization

The Sample
X
X = statistic = sample mean

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University
33

Levels of Statistics

Univariate Statistics

Tools to examine a single variable (X)

Examples: frequency, percentage,
mean, standard deviation, histogram

Bivariate Statistics

Tools for determining whether one variable
is related to another (X    Y)

Examples: correlation coefficient, chi-
square, t-test, one-way analysis of variance

Multivariate Statistics

Tools for determining whether two or more
predictor variables are related to a
dependent variable (X1, X2, … Xk       Y)

Examples: multiple regression, discriminant
analysis, log-linear analysis

Creating & Examining a Database: Charles M. Friel Ph.D., Criminal Justice Center, Sam Houston State University

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 22 posted: 1/19/2011 language: English pages: 33
Description: Database Sample Problems document sample
How are you planning on using Docstoc?