; Stat 305 - Lab 1
Documents
User Generated
Resources
Learning Center
Your Federal Quarterly Tax Payments are due April 15th

# Stat 305 - Lab 1

VIEWS: 30 PAGES: 11

• pg 1
```									                                Stat 305 - Lab 1

Goal: Learn how to compute basic summary statistics using STATA and to
interpret the results. Learn how to plot basic graphs and interpret the results.

Qm* This is the way the problems you must answer are designated: with
'Q*' and bolded, m is a number.

Information about the dataset used for this lab:
158 fishes of 7 species are caught and measured. There are 3 variables:
species, weight and length. All the fish are caught from the same lake
(Laengelmavesi) near Tampere in Finland.

Species
Code:
1 Bream
2 Whitefish
3 Roach
4 Parkki
5 Smelt
6 Pike
7 Perch
Weight is measured in grams (g).
Length is measured in centimeters (cm).

*You may want to open a new Word document to record your results and graphs

0. The STATA environment:

Open STATA.
Familiarize yourself with the STATA workspace.
Different windows that open:
 Stata Results: displays results
 Commands: to type in commands; we won't use this. If you want to
use the command window, 'PageUp' and 'PageDown' scroll through
the previous commands.
 Review: gives a history of your commands, actions
 Variables: displays the variables of your data set

For this class, the only windows that you will need to use are the 'Stata Results'
and the 'Variables' windows because we will focus on using the pull-down
could also use will appear in the 'Stata Results' window. You may use these
commands in the 'Commands' window if you like.
1. Entering Data:

Save spreadsheet as 'text (Tab delimited)', e.g. as fishcatch1.txt
Import data into STATA:
In the pull down menus, click on 'File / Import / ASCII data created by
Enter filename, fishcatch1.txt
Choose the tab delimited option, then click 'OK'

The variables in your spreadsheet will appear in the "Variables" window. The left
column gives the variable names in STATA and the right gives the original
variable names.

2. Finding Summary Statistics: (mean, std.dev., min, max,
percentiles)

From the pull down menus, go to 'Statistics/ Summaries, tables, & tests /
Summary statistics / Summary statistics'
You will only need to use the 'Main' tab for now:
enter the variables of interest, choose the options in which you are
interested, and click OK'

The results from the analysis will appear in the 'Stata Results' window.

Q1* Find the mean, standard deviation, and the Five Number Summary
(minimum, maximum, and quartiles) for the variables weight and length.

Q2* Which is a better of measure of center for weight? for length? Explain.
Does it make sense to find these same statistics from Q1 for the Species
variable? Explain. What other statistic could you use as a measure of
center?

3. Graphing:
Go to 'Graphics'

a) Histogram:
Select 'Histogram'
You will mostly need to use the main tab:
In the options:
Type in the variable of interest (e.g. weight)
Under 'Y - axis', select 'Frequency' or 'Density'
You can designate number of bins or width of bins, but not
both (because one is determined by the other). If you do not
designate either, STATA chooses a default method (not
Sturges' rule) for deciding the number of bins.
     The title tab allows you to give the graph a title; there are also
tabs to label the x and y axes.

The histogram is a way of summarizing quantitative data (either discrete or
continuous). It is often used in exploratory data analysis to illustrate the major
features of the distribution of the data.

A histogram is constructed by dividing up the range of possible values in a data
set into nonoverlapping intervals or classes called bins and then counting the
number of observations that fall into each bin. The length of the interval is called
the bin width and the number of observations that fall into a bin is called the bin
count. A good rule of thumb for choosing the number of bins (and thereby
determining the bin width) is Sturges' rule. This rule gives K number of bins,
where K=1+3.322*log10(n). Making a histogram is a bit more art than science,
so while there is a theoretical foundation for histograms, rules like Sturges rule
are meant as guides. There are many other rules for bin width. In general, it is
best to round up the number of bins (and/or the bin width) to be sure to include
all the data points.

Once the number of bins is chosen, we can determine the bin width, h, where
x (n)  x (1)
h                .
K
The maximum data point is denoted by x(n), and the minimum data point is
denoted by x(1). For each bin, a rectangle is constructed at each interval, with a
base length equal to the bin width; there is no space between the rectangles as
in a bar chart. The height of the rectangle is proportional to the number of
observations falling into that group. You can either use the bin count (also
called the frequency) or the percentage of observations that fall in the bin (the
relative frequency) as the height of the bin. In either case (using frequency or
relative frequency), the graph of the histogram looks the same; the difference is
in the scale of and the information contained in the Y-axis. For data from a
continuous random variable, a histogram using the relative frequency gives a
graphical estimate of the probability distribution.

A histogram can also help detect any unusual observations (outliers), or any
gaps in the data set; these outliers may result in some empty bins. An outlier is
an observation in a data set which is far removed in value from the others in the
data set. It is an unusually large or an unusually small value compared to the
others. An outlier might be the result of an error in measurement, in which case
it will distort the interpretation of the data, having undue influence on many
summary statistics, for example, the mean. If an outlier is a genuine result, it is
important because it might indicate an extreme of behavior of the process under
study. For this reason, all outliers must be examined carefully before beginning
any formal analysis. Outliers should not routinely be removed without further
justification.

Q3* Using a calculator, calculate the number of bins for a histogram of the
'length' and 'weight' variables using Sturges' rule. Calculate the bin width.

Symmetry and skewness of data sets:

Symmetry is implied when data values are distributed in the same way above
and below the middle of the sample. You can identify a symmetric probability
distribution when the part of the distribution on one side of the midpoint looks like
the mirror image of the other.

Example of a symmetric probability distribution
0.4
0.3
0.2
f(x)
0.1
0.0

-2           0               2          4            6
x

Skewness is defined as asymmetry in the distribution of the sample data values.
Values on one side of the distribution tend to be further from the 'middle' than
values on the other side. For skewed data, the usual measures of location will
give different values, for example, median<mean would indicate positive (or right)
skewness. In other words, a distribution is skewed to the right if the right tail
(larger values) is much longer than the left one (smaller values). Similarly, a
distribution is skewed to the left if the left tail is much longer than the right one.
Positive (or right) skewness is more common than negative (or left) skewness.
In the case of skewed data, one would most likely report the mean in addition to
the median as the measures of center. In the case of symmetric data, the mean
and median are nearly or exactly equal.

Example of a (right) skewed probability distribution
0.20
0.15
f(x)

0.10

right tail of the distribution
0.05
0.0

0          5           10              15        20     25
x
Example of (left) skewed probability distribution

50
40
30
f(x)

left tail of distribution
20
10
0

0.88       0.90          0.92      0.94       0.96   0.98   1.00
x

Q4* Plot a histogram for weight and for length using the relative frequency
(choose 'Density' in the options); use Sturges' rule to enter the number of
bins in the options. (Copy and paste each graph before going on to the
next because Stata copies over them.) Give titles to your histograms, e.g.
"Histogram of Weight, using Sturges' rule".

Q5* Plot a histogram for weight and for length using the relative frequency
(choose 'Density' in the options); use the Stata default (don't check or enter
the number of bins in the options). (Copy and paste each graph before
going on to the next because Stata copies over them.) How many bins
does the default give?

In the future, when plotting histograms, you may use the STATA default or
Sturges' rule, whichever you prefer.

Describe the distributions of the data. How are they different/similar?
What is the overall shape of each one? Where are the mean and the
median with respect to the bulk of the data? Are there any outliers in the
data? What do the outliers signify (in terms of the variable; remember
there are seven different species of fish)?
Q7* In light of the discussion of symmetry and skewness, which is a better
of measure of center for weight? for length? Explain.

b) Box plot
Select 'Box Plot' under the Graphics menu.
You will mostly need to use the main tab:
In the options:
 Type in the variable of interest (e.g. weight)
 You don't need to specify any other options (use the default of
'Median Type' choice is 'line'.)
 The title tab allows you to give the graph a title.

A box plot is a graphical representation of the five-number summary, the set of
statistics that consists of the maximum, the minimum, and the quartiles of a data
set. A central box spans the 1st and 3rd quartiles (denote by q1 and q2), a line in
the box marks the median, and lines extend from the box out to the smallest and
largest observations.

Q8* Plot box plots for the variables weight and length. Do not include
outliers: on the 'Outsides' tab, select 'do not plot outside values'. Judging
the box plots, which variable's distribution is skewed and why?

Sometimes a box plot is modified to plot suspected outliers individually. In a
modified box plot, the lines extend out from the central box only to the smallest
and largest observations that are not suspected outliers. Observations that are
more than 1.5*(q3-q1) are plotted as individual points.

Q9* Plot modified box plots for the variables weight and length that include
outliers: on the 'Outsides' tab, unselect 'do not plot outside values'; leave
all the 'markers' at default. According to Stata, how many outliers are
detected for each variable?

c) Scatter plot
Under the Graphics menu, select 'Easy Graphs' then 'Scatter Plot'.
You will mostly need to use the main tab:
In the options:
 Type in the variable for each of the x and y axes.
 You don't need to specify any other options.
 The title tab allows you to give the graph a title.

A scatter plot shows the relationship between 2 quantitative variables measured
on the same subject; one appears on each axis. The subject is represented by
the paired values. The x-axis is used to denote the independent variable and the
y-axis, the dependent variable. In this plot, points are plotted but not joined. The
resulting pattern indicates the type and strength of the relationship between the
two variables.

Patterns that may arise:
 The more the points tend to cluster around a straight line, the stronger the
linear relationship between the two variables. A scatterplot will also show
up a non-linear relationship between the two variables and whether or not
there exist any outliers in the data.

Nonlinear association
2.5
2.0
1.5
y

1.0
0.5
0.0

-2.0       -1.5       -1.0        -0.5       0.0        0.5        1.0
x

         If the line around which the points tends to cluster runs from lower left to
upper right, the relationship between the two variables is a positive
association (direct).
Positive association

4
3
y

2
1

0                1                2                 3                4
x

       If the line around which the points tends to cluster runs from upper left to
lower right, the relationship between the two variables is a negative
association (inverse).
Negative association

-1.5
-2.0
y

-2.5
-3.0
-3.5

1.0         1.5          2.0          2.5          3.0          3.5
x

       If there exists a random scatter of points, there is no relationship between
the two variables.
No association

6
4
y

2
0

0              1              2        3            4
x

Q10* Plot a scatter plot of the length and weight variables; use length as
the independent variable (on the x-axis). Describe the plot. What is the
overall pattern? Are there any deviations from the pattern? If so, how
might the deviations be explained? What kind of relationship is there
between the two variables (i.e. a positive or a negative association)? Does
this relationship make sense? Are there any outliers?

* some of the text of this lab is from
http://www.stats.gla.ac.uk/steps/glossary/index.html

```
To top