Stata - DOC by niusheng11


									                   An Introductory Course for Stata
                                     By Dallas J. Bateman
I. Introduction

What is Stata?
         Stata is a statistical package used mostly by business and academic institutions. It is
highly used in economics, sociology, political science, and epidemiology. Stata is highly
admirable among these fields because of the simple point-and-click features that accomplish
complex statistical analyses and produce publication-quality graphics. Stata has computing
capabilities to perform data management, statistical analysis, provide graphics, run simulations,
and even do custom programming. For a full list of the capabilities of Stata, please refer to the
following website:
         Although there is a point-and-click capability, the commands given in this paper will be
for the use of running commands based on Stata code.
         There are a few different versions of Stata depending on the type of data with which one
may work. There is a version for multiprocessor computers, large databases, a standard version,
and a smaller version for students.
         Since Stata is not free software, the software and a license must be purchased in order to
install it on a personal computer. The software can run around $600.00. Student versions can be
significantly cheaper.
         For additional help files on getting acquainted with Stata, please visit

Reading data into Stata
      Stata has built-in datasets with which we may work. To locate a dataset:
      1. Select File > Example Datasets….
      2. Click on Example datasets installed with Stata.
      3. Choose the dataset you would like to work with.

        For outside data files, Stata can read in a file from a directory on the computer or a file
from the internet. Both methods require the use command followed by either the directory
location or the web address as examples:
             use H:\School\STAT 582\logit.dta, clear
             use, clear
       lookfor allows you to find variables that contain a specified keyword. This is especially
useful in large data sets with many variables. Often abbreviated keywords are the most helpful.
To find a poverty variable, type lookfor pov.
         describe tells you about the contents of a specific variable. describe xvar yvar.
codebook xvar yvar will produce a nicely formatted codebook of your data which is especially
useful if you have added variable labels with the label variable command. codebook by itself
will list every variable in your data and generate a lot of output.
         Once you have opened your data and are ready to begin, Stata has a way of opening help
files specific to the functions that you would like to call. For example, say you want to begin
using simple linear regression analysis, but you cannot remember the syntax for the regress
command. By typing findit regress in the command window, you will be given a help file
explaining the required parameters for the regress function.
II. Common Statistical Analyses in Stata

Descriptive Statistics
       summarize gives basic descriptive statistics for a variable. This is mostly useful for
continuous variables.
             summarize xvar yvar
             summarize xvar yvar
       tabulate (or simply tab) gives    a frequency distribution for your variable. This is useful
for categorical variables.
               tabulate xvar.

Linear Regression
         To run a linear model in Stata, we are going to use the crime dataset. The variables are
state id (sid), state name (state), violent crimes per 100,000 people (crime), percent of the
population living under the poverty line (poverty), and percent of the population that are single
parents (single). There are other variables in the dataset, but these are the ones that we will
refer to for this example.
         To load the data into Stata, type the following commands in the Command window:

use, clear
drop if sid == 51

The drop command will drop Washington DC since it is not a state.
         To fit a regression model, we will treat crime as the response and poverty and single as
the predictors. Typing regress crime poverty single in the Command window will
produce regression analysis output with an ANOVA table, model fit statistics (R2, Adj R2, Root
MSE, etc.), and a table with the coefficients, standard errors, significance tests and confidence
intervals of the respective coefficients.
         Let us suppose for a moment that there was an additional predictor variable race, which
is a categorical variable denoting the race of the, was added to the model. To let Stata know that
you want to use indicator variables for this categorical variable, we can add such a statement into
the model above by adding “i.” before the categorical variable:
               regress crime poverty single i.race
       A common desire is to obtain residuals or fitted values to test assumptions of normality.
Stata makes this simple. The following code will store the residuals and the fitted values:
                predict res, r
                predict yhat
        This predict statement must be done after the regress statement. The first line of code
will store the residuals (r) in a new variable called res. The second line simply stores the fitted
values (yhat) in a variable called yhat. To look at residual plots:
                plot res yhat
                plot res poverty
                plot res single
                plot res race
         This is a very basic run-through of regression analysis. For more information on
  checking model assumptions, checking model fit, and searching for outliers please refer to the
  following website:

  Categorical Variable Analysis
          Tabulating two categorical variables together gives you a cross-tabulation of those
  variables, e.g tabulate xvar yvar, row col chi2
       pwcorr xvar yvar, sig gives you the pairwise correlation of two continuous variables.
       oneway xvar yvar, tabulate gives you a oneway ANOVA of a continuous variable
          over a categorical factor.
          As an example using logistic regression, we are going to use a hypothetical dataset about
  getting into graduate school. Hypothetical data has been generated, which can be loaded into
  Stata via the following command:
                 use, clear

          This hypothetical data set has a binary response variable called admit denoting whether
  or not a student was admitted into graduate school. There are three predictor variables: gre, gpa
  and topnotch, which is a binary predictor where 1 indicates that the undergraduate institution
  was "top notch" and 0 indicates that it is not.

         tab admit topnotch       will produce a crosstab of admit and topnotch:
             |       topnotch
       admit |         0          1 |     Total
           0 |       238         35 |       273
           1 |        97         30 |       127
       Total |       335         65 |       400

  None of the cells are too small or empty (has no cases), so it is safe to run a logistic model.

                 logistic admit gre topnotch gpa

         Note again that the first variable listed after the logistic command is automatically
  considered the response where all variables listed afterwards are the predictors. The logistic
  command above will produce the following output (similar to that for linear regression):

Logistic regression                                             Number of obs        =          400
                                                                LR chi2(3)           =        21.85
                                                                Prob > chi2          =       0.0001
Log likelihood = -239.06481                                     Pseudo R2            =       0.0437

       admit |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
         gre |   .0024768   .0010702     2.31   0.021     .0003792    .0045744
    topnotch |   .4372236   .2918532     1.50   0.134    -.1347983    1.009245
         gpa |   .6675556   .3252593     2.05   0.040     .0300592    1.305052
       _cons | -4.600814    1.096379    -4.20   0.000    -6.749678   -2.451949
        Again, this is a very basic run-through of logistic regression and producing contingency
tables. For more information on this particular example please refer to the following website:

General Data Manipulations
      To keep a portion of the dataset conditioned on a specific value:
                   scatter y x if x < 10
          This example will produce a scatterplot of x and y only for x-values greater than 10.

Sample Size and Power Calculations
        For this problem, we are going to be given only some results (no data). First, we are told
that there are four groups in the study. Second, the largest group mean is 646 and the smallest
group mean is 550 (the other two groups are considered equal to the group mean for simplicity).
Third, the standard deviation for all four groups is equal and said to be the same as the
population standard deviation of 80.
        We will make use of the Stata function fpower to do the power analysis. The fpower
function needs the following information in order to do the power analysis:
    1. the number of levels (or groups)
    2. the effect size (called delta)
    3. the alpha level
        From the information given above, we know that there are four groups, a=4. We will set
alpha = 0.05, and we will compute the effect size:
                     max{1...4 }  min{1...4 }
                                sd ( 0 )
                     646  550
Hence,                         1.2
        Now, we can apply fpower and get the corresponding output:
                   fpower, a(4) delta(1.2) alpha(0.05)

a =   4      b =     1   c =   1    r =   1    rho =    0    delta =     1.2

      nobs            power                                  nobs         power
         2         .0906746                                    16       .795521
         3         .1438119                                    18      .8478578
         4         .2013958                                    20      .8884002
         5         .2614601                                    25      .9512783
         6         .3224192                                    30      .9800673
         7         .3829314                                    35      .9922693
         8         .4419005                                    40      .9971333
         9           .49847                                    45       .998977
        10         .5520059                                    50      .9996469
        12         .6484047                                   100             1
        14         .7294912

      If we wanted to obtain 80% power, then our sample size (or nobs) falls somewhere
between 16 and 18 observations. To do the reverse, the same Stata code applies, but this time
suppose that we have 40 subjects. We would then see that we have a power of 99.71%.
III. Working with Graphics in Stata
       histogram xvar will give you a nice display of one variable. histogram xvar,
by(yvar) may be useful for comparing the distributions of two variables over the categories of
     histogram xvar, percent will scale the y-axis more intuitively in terms of
     histogram xvar, discrete gives a nicer display for categorical variables.
     twoway scatter yvar xvar gives you a twoway scatterplot of your data.
     sunflower yvar xvar gives you a sunflower plot of your data.
     twoway lfit yvar xvar will give you a linear fit graph.

The two syntaxes may be combined e.g. twoway (scatter yvar xvar)(lfit yvar xvar)
graph bar xvar, over(yvar) is useful for creating a bar graph of a continuous or categorical
variable graphed across the categories of a categorical variable.
        For all graphs, options after a comma will be helpful in titling your graph, example:
               twoway lfit yvar xvar, title(“…”) xtitle(“…”) ytitle(“…”)
               scatter y x
        A greater detailed report on the graphics capabilities in Stata can be found at: The code for such graphs are not provided with this
list. They are provided as a result of a point-and-click GUI representation. I am not personally
familiar with the personalization abilities of Stata when it comes to graphics, but this link seems
to show several different ways to personalize any publication-ready graph.

      Much of the information for this write up has been taken from the following resources:
   1. (accessed 4/13/2010).
   2. (accessed 4/14/2010).
   3. (accessed 4/13/2010).

To top