STATA Quick Introduction STATA Quick Introduction January 2007 This handout provides short and

Document Sample
STATA Quick Introduction STATA Quick Introduction January 2007 This handout provides short and Powered By Docstoc
					                      STATA Quick Introduction
                                     (January 2007)

This handout provides short and quick introduction to STATA. If you want to increase
your skills in using STATA, please check manuals, and use help information whenever
you need help.


      Starting and stopping Stata
      Reading data into Stata:       Typing         Getting external files
      Useful commands: List Summation Tabulation                   Logical operators
               Functions and expressions Generating variables Graphics
               Simple linear regression       Subsetting the data Linear restrictions
               Time series(lags and differences, DW test and Q-stats, autocrrelation,
               Dicky-fuller test, Gold-Feld test)

Starting & Stopping Stata

             Starting STATA on the PC.
              Click Start  Programs  Stata  Intercooled Stata

              You will find four windows:
              (a) „Review‟ window on the upper left, past commands will appear there;
              (b) „Variables‟ window on the lower left, variables list will appear there;
              (c) „Results‟ window, results will be displayed there;
              (d) „Command‟ window where you can type commands.

              Also you will several „buttons‟ above the windows. Just hold the mouse
              pointer over a button for a moment and a box will appear with a
              description of that button. The buttons that most frequently used are:
              a. Open: open a Stata dataset;
              b. Save: save to disk the Stata dataset currently in memory;
              c. Do-file Editor: open the do-file editor or bring do-file editor to the
                 front of the other stats windows;
              d. Data Editor: open the data editor or bring the data editor to the front of
                 the other Stata windows.

             Stopping Stata.
              Type exit in the command window, or just click  button in upper right
              corner of Stata window.

   Reading Data into Stata

   Comma/Tab separated file with variable names on line 1

   Consider that you have the following data in Excel format.

                 Name            HOUR            GRADE
                 mary            10              90
                 john            11              92
                 jose            5               60
                 lee             6               71
                 mike            9               80

          Here GRADE is just the % points. Two common file formats for raw data are
          comma separated files and tab separated files. Such files are commonly made
          from spreadsheet programs like Excel.

          For example, if you have a data set in comma /tab separated file (you can save
          excel data into .csv or .txt format), which stored in your C:\temp\data.csv, with
          variable names on the first row.

          This file has two characteristics. (1) The first line has the names of the
          variables separated by comma/tabs. (2) The following lines have the
          values for the variables, also separated by commas/tabs.

          This kind of file can be read using the insheet command such as:

          insheet using C:\temp\study.csv or (comma separated)
          insheet using C:\temp\study.txt     (tab separated)

          However, insheet command could not handle a file that uses a mixture of commas
          and tabs as delimiters.

Comma/Tab separated file without variable names on line 1

          The same data as above except that there are no variable names on the first row.
          Then where should Stata get the variable names?

          If Stata does not have names for variables, it names then v1 v2 v3 etc …as we can
          see from the window.

           We can of course tell Stata the names of the variables on the same insheet
           command, such as:
           insheet name hour grade using C:\temp\study.csv

Space separated file
          For this case, file can be read with the infile command as shown below:
          infile str4 name hour grade using C:\temp\study.txt
          (str4 means that variable NAME is a character variable (a string) and it could take
          up to 4 characters wide.)

Fixed format file
          If a file uses fixed column data, i.e., the variables are clearly defined by which
          column(s) they are located. Then this type of file can be read with the infix
          command as shown below:
          infix str name 1-4 hour 5-6 grade 8-9 using C:\temp\study.csv

   Creating a command file using do-file editor
          Let‟s introduce do-file editor window. Double-click the do-file editor icon, the
          fifth from the right, a blank window will appear, now you can type the commands
          such as use filename, list, summ, etc and save that as a do file. Only through this
          way, your commands will not disappear after quitting Stata.

   Writing Comments

   You can put your comments in three different ways
   1. „*‟ type your comments after the „star‟ (single line comment)
   * My project for Econ 509
   2. <CODES> // comments in the same line
   3. comments in multiple lines
   e.g /*My project for Econ 509 999999999999999 00000000000000 8888888888888888
   77777777777777777777777 */

   With /without Delimiter

   You can work in STATA with or without the delimiter, such as „;‟. Always be consistent.
   If you have a habit of forgetting „;‟ in each line of STATA codes, it is better to avoid its
   use from the beginning. Because if you miss the delimiter once you specified it, it will
   produce an error. If you are comfortable, then go for it.

   Typing your codes

   *Open do-file window, and type the following commands. Anything with „*‟ is just a

   *clearing memory every time

*expects semi-colon at the end of each command line

*storing output file in: filename.log
log using e:\teaching\409\stexp0.log, replace;

*reading excel .CSV file with the variable labels
insheet using e:\teaching\409\hour.csv;

*Reading stata .DTA file with the variable labels
use a:\study.dta;

*listing the observations

*graphing x-axis vs y-axis
graph hour grade, xlabel ylabel title("graph of study_hour
vs grades");

*Running Regression
regress grade hour;

*predicting fitted value of dependent variable
predict fgrade;

* calculating residuals and printing it
gen res = grade-fgrade;
list res;

*graph observed versus fitted against the indepen variable
graph grade fgrade hour, connect(.l);

*closing the output file
log close;

Do not worry about the commands; let‟s take a look at the procedure steps.
a) First, we need to use „clear‟ to clear the memory;
b) Second, it‟d be better to use #delimit to indicate “;” for the end of each command;
c) Third, in order to save the output into a specified file we can create a log file in the
beginning, such as
log using a:\409\study0.log, replace;
which is paired with log close at the end. Then we can either view or copy & paste the
output file easily. Just click the start log icon, which is the fourth icon from the left on the
menu, then open the specified log file you already saved.

If you read in the data correctly, you should see the variable names appear and some lines
goes through results window. (no red letters, red letters indicate error).

Some useful commands

a. Descriptive statistics (the most simple and powerful commands are):

list ;          //List all or some of the observations.

For example,
list in 1/5 ; //list the first five observations for all the variables

list hour in f/l ; //list “hour” from the first obs to the last one

*You will have the following in the results window:

    1.            10
    2.            11
    3.             5
    4.             6
    5.             9
list name in –2/l ;     //list the last two variable of “name”

*you will see the following in the results window:

    4.           lee
    5.           joe

Attention: -2/l, here “l‟ is for “Last”, do not confuse it with one (1) in 1/5.

b.summarize(summ for short)
summ //gives you the mean, standard deviation, minimum and maximum value for all
      //numerical variables.

summ (gives you means of all numerical variables in your sample), the results should
    Variable |     Obs        Mean   Std. Dev.       Min        Max
        hour |       5         8.2   2.588436          5         11
       grade |       5        78.6   13.37161         60         92
        name |       0

    summ hour ;         //only gives you mean study hour for the sample


tab ;         // with one variable gives you a frequency distribution.
*`Tab‟ with two or more variables gives you a cross-tab

Examples: tab grade (gives you the number of people (and the percentage) for each
grade level), the results are shown in the following table:

grade |      Freq.     Percent        Cum.
60   |          1       20.00       20.00
71   |          1       20.00       40.00
80   |          1       20.00       60.00
90   |          1       20.00       80.00
92   |          1       20.00      100.00
Total     |          5      100.00

tab hour grade;        //gives you the number of people in study hour level and grade

Be careful not to ask for a tab of a continuous variable: you will get hundreds of values,
such as income.

c. Logical operators
They are used to evaluate an expression and then do something depending on the
outcome. The operators are:
= = equal to (you must use double = signs)
  >= greater than or equal to
  <= less than or equal to
  > greater than
  < less than
  ~= not equal to
  & and : both conditions hold
| or : at least one condition holds

summ hour if grade >=80 & grade <=100          //calculates mean of study hour for students
grade between 80 and 100.

list if (hour > 5) & (hour !=.) //list all the variables if study hour is greater than 5 and is
not missing

              Functions and expressions

           Taking log:                 gen lhour = log(hour)
           Raising to powers:          gen hournew = hour ^3
           Taking square root:         gen sqgrade = ln(sqrt(grade))
           Taking absolute value:      gen hournew= abs(hour)

        Taking lags:               gen xlag = x[_n – 1]   or
                                   gen xlagged = L.x (lag1 of x)
                                   gen Xlag = X[_n – 2]   (lag2 of x)

        Of course, the arithmetic operators, such as +, -, *, /, ^ are working.

           Generating variables

        New variables may be generated by using the commands generate or egen.

           The command generate(gen for short) simply equates a new variable to an
            expression which is evaluated for each observation.

            generate minutes = hour* 60 if grade>80

            Generating dummy variables

            Suppose we are interested in construct one dummy variable to a random
            variable X, such as grade, we want to create 1s for grade > 79, 0
            otherwise. Then the appropriate Stata command is:

            gen xdummy = (x > 79)

            Stata will offer 1s for X>79, and 0s for X<80 automatically, please
            remember we always put the conditions in the parenthesis.

            The function egen provides an extension to generate. One advantage of
            egen is that some of its functions accept a variable list as an argument,
            whereas the functions for gen can only take simple expressions as

            egen average = rmean(x y z)

           Another function works in the same way as generate is replace, which
            allows an existing variable to be changed.
            replace grade = 0 if grade < 60

           Graphics

        The command graph may be used to plot a large number of different graphs.
        The basic syntax is [graph varlist, options] where varlist is the list of
        variables and options are used to specify the type of graph.

        For example, graph hour, normal title(“histogram of study hour”)
        [the normal option overlaid a normal curve on our histogram, title option has
        to be quoted]



                5                                                       11
                         histogram of study hour

Now consider about the case that we are interested in the plotting more than
one variable, for example, we are interested in the relationship between the
grades and the study hours.

graph grade hour, xlabel ylabel title("graph of study_hour vs grades")

[the xlabel and ylabel options cause the x- and y-axes to be labeled using
round values, without them, only the minimum and maximum values are




                4         6            8             10            12
                     graph of study_hour vs grades

   Of course if you want to save the graph, you have to add save(filename).
   There are many options for graphics in Stata, if you want to explore more,
   please check the manual: Stata Graphics for details.

Simple linear regression

   Let‟s begin by showing some examples of simple linear regression using
   Stata. In this type of regression, we have only one predictor variable. This
   variable can be either continuous or discrete. There is only one response or
   dependent variable, and it is continuous.

   In Stata, the dependent variable is listed immediately after the regress
   command followed by one or more predictor variables, like regress [dep.
   Var.] [predictors].

   Let‟s examine the relationship between the study hours and grades. From the
   plot above, there is an obvious positive relationship between these two
   variables. It looks like the more hours students spent on studying, the better
   grade they have. For this example, the model we are interested in is: grade =
    + *hour+e, grade is the dependent variable and hour is the predictor:

   regress grade hour
      Source |       SS       df       MS               Number of obs   =        5
-------------+------------------------------            F( 1,      3)   =    65.93
       Model | 684.073134      1 684.073134             Prob > F        =   0.0039
    Residual | 31.1268657      3 10.3756219             R-squared       =   0.9565
-------------+------------------------------            Adj R-squared   =   0.9420
       Total |      715.20     4      178.80            Root MSE        =   3.2211

       grade |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
        hour |   5.052239   .6222138     8.12   0.004     3.072077    7.032401
       _cons |   37.17164   5.301613     7.01   0.006     20.29954    54.04374

   In addition to getting the regression table, we can also get the predicted
   variables and plot them.

   For example,

   predict fgrade     [fgrade is the name of the new predicted variable]
   list grade fgrade [take a look at the predicted ones]

                            grade           fgrade
                  1.           90         87.69403
                  2.           92         92.74627
                  3.           60         62.43283
                  4.           71         67.48508
                  5.           80         82.64179

         graph grade fgrade hour, connect(.l)
         [graph the original and fitted values against hour, connecting the fitted line]

                          grade                             Fitted values


                     5                                                          11

Subsettting data manipulations

     We can subset data by keeping or dropping variables, and we can also subset data
     by keeping or dropping observations. Now let‟s focus on keeping and dropping

     For example, we have the data as follows, either entering by data editor or reading
     from external file. For gender, 0=female, 1=male; income is divided by $100;
     year means how long people have experiences for this kind of job; for degree,
     0=no degree, 1=B.S.(B.A.), 2=M.S.(M.A.).

            gender       income         degree           year
            1            10             0                5
            0            30             1                7
            1            15             0                9
            1            19             0                10
            0            27             2                12
            0            23             1                10
            1            45             2                5
            1            27             1                9
            0            29             2                7
            0            30             2                8
            -999         22             1                15
            -999         19             1                9

                                         - 10 -
      Please note that there are two –999 in gender which represent for missing values,
      we can generate missing values, using
             replace gender=. if gender ==-999 // [double equal ==]

      Suppose we only take care of non-missing values, we can drop the missing
      values, using drop if (gender==.).

      Now we want to split the sample into two sections, one for female, the other for
      male, and want to regression on these two parts separately using model:
      Income = +1*degree + 2*year. Here is one way to handle this.

      To extract only female, keep if gender==0 and save it into a new data file such as
      D:\pr_stata\incomef.dta, then we run the regression using the new data set.

             *extract only female and save it into new data*

             keep if gender==0
             save D:\pr_stata\incomef.dta, replace

             *run the regression using new data

             use D:\pr_stata\incomef.dta
             reg income degree year

             *back to original data*
             use D:\pr_stata\income.dta

             *extract only male, and save into another new
             keep if gender==1
             save D:\pr_stata\incomem.dta, replace

             *run the regression using male new data

             use D:\pr_stata\incomem.dta
             reg income degree year

Linear Restrictions

      Often we want to test linear hypothesis after model estimation, the most useful
      command is test. It performs F or 2 tests of linear restrictions about the estimated
      parameters from the most recently estimated model using a Wald test.

      There are several version of syntax, we only introduce two versions, for detail
      please check Stata Manual [Reference: test].
             1) test expressions = expressions

                                          - 11 -
      2) test coefficientlist {note: coefficientlist van simply be a list of variable


Suppose we have estimated a model of 1980 Census data on the 50 states
recording the birth rate in each state (brate), the median age (medage) and its
square (medagesq), and the region of the country in which each state is located.
The variable reg1 =1 if the state is located in the Northeast, otherwise 0; whereas
reg2 = 1 if the state is located in the North Central, reg3 marks the South, and
reg4 makes the West.

First we estimate the following regression:
       reg brate medage medagesq reg2-reg4

*reg2-reg4 is the abbreviation for reg2, reg3 and reg4

If we want to test (F) if the coefficient on medage is zero, just type:
       test medage = 0

If we want to test the coefficient on reg2 is the same as that on reg4, we can do
       test reg2 = reg4

Of course, we can put more complicated linear restrictions here, like
       test 2*reg2-3*reg4 = reg3

However, the real power of test is when we test joint hypothesis. Suppose we
wish to test whether the region variables, taken as a whole, are significant. To
perform tests of this kind, specify each constraint and accumulate it with
previous constraints:
       test reg2 = 0
       test reg3 = 0, accumulate
       test reg4 = 0, accumulate

Typing separate test commands for each constraint, like above, can be tedious, the
second syntax allows us to perform our last test more conveniently:
       test reg2 reg3 reg4

                                     - 12 -
Dealing with time series Data
         a) Setting the time span
              We have to let Stata know the time span variable at the beginning, and the
              command is tsset.

              There are two cases. One is our data already provide the time span
              information, for example, we have an annual exchange rate data consisting
              of two observations: year exchrate, “year” here is the variable that infers
              the time span, we simply say
              tsset year;
              then Stata will read the data as annual time series dataset.

              The other case is that we have to generate the time span by ourselves since
              we do not have it in our data.

              For example, if we want to generate annual data starting in 1985,
              gen t = y(1985) + _n-1;      //1985 is the start time, y indicates year()
              tsset t;

              if we want to generate quarterly data starting in 1973:II, then
              gen t = q(1973q2) +_n-1;      /*1973q2 means the start time:the 2nd
                                            quarter in 1973, q() infers quarterly()*/
              format t %tq;                 //assign Stata format for quarterly data to t
              tsset t;

              if we want to generate monthly data starting in 1995 July, then
              gen t = m(1995m7)+_n-1; /*1995m7 means the start time, m() infers
              format t %tm;                /*assign Stata format for monthly data to t*/
              tsset t;

              if we want to generate weekly data starting from the 1st week of 1995, then
              gen t = w(1995w1)+_n-1;      /*1995w1 indicates the start time, w() infers
              format t %tw;                /*assign Stata format for weekly data to t*/
              tsset t;

Generating lags and differences
             Suppose x and y are random time series variables.
             If we want to create 1 lag of y, then we can generate a new variable ylag1:
             gen ylag1 = y[_n-1];

              Applying the same idea if we want to create 2 lags of y, then we can
              generate another new variable ylag2 as:
              gen ylag2 = y[_n-2];

                                          - 13 -
               also we can generate lags of x as
               gen xlag1 = x[_n-1];

               To generate the differences of variables, we use D..

               If we want to generate 1st order difference of y, we can create a new
               variable dy1 as:
               gen dy1 = D.y;

               If we want o generate 1st order difference of dy1, we can create another
               new variable dy2 as:
               gen dy2 = D.D.y;

               To generate a seasonal difference, we can create the lags first then take the
               difference. For example, we want to create a new variable sdy4 for
               seasonal difference of a quarterly data,
               gen sdy4 = y-y[_n-4];

Durbin-Walson test and Q-stats

               The Stata command are dwstat and wntestq. These commands can be
               applied after estimation and storing residuals.

               For example:
               reg y ylag1 ylag2 xlag1;
               predict uhat, residuals;              /*store residuals into uhat*/

               wntestq uhat, lag(20);                /*Q-stat on residuals up to 20 lags*/

             prais estimates a linear regression of depvar on varlist that is corrected for
             first-order serially-correlated residuals using the Prais-Winsten
             transformed regression estimator, the Cochrane-Orcutt transformed
             regression estimator, or a version of the search method suggested by

               Please pay attention that prais is for use with time-series data. You must
               tsset your data before using prais.

               For example,

               prais y x1 x2, corc;
               /*corc specifies that the Cochrane-Orcutt transformation be used to
               estimate the equation*/

                                           - 14 -
               prais y x1 x2, corc ssesearch;
               /*ssesearch specifies that a search is performed for the value of rho that
                minimizes the sum of squared errors of the transformed equation*/

Dicky-Fuller and Unit Root
              Let‟s still use the variables created earlier. There are two ways for D-F

               reg dy1 ylag1;         /*test the coefficient is zero*/
               reg y ylag1;           /*test the coefficient is 1*/

               To perform augmented Dicky-Fuller test, we can use dfuller command.
               This test performs a regression of the differenced variable on its lag and
               the user-specified number of lagged differences of the variable.

               For example,
               dfuller y, lags(5) regress;

Gold_Feld test
             To perform the Gold_Feld test, first we need to split the observations into
             two parts. Usually we sort the variable first then run two regressions on
             separate parts.

               For example, we have data about expenditure (exp) and income (inc), and
               we are interested in performing Gold-Feld for inc. What we can do is as

               sort inc;       /*sorting the data first*/
               list inc exp in 1/5;    /*checking the sorted data*/

               reg exp inc in 1/10; /*run OLS on the 1st part of the obs*/
               reg exp inc in 21/40; /*run OLS on the 2nd part of the obs*/

               then extract the SSR to perform the test.

                                             - 15 -
Some Useful Extras

When you are dealing with complex data sets, then your data might be in several different
files. Before analyzing such data in different files, you need to create a common file with
the relevant variables from different files that you are interested in. One way of
combining such different tables is using “merge”.

For merging two files, you need common ID in both files. If that is the case, then do the

sort ID                // ID is the name of the common ID in both files

*after sorting, save the file under different name
*Read the second file and sort with ID. Now you are ready to merge two data sets.

merge ID using <file name that you just saved>

*Now your two data sets are in one file.
*To make sure you merged the files appropriately, type
* If you have more files, just repeat those steps

Your data might be in one of the following form
                      (wide form)

                i      ....... x_ij ........
               id sex inc80 inc81 inc82
                1 0 5000 5500 6000
                2 1 2000 2200 3300
                3 0 3000 2000 1000

                     (long form)

                  i j          x_ij
                 id year sex inc
                  1 80 0 5000
                  1 81 0 5500
                  1 82 0 6000
                  2 80 1 2000
                  2 81 1 2200
                  2 82 1 3300
                  3 80 0 3000

                                            - 16 -
                  3    81    0 2000
                  3    82    0 1000

*To reshape the data from wide to long, do the following
reshape long inc, i(id) j(year) // goes from top-form to bottom

*to reshape the data from long to wide, do the following
reshape wide inc, i(id) j(year) //goes from bottom-form to top


*collapse converts the dataset in memory into a dataset of means, sums, medians, etc.

Example: If you have a data set across 50 states and you want to get the summary
statistics of each state. Use the following command

collapse age educ income, by(state) // here age, educ and income are 3

*coll2, an alternative to collapse, converts the data in memory into a data set
of means, sums, medians, etc.
coll2 age educ income, by(state)

*in above cases, you will get mean values of age, edu and income across all the states. If
you want something different such as mode or sum, you have to specify as:

coll2 (model) age educ income, by(state)

Weighting Issue

Most Stata commands can deal with weighted data. It is basically useful in survey data.
Stata allows four kinds
  of weights:

  1. fweights, or frequency weights, are weights that indicate the number
     of duplicated observations.

  2. pweights, or sampling weights, are weights that denote the inverse of
     the probability that the observation is included due to the sampling

  3. aweights, or analytic weights, are weights that are inversely
     proportional to the variance of an observation; i.e., the variance of
     the j-th observation is assumed to be sigma^2/w_j, where w_j are the
     weights. Typically, the observations represent averages and the
     weights are the number of elements that gave rise to the average. For

                                            - 17 -
    most Stata commands, the recorded scale of aweights is irrelevant;
    Stata internally rescales them to sum to N, the number of observations
    in your data, when it uses them.

  4. iweights, or importance weights, are weights that indicate the
     "importance" of the observation in some vague sense. iweights have no
     formal statistical definition; any command that supports iweights will
     define exactly how they are treated. In most cases, they are intended
     for use by programmers who want to produce a certain computation.

Suppose you are running a regression of y on x1, x2, x3, and you have a variable called
pop that you want to use as a weight, they do the following

regress y x1 x2 x3 [aw=pop]          // remember the square brackets, and „aw‟ refers to
analytical weight.

                                          - 18 -