Intro to Stata by dfhdhdhdhjr

VIEWS: 18 PAGES: 92

									A Gentle Introduction
to STATA

 Jose Ramon G. Albert
 Research Division Chief
 Statistical Research & Training Center (SRTC)
 email: srtcres@srtc.gov.ph
      SIAP-SRTC Training Course on Sampling
            Acceed Center, AIM, Makati
                   Philippines
                   4 April 2002
                                                    2


               OUTLINE

   Statistical Computing Resources
   Data Management with Stata
   Table Generation
    • Tab and Table Commands
    • Survey Commands


                   SIAP-SRTC Training on Sampling
Computing Resources
                                                         4


        Computing Resources

   The Age of ICT has brought
    about a synergy of computing
    and communications
   Implications:
    • More DATA collected
    • More DATA stored
    • More DATA accessible and distributed

                        SIAP-SRTC Training on Sampling
                                                    5


       Computing Resources

   There are a host of statistical
    software that provide pre-
    programmed analytical and data
    management capabilities. These
    software may be classified
    according to use and cost.

                   SIAP-SRTC Training on Sampling
                                                    6


         Computing Resources

    Types of Stat Software by usage
   General Purpose -- SAS, SPSS, R,
    Splus, Statistica, Stata
   Special Purposes -- econometric
    modeling (Eviews), seasonal
    adjustment (X12), Bayesian
    modeling (WINBUGS), survey data
    tabulation & variance estimation
    (IMPS, CENVAR) SIAP-SRTC Training on Sampling
                                                     7


       Computing Resources

    Types of Stat Software by cost
   Commercial Software - SAS, SPSS,
    Stata, S-plus
   Freeware - R, IMPS, X12




                    SIAP-SRTC Training on Sampling
                                                     8


     Computing Resources
FOR SURVEY DATA
 Bascula from Statistics Netherlands.

 CENVAR (& IMPS)from U.S. Bureau of
  the Census.
 CLUSTERS from University of Essex.

 Epi Info from Centers for Disease
  Control.
 Generalized Estimation System (GES)
  from Statistics Canada.
 IVEWare (beta version) from University
  of Michigan.
                    SIAP-SRTC Training on Sampling
                                                     9


     Computing Resources
FOR SURVEY DATA
 PCCARP from Iowa State University.

 SAS/STAT from SAS Institute.

 Stata from Stata Corporation.

 SUDAAN from Research Triangle
  Institute.
 VPLX from U.S. Bureau of the Census.

 WesVar from Westat, Inc.




                    SIAP-SRTC Training on Sampling
                                                        10


         Computing Resources

   Lists of Statistical Software
    http://members.aol.com/johnp71/javasta2.
    html
    http://www.stir.ac.uk/Departments/Human
    Sciences/SocInfo/Statistical.htm
    http://www.fas.harvard.edu/~stats/survey-
    soft/
    http://www.feweb.vu.nl/econometriclinks/s
    oftware.html
                       SIAP-SRTC Training on Sampling
                                                 11


   Computing Resources

This afternoon, we will provide a
demonstration on how to use
STATA for accomplishing some
of the most common tasks of
data management, statistical
computing and analysis of
survey data.
                SIAP-SRTC Training on Sampling
                                                       12


     Computing Resources
Stata
Estimation of means, totals, ratios, and
proportions;
linear regression, logistic regression, and
probit.
Point estimates, associated standard
errors, confidence intervals, and design
effects for the full population or
subpopulations are displayed.


                      SIAP-SRTC Training on Sampling
                                                          13


     Computing Resources
Stata
Auxiliary commands display various
information for linear combinations (e.g.,
differences) of estimators, and conduct
hypothesis tests.
New in Stata : contingency tables with Rao-
Scott corrections of chi-squared tests; new
survey-corrected regression commands including
tobit, interval, censored, instrumental variables,
multinomial logit, ordered logit and probit, and
Poisson

                         SIAP-SRTC Training on Sampling
                                                         14


        Computing Resources
    Stata
   stratified designs;
   cluster sampling;
   FPCs can be calculated for simple random
    sampling w/o replacement of sampling
    units within strata;
   variance estimation for multistage sample
    data carried out through the customary
    between-PSU-squared-differences
    calculation.

                        SIAP-SRTC Training on Sampling
                                                       15


     Computing Resources
Stata
Variance estimation is done thru Taylor-
series linearization in the survey analysis
commands. There are also commands for
jackknife and bootstrap variance
estimation, but these are not specifically
oriented toward survey data.




                      SIAP-SRTC Training on Sampling
                                                      16


    Computing Resources
Note:
We will demonstrate the use of STATA
version 6. Current version is version 7;
even a Special Edition (SE) which can
handle up to 32,766 variables w/ strings
up to 244 chars, and up to 11,000 x
11,000 matrices.




                     SIAP-SRTC Training on Sampling
Data Management with STATA
                                                    18


        Data Management
STARTING UP
 Go to Start, Programs, Stata,
  Intercooled Stata
 Alternatively, from Windows

  Explorer, go to folder
      c:\stata
  Double click
      wstata.exe
                   SIAP-SRTC Training on Sampling
                                        19


Data Management




       SIAP-SRTC Training on Sampling
                                                  20


       Data Management
CREATING A NEW DATASET
 Open the STATA spreadsheet editor




                 SIAP-SRTC Training on Sampling
                                                    21


        Data Management
CREATING A NEW DATASET
 Enter data into the editor, when
  done close the editor.




                   SIAP-SRTC Training on Sampling
                                                 22


       Data Management
CREATING A NEW DATASET
 In the STATA COMMAND window
  enter the command
     save newfile




                SIAP-SRTC Training on Sampling
                                                    23


        Data Management
NOTE
 A STATA dataset will have extension
  name dta. That is, newfile is actually
  newfile.dta
 Public use files of some surveys, e.g.

  VLSS (Vietnam Living Standards
  Survey), are in Stata format.


                   SIAP-SRTC Training on Sampling
                                                  24


       Data Management
INSPECTING DATA BASE
 In the STATA COMMAND window
   enter the following commands
   describe
  list
  summarize



                 SIAP-SRTC Training on Sampling
                                                   25


       Data Management
NOTE:
 Stata is case sensitive.

 Stata commands may be

  abbreviated, e.g. D for DESCRIBE,
  SUM for SUMMARIZE, etc.
 We may use Page Up/Down keys or
  mouse for re-selecting commands in
  the Review window.

                  SIAP-SRTC Training on Sampling
                                                    26


        Data Management
NOTE:
 Commands and output are shown in
  Results window. Windows may be re-
  sized.
 Commands and output may be

                 logged into a log file
                 by pressing Open
                 Log button.

                   SIAP-SRTC Training on Sampling
                                                    27


        Data Management
RENAMING VARIABLES
 ONE WAY : (From Data Editor) Double
  click anywhere in the variable‘s column
  resulting in a dialogue box




                   SIAP-SRTC Training on Sampling
                                                   28


        Data Management
RENAMING VARIABLES
 SECOND WAY: (In the STATA
   COMMAND window) enter
  rename var1 domain
  rename var2 hcn
  rename var3 age
  label variable age “HH head age”
  d

                  SIAP-SRTC Training on Sampling
                                                 29


       Data Management
SAVING EDITED DATABASE
 In the STATA COMMAND window enter
  the following commands
  save newfile, replace
  Note: typing only
  save newfile
  will result in an error message

                SIAP-SRTC Training on Sampling
                                                     30


        Data Management
READING PRE-EXISTING
STATA DATASET
 If dataset is in folder c:\fies2000 and
  filename is “fies00small.dta”, enter
  clear
  set mem 64m            NOTE: Impt for
                         MEMORY
  cd c:\fies2000         MANAGEMENT
  use fies00small
                    SIAP-SRTC Training on Sampling
                                                    31


        Data Management
IMPORTING DATA
 Suppose we have a dataset try.txt in
  c:\fies2000 folder

                                NOTE:
                                Missing
                                Data coded
                                as “.”


                   SIAP-SRTC Training on Sampling
                                                         32


            Data Management
IMPORTING DATA
 Suppose we have a dataset try.txt in
   c:\fies2000 folder
 Use the infile command with syntax

   infile variable-list using filename.raw
 In particular, enter

   cd c:\fies2000
  infile domain hcn age using try.txt,
     automatic          SIAP-SRTC Training on Sampling
                                                     33


         Data Management
TRIVIA ON STRING VARIABLES
 When using the infile command for
  character (string) variables, we need to
  identify these variables. For instance
  infile domain hcn str30 prov using tr.txt
 For more details regarding infile, enter

  help infile1


                    SIAP-SRTC Training on Sampling
                                                         34


              Data Management
IMPORTING DATA
 Suppose we have a dataset try2.txt in
  c:\fies2000 folder with the data in
  specific fields




 Assumes last line is
 blank line

                        SIAP-SRTC Training on Sampling
                                                    35


        Data Management
IMPORTING DATA
 Suppose we have a dataset try2.txt in
   c:\fies2000 folder with the data in
   specific fields
 Use the infix command

  infix domain 1 hcn 2 age 3-4 using
   try2.txt, clear


                   SIAP-SRTC Training on Sampling
                                                      36
            Data Management

Thus, Stata can read text files with
 Infile (if the data in text is separated by
  spaces and does not have strings, or if
  strings are just one word, or if all strings
  are enclosed in quotes)
 Infix (fixed format text)

 Insheet (if text file was created by a

  spreadsheet or db program)

                     SIAP-SRTC Training on Sampling
                                                     37
           Data Management

NOTE:
 The commands infile, infix, insheet read
  data from ASCII files. Outfile is a way to
  save the data in ASCII.
 There are third party programs, esp.
  Stat/Transfer and DBMS/COPY, that
  perform translations from one data
  format (e.g., dBASE, Excel, SAS, SPSS,
  Stata) to another.
                    SIAP-SRTC Training on Sampling
                                      38
Data Management




     SIAP-SRTC Training on Sampling
                                                      39


          Data Management
OTHER USEFUL COMMANDS
 To sort the dataset by age

  sort age
 To get a listing of the dataset

  list
 To get a listing of the 2nd-4th data

  list in 2/4

                     SIAP-SRTC Training on Sampling
                                                    40


         Data Management
OTHER USEFUL COMMANDS
 To summarize the restricted dataset
  of HHs whose head’s age is less
  than/equal to 50
  summarize if age <=50
 HH head age between 35 and 50

  summarize if age <50 & age >35


                   SIAP-SRTC Training on Sampling
                                                 41


       Data Management

Comparison operators
>       >=      ==
<       <=      !=
Logical operators
& (and)     ! (not)
| (or)      ~ (not)


                SIAP-SRTC Training on Sampling
                                                    42


         Data Management
OTHER USEFUL COMMANDS
 To tabulate domain

  tab domain
 To generate contingency tables

  tab domain hcn if age>35
 To get the correlation matrix

  correlate x y z

                   SIAP-SRTC Training on Sampling
                                                    43


         Data Management
GENERATING & REPLACING VARIABLES
 Suppose we want to obtain per capita
  income (pci) of FIES 2000 households
  clear
  cd d:\fies00
  use fies00small
  gen pci=toinc/hsize

                   SIAP-SRTC Training on Sampling
                                                    44


         Data Management
GENERATING & REPLACING VARIABLES
 Now tag the household as poor (1) if
  pci < some threshold, say 13823,
  determine percent of HHs that are
  poor.
  gen poor=1 if pci < 13823
  replace poor=0 if poor==.
  sum poor [aw=rfact]
  save fies00small, replace
                   SIAP-SRTC Training on Sampling
                                                     45


         Data Management
NOTE
 Small portion of data set of FIES 2000
  was used. The Family Income and
  Expenditure Survey (FIES) is
  conducted by the National Statistics
  Office (NSO)every 3 years. Data may
  be purchased through the NSO
  website:
     www.census.gov.ph
                    SIAP-SRTC Training on Sampling
Introduction to STATA
(cont’d)

 Jose Ramon G. Albert
 Research Division Chief
 Statistical Research & Training Center (SRTC)
 email: srtcres@srtc.gov.ph
      SIAP-SRTC Training Course on Sampling
            Acceed Center, AIM, Makati
                   Philippines
                   5 April 2002
                                                        47


           Data Management
RECALL
 That if we use our fies2000 data set

  set mem 64m
 cd c:\fies2000
 use fies00small
 sum poor [aw=rfact]
 Note poverty line we provided is a weighted
  average of the variable poverty lines in the
  Philippines (for urban-rural areas across the
  different regions)
                       SIAP-SRTC Training on Sampling
Digression …
 Official Poverty
 Measurement & Latest
 Poverty Statistics
                                                             49


Estimating Food Poverty Line
   Food poverty line estimated from low cost
    one day menus (breakfast, lunch, supper
    snack) constructed for each urban-rural
    area of a region by Food and Nutrient
    Research Institute (FNRI) which meet
    100% sufficiency in energy and protein
    requirements and 80% sufficiency of other
    nutrients and vitamins.
    • RDA’s for energy: 2000 Kcal per person
    • RDA’s for protein: 50 grams per person
   29 such menus constructed on the basis of
    the 1988 Food Consumption Survey
                            SIAP-SRTC Training on Sampling
                                              50

Annual Per Capita Food Line
    Urban, by Region




             SIAP-SRTC Training on Sampling
                                              51

Annual Per Capita Food Line
     Rural, by Region




             SIAP-SRTC Training on Sampling
                                                      52


     Estimating Poverty Line
   Poverty Line= Food Threshold/ Engel’s
    Coefficient
   Engel’s coefficient estimated by
    analyzing the consumption pattern of
    families having incomes within plus or
    minus 10 percentage points from food
    threshold.
   Engel’s coeff = Food Exp/ Total Basic
                                 Exp

                     SIAP-SRTC Training on Sampling
                                             53

Annual Per Capita Poverty
 Line Urban, by Region




            SIAP-SRTC Training on Sampling
                                             54

Annual Per Capita Poverty
  Line Rural, by Region




            SIAP-SRTC Training on Sampling
                                                  55


  Poverty Statistics (Family)
Measures                  2000          1997


Poverty Incidence        33.6%         31.8%
                         [0.3%]
Poverty Gap              10.7%         10.0%
                         [0.1%]
Severity Index            4.6%         4.3%
                         [0.1%]

                    [Standard Error]

                 SIAP-SRTC Training on Sampling
                                          56

 Poverty Incidence
All Areas, by Region




         SIAP-SRTC Training on Sampling
                                                     57


      Small Area Poverty Stats?

   Stata has some add ons for
    generating SEs for poverty stats
   If we wish to generate provincial
    poverty statistics, we will find
    out that SEs are too high, i.e.
    figures are unreliable

                    SIAP-SRTC Training on Sampling
Back to STATA
                                                        59


           Data Management
RECALL
 That if we use our fies2000 data set

  set mem 64m
 cd c:\fies2000
 use fies00small
 sum poor [aw=rfact]
 Note poverty line we provided is a weighted
  average of the variable poverty lines in the
  Philippines (for urban-rural areas across the
  different regions)
                       SIAP-SRTC Training on Sampling
                                                    60


         Data Management
NOTE:
 STATA uses several types of weights

  fw frequency weights
   aw analytic weights
  iw importance weights
   pw probability weights



                   SIAP-SRTC Training on Sampling
                                                    61


         Data Management
NOTE:
 Within the command generate or
  replace, we may transform or create
  variables by using functions, e.g.,
  generate loginc=ln(toinc)
  generate y=cos(x*_pi/180)
  replace newvar=normd(z)
  generate rvar=uniform()
                   SIAP-SRTC Training on Sampling
                                                    62


         Data Management
DELETING VARIABLES/DATA
 To drop a variable, say age

  drop age
 To drop some observations

  drop in 2/3
  Try also the command keep.
 To drop all data in memory

  clear
                   SIAP-SRTC Training on Sampling
                                                     63


         Data Management
NOTE:
 So far we have used STATA
  interactively. We can also do batch
  processing through the DO FILE
  editor.




                    SIAP-SRTC Training on Sampling
                                                    64


         Data Management
NOTE:
 The STATA toolbar has 13 buttons.




The first three are to
 OPEN a Stata dataset
 SAVE to the disk the resident
 dataset
 PRINT a graph or log
                   SIAP-SRTC Training on Sampling
                                                      65


           Data Management


   The next five are for
    Starting/stopping/suspending a LOG
    Bringing the Log to the Front
    Bringing the Dialog to Front
    Bringing the Results to Front
    Bringing the Graph to Front
                     SIAP-SRTC Training on Sampling
                                                        66


            Data Management


   The last five are for
    Opening the DO FILE editor
    Opening the DATA editor
    Opening the DATA Browser
    Telling Stat to continue when it has
     paused in mid of long output
    Stopping the current task
                       SIAP-SRTC Training on Sampling
                                                     67


               Exercise
   What is the average income of
    families that are below or above
    the mean family expenditure?




                    SIAP-SRTC Training on Sampling
                                                      68


                Exercise
   Compare correlation of food
    expenditures (fexp) and nonfood
    expenditures for families in rural
    & urban areas.




                     SIAP-SRTC Training on Sampling
                                                                    69


                                 Extra
   Enter
    graph food nfood
                 1.2e+06
          food




                   1404
                           684                    5.0e+06
                                     nfood


                                   SIAP-SRTC Training on Sampling
                                                                                                        70


                                                                     Extra
   Now try
    sort urb
    graph food nfood, by (urb)
    graph food nfood, by (urb) total
                     urb==1                       urb==2
           1.2e+06




             1404
    food




                     684      5.0e+06
                                                  Total
                                        1.2e+06




                                          1404
                                                  684      5.0e+06

                                        nfood
                              Graphs by urb

                                                                       SIAP-SRTC Training on Sampling
                                                                                         71


                                      Extra
   Matrix plots
    graph toinc food nfood, matrix
                                     1404          1.2e+06
                                                                               8.4e+06




                          toinc


                                                                               4273
         1.2e+06




                                            food


           1404
                                                                               5.0e+06




                                                             nfood


                                                                               684
                   4273           8.4e+06              684           5.0e+06




                                            SIAP-SRTC Training on Sampling
Table Generation
                                                        73


      Table Generation w/ tab
   Earlier, we showed the use of
    the tab(ulate) command. Try
    tab   urb
    tab   urb [aw=rfact]
    tab   urb [iw=rfact]
    tab   urb regn



                       SIAP-SRTC Training on Sampling
                                                       74


                    Tab
   The tab command has options for
    generating 1-way tables of freqs
    tab urb, summ(toinc)
   and two way tables
    tab urb sex
    tab urb sex, row
    tab urb sex, row col chi2
    tab urb sex, all exact
                      SIAP-SRTC Training on Sampling
                                                    75


      Table Generation w/ table
   Aside from the tab command, we
    can generate tables of statistics
    with the table command. Compare
    tab urb
    with
    table urb



                   SIAP-SRTC Training on Sampling
                                                      76


                  Table
   To generate the average (family)
    income and average (family)
    expenditure across urban and rural
    areas, enter
    table urb, c(mean toinc mean toexp)
   Using weights
    table urb [aw=rfact], c(mean toinc
    mean toexp)
                     SIAP-SRTC Training on Sampling
                                                       77


                  Table
   The contents option may specify at most
    five of the ff statistics:
freq (for frequency)
mean varname (for mean of varname)
sd varname        (for standard deviation)
sum varname         (for sum)
rawsum varname (for sums ignoring
  optionally specified weight)
count varname (for count of nonmissing
  data)

                      SIAP-SRTC Training on Sampling
                                                       78


                  Table
   The contents option may specify at most
    five of the ff statistics:
n varname     (same as count)
max varname      (for maximum)
min varname     (for minimum)
median varname (for median)
p1 varname     (for 1st percentile)
p2 varname     (for 2nd percentile)
      ...
iqr varname   (for interquartile range)
                      SIAP-SRTC Training on Sampling
                                                       79


         Exercise Using Table
   Obtain the average and median per
    capita income of households by sex
    of household head
    table sex, c(mean pci median pci)
   Obtain the “weighted” frequency of
    poor and nonpoor households
    across regions
    table poor regn [iw=rfact]
                      SIAP-SRTC Training on Sampling
                                                         80


      Using Survey Commands
   STATA has designed a family of
    commands especially for sample
    surveys. These commands all begin
    with svy
    svyset      setting variables
    svydes      describe strata and PSUs
    svymean     estimate popn & subpop means
    svytotals   estimate popn & subpop totals

                        SIAP-SRTC Training on Sampling
                                                       81


      Using Survey Commands
   Svy commands
    svyprop estimate popn & subpop props
    svyratio estimate popn & subpop ratios
    svytab for two way tables
    svyreg for regression
    svyivreg for instrumental variables reg
    svylogit for logit reg
    svyprobit for probit reg
                      SIAP-SRTC Training on Sampling
                                                        82


      Using Survey Commands
   Svy commands
    svytest     for hypothesis testing
    svylc       for estimating linear combs
    svymlog    for multinomial logistic reg
    svyolog    for ordered logistic reg
    svyoprob    for ordered probit reg
    svypois    for poisson reg
    svyintrg   for censored & interval reg
                       SIAP-SRTC Training on Sampling
                                                       83


      Using Survey Commands
   Before issuing any svy estimation
    command, we identify the weight,
    strata and PSU identifier variables
    svyset pweight rfact
    svyset strata domain
    svyset psu hcn



                      SIAP-SRTC Training on Sampling
                                                       84


      Using Survey Commands
   To obtain the average family
    income & average family
    expenditure
    svymean toinc toexp
   To obtain the total family income,
    total family expenditure by province
    svytotal toinc toexp, by(regn)

                      SIAP-SRTC Training on Sampling
                                                        85


      Using Survey Commands
   To obtain the per capita income &
    per capita expenditure
    svyratio toinc/fsize toexp/fsize
   pci & pce by urban/rural
    svyratio toinc/fsize toexp/fsize, by(urb)




                       SIAP-SRTC Training on Sampling
                                                       86


      Using Survey Commands
   Linear regression of ln(pci)
    gen loginc=ln(pci)
    svyreg loginc age fsize sex prov urb
    Compare the results with the
    regular regression command
    reg loginc age fsize sex prov urb



                      SIAP-SRTC Training on Sampling
                                                       87


      Using Survey Commands
   Two way tables
    svytab urb poor, row se
    compared with
    tab urb poor [aw=rfact], no freq row




                      SIAP-SRTC Training on Sampling
Alternatives to STATA
                                                                 89


     Learning More about Stata
   Online tutorial, type
    tutorial intro
   List of Tutorials
Tutorial       Description
-----------------------------------------------------
intro        An introduction to Stata
graphics     How to make graphs
tables       How to make tables
regress      Estimating regression models, inc 2SLS
anova        Estimating one-, two- and N-way ANOVA
                and ANCOVA models
                                SIAP-SRTC Training on Sampling
                                                                    90


     Learning More about Stata
Tutorial       Description
-----------------------------------------------------
logit    Estimating maximum-likelihood logit and
           probit models
survival Estimating ML survival models
factor   Estimating factor and principal
            component models
ourdata Description of the data we provide
yourdata How to input your own data into Stata




                                   SIAP-SRTC Training on Sampling
                                                       91


     Learning More about Stata
   Email distribution list.
    Send email to
    Majordomo@hsphsun2.harvard.edu
    In the body of your email message
    type the message
      subscribe statalist email@address
    or for a daily summary
      subscribe statalist-digest
          email@address
                      SIAP-SRTC Training on Sampling
Maraming Salamat
sa inyong pakikinig.
(Thank you for your attention)

       END OF TALK
        Introduction to STATA

								
To top