VIEWS: 18 PAGES: 92 POSTED ON: 2/13/2012 Public Domain
A Gentle Introduction to STATA Jose Ramon G. Albert Research Division Chief Statistical Research & Training Center (SRTC) email: srtcres@srtc.gov.ph SIAP-SRTC Training Course on Sampling Acceed Center, AIM, Makati Philippines 4 April 2002 2 OUTLINE Statistical Computing Resources Data Management with Stata Table Generation • Tab and Table Commands • Survey Commands SIAP-SRTC Training on Sampling Computing Resources 4 Computing Resources The Age of ICT has brought about a synergy of computing and communications Implications: • More DATA collected • More DATA stored • More DATA accessible and distributed SIAP-SRTC Training on Sampling 5 Computing Resources There are a host of statistical software that provide pre- programmed analytical and data management capabilities. These software may be classified according to use and cost. SIAP-SRTC Training on Sampling 6 Computing Resources Types of Stat Software by usage General Purpose -- SAS, SPSS, R, Splus, Statistica, Stata Special Purposes -- econometric modeling (Eviews), seasonal adjustment (X12), Bayesian modeling (WINBUGS), survey data tabulation & variance estimation (IMPS, CENVAR) SIAP-SRTC Training on Sampling 7 Computing Resources Types of Stat Software by cost Commercial Software - SAS, SPSS, Stata, S-plus Freeware - R, IMPS, X12 SIAP-SRTC Training on Sampling 8 Computing Resources FOR SURVEY DATA Bascula from Statistics Netherlands. CENVAR (& IMPS)from U.S. Bureau of the Census. CLUSTERS from University of Essex. Epi Info from Centers for Disease Control. Generalized Estimation System (GES) from Statistics Canada. IVEWare (beta version) from University of Michigan. SIAP-SRTC Training on Sampling 9 Computing Resources FOR SURVEY DATA PCCARP from Iowa State University. SAS/STAT from SAS Institute. Stata from Stata Corporation. SUDAAN from Research Triangle Institute. VPLX from U.S. Bureau of the Census. WesVar from Westat, Inc. SIAP-SRTC Training on Sampling 10 Computing Resources Lists of Statistical Software http://members.aol.com/johnp71/javasta2. html http://www.stir.ac.uk/Departments/Human Sciences/SocInfo/Statistical.htm http://www.fas.harvard.edu/~stats/survey- soft/ http://www.feweb.vu.nl/econometriclinks/s oftware.html SIAP-SRTC Training on Sampling 11 Computing Resources This afternoon, we will provide a demonstration on how to use STATA for accomplishing some of the most common tasks of data management, statistical computing and analysis of survey data. SIAP-SRTC Training on Sampling 12 Computing Resources Stata Estimation of means, totals, ratios, and proportions; linear regression, logistic regression, and probit. Point estimates, associated standard errors, confidence intervals, and design effects for the full population or subpopulations are displayed. SIAP-SRTC Training on Sampling 13 Computing Resources Stata Auxiliary commands display various information for linear combinations (e.g., differences) of estimators, and conduct hypothesis tests. New in Stata : contingency tables with Rao- Scott corrections of chi-squared tests; new survey-corrected regression commands including tobit, interval, censored, instrumental variables, multinomial logit, ordered logit and probit, and Poisson SIAP-SRTC Training on Sampling 14 Computing Resources Stata stratified designs; cluster sampling; FPCs can be calculated for simple random sampling w/o replacement of sampling units within strata; variance estimation for multistage sample data carried out through the customary between-PSU-squared-differences calculation. SIAP-SRTC Training on Sampling 15 Computing Resources Stata Variance estimation is done thru Taylor- series linearization in the survey analysis commands. There are also commands for jackknife and bootstrap variance estimation, but these are not specifically oriented toward survey data. SIAP-SRTC Training on Sampling 16 Computing Resources Note: We will demonstrate the use of STATA version 6. Current version is version 7; even a Special Edition (SE) which can handle up to 32,766 variables w/ strings up to 244 chars, and up to 11,000 x 11,000 matrices. SIAP-SRTC Training on Sampling Data Management with STATA 18 Data Management STARTING UP Go to Start, Programs, Stata, Intercooled Stata Alternatively, from Windows Explorer, go to folder c:\stata Double click wstata.exe SIAP-SRTC Training on Sampling 19 Data Management SIAP-SRTC Training on Sampling 20 Data Management CREATING A NEW DATASET Open the STATA spreadsheet editor SIAP-SRTC Training on Sampling 21 Data Management CREATING A NEW DATASET Enter data into the editor, when done close the editor. SIAP-SRTC Training on Sampling 22 Data Management CREATING A NEW DATASET In the STATA COMMAND window enter the command save newfile SIAP-SRTC Training on Sampling 23 Data Management NOTE A STATA dataset will have extension name dta. That is, newfile is actually newfile.dta Public use files of some surveys, e.g. VLSS (Vietnam Living Standards Survey), are in Stata format. SIAP-SRTC Training on Sampling 24 Data Management INSPECTING DATA BASE In the STATA COMMAND window enter the following commands describe list summarize SIAP-SRTC Training on Sampling 25 Data Management NOTE: Stata is case sensitive. Stata commands may be abbreviated, e.g. D for DESCRIBE, SUM for SUMMARIZE, etc. We may use Page Up/Down keys or mouse for re-selecting commands in the Review window. SIAP-SRTC Training on Sampling 26 Data Management NOTE: Commands and output are shown in Results window. Windows may be re- sized. Commands and output may be logged into a log file by pressing Open Log button. SIAP-SRTC Training on Sampling 27 Data Management RENAMING VARIABLES ONE WAY : (From Data Editor) Double click anywhere in the variable‘s column resulting in a dialogue box SIAP-SRTC Training on Sampling 28 Data Management RENAMING VARIABLES SECOND WAY: (In the STATA COMMAND window) enter rename var1 domain rename var2 hcn rename var3 age label variable age “HH head age” d SIAP-SRTC Training on Sampling 29 Data Management SAVING EDITED DATABASE In the STATA COMMAND window enter the following commands save newfile, replace Note: typing only save newfile will result in an error message SIAP-SRTC Training on Sampling 30 Data Management READING PRE-EXISTING STATA DATASET If dataset is in folder c:\fies2000 and filename is “fies00small.dta”, enter clear set mem 64m NOTE: Impt for MEMORY cd c:\fies2000 MANAGEMENT use fies00small SIAP-SRTC Training on Sampling 31 Data Management IMPORTING DATA Suppose we have a dataset try.txt in c:\fies2000 folder NOTE: Missing Data coded as “.” SIAP-SRTC Training on Sampling 32 Data Management IMPORTING DATA Suppose we have a dataset try.txt in c:\fies2000 folder Use the infile command with syntax infile variable-list using filename.raw In particular, enter cd c:\fies2000 infile domain hcn age using try.txt, automatic SIAP-SRTC Training on Sampling 33 Data Management TRIVIA ON STRING VARIABLES When using the infile command for character (string) variables, we need to identify these variables. For instance infile domain hcn str30 prov using tr.txt For more details regarding infile, enter help infile1 SIAP-SRTC Training on Sampling 34 Data Management IMPORTING DATA Suppose we have a dataset try2.txt in c:\fies2000 folder with the data in specific fields Assumes last line is blank line SIAP-SRTC Training on Sampling 35 Data Management IMPORTING DATA Suppose we have a dataset try2.txt in c:\fies2000 folder with the data in specific fields Use the infix command infix domain 1 hcn 2 age 3-4 using try2.txt, clear SIAP-SRTC Training on Sampling 36 Data Management Thus, Stata can read text files with Infile (if the data in text is separated by spaces and does not have strings, or if strings are just one word, or if all strings are enclosed in quotes) Infix (fixed format text) Insheet (if text file was created by a spreadsheet or db program) SIAP-SRTC Training on Sampling 37 Data Management NOTE: The commands infile, infix, insheet read data from ASCII files. Outfile is a way to save the data in ASCII. There are third party programs, esp. Stat/Transfer and DBMS/COPY, that perform translations from one data format (e.g., dBASE, Excel, SAS, SPSS, Stata) to another. SIAP-SRTC Training on Sampling 38 Data Management SIAP-SRTC Training on Sampling 39 Data Management OTHER USEFUL COMMANDS To sort the dataset by age sort age To get a listing of the dataset list To get a listing of the 2nd-4th data list in 2/4 SIAP-SRTC Training on Sampling 40 Data Management OTHER USEFUL COMMANDS To summarize the restricted dataset of HHs whose head’s age is less than/equal to 50 summarize if age <=50 HH head age between 35 and 50 summarize if age <50 & age >35 SIAP-SRTC Training on Sampling 41 Data Management Comparison operators > >= == < <= != Logical operators & (and) ! (not) | (or) ~ (not) SIAP-SRTC Training on Sampling 42 Data Management OTHER USEFUL COMMANDS To tabulate domain tab domain To generate contingency tables tab domain hcn if age>35 To get the correlation matrix correlate x y z SIAP-SRTC Training on Sampling 43 Data Management GENERATING & REPLACING VARIABLES Suppose we want to obtain per capita income (pci) of FIES 2000 households clear cd d:\fies00 use fies00small gen pci=toinc/hsize SIAP-SRTC Training on Sampling 44 Data Management GENERATING & REPLACING VARIABLES Now tag the household as poor (1) if pci < some threshold, say 13823, determine percent of HHs that are poor. gen poor=1 if pci < 13823 replace poor=0 if poor==. sum poor [aw=rfact] save fies00small, replace SIAP-SRTC Training on Sampling 45 Data Management NOTE Small portion of data set of FIES 2000 was used. The Family Income and Expenditure Survey (FIES) is conducted by the National Statistics Office (NSO)every 3 years. Data may be purchased through the NSO website: www.census.gov.ph SIAP-SRTC Training on Sampling Introduction to STATA (cont’d) Jose Ramon G. Albert Research Division Chief Statistical Research & Training Center (SRTC) email: srtcres@srtc.gov.ph SIAP-SRTC Training Course on Sampling Acceed Center, AIM, Makati Philippines 5 April 2002 47 Data Management RECALL That if we use our fies2000 data set set mem 64m cd c:\fies2000 use fies00small sum poor [aw=rfact] Note poverty line we provided is a weighted average of the variable poverty lines in the Philippines (for urban-rural areas across the different regions) SIAP-SRTC Training on Sampling Digression … Official Poverty Measurement & Latest Poverty Statistics 49 Estimating Food Poverty Line Food poverty line estimated from low cost one day menus (breakfast, lunch, supper snack) constructed for each urban-rural area of a region by Food and Nutrient Research Institute (FNRI) which meet 100% sufficiency in energy and protein requirements and 80% sufficiency of other nutrients and vitamins. • RDA’s for energy: 2000 Kcal per person • RDA’s for protein: 50 grams per person 29 such menus constructed on the basis of the 1988 Food Consumption Survey SIAP-SRTC Training on Sampling 50 Annual Per Capita Food Line Urban, by Region SIAP-SRTC Training on Sampling 51 Annual Per Capita Food Line Rural, by Region SIAP-SRTC Training on Sampling 52 Estimating Poverty Line Poverty Line= Food Threshold/ Engel’s Coefficient Engel’s coefficient estimated by analyzing the consumption pattern of families having incomes within plus or minus 10 percentage points from food threshold. Engel’s coeff = Food Exp/ Total Basic Exp SIAP-SRTC Training on Sampling 53 Annual Per Capita Poverty Line Urban, by Region SIAP-SRTC Training on Sampling 54 Annual Per Capita Poverty Line Rural, by Region SIAP-SRTC Training on Sampling 55 Poverty Statistics (Family) Measures 2000 1997 Poverty Incidence 33.6% 31.8% [0.3%] Poverty Gap 10.7% 10.0% [0.1%] Severity Index 4.6% 4.3% [0.1%] [Standard Error] SIAP-SRTC Training on Sampling 56 Poverty Incidence All Areas, by Region SIAP-SRTC Training on Sampling 57 Small Area Poverty Stats? Stata has some add ons for generating SEs for poverty stats If we wish to generate provincial poverty statistics, we will find out that SEs are too high, i.e. figures are unreliable SIAP-SRTC Training on Sampling Back to STATA 59 Data Management RECALL That if we use our fies2000 data set set mem 64m cd c:\fies2000 use fies00small sum poor [aw=rfact] Note poverty line we provided is a weighted average of the variable poverty lines in the Philippines (for urban-rural areas across the different regions) SIAP-SRTC Training on Sampling 60 Data Management NOTE: STATA uses several types of weights fw frequency weights aw analytic weights iw importance weights pw probability weights SIAP-SRTC Training on Sampling 61 Data Management NOTE: Within the command generate or replace, we may transform or create variables by using functions, e.g., generate loginc=ln(toinc) generate y=cos(x*_pi/180) replace newvar=normd(z) generate rvar=uniform() SIAP-SRTC Training on Sampling 62 Data Management DELETING VARIABLES/DATA To drop a variable, say age drop age To drop some observations drop in 2/3 Try also the command keep. To drop all data in memory clear SIAP-SRTC Training on Sampling 63 Data Management NOTE: So far we have used STATA interactively. We can also do batch processing through the DO FILE editor. SIAP-SRTC Training on Sampling 64 Data Management NOTE: The STATA toolbar has 13 buttons. The first three are to OPEN a Stata dataset SAVE to the disk the resident dataset PRINT a graph or log SIAP-SRTC Training on Sampling 65 Data Management The next five are for Starting/stopping/suspending a LOG Bringing the Log to the Front Bringing the Dialog to Front Bringing the Results to Front Bringing the Graph to Front SIAP-SRTC Training on Sampling 66 Data Management The last five are for Opening the DO FILE editor Opening the DATA editor Opening the DATA Browser Telling Stat to continue when it has paused in mid of long output Stopping the current task SIAP-SRTC Training on Sampling 67 Exercise What is the average income of families that are below or above the mean family expenditure? SIAP-SRTC Training on Sampling 68 Exercise Compare correlation of food expenditures (fexp) and nonfood expenditures for families in rural & urban areas. SIAP-SRTC Training on Sampling 69 Extra Enter graph food nfood 1.2e+06 food 1404 684 5.0e+06 nfood SIAP-SRTC Training on Sampling 70 Extra Now try sort urb graph food nfood, by (urb) graph food nfood, by (urb) total urb==1 urb==2 1.2e+06 1404 food 684 5.0e+06 Total 1.2e+06 1404 684 5.0e+06 nfood Graphs by urb SIAP-SRTC Training on Sampling 71 Extra Matrix plots graph toinc food nfood, matrix 1404 1.2e+06 8.4e+06 toinc 4273 1.2e+06 food 1404 5.0e+06 nfood 684 4273 8.4e+06 684 5.0e+06 SIAP-SRTC Training on Sampling Table Generation 73 Table Generation w/ tab Earlier, we showed the use of the tab(ulate) command. Try tab urb tab urb [aw=rfact] tab urb [iw=rfact] tab urb regn SIAP-SRTC Training on Sampling 74 Tab The tab command has options for generating 1-way tables of freqs tab urb, summ(toinc) and two way tables tab urb sex tab urb sex, row tab urb sex, row col chi2 tab urb sex, all exact SIAP-SRTC Training on Sampling 75 Table Generation w/ table Aside from the tab command, we can generate tables of statistics with the table command. Compare tab urb with table urb SIAP-SRTC Training on Sampling 76 Table To generate the average (family) income and average (family) expenditure across urban and rural areas, enter table urb, c(mean toinc mean toexp) Using weights table urb [aw=rfact], c(mean toinc mean toexp) SIAP-SRTC Training on Sampling 77 Table The contents option may specify at most five of the ff statistics: freq (for frequency) mean varname (for mean of varname) sd varname (for standard deviation) sum varname (for sum) rawsum varname (for sums ignoring optionally specified weight) count varname (for count of nonmissing data) SIAP-SRTC Training on Sampling 78 Table The contents option may specify at most five of the ff statistics: n varname (same as count) max varname (for maximum) min varname (for minimum) median varname (for median) p1 varname (for 1st percentile) p2 varname (for 2nd percentile) ... iqr varname (for interquartile range) SIAP-SRTC Training on Sampling 79 Exercise Using Table Obtain the average and median per capita income of households by sex of household head table sex, c(mean pci median pci) Obtain the “weighted” frequency of poor and nonpoor households across regions table poor regn [iw=rfact] SIAP-SRTC Training on Sampling 80 Using Survey Commands STATA has designed a family of commands especially for sample surveys. These commands all begin with svy svyset setting variables svydes describe strata and PSUs svymean estimate popn & subpop means svytotals estimate popn & subpop totals SIAP-SRTC Training on Sampling 81 Using Survey Commands Svy commands svyprop estimate popn & subpop props svyratio estimate popn & subpop ratios svytab for two way tables svyreg for regression svyivreg for instrumental variables reg svylogit for logit reg svyprobit for probit reg SIAP-SRTC Training on Sampling 82 Using Survey Commands Svy commands svytest for hypothesis testing svylc for estimating linear combs svymlog for multinomial logistic reg svyolog for ordered logistic reg svyoprob for ordered probit reg svypois for poisson reg svyintrg for censored & interval reg SIAP-SRTC Training on Sampling 83 Using Survey Commands Before issuing any svy estimation command, we identify the weight, strata and PSU identifier variables svyset pweight rfact svyset strata domain svyset psu hcn SIAP-SRTC Training on Sampling 84 Using Survey Commands To obtain the average family income & average family expenditure svymean toinc toexp To obtain the total family income, total family expenditure by province svytotal toinc toexp, by(regn) SIAP-SRTC Training on Sampling 85 Using Survey Commands To obtain the per capita income & per capita expenditure svyratio toinc/fsize toexp/fsize pci & pce by urban/rural svyratio toinc/fsize toexp/fsize, by(urb) SIAP-SRTC Training on Sampling 86 Using Survey Commands Linear regression of ln(pci) gen loginc=ln(pci) svyreg loginc age fsize sex prov urb Compare the results with the regular regression command reg loginc age fsize sex prov urb SIAP-SRTC Training on Sampling 87 Using Survey Commands Two way tables svytab urb poor, row se compared with tab urb poor [aw=rfact], no freq row SIAP-SRTC Training on Sampling Alternatives to STATA 89 Learning More about Stata Online tutorial, type tutorial intro List of Tutorials Tutorial Description ----------------------------------------------------- intro An introduction to Stata graphics How to make graphs tables How to make tables regress Estimating regression models, inc 2SLS anova Estimating one-, two- and N-way ANOVA and ANCOVA models SIAP-SRTC Training on Sampling 90 Learning More about Stata Tutorial Description ----------------------------------------------------- logit Estimating maximum-likelihood logit and probit models survival Estimating ML survival models factor Estimating factor and principal component models ourdata Description of the data we provide yourdata How to input your own data into Stata SIAP-SRTC Training on Sampling 91 Learning More about Stata Email distribution list. Send email to Majordomo@hsphsun2.harvard.edu In the body of your email message type the message subscribe statalist email@address or for a daily summary subscribe statalist-digest email@address SIAP-SRTC Training on Sampling Maraming Salamat sa inyong pakikinig. (Thank you for your attention) END OF TALK Introduction to STATA