VIEWS: 1 PAGES: 5 POSTED ON: 2/12/2012
Stat 421 – Survey Sampling Techniques Lab 2: Introduction to SAS 9.2 (part 2) In this lab, we will continue to work on the objectives stated in Lab 1 by introducing a number of PROC steps that we will use throughout the semester. Last week we learned how to: read data into SAS and create new variables using IF-THEN statements in the DATA step; create formats for our variables and associate variables with formats; view our data and our data manipulations using the Explorer tab and PROC PRINT; check for problems in our SAS program using the Log window; produce frequency and percentage tabulations for our data using PROC FREQ; and use comments to annotate our SAS program and TITLE statements to label our output. This week we introduce two exploratory graphical methods in SAS: PROC GPLOT will be used to determine the relationship between two continuous variables. PROC BOXPLOT will be used to compare the marginal distributions of a variable among some grouping variable (categorical). Later that grouping variable will most often be strata. We will also introduce three PROC steps in SAS specifically for surveys: PROC SURVEYMEANS, PROC SURVEYFREQ, and PROC SURVEYSELECT. PROC SURVEYMEANS will be used to calculate statistically valid inferences for means and totals for a population. PROC SURVEYFREQ performs a similar function to PROC FREQ from last week, but we will use it to produce statistically valid inference for proportions from population in a survey setting. (If we don’t have an equal probability design, the percentages in PROC FREQ are invalid estimates.) PROC SURVEYSELECT is used to draw a sample with a chosen sample scheme from a frame. We will use the program posted on the course webpage, Lab02.sas, to learn about these procedures. Explanations regarding the PROCs are embedded in the comments of the program. So you will want to carefully review the comments in this program to help you understand how to make your own SAS code. Code from Lab02.sas /* This data set contains 12 variables (from a 1994 AAUP salary survey): FICE = federal ID number name = college name state = State abbreviation fullsal = average salary for full professors Lab 2 2 assosal = average salary for associate professors assisal = average salary for assistant professors allsal = average salary for all ranks numfull = number of full professors numasso = number of associate professors numassi = number of assistant professors numinst = number of instructors numfac = number of all faculty */ /*Filename option droped */ DATA salary; INFILE '\\iastate.edu\cyfiles\lbbb\Documents\Downloads\aaup2.dat'; /*change the address to where you download the dat file*/ INPUT FICE 1-5 name $6-37 state $38-39 fullsal 44-48 assosal 49-52 assisal 53-56 allsal 57-60 numfull 79-82 numasso 83-86 numassi 87-90 numinst 91-94 numfac 95-99; IF numfac >=200 then sizeind = 1; /*large school*/ ELSE sizeind = 0; RUN; PROC FORMAT; /*Create format*/ VALUE sz 0 = "Small School" 1 = "Large School"; RUN; DATA salary; /*Associate format to variable sizeind*/ SET salary; FORMAT sizeind sz.; RUN; /* PROC GPLOT can make scatterplots. The PLOT statement gives the variables to plot in the form (Y-axis)*(X-axis). The WHERE statement makes SAS only use the portion of the data where the statement is true, so this plot is only for large schools. The SYMBOL V= statement gives the shape for plotting points. Other choices are square, triangle, diamond, and so on. */ PROC GPLOT DATA=salary; TITLE 'Comparison of Full vs Associate Salaries in Large Schools'; PLOT fullsal*assosal; /*Y vs X, or Vertical vs Horizontal*/ SYMBOL Value=circle color=red; WHERE sizeind=1; /*use WHERE to plot partial data*/ RUN; /* When you want to look at statistics broken down by some categories, SAS often wants the data set sorted by the categories, so we need to use PROC SORT. */ PROC SORT DATA=salary; BY sizeind; /*increasing order*/ RUN; /* Sometimes we will want to look at side-by-side boxplots of our Lab 2 3 data. To do this we will use PROC BOXPLOT. The form of the PLOT statement is (analysis variables)*group variable. The plots from GPLOT and BOXPLOT all stack in the same graphics window, so you will need to scroll up and down to see the different plots. */ PROC BOXPLOT DATA=salary; TITLE 'Associate Professor Salary by Sizeind'; PLOT assosal*sizeind; /* analysis variables*group variable */ RUN; /* PROC SURVEYMEANS will be used in this class to get means and proportions as well as standard errors for them based on sample designs and analyses we are interested in. The VAR statement tells us which variables to get means and proportions for. */ PROC SURVEYMEANS DATA=salary; TITLE 'Mean Salary for All Faculty Levels'; VAR allsal fullsal; RUN; /* PROC UNIVARIATE can be used in a non-survey setting to produce statistic measures about univariate variables The VAR statement tells us which variable we are working on */ PROC UNIVARIATE data=salary; Title 'Mean Salary for All Faculty Levels'; var allsal; /*same result as before*/ run; /* There are two ways of calculating proportions and standard errors of proportions. The first method is to use a SURVEYMEANS statement with a CLASS statement and the second method is to use SURVEYFREQ: The CLASS statement is a common statement in SAS PROCs that tells SAS to perform the calculations for each level of the variable listed in the CLASS statement. Here it's a little bit confusing, as in SURVEYMEANS class statements tells SAS that sizeind is a categorical variable, SAS will estimate the proportions in each category. MEAN and STDERR options tell SAS to calculate the mean and standard error of the mean. The CLM option tells SAS to construct confidence limits and the ALPHA=.05 line tells SAS that we want 95% confidence for the limits. */ PROC SURVEYMEANS DATA=salary MEAN STDERR CLM ALPHA=.05; TITLE 'Proportion of Large Schools'; Lab 2 4 class sizeind; /*categorical not continuous, thus caculate the proportion*/ VAR sizeind; RUN; /* The SURVEYFREQ statement looks similar to the PROC FREQ statement from Lab 1. Here CL is used to obtain confidence intervals. Omitting the ALPHA = option results in the default 95% confidence limits. */ PROC SURVEYFREQ DATA=salary; TITLE 'Proportion of Large Schools'; TABLE sizeind /CL ; /*std error is default, but CL is not*/ RUN; /* If we treat our data as a sample frame(population), we can use SAS to draw a sample for us for a variety of sample designs. To do this we will use PROC SURVEYSELECT. The DATA= tells SAS what input frame to use. The OUT= statement tells SAS where to store the selected sample. The n= tells SAS the size of the sample we want. The SEED is a random number used to start the sampling. This seed should be randomly chosen and is fixed in lab only so that we all have the same sample. If SEED is omitted, the system time is used to select a SEED, which may act as random. The METHOD= option tells SAS what design to use to select the sample from the frame. The SRS method for SAS is a SRSWOR. */ PROC SURVEYSELECT DATA=salary OUT=salsamp n=100 SEED=31218 METHOD=SRS; RUN; PROC PRINT DATA=salsamp; TITLE 'Sample of 100 Schools'; RUN; Lab 2 5 In-Lab Survey Computing Assignment 2 Due at the end of the lab period Writing your own SAS code Continue working with the dataset counties.dat. Provide appropriate titles for each piece of code that will produce output. 1) Plot the number of physicians (Y) in relation to the county population (X). 2) Estimate the average number of physicians, as well as the standard error and a 95% confidence interval for the mean. 3) Produce a frequency table of vietind for Texas (‘TX’) and Florida (‘FL’) together without manipulating the data set (i.e., complete within the SAS procedure). Hint: note that logical operators such as AND and OR can be used in many statements, like where. 4) Obtain a side-by-side box plot for the unemployment counts by the two values of vietind. 5) What, if anything, can be learned from the boxplot in question (4). 6) What could be done to the code to make the plot in (4) more useful? (hint: use where statement to delete some outliers) 7) Use your answer in (6) to modify your program and make new plots for (1) and (4). What do you see now? 8) Draw a SRSWOR (Use method=SRS) of size 25 from the counties using a random seed of 56856 and print the dataset. For this lab homework, turn in the answers of question (5) (6) and (7) and your program code in a Word Document to: lbbb@iastate.edu. Be sure to use your name and lab number as the title (eg, Bin_Liu_Lab02.docx).