# Lab02 by keralaguest

VIEWS: 1 PAGES: 5

• pg 1
```									                  Stat 421 – Survey Sampling Techniques
Lab 2: Introduction to SAS 9.2 (part 2)
In this lab, we will continue to work on the objectives stated in Lab 1 by introducing
a number of PROC steps that we will use throughout the semester.
Last week we learned how to:
   read data into SAS and create new variables using IF-THEN statements in the
DATA step;
   create formats for our variables and associate variables with formats;
   view our data and our data manipulations using the Explorer tab and PROC
PRINT;
   check for problems in our SAS program using the Log window;
   produce frequency and percentage tabulations for our data using PROC
FREQ; and
   use comments to annotate our SAS program and TITLE statements to label
our output.
This week we introduce two exploratory graphical methods in SAS:
   PROC GPLOT will be used to determine the relationship between two
continuous variables.
   PROC BOXPLOT will be used to compare the marginal distributions of a
variable among some grouping variable (categorical). Later that grouping
variable will most often be strata.
We will also introduce three PROC steps in SAS specifically for surveys: PROC
SURVEYMEANS, PROC SURVEYFREQ, and PROC SURVEYSELECT.
   PROC SURVEYMEANS will be used to calculate statistically valid inferences for
means and totals for a population.
   PROC SURVEYFREQ performs a similar function to PROC FREQ from last
week, but we will use it to produce statistically valid inference for proportions
from population in a survey setting. (If we don’t have an equal probability
design, the percentages in PROC FREQ are invalid estimates.)
   PROC SURVEYSELECT is used to draw a sample with a chosen sample
scheme from a frame.
We will use the program posted on the course webpage, Lab02.sas, to learn about
these procedures. Explanations regarding the PROCs are embedded in the
comments of the program. So you will want to carefully review the comments in
Code from Lab02.sas
/* This data set contains 12 variables (from a 1994 AAUP salary survey):
FICE = federal ID number
name = college name
state = State abbreviation
fullsal = average salary for full professors
Lab 2                                                                         2

assosal   =   average salary for associate professors
assisal   =   average salary for assistant professors
allsal    =   average salary for all ranks
numfull   =   number of full professors
numasso   =   number of associate professors
numassi   =   number of assistant professors
numinst   =   number of instructors
numfac    =   number of all faculty
*/

/*Filename option droped */
DATA salary;
INPUT FICE 1-5 name \$6-37 state \$38-39 fullsal 44-48 assosal 49-52 assisal
53-56 allsal 57-60
numfull 79-82 numasso 83-86 numassi 87-90 numinst 91-94 numfac 95-99;

IF numfac >=200 then sizeind = 1; /*large school*/
ELSE sizeind = 0;
RUN;

PROC FORMAT; /*Create format*/
VALUE sz 0 = "Small School"
1 = "Large School";
RUN;

DATA salary; /*Associate format to variable sizeind*/
SET salary;
FORMAT sizeind sz.;
RUN;

/* PROC GPLOT can make scatterplots.
The PLOT statement gives the variables to plot in the
form (Y-axis)*(X-axis).
The WHERE statement makes SAS only use the portion of the data where
the statement is true, so this plot is only for large schools.
The SYMBOL V= statement gives the shape for plotting points.
Other choices are square, triangle, diamond, and so on.
*/

PROC GPLOT DATA=salary;
TITLE 'Comparison of Full vs Associate Salaries in Large Schools';
PLOT fullsal*assosal; /*Y vs X, or Vertical vs Horizontal*/
SYMBOL Value=circle color=red;
WHERE sizeind=1; /*use WHERE to plot partial data*/
RUN;

/* When you want to look at statistics broken down by
some categories, SAS often wants the data set sorted
by the categories, so we need to use PROC SORT.
*/
PROC SORT DATA=salary;
BY sizeind;   /*increasing order*/
RUN;

/* Sometimes we will want to look at side-by-side boxplots of our
Lab 2                                                                       3

data. To do this we will use PROC BOXPLOT. The form of the PLOT
statement is (analysis variables)*group variable.
The plots from GPLOT and BOXPLOT all stack in the same graphics
window, so you will need to scroll up and down to see the different
plots.
*/

PROC BOXPLOT DATA=salary;
TITLE 'Associate Professor Salary by Sizeind';
PLOT assosal*sizeind; /* analysis variables*group variable */
RUN;

/* PROC SURVEYMEANS will be used in this class to get means
and proportions as well as standard errors for them based
on sample designs and analyses we are interested in. The VAR
statement tells us which variables to get means and proportions
for.
*/

PROC SURVEYMEANS DATA=salary;
TITLE 'Mean Salary for All Faculty Levels';
VAR allsal fullsal;
RUN;

/*
PROC UNIVARIATE can be used in a non-survey setting
to produce statistic measures about univariate variables
The VAR statement tells us which variable we are working on
*/

PROC UNIVARIATE data=salary;
Title 'Mean Salary for All Faculty Levels';
var allsal;   /*same result as before*/
run;

/* There are two ways of calculating proportions and standard
errors of proportions. The first method is to use a SURVEYMEANS
statement with a CLASS statement and the second method is to use
SURVEYFREQ:
The CLASS statement is a common statement in SAS PROCs that
tells SAS to perform the calculations for each level of the
variable listed in the CLASS statement.
Here it's a little bit confusing, as in SURVEYMEANS class statements tells
SAS that sizeind
is a categorical variable, SAS will estimate the proportions in each
category.
MEAN and STDERR options tell SAS to calculate the mean and standard
error of the mean.
The CLM option tells SAS to construct confidence limits
and the ALPHA=.05 line tells SAS that we want 95% confidence for the
limits.
*/

PROC SURVEYMEANS DATA=salary MEAN STDERR CLM ALPHA=.05;
TITLE 'Proportion of Large Schools';
Lab 2                                                                          4

class sizeind;   /*categorical not continuous, thus caculate the
proportion*/
VAR sizeind;
RUN;

/* The SURVEYFREQ statement looks similar to the PROC FREQ statement
from Lab 1. Here CL is used to obtain confidence intervals. Omitting
the ALPHA = option results in the default 95% confidence limits.
*/

PROC SURVEYFREQ DATA=salary;
TITLE 'Proportion of Large Schools';
TABLE sizeind /CL ; /*std error is default, but CL is not*/
RUN;

/* If we treat our data as a sample frame(population), we can use
SAS to draw a sample for us for a variety of sample
designs. To do this we will use PROC SURVEYSELECT.
The DATA= tells SAS what input frame to use.
The OUT= statement tells SAS where to store the selected sample.
The n= tells SAS the size of the sample we want.
The SEED is a random number used to start the sampling. This seed should
be
randomly chosen and is fixed in lab only so that we all have the same
sample.
If SEED is omitted, the system time is used to select a SEED, which may
act as random.
The METHOD= option tells SAS what design to use to select the sample from
the frame.
The SRS method for SAS is a SRSWOR.
*/

PROC SURVEYSELECT DATA=salary OUT=salsamp n=100 SEED=31218 METHOD=SRS;
RUN;

PROC PRINT DATA=salsamp;
TITLE 'Sample of 100 Schools';
RUN;
Lab 2                                                                               5

In-Lab Survey Computing Assignment 2
Due at the end of the lab period

Continue working with the dataset counties.dat. Provide appropriate titles for each
piece of code that will produce output.
1) Plot the number of physicians (Y) in relation to the county population (X).
2) Estimate the average number of physicians, as well as the standard error and
a 95% confidence interval for the mean.
3) Produce a frequency table of vietind for Texas (‘TX’) and Florida (‘FL’)
together without manipulating the data set (i.e., complete within the SAS
procedure). Hint: note that logical operators such as AND and OR can be
used in many statements, like where.
4) Obtain a side-by-side box plot for the unemployment counts by the two
values of vietind.
5) What, if anything, can be learned from the boxplot in question (4).
6) What could be done to the code to make the plot in (4) more useful? (hint:
use where statement to delete some outliers)