Lab02 by keralaguest


									                  Stat 421 – Survey Sampling Techniques
               Lab 2: Introduction to SAS 9.2 (part 2)
In this lab, we will continue to work on the objectives stated in Lab 1 by introducing
a number of PROC steps that we will use throughout the semester.
Last week we learned how to:
      read data into SAS and create new variables using IF-THEN statements in the
       DATA step;
      create formats for our variables and associate variables with formats;
      view our data and our data manipulations using the Explorer tab and PROC
      check for problems in our SAS program using the Log window;
      produce frequency and percentage tabulations for our data using PROC
       FREQ; and
      use comments to annotate our SAS program and TITLE statements to label
       our output.
This week we introduce two exploratory graphical methods in SAS:
      PROC GPLOT will be used to determine the relationship between two
       continuous variables.
      PROC BOXPLOT will be used to compare the marginal distributions of a
       variable among some grouping variable (categorical). Later that grouping
       variable will most often be strata.
We will also introduce three PROC steps in SAS specifically for surveys: PROC
      PROC SURVEYMEANS will be used to calculate statistically valid inferences for
       means and totals for a population.
      PROC SURVEYFREQ performs a similar function to PROC FREQ from last
       week, but we will use it to produce statistically valid inference for proportions
       from population in a survey setting. (If we don’t have an equal probability
       design, the percentages in PROC FREQ are invalid estimates.)
      PROC SURVEYSELECT is used to draw a sample with a chosen sample
       scheme from a frame.
We will use the program posted on the course webpage,, to learn about
these procedures. Explanations regarding the PROCs are embedded in the
comments of the program. So you will want to carefully review the comments in
this program to help you understand how to make your own SAS code.
Code from
/* This data set contains 12 variables (from a 1994 AAUP salary survey):
   FICE = federal ID number
   name = college name
   state = State abbreviation
   fullsal = average salary for full professors
Lab 2                                                                         2

     assosal   =   average salary for associate professors
     assisal   =   average salary for assistant professors
     allsal    =   average salary for all ranks
     numfull   =   number of full professors
     numasso   =   number of associate professors
     numassi   =   number of assistant professors
     numinst   =   number of instructors
     numfac    =   number of all faculty

/*Filename option droped */
DATA salary;
INFILE '\\\cyfiles\lbbb\Documents\Downloads\aaup2.dat';
/*change the address to where you download the dat file*/
INPUT FICE 1-5 name $6-37 state $38-39 fullsal 44-48 assosal 49-52 assisal
53-56 allsal 57-60
      numfull 79-82 numasso 83-86 numassi 87-90 numinst 91-94 numfac 95-99;

   IF numfac >=200 then sizeind = 1; /*large school*/
ELSE sizeind = 0;

PROC FORMAT; /*Create format*/
   VALUE sz 0 = "Small School"
            1 = "Large School";

DATA salary; /*Associate format to variable sizeind*/
SET salary;
FORMAT sizeind sz.;

/* PROC GPLOT can make scatterplots.
   The PLOT statement gives the variables to plot in the
   form (Y-axis)*(X-axis).
   The WHERE statement makes SAS only use the portion of the data where
   the statement is true, so this plot is only for large schools.
   The SYMBOL V= statement gives the shape for plotting points.
   Other choices are square, triangle, diamond, and so on.

TITLE 'Comparison of Full vs Associate Salaries in Large Schools';
PLOT fullsal*assosal; /*Y vs X, or Vertical vs Horizontal*/
SYMBOL Value=circle color=red;
WHERE sizeind=1; /*use WHERE to plot partial data*/

/* When you want to look at statistics broken down by
   some categories, SAS often wants the data set sorted
   by the categories, so we need to use PROC SORT.
BY sizeind;   /*increasing order*/

/* Sometimes we will want to look at side-by-side boxplots of our
Lab 2                                                                       3

     data. To do this we will use PROC BOXPLOT. The form of the PLOT
     statement is (analysis variables)*group variable.
     The plots from GPLOT and BOXPLOT all stack in the same graphics
     window, so you will need to scroll up and down to see the different

   TITLE 'Associate Professor Salary by Sizeind';
   PLOT assosal*sizeind; /* analysis variables*group variable */

/* PROC SURVEYMEANS will be used in this class to get means
   and proportions as well as standard errors for them based
   on sample designs and analyses we are interested in. The VAR
   statement tells us which variables to get means and proportions

TITLE 'Mean Salary for All Faculty Levels';
VAR allsal fullsal;

PROC UNIVARIATE can be used in a non-survey setting
to produce statistic measures about univariate variables
The VAR statement tells us which variable we are working on

PROC UNIVARIATE data=salary;
 Title 'Mean Salary for All Faculty Levels';
 var allsal;   /*same result as before*/

/* There are two ways of calculating proportions and standard
   errors of proportions. The first method is to use a SURVEYMEANS
   statement with a CLASS statement and the second method is to use
   The CLASS statement is a common statement in SAS PROCs that
   tells SAS to perform the calculations for each level of the
   variable listed in the CLASS statement.
   Here it's a little bit confusing, as in SURVEYMEANS class statements tells
SAS that sizeind
   is a categorical variable, SAS will estimate the proportions in each
   MEAN and STDERR options tell SAS to calculate the mean and standard
   error of the mean.
   The CLM option tells SAS to construct confidence limits
   and the ALPHA=.05 line tells SAS that we want 95% confidence for the

   TITLE 'Proportion of Large Schools';
Lab 2                                                                          4

   class sizeind;   /*categorical not continuous, thus caculate the
   VAR sizeind;

/* The SURVEYFREQ statement looks similar to the PROC FREQ statement
   from Lab 1. Here CL is used to obtain confidence intervals. Omitting
   the ALPHA = option results in the default 95% confidence limits.

   TITLE 'Proportion of Large Schools';
   TABLE sizeind /CL ; /*std error is default, but CL is not*/

/* If we treat our data as a sample frame(population), we can use
   SAS to draw a sample for us for a variety of sample
   designs. To do this we will use PROC SURVEYSELECT.
   The DATA= tells SAS what input frame to use.
   The OUT= statement tells SAS where to store the selected sample.
   The n= tells SAS the size of the sample we want.
   The SEED is a random number used to start the sampling. This seed should
   randomly chosen and is fixed in lab only so that we all have the same
   If SEED is omitted, the system time is used to select a SEED, which may
act as random.
   The METHOD= option tells SAS what design to use to select the sample from
the frame.
   The SRS method for SAS is a SRSWOR.


   TITLE 'Sample of 100 Schools';
Lab 2                                                                               5

                   In-Lab Survey Computing Assignment 2
                        Due at the end of the lab period

Writing your own SAS code
Continue working with the dataset counties.dat. Provide appropriate titles for each
piece of code that will produce output.
   1) Plot the number of physicians (Y) in relation to the county population (X).
   2) Estimate the average number of physicians, as well as the standard error and
      a 95% confidence interval for the mean.
   3) Produce a frequency table of vietind for Texas (‘TX’) and Florida (‘FL’)
      together without manipulating the data set (i.e., complete within the SAS
      procedure). Hint: note that logical operators such as AND and OR can be
      used in many statements, like where.
   4) Obtain a side-by-side box plot for the unemployment counts by the two
      values of vietind.
   5) What, if anything, can be learned from the boxplot in question (4).
   6) What could be done to the code to make the plot in (4) more useful? (hint:
      use where statement to delete some outliers)
   7) Use your answer in (6) to modify your program and make new plots for (1)
      and (4). What do you see now?
   8) Draw a SRSWOR (Use method=SRS) of size 25 from the counties using a
      random seed of 56856 and print the dataset.
For this lab homework, turn in the answers of question (5) (6) and (7) and your
program code in a Word Document to: Be sure to use your
name and lab number as the title (eg, Bin_Liu_Lab02.docx).

To top