Docstoc

PubH6450 Biostatistics 1

Document Sample
PubH6450 Biostatistics 1 Powered By Docstoc
					PubH 6450 -- Biostatistics I
Directions for Lab 6
Oct. 19-21, 2009


Guidelines for each lab:

Each lab consists of three parts: Part I will teach you new SAS procedures and steps with
much guidance, using the class data set we created together. Part II will teach the same SAS
procedures and steps with less guidance, requiring you to tap into your previous knowledge and
labs to fill in intermediate steps, and using various health-related data sets. Part III will point
you towards some of this week‟s homework questions; to complete these, you will need to
choose and implement the appropriate SAS procedures and steps based on what you learned in
this and previous labs and lectures.

You may skip Part I if you wish to proceed directly to the more challenging Part II.
Your lab TA will work through this Lab in front of the class, step by step. Feel free to interrupt
with questions at any time. If the TA is working through the lab too slowly for you, work
ahead at your own pace.

The link below takes you to the SAS online manual. Here you can find explanations and syntax
for all SAS functions and procedures.

                      http://support.sas.com/onlinedoc/913/docMainpage.jsp

Instructions on how to save your SAS work can be found on the SAS labs page:
http://www.biostat.umn.edu/~ph6450/FALL09PH6450LAB.html by clicking on the link
“Instructions for saving your SAS work”. Previous labs can also be found there, if you need to
remind yourself of some syntax learned in an earlier lab.

Purpose of this lab – One sample tests:

Today‟s lab will demonstrate how to perform one sample z-tests and t-tests in SAS using proc
means, proc univariate, proc ttest, and the cdf function.




3adc3913-4c07-4ebd-b6e4-5477d5912372.doc                                        Page 1 of 11
Part I-A, Z-tests:

Part I-A will show you how to perform hypothesis tests for inference on the mean of a normal
random sample when the standard deviation σ is assumed to be known. (Part II-A uses the same
example.)

A study of the pay of corporate Chief Executive Officers (CEOs) for health insurance companies
examined the increase in cash compensation of the CEOs of 36 such companies, adjusted for
inflation, in a recent year. The public wants to know if there is good evidence to suggest that the
mean compensation of all health insurance company CEOs increased that year. The dataset
“ceo_pay” provides the data with percentage increase in CEO pay. Let us assume that percent
increase follows a normal distribution with mean  and known standard deviation  = 9.

We want to test the null hypothesis of no mean change in CEO pay:

                       Ho:  = 0

What does  represent here? The public is only interested in an increase in CEO pay, therefore
the alternative hypothesis should be one-sided:

                       Ha:  > 0


1) Getting the lab data: Go to the course labs page:
http://www.biostat.umn.edu/~ph6450/FALL09PH6450LAB.html.

       a) First left-click on the link “ceo pay” and look at the dataset. Notice that the data
       values are separated across the columns by commas in each row. This is a comma-
       delimited file format and has extension “.csv”. (We saw this file type in Lab 2 and Lab
       5.)

       b) Go back to the course labs page and now right-click on the link “ceo pay” and click on
       “Save Link As…” When the “Save As” dialog box opens, click on “Desktop” on the left
       and then on “Save” on the lower right. This should save the file as “ceo_pay.csv” on your
       computer‟s desktop.


2) Starting SAS: Open the PC SAS application by clicking on “Start” (lower left corner of your
screen) and then on “Programs” and then on “Class Applications” and then on “SAS 9.1.”


3) Importing the lab data (comma-delimited): Copy and paste the following code in your SAS
editor to import the CEO pay dataset. (Important: change the infile statement to correspond
to where you saved ceo_pay.csv). The SAS dataset is named “pay”.
data pay;



3adc3913-4c07-4ebd-b6e4-5477d5912372.doc                                       Page 2 of 11
   infile 'ceo_pay.csv'
     dlm=',' firstobs=2;
   input company increase;
run;



4) We introduced the means procedure in Lab 1 and used it again in Lab 5. Recall that the
MEANS procedure computes descriptive statistics. The following code provides by default the
mean, standard deviation, minimum, and maximum for the variable „increase‟.

proc means data=pay;
  var increase;
run;

Sometimes we want other descriptive statistics. Change the code to obtain the mean, median,
min, and max. Then look at the output to find the sample mean which will be used in step 5.

What is the range of percent increase in CEO pay? Is the range from the minimum to the mean
the same as the range from the mean to the maximum? What about for the median instead of the
mean?


5) Since we have a one-sided alternative hypothesis of  >0, the p-value is the right tail
probability: Pr(X>x) = 1 - Pr(X <= x). The key function we need is cdf('NORMAL',x,  ,  ).
Recall from lecture notes and your past homeworks that the cdf function is for computing
Pr(X <= x). Here we assume CEO pay increases come from a Normal distribution. (We have
also seen the Binomial cdf function and the t cdf function in class.)

Which values should we use for x,  ,  in the cdf function?
      a) The null hypothesis is  = 0.
      b) We are given that  = 9.
      c) The sample mean is 3.171.
      d) There are n=36 observations.

X=        ,   =         , =


Copy and paste the following code into your SAS editor, fill in the appropriate values in the cdf
function, and run it. See 7) below if you are not sure what to fill in.
data p1;
  p = 1-cdf('NORMAL',x,          ,  );
run;
proc print data=p1;
run;


What is the p-value for this test? I.e., what is the probability of observing a sample mean of
3.171 or larger when percent increase in CEO pay follows a Normal(0,9) distribution?



3adc3913-4c07-4ebd-b6e4-5477d5912372.doc                                       Page 3 of 11
6) We can compute the same p-value by standardizing x to get the test statistic and then
comparing the test statistic to a Normal(0,1) distribution.

Copy and paste the following code into your SAS editor and run it.
data p2;
  *parameters used here for the cdf function are z test statistic,0,1;
  p = 1-cdf('NORMAL',(3.171-0)/(9/sqrt(36)),0,1);
run;
proc print data=p2;
run;

Do you think the study gives strong evidence that the mean compensation of all CEOs went up?


7) The missing arguments for the cdf function in 5) above are 3.171, 0, and 9/sqrt(36).



Part I-B, T-tests:

Part I-B introduces two important procedures for conducting inference on the mean when σ is
unknown: proc univariate and proc ttest. These procedures as well as proc means can be
used to complete problems for Homework 6.

A study is designed to test whether video games have a positive impact on motor skills. Data
were collected on 50 middle school students before and after six months of playing video games
on a regular basis. The data represent change in a measure of motor skill for each of the 50
students. A positive value indicates an improvement. (The example in Part II-B is different from
this one.)


1) Entering data directly using datalines: In previous labs, we have read in datasets from files.
here we will enter data directly in the data step of the SAS program using a datalines statement
instead of an infile statement; we have seen this in class notes. After the input statement we
provide the variable name “change” and then “ @@” (two ampersands). To use less space in the
code, the @@ allows us to type the data into rows instead of into one very long column. Notice
that there is a semi-colon after datalines and another after the data have been typed. Copy and
paste the code below into your SAS editor. Then run the code to create the “games” dataset.

data games;
  input change @@;
  datalines;
0 2 -3 3 -3 -5 -1 3 1 4 4 4 -4 4                    0   1 1 -3 0       1 -1    3 -1 -3       1   1
-3 0 2 -1 3 -6 1 -1 1 1 2 3 0 -1                    0   0 -1 4 -2      0 1     1 -3 1
;
run;




3adc3913-4c07-4ebd-b6e4-5477d5912372.doc                                      Page 4 of 11
We will test whether average change in motor skills for the population of middle school students
is different from 0. Write down the null and alternative hypotheses.


2) The univariate procedure (which we used previously in Lab 2) provides many statistical
measures including descriptive statistics based on moments (including variance, skewness, and
kurtosis), quantiles or percentiles (such as the median), and extreme values. We have already
seen the stem-and-leaf and quantile-quantile plots (Q-Q plots) produced when we add the plots
option; histograms are produced when we add the histogram statement. Confidence intervals for
the mean are produced when we add the cibasic option. Run the following code and look at the
output.

proc univariate data=games cibasic plots;
   var change;
   histogram;
run;

Do the plots suggest that the data are normally distributed? Next find the output table labeled
„Tests for Location: Mu0=0‟ and find within that table the t-test of the mean. What is the p-value
for that t-test? The p-value computed by SAS corresponds to a two-sided test of whether or not
the mean is equal to zero. Is there strong evidence that mean change in motor skill was different
from zero?

By default SAS produces t-tests of whether or not the mean is equal to zero. If we wanted to test
some other value (such as 0.5, which doesn‟t make sense in this context) we would need to
specify 0.5 in the mu0 option.
proc univariate data=games mu0=0.5 cibasic plots;*Note: mu0=0 is the default;
   var change;
   histogram;
run;


3) Several types of t-tests can all be carried out by the TTEST procedure; it performs t-tests for
one sample (is the mean zero), two samples (is the mean difference zero), and paired
observations (is the mean difference zero); so far in class we have only learned about one sample
t-tests. The underlying assumption of the t-test in all three cases is that the observations are
random samples drawn from normally distributed populations, or the sample size is large enough
that the sample mean is approximately normally distributed. This assumption can be checked
using the univariate procedure, as we did above in 2). Here we use the one sample t-test to
compare the mean of the sample to zero. Copy and paste this code into your SAS editor and run
it. Then look at the output.
proc ttest data=games;
   var change;
run;

Is the p-value shown the same as what you saw above from the univariate procedure? (It
should be.) Are the data convincing enough to reject the null hypothesis?


3adc3913-4c07-4ebd-b6e4-5477d5912372.doc                                      Page 5 of 11
By default SAS produces t-tests of whether or not the mean is equal to zero. If we wanted to test
some other value (such as 0.5, which doesn‟t make sense in this context) we would need to
specify 0.5 in the h0 option.
proc ttest data=games h0=0.5; *Note: h0=0 is the default;
   var change;
run;




3adc3913-4c07-4ebd-b6e4-5477d5912372.doc                                      Page 6 of 11
Part II-A, Z-tests:

Part II-A is based on the same example as Part I-A. Here we will show you how to perform
hypothesis tests for inference on the mean of a normal random sample when the standard
deviation σ is assumed to be known.

A study of the pay of corporate Chief Executive Officers (CEOs) for health insurance companies
examined the increase in cash compensation of the CEOs of 36 such companies, adjusted for
inflation, in a recent year. The public wants to know if there is good evidence to suggest that the
mean compensation of all health insurance company CEOs increased that year. The dataset
“ceo_pay” provides the data with percentage increase in CEO pay. Let us assume that percent
increase follows a normal distribution with mean  and known standard deviation  = 9.

We want to test the null hypothesis of no mean change in CEO pay:

                       Ho:  = 0

What does  represent here? The public is only interested in an increase in CEO pay, therefore
the alternative hypothesis should be one-sided:

                       Ha:  > 0


1) Getting the lab data: Go to the course labs page:
http://www.biostat.umn.edu/~ph6450/FALL09PH6450LAB.html and save the “ceo_pay” file as
“ceo_pay.csv” on your computer‟s desktop.


2) Starting SAS: Open the PC SAS application.


3) Importing the lab data (comma-delimited): Import the comma-delimited CEO pay dataset into
SAS; it has two numeric variables: company increase. (We saw the .csv file type in Lab 2 and
Lab 5; look back to those labs if you need a reminder.)


4) We introduced the means procedure in Lab 1 and used it again in Lab 5. Obtain the mean,
median, min, and max of the variable increase. Then look at the output to find the sample mean
which will be used in step 5.

What is the range of percent increase in CEO pay? Is the range from the minimum to the mean
the same as the range from the mean to the maximum? What about for the median instead of the
mean?




3adc3913-4c07-4ebd-b6e4-5477d5912372.doc                                       Page 7 of 11
5) Since we have a one-sided alternative hypothesis of  >0, the p-value is the right tail
probability: Pr(X>x) = 1 - Pr(X <= x). The key function we need is cdf('NORMAL',x,  ,  ).
Recall from lecture notes and your past homeworks that the cdf function is for computing
Pr(X <= x). Here we assume CEO pay increases come from a Normal distribution. (We have
also seen the Binomial cdf function and the t cdf function in class.)

Which values should we use for x,  ,  in the cdf function?
      e) The null hypothesis is  = 0.
      f) We are given that  = 9.
      g) The sample mean is 3.171.
      h) There are n=36 observations.

X=        ,   =         , =


Copy and paste the following code into your SAS editor, fill in the appropriate values in the cdf
function, and run it. See 7) below if you are not sure what to fill in.
data p1;
  p = 1-cdf('NORMAL',x,          ,  );
run;
proc print data=p1;
run;


What is the p-value for this test? I.e., what is the probability of observing a sample mean of
3.171 or larger when percent increase in CEO pay follows a Normal(0,9) distribution?


6) We can compute the same p-value by standardizing x to get the test statistic and then
comparing the test statistic to a Normal(0,1) distribution.

Copy and paste the following code into your SAS editor, fill in the appropriate value for x in the
cdf function, and run it. See 7) below if you are not sure what to fill in.
data p2;
  p = 1-cdf('NORMAL',x,0,1);
run;
proc print data=p2;
run;

Do you think the study gives strong evidence that the mean compensation of all CEOs went up?


7) The missing arguments for the cdf function in 5) above are 3.171, 0, and 9/sqrt(36). The
missing argument for the cdf function in 6) above is (3.171-0)/(9/sqrt(36)).



Part II-B, T-tests:


3adc3913-4c07-4ebd-b6e4-5477d5912372.doc                                       Page 8 of 11
Part II-B introduces two important procedures for conducting inference on the mean when σ is
unknown: proc univariate and proc ttest. These procedures as well as proc means can be
used to complete problems for Homework 6. We will again use the class dataset. (The example
in Part I-B is different from this one.)


1) Structure of the dataset: We will use a dataset based on the class survey. This dataset is
identical to the data used in Lab 5.

Variables in this data file are:

       SUBJECT: The response system ID (code)
       AGE: Self-reported age (numeric, in years)
       GENDER: Gender (1: Male, 2: Female)
       SMOKED_100: Smoked at least 100 cigarettes in lifetime (Yes, No)
       SOCKS: Wearing socks right now (1 : Yes, 2 : No)
       HEIGHT: Self-reported height (numeric, in inches or meters)
       HEIGHT_UNITS: (Meters, Inches)
       WEIGHT: Self-reported weight (numeric, in pounds or kilograms)
       WEIGHT_UNITS: (Kilograms, Pounds)
       RESTLESS_DAYS: During the past 30 days, for about how many days have you felt you
        did not get enough rest or sleep? (numeric)
       LEFT_HAND: Are you left-handed (Yes, No)
       LEFT_HAND_PARENT: Is either of your birth parents left-handed (Yes, No, Don‟t
        know)
       LANGUAGES: How many languages can you speak fluently or somewhat fluently?
        (numeric)
       HEART_RATE: How many times does your heart beat in 15 seconds? (numeric)
       PUSHUP_HEART_RATE: After doing 5 push-ups, how many times does your heart beat
        in 15 seconds? (numeric)
       FLU_SHOT: During the past 12 months, have you had a flu shot (or nasal flu vaccine)?
        (Yes, No)
       CHILDREN: How many children less than 18 years of age live in your household?
        (numeric)
       COMMUTE_TIME: On average, how long in minutes is your commute to campus?
        (numeric, in minutes)
       YRS_SINCE_MATHSTAT: How many years ago was your most recent math or stats
        class? (numeric)
       RESEARCH_TEAM: Have you ever worked as part of a clinical research study team?
        (Yes, No)
       CALCULATION: Solution to two linear equations (Y=2,Y=3,Y=4,Y=5,Don‟t know)
       FRUIT: A character string of the fruit-rankings with each rank separated by „:‟,
        fruit1:fruit2:…:fruit6. Each value consists of 47 characters. For example, the value for
        the first subject is




3adc3913-4c07-4ebd-b6e4-5477d5912372.doc                                       Page 9 of 11
       Apple:Pineapple:Mango:Strawberry:Banana:Coconut

2) Getting the lab data (csv): Go to the course labs page:
http://www.biostat.umn.edu/~ph6450/FALL09PH6450LAB.html and save the
“responsedatasubset3” csv file as responsedatasubset3.csv on your computer‟s desktop.

3) Starting SAS: Open the PC SAS application.

4) Importing the dataset (comma delimited): Use this code to import responsedatasubset3 into
SAS; this is the same code as we used in Lab 5. (Important: alter the infile statement to
correspond to where you saved responsedatasubset3.csv.)

options ls=78 nodate nocenter pageno=1;
data classdata;
infile "responsedatasubset3.csv" delimiter="," firstobs=2;
input subject age gender smoked_100 $ socks $ height height_units $ weight
      weight_units $ restless_days left_hand $ left_hand_parent $ languages
      heart_rate pushup_heart_rate flu_shot $ children commute_time
      yrs_since_mathstat research_team $ calculation $ fruit $47.;
run;


5) We will use the variable for resting heart rate (heart_rate). This variable represents the
number of heartbeats in a 15 second period; heart rate is usually recorded as heartbeats per
minute, so create a new variable bpm that represents heartbeats per minute (heart_rate times 4).
A healthy heart rate is one around 70 beats per minute. We will test whether average beats per
minute for the population of students who take this class is larger than 70 beats per minute. Write
down the null and alternative hypotheses.


6) The univariate procedure (which we used previously in Lab 2) provides many statistical
measures including descriptive statistics based on moments (including variance, skewness, and
kurtosis), quantiles or percentiles (such as the median), and extreme values. We have already
seen the stem-and-leaf and quantile-quantile plots (Q-Q plots) produced when we add the plots
option; histograms are produced when we add the histogram statement. Confidence intervals for
the mean are produced when we add the cibasic option.

By default SAS produces t-tests of whether or not the mean is equal to zero. If we wanted to test
some other value (such as 70) we would need to specify 70 in the mu0 option. Run the following
code and look at the output.

proc univariate data=classdata cibasic mu0=70 plots;
   var bpm;
   histogram;
   run;

Do the plots suggest that the data are normally distributed? Next find the output table labeled
„Tests for Location: Mu0=70‟ and find within that table the t-test of the mean. What is the p-
value for that t-test? The p-value computed by SAS corresponds to a two-sided test of whether or


3adc3913-4c07-4ebd-b6e4-5477d5912372.doc                                     Page 10 of 11
not the mean is equal to zero. Convert that p-value to a one-sided p-value appropriate for the
direction in your alternative hypothesis. Is there strong evidence that mean beats per minute was
greater than 70?


7) Several types of t-tests can all be carried out by the TTEST procedure; it performs t-tests for
one sample (is the mean zero), two samples (is the mean difference zero), and paired
observations (is the mean difference zero); so far in class we have only learned about one sample
t-tests. The underlying assumption of the t-test in all three cases is that the observations are
random samples drawn from normally distributed populations, or the sample size is large enough
that the sample mean is approximately normally distributed. This assumption can be checked
using the univariate procedure, as we did above in 6).

Here we use the one sample t-test for beats per minute. By default SAS produces t-tests of
whether or not the mean is equal to zero, but again we want to test some other value. We need to
specify 70 in the h0 option. Copy and paste this code into your SAS editor and run it. Then look
at the output.
proc ttest data=classdata h0=70;
   var HR;
run;

Is the p-value shown the same as what you saw above from the univariate procedure? (It
should be.) Are the data convincing enough to reject the null hypothesis?


8) There is one unusually large value for beats per minute, 260. This is biologically implausible,
and perhaps represents someone who entered beats per minute, rather than beats per 15 seconds,
while completing the survey. Delete this observation and re-run the t-test. Do your conclusions
change?


Part III:

Follow this link to the Homework 6 page:
http://www.biostat.umn.edu/~ph6450/homework/hw6.html. There are two problems from MMC
that require SAS. Use what you have learned in lab to answer these problems.




3adc3913-4c07-4ebd-b6e4-5477d5912372.doc                                     Page 11 of 11

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:7
posted:2/26/2010
language:English
pages:11