Document Sample

PubH 6450 -- Biostatistics I Directions for Lab 6 Oct. 19-21, 2009 Guidelines for each lab: Each lab consists of three parts: Part I will teach you new SAS procedures and steps with much guidance, using the class data set we created together. Part II will teach the same SAS procedures and steps with less guidance, requiring you to tap into your previous knowledge and labs to fill in intermediate steps, and using various health-related data sets. Part III will point you towards some of this week‟s homework questions; to complete these, you will need to choose and implement the appropriate SAS procedures and steps based on what you learned in this and previous labs and lectures. You may skip Part I if you wish to proceed directly to the more challenging Part II. Your lab TA will work through this Lab in front of the class, step by step. Feel free to interrupt with questions at any time. If the TA is working through the lab too slowly for you, work ahead at your own pace. The link below takes you to the SAS online manual. Here you can find explanations and syntax for all SAS functions and procedures. http://support.sas.com/onlinedoc/913/docMainpage.jsp Instructions on how to save your SAS work can be found on the SAS labs page: http://www.biostat.umn.edu/~ph6450/FALL09PH6450LAB.html by clicking on the link “Instructions for saving your SAS work”. Previous labs can also be found there, if you need to remind yourself of some syntax learned in an earlier lab. Purpose of this lab – One sample tests: Today‟s lab will demonstrate how to perform one sample z-tests and t-tests in SAS using proc means, proc univariate, proc ttest, and the cdf function. 3adc3913-4c07-4ebd-b6e4-5477d5912372.doc Page 1 of 11 Part I-A, Z-tests: Part I-A will show you how to perform hypothesis tests for inference on the mean of a normal random sample when the standard deviation σ is assumed to be known. (Part II-A uses the same example.) A study of the pay of corporate Chief Executive Officers (CEOs) for health insurance companies examined the increase in cash compensation of the CEOs of 36 such companies, adjusted for inflation, in a recent year. The public wants to know if there is good evidence to suggest that the mean compensation of all health insurance company CEOs increased that year. The dataset “ceo_pay” provides the data with percentage increase in CEO pay. Let us assume that percent increase follows a normal distribution with mean and known standard deviation = 9. We want to test the null hypothesis of no mean change in CEO pay: Ho: = 0 What does represent here? The public is only interested in an increase in CEO pay, therefore the alternative hypothesis should be one-sided: Ha: > 0 1) Getting the lab data: Go to the course labs page: http://www.biostat.umn.edu/~ph6450/FALL09PH6450LAB.html. a) First left-click on the link “ceo pay” and look at the dataset. Notice that the data values are separated across the columns by commas in each row. This is a comma- delimited file format and has extension “.csv”. (We saw this file type in Lab 2 and Lab 5.) b) Go back to the course labs page and now right-click on the link “ceo pay” and click on “Save Link As…” When the “Save As” dialog box opens, click on “Desktop” on the left and then on “Save” on the lower right. This should save the file as “ceo_pay.csv” on your computer‟s desktop. 2) Starting SAS: Open the PC SAS application by clicking on “Start” (lower left corner of your screen) and then on “Programs” and then on “Class Applications” and then on “SAS 9.1.” 3) Importing the lab data (comma-delimited): Copy and paste the following code in your SAS editor to import the CEO pay dataset. (Important: change the infile statement to correspond to where you saved ceo_pay.csv). The SAS dataset is named “pay”. data pay; 3adc3913-4c07-4ebd-b6e4-5477d5912372.doc Page 2 of 11 infile 'ceo_pay.csv' dlm=',' firstobs=2; input company increase; run; 4) We introduced the means procedure in Lab 1 and used it again in Lab 5. Recall that the MEANS procedure computes descriptive statistics. The following code provides by default the mean, standard deviation, minimum, and maximum for the variable „increase‟. proc means data=pay; var increase; run; Sometimes we want other descriptive statistics. Change the code to obtain the mean, median, min, and max. Then look at the output to find the sample mean which will be used in step 5. What is the range of percent increase in CEO pay? Is the range from the minimum to the mean the same as the range from the mean to the maximum? What about for the median instead of the mean? 5) Since we have a one-sided alternative hypothesis of >0, the p-value is the right tail probability: Pr(X>x) = 1 - Pr(X <= x). The key function we need is cdf('NORMAL',x, , ). Recall from lecture notes and your past homeworks that the cdf function is for computing Pr(X <= x). Here we assume CEO pay increases come from a Normal distribution. (We have also seen the Binomial cdf function and the t cdf function in class.) Which values should we use for x, , in the cdf function? a) The null hypothesis is = 0. b) We are given that = 9. c) The sample mean is 3.171. d) There are n=36 observations. X= , = , = Copy and paste the following code into your SAS editor, fill in the appropriate values in the cdf function, and run it. See 7) below if you are not sure what to fill in. data p1; p = 1-cdf('NORMAL',x, , ); run; proc print data=p1; run; What is the p-value for this test? I.e., what is the probability of observing a sample mean of 3.171 or larger when percent increase in CEO pay follows a Normal(0,9) distribution? 3adc3913-4c07-4ebd-b6e4-5477d5912372.doc Page 3 of 11 6) We can compute the same p-value by standardizing x to get the test statistic and then comparing the test statistic to a Normal(0,1) distribution. Copy and paste the following code into your SAS editor and run it. data p2; *parameters used here for the cdf function are z test statistic,0,1; p = 1-cdf('NORMAL',(3.171-0)/(9/sqrt(36)),0,1); run; proc print data=p2; run; Do you think the study gives strong evidence that the mean compensation of all CEOs went up? 7) The missing arguments for the cdf function in 5) above are 3.171, 0, and 9/sqrt(36). Part I-B, T-tests: Part I-B introduces two important procedures for conducting inference on the mean when σ is unknown: proc univariate and proc ttest. These procedures as well as proc means can be used to complete problems for Homework 6. A study is designed to test whether video games have a positive impact on motor skills. Data were collected on 50 middle school students before and after six months of playing video games on a regular basis. The data represent change in a measure of motor skill for each of the 50 students. A positive value indicates an improvement. (The example in Part II-B is different from this one.) 1) Entering data directly using datalines: In previous labs, we have read in datasets from files. here we will enter data directly in the data step of the SAS program using a datalines statement instead of an infile statement; we have seen this in class notes. After the input statement we provide the variable name “change” and then “ @@” (two ampersands). To use less space in the code, the @@ allows us to type the data into rows instead of into one very long column. Notice that there is a semi-colon after datalines and another after the data have been typed. Copy and paste the code below into your SAS editor. Then run the code to create the “games” dataset. data games; input change @@; datalines; 0 2 -3 3 -3 -5 -1 3 1 4 4 4 -4 4 0 1 1 -3 0 1 -1 3 -1 -3 1 1 -3 0 2 -1 3 -6 1 -1 1 1 2 3 0 -1 0 0 -1 4 -2 0 1 1 -3 1 ; run; 3adc3913-4c07-4ebd-b6e4-5477d5912372.doc Page 4 of 11 We will test whether average change in motor skills for the population of middle school students is different from 0. Write down the null and alternative hypotheses. 2) The univariate procedure (which we used previously in Lab 2) provides many statistical measures including descriptive statistics based on moments (including variance, skewness, and kurtosis), quantiles or percentiles (such as the median), and extreme values. We have already seen the stem-and-leaf and quantile-quantile plots (Q-Q plots) produced when we add the plots option; histograms are produced when we add the histogram statement. Confidence intervals for the mean are produced when we add the cibasic option. Run the following code and look at the output. proc univariate data=games cibasic plots; var change; histogram; run; Do the plots suggest that the data are normally distributed? Next find the output table labeled „Tests for Location: Mu0=0‟ and find within that table the t-test of the mean. What is the p-value for that t-test? The p-value computed by SAS corresponds to a two-sided test of whether or not the mean is equal to zero. Is there strong evidence that mean change in motor skill was different from zero? By default SAS produces t-tests of whether or not the mean is equal to zero. If we wanted to test some other value (such as 0.5, which doesn‟t make sense in this context) we would need to specify 0.5 in the mu0 option. proc univariate data=games mu0=0.5 cibasic plots;*Note: mu0=0 is the default; var change; histogram; run; 3) Several types of t-tests can all be carried out by the TTEST procedure; it performs t-tests for one sample (is the mean zero), two samples (is the mean difference zero), and paired observations (is the mean difference zero); so far in class we have only learned about one sample t-tests. The underlying assumption of the t-test in all three cases is that the observations are random samples drawn from normally distributed populations, or the sample size is large enough that the sample mean is approximately normally distributed. This assumption can be checked using the univariate procedure, as we did above in 2). Here we use the one sample t-test to compare the mean of the sample to zero. Copy and paste this code into your SAS editor and run it. Then look at the output. proc ttest data=games; var change; run; Is the p-value shown the same as what you saw above from the univariate procedure? (It should be.) Are the data convincing enough to reject the null hypothesis? 3adc3913-4c07-4ebd-b6e4-5477d5912372.doc Page 5 of 11 By default SAS produces t-tests of whether or not the mean is equal to zero. If we wanted to test some other value (such as 0.5, which doesn‟t make sense in this context) we would need to specify 0.5 in the h0 option. proc ttest data=games h0=0.5; *Note: h0=0 is the default; var change; run; 3adc3913-4c07-4ebd-b6e4-5477d5912372.doc Page 6 of 11 Part II-A, Z-tests: Part II-A is based on the same example as Part I-A. Here we will show you how to perform hypothesis tests for inference on the mean of a normal random sample when the standard deviation σ is assumed to be known. A study of the pay of corporate Chief Executive Officers (CEOs) for health insurance companies examined the increase in cash compensation of the CEOs of 36 such companies, adjusted for inflation, in a recent year. The public wants to know if there is good evidence to suggest that the mean compensation of all health insurance company CEOs increased that year. The dataset “ceo_pay” provides the data with percentage increase in CEO pay. Let us assume that percent increase follows a normal distribution with mean and known standard deviation = 9. We want to test the null hypothesis of no mean change in CEO pay: Ho: = 0 What does represent here? The public is only interested in an increase in CEO pay, therefore the alternative hypothesis should be one-sided: Ha: > 0 1) Getting the lab data: Go to the course labs page: http://www.biostat.umn.edu/~ph6450/FALL09PH6450LAB.html and save the “ceo_pay” file as “ceo_pay.csv” on your computer‟s desktop. 2) Starting SAS: Open the PC SAS application. 3) Importing the lab data (comma-delimited): Import the comma-delimited CEO pay dataset into SAS; it has two numeric variables: company increase. (We saw the .csv file type in Lab 2 and Lab 5; look back to those labs if you need a reminder.) 4) We introduced the means procedure in Lab 1 and used it again in Lab 5. Obtain the mean, median, min, and max of the variable increase. Then look at the output to find the sample mean which will be used in step 5. What is the range of percent increase in CEO pay? Is the range from the minimum to the mean the same as the range from the mean to the maximum? What about for the median instead of the mean? 3adc3913-4c07-4ebd-b6e4-5477d5912372.doc Page 7 of 11 5) Since we have a one-sided alternative hypothesis of >0, the p-value is the right tail probability: Pr(X>x) = 1 - Pr(X <= x). The key function we need is cdf('NORMAL',x, , ). Recall from lecture notes and your past homeworks that the cdf function is for computing Pr(X <= x). Here we assume CEO pay increases come from a Normal distribution. (We have also seen the Binomial cdf function and the t cdf function in class.) Which values should we use for x, , in the cdf function? e) The null hypothesis is = 0. f) We are given that = 9. g) The sample mean is 3.171. h) There are n=36 observations. X= , = , = Copy and paste the following code into your SAS editor, fill in the appropriate values in the cdf function, and run it. See 7) below if you are not sure what to fill in. data p1; p = 1-cdf('NORMAL',x, , ); run; proc print data=p1; run; What is the p-value for this test? I.e., what is the probability of observing a sample mean of 3.171 or larger when percent increase in CEO pay follows a Normal(0,9) distribution? 6) We can compute the same p-value by standardizing x to get the test statistic and then comparing the test statistic to a Normal(0,1) distribution. Copy and paste the following code into your SAS editor, fill in the appropriate value for x in the cdf function, and run it. See 7) below if you are not sure what to fill in. data p2; p = 1-cdf('NORMAL',x,0,1); run; proc print data=p2; run; Do you think the study gives strong evidence that the mean compensation of all CEOs went up? 7) The missing arguments for the cdf function in 5) above are 3.171, 0, and 9/sqrt(36). The missing argument for the cdf function in 6) above is (3.171-0)/(9/sqrt(36)). Part II-B, T-tests: 3adc3913-4c07-4ebd-b6e4-5477d5912372.doc Page 8 of 11 Part II-B introduces two important procedures for conducting inference on the mean when σ is unknown: proc univariate and proc ttest. These procedures as well as proc means can be used to complete problems for Homework 6. We will again use the class dataset. (The example in Part I-B is different from this one.) 1) Structure of the dataset: We will use a dataset based on the class survey. This dataset is identical to the data used in Lab 5. Variables in this data file are: SUBJECT: The response system ID (code) AGE: Self-reported age (numeric, in years) GENDER: Gender (1: Male, 2: Female) SMOKED_100: Smoked at least 100 cigarettes in lifetime (Yes, No) SOCKS: Wearing socks right now (1 : Yes, 2 : No) HEIGHT: Self-reported height (numeric, in inches or meters) HEIGHT_UNITS: (Meters, Inches) WEIGHT: Self-reported weight (numeric, in pounds or kilograms) WEIGHT_UNITS: (Kilograms, Pounds) RESTLESS_DAYS: During the past 30 days, for about how many days have you felt you did not get enough rest or sleep? (numeric) LEFT_HAND: Are you left-handed (Yes, No) LEFT_HAND_PARENT: Is either of your birth parents left-handed (Yes, No, Don‟t know) LANGUAGES: How many languages can you speak fluently or somewhat fluently? (numeric) HEART_RATE: How many times does your heart beat in 15 seconds? (numeric) PUSHUP_HEART_RATE: After doing 5 push-ups, how many times does your heart beat in 15 seconds? (numeric) FLU_SHOT: During the past 12 months, have you had a flu shot (or nasal flu vaccine)? (Yes, No) CHILDREN: How many children less than 18 years of age live in your household? (numeric) COMMUTE_TIME: On average, how long in minutes is your commute to campus? (numeric, in minutes) YRS_SINCE_MATHSTAT: How many years ago was your most recent math or stats class? (numeric) RESEARCH_TEAM: Have you ever worked as part of a clinical research study team? (Yes, No) CALCULATION: Solution to two linear equations (Y=2,Y=3,Y=4,Y=5,Don‟t know) FRUIT: A character string of the fruit-rankings with each rank separated by „:‟, fruit1:fruit2:…:fruit6. Each value consists of 47 characters. For example, the value for the first subject is 3adc3913-4c07-4ebd-b6e4-5477d5912372.doc Page 9 of 11 Apple:Pineapple:Mango:Strawberry:Banana:Coconut 2) Getting the lab data (csv): Go to the course labs page: http://www.biostat.umn.edu/~ph6450/FALL09PH6450LAB.html and save the “responsedatasubset3” csv file as responsedatasubset3.csv on your computer‟s desktop. 3) Starting SAS: Open the PC SAS application. 4) Importing the dataset (comma delimited): Use this code to import responsedatasubset3 into SAS; this is the same code as we used in Lab 5. (Important: alter the infile statement to correspond to where you saved responsedatasubset3.csv.) options ls=78 nodate nocenter pageno=1; data classdata; infile "responsedatasubset3.csv" delimiter="," firstobs=2; input subject age gender smoked_100 $ socks $ height height_units $ weight weight_units $ restless_days left_hand $ left_hand_parent $ languages heart_rate pushup_heart_rate flu_shot $ children commute_time yrs_since_mathstat research_team $ calculation $ fruit $47.; run; 5) We will use the variable for resting heart rate (heart_rate). This variable represents the number of heartbeats in a 15 second period; heart rate is usually recorded as heartbeats per minute, so create a new variable bpm that represents heartbeats per minute (heart_rate times 4). A healthy heart rate is one around 70 beats per minute. We will test whether average beats per minute for the population of students who take this class is larger than 70 beats per minute. Write down the null and alternative hypotheses. 6) The univariate procedure (which we used previously in Lab 2) provides many statistical measures including descriptive statistics based on moments (including variance, skewness, and kurtosis), quantiles or percentiles (such as the median), and extreme values. We have already seen the stem-and-leaf and quantile-quantile plots (Q-Q plots) produced when we add the plots option; histograms are produced when we add the histogram statement. Confidence intervals for the mean are produced when we add the cibasic option. By default SAS produces t-tests of whether or not the mean is equal to zero. If we wanted to test some other value (such as 70) we would need to specify 70 in the mu0 option. Run the following code and look at the output. proc univariate data=classdata cibasic mu0=70 plots; var bpm; histogram; run; Do the plots suggest that the data are normally distributed? Next find the output table labeled „Tests for Location: Mu0=70‟ and find within that table the t-test of the mean. What is the p- value for that t-test? The p-value computed by SAS corresponds to a two-sided test of whether or 3adc3913-4c07-4ebd-b6e4-5477d5912372.doc Page 10 of 11 not the mean is equal to zero. Convert that p-value to a one-sided p-value appropriate for the direction in your alternative hypothesis. Is there strong evidence that mean beats per minute was greater than 70? 7) Several types of t-tests can all be carried out by the TTEST procedure; it performs t-tests for one sample (is the mean zero), two samples (is the mean difference zero), and paired observations (is the mean difference zero); so far in class we have only learned about one sample t-tests. The underlying assumption of the t-test in all three cases is that the observations are random samples drawn from normally distributed populations, or the sample size is large enough that the sample mean is approximately normally distributed. This assumption can be checked using the univariate procedure, as we did above in 6). Here we use the one sample t-test for beats per minute. By default SAS produces t-tests of whether or not the mean is equal to zero, but again we want to test some other value. We need to specify 70 in the h0 option. Copy and paste this code into your SAS editor and run it. Then look at the output. proc ttest data=classdata h0=70; var HR; run; Is the p-value shown the same as what you saw above from the univariate procedure? (It should be.) Are the data convincing enough to reject the null hypothesis? 8) There is one unusually large value for beats per minute, 260. This is biologically implausible, and perhaps represents someone who entered beats per minute, rather than beats per 15 seconds, while completing the survey. Delete this observation and re-run the t-test. Do your conclusions change? Part III: Follow this link to the Homework 6 page: http://www.biostat.umn.edu/~ph6450/homework/hw6.html. There are two problems from MMC that require SAS. Use what you have learned in lab to answer these problems. 3adc3913-4c07-4ebd-b6e4-5477d5912372.doc Page 11 of 11

DOCUMENT INFO

Shared By:

Categories:

Tags:
left side, plot window, menu bar, USB flash drive, SAS output, Ronald DeVore, program editor, lower right

Stats:

views: | 7 |

posted: | 2/26/2010 |

language: | English |

pages: | 11 |

OTHER DOCS BY fionan

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.