VIEWS: 31 PAGES: 47 POSTED ON: 9/27/2010
SAS- Statistical Programming Language Ignacio Correas University of Colorado at Boulder NOTES FROM: Jonathan Hill, Dept. of Economics, University of California-San Diego I would like to thank my friend Dr. Jonathan Hill for letting me use his excellent SAS notes and exercises. Jonathan's caliper as an econometrician is further reflected in the ease of exposition and the clarity with which he presents material as complex in its didactical application as teaching econometrics with a computer package is. CONTENTS PAGE I. What is SAS? Getting Around, Saving Files, Printing 2 1. Introduction 2 2. Booting-up SAS 2 3. The SAS Environment: Getting Around 3 4. Opening File, Saving Files, Printing Output 4 II. Basic SAS Programming Elements: Data, Proc's, Macros, IML 6 1. Data Step 6 2. Proc Step 6 III. Data: Entering and Examining Economic Information 8 1. Internal Data Entry: DATALINES 8 2. External Data Entry: INFILE, FILENAME, OBS 9 3. Creating New Variables 10 3.1 Arithmetic Operations 10 3.2 Logical Operations: IF, THEN, ELSE, AND, OR 11 4. Creating Datasets from Existing Datasets 13 MERGE 13 SET 14 5. Describing Data: Simple Data Inspection 15 PROC PRINT 16 PROC SORT 17 PROC CONTENTS 19 6. Describing Data: Simple Data Analysis 21 PROC MEANS 21 PROC CORR 26 PROC UNIVARIATE 28 IV. SAS and Econometric Analysis I: Basic Regression with PROC REG 29 PROC REG 29 PROC REG: Commands and Options 31 MODEL 31 BY 31 TEST (F-Tests) 32 EXAMPLES 32 V. SAS and Econometric Analysis II: Multiple Regression with PROC AUTOREG 37 1. PROC AUTOREG 37 2. PROC AUTOREG: Commands and Options [model,by,…] 39 3. The Jarque-Bera Test of Normality: NORMAL 39 VI. SAS and Econometric Analysis III: Multiple Regression and Inference 41 1. Classical F-test of Model Correctness: PROC REG 41 2. General F-test of Multiple Restrictions: TEST 41 3. The RESET Test of Model Specification Correctness 43 4. Tests of Heteroskedasticity 1 I. What is SAS? Getting Around, Saving Files, Printing 1. INTRO The statistical software we will employ in this course is SAS (Statistical Applications SoftwareTM), a language which is used world-wide in economics, sociology, political science, and biology, and in major universities, governments and private research organizations. The SAS language, a multi-purpose statistical package, is particularly useful for large data-set manipulations, data-set creations, and fast/simple statistical analysis. Other software available that would be appropriate for more advanced and refined statistical analysis includes LIMDEP (Limited Dependent Variables), GAUSS, MATLAB, and FORTAN (Formula Translation) . 2. BOOTING-UP SAS In Windows, click-on START, click on PROGRAMS, then click-on SAS. When the software loads, depending on which version you are using, the screen should be split into two parts (three parts is used in Version 8). You can fill the entire screen with the software by clicking on the open-box in the upper right- hand corner. 3. THE SAS ENVIRONMENT: Getting Around SAS incorporates three (3) primary windows for viewing program text, output and error messages. Program Editor Window The Program Editor allows us directly to create SAS programs. The editor screen is simply titled "Editor", and you can place the cursor in the editor by clicking anywhere on the editor screen, or by clicking-on Window, then Editor. Also, if your version of SAS is recent, on the bottom of the screen will be three bars denoting Editor, Output and Log: click-on the appropriate bar to go to the specific screen. Output Window Displays program output, including the printing of data-sets, statistical output like sample statistics (mean, var), and econometric output (e.g. regression output). The Output window can be cleared, and should be cleared before you run a program: be sure the Output window is up, then click-on Edit, then Clearall1. Log Window 1 The reason for clearing the output and log screens is simple: after you run a program twice, the output and error messages will simply be stacked with the first program-run on top, and the second program run on the bottom. It can be very confusing deciphering which comments are for which program run. Always clear before each program run. 2 This window displays SAS's comments while it translates your program text. To view this log window, click-on Window then Log. If your program is error free, messages will be in blue; if you have errors which SAS believes it can override, ignore, or correct, a message will appear in green; if an error is terminal such that the program crashes, error messages in red will appear. As with the output window, always clear the log window before you run a program: be sure the log-window is up, click-on Edit, then click-on Clearall. Example (type the following code into the editor window, and follow my instructions, below) ______________________________________________ DATA example; INPUT age gender $ income; DATALINES; 54 m 45000 19 f 37500 37 f 67000 RUN; PROC PRINT DATA = example; RUN; PROC MEANS DATA = example; RUN; ______________________________________________ This program creates a simple dataset of three people and their respective ages, gender (male = m and female = f) and income in dollar units. The program then prints out the entire dataset, and calculates sample statistics including the mean, standard deviation, minimum and maximum 2 of the numerical data. Once you type the program code in the editor, click-on the icon of the running person at the top of screen to the right (this runs the program), or, simply click-on RUN, then SUBMIT. SAS will automatically present the output in the output window. The output should look like this: The SAS System Monday, May 7, 2001 6 Obs age gender income 1 54 m 45000 2 19 f 37500 3 37 f 67000 The SAS System 13:18 Monday, May 7, 2001 7 The MEANS Procedure 2 All of commands are detailed in subsequent sections, below. 3 Variable N Mean Std Dev Minimum Maximum age 3 36.6666667 17.5023808 19. 54. income 3 49833.33 15332.43 37500. 67000. Now, view the log window to see how SAS comments: we do not have any errors, thus SAS displays only blue messages, and black is used for the code you typed in. The log window should look like this: 44 DATA example; 45 input age gender $ income; 46 datalines; NOTE: The data set WORK.EXAMPLE has 3 observations and 3 variables. NOTE: DATA statement used: real time 0.00 seconds 50 run; 51 proc print data = example; 52 run; NOTE: There were 3 observations read from the data set WORK.EXAMPLE. NOTE: PROCEDURE PRINT used: real time 0.11 seconds 53 proc means data = example; 54 run; NOTE: There were 3 observations read from the data set WORK.EXAMPLE. NOTE: PROCEDURE MEANS used: real time 0.04 seconds Be sure to clear both log and output windows. 4. OPENING FILES, SAVING FILES, PRINTING OUTPUT Loading/Opening Files If SAS is not presently loaded, the fastest way to load a program is to boot-up SAS, click- on the editor window or click-on Window then Edit, then click-on File and Open. In this class, your files will most likely be on a floppy-disk: once you click-on Open, scroll down the "look-in" box until you find the floppy "A"-drive, and proceed. All SAS files have the file type “.sas”. Our data files will be of the file type “.dat”. Saving Program Code Recall that programs are coded in the edit window. Once you type in a program (you should save any text roughly once every 5 minutes!), click-on File, Save, then scroll- down the "save-in" box until you reach the drive that suits your needs (e.g. the A-drive for floppy disks). Be sure to use file names that are reasonably short and intuitive (for example, do not use "file1.sas"). All SAS programs are automatically saved as ".sas" type files. 4 Warning: be sure you are actually in the EDITOR screen when you save: otherwise, SAS will simply save whatever contents are on the screen, be it output or error messages. Saving Output The easiest way to summarize your empirical project results is to save the SAS output to a file and load the file into EXCEL3, or WORD. To save SAS output, run your program, be sure you are presently in the output window after the program finishes running (if you have any doubt, click-on Window, then Output), then click-on File, Save, scroll-down the “look-in” box, find the appropriate drive, and give your output a useful name. For example, if your SAS program is named "income.sas", then title the output file as "income_out". Printing Output Once you run program, simply click-on File, Print, or just click-on the printer icon located at the top of the screen, in the middle. 3 See the section below on using EXCEL to create various types of graphs based on SAS output. 5 II. Basic SAS Programming Elements: Data, Proc's, Macros, IML Any SAS program incorporates steps for entering data and steps for analyzing data. This short section will briefly discuss each step without any details on how actually to code a program. The subsection section presents specific information on how to enter and look at data. 1. DATA STEP Any SAS program must employ data from some source. In this class, we will usually enter data from a floppy-disk, however you can save data to a hard-drive (Drive “C”, for instance) and enter it from there. Data statements are always of the form4 DATA [dataset name]; ……. RUN; Each data step requires the command "DATA", a dataset name, code which actually enters the data, and the command "RUN". Datasets can incorporate any alpha-numeric characters. For example: DATA d1; INPUT x y DATALINES; 14 10 -8 RUN; This codes dictates that a dataset named "d1" is created with two variables, named "x" and "y", and two observations: x = (1, 10) and y = (4, -8). We can build as many datasets as we like, as well as merge datasets: see the subsequent section. 2. PROC STEP Usually SAS programmers use "proc" statements for data analysis. Other means for analyzing data will be briefly mentioned below: in this class, we will always use proc's. The term "proc" is short for "procedure", which denotes any built-in array of commands. For example, the MEANS procedure in SAS will automatically calculate data means, variances, etc., while the REG procedure performs basic regression analysis. You, yourself, do not need to program in SAS how a sample mean is calculated: we can do that, however, if we like by using the built-in sub-language called IML (which we will not use in the class). SAS already has all the details programmed within itself. SAS proc's are use to print data, find sample statistics, perform econometric analysis, create graphs, charts, etc. Proc's are coded much like DATA statements. For any proc, we need to specify which data is to be analyzed. For example, in order to print the entire contents of the dataset created above, we code: DATA d1; INPUT x y DATALINES; 14 10 -8 RUN; PROC PRINT data = d1; RUN; The statement "data = d1" dictates which dataset is to be printed. As with the use of datasets, we can use as many proc's as we like: the following code creates two datasets, prints both, and displays sample statistics of one dataset: 4 I will use brackets "[ ]" to denote information that the programmer enters: you never actually type these brackets in SAS code. 6 DATA d1; INPUT x y DATALINES; 1 4 10 -8 RUN; DATA d2; INPUT w z DATALINES; 10 -100 9 0 RUN; PROC PRINT data = d1; RUN; PROC PRINT data = d2; RUN; PROC MEANS data = d1; RUN; ____________________________________ MACROS and SAS-IML Although SAS's power is derived from its ability to manage and create large datasets as well as its ability easily to analyze any dataset by incorporating any one of its several hundred built-in procedures, there are other means for programming that require substantial effort on the part of the programmer. SAS-IML The SAS language has built-in to it a sub-language for matrix-oriented mathematics. This software is called the Integrated Matrix Language [IML] and can be used to code substantially sophisticated econometric commands. SAS's built-in procedures are very useful, however they are, ultimately, of limited use: recent advances in economic/econometric/statistical theory are NOT programmed into SAS, thus if you require a means of data analysis that lies outside of the range of SAS's present abilities, then you must program the procedure yourself. IML allows the programmer literally to create his/her own procedures that can be called from any SAS program. The IML language requires its own syntax, employs matrix algebra and therefore requires extra time to learn and a background in higher mathematics. SAS MACROS SAS's IML is literally a built-in sub-language useful for creating you own hand-written econometric analysis. A "macro", by contrast, is a routine that is programmed into SAS along with standard DATA and PROC steps. A "macro" requires its own syntax, and can be used to create routines that perform sophisticated tasks. Moreover, a macro can be written simply to group together standard SAS commands: once this kind of macro is written, the programmer simply needs to refer to it by name, and all of the subsequent SAS commands associated with the macro name are performed. 7 III. Data: Entering and Examining Economic Information In this section, we will learn the basic techniques for entering data directly into SAS: the two primary techniques entail writing the data directly into the program, or loading data into a SAS from an external source (e.g. floppy disk). Additionally, we will also learn several procedures for performing basic statistical analysis of our data. For this, and all subsequent documents, to familiarize yourself with new SAS commands and programming techniques, be sure to boot-up SAS and practice the examples I give below. Always feel free to experiment. NOTE: Because SAS is a Windows product, you can simply copy examples of code in this and any documents and paste the text directly into SAS. In fact, many of the examples, below, were written in SAS and copy/pasted into WORD! Do as we all do: take the code wherever you can find, study it, and learn to re-write it yourself. 1. Internal Data Entry: DATALINES SAS allows for the programmer to enter directly any data. For large data sets, this is impractical, however, there will be times when the programmer wants to have the data physically present in the program. Recall, we enter data in a DATA STEP. For direct data entry, we use the code: DATA [dataset name]; INPUT var1 var2 [more variable names] varN; DATALINES; [data would be typed here] RUN; Notice, there is not a semi-colon ";" after the last line of data, however we use a semi-colon after every line of code. Variable names can use any alpha-numerical symbols, however it can be no more than 8 characters in SAS Version 6.0. We do not put commas between variable names. The INPUT command dictates variable names and the order in which the data will be entered. DATALINES dictates that actual data follows. For example, if we want to enter income and ages for 5 people, we write: DATA income1; INPUT income age; DATALINES; 10000 50 75000 43 23000 67 10000 19 100000 56 RUN; SAS understands that the data is read as "income age", and only requires one space between data entries: you can, however, place as many spaces between data entries as you like. Also, you do not need to indent code the way I do, however it is much easier to read: you will need my help from time-to-time, so you should write your code in a manner that is easy to understand. SAS differentiates between numerical and character variable. For data that is non- numerical, use the dollar-sign "$" after (to the right of) the variable name with one space. For example, suppose that the above dataset "income1" includes gender information in the form of "M" for male and "F" for female. We can write: 8 DATA income1; INPUT income age sex $; DATALINES; 10000 50 m 75000 43 m 23000 67 f 10000 19 f 100000 56 m RUN; We now have a dataset named "income1" with five observations (5 people), and income, age and gender information. Example: We want to create a dataset with monthly GNP (in $trillions) information, however not all months are present in our sample. We have information for 4 months. DATA gnp_mon; INPUT gnp month $; DATALINES; 2 jan 2.01 march 1.99 july 2.00 dec RUN; Thus, we have data for January, March, July and December. 2. External Data Entry: INFILE, FILENAME, OBS By far the most useful approach to data entry is the method of entering data directly from a drive, be it hard ("C") or floppy (A"). We use the INFILE command for such basic entry: DATA [dataset name]; INFILE 'drive:\folder\folder\…\filename.type'; INPUT var1 var2 … varn; RUN; The INFILE command directs SAS to some drive and sequence of folders. The file directly and name requires single quotations. The file type may be .dat or .txt, depending on he files I give you, and ultimately depending on how you yourself make your data files. I will comment later on the nature of .dat and .txt files. For example, if our income data exists on a floppy in a file named "income_data.dat", we can write: DATA income1; INFILE 'a:\income_data.txt'; INPUT income age sex $; RUN; If you plan on entering data from the same drive and file over and over again, you can simply re-write the file-name as follows: DATA [dataset name]; FILENAME [file name] 'drive:\folder\folder\…\filename.type'; INFILE [file name] INPUT var1 var2 … varn; RUN; 9 Notice, only spaces are placed between the new file name and the actual directly and drive specifications. For example, DATA income1; FILENAME inc_file 'a:\income_data.dat'; INFILE inc_file; INPUT income age sex $; DATALINES; RUN; Thus, SAS understands that "inc_file" refers to the location "a:\income_data.dat". You can access the same simple file name in subsequent datasets. For example DATA income1; FILENAME inc_file 'a:\income_data.dat'; INFILE inc_file; INPUT income age sex $; RUN; DATA income2; INFILE inc_file; INPUT income age sex $; DATALINES; RUN; This simple program re-names the file for SAS's use, reads in the data, and re-reads the data in a second data step: the second data step does not require the file location specification (i.e. a:\income_data.dat) because SAS interprets “inc_file” as that location. In many cases, we will not want to use an entire dataset: many datasets contain more than 50000 observations and more than 200 variables. Simply in order to maintain a program during the coding development stage, and to run the program in order to find and remove errors, we may want to use only a few observations, and use the entire dataset only when all errors ("bugs") have been corrected. A simple way to control how many observations are read-in into a dataset is to use the OBS command. Suppose the file a:\income_data.dat has 10,000 observations, but we want only the first 100. Then, we write DATA income1; INFILE 'a:\income_data.dat' OBS = 100; INPUT income age sex $; RUN; 3. Creating New Variables 3.1 Arithmetic Operations During the data entry stage of any data step, we can create new variables using basic arithmetic and logic commands. For example: DATA income1; FILENAME inc_file 'a:\income_data.dat'; INFILE inc_file; INPUT income age sex $; income_sq = income*income; RUN; 10 The code " income_sq = income*income" creates a new variable named "income_sq" which equals income squared (i.e. income_inc = income2). SAS understands that the operation is to be performed for all data observations. Mathematical symbols include * times “log” natural log ** to the power of “exp” the exponential function (i.e. exp(x) = ex, e = 2.7141) - minus + plus / divide Thus, we could have written "income_inc = income**2". For example, if we read in variables x and y, and we want ln(x), x4, x - y and x/y as new variables, we can write DATA d1; INPUT x y; x_4 = x**4; ln_x = log(x); xmy = x - y; xdy = x/y; DATALINES; 10000 50 75000 43 23000 67 RUN; Note that SAS will now understand that the dataset "d1" has 6 variables: x, y, ln_x, x_4, xmy and xdy. 3.2 Logical Operations Many variables should only be constructed when a condition is satisfied, or perhaps a variable's value depends not on specific values of other variables (e.g. ln_x = log(x)), rather on value ranges. For such derivations, we use IF, THEN, ELSE logical operations with connectors AND and OR. Consider, for example, that we have a variable “ed” that denotes the number of years of educations. In the U.S., if ed > 12, we would understand that the individual graduated from high school. Likewise, if ed > 16, we might conclude that the individual has a basic degree from a university. In econometric analysis, we often want to know both what impact the number of years of education has on income, as well as whether graduating from high school has an impact on education 5. For such information, we will want to create a “dummy”, or “binary” variable6 that equals 1 if the individual graduated from high school, and 0 otherwise: all we want from these variables if the simple information of whether they graduated or not. For example, suppose we read in data on income, education and age, and we want to create variables that represent whether that individual has a high school or college education or not: 5 After all, 11 years is not much less than 12 years (and 11.75 years does not mean the individual graduated from high school!), but a high school diploma will signal to many employers a certain skill level in the laborer, a certain degree of dedication that people who quit high school early may not have. 6 We will study the use and implications of dummy variables throughout the semester. 11 DATA income1; INPUT income ed age; IF ed GE 12 THEN hs = 1; ELSE IF ed LT 12 THEN hs = 0; IF ed GE 16 THEN college = 1; ELSE IF ed LT 16 THEN college = 0; DATALINES; 10000 15 45 24000 18 54 31000 9 69 RUN; The code literally states that if the education level of an individual is greater than or equal to [GE] 12, then a new variable, named “hs”, is set equal to 1. However [ELSE], if years of education is less-than [LT] 12, then the variable “hs” is set to 0. Likewise, if education is greater than or equal to [GE] 16, a new variable, named “college” is set equal to 1. However [ELSE], if the number of years of education is less than [LT] 16, the “college” is set to zero. Clearly, the first person has a high school education but not a college education, so hs = 1 and college = 0 for the first individual. If we run the above program and print the dataset, then the output looks like this: The SAS System 21:10 Wednesday, May 9, 2000 Obs income ed age hs college 1 10000 15 45 1 0 2 24000 18 54 1 1 3 31000 9 69 0 0 As usual, the dataset has 5 variables: the three original variables and the two new dummy variables. The logical operators available are as follows: Operator: Definition Symbol EQ: equal to = GE: greater than or equal to >= LE: less than or equal to <= NE: not equal to ^= NOT: not ^ AND OR Consider a more complicated piece of information. Suppose we want a variable for people over the age of 50 who have at least 14 years of education (i.e. they are high school graduates from before the 1980's with at least some college education). We can use the AND and OR operators as follows: DATA income1; 12 INPUT income ed age; IF ed GE 14 AND age GE 50 THEN coll_50 = 1; ELSE IF ed LT 14 OR age LT 50 THEN coll_50 = 0; DATALINES; 10000 15 45 24000 18 54 31000 9 69 RUN; Thus, only if a person if over 50 years old and [AND] they have at least 14 years of education will the new variable “coll_50” be set to 1. However, if they are too young (age < 50) or [OR] if they have too littler education, then they do not satisfy our compound criteria, and the new variable “coll_50” is set to 0. IF we print the dataset, we find The SAS System 21:22 Wednesday, May 9, 2000 Obs income ed age coll_50 1 10000 15 45 0 2 24000 18 54 1 3 31000 9 69 0 Only the second individual satisfies both criteria: she is both at least 50 years old AND has at least 14 years of education. 4. Creating Datasets from Existing Datasets: MERGE, SET Often, we will want to use the information in one dataset in order to build quickly another dataset. For example, we may read in information for 1000 people concerning wages, hours worked, and taxes paid, and read in from another source information concerning the same 1000 people concerning basic demographic information: education, marital status, age, gender, and number of children. Or, we may find in one data source on the web information on a country's GNP, interest rates, unemployment rate and inflation rate for the period 1970-1979, and from another data source the same information for the period 1980-1989. In order to use all of the data at once during the stage of econometric analysis, we will want to build one dataset containing all relevant information (all variables concerning one person, or all time periods concerning several economic quantifiers). Two simple techniques utilized for such dataset blending are the MERGE and SET commands employed during any data step. 4.1 MERGE Consider the following code which builds two datasets containing, variously, economic and demographic data, about the same group of people: DATA income1; INPUT income taxes hours; 13 DATALINES; 10000 100 54 75000 23000 38 23000 3000 40 RUN; DATA demog1; INPUT age gender; /* gender = 1 if male, gender = 0 if female */ DATALINES; 27 1 64 0 43 0 RUN; Note that any text between the items /* */ is treated as a command, and ignored by SAS. The variable gender is simply a dummy variable representing male if the value is 1 and female if the value is 0. To merge these dataset, we code a third data step as follows: DATA inc_dem; MERGE income1 demog1; RUN; Now, the new dataset "inc_dem" contains 5 variables: income, taxes, hours, age and gender. It we print the dataset, we observe The SAS System 10:59 Thursday, May 10, 2000 Obs income taxes hours age gender 1 10000 100 54 27 1 2 75000 23000 38 64 0 3 23000 3000 40 43 0 SAS literally places the two datasets side-by-side. WARNING: your datasets must have the observations arranged in the same order in order to ensure that information for the same individual is merged. WARNING: in order to merge datasets with different information concerning the same people, no variable names can be shared between datasets. 4.2 SET The command SET is used to stack (i.e. concatenate) different datasets which have the same variables types. This is particularly useful for merging different datasets with time-series information. Consider the example give above: suppose we may find in one data source on the web information on a country's GNP and unemployment rate for the period 1970-1974, and from another data source the same information for the period 1985-1989. We will want to merge the data, however we do not want to perform a side-by-side merge in manner that was performed above. We want the data to be stacked vertically, with the years 1970-1974on top, and the years 1985-1989on the bottom: DATA data_70; /* contains data for the years 1970-1974 */ /* GNP is in billions; unemployment rate is a percent: e.g. 6 denotes 6% = .06 */ 14 INPUT gnp ue_rate; DATALINES; 3000 4 3100 3.9 3120 3.92 3110 4.1 2900 4.3 RUN; DATA data_75; /* contains data for the years 1975-1979 */ INPUT gnp ue_rate; DATALINES; 2910 4.2 3000 4.1 3000 4 3100 3.7 3300 3.2 RUN; DATA data_70_75; SET data_70 data_75; RUN; Notice that we have created a third dataset named "data_70_75" containing all the information from the years 1970-1979. The SET command will automatically concatenate (stack) the data with the dataset stated first (i.e. data_70) on top, and the second dataset on the bottom. If we print the dataset, we observe: The SAS System 10:59 Thursday, May 10, 2000 Obs gnp ue_rate 1 3000 4.00 2 3100 3.90 3 3120 3.92 4 3110 4.10 5 2900 4.30 6 2910 4.20 7 3000 4.10 8 3000 4.00 9 3100 3.70 10 3300 3.20 Thus, all of the relevant data was stacked with 1970 on top and 1979 on the bottom. 5. Describing Data: Simple Data Inspection In this section, we will learn the following procedures for basic visual inspection of our data: PRINT SORT CONTENTS 5.1 PROC PRINT This procedure is use to print entire or partial datasets. Consider the examples: DATA income1; INPUT income taxes hours; 15 DATALINES; 10000 100 54 75000 23000 38 23000 3000 40 RUN; PROC PRINT DATA = income1; RUN; Notice that we must specify which dataset is to be printed. Unless we state otherwise, SAS will print the entire set. The output window will contain: The SAS System 10:59 Thursday, May 10, 2000 Obs income taxes hours 1 10000 100 54 2 75000 23000 38 3 23000 3000 40 Consider delineating specific variables to be printed. DATA income1; INPUT income taxes hours; DATALINES; 10000 100 54 75000 23000 38 23000 3000 40 RUN; PROC PRINT DATA = income1; VAR income; RUN; Here, we specify that we want only the variable [VAR] "income" to be printed. The output window contains: The SAS System 10:59 Thursday, May 10, 200 Obs income 1 10000 2 75000 3 23000 Finally, consider printing several variables, but not all that exist in the dataset: DATA income1; INPUT income taxes hours; DATALINES; 10000 100 54 75000 23000 38 23000 3000 40 RUN; PROC PRINT data = income1; 16 VAR taxes hours; RUN; We can delineate as many or as few variables as we like: as with other SAS command structures, we do not use commas between the variable names. The output window contains: The SAS System 10:59 Thursday, May 10, 200 Obs taxes hours 1 100 54 2 23000 38 3 3000 40 5.2 PROC SORT Sorting data is intuitive and simple. Consider sorting the above dataset "income1" according to income (i.e. we want to sort all individuals and all variables with individuals who have the smallest incomes at the "top" of the dataset, and individuals with the largest incomes at the "bottom" of the dataset). We write the following code: DATA income1; INPUT income taxes hours; DATALINES; 10000 100 54 75000 23000 38 23000 3000 40 RUN; PROC SORT DATA = income1; BY income; RUN; PROC PRINT DATA = income1; RUN; The syntax here is the same as with PROC PRINT: we must tell SAS which dataset is to be sorted. Moreover, whenever we sort data, the sort must be according to, or BY, some criterion. The output window contains: The SAS System 13:32 Thursday, May 10, 2001 1 Obs income taxes hours 1 10000 100 54 2 23000 3000 40 3 75000 23000 38 Note that the dataset is now permanently changed. Whenever you refer to this dataset, SAS will interpret it as sorted according to income. We can use the DESCENDING command to dictate that the data is to sorted from the highest value of the BY variable to the lowest value: DATA income1; INPUT income taxes hours; DATALINES; 10000 100 54 75000 23000 38 23000 3000 40 RUN; 17 PROC SORT DATA = income1; BY DESCENDING income; RUN; PROC PRINT DATA = income1; RUN; The output window contains The SAS System 13:32 Thursday, May 10, 200 Obs income taxes hours 1 75000 23000 38 2 23000 3000 40 3 10000 100 54 We can also sort according to several criteria. For example, suppose we have data on stocker trader's names, and the net number of stock shares traded (positive valued denote net purchases; negative valued denote net sales). Out dataset contains the information: First NAME Last NAME STOCK SHARES Frank Smith 10 Betty Jones 5 Betty Jones 10 Frank Smith 100 Frank Albert 40 Betty Jones 50 Frank Albert 20 Betty Jones 45 We want to read in this data, and sort by last name, first name, date, and finally by number of shares traded. By last name, Albert comes first, with stock shares traded in volumes of 40 and 20: Albert will come first, sorted with 20 then 40 shares traded. We code as follows: DATA trades; INPUT name $ date $ shares; DATALINES; Frank Smith 10 Betty Jones 5 Betty Jones 10 Frank Smith 100 Frank Albert 40 Betty Jones 50 Frank Albert 20 Betty Jones 45 RUN; PROC SORT DATA = trades; 18 BY lname fname shares; RUN; PROC PRINT DATA = trades; RUN; The output window displays: The SAS System 13:32 Thursday, May 10, 2001 19 Obs fname lname shares 1 Frank Albert 20 2 Frank Albert 40 3 Betty Jones 5 4 Betty Jones 10 5 Betty Jones 45 6 Betty Jones 50 7 Frank Smith 10 8 Frank Smith 100 5.3 PROC CONTENTS If you simply want to know basic structural (i.e. non-statistical) information about a dataset, we can use the CONTENTS procedure. This is especially helpful when our econometric results do not appear the way we expected them to: we may have damaged data, and one easy way to detect the damage is to inspect the basic dataset properties. The procedure CONTENTS details the number of variables, observations, and missing observations (some variables may not exist for some people or during some periods: if your dataset is too large to inspect visually in EXCEL, then CONTENTS can provide a quick peek). For example, consider the dataset “income1” with income, taxes and hours worked for three people. DATA income1; INPUT income taxes hours; DATALINES; 10000 100 54 75000 23000 38 23000 3000 40 RUN; PROC CONTENTS DATA = income1; RUN; As usual, we need to dictated which dataset is to be inspected by CONTENTS. The output window contains: The SAS System 17:49 Friday, May 11, 2000 The CONTENTS Procedure Data Set Name: WORK.INCOME1 Observations: 3 Member Type: DATA Variables: 3 Engine: V8 Indexes: 0 Created: 17:49 Friday, May 11, 2000 Observation Length: 24 Last Modified: 17:49 Friday, May 11, 2000 Deleted Observations: 0 Protection: Compressed: NO Label: -----Engine/Host Dependent Information----- Data Set Page Size: 4096 Number of Data Set Pages: 1 First Data Page: 1 19 Max Obs per Page: 168 Obs in First Data Page: 3 Number of Data Set Repairs: 0 File Name: C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\SAS Temporary Files\_TD1148\income1.sas7bdat Release Created: 8.0101M0 Host Created: WIN_PRO -----Alphabetic List of Variables and Attributes----- # Variable Type Len Pos ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 3 hours Num 8 16 1 income Num 8 0 2 taxes Num 8 8 Like most procedures, this procedure permits many internal commands that direct SAS to display specific information that is not displayed by default: consult SAS‟s help screen 7. For example, SAS permits many optional commands that can entered after the “DATA = “ statement: PROC CONTENTS DATA = dataset [option] [option] … [option]; RUN; For example, three such optional commands include: Specify the output data set OUT = Print a list of the variables by their position in the data set VARNUM Thus, the programmer can save the CONTENTS output to another dataset, as well as list variables in the order in which they appear in the dataset, as opposed to in alphabetical order (see the example above). Such a variable listing can be helpful if you have many (e.g. 50, 100, 200) variables, and you want to check if you are reading the data in in the right order (e.g. does “income” come before “taxes”? if you have the wrong order in your program, then your income variable will contain numerical information about taxes). 6. Describing Data: Simple Data Analysis In this section, we will learn the following procedures for basic statistical inspection of our data: MEANS CORR UNIVARIATE 6.1 PROC MEANS PROC MEANS creates and displays basic sample statistics, confidence interval and simple hypothesis test information, including the sample mean, variance, standard deviation, the minimum and maximum values of specified variables, and t-tests for the null hypothesis that the mean of a variable is zero. If no specifications are provided, SAS will automatically display results for all variables. For example: DATA income1; INPUT income taxes hours; DATALINES; 10000 100 54 75000 23000 38 23000 3000 40 7 If your version of SAS is 6.0 or greater, then a very useful help-screen should be installed. For all commands and procedures we employ, you should always search the help screen for further information. Simply click-on the “book” icon to the upper-right. 20 RUN; PROC MEANS DATA = income1; RUN; The output window contains: The SAS System 17:49 Friday, May 11, 200 The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ income 3 36000.00 34394.77 10000.00 75000.00 taxes 3 8700.00 12468.76 100.0000000 23000.00 hours 3 44.0000000 8.7177979 38.0000000 54.0000000 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ If you want information for select variables, use the VAR option: PROC MEANS DATA = income1; VAR income taxes; RUN; The SAS System 17:49 Friday, May 11, 200 The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ income 3 36000.00 34394.77 10000.00 75000.00 taxes 3 8700.00 12468.76 100.0000000 23000.00 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ The proper syntax for PROC MEANS includes output options, like VAR, and statistic keywords that dictate which information is to be displayed. If you use such keywords, SAS will only provide that information, and omit all other statistics. PROC MEANS [option(s)] [statistic-keyword(s)] Statistical keywords include: ALL all statistics listed CLM 100(1 - )% confidence limits for the MEAN, where is determined by the “ALPHA= option”, and the default is = .05 CLSUM 100(1 - )% confidence limits for the SUM, where 100(1 - )% is determined by the “ALPHA= option” and the default is = .05. CV coefficient of variation DF degrees of freedom for the t test 21 KURTOSIS The kurtosis of the data MAX maximum value MEAN mean for a numeric variable, or the proportion in each category for a categorical variable MIN minimum value NMISS number of missing observations NOBS number of non-missing observations8 PRT probability that a true t-random variable is greater than the t-statistic we have derived RANGE range, MAX-MIN STD standard deviation of the SUM. When you request SUM, the procedure computes STD by default. STDERR standard error of the MEAN. When you request MEAN, the procedure computes STDERR by default. SUM weighted sum, or estimated population total when the appropriate sampling weights are used SKEWNESS the skew of the data T t value for H0: population MEAN = 0, and its two tailed p-value with DF degrees of freedom VAR variance of the MEAN VARSUM variance of the SUM Comments All of the above statistics are derived as sample statistics. Consult the text-book, or consult any introductory level text book in statistics: KURTOTIS = 1 n xi x n 1 i 1 4 derived as a sample conjugate to E ( x x ) 4 1 n MEAN = xi n i 1 derived as a sample conjugate (estimate) to E[x] n 1 xi x 3 SKEWNESS = n 1 i 1 derived as a sample estimate to E ( x x ) 3 STD = 1 n xi x n 1 i 1 2 s, the estimate of the standard deviation of the population: σ and provided the data is i.i.d. 1 n xi x n 1 i 1 2 STD 2 STDERR = sX n n usually referred to as standard error of the mean or s X . This is an estimator of the standard deviation of the sample mean X = 8 Sometimes datasets do not contain complete information: some people in the dataset may not have recorded values of some data, like age, education, etc. 22 n x 1 n V x V i V xi 1 n V xi n 2 i 1 1 n2 n 2 2 i 1 n n i 1 n provided the data is i.i.d. VAR = 1 n n 1 i 1 xi x 2 STD 2 2 s2, the (sample) estimate of the standard deviation of the population: σ and provided the data are i.i.d. T= x 1 n xi x n 1 i 1 2 n The sample mean of an iid process xi divided by the standard deviation of that sample mean, converges to a mean-zero normal random variable under null hypothesis that the true mean of the process x is zero. Therefore we know: H 0 : E[ x] x 0 x x Z N (0,1) if null is true V x 2 n This Z statistic is accompanied with a two-tailed p-value. Consider the case where x = 10. Then, a p-value for our null is the probability statement P | x | 10 2 P x 10 x 0 10 0 2 P 2 2 n n Because the random variable x N (0,1) if the null is true 2 n we can use the standard normal table to look up the probability that a standard normal variable exceeds the cut-off value 23 10 0 2 n Of course, we do not know the true variance 2, thus, employing a sample estimate of the variance, the resulting random variable with roughly be t- distributed with n –1 degrees of freedom9: t x t n 1 1 n n 1 i 1 xi x 2 n Example: Consider a dataset with information on stock returns: DATA stocks; INPUT return; DATALINES; 1 2 -4 5 0 RUN; /* Then we run the following three PROC MEANS */ PROC MEANS DATA = stocks CLM ALPHA = .01; RUN; PROC MEANS DATA = stocks CLM ALPHA = .05; RUN; PROC MEANS DATA = stocks T PRT SKEWNESS KURTOSIS MEAN VAR; RUN; The output will be stacked in order of the MEANS statements. The first output page contains the results of a 99% Confidence Interval: The SAS System 17:49 Friday, May 11, 2000 The MEANS Procedure Analysis Variable : return Lower 99% Upper 99% CL for Mean CL for Mean ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ -5.9352101 7.5352101 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Notice that the “ALPHA = .01“ command dictates a 1 - .01 = .99 Confidence Interval. The second output page contains the results of a 95% Confidence Interval: The SAS System 17:49 Friday, May 11, 2000 The MEANS Procedure Analysis Variable : return 9 The t-statistic will be exactly t-distributed if the data is normally distributed. This is a fundamental reason why many economists assume their data is made up of normal random variables. 24 Lower 95% Upper 95% CL for Mean CL for Mean ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ -3.2615890 4.8615890 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ The third output page contains the results of a sample t-test, mean and variance of the mean: The SAS System 12:00 Sunday, May 13, 2000 The MEANS Procedure Analysis Variable : return t Value Pr > |t| Skewness Kurtosis Mean Variance ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 0.55 0.6135 -0.4199926 1.2201939 0.8000000 10.7000000 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Notice that we cannot reject the null hypothesis: the actual data sufficiently represents a mean-zero random variable: recall that we reject tests when the associated p- value is less than the size of the test. In this case, if we choose the size to be 5%, then clearly 61% > 5%, hence we cannot reject. When the p-value is less than the less (e.g. suppose the p-value were .02), then the odds that our data could have been generated by a mean-zero random variable is too low; consequently, we reject the hypothesis. 6.2 PROC CORR We employ PROC CORR to derive sample correlation coefficients for variables in a dataset. Consider data on income, wages, gender, etc., derived from the 1978 Current Population Survey [CPS], a U.S. dataset built by the U.S. Bureau of Labor Statistics [BLS]. You will use this dataset for several projects in this course. For simple correlation coefficients between several variables10, we write: DATA cps78; INFILE ‘a:\cps78.dat'; INPUT ED SOUTH NONWHITE HISPANIC FEMALE MARRIED MARRFE TENURE TENURE_2 UNION ln_wage AGE NUM_DEP; RUN; PROC CORR DATA = cps78; VAR ED SOUTH FEMALE MARRIED TENURE UNION NUM_DEP; RUN; Here, I specify that only a subset of the available variables for correlation analysis. SAS automatically prints basic statistical information, including the sample means, standard deviations, minima and maxima. Notice that PROC MEANS would be more useful for hypothesis testing and confidence interval creation, as well as the generation of higher moments, like the skewness and kurtosis. The output, by default, includes sample statistics, correlation coefficients between all variables, and the p-value for the null hypothesis that the true correlations are zero. Like any standard hypothesis test at the 5%-level, if the resulting p-value is less than .05, we reject the null 10 ED = years of education; SOUTH = 1 if the person lives in a southern state; NONWHITE= 1 is the person is black, asian or Hispanic; FEMALE = 1 if female; MARRIED = 1 if married; MARRFE = 1 if the person is a married female;TENURE = years in their present jab; TENURE_2 = tenure2; UNION = 1 if a member of a union; ln_wage = ln(wage); NUM_DEP = number of children and other dependents in the household. 25 hypothesis that the true correlation is zero, and conclude that irrespective of the actual sample value, we have reasonable evidence that the true correlation is less than or greater than zero. The SAS System 12:00 Sunday, May 13, 2000 The CORR Procedure 6 Variables: ED FEMALE MARRIED TENURE UNION NUM_DEP Simple Statistics Variable N Mean Std Dev Sum Minimum Maximum ED 550 12.53636 2.77209 6895 1.00000 18.00000 FEMALE 550 0.37636 0.48491 207.00000 0 1.00000 MARRIED 550 0.65273 0.47654 359.00000 0 1.00000 TENURE 550 18.71818 13.34653 10295 1.00000 55.00000 UNION 550 0.30545 0.46102 168.00000 0 1.00000 NUM_DEP 550 0.98909 1.28600 544.00000 0 8.00000 Pearson Correlation Coefficients, N = 550 Prob > |r| under H0: Rho=0 ED FEMALE MARRIED TENURE UNION NUM_DEP ED 1.00000 0.06365 -0.08212 -0.34708 -0.12273 -0.06171 0.1360 0.0543 <.0001 0.0039 0.1483 FEMALE 0.06365 1.00000 -0.24526 -0.11727 -0.12408 -0.08687 0.1360 <.0001 0.0059 0.0036 0.0417 MARRIED -0.08212 -0.24526 1.00000 0.29188 0.14378 0.24051 0.0543 <.0001 <.0001 0.0007 <.0001 TENURE -0.34708 -0.11727 0.29188 1.00000 0.19045 -0.04401 <.0001 0.0059 <.0001 <.0001 0.3029 UNION -0.12273 -0.12408 0.14378 0.19045 1.00000 0.09780 0.0039 0.0036 0.0007 <.0001 0.0218 NUM_DEP -0.06171 -0.08687 0.24051 -0.04401 0.09780 1.00000 0.1483 0.0417 <.0001 0.3029 0.0218 26 Comments: 1. The true correlation and sample correlation coefficients are respectively cov(x, y ) E ( x x )( y y ) x, y x y V [ x] V [ y ] 1 i 1 ( xi x)( y i y) n ^ x, y n 1 1 1 ( xi x) 2 n 1 i 1 ( y i y) 2 n n n 1 i 1 2. I put in bold the p-values: SAS does not put these in bold: notice that theses are p-values for the test of the hypothesis that the true correlation is zero. 3. The correlation between any variable and itself is always one (can you prove this by using the above formulas?) 4. The symbol ”<” of course means “less than”, hence “<.0001” means the p-value is smaller than .0001. This, of course, is a very small p-value, implying that the null hypothesis that the true correlation is zero should be strongly rejected. 5. Notice the relationship between education and number of dependents, union membership and work tenure: more education for Americans in the 1970‟s implied for many people less time for child bearing/rearing, especially for females, while more educated Americans tend not to participate in labor organizations. Moreover, not surprisingly, more education tended to be associated with fewer years in the labor force due to the time required to go to school. 6. What are the means of binary (i.e. dummy) random variables? How do we interpret the sample mean of “female”, or “married”, or “union”? 7. If you do not want correlations between all variables specified in the VAR command, use the WITH command to dictate which variables are to be analyzed with [WITH] the VAR variables. For example: PROC CORR DATA = cps78; VAR SOUTH MARRIED TENURE UNION NUM_DEP; WITH ED; RUN; Pearson Correlation Coefficients, N = 550 Prob > |r| under H0: Rho=0 MARRIED TENURE UNION NUM_DEP ED -0.08212 -0.34708 -0.12273 -0.06171 0.0543 <.0001 0.0039 0.1483 Thus, SAS displays the correlation coefficients between ED, specified in the WITH statement, and the various variable denoted with the VAR command. 6.3 UNIVARIATE 27 This procedure is essentially a combination of MEANS and CONTENTS: each variable specified (all variables are analyzed by de fault) is statistically and physically analyzed in manners similar to MEANS and CONTENTS. IV. SAS and Econometric Analysis I: Basic Regression This section details basic steps for performing least squares regression analysis in SAS using standard OLS theory. We will use SAS to regress some y on the available information x, perform basic tasks of inference and model improvement. 1. PROC REG We will use the procedure REG to perform basic regression analysis. There are many other proc‟s in SAS that can used for least squares estimation depending on the sophistication of the problem (e.g. dependent error terms, errors terms with non-constant variance, regression of many models simultaneously, etc…). The following definition is what SAS’s help screen (roughly) says about PROC REG under the assumption that there may be more than one regressor (i.e. the X‟s) available: PROC REG: Syntax The following statements are available in PROC REG. PROC REG OPTIONS; Label MODEL Y = X1 X2 … Xk / OPTIONS BY variables ; OUTPUT OUT = dataset OPTIONS; PLOT Yvar*Xvar / OPTIONS Label TEST test specifications / OPTIONS We will study the various options and commands below. Consider, first, a simple example. Example 1 Consider the CPS dataset detailed above, and suppose it is contained in the file data_1_1.dat on a floppy disk. The data contains information on age, education, log-wages, gender, union status, number of children, etc. Suppose we want to see if the level of education provides an adequate explanation for log- wages. Define Y = ln_wage and X = ed, and suppose we want to estimate (1) E[Yi | X i ] 1 2 X i Yi 1 2 X i ei where the errors et satisfy the usual assumptions (i.e. zero mean, constant variance, zero correlation, normally distributed). We write: DATA cps; /* CPS data */ INFILE ‘a:\data_1_1.dat'; INPUT ED SOUTH NONWHITE HISPANIC FEMALE MARRIED MARRFE TENURE TENURE_2 UNION ln_wage AGE NUM_DEP MANUF CONSTRUCT MANAG SALES CLER SERV PROF; RUN; PROC REG DATA = cps; MODEL ln_wage = ed; 28 RUN; SAS understands that the variable on the left hand side of the equality in the MODEL statement is the dependent Y, and anything on the right hand side is understood to be the independent variables X. Notice that we have not used any options: the above code is the simplest possible way to run a bivariate regression. The output is as follows: The SAS System 11:02 Sunday, May 27, 2001 1 The REG Procedure Model: MODEL1 Dependent Variable: ln_wage Analysis of Variance Sum of Mean Source DF Squares Square F Value Pr > F Model 1 11.36093 11.36093 51.65 <.0001 Error 548 120.53845 0.21996 Corrected Total 549 131.89938 Root MSE 0.46900 R-Square 0.0861 Dependent Mean 1.68100 Adj R-Sq 0.0845 Coeff Var 27.90001 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 1.03044 0.09270 11.12 <.0001 ED 1 0.05189 0.00722 7.19 <.0001 The Analysis of Variance information will be studied on chapters 4 and 5; the Parameter Estimates will be studied in chapters 3 and 5. Hence, much of the above information will not be understandable until we studiy the chapters that follow chapter 3, although we can use the above information to gain insight into how well our regression model describes the data. For now, note that under Parameter Estimates, SAS lists the employed “variables”, and calls them “Intercept” and “ED”. Under the Parameter Estimate 11, SAS lists the OLS estimates of the model in (1): 1 1.03044 2 .05189 Moreover, SAS automatically performs tests of the two two-sided hypotheses H 0 : 1 0 H0 : 2 0 H1 : 1 0 H1 : 2 0 and presents the results under “t Value” and “Pr > |t|”. The p-value of the test, itself, is contained in “Pr > |t|” If the p-value is less than our chosen size of the test, sat 5% = .05, then we reject the null; synonymously, if the t-statistic is greater than 1.96 for a sufficiently large sample (e.g. n > 100), we reject the null: 11 “Estimate” without an “s”. 29 p value .05 reject or t value 1.96 reject (if n 100) In the present case, neither hypothesis is rejected: this suggests that the true intercept may be non-zero, and that there truly exists a relationship between education and wage 12. 2. PROC REG: Commands and Options The PROC REG statement presented above employs many auxiliary commands (not all are presented above) and allows for many options. Here, we will list and explain a few. Examples are provided below. A. PROC REG options After the “PROC REG DATA = dataset” statement, several options can be used: CORR : displays the correlations for all variables listed in the MODEL statement. ALPHA = : sets the probability level for confidence intervals with respect to the OLS estimators B. MODEL / options After the MODEL statement and the stated Y and X variables, use a slash “/”, and any of the following options: ALPHA = : sets the probability level for confidence intervals with respect to the OLS estimators CLB : dictates to SAS that confidence intervals are to be created for all regression model estimators CORRB: displays the correlations between the various OLS estimators COVB: displays the variances & co-variances for the estimators NOINT : dictates to SAS that the intercept parameter is assumed to be zero C. BY The BY command here performs the same task as in PROC MEANS. SAS will perform separate regressions for each category within the BY variables: SAS expects the dataset to be sorted by the employed variables. SAS only recognizes one BY command at a time, hence if you want to estimate various regression models according to various sub- group divisions, use several PROC REG‟s, and change the BY variables for each. D. OUTPUT OUT = dataset If you want to save the regression output (e.g. parameter estimates, test statistics, etc.) to another dataset, use this command. Note: the dataset that you save the regression to does 12 Indeed, a non-zero intercept means that when education is zero, the individual‟s wages will not be zero: E[Yi | X i 0] 1 2 0 1 , hence the intercept represents the minimum wage a person can earn based on having zero years of education. Not surprisingly, it is not zero: people can always find work even if they are uneducated. Moreover, a nonzero slope implies the marginal impact of a new year of education on wages is non-zero: E[Yi | X i 0] 2 X i thus, additional years of education will improve one‟s earning potential, on average. 30 not need to exist: SAS will simply create a new dataset with the assigned name. In order to tell SAS which elements to send to the output dataset, use the following keywords (there are far more than the ones below) after the “OUTPUT OUT = dataset”, and without a slash “/”: P = variable name : denotes the predicted values of Y; you need to assign a name for this variable, like “y_hat” R = variable name: denotes the residuals; you need to assign a name for this variable, like “e_hat” Thus, you can easily derive the predicted dependent variables and the regression residuals. E. PLOT The PLOT statement in PROC REG displays scatter plots with yvariable on the vertical axis and xvariable on the horizontal axis. If you want to plot the residuals of predicted values of Y, use the “RESIDUAL.” and “PREDICTED.” Keywords: notice that there are dots, or periods, after the words RESIDUAL and PREDICTED. Also, notice that we specify the variable that goes on the Y-axis first, and the variable for the X-axis is stated second with a “*” in between. F. TEST We will study this command in depth in the subsequent sections. Example 2 We want to regress the log of wages Y on education X from the CPS data. DATA cps; /* CPS data */ INFILE „a:\data_1_1.dat'; INPUT ED SOUTH NONWHITE HISPANIC FEMALE MARRIED MARRFE TENURE TENURE_2 UNION ln_wage AGE NUM_DEP MANUF CONSTRUCT MANAG SALES CLER SERV PROF; RUN; PROC REG DATA = cps; MODEL ln_wage = ed; RUN; Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 1.03044 0.09270 11.12 <.0001 ED 1 0.05189 0.00722 7.19 <.0001 Example 3 We want to regress the log of wages Y on education X from the CPS data without an intercept term. PROC REG DATA = cps; MODEL ln_wage = ed / NOINT; RUN; Parameter Estimates 31 Parameter Standard Variable DF Estimate Error t Value Pr > |t| ED 1 0.13027 0.00172 75.61 <.0001 Example 4 We want to regress the log of wages Y on education X and regress log of wages Y on job tenure (years in the labor force) X. We can use two separate PROC REG‟s, or simply use two separate MODEL statements. In order to clarify the output for our own sake, we can use labels for the each MODEL command. Note: we do need to use the labels, and we can always use labels even when we only estimate one model. PROC REG DATA = cps; Wage_Ed: MODEL ln_wage = ed; /* “Wage_Ed” will be used to signify the output of this regression */ Tenure_Ed: MODEL ln_wage = tenure; /* “Tenure_Ed” will be used to signify the output of this regression */ RUN; The SAS System 19:17 Sunday, May 27, 2001 21 The REG Procedure Model: Wage_Ed Dependent Variable: ln_wage Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 1.03044 0.09270 11.12 <.0001 ED 1 0.05189 0.00722 7.19 <.0001 The SAS System 19:17 Sunday, May 27, 2001 22 The REG Procedure Model: Tenure_Ed Dependent Variable: ln_wage Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 1.50849 0.03490 43.22 <.0001 TENURE 1 0.00922 0.00152 6.07 <.0001 32 Example 5 We want to regress the log of wages Y on education X for females with children, and those who are not females, or do not have children. DATA cps; /* CPS data */ INFILE „a:\data_1_1.dat'; INPUT ED SOUTH NONWHITE HISPANIC FEMALE MARRIED MARRFE TENURE TENURE_2 UNION ln_wage AGE NUM_DEP MANUF CONSTRUCT MANAG SALES CLER SERV PROF; IF NUM_DEP > 0 THEN DEP = 1; ELSE IF NUM_DEP EQ 0 THEN DEP = 0; FEM_DEP = FEMALE*DEP; RUN; PROC SORT DATA = cps; BY fem_dep; RUN; PROC REG DATA = cps; MODEL ln_wage = ed; BY fem_dep; RUN; The SAS System 19:17 Sunday, May 27, 2001 23 -------------------------------------------- FEM_DEP=0 ---------------------------------- The REG Procedure Model: MODEL1 Dependent Variable: ln_wage Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 1.14712 0.09393 12.21 <.0001 ED 1 0.04765 0.00728 6.55 <.0001 The SAS System 19:17 Sunday, May 27, 2001 24 -------------------------------------------- FEM_DEP=1 ---------------------------------- The REG Procedure Model: MODEL1 Dependent Variable: ln_wage 33 Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| Intercept 1 0.48333 0.26942 1.79 0.0762 ED 1 0.07009 0.02160 3.25 0.0017 Example 6 We want to regress the log of wages Y on education X , and display confidence intervals for the OLS estimators. PROC REG DATA = cps; MODEL ln_wage = ed/ CLB ALPHA = .05; RUN; The SAS System 19:17 Sunday, May 27, 2001 25 The REG Procedure Model: MODEL1 Dependent Variable: ln_wage Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| 95% Confidence Limits Intercept 1 1.03044 0.09270 11.12 <.0001 0.84835 1.21254 ED 1 0.05189 0.00722 7.19 <.0001 0.03771 0.06608 Example 6 We want to regress the number of children Y on education X , and perform a variety of tasks. DATA cps; /* CPS data */ INFILE 'c:\Program Files\WS_FTP\econometrics\data_1_1.dat'; /* contains the CPS data */ INPUT ED SOUTH NONWHITE HISPANIC FEMALE MARRIED MARRFE TENURE TENURE_2 UNION ln_wage AGE NUM_DEP MANUF CONSTRUCT MANAG SALES CLER SERV PROF; MALE_PRO = (1-FEMALE)*PROF; RUN; PROC SORT DATA = cps; BY male_pro; RUN; PROC REG DATA = cps CORR; MODEL num_dep = ed/ CLB ALPHA = .05 CORRB NOINT; BY male_pro; RUN; 34 The SAS System 19:17 Sunday, May 27, 2001 30 -------------------------------------------- MALE_PRO=0 --------------------------------- The REG Procedure Uncorrected Correlation Variable ED NUM_DEP ED 1.0000 0.5811 NUM_DEP 0.5811 1.0000 The SAS System 19:17 Sunday, May 27, 2001 31 -------------------------------------------- MALE_PRO=0 --------------------------------- The REG Procedure Model: MODEL1 Dependent Variable: NUM_DEP NOTE: No intercept in model. R-Square is redefined. Parameter Estimates Parameter Standard Variable DF Estimate Error t Value Pr > |t| 95% Confidence Limits ED 1 0.07349 0.00466 15.77 <.0001 0.06433 0.08264 35 V. SAS and Econometric Analysis II: Multiple Regression with PROC AUTOREG This section will provide the basic details for using SAS‟s PROC AUTOREG. This procedure performs the same tasks as PROC REG is the regression assumptions are standard, and can employ more sophisticated techniques if basic assumptions do not hold (e.g. correlated regression errors, regression errors with non-constant variance, etc..). A nice feature of this procedure is its ability to test a myriad important hypotheses, including the hypothesis that the regression errors are normal random variables, the hypothesis that the error variance is constant, or uncorrelated with itself: PROC REG cannot perform these tests. In order to handle non-standard estimation environments, we will employ PROC AUTOREG for estimation when variance is non-constant and/or errors are correlated. 1. PROC AUTOREG The basic syntax of PROC AUTOREG is as follows: PROC AUTOREG options ; BY variables ; MODEL Y = X1 X2 … Xk / options ; TEST / options ; OUTPUT OUT = dataset options ; We will study the various options below. Example 1 The following code enters the coffee data from Project 2, performs basic regression with AUTOREG, and sends the the regression output to new datasets. Notice that the OUTPUT OUT = q_out1 P = q_hat R = e_hat; statement creates a new dataset called “q_out1”. The statement “P = q_hat” tells SAS to place the predicted values into the new dataset, and call the new variable “q_hat”. The statement “R = e_hat” tells SAS to place the regression residuals into the new dataset, and call the new variable “e_hat”. We can then print the datasets, save the SAS output, and use EXCEL to make graphs: we will learn these tasks over the next few weeks. data coffee; infile 'c:\Program Files\WS_FTP\econometrics\data_2_1.dat'; input q p; ln_q = log(q); ln_p = log(p); run; proc autoreg data = coffee; model q = p; output out = q_out1 P = q_hat R = e_hat; model ln_q = ln_p; output out = q_out2 P = q_hat R = e_hat; run; 36 proc print data = q_out1; run; The AUTOREG Procedure Dependent Variable q Ordinary Least Squares Estimates SSE 0.14907972 DFE 9 MSE 0.01656 Root MSE 0.12870 SBC -11.300425 AIC -12.096215 Regress R-Square 0.6628 Total R-Square 0.6628 Durbin-Watson 0.7266 Standard Approx Variable DF Estimate Error t Value Pr > |t| Intercept 1 2.6911 0.1216 22.13 <.0001 p 1 -0.4795 0.1140 -4.21 0.0023 The AUTOREG Procedure Dependent Variable ln_q Ordinary Least Squares Estimates SSE 0.02263302 DFE 9 MSE 0.00251 Root MSE 0.05015 SBC -32.036211 AIC -32.832001 Regress R-Square 0.7448 Total R-Square 0.7448 Durbin-Watson 0.6801 Standard Approx Variable DF Estimate Error t Value Pr > |t| Intercept 1 0.7774 0.0152 51.00 <.0001 ln_p 1 -0.2530 0.0494 -5.13 0.0006 Obs q_hat e_hat q p ln_q ln_p 1 2.32189 0.24811 2.57 0.77 0.94391 -0.26136 2 2.33627 0.16373 2.50 0.74 0.91629 -0.30111 3 2.34586 0.00414 2.35 0.72 0.85442 -0.32850 4 2.34107 -0.04107 2.30 0.73 0.83291 -0.31471 5 2.32668 -0.07668 2.25 0.76 0.81093 -0.27444 6 2.33148 -0.13148 2.20 0.75 0.78846 -0.28768 7 2.17323 -0.06323 2.11 1.08 0.74669 0.07696 8 1.82318 0.11682 1.94 1.81 0.66269 0.59333 9 2.02458 -0.05458 1.97 1.39 0.67803 0.32930 10 2.11569 -0.05569 2.06 1.20 0.72271 0.18232 11 2.13007 -0.11007 2.02 1.17 0.70310 0.15700 The bottom of the output presents the dataset called “q_out1”: notice that SAS automatically places all the data from the original dataset in the output dataset. In addition, SAS places the regression predicted values, named “q_hat”, and the regression residuals, named “e_hat”, in this dataset. 37 Notice the different output arrangment when compared to PROC REG. SAS places the basicc goodness-of-fit measures in the top of the output, including “SSE” (sum of squares residuals), “MSE” (mean squared errors13) and the coefficient of determination, R2. 2. PROC AUTOREG: Commands and Options A. MODEL / options After the MODEL statement and the stated Y and X variables, use a slash “/”, and any of the following options: CORRB: displays the correlations between the various OLS estimators NOINT : dictates to SAS that the intercept parameter is assumed to be zero NORMAL specifies the Jarque-Bera's normality test statistic for regression residuals. B. BY The BY command here performs the same task as in PROC REG. SAS will perform basic OLS tasks for each group specified by the BY variable. SAS expects the data to be sorted according to the BY variable. C. OUTPUT OUT = dataset If you want to save the regression output (e.g. parameter estimates, test statistics, etc.) to another dataset, use this command. Note: the dataset that you save the regression to does not need to exist: SAS will simply create a new dataset with the assigned name. In order to tell SAS which elements to send to the output dataset, use the following keywords (there are far more than the ones below) after the “OUTPUT OUT = dataset”, and without a slash “/”: P = variable name : denotes the predicted values of Y; you need to assign a name for this variable, like “y_hat” R = variable name: denotes the residuals; you need to assign a name for this variable, like “e_hat” 3. Test of Normality: The NORMAL Command As detailed above, PROC AUTOREG can perform the Jarque-Bera test of normality on the regression errors by employing the command NORMAL after the MODEL statement. Recall that the test statistic employs the skewness of the residuals (a measure of distribution symmetry), and the kurtosis (a measure of the flatness of the distribution: a flatter distribution means the tails are larger, which implies greater variance). Under the null hypothesis that the true regression errors are normally distributed, the Jarque-Bera test statistic has a chi-squared distribution with K-degrees of freedom, where K denotes the number of variables, including the intercept, used in the regression. Thus, is H0: e ~ N(0, 2) is true, then JB ~ 2 (2) . SAS automatically displays the p-value for the chi-squared test statistic. Example 2 ^ 2 1 ^2 ei 13 The MSE, or “mean squared error”, is simply the estimated regression error variance: n2 38 proc autoreg data = coffee; model q = p / NORMAL; run; The SAS System 09:41 Wednesday, June 13, 2001 19 The AUTOREG Procedure Dependent Variable q Ordinary Least Squares Estimates SSE 0.14907972 DFE 9 MSE 0.01656 Root MSE 0.12870 SBC -11.300425 AIC -12.096215 Regress R-Square 0.6628 Total R-Square 0.6628 Normal Test 1.7466 Pr > ChiSq 0.4176 Durbin-Watson 0.7266 Standard Approx Variable DF Estimate Error t Value Pr > |t| Intercept 1 2.6911 0.1216 22.13 <.0001 p 1 -0.4795 0.1140 -4.21 0.0023 SAS prints the Jarque-Bera statistic as “Normal Test 1.7466” , and displays the subsequent p- value to the right, Pr > ChiSq 0.4176. In this setting, we have one intercept and one regressor, ln_p, thus, under the null hypothesis, the JB statistic is a chi-squared random variable with 2 degrees of freedom: the cutoff value is 5.99, thus we cannot reject null. However, we can always simply refer to the p-value: the p- value = .4176 > .05, hence we cannot reject the null. For this sample, the regression errors are reasonably similar to normal random variable, hence we can maintain the assumption that they are, in fact, normal. VI. SAS and Econometric Analysis III: Multiple Regression and Inference 39 This section will provide information for using SAS to perform tests of model specification hypotheses. In particular, we will review how to use PROC REG for the classical F-test, PROC REG and PROC AUTOREG for general F-tests of multiple restrictions, and PROC AUTOREG for the RESET test of model correctness. 1. Classical F-test of Model Correctness A. Theory The classical F-test of model correctness is used to test the hypothesis that all slope parameters are simultaneously zero (i.e. all explanatory variables are not linearly related to Y; the entire linear model is inappropriate). For the model (1) Yt 1 2 X 2t 3 X 3t ... K X Kt et the null hypothesis is H 0 : 2 0,..., K 0 Observe that we only test the slopes: the nature of the hypothesis is see whether any explanatory variables at all belong; not whether an intercept is appropriate. The F-statistic for a test of the above hypothesis is exactly (2) F SST SSE /( K 1) ~ F ( K 1, N K ) is the null hypothesis is true. SSE /( N K ) If the null is true, the F-statistic will be close to zero, whereas if the null is false, the statistic will be very large: for a test at the 5% level, we reject if F > Fc , where the cutoff value is derived from the F- distribution with K – 1 and N – K degrees of freedom: PF Fc .05 F ~ F ( K 1, N K ) B. SAS Use PROC REG. SAS automatically reports the F-statistic and associated p-value: reject the null hypothesis if the p-value < .05 (or, whatever the test size is; e.g. .01, .05, .10). 2. General F-test of Multiple Restrictions A. Theory The general F-test of multiple restrictions is used to test complicated concern more than parameter at a time. The hypothesis may test restrictions on any regression parameter (the intercept; any slope), may test any number of parameters as one time, and may test functions of parameters. Examples of null hypothesis testable by the F-test method include i. H 0 : 1 0, 3 0 ii. H 0 : 2 2, 3 3 4 , 5 4 iii. H0 : 2 3 4 5 1 40 The F-statistic for a test of the above hypothesis is based on running two separate regressions, one without any restrictions, and one with the hypothetical restrictions enforced14. The Sum of Squared Errors [SSE] are collected from the unrestricted model (SSEU) and the restricted model (SSER). The F-statistic is exactly SSE R SSEU /( J ) (3) F ~ F ( J , N K ) is the null hypothesis is true. SSEU /( N K ) where J denotes the number of restrictions. For example, using the above three examples (i) – (iii), the number of restrictions are respectively i. J=2 ii. J=3 iii. J=1 If the null is true, the restricted and unrestricted models will perform roughly identically, hence the SSE‟s will be nearly identical and the F-statistic will be close to zero. If the null is false, when the restrictions are enforced the resulting model will perform very poorly compared to the unrestricted model, hence the SSE from the restricted will be comparatively large, and the statistic will be very large. B. SAS Use PROC REG or PROC AUTOREG. The test instructions are performed below MODEL statements on separate lines of code. By way of example, consider an income model (4) INCOME t 1 2 EDt 3 AGE t 4 NUM _ CHILD t et Suppose we want to test the two hypothesis i. H 0 : 1 0, 4 0 ii. H 0 : 2 .5 3 We write15 PROC REG DATA = d1; MODEL INCOME = ED AGE NUM_CHILD; TEST intercept = 0, NUM_CHILD = 0; TEST ED = .5*AGE; RUN; Notice that we refer to the estimated intercept literally as “intercept”. SAS will report on separate screens (i.e. you need to scroll down) the results of each test. SAS displays numerical values associated with the numerator and denominator of the F-statistic, the F-statistic itself, labeled “F Value”, and the p-value, labeled “Pr > F”. As usual, for a 5%-sized test we reject the null hypothesis if the p-value is below .05. Examples of SAS output follow: Test 1 Results for Dependent Variable INCOME 14 In the course, if there is time we will study how to use SAS to perform “constrained least squares”, the method of OLS when restrictions about the parameters are required. 15 PROC AUTOREG will perform the same task: recall, however, that PROC AUTOREG will not report the classical F-test of model correctness. 41 Mean Source DF Square F Value Pr > F Numerator 2 127493742 7.67 0.0005 Denominator 424 1661399 The REG Procedure Model: MODEL1 Test 2 Results for Dependent Variable INCOME Mean Source DF Square F Value Pr > F Numerator 1 683580130 41.14 <.0001 Denominator 424 16613992 Observe that both tests reject the null hypothesis at the 5%-level: we do not have statistical evidence to support either hypothesis. 3. The RESET Test of Model Specification Correctness A. Theory The RESET test of model specification correctness tests to see if the hypothesized model is correct, with an alternative hypothesis that suggests a better model. Consider the following regression model with K = 4: (5) Yt 1 2 X 2t 3 X 3t 4 X 4t et Examples of null hypothesis and resulting alternatives are i. H 0 : Yt 1 2 X 2t 3 X 3t 4 X 4t et ^ 2 H 1 : Yt 1 2 X 2t 3 X 3t 4 X 4t 5 Y t et ii. H 0 : Yt 1 2 X 2t 3 X 3t 4 X 4t et ^ 2 ^ 3 H 1 : Yt 1 2 X 2t 3 X 3t 4 X 4t 5 Y t 6 Y t et Other alternatives would simply add more power-functions of the predicted Y‟s. The alternative hypothesis is based on the logic that if the original model is not adequate (i.e. poor performance based on t-tests, coefficient of determination, classical F-test), then a reasonable model improvement entails adding non-linear functions of the available data. To see this, notice that the alternative models include power-functions of the predicted Y‟s ^ 2 ^ 3 Yt Yt 42 Now, recall that the predicted values are exactly ^ ^ ^ ^ ^ Y t 1 2 X 2 t 3 X 3t 4 X 4 t ^ 2 Thus, for example, Y t will be a function of squares of the X‟s and “interaction” terms, like X 2 t X 3t X 2t X 4t X 3t X 4 t as well as functions of all the estimated parameters, which, in turn, are all random functions of the available data. In other words, the alternative model, for example ^ 2 Yt 1 2 X 2t 3 X 3t 4 X 4t 5 Y t et , will now include the original explanatory variables, and simple non-linear functions of all the explanatory variables. Why would we ever simply add power-functions of the explanatory variables? In general, although through F-tests and t-tests we can ascertain that some information (i.e. some explanatory variable) is not statistically relevant, we never know exactly how to build a better model: removing an explanatory variable won‟t necessarily lead to a better model: recall that the R2 will drop in value as we remove variables!. Indeed, even if the answer is simply “add more data; find better explanatory variables”, we, of course, do not have more data, and if we could find better explanatory variables, we would have already been using them. In other words, the sample of data we have is not going to improve magically. If the present model (5) is not performing well, we have little choice but to use the available data in a way different than the original linear specification. Power-functions are simply a convenient non-linear way to build “new” explanatory variables in a world of limited data. If we reject the test (if the F-statistic used for RESET test is too large), we conclude that we have evidence that a better model would be the one specified in the alternative hypothesis. B. SAS Use PROC AUTOREG. For model (4), say, we write PROC REG DATA = d1; MODEL INCOME = ED AGE NUM_CHILD / RESET; RUN; SAS does not know how many power-functions of the predicted values to include for the test, so it reports RESET test statistics for a variety of tests (based on using different power functions of the predicted values). The output is The SAS System 16:40 Monday, June 25, 2001 1 43 The AUTOREG Procedure Dependent Variable INCOME Ordinary Least Squares Estimates SSE 7044332479 DFE 424 MSE 16613992 Root MSE 4076 SBC 8350.65254 AIC 8334.41605 Regress R-Square 0.1084 Total R-Square 0.1084 Durbin-Watson 1.9012 Ramsey's RESET Test ^ 2 Uses Y Power RESET Pr > F ^ 2 ^ 3 2 7.4417 0.0066 Uses Y ,Y 3 3.8428 0.0222 ^ 2 ^ 3 ^ 4 4 2.7202 0.0441 Uses Y , Y , Y Standard Approx Variable DF Estimate Error t Value Pr > |t| Intercept 1 -2612 1618 -1.61 0.1071 ED 1 573.6493 87.0455 6.59 <.0001 AGE 1 18.6558 27.1506 0.69 0.4924 NUM_CHILD 1 -1708 538.6768 -3.17 0.0016 Notice that SAS prints RESET statistics for tests that include only the power of 2, the powers of 2 and 3, and powers of 2, 3 and 4. As usual, we reject if the associated p-value less than .05. In this case, we reject all tests: there is substantial evidence that the original specification in (4) is not accurate, and that including power-functions of the predicted values will improve the performance of the model: literally, any of the models ^ 2 Yt 1 2 X 2t 3 X 3t 4 X 4t 5 Y t et ^ 2 ^ 3 Yt 1 2 X 2t 3 X 3t 4 X 4t 5 Y t 6 Y t et ^ 2 ^ 3 ^ 4 Yt 1 2 X 2t 3 X 3t 4 X 4t 5 Y t 6 Y t 7 Y t et will perform better than the original specification in (4). Until we find a better way to analyze the model, we should estimated one of the above augmented model specifications for the sake of more statistically accurate forecasts. C. Using the RESET Result to Build a Better Model 44 In order to estimate the augmented model suggested by the alternative hypothesis, we need to save the predicted values to a new dataset by using the OUTPUT OUT command. SAS will automatically place in the new dataset all explanatory variables and the Y variable. For example, to perform the RESET test and save the predicted values, use PROC REG DATA = d1; MODEL INCOME = ED AGE NUM_CHILD / RESET; OUTPUT OUT = reg_out P = y_hat; RUN; SAS will create a new dataset called “reg_out”, and place in it INCOME, ED, AGE, and NUM_CHILD as well as the predicted values of INCOME. Notice the command: we use “P = “ to signify that we want the predicted values to be printed to the dataset; we then create a variable name for the predicted values. Here, I simply called them “y_hat”. We need, however, power-functions of the predicted values. For this task, we can create yet another dataset, place everything in “reg_out” into the new dataset, and create the powers. Consider the following code: PROC REG DATA = d1; MODEL INCOME = ED AGE NUM_CHILD / RESET; OUTPUT OUT = reg_out P = y_hat; RUN; DATA reg_out2; SET reg_out; /* SET places “reg_out” into this dataset */ y_hat_2 = y_hat**2; y_hat_3 = y_hat**3; y_hat_4 = y_hat**4; RUN; PROC REG DATA = reg_out2; MODEL INCOME = ED AGE NUM_CHILD y_hat_2 y_hat_3 y_hat_4 / RESET; RUN; The SAS output for the second regression with the augmented predicted value power functions is 45 The SAS System 16:40 Monday, June 25, 2001 7 The AUTOREG Procedure Dependent Variable income Ordinary Least Squares Estimates SSE 6910381237 DFE 421 MSE 16414207 Root MSE 4051 SBC 8360.61292 AIC 8332.19905 Regress R-Square 0.1254 Total R-Square 0.1254 Durbin-Watson 1.9005 Ramsey's RESET Test Power RESET Pr > F 2 1.7439 0.1874 3 0.8737 0.4181 4 0.6407 0.5892 Standard Approx Variable DF Estimate Error t Value Pr > |t| Intercept 1 13052 16875 0.77 0.4397 ED 1 -1597 2765 -0.58 0.5639 AGE 1 -58.5699 94.1374 -0.62 0.5342 NUM_CHILD 1 4301 8255 0.52 0.6027 y_hat_2 1 0.001178 0.001857 0.63 0.5262 y_hat_3 1 -1.864E-7 2.9402E-7 -0.63 0.5265 y_hat_4 1 1.126E-11 1.618E-11 0.70 0.4868 Comments: 1. Now that we have included power-functions of the predicted values from the original estimated model, the RESET tests all fail to reject the hypothesis that the specification ^ 2 ^ 3 ^ 4 (6) Yt 1 2 X 2t 3 X 3t 4 X 4t 5 Y t 6 Y t 7 Y t et is statistically improvable: in other words, adding the power-functions seems to created a regression model that cannot be yet again improved. 2. However, notice how all the estimated slope signs have changed, the size of the estimated parameters are substantially different (education has a negative impact?!!?!), and all t-tests fail to reject the hypotheses that the true slopes are zero. Somewhat contradictingly, the classical F-tests reject the hypothesis that the entire linear is irrelevant (I do not present the F-test above, however the p-value < .0001). In other words, the entire model works well, but the actual individual parameters seem to be very volatile, and therefore not trustworthy. 46 3. This confusing phenomena is often due to excessive correlation (linear dependence) between the regressors16, which we refer to as “multi-collinearity”. In model (6), the augmented power functions of the predicted values will themselves be functions of the X‟s, and therefore all the data is likely to be highly correlated in the new regression model (6). That the RESET test can produce such a poor result is one reason why econometricians over the past 20 years have attempted to produce better model specification tests. 16 Recall, for multiple regression, we assume the regressors are not linear functions of each other. If this is the case, SAS could not perform least squares estimation. However, when the explanatory variables are somewhat correlated (indeed, simply not perfectly correlated), SAS can perform OLS, however, the results may be difficult to interpret, or simply non-sensical. 47 attempted to produce better model specification tests. 16 Recall, for multiple regression, we assume the regressors are not linear functions of each other. If this is the case, SAS co uld not perform least squares estimation. However, when the explanatory variables are somewhat correlated (indeed, simply not perfectly correlated), SAS can perform OLS, however, the results may be difficult to interpret, or simply non-sensical. 47