# SAS- Statistical Programming Language

Document Sample

```					                                 SAS- Statistical Programming Language
Ignacio Correas
NOTES FROM: Jonathan Hill, Dept. of Economics, University of California-San Diego

I would like to thank my friend Dr. Jonathan Hill for letting me use his excellent SAS notes and exercises.
Jonathan's caliper as an econometrician is further reflected in the ease of exposition and the clarity with
which he presents material as complex in its didactical application as teaching econometrics with a
computer package is.

CONTENTS                                                                                          PAGE

I.        What is SAS? Getting Around, Saving Files, Printing                                     2
1.       Introduction                                                                   2
2.       Booting-up SAS                                                                 2
3.       The SAS Environment: Getting Around                                            3
4.       Opening File, Saving Files, Printing Output                                    4
II.       Basic SAS Programming Elements: Data, Proc's, Macros, IML                               6
1.       Data Step                                                                      6
2.       Proc Step                                                                      6
III.      Data: Entering and Examining Economic Information                                       8
1.       Internal Data Entry: DATALINES                                                 8
2.       External Data Entry: INFILE, FILENAME, OBS                                     9
3.       Creating New Variables                                                         10
3.1      Arithmetic Operations                                                 10
3.2      Logical Operations: IF, THEN, ELSE, AND, OR                           11
4.       Creating Datasets from Existing Datasets                                       13
MERGE                                                                 13
SET                                                                   14
5.       Describing Data: Simple Data Inspection                                        15
PROC PRINT                                                            16
PROC SORT                                                             17
PROC CONTENTS                                                         19
6.       Describing Data: Simple Data Analysis                                          21
PROC MEANS                                                            21
PROC CORR                                                             26
PROC UNIVARIATE                                                       28
IV.       SAS and Econometric Analysis I: Basic Regression with PROC REG                          29
PROC REG                                                              29
PROC REG: Commands and Options                                        31
MODEL                                                       31
BY                                                          31
TEST (F-Tests)                                              32
EXAMPLES                                                              32
V.        SAS and Econometric Analysis II: Multiple Regression with PROC AUTOREG                  37
1.       PROC AUTOREG                                                                   37
2.       PROC AUTOREG: Commands and Options [model,by,…]                                39
3.       The Jarque-Bera Test of Normality: NORMAL                                      39
VI.       SAS and Econometric Analysis III: Multiple Regression and Inference                     41
1.       Classical F-test of Model Correctness: PROC REG                                41
2.       General F-test of Multiple Restrictions: TEST                                  41
3.       The RESET Test of Model Specification Correctness                              43
4.       Tests of Heteroskedasticity

1
I.         What is SAS? Getting Around, Saving Files, Printing

1.         INTRO
The statistical software we will employ in this course is SAS (Statistical Applications SoftwareTM),
a language which is used world-wide in economics, sociology, political science, and biology, and in major
universities, governments and private research organizations. The SAS language, a multi-purpose statistical
package, is particularly useful for large data-set manipulations, data-set creations, and fast/simple statistical
analysis. Other software available that would be appropriate for more advanced and refined statistical
analysis includes LIMDEP (Limited Dependent Variables), GAUSS, MATLAB, and FORTAN (Formula
Translation) .

2.         BOOTING-UP SAS
In Windows, click-on START, click on PROGRAMS, then click-on SAS. When the software loads,
depending on which version you are using, the screen should be split into two parts (three parts is used in
Version 8). You can fill the entire screen with the software by clicking on the open-box in the upper right-
hand corner.

3.         THE SAS ENVIRONMENT: Getting Around
SAS incorporates three (3) primary windows for viewing program text, output and error messages.
     Program Editor Window
The Program Editor allows us directly to create SAS programs. The editor screen is
simply titled "Editor", and you can place the cursor in the editor by clicking anywhere on the
editor screen, or by clicking-on Window, then Editor. Also, if your version of SAS is recent, on
the bottom of the screen will be three bars denoting Editor, Output and Log: click-on the
appropriate bar to go to the specific screen.
     Output Window
Displays program output, including the printing of data-sets, statistical output like sample
statistics (mean, var), and econometric output (e.g. regression output). The Output window can be
cleared, and should be cleared before you run a program: be sure the Output window is up, then
click-on Edit, then Clearall1.

     Log Window

1
The reason for clearing the output and log screens is simple: after you run a program twice, the output and error messages will
simply be stacked with the first program-run on top, and the second program run on the bottom. It can be very confusing deciphering
which comments are for which program run. Always clear before each program run.

2
This window displays SAS's comments while it translates your program text. To view
this log window, click-on Window then Log. If your program is error free, messages will be in
blue; if you have errors which SAS believes it can override, ignore, or correct, a message will
appear in green; if an error is terminal such that the program crashes, error messages in red will
appear. As with the output window, always clear the log window before you run a program: be
sure the log-window is up, click-on Edit, then click-on Clearall.

Example (type the following code into the editor window, and follow my instructions, below)

______________________________________________
DATA example;
INPUT age gender \$ income;
DATALINES;
54 m 45000
19 f 37500
37 f 67000
RUN;
PROC PRINT DATA = example;
RUN;
PROC MEANS DATA = example;
RUN;
______________________________________________

This program creates a simple dataset of three people and their respective ages, gender (male = m
and female = f) and income in dollar units. The program then prints out the entire dataset, and calculates
sample statistics including the mean, standard deviation, minimum and maximum 2 of the numerical data.
Once you type the program code in the editor, click-on the icon of the running person at the top
of screen to the right (this runs the program), or, simply click-on RUN, then SUBMIT.
SAS will automatically present the output in the output window. The output should look like this:
The SAS System                              Monday, May 7, 2001              6

Obs         age          gender     income

1            54              m      45000
2            19              f      37500
3            37              f      67000

The SAS System                        13:18 Monday, May 7, 2001              7

The MEANS Procedure

2
All of commands are detailed in subsequent sections, below.

3
Variable      N            Mean                Std Dev                   Minimum                Maximum

age           3            36.6666667          17.5023808                19.                   54.
income        3            49833.33            15332.43                  37500.                67000.

Now, view the log window to see how SAS comments: we do not have any errors, thus SAS displays only
blue messages, and black is used for the code you typed in. The log window should look like this:

44   DATA example;
45   input age gender \$ income;
46   datalines;

NOTE: The data set WORK.EXAMPLE has 3 observations and 3 variables.
NOTE: DATA statement used:
real time           0.00 seconds

50   run;
51   proc print data = example;
52   run;

NOTE: There were 3 observations read from the data set WORK.EXAMPLE.
NOTE: PROCEDURE PRINT used:
real time           0.11 seconds

53   proc means data = example;
54   run;

NOTE: There were 3 observations read from the data set WORK.EXAMPLE.
NOTE: PROCEDURE MEANS used:
real time           0.04 seconds

Be sure to clear both log and output windows.

4.       OPENING FILES, SAVING FILES, PRINTING OUTPUT
If SAS is not presently loaded, the fastest way to load a program is to boot-up SAS, click-
on the editor window or click-on Window then Edit, then click-on File and Open. In this
class, your files will most likely be on a floppy-disk: once you click-on Open, scroll
down the "look-in" box until you find the floppy "A"-drive, and proceed. All SAS files
have the file type “.sas”. Our data files will be of the file type “.dat”.

Saving Program Code
Recall that programs are coded in the edit window. Once you type in a program (you
should save any text roughly once every 5 minutes!), click-on File, Save, then scroll-
down the "save-in" box until you reach the drive that suits your needs (e.g. the A-drive
for floppy disks). Be sure to use file names that are reasonably short and intuitive (for
example, do not use "file1.sas"). All SAS programs are automatically saved as ".sas" type
files.

4
Warning: be sure you are actually in the EDITOR screen when you save: otherwise,
SAS will simply save whatever contents are on the screen, be it output or error messages.

Saving Output
The easiest way to summarize your empirical project results is to save the SAS output to
a file and load the file into EXCEL3, or WORD. To save SAS output, run your program,
be sure you are presently in the output window after the program finishes running (if you
have any doubt, click-on Window, then Output), then click-on File, Save, scroll-down the
“look-in” box, find the appropriate drive, and give your output a useful name. For
example, if your SAS program is named "income.sas", then title the output file as
"income_out".

Printing Output
Once you run program, simply click-on File, Print, or just click-on the printer icon
located at the top of the screen, in the middle.

3
See the section below on using EXCEL to create various types of graphs based on SAS output.

5
II.          Basic SAS Programming Elements: Data, Proc's, Macros, IML

Any SAS program incorporates steps for entering data and steps for analyzing data. This short
section will briefly discuss each step without any details on how actually to code a program. The subsection
section presents specific information on how to enter and look at data.

1.       DATA STEP
Any SAS program must employ data from some source. In this class, we will usually enter data
from a floppy-disk, however you can save data to a hard-drive (Drive “C”, for instance) and enter it from
there. Data statements are always of the form4

DATA [dataset name];
…….
RUN;

Each data step requires the command "DATA", a dataset name, code which actually enters the data, and the
command "RUN". Datasets can incorporate any alpha-numeric characters.
For example:

DATA d1;
INPUT x y
DATALINES;
14
10 -8
RUN;

This codes dictates that a dataset named "d1" is created with two variables, named "x" and "y", and two
observations: x = (1, 10) and y = (4, -8). We can build as many datasets as we like, as well as merge
datasets: see the subsequent section.

2.       PROC STEP
Usually SAS programmers use "proc" statements for data analysis. Other means for analyzing data
will be briefly mentioned below: in this class, we will always use proc's. The term "proc" is short for
"procedure", which denotes any built-in array of commands. For example, the MEANS procedure in SAS
will automatically calculate data means, variances, etc., while the REG procedure performs basic regression
analysis. You, yourself, do not need to program in SAS how a sample mean is calculated: we can do that,
however, if we like by using the built-in sub-language called IML (which we will not use in the class). SAS
already has all the details programmed within itself. SAS proc's are use to print data, find sample statistics,
perform econometric analysis, create graphs, charts, etc.
Proc's are coded much like DATA statements. For any proc, we need to specify which data is to be
analyzed. For example, in order to print the entire contents of the dataset created above, we code:

DATA d1;
INPUT x y
DATALINES;
14
10 -8
RUN;

PROC PRINT data = d1;
RUN;
The statement "data = d1" dictates which dataset is to be printed. As with the use of datasets, we can use as
many proc's as we like: the following code creates two datasets, prints both, and displays sample statistics
of one dataset:

4
I will use brackets "[ ]" to denote information that the programmer enters: you never actually type these brackets in SAS code.

6
DATA d1;
INPUT x y
DATALINES;
1 4
10 -8
RUN;

DATA d2;
INPUT w z
DATALINES;
10 -100
9 0
RUN;

PROC PRINT data = d1;
RUN;

PROC PRINT data = d2;
RUN;

PROC MEANS data = d1;
RUN;

____________________________________

MACROS and SAS-IML
Although SAS's power is derived from its ability to manage and create large datasets as well as its
ability easily to analyze any dataset by incorporating any one of its several hundred built-in procedures,
there are other means for programming that require substantial effort on the part of the programmer.

SAS-IML
The SAS language has built-in to it a sub-language for matrix-oriented mathematics. This
software is called the Integrated Matrix Language [IML] and can be used to code
substantially sophisticated econometric commands. SAS's built-in procedures are very
useful, however they are, ultimately, of limited use: recent advances in
economic/econometric/statistical theory are NOT programmed into SAS, thus if you
require a means of data analysis that lies outside of the range of SAS's present abilities,
then you must program the procedure yourself. IML allows the programmer literally to
create his/her own procedures that can be called from any SAS program. The IML
language requires its own syntax, employs matrix algebra and therefore requires extra
time to learn and a background in higher mathematics.

SAS MACROS
SAS's IML is literally a built-in sub-language useful for creating you own hand-written
econometric analysis. A "macro", by contrast, is a routine that is programmed into SAS
along with standard DATA and PROC steps. A "macro" requires its own syntax, and can
be used to create routines that perform sophisticated tasks. Moreover, a macro can be
written simply to group together standard SAS commands: once this kind of macro is
written, the programmer simply needs to refer to it by name, and all of the subsequent
SAS commands associated with the macro name are performed.

7
III.     Data: Entering and Examining Economic Information

In this section, we will learn the basic techniques for entering data directly into SAS: the two
primary techniques entail writing the data directly into the program, or loading data into a SAS from an
external source (e.g. floppy disk). Additionally, we will also learn several procedures for performing basic
statistical analysis of our data.
For this, and all subsequent documents, to familiarize yourself with new SAS commands and
programming techniques, be sure to boot-up SAS and practice the examples I give below. Always feel free
to experiment.
NOTE: Because SAS is a Windows product, you can simply copy examples of code in this and
any documents and paste the text directly into SAS. In fact, many of the examples, below, were written in
SAS and copy/pasted into WORD! Do as we all do: take the code wherever you can find, study it, and learn
to re-write it yourself.

1.       Internal Data Entry: DATALINES
SAS allows for the programmer to enter directly any data. For large data sets, this is
impractical, however, there will be times when the programmer wants to have the data physically
present in the program. Recall, we enter data in a DATA STEP. For direct data entry, we use the
code:
DATA [dataset name];
INPUT var1 var2 [more variable names] varN;
DATALINES;
[data would be typed here]
RUN;

Notice, there is not a semi-colon ";" after the last line of data, however we use a semi-colon after
every line of code. Variable names can use any alpha-numerical symbols, however it can be no
more than 8 characters in SAS Version 6.0. We do not put commas between variable names. The
INPUT command dictates variable names and the order in which the data will be entered.
DATALINES dictates that actual data follows. For example, if we want to enter income and ages
for 5 people, we write:

DATA income1;
INPUT income age;
DATALINES;
10000 50
75000 43
23000 67
10000 19
100000 56
RUN;

SAS understands that the data is read as "income age", and only requires one space between data
entries: you can, however, place as many spaces between data entries as you like. Also, you do not
need to indent code the way I do, however it is much easier to read: you will need my help from
time-to-time, so you should write your code in a manner that is easy to understand.
SAS differentiates between numerical and character variable. For data that is non-
numerical, use the dollar-sign "\$" after (to the right of) the variable name with one space. For
example, suppose that the above dataset "income1" includes gender information in the form of
"M" for male and "F" for female. We can write:

8
DATA income1;
INPUT income age sex \$;
DATALINES;
10000 50 m
75000 43 m
23000 67 f
10000 19 f
100000 56 m
RUN;

We now have a dataset named "income1" with five observations (5 people), and income, age and
gender information.

Example: We want to create a dataset with monthly GNP (in \$trillions) information, however not
all months are present in our sample. We have information for 4 months.

DATA gnp_mon;
INPUT gnp month \$;
DATALINES;
2 jan
2.01 march
1.99 july
2.00 dec
RUN;

Thus, we have data for January, March, July and December.

2.   External Data Entry: INFILE, FILENAME, OBS
By far the most useful approach to data entry is the method of entering data directly from
a drive, be it hard ("C") or floppy (A"). We use the INFILE command for such basic entry:

DATA [dataset name];
INFILE 'drive:\folder\folder\…\filename.type';
INPUT var1 var2 … varn;
RUN;

The INFILE command directs SAS to some drive and sequence of folders. The file directly and
name requires single quotations. The file type may be .dat or .txt, depending on he files I give
you, and ultimately depending on how you yourself make your data files. I will comment later on
the nature of .dat and .txt files. For example, if our income data exists on a floppy in a file named
"income_data.dat", we can write:

DATA income1;
INFILE 'a:\income_data.txt';
INPUT income age sex \$;
RUN;

If you plan on entering data from the same drive and file over and over again, you can
simply re-write the file-name as follows:

DATA [dataset name];
FILENAME [file name] 'drive:\folder\folder\…\filename.type';
INFILE [file name]
INPUT var1 var2 … varn;
RUN;

9
Notice, only spaces are placed between the new file name and the actual directly and drive
specifications. For example,

DATA income1;
FILENAME inc_file 'a:\income_data.dat';
INFILE inc_file;
INPUT income age sex \$;
DATALINES;
RUN;

Thus, SAS understands that "inc_file" refers to the location "a:\income_data.dat". You can access
the same simple file name in subsequent datasets. For example

DATA income1;
FILENAME inc_file 'a:\income_data.dat';
INFILE inc_file;
INPUT income age sex \$;
RUN;
DATA income2;
INFILE inc_file;
INPUT income age sex \$;
DATALINES;
RUN;

This simple program re-names the file for SAS's use, reads in the data, and re-reads the data in a
second data step: the second data step does not require the file location specification (i.e.
a:\income_data.dat) because SAS interprets “inc_file” as that location.
In many cases, we will not want to use an entire dataset: many datasets contain more than
50000 observations and more than 200 variables. Simply in order to maintain a program during
the coding development stage, and to run the program in order to find and remove errors, we may
want to use only a few observations, and use the entire dataset only when all errors ("bugs") have
been corrected.
A simple way to control how many observations are read-in into a dataset is to use the
OBS command. Suppose the file a:\income_data.dat has 10,000 observations, but we want only
the first 100. Then, we write

DATA income1;
INFILE 'a:\income_data.dat' OBS = 100;
INPUT income age sex \$;
RUN;

3.   Creating New Variables

3.1      Arithmetic Operations
During the data entry stage of any data step, we can create new variables using basic
arithmetic and logic commands. For example:

DATA income1;
FILENAME inc_file 'a:\income_data.dat';
INFILE inc_file;
INPUT income age sex \$;
income_sq = income*income;
RUN;

10
The code " income_sq = income*income" creates a new variable named "income_sq" which
equals income squared (i.e. income_inc = income2). SAS understands that the operation is to be
performed for all data observations. Mathematical symbols include

* times                           “log” natural log
** to the power of                “exp” the exponential function (i.e. exp(x) = ex, e = 2.7141)
- minus
+ plus
/ divide

Thus, we could have written "income_inc = income**2".
For example, if we read in variables x and y, and we want ln(x), x4, x - y and x/y as new
variables, we can write

DATA d1;
INPUT x y;
x_4 = x**4;
ln_x = log(x);
xmy = x - y;
xdy = x/y;
DATALINES;
10000 50
75000 43
23000 67
RUN;

Note that SAS will now understand that the dataset "d1" has 6 variables: x, y, ln_x, x_4, xmy and
xdy.

3.2        Logical Operations
Many variables should only be constructed when a condition is satisfied, or
perhaps a variable's value depends not on specific values of other variables (e.g. ln_x =
log(x)), rather on value ranges. For such derivations, we use IF, THEN, ELSE logical
operations with connectors AND and OR.
Consider, for example, that we have a variable “ed” that denotes the number of
years of educations. In the U.S., if ed > 12, we would understand that the individual
graduated from high school. Likewise, if ed > 16, we might conclude that the individual
has a basic degree from a university. In econometric analysis, we often want to know
both what impact the number of years of education has on income, as well as whether
graduating from high school has an impact on education 5. For such information, we will
want to create a “dummy”, or “binary” variable6 that equals 1 if the individual graduated
from high school, and 0 otherwise: all we want from these variables if the simple
information of whether they graduated or not.
For example, suppose we read in data on income, education and age, and we
want to create variables that represent whether that individual has a high school or
college education or not:

5
After all, 11 years is not much less than 12 years (and 11.75 years does not mean the individual graduated from high school!), but a
high school diploma will signal to many employers a certain skill level in the laborer, a certain degree of dedication that people who
quit high school early may not have.
6
We will study the use and implications of dummy variables throughout the semester.

11
DATA income1;
INPUT income ed age;

IF ed GE 12 THEN hs = 1;
ELSE IF ed LT 12 THEN hs = 0;
IF ed GE 16 THEN college = 1;
ELSE IF ed LT 16 THEN college = 0;
DATALINES;
10000 15 45
24000 18 54
31000 9 69
RUN;

The code literally states that if the education level of an individual is greater than or
equal to [GE] 12, then a new variable, named “hs”, is set equal to 1. However [ELSE], if
years of education is less-than [LT] 12, then the variable “hs” is set to 0. Likewise, if
education is greater than or equal to [GE] 16, a new variable, named “college” is set
equal to 1. However [ELSE], if the number of years of education is less than [LT] 16, the
“college” is set to zero. Clearly, the first person has a high school education but not a
college education, so hs = 1 and college = 0 for the first individual. If we run the above
program and print the dataset, then the output looks like this:

The SAS System              21:10 Wednesday, May 9, 2000
Obs     income      ed    age    hs       college

1       10000      15        45     1       0
2       24000      18        54     1       1
3       31000       9        69     0       0

As usual, the dataset has 5 variables: the three original variables and the two new dummy
variables.
The logical operators available are as follows:

Operator: Definition                       Symbol
EQ: equal to                               =
GE: greater than or equal to               >=
LE: less than or equal to                  <=
NE: not equal to                           ^=
NOT: not                                   ^
AND
OR
Consider a more complicated piece of information. Suppose we want a variable
for people over the age of 50 who have at least 14 years of education (i.e. they are high
school graduates from before the 1980's with at least some college education). We can
use the AND and OR operators as follows:

DATA income1;

12
INPUT income ed age;

IF ed GE 14 AND age GE 50 THEN coll_50 = 1;
ELSE IF ed LT 14 OR age LT 50 THEN coll_50 = 0;
DATALINES;
10000 15 45
24000 18 54
31000 9 69
RUN;

Thus, only if a person if over 50 years old and [AND] they have at least 14 years of
education will the new variable “coll_50” be set to 1. However, if they are too young (age
< 50) or [OR] if they have too littler education, then they do not satisfy our compound
criteria, and the new variable “coll_50” is set to 0. IF we print the dataset, we find

The SAS System            21:22 Wednesday, May 9, 2000
Obs     income      ed    age    coll_50

1       10000      15     45        0
2       24000      18     54        1
3       31000       9     69        0

Only the second individual satisfies both criteria: she is both at least 50 years old AND
has at least 14 years of education.

4.   Creating Datasets from Existing Datasets: MERGE, SET

Often, we will want to use the information in one dataset in order to build quickly another
dataset. For example, we may read in information for 1000 people concerning wages, hours
worked, and taxes paid, and read in from another source information concerning the same 1000
people concerning basic demographic information: education, marital status, age, gender, and
number of children. Or, we may find in one data source on the web information on a country's
GNP, interest rates, unemployment rate and inflation rate for the period 1970-1979, and from
another data source the same information for the period 1980-1989. In order to use all of the data
at once during the stage of econometric analysis, we will want to build one dataset containing all
relevant information (all variables concerning one person, or all time periods concerning several
economic quantifiers).
Two simple techniques utilized for such dataset blending are the MERGE and SET
commands employed during any data step.

4.1    MERGE
Consider the following code which builds two datasets containing, variously, economic
and demographic data, about the same group of people:

DATA income1;
INPUT income taxes hours;

13
DATALINES;
10000 100 54
75000 23000 38
23000 3000 40
RUN;
DATA demog1;
INPUT age gender;
/* gender = 1 if male, gender = 0 if female */
DATALINES;
27 1
64 0
43 0
RUN;

Note that any text between the items /* */ is treated as a command, and ignored by SAS. The
variable gender is simply a dummy variable representing male if the value is 1 and female if the
value is 0. To merge these dataset, we code a third data step as follows:

DATA inc_dem;
MERGE income1 demog1;
RUN;

Now, the new dataset "inc_dem" contains 5 variables: income, taxes, hours, age and gender. It we
print the dataset, we observe

The SAS System            10:59 Thursday, May 10, 2000
Obs    income      taxes    hours    age     gender

1       10000       100           54   27       1
2       75000     23000           38   64       0
3       23000      3000           40   43       0

SAS literally places the two datasets side-by-side.

WARNING: your datasets must have the observations arranged in the same order in order to
ensure that information for the same individual is merged.
WARNING: in order to merge datasets with different information concerning the same people,
no variable names can be shared between datasets.

4.2      SET
The command SET is used to stack (i.e. concatenate) different datasets which have the
same variables types. This is particularly useful for merging different datasets with time-series
information. Consider the example give above: suppose we may find in one data source on the
web information on a country's GNP and unemployment rate for the period 1970-1974, and from
another data source the same information for the period 1985-1989. We will want to merge the
data, however we do not want to perform a side-by-side merge in manner that was performed
above. We want the data to be stacked vertically, with the years 1970-1974on top, and the years
1985-1989on the bottom:

DATA data_70;
/* contains data for the years 1970-1974 */
/* GNP is in billions; unemployment rate is a percent: e.g. 6 denotes 6% = .06 */

14
INPUT gnp ue_rate;
DATALINES;
3000 4
3100 3.9
3120 3.92
3110 4.1
2900 4.3
RUN;
DATA data_75;
/* contains data for the years 1975-1979 */
INPUT gnp ue_rate;
DATALINES;
2910 4.2
3000 4.1
3000 4
3100 3.7
3300 3.2
RUN;

DATA data_70_75;
SET data_70 data_75;
RUN;

Notice that we have created a third dataset named "data_70_75" containing all the information
from the years 1970-1979. The SET command will automatically concatenate (stack) the data with
the dataset stated first (i.e. data_70) on top, and the second dataset on the bottom. If we print the
dataset, we observe:

The SAS System              10:59 Thursday, May 10, 2000

Obs      gnp     ue_rate

1    3000       4.00
2    3100       3.90
3    3120       3.92
4    3110       4.10
5    2900       4.30
6    2910       4.20
7    3000       4.10
8    3000       4.00
9    3100       3.70
10    3300       3.20

Thus, all of the relevant data was stacked with 1970 on top and 1979 on the bottom.

5.   Describing Data: Simple Data Inspection
In this section, we will learn the following procedures for basic visual inspection of our
data:
PRINT
SORT
CONTENTS

5.1      PROC PRINT
This procedure is use to print entire or partial datasets. Consider the examples:

DATA income1;
INPUT income taxes hours;

15
DATALINES;
10000 100 54
75000 23000 38
23000 3000 40
RUN;

PROC PRINT DATA = income1;
RUN;

Notice that we must specify which dataset is to be printed. Unless we state otherwise, SAS will
print the entire set. The output window will contain:

The SAS System              10:59 Thursday, May 10, 2000

Obs     income        taxes     hours

1       10000          100      54
2       75000        23000      38
3       23000         3000      40

Consider delineating specific variables to be printed.

DATA income1;
INPUT income taxes hours;
DATALINES;
10000 100 54
75000 23000 38
23000 3000 40
RUN;

PROC PRINT DATA = income1;
VAR income;
RUN;

Here, we specify that we want only the variable [VAR] "income" to be printed. The output
window contains:

The SAS System              10:59 Thursday, May 10, 200

Obs      income

1       10000
2       75000
3       23000

Finally, consider printing several variables, but not all that exist in the dataset:

DATA income1;
INPUT income taxes hours;
DATALINES;
10000 100 54
75000 23000 38
23000 3000 40
RUN;
PROC PRINT data = income1;

16
VAR taxes hours;
RUN;

We can delineate as many or as few variables as we like: as with other SAS command structures,
we do not use commas between the variable names. The output window contains:

The SAS System                     10:59 Thursday, May 10, 200

Obs      taxes        hours

1         100             54
2       23000             38
3        3000             40

5.2      PROC SORT

Sorting data is intuitive and simple. Consider sorting the above dataset "income1"
according to income (i.e. we want to sort all individuals and all variables with individuals who
have the smallest incomes at the "top" of the dataset, and individuals with the largest incomes at
the "bottom" of the dataset). We write the following code:

DATA income1;
INPUT income taxes hours;
DATALINES;
10000 100 54
75000 23000 38
23000 3000 40
RUN;
PROC SORT DATA = income1;
BY income;
RUN;
PROC PRINT DATA = income1;
RUN;

The syntax here is the same as with PROC PRINT: we must tell SAS which dataset is to be
sorted. Moreover, whenever we sort data, the sort must be according to, or BY, some criterion.
The output window contains:
The SAS System                     13:32 Thursday, May 10, 2001   1

Obs        income        taxes        hours

1          10000          100         54
2          23000         3000         40
3          75000        23000         38

Note that the dataset is now permanently changed. Whenever you refer to this dataset, SAS will
interpret it as sorted according to income.
We can use the DESCENDING command to dictate that the data is to sorted from the
highest value of the BY variable to the lowest value:

DATA income1;

INPUT income taxes hours;
DATALINES;
10000 100 54
75000 23000 38
23000 3000 40
RUN;

17
PROC SORT DATA = income1;
BY DESCENDING income;
RUN;
PROC PRINT DATA = income1;
RUN;

The output window contains

The SAS System             13:32 Thursday, May 10, 200

Obs     income        taxes   hours

1       75000        23000    38
2       23000         3000    40
3       10000          100    54

We can also sort according to several criteria. For example, suppose we have data on
stocker trader's names, and the net number of stock shares traded (positive valued denote net
purchases; negative valued denote net sales). Out dataset contains the information:

First NAME                 Last NAME                   STOCK SHARES
Frank              Smith                                10
Betty              Jones                                5
Betty              Jones                                10
Frank                Smith                              100
Frank                Albert                             40
Betty                Jones                              50
Frank                Albert                             20
Betty                Jones                              45
We want to read in this data, and sort by last name, first name, date, and finally by number of
shares traded. By last name, Albert comes first, with stock shares traded in volumes of 40 and 20:
Albert will come first, sorted with 20 then 40 shares traded. We code as follows:

INPUT name \$ date \$ shares;
DATALINES;
Frank       Smith                                        10
Betty       Jones                                        5
Betty       Jones                                        10
Frank        Smith                                       100
Frank        Albert                                      40
Betty        Jones                                       50
Frank        Albert                                      20
Betty        Jones                                       45
RUN;

18
BY lname fname shares;
RUN;
RUN;

The output window displays:

The SAS System                 13:32 Thursday, May 10, 2001   19

Obs       fname    lname        shares

1       Frank    Albert          20
2       Frank    Albert          40
3       Betty    Jones            5
4       Betty    Jones           10
5       Betty    Jones           45
6       Betty    Jones           50
7       Frank    Smith           10
8       Frank    Smith          100

5.3       PROC CONTENTS

If you simply want to know basic structural (i.e. non-statistical) information about a
dataset, we can use the CONTENTS procedure. This is especially helpful when our econometric
results do not appear the way we expected them to: we may have damaged data, and one easy way
to detect the damage is to inspect the basic dataset properties. The procedure CONTENTS details
the number of variables, observations, and missing observations (some variables may not exist for
some people or during some periods: if your dataset is too large to inspect visually in EXCEL,
then CONTENTS can provide a quick peek). For example, consider the dataset “income1” with
income, taxes and hours worked for three people.

DATA income1;
INPUT income taxes hours;
DATALINES;
10000 100 54
75000 23000 38
23000 3000 40
RUN;
PROC CONTENTS DATA = income1;
RUN;

As usual, we need to dictated which dataset is to be inspected by CONTENTS. The
output window contains:
The SAS System                   17:49 Friday, May 11, 2000

The CONTENTS Procedure

Data Set Name:   WORK.INCOME1                              Observations:           3
Member Type:     DATA                                      Variables:              3
Engine:          V8                                        Indexes:                0
Created:         17:49 Friday, May 11, 2000                Observation Length:     24
Protection:                                                Compressed:             NO
Label:

-----Engine/Host Dependent Information-----

Data Set Page Size:          4096
Number of Data Set Pages:    1
First Data Page:             1

19
Max Obs per Page:                     168
Obs in First Data Page:               3
Number of Data Set Repairs:           0
Files\_TD1148\income1.sas7bdat
Release Created:                      8.0101M0
Host Created:                         WIN_PRO

-----Alphabetic List of Variables and Attributes-----

#    Variable    Type    Len    Pos
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
3    hours       Num       8     16
1    income      Num       8      0
2    taxes       Num       8      8

Like most procedures, this procedure permits many internal commands that direct SAS to display
specific information that is not displayed by default: consult SAS‟s help screen 7. For example, SAS permits
many optional commands that can entered after the “DATA = “ statement:

PROC CONTENTS DATA = dataset [option] [option] … [option];
RUN;

For example, three such optional commands include:

Specify the output data set                                                              OUT =
Print a list of the variables by their position in the data set                          VARNUM

Thus, the programmer can save the CONTENTS output to another dataset, as well as list variables
in the order in which they appear in the dataset, as opposed to in alphabetical order (see the example
above). Such a variable listing can be helpful if you have many (e.g. 50, 100, 200) variables, and you want
to check if you are reading the data in in the right order (e.g. does “income” come before “taxes”? if you
taxes).

6.         Describing Data: Simple Data Analysis
In this section, we will learn the following procedures for basic statistical inspection of
our data:
MEANS
CORR
UNIVARIATE

6.1       PROC MEANS
PROC MEANS creates and displays basic sample statistics, confidence interval and
simple hypothesis test information, including the sample mean, variance, standard deviation, the
minimum and maximum values of specified variables, and t-tests for the null hypothesis that the
mean of a variable is zero. If no specifications are provided, SAS will automatically display results
for all variables. For example:

DATA income1;
INPUT income taxes hours;
DATALINES;
10000 100 54
75000 23000 38
23000 3000 40

7
If your version of SAS is 6.0 or greater, then a very useful help-screen should be installed. For all commands and procedures we
employ, you should always search the help screen for further information. Simply click-on the “book” icon to the upper-right.

20
RUN;
PROC MEANS DATA = income1;
RUN;

The output window contains:

The SAS System                    17:49 Friday, May 11, 200

The MEANS Procedure

Variable    N            Mean         Std Dev         Minimum         Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
income      3        36000.00        34394.77        10000.00        75000.00
taxes       3         8700.00        12468.76     100.0000000        23000.00
hours       3      44.0000000       8.7177979      38.0000000      54.0000000

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

If you want information for select variables, use the VAR option:

PROC MEANS DATA = income1;
VAR income taxes;
RUN;

The SAS System                      17:49 Friday, May 11, 200
The MEANS Procedure

Variable    N             Mean         Std Dev         Minimum         Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
income      3         36000.00        34394.77        10000.00        75000.00
taxes       3          8700.00        12468.76     100.0000000        23000.00
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

The proper syntax for PROC MEANS includes output options, like VAR, and statistic keywords
that dictate which information is to be displayed. If you use such keywords, SAS will only provide
that information, and omit all other statistics.

PROC MEANS [option(s)] [statistic-keyword(s)]

Statistical keywords include:
ALL               all statistics listed
CLM               100(1 - )% confidence limits for the MEAN, where  is determined by the “ALPHA= option”, and
the default is  = .05
CLSUM             100(1 - )% confidence limits for the SUM, where 100(1 - )% is determined by the “ALPHA=
option” and the default is  = .05.
CV                coefficient of variation
DF                degrees of freedom for the t test

21
KURTOSIS          The kurtosis of the data
MAX               maximum value
MEAN              mean for a numeric variable, or the proportion in each category for a categorical variable
MIN               minimum value
NMISS             number of missing observations
NOBS              number of non-missing observations8
PRT               probability that a true t-random variable is greater than the t-statistic we have derived
RANGE             range, MAX-MIN
STD               standard deviation of the SUM. When you request SUM, the procedure computes STD by default.
STDERR            standard error of the MEAN. When you request MEAN, the procedure computes STDERR by
default.
SUM               weighted sum, or estimated population total when the appropriate sampling weights are used
SKEWNESS          the skew of the data
T                 t value for H0: population MEAN = 0, and its two tailed p-value with DF
degrees of freedom
VAR               variance of the MEAN
VARSUM            variance of the SUM

All of the above statistics are derived as sample statistics. Consult the text-book, or
consult any introductory level text book in statistics:

KURTOTIS =
1 n
 xi  x
n  1 i 1
           
4

derived as a sample conjugate to E ( x   x )
4

1 n
MEAN =         xi
n i 1
derived as a sample conjugate (estimate) to              E[x]

           
n
1
 xi  x
3
SKEWNESS =
n  1 i 1
derived as a sample estimate to E ( x   x )
3

STD     =
1 n

 xi  x
n  1 i 1

2

s, the estimate of the standard deviation of the population: σ
and provided the data is i.i.d.

1 n
 xi  x
n  1 i 1
           
2

STD
2
STDERR =      sX                                       
n                        n
usually referred to as standard error of the mean or s X .

This is an estimator of the standard deviation of the sample mean           X    =

8
Sometimes datasets do not contain complete information: some people in the dataset may not have
recorded values of some data, like age, education, etc.

22
    n x      1 n 
V x  V  i   V   xi  
1 n 
V  xi 
n 2  i 1 
1
n2
n 2 
2
 i 1 n   n i 1                                                 n
provided the data is i.i.d.

VAR =
1 n
n  1 i 1

 xi  x            2
 STD 2
2
s2, the (sample) estimate of the standard deviation of the population: σ
and provided the data are i.i.d.

T=              x
1 n
 xi  x
n  1 i 1
               2

n

The sample mean of an iid process xi divided by the standard deviation of that
sample mean, converges to a mean-zero normal random variable under null
hypothesis that the true mean of the process x is zero.
Therefore we know:

H 0 : E[ x]   x  0
x                 x
Z                                  N (0,1) if null is true
V x              2
n

This Z statistic is accompanied with a two-tailed p-value. Consider the case
where x = 10. Then, a p-value for our null is the probability statement

             
P | x | 10  2 P x  10              
              
              
 x  0 10  0 
 2 P             
  2     2 
              
 n        n 

Because the random variable
x
 N (0,1) if the null is true
2
n

we can use the standard normal table to look up the probability that a standard
normal variable exceeds the cut-off value

23
10  0
2
n
Of course, we do not know the true variance 2, thus, employing a sample
estimate of the variance, the resulting random variable with roughly be t-
distributed with n –1 degrees of freedom9:

t            x                                t n 1
1 n
n  1 i 1

 xi  x         2

n
Example:

Consider a dataset with information on stock returns:

DATA stocks;
INPUT return;
DATALINES;
1
2
-4
5
0
RUN;

/* Then we run the following three PROC MEANS */
PROC MEANS DATA = stocks CLM ALPHA = .01;
RUN;
PROC MEANS DATA = stocks CLM ALPHA = .05;
RUN;
PROC MEANS DATA = stocks T PRT SKEWNESS KURTOSIS MEAN VAR;
RUN;

The output will be stacked in order of the MEANS statements. The first output
page contains the results of a 99% Confidence Interval:

The SAS System                              17:49 Friday, May 11, 2000
The MEANS Procedure
Analysis Variable : return

Lower 99%       Upper 99%
CL for Mean     CL for Mean
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
-5.9352101       7.5352101
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
Notice that the “ALPHA = .01“ command dictates a 1 - .01 = .99 Confidence Interval.
The second output page contains the results of a 95% Confidence Interval:

The SAS System                              17:49 Friday, May 11, 2000
The MEANS Procedure
Analysis Variable : return

9
The t-statistic will be exactly t-distributed if the data is normally distributed. This is a fundamental reason why many economists
assume their data is made up of normal random variables.

24
Lower 95%       Upper 95%
CL for Mean     CL for Mean
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
-3.2615890       4.8615890
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
The third output page contains the results of a sample t-test, mean and variance
of the mean:

The SAS System                       12:00 Sunday, May 13, 2000
The MEANS Procedure
Analysis Variable : return

t Value    Pr > |t|    Skewness        Kurtosis          Mean          Variance
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
0.55      0.6135      -0.4199926       1.2201939       0.8000000      10.7000000
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

Notice that we cannot reject the null hypothesis: the actual data sufficiently
represents a mean-zero random variable: recall that we reject tests when the associated p-
value is less than the size of the test. In this case, if we choose the size to be 5%, then
clearly 61% > 5%, hence we cannot reject. When the p-value is less than the less (e.g.
suppose the p-value were .02), then the odds that our data could have been generated by a
mean-zero random variable is too low; consequently, we reject the hypothesis.

6.2      PROC CORR
We employ PROC CORR to derive sample correlation coefficients for variables in a
dataset. Consider data on income, wages, gender, etc., derived from the 1978 Current Population
Survey [CPS], a U.S. dataset built by the U.S. Bureau of Labor Statistics [BLS]. You will use this
dataset for several projects in this course. For simple correlation coefficients between several
variables10, we write:

DATA cps78;
INFILE ‘a:\cps78.dat';
INPUT ED SOUTH NONWHITE HISPANIC FEMALE MARRIED
MARRFE TENURE TENURE_2 UNION ln_wage AGE NUM_DEP;
RUN;

PROC CORR DATA = cps78;
VAR ED SOUTH FEMALE MARRIED TENURE UNION NUM_DEP;
RUN;

Here, I specify that only a subset of the available variables for correlation analysis. SAS
automatically prints basic statistical information, including the sample means, standard deviations,
minima and maxima. Notice that PROC MEANS would be more useful for hypothesis testing and
confidence interval creation, as well as the generation of higher moments, like the skewness and
kurtosis.
The output, by default, includes sample statistics, correlation coefficients between all
variables, and the p-value for the null hypothesis that the true correlations are zero. Like any
standard hypothesis test at the 5%-level, if the resulting p-value is less than .05, we reject the null

10
ED = years of education; SOUTH = 1 if the person lives in a southern state; NONWHITE= 1 is the person is black, asian or
Hispanic; FEMALE = 1 if female; MARRIED = 1 if married; MARRFE = 1 if the person is a married female;TENURE = years in
their present jab; TENURE_2 = tenure2; UNION = 1 if a member of a union; ln_wage = ln(wage); NUM_DEP = number of children
and other dependents in the household.

25
hypothesis that the true correlation is zero, and conclude that irrespective of the actual sample
value, we have reasonable evidence that the true correlation is less than or greater than zero.

The SAS System      12:00 Sunday, May 13, 2000
The CORR Procedure

6     Variables:     ED          FEMALE   MARRIED     TENURE    UNION      NUM_DEP

Simple Statistics

Variable              N               Mean       Std Dev              Sum           Minimum      Maximum

ED                  550        12.53636          2.77209             6895           1.00000      18.00000
FEMALE              550         0.37636          0.48491        207.00000                 0       1.00000
MARRIED             550         0.65273          0.47654        359.00000                 0       1.00000
TENURE              550        18.71818         13.34653            10295           1.00000      55.00000
UNION               550         0.30545          0.46102        168.00000                 0       1.00000
NUM_DEP             550         0.98909          1.28600        544.00000                 0       8.00000

Pearson Correlation Coefficients, N = 550
Prob > |r| under H0: Rho=0

ED         FEMALE          MARRIED          TENURE             UNION        NUM_DEP

ED               1.00000        0.06365           -0.08212        -0.34708       -0.12273         -0.06171
0.1360             0.0543          <.0001         0.0039           0.1483

FEMALE           0.06365        1.00000           -0.24526        -0.11727       -0.12408         -0.08687
0.1360                            <.0001          0.0059         0.0036           0.0417

MARRIED          -0.08212       -0.24526           1.00000         0.29188           0.14378       0.24051
0.0543         <.0001                            <.0001            0.0007        <.0001

TENURE           -0.34708       -0.11727           0.29188         1.00000           0.19045      -0.04401
<.0001         0.0059            <.0001                            <.0001        0.3029

UNION            -0.12273       -0.12408           0.14378         0.19045           1.00000       0.09780
0.0039         0.0036            0.0007          <.0001                          0.0218

NUM_DEP          -0.06171       -0.08687           0.24051        -0.04401           0.09780       1.00000
0.1483         0.0417            <.0001          0.3029            0.0218

26
1.                  The true correlation and sample correlation coefficients are respectively

cov(x, y )       E ( x   x )( y   y )
 x, y                 
 x y               V [ x] V [ y ]
1
i 1 ( xi  x)( y i  y)
n
^
 x, y            n 1
1                      1
 ( xi  x) 2 n  1 i 1 ( y i  y) 2
n                         n

n  1 i 1

2.             I put in bold the p-values: SAS does not put these in bold: notice that
theses are p-values for the test of the hypothesis that the true correlation is
zero.
3.    The correlation between any variable and itself is always one (can you prove
this by using the above formulas?)
4.             The symbol ”<” of course means “less than”, hence “<.0001” means
the p-value is smaller than .0001. This, of course, is a very small p-value,
implying that the null hypothesis that the true correlation is zero should be
strongly rejected.
5.             Notice the relationship between education and number of dependents,
union membership and work tenure: more education for Americans in the
1970‟s implied for many people less time for child bearing/rearing, especially
for females, while more educated Americans tend not to participate in labor
organizations. Moreover, not surprisingly, more education tended to be
associated with fewer years in the labor force due to the time required to go to
school.
6.             What are the means of binary (i.e. dummy) random variables? How do
we interpret the sample mean of “female”, or “married”, or “union”?
7.             If you do not want correlations between all variables specified in the
VAR command, use the WITH command to dictate which variables are to be
analyzed with [WITH] the VAR variables. For example:

PROC CORR DATA = cps78;
VAR SOUTH MARRIED TENURE UNION
NUM_DEP;
WITH ED;
RUN;

Pearson Correlation Coefficients, N = 550
Prob > |r| under H0: Rho=0

MARRIED      TENURE                 UNION             NUM_DEP

ED   -0.08212     -0.34708          -0.12273               -0.06171
0.0543       <.0001            0.0039                 0.1483

Thus, SAS displays the correlation coefficients between ED, specified in the
WITH statement, and the various variable denoted with the VAR command.

6.3     UNIVARIATE

27
This procedure is essentially a combination of MEANS and CONTENTS: each variable specified
(all variables are analyzed by de fault) is statistically and physically analyzed in manners similar
to MEANS and CONTENTS.

IV.      SAS and Econometric Analysis I: Basic Regression

This section details basic steps for performing least squares regression analysis in SAS using
standard OLS theory. We will use SAS to regress some y on the available information x, perform basic
tasks of inference and model improvement.

1.       PROC REG

We will use the procedure REG to perform basic regression analysis. There are many other proc‟s
in SAS that can used for least squares estimation depending on the sophistication of the problem (e.g.
dependent error terms, errors terms with non-constant variance, regression of many models simultaneously,
etc…). The following definition is what SAS’s help screen (roughly) says about PROC REG under the
assumption that there may be more than one regressor (i.e. the X‟s) available:

PROC REG: Syntax
The following statements are available in PROC REG.

PROC REG OPTIONS;
Label MODEL Y = X1 X2 … Xk / OPTIONS
BY variables ;
OUTPUT OUT = dataset             OPTIONS;
PLOT Yvar*Xvar                 / OPTIONS
Label TEST test specifications / OPTIONS

We will study the various options and commands below. Consider, first, a simple example.

Example 1
Consider the CPS dataset detailed above, and suppose it is contained in the file data_1_1.dat on a
floppy disk. The data contains information on age, education, log-wages, gender, union status, number of
children, etc. Suppose we want to see if the level of education provides an adequate explanation for log-
wages. Define Y = ln_wage and X = ed, and suppose we want to estimate

(1)
E[Yi | X i ]   1   2 X i
 Yi   1   2 X i  ei

where the errors et satisfy the usual assumptions (i.e. zero mean, constant variance, zero correlation,
normally distributed). We write:

DATA cps;   /* CPS data */
INFILE ‘a:\data_1_1.dat';
INPUT ED SOUTH NONWHITE HISPANIC FEMALE   MARRIED     MARRFE
TENURE TENURE_2 UNION ln_wage AGE NUM_DEP MANUF CONSTRUCT
MANAG SALES CLER SERV PROF;
RUN;

PROC REG DATA = cps;
MODEL ln_wage = ed;

28
RUN;

SAS understands that the variable on the left hand side of the equality in the MODEL statement is the
dependent Y, and anything on the right hand side is understood to be the independent variables X. Notice
that we have not used any options: the above code is the simplest possible way to run a bivariate regression.
The output is as follows:

The SAS System                     11:02 Sunday, May 27, 2001      1

The REG Procedure
Model: MODEL1
Dependent Variable: ln_wage

Analysis of Variance

Sum of               Mean
Source                       DF           Squares             Square    F Value     Pr > F

Model                         1          11.36093          11.36093        51.65    <.0001
Error                       548         120.53845           0.21996
Corrected Total             549         131.89938

Root MSE                 0.46900      R-Square        0.0861
Dependent Mean           1.68100      Adj R-Sq        0.0845
Coeff Var               27.90001

Parameter Estimates

Parameter        Standard
Variable       DF           Estimate           Error      t Value     Pr > |t|

Intercept       1            1.03044         0.09270        11.12        <.0001
ED              1            0.05189         0.00722         7.19        <.0001

The Analysis of Variance information will be studied on chapters 4 and 5; the Parameter Estimates will be
studied in chapters 3 and 5. Hence, much of the above information will not be understandable until we
studiy the chapters that follow chapter 3, although we can use the above information to gain insight into
how well our regression model describes the data.
For now, note that under Parameter Estimates, SAS lists the employed “variables”, and calls them
“Intercept” and “ED”. Under the Parameter Estimate 11, SAS lists the OLS estimates of the model in (1):

 1  1.03044  2  .05189
Moreover, SAS automatically performs tests of the two two-sided hypotheses

H 0 : 1  0          H0 : 2  0
H1 : 1  0           H1 :  2  0

and presents the results under “t Value” and “Pr > |t|”. The p-value of the test, itself, is contained in

“Pr > |t|”

If the p-value is less than our chosen size of the test, sat 5% = .05, then we reject the null; synonymously, if
the t-statistic is greater than 1.96 for a sufficiently large sample (e.g. n > 100), we reject the null:

11
“Estimate” without an “s”.

29
p  value  .05  reject
or
t  value  1.96  reject (if n  100)

In the present case, neither hypothesis is rejected: this suggests that the true intercept may be non-zero, and
that there truly exists a relationship between education and wage 12.

2.            PROC REG: Commands and Options

The PROC REG statement presented above employs many auxiliary commands (not all are
presented above) and allows for many options. Here, we will list and explain a few. Examples are
provided below.

A.         PROC REG options

After the “PROC REG DATA = dataset” statement, several options can be used:

CORR :                 displays the correlations for all variables listed in the MODEL
statement.
ALPHA = :              sets the probability level for confidence intervals with
respect to the OLS estimators
B.         MODEL / options

After the MODEL statement and the stated Y and X variables, use a slash “/”, and any of
the following options:

ALPHA = :              sets the probability level for confidence intervals with
respect to the OLS estimators
CLB :                  dictates to SAS that confidence intervals are to be created for
all regression model estimators
CORRB:                 displays the correlations between the various OLS estimators
COVB:                  displays the variances & co-variances for the estimators
NOINT :                dictates to SAS that the intercept parameter is assumed to be
zero

C.         BY

The BY command here performs the same task as in PROC MEANS. SAS will perform
separate regressions for each category within the BY variables: SAS expects the dataset
to be sorted by the employed variables. SAS only recognizes one BY command at a
time, hence if you want to estimate various regression models according to various sub-
group divisions, use several PROC REG‟s, and change the BY variables for each.

D.         OUTPUT OUT = dataset

If you want to save the regression output (e.g. parameter estimates, test statistics, etc.) to
another dataset, use this command. Note: the dataset that you save the regression to does
12
Indeed, a non-zero intercept means that when education is zero, the individual‟s wages will not be zero:
E[Yi | X i  0]   1   2 0   1 , hence the intercept represents the minimum wage a person can earn based on having zero years
of education. Not surprisingly, it is not zero: people can always find work even if they are uneducated. Moreover, a nonzero slope
implies the marginal impact of a new year of education on wages is non-zero:

E[Yi | X i  0]   2
X i
thus, additional years of education will improve one‟s earning potential, on average.

30
not need to exist: SAS will simply create a new dataset with the assigned name. In order
to tell SAS which elements to send to the output dataset, use the following keywords
(there are far more than the ones below) after the “OUTPUT OUT = dataset”, and
without a slash “/”:

   P = variable name : denotes the predicted values of Y; you need to assign a name for
this variable, like “y_hat”
   R = variable name: denotes the residuals; you need to assign a name for this
variable, like “e_hat”

Thus, you can easily derive the predicted dependent variables and the regression
residuals.

E.       PLOT

The PLOT statement in PROC REG displays scatter plots with yvariable on the vertical
axis and xvariable on the horizontal axis. If you want to plot the residuals of predicted
values of Y, use the “RESIDUAL.” and “PREDICTED.” Keywords: notice that there
are dots, or periods, after the words RESIDUAL and PREDICTED. Also, notice that we
specify the variable that goes on the Y-axis first, and the variable for the X-axis is stated
second with a “*” in between.

F.       TEST

We will study this command in depth in the subsequent sections.

Example 2
We want to regress the log of wages Y on education X from the CPS data.

DATA cps;     /* CPS data */
INFILE „a:\data_1_1.dat';
INPUT ED SOUTH NONWHITE HISPANIC FEMALE       MARRIED
MARRFE TENURE TENURE_2 UNION ln_wage AGE NUM_DEP
MANUF CONSTRUCT MANAG SALES CLER SERV PROF;
RUN;
PROC REG DATA = cps;
MODEL ln_wage = ed;
RUN;

Parameter Estimates

Parameter    Standard
Variable    DF    Estimate      Error      t Value   Pr > |t|

Intercept 1       1.03044       0.09270     11.12 <.0001
ED        1      0.05189       0.00722      7.19 <.0001

Example 3
We want to regress the log of wages Y on education X from the CPS data without an intercept
term.

PROC REG DATA = cps;
MODEL ln_wage = ed / NOINT;
RUN;

Parameter Estimates

31
Parameter    Standard
Variable    DF    Estimate      Error      t Value   Pr > |t|

ED         1       0.13027     0.00172     75.61     <.0001

Example 4
We want to regress the log of wages Y on education X and regress log of wages Y on job tenure
(years in the labor force) X. We can use two separate PROC REG‟s, or simply use two separate MODEL
statements. In order to clarify the output for our own sake, we can use labels for the each MODEL
command. Note: we do need to use the labels, and we can always use labels even when we only estimate
one model.

PROC REG DATA = cps;
Wage_Ed: MODEL ln_wage = ed;                     /* “Wage_Ed” will be
used to signify the
output of this
regression */
Tenure_Ed: MODEL ln_wage = tenure;                    /* “Tenure_Ed” will be
used to signify the output of this
regression */
RUN;

The SAS System          19:17 Sunday, May 27, 2001 21

The REG Procedure

Model: Wage_Ed
Dependent Variable: ln_wage

Parameter Estimates

Parameter    Standard
Variable    DF    Estimate      Error      t Value   Pr > |t|

Intercept 1         1.03044     0.09270      11.12  <.0001
ED        1        0.05189     0.00722       7.19 <.0001

The SAS System          19:17 Sunday, May 27, 2001 22

The REG Procedure
Model: Tenure_Ed
Dependent Variable: ln_wage

Parameter Estimates

Parameter    Standard
Variable    DF    Estimate      Error      t Value   Pr > |t|

Intercept 1   1.50849    0.03490    43.22    <.0001
TENURE      1    0.00922    0.00152     6.07    <.0001

32
Example 5
We want to regress the log of wages Y on education X for females with children, and those who
are not females, or do not have children.

DATA cps;        /* CPS data */
INFILE „a:\data_1_1.dat';
INPUT ED SOUTH NONWHITE HISPANIC FEMALE       MARRIED
MARRFE TENURE TENURE_2 UNION ln_wage AGE NUM_DEP
MANUF CONSTRUCT MANAG SALES CLER SERV PROF;

IF NUM_DEP > 0 THEN DEP = 1;
ELSE IF NUM_DEP EQ 0 THEN DEP = 0;
FEM_DEP = FEMALE*DEP;

RUN;

PROC SORT DATA = cps;
BY fem_dep;
RUN;
PROC REG DATA = cps;
MODEL ln_wage = ed;
BY fem_dep;
RUN;

The SAS System              19:17 Sunday, May 27, 2001 23

-------------------------------------------- FEM_DEP=0 ----------------------------------

The REG Procedure
Model: MODEL1
Dependent Variable: ln_wage

Parameter Estimates

Parameter    Standard
Variable     DF    Estimate      Error        t Value    Pr > |t|

Intercept 1        1.14712        0.09393       12.21 <.0001
ED        1       0.04765        0.00728       6.55 <.0001

The SAS System              19:17 Sunday, May 27, 2001 24

-------------------------------------------- FEM_DEP=1 ----------------------------------

The REG Procedure
Model: MODEL1
Dependent Variable: ln_wage

33
Parameter Estimates

Parameter    Standard
Variable   DF    Estimate      Error    t Value       Pr > |t|

Intercept 1      0.48333       0.26942     1.79 0.0762
ED        1     0.07009       0.02160     3.25 0.0017
Example 6
We want to regress the log of wages Y on education X , and display confidence intervals for the
OLS estimators.

PROC REG DATA = cps;
MODEL ln_wage = ed/ CLB ALPHA = .05;
RUN;

The SAS System                19:17 Sunday, May 27, 2001 25

The REG Procedure
Model: MODEL1
Dependent Variable: ln_wage

Parameter Estimates

Parameter   Standard
Variable      DF   Estimate    Error        t Value    Pr > |t|      95% Confidence Limits

Intercept    1     1.03044     0.09270      11.12      <.0001          0.84835      1.21254
ED           1     0.05189     0.00722      7.19       <.0001          0.03771      0.06608

Example 6
We want to regress the number of children Y on education X , and perform a variety of tasks.

DATA cps;     /* CPS data */
INFILE 'c:\Program Files\WS_FTP\econometrics\data_1_1.dat';
/* contains the CPS data */
INPUT ED SOUTH NONWHITE HISPANIC FEMALE                  MARRIED MARRFE
TENURE TENURE_2 UNION ln_wage AGE NUM_DEP MANUF CONSTRUCT
MANAG SALES CLER SERV PROF;
MALE_PRO = (1-FEMALE)*PROF;
RUN;

PROC SORT DATA = cps;
BY male_pro;
RUN;

PROC REG DATA = cps CORR;
MODEL num_dep = ed/ CLB ALPHA = .05 CORRB NOINT;
BY male_pro;
RUN;

34
The SAS System              19:17 Sunday, May 27, 2001 30

-------------------------------------------- MALE_PRO=0 ---------------------------------

The REG Procedure
Uncorrected Correlation
Variable         ED        NUM_DEP

ED              1.0000      0.5811
NUM_DEP              0.5811      1.0000

The SAS System              19:17 Sunday, May 27, 2001 31

-------------------------------------------- MALE_PRO=0 ---------------------------------

The REG Procedure
Model: MODEL1
Dependent Variable: NUM_DEP

NOTE: No intercept in model. R-Square is redefined.

Parameter Estimates

Parameter           Standard
Variable           DF        Estimate            Error               t Value   Pr > |t|     95% Confidence Limits

ED                 1         0.07349             0.00466             15.77     <.0001        0.06433    0.08264

35
V.       SAS and Econometric Analysis II: Multiple Regression with PROC AUTOREG

This section will provide the basic details for using SAS‟s PROC AUTOREG. This procedure
performs the same tasks as PROC REG is the regression assumptions are standard, and can employ more
sophisticated techniques if basic assumptions do not hold (e.g. correlated regression errors, regression
errors with non-constant variance, etc..). A nice feature of this procedure is its ability to test a myriad
important hypotheses, including the hypothesis that the regression errors are normal random variables, the
hypothesis that the error variance is constant, or uncorrelated with itself: PROC REG cannot perform
these tests.
In order to handle non-standard estimation environments, we will employ PROC AUTOREG for
estimation when variance is non-constant and/or errors are correlated.

1.       PROC AUTOREG

The basic syntax of PROC AUTOREG is as follows:

PROC AUTOREG options ;
BY variables ;
MODEL Y = X1 X2 … Xk / options ;
TEST / options ;
OUTPUT OUT = dataset options ;

We will study the various options below.

Example 1
The following code enters the coffee data from Project 2, performs basic regression with
AUTOREG, and sends the the regression output to new datasets. Notice that the

OUTPUT OUT = q_out1 P = q_hat R = e_hat;

statement creates a new dataset called “q_out1”. The statement “P = q_hat” tells SAS to place the predicted
values into the new dataset, and call the new variable “q_hat”. The statement “R = e_hat” tells SAS to place
the regression residuals into the new dataset, and call the new variable “e_hat”. We can then print the
datasets, save the SAS output, and use EXCEL to make graphs: we will learn these tasks over the next few
weeks.

data coffee;
infile 'c:\Program Files\WS_FTP\econometrics\data_2_1.dat';
input q p;
ln_q = log(q);
ln_p = log(p);
run;

proc autoreg data = coffee;
model q = p;
output out = q_out1 P = q_hat R = e_hat;
model ln_q = ln_p;
output out = q_out2 P = q_hat R = e_hat;
run;

36
proc print data = q_out1;
run;

The AUTOREG Procedure

Dependent Variable           q

Ordinary Least Squares Estimates

SSE                     0.14907972        DFE                                9
MSE                        0.01656        Root MSE                     0.12870
SBC                     -11.300425        AIC                       -12.096215
Regress R-Square            0.6628        Total R-Square                0.6628
Durbin-Watson               0.7266

Standard                       Approx
Variable           DF      Estimate             Error        t Value      Pr > |t|

Intercept           1        2.6911           0.1216           22.13        <.0001
p                   1       -0.4795           0.1140           -4.21        0.0023

The AUTOREG Procedure

Dependent Variable           ln_q

Ordinary Least Squares Estimates

SSE                     0.02263302        DFE                                9
MSE                        0.00251        Root MSE                     0.05015
SBC                     -32.036211        AIC                       -32.832001
Regress R-Square            0.7448        Total R-Square                0.7448
Durbin-Watson               0.6801

Standard                       Approx
Variable           DF      Estimate             Error        t Value      Pr > |t|

Intercept           1        0.7774           0.0152           51.00        <.0001
ln_p                1       -0.2530           0.0494           -5.13        0.0006

Obs       q_hat         e_hat         q         p            ln_q         ln_p

1     2.32189         0.24811      2.57     0.77        0.94391        -0.26136
2     2.33627         0.16373      2.50     0.74        0.91629        -0.30111
3     2.34586         0.00414      2.35     0.72        0.85442        -0.32850
4     2.34107        -0.04107      2.30     0.73        0.83291        -0.31471
5     2.32668        -0.07668      2.25     0.76        0.81093        -0.27444
6     2.33148        -0.13148      2.20     0.75        0.78846        -0.28768
7     2.17323        -0.06323      2.11     1.08        0.74669         0.07696
8     1.82318         0.11682      1.94     1.81        0.66269         0.59333
9     2.02458        -0.05458      1.97     1.39        0.67803         0.32930
10     2.11569        -0.05569      2.06     1.20        0.72271         0.18232
11     2.13007        -0.11007      2.02     1.17        0.70310         0.15700

The bottom of the output presents the dataset called “q_out1”: notice that SAS automatically
places all the data from the original dataset in the output dataset. In addition, SAS places the regression
predicted values, named “q_hat”, and the regression residuals, named “e_hat”, in this dataset.

37
Notice the different output arrangment when compared to PROC REG. SAS places the basicc
goodness-of-fit measures in the top of the output, including “SSE” (sum of squares residuals), “MSE”
(mean squared errors13) and the coefficient of determination, R2.

2.         PROC AUTOREG: Commands and Options

A.        MODEL / options

After the MODEL statement and the stated Y and X variables, use a slash “/”, and any of
the following options:

CORRB:            displays the correlations between the various OLS estimators
NOINT :           dictates to SAS that the intercept parameter is assumed to be
zero
NORMAL            specifies the Jarque-Bera's normality test statistic for
regression residuals.
B.        BY

The BY command here performs the same task as in PROC REG. SAS will perform
basic OLS tasks for each group specified by the BY variable. SAS expects the data to be
sorted according to the BY variable.

C.        OUTPUT OUT = dataset

If you want to save the regression output (e.g. parameter estimates, test statistics, etc.) to
another dataset, use this command. Note: the dataset that you save the regression to does
not need to exist: SAS will simply create a new dataset with the assigned name. In order
to tell SAS which elements to send to the output dataset, use the following keywords
(there are far more than the ones below) after the “OUTPUT OUT = dataset”, and without
a slash “/”:

      P = variable name : denotes the predicted values of Y; you need to assign a name for
this variable, like “y_hat”
      R = variable name: denotes the residuals; you need to assign a name for this
variable, like “e_hat”

3.         Test of Normality: The NORMAL Command

As detailed above, PROC AUTOREG can perform the Jarque-Bera test of normality on the
regression errors by employing the command NORMAL after the MODEL statement. Recall that the test
statistic employs the skewness of the residuals (a measure of distribution symmetry), and the kurtosis (a
measure of the flatness of the distribution: a flatter distribution means the tails are larger, which implies
greater variance). Under the null hypothesis that the true regression errors are normally distributed, the
Jarque-Bera test statistic has a chi-squared distribution with K-degrees of freedom, where K denotes the
number of variables, including the intercept, used in the regression. Thus, is H0: e ~ N(0, 2) is true, then

JB ~  2 (2) .
SAS automatically displays the p-value for the chi-squared test statistic.

Example 2

^ 2    1    ^2
 ei
13
The MSE, or “mean squared error”, is simply the estimated regression error variance:  
n2

38
proc autoreg data = coffee;
model q = p / NORMAL;
run;

The SAS System         09:41 Wednesday, June 13, 2001 19

The AUTOREG Procedure

Dependent Variable    q

Ordinary Least Squares Estimates

SSE                     0.14907972         DFE                       9
MSE                     0.01656            Root MSE                   0.12870
SBC                     -11.300425         AIC                       -12.096215
Regress R-Square        0.6628             Total R-Square            0.6628
Normal Test             1.7466             Pr > ChiSq                0.4176
Durbin-Watson            0.7266

Standard        Approx
Variable         DF    Estimate      Error t Value Pr > |t|

Intercept        1     2.6911      0.1216        22.13   <.0001
p                1    -0.4795      0.1140        -4.21   0.0023

SAS prints the Jarque-Bera statistic as “Normal Test 1.7466” , and displays the subsequent p-
value to the right, Pr > ChiSq 0.4176. In this setting, we have one intercept and one regressor, ln_p, thus,
under the null hypothesis, the JB statistic is a chi-squared random variable with 2 degrees of freedom: the
cutoff value is 5.99, thus we cannot reject null. However, we can always simply refer to the p-value: the p-
value = .4176 > .05, hence we cannot reject the null. For this sample, the regression errors are reasonably
similar to normal random variable, hence we can maintain the assumption that they are, in fact, normal.

VI.      SAS and Econometric Analysis III: Multiple Regression and Inference

39
This section will provide information for using SAS to perform tests of model specification
hypotheses. In particular, we will review how to use PROC REG for the classical F-test, PROC REG and
PROC AUTOREG for general F-tests of multiple restrictions, and PROC AUTOREG for the RESET test
of model correctness.

1.       Classical F-test of Model Correctness

A.       Theory

The classical F-test of model correctness is used to test the hypothesis that all slope parameters
are simultaneously zero (i.e. all explanatory variables are not linearly related to Y; the entire linear model is
inappropriate). For the model

(1)       Yt   1   2 X 2t   3 X 3t  ...   K X Kt  et

the null hypothesis is

H 0 :  2  0,...,  K  0

Observe that we only test the slopes: the nature of the hypothesis is see
whether any explanatory variables at all belong; not whether an intercept is
appropriate.
The F-statistic for a test of the above hypothesis is exactly

(2)       F
SST  SSE  /( K  1)     ~    F ( K  1, N  K ) is the null hypothesis is true.
SSE /( N  K )

If the null is true, the F-statistic will be close to zero, whereas if the null is false, the statistic will be very
large: for a test at the 5% level, we reject if F > Fc , where the cutoff value is derived from the F-
distribution with K – 1 and N – K degrees of freedom:

PF  Fc   .05
F ~ F ( K  1, N  K )

B.       SAS

Use PROC REG. SAS automatically reports the F-statistic and associated p-value: reject the null
hypothesis if the p-value < .05 (or, whatever the test size is; e.g. .01, .05, .10).

2.       General F-test of Multiple Restrictions

A.        Theory
The general F-test of multiple restrictions is used to test complicated concern more than parameter
at a time. The hypothesis may test restrictions on any regression parameter (the intercept; any slope), may
test any number of parameters as one time, and may test functions of parameters. Examples of null
hypothesis testable by the F-test method include

i.          H 0 :  1  0, 3  0
ii.         H 0 :  2  2,  3  3 4 ,  5    4
iii.        H0 : 2  3  4  5 1

40
The F-statistic for a test of the above hypothesis is based on running two separate regressions, one
without any restrictions, and one with the hypothetical restrictions enforced14. The Sum of Squared Errors
[SSE] are collected from the unrestricted model (SSEU) and the restricted model (SSER). The F-statistic is
exactly

SSE R SSEU  /( J )
(3)       F                              ~   F ( J , N  K ) is the null hypothesis is true.
SSEU /( N  K )

where J denotes the number of restrictions. For example, using the above three examples (i) – (iii), the
number of restrictions are respectively

i.        J=2
ii.       J=3
iii.      J=1

If the null is true, the restricted and unrestricted models will perform roughly identically, hence the SSE‟s
will be nearly identical and the F-statistic will be close to zero. If the null is false, when the restrictions are
enforced the resulting model will perform very poorly compared to the unrestricted model, hence the SSE
from the restricted will be comparatively large, and the statistic will be very large.

B.       SAS

Use PROC REG or PROC AUTOREG. The test instructions are performed below MODEL
statements on separate lines of code. By way of example, consider an income model

(4)       INCOME t   1   2 EDt   3 AGE t   4 NUM _ CHILD t  et

Suppose we want to test the two hypothesis

i.        H 0 :  1  0,  4  0
ii.       H 0 :  2  .5 3

We write15
PROC REG DATA = d1;
MODEL INCOME = ED AGE NUM_CHILD;
TEST intercept = 0, NUM_CHILD = 0;
TEST ED = .5*AGE;
RUN;

Notice that we refer to the estimated intercept literally as “intercept”.
SAS will report on separate screens (i.e. you need to scroll down) the results of each test. SAS
displays numerical values associated with the numerator and denominator of the F-statistic, the F-statistic
itself, labeled “F Value”, and the p-value, labeled “Pr > F”. As usual, for a 5%-sized test we reject the null
hypothesis if the p-value is below .05. Examples of SAS output follow:

Test 1 Results for Dependent Variable INCOME

14
In the course, if there is time we will study how to use SAS to perform “constrained least squares”, the
method of OLS when restrictions about the parameters are required.
15
PROC AUTOREG will perform the same task: recall, however, that PROC AUTOREG will not report
the classical F-test of model correctness.

41
Mean
Source                      DF       Square              F Value               Pr > F

Numerator                  2         127493742           7.67                  0.0005
Denominator                424       1661399

The REG Procedure
Model: MODEL1

Test 2 Results for Dependent Variable INCOME

Mean
Source                     DF        Square              F Value               Pr > F

Numerator                   1         683580130          41.14                 <.0001
Denominator                424       16613992

Observe that both tests reject the null hypothesis at the 5%-level: we do not have statistical evidence to
support either hypothesis.

3.       The RESET Test of Model Specification Correctness

A.       Theory
The RESET test of model specification correctness tests to see if the hypothesized model is
correct, with an alternative hypothesis that suggests a better model. Consider the following regression
model with K = 4:

(5)      Yt   1   2 X 2t   3 X 3t   4 X 4t  et

Examples of null hypothesis and resulting alternatives are

i.
H 0 : Yt   1   2 X 2t   3 X 3t   4 X 4t  et
^ 2
H 1 : Yt   1   2 X 2t   3 X 3t   4 X 4t   5 Y t  et

ii.
H 0 : Yt   1   2 X 2t   3 X 3t   4 X 4t  et
^ 2       ^ 3
H 1 : Yt   1   2 X 2t   3 X 3t   4 X 4t   5 Y t   6 Y t  et

Other alternatives would simply add more power-functions of the predicted Y‟s.
The alternative hypothesis is based on the logic that if the original model is not adequate (i.e. poor
performance based on t-tests, coefficient of determination, classical F-test), then a reasonable model
improvement entails adding non-linear functions of the available data. To see this, notice that the
alternative models include power-functions of the predicted Y‟s

^ 2     ^ 3
Yt      Yt

42
Now, recall that the predicted values are exactly

^      ^    ^          ^           ^
Y t   1   2 X 2 t   3 X 3t   4 X 4 t

^ 2
Thus, for example, Y t will be a function of squares of the X‟s and “interaction” terms, like

X 2 t X 3t
X 2t X 4t
X 3t X 4 t

as well as functions of all the estimated parameters, which, in turn, are all random functions of the
available data. In other words, the alternative model, for example

^ 2
Yt   1   2 X 2t   3 X 3t   4 X 4t   5 Y t  et ,

will now include the original explanatory variables, and simple non-linear functions of all the explanatory
variables.
Why would we ever simply add power-functions of the explanatory variables? In general,
although through F-tests and t-tests we can ascertain that some information (i.e. some explanatory variable)
is not statistically relevant, we never know exactly how to build a better model: removing an explanatory
variable won‟t necessarily lead to a better model: recall that the R2 will drop in value as we remove
variables!. Indeed, even if the answer is simply “add more data; find better explanatory variables”, we, of
course, do not have more data, and if we could find better explanatory variables, we would have already
been using them. In other words, the sample of data we have is not going to improve magically. If the
present model (5) is not performing well, we have little choice but to use the available data in a way
different than the original linear specification. Power-functions are simply a convenient non-linear way to
build “new” explanatory variables in a world of limited data.
If we reject the test (if the F-statistic used for RESET test is too large), we conclude that we have
evidence that a better model would be the one specified in the alternative hypothesis.

B.       SAS
Use PROC AUTOREG. For model (4), say, we write

PROC REG DATA = d1;
MODEL INCOME = ED AGE NUM_CHILD / RESET;
RUN;

SAS does not know how many power-functions of the predicted values to include for the test, so it reports
RESET test statistics for a variety of tests (based on using different power functions of the predicted
values). The output is

The SAS System              16:40 Monday, June 25, 2001 1

43
The AUTOREG Procedure

Dependent Variable      INCOME

Ordinary Least Squares Estimates

SSE          7044332479 DFE               424
MSE            16613992 Root MSE           4076
SBC          8350.65254 AIC          8334.41605
Regress R-Square    0.1084 Total R-Square     0.1084
Durbin-Watson      1.9012

Ramsey's RESET Test
^ 2
Uses   Y                       Power               RESET                Pr > F

^ 2    ^ 3
2                   7.4417               0.0066
Uses Y ,Y
3                   3.8428               0.0222
^ 2    ^ 3   ^ 4
4                   2.7202               0.0441
Uses Y , Y , Y

Standard                      Approx
Variable            DF       Estimate             Error               t Value   Pr > |t|

Intercept 1                  -2612                1618                -1.61     0.1071
ED        1                  573.6493              87.0455            6.59      <.0001
AGE       1                  18.6558              27.1506              0.69     0.4924
NUM_CHILD 1                   -1708                538.6768            -3.17    0.0016

Notice that SAS prints RESET statistics for tests that include only the power of 2, the powers of 2 and 3,
and powers of 2, 3 and 4. As usual, we reject if the associated p-value less than .05. In this case, we reject
all tests: there is substantial evidence that the original specification in (4) is not accurate, and that
including power-functions of the predicted values will improve the performance of the model: literally, any
of the models

^ 2
Yt   1   2 X 2t   3 X 3t   4 X 4t   5 Y t  et

^ 2         ^ 3
Yt   1   2 X 2t   3 X 3t   4 X 4t   5 Y t   6 Y t  et

^ 2         ^ 3      ^ 4
Yt   1   2 X 2t   3 X 3t   4 X 4t   5 Y t   6 Y t   7 Y t  et

will perform better than the original specification in (4). Until we find a better way to analyze the model,
we should estimated one of the above augmented model specifications for the sake of more statistically
accurate forecasts.

C.       Using the RESET Result to Build a Better Model

44
In order to estimate the augmented model suggested by the alternative hypothesis, we need to save
the predicted values to a new dataset by using the OUTPUT OUT command. SAS will automatically place
in the new dataset all explanatory variables and the Y variable. For example, to perform the RESET test
and save the predicted values, use

PROC REG DATA = d1;
MODEL INCOME = ED AGE NUM_CHILD / RESET;
OUTPUT OUT = reg_out P = y_hat;
RUN;

SAS will create a new dataset called “reg_out”, and place in it INCOME, ED, AGE, and NUM_CHILD as
well as the predicted values of INCOME. Notice the command: we use “P = “ to signify that we want the
predicted values to be printed to the dataset; we then create a variable name for the predicted values. Here, I
simply called them “y_hat”. We need, however, power-functions of the predicted values. For this task, we
can create yet another dataset, place everything in “reg_out” into the new dataset, and create the powers.
Consider the following code:

PROC REG DATA = d1;
MODEL INCOME = ED AGE NUM_CHILD / RESET;
OUTPUT OUT = reg_out P = y_hat;
RUN;

DATA reg_out2;
SET reg_out;                            /* SET places “reg_out” into this dataset */
y_hat_2 = y_hat**2;
y_hat_3 = y_hat**3;
y_hat_4 = y_hat**4;
RUN;

PROC REG DATA = reg_out2;
MODEL INCOME = ED AGE NUM_CHILD y_hat_2 y_hat_3 y_hat_4 / RESET;
RUN;

The SAS output for the second regression with the augmented predicted value power functions is

45
The SAS System           16:40 Monday, June 25, 2001 7

The AUTOREG Procedure

Dependent Variable        income

Ordinary Least Squares Estimates

SSE          6910381237 DFE               421
MSE            16414207 Root MSE           4051
SBC          8360.61292 AIC          8332.19905
Regress R-Square    0.1254 Total R-Square     0.1254
Durbin-Watson      1.9005

Ramsey's RESET Test

Power         RESET      Pr > F

2         1.7439   0.1874
3         0.8737   0.4181
4         0.6407   0.5892

Standard            Approx
Variable           DF       Estimate             Error             t Value        Pr > |t|

Intercept          1        13052                 16875            0.77           0.4397
ED                 1        -1597                2765               -0.58         0.5639
AGE                1        -58.5699             94.1374           -0.62           0.5342
NUM_CHILD          1        4301                 8255              0.52           0.6027
y_hat_2            1         0.001178            0.001857          0.63           0.5262
y_hat_3            1        -1.864E-7            2.9402E-7         -0.63          0.5265
y_hat_4            1        1.126E-11            1.618E-11          0.70          0.4868

1.             Now that we have included power-functions of the predicted values from the original
estimated model, the RESET tests all fail to reject the hypothesis that the specification

^ 2         ^ 3    ^ 4
(6)     Yt   1   2 X 2t   3 X 3t   4 X 4t   5 Y t   6 Y t   7 Y t  et

is statistically improvable: in other words, adding the power-functions seems to created a
regression model that cannot be yet again improved.
2.                 However, notice how all the estimated slope signs have changed, the size of the estimated
parameters are substantially different (education has a negative impact?!!?!), and all t-tests fail to
reject the hypotheses that the true slopes are zero. Somewhat contradictingly, the classical F-tests
reject the hypothesis that the entire linear is irrelevant (I do not present the F-test above, however
the p-value < .0001). In other words, the entire model works well, but the actual individual
parameters seem to be very volatile, and therefore not trustworthy.

46
3.                   This confusing phenomena is often due to excessive correlation (linear dependence)
between the regressors16, which we refer to as “multi-collinearity”. In model (6), the augmented
power functions of the predicted values will themselves be functions of the X‟s, and therefore all
the data is likely to be highly correlated in the new regression model (6). That the RESET test can
produce such a poor result is one reason why econometricians over the past 20 years have
attempted to produce better model specification tests.

16
Recall, for multiple regression, we assume the regressors are not linear functions of each other. If this is the case, SAS could not
perform least squares estimation. However, when the explanatory variables are somewhat correlated (indeed, simply not perfectly
correlated), SAS can perform OLS, however, the results may be difficult to interpret, or simply non-sensical.

47

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 9 posted: 5/1/2010 language: English pages: 47