Docstoc

SAS- Statistical Programming Language - Download as DOC

Document Sample
SAS- Statistical Programming Language - Download as DOC Powered By Docstoc
					                                 SAS- Statistical Programming Language
                                                    Ignacio Correas
                                      University of Colorado at Boulder
NOTES FROM: Jonathan Hill, Dept. of Economics, University of California-San Diego


I would like to thank my friend Dr. Jonathan Hill for letting me use his excellent SAS notes and exercises.
Jonathan's caliper as an econometrician is further reflected in the ease of exposition and the clarity with
which he presents material as complex in its didactical application as teaching econometrics with a
computer package is.

CONTENTS                                                                                          PAGE

I.        What is SAS? Getting Around, Saving Files, Printing                                     2
          1.       Introduction                                                                   2
          2.       Booting-up SAS                                                                 2
          3.       The SAS Environment: Getting Around                                            3
          4.       Opening File, Saving Files, Printing Output                                    4
II.       Basic SAS Programming Elements: Data, Proc's, Macros, IML                               6
          1.       Data Step                                                                      6
          2.       Proc Step                                                                      6
III.      Data: Entering and Examining Economic Information                                       8
          1.       Internal Data Entry: DATALINES                                                 8
          2.       External Data Entry: INFILE, FILENAME, OBS                                     9
          3.       Creating New Variables                                                         10
                   3.1      Arithmetic Operations                                                 10
                   3.2      Logical Operations: IF, THEN, ELSE, AND, OR                           11
          4.       Creating Datasets from Existing Datasets                                       13
                            MERGE                                                                 13
                            SET                                                                   14
          5.       Describing Data: Simple Data Inspection                                        15
                            PROC PRINT                                                            16
                            PROC SORT                                                             17
                            PROC CONTENTS                                                         19
          6.       Describing Data: Simple Data Analysis                                          21
                            PROC MEANS                                                            21
                            PROC CORR                                                             26
                            PROC UNIVARIATE                                                       28
IV.       SAS and Econometric Analysis I: Basic Regression with PROC REG                          29
                            PROC REG                                                              29
                            PROC REG: Commands and Options                                        31
                                      MODEL                                                       31
                                      BY                                                          31
                                      TEST (F-Tests)                                              32
                            EXAMPLES                                                              32
V.        SAS and Econometric Analysis II: Multiple Regression with PROC AUTOREG                  37
          1.       PROC AUTOREG                                                                   37
          2.       PROC AUTOREG: Commands and Options [model,by,…]                                39
          3.       The Jarque-Bera Test of Normality: NORMAL                                      39
VI.       SAS and Econometric Analysis III: Multiple Regression and Inference                     41
          1.       Classical F-test of Model Correctness: PROC REG                                41
          2.       General F-test of Multiple Restrictions: TEST                                  41
          3.       The RESET Test of Model Specification Correctness                              43
          4.       Tests of Heteroskedasticity




                                                               1
I.         What is SAS? Getting Around, Saving Files, Printing

1.         INTRO
           The statistical software we will employ in this course is SAS (Statistical Applications SoftwareTM),
a language which is used world-wide in economics, sociology, political science, and biology, and in major
universities, governments and private research organizations. The SAS language, a multi-purpose statistical
package, is particularly useful for large data-set manipulations, data-set creations, and fast/simple statistical
analysis. Other software available that would be appropriate for more advanced and refined statistical
analysis includes LIMDEP (Limited Dependent Variables), GAUSS, MATLAB, and FORTAN (Formula
Translation) .


2.         BOOTING-UP SAS
           In Windows, click-on START, click on PROGRAMS, then click-on SAS. When the software loads,
depending on which version you are using, the screen should be split into two parts (three parts is used in
Version 8). You can fill the entire screen with the software by clicking on the open-box in the upper right-
hand corner.


3.         THE SAS ENVIRONMENT: Getting Around
           SAS incorporates three (3) primary windows for viewing program text, output and error messages.
          Program Editor Window
                      The Program Editor allows us directly to create SAS programs. The editor screen is
           simply titled "Editor", and you can place the cursor in the editor by clicking anywhere on the
           editor screen, or by clicking-on Window, then Editor. Also, if your version of SAS is recent, on
           the bottom of the screen will be three bars denoting Editor, Output and Log: click-on the
           appropriate bar to go to the specific screen.
          Output Window
                      Displays program output, including the printing of data-sets, statistical output like sample
           statistics (mean, var), and econometric output (e.g. regression output). The Output window can be
           cleared, and should be cleared before you run a program: be sure the Output window is up, then
           click-on Edit, then Clearall1.


          Log Window



1
  The reason for clearing the output and log screens is simple: after you run a program twice, the output and error messages will
simply be stacked with the first program-run on top, and the second program run on the bottom. It can be very confusing deciphering
which comments are for which program run. Always clear before each program run.



                                                                 2
                        This window displays SAS's comments while it translates your program text. To view
             this log window, click-on Window then Log. If your program is error free, messages will be in
             blue; if you have errors which SAS believes it can override, ignore, or correct, a message will
             appear in green; if an error is terminal such that the program crashes, error messages in red will
             appear. As with the output window, always clear the log window before you run a program: be
             sure the log-window is up, click-on Edit, then click-on Clearall.


Example (type the following code into the editor window, and follow my instructions, below)


______________________________________________
             DATA example;
                        INPUT age gender $ income;
                        DATALINES;
                        54 m 45000
                        19 f 37500
                        37 f 67000
             RUN;
             PROC PRINT DATA = example;
             RUN;
             PROC MEANS DATA = example;
             RUN;
______________________________________________



             This program creates a simple dataset of three people and their respective ages, gender (male = m
and female = f) and income in dollar units. The program then prints out the entire dataset, and calculates
sample statistics including the mean, standard deviation, minimum and maximum 2 of the numerical data.
             Once you type the program code in the editor, click-on the icon of the running person at the top
of screen to the right (this runs the program), or, simply click-on RUN, then SUBMIT.
             SAS will automatically present the output in the output window. The output should look like this:
                                     The SAS System                              Monday, May 7, 2001              6

                                         Obs         age          gender     income

                                         1            54              m      45000
                                         2            19              f      37500
                                         3            37              f      67000

                                     The SAS System                        13:18 Monday, May 7, 2001              7

                                                   The MEANS Procedure

2
    All of commands are detailed in subsequent sections, below.



                                                                  3
Variable      N            Mean                Std Dev                   Minimum                Maximum

age           3            36.6666667          17.5023808                19.                   54.
income        3            49833.33            15332.43                  37500.                67000.

Now, view the log window to see how SAS comments: we do not have any errors, thus SAS displays only
blue messages, and black is used for the code you typed in. The log window should look like this:


         44   DATA example;
         45   input age gender $ income;
         46   datalines;

         NOTE: The data set WORK.EXAMPLE has 3 observations and 3 variables.
         NOTE: DATA statement used:
              real time           0.00 seconds

         50   run;
         51   proc print data = example;
         52   run;

         NOTE: There were 3 observations read from the data set WORK.EXAMPLE.
         NOTE: PROCEDURE PRINT used:
                 real time           0.11 seconds

         53   proc means data = example;
         54   run;

         NOTE: There were 3 observations read from the data set WORK.EXAMPLE.
         NOTE: PROCEDURE MEANS used:
                 real time           0.04 seconds



Be sure to clear both log and output windows.


4.       OPENING FILES, SAVING FILES, PRINTING OUTPUT
         Loading/Opening Files
                  If SAS is not presently loaded, the fastest way to load a program is to boot-up SAS, click-
                  on the editor window or click-on Window then Edit, then click-on File and Open. In this
                  class, your files will most likely be on a floppy-disk: once you click-on Open, scroll
                  down the "look-in" box until you find the floppy "A"-drive, and proceed. All SAS files
                  have the file type “.sas”. Our data files will be of the file type “.dat”.


         Saving Program Code
                  Recall that programs are coded in the edit window. Once you type in a program (you
                  should save any text roughly once every 5 minutes!), click-on File, Save, then scroll-
                  down the "save-in" box until you reach the drive that suits your needs (e.g. the A-drive
                  for floppy disks). Be sure to use file names that are reasonably short and intuitive (for
                  example, do not use "file1.sas"). All SAS programs are automatically saved as ".sas" type
                  files.



                                                        4
                    Warning: be sure you are actually in the EDITOR screen when you save: otherwise,
                   SAS will simply save whatever contents are on the screen, be it output or error messages.


           Saving Output
                   The easiest way to summarize your empirical project results is to save the SAS output to
                   a file and load the file into EXCEL3, or WORD. To save SAS output, run your program,
                   be sure you are presently in the output window after the program finishes running (if you
                   have any doubt, click-on Window, then Output), then click-on File, Save, scroll-down the
                   “look-in” box, find the appropriate drive, and give your output a useful name. For
                   example, if your SAS program is named "income.sas", then title the output file as
                   "income_out".


           Printing Output
                   Once you run program, simply click-on File, Print, or just click-on the printer icon
                   located at the top of the screen, in the middle.




3
    See the section below on using EXCEL to create various types of graphs based on SAS output.


                                                        5
II.          Basic SAS Programming Elements: Data, Proc's, Macros, IML

         Any SAS program incorporates steps for entering data and steps for analyzing data. This short
section will briefly discuss each step without any details on how actually to code a program. The subsection
section presents specific information on how to enter and look at data.

1.       DATA STEP
         Any SAS program must employ data from some source. In this class, we will usually enter data
from a floppy-disk, however you can save data to a hard-drive (Drive “C”, for instance) and enter it from
there. Data statements are always of the form4

             DATA [dataset name];
                   …….
             RUN;

Each data step requires the command "DATA", a dataset name, code which actually enters the data, and the
command "RUN". Datasets can incorporate any alpha-numeric characters.
        For example:

             DATA d1;
                   INPUT x y
                   DATALINES;
                   14
                   10 -8
             RUN;

This codes dictates that a dataset named "d1" is created with two variables, named "x" and "y", and two
observations: x = (1, 10) and y = (4, -8). We can build as many datasets as we like, as well as merge
datasets: see the subsequent section.

2.       PROC STEP
         Usually SAS programmers use "proc" statements for data analysis. Other means for analyzing data
will be briefly mentioned below: in this class, we will always use proc's. The term "proc" is short for
"procedure", which denotes any built-in array of commands. For example, the MEANS procedure in SAS
will automatically calculate data means, variances, etc., while the REG procedure performs basic regression
analysis. You, yourself, do not need to program in SAS how a sample mean is calculated: we can do that,
however, if we like by using the built-in sub-language called IML (which we will not use in the class). SAS
already has all the details programmed within itself. SAS proc's are use to print data, find sample statistics,
perform econometric analysis, create graphs, charts, etc.
         Proc's are coded much like DATA statements. For any proc, we need to specify which data is to be
analyzed. For example, in order to print the entire contents of the dataset created above, we code:

             DATA d1;
                   INPUT x y
                   DATALINES;
                   14
                   10 -8
             RUN;

         PROC PRINT data = d1;
         RUN;
The statement "data = d1" dictates which dataset is to be printed. As with the use of datasets, we can use as
many proc's as we like: the following code creates two datasets, prints both, and displays sample statistics
of one dataset:

4
    I will use brackets "[ ]" to denote information that the programmer enters: you never actually type these brackets in SAS code.



                                                                     6
         DATA d1;
               INPUT x y
               DATALINES;
               1 4
               10 -8
         RUN;


         DATA d2;
               INPUT w z
               DATALINES;
               10 -100
               9 0
         RUN;

         PROC PRINT data = d1;
         RUN;

         PROC PRINT data = d2;
         RUN;

         PROC MEANS data = d1;
         RUN;

____________________________________

MACROS and SAS-IML
          Although SAS's power is derived from its ability to manage and create large datasets as well as its
ability easily to analyze any dataset by incorporating any one of its several hundred built-in procedures,
there are other means for programming that require substantial effort on the part of the programmer.

         SAS-IML
                The SAS language has built-in to it a sub-language for matrix-oriented mathematics. This
                software is called the Integrated Matrix Language [IML] and can be used to code
                substantially sophisticated econometric commands. SAS's built-in procedures are very
                useful, however they are, ultimately, of limited use: recent advances in
                economic/econometric/statistical theory are NOT programmed into SAS, thus if you
                require a means of data analysis that lies outside of the range of SAS's present abilities,
                then you must program the procedure yourself. IML allows the programmer literally to
                create his/her own procedures that can be called from any SAS program. The IML
                language requires its own syntax, employs matrix algebra and therefore requires extra
                time to learn and a background in higher mathematics.

         SAS MACROS
               SAS's IML is literally a built-in sub-language useful for creating you own hand-written
               econometric analysis. A "macro", by contrast, is a routine that is programmed into SAS
               along with standard DATA and PROC steps. A "macro" requires its own syntax, and can
               be used to create routines that perform sophisticated tasks. Moreover, a macro can be
               written simply to group together standard SAS commands: once this kind of macro is
               written, the programmer simply needs to refer to it by name, and all of the subsequent
               SAS commands associated with the macro name are performed.




                                                     7
III.     Data: Entering and Examining Economic Information

          In this section, we will learn the basic techniques for entering data directly into SAS: the two
primary techniques entail writing the data directly into the program, or loading data into a SAS from an
external source (e.g. floppy disk). Additionally, we will also learn several procedures for performing basic
statistical analysis of our data.
          For this, and all subsequent documents, to familiarize yourself with new SAS commands and
programming techniques, be sure to boot-up SAS and practice the examples I give below. Always feel free
to experiment.
          NOTE: Because SAS is a Windows product, you can simply copy examples of code in this and
any documents and paste the text directly into SAS. In fact, many of the examples, below, were written in
SAS and copy/pasted into WORD! Do as we all do: take the code wherever you can find, study it, and learn
to re-write it yourself.

1.       Internal Data Entry: DATALINES
                  SAS allows for the programmer to enter directly any data. For large data sets, this is
         impractical, however, there will be times when the programmer wants to have the data physically
         present in the program. Recall, we enter data in a DATA STEP. For direct data entry, we use the
         code:
                  DATA [dataset name];
                           INPUT var1 var2 [more variable names] varN;
                           DATALINES;
                           [data would be typed here]
                  RUN;

         Notice, there is not a semi-colon ";" after the last line of data, however we use a semi-colon after
         every line of code. Variable names can use any alpha-numerical symbols, however it can be no
         more than 8 characters in SAS Version 6.0. We do not put commas between variable names. The
         INPUT command dictates variable names and the order in which the data will be entered.
         DATALINES dictates that actual data follows. For example, if we want to enter income and ages
         for 5 people, we write:

                  DATA income1;
                         INPUT income age;
                         DATALINES;
                         10000 50
                         75000 43
                         23000 67
                         10000 19
                         100000 56
                  RUN;

         SAS understands that the data is read as "income age", and only requires one space between data
         entries: you can, however, place as many spaces between data entries as you like. Also, you do not
         need to indent code the way I do, however it is much easier to read: you will need my help from
         time-to-time, so you should write your code in a manner that is easy to understand.
                   SAS differentiates between numerical and character variable. For data that is non-
         numerical, use the dollar-sign "$" after (to the right of) the variable name with one space. For
         example, suppose that the above dataset "income1" includes gender information in the form of
         "M" for male and "F" for female. We can write:




                                                     8
              DATA income1;
                     INPUT income age sex $;
                     DATALINES;
                     10000 50 m
                     75000 43 m
                     23000 67 f
                     10000 19 f
                     100000 56 m
              RUN;

     We now have a dataset named "income1" with five observations (5 people), and income, age and
     gender information.

     Example: We want to create a dataset with monthly GNP (in $trillions) information, however not
     all months are present in our sample. We have information for 4 months.

              DATA gnp_mon;
                    INPUT gnp month $;
                    DATALINES;
                    2 jan
                    2.01 march
                    1.99 july
                    2.00 dec
              RUN;

     Thus, we have data for January, March, July and December.

2.   External Data Entry: INFILE, FILENAME, OBS
               By far the most useful approach to data entry is the method of entering data directly from
     a drive, be it hard ("C") or floppy (A"). We use the INFILE command for such basic entry:

              DATA [dataset name];
                     INFILE 'drive:\folder\folder\…\filename.type';
                     INPUT var1 var2 … varn;
              RUN;

     The INFILE command directs SAS to some drive and sequence of folders. The file directly and
     name requires single quotations. The file type may be .dat or .txt, depending on he files I give
     you, and ultimately depending on how you yourself make your data files. I will comment later on
     the nature of .dat and .txt files. For example, if our income data exists on a floppy in a file named
     "income_data.dat", we can write:

              DATA income1;
                     INFILE 'a:\income_data.txt';
                     INPUT income age sex $;
              RUN;

              If you plan on entering data from the same drive and file over and over again, you can
     simply re-write the file-name as follows:

              DATA [dataset name];
                     FILENAME [file name] 'drive:\folder\folder\…\filename.type';
                     INFILE [file name]
                     INPUT var1 var2 … varn;
              RUN;




                                                  9
     Notice, only spaces are placed between the new file name and the actual directly and drive
     specifications. For example,

              DATA income1;
                     FILENAME inc_file 'a:\income_data.dat';
                     INFILE inc_file;
                     INPUT income age sex $;
                     DATALINES;
              RUN;

     Thus, SAS understands that "inc_file" refers to the location "a:\income_data.dat". You can access
     the same simple file name in subsequent datasets. For example

              DATA income1;
                     FILENAME inc_file 'a:\income_data.dat';
                     INFILE inc_file;
                     INPUT income age sex $;
              RUN;
              DATA income2;
                     INFILE inc_file;
                     INPUT income age sex $;
                     DATALINES;
              RUN;

     This simple program re-names the file for SAS's use, reads in the data, and re-reads the data in a
     second data step: the second data step does not require the file location specification (i.e.
     a:\income_data.dat) because SAS interprets “inc_file” as that location.
                In many cases, we will not want to use an entire dataset: many datasets contain more than
     50000 observations and more than 200 variables. Simply in order to maintain a program during
     the coding development stage, and to run the program in order to find and remove errors, we may
     want to use only a few observations, and use the entire dataset only when all errors ("bugs") have
     been corrected.
                A simple way to control how many observations are read-in into a dataset is to use the
     OBS command. Suppose the file a:\income_data.dat has 10,000 observations, but we want only
     the first 100. Then, we write

              DATA income1;
                     INFILE 'a:\income_data.dat' OBS = 100;
                     INPUT income age sex $;
              RUN;

3.   Creating New Variables

     3.1      Arithmetic Operations
              During the data entry stage of any data step, we can create new variables using basic
     arithmetic and logic commands. For example:

              DATA income1;
                     FILENAME inc_file 'a:\income_data.dat';
                     INFILE inc_file;
                     INPUT income age sex $;
                             income_sq = income*income;
              RUN;




                                                 10
           The code " income_sq = income*income" creates a new variable named "income_sq" which
           equals income squared (i.e. income_inc = income2). SAS understands that the operation is to be
           performed for all data observations. Mathematical symbols include

                      * times                           “log” natural log
                      ** to the power of                “exp” the exponential function (i.e. exp(x) = ex, e = 2.7141)
                      - minus
                      + plus
                      / divide

           Thus, we could have written "income_inc = income**2".
                    For example, if we read in variables x and y, and we want ln(x), x4, x - y and x/y as new
           variables, we can write

                      DATA d1;
                            INPUT x y;
                                    x_4 = x**4;
                                    ln_x = log(x);
                                    xmy = x - y;
                                    xdy = x/y;
                            DATALINES;
                            10000 50
                            75000 43
                            23000 67
                      RUN;

           Note that SAS will now understand that the dataset "d1" has 6 variables: x, y, ln_x, x_4, xmy and
           xdy.

           3.2        Logical Operations
                                Many variables should only be constructed when a condition is satisfied, or
                      perhaps a variable's value depends not on specific values of other variables (e.g. ln_x =
                      log(x)), rather on value ranges. For such derivations, we use IF, THEN, ELSE logical
                      operations with connectors AND and OR.
                                Consider, for example, that we have a variable “ed” that denotes the number of
                      years of educations. In the U.S., if ed > 12, we would understand that the individual
                      graduated from high school. Likewise, if ed > 16, we might conclude that the individual
                      has a basic degree from a university. In econometric analysis, we often want to know
                      both what impact the number of years of education has on income, as well as whether
                      graduating from high school has an impact on education 5. For such information, we will
                      want to create a “dummy”, or “binary” variable6 that equals 1 if the individual graduated
                      from high school, and 0 otherwise: all we want from these variables if the simple
                      information of whether they graduated or not.
                                For example, suppose we read in data on income, education and age, and we
                      want to create variables that represent whether that individual has a high school or
                      college education or not:




5
  After all, 11 years is not much less than 12 years (and 11.75 years does not mean the individual graduated from high school!), but a
high school diploma will signal to many employers a certain skill level in the laborer, a certain degree of dedication that people who
quit high school early may not have.
6
  We will study the use and implications of dummy variables throughout the semester.



                                                                  11
         DATA income1;
                INPUT income ed age;

                  IF ed GE 12 THEN hs = 1;
                          ELSE IF ed LT 12 THEN hs = 0;
                  IF ed GE 16 THEN college = 1;
                          ELSE IF ed LT 16 THEN college = 0;
                  DATALINES;
                  10000 15 45
                  24000 18 54
                  31000 9 69
         RUN;

The code literally states that if the education level of an individual is greater than or
equal to [GE] 12, then a new variable, named “hs”, is set equal to 1. However [ELSE], if
years of education is less-than [LT] 12, then the variable “hs” is set to 0. Likewise, if
education is greater than or equal to [GE] 16, a new variable, named “college” is set
equal to 1. However [ELSE], if the number of years of education is less than [LT] 16, the
“college” is set to zero. Clearly, the first person has a high school education but not a
college education, so hs = 1 and college = 0 for the first individual. If we run the above
program and print the dataset, then the output looks like this:

                                 The SAS System              21:10 Wednesday, May 9, 2000
            Obs     income      ed    age    hs       college

              1       10000      15        45     1       0
              2       24000      18        54     1       1
              3       31000       9        69     0       0

As usual, the dataset has 5 variables: the three original variables and the two new dummy
variables.
         The logical operators available are as follows:


      Operator: Definition                       Symbol
 EQ: equal to                               =
 GE: greater than or equal to               >=
 LE: less than or equal to                  <=
 NE: not equal to                           ^=
 NOT: not                                   ^
 AND
 OR
         Consider a more complicated piece of information. Suppose we want a variable
for people over the age of 50 who have at least 14 years of education (i.e. they are high
school graduates from before the 1980's with at least some college education). We can
use the AND and OR operators as follows:




         DATA income1;



                                      12
                                INPUT income ed age;

                                IF ed GE 14 AND age GE 50 THEN coll_50 = 1;
                                        ELSE IF ed LT 14 OR age LT 50 THEN coll_50 = 0;
                                DATALINES;
                                10000 15 45
                                24000 18 54
                                31000 9 69
                       RUN;

              Thus, only if a person if over 50 years old and [AND] they have at least 14 years of
              education will the new variable “coll_50” be set to 1. However, if they are too young (age
              < 50) or [OR] if they have too littler education, then they do not satisfy our compound
              criteria, and the new variable “coll_50” is set to 0. IF we print the dataset, we find


                                                The SAS System            21:22 Wednesday, May 9, 2000
                              Obs     income      ed    age    coll_50

                                1       10000      15     45        0
                                2       24000      18     54        1
                                3       31000       9     69        0



              Only the second individual satisfies both criteria: she is both at least 50 years old AND
              has at least 14 years of education.

4.   Creating Datasets from Existing Datasets: MERGE, SET

               Often, we will want to use the information in one dataset in order to build quickly another
     dataset. For example, we may read in information for 1000 people concerning wages, hours
     worked, and taxes paid, and read in from another source information concerning the same 1000
     people concerning basic demographic information: education, marital status, age, gender, and
     number of children. Or, we may find in one data source on the web information on a country's
     GNP, interest rates, unemployment rate and inflation rate for the period 1970-1979, and from
     another data source the same information for the period 1980-1989. In order to use all of the data
     at once during the stage of econometric analysis, we will want to build one dataset containing all
     relevant information (all variables concerning one person, or all time periods concerning several
     economic quantifiers).
               Two simple techniques utilized for such dataset blending are the MERGE and SET
     commands employed during any data step.

     4.1    MERGE
            Consider the following code which builds two datasets containing, variously, economic
     and demographic data, about the same group of people:




              DATA income1;
                     INPUT income taxes hours;



                                                  13
                  DATALINES;
                  10000 100 54
                  75000 23000 38
                  23000 3000 40
         RUN;
         DATA demog1;
               INPUT age gender;
                      /* gender = 1 if male, gender = 0 if female */
               DATALINES;
               27 1
               64 0
               43 0
         RUN;

Note that any text between the items /* */ is treated as a command, and ignored by SAS. The
variable gender is simply a dummy variable representing male if the value is 1 and female if the
value is 0. To merge these dataset, we code a third data step as follows:

         DATA inc_dem;
                MERGE income1 demog1;
         RUN;

Now, the new dataset "inc_dem" contains 5 variables: income, taxes, hours, age and gender. It we
print the dataset, we observe

                                         The SAS System            10:59 Thursday, May 10, 2000
                   Obs    income      taxes    hours    age     gender

                    1       10000       100           54   27       1
                    2       75000     23000           38   64       0
                    3       23000      3000           40   43       0

SAS literally places the two datasets side-by-side.

WARNING: your datasets must have the observations arranged in the same order in order to
ensure that information for the same individual is merged.
WARNING: in order to merge datasets with different information concerning the same people,
no variable names can be shared between datasets.

4.2      SET
         The command SET is used to stack (i.e. concatenate) different datasets which have the
same variables types. This is particularly useful for merging different datasets with time-series
information. Consider the example give above: suppose we may find in one data source on the
web information on a country's GNP and unemployment rate for the period 1970-1974, and from
another data source the same information for the period 1985-1989. We will want to merge the
data, however we do not want to perform a side-by-side merge in manner that was performed
above. We want the data to be stacked vertically, with the years 1970-1974on top, and the years
1985-1989on the bottom:




         DATA data_70;
               /* contains data for the years 1970-1974 */
               /* GNP is in billions; unemployment rate is a percent: e.g. 6 denotes 6% = .06 */



                                              14
                       INPUT gnp ue_rate;
                       DATALINES;
                       3000 4
                       3100 3.9
                       3120 3.92
                       3110 4.1
                       2900 4.3
              RUN;
              DATA data_75;
                    /* contains data for the years 1975-1979 */
                    INPUT gnp ue_rate;
                    DATALINES;
                    2910 4.2
                    3000 4.1
                    3000 4
                    3100 3.7
                    3300 3.2
              RUN;

              DATA data_70_75;
                    SET data_70 data_75;
              RUN;

     Notice that we have created a third dataset named "data_70_75" containing all the information
     from the years 1970-1979. The SET command will automatically concatenate (stack) the data with
     the dataset stated first (i.e. data_70) on top, and the second dataset on the bottom. If we print the
     dataset, we observe:

                                               The SAS System              10:59 Thursday, May 10, 2000

                                       Obs      gnp     ue_rate

                                         1    3000       4.00
                                         2    3100       3.90
                                         3    3120       3.92
                                         4    3110       4.10
                                         5    2900       4.30
                                         6    2910       4.20
                                         7    3000       4.10
                                         8    3000       4.00
                                         9    3100       3.70
                                        10    3300       3.20

     Thus, all of the relevant data was stacked with 1970 on top and 1979 on the bottom.

5.   Describing Data: Simple Data Inspection
             In this section, we will learn the following procedures for basic visual inspection of our
     data:
                       PRINT
                       SORT
                       CONTENTS


     5.1      PROC PRINT
     This procedure is use to print entire or partial datasets. Consider the examples:

              DATA income1;
                     INPUT income taxes hours;



                                                  15
                   DATALINES;
                   10000 100 54
                   75000 23000 38
                   23000 3000 40
         RUN;

         PROC PRINT DATA = income1;
         RUN;

Notice that we must specify which dataset is to be printed. Unless we state otherwise, SAS will
print the entire set. The output window will contain:

                                             The SAS System              10:59 Thursday, May 10, 2000

                              Obs     income        taxes     hours

                               1       10000          100      54
                               2       75000        23000      38
                               3       23000         3000      40

Consider delineating specific variables to be printed.

         DATA income1;
                INPUT income taxes hours;
                DATALINES;
                10000 100 54
                75000 23000 38
                23000 3000 40
         RUN;

         PROC PRINT DATA = income1;
               VAR income;
         RUN;

Here, we specify that we want only the variable [VAR] "income" to be printed. The output
window contains:

                                              The SAS System              10:59 Thursday, May 10, 200

                                        Obs      income

                                         1       10000
                                         2       75000
                                         3       23000

Finally, consider printing several variables, but not all that exist in the dataset:




         DATA income1;
                INPUT income taxes hours;
                DATALINES;
                10000 100 54
                75000 23000 38
                23000 3000 40
         RUN;
         PROC PRINT data = income1;


                                               16
                  VAR taxes hours;
         RUN;

We can delineate as many or as few variables as we like: as with other SAS command structures,
we do not use commas between the variable names. The output window contains:

                                             The SAS System                     10:59 Thursday, May 10, 200

                                   Obs      taxes        hours

                                    1         100             54
                                    2       23000             38
                                    3        3000             40

5.2      PROC SORT

         Sorting data is intuitive and simple. Consider sorting the above dataset "income1"
according to income (i.e. we want to sort all individuals and all variables with individuals who
have the smallest incomes at the "top" of the dataset, and individuals with the largest incomes at
the "bottom" of the dataset). We write the following code:

         DATA income1;
               INPUT income taxes hours;
               DATALINES;
               10000 100 54
               75000 23000 38
               23000 3000 40
         RUN;
         PROC SORT DATA = income1;
               BY income;
         RUN;
         PROC PRINT DATA = income1;
         RUN;

The syntax here is the same as with PROC PRINT: we must tell SAS which dataset is to be
sorted. Moreover, whenever we sort data, the sort must be according to, or BY, some criterion.
The output window contains:
                                        The SAS System                     13:32 Thursday, May 10, 2001   1

                             Obs        income        taxes        hours

                              1          10000          100         54
                              2          23000         3000         40
                              3          75000        23000         38

Note that the dataset is now permanently changed. Whenever you refer to this dataset, SAS will
interpret it as sorted according to income.
          We can use the DESCENDING command to dictate that the data is to sorted from the
highest value of the BY variable to the lowest value:

         DATA income1;

                  INPUT income taxes hours;
                  DATALINES;
                  10000 100 54
                  75000 23000 38
                  23000 3000 40
         RUN;



                                                 17
         PROC SORT DATA = income1;
               BY DESCENDING income;
         RUN;
         PROC PRINT DATA = income1;
         RUN;

The output window contains

                                          The SAS System             13:32 Thursday, May 10, 200

                            Obs     income        taxes   hours

                             1       75000        23000    38
                             2       23000         3000    40
                             3       10000          100    54

          We can also sort according to several criteria. For example, suppose we have data on
stocker trader's names, and the net number of stock shares traded (positive valued denote net
purchases; negative valued denote net sales). Out dataset contains the information:

          First NAME                 Last NAME                   STOCK SHARES
                  Frank              Smith                                10
                  Betty              Jones                                5
                  Betty              Jones                                10
                  Frank                Smith                              100
                  Frank                Albert                             40
                  Betty                Jones                              50
                  Frank                Albert                             20
                  Betty                Jones                              45
We want to read in this data, and sort by last name, first name, date, and finally by number of
shares traded. By last name, Albert comes first, with stock shares traded in volumes of 40 and 20:
Albert will come first, sorted with 20 then 40 shares traded. We code as follows:




         DATA trades;
               INPUT name $ date $ shares;
               DATALINES;
               Frank       Smith                                        10
               Betty       Jones                                        5
               Betty       Jones                                        10
               Frank        Smith                                       100
               Frank        Albert                                      40
               Betty        Jones                                       50
               Frank        Albert                                      20
               Betty        Jones                                       45
         RUN;
         PROC SORT DATA = trades;


                                             18
                        BY lname fname shares;
                  RUN;
                  PROC PRINT DATA = trades;
                  RUN;

        The output window displays:

                                              The SAS System                 13:32 Thursday, May 10, 2001   19

                                    Obs       fname    lname        shares

                                      1       Frank    Albert          20
                                      2       Frank    Albert          40
                                      3       Betty    Jones            5
                                      4       Betty    Jones           10
                                      5       Betty    Jones           45
                                      6       Betty    Jones           50
                                      7       Frank    Smith           10
                                      8       Frank    Smith          100

        5.3       PROC CONTENTS

                  If you simply want to know basic structural (i.e. non-statistical) information about a
        dataset, we can use the CONTENTS procedure. This is especially helpful when our econometric
        results do not appear the way we expected them to: we may have damaged data, and one easy way
        to detect the damage is to inspect the basic dataset properties. The procedure CONTENTS details
        the number of variables, observations, and missing observations (some variables may not exist for
        some people or during some periods: if your dataset is too large to inspect visually in EXCEL,
        then CONTENTS can provide a quick peek). For example, consider the dataset “income1” with
        income, taxes and hours worked for three people.

                  DATA income1;
                        INPUT income taxes hours;
                        DATALINES;
                        10000 100 54
                        75000 23000 38
                        23000 3000 40
                  RUN;
                  PROC CONTENTS DATA = income1;
                  RUN;

                As usual, we need to dictated which dataset is to be inspected by CONTENTS. The
        output window contains:
                                                  The SAS System                   17:49 Friday, May 11, 2000

                                          The CONTENTS Procedure

Data Set Name:   WORK.INCOME1                              Observations:           3
Member Type:     DATA                                      Variables:              3
Engine:          V8                                        Indexes:                0
Created:         17:49 Friday, May 11, 2000                Observation Length:     24
Last Modified:   17:49 Friday, May 11, 2000                Deleted Observations:   0
Protection:                                                Compressed:             NO
Label:


                              -----Engine/Host Dependent Information-----

Data Set Page Size:          4096
Number of Data Set Pages:    1
First Data Page:             1



                                                      19
Max Obs per Page:                     168
Obs in First Data Page:               3
Number of Data Set Repairs:           0
File Name:                            C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\SAS Temporary
                                      Files\_TD1148\income1.sas7bdat
Release Created:                      8.0101M0
Host Created:                         WIN_PRO


                               -----Alphabetic List of Variables and Attributes-----

                                           #    Variable    Type    Len    Pos
                                           ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
                                           3    hours       Num       8     16
                                           1    income      Num       8      0
                                           2    taxes       Num       8      8

          Like most procedures, this procedure permits many internal commands that direct SAS to display
specific information that is not displayed by default: consult SAS‟s help screen 7. For example, SAS permits
many optional commands that can entered after the “DATA = “ statement:

           PROC CONTENTS DATA = dataset [option] [option] … [option];
           RUN;

For example, three such optional commands include:

           Specify the output data set                                                              OUT =
           Print a list of the variables by their position in the data set                          VARNUM

          Thus, the programmer can save the CONTENTS output to another dataset, as well as list variables
in the order in which they appear in the dataset, as opposed to in alphabetical order (see the example
above). Such a variable listing can be helpful if you have many (e.g. 50, 100, 200) variables, and you want
to check if you are reading the data in in the right order (e.g. does “income” come before “taxes”? if you
have the wrong order in your program, then your income variable will contain numerical information about
taxes).

6.         Describing Data: Simple Data Analysis
                     In this section, we will learn the following procedures for basic statistical inspection of
           our data:
                               MEANS
                               CORR
                               UNIVARIATE

           6.1       PROC MEANS
                     PROC MEANS creates and displays basic sample statistics, confidence interval and
           simple hypothesis test information, including the sample mean, variance, standard deviation, the
           minimum and maximum values of specified variables, and t-tests for the null hypothesis that the
           mean of a variable is zero. If no specifications are provided, SAS will automatically display results
           for all variables. For example:

                      DATA income1;
                            INPUT income taxes hours;
                            DATALINES;
                            10000 100 54
                            75000 23000 38
                            23000 3000 40

7
 If your version of SAS is 6.0 or greater, then a very useful help-screen should be installed. For all commands and procedures we
employ, you should always search the help screen for further information. Simply click-on the “book” icon to the upper-right.



                                                                 20
          RUN;
          PROC MEANS DATA = income1;
          RUN;

The output window contains:

                                             The SAS System                    17:49 Friday, May 11, 200

                                     The MEANS Procedure


     Variable    N            Mean         Std Dev         Minimum         Maximum
     ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
     income      3        36000.00        34394.77        10000.00        75000.00
     taxes       3         8700.00        12468.76     100.0000000        23000.00
     hours       3      44.0000000       8.7177979      38.0000000      54.0000000

ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

If you want information for select variables, use the VAR option:

          PROC MEANS DATA = income1;
                VAR income taxes;
          RUN;




                                           The SAS System                      17:49 Friday, May 11, 200
                                     The MEANS Procedure


     Variable    N             Mean         Std Dev         Minimum         Maximum
     ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
     income      3         36000.00        34394.77        10000.00        75000.00
     taxes       3          8700.00        12468.76     100.0000000        23000.00
                   ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

The proper syntax for PROC MEANS includes output options, like VAR, and statistic keywords
that dictate which information is to be displayed. If you use such keywords, SAS will only provide
that information, and omit all other statistics.

          PROC MEANS [option(s)] [statistic-keyword(s)]

Statistical keywords include:
ALL               all statistics listed
CLM               100(1 - )% confidence limits for the MEAN, where  is determined by the “ALPHA= option”, and
                  the default is  = .05
CLSUM             100(1 - )% confidence limits for the SUM, where 100(1 - )% is determined by the “ALPHA=
                  option” and the default is  = .05.
CV                coefficient of variation
DF                degrees of freedom for the t test




                                               21
        KURTOSIS          The kurtosis of the data
        MAX               maximum value
        MEAN              mean for a numeric variable, or the proportion in each category for a categorical variable
        MIN               minimum value
        NMISS             number of missing observations
        NOBS              number of non-missing observations8
        PRT               probability that a true t-random variable is greater than the t-statistic we have derived
        RANGE             range, MAX-MIN
        STD               standard deviation of the SUM. When you request SUM, the procedure computes STD by default.
        STDERR            standard error of the MEAN. When you request MEAN, the procedure computes STDERR by
                          default.
        SUM               weighted sum, or estimated population total when the appropriate sampling weights are used
        SKEWNESS          the skew of the data
        T                 t value for H0: population MEAN = 0, and its two tailed p-value with DF
                          degrees of freedom
        VAR               variance of the MEAN
        VARSUM            variance of the SUM

        Comments
              All of the above statistics are derived as sample statistics. Consult the text-book, or
              consult any introductory level text book in statistics:


                 KURTOTIS =
                                    1 n
                                        xi  x
                                  n  1 i 1
                                                            
                                                             4




                          derived as a sample conjugate to E ( x   x )
                                                                                       4




                             1 n
                 MEAN =         xi
                             n i 1
                          derived as a sample conjugate (estimate) to              E[x]

                                                            
                                             n
                                     1
                                         xi  x
                                                                 3
                 SKEWNESS =
                                   n  1 i 1
                          derived as a sample estimate to E ( x   x )
                                                                                   3




                 STD     =
                                1 n
                                         
                                    xi  x
                              n  1 i 1
                                                     
                                                     2




                          s, the estimate of the standard deviation of the population: σ
                          and provided the data is i.i.d.

                                            1 n
                                                xi  x
                                          n  1 i 1
                                                                    
                                                                     2

                                                                             STD
                                                                                   2
                 STDERR =      sX                                       
                                                     n                        n
                          usually referred to as standard error of the mean or s X .

                          This is an estimator of the standard deviation of the sample mean           X    =




8
  Sometimes datasets do not contain complete information: some people in the dataset may not have
recorded values of some data, like age, education, etc.


                                                             22
                n x      1 n 
          V x  V  i   V   xi  
                                                                1 n 
                                                                   V  xi 
                                                                n 2  i 1 
                                                                               1
                                                                               n2
                                                                                  n 2 
                                                                                         2
                   i 1 n   n i 1                                                 n
        provided the data is i.i.d.


VAR =
          1 n
        n  1 i 1
                    
              xi  x            2
                                       STD 2
                                                                                   2
        s2, the (sample) estimate of the standard deviation of the population: σ
        and provided the data are i.i.d.


T=              x
                          1 n
                              xi  x
                        n  1 i 1
                                                     2



                                   n

        The sample mean of an iid process xi divided by the standard deviation of that
        sample mean, converges to a mean-zero normal random variable under null
        hypothesis that the true mean of the process x is zero.
        Therefore we know:

        H 0 : E[ x]   x  0
                        x                 x
           Z                                  N (0,1) if null is true
                    V x              2
                                          n

        This Z statistic is accompanied with a two-tailed p-value. Consider the case
        where x = 10. Then, a p-value for our null is the probability statement

                                     
                    P | x | 10  2 P x  10              
                                                             
                                                             
                                                x  0 10  0 
                                           2 P             
                                                 2     2 
                                                             
                                                n        n 

        Because the random variable
                        x
                                  N (0,1) if the null is true
                        2
                            n

        we can use the standard normal table to look up the probability that a standard
        normal variable exceeds the cut-off value




                                              23
                                              10  0
                                                 2
                                                   n
                                  Of course, we do not know the true variance 2, thus, employing a sample
                                  estimate of the variance, the resulting random variable with roughly be t-
                                  distributed with n –1 degrees of freedom9:



                                              t            x                                t n 1
                                                                   1 n
                                                                 n  1 i 1
                                                                             
                                                                       xi  x         2



                                                                            n
                      Example:

                                  Consider a dataset with information on stock returns:

                      DATA stocks;
                            INPUT return;
                            DATALINES;
                            1
                            2
                            -4
                            5
                            0
                      RUN;

                            /* Then we run the following three PROC MEANS */
                      PROC MEANS DATA = stocks CLM ALPHA = .01;
                      RUN;
                      PROC MEANS DATA = stocks CLM ALPHA = .05;
                      RUN;
                      PROC MEANS DATA = stocks T PRT SKEWNESS KURTOSIS MEAN VAR;
                      RUN;

                              The output will be stacked in order of the MEANS statements. The first output
                      page contains the results of a 99% Confidence Interval:

                                                           The SAS System                              17:49 Friday, May 11, 2000
                                                      The MEANS Procedure
                                                   Analysis Variable : return

                                                       Lower 99%       Upper 99%
                                                   CL for Mean     CL for Mean
                                                  ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
                                                    -5.9352101       7.5352101
                                                  ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
                      Notice that the “ALPHA = .01“ command dictates a 1 - .01 = .99 Confidence Interval.
                               The second output page contains the results of a 95% Confidence Interval:

                                                           The SAS System                              17:49 Friday, May 11, 2000
                                                      The MEANS Procedure
                                                   Analysis Variable : return


9
 The t-statistic will be exactly t-distributed if the data is normally distributed. This is a fundamental reason why many economists
assume their data is made up of normal random variables.



                                                                  24
                                                Lower 95%       Upper 95%
                                              CL for Mean     CL for Mean
                                             ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
                                               -3.2615890       4.8615890
                                             ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
                             The third output page contains the results of a sample t-test, mean and variance
                    of the mean:




                                                      The SAS System                       12:00 Sunday, May 13, 2000
                                                 The MEANS Procedure
                                             Analysis Variable : return

t Value    Pr > |t|    Skewness        Kurtosis          Mean          Variance
        ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
0.55      0.6135      -0.4199926       1.2201939       0.8000000      10.7000000
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ

                              Notice that we cannot reject the null hypothesis: the actual data sufficiently
                    represents a mean-zero random variable: recall that we reject tests when the associated p-
                    value is less than the size of the test. In this case, if we choose the size to be 5%, then
                    clearly 61% > 5%, hence we cannot reject. When the p-value is less than the less (e.g.
                    suppose the p-value were .02), then the odds that our data could have been generated by a
                    mean-zero random variable is too low; consequently, we reject the hypothesis.

          6.2      PROC CORR
                   We employ PROC CORR to derive sample correlation coefficients for variables in a
          dataset. Consider data on income, wages, gender, etc., derived from the 1978 Current Population
          Survey [CPS], a U.S. dataset built by the U.S. Bureau of Labor Statistics [BLS]. You will use this
          dataset for several projects in this course. For simple correlation coefficients between several
          variables10, we write:

          DATA cps78;
                INFILE ‘a:\cps78.dat';
                INPUT ED SOUTH NONWHITE HISPANIC FEMALE MARRIED
                      MARRFE TENURE TENURE_2 UNION ln_wage AGE NUM_DEP;
          RUN;

          PROC CORR DATA = cps78;
                VAR ED SOUTH FEMALE MARRIED TENURE UNION NUM_DEP;
          RUN;

                    Here, I specify that only a subset of the available variables for correlation analysis. SAS
          automatically prints basic statistical information, including the sample means, standard deviations,
          minima and maxima. Notice that PROC MEANS would be more useful for hypothesis testing and
          confidence interval creation, as well as the generation of higher moments, like the skewness and
          kurtosis.
                    The output, by default, includes sample statistics, correlation coefficients between all
          variables, and the p-value for the null hypothesis that the true correlations are zero. Like any
          standard hypothesis test at the 5%-level, if the resulting p-value is less than .05, we reject the null

10
  ED = years of education; SOUTH = 1 if the person lives in a southern state; NONWHITE= 1 is the person is black, asian or
Hispanic; FEMALE = 1 if female; MARRIED = 1 if married; MARRFE = 1 if the person is a married female;TENURE = years in
their present jab; TENURE_2 = tenure2; UNION = 1 if a member of a union; ln_wage = ln(wage); NUM_DEP = number of children
and other dependents in the household.



                                                            25
         hypothesis that the true correlation is zero, and conclude that irrespective of the actual sample
         value, we have reasonable evidence that the true correlation is less than or greater than zero.




                                                                                         The SAS System      12:00 Sunday, May 13, 2000
                                              The CORR Procedure

             6     Variables:     ED          FEMALE   MARRIED     TENURE    UNION      NUM_DEP

                                              Simple Statistics

 Variable              N               Mean       Std Dev              Sum           Minimum      Maximum

 ED                  550        12.53636          2.77209             6895           1.00000      18.00000
 FEMALE              550         0.37636          0.48491        207.00000                 0       1.00000
 MARRIED             550         0.65273          0.47654        359.00000                 0       1.00000
 TENURE              550        18.71818         13.34653            10295           1.00000      55.00000
 UNION               550         0.30545          0.46102        168.00000                 0       1.00000
 NUM_DEP             550         0.98909          1.28600        544.00000                 0       8.00000


                            Pearson Correlation Coefficients, N = 550
                                    Prob > |r| under H0: Rho=0

                       ED         FEMALE          MARRIED          TENURE             UNION        NUM_DEP

ED               1.00000        0.06365           -0.08212        -0.34708       -0.12273         -0.06171
                                 0.1360             0.0543          <.0001         0.0039           0.1483

FEMALE           0.06365        1.00000           -0.24526        -0.11727       -0.12408         -0.08687
                  0.1360                            <.0001          0.0059         0.0036           0.0417

MARRIED          -0.08212       -0.24526           1.00000         0.29188           0.14378       0.24051
                   0.0543         <.0001                            <.0001            0.0007        <.0001

TENURE           -0.34708       -0.11727           0.29188         1.00000           0.19045      -0.04401
                   <.0001         0.0059            <.0001                            <.0001        0.3029

UNION            -0.12273       -0.12408           0.14378         0.19045           1.00000       0.09780
                   0.0039         0.0036            0.0007          <.0001                          0.0218

NUM_DEP          -0.06171       -0.08687           0.24051        -0.04401           0.09780       1.00000
                   0.1483         0.0417            <.0001          0.3029            0.0218




                                                                       26
     Comments:
           1.                  The true correlation and sample correlation coefficients are respectively

                              cov(x, y )       E ( x   x )( y   y )
                    x, y                 
                                x y               V [ x] V [ y ]
                                        1
                                           i 1 ( xi  x)( y i  y)
                                             n
                    ^
                    x, y            n 1
                                 1                      1
                                      ( xi  x) 2 n  1 i 1 ( y i  y) 2
                                       n                         n

                                n  1 i 1

             2.             I put in bold the p-values: SAS does not put these in bold: notice that
                   theses are p-values for the test of the hypothesis that the true correlation is
                   zero.
             3.    The correlation between any variable and itself is always one (can you prove
                   this by using the above formulas?)
             4.             The symbol ”<” of course means “less than”, hence “<.0001” means
                   the p-value is smaller than .0001. This, of course, is a very small p-value,
                   implying that the null hypothesis that the true correlation is zero should be
                   strongly rejected.
             5.             Notice the relationship between education and number of dependents,
                   union membership and work tenure: more education for Americans in the
                   1970‟s implied for many people less time for child bearing/rearing, especially
                   for females, while more educated Americans tend not to participate in labor
                   organizations. Moreover, not surprisingly, more education tended to be
                   associated with fewer years in the labor force due to the time required to go to
                   school.
             6.             What are the means of binary (i.e. dummy) random variables? How do
                   we interpret the sample mean of “female”, or “married”, or “union”?
             7.             If you do not want correlations between all variables specified in the
                   VAR command, use the WITH command to dictate which variables are to be
                   analyzed with [WITH] the VAR variables. For example:

                   PROC CORR DATA = cps78;
                         VAR SOUTH MARRIED TENURE UNION
                             NUM_DEP;
                         WITH ED;
                   RUN;

                        Pearson Correlation Coefficients, N = 550
                               Prob > |r| under H0: Rho=0

       MARRIED      TENURE                 UNION             NUM_DEP

ED   -0.08212     -0.34708          -0.12273               -0.06171
      0.0543       <.0001            0.0039                 0.1483

                   Thus, SAS displays the correlation coefficients between ED, specified in the
                   WITH statement, and the various variable denoted with the VAR command.

     6.3     UNIVARIATE




                                                      27
         This procedure is essentially a combination of MEANS and CONTENTS: each variable specified
         (all variables are analyzed by de fault) is statistically and physically analyzed in manners similar
         to MEANS and CONTENTS.




IV.      SAS and Econometric Analysis I: Basic Regression

          This section details basic steps for performing least squares regression analysis in SAS using
standard OLS theory. We will use SAS to regress some y on the available information x, perform basic
tasks of inference and model improvement.

1.       PROC REG

        We will use the procedure REG to perform basic regression analysis. There are many other proc‟s
in SAS that can used for least squares estimation depending on the sophistication of the problem (e.g.
dependent error terms, errors terms with non-constant variance, regression of many models simultaneously,
etc…). The following definition is what SAS’s help screen (roughly) says about PROC REG under the
assumption that there may be more than one regressor (i.e. the X‟s) available:

PROC REG: Syntax
The following statements are available in PROC REG.

PROC REG OPTIONS;
      Label MODEL Y = X1 X2 … Xk / OPTIONS
      BY variables ;
      OUTPUT OUT = dataset             OPTIONS;
      PLOT Yvar*Xvar                 / OPTIONS
      Label TEST test specifications / OPTIONS

We will study the various options and commands below. Consider, first, a simple example.

Example 1
         Consider the CPS dataset detailed above, and suppose it is contained in the file data_1_1.dat on a
floppy disk. The data contains information on age, education, log-wages, gender, union status, number of
children, etc. Suppose we want to see if the level of education provides an adequate explanation for log-
wages. Define Y = ln_wage and X = ed, and suppose we want to estimate

(1)
         E[Yi | X i ]   1   2 X i
             Yi   1   2 X i  ei

where the errors et satisfy the usual assumptions (i.e. zero mean, constant variance, zero correlation,
normally distributed). We write:

DATA cps;   /* CPS data */
      INFILE ‘a:\data_1_1.dat';
      INPUT ED SOUTH NONWHITE HISPANIC FEMALE   MARRIED     MARRFE
            TENURE TENURE_2 UNION ln_wage AGE NUM_DEP MANUF CONSTRUCT
            MANAG SALES CLER SERV PROF;
RUN;

PROC REG DATA = cps;
      MODEL ln_wage = ed;



                                                     28
RUN;

SAS understands that the variable on the left hand side of the equality in the MODEL statement is the
dependent Y, and anything on the right hand side is understood to be the independent variables X. Notice
that we have not used any options: the above code is the simplest possible way to run a bivariate regression.
The output is as follows:

                The SAS System                     11:02 Sunday, May 27, 2001      1

                                                   The REG Procedure
                                                     Model: MODEL1
                                              Dependent Variable: ln_wage

                                                      Analysis of Variance

                                                            Sum of               Mean
                 Source                       DF           Squares             Square    F Value     Pr > F

                 Model                         1          11.36093          11.36093        51.65    <.0001
                 Error                       548         120.53845           0.21996
                 Corrected Total             549         131.89938


                                  Root MSE                 0.46900      R-Square        0.0861
                                  Dependent Mean           1.68100      Adj R-Sq        0.0845
                                  Coeff Var               27.90001


                                                      Parameter Estimates

                                                    Parameter        Standard
                        Variable       DF           Estimate           Error      t Value     Pr > |t|

                        Intercept       1            1.03044         0.09270        11.12        <.0001
                        ED              1            0.05189         0.00722         7.19        <.0001

The Analysis of Variance information will be studied on chapters 4 and 5; the Parameter Estimates will be
studied in chapters 3 and 5. Hence, much of the above information will not be understandable until we
studiy the chapters that follow chapter 3, although we can use the above information to gain insight into
how well our regression model describes the data.
         For now, note that under Parameter Estimates, SAS lists the employed “variables”, and calls them
“Intercept” and “ED”. Under the Parameter Estimate 11, SAS lists the OLS estimates of the model in (1):

         1  1.03044  2  .05189
Moreover, SAS automatically performs tests of the two two-sided hypotheses

              H 0 : 1  0          H0 : 2  0
              H1 : 1  0           H1 :  2  0

and presents the results under “t Value” and “Pr > |t|”. The p-value of the test, itself, is contained in

             “Pr > |t|”

If the p-value is less than our chosen size of the test, sat 5% = .05, then we reject the null; synonymously, if
the t-statistic is greater than 1.96 for a sufficiently large sample (e.g. n > 100), we reject the null:




11
     “Estimate” without an “s”.



                                                                29
              p  value  .05  reject
                   or
              t  value  1.96  reject (if n  100)

In the present case, neither hypothesis is rejected: this suggests that the true intercept may be non-zero, and
that there truly exists a relationship between education and wage 12.

2.            PROC REG: Commands and Options

              The PROC REG statement presented above employs many auxiliary commands (not all are
              presented above) and allows for many options. Here, we will list and explain a few. Examples are
              provided below.

              A.         PROC REG options

                         After the “PROC REG DATA = dataset” statement, several options can be used:

                                    CORR :                 displays the correlations for all variables listed in the MODEL
                                                           statement.
                                    ALPHA = :              sets the probability level for confidence intervals with
                                                           respect to the OLS estimators
              B.         MODEL / options

                         After the MODEL statement and the stated Y and X variables, use a slash “/”, and any of
                         the following options:

                                    ALPHA = :              sets the probability level for confidence intervals with
                                                           respect to the OLS estimators
                                    CLB :                  dictates to SAS that confidence intervals are to be created for
                                                           all regression model estimators
                                    CORRB:                 displays the correlations between the various OLS estimators
                                    COVB:                  displays the variances & co-variances for the estimators
                                    NOINT :                dictates to SAS that the intercept parameter is assumed to be
                                                           zero

              C.         BY

                         The BY command here performs the same task as in PROC MEANS. SAS will perform
                         separate regressions for each category within the BY variables: SAS expects the dataset
                         to be sorted by the employed variables. SAS only recognizes one BY command at a
                         time, hence if you want to estimate various regression models according to various sub-
                         group divisions, use several PROC REG‟s, and change the BY variables for each.

              D.         OUTPUT OUT = dataset

                         If you want to save the regression output (e.g. parameter estimates, test statistics, etc.) to
                         another dataset, use this command. Note: the dataset that you save the regression to does
12
     Indeed, a non-zero intercept means that when education is zero, the individual‟s wages will not be zero:
E[Yi | X i  0]   1   2 0   1 , hence the intercept represents the minimum wage a person can earn based on having zero years
of education. Not surprisingly, it is not zero: people can always find work even if they are uneducated. Moreover, a nonzero slope
implies the marginal impact of a new year of education on wages is non-zero:
               
                   E[Yi | X i  0]   2
              X i
thus, additional years of education will improve one‟s earning potential, on average.




                                                                    30
                 not need to exist: SAS will simply create a new dataset with the assigned name. In order
                 to tell SAS which elements to send to the output dataset, use the following keywords
                 (there are far more than the ones below) after the “OUTPUT OUT = dataset”, and
                 without a slash “/”:

                    P = variable name : denotes the predicted values of Y; you need to assign a name for
                     this variable, like “y_hat”
                    R = variable name: denotes the residuals; you need to assign a name for this
                     variable, like “e_hat”

                 Thus, you can easily derive the predicted dependent variables and the regression
                 residuals.

        E.       PLOT

                 The PLOT statement in PROC REG displays scatter plots with yvariable on the vertical
                 axis and xvariable on the horizontal axis. If you want to plot the residuals of predicted
                 values of Y, use the “RESIDUAL.” and “PREDICTED.” Keywords: notice that there
                 are dots, or periods, after the words RESIDUAL and PREDICTED. Also, notice that we
                 specify the variable that goes on the Y-axis first, and the variable for the X-axis is stated
                 second with a “*” in between.

        F.       TEST

                 We will study this command in depth in the subsequent sections.

Example 2
       We want to regress the log of wages Y on education X from the CPS data.

        DATA cps;     /* CPS data */
              INFILE „a:\data_1_1.dat';
              INPUT ED SOUTH NONWHITE HISPANIC FEMALE       MARRIED
                      MARRFE TENURE TENURE_2 UNION ln_wage AGE NUM_DEP
                      MANUF CONSTRUCT MANAG SALES CLER SERV PROF;
        RUN;
        PROC REG DATA = cps;
              MODEL ln_wage = ed;
        RUN;

                                           Parameter Estimates

                      Parameter    Standard
         Variable    DF    Estimate      Error      t Value   Pr > |t|

         Intercept 1       1.03044       0.09270     11.12 <.0001
         ED        1      0.05189       0.00722      7.19 <.0001

Example 3
       We want to regress the log of wages Y on education X from the CPS data without an intercept
term.

        PROC REG DATA = cps;
              MODEL ln_wage = ed / NOINT;
        RUN;

                                           Parameter Estimates



                                                    31
                      Parameter    Standard
         Variable    DF    Estimate      Error      t Value   Pr > |t|

         ED         1       0.13027     0.00172     75.61     <.0001


Example 4
          We want to regress the log of wages Y on education X and regress log of wages Y on job tenure
(years in the labor force) X. We can use two separate PROC REG‟s, or simply use two separate MODEL
statements. In order to clarify the output for our own sake, we can use labels for the each MODEL
command. Note: we do need to use the labels, and we can always use labels even when we only estimate
one model.

        PROC REG DATA = cps;
              Wage_Ed: MODEL ln_wage = ed;                     /* “Wage_Ed” will be
                                                                       used to signify the
                                                                       output of this
                                                                       regression */
                 Tenure_Ed: MODEL ln_wage = tenure;                    /* “Tenure_Ed” will be
                                                                       used to signify the output of this
                                                                       regression */
        RUN;


                                                   The SAS System          19:17 Sunday, May 27, 2001 21

                          The REG Procedure

                            Model: Wage_Ed
                        Dependent Variable: ln_wage

                          Parameter Estimates

                      Parameter    Standard
         Variable    DF    Estimate      Error      t Value   Pr > |t|

         Intercept 1         1.03044     0.09270      11.12  <.0001
         ED        1        0.05189     0.00722       7.19 <.0001



                                                   The SAS System          19:17 Sunday, May 27, 2001 22

                          The REG Procedure
                           Model: Tenure_Ed
                        Dependent Variable: ln_wage

                          Parameter Estimates

                      Parameter    Standard
         Variable    DF    Estimate      Error      t Value   Pr > |t|

         Intercept 1   1.50849    0.03490    43.22    <.0001
         TENURE      1    0.00922    0.00152     6.07    <.0001




                                                      32
Example 5
          We want to regress the log of wages Y on education X for females with children, and those who
are not females, or do not have children.

                   DATA cps;        /* CPS data */
                   INFILE „a:\data_1_1.dat';
                   INPUT ED SOUTH NONWHITE HISPANIC FEMALE       MARRIED
                           MARRFE TENURE TENURE_2 UNION ln_wage AGE NUM_DEP
                           MANUF CONSTRUCT MANAG SALES CLER SERV PROF;

                   IF NUM_DEP > 0 THEN DEP = 1;
                         ELSE IF NUM_DEP EQ 0 THEN DEP = 0;
                   FEM_DEP = FEMALE*DEP;

                   RUN;

                   PROC SORT DATA = cps;
                         BY fem_dep;
                   RUN;
                   PROC REG DATA = cps;
                         MODEL ln_wage = ed;
                         BY fem_dep;
                   RUN;

                                                       The SAS System              19:17 Sunday, May 27, 2001 23

-------------------------------------------- FEM_DEP=0 ----------------------------------

                          The REG Procedure
                           Model: MODEL1
                        Dependent Variable: ln_wage


                           Parameter Estimates

                         Parameter    Standard
           Variable     DF    Estimate      Error        t Value    Pr > |t|

           Intercept 1        1.14712        0.09393       12.21 <.0001
           ED        1       0.04765        0.00728       6.55 <.0001

                                                       The SAS System              19:17 Sunday, May 27, 2001 24

-------------------------------------------- FEM_DEP=1 ----------------------------------

                          The REG Procedure
                           Model: MODEL1
                        Dependent Variable: ln_wage




                                                          33
                         Parameter Estimates

                        Parameter    Standard
            Variable   DF    Estimate      Error    t Value       Pr > |t|

         Intercept 1      0.48333       0.26942     1.79 0.0762
         ED        1     0.07009       0.02160     3.25 0.0017
Example 6
        We want to regress the log of wages Y on education X , and display confidence intervals for the
OLS estimators.

            PROC REG DATA = cps;
                       MODEL ln_wage = ed/ CLB ALPHA = .05;
            RUN;

                                                   The SAS System                19:17 Sunday, May 27, 2001 25

                         The REG Procedure
                          Model: MODEL1
                       Dependent Variable: ln_wage

                         Parameter Estimates

                   Parameter   Standard
Variable      DF   Estimate    Error        t Value    Pr > |t|      95% Confidence Limits

Intercept    1     1.03044     0.09270      11.12      <.0001          0.84835      1.21254
ED           1     0.05189     0.00722      7.19       <.0001          0.03771      0.06608


Example 6
       We want to regress the number of children Y on education X , and perform a variety of tasks.

DATA cps;     /* CPS data */
      INFILE 'c:\Program Files\WS_FTP\econometrics\data_1_1.dat';
                       /* contains the CPS data */
      INPUT ED SOUTH NONWHITE HISPANIC FEMALE                  MARRIED MARRFE
              TENURE TENURE_2 UNION ln_wage AGE NUM_DEP MANUF CONSTRUCT
              MANAG SALES CLER SERV PROF;
      MALE_PRO = (1-FEMALE)*PROF;
RUN;

PROC SORT DATA = cps;
      BY male_pro;
RUN;

PROC REG DATA = cps CORR;
           MODEL num_dep = ed/ CLB ALPHA = .05 CORRB NOINT;
                    BY male_pro;
RUN;




                                                      34
                                                      The SAS System              19:17 Sunday, May 27, 2001 30

-------------------------------------------- MALE_PRO=0 ---------------------------------

                           The REG Procedure
                         Uncorrected Correlation
                   Variable         ED        NUM_DEP

                   ED              1.0000      0.5811
                   NUM_DEP              0.5811      1.0000


                                                      The SAS System              19:17 Sunday, May 27, 2001 31

-------------------------------------------- MALE_PRO=0 ---------------------------------

                          The REG Procedure
                           Model: MODEL1
                        Dependent Variable: NUM_DEP

                NOTE: No intercept in model. R-Square is redefined.

                           Parameter Estimates

                             Parameter           Standard
Variable           DF        Estimate            Error               t Value   Pr > |t|     95% Confidence Limits

ED                 1         0.07349             0.00466             15.77     <.0001        0.06433    0.08264




                                                         35
V.       SAS and Econometric Analysis II: Multiple Regression with PROC AUTOREG

          This section will provide the basic details for using SAS‟s PROC AUTOREG. This procedure
performs the same tasks as PROC REG is the regression assumptions are standard, and can employ more
sophisticated techniques if basic assumptions do not hold (e.g. correlated regression errors, regression
errors with non-constant variance, etc..). A nice feature of this procedure is its ability to test a myriad
important hypotheses, including the hypothesis that the regression errors are normal random variables, the
hypothesis that the error variance is constant, or uncorrelated with itself: PROC REG cannot perform
these tests.
          In order to handle non-standard estimation environments, we will employ PROC AUTOREG for
estimation when variance is non-constant and/or errors are correlated.

1.       PROC AUTOREG

         The basic syntax of PROC AUTOREG is as follows:

         PROC AUTOREG options ;
               BY variables ;
               MODEL Y = X1 X2 … Xk / options ;
               TEST / options ;
               OUTPUT OUT = dataset options ;

         We will study the various options below.

Example 1
       The following code enters the coffee data from Project 2, performs basic regression with
AUTOREG, and sends the the regression output to new datasets. Notice that the

                  OUTPUT OUT = q_out1 P = q_hat R = e_hat;

statement creates a new dataset called “q_out1”. The statement “P = q_hat” tells SAS to place the predicted
values into the new dataset, and call the new variable “q_hat”. The statement “R = e_hat” tells SAS to place
the regression residuals into the new dataset, and call the new variable “e_hat”. We can then print the
datasets, save the SAS output, and use EXCEL to make graphs: we will learn these tasks over the next few
weeks.

         data coffee;
               infile 'c:\Program Files\WS_FTP\econometrics\data_2_1.dat';
               input q p;
               ln_q = log(q);
               ln_p = log(p);
         run;

         proc autoreg data = coffee;
               model q = p;
                     output out = q_out1 P = q_hat R = e_hat;
               model ln_q = ln_p;
                     output out = q_out2 P = q_hat R = e_hat;
         run;



                                                     36
         proc print data = q_out1;
         run;




                                              The AUTOREG Procedure

                                            Dependent Variable           q

                                      Ordinary Least Squares Estimates

                     SSE                     0.14907972        DFE                                9
                     MSE                        0.01656        Root MSE                     0.12870
                     SBC                     -11.300425        AIC                       -12.096215
                     Regress R-Square            0.6628        Total R-Square                0.6628
                     Durbin-Watson               0.7266

                                                               Standard                       Approx
                  Variable           DF      Estimate             Error        t Value      Pr > |t|

                  Intercept           1        2.6911           0.1216           22.13        <.0001
                  p                   1       -0.4795           0.1140           -4.21        0.0023



                                              The AUTOREG Procedure

                                           Dependent Variable           ln_q


                                      Ordinary Least Squares Estimates

                     SSE                     0.02263302        DFE                                9
                     MSE                        0.00251        Root MSE                     0.05015
                     SBC                     -32.036211        AIC                       -32.832001
                     Regress R-Square            0.7448        Total R-Square                0.7448
                     Durbin-Watson               0.6801


                                                               Standard                       Approx
                  Variable           DF      Estimate             Error        t Value      Pr > |t|

                  Intercept           1        0.7774           0.0152           51.00        <.0001
                  ln_p                1       -0.2530           0.0494           -5.13        0.0006




                   Obs       q_hat         e_hat         q         p            ln_q         ln_p

                     1     2.32189         0.24811      2.57     0.77        0.94391        -0.26136
                     2     2.33627         0.16373      2.50     0.74        0.91629        -0.30111
                     3     2.34586         0.00414      2.35     0.72        0.85442        -0.32850
                     4     2.34107        -0.04107      2.30     0.73        0.83291        -0.31471
                     5     2.32668        -0.07668      2.25     0.76        0.81093        -0.27444
                     6     2.33148        -0.13148      2.20     0.75        0.78846        -0.28768
                     7     2.17323        -0.06323      2.11     1.08        0.74669         0.07696
                     8     1.82318         0.11682      1.94     1.81        0.66269         0.59333
                     9     2.02458        -0.05458      1.97     1.39        0.67803         0.32930
                    10     2.11569        -0.05569      2.06     1.20        0.72271         0.18232
                    11     2.13007        -0.11007      2.02     1.17        0.70310         0.15700

         The bottom of the output presents the dataset called “q_out1”: notice that SAS automatically
places all the data from the original dataset in the output dataset. In addition, SAS places the regression
predicted values, named “q_hat”, and the regression residuals, named “e_hat”, in this dataset.



                                                        37
        Notice the different output arrangment when compared to PROC REG. SAS places the basicc
goodness-of-fit measures in the top of the output, including “SSE” (sum of squares residuals), “MSE”
(mean squared errors13) and the coefficient of determination, R2.


2.         PROC AUTOREG: Commands and Options

           A.        MODEL / options

                     After the MODEL statement and the stated Y and X variables, use a slash “/”, and any of
                     the following options:


                                CORRB:            displays the correlations between the various OLS estimators
                                NOINT :           dictates to SAS that the intercept parameter is assumed to be
                                                  zero
                                NORMAL            specifies the Jarque-Bera's normality test statistic for
                                                  regression residuals.
           B.        BY

                     The BY command here performs the same task as in PROC REG. SAS will perform
                     basic OLS tasks for each group specified by the BY variable. SAS expects the data to be
                     sorted according to the BY variable.

           C.        OUTPUT OUT = dataset

                     If you want to save the regression output (e.g. parameter estimates, test statistics, etc.) to
                     another dataset, use this command. Note: the dataset that you save the regression to does
                     not need to exist: SAS will simply create a new dataset with the assigned name. In order
                     to tell SAS which elements to send to the output dataset, use the following keywords
                     (there are far more than the ones below) after the “OUTPUT OUT = dataset”, and without
                     a slash “/”:

                           P = variable name : denotes the predicted values of Y; you need to assign a name for
                            this variable, like “y_hat”
                           R = variable name: denotes the residuals; you need to assign a name for this
                            variable, like “e_hat”

3.         Test of Normality: The NORMAL Command

           As detailed above, PROC AUTOREG can perform the Jarque-Bera test of normality on the
regression errors by employing the command NORMAL after the MODEL statement. Recall that the test
statistic employs the skewness of the residuals (a measure of distribution symmetry), and the kurtosis (a
measure of the flatness of the distribution: a flatter distribution means the tails are larger, which implies
greater variance). Under the null hypothesis that the true regression errors are normally distributed, the
Jarque-Bera test statistic has a chi-squared distribution with K-degrees of freedom, where K denotes the
number of variables, including the intercept, used in the regression. Thus, is H0: e ~ N(0, 2) is true, then

           JB ~  2 (2) .
SAS automatically displays the p-value for the chi-squared test statistic.

Example 2


                                                                                               ^ 2    1    ^2
                                                                                                          ei
13
     The MSE, or “mean squared error”, is simply the estimated regression error variance:  
                                                                                                     n2


                                                          38
         proc autoreg data = coffee;
                 model q = p / NORMAL;
         run;

                        The SAS System         09:41 Wednesday, June 13, 2001 19

                        The AUTOREG Procedure

                        Dependent Variable    q


                      Ordinary Least Squares Estimates

            SSE                     0.14907972         DFE                       9
            MSE                     0.01656            Root MSE                   0.12870
            SBC                     -11.300425         AIC                       -12.096215
            Regress R-Square        0.6628             Total R-Square            0.6628
            Normal Test             1.7466             Pr > ChiSq                0.4176
            Durbin-Watson            0.7266

                                             Standard        Approx
          Variable         DF    Estimate      Error t Value Pr > |t|

          Intercept        1     2.6911      0.1216        22.13   <.0001
          p                1    -0.4795      0.1140        -4.21   0.0023


          SAS prints the Jarque-Bera statistic as “Normal Test 1.7466” , and displays the subsequent p-
value to the right, Pr > ChiSq 0.4176. In this setting, we have one intercept and one regressor, ln_p, thus,
under the null hypothesis, the JB statistic is a chi-squared random variable with 2 degrees of freedom: the
cutoff value is 5.99, thus we cannot reject null. However, we can always simply refer to the p-value: the p-
value = .4176 > .05, hence we cannot reject the null. For this sample, the regression errors are reasonably
similar to normal random variable, hence we can maintain the assumption that they are, in fact, normal.




VI.      SAS and Econometric Analysis III: Multiple Regression and Inference




                                                      39
        This section will provide information for using SAS to perform tests of model specification
hypotheses. In particular, we will review how to use PROC REG for the classical F-test, PROC REG and
PROC AUTOREG for general F-tests of multiple restrictions, and PROC AUTOREG for the RESET test
of model correctness.

1.       Classical F-test of Model Correctness

A.       Theory

         The classical F-test of model correctness is used to test the hypothesis that all slope parameters
are simultaneously zero (i.e. all explanatory variables are not linearly related to Y; the entire linear model is
inappropriate). For the model

(1)       Yt   1   2 X 2t   3 X 3t  ...   K X Kt  et

the null hypothesis is

          H 0 :  2  0,...,  K  0

Observe that we only test the slopes: the nature of the hypothesis is see
whether any explanatory variables at all belong; not whether an intercept is
appropriate.
      The F-statistic for a test of the above hypothesis is exactly


(2)       F
                 SST  SSE  /( K  1)     ~    F ( K  1, N  K ) is the null hypothesis is true.
                     SSE /( N  K )

If the null is true, the F-statistic will be close to zero, whereas if the null is false, the statistic will be very
large: for a test at the 5% level, we reject if F > Fc , where the cutoff value is derived from the F-
distribution with K – 1 and N – K degrees of freedom:

          PF  Fc   .05
                F ~ F ( K  1, N  K )

B.       SAS

        Use PROC REG. SAS automatically reports the F-statistic and associated p-value: reject the null
hypothesis if the p-value < .05 (or, whatever the test size is; e.g. .01, .05, .10).

2.       General F-test of Multiple Restrictions

A.        Theory
          The general F-test of multiple restrictions is used to test complicated concern more than parameter
at a time. The hypothesis may test restrictions on any regression parameter (the intercept; any slope), may
test any number of parameters as one time, and may test functions of parameters. Examples of null
hypothesis testable by the F-test method include

         i.          H 0 :  1  0, 3  0
         ii.         H 0 :  2  2,  3  3 4 ,  5    4
         iii.        H0 : 2  3  4  5 1




                                                          40
         The F-statistic for a test of the above hypothesis is based on running two separate regressions, one
without any restrictions, and one with the hypothetical restrictions enforced14. The Sum of Squared Errors
[SSE] are collected from the unrestricted model (SSEU) and the restricted model (SSER). The F-statistic is
exactly

                SSE R SSEU  /( J )
(3)       F                              ~   F ( J , N  K ) is the null hypothesis is true.
                  SSEU /( N  K )

where J denotes the number of restrictions. For example, using the above three examples (i) – (iii), the
number of restrictions are respectively

         i.        J=2
         ii.       J=3
         iii.      J=1

 If the null is true, the restricted and unrestricted models will perform roughly identically, hence the SSE‟s
will be nearly identical and the F-statistic will be close to zero. If the null is false, when the restrictions are
enforced the resulting model will perform very poorly compared to the unrestricted model, hence the SSE
from the restricted will be comparatively large, and the statistic will be very large.

B.       SAS

        Use PROC REG or PROC AUTOREG. The test instructions are performed below MODEL
statements on separate lines of code. By way of example, consider an income model

(4)       INCOME t   1   2 EDt   3 AGE t   4 NUM _ CHILD t  et

Suppose we want to test the two hypothesis

         i.        H 0 :  1  0,  4  0
         ii.       H 0 :  2  .5 3

We write15
        PROC REG DATA = d1;
              MODEL INCOME = ED AGE NUM_CHILD;
                   TEST intercept = 0, NUM_CHILD = 0;
                   TEST ED = .5*AGE;
        RUN;

Notice that we refer to the estimated intercept literally as “intercept”.
           SAS will report on separate screens (i.e. you need to scroll down) the results of each test. SAS
displays numerical values associated with the numerator and denominator of the F-statistic, the F-statistic
itself, labeled “F Value”, and the p-value, labeled “Pr > F”. As usual, for a 5%-sized test we reject the null
hypothesis if the p-value is below .05. Examples of SAS output follow:




                              Test 1 Results for Dependent Variable INCOME

14
   In the course, if there is time we will study how to use SAS to perform “constrained least squares”, the
method of OLS when restrictions about the parameters are required.
15
   PROC AUTOREG will perform the same task: recall, however, that PROC AUTOREG will not report
the classical F-test of model correctness.


                                                        41
                                                     Mean
                Source                      DF       Square              F Value               Pr > F

                Numerator                  2         127493742           7.67                  0.0005
                Denominator                424       1661399

                               The REG Procedure
                                Model: MODEL1



                                  Test 2 Results for Dependent Variable INCOME

                                                      Mean
                Source                     DF        Square              F Value               Pr > F

                Numerator                   1         683580130          41.14                 <.0001
                Denominator                424       16613992


Observe that both tests reject the null hypothesis at the 5%-level: we do not have statistical evidence to
support either hypothesis.

3.       The RESET Test of Model Specification Correctness

A.       Theory
         The RESET test of model specification correctness tests to see if the hypothesized model is
correct, with an alternative hypothesis that suggests a better model. Consider the following regression
model with K = 4:

(5)      Yt   1   2 X 2t   3 X 3t   4 X 4t  et

Examples of null hypothesis and resulting alternatives are

         i.
                        H 0 : Yt   1   2 X 2t   3 X 3t   4 X 4t  et
                                                                               ^ 2
                        H 1 : Yt   1   2 X 2t   3 X 3t   4 X 4t   5 Y t  et

         ii.
                        H 0 : Yt   1   2 X 2t   3 X 3t   4 X 4t  et
                                                                               ^ 2       ^ 3
                        H 1 : Yt   1   2 X 2t   3 X 3t   4 X 4t   5 Y t   6 Y t  et

Other alternatives would simply add more power-functions of the predicted Y‟s.
          The alternative hypothesis is based on the logic that if the original model is not adequate (i.e. poor
performance based on t-tests, coefficient of determination, classical F-test), then a reasonable model
improvement entails adding non-linear functions of the available data. To see this, notice that the
alternative models include power-functions of the predicted Y‟s


          ^ 2     ^ 3
         Yt      Yt



                                                              42
Now, recall that the predicted values are exactly

          ^      ^    ^          ^           ^
         Y t   1   2 X 2 t   3 X 3t   4 X 4 t

                      ^ 2
Thus, for example, Y t will be a function of squares of the X‟s and “interaction” terms, like

         X 2 t X 3t
         X 2t X 4t
         X 3t X 4 t

as well as functions of all the estimated parameters, which, in turn, are all random functions of the
available data. In other words, the alternative model, for example

                                                          ^ 2
         Yt   1   2 X 2t   3 X 3t   4 X 4t   5 Y t  et ,

will now include the original explanatory variables, and simple non-linear functions of all the explanatory
variables.
          Why would we ever simply add power-functions of the explanatory variables? In general,
although through F-tests and t-tests we can ascertain that some information (i.e. some explanatory variable)
is not statistically relevant, we never know exactly how to build a better model: removing an explanatory
variable won‟t necessarily lead to a better model: recall that the R2 will drop in value as we remove
variables!. Indeed, even if the answer is simply “add more data; find better explanatory variables”, we, of
course, do not have more data, and if we could find better explanatory variables, we would have already
been using them. In other words, the sample of data we have is not going to improve magically. If the
present model (5) is not performing well, we have little choice but to use the available data in a way
different than the original linear specification. Power-functions are simply a convenient non-linear way to
build “new” explanatory variables in a world of limited data.
          If we reject the test (if the F-statistic used for RESET test is too large), we conclude that we have
evidence that a better model would be the one specified in the alternative hypothesis.

B.       SAS
         Use PROC AUTOREG. For model (4), say, we write

         PROC REG DATA = d1;
               MODEL INCOME = ED AGE NUM_CHILD / RESET;
         RUN;

SAS does not know how many power-functions of the predicted values to include for the test, so it reports
RESET test statistics for a variety of tests (based on using different power functions of the predicted
values). The output is




         The SAS System              16:40 Monday, June 25, 2001 1




                                                        43
                             The AUTOREG Procedure

                            Dependent Variable      INCOME


                           Ordinary Least Squares Estimates

              SSE          7044332479 DFE               424
              MSE            16613992 Root MSE           4076
              SBC          8350.65254 AIC          8334.41605
              Regress R-Square    0.1084 Total R-Square     0.1084
              Durbin-Watson      1.9012

                              Ramsey's RESET Test
        ^ 2
 Uses   Y                       Power               RESET                Pr > F

        ^ 2    ^ 3
                                2                   7.4417               0.0066
 Uses Y ,Y
                                3                   3.8428               0.0222
        ^ 2    ^ 3   ^ 4
                                4                   2.7202               0.0441
 Uses Y , Y , Y

                                                              Standard                      Approx
            Variable            DF       Estimate             Error               t Value   Pr > |t|

            Intercept 1                  -2612                1618                -1.61     0.1071
            ED        1                  573.6493              87.0455            6.59      <.0001
            AGE       1                  18.6558              27.1506              0.69     0.4924
            NUM_CHILD 1                   -1708                538.6768            -3.17    0.0016

Notice that SAS prints RESET statistics for tests that include only the power of 2, the powers of 2 and 3,
and powers of 2, 3 and 4. As usual, we reject if the associated p-value less than .05. In this case, we reject
all tests: there is substantial evidence that the original specification in (4) is not accurate, and that
including power-functions of the predicted values will improve the performance of the model: literally, any
of the models

                                                             ^ 2
            Yt   1   2 X 2t   3 X 3t   4 X 4t   5 Y t  et

                                                             ^ 2         ^ 3
            Yt   1   2 X 2t   3 X 3t   4 X 4t   5 Y t   6 Y t  et

                                                             ^ 2         ^ 3      ^ 4
            Yt   1   2 X 2t   3 X 3t   4 X 4t   5 Y t   6 Y t   7 Y t  et

will perform better than the original specification in (4). Until we find a better way to analyze the model,
we should estimated one of the above augmented model specifications for the sake of more statistically
accurate forecasts.



C.       Using the RESET Result to Build a Better Model




                                                             44
         In order to estimate the augmented model suggested by the alternative hypothesis, we need to save
the predicted values to a new dataset by using the OUTPUT OUT command. SAS will automatically place
in the new dataset all explanatory variables and the Y variable. For example, to perform the RESET test
and save the predicted values, use

         PROC REG DATA = d1;
               MODEL INCOME = ED AGE NUM_CHILD / RESET;
               OUTPUT OUT = reg_out P = y_hat;
         RUN;

SAS will create a new dataset called “reg_out”, and place in it INCOME, ED, AGE, and NUM_CHILD as
well as the predicted values of INCOME. Notice the command: we use “P = “ to signify that we want the
predicted values to be printed to the dataset; we then create a variable name for the predicted values. Here, I
simply called them “y_hat”. We need, however, power-functions of the predicted values. For this task, we
can create yet another dataset, place everything in “reg_out” into the new dataset, and create the powers.
Consider the following code:

         PROC REG DATA = d1;
               MODEL INCOME = ED AGE NUM_CHILD / RESET;
               OUTPUT OUT = reg_out P = y_hat;
         RUN;

         DATA reg_out2;
               SET reg_out;                            /* SET places “reg_out” into this dataset */
                        y_hat_2 = y_hat**2;
                        y_hat_3 = y_hat**3;
                        y_hat_4 = y_hat**4;
         RUN;

         PROC REG DATA = reg_out2;
               MODEL INCOME = ED AGE NUM_CHILD y_hat_2 y_hat_3 y_hat_4 / RESET;
         RUN;




The SAS output for the second regression with the augmented predicted value power functions is



                                                      45
                           The SAS System           16:40 Monday, June 25, 2001 7

                        The AUTOREG Procedure

                      Dependent Variable        income


                     Ordinary Least Squares Estimates

           SSE          6910381237 DFE               421
           MSE            16414207 Root MSE           4051
           SBC          8360.61292 AIC          8332.19905
           Regress R-Square    0.1254 Total R-Square     0.1254
           Durbin-Watson      1.9005


                           Ramsey's RESET Test

                     Power         RESET      Pr > F

                       2         1.7439   0.1874
                       3         0.8737   0.4181
                       4         0.6407   0.5892


                                                          Standard            Approx
         Variable           DF       Estimate             Error             t Value        Pr > |t|

         Intercept          1        13052                 16875            0.77           0.4397
         ED                 1        -1597                2765               -0.58         0.5639
         AGE                1        -58.5699             94.1374           -0.62           0.5342
         NUM_CHILD          1        4301                 8255              0.52           0.6027
         y_hat_2            1         0.001178            0.001857          0.63           0.5262
         y_hat_3            1        -1.864E-7            2.9402E-7         -0.63          0.5265
         y_hat_4            1        1.126E-11            1.618E-11          0.70          0.4868

Comments:
1.             Now that we have included power-functions of the predicted values from the original
      estimated model, the RESET tests all fail to reject the hypothesis that the specification

                                                         ^ 2         ^ 3    ^ 4
(6)     Yt   1   2 X 2t   3 X 3t   4 X 4t   5 Y t   6 Y t   7 Y t  et

        is statistically improvable: in other words, adding the power-functions seems to created a
        regression model that cannot be yet again improved.
2.                 However, notice how all the estimated slope signs have changed, the size of the estimated
        parameters are substantially different (education has a negative impact?!!?!), and all t-tests fail to
        reject the hypotheses that the true slopes are zero. Somewhat contradictingly, the classical F-tests
        reject the hypothesis that the entire linear is irrelevant (I do not present the F-test above, however
        the p-value < .0001). In other words, the entire model works well, but the actual individual
        parameters seem to be very volatile, and therefore not trustworthy.




                                                         46
3.                   This confusing phenomena is often due to excessive correlation (linear dependence)
           between the regressors16, which we refer to as “multi-collinearity”. In model (6), the augmented
           power functions of the predicted values will themselves be functions of the X‟s, and therefore all
           the data is likely to be highly correlated in the new regression model (6). That the RESET test can
           produce such a poor result is one reason why econometricians over the past 20 years have
           attempted to produce better model specification tests.




16
  Recall, for multiple regression, we assume the regressors are not linear functions of each other. If this is the case, SAS could not
perform least squares estimation. However, when the explanatory variables are somewhat correlated (indeed, simply not perfectly
correlated), SAS can perform OLS, however, the results may be difficult to interpret, or simply non-sensical.



                                                                   47
     attempted to produce better model specification tests.




16
  Recall, for multiple regression, we assume the regressors are not linear functions of each other. If this is the case, SAS co uld not
perform least squares estimation. However, when the explanatory variables are somewhat correlated (indeed, simply not perfectly
correlated), SAS can perform OLS, however, the results may be difficult to interpret, or simply non-sensical.



                                                                   47

				
DOCUMENT INFO