Docstoc

Group Presentation - PowerPoint Presentation

Document Sample
Group Presentation - PowerPoint Presentation Powered By Docstoc
					1
1 – Intro & Hist. - Na Chan
2 – Basics of ANOVA - Alla Tashlitsky
3 - Data Collection - Bryan Rong
4 - Checking Assumptions in SAS - Junying Zhang
5 - 1-Way ANOVA derivation - Yingying Lin and Wenyi Dong
6 - 1-Way ANOVA in SAS - Yingying Lin and Wenyi Dong
7 - 2-Way ANOVA derivation - Peng Yang
8 - 2-Way ANOVA in SAS - Phil Caffrey and Yin Diao
9 - Multi-Way ANOVA Derivation - Michael Biro
10 - ANOVA and Regression – Cris (Jiangyang) Liu

                                                           2
3
             USES OF T-TEST
• A one-sample location test of whether the
  mean of a normally distributed population has
  a value specified in a null hypothesis.

• A two sample location test of the null
  hypothesis that the means of two normally
  distributed populations are equal


                                              4
             USES OF T-TEST
• A test of the null hypothesis that the
  difference between two responses measured
  on the same statistical unit has a mean value
  of zero

• A test of whether the slope of a regression
  line differs significantly from 0


                                                  5
             BACKGROUND
• If comparing means among > 2 groups, 3 or
  more t-tests are needed

  -Time-consuming (Number of t-tests increases)

 -Inherently flawed (Probability of making a
 Type I error increases)


                                               6
               RONALD A.FISHER
•   Biologist                      Informally used by
                                    researchers in the 1800s
•   Eugenicist
•   Geneticist                     Formally proposed by
                                    Ronald A. Fisher in 1918
•   Statistician

    “A genius who almost single-handedly created
    the foundations for modern statistical science”
                                   - Anders Hald
    “The greatest of Darwin's successors”
                                   -Richard Dawkins
                                                          7
                     HISTORY
• Fisher proposed a formal analysis of variance in
  his paper The Correlation Between Relatives on
  the Supposition of Mendelian Inheritance in 1918.

• His first application of the analysis of variance
  was published in 1921.

• Become widely known after being included in
  Fisher's 1925 book Statistical Methods for
  Research Workers in 1925.
                                                      8
              DEFINITION
• An abbreviation for: ANalysis Of VAriance

• The procedure to consider means from k
  independent groups, where k is 2 or
  greater.



                                           9
          ANOVA and T-TEST
• ANOVA and T-Test are similar
  -Compare means between groups

• 2 groups, both work

• 2 or more groups, ANOVA is better



                                      10
                  TYPES
• ANOVA - analysis of variance
  – One way (F-ratio for 1 factor )
  – Two way (F-ratio for 2 factors)

• ANCOVA - analysis of covariance

• MANOVA - multiple analysis
                                      11
              APPLICATION
•   Biology
•   Microbiology
•   Medical Science
•   Computer Science
•   Industry
•   Finance

                            12
13
                   Definition
• ANOVA can determine whether there is a significant
  relationship between variables. It is also used to
  determine whether a measurable difference exists
  between two or more sample means.
• Objective: To identify important independent variables
  (predictor variables – yi’s) and determine how they
  affect the response variables.
• One-way, two-way, or multi-way ANOVA depend on the
  number of independent variables there are in the
  experiment that affect the outcome of the hypothesis
  test.

                                                       14
Model & Assumptions




                      15
         Classes of ANOVA
1. Fixed Effects: concrete (e.g. sex,
   age)
2. Random Effects: representative
   sample (e.g. treatments, locations,
   tests)
3. Mixed Effects: combination of fixed
   and random
                                     16
                 Procedure
• H0: µ1=µ2=…=µk vs
  Ha: at least one the equalities doesn’t hold

• F~fk,n-(k+1),α = MSR/MSE = t2 (when there are only 2
  means)
   – Where mean square regression: MSR = SSR/1 and
     mean square error: MSE = SSE/n-2

• The rejection region for a given significance level
  is F > f

                                                        17
                  Regression
• SST (sum of squares total) = SSR (sum of
  squares regression) + SSE (sum of squares
  error)

                     y)   ( y i  y)   ( y i  y i)
           n            2    n         2    n
    SST   ( y                                     ˆ
                                                        2

•                 i            ˆ
          i 1              i 1           i 1




• Sample variance: S2 = MSE = SSE/n-k →
  Unbiased estimator for σ2
                                                       18
 Mean
Variation




            19
20
             Data Collection
• 3 industries – Application Software, Credit
  Service, Apparel Stores
• Sample 15 stocks from each industry
• For each stock, we observed the last 30 days
  and calculated
  – Mean daily percentage change
  – Mean daily percentage range
  – Mean Volume

                                                 21
                   Application software

•   CA, Inc. [CA]
•   Compuware Corporation [CPWR]
•   Deltek, Inc. [PROJ]
•   Epicor Software Corporation [EPIC]
•   Fundtech Ltd. [FNDT]
•   Intuit Inc. [INTU]
•   Lawson Software, Inc. [LWSN]
•   Microsoft Corporation [MSFT
•   MGT Capital Investments, Inc. [MGT]
•   Magic Software Enterprises Ltd. [MGIC]
•   SAP AG [SAP]
•   Sonic Foundry, Inc. [SOFO]
•   RealPage, Inc. [RP]
•   Red Hat, Inc. [RHT]
•   VeriSign, Inc. [VRSN]



                                             22
                          Credit Service

•   Advance America, Cash Advance Centers, Inc. [AEA]
•   Alliance Data Systems Corporation [ADS]
•   American Express Company [AXP]
•   Asset Acceptance Capital Corp. [AACC]
•   Capital One Financial Corporation [COF]
•   CapitalSource Inc. [CSE]
•   Cash America International, Inc. [CSH]
•   Discover Financial Services [DFS]
•   Equifax Inc. [EFX]
•   Global Cash Access Holdings, Inc. [GCA]
•   Federal Agricultural Mortgage Corporation [AGM]
•   Intervest Bancshares Corporation [IBCA]
•   Manhattan Bridge Capital, Inc. [LOAN]
•   MicroFinancial Incorporated [MFI]
•   Moody's Corporation [MCO]



                                                        23
                         APPAREL STORES

•   Abercrombie & Fitch Co. [ANF]
•   American Eagle Outfitters, Inc. [AEO]
•   bebe stores, inc. [BEBE]
•   DSW Inc. [DSW]
•   Express, Inc. [EXPR]
•   J. Crew Group, Inc. [JCG]
•   New York & Company, Inc. [NWY]
•   Nordstrom, Inc. [JWN]
•   Pacific Sunwear of California, Inc. [PSUN]
•   The Gap, Inc. [GPS]
•   The Buckle, Inc. [BKE]
•   The Children's Place Retail Stores, Inc. [PLCE]
•   The Dress Barn, Inc. [DBRN]
•   The Finish Line, Inc. [FINL]
•   Urban Outfitters, Inc. [URBN]



                                                      24
25
26
Final Data look




                  27
28
   Major Assumptions of Analysis of
             Variance
• The Assumptions
   – Normal populations
   – Independent samples
   – Equal (unknown) population variances



• Our Purpose
   – Examine these assumptions by graphical analysis of residual




                                                                   29
                     Residual plot
•   Violations of the basic assumptions and model adequacy can
    be easily investigated by the examination of residuals.
•   We define the residual for observation j in treatment i as
                                   
                      eij  yij  y ij

•   If the model is adequate, the residuals should be
    structureless; that is, they should contain no obvious
    patterns.




                                                             30
                          Normality

• Why normal?
   – ANOVA is an Analysis of Variance
   – Analysis of two variances, more specifically, the ratio of two variances
   – Statistical inference is based on the F distribution which is given by
     the ratio of two chi-squared distributions
   – No surprise that each variance in the ANOVA ratio come from a parent
     normal distribution
• Normality is only needed for statistical inference.




                                                                            31
       Sas code for getting residual

PROC IMPORT datafile = 'C:\Users\junyzhang\Desktop\mydata.xls' out = stock;
RUN;
PROC PRINT DATA=stock;
RUN;
Proc glm data=stock;
Class indu;
Model adpcdata=indu;
Output out =stock1 p=yhat r=resid;
Run;
PROC PRINT DATA=stock1;
RUN;



                                                                              32
              Normality test

The normal plot of the residuals is used to check
  the normality test.


                  proc univariate data= stock1
                  normal plot;
                    var resid;
                  run;




                                                 33
                                                                                Normality Tests
                                                                                                                                                    Tests for Normality
                                             Tests for Normality

                                                                                                              Test                                      --Statistic---          -----p Value------
  Test                                               --Statistic---                 -----p Value------

                                                                                                              Shapiro-Wilk                              W            0.989846   Pr   <   W       0.6521
  Shapiro-Wilk                                       W                   0.731203   Pr   <   W      <0.0001
                                                                                                              Kolmogorov-Smirnov                        D            0.057951   Pr   >   D      >0.1500
  Kolmogorov-Smirnov                                 D                   0.206069   Pr   >   D      <0.0100
                                                                                                              Cramer-von Mises                          W-Sq          0.03225   Pr   >   W-Sq   >0.2500
  Cramer-von Mises                                   W-Sq                1.391667   Pr   >   W-Sq   <0.0050
                                                                                                              Anderson-Darling                          A-Sq         0.224264   Pr   >   A-Sq   >0.2500
  Anderson-Darling                                   A-Sq                7.797847   Pr   >   A-Sq   <0.0050




                                                                                                                                       Normal Probability Plot
Normal Probability Plot                                                                                           2.3+                                                  ++ *
                                                                                                                     |                                                ++*
    8.25+
                                                                                                                     |                                              +**
        |                                                                   *                                        |                                            +**
        |                                                                                                            |                                         ****
                                                                                                                     |                                       ***
        |                                                                                                            |                                     **+
                                                                                                                     |                                    **
        |                                                            *
                                                                                                                     |                                ***
        |                                                                                                            |                              **+
                                                                                                                     |                            ***
        |                                                        *
                                                                                                                  0.1+                          ***
        |                                                                   +                                        |                         **
                                                                                                                     |                      ***
    4.25+                                                   **       ++++
                                                                                                                     |                    ***
        |                                                  ** +++                                                    |                   **
                                                                                                                     |               +***
        |                                                 *+++
                                                                                                                     |             +**
        |                                          +++*                                                              |           +**
                                                                                                                     |        ****
        |                                    ++****
                                                                                                                     |       ++
        |                                ++++ **                                                                     |    +*
                                                                                                                 -2.1+*++
        |                             ++++*****
                                                                                                                       +----+----+----+----+----+----+----+----+----+----+
        |                       ++******
    0.25+*     * ******************                                                                                       -2        -1          0        +1         +2
            +----+----+----+----+----+----+----+----+----+----+




                                                                                                                                                                                                          34
                                                                                                                                                                          34
Normality
  Tests




            35
                   Independence

• Independent observations
  – No correlation between error terms
  – No correlation between independent variables and error

• Positively correlated data inflates standard
  error
  – The estimation of the treatment means are more accurate than the
    standard error shows.




                                                                       36
  SAS code for independence test

The plot of the residual against the factor is used
  to check the independence.


                      proc plot;
                        plot resid* indu;
                      run;




                                                  37
Independence Tests




                     38
      Homogeneity of Variances
• Eisenhart (1947) describes the problem of unequal
  variances as follows
   – the ANOVA model is based on the proportion of the mean
     squares of the factors and the residual mean squares
   – The residual mean square is the unbiased estimator of 2, the
     variance of a single observation
   – The between treatment mean squares takes into account not only
     the differences between observations, 2, just like the residual
     mean squares, but also the variance between treatments
   – If there was non-constant variance among treatments, we can
     replace the residual mean square with some overall variance,  a2,
     and a treatment variance,  t2, which is some weighted version of
      a2
   – The “neatness” of ANOVA is lost



                                                                          39
Sas code for Homogeneity of Variances
                test
The plot of residuals against the fitted value is
  used to check constant variance assumption.


                proc plot;
                  plot resid* yhat;
                run;




                                                    40
Data with homogeneity of Variances




                                     41
Tests for Homogeneity of Variances




                                     42
     Result about our data

– Normal populations

– Nearly independent samples

– Equal (unknown) population variances

So we can employ ANOVA to analyze our data.


                                              43
44
  Derivation – 1-Way ANOVA
• Hypotheses
   – H0: μ= μ1 = μ2 = μ3 = … = μn
   – H1: μi ≠ μj for some i,j
• We assume that the jth observation in group i is
  related to the mean by xij = μ+ (μi – μ) + εij, where εij
  is a random noise term.
• We wish to separate the variability of the individual
  observations into parts due to differences between
  groups and individual variability

                                                              45
Derivation – 1-Way ANOVA – Cont’




                                   46
 Derivation – 1-Way ANOVA – Cont’

• Using the above equation, we define




• We can show that




                                        47
  Derivation – 1-Way ANOVA – Cont’

• Given the distributions of the MSS values, we
  can reject the null hypothesis if the between
  group variance is significantly higher than the
  within group variance. That is,




 • We reject the null hypothesis if F > fn-1,N-n,α

                                                     48
      Brief Summary Statistics

• Code
proc means data=stock maxdec=5 n mean std;
by industry;
var ADPC;


        Get simple summary statistics(sample size,
        sample mean and SD of each industry) with
        max of 5 decimal places
                                                     49
       Brief Summary Statistics

• Output



     Industry         N    Mean      Std Dev
     Apparel          15   0.00253   0.00356
     Stores
     Application      15   0.00413   0.00742
     Software
     Credit Service   15   0.00135   0.00443




                                               50
                 Data Plot
• Code
proc plot data=stock;
plot industry*ADPC;




         Produce crude graphical output

                                          51
                                   Data Plot
• Output
Plot of industry*ADPC. Legend: A = 1 obs, B = 2 obs, D = 4 obs.

industry
    |
CreditSe +       A            A B A AAA AABA A A

Applicat +                A    D A AAAAA A             AA                  A

ApparelS +                  AA B A B B B A BA
    |
    -+---------+---------+---------+---------+---------+---------+---------+-----
   -0.015 -0.010 -0.005 0.000 0.005 0.010 0.015 0.020                               0.025

                                  ADPC

                                                                                            52
                One Way ANOVA Test
•   Code
•   proc anova data=stock;                             Class statement
                                                        indicates that
•   class industry;                                     “industry” is a
•   model ADPC=industry;                                    factor.

         Assumes”industry”influences average daily percentage change.

• means industry/tukey cldiff;
                                                                      Get pictorial
                                                                       display of
                                                                     comparisons.
    Multiple comparison by Tukey’s method—get actual
                  Confidence Intervals.

• means industry/tukey lines;
                                                                                  53
                   GLM analysis
• Code
proc glm data=stock;
class industry;
model ADPC=industry;
output out=stockfit p=yhat r=resid;

           This procedure is similar to 'proc anova' but
         'glm' allows residual plots but gives more junk
                              output.


                                                           54
            One Way ANOVA Test
• Output
Dependent Variable: ADPC
                         Sum of
Source          DF    Squares Mean Square F Value     Pr > F
Model            2   0.00005833 0.00002916   1.00
                                              1.00    0.3757
                                                     0.3757
Error             42 0.00122217 0.00002910
Corrected Total 44 0.00128050

             R-Square Coeff Var Root MSE ADPC Mean
             0.045552 201.8054    0.005394 0.002673
Source        DF    Anova SS Mean Square F Value Pr > F
industry        2 0.00005833 0.00002916     1.00    0.3757


                                                             55
          One Way ANOVA Test

Tukey's Studentized Range (HSD) Test for ADPC
Alpha
Error Degrees of Freedom                0.05
Error Mean Square                        42
                                     .000029
Critical Value of Studentized Range
                                     3.43582
Minimum Significant Difference         .0048


                                                56
           One Way ANOVA Test


                      Difference
Industry                Between     Simultaneous 95%
Comparison              Means      Confidence Limits
Applicat - ApparelS     0.001601   -0.003184 0.006387
Applicat - CreditSe     0.002778   -0.002008 0.007563
ApparelS - Applicat    -0.001601    -0.006387 0.003184
ApparelS - CreditSe     0.001177    -0.003609 0.005962
CreditSe - Applicat    -0.002778    -0.007563 0.002008
CreditSe - ApparelS    -0.001177    -0.005962 0.003609
                                                         57
         Univariate Procedure
• Code
• proc univariate data=stockfit plot normal;
• var resid;

                 We use the proc univariate to produce
                     the stem-and-leaf and normal
                 probability plots and we use the stem-
                    leaf plot to visualize the overall
                       distribution of a variable.

                                                          58
           Univariate Procedure
• Output
                       Moments
N             45        Sum Weights            45
Mean          0         Sum Observations        0
Std Deviation 0.00527035 Variance      0.00002778
Skewness      1.33008795 Kurtosis        5.46395169
UncorrectedSS 0.00122217Corrected SS 0.00122217
Coeff Variation      . Std Error Mean 0.00078566



                                                      59
      Tests for Location: Mu0=0
 Test       -Statistic- -----p Value------
 Student's t t       0 Pr > |t| 1.0000
 Sign        M -1.5 Pr >= |M| 0.7660
Signed Rank S -43.5 Pr >= |S| 0.6288




                                             60
       Basic Statistical Measures


  Location          Variability
Mean 0.00000 Std Deviation 0.00527
Median -0.00048 Variance 0.0000278
Mode     .        Range         0.03389
           Interquartile Range 0.00623



                                          61
               Tests for Normality

Test             --Statistic--- -----p Value------
Shapiro-Wilk      W 0.904256     Pr < W 0.0013
Kolmogorov-Smirnov D 0.112584 Pr > D >0.1500
Cramer-von Mises W-Sq 0.096018 Pr > W-Sq 0.1266
Anderson-Darling A-Sq 0.781507 Pr > A-Sq 0.0410




                                                     62
 Quantiles
Quantile    Estimate

100% Max     0.021509105
99%      0.021509105
95%      0.007261567
90%      0.005106613
75% Q3     0.002667399
50% Median -0.000477723
25% Q1    -0.003565176
10%     -0.004824061
5%     -0.005444811
1%     -0.012376248
0% Min    -0.012376248
                           63
Extreme Observations

 -------Lowest-------    -------Highest------

    Value     Obs          Value     Obs

 -0.01237625        41     0.00510661        6
 -0.00807339        25     0.00596875        34
 -0.00544481        13     0.00726157        29
 -0.00483936        3     0.00814126        27
 -0.00482406        28     0.02150911        22



                                                  64
Stem Leaf Plot and Boxplot
Stem Leaf                  #   Boxplot
   20 5                     1    *
   18
   16
   14
   12
   10
    8   1                  1     |
    6   03                  2    |
    4   4561                4     |
    2   0027922              7  +-----+
    0   334669               6  | + |
   -0   9809753              7  *-----*
   -2   97688551             8  +-----+
   -4   4888772              7     |
   -6                            |
   -8   1                   1    |
  -10                            |
  -12 4                     1    |
        ----+----+----+----+              65
  Multiply Stem.Leaf by 10**-3
                           Plot
•   Code
•   proc plot;
•   plot resid*industry;
•   plot resid*yhat;
•   run;

          Plot the qq graph of residual VS industry, and
           residual VS the approximated ADPC value.

                                                           66
           Normal Probability Plot
    0.021+                                *
      |
      |
      |
      |                               +++
      |                             ++++
      |                           ++*
      |                         ++++*
      |                      ++*****
      |                   +*****
      |                +****
      |               *****
      |           ******
      |      * ******+
      |       ++++
      |     *++
      | ++++
-0.013++++*
       +----+----+----+----+----+----+----+----+----+----+
         -2      -1      0      +1      +2



                                                             67
          0.025 +
             |                     A
                                                         Graph
          0.020 +
          0.010 +
             |                     A
             |                     A
             |                                        A
          0.005 + B
             | A                                       A
             | A                                       C
             | B                    A                   B
             |                                        A
          0.000 + C
             | A                    B
                                                          B
                                                                               Plot of
             | A
             |                     A
                                    B
                                                       B
                                                        A                 resid*industry.
             | B                    A                   A                     Legend:
         -0.005 + B                     D
             |                     A                                         A = 1 obs
         -0.010 +
             |                                        A                      B = 2 obs
         -0.015 +
             |                                                               D = 4 obs
             ---+-------------------------+-------------------------+--
industry ApparelS                 Applicat                CreditSe




                                                                                            68
                                      Plot of resid*yhat
resid
0.025 +
    |                                                         A
0.010 +
    |                                                         A
    |                                                         A
    |       A
 0.005 +                              B                                                     Plot of
    |       A                      A                                                        resid*yhat.
    |       C                      A
                                                                                             Legend:
    |       B                      B                             A
    |       A                                                                               A = 1 obs,
 0.000 +        B                      C                                                    B = 2 obs,
    |                             A                             B                           D=4 obs.
    |       A                      A                             B
    |       B                                                   A
    |       A                      B                             A
-0.005 +                              B                             D
    |                                                          A
    |       A
-0.015 +
    --+------------+------------+------------+------------+------------+------------
    0.0010       0.0015        0.0020       0.0025        0.0030       0.0035        yhat             69
                 Conclusion
• After the analysis of one way anova test,we
  can get the result of F=1.00 and p=0.3757.
  Since the p-value is bigger, we accept the null
  hypothesis which indicates that there is no
  difference between the mean of daily average
  percentage change of stocks of different
  industries. Thus, there is no different if we buy
  the stocks in different industries in the long
  term.
                                                  70
71
We now have two factors (A & B)




                                  72
Dot Notation
Linear Model




               73
Least Square Method

      SST
      =
SST = SSA + SSB+ SSAB + SSE
      SSA
      +
      SSB
      +
      SSAB
      +
      SSE
                              74
    Test Criteria
Rejection Conditions




                       75
Pivotal Quantity




                   76
Pivotal Quantity (Cont’)




                           77
Two-Way ANOVA in SAS


    By: Philip Caffrey
            &
        Yin Diao

                         78
                        Model
• An extension of one way ANOVA. It provides more
  insight about how the two IVs interact and individually
  affect the DV. Thus, the main effects and interaction
  effects of two IVs have on the DV need to be tested.

• Model:

• Null hypothesis:




                                                            79
                  Sum of Squares


Every term compared with the error term leads to F
 distribution. In this way, we can conclude whether there
 is main effect or interaction effect.

 SSTOTAL = SSA + SSB + SSINTERACTION + SSERROR




                                                       80
                Example



Using the same data from the One-Way
analysis, we will now separate the data further
by introducing a second factor, Average Daily
Volume.



                                              81
                Example

Factor 1: Industry
     • Apparrel Stores
     • Application Software
     • Credit Services


Factor 2: Average Daily Volume
     • Low
     • Medium
     • High

                                 82
              Two-Way Design
                                                Repeat 5 times
                                                each
    High
V
O
L
U   Medium
M
E


    Low


             Credit        Apparel   Software


                      INDUSTRY

                                                                 83
                    Using SAS

SAS code:

   PROC IMPORT DATAFILE=PROC IMPORT
   DATAFILE='G:\Stony Brok Univ Text Books\AMS
     Project\Data.xls' OUT=TWOWAY;
   RUN;

 PROC ANOVA DATA = TWOWAY;
  TITLE “ANALYSIS OF STOCK DATA”;
 CLASS INDUSTRY VOLUME;
  MODEL ADPC = INDUSTRY | VOLUME;
  MEANS INDUSTRY | VOLUME / TUKEY CLDIFF;
 RUN;

                                                 84
                     Using SAS

/*PLOT THE CELL MEANS*/

PROC MEANS DATA=WAY NWAY NOPRINT;
CLASS INDT ADTV;
VAR ADPC;
OUTPUT OUT=MEANS MEAN=;
RUN;

PROC GPLOT DATA=MEANS;
PLOT INDT*ADTV;
RUN;

                                    85
                       ANOVA Table
             Te sts ofBe tween-Subjects Effects
               Sum of                  Mean
Source         Squares      df        Square      F       Sig.
Corrected           .000a        8    3.335E-5    1.184    .335
Model
Industry        6.906E-5         2    3.453E-5    1.226    .305
                                                                  No Sig.
Volume          9.534E-5         2    4.767E-5    1.693    .198   Results
Industry *      7.950E-5         4    1.988E-5     .706    .593
Volume
Error                .001        36   2.816E-5
Corrected            .001        44
Total
                                                                   86
               Using SAS
To test the main effect of one IV, we should
combine all the data of the other IV. And
this is done in the one way ANOVA.
From the ANOVA we know there is no
significant main effects or interaction effect
of the two IVs.

To indicate if there is an interaction effect,
we can plot of means of each cell formed
by combination of all levels of IVs.             87
      PLOT OF CELL MEANS
Industry by Average Daily Volume




                                   88
         Interpreting the Output

Given that the F tests were not significant we would
    normally stop our analysis here.
If the F test is significant, we would want to know
    exactly which means are different from each other.

Use Tukey’s Test.
  MEANS INDUSTRY | VOLUME / TUKEY CLDIFF;




                                                         89
                   Interpreting the Output

                       Comparing Means
Comparison                Diff. b/w Means        95% CI
Software - Apparel            0.001601      [-0.003184 0.006387]

Software - Credit             0.002778      [-0.002008 0.007563]

Credit - Apparel              -0.001177     [-0.005962 0.003609]

MedVol. - LowVol.            -0.003698      [-0.008435 0.001038]

Med.Vol. - HighVol.          -0.001252      [-0.005989 0.003484]

HighVol. - LowVol.           -0.002446      [-0.007182 0.002290]


                                                             90
                Conclusion

• We cannot conclude that there is a significant
  difference between any of the group means.

• The two IVs have no effects on the DV.




                                                   91
92
                   M-way ANOVA
                    (Derivation)
• Let us have n factors, A1,A2,…,An , each with 2 or
  more levels, a1,a2,…,an, respectively. Then there
  are N = a1a2…an types of treatment to conduct,
  with each treatment having sample size ni. Let
  xi1i2…ink be the kth observation from treatment
  i1i2…in .
• By the assumption for ANOVA, xi1i2…ink is a
  random variable that follows the normal
  distribution. Using the model xi1i2…ink = µi1i2…ink +
  εi1i2…ink where each (residual) εi1i2…ink are i.i.d. and
  follows N(0,σ2).

                                                         93
                                   M-way ANOVA
                                    (Derivation)
Using “dot notation”, let


                              ,                   , …,                    ,…,                   .

Let

        ,                                                                    and
       , where   is the grand mean (see above),   is the mean effect of factor subtract by the grand
mean, and     is the mean effect of factor   subtract by the grand mean. Then we can model the above
as a linear equation of




                                                                                               94
                         M-way ANOVA
                          (Derivation)
Applying Least Square Estimation we get




Which is the ANOVA Identity,




                                          95
              M-way ANOVA
               (Derivation)
• These are all distributed as independent χ2
  random variables (when multiplied by the
  correct constants and when some hypotheses
  hold) with d.f. satisfying the equation:




                                                96
                 M-way ANOVA
                  (Derivation)
• There are a total of 2m hypotheses in an m-
  way ANOVA.
  – The null hypothesis, which states that there is no
    difference or interaction between factors
  – For k from 1 to m, there are mCk alternative
    hypotheses about the interaction between every
    collection of k factors.
  – Then we have 1 + mC1 + mC2 + … + mCm = 2m by
    a well known combinatorial identity.

                                                         97
               M-way ANOVA
                (Derivation)
• These hypotheses are:
                                       At least one

                                       At least one

                                 ...

                                       At least one



                                                      At least one



                                 ...

             Test for all combination of


                                                                     98
                M-way ANOVA
                 (Derivation)
• We want to see if the variability between
  groups is larger that the variability within the
  groups.

• To do this, we use the F distribution as our
  pivotal quantity, and then we can derive the
  proper tests, very similar to the 1-way and 2-
  way tests.

                                                     99
       M-way ANOVA
        (Derivation)


                        ...




                        ...




                        ...

Continue to see whether all combination of



                                             100
RELATIONSHIP BETWEEN

     ANOVA and Regression




         Presenter: Cris J.Y. Liu
                                    101
• What we know:
   – regression is the statistical model that you use to predict
     a continuous outcome on the basis of one or more
     continuous predictor variables.
   – ANOVA compares several groups (usually categorical
     predictor variables) in terms of a certain dependent
     variable(continuous outcome )
    ( if there are mixture of categorical and continuous data,
     ANCOVA is an alternative method.)
• Take a second look:
  They are the just different sides of the same coin!




                                                            102
            Review of ANOVA

• Compare the means of different groups
• n groups, ni elements for ith group, N element
  in total.
• SST= SSbetween + SSwithin

                               How about only
                              two group,X and Y,
                              Each have n data?

                                                   103
  Review of Simple Linear Regression

• We try to find a line y = β0 + β1 x that best fits our
  data so that we can calculate the best estimate of y
  from x
• It will find such β0 and β1 that minimize the distance Q between the actual
  and estimated score
                                                   Minimize me




• Let predicted value be of one group, while
  the other group consist all of original value ..
• It is a special (and also simple) case of ANOVA!
                                                                          104
        Review of Regression
Total        =       Model     +          Error
                     (Between)                     (Within)




                 =                    +




 d.f.: n-1
                      d.f.: 2-1 = 1               d.f.:n-2    105
ANOVA table of Regression




                            106
            How are they alike?


• If we use the group mean to be our X values from
  which we predict Y we can see that ANOVA and
  regression is the same!!



• The group mean is the best prediction of a Y-score.


                                                        107
         Term comparison
Regression                          ANOVA
             Dependent variable

             Explaintory variable

               total mean



   SSR                                SSbetween
   SSE                                SSwithin


                                              108
              Term comparison


if more than one predictor…..

 Regression                         ANOVA


Multiple Regression             Multi-way ANOVA

dummy variable                  categorical variable

interaction effect                   covariance

  ………………….                       ……………             109
                   Notes:

• Both of them are applicable only when
  outcome variables are continuous.

• They share basically the same procedure of
  checking the underlying assumption.




                                               110
Robust ANOVA

-Taguchi Method




                  111
            What is Robustness?
• The term “robustness” is often used to refer to methods
  designed to be insensitive to distributional assumptions
  (such as normality) in general, and unusual observations
  (“outliers”) in particular.


• Why Robust ANOVA?
• There is always the possibility that some observations may
  contain excessive noise.
• excessive noise during experiments might lead to incorrect
  inferences.
• Widely used in Quality control

                                                             112
             Robust ANOVA


• What we want from robust ANOVA?
   robust ANOVA methods could withstand non-
  ideal conditions while no more difficult to
  perform than ordinary ANOVA

• Standard technique----least squares method is
  highly sensitive to unusual observations
                                              113
             Robust ANOVA

Our aim is to minimize by choosing β:


 In standard ANOVA, we let


 we can also try some other ρ(x) .


                                        114
         Least absolute deviation
• It is well-known that the median is much more robust
  to outliers than the mean.
• least absolute deviation (LAD) estimate, which takes


• How is LAD related to median?
   the LAD estimator determines the “center” of the data
   set by minimizing the sum of the absolute deviations
   from the estimate of the center, which turns out to be
   the median.
• It has been shown to be quite effective in the presence
  of fat tailed data
                                                         115
               M-estimation
• M-estimation is based on replacing ρ(.) with a
  function that is less sensitive to unusual
  observations than is the quadratic .
• The M means we should keep ρ follows MLE.
• LSD with               , is an example of a robust
  M-estimator.
• Another popular choice of ρ : Tukey bisquare:

and (;)1rcρ= otherwise, where r is the residual
 and c is a constant.                             116
                Suggestion

• these robust analyses may not take the place
  of standard ANOVA analyses in this context;
• Rather, we believe that the robust analyses
  should be undertaken as an adjunct to the
  standard analyses




                                                 117
118
119

				
DOCUMENT INFO