Robust Estimators

Shared by: HC12030111355
Categories
Tags
-
Stats
views:
0
posted:
3/1/2012
language:
pages:
34
Document Sample
scope of work template
							Robust Estimators
Estimating Population Parameters

• The mean (X) and variance (s2) are the
  descriptive statistics that are most commonly
  used to represent the data points of some
  sample.
• The real reason that they are the preferred
  measures of central tendency and variance is
  because of certain properties they have as
  estimators of their corresponding population
  parameters; and 2.
      Estimating Population
       Parameters (cont.)
• Four properties are considered desirable
  in a population estimator; sufficiency,
  unbiasedness, efficiency, & resistance.
• Both the mean and the variance are the
  best estimators in their class in terms of
  the first three of these four properties.
• To understand these properties, you first
  need to understand a concept in statistics
  called the sampling distribution
Sampling Distribution Demo
Basically, the idea is this – in order to examine the properties
of a statistic we often want to take repeated samples from
some population of data and calculate the relevant statistic on
each sample. We can then look at the distribution of the
statistic across these samples and ask a variety of questions
about it.

•Check out this demonstration
     Properties of a Statistic
• 1) Sufficiency

 A sufficient statistic is one that makes use
 of all of the information in the sample to
 estimate its corresponding parameter. For
 example, this property makes the mean
 more attractive as a measure of central
 tendency compared to the mode or
 median.
       Estimating Population
            Parameters
• 2) Unbiasedness

 A statistic is said to be an unbiased estimator if
 its expected value (i.e., the mean of a number of
 sample means) is equal to the population
 parameter it is estimating.

• Using the resampling procedure, the mean can
  be shown to be an unbiased estimator.
    Assessing the Bias of an
           Estimator
Explanation of N-1 in s2 formula.



• If the more intuitive formula for s2 is used:
             2   (X – X ) 2
            sx =
                     N

• it turns out to underestimate 2
      Assessing the Bias of an
         Estimator (cont.)
• This bias to underestimate is caused by the act
  of sampling and it can be shown that this bias
  can be reduced if N-1 is used in the
  denominator instead of N.

• Note that this is only true when calculating s2, if
  you have a measurable population and you
  want to calculate 2, you use N in the
  denominator, not N-1.
                                    n
                                       ( X i  X )2
                             sx  
                              2

                                  i 1    N 1
    Degrees of Freedom
• The mean of 6, 8, & 10 = 8.

• If I allow you to change as many of these
  numbers as you want BUT the mean
  must stay 8, how many of the numbers
  are you free to vary?
      Degrees of Freedom
• The point of this exercise is that when the
  mean is fixed, it removes a degree of
  freedom from your sample -- this is like
  actually subtracting 1 from the number of
  observations in your sample.
• It is for this reason that we use N-1 in the
  denominator when we calculate s2
  • The calculation requires that the mean be
    fixed first which effectively removes -- fixes -
    - one of the data points.
Estimating Population Parameters
• 3) Efficiency

 The efficiency of a statistic is reflected in
 the variance that is observed when one
 examines the statistic over a bunch of
 independently chosen samples. The
 smaller the variance, the more efficient the
 statistic is said to be.
Estimating Population Parameters

• 4) Resistance
 The resistance of an estimator refers to
 the degree to which that estimate is
 effected by extreme values.
 As mentioned previously, both X and s2
 are highly sensitive to extreme values.
Estimating Population Parameters

• 4) Resistance
 Despite this, they are still the most
 commonly used estimates of the
 corresponding population parameters,
 mostly because of their superiority over
 other measures in terms sufficiency,
 unbiasedness, & efficiency. However…
            J. W. Tukey
“Statistics is a science in my opinion, and it is
no more a branch of mathematics than are
physics, chemistry and economics; for if its
methods fail the test of experience – not the
test of logic – they will be discarded”
                   Outline
• Different measures
  – trimmed mean, winsorized mean, M-
    estimators
  – Median absolute deviation
• Estimation
  – Bootstrap (a re-sampling procedure)
• End result
  – Increase your chances of getting significant
    results
Two (sometimes) related questions

• What is the best measure to estimate for
  the scientific purpose?
  – Mean v median
• What is the best way to measure the
  estimate?
These question both concern robustness,
 but each concern something else:

Which measure: measurement level and
 meaningfulness of a statistic.

How measured: the distribution of the
 data
           Robust Estimation
• What is Robust?
  – Influence plots, breakdown points (resistance)
• Robust Estimators
  – Trimmed, winsorized, and the M-Estimators
• Computer intensive methods
  – The bootstrap
• Mostly on central tendency, but briefly on others.
What to measure: Central Tendency
How many hours of TV watched yesterday


                                              Histogram


                      6



                      5



                      4
          Frequency




                      3



                      2



                      1

                                                                   Mean = 3.95
                                                                   Std. Dev. = 2.11449
                      0                                            N = 20
                          0   1   2   3   4   5   6   7   8   9   10
                                          VAR00001
           Trimmed Mean
                                   1
                                   2     2
• Simply cut off the ends          2
                                   2
                                         2
                                         2
• Example: 10% trim                3
                                   3
                                         3
                                         3
• N=20                             3
                                   3
                                         3
                                         3
• Cut out 2 values (10%), lowest   3 
                                   3
                                         3
                                         3
  and highest                      4     4
                                   4     4
• Mean = 3.95                      4     4
                                   4     4
• Median = 3.50                    4     4

• 10% trimmed mean = 3.78          5
                                   5
                                         5
                                         5
                                   6     6
                                   8     8
                                   10
         Winsorized Mean
                               1     2
                               2     2
• Make some percentage of      2
                               2
                                     2
                                     2
  most extreme values the      3
                               3
                                     3
                                     3
  same as the previous value   3     3
                               3     3
• Doing this to 10% of our     3 
                               3
                                     3
                                     3
  data…                        4     4
                               4     4
• = 3.90                       4     4
                               4     4
                               4     4
                               5     5
                               5     5
                               6     6
                               8     8
                               10    8
             M-Estimators
• More mathematically complex, but that’s
  what computers are good for
• Calculate the center of the distribution and
  give less weight to values that are further
  away from center
• Different M-Estimators give different
  weights for deviating values
         1.0
                           Hampel
Weight
         0.8                 Huber


         0.6
                                   Cauchy
    Y



                                     Andrews &
         0.4                            Bisquare
                 Fair
         0.2

         0.0
            -2    0      2    4       6     8
                        Distance
            All together now
•   Mean =             3.95
•   Median =           3.50
•   10% trim mean =    3.78
•   10% Winsorized =   3.90
•   Huber’s =          3.51
•   Tukey’s s =        3.33
•   Hampel’s =         3.37
•   Andrew’s =         3.33
  S vs. AD vs. Median Absolute Deviation

• MAD = the median absolute deviation from the
  median

• Efficiency: MAD > s > AD

• s vs. AD
  – For normal distribution AD = .8(s)

• Turns out the efficiency advantage of s over AD
  is nullified and even reversed for even slightly
  less than normal data.
• MAD
  – For normal distribution .67(s) or .85(AD)
  – More efficient than either s or AD
  – Also more resistant than either
• Tukey
  – MAD “the best of an inferior lot”
              Classical Statistics
• Parameter Estimates (“Point” Estimates)
   – Sample mean and sample standard deviation
   – Sample correlation and covariance estimates
   – Linear least squares model fits

• Statistical Inference
   – t-statistic and t-interval for an unkown mean
   – Standard errors and t-values for regression coefficients
   – F-tests for regression model hypotheses
    Classical stats are often not robust

 Outliers have “unbounded influence” on classical
 statistics, resulting in:

• Inaccurate parameter estimates and predictions
• Inaccurate statistical inference
   – Standard errors are too large
   – Confidence intervals are too wide
   – t-statistics lack power

• In short, unreliable when outliers are present
                  Resistance
• A Fundamental Continuity Concept
  - Small changes in the data result in only small
    changes in estimate
  - “Change a few, so what” J.W. Tukey


• Valuable Consequence
  - A good fit to the bulk of the data
  - Reliable, automatic outlier detection
         J. W. Tukey
            (1979)
“… just which robust/resistant
methods you use is not important –
what is important is that you use
some. It is perfectly proper to use
both classical and robust/resistant
methods routinely, and only worry
when they differ enough to matter.
But when they differ, you should think
hard.”
                 Bootstrap
• More voodoo
• The basic idea involves sampling with
  replacement to produce random samples
  of size n from the original data
  – Each of these samples provides an estimate
    of the parameter of interest.
  – Repeating the sampling a large number of
    times provides information on the variability of
    the estimator.
                   Bootstrap
• From previous data
  example                                                   Param

  – 1000 samples




                                    1.0
                                    0.8
• Distribution of Means


                                    0.6
                          Density
  of each sample 

                                    0.4
                                    0.2
                                    0.0


• Mean = 3.94
                                          2.5   3.0   3.5       4.0   4.5   5.0

                                                             Value
                 Bootstrap
• With smaller data sets we should be
  concerned about biasedness
  – Does our statistic reliably estimate the
    population parameter?


• Bootstrapped estimates and subsequent
  inferential use may be preferable in such
  low N situations

						
Related docs
Other docs by HC12030111355
CURRICULUM VITAE
Views: 6  |  Downloads: 0
Best Practices Hospice
Views: 5  |  Downloads: 0
PAHAYAG NG MAP TUNGKOL SA MABUTING PAMAMAHALA
Views: 46  |  Downloads: 0
SECTION 07140 - DOC
Views: 12  |  Downloads: 0
Nghi dinh 06 2008 ve XPVPHC trong TM
Views: 114  |  Downloads: 0
4869 1 HD thue nha o mau khac tham khao them
Views: 16  |  Downloads: 0
Dumping and littering
Views: 5  |  Downloads: 0