Exploratory Data Analysis 4 Exploratory

Document Sample
Exploratory Data Analysis 4 Exploratory Powered By Docstoc
					Exploratory Data Analysis

            Spring 2010
The four features of distributions

   Central Location – where are most of the
   Spread – how far apart are the observations?
   Shape – Symmetric or skewed?
   Outliers – are any observations very far from
    the rest?
What type of data do you have?

   Nominal – observations are in categories.
    Examples are gender (male, female) or eye
    color (blue, green, brown, other)
   Ordinal – observations can be ranked (i.e.
    greater than or less than makes sense).
    Examples are education level (less than high
    school, high school, bachelor’s, graduate
    degree) or agreement with a survey question
    (strongly disagree, disagree, ambivalent,
    agree, strongly agree)
Scales, continued

   Interval – an observation is on an interval
    scale if the difference between two numbers
    has meaning. In measuring temperature, 95
    is 5 degrees higher than 90 degrees, 30 is 5
    degree higher than 25 degrees, etc.
   This is not true of all data. For example, is
    the difference between “agree” and “strongly
    agree” the same as the difference between
    “ambivalent” and “agree”?
Scales, continued

   Ratio – The strongest form of scale. It
    indicates that ratios (division) of numbers has
    meaning. If your income is 20,000 then you
    have twice as much income as someone
    who makes 10,000.
   Temperature in Fahrenheit or Celsius is NOT
    ratio scaled, since 10 degrees is not “twice
    as hot” as 5 degrees.
   In a ratio scale, 0 has meaning as “nothing”.
    In those temperature scales, 0 is arbitrary.
Back to the four features

   Central location – where are most of the
   What the observations are in categories, the
    most relevant “statistics” are either the
    number and/or frequencies in each category.
    For example, “50.5% of live births are male,
    while 49.5% are female”. Alternatively, “in
    our town, we have 934 men and 982 women”
   It doesn’t make sense to talk about “the
    average person’s gender” since you really
    can’t be somewhere in the middle.

   One statistic mentioned often for categorical
    data (ordinal or nominal) is the mode, which is
    the category with the most observations.
   The mode is most meaningful when one of the
    categories has most of the observations, as in
    “most faculty at UK have doctoral degrees”
   If the data is spread among many categories,
    knowing the mode doesn’t provide a full picture.
    For example, “the largest department in Arts and
    Sciences at UK is Psychology” does not say
    anything about the majority of faculty.
   Summary – the mode often isn’t that useful.
Central Location for Interval/Ratio

   For interval/ratio data, the most common
    measures of central location are the mean
    and median.
   The mean is defined as the arithmetic
    average of the observations. You find this by
    adding them up and dividing by the total
    number. If your observations are (1,5,12), the
    mean is (1+5+12)/3 = 6.
Mean/Median continued

   The median is the “middle” observation of the
    SORTED data. If your observations are
    (1,5,12), the median is 5. If your observations
    are (4,10,2,8,9), the median is 8.
   If there is an even amount of data, average
    the two middle values. So if the data are
    (6,10,4,3), the middle values are 4 and 6,
    and (4+6)/2 = 5. The median is 5.
Differences between the mean and

   The median is robust, which means that
    outliers do not affect it. The mean is not.
   Suppose we have data (1,4,6,10,12). The
    mean is 33/5 = 6.6 while the median is 6.
   Suppose we change the 12 to 14000. The
    median is still 6, but the mean changes to
    14021/5 = 2804.2. Note also that the median
    is still close to most of the data, but the mean
    is nowhere close to any data point.

   For ordinal/nominal data, we do NOT have a
    measure of spread in this class.
   There are measures of spread, not
    discussed in this class, for ordinal/nominal
    data. Essentially, this measures indicate
    whether the data is spread evenly into all the
    categories or whether one or a few
    categories contain almost all the data.
   The notion is called entropy. Not required in
    our class, but look it up if you need it.
Spread for interval/ratio

   Some common measures of spread for
    interval/ratio data are the range, the
    interquartile range, and the standard
   The range is simply the distance between the
    smallest and largest observations. It is
    obviously not robust to outliers, and seldom
    used except when the spread is very small.
    (i.e. if all the scores on an exam happened to
    be between 76 and 78, which doesn’t
    happen very often)
Interquartile range

   First, we have to define the quartiles. Recall
    when we compute the median, we are
    dividing the data in half. The quartiles divide
    each of the halves in half again (this divides
    the data into four parts, hence the term
   To find the quartiles, first sort the data as if
    you were finding the median.
Quartiles continued

   If n is even, divide the data in half, thus
    creating a first half and a second half
   If n is odd, remove the median, and then
    divide the data in half to produce a first half
    and a second half.
   The first quartile, Q1, is the median of the first
    half. The third quartile, Q3, is the median of
    the second half. (Q2 is the median).
Example of computing quartiles

   Suppose our sorted data was 12, 14, 23, 36,
    40, 42, 44, 61, and 78.
   There are n=9 numbers, so find the median
    M=40 and remove it. The first half is
    (12,14,23,36) and the second half is (42, 44,
    61, and 78).
   The median of the first half is Q1=(14+23)/2 =
    18.5 while the median of the second half is
    Q3 = (44+61)/2 = 52.5
Interquartile range

   The interquartile range is Q3 – Q1. It is not
    sensitive to outliers.
   We used the data 12, 14, 23, 36, 40, 42, 44,
    61, and 78. If we changed the 78 to 100,000
    then the interquartile range (IQR) does not
Standard deviation

   The standard deviation is based on measuring the
    average squared distance from the mean. It is defined as

                    X                
                     n                     2

                                  X
                    i 1
                           n 1
Standard deviation continued

   The standard deviation is sensitive to
    outliers. If one of the observations is very
    large, then the standard deviation will be
    large as well.
   Unless there are strong outliers, the standard
    deviation is the most commonly used
    measure of spread.
   This is because the standard deviation is
    directly related to normal distributions (bell
    curves), which we will study later.
Interlude – review of central
location and spread

   For nominal/ordinal data, we simply report
    the percentages in each category.
   For interval/ratio data, central location is
    usually measured by the mean (not robust)
    or the median (robust).
   For interval/ratio data, spread is usually
    measured by the standard deviation (not
    robust) or the Interquartile Range (robust)
   The mode (central location) and the range
    (spread) are rarely used for inference.

   Look at the “tails”. If the tails are equal
    length, then the distribution is symmetric
   If the tail for lower values is longer, the
    distribution is left skewed
   If the tail for higher values is longer, the
    distribution is right skewed.
   “Symmetric” gets the benefit of the doubt in
    describing a distribution. “Roughly
    symmetric” is fine. I will not put judgment
    calls on homework or exams.
Symmetric Data – Ideally and
Right skewed data – ideally and
Left skewed data – ideally and

   Recall outliers are any points that appears
    separate from the rest.
   Often this is a judgment call. Saying “mild
    outlier” is fine, I don’t intend on policing
    judgment calls.
   Outliers often occur with skewed data in the
    direction of the long tail.

   A boxplot is intended to be a SIMPLE plot
    which allows you to quickly see all the
    features of the distribution.
   In PS372 you will NOT be expected to draw
    a boxplot from scratch, but you will be
    expected to interpret a boxplot drawn on a
Step 1 for boxplot – The Box

   Box extends from
    Q1 to Q3, with a line
    for the median.            Q3
   Thus, you can              Q1
    immediately see the
    median (central
    location) and the
    IQR (spread).
   Note the box
    contains 50% of the
Step 2 for boxplot – The fences

   Construct the          1.5 IQR
    “fences”. These are
    NOT in the final       1.5 IQR   Q3

    product. They are        IQR     Median
    just used to make      1.5 IQR
    decisions on
    outliers.              1.5 IQR

   Inner fences are 1.5
    IQR from the box,
    outer fences are 3.0
    IQR from the box.
Step 2 for boxplot – Inner Fences

   Construct the          1.5 IQR
    “fences”. These are
    NOT in the final       1.5 IQR

    product. They are        IQR     Inner fences
    just used to make      1.5 IQR
    decisions on
    outliers.              1.5 IQR

   Inner fences are 1.5
    IQR from the box,
    outer fences are 3.0
    IQR from the box.
Step 2 for boxplot – Outer fences

   Construct the          1.5 IQR
    “fences”. These are
    NOT in the final       1.5 IQR

    product. They are        IQR     Outer Fences
    just used to make      1.5 IQR
    decisions on
    outliers.              1.5 IQR

   Inner fences are 1.5
    IQR from the box,
    outer fences are 3.0
    IQR from the box.
Step 3 for boxplot – Whiskers

   The whiskers            1.5 IQR
    extend from the box
                            1.5 IQR
    to the point closest,
                              IQR     Whiskers
    but still inside, the
    inner fence.            1.5 IQR

   Remember, the           1.5 IQR

    whiskers end at a
    data point, not the
    inner fences.
Step 4 for boxplot – Mild outliers

   Mild outliers for a    1.5 IQR
    boxplot are defined
                           1.5 IQR
    to be points located
                             IQR     Mild outliers
    between the inner
    and outer fences.      1.5 IQR

   They are denoted       1.5 IQR

    by open circles.
Step 5 for boxplot – Extreme

   Extreme outliers for   1.5 IQR
    a boxplot are
                           1.5 IQR
    defined to be points
                             IQR     Extreme outliers
    located beyond the
    outer fences           1.5 IQR

   They are denoted       1.5 IQR

    by filled circles.
Final boxplot

   Remember, the
    fences are not
    actually drawn.
   You can see the
    four features of
    distributions easily
    with a boxplot.
    Outliers, for
    example, are
    explicitly drawn.
Using Boxplots

   Central location is
    shown through the
    median (some
    boxplots will show
    the mean as a
    separate line).
Using Boxplots

   Spread is shown
    through the IQR
    (you cannot get the
    standard deviation
    from a boxplot).
   You can also see
    the range of the
    data, but remember
    the range is often
    not that useful.
Using Boxplots

   Shape can be seen
    through the box and
    the whiskers. If one
    side of the box and
    the corresponding
    whisker are longer,
    then the data is
    skewed that
    direction (here left
Using boxplots

   Sometime the box “leans” one way and the whiskers
    the other. Then you can’t tell that much about shape
    from the boxplot. This happens most often in small
    datasets, where there isn’t much information about
    shape in the entire dataset anyway.
   Remember that symmetric always gets the benefit of
    the doubt, so a slight “lean” isn’t enough to conclude
   Outliers are of course drawn explicitly on the plot,
    and while you don’t have to take their definitions of
    “mild” and “extreme” as absolute truth, it can be
Some variants

   Some people and/or computer programs add
    some “bells and whistles” to this basic
   For example, Stata will often put a “+” in the
    boxplot showing the location of the mean.
Side by side boxplots

   When comparing multiple groups of people
    (or anything else), boxplots provide a handy
    method for comparison.
   My placing the boxplots side by side, you can
    immediately see similarities and differences
    in central location, spread, and shape.
1970 Draft Lottery – months on x axis,
draft number on y axis.

   There is clear evidence the later months,
    especially December, fared far worse in the
    draft lottery than other months.
   This draft was redone later after the
    unfairness was noted by many sources.

   There are four features of distributions –
    central location, spread, shape, and outliers
   Central location can be measured by the
    mode (nominal or ordinal data) or the median
    or mean (interval/ratio data)
   In interval/ratio data, spread can be
    measured by the range (rarely useful), the
    IQR, or the standard deviation.
More review

   Outliers are any points far from the other
    points. This definition is deliberately vague.
    Two people may disagree over whether a
    point is an outlier.
   There is an explicit definition of outlier for a
    boxplot (any point more extreme than Q1 –
    1.5 IQR or Q3 + 1.5 IQR), but that is NOT
    etched in stone
More review

   Shape is in “the tails”. If the tails are equal
    length, then the distribution is symmetric
   If the tail for lower values is longer, the
    distribution is left skewed
   If the tail for higher values is longer, the
    distribution is right skewed.
Describing a single distribution

   When describing a distribution, or comparing two
    distribution, you need to mention all four features of
    the distributions, noting where they are similar and
    where they are different.
   For example, “all the distributions have the same
    spread (IQR is around 5, standard deviation is
    around 7), but distribution A is, on average, much
    higher than distribution B (mean for A is 78 while the
    mean for B is 70). Both distributions are symmetric
    and have no outliers”.

   Two classrooms were observed, with one
    classroom (n=21) using “new directed
    reading activities” and another classroom
    (n=23) not using the activities.
   This might be useful for an exploratory study,
    but cannot provide conclusion evidence of
    anything, as the classrooms differ on far
    more than just “activities” or “no activities”
    (for example, the teachers differ)
Example continued
Descriptive statistics

   For the controls, n=23, mean=41.52, M=53, = 17.15, IQR=26
   For the treatment group n=21, mean=51.47,
    M=42, = 11.00, IQR=14
An example paragraph summary

   The two groups vary most on spread, both in
    terms of standard deviation (17.15 for the
    controls and 11.00 for the treatment group)
    and IQR (26 for the control group and 14 for
    the treatment group). The difference in
    spread is sufficient that the control group
    extends beyond the treated group both for
    high and low scoring students.
Paragraph summary continued

   On average, scores are higher in the
    treatment group. The mean of the treatment
    group is 51.48 compared to a mean of 41.52
    for the controls (the respective medians are
    53 and 42). Both groups appears
    approximately symmetric (perhaps a slight
    right skew for the control group) and have no

Shared By: