Exploratory Data Analysis by 483h36A

VIEWS: 5 PAGES: 28

									 Exploratory Data Analysis
The goal of data analysis is to gain information from the data.

Exploratory data analysis: set of methods to display and summarize the data.

Data on just one variable: the distribution of the observations is analyzed by

I.    Displaying the data in a graph that shows overall patterns and unusual
      observations (bar chart, histogram, density curve)

II.   Computing descriptive statistics that summarize specific aspects of the
      data (center and spread).
Review of Histograms
   A histogram represents percent by area.
   The height of each block represents frequencies/percentages of
    the observations falling in the interval.
   The total area under a histogram is ______ if height in
    frequencies
   The total area under a histogram is ______ if height in
    percentages
   There is no fixed choice for the number of classes in a
    histogram:
     •   If class intervals are too small, the histogram will have spikes;
     •   If class intervals are too large, some information will be missed.
     •   Use your judgment!
   Typically statistical software will choose the class intervals for
    you, but you can modify them.
Center and Spread

                     Distribution of city fuel consumption

                16
                14
                12
    Frequency




                10
                 8
                 6
                 4
                 2
                 0
                     0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75
                                      Mph/gallon
     Measuring Centers
The most common measures are the mean (or average) and the median.

1.   The Mean or Average x
     To calculate the average x of a set of observations, add their value and divide by the
          number of observations:
                        x1  x2  x3  ... xn
                     x
                                  n
     Data: Number of home runs hit by Babe Ruth as a Yankee
                 54, 59, 35, 41, 46, 25, 47, 60, 54, 46, 49, 46, 41, 34, 22
     The mean number of home runs hit in a year is:

                 54  59  35  41  46  ... 41  34  22 659
            x                                                  43.9
                                    15                       15
2.   The median
     The median M is the midpoint of a distribution, the number such that half the
     observations are smaller and the other half are larger.

     To find the median:
     1. Sort all the observations in order of size from smallest to largest
     2. If the number of observations n is odd, the median M is the center
          observation in the ordered list; I.e. M=(n+1)/2-th obs.
     3. If the number of observations n is even, the median M is the mean of the two
          center observations in the ordered list.


     Example 1: Ordered list of home run hits by Babe Ruth:

     22 25 34 35 41 41 46 46 46 47 49 54 54 59 60           N=15 Median = 46

                         8th
      Example 2: Ordered list of home run hits by Roger Maris:

      8 13 14 16 23 26 28 33 39 61        N=10 Median = (23+26)/2=24.5
                                                       Symmetric distribution
                                             50%


Mean versus Median
1. The mean and median of a
   symmetric distribution are close
   together                                     Mean   Median
2. In skewed distributions, the mean is farther out in the long tail
   than is the median. The mean is more sensitive to extreme
   values.Right-skewed distribution       Left-skewed distribution



                        50%
                                                                     50%




         Median        Mean                         Mean      Median
Mean or Median?
   The mean is a good measure for the
    center of a symmetric distribution
   The median is a resistant measure and
    should be used for skewed distributions.
    Its value is only slightly affected by the
    presence of extreme observations, no
    matter how large these observations are.
                                                                  City



             The Mode
                                                                  Mean                     18.9
                                                                  Standard Error       1.629717
                                                                  Median                     18
                                                                  Mode                       17
                  Distribution of city fuel consumption
                                                                  Standard Deviation   8.926327
                                                                  Sample Variance      79.67931
             16
             14                                                   Kurtosis             17.87193
             12                                                   Skewness             3.710471
 Frequency




             10                                                   Range                      53
              8
                                                                  Minimum                     8
              6
              4                                                   Maximum                    61
              2                                                   Sum                       567
              0                                                   Count                      30
                  0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75
                                                                  Largest(5)                 22
                                   Mph/gallon                     Smallest(5)                13

On average, the cars under study drive 18.9 miles per gallon, and 50% of the
cars under study drive at least 18 miles per gallon.
The mode is the observation value with the highest frequency
  Spread of a Distribution
Two measures of spread:

1. The Quartiles:
First quartile Q1 = is the value such
that 25% of the observations fall at or
below it,
(Q1 is often called 25th percentile).
The third quartile Q3 = the value such
that 75% of the observations fall at or
below it, (Q3 is often called 75th
percentile).                                    Q1   M Q3
Typically used if the distribution of
                                          25%
the observations is skewed.
                      Distribution of city fuel consumption

                 16
                 14
                 12
     Frequency
                 10
                  8
                  6
                  4
                  2
                  0
                      0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75
                                       Mph/gallon




First quartile (Q1) = 16, third quartile (Q3) = 21

What does this mean in terms of the data?
Percentiles (also called Quantiles):
In general the nth percentile is a value such that n% of the observations
fall at or below or it;


                          n%




                               nth percentile
 In the example before:
 5th percentile = 10.35    95th percentile = 24.1
 10th percentile = 11      90th percentile = 22
 Hence about 80% of the cars get between 11 and 22 miles per gallon.
 Descriptive measures for
 skewed distributions

If the histogram of the data is skewed, use the following descriptive
    statistics:

                    Min, Q1, Median, Q3, Max

To describe the distribution of the observed variable.
In our example,
           Min=8, Q1=16, Median=18, Q3=21, Max=61
     The Standard Deviation
If a distribution is symmetric:
Use the average to measure the center and
     the Standard Deviation to measure the spread.
The standard deviation s (or SD ) measures how far the observations are from the
average.
Example: A person’s metabolic rate= rate at which the body consumes energy.
Rates of 7 men in a study on dieting: 1792, 1666, 1614, 1460, 1867, 1439, 1362.
The mean is x  1600 and the s.d. s =189.24
    Deviation=1600 –1439=161                                  Deviation=1867 – 1600=267
                                        x

                                                                      
   1300       1400         1500       1600             1700        1800         1900
                                  Metabolic rate
 Formula for the SD
In symbols, the standard deviation s of n observations x1 , x2 ,..., xn is

                      ( x1  x ) 2  ( x2  x ) 2  ... ( xn  x ) 2
                   s
                                           n 1




The variance of an observed variable is defined as the square of the standard
deviation.

         Variance = s2
Properties of the SD
   It measures the spread about the mean.

   Only used in association with the mean. Good descriptive measure for
    symmetric distributions

   If s = 0, all the observations have the same value

   It is a POSITIVE value, the larger s is, the more spread out the
    observations are around the mean

   It is NOT a resistant measure, a few extreme observations may affect
    its value (make it very large).

   The variance is the square of the s.d.
Interpreting the SD
For many lists of observations – especially if their histogram is bell-shaped

1.   Roughly 68% of the observations in the list lie within 1 standard
     deviation of the average
2.   95% of the observations lie within 2 standard deviations of the average

                                  Average                    Ave+2s.d.
     Ave-2s.d.     Ave-s.d.                    Ave+s.d.



                                    68%

                                    95%
Example
In a large university, data were collected to study the academic achievements
of computer science majors. We’ll consider the SAT math scores of 224 first
year CS students.

The average SATM score is 595.28 with s.d. s= 86.40
Are the average and s.d. good             Histogram of the SATM Scores
descriptions of the SATM scores
distribution?

Roughly 68% of the students have
scores between 510 and 680

Roughly 95% of the students have
scores between 422 and 768
CS students example:
Descriptive statistics
  Mean = 595.28 Std Deviation = 86.40 Max= 800 Min= 300
  Q1 = 540          Median = 600.00 Q3= 650 IQR=110 1.5xIQR=165
  5th percentile = 460        95th percentile = 750
                     Histogram of the SATM Scores




                           422                   768
                                 95% of scores
Analysis of the scores
for male and female students:




   SATM scores for men   SATM scores for women
 Exploratory Data Analysis:
1. Always plot your data

2. Look for overall patterns & striking deviations such as
   outliers

3. Calculate a numerical summary to describe the center and
   the spread

4. NEXT STEP: sometimes the overall pattern is so regular that
   we can describe it through a smooth curve, called a density
   curve
     Computing descriptive statistics
     in Excel
There are two ways:
1. Use the formula palette – click on
   the fx button

OR

2.   Use the Data Analysis Toolpak &
     select descriptive statistics
The descriptive statistics tool
Input range: sequence of
     cells containing the
     data
Label in First row
Output range: tell Excel
     where to put the
     output
Summary statistics: to
     be checked
Formulas for 5-number summary
        Five number summary
        City                               Highway
        Min                            8   Min                           13
        Q1                            16   Q1                         22.25
        Median                        18   Median                      25.5
        Q3                         20.75   Q3                            28
        Max                           61   Max                           68
Select an empty cell, and type the function name you want to compute or
use the function palette for the list of available functions.
For instance to compute the min of the fuel consumption data in the city,
type
         =min(b2:b31)
 Normal distributions
Normal curves provide a simple, compact way to describe
symmetric, bell-shaped distributions.
                                               Normal curve




            SAT math scores for CS students
Money spent in a supermarket




Is the normal curve a good approximation?
                  SAT math scores for CS students
The area under the histogram, i.e. the percentages of the observations, can be
approximated by the corresponding area under the normal curve.

If the histogram is symmetric, we say that the data are approximately normal
(or normally distributed).
We need to know only the average and the standard deviation of the
observations!!

								
To top