Displaying and Summarising Data by liuhongmei


									Workshop 2: Basic statistics

Instructor: Dr Patrick Johanns
Office: 239

Phone:       9349-8162
Fax:         9349-8133
E-mail: p.johanns@mbs.edu
Types of data
Displaying data
Summarising data
 Intent of workshop

Give an overview of basic measures of
  – Central tendency
  – Variability or dispersion
Explore and review different ways to
 display data via graphs and tables
 Dangers with Quantifying

False sense of “accuracy”.
May not be able to be compared to other
Estimates/data used to quantify a factor
 may be wrong or insufficient.
Given these dangers, why quantify
Types of Data
Discrete vs. Continuous
Categorical vs. Quantitative
Cross-Sectional vs. Time-Ordered
Univariate vs. Bivariate and Multivariate
Discrete and Continuous Variables
Discrete Variable: Must take one of a set of
 fixed variables. Variables are counted. (Usually
 whole numbers).
  – Number of children in family: can’t have 3.4
Continuous Variable: Can take any fractional
 value at all. Variables are measured.
  – Height of a tree. Almost never find a tree that is
    exactly 12 feet high.
Categorical(Qualitative) and
Quantitative Variables

 Person     Religion         Height
  State      Flower          Area
 House        Style           Size
  Firm      Industry         Sales
 Worker      Gender           Age
Computer     Brand       Size of Memory
Cross-Sectional and Time-Ordered Data

 Cross-Sectional Data: DATA FOR ONE TIME
    – Absenteeism Data for Four Departments in January
    – Yearly Sales for Three Divisions
 Time-Ordered Data: DATA FROM MULTIPLE
    – Monthly Absenteeism Data over a One-Year Period
    – Daily IBM Stock Prices over a One-Month Period
 Univariate and Bivariate Data
    ELEMENT   Univariate      Bivariate
    Worker      Gender     Promotion Status
                              (yes, no)
                             Market Share
    Product   Market Share
     Group    Productivity   Job Switching
                                (yes ,no)
Observation: single object or time-period. Can
 have more than one piece of information.
 Spot Check on types of data
Average weights and lengths of newborn
 babies on March 3 at Royal Women’s Hospital
Mean marks by departments for academic year
 1999 at University of Melbourne
The Age Poll on gambling views
Hourly defect numbers for a product on a
 production line
Critical Questions that
Business Professionals Ask

On average, how are we doing?
Is there much variation?
Are there outliers?
Typical Values
What do we mean when we use the word
If a country had 3 people who earned a
 million dollars each and 100 people who
 earned $20,000, what would the average
 person in that country earn?
“Average” or “typical” depends on context:
 there’s no single definition
Suburban Melbourne
In a suburb of Melbourne, five houses were
  sold last month for $1,208,000, $759,000,
  $795,000, $990,000 and $579,000.
What kind of community do you think it is?
What would the average house in this
  community be like?
Is this community average?
What kind of statistics would you use to
  describe the data?
The Sample
Draw 5 observations from a population.
They are:
    8, 14, 3, 7, 23
Want to describe this sample: its typical
 values and its spread.
Sample Mean

     x      i   where xi are the data values
x   i                 n is the sample size

     8  14  3  7  23
  x                      11
Sample Median
Take all observations
Order them
Count observations until you are exactly
 half way: this is the median value
(If there are an even number of
 observations then the median is half-way
 between the two middle values.)

           Median  8
Sample Mode
What is the mode?
What are its advantages and
Trimmed Mean
Rank order data.
Eliminate the lowest x% and the highest
 x% of data.
Compute sample mean for remaining data.
Reduces effect of outliers.
Measures of Spread
Measures of spread help answer the
How closely grouped are the data?
Is there a lot of variation?
How far apart are the lowest and highest
The Range
Rank order the data.
Subtract smallest from the largest data
  Range  23  3  20
Use only two values to calculate range.
Doesn’t measure spread around mean.
Quartiles and Inter-Quartile
Quartiles (Upper and Lower) are like
 Median except that they are the quarter and
 three-quarter points in the data
Inter-quartile range is the distance between
 the quartiles
  Sample Standard Deviation

         (x  x)i
                         where x is the sample mean
            n 1               n is the sample size

      (8  11)  (14  11)  (3  11)  (7  11)  (23  11)
             2            2         2          2               2
                               5 1
s  7.78
Variance and Coefficient of
Variance is just the square of the standard
                 V  sx 2

Coefficient of Variation is standard deviation
 divided by the mean:            s
                          CV     x
CV is useful because it is unit-less (doesn’t
 depend on the magnitude of the data).
Quick Problems
What is the mean and standard deviation of
 the following set of numbers: {1, 1, 1}?
Mean = 1
Standard deviation = 0
Quick Problems
The hourly wages of three employees are
 $4.00, $4.50, and $5.00. What would
 happen to the mean and standard deviation
 of these wages if the following occurred:
each got a raise of $.50 per hour?
Mean goes up $0.50
Standard deviation stays the same
Quick Problems
The hourly wages of three employees are
 $4.00, $4.50, and $5.00. What would
 happen to the mean and standard deviation
 of these wages if the following occurred:
the hourly wage of each doubled?
Mean doubles
Standard deviation doubles
Quick Problems
        School A                School B

          20   20          20               20

   10               10            10   10

        50 100 150 200       50 100 150 200
        Pocket Money ($)    Pocket Money ($)

Which school has the greater mean? the
 greater standard deviation?
Excel - Descriptive Statistics

             Go to “Tools” menu
             Go to “Data Analysis…”
              Choose “Descriptive
             Enter the cell range of the data
              in the “Input Range” box
             Tick “Summary statistics” box
Describing Data
                STEM and LEAF
                                      LINE GRAPH
Visualise     FREQUENCY HIST.
                                    MOVING AVERAGE
  Data             OGIVE
                                     RESIDUAL GRAPH
                  BOX PLOT
              MEAN and STD. DEV.
Summarise      MEDIAN and IQR       MOVING AVERAGE
  Data         TRIMMED MEAN             VALUES
              EMPIRICAL RULE
 Detect                            RESIDUALS AND THE
               TUKEY’S RULE
                                    EMPIRICAL RULE

               SCATTER PLOT
              CROSS-TABS TABLE
Example: Golf Scores

  82   77   88   87   78   84   85
  82   72   91   82   85   82   79
  83   75   81   87   86   82   82
  80   84   89   83   78   84   87
   Stem-and-Leaf Display
                     7          2
Stem displays data
values between       7          57889
75 and 79.
                     8          0122222233444
Stem displays
data                 8          55677789
values between
85 and 89.           9          1

                   Stems            Leaves
                 (tens digit)
Frequency Table

     Golf Scores   Frequency
72 to 75               2
76 to 79               4
80 to 83               10
84 to 87               9
88 to 91               3
     Histogram for Golf Data
                                   MEAN IS GOLF SCORE
                      10           BETWEEN 80 AND 83


72-75      76-79     80-83      84-87       88-91

Mean of Frequency Histogram is its Balance Point.
Relative Frequency Table
  Golf Scores   Frequency
   72 to 75        2        2/28 = 7.1%
   76 to 79        4          14.2%
   80 to 83        10         35.7%
   84 to 87        9          32.1%
   88 to 91        3          10.7%
                   28         100.0%
Cumulative Frequency Table
for Golf Data
Golf Scores   Frequency   Golf Scores
 72 to 75        2           < 76           2
 76 to 79         4          < 80           6
 80 to 83        10          < 84          16
 84 to 87         9          < 88          25
 88 to 91         3          < 92          28
Cumulative Relative Frequency
Table for Golf Data
                Cumulative   Cumul. Relative
  Golf Scores
                Frequency      Frequency
     < 76           2            7.1%
     < 80          6             21.4%
     < 84          16            57.1%
     < 88          25            89.3%
     < 92          28            100%
A cumulative frequency histogram is called
 an ogive.
Procedure for constructing one is exactly
 the same as the frequency histogram seen
Box and Whisker Plots
                         Median        75th percentile
         25 percentile

     Lower Fence                                          Upper Fence

         10        20       30    40        50       60      70

In this QA class the lower and upper fences
 will correspond to the lowest and highest
Line Graph
For time-ordered data, construct a line
 graph by using time as x-axis, plotting
 values on y-axis and joining them up with
Use to detect trends, cycles, seasonality and
 unusual observations.
Scatter Plots

Golf    90

Score   85



             0   20   40   60    80
Used with bivariate (or multivariate) data
Each point is an observation.
What about categorical
Are there any ways to represent these?
 One-Way Table
Displays the impact of one categorical
 explanatory variable on a quantitative
 dependent variable.
An explanatory variable affects the variation
 in another variable. It is also called an
 independent variable.
A dependent variable is a variable whose
 variation depends upon another variable.
Bivariate Data for Productivity
and Job Switching
Group   Productivity (%)   Switch
 1           106           Yes
 2           95            No
 3           103           Yes
 …           …             ...
 34          98            Yes
 35          97            Yes
 36          94            No
   One Way Table of Impact of
   Job Switching on Productivity
                   No Job Switch   Job Switch
       Number of
        groups          14           22
        Mean         95.5%         103.5%
       Std. Dev.      2.8%          5.8%
        Median        95.0%        102.5%
         IQR          2.0%          6.0%

 Job switch groups have higher mean& median productivity.
 Job switch groups are more variable producers.
     Try Job Switching Throughout Firm
          to Increase Productivity.
Multiple Box and Whisker Plots
For multivariate or bivariate data where one
 of the variables is categorical.
Get a visual comparison between different
 categories by drawing a box plot of the
 values of the non-categorical variable.
Use one box plot for each category and use
 the same scale.
Constructing a Cross-Tabs Table for
Productivity and Job Switching Data
1. Select Sample.                 Categorical Variable into
2. Cross-Classify Each             Mutually Exclusive and
   Element by Two                  Exhaustive Categories.
   Categorical Variables.       4. Arrange Sample Data in
3. Subdivide Each                  Cross-Tabs Table.
                    NO JOB SWITCH          JOB SWITCH

           PROD             13                    4        17
                                                           Row Totals
           PROD             1                     18       19
                            14                    22       36
                                  Column Totals
                                                        Entire Population
Percentage Tables
Joint percentage table show the percent of
 the entire population found in each box.

            NO JOB SWITCH       JOB SWITCH

     PROD      36.1%              11.1%         47.2%
                                                   Row Totals
     PROD       2.8%              50.0%         52.8%

               38.9%              61.1%         100%
                       Column Totals
                                             Entire Population
 The Shape of Histograms

   SYMMETRIC                             SKEWED TO

 Symmetric Histogram: Classes to the Left and
Right of the Mean are Mirror Images.
 Skewed Histogram: the Distribution Falls Off
More Slowly on One Side of the Mean.
If our data set was highly skewed, would we
 prefer to use the mean and standard
 deviation, or the median and the quartiles?
Median and the quartiles
Summarising Skewed Cross-
Sectional Data

 Trimmed Mean
 Upper Quartile and Lower Quartile and
  the Interquartile Range
If the histogram for a data set is:
Approximately bell-shaped
   – a data value more than three standard deviations
     away from the mean should be considered an
     outlier. (Empirical Rule)
   – use Chebyshev’s rule. (Not discussed in this
     workshop or required for MBA degree)
  The Empirical Rule Applied to
  Golf Data
        +1              ABOUT 68% OF DATA
        +2              ABOUT 95% OF DATA
        +3              ALMOST ALL DATA

           x  82.7 and s  4.29
The Mean +1 Standard Deviation    Score 78.41 to 86.99
The Mean +2 Standard Deviations   Score 74.12 to 91.28
The Mean +3 Standard Deviations   Score 69.83 to 95.57
Problem Diagnosis
Would like to determine the root causes of an
What is unique about the outlier group or
 time period?
What has changed that might account for
 the outliers?
Is it a mistake?
What did we do?
Explored the use of statistics, graphs and
 tables to summarise and describe data.
Learned how to identify outliers.
Used tables to explore relationships
 between variables.
Talked about how graphs and charts can lie.

To top