# Displaying and Summarising Data by liuhongmei

VIEWS: 2 PAGES: 54

• pg 1
```									Workshop 2: Basic statistics

Instructor: Dr Patrick Johanns
Office: 239

Phone:       9349-8162
Fax:         9349-8133
E-mail: p.johanns@mbs.edu
Today
Types of data
Displaying data
Summarising data
Intent of workshop

Give an overview of basic measures of
– Central tendency
– Variability or dispersion
Explore and review different ways to
display data via graphs and tables
Dangers with Quantifying

False sense of “accuracy”.
May not be able to be compared to other
factors.
Estimates/data used to quantify a factor
may be wrong or insufficient.
Given these dangers, why quantify
anything?
Types of Data
Discrete vs. Continuous
Categorical vs. Quantitative
Cross-Sectional vs. Time-Ordered
Univariate vs. Bivariate and Multivariate
Discrete and Continuous Variables
Discrete Variable: Must take one of a set of
fixed variables. Variables are counted. (Usually
whole numbers).
– Number of children in family: can’t have 3.4
children
Continuous Variable: Can take any fractional
value at all. Variables are measured.
– Height of a tree. Almost never find a tree that is
exactly 12 feet high.
Categorical(Qualitative) and
Quantitative Variables
Element   CATEGORICAL    QUANTITATIVE

Person     Religion         Height
State      Flower          Area
House        Style           Size
Firm      Industry         Sales
Worker      Gender           Age
Computer     Brand       Size of Memory
Cross-Sectional and Time-Ordered Data

Cross-Sectional Data: DATA FOR ONE TIME
PERIOD TAKEN ON DIFFERENT PERSONS OR
PLACES
– Absenteeism Data for Four Departments in January
– Yearly Sales for Three Divisions
Time-Ordered Data: DATA FROM MULTIPLE
TIME PERIODS
– Monthly Absenteeism Data over a One-Year Period
– Daily IBM Stock Prices over a One-Month Period
Univariate and Bivariate Data
ELEMENT   Univariate      Bivariate
Gender
Worker      Gender     Promotion Status
(yes, no)
Market Share
Product   Market Share
Price
Productivity
Group    Productivity   Job Switching
(yes ,no)
Observation: single object or time-period. Can
have more than one piece of information.
Spot Check on types of data
Average weights and lengths of newborn
babies on March 3 at Royal Women’s Hospital
Mean marks by departments for academic year
1999 at University of Melbourne
The Age Poll on gambling views
Hourly defect numbers for a product on a
production line
Critical Questions that

On average, how are we doing?
Is there much variation?
Are there outliers?
Typical Values
What do we mean when we use the word
“average”?
If a country had 3 people who earned a
million dollars each and 100 people who
earned \$20,000, what would the average
person in that country earn?
“Average” or “typical” depends on context:
there’s no single definition
Suburban Melbourne
In a suburb of Melbourne, five houses were
sold last month for \$1,208,000, \$759,000,
\$795,000, \$990,000 and \$579,000.
What kind of community do you think it is?
What would the average house in this
community be like?
Is this community average?
What kind of statistics would you use to
describe the data?
The Sample
Draw 5 observations from a population.
They are:
8, 14, 3, 7, 23
Want to describe this sample: its typical
Sample Mean

x      i   where xi are the data values
x   i                 n is the sample size
n

8  14  3  7  23
x                      11
5
Sample Median
Take all observations
Order them
Count observations until you are exactly
half way: this is the median value
(If there are an even number of
observations then the median is half-way
between the two middle values.)

Median  8
Sample Mode
What is the mode?
Trimmed Mean
Rank order data.
Eliminate the lowest x% and the highest
x% of data.
Compute sample mean for remaining data.
Reduces effect of outliers.
questions:
How closely grouped are the data?
Is there a lot of variation?
How far apart are the lowest and highest
observations?
The Range
Rank order the data.
Subtract smallest from the largest data
value.
Range  23  3  20
Use only two values to calculate range.
Quartiles and Inter-Quartile
Range
Quartiles (Upper and Lower) are like
Median except that they are the quarter and
three-quarter points in the data
Inter-quartile range is the distance between
the quartiles
Sample Standard Deviation

s
 (x  x)i
2
where x is the sample mean
n 1               n is the sample size

(8  11)  (14  11)  (3  11)  (7  11)  (23  11)
2            2         2          2               2
s
5 1
s  7.78
Variance and Coefficient of
Variation
Variance is just the square of the standard
deviation:
V  sx 2

Coefficient of Variation is standard deviation
divided by the mean:            s
CV     x
100%
X
CV is useful because it is unit-less (doesn’t
depend on the magnitude of the data).
Quick Problems
What is the mean and standard deviation of
the following set of numbers: {1, 1, 1}?
Mean = 1
Standard deviation = 0
Quick Problems
The hourly wages of three employees are
\$4.00, \$4.50, and \$5.00. What would
happen to the mean and standard deviation
of these wages if the following occurred:
each got a raise of \$.50 per hour?
Mean goes up \$0.50
Standard deviation stays the same
Quick Problems
The hourly wages of three employees are
\$4.00, \$4.50, and \$5.00. What would
happen to the mean and standard deviation
of these wages if the following occurred:
the hourly wage of each doubled?
Mean doubles
Standard deviation doubles
Quick Problems
School A                School B

20   20          20               20

10               10            10   10

50 100 150 200       50 100 150 200
Pocket Money (\$)    Pocket Money (\$)

Which school has the greater mean? the
greater standard deviation?
Excel - Descriptive Statistics

Go to “Data Analysis…”
 Choose “Descriptive
Statistics”
Enter the cell range of the data
in the “Input Range” box
Tick “Summary statistics” box
Describing Data
CROSS-SECTIONAL DATA   TIME-ORDERED DATA
STEM and LEAF
LINE GRAPH
Visualise     FREQUENCY HIST.
MOVING AVERAGE
Data             OGIVE
RESIDUAL GRAPH
BOX PLOT
MEAN and STD. DEV.
Summarise      MEDIAN and IQR       MOVING AVERAGE
Data         TRIMMED MEAN             VALUES
RANGE
EMPIRICAL RULE
Detect                            RESIDUALS AND THE
TUKEY’S RULE
Outliers    CHEBYSHEV’S RULE
EMPIRICAL RULE

SCATTER PLOT
Compare
CROSS-TABS TABLE
Data       PERCENTAGE TABLE
Example: Golf Scores

82   77   88   87   78   84   85
82   72   91   82   85   82   79
83   75   81   87   86   82   82
80   84   89   83   78   84   87
Stem-and-Leaf Display
7          2
Stem displays data
values between       7          57889
75 and 79.
8          0122222233444
Stem displays
data                 8          55677789
values between
85 and 89.           9          1

Stems            Leaves
(tens digit)
Frequency Table

Golf Scores   Frequency
72 to 75               2
76 to 79               4
80 to 83               10
84 to 87               9
88 to 91               3
28
Histogram for Golf Data
MEAN IS GOLF SCORE
10           BETWEEN 80 AND 83
9

4
3
2

72-75      76-79     80-83      84-87       88-91

Mean of Frequency Histogram is its Balance Point.
Relative Frequency Table
Relative
Golf Scores   Frequency
Frequency
72 to 75        2        2/28 = 7.1%
76 to 79        4          14.2%
80 to 83        10         35.7%
84 to 87        9          32.1%
88 to 91        3          10.7%
28         100.0%
Cumulative Frequency Table
for Golf Data
Cumulative
Golf Scores   Frequency   Golf Scores
Frequency
72 to 75        2           < 76           2
76 to 79         4          < 80           6
80 to 83        10          < 84          16
84 to 87         9          < 88          25
88 to 91         3          < 92          28
Cumulative Relative Frequency
Table for Golf Data
Cumulative   Cumul. Relative
Golf Scores
Frequency      Frequency
< 76           2            7.1%
< 80          6             21.4%
< 84          16            57.1%
< 88          25            89.3%
< 92          28            100%
Ogive
A cumulative frequency histogram is called
an ogive.
Procedure for constructing one is exactly
the same as the frequency histogram seen
previously.
Box and Whisker Plots
Median        75th percentile
25 percentile

Lower Fence                                          Upper Fence

10        20       30    40        50       60      70
Scale

In this QA class the lower and upper fences
will correspond to the lowest and highest
values.
Line Graph
For time-ordered data, construct a line
graph by using time as x-axis, plotting
values on y-axis and joining them up with
lines.
Use to detect trends, cycles, seasonality and
unusual observations.
Scatter Plots
95

Golf    90

Score   85

80

75

70
0   20   40   60    80
Age
Used with bivariate (or multivariate) data
Each point is an observation.
variables?
Are there any ways to represent these?
One-Way Table
Displays the impact of one categorical
explanatory variable on a quantitative
dependent variable.
An explanatory variable affects the variation
in another variable. It is also called an
independent variable.
A dependent variable is a variable whose
variation depends upon another variable.
Bivariate Data for Productivity
and Job Switching
Group   Productivity (%)   Switch
1           106           Yes
2           95            No
3           103           Yes
…           …             ...
34          98            Yes
35          97            Yes
36          94            No
One Way Table of Impact of
Job Switching on Productivity
No Job Switch   Job Switch
Number of
groups          14           22
Mean         95.5%         103.5%
Std. Dev.      2.8%          5.8%
Median        95.0%        102.5%
IQR          2.0%          6.0%

 Job switch groups have higher mean& median productivity.
 Job switch groups are more variable producers.
Try Job Switching Throughout Firm
to Increase Productivity.
Multiple Box and Whisker Plots
For multivariate or bivariate data where one
of the variables is categorical.
Get a visual comparison between different
categories by drawing a box plot of the
values of the non-categorical variable.
Use one box plot for each category and use
the same scale.
Constructing a Cross-Tabs Table for
Productivity and Job Switching Data
1. Select Sample.                 Categorical Variable into
2. Cross-Classify Each             Mutually Exclusive and
Element by Two                  Exhaustive Categories.
Categorical Variables.       4. Arrange Sample Data in
3. Subdivide Each                  Cross-Tabs Table.
NO JOB SWITCH          JOB SWITCH

LOW
PROD             13                    4        17
Row Totals
HIGH
PROD             1                     18       19
14                    22       36
Column Totals
Entire Population
Percentage Tables
Joint percentage table show the percent of
the entire population found in each box.

NO JOB SWITCH       JOB SWITCH

LOW
PROD      36.1%              11.1%         47.2%
Row Totals
HIGH
PROD       2.8%              50.0%         52.8%

38.9%              61.1%         100%
Column Totals
Entire Population
The Shape of Histograms
Means
(Approximate)

SYMMETRIC                             SKEWED TO
RIGHT

 Symmetric Histogram: Classes to the Left and
Right of the Mean are Mirror Images.
 Skewed Histogram: the Distribution Falls Off
More Slowly on One Side of the Mean.
Question
If our data set was highly skewed, would we
prefer to use the mean and standard
deviation, or the median and the quartiles?
Median and the quartiles
Summarising Skewed Cross-
Sectional Data

Median
Trimmed Mean
Upper Quartile and Lower Quartile and
the Interquartile Range
Outliers
If the histogram for a data set is:
Approximately bell-shaped
– a data value more than three standard deviations
away from the mean should be considered an
outlier. (Empirical Rule)
Skewed
– use Chebyshev’s rule. (Not discussed in this
workshop or required for MBA degree)
The Empirical Rule Applied to
Golf Data
NUMBER OF STD.        EMPIRICAL RULE
DEV. FROM MEAN
+3              ALMOST ALL DATA

x  82.7 and s  4.29
The Mean +1 Standard Deviation    Score 78.41 to 86.99
The Mean +2 Standard Deviations   Score 74.12 to 91.28
The Mean +3 Standard Deviations   Score 69.83 to 95.57
Problem Diagnosis
Would like to determine the root causes of an
outlier.
What is unique about the outlier group or
time period?
What has changed that might account for
the outliers?
Is it a mistake?
What did we do?
Explored the use of statistics, graphs and
tables to summarise and describe data.
Learned how to identify outliers.
Used tables to explore relationships
between variables.
Talked about how graphs and charts can lie.

```
To top