# Week1

Document Sample

```					         Descriptive vs. Inferential
Statistics

   Descriptive
•   Methods for summarizing data
•   Summaries usually consist of graphs and numerical
summaries of the data
   Inferential
•   Methods of making decisions or predictions about a
populations based on sample information.
Data Vocabulary
   We will refer to Data as plural and data set
as a particular collection of data as a whole.
   Observation – each data value.
   Subject (or individual) – an item for study
(e.g., an employee in your company).
   Variable – a characteristic about the subject
or individual (e.g., employee’s income).
Data Vocabulary

Consider the multivariate data set with
5 variables 8 subjects
5 x 8 = 40 observations
Data Vocabulary – Data Types
   A data set may have a mixture of data types.

Types of Data

Attribute                    Numerical
(qualitative)                (quantitative)
Verbal Label        Coded         Discrete       Continuous
X = economics        X=3            X=2            X = 3.15
Data Vocabulary – Attribute Data
    Also called categorical, nominal or qualitative
data.
   Values are described by words rather than
numbers.
   For example,
• Automobile style (e.g., X = full, midsize,
compact, subcompact).
Data Vocabulary – Data Coding
    Coding refers to using numbers to represent
categories to facilitate statistical analysis.
   Coding an attribute as a number does not make
the data numerical.
•   For example,
1 = Bachelor’s, 2 = Master’s, 3 = Doctorate
1 = Liberal, 2 = Moderate, 3 = Conservative
Data Vocabulary – Binary Data

    A binary variable has only two values,
1 = presence, 0 = absence of a characteristic of
interest (codes themselves are arbitrary).
   For example,
1 = employed, 0 = not employed
1 = married, 0 = not married
1 = male, 0 = female
1 = female, 0 = male
   The coding itself has no numerical value so binary
variables are attribute data.
Data Vocabulary – Numerical
Data
 Numerical or quantitative data arise from counting or some
kind of mathematical operation.
   For example,
- Number of auto insurance claims filed in
March (e.g., X = 114 claims).
- Ratio of profit to sales for last quarter
(e.g., X = 0.0447).
   Can be broken down into two types – discrete or continuous
data.
Data Vocabulary – Discrete Data

    A numerical variable with a countable number of
values that can be represented by an integer (no
fractional values).
   For example,
- Number of Medicaid patients (e.g., X = 2).
- Number of takeoffs at O’Hare (e.g., X = 37)
Data Vocabulary – Continuous
Data
   A numerical variable that can have any value
within an interval (e.g., length, weight, time, sales,
price/earnings ratios).
   Any continuous interval contains infinitely many
possible values (e.g., 426 < X < 428).
Data Vocabulary - Rounding
   Ambiguity is introduced when continuous data are
rounded to whole numbers.
   Underlying measurement scale is continuous.
   Precision of measurement depends on instrument.
    Sometimes discrete data are treated as
continuous when the range is very large (e.g., SAT
scores) and small differences (e.g., 604 or 605)
aren’t of much importance.
Four Levels of Measurement

Level of
Measurement Characteristics    Example
Nominal      Categories only   Eye color (blue, brown,
green, hazel)
Ordinal      Rank has meaning Bond ratings (Aaa, Aab,
C, D, F, etc.)
Interval     Distance has      Temperature (57o
meaning           Celsius)
Ratio        Meaningful zero   Accounts payable
exists            (\$21.7 million)
Nominal Level of Measurement
   Nominal data merely identify a category.

   Nominal data are qualitative, attribute, categorical or
classification data (e.g., Apple, Compaq, Dell, HP).

   Nominal data are usually coded numerically, codes are
arbitrary (e.g., 1 = Apple, 2 = Compaq, 3 = Dell, 4 = HP).

   Only mathematical operations are counting (e.g.,
frequencies) and simple statistics.
Ordinal Level of Measurement
   Ordinal data codes can be ranked
(e.g., 1 = Frequently, 2 = Sometimes, 3 = Rarely,
4 = Never).

   Distance between codes is not meaningful
(e.g., distance between 1 and 2, or between 2 and 3, or
between 3 and 4 lacks meaning).

Many useful statistical tests exist for ordinal data.
Especially useful in social science, marketing and human
resource research.
Interval Level of Measurement
    Data can not only be ranked, but also have meaningful
intervals between scale points. (e.g., difference between
60F and 70F is same as difference between 20F and
30F).

   Since intervals between numbers represent distances,
mathematical operations can be performed (e.g., average).

   Zero point of interval scales is arbitrary, so ratios are not
meaningful (e.g., 60F is not twice as warm as 30F).
Level of Measurement – Likert
Scales
    A special case of interval data frequently used in survey
research.
   The coarseness of a Likert scale refers to the number of
scale points (typically 5 or 7).
“College-bound high school students should be required to study
a foreign language.” (check one)
                                                
Strongly   Somewhat      Neither      Somewhat      Strongly
Agree       Agree        Agree        Disagree     Disagree
Nor Disagree
Likert Scales
   Careful choice of verbal anchors results in measurable
intervals (e.g., the distance from 1 to 2 is “the same” as the
interval, say, from 3 to 4).

   Ratios are not meaningful (e.g., here 4 is not
twice 2).

   Many statistical calculations can be performed (e.g.,
averages, correlations, etc.).
Time Series vs. Cross-sectional
Data – Time Series
•    Each observation in the sample represents a different
equally spaced point in time (e.g., years, months, days).
•   Periodicity may be annual, quarterly, monthly, weekly,
daily, hourly, etc.
•   We are interested in trends and patterns over time (e.g.,
annual growth in
consumer debit card use
from 1999 to 2008).
Time Series vs. Cross-sectional
Data – Cross-sectional
•   Each observation represents a different individual unit (e.g.,
person) at the same point in time (e.g., monthly VISA
balances).

•   We are interested in
- variation among observations or in
- relationships.

•   We can combine the two data types to get pooled cross-
sectional and time series data.
Population and Sample

   Population: All subjects of interest

   Sample: Subset of the population for whom we
have data
Populations and Samples
Population

Sample
Example: The Sample and the
Population for an Exit Poll

   In California in 2003, a special election was held to
consider whether Governor Gray Davis should be
recalled from office.
   An exit poll sampled 3160 of the 8 million people
who voted.
Example: The Sample and the
Example: The Sample and the Population for an Exit Poll

Population for an Exit Poll

   What’s the sample and the population for this
exit poll?

   The population was the 8 million people who
voted in the election.
   The sample was the 3160 voters who were
interviewed in the exit poll.
Parameter and Statistic
   A parameter is a numerical summary of the
population

   A statistic is a numerical summary of a
sample taken from the population
Sampling Methods
Probability Samples

Simple Random        Use random numbers to select items
Sample               from a list (e.g., VISA cardholders).
Systematic Sample Select every kth item from a list or
sequence (e.g., restaurant customers).

Stratified Sample    Select randomly within defined strata
(e.g., by age, occupation, gender).
Cluster Sample       Like stratified sampling except strata are
geographical areas (e.g., zip codes).
Sampling Methods
Nonprobability Samples
Judgment       Use expert knowledge to choose
Sample         “typical” items (e.g., which
employees to interview).

Convenience   Use a sample that happens to be
opinions at lunch).
Simple Random Sample
 Every item in the population of N items has the same
chance of being chosen in the sample of n items.
 We rely on random

numbers to select a
name.

=RANDBETWEEN(1,48)
Graphical Summaries
   Describe the main features of a variable
   For Quantitative variables: key features are
center (Where are the data values concentrated?
What seem to be typical or middle data values?)
spread (How much variation is there in the data?
How spread out are the data values? Are there
unusual values?) and shape (Are the data values
distributed symmetrically? Skewed? Sharply
peaked? Flat? Bimodal?
   For Categorical variables: key feature is the
percentage in each of the categories
Frequency Table

   A method of organizing data

   Lists all possible values for a variable along with
the number of observations for each value

   Natural categories exist for qualitative variables

   For quantitative variables artificial “bins” are
created
Example: Shark Attacks
Example: Shark Attacks
Example: Shark Attacks
   What is the variable?

   Is it categorical or quantitative?

   How is the proportion for Florida calculated?

   How is the % for Florida calculated?
Example: Shark Attacks

   Insights – what the data tells us about shark
attacks
Graphs for Categorical Data
   Pie Chart: A circle having a “slice of pie” for
each category. Center angle of slice represents
relative frequency/percentage.

   Bar Graph: A graph that displays a vertical bar
for each category. Length of bars represents
frequency.
Example: Sources of Electricity Use
Pie Chart

•   A pie chart can only convey a general idea of the data.

•   Pie charts should be used to portray data which sum to
a total (e.g., percent market shares).

•   A pie chart should only have a few (i.e., 3 to 5) slices.

•   Each slice should be labeled with data values or
percents.
Pie Chart
Bar Chart
Pie Charts Are Often Abused

•   Consider the following charts used to illustrate an article
from the Wall Street Journal.
Which type is better? Why?

2-D Pie Chart                     Bar Chart

•    Exploded and 3-D pie charts add strong visual impact
but slices are hard to assess.

Exploded Pie Chart           Exploded 3-D Pie Chart
Summarizing Quantitative Data
   Example: Price/Earnings Ratios

•   P/E ratios are
current stock
price divided by
earnings per
share in the last
12 months. For
example:
Graphs for Quantitative Data
 Dot Plot: shows a dot for each observation
 Histogram: uses bars to portray the data

Which is Best?
 Dot-plot
•   More useful for small data sets
•   Data values are retained

   Histogram
•   More useful for large data sets
•   Most compact display
•   More flexibility in defining intervals
Dot Plot

•   A dot plot is the simplest graphical display of n individual
values of numerical data.
- Easy to understand
- Not good for large samples (e.g., > 5,000).
•   Make a scale that covers the data range
•   Mark the axes and label them
•   Plot each data value as a dot above the scale at its
approximate location
•   If more than one data value lies at about the same axis
location, the dots are piled up vertically.
Dot Plot

•   Range of data shows dispersion.
•   Clustering shows central tendency.
•   Dot plots do not tell much of shape of distribution.

•

•   Can add annotations (text boxes) to call attention to
specific features.
Frequency Distributions and
Histograms
•   A frequency distribution is a table formed by classifying
n data values into k classes (bins).
•   Bin limits define the values to be included in each bin.
Widths must all be the same.
•   Frequencies are the number of observations within each
bin.
•   Express as relative frequencies (frequency divided by
the total) or percentages (relative frequency times 100).
Constructing a Frequency
Distribution
1.    Sort data in ascending order (e.g., P/E ratios)
2.    Choose the number of bins (k)
- k should be much smaller than n.
-     Too many bins results in sparsely populated bins, too
few and dissimilar data values are lumped together.

8   10   10   10   13   13   14   14   15   15
16   16   17   18   19   19   20   20   21   22
23   26   26   27   29   29   34   48   55   68
Constructing a Frequency
Distribution – Sturges’ Rule
Sample     Number of Bins Sample Size Number of Bins
Size (n)       (k)           (n)          (k)
16             5            256           9
32             6            512           10
64             7           1024           11
128            8
Constructing a Frequency
Distribution
3.   Set the bin limits according to k from Sturges’ Rule:

Bin width      X max  X min
k

For example, for k = 7 bins, the approximate bin width is:

Bin width      68  8 60
    8.57
7     7

To obtain “nice” limits, round the width to 10 and start
the first bin at 0 to yield: 0, 10, 20, 30, 40, 50, 60, 70
Constructing a Frequency
Distribution
4.    Put the data values in the appropriate bin
In general, the lower limit is included in the bin while
the upper limit is excluded.
5.    Create the table: you can include
Frequencies – counts for each bin
Relative frequencies – absolute frequency divided by
total number of data values.
Cumulative frequencies – accumulated relative
frequency values as bin limits increase.
3A-49

Bin Limits for the P/E Ratio Data

Cumulative
Relative    Relative
Bin Range       Frequency   Frequency   Frequency
0<P/E Ratio<10       1        0.0333      0.0333
10<P/E Ratio<20      15        0.5000      0.5333
20<P/E Ratio<30      10        0.3333      0.8666
30<P/E Ratio<40       1        0.0333      0.8999
40<P/E Ratio<50       1        0.0333      0.9332
50<P/E Ratio<60       1        0.0333      0.9665
60<P/E Ratio<70       1        0.0333      0.9998
3A-50

Frequency Distributions and
Histograms
•   A histogram is a graphical representation of a
frequency distribution.
•   Y-axis shows frequency within each bin.
•   A histogram is a bar chart with no gaps between bars
•   X-axis ticks shows end points of each bin.
3A-51

Frequency Distributions and
Histograms
•   Consider 3 histograms for the P/E ratio data with
different bin widths. What do they tell you?
Frequency Distributions and
Histograms – Modal Class
•   A histogram bar that is higher than those on either side
is called the modal class.
•   Monomodal – a single modal class.
•   Bimodal – two modal classes.
•   Multimodal – more than two modal classes.
•   Modal classes may be artifacts of the way bin limits are
chosen.
3A-53

Shape of Histograms

•   A histogram suggests the shape of the population.
•   It is influenced by number of bins and bin limits.
•   Skewness – indicated by the direction of the longer tail
of the histogram.
•   Left-skewed – (negatively skewed) a longer left tail.
•   Right-skewed – (positively skewed) a longer right tail.
•   Symmetric – both tail areas approximately the same.
3A-55

Line Charts

•   Used to display a time series or spot trends, or to compare
time periods.
•   Can display several
variables at once.
Scatter Plots for Bi-variate Data

•   A scatter plot shows n pairs of observations as dots (or
some other symbol) on an XY graph.

•   A starting point for bivariate data analysis.

•   Allows observations about the relationship between two
variables.

•   Answers the question: Is there an association between
the two variables and if so, what kind of association?
Scatter Plot Example: Birth Rates
vs. Life Expectancy
Nation          Birth Rate   Life Expectancy
Afghanistan       41.03           46.60
Finland           10.60           77.80
Guatemala         34.17           66.90
Japan             10.03           80.90
Mexico            22.36           72.00
Pakistan          30.40           62.70
Spain              9.29           79.10
United States     14.10           77.40
Scatter Plot Example: Birth Rates
vs. Life Expectancy

•   Here is a scatter plot with life expectancy on the
X-axis and birth rates on the Y-axis.
•    Is there an
association
between the two
variables?
•       Is there a cause-
and-effect
relationship?
Scatter Plot Example: Aircraft
Fuel Consumption
•   Consider five observations on flight time and fuel
consumption for a twin-engine Piper Cheyenne aircraft.
•   A causal relationship is assumed since a longer flight
would consume more fuel.
Flight Time   Fuel Used
Trip Leg
(hours)     (pounds)
1          2.3          145
2          4.2          258
3          3.6          219
4          4.7          276
5          4.9          283
Scatter Plot Example: Aircraft
Fuel Consumption
•   Here is the scatter plot with flight time (explanatory) on
the X-axis and fuel use (response) on the Y-axis. Is there
an association between the variables?
Scatter Plots for Bi-variate Data

Very strong association   Strong association

Moderate association      Little or no association
Scatter Plots and Policy Making

•   Scatter plots can be helpful when policy decisions need

•   For example, compare traffic fatalities resulting from
crashes per million vehicles sold between 1995 and
1999.

•   Do SUV’s create a greater risk to the drivers of both
cars?
Numerical Descriptive Statistics

How Can We describe the Center of
Quantitative Data?
Measures of Central Tendency

Statistic    Formula   Excel Formula   Pro           Con
Familiar and
1 n
 xi
Influenced
uses all the
Mean          n i 1   =AVERAGE(Data)                by extreme
sample
values.
information.
Ignores
Middle                                   extremes
Robust when
value in                                 and can be
Median                 =MEDIAN(Data)   extreme data
sorted                                   affected by
values exist.
array                                    gaps in data
values.
Measures of Central Tendency

Statistic     Formula      Excel Formula      Pro            Con
Useful for     May not be
Most                              attribute      unique,
frequently                        data or        and is not
Mode                         =MODE(Data)
data value                        data with a    continuous
small range.   data.
Influenced
Easy to        by extreme
xmin  xmax    =0.5*(MIN(Data)   understand     values and
Midrange
2          +MAX(Data))      and            ignores
calculate.     most data
values.
Measures of Central Tendency
Statistic        Formula        Excel Formula      Pro          Con
Useful for   Less
growth       familiar
Geometric                                          rates and    and
=GEOMEAN(Data)
mean (G)     n   x1 x2 ... xn                      mitigates    requires
high         positive
extremes.    data.

Same as the
Excludes
mean except
Mitigates      some data
omit highest
Trimmed                                          effects of     values
and lowest          =TRMEAN(Data, %)
mean                                             extreme        that could
k% of data
values.        be
values (e.g.,
relevant.
5%)
Measures of Central Tendency -
Mean

•   A familiar measure of central tendency.

Population Formula   Sample Formula
N                    n
 xi
i 1
 xi
                  x   i 1
N                      n

•   In Excel, use function =AVERAGE(Data) where
Data is an array of data values.
Characteristics of the Mean

•   Arithmetic mean is the most familiar average.
•   Affected by every sample item.
•   The balancing point or fulcrum for the data.
Characteristics of the Median

•    For n = 8, the median is between the fourth and
fifth observations in the data array.
Characteristics of the Median

•    For n = 9, the median is the fifth observation in the
data array.
Comparison Among Mean,
Median, and Mode
•       Consider the following quiz scores for 3 students:
Lee’s scores:
60, 70, 70, 70, 80 Mean =70, Median = 70, Mode = 70
Pat’s scores:
45, 45, 70, 90, 100 Mean = 70, Median = 70, Mode = 45
Sam’s scores:
50, 60, 70, 80, 90 Mean = 70, Median = 70, Mode = none
Xiao’s scores:
50, 50, 70, 90, 90 Mean = 70, Median = 70, Modes = 50,90

•    What does the mode for each student tell you?
Relationships Among Mean,
Median and Mode
Measures of Variation

center of the distribution in a sample. Consider
the following measures of dispersion:
Statistic   Formula         Excel         Pro              Con
Sensitive to
=MAX(Data)- Easy to
Range       xmax – xmin                                    extreme
MIN(Data)  calculate
data values.
n
Plays a key role
Variance      xi  x 2                                  Non-intuitive
i 1             =VAR(Data)   in mathematical
(s2)               n 1
meaning.
statistics.
Measures of Variation

Statistic     Formula            Excel          Pro                   Con
Most common
Standard       n                                measure. Uses         Non-
  xi  x 
2
deviation     i 1
=STDEV(Data)   same units as the     intuitive
(s)                  n 1                       raw data (\$ , £, ¥,   meaning.
etc.).
Measures relative
Coef-                                                                 Requires
variation in
ficient. of             s                                             non-
100                       None   percent so can
variation               x                                             negative
compare data
(CV)                                                                  data.
sets.
Measures of Variation

Statistic   Formula    Excel            Pro           Con
Mean         n                                        Lacks
absolute     xi  x    =AVEDEV(Data)
Easy to       “nice”
i 1
deviation                               understand.   theoretical
The Range
Range = largest measurement - smallest measurement

Example:
Internists’ Salaries (in thousands of dollars)
127 132 138 141 144 146 152 154 165 171 177 192 241
Range = 241 - 127 = 114 (\$114,000)
The Variance
Population X1, X2, …, XN   Sample x1, x2, …, xn

σ2                      s2

Population Variance         Sample Variance
N                            n

 (X   i   - )2                  (x i - x ) 2
2    i=1
s2 =   i =1
N                            n -1
The Standard Deviation

Population Standard Deviation, :         2

Sample Standard Deviation, s:        s   s   2
Example: Population
Variance/Standard Deviation

Population of annual returns for five junk bond
mutual funds:     10.0%, 9.4%, 9.1%, 8.3%, 7.8%

= 10.0+9.4+9.1+8.3+7.8 = 44.6 = 8.92%
5              5
(10.0  8.92)2  (9.4  8.92)2  (91  8.92)2  (8.3  8.92)2  ( 7.8  8.92)2
.
 
2
5
= 1.1664+.2304+.3844+1.2544 = 3.068 = .6136
5                5
   2  .6136 .7833
Sample Variance Example

Sample : 2, 3, 5, 6. Here n = 4 and x = 4
xi      (xi-x)       (xi- x)2
2       2 – 4 = -2     4
3       3 – 4 = -1     1
5       5 – 4 = +1     1
6       6 – 4 = +2     4
Sum = 10
s2 = 10 /(4-1) = 3.33
Example: Sample
Variance/Standard Deviation

Sample of five car mileages
30.8, 31.7, 30.1, 31.6, 32.1
5

x  31.26
 (x    i   - x)2
s2 =   i =1
5 -1

2  (30.8  31.26) 2  (31.7  31.26) 2  (30.1  31.26) 2  (31.6  31.26) 2  (32.1  31.26) 2
s =
4

s2 = 2.572  4 = 0.643                             s  s 2  .643  0.8019
Coefficient of Variation

•   Useful for comparing variables measured in
different units or with different means.
•   A unit-free measure of dispersion
•   Expressed as a percent of the mean.
s
CV  100 
x
•   Only appropriate for nonnegative data. It is
undefined if the mean is zero or negative.
Coefficient of Variation Examples

s
CV  100 
x
Defect rates     s = 22.89
(n = 37)       x = 125.38 gives CV = 100 × (22.89)/(125.38) = 18%
ATM            s = 280.80
deposits       x = 233.89 gives CV = 100 × (280.80)/(233.89) =
(n = 100)                       120%
P/E ratios      s = 14.28
(n = 68)        x = 22.72 gives CV = 100 × (14.08)/(22.72) = 62%
Mean Absolute Deviation

•   The Mean Absolute Deviation (MAD) reveals the
average distance from an individual data point to
the mean (center of the distribution).
•   Uses absolute values of the deviations around the
mean.             n
 xi  x
i 1
n
•   Excel’s function is =AVEDEV(Array)
Central Tendency vs. Dispersion

•       Consider the histograms of hole diameters
drilled in a steel plate during manufacturing.

Machine A                Machine B
•   The desired distribution is outlined in red.
Central Tendency vs. Dispersion

Machine A                  Machine B
Desired mean (5mm) but       Acceptable variation but mean
too much variation.          is less than 5 mm.
•   Take frequent samples to monitor quality.
Central Tendency vs. Dispersion
Job Performance
•   A high mean (better rating) and low standard
deviation (more consistency) is preferred. Which
professor do you think is best?
Section 2.6 – 2.7

Interpreting Standard Deviation and Measures
of Relative Standing
Empirical Rule
For bell-shaped data sets:

   Approximately 68% of the observations fall within 1 standard
deviation of the mean

   Approximately 95% of the observations fall within 2 standard
deviations of the mean

   Approximately 100% of the observations fall within 3 standard
deviations of the mean
Scale in std. dev. units
 = 9.12;  = 0.15
Empirical Rule: Detecting
Unusual Observations
•   The P/E ratio data contains several large data
values. Are they unusual or outliers?

7   8   8 10 10    10 10 12 13 13 13 13
13 13 13 14 14      14 15 15 15 15 15 16
16 16 17 18 18      18 18 19 19 19 19 19
20 20 20 21 21      21 22 22 23 23 23 24
25 26 26 26 26      27 29 29 30 31 34 36
37 40 41 45 48      55 68 91
Empirical Rule: Detecting
Unusual Observations
•   If the sample came from a normal distribution, then
the Empirical rule states

x 1s    = 22.72 ± 1(14.08) = (8.9, 38.8)

x  2s   = 22.72 ± 2(14.08) = (-5.4, 50.9)

x  3s   = 22.72 ± 3(14.08) = (-19.5, 65.0)
Empirical Rule: Detecting
Unusual Observations
•   Are there any unusual values or outliers?
7   8     . . .          48 55        68 91

Unusual                               Unusual

Outliers                                                   Outliers

-19.5   -5.4     8.9   22.72   38.8    50.9   65.0
Defining a Standardized Variable or
Z-Score

•    A standardized variable (Z) redefines each
observation in terms the number of standard
deviations from the mean.

Standardization          xi  
zi 
formula for a               
population:
Standardization          xi  x
formula for a       zi 
sample:                     s
Z-Score Example

•   zi tells how far away the observation is from the mean. A
negative z value indicates the observation is below the
mean while positive z value indicates the observation is
above the mean.
•   For example, for the P/E data, the first value x1 = 7.
The associated z value is

xi  x
zi           = 7 – 22.72 = -1.12
s          14.08
Percentiles, Deciles and Quartiles

•   Percentiles are data that have been divided into 100
groups.
•   For example, you score in the 83rd percentile on a
standardized test. That means that 83% of the
test-takers scored below you.
•   Deciles are data that have been divided into
10 groups.
•   Quintiles are data that have been divided into
5 groups.
•   Quartiles are data that have been divided into
4 groups.
Use of Percentiles and Quartiles

•   Percentiles are used to establish benchmarks for
comparison purposes (e.g., health care, manufacturing
and banking industries use 5, 25, 50, 75 and 90
percentiles).
•   Percentiles are used in employee merit evaluation and
salary benchmarking.
•   Quartiles (25, 50, and 75 percent) are commonly used to
assess financial performance and stock portfolios.
Quartiles

•   Quartiles are scale points that divide the sorted data into
four groups of approximately equal size.

Q1                  Q2                 Q3

Lower 25%    |   Second 25%   |    Third 25%   |    Upper 25%

•   The three values that separate the four groups are
called Q1, Q2, and Q3, respectively.
Quartiles

•    The second quartile Q2 is the median, an
important indicator of central tendency.

Q2
 Lower 50%        |      Upper 50% 
•       Q1 and Q3 measure dispersion since the interquartile
range Q3 – Q1 measures the degree of spread in the
middle 50 percent of data values.
Q1                            Q3
Lower 25%         |       Middle 50%          |   Upper 25%
Calculating Quartiles

•   For small data sets, find quartiles using method of
medians:
Step 1. Sort the observations.

Step 2. Find the median Q2.
Step 3. Find the median of the data values that lie
below Q2.
Step 4. Find the median of the data values that lie
above Q2.
Calculating Quartiles

•   Use Excel function =QUARTILE(Array, k) to return the
kth quartile.
•   Excel treats quartiles as a special case of percentiles.
For example, to calculate Q3
=QUARTILE(Array, 3)
=PERCENTILE(Array, 75)
Excel calculates the quartile positions as:
Position of Q1         0.25n + 0.75
Position of Q2         0.50n + 0.50
Position of Q3         0.75n + 0.25
Central Tendency Using Quartiles

Statistic   Formula Excel                Pro           Con
Robust to
=0.5*(QUARTILE                  Less
Q1  Q3                         presence of
familiar to
Midhinge             (Data,1)+QUARTILE   extreme
most
2              (Data,3))     data
people.
values.
Dispersion Using Quartiles

Statistic     Formula      Excel              Pro            Con
Stable          Ignores
when            magnitude
=QUARTILE(Data,3)
Midspread     Q3 – Q1                        extreme         of extreme
-QUARTILE(Data,1)
data values     data
exist.          values.
Relative
Coefficient      Q Q                         variation in   Less
of quartile 100  3 1                         percent so     familiar to
Q3  Q1           None
variation                                     we can         non-
(CQV)                                         compare        statisticians
data sets.
Box Plots

•   A useful tool of exploratory data analysis (EDA).
•   Also called a box-and-whisker plot.
•   Based on a five-number summary:

Xmin, Q1, Q2, Q3, Xmax

•   Consider the five-number summary for the
68 P/E ratios:
Xmin, Q1, Q2, Q3, Xmax
7 14 19 26 91
Box Plots

Whiskers
Center of Box is Midhinge
Box

Q1     Q3

Minimum                             Maximum
Right-skewed
Median (Q2)
Detecting Unusual Observations and
Potential Outliers

IQR = Q3 – Q1

   An observation is considered unusual if it falls
more than 1.5 x IQR below the first quartile or
more than 1.5 x IQR above the third quartile
   An observation is a potential outlier if it falls
more than 3 x IQR below the first quartile or more
than 3 x IQR above the third quartile
Box - Whiskers Plots
4B-109
Box Plots
       Fences and Unusual Data Values
•    Truncate the whisker at the fences and display
unusual values
and outliers    Inner     Outer
Fence     Fence
as dots.

Unusual       Outliers

•    Based on these fences, there are three unusual
P/E values and two outliers.
Probability Concepts

An experiment is any process of observation with
an uncertain outcome.

The possible outcomes for an experiment are called
the experimental outcomes.

Probability is a measure of the chance that an
experimental outcome will occur when an
experiment is carried out
Probability

If E is an experimental outcome, then P(E) denotes the
probability that E will occur and
0  P( E )  1

Conditions
If E can never occur, then P(E) = 0
If E is certain to occur, then P(E) = 1
The probabilities of all the experimental outcomes must
sum to 1.
Assigning Probabilities to
Experimental Outcomes
   Classical Method
•   For equally likely outcomes
   Relative frequency or Empirical Approach
•   In the long run
   Subjective
•   Assessment based on experience, expertise, or intuition
The Sample Space
The sample space of an experiment is the set of all
experimental outcomes.
Example: Genders of Two Children
Computing Probabilities of Events

An event is a set (or collection) of experimental
outcomes.
The probability of an event is the sum of the
probabilities of the experimental outcomes that
belong to the event.
Probabilities: Equally Likely
Outcomes

If the sample space outcomes (or experimental
outcomes) are all equally likely, then the
probability that an event will occur is equal to the
ratio
the number of sample space outcomes that correspond to the event
The total number of sample space outcomes
Example: Computing
Probabilities

Events
P(one boy and one girl) =
P(BG) + P(GB) = ¼ + ¼ = ½
P(at least one girl) =
P(BG) + P(GB) + P(GG) = ¼ + ¼ + ¼ = ¾

Note: Experimental Outcomes: BB, BG, GB, GG
All outcomes equally likely: P(BB) = … = P(GG) = ¼
Event Relations
The complement A or A/ of an event A is the set of
all sample space outcomes not in A.
Further, P(A) = 1 - P(A)

Union of A and B, A  B
Elementary events that belong to
either A or B (or both.)
Intersection of A and B, A  B
Elementary events that belong to
both A and B.

The probability that A or B (the union of A and B) will
occur is
P(A  B) = P(A) + P(B) - P(A  B)

A and B are mutually exclusive if they have no sample
space outcomes in common, or equivalently if
P(A  B) = 0
Conditional Probability

The probability of an event A, given that the event B has
occurred is called the “conditional probability of A
given B” and is denoted as P(A | B). Further,

P(A  B)
P(A | B) =
P(B)
Independence of Events

Two events A and B are said to be independent if
and only if:
P(A|B) = P(A) or, equivalently,
P(B|A) = P(B)
Multiplication Rule for Intersections

The probability that A and B (the intersection of
A and B) will occur is
P(A  B) = P(A) P(B | A)
= P(B) P(A | B)

If A and B are independent, then the probability
that A and B (the intersection of A and B) will
occur is P(A  B) = P(A) P(B)  P(B) P(A)
Applications of Independence

•    To illustrate system reliability, suppose a Web site has 2
independent file servers. Each server has 99%
reliability. What is the total system reliability? Let,
F1 be the event that server 1 fails
F2 be the event that server 2 fails
•    P(F1  F2 ) = P(F1) P(F2) = (.01)(.01) = .0001

So, the probability that both servers are down is .0001.
•    The probability that at least one server is “up” is:
•    1 - .0001 = .9999 or 99.99%
Applications of Independence –
the Five Nines Rule
Contingency Tables
P(R1 )
P(R1  C1 )

C1   C2        Total
R1             .4   .2         .6
R2             .1   .3         .4
Total           .5   .5        1.00

P(C 2 )
P(R 2  C2 )
Contingency Tables Example:
Salary Gains & MBA Tuition
Contingency Tables Example:
Salary Gains & MBA Tuition

•    Are large salary gains more likely to accrue to graduates
of high-tuition MBA programs?
•    For example, find the marginal probability of a small
salary gain (P(S1)).
•    The marginal probability of a single event is found by
dividing a row or column total by the total sample size.
•    P(S1) = 17/67 = 0.2537
•     Conclude that about 25% of salary gains at the top-tier
schools were under \$50,000.
Contingency Tables Example:
Salary Gains & MBA Tuition
•    Find the marginal probability of a low tuition P(T1).

P(T1) = 16/67 = 0.2388
There is a 24% chance that a top-tier school’s
MBA tuition is under \$40,000.
Contingency Tables Example:
Salary Gains & MBA Tuition
•    Find the joint probability of a low tuition and large salary
gains P(T1  S3)

P(T1  S3) = 1/67 = 0.0149
• There is less than a 2% chance that a top-tier school
has both low tuition and large salary gains.
Contingency Tables Example:
Salary Gains & MBA Tuition
•    Find the conditional probability that the salary gains are
small (S1) given that the MBA tuition is large (T3).

P(S1 | T3) = 5/32 = 0.1563
• There is about a16% chance that a top-tier school
has small salary gains given the tuition is large.
Salary Gains & MBA Tuition -
Independence

•     To check for independent events in a contingency table,
compare the conditional to the marginal probabilities.
•     For example, if small salary gains (S1) were independent
of high tuition (T3), then P(S1 | T3) = P(S1).

Conditional                     Marginal
P(S1 | T3)= 5/32 = .1563        P(S1) = 17/67 = .2537

•        What do you conclude about events S1 and T3?
•        They are dependent or not independent
Contingency Tables : Relative
Frequencies
•    Calculate the relative frequencies below for each cell of
the cross-tabulation table to facilitate probability
calculations.

•    Symbolic notation for relative frequencies:

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 18 posted: 11/9/2011 language: English pages: 134
How are you planning on using Docstoc?