# Introduction to Statistics

Document Sample

```					Introduction to Statistics

Data description and summary
Statistics
   Derived from the word state, which means the
collection of facts of interest to the state
   The art of learning from data
   Statistics are no substitute for judgment.
   A scientific discipline can be used to collect, describe,
summarize, and analyze the data
   Descriptive vs. inferential
   It is a usual expectation to draw a meaningful conclusion
beyond a merely descriptive figure or table from the collected
data
   An extrapolative inference, a method of deduction
Probability
   Some assumptions about the chances
of obtaining the different data values
for drawing certain logical conclusions
   A totality of these assumptions is
referred to as a probability model
   An inductive approach
Statistics vs. probability

Source: http:///ocw.mit.edu/OcwWeb/Sloan-School-of-Management/
Data: A Set of measurements
   Character
   Nominal, e.g., color: red, green, blue
   Binary e.g., (M,F), (H,T), (0,1)
   Ordinal, e.g., attitude to war: agree, neutral, disagree
   Numeric
   Discrete, e.g., number of children
   Continuous. e.g., distance, time, temperature
   Interval, e.g., Fahrenheit/Celsius temperature
   Ratio (real zero), e.g., distance, number of children
   Population: The set of all units of interest (finite
or infinite).
   E.g., all students at NCNU
   Sample: A subset/subgroup of the population
actually observed.
   E.g., students in this room.
   Variable: A property or attribute of each unit,
   e.g., age, height (a column field within a table)
   Observation: Values of all variables for an
individual unit (a row record in the table)
Matrix form of raw data

variable

…
observation

Sample
…
Properties of measurements
   Parameter:
   Numerical characteristic of population, defined for each
variable, e.g. proportion opposed to war
   Statistic:
   Numerical function of sample used to estimate population
parameter.
   Precision:
   Spread of estimator of a parameter
   Accuracy:
   How close estimator is to true value
   Bias:
   Systematic deviation of estimate from true value
Accuracy vs. Precision

Source: http:///ocw.mit.edu/OcwWeb/Sloan-School-of-Management/
Is it a good sample?
   Is it a representative sample from the
interested population?
   Preexisted Bias?
   unavoidable errors?
Describing data sets
   Frequency tables and graphs
   Scatter plot, bar/pie chart (for attraction)
   Relative frequency tables and graphs
   Grouped data with
   histograms,
   Ogive (cumulative frequency), e.g., the Lawrence
curve for national wealth distribution
   Stem-and-leaf plot
   Always plot your data appropriately - try several
ways!
Scatter plot
Variable Y or observation

Variable x or observation number
Line graph (chart)
Bar chart
Relative frequency

(42/200)=

   =200=n
Pie chart
Histogram (柱狀圖/直方圖)
   Class intervals: a trade-off between too-few
and too-many classes
   Class boundaries: left-end inclusion
convention
   E.g., the interval 20-30 contains all values that
both greater than or equal to 20 and less than
30
   c.f. right-end inclusion, (MS Excel)
   Pareto histogram: a bar chart with categories
arranged from the highest to lowest
The life hours of lamps
Interpretation of histogram
   Area under the histogram represents sample
proportion
   If too many intervals, too jagged; (polygon graph)
   If too few, too smooth
   Detecting the data distribution (chart)
   Symmetric or skewed
   Uni-modal or bi-modal
   Only used for categorizing the numerical data
Ogive (cumulative relative
frequency graph)
Stem-and-leaf plot
The case of city minimum temperatures

The length of leaf means the frequency of this stem (interval)

The tens digit
• You had better sort the data from
The ones digit      the smallest to the largest before
the stem-and-leaf assignment
Run chart
               For time series data, it is often useful to plot the
data in time sequence.
electric cost

50000
45000               2   3
40000       1                                                                         12
6   7       8
35000
4                                           11
electric cost

30000                                 5
9
25000                                                                  10
20000
15000
10000
5000
0
0       2           4             6           8           10             12        14

month
Summarizing data sets
   Measures of location & central tendency
   Sample mean, sample median, sample mode
   Measures of dispersion
   Sample variance, sample standard deviation
   Sample percentile (quartiles, quantiles)
   Box (and whiskers) plots, QQ plots
Mean
   Simple average
   Weighted average
Median
The middle value is located when the data are arranged
in a increasing/decreasing order.
Mode
   The value occurs most frequently
   If no single value occurs most
frequently, all the values that occur at
the highest frequency are called mode
values.
Skew-ness

transformation                     squared transformation
Exercise and justify it yourselves
A case of bimodal histogram
Mean or median?
   Appropriate summary of the center of the data?
   Mean—if the data has a symmetric distribution with
light tails (i.e. a relatively small proportion of the
observations lie away from the center of the data).
   Median—if the distribution has heavy tails or is
asymmetric.
   Extreme values that are far removed from the
main body of the data are called outliers.
   Large influence on the mean but not on the median.
Sample variance

(Check it!)
Linear computation of sample
variance

if
Sample standard deviation
Percentiles , Quartiles
   The sample 100p percentile (p quantile) is that
data value such that 100p percent of the data
are less than or equal to it and 100(1-p)
percent are greater than or equal to it.
   The sample 25 percentile is called the first
quartile, Q1; the sample 50 percentile is called
the sample median or the second quartile, Q2;
the sample 75 percentile is called the third
quartile, Q3.
Finding the sample percentiles
 To determine the sample 100p percentile of a
data set of size n, Xp, we need to determine the
data values such that
(1)At least np of the values are less than or equal to
it.
(2)At least n(1-p) of the values are greater than or
equal to it.
   If np is NOT an integer, round up to the next integer
and set the corresponding observation Xp
   If np is an integer K, average the Kth and (K+1)st
ordered values. This average is then Xp.
Five number summary
   The minimum,
   The maximum,
   and three quartiles, Q1, Q2, Q3
Box (and Whiskers) plots
     A “box” starts at the Q1 and continues to the
Q3, so the length of box is called the
interquartile range. (50% of distribution)
     the value of the Q2 indicated by a vertical line
     A straight line segment (i.e., whiskers) stretching
from the smallest to the largest data value
(i.e., the range) is drawn on a horizontal axis.
Case 1.
Min.         Q1    Q2         Q3                 Max.
Lower fence and upper fence
Max.   *   Possible outliers
Case 2.                           *

Whisker extends to this
highest value within the
upper fence= Q3 + 1.5
(Q3 - Q1)
Median
Q1
Whisker extends to this
value within the lower
fence= Q1 - 1.5 (Q3 - Q1)
Min.
Normal sample distribution

   For normal data and large samples
   50% of the data values fall between mean ± 0.67s
   68% of the data values fall between mean ± 1s
   95% of the data values fall between mean ± 2s
   99.7% of the data values fall between mean ± 3s
QQ (normal) plots
   Sequentially compare the sample data to the quantiles of
theoretical (normal) distribution
   The ith ordered data value is the pth quanntile, p=(i-0.5)/n
Raw data

Quantiles of standard normal
Paired data sets (X, Y) and
the sample correlation coefficient, r

r
Illustrations of correlation
r vs. Linear relation
   If the these two paired data sets x and y
possess a linear relation, y=a+bx, with
b>0, then r=1.
   If the these two paired data sets x and y
possess a linear relation, y=a+bx, with
b<0, then r=-1.
   r is just an indicator telling how perfect a
linear relation exists between X, and y
Properties of r
   |r| ≤ 1, (why? See the 2.6.1)
   If r is positive, x and y may change in the same
direction.
   If r is negative, x and y may not change in the
same direction.
   Correlation measures association, not causation
   Causation still needs the other necessary conditions:
time sequence, exclusion
   E.g., Wealth and health problems go up with age. Does
wealth cause health problems?
Chebyshev’s inequality

Let Set

(The lower bound)
Proof

Dividing both
sides by
The next step?
And the upper bound
of N(k)/n
Categorizing the bi-variate data
   Lurking variables excluded from
considerations can change or reverse a
relation between two categorical variables
Male Female

Male Female
Rej.   30      10

Rej.   45      40             Male Female

35/80   20/60

Rej.
15      30
5/20    10/40
Homework #1
   Chapter 1: Problem 2, 6
   Chapter 2: Problem 15 (You had better use
Excel or the book-included software to compute the
data.)
Graphical Excellence
   “Complex ideas communicated with clarity,
precision, and efficiency”
   Shows the data
   Makes you think about substance rather than
method, graphic design, or something else
   Many numbers in a small space
   Makes large data sets coherent
   Encourages the eye to compare different
pieces of the data
ACCENT Principles for
effective graphical display
   Apprehension:
   Ability to correctly perceive relations among variables.
   Does the graph maximize apprehension of the relations among
variables?
   Clarity:
   Ability to visually distinguish all the elements of a graph.
   Are the most important elements or relations visually most
prominent?
   Consistency:
   Ability to interpret a graph based on similarity to previous
graphs.
   Are the elements, symbol shapes and colors consistent with their
use in previous graphs?
ACCENT Principles for effective
graphical display (Cont.)
   Efficiency:
   Ability to portray a possibly complex relation in as simple a way as possible
   Are the elements of the graph economically used?
   Is the graph easy to interpret?
   Necessity:
   The need for the graph, and the graphical elements.
   Is the graph a more useful way to represent the data than alternatives (table,
text)?
   Are all the graph elements necessary to convey the relations?
   Truthfulness:
   Ability to determine the true value represented by any graphical element
by its magnitude relative to the implicit or explicit scale.
   Are the graph elements accurately positioned and scaled?
Source: http://www.math.yorku.ca/SCS/Gallery/, Adapted from: D. A. Burn (1993), "Designing
Effective Statistical Graphs". in C. R. Rao, ed., Handbook of Statistics, vol. 9, Chapter 22.
Lies on graphical display (1)
Lies on graphical display (2)
Lies on graphical display (3)
Lies on graphical display (4)
   Changes in the scale of the graphic
should always correspond to changes in
the data being represented
   Avoid the confused dimensions
   Be careful of misunderstanding from
the goosed-up way
   Don’t quote data from the context

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 14 posted: 4/6/2013 language: Unknown pages: 57
How are you planning on using Docstoc?