Statistics Cheat Sheet - DOC

Document Sample
Statistics Cheat Sheet - DOC Powered By Docstoc
					                                                               s.   Boxplot:
                                                                       Min        Q1         M       Q3        Max
Statistics Cheat Sheet
Mr. Roth , Mar 2004
                                                                                 s 2   ( x  x ) /(n  1)  SS x /(n  1) ,
                                                                                                     2
1. Fundamentals                                                t.   Variance:
a.       Population – Everybody to be analysed                 u.   p78: standard deviation, s = √s
                                                                                                          2


                                                                    SS x   ( x  x ) 2   x 2  ( x) 2 / n
          Parameter - # summarizing Pop
                                                               v.
b.       Sample – Subset of Pop we collect data on
          Statistics - # summarizing Sample                   w.   Density curve – relative proportion within classes –
c.       Quantitative Variables – a number                          area under curve = 1
          Discrete – countable (# cars in family)             x.   Normal Distribution: 68, 95, 99.7 % within 1, 2, 3 std
          Continuous – Measurements – always # between             deviations.
d.       Qualitative                                           y.   p98: z-score    z  ( x  x ) / s or ( x   ) / 
          Nominal – just a name
                                                               z.   Standard Normal: N(0,1) when N(μ,σ)
          Ordinal – Order matters (low, mid, high)
                                                               3. Bivariate - Scatterplots & Correlation
Choosing a Sample
                                                               a.   Explanatory – independent variable
        Sample Frame – list of pop we choose sample from
                                                               b.   Response – dependent variable
        Biased – sampling differs from pop characteristics.
                                                               c.   Scatterplot: form, direction, strength, outliers
        Volunteer Sample – any of below three types may
                                                               d.   – form is linear negative, …
         end up as volunteer if people choose to respond.
                                                               e.   – to add categorical use different color/symbol
Sample Designs
                                                               f.   p147: Linear Correlation- direction & strength of
e.       Judgement Samp: Choose what we think represents            linear relationship
          Convenience Sample – easily accessed people         g.   Pearsons Coeff: {-1 ≤ r ≤ 1} 1 is perfectly linear +
f.       Probability Samp: Elements selected by Prob                slope, -1 is perfectly linear – slope.
          Simple random sample – every element = chance
                                                                             1     ( x  x ) ( y  y)             SS xy
          Systematic sample – almost random but we            h.   r          *                                            ,
           choose by method                                                n 1       sx        sy               SS x SS y
g.       Census – data on every everyone/thing in pop
                                                               i.   r = zxzy / (n - 1),
Stratified Sampling
Divide pop into subpop based upon characteristics              j.    SS xy   xy 
                                                                                             x y
                                                                                                 n
h.       Proportional: in proportion to total pop
i.       Stratified Random: select random within substrata     4. Regression
j.       Cluster: Selection within representative clusters     k.   least squares – sum of squares of vertical error
                                                                    minimized
Collect the Data                                                                                 
                                                               l.   p154: y = b0 + b1x, or y  a  bx ,
k.       Experiment: Control the environment
l.       Observation:                                          m.   (same as y = mx + b)

2. Single Variable Data - Distributions
                                                               n.   b1 
                                                                            ( x  x )( y  y )  SS      xy
                                                                                                               = r (sy / sx)
m.       Graphing Categorical: Pie & bar chart)                               (x  x)       2
                                                                                                  SS      x
n.       Histogram (classes, count within each class)          o.   Then solving knowing lines thru centroid
o.       – shape, center, spread. Symmetric, skewed right,          ( ( x , y ); a  y  bx
         skewed left
p.       Stemplots
                                                               p.   b0 
                                                                             y  (b  x)
                                                                                        1
             0    11222          0    112233                                        n
             1    011333         0    56677
                                                               q.   r^2 is proportion of variation described by linear
             2    etc            1
                                                                    relationship
q.       Mean:   x   xi / n                                                        
                                                               r.   residual = y - y = observed – predicted.
r.       Median: M: If odd – center, if even - mean of 2
                                                                                                    Statistics Cheat Sheet
s.   Outliers: in y direction -> large residuals, in x        d.   Event: outcome of random phenomenon
     direction -> often influential to least squares line.    e.   n(S) – number of points in sample space
t.   Extrapolation – predict beyond domain studied            f.   n(A) – number of points that belong to A
u.   Lurking variable                                         g.   p 183: Empirical: P'(A) = n(A)/n = #observed/
v.   Association doesn't imply causation                           #attempted.
5. Data – Sampling                                            h.   p 185: Law of large numbers – Exp -> Theoret.
                                                              i.   p. 194: Theoretical P(A) = n(A)/n(S) ,
a.   Population: entire group
                                                                   favorable/possible
b.   Sample: part of population we examine
                                                              j.   0 ≤ P(A) ≤ 1, ∑ (all outcomes) P(A) = 1
c.   Observation: measures but does not influence
                                                              k.   p. 189: S = Sample space, n(S) - # sample points.
     response
                                                                   Represented as listing {(, ), …}, tree diagram, or grid
d.   Experiment: treatments controlled & responses
     observed                                                 l. p. 197 Complementary Events P(A) + P( A ) = 1
e.   Confounded variables (explanatory or lurking) when       m. p200: Mutually exclusive events: both can't happen
     effects on response variable cannot be distinguished        at the same time
f.   Sampling types: Voluntary response – biased to           n. p203. Addition Rule: P(A or B) = P(A) + P(B) – P(A
     opinionated, Convenience – easiest                          and B) [which = 0 if exclusive]
g.   Bias: systematically favors outcomes                     o. p207: Independent Events: Occurrence (or not) of A

h.   Simple Random Sample (SRS): every set of n                  does not impact P(B) & visa versa.
     individuals has equal chance of being chosen             p. Conditional Probability: P(A|B) – Probability of A

i.   Probability sample: chosen by known probability             given that B has occurred. P(B|A) – Probability of B
                                                                 given that A has occurred.
j.   Stratified random: SRS within strata divisions
                                                              q. Independent Events iff P(A|B) = P(A) and P(B|A) =
k.   Response bias – lying/behavioral influence
                                                                 P(B)
6. Experiments                                                r. Special Multiplication. Rule: P(A and B) = P(A)*P(B)
a.   Subjects: individuals in experiment                      s. General mult. Rule: P(A and B) = P(A)*P(B|A) =
b.   Factors: explanatory variables in experiment                P(B)*P(A|B)
c.   Treatment: combination of specific values for each       t. Odds / Permutations
     factor                                                   u. Order important vs not (Prob of picking four
d.   Placebo: treatment to nullify confounding factors           numbers)
e.   Double-blind: treatments unknown to subjects &           v. Permutations: nPr, n!/(n – r)! , number of ways to
     individual investigators                                    pick r item(s) from n items if order is important :
f.   Control Group: control effects of lurking variables         Note: with repetitions p alike and q alike = n!/p!q!.
g.   Completely Randomized design: subjects allocated         w. Combinations: nCr, n!/((n – r)!r!) , number of ways
     randomly among treatments                                   to pick r item(s) from n items if order is NOT
h.   Randomized comparative experiments: similar                 important
     groups – nontreatment influences operate equally         x. Replacement vs not (AAKKKQQJJJJ10) (a) Pick an

i.   Experimental design: control effects of lurking             A, replace, then pick a K. (b) Pick a K, keep it, pick
     variables, randomize assignments, use enough                another.
     subjects to reduce chance                                y. Fair odds - If odds are 1/1000 and 1000 payout. May

j.   Statistical signifi: observations rare by chance            take 3000 plays to win, may win after 200.
k.   Block design: randomization within a block of            8. Probability Distribution
     individuals with similarity (men vs women)               a.   Refresh on Numb heads from tossing 3 coins. Do
7. Probability & odds                                              grid {HHH,….TTT} then #Heads vs frequency
                                                                   chart{(0,1), (1,3), (2,3), (4,1)} – Note Pascals triangle
a.   2 definitions:
                                                              b.   Random variable – circle #Heads on graph above.
b.   1) Experimental: Observed likelihood of a given
                                                                   "Assumes unique numerical value for each outcome
     outcome within an experiment
                                                                   in sample space of probability experiment".
c.   2) Theoretical: Relative frequency/proportion of a
                                                              c.   Discrete – countable number
     given event given all possible outcomes (Sample
     Space)                                                   d.   Continuous – Infinite possible values.


f0acc37b-be55-453e-be4c-8c9709fb5712.doc                -2-                                       Printed 4/8/2009
                                                                                                                    Statistics Cheat Sheet
e. Probability Distribution: Add next to coins frequency                  11. Confidence Intervals
   chart a P(x) with 1/8, 3/8, 3/8, 1/8 values                            a.   Statistical Inference: methods for inferring data
f. Probability Function: Obey two properties of prob.                          about population from a sample
   (0 ≤ P(A) ≤ 1, ∑ (all outcomes) P(A) = 1.                              b.   If   x is unbiased, use to estimate μ
g. Parameter: Unknown # describing population
                                                                          c.   Confidence Interval: Estimate+/- error margin
h. Statistic: # computed from sample data
                                                                          d.   Confidence Level C: probability interval captures
                    Sample          Population                                 true parameter value in repeated samples
   Mean             x               μ - mu                                e.   Given SRS of n & normal population, C confidence
     Variance               s2                   σ2                            interval for μ is: x  z *  /   n
     Standard               s                    σ - sigma
     deviation                                                            f.   Sample size for desired margin of error – set +/-
                                                                               value above & solve for n.
                                              (x  x)
                                                         2

i.   Base:      x  x / n , s        2
                                                                         12. Tests of significance
                                               (n  1)                    g.   Assess evidence supporting a claim about popu.
             Frequency Dist                   Probability Distribution    h.   Idea – outcome that would rarely happen if claim
     Me       x   xf /  f                    [ xP( x)]                   were true evidences claim is not true
     an                                                                   i.   Ho – Null hypothesis: test designed to assess
     Var
                       (x  x) f
                                      2
                                               2  [(x   ) 2 P( x)]        evidence against Ho. Usually statement of no effect
              s2                                                              Ha – alternative hypothesis about population
                       ( f  1)
                                                                          j.
                                                                               parameter to null
                                                                          k.   Two sided: Ho: μ = 0, Ha: μ ≠ 0
                      2
     Std     s = √s
     Dv
                                                      2                l.   P-value: probability, assuming Ho is true, that test
                                                                               statistic would be as or more extreme (smaller P-
j.   Probability acting as an              f /  f . Lose the -1               value is > evidence against Ho)
                                                                                      x
9. Sampling Distribution                                                  m.   z=
                                                                                     / n
a.   By law of large #'s, as n -> population,                x
                                                                          n.   Significance level α : if α = .05, then happens no
b.   Given x as mean of SRS of size n, from pop with μ                         more than 5% of time. "Results were significant (P <
     and σ. Mean of sampling distribution of x is μ and                        .01 )"
     standard deviation is           / n                                 o.   Level α 2-sided test rejects Ho: μ = μo when uo falls
                                                                               outside a level 1 – α confidence int.
c.   If individual observations have normal distribution
                                                                          a.   Complicating factors: not complete SRS from
     N(μ,σ) – then         x of n has N(μ,  / n )                             population, multistage & many factor designs,
d.   Central Limit Theorem: Given SRS of b from a                              outliers, non-normal distribution, σ unknown.
     population with μ and σ. When n is large, the                        b.   Under coverage and nonresponse often more
     sample mean           x is approx normal.                                 serious than the random sampling error accounted
                                                                               for by confidence interval
10. Binomial Distribution
                                                                          c.   Type I error: reject Ho when it's true – α gives
a.   Binomial Experiment. Emphasize Bi – two possible                          probability of this error
     outcomes (success,failure). n repeated identical
                                                                          d.   Type II error: accept Ho when Ha is true
     trials that have complementary P(success) +
     P(failure) = 1. binomial is count of successful trials               e.   Power is 1 – probability of Type II error
     where 0≤x≤n
b.   p : probability of success of each observation
c.   Binomial Coefficient: nCk = n!/(n – k)!k!
                                            n k           nk
d.   Binomial Prob: P(x = k) =                 p (1  p )
                                             k
e.   Binomal μ = np
f.   Binomal         np(1  p)
f0acc37b-be55-453e-be4c-8c9709fb5712.doc                           -3-                                          Printed 4/8/2009