VIEWS: 4 PAGES: 66 CATEGORY: Business POSTED ON: 11/18/2010 Public Domain
Business Statistics I MGT 515 Intro Statistics Simple examples • Should male drivers be charged higher auto insurance premiums than female drivers? – Types of cars driven • What is the probability of mortgage default? – How is it conditioned on the macroeconomic environment such as unemployment, GDP…. • Do lower taxes attract businesses and create employment growth? Population versus Sample • Observing the entire population is too costly – Drawing a small sample out of the population usually significantly reduces the cost • Population – All observations • All US drivers constitute the population of US drivers • Sample – A small selection drawn from the population • You, the students in this class, constitute a small sample of US drivers. • Observation – Each driver represents an observation – Each observation contains information on one or more variables • Driver information: gender, age, type of car driven…. – Variables: GENDER, AGE, TYPE OF CAR Population versus Sample • Parameter – Population characteristic – Population mean is a parameter • Average age of US drivers • Statistic – Sample characteristic – Sample mean is a statistic • Average age of the drivers in our class Types of variables • Continuous – Age of the driver – Miles driven per year • Discrete Variables (integer count) – Number of cars in the household • Categorical (Qualitative) – Driver Gender – Type of vehicle (sedan versus SUV) • Treated as binary variables – Quantifying qualitative variables • Ranked (ordered) data – Censored data • Income (given in brackets rather than in the exact values) US States • Population: 50 States + DC • Each State is an Observation • Sample would be a selection of several states High Income Private Private High Income Tax Rate Employment Employment Employment Observation State Tax Rate Bracket Sales Tax Rate 1999 (000) 2008 (000) growth % No Income Tax 1 ALABAMA 5 3,000 4 1568.7 1610.9 2.6901256 0 2 ALASKA 0 0 204.1 239.5 17.344439 1 3 ARIZONA 4.54 150,000 5.6 1809 2182.8 20.66335 0 4 ARKANSAS 7 31,000 6 954.3 989.9 3.7304831 0 5 CALIFORNIA 9.3 44,815 7.25 11752.4 12474.8 6.1468296 0 6 COLORADO 4.63 2.9 1804.3 1965.2 8.9175858 0 7 CONNECTICUT 5 10,000 6 1434 1447.1 0.9135286 0 8 DELAWARE 5.95 60,000 0 357.8 370.5 3.549469 0 9 FLORIDA 0 6 5850.4 6635.7 13.423014 1 10 GEORGIA 6 7,000 4 3265 3409 4.4104135 0 11 HAWAII 8.25 48,000 4 422.2 494.2 17.053529 0 12 IDAHO 7.8 24,736 6 433.7 529.2 22.019829 0 13 ILLINOIS 3 6.25 5132.9 5093.3 -0.7714937 0 14 INDIANA 3.4 6 2572.4 2518.3 -2.1030944 0 15 IOWA 8.98 62,055 5 1229.1 1270.3 3.3520462 0 16 KANSAS 6.45 30,000 5.3 1088.8 1130.9 3.8666422 0 17 KENTUCKY 6 75,000 6 1494.3 1531.6 2.496152 0 18 LOUISIANA 6 25,000 4 1523.7 1576.7 3.478375 0 19 MAINE 8.5 19,450 5 489.6 511.8 4.5343137 0 20 MARYLAND 5.5 500,000 6 1947.5 2111.1 8.4005135 0 21 MASSACHUSETTS 5.3 5 2814.5 2847.9 1.1867117 0 22 MICHIGAN 4.35 6 3917.7 3511.3 -10.373433 0 23 MINNESOTA 7.85 71,591 6.5 2225.6 2340.7 5.1716391 0 24 MISSISSIPPI 5 10,000 7 926.1 898.8 -2.9478458 0 25 MISSOURI 6 9,000 4.225 2305.5 2346.3 1.7696812 0 26 MONTANA 6.9 14,900 0 301.4 358.5 18.944924 0 27 NEBRASKA 6.84 27,001 5.5 742.5 800.8 7.8518519 0 28 NEVADA 0 6.5 865.6 1104.7 27.622458 1 29 NEW HAMPSHIRE 0 0 524.2 550.9 5.0934758 1 30 NEW JERSEY 8.97 500,000 7 3323.5 3407.1 2.5154205 0 31 NEW MEXICO 5.3 16,000 5 549.4 649.2 18.165271 0 32 NEW YORK 6.85 20,000 4 7013.6 7282.7 3.8368313 0 33 NORTH CAROLINA 7.75 60,000 4.25 3252.1 3422.5 5.2396913 0 34 NORTH DAKOTA 5.54 349,701 5 252.7 290.9 15.116739 0 35 OHIO 6.24 200,000 5.5 4791.4 4571.5 -4.5894728 0 36 OKLAHOMA 5.5 8,701 4.5 1170.3 1270.1 8.5277279 0 37 OREGON 9 7,300 0 1313.6 1422 8.2521315 0 38 PENNSYLVANIA 3.07 6 4870.5 5051.6 3.7183041 0 39 RHODE ISLAND 25% of federal 7 402.1 418.3 4.0288485 0 40 SOUTH CAROLINA 7 13,350 6 1515 1583.3 4.5082508 0 41 SOUTH DAKOTA 0 4 300 335.4 11.8 1 42 TENNESSEE 0 7 2295.1 2350 2.3920526 1 43 TEXAS 0 6.25 7625.5 8839.8 15.924202 1 44 UTAH 5 4.65 869 1043.2 20.04603 0 45 VERMONT 9.5 357,700 6 243.9 252.1 3.3620336 0 46 VIRGINIA 5.75 17,000 5 2801.1 3062.9 9.3463282 0 47 WASHINGTON 0 6.5 2174.4 2414 11.019132 1 48 WEST VIRGINIA 6.5 60,000 6 585.1 614.4 5.007691 0 49 WISCONSIN 6.75 145,460 5 2385.1 2449.7 2.7084818 0 50 WYOMING 0 4 173.6 229 31.912442 1 51 DIST. OF COLUMBIA 8.5 40,000 5.75 404.9 470.2 16.127439 0 Visually Presenting The Data Employment growth % 35 30 25 20 Private Employment Growth by State 15 10 1999 – 2008 as % of State’s starting 5 employment values 0 -5 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 -10 -15 States by Growth Category 35 30 25 20 Frequency Chart 15 Large categories 10 5 0 <0 0<.<10 10< Frequency Distribution P.D.F. C.D.F. Category Count Proportion Cumulative Proportion -10..-5 1 0.019608 0.019608 0.45 -5..0 4 0.078431 0.098039 0.4 0..5 20 0.392157 0.490196 0.35 5..10 11 0.215686 0.705882 0.3 10..15 3 0.058824 0.764706 0.25 15..20 7 0.137255 0.901961 0.2 20..25 3 0.058824 0.960784 0.15 25..30 1 0.019608 0.980392 0.1 30..50 1 0.019608 1 0.05 51 1 0 -10..-5 -5..0 0..5 5..10 10..15 15..20 20..25 25..30 30..50 P.D.F. Point (Probability) Density Function C.D.F. Cumulative Density Function Cumulative Count 1.2 25 1 20 0.8 0.6 15 0.4 Cumulative 10 0.2 5 0 0 -10..-5 -5..0 0..5 5..10 10..15 15..20 20..25 25..30 30..50 P.D.F. and C.D.F. Uniform Distribution Raw Data Sorted Data P.D.F C.D.F. Observation ID X Observation ID X Count Proportion 1 4 2 3 1 0.142857 0.142857 2 3 1 4 1 0.142857 0.285714 3 5 3 5 1 0.142857 0.428571 4 7 5 6 1 0.142857 0.571429 5 6 4 7 1 0.142857 0.714286 6 9 7 8 1 0.142857 0.857143 7 8 6 9 1 0.142857 1 Range of the distribution: Maximum – Minimum = 9 – 3 Number of Observations: 7 P.D.F. = 1/7 = 1/N, as each observation has equal weight of one C.D.F. = (X – min+1)/(max-min+1) Discrete Variable Case P.D.F. of the Normal Distribution ( x ) 2 1 f ( x) 2 2 e 2 Descriptive Measures • The Mean – Arithmetic Mean • Grade Point Average – Weighted Mean • Consumer Price Index • The Median – Center of the sorted (ranked) distribution – If the number of observations is odd, the middle ranked value is the median – If the number of observations is even, the median is the average of the two middle observations • Mode – Most frequent occurrence Geometric Mean x ( x1 x2 .... xn ) 1/ n One application is in the computation of average rate of return 1 i (1 i1 ) ...( in ) 1/ n 1 interest 1+int 0.05 1.05 1.306901 0.06 1.06 Geometric Mean 1.054991 0.05 1.05 0.06 1.06 0.055 1.055 Employment Observation 22 35 24 growth % -10.37 -4.59 -2.95 Splitting the data into 14 -2.10 13 -0.77 QUARTILES First Quartile 7 0.91 21 1.19 25 1.77 42 2.39 17 2.50 30 2.52 1 2.69 49 2.71 15 3.35 45 3.36 18 3.48 Where i represents the ith quartile 8 3.55 i (n 1) Second Quartile Qi 38 3.72 4 3.73 and n represents the number of 32 16 3.84 3.87 4 RANKED observations 39 4.03 10 4.41 40 4.51 19 4.53 48 5.01 29 5.09 23 5.17 33 5.24 5 6.15 Third Quartile 27 7.85 37 8.25 Interquartile Range: Q3 – Q1: the middle 50% of the 20 8.40 36 8.53 ranked observations 6 8.92 46 9.35 47 11.02 41 11.80 9 13.42 34 15.12 43 15.92 51 16.13 11 17.05 Fourth Quartile 2 17.34 31 18.17 26 18.94 44 20.05 3 20.66 12 22.02 28 27.62 50 31.91 Shape of the Distribution • Mean • Median • Spread Frequency PDF Category INTC HPQ T VZ CAT SO INTC HPQ T VZ CAT SO -80 1 0 0 0 0 0 0.008333 0 0 0 0 0 -75 0 0 0 0 0 0 0 0 0 0 0 0 -70 0 0 0 0 0 0 0 0 0 0 0 0 -65 0 0 0 0 0 0 0 0 0 0 0 0 -60 0 0 0 0 0 0 0 0 0 0 0 0 -55 0 0 0 0 0 0 0 0 0 0 0 0 -50 1 0 0 0 1 0 0.008333 0 0 0 0.008333 0 -45 0 1 0 0 0 0 0 0.008333 0 0 0 0 -40 0 1 0 0 1 0 0 0.008333 0 0 0.008333 0 -35 1 0 0 0 0 0 0.008333 0 0 0 0 0 -30 1 0 0 0 0 0 0.008333 0 0 0 0 0 -25 3 1 0 1 1 0 0.025 0.008333 0 0.008333 0.008333 0 -20 2 4 2 1 1 0 0.016667 0.033333 0.016667 0.008333 0.008333 0 -15 5 5 5 0 4 0 0.041667 0.041667 0.041667 0 0.033333 0 -10 7 7 8 10 6 7 0.058333 0.058333 0.066667 0.083333 0.05 0.058333 -5 15 13 16 18 13 7 0.125 0.108333 0.133333 0.15 0.108333 0.058333 0 22 24 31 30 28 31 0.183333 0.2 0.258333 0.25 0.233333 0.258333 5 25 30 29 34 24 54 0.208333 0.25 0.241667 0.283333 0.2 0.45 10 14 18 21 20 28 17 0.116667 0.15 0.175 0.166667 0.233333 0.141667 15 13 8 5 3 7 3 0.108333 0.066667 0.041667 0.025 0.058333 0.025 20 9 4 2 2 4 1 0.075 0.033333 0.016667 0.016667 0.033333 0.008333 25 0 3 1 0 1 0 0 0.025 0.008333 0 0.008333 0 30 1 1 0 1 1 0 0.008333 0.008333 0 0.008333 0.008333 0 total 120 120 120 120 120 120 1 1 1 1 1 1 Shape of the Distribution monthly stock price change HPQ T INTC 40 40 30 30 30 20 20 20 10 10 10 0 0 0 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 SO VZ 60 40 CAT 50 30 30 40 20 20 30 10 10 20 0 0 10 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 0 -80 -70 -60 -50 -40 -30 -20 -10 0 10 20 30 Data source: Yahoo.com. Data: monthly stock price changes from July 1999 – July 2009 p.d.f. 0.5 0.45 0.4 0.35 0.3 INTC HPQ 0.25 T VZ 0.2 CAT SO 0.15 0.1 0.05 0 -80 -75 -70 -65 -60 -55 -50 -45 -40 -35 -30 -25 -20 -15 -10 -5 0 5 10 15 20 25 30 Spread • In the case of stocks, spread is a measure of risk and should be captured in the option pricing SPREAD Difference xi x Sum of Squares i ( xi x ) 2 Sample Variance Population Variance n (x x) N 2 i ( xi ) 2 S2 i 1 2 i 1 n 1 N Sample Standard Deviation Population Standard Deviation n N ( xi x ) 2 ( xi ) 2 S S2 i 1 2 i 1 n 1 N INTCchange HPQchange Tchange Mean -1.27715 Mean -0.51072 Mean -0.62194 Standard Error 1.308779 Standard Error 1.07527 Standard Error 0.744948 Median 0.542984 Median 0.592682 Median -0.24774 Mode #N/A Mode #N/A Mode #N/A Standard Deviation 14.33695 Standard Deviation 11.77899 Standard Deviation 8.160501 Sample Variance 205.5482 Sample Variance 138.7446 Sample Variance 66.59378 Kurtosis 7.822148 Kurtosis 2.740998 Kurtosis 0.612623 Skewness -1.98331 Skewness -0.91781 Skewness -0.39071 Range 105.4057 Range 73.20115 Range 45.68781 Minimum -80.1469 Minimum -47.0547 Minimum -23.0226 Maximum 25.2588 Maximum 26.14648 Maximum 22.66521 Sum -153.258 Sum -61.2867 Sum -74.6324 Count 120 Count 120 Count 120 VZchange CATchange SOchange Mean -0.48819 Mean 0.010754 Mean 0.831933 Standard Error 0.694313 Standard Error 0.992893 Standard Error 0.492301 Median 0.097096 Median 0.91357 Median 1.030665 Mode #N/A Mode #N/A Mode #N/A Standard Deviation 7.605816 Standard Deviation 10.8766 Standard Deviation 5.392892 Sample Variance 57.84844 Sample Variance 118.3003 Sample Variance 29.08328 Kurtosis 2.054396 Kurtosis 6.289378 Kurtosis 1.683112 Skewness 0.065181 Skewness -1.56712 Skewness -0.29751 Range 54.74623 Range 80.40594 Range 33.39245 Minimum -26.5565 Minimum -54.464 Minimum -13.9316 Maximum 28.18969 Maximum 25.9419 Maximum 19.46083 Sum -58.5828 Sum 1.290532 Sum 99.83195 Count 120 Count 120 Count 120 Coefficient of Variation Standard Deviation S CV 100% 100 Mean x INTC HPQ T VZ CAT SO CV 41.79 38.67 25.48 15.95 52.25 33.92 Skewness of the Distribution INTCchange INTC Mean -1.27715 30 Standard Error 1.308779 Median 0.542984 25 Mode #N/A Standard Deviation 14.33695 20 Sample Variance 205.5482 Kurtosis 7.822148 15 Skewness -1.98331 Range 105.4057 10 Minimum -80.1469 5 Maximum 25.2588 Sum -153.258 0 Count 120 -80-75-70-65-60-55-50-45-40-35-30-25-20-15-10 -5 0 5 10 15 20 25 30 Negative, or left skewed Mean < Median (mean is pushed to the left by “outlying” observations on the left end of the distribution) Skewness < 0 Tchange T Mean -0.62194 35 Standard Error 0.744948 30 Median -0.24774 Mode #N/A 25 Standard Deviation 8.160501 20 Sample Variance 66.59378 Kurtosis 0.612623 15 Skewness -0.39071 Range 45.68781 10 Minimum -23.0226 5 Maximum 22.66521 Sum -74.6324 0 Count 120 -80-75-70-65-60-55-50-45-40-35-30-25-20-15-10 -5 0 5 10 15 20 25 30 This distribution is only slightly skewed to the left Mean is slightly less than the median Skewness measure is close to zero Chebyshev Rule 1 1 2 100 k The Rule: No matter what the distribution is the Rule defines the minimum percentage of observations that will be found within the k standard deviations of the mean Chebyshev Rule INTCchange Mean -1.27715 Standard Error 1.308779 Based on the Rule, 75% of all observations should be Median 0.542984 Mode #N/A found within 2 standard deviations of the mean Standard Deviation 14.33695 Sample Variance 205.5482 Kurtosis 7.822148 Skewness -1.98331 Range 105.4057 Minimum -80.1469 Maximum 25.2588 Defining 75% confidence interval: Sum -153.258 Count 120 x 2 S 1.27715 2 14.33695 75% of the time the stock of INTC is likely exhibit a monthly change between -29.95% and +28.68% In reality, during the past decade, 116 times out of 120 has INTC stock demonstrated growth within the two standard deviations of the mean over the past decade. Covariance and Correlation Stock Price Distribution INTC HPQ T VZ CAT SO n ( xi x )( yi y ) INTC 103.47 HPQ 45.48 134.69 COV ( x, y ) i 1 n 1 T 18.517 56.478 43.2 VZ 19.14 32.899 28.279 24.377 CAT -79.73 108.82 38.508 7.0726 396.84 SO -54 21.263 -0.342 -8.726 133.16 63.539 INTC HPQ T VZ CAT SO INTC 1 cov(x, y ) HPQ 0.385253 1 r T 0.276967 0.740395 1 SxS y VZ 0.381103 0.574143 0.871444 1 CAT -0.39346 0.470676 0.294101 0.071908 1 SO -0.66604 0.229844 -0.00653 -0.22172 0.838598 1 SO – Southern Power, is strongly correlated with CAT, and negatively correlated with INTC VZ and T are strongly correlated ? Regression ? SUMMARY OUTPUT Regression Statistics Multiple R 0.871444 R Square 0.759415 Adjusted R Square 0.757394 Standard Error 3.250819 Observations 121 ANOVA Significan df SS MS F ce F Regression 1 3969.577 3969.577 375.6287 1.28E-38 Residual 119 1257.571 10.56782 Total 120 5227.148 Coefficien Standard Lower Upper Lower Upper ts Error t Stat P-value 95% 95% 95.0% 95.0% Intercept -10.1602 1.88385 -5.3933 3.57E-07 -13.8904 -6.42996 -13.8904 -6.42996 VZ 1.160089 0.059857 19.38114 1.28E-38 1.041567 1.27861 1.041567 1.27861 Covariance (revisited) N XY ( X i E[ X ]) (Yi E[Y ]) P( X iYi ) i 1 • Measure of relationship between X and Y variables – Zero indicates that the variables are independent X Y X - E[X] Y - E[Y] 2 4 -1.75 -0.75 0.328125 3 6 -0.75 1.25 -0.23438 6 5 2.25 0.25 0.140625 4 4 0.25 -0.75 -0.04688 E[.] 3.75 4.75 0.1875 Positive relationship: higher values of X correspond to higher values of Y Properties of the sum of two random variables E[ X Y ] E[ X ] E[Y ] Var ( X Y ) 2 X Y 2 X 2Y 2 XY Expected Return • Lottery sells 20 tickets with one winning ticket. The winning ticket pays 10 dollars. What is the expected return from purchasing one ticket? – Probability the ticket is winning: 1/20 = 0.05 – Expected Return: Payoff times its probability • From holding only one ticket: Probability (Winning) times Prize 0.05 * $10 = $0.5 Expected Return II • What if the lottery has three winning tickets: – One ticket paying 10 dollars – Two tickets paying 5 dollars each • What is the expected return now? N E[ X ] Pr( X i ) * X i i 1 Expected Return of an investment portfolio Daily Stock Return ($) Weighted Return INTC T INTC T Expected -0.17 -0.11 -0.119 -0.033 -0.152 0.45 -0.02 0.315 -0.006 0.309 0.02 0.08 0.014 0.024 0.038 0.73 0.08 0.511 0.024 0.535 -0.36 -0.01 -0.252 -0.003 -0.255 E[.] 0.134 0.004 0.0938 0.0012 0.095 Weights 0.7 0.3 Weighted Aver Return 0.095 Standard Deviation of the Portfolio Return as the Measure of Risk of the Portfolio p wx 2 2 x w y2 2 y 2wx wy xy 1/ 2 PROBABILITY • Likelihood • Coin has two sides: reverse, obverse – Number of all possible outcomes of a coin toss • 2 – Probability that obverse shows: • One particular outcome (obverse) – Probability = outcome/all possible outcomes = ½ = 0.5 = 50% • Mutually Exclusive events – Obverse OR Reverse • Cannot occur simultaneously • Collectively Exhaustive – Set of events where one event must occur, an event drawn from the set has 100% probability • Set of events: Obverse, Reverse. One of these must occur. Between these two events all possibilities are covered. Growth Category (%) INTC Probability p.d.f. -80 1 0.008333 Probability that the stock of Intel will lose Probability that the stock of intel will lose -75 0 0 -70 0 0 at least 20 % would be: 7.5% -65 0 0 -60 0 0 What is the probability that the stock of Intel value during a month -55 0 0 -50 1 0.008333 will gain between 5 and 15%? -45 0 0 -40 0 0 -35 1 0.008333 -30 1 0.008333 -25 3 0.025 -20 2 0.016667 -15 5 0.041667 -10 7 0.058333 -5 15 0.125 gain value during a Probability that the 0 22 0.183333 stock of Intel will 5 25 0.208333 10 14 0.116667 month 15 13 0.108333 20 9 0.075 25 0 0 30 1 0.008333 total 120 Conditional Probability What is the probability that the stock of INTC will increase after falling? Possible States of the World (outcomes, total outcomes 120, 58 declines and 62 increases) Joint Probabilities: 1) Stock declines after an increase 34 outcomes; 28.33% 2) Stock declines after a decline 24 outcomes; 20% 3) Stock increases after an increase 27 outcomes; 22.5% 4) Stock increases after a decline 35 outcomes; 29.17% Conditional Probability (applied to the subset on which the conditioning is being made): Pr(A|B) = Pr(A and B)/Pr (B) Probability that the stock of Intel will increase after falling in the previous month is: Pr(Increase|Decline) = 35/58 = 60.3% Pr(Decline|Decline) = 24/58 = 41.4% Pr(Decline) = 58/120 = 48.3% Pr(Increse) = 62/120 = 51.7% Two Events are Independent if: Pr(A|B) = Pr(A) We can argue that in the case of INTC, the behavior of the stock in a given month depends on its behavior in the prior month. Multiplication Rule Pr( A and B) Pr( A | B) Pr( B) Solve for the joint probability Pr(A and B) Pr(A | B) Pr(B) Pr(Increase and Decline) 29.17 Pr(Decline) 49.2 Pr(Increase | Decline) 59.3 Multiplication Rule for INDEPENDENT EVENTS Events A and B are statistically independent iff Pr(A|B) = Pr(A) Multiplication rule simplifies: Pr(A and B) = Pr(A|B) Pr(B) = Pr(A) Pr(B) Additional Probability Rules N Where N represents N mutually exclusive and Pr( A) Pr( A | Bi ) Pr( Bi ) i 1 Collectively exclusive events Pr(INTC increases) = Pr(increase|decline) Pr(decline) + Pr(increase|increase) Pr(increase) 62/120 = ( 35 / 58 ) (58/120) + ( 27 / 62 ) ( 62/120) Pr(A and B) Pr(A | B) Pr(B) Bayer’s Theorem Pr(A and B) Pr(A | B)Pr(B) Conditional Pr( A and B) Pr( A | B) Probability Pr(B) Pr( A and B) Pr( A | B) Pr( B) Similarly: Pr( B | A) Pr(A) Pr( A) Note: Pr(A) Pr(A | B) Pr(B) Pr(A | C ) Pr(C ) Pr(A | B)Pr(B) Combining these: Pr(B | A) Pr( A | B) Pr( B) Pr( A | C ) Pr( C ) Bayes’ Theorem • Consider the following scenario: on average the stock of INTC posts monthly declines 40% of the time. The stock is rated by a number of analysts. In the past when the analysts assigned INTC stock accumulate rating they were correct 70% of the time (the stock posted a monthly increase). However, in the remaining 30% of the time, the stock declined. Recently, the stock of INTC again received the average rating of accumulate, what is the probability that the stock will increase given this rating? • Pr(Increase) = 60% • Pr(Decline)=40% • Pr(Accumulate|Increase)=70% • Pr(Accumulate|Decline)=30% Pr(• The |Question is Pr( Accumulate | Increase ) Pr( Increase ) Increase Accumulate) what is Pr(Accumulate|Increase)=? Pr( Accumulate | Increase ) Pr( Increase ) Pr( Accumulate | Decline) Pr( Decline) 0.7 * 0.6 0.75 0.7 * 0.6 0.3 * 0.4 Practice example • The student in the past failed every on average every fifth exam he took. Furthermore, 50% of the time when he failed his exams he studied hard for them. However, in 90% of exams that he passed he also studied hard for them. For the upcoming exam the student has been studying hard, what is the probability that he will pass this exam? Solution • Pr(F) = 0.2 • Pr(P) = 0.8 These two are mutually exclusive and exhaustive events • Pr(S|F)=0.5 Half of the time when the exam was failed, the student had been studying • Pr(S|P)=0.9, 90% of the time when passing occurred, the student had been studying • Pr(P|S)=? Pr( S | P) Pr( P) 0.9 * 0.8 0.72 Pr( P | S ) Pr( S | P) Pr( P) Pr( S | F ) Pr( F ) 0.9 * 0.8 0.5 * 0.2 0.82 Discrete Random Variable • Discrete Variable – variables generated from a counting process - Enrollment in class - Enrollment in econ 101 depending on the time of day it is offered - Number of accidents on Buffalo highways on a given day. - Daily number of listings on eBay of a given item - These variables are numerical but NOT continuous Characteristics of The Distribution • Each observed value has its own probability. number of number of Number of Day listings Day listings Listings Count Probability 1 12 4 8 8 1 0.05 2 15 12 9 9 1 0.05 3 10 3 10 10 3 0.15 4 8 7 10 11 2 0.1 5 14 17 10 12 5 0.25 6 17 10 11 13 2 0.1 7 10 18 11 14 3 0.15 8 14 1 12 15 1 0.05 9 12 9 12 16 1 0.05 10 11 11 12 17 1 0.05 11 12 15 12 12 9 19 12 13 16 16 13 14 14 20 13 15 12 5 14 16 13 8 14 17 10 14 14 18 11 2 15 19 12 13 16 20 13 6 17 N Computing the average, x i N mean (expected value) i 1 E[ x] xi Pr( xi ) N i 1 Variance of a Discrete Random Variable: N ( xi E[ xi ])2 Pr(xi ) 2 i 1 Standard Deviation: 2 Binary (“dummy”) Variables • Assume two values only, usually 0, 1 • Weather: good/bad – Good weather =1 if good, =0 otherwise – Passing grade/failing grade – Winning/losing the lottery • Possible Combinations weather in Consider a simple setup: On average 80% of the time we have good summer time. Given that, what is the probability that next weekend we will have two good weather days and one bad weather day? • ASSUMPTION: weather is independent from day to day! That is the probability of good weather on any day is 0.8, independent of the weather on the previous day • One possibility: Friday – good; Saturday – good; Sunday – bad. What is the probability of that? – Pr(Friday=1)*Pr(Sat=1)*Pr(Sun=0)=0.8*0.8*0.2=0.128 * refer to Pr(“”=1) = p, Pr(“”=0)=(1-p) • p – probability of success – The above is just a particular draw. • Another possibility: Friday – bad; Saturday – good; Sunday – good and etc…. Using Factorials to determine the number of possible combinations Drawing X objects from a set of n objects n! n CX X !(n X )! 3! 3 2 1 3 C2 3 2!(3 2)! 2 1 (1) Fri Sat Sun G G B p p (1 p) p 2 (1 p)1 Pr(2G)=0.128*3 G B G p (1 p) p p (1 p) 2 1 p X (1 p ) n X B G G (1 p) p p p 2 (1 p)1 Simple Practice • What is the probability that we will have two bad days in a weekend? • What is the probability that we will have AT LEAST two bad days in a weekend? • What is the probability that two consecutive days in will be good? Example 2 • What if there are only about 20 people in the US who are used as referees by journals that publish papers in a certain area of research? Each journal assigns 2 referees to a paper. What if you consider submitting two papers, one to each journal, what is the probability that at least one of the referees will be the same? Binomial Distribution n! Pr( X ) p X (1 p) n X X !(n X )! Pr(X) – probability of X successes drawn from the sample of n observations with p representing the probability of success of each observation Assumptions: Probability p stays constant with each draw (effectively this assumes that the sample is large) Outcomes are independent The variable is binary E[ X ] np 2 np(1 p) Binomial Distribution Review Example • Let’s say that historically it so happens that in the first 10 days of September it only rains for two days. What is the probability that it will not rain during the 4 day labor day weekend? – Note how the probability simplifies to the simple multiplication of individual probabilities • What is the probability that it will not rain on any three of the four days of the weekend? Continuous Distributions Unsorted Sorted UNIFORM DISTRIBUTION X X X Count PDF CDF 2 0 0 2 0.047619 0.047619 4 0 1 2 0.047619 0.095238 3 1 2 2 0.047619 0.142857 1 1 3 2 0.047619 0.190476 min max 5 7 2 2 4 5 2 2 0.047619 0.047619 0.238095 0.285714 8 3 6 2 0.047619 0.333333 2 6 3 7 2 0.047619 0.380952 9 4 8 2 0.047619 0.428571 10 4 9 2 0.047619 0.47619 (max min) 2 0 12 5 5 10 11 2 2 0.047619 0.047619 0.52381 0.571429 2 11 6 12 2 0.047619 0.619048 12 15 6 13 2 0.047619 0.666667 13 7 14 2 0.047619 0.714286 14 7 15 2 0.047619 0.761905 20 8 16 2 0.047619 0.809524 1 18 8 17 2 0.047619 0.857143 PDF f ( x) , x [min, max] max min 19 9 18 2 0.047619 0.904762 17 9 19 2 0.047619 0.952381 16 10 20 2 0.047619 1 2 10 4 11 3 11 1 12 5 12 7 13 8 13 6 14 9 14 10 15 0 15 12 16 11 16 15 17 13 17 14 18 20 18 18 19 19 19 17 20 16 20 Normal Distribution • Sometimes called Gaussian Distribution • The Bell Curve – Symmetrical – Greater mass at the center • PDF diminishes as you move away from the center • Measures of central tendency are equal – Mean, median, mode • Interquartile Range is 4/3 standard deviations – 50% of all observations are contained between mean plus or minus 2/3 standard deviation • Infinite Range Why Study Normal Distribution • Central Limit Theorem Property • Extensive use of the Normal Distribution in Statistics/Econometrics Standardizing the Normal Distribution Normal Distribution ( x )2 1 P.D.F. (X) f ( x) e 2 2 2 X ~ N (, Standardized Normal Distribution X Z 1 1Z 2 f (Z ) e 2 2 P.D.F. (Z) Z ~ N (0,1) Computing Normal CDF, CDF as the probability • In Excel use NORMDIST – NORMDIST(X, , , cumulative) = CDF (X) • TRUE if cumulative is computed (CDF up to the X value) • False if PDF is computed instead (PDF at X) – NORMINV ( prob, , = X – NORMSDIST (Z) = CDF (Z) • Computes the CDF of the standardized normal distribution – NORMSINV (probability) = Z • For Standardized Normal Distribution – Use Textbook Z tables Example using the textbook Z-dist. tables • Consider that X ~ N (10, 4) – What is the probability that we will draw at random a value of X that is less than 8? • Do it using the tables first • Do it using excel – What is the probability that we will draw at random a value of X that is greater than 11? – What is the probability that we will draw at random a value of X between 7 and 9? Simple examination of Data for Normality B Period_Year Period_Month UF_RetailAs%Total 2009 8 13.15379 2009 7 13.21556 2009 6 13.20713 1 Comparison of mean, mode and 2009 5 13.09604 median 2009 4 13.29532 2009 3 13.48186 2 Kurtosis and Skewness 2009 2 13.61643 2 Quantile-Quantile Plot 2009 1 13.9316 2008 12 14.12047 2008 11 13.9948 2008 10 13.57838 2008 9 13.40963 2008 8 13.45029 2008 7 13.42442 2008 6 13.44574 2008 5 13.3493 2008 4 13.33186 2008 3 13.47661 2008 2 13.49046 2008 1 13.9232 2007 12 14.25173 2007 11 13.955 Confidence Interval N ( xi ) 2 Distribution of X 2 i 1 N Distribution of the means •Samples of size n •Distribution of the sample means •The means is the unbiased estimator of the population mean Standard Error of the mean X n Central Limit Theorem implies: X ~ N ( , X ) Confidence Interval X z X z n n Increased sample size concentrates the distribution around the population mean As the sample size approaches infinity (population size) the distributional standard error approaches zero as the sample becomes equal the population Confidence Interval • General Form •If St. Dev. is known x z x z z •If St. Dev. is unknown xx x tn 1S x x t n 1S t S • Confidence Interval for the mean •Standard Error for the distribution of the means is used S S x t n 1 x t n 1 n n • As n increases the z and t distributions converge • Increasing the sample size reduces the confidence interval Sampling • Probability versus non-probability sampling – For instance, if Alaska represents 0.5% of the US population, a US population sample should allocate a 0.5% weight to Alaska, in this case, each state can be considered as a strata, and a certain number (based on the representation in the population) of random draws can be drawn – Random sample – Stratified sample • Dividing data (strata) – Common characteristics, gender…. – Clustering Data • Clusters represent the population – Geographical sub samples Sampling Problems • Selection Bias – Not all of population is represented in the frame – Non-response in survey (omitted) bias • Measurement Error – Function of the sample size • With sample size the standard error diminishes