Document Sample

STAT312: Applied Regression Methods http://www.mysmu.edu/faculty/zlyang/ Course Contents 1 Linear Regression 2 Transformed Linear Regression 3 Logistic Regression 4 Multinomial Logistic Regression 5 Loglinear Model STAT312, Term II, 10/11 2 Zhenlin Yang, SMU Chapter 1: Introduction Contents History of regression analysis Basic concepts Applications - Examples Sampling models Sampling distributions Statistical inference Large sample inference Matrix algebra STAT312, Term II, 10/11 3 Zhenlin Yang, SMU Chapter 1: Introduction History of Regression Analysis The earliest form of regression was the method of least squares, which was published by Legendre in 1805, and by Gauss in 1809. The term “least squares” is from Legendre’s term, moindres carrés. However, Gauss claimed that he had known the method since 1795. Legendre and Gauss both applied the method to the problem of determining, from astronomical observations, the orbits of bodies about the Sun. Euler had worked on the same problem (1748) without success. Gauss published a further development of the theory of least squares in 1821, including a version of the Gauss–Markov theorem. STAT312, Term II, 10/11 4 Zhenlin Yang, SMU Chapter 1: Introduction History of Regression Analysis The term "regression" was coined in the nineteenth century to describe a biological phenomenon, namely that the progeny of exceptional individuals tend on average to be less exceptional than their parents and more like their more distant ancestors. Francis Galton, a cousin of Charles Darwin, studied this phenomenon and applied the slightly misleading term "regression towards mediocrity" to it. For Galton, regression had only this biological meaning, but his work was later extended by Udny Yule and Karl Pearson to a more general statistical context. Nowadays the term "regression" is often synonymous with "least squares curve fitting". STAT312, Term II, 10/11 5 Zhenlin Yang, SMU Chapter 1: Introduction Some Basic Concepts A variable represents some characteristic of a social phenomenon. A continuous variable is a variable that assumes values in an interval. A discrete variable is a variable that assumes distinct values. A random variable is a variable that is used to represent possible outcomes of some random phenomenon, e.g., results of tossing a coin, results of rolling a die, etc. Response/dependent variable: variable of interest in the study. STAT312, Term II, 10/11 6 Zhenlin Yang, SMU Chapter 1: Introduction Some Basic Concepts Explanatory/independent variable: variable used to “explain” or “study” the variable of interest. Regression Analysis: a method for investigating functional relationships among variables. Categorical variables having ordered scales are called ordinal variables. Categorical variables having unordered scales are called nominal variables. Categorical variables are often referred to as qualitative variables, and numerical-valued variables are called quantitative variables. STAT312, Term II, 10/11 7 Zhenlin Yang, SMU Chapter 1: Introduction Some Basic Concepts STAT312, Term II, 10/11 8 Zhenlin Yang, SMU Chapter 1: Introduction Applications - Examples Example 1.1 Where to locate a new motor inn? La Quinta Motor Inns is planning an expansion. Management wishes to predict which sites are likely to be profitable. Several areas where predictors of profitability can be identified are: • Competition • Market awareness • Demand generators • Demographics • Physical quality STAT312, Term II, 10/11 9 Zhenlin Yang, SMU Chapter 1: Introduction Applications - Examples Predictors of Profitability Operating Margin profitability: Market Competition Customers Community Physical awareness Rooms Nearest Office College Income Disttwn space enrollment Number of Distance to Median Distance to hotels/motels the nearest household downtown. rooms within La Quinta inn. income. 3 miles from the site. Data: EX1-01.XLS EX1-01.TXT STAT312, Term II, 10/11 10 Zhenlin Yang, SMU Chapter 1: Introduction Applications - Examples Example 1.2. The table below cross classifies 1091 respondents to the 1991 General Social Survey by their gender and their belief in an afterlife. Table 1.1 Cross Classification of Belief in Afterlife by Gender Belief in Afterlife Gender Yes No or undecided Females 435 147 Males 375 134 Purpose of the Study: whether an association exists between gender and belief in afterlife. Is one sex more likely than the other to belief in an afterlife, or is belief in afterlife independent of gender? This is a typical 22 contingency table, where an analysis of association between two factors (variables) is of interest. STAT312, Term II, 10/11 11 Zhenlin Yang, SMU Chapter 1: Introduction Applications - Examples Example 1.3. The data given below comes from a randomized, double- blind clinical trial investigating a new treatment for rheumatoid arthritis. Investigators compared the new treatment with a placebo. The response measured was whether there was no, some, or marked improvement in the symptoms of rheumatoid arthritis. Table 1.2 Rheumatoid Arthritis Data This is a 2 23 Gender Treatment Improvement contingency table, None Some Marked Total where interest lies Female Test Drugs 6 5 16 27 in the association Female Placebo 19 7 6 32 between treatment Total 25 12 22 59 and degree of Male Test Drugs 7 2 5 14 improvement, Male Placebo 10 0 1 11 adjusting for gender effect. Total 17 2 6 25 STAT312, Term II, 10/11 12 Zhenlin Yang, SMU Chapter 1: Introduction Applications - Examples Example 1.4. (Education Expenditure) Per capita expenditure on public education can be affected by (1) per capita personal income, (2) number of residents per thousand under 18 years of age, (3) number of people per thousand residing in urban areas, and (4) the geographical region. The data have been collected for each of the 50 states in U.S., in 1960, 1970, and 1975. The problems of interest can be: • How is education expenditure related to the factors listed above? • Does the geographical region make a difference on education expenditure? • Is the relationship between education expenditure and other variables constant over time? STAT312, Term II, 10/11 13 Zhenlin Yang, SMU Chapter 1: Introduction Applications - Examples Example 1.5. (Probability of Bankruptcies) Detecting ailing financial and business establishments is an important function of audit and control. Systematic failure to do audit and control can lead to grave consequences, such as the saving-and-loan-fiasco of the 1980s in the United Stats, and current financial crises. The data P322.txt gives some of the operating financial ratios of 33 firms that went bankrupt after 2 years and 33 that remained solvent during the same period: X1 = Retained Earnings/Total Assets X2 = Earning Before Interest and Taxes/Total Assets X3 = Sales/ Total Assets Y = 0 if bankrupt after two years, and 1 if solvent after two years. Question: given a firm’s characteristics, what is the chance that this firm remains solvent after two years? STAT312, Term II, 10/11 14 Zhenlin Yang, SMU Chapter 1: Introduction Applications - Examples Example 1.6. (Chemical Diabetes) To determine the treatment and management of diabetes it is necessary to determine whether the patient has chemical diabetes of overt diabetes. The data P331.txt is from a study to determine the nature of chemical diabetes. The measurements were taken on 145 non-obese volunteers who were subject to the same regimen. Many variables were measured, but only three considered: insulin response (IR), the steady state of plasma glucose (SSPG), which measures insulin resistance, and relative weight (RW). The diabetes status of each subject was recorded. The clinical classification (CC) categories were overt (1), chemical diabetes (2), and normal (3). Question: given a subject’s characteristics, what is the chance that he/she has overt diabetes, or has chemical diabetes, or is normal? STAT312, Term II, 10/11 15 Zhenlin Yang, SMU Chapter 1: Introduction Applications - Examples Example 1.7. (Political Ideology and Party Affiliation) The data given below, from a General Social Survey, relates political ideology to political party affiliation. Table 1.3 Political Ideology and Part Affiliation Data Political Ideology Political Very Slightly Moderate Slightly Very Gender Party Liberal Liberal Conservative Conservative Female Democratic 44 47 118 23 32 Republican 18 28 86 39 48 Male Democratic 36 34 53 18 23 Republican 12 18 62 45 51 Question: Who is more conservative, democrats or republicans, males of females? STAT312, Term II, 10/11 16 Zhenlin Yang, SMU Chapter 1: Introduction Sampling Models Normal Distribution: A random variable X is said to have a normal distribution, denoted by X ~ N(, 2), if its probability density function (pdf) is of the form 1 (x )2 f ( x; , ) exp 2 2 2 This is a continuous distribution with mean and variance 2. 0.4 0.2 =1, 1.5, 2 =1.0 0.3 0.15 0.2 0.1 0.1 0.05 0 0 4 6 8 10 12 14 16 0 5 10 15 20 STAT312, Term II, 10/11 17 Zhenlin Yang, SMU Chapter 1: Introduction Sampling Models Binomial Distribution: A random variable X is said to have a binomial distribution, denoted by X ~ Bin(n,p), if its probability function is: n! p ( x) x (1 ) n x , x 0,1, 2, , n x!(n x)! a discrete distribution with mean n and variance n(1- ). 0.2 0.2 p(x) p(x) n=20, p=0.5 n=20, p=0.3 0.1 0.1 0.0 0.0 0 10 20 0 10 20 x x STAT312, Term II, 10/11 18 Zhenlin Yang, SMU Chapter 1: Introduction Sampling Models Poisson Distribution: A random variable X is said to have a Poisson distribution, denoted by X ~ Poi(), if its probability function is: e x p( x) , x 0,1, 2, , x! a discrete distribution with mean , called the Poisson rate, and variance 2. p(x) p(x) 0.15 0.10 mean=10 mean=6 0.10 0.05 0.05 0.00 0.00 0 10 20 x 0 10 x 20 STAT312, Term II, 10/11 19 Zhenlin Yang, SMU Chapter 1: Introduction Sampling Models Chi-squared Distribution: A random variable X is said to have a Chi- squared distribution with degrees of freedom if its pdf is of the form x ( 2 ) 2 exp( x 2) a continuous distribution f ( x) 2 , x 0, with mean 2 and variance 2 ( 2) 4, denoted by 2 =3 0.2 2 Z12 Z 2 Z2 2 0.15 where Z1 , Z2, …, Z =6 are iid standard normal 0.1 =12 random variables. 0.05 =20 It is a distribution used 0 in statistical inference 0 5 10 15 20 25 30 35 STAT312, Term II, 10/11 20 Zhenlin Yang, SMU Chapter 1: Introduction Sampling Models Student’s t distribution: A continuous r.v. T is said to have a Student’s t distribution with v df, denoted by T ~ tv, if its pdf has the form: ( v 1) 2 [(v 1) 2] t 2 f(t) = 1 , < t < , v > 0 (v 2) v v • The t distribution is symmetric around zero • If Z ~ N(0, 1) and Y ~ 2 (v ) and if Z and Y are indep., T =Z Y v ~ tv • It approaches to N(0, 1) as v STAT312, Term II, 10/11 21 Zhenlin Yang, SMU Chapter 1: Introduction Sampling Models F Distribution: If X1 ~ (v1 ) and X2 ~ 2 2 (v 2 ) are independent, then the r.v. X 1 v1 Y= X 2 v2 follows an F–distribution, with v1 df in the numerator and v2 df in the denominator. • This distributional result is often used to construct F test in linear regression models. STAT312, Term II, 10/11 22 Zhenlin Yang, SMU Chapter 1: Introduction Sampling Distributions The Central Limit Theorem If a random sample X1, …, Xn is drawn from any population, the sampling distribution of the sample mean x is approximately normal for a sufficiently large sample size. The larger the sample size, the more closely the sampling distribution of x will resemble a normal distribution. STAT312, Term II, 10/11 23 Zhenlin Yang, SMU Chapter 1: Introduction Sampling Distributions In more detail: Let X represent a population, and X1, …, Xn be a random sample drawn from this population. Then, 1. X X 2 2. 2 X x n 3. If X is normal, X is normal. If X is nonnormal X is approximat ely normally distribute d for sufficient ly large sample size (n 30). STAT312, Term II, 10/11 24 Zhenlin Yang, SMU Chapter 1: Introduction Sampling Distributions Sampling Distribution of a Sample Proportion ˆbe Let p the proportion of “successes” in a sequence of n Bernoulli trials. From the laws of expected value and variance, it can be ˆ ˆ shown that E( p ) = p and Var(p) = p(1-p)/n If both np ≥ 5 and np(1–p) ≥ 5, then p p ˆ Z p (1 p ) n Z is approximately standard normally distributed. STAT312, Term II, 10/11 25 Zhenlin Yang, SMU Chapter 1: Introduction Sampling Distributions Sampling Distribution of the Difference Between Two Sample Means The distribution of x1 x 2 is normal if The two samples are independent, and The parent populations are normally distributed. If the two populations are not both normally distributed, but the sample sizes are 30 or more, the distribution of x1 x 2 is approximately normal. STAT312, Term II, 10/11 26 Zhenlin Yang, SMU Chapter 1: Introduction Sampling Distributions Applying the laws of expected value and variance we have: E ( X 1 X 2 ) E ( X 1 ) E ( X 2 ) 1 2 12 2 2 Var ( X 1 X 2 ) Var ( X 1 ) Var ( X 2 ) n1 n2 We can define: ( x1 x2 ) ( 1 2 ) Z 12 2 2 n1 n2 STAT312, Term II, 10/11 27 Zhenlin Yang, SMU Chapter 1: Introduction Sampling Distributions Sampling Distribution of the difference between two Sample Proportions From the laws of expected value and variance, it can be shown that E( p1 p 2 ) E( p1 ) E( p 2 ) p1 p 2 ˆ ˆ ˆ ˆ p1 (1 p1 ) p 2 (1 p 2 ) Var ( p1 p 2 ) Var ( p1 ) Var ( p 2 ) ˆ ˆ ˆ ˆ n1 n2 If both n1p1 ≥ 5, n1(1-p1) ≥ 5, n2p2 ≥ 5, n2(1-p2) ≥ 5, then p1 p2 ( p1 p2 ) ˆ ˆ Z p1 (1 p1 ) n1 p 2 (1 p2 ) n2 Z is approximately standard normally distributed. STAT312, Term II, 10/11 28 Zhenlin Yang, SMU Chapter 1: Introduction Sampling Distributions Sampling Distribution of Sample Variance The statistic (n 1) s 2 2 has a Chi-squared distribution with df = n-1, if the population is normally distributed. d.f. = 5 (n 1) s 2 2 , d. f . n 1 2 d.f. = 10 STAT312, Term II, 10/11 29 Zhenlin Yang, SMU Chapter 1: Introduction Sampling Distributions Sampling Distribution of the Ratio of Two sample Variances 2 2 s1 1 Define the statistic: F 2 2 where the two samples s2 2 are drawn from two Normal populations The sampling distribution of this statistic is an F distribution with df n1 = n1–1 for the numerator and df n2 = n2–1 for the denominator. STAT312, Term II, 10/11 30 Zhenlin Yang, SMU Chapter 1: Introduction Statistical Inference Any numerical feature of a population, such as mean and variance, is called a parameter. Statistical Inference deals with drawing generalizations about population parameters from an analysis of information contained in the sample data. Studying the whole population is usually impractical; that is why we study a part of it. Inferences include: Point Estimation: obtain a guess or an estimate for the unknown true parameter value. Interval Estimation: obtain an interval of plausible values for the parameter, and determine the accuracy of the procedure. Testing hypothesis: decide whether the value of the parameter is equal to some pre-assumed value. STAT312, Term II, 10/11 31 Zhenlin Yang, SMU Chapter 1: Introduction Statistical Inference • Point Estimation Let f(x; ) be the pdf with parameter (vector) . Let X1, X2, …, Xn be a random sample drawn from f(x; ). Point estimation of is to find a statistic such that its value computed from the sample data would reflect value of as closely as possible. Such a statistic is called an estimator of and a specific value of the estimator computed form sample data is called an estimate of . Maximum Likelihood Estimation Method: The joint pdf of X1, X2, …, Xn when regarded as a function of , is called the likelihood function of : L( ) f ( x1; ) f ( x2 ; ) f ( xn ; ) ˆ The value of , denoted by , that maximizes L() is called the maximum likelihood estimator, or the MLE. STAT312, Term II, 10/11 32 Zhenlin Yang, SMU Chapter 1: Introduction Statistical Inference Example 1.8. Bernoulli sampling: f ( x, p) p x (1 p)1 x , x 0,1 L( p) p xi (1 p) n xi , 0 p 1 p xi n ˆ Example 1.9. Normal sampling: Xi ~ N(, 2) n ( xi ) 2 n 1 L( , ) exp 2 2 2 2 i 1 1 n 1 n X i X , and ( X i X ) 2 ˆ ˆ 2 n i 1 n i 1 Other methods: least square, method of moment, Bayesian estimator, etc. Properties: unbiasness, relative efficiency, etc. STAT312, Term II, 10/11 33 Zhenlin Yang, SMU Chapter 1: Introduction Statistical Inference • Confidence Interval Let L(X) and U(X) be functions of X = (X1, X2, …, Xn) such that P[L(X) < < U(X)] = 1 – a Then the interval {L(X), U(X)} is called a 100(1–a)% confidence interval (CI) for , L(X) and U(X) the lower and the upper confidence limits, and (1–a) the confidence coefficient associated with the interval. It is an approximate CI if the above equality holds only approximately. Bernoulli sampling: an approx. CI for p: p Z a 2 p(1 p) n ˆ ˆ ˆ Normal sampling: an exact CI for : ta 2 ˆ ˆ n 1 STAT312, Term II, 10/11 34 Zhenlin Yang, SMU Chapter 1: Introduction Statistical Inference • Test Statistical Hypothesis Null Hypothesis (H0): A theory about the values of population parameter(s), representing the status quo, accepted until proven false. Alternative Hypothesis (Ha): A theory that contradicts H0, which is favored upon existence of sufficient evidence. Test Statistic: A sample statistic used to decide whether to reject H0, which a measure of difference between the data and what is expected under the null hypothesis. Rejection Region: The numerical values of test statistic for which H0 is rejected. This region is chosen so that the probability is a that it will contain the test statistic when H0 is true, thereby leading to a wrong rejection (Type I error). It is also referred to as level of significance. STAT312, Term II, 10/11 35 Zhenlin Yang, SMU Chapter 1: Introduction Statistical Inference Conclusion: If the numerical value of the test statistic falls in the rejection region, we reject the H0 and conclude that the Ha is true. If the test statistic does not fall in the rejection region, we reserve the judgment about which H0 is true. An incorrect acceptance of H0 leads to a Type II error. p–value: the probability (assuming H0 is true) of observing a value of the test statistic that is at least as contradictory to the null hypothesis as the one computed from the data. Power of the test: Probability of rejecting a wrong null hypothesis. STAT312, Term II, 10/11 36 Zhenlin Yang, SMU Chapter 1: Introduction Statistical Inference Example 1.10. From past sales records, it is known that 30% of the population buys Brand X toothpaste. A new advertising campaign is completed, and to test its effectiveness, 1000 people are asked whether they buy Brand X toothpaste now. If 334 answer yes, does this indicate that the advertising campaign has been successful? H0: p = 0.30, Ha: p > 0.30. ˆ n = 1000, p = 0.334, Z0.05= 1.65. Rejection region: Z > 1.65. Test Stat. p p0 ˆ 0.334 0.3 Z= = = 2.35 p0 (1 p0 ) n 0.01449 STAT312, Term II, 10/11 37 Zhenlin Yang, SMU Chapter 1: Introduction Large Sample Inference Methods When the exact sampling distribution of an estimator is unknown, statistical inference can only be made approximately, based large sample properties of the estimator. Common methods include: ˆ 0 H 0 Wald Statistics: ~ N (0,1) ˆ ASE ( ) approx. S ( 0 ) H0 Score Statistic: ~ N (0,1) ASE[ S ( 0 )] . approx where S ( 0 ) d log[ L( )] d 0 LR Statistic: maximum likelihood when parameters satisfy H0 maximum likelihood when parameters unrestrict ed H0 2 log ~ df 2 . approx STAT312, Term II, 10/11 38 Zhenlin Yang, SMU Chapter 1: Introduction Matrix Algebra Vector: Matrix: a1 a11 a12 a1n a = a 2 , b = b1 b2 bn A = a 21 a 22 a2n a a am n n m1 am2 Transpose: a a1 a2 an 10 1 2 23 Matrix Multiplication: A = 3 6 , b = then A b , 3 24 0.10526316 - 0.01754386 Matrix Inverse: A 1 - 0.05263158 0.17543860 STAT312, Term II, 10/11 39 Zhenlin Yang, SMU Chapter 1: Introduction Computer Software: R ‘R’ is a computer package which does statistical analysis in a rather simple way, R is an open source software project and can be freely downloaded from: http://info.smu.edu.sg/rsite/ http://cran.r-project.org/ http://cran.bic.nus.edu.sg/ http://www.mysmu.edu/faculty/zlyang/ Other popular software include: Excel, Minitab, Matlab, SPSS, SAS, Gauss, S-Plus. STAT312, Term II, 10/11 40 Zhenlin Yang, SMU

DOCUMENT INFO

Shared By:

Categories:

Tags:
Chapter 4, Chapter 6, Chapter 9, PowerPoint Presentations, Chapter 8, Chapter 7, Chapter 10, the organization, Learning Objectives, Chapter 13

Stats:

views: | 21 |

posted: | 2/23/2011 |

language: | English |

pages: | 40 |

OTHER DOCS BY suchenfz

How are you planning on using Docstoc?
BUSINESS
PERSONAL

Feel free to Contact Us with any questions you might have.