VIEWS: 15 PAGES: 35 CATEGORY: Legal Forms POSTED ON: 5/1/2009
Topics for 9/6/2006 • Discussion: Problem Set # 1 • Continued- Descriptive Statistics – Variability: Range, Variance, and Standard Deviation – Measures of Position: Percentile, z scores • Graphical Presentations of Data • Types of Graphs – Bar charts, Pie Charts, etc. • Correlation Between 2 Variables • STATA Lab Variability • While the concept of central tendency regards the middleness or the central location or most frequent value of a distribution of values, it gives an account of what’s common within the distribution. • The concept of variability, on the other hand, is related to what’s different or how different the values in a distribution are. • Grades of two classes: Same mean, but… – A: 100, 100, 100, 100, 0 – B: 90, 85, 80, 75, 70 Range • difference between the minimum and maximum R Max ( x ) Min ( x ) • Example: – Scores of students A & B in six quizzes: A B 2 2 3 5 4 5 6 5 7 5 8 8 What are the Ranges for A and B? Variance • Variance is another measure of the degree of dispersion in a series of data. • variance is also based on deviations from the mean. • the variance clears the signs by squaring the deviations. • by squaring the deviation, larger outliers have more influence than values closer to the mean n n xi x 2 xi x Compared with i 1 MD i 1 2 N N Example x xx x x 2 x xx x x 2 2 -3 9 2 -3 9 3 -2 4 5 0 0 4 -1 1 5 0 0 6 1 1 5 0 0 7 2 4 5 0 0 8 3 9 8 3 9 0 28 0 18 n x x 2 A = 28/6=4.67 2 i i 1 2 B = 18/6=3 N 2 Standard Deviation • Square root of variance to deal with the problem of square unit n x x 2 i i 1 N A A 4 .6 7 2 .1 6 1 2 B B 3 1.732 2 Variance and Standard Deviation • Always positive! Why? • Variance takes care of the canceling effect but is in squared unit • Standard deviation is the same unit, easier to interpret • Which one is larger? • When will they be equal? Summation Notation 3 N Xi X1 X 2 X3 Xi X1 X 2 ... X N i 1 i 1 2 3 3 X i X 1 X 2 X 3 X i X 1 X 2 X 3 2 2 2 2 2 i 1 i 1 N C i C C ... C NC i 1 N N CX i CX 1 CX 2 ... CX N C X 1 X 2 ... X N C Xi i 1 i 1 Summary: Measures of Variability Range Maximum - Minimum N N (X i ) (X i X ) 2 2 Variance 2 i 1 S 2 i 1 N n 1 N N (X ) (X 2 X) 2 Standard i 1 i i Deviation S i 1 N n 1 Adding a constant to X N N ( X i C ) X i NC i 1 ( X C ) i 1 C N N N N X i C C (X i ) 2 2 i 1 i 1 ( X C ) N N Example: X 10, 20, 30 mean=20 sd=8.16 X+100 110, 120, 130 mean=120 sd=8.16 Multiplying by a constant N N CX i X i CX i 1 C i 1 C N N N N ( CX C ) (X ) 2 2 2 i C i i 1 i 1 CX C N N (Note the absolute value sign. If C is -10 for example, it still increases the standard deviation by a factor of positive 10. Variances and standard deviations are always positve by definition.) Examples: X 10, 20, 30 Mean = 20 sd= 8.16 10X 100, 200, 300 Mean = 200 sd= 81.6 -10X -100, -200, -300 Mean = -200 sd= 81.6 Standardized (Z) score • Standardizing a score refers to expressing a raw value in terms of its deviation from the mean, expressed in units of standard deviation. – Any raw score or raw value can be converted to a standardized value, provided you know the mean and standard deviation of the distribution from which it came. Z score (Example) x f Consider the following example of scores on an 4 1 American Government quiz. All students in the class 5 1 (102) took a quiz worth 17 points, and scored between 6 4 4 and 16. The distribution below depicts those 102 scores. 7 5 8 6 Mode: 9 10 Median: 10 48 Mean: 11 10 Variance: 12 6 13 5 Standard deviation: 14 4 Why do we need Z score? 15 1 16 1 We want to know how well individual score did. N=102 Z score (Example) • Let’s say I got a score of 14 on my test, and a score of 15 on another 17 point test. What might I want to know in order to compare “how well I did” on the two tests? – how most of the class did – how well I did compared to the mean • It turns out there is a way we can “re-compute” a given score value to express it in such terms. It’s called the standardized score, and technically represents a given score’s departure from the mean in units of standard deviation. Z score (Example) x f x-mean z 4 1 -6 -3.0 In a sense, then, we really are standardizing the score. We can now compare my score on this test to my score 5 1 -5 -2.5 on the other test. 6 4 -4 -2.0 7 5 -3 -1.5 Ex: x = 14, z = 14-10 / 2 = 2 8 6 -2 -1.0 9 10 -1 -0.5 Suppose there is another test 10 48 0 0.0 x=15, mean:12, variance:4 11 10 1 +0.5 z=? 12 6 2 +1.0 13 5 3 +1.5 Compare two cases? 14 4 4 +2.0 15 1 5 +2.5 16 1 6 +3.0 N=102 Z Scores: Comparing Across Distributions A z score is the observation for a single person, normalized by the mean and standard deviation for the whole distribution. What is the relevant distribution? That depends on the question you are asking. * The mean of a set of z scores is 0. (Why?) * The standard deviation of a set of z scores is 1. (Why?) Example (data are approximate): year jump mean sd z Bob Beamon 1968 29' 2.5" (29.2) 23 1.5 4.1 Mike Powell 1994 29' 4" (29.3) 26 1.5 2.2 Beamon's jump was more spectacular in comparison to his contemporaries. Percentile • Another measure of relative standing • The pth percentile means the value of x that exceeds p% of the measurements and is less than the remaining (100-p)%. • Ex) Dr. Minsky said that Eileen’s weight is 90th percentile. What does it mean? 90% 10% Lower and Upper Quartiles • The lower quartile (first quartile), Q1 is the value of x that exceeds one-fourth of the measurements and is less than the remaining three-fourths. • The upper quartile (third quartile), Q3 is the value of x that exceeds three-fourths of the measurements and is less than one- fourth. • The value of second quartile, Q2? Relative Frequency 25% 25% 25% 25% The interquartile range (IQR) for a set of measurement is the difference between the upper and lower quartile; IQR=Q3-Q1 Calculating Quartile When the measurement are arranged in order of magnitude, the lower quartile, Q1, is the value of x in position 0.25(n+1) and the upper quartile, Q3, is the value of x in position 0.75(n+1). Ex: The following data represent the scores for a sample of 10 students on a 20-point Statistics quiz: 16, 14, 2, 8, 12, 12, 9, 10, 15, and 13. Calculate the lower and upper quartiles and the IQR for these data. The position of Q1=0.25(10+1)=2.75; Q1= The position of Q3=0.75(10+1)=8.75; Q2= IQR=Q3-Q1= Some Findings from the Gender Dataset . gen wage = salary/(hours*weeks) . format wage %7.2f . tab gender, sum(wage) | Summary of wage Gender | Mean Std. Dev. Freq. ------------+------------------------------------ Male | 14.01 10.12 488 Female | 10.72 7.03 462 ------------+------------------------------------ Total | 12.41 8.91 950 . tab edatt gender Educational | Gender Attainment | Male Female | Total ---------------+----------------------+---------- HS Drop Out | 87 48 | 135 HS Graduate | 235 231 | 466 Assoc. Deg. | 39 61 | 100 Bachelors Deg. | 88 86 | 174 Advanced Deg. | 39 36 | 75 ---------------+----------------------+---------- Total | 488 462 | 950 Alternative Graphing Techniques Male Female 14% edatt==HS Drop Out 49% edatt==HS Graduate 11% edatt==Assoc. Deg. 18% edatt==Bachelors Deg. 8% edatt==Advanced Deg. HS Drop Out HS Graduate Assoc. Deg. 51%Male 49%Female Bachelors Deg. Advanced Deg. HS Graduate 235 HS Drop Out Assoc. Deg. 0 Frequency Male Female 235 Bachelors Deg. Advanced Deg. 0 Male Female Male Female Histograms by Educational Attainment Male Female 235 Frequency 0 HS Drop Assoc. D Advanced HS Drop Assoc. D Advanced HS Gradu Bachelor HS Gradu Bachelor Histograms by Gender Stacked Bar Graph What is Wrong With This Graphic? Wage Gap 14.01 15.00 14.00 13.00 12.00 10.72 11.00 10.00 Men Women Gender What is Wrong With This Graphic? Economic Status of Workers in the Market Economy and the Role of Gender 20.00 15.00 10.00 5.00 0.00 Men Women Mean Wage of Employed Persons by Gender 14.01 15.00 10.72 10.00 Hourly Wage 5.00 0.00 Men Women Source: Sample from the Current Population Survey, 1995. Note: includes employed persons 15 years of age or older. On average, men have higher wages. 15 10 Hourly Wage 5 0 Male Female Source: Sample from the Current Population Survey, 1995. Note: includes employed persons 15 years of age or older. graph wage, bar mean by(gender) ylabel l1("Hourly Wage") "Box and Whiskers" Plot graph wage, by(female) box ylabel l1("Hourly Wage") 80 60 Hourly Wage 40 20 0 Male Female Source: Sample from the Current Population Survey, 1995. Note: includes employed persons 15 years of age or older. Controlling for Age Changes the Picture Correlation • Correlation refers to the degree of association between two variables. – Not just imply that there is relationship. It tells us how strong that relationship is. • One way social science researchers look at two variables at the same time is to employ a scatter plots. – A scatter plot represents each case’s score on each variable on a pair of axes. • Consider the following scenario: 10 students, showing for each student the number of hours spent studying and their grade on an exam. Hours Grade 100 The scatter plot depicts the joint 90 distribution of grade and hours spent Grade 80 Student 70 studying. 60 1 2.50 55 50 2 2.75 60 2 4 6 3 3.50 65 Hours A simple visual inspection of this scatter 4 3.75 70 plot would lead us to suspect that there’s 5 4.50 75 a relationship between studying and test 6 4.75 80 performance. In general, the more time 7 5.50 85 spent studying, the better the grade on the 8 6.25 90 exam. This visual inspection would 9 6.50 95 suggest that there is a positive 10 7.25 100 correlation between studying and grade. We say that the correlation is positive because as scores on one variable get higher, so do scores on other variables. In other words, high values of one variable are associated with high values on the other, and low values on one variable are associated with low values on the other. Skipped Grade Classes 100 Student 90 Consider another scenario showing Grade 80 1 15 55 70 the number of classes skipped and 60 2 14 60 50 performance on the exam for another 0 5 10 15 20 3 12 65 Missed Classes 10 students. 4 11 70 5 10 75 In the scatter plot, it illustrates the 6 8 80 relationship between class attendance 7 7 85 and grade. In this case, however, we’re 8 4 90 looking at a negative correlation. 9 3 95 10 2 100 In a negative correlation, low values on one variable are associated with high values on the other and vice versa. In this example, low values on missed classes are associated with high values on exam grade. So, the slope of the line reveals the direction of the relationship (positive or negative). Weak positive correlation Strong positive correlation 100 100 Grade 90 90 Grade 80 80 70 70 60 60 50 50 2 3 4 5 6 0 2 4 6 Hours Studying Hours Studying No correlation 100 90 Grade 80 70 60 50 0 2 4 6 Hours Studying Weak negative correlation Strong negative correlation 100 100 90 90 Grade Grade 80 80 70 70 60 60 50 50 0 5 10 15 20 0 5 10 15 20 Missed Classes Missed Classes Correlation Coefficients, continued . corr y x x2 x3 (obs=500) | y x x2 x3 ---------+------------------------------------ y | 1.0000 x | 0.7114 1.0000 x2 | -0.7114 -1.0000 1.0000 x3 | 0.0119 0.0645 -0.0645 1.0000 Correlation of age and wage, controlling for gender The correlation coefficients show that wage is positively correlated with age for both men and women. However, the correlation is much stronger for men. The scatterplots below give a sense what the correlations mean. . sort gender Male Female . by gender: corr wage age 40 30 -> gender= Male (obs=488) wage 20 | wage age ---------+------------------ 10 wage | 1.0000 age | 0.3816 1.0000 0 15 90 15 90 Age in -> gender= Female (obs=462) Years graph wage age if wage<40, by(gender) | wage age ---------+------------------ wage | 1.0000 age | 0.1053 1.0000