# Topics for 962006 by hollypiet

VIEWS: 15 PAGES: 35

• pg 1
```									            Topics for 9/6/2006
• Discussion: Problem Set # 1
• Continued- Descriptive Statistics
– Variability: Range, Variance, and Standard Deviation
– Measures of Position: Percentile, z scores
• Graphical Presentations of Data
• Types of Graphs
– Bar charts, Pie Charts, etc.
• Correlation Between 2 Variables
• STATA Lab
Variability
• While the concept of central tendency regards the
middleness or the central location or most frequent
value of a distribution of values, it gives an account of
what’s common within the distribution.
• The concept of variability, on the other hand, is related
to what’s different or how different the values in a
distribution are.
• Grades of two classes: Same mean, but…
– A: 100, 100, 100, 100, 0
– B: 90, 85, 80, 75, 70
Range
• difference between the minimum and maximum
R  Max ( x )  Min ( x )
• Example:
– Scores of students A & B in six quizzes:
A            B
2            2
3            5
4            5
6            5
7            5
8            8

What are the Ranges for A and B?
Variance
• Variance is another measure of the degree of dispersion in
a series of data.
• variance is also based on deviations from the mean.
• the variance clears the signs by squaring the deviations.
• by squaring the deviation, larger outliers have more
influence than values closer to the mean
n
n

  xi  x 
2
      xi  x
Compared with          i 1
MD 
          i 1
2
N
N
Example
x                  xx             x  x 2       x   xx    x  x 2
2                      -3                9         2   -3         9
3                      -2                4         5    0         0
4                      -1                1         5    0         0
6                       1                1         5    0         0
7                       2                4         5    0         0
8                       3                9         8    3         9
0              28               0       18
n

 x        x
2

 A = 28/6=4.67
2
i

          i 1
2

 B = 18/6=3
N                   2
Standard Deviation
• Square root of variance to deal with the problem
of square unit
n

 x         x
2
i

    i 1

N

A A                     4 .6 7  2 .1 6 1
2

B B                     3  1.732
2
Variance and Standard Deviation

• Always positive! Why?
• Variance takes care of the canceling effect but is
in squared unit
• Standard deviation is the same unit, easier to
interpret
• Which one is larger?
• When will they be equal?
Summation Notation
3                                                       N

      Xi  X1  X 2  X3                                      Xi  X1  X   2
 ...  X   N
i 1                                                     i 1

2
3
         3

      X i  X 1  X 2  X 3    X i   X 1  X 2  X 3 
2     2    2     2                                  2

i 1                            i 1  

N

C       i
 C  C  ...  C  NC
i 1

N                                                                                                N

 CX         i
 CX 1  CX   2
 ...  CX   N
 C  X 1  X 2  ...  X       N
 C            Xi
i 1                                                                                              i 1
Summary: Measures of Variability

Range                                  Maximum - Minimum

N                                         N

      (X i  )                               (X i  X )
2                                        2

Variance      
2
   i 1
S
2
    i 1

N                                       n 1

N                                       N

 (X          )                        (X
2
 X)
2
Standard                  i 1
i                                        i

Deviation                                           S          i 1

N                                      n 1

N
 N     
       ( X i  C )   X i   NC
 i 1  
 ( X C )    i 1
                 C
N                N

N                                     N

  X i     C      C                  (X i  )
2                          2

i 1                                     i 1
 ( X C )                                                                     
N                                  N

Example:
X                     10, 20, 30                         mean=20                 sd=8.16
X+100                 110, 120, 130                      mean=120                sd=8.16
Multiplying by a constant
N                                 N

      CX      i                            X   i

 CX            i 1
 C            i 1
 C
N                              N

N                                         N

 ( CX           C )                      (X               )
2           2                             2
i
C                     i
i 1                                      i 1
   CX
                                                                             C
N                                       N

(Note the absolute value sign. If C is -10 for example, it still
increases the standard deviation by a factor of positive 10.
Variances and standard deviations are always positve by
definition.)
Examples:
X    10, 20, 30                           Mean = 20                                 sd= 8.16
10X 100, 200, 300                         Mean = 200                                sd= 81.6
-10X -100, -200, -300                     Mean = -200                               sd= 81.6
Standardized (Z) score
• Standardizing a score refers to expressing a raw
value in terms of its deviation from the mean,
expressed in units of standard deviation.
– Any raw score or raw value can be converted to a
standardized value, provided you know the mean and
standard deviation of the distribution from which it
came.
Z score (Example)
x       f    Consider the following example of scores on an
4       1    American Government quiz. All students in the class
5       1    (102) took a quiz worth 17 points, and scored between
6       4    4 and 16. The distribution below depicts those 102
scores.
7       5
8       6      Mode:
9      10
Median:
10      48
Mean:
11      10
Variance:
12       6
13       5
Standard deviation:
14       4
Why do we need Z score?
15       1
16       1
We want to know how well individual score did.
N=102
Z score (Example)
• Let’s say I got a score of 14 on my test, and a score of
15 on another 17 point test. What might I want to know
in order to compare “how well I did” on the two tests?
– how most of the class did
– how well I did compared to the mean
• It turns out there is a way we can “re-compute” a given
score value to express it in such terms. It’s called the
standardized score, and technically represents a given
score’s departure from the mean in units of standard
deviation.
Z score (Example)
x       f   x-mean      z
4       1       -6   -3.0   In a sense, then, we really are standardizing the score.
We can now compare my score on this test to my score
5       1       -5   -2.5
on the other test.
6       4       -4   -2.0
7       5       -3   -1.5
Ex: x = 14, z = 14-10 / 2 = 2
8       6       -2   -1.0
9      10       -1   -0.5   Suppose there is another test
10      48        0    0.0   x=15, mean:12, variance:4
11      10        1   +0.5
z=?
12       6        2   +1.0
13       5        3   +1.5   Compare two cases?
14       4        4   +2.0
15       1        5   +2.5
16       1        6   +3.0
N=102
Z Scores: Comparing Across Distributions
A z score is the observation for a
single person, normalized by the
mean and standard deviation for
the whole distribution. What is the
relevant distribution? That depends
on the question you are asking.
* The mean of a set of z scores is 0. (Why?)
* The standard deviation of a set of z scores is 1. (Why?)
Example (data are approximate):

year    jump              mean   sd       z

Bob Beamon 1968        29' 2.5" (29.2)   23     1.5      4.1
Mike Powell 1994       29' 4" (29.3)     26     1.5      2.2

Beamon's jump was more spectacular in comparison to his contemporaries.
Percentile
• Another measure of relative standing
• The pth percentile means the value of x that exceeds
p% of the measurements and is less than the remaining
(100-p)%.
• Ex) Dr. Minsky said that Eileen’s weight is 90th
percentile. What does it mean?

90%

10%
Lower and Upper Quartiles
• The lower quartile (first quartile), Q1 is the value of x that
exceeds one-fourth of the measurements and is less than the
remaining three-fourths.
• The upper quartile (third quartile), Q3 is the value of x that
exceeds three-fourths of the measurements and is less than one-
fourth.
• The value of second quartile, Q2?
Relative Frequency

25%   25% 25% 25%
The interquartile range (IQR) for a set of measurement is the
difference between the upper and lower quartile; IQR=Q3-Q1

Calculating Quartile
When the measurement are arranged in order of magnitude, the
lower quartile, Q1, is the value of x in position 0.25(n+1) and the
upper quartile, Q3, is the value of x in position 0.75(n+1).

Ex: The following data represent the scores for a sample of 10
students on a 20-point Statistics quiz: 16, 14, 2, 8, 12, 12, 9, 10, 15,
and 13. Calculate the lower and upper quartiles and the IQR for these
data.

The position of Q1=0.25(10+1)=2.75; Q1=
The position of Q3=0.75(10+1)=8.75; Q2=
IQR=Q3-Q1=
Some Findings from the Gender Dataset
. gen wage = salary/(hours*weeks)

. format wage %7.2f

. tab gender, sum(wage)
|           Summary of wage
Gender |        Mean   Std. Dev.       Freq.
------------+------------------------------------
Male |       14.01       10.12         488
Female |       10.72        7.03         462
------------+------------------------------------
Total |       12.41        8.91         950

. tab edatt gender
Educational |        Gender
Attainment |      Male     Female |     Total
---------------+----------------------+----------
HS Drop Out |        87         48 |       135
HS Graduate |       235        231 |       466
Assoc. Deg. |        39         61 |       100
Bachelors Deg. |        88         86 |       174
Advanced Deg. |        39         36 |        75
---------------+----------------------+----------
Total |       488        462 |       950
Alternative Graphing Techniques
Male                      Female
14% edatt==HS Drop Out
11% edatt==Assoc. Deg.
18% edatt==Bachelors Deg.

HS Drop Out      HS Graduate     Assoc. Deg.
51%Male
49%Female

235
HS Drop Out
Assoc. Deg.

0
Frequency

Male      Female

235
Bachelors Deg.

0
Male   Female                    Male    Female

Histograms by Educational Attainment
Male                                        Female
235

Frequency

0

Histograms by Gender
Stacked Bar Graph
What is Wrong With This Graphic?

Wage Gap
14.01
15.00
14.00
13.00
12.00                  10.72
11.00
10.00
Men        Women
Gender
What is Wrong With This Graphic?

Economic Status of Workers in
the Market Economy and the Role
of Gender

20.00
15.00
10.00
5.00
0.00
Men        Women
Mean Wage of Employed Persons by Gender
14.01
15.00                     10.72

10.00
Hourly Wage

5.00

0.00
Men             Women
Source: Sample from the Current Population Survey, 1995.
Note: includes employed persons 15 years of age or older.
On average, men have higher wages.

15

10
Hourly Wage

5

0

Male                  Female
Source: Sample from the Current Population Survey, 1995.
Note: includes employed persons 15 years of age or older.

graph wage, bar mean by(gender) ylabel l1("Hourly Wage")
"Box and Whiskers" Plot
graph wage, by(female) box ylabel l1("Hourly Wage")

80

60
Hourly Wage

40

20

0

Male                              Female

Source: Sample from the Current Population Survey, 1995.
Note: includes employed persons 15 years of age or older.
Controlling for Age Changes the Picture
Correlation
• Correlation refers to the degree of association between
two variables.
– Not just imply that there is relationship. It tells us how strong
that relationship is.
• One way social science researchers look at two
variables at the same time is to employ a scatter plots.
– A scatter plot represents each case’s score on each variable on
a pair of axes.
• Consider the following scenario: 10 students, showing
for each student the number of hours spent studying and
100                       The scatter plot depicts the joint
90
distribution of grade and hours spent

80
Student                           70                        studying.
60
1     2.50     55
50
2     2.75     60                  2   4           6

3     3.50     65
Hours       A simple visual inspection of this scatter
4     3.75     70                                      plot would lead us to suspect that there’s
5     4.50     75                                      a relationship between studying and test
6     4.75     80                                      performance. In general, the more time
7     5.50     85                                      spent studying, the better the grade on the
8     6.25     90                                      exam. This visual inspection would
9     6.50     95
suggest that there is a positive
10     7.25    100
We say that the correlation is positive because as scores on one variable get
higher, so do scores on other variables. In other words, high values of one variable
are associated with high values on the other, and low values on one variable are
associated with low values on the other.
Classes
100

Student                              90                                       Consider another scenario showing

80

1         15      55
70                                       the number of classes skipped and
60
2         14      60            50
performance on the exam for another
0   5         10         15   20
3         12      65                          Missed Classes
10 students.
4         11      70
5         10      75                                                     In the scatter plot, it illustrates the
6          8      80
relationship between class attendance
7          7      85
and grade. In this case, however, we’re
8          4      90
looking at a negative correlation.
9          3      95
10          2     100

In a negative correlation, low values on one variable are associated with high values
on the other and vice versa. In this example, low values on missed classes are
associated with high values on exam grade. So, the slope of the line reveals the
direction of the relationship (positive or negative).
Weak positive correlation                                                                                    Strong positive correlation
100                                                                                                  100

80                                                                                                   80
70                                                                                                   70
60                                                                                                   60
50                                                                                                   50
2         3          4         5                6                                                    0        2          4         6
Hours Studying                                                                                 Hours Studying

No correlation
100
90
70
60
50
0       2           4    6
Hours Studying

Weak negative correlation                                                                                     Strong negative correlation
100                                                                                                           100
90                                                                                                            90

80                                                                                                            80
70                                                                                                            70
60                                                                                                            60
50                                                                                                            50
0         5           10       15        20                                                                0   5         10         15   20
Missed Classes                                                                                       Missed Classes
Correlation Coefficients, continued
. corr y x x2 x3
(obs=500)

|      y       x       x2       x3
---------+------------------------------------
y | 1.0000
x | 0.7114 1.0000
x2 | -0.7114 -1.0000 1.0000
x3 | 0.0119 0.0645 -0.0645 1.0000
Correlation of age and wage, controlling for gender
The correlation coefficients show that wage is positively correlated with age for
both men and women. However, the correlation is much stronger for men. The
scatterplots below give a sense what the correlations mean.

. sort gender                                   Male                          Female
. by gender: corr wage age                 40

30
-> gender=     Male   (obs=488)
wage   20
|     wage      age
---------+------------------               10
wage |   1.0000
age |   0.3816   1.0000               0
15                  90        15            90

Age in
-> gender=   Female   (obs=462)                                      Years
graph wage age if wage<40, by(gender)
|     wage      age
---------+------------------
wage |   1.0000
age |   0.1053   1.0000

```
To top