Topics for 962006 by hollypiet

VIEWS: 15 PAGES: 35

									            Topics for 9/6/2006
• Discussion: Problem Set # 1
• Continued- Descriptive Statistics
  – Variability: Range, Variance, and Standard Deviation
  – Measures of Position: Percentile, z scores
• Graphical Presentations of Data
• Types of Graphs
  – Bar charts, Pie Charts, etc.
• Correlation Between 2 Variables
• STATA Lab
                      Variability
• While the concept of central tendency regards the
  middleness or the central location or most frequent
  value of a distribution of values, it gives an account of
  what’s common within the distribution.
• The concept of variability, on the other hand, is related
  to what’s different or how different the values in a
  distribution are.
• Grades of two classes: Same mean, but…
   – A: 100, 100, 100, 100, 0
   – B: 90, 85, 80, 75, 70
                         Range
• difference between the minimum and maximum
         R  Max ( x )  Min ( x )
• Example:
  – Scores of students A & B in six quizzes:
     A            B
     2            2
     3            5
     4            5
     6            5
     7            5
     8            8

   What are the Ranges for A and B?
                                   Variance
• Variance is another measure of the degree of dispersion in
  a series of data.
• variance is also based on deviations from the mean.
• the variance clears the signs by squaring the deviations.
• by squaring the deviation, larger outliers have more
  influence than values closer to the mean
                                                           n
                 n

                  xi  x 
                               2
                                                                xi  x
                                   Compared with          i 1
                                                   MD 
              i 1
        2
                                                                 N
                       N
                                   Example
x                  xx             x  x 2       x   xx    x  x 2
2                      -3                9         2   -3         9
3                      -2                4         5    0         0
4                      -1                1         5    0         0
6                       1                1         5    0         0
7                       2                4         5    0         0
8                       3                9         8    3         9
                        0              28               0       18
             n

             x        x
                              2

                                     A = 28/6=4.67
                                               2
                   i

          i 1
    2


                                     B = 18/6=3
                   N                   2
                 Standard Deviation
• Square root of variance to deal with the problem
  of square unit
             n

          x         x
                            2
                 i

       i 1

                 N


    A A                     4 .6 7  2 .1 6 1
         2




    B B                     3  1.732
         2
  Variance and Standard Deviation

• Always positive! Why?
• Variance takes care of the canceling effect but is
  in squared unit
• Standard deviation is the same unit, easier to
  interpret
• Which one is larger?
• When will they be equal?
                     Summation Notation
 3                                                       N

      Xi  X1  X 2  X3                                      Xi  X1  X   2
                                                                                   ...  X   N
i 1                                                     i 1



                                                     2
 3
                                        3

      X i  X 1  X 2  X 3    X i   X 1  X 2  X 3 
          2     2    2     2                                  2


i 1                            i 1  


 N

C       i
              C  C  ...  C  NC
i 1



 N                                                                                                N

 CX         i
                  CX 1  CX   2
                                    ...  CX   N
                                                     C  X 1  X 2  ...  X       N
                                                                                         C            Xi
i 1                                                                                              i 1
            Summary: Measures of Variability

Range                                  Maximum - Minimum



                          N                                         N

                                (X i  )                               (X i  X )
                                                  2                                        2

Variance      
                  2
                         i 1
                                                      S
                                                          2
                                                                  i 1

                                   N                                       n 1



                          N                                       N

                           (X          )                        (X
                                              2
                                                                                 X)
                                                                                       2
Standard                  i 1
                                   i                                        i

Deviation                                           S          i 1

                                   N                                      n 1
                         Adding a constant to X

                N
                                      N     
                       ( X i  C )   X i   NC
                                      i 1  
  ( X C )    i 1
                                                    C
                           N                N


                    N                                     N

                   X i     C      C                  (X i  )
                                                  2                          2

                 i 1                                     i 1
 ( X C )                                                                     
                                N                                  N


 Example:
     X                     10, 20, 30                         mean=20                 sd=8.16
     X+100                 110, 120, 130                      mean=120                sd=8.16
                       Multiplying by a constant
                               N                                 N

                                     CX      i                            X   i

               CX            i 1
                                                   C            i 1
                                                                                     C
                                      N                              N

                N                                         N

                ( CX           C )                      (X               )
                                          2           2                             2
                           i
                                                  C                     i
                i 1                                      i 1
   CX
                                                                                       C
                           N                                       N

         (Note the absolute value sign. If C is -10 for example, it still
         increases the standard deviation by a factor of positive 10.
         Variances and standard deviations are always positve by
         definition.)
Examples:
    X    10, 20, 30                           Mean = 20                                 sd= 8.16
    10X 100, 200, 300                         Mean = 200                                sd= 81.6
    -10X -100, -200, -300                     Mean = -200                               sd= 81.6
       Standardized (Z) score
• Standardizing a score refers to expressing a raw
  value in terms of its deviation from the mean,
  expressed in units of standard deviation.
  – Any raw score or raw value can be converted to a
    standardized value, provided you know the mean and
    standard deviation of the distribution from which it
    came.
        Z score (Example)
 x       f    Consider the following example of scores on an
 4       1    American Government quiz. All students in the class
 5       1    (102) took a quiz worth 17 points, and scored between
 6       4    4 and 16. The distribution below depicts those 102
              scores.
 7       5
 8       6      Mode:
 9      10
                Median:
10      48
                Mean:
11      10
                Variance:
12       6
13       5
                Standard deviation:
14       4
             Why do we need Z score?
15       1
16       1
             We want to know how well individual score did.
     N=102
               Z score (Example)
• Let’s say I got a score of 14 on my test, and a score of
  15 on another 17 point test. What might I want to know
  in order to compare “how well I did” on the two tests?
   – how most of the class did
   – how well I did compared to the mean
• It turns out there is a way we can “re-compute” a given
  score value to express it in such terms. It’s called the
  standardized score, and technically represents a given
  score’s departure from the mean in units of standard
  deviation.
                  Z score (Example)
 x       f   x-mean      z
 4       1       -6   -3.0   In a sense, then, we really are standardizing the score.
                             We can now compare my score on this test to my score
 5       1       -5   -2.5
                             on the other test.
 6       4       -4   -2.0
 7       5       -3   -1.5
                             Ex: x = 14, z = 14-10 / 2 = 2
 8       6       -2   -1.0
 9      10       -1   -0.5   Suppose there is another test
10      48        0    0.0   x=15, mean:12, variance:4
11      10        1   +0.5
                               z=?
12       6        2   +1.0
13       5        3   +1.5   Compare two cases?
14       4        4   +2.0
15       1        5   +2.5
16       1        6   +3.0
     N=102
    Z Scores: Comparing Across Distributions
                                   A z score is the observation for a
                                   single person, normalized by the
                                   mean and standard deviation for
                                   the whole distribution. What is the
                                   relevant distribution? That depends
                                   on the question you are asking.
     * The mean of a set of z scores is 0. (Why?)
     * The standard deviation of a set of z scores is 1. (Why?)
Example (data are approximate):

                 year    jump              mean   sd       z

  Bob Beamon 1968        29' 2.5" (29.2)   23     1.5      4.1
  Mike Powell 1994       29' 4" (29.3)     26     1.5      2.2

Beamon's jump was more spectacular in comparison to his contemporaries.
                    Percentile
• Another measure of relative standing
• The pth percentile means the value of x that exceeds
  p% of the measurements and is less than the remaining
  (100-p)%.
• Ex) Dr. Minsky said that Eileen’s weight is 90th
  percentile. What does it mean?



              90%

                                       10%
        Lower and Upper Quartiles
• The lower quartile (first quartile), Q1 is the value of x that
  exceeds one-fourth of the measurements and is less than the
  remaining three-fourths.
• The upper quartile (third quartile), Q3 is the value of x that
  exceeds three-fourths of the measurements and is less than one-
  fourth.
• The value of second quartile, Q2?
              Relative Frequency




                                   25%   25% 25% 25%
The interquartile range (IQR) for a set of measurement is the
difference between the upper and lower quartile; IQR=Q3-Q1

Calculating Quartile
When the measurement are arranged in order of magnitude, the
lower quartile, Q1, is the value of x in position 0.25(n+1) and the
upper quartile, Q3, is the value of x in position 0.75(n+1).

Ex: The following data represent the scores for a sample of 10
students on a 20-point Statistics quiz: 16, 14, 2, 8, 12, 12, 9, 10, 15,
and 13. Calculate the lower and upper quartiles and the IQR for these
data.

      The position of Q1=0.25(10+1)=2.75; Q1=
      The position of Q3=0.75(10+1)=8.75; Q2=
      IQR=Q3-Q1=
Some Findings from the Gender Dataset
. gen wage = salary/(hours*weeks)

. format wage %7.2f

. tab gender, sum(wage)
            |           Summary of wage
     Gender |        Mean   Std. Dev.       Freq.
------------+------------------------------------
       Male |       14.01       10.12         488
     Female |       10.72        7.03         462
------------+------------------------------------
      Total |       12.41        8.91         950

. tab edatt gender
   Educational |        Gender
    Attainment |      Male     Female |     Total
---------------+----------------------+----------
   HS Drop Out |        87         48 |       135
   HS Graduate |       235        231 |       466
   Assoc. Deg. |        39         61 |       100
Bachelors Deg. |        88         86 |       174
 Advanced Deg. |        39         36 |        75
---------------+----------------------+----------
         Total |       488        462 |       950
                 Alternative Graphing Techniques
Male                      Female
                                                      14% edatt==HS Drop Out
                                                      49% edatt==HS Graduate
                                                      11% edatt==Assoc. Deg.
                                                      18% edatt==Bachelors Deg.
                                                       8% edatt==Advanced Deg.




       HS Drop Out      HS Graduate     Assoc. Deg.
                                                             51%Male
                                                             49%Female




       Bachelors Deg.   Advanced Deg.
                                                     HS Graduate
            235
                         HS Drop Out
                                                                                        Assoc. Deg.




             0
Frequency


                                                                                 Male      Female

            235
                     Bachelors Deg.
                                                     Advanced Deg.




             0
                  Male   Female                    Male    Female


                           Histograms by Educational Attainment
                         Male                                        Female
                   235




   Frequency

                     0
                         HS Drop      Assoc. D        Advanced       HS Drop      Assoc. D       Advanced
                                HS Gradu       Bachelor                     HS Gradu      Bachelor

                                            Histograms by Gender
Stacked Bar Graph
 What is Wrong With This Graphic?


         Wage Gap
          14.01
15.00
14.00
13.00
12.00                  10.72
11.00
10.00
        Men        Women
              Gender
What is Wrong With This Graphic?

    Economic Status of Workers in
    the Market Economy and the Role
    of Gender


 20.00
 15.00
 10.00
  5.00
  0.00
          Men        Women
    Mean Wage of Employed Persons by Gender
                            14.01
                    15.00                     10.72

                    10.00
      Hourly Wage




                     5.00

                     0.00
                             Men             Women
Source: Sample from the Current Population Survey, 1995.
Note: includes employed persons 15 years of age or older.
          On average, men have higher wages.

                         15




                         10
           Hourly Wage




                          5




                          0

                              Male                  Female
Source: Sample from the Current Population Survey, 1995.
Note: includes employed persons 15 years of age or older.


           graph wage, bar mean by(gender) ylabel l1("Hourly Wage")
                        "Box and Whiskers" Plot
                        graph wage, by(female) box ylabel l1("Hourly Wage")

                   80




                   60
     Hourly Wage




                   40




                   20




                    0

                                    Male                              Female

Source: Sample from the Current Population Survey, 1995.
Note: includes employed persons 15 years of age or older.
Controlling for Age Changes the Picture
                      Correlation
• Correlation refers to the degree of association between
  two variables.
   – Not just imply that there is relationship. It tells us how strong
     that relationship is.
• One way social science researchers look at two
  variables at the same time is to employ a scatter plots.
   – A scatter plot represents each case’s score on each variable on
     a pair of axes.
• Consider the following scenario: 10 students, showing
  for each student the number of hours spent studying and
  their grade on an exam.
           Hours   Grade
                                   100                       The scatter plot depicts the joint
                                   90
                                                             distribution of grade and hours spent


                           Grade
                                   80
 Student                           70                        studying.
                                   60
      1     2.50     55
                                   50
      2     2.75     60                  2   4           6

      3     3.50     65
                                                 Hours       A simple visual inspection of this scatter
      4     3.75     70                                      plot would lead us to suspect that there’s
      5     4.50     75                                      a relationship between studying and test
      6     4.75     80                                      performance. In general, the more time
      7     5.50     85                                      spent studying, the better the grade on the
      8     6.25     90                                      exam. This visual inspection would
      9     6.50     95
                                                             suggest that there is a positive
     10     7.25    100
                                                             correlation between studying and grade.
We say that the correlation is positive because as scores on one variable get
higher, so do scores on other variables. In other words, high values of one variable
are associated with high values on the other, and low values on one variable are
associated with low values on the other.
            Skipped    Grade
             Classes
                                       100

  Student                              90                                       Consider another scenario showing




                               Grade
                                       80

       1         15      55
                                       70                                       the number of classes skipped and
                                       60
       2         14      60            50
                                                                                performance on the exam for another
                                             0   5         10         15   20
       3         12      65                          Missed Classes
                                                                                10 students.
       4         11      70
       5         10      75                                                     In the scatter plot, it illustrates the
       6          8      80
                                                                                relationship between class attendance
       7          7      85
                                                                                and grade. In this case, however, we’re
       8          4      90
                                                                                looking at a negative correlation.
       9          3      95
      10          2     100



In a negative correlation, low values on one variable are associated with high values
on the other and vice versa. In this example, low values on missed classes are
associated with high values on exam grade. So, the slope of the line reveals the
direction of the relationship (positive or negative).
 Weak positive correlation                                                                                    Strong positive correlation
                  100                                                                                                  100

          Grade   90                                                                                                   90




                                                                                                               Grade
                  80                                                                                                   80
                  70                                                                                                   70
                  60                                                                                                   60
                  50                                                                                                   50
                        2         3          4         5                6                                                    0        2          4         6
                                      Hours Studying                                                                                 Hours Studying


                                                                                     No correlation
                                                                        100
                                                                            90
                                                                Grade       80
                                                                            70
                                                                            60
                                                                            50
                                                                                 0       2           4    6
                                                                                         Hours Studying


Weak negative correlation                                                                                     Strong negative correlation
         100                                                                                                           100
         90                                                                                                            90
 Grade




                                                                                                               Grade
         80                                                                                                            80
         70                                                                                                            70
         60                                                                                                            60
         50                                                                                                            50
                  0         5           10       15        20                                                                0   5         10         15   20
                                Missed Classes                                                                                       Missed Classes
Correlation Coefficients, continued
 . corr y x x2 x3
 (obs=500)

        |      y       x       x2       x3
 ---------+------------------------------------
       y | 1.0000
       x | 0.7114 1.0000
      x2 | -0.7114 -1.0000 1.0000
      x3 | 0.0119 0.0645 -0.0645 1.0000
Correlation of age and wage, controlling for gender
   The correlation coefficients show that wage is positively correlated with age for
   both men and women. However, the correlation is much stronger for men. The
   scatterplots below give a sense what the correlations mean.



. sort gender                                   Male                          Female
. by gender: corr wage age                 40

                                           30
-> gender=     Male   (obs=488)
                                    wage   20
         |     wage      age
---------+------------------               10
    wage |   1.0000
     age |   0.3816   1.0000               0
                                            15                  90        15            90

                                                                     Age in
-> gender=   Female   (obs=462)                                      Years
                                                graph wage age if wage<40, by(gender)
         |     wage      age
---------+------------------
    wage |   1.0000
     age |   0.1053   1.0000

								
To top