# Sociology 709 _Martin_

Document Sample

```					              Sociology 709 (Martin)
Lecture 12: April 23, 2009
• Statistical models for rates: part 2 of 2.
– Practice with rates and survivor proportions
– Possible problems with rate models

– Vaupel and Yashin. 1985. “Heterogeneity’s Ruses.”
American Statistician 39(3):176-185.
• Exponential models…
• Usually do not match observed data very well

• Cox models…
• Match the data extremely well at little cost, but
• Do not allow us to predict survivor proportions from
model output.
• Give us no insights into duration as a variable.
Other forms of hazard functions:

•   Piecewise exponential
•   Gompertz
•   Weibull
•   Splined Piecewise Gompertz

• I am not going to explain how to do these models. I will
show the equation, relative advantages, and graphic
representations for each.

• You might encounter such models in the literature.
Piecewise constant exponential model
To specify a piecewise constant model, set aside one duration
as the comparison interval, then define other intervals
with a time-varying covariate.
Once you have done this, you can estimate a coefficient for an
interval just as you estimate a coefficient for any other
covariate.

h(t )  exp(  o  1t1   2t 2   3 x1  ...   k xk  2 )

In this example, 1 and 2 estimate coefficients for time
intervals, not x variables, and 0 estimates an intercept for
the omitted time interval
Data set-up for a piecewise constant model
id    age at starting ending        ending   x1   x2   x3   x4   x5
1st birth duration duration   state

1    To
18      0          8          0        1    0    0    0    0
1     18      9          20         0        0    1    0    0    0
1     18      21         29         1        0    0    1    0    0
2     26      0          8          0        1    0    0    0    0
2     26      9          19         1        0    1    0    0    0
3     21      0          8          0        1    0    0    0    0
3     21      9          20         0        0    1    0    0    0
3     21      21         32         0        0    0    1    0    0
3     21      33         68         0        0    0    0    1    0
3     21      69         117        0        0    0    0    0    1
Data set-up for a piecewise constant model
A few notes about the data set-up:

Starting duration = 9 means 9.0000
Ending duration = 20 means 20.9999

It is possible to start in month 21 and have a second
birth in month 21.
It is not possible to start in month 0 and have a second
birth in month 0. (In other words, STATA will
drop cases with a twin first birth.)
How on earth do we set up data like this?
Answer: tell STATA to do it (or do it in SAS).

. * set up for a piecewise exponential model

. stset dur, fail(birth2) id(id)
. stsplit durcat, at(9 21 33 69)
(6790 observations created)

. egen durgroup = group(durcat)
(34 missing values generated)

.   gen   dur0008   =   durgroup==1
.   gen   dur0920   =   durgroup==2
.   gen   dur2132   =   durgroup==3
.   gen   dur3368   =   durgroup==4
.   gen   dur69p    =   durgroup==5
.   *event history model with piecewise exponential baseline for duration
.   * since first birth
.   streg age1_15 age1_16 age1_18 age1_25 hispanic nhblack nhother dur0008 dur092
>   0 dur2132 dur69p, dist(exp) nohr

Exponential regression -- log relative-hazard form

No. of subjects =           2918                     Number of obs   =     9708
No. of failures =           1741
Time at risk    =         124953
LR chi2(11)     =    882.26
Log likelihood =    -2785.6657                     Prob > chi2     =    0.0000
------------------------------------------------------------------------------
_t |      Coef.   Std. Err.       z     P>|z|       [95% Conf. Interval]
---------+--------------------------------------------------------------------
age1_15 | -.1538871    .1253363     -1.228   0.220      -.3995418    .0917676
age1_16 |   .0472311   .0719714      0.656   0.512      -.0938302    .1882924
age1_18 |   .0901565   .0592117      1.523   0.128      -.0258964    .2062093
age1_25 |   -.016424   .0898077     -0.183   0.855      -.1924439    .1595959
hispanic |   .1417642   .0705131      2.010   0.044       .0035609    .2799674
nhblack |   .0476215   .0676632      0.704   0.482      -.0849961     .180239
nhother |   .3124815    .109814      2.846   0.004       .0972501     .527713
dur0008 | -3.444608    .2618926    -13.153   0.000      -3.957908   -2.931308
dur0920 | -.2045279    .0635638     -3.218   0.001      -.3291106   -.0799452
dur2132 |   .3671508   .0596125      6.159   0.000       .2503124    .4839892
dur69p | -.8130398     .097524     -8.337   0.000      -1.004183   -.6218962
_cons |    -4.0599   .0517029    -78.524   0.000      -4.161236   -3.958565
------------------------------------------------------------------------------
log rate: piecewise exponential

0
log rate

-2

-4

-6
10

20
0

duration
rate: piecewise exponential

0.06

0.04
rate

0.02

0
0

15
duration
Gompertz model

• In a Gompertz model, the log of the rate increases or
decreases linearly with t.
• This means that the rate increases or decreases
monotonically with t.
• ln((t)) = 0 +  t + 1x1 + 2x2 +…
• Df = 2: one for intercept, one for slope.
• Usually less degrees of freedom than a piecewise model
• Good for monotonically changing rates (this happens quite
• Calculating survivor probabilities is quite doable.
• You interpret the coefficients as usual.
log rate:gompertz

0
log rate

-2

-4

-6
10

20
0

duration
rate: gompertz

0.06
0.04
rate

0.02
0
0

15
duration
Weibull model

• In a Weibull model, the log of the rate increases or decreases
linearly with the log of t.
• The rate increases or decreases monotonically and rapidly
with small t, more slowly with large t
• ln((t)) = 0 +  ln(t) + 1x1 + 2x2 +…
• Df = 2: one for intercept, one for slope (in log form).
• Usually less degrees of freedom than a piecewise model
• Good for monotonically changing rates that look Weibull-
like (this happens quite often). Bad otherwise.
• Calculating survivor probabilities is still doable.
• You interpret the coefficients as usual.
log rate|log time: Weibull model

0
log rate

-2

-4

-6

1

3
-1
-0.6
-0.2
0.2
0.6

1.4
1.8
2.2
2.6
log duration
log rate:weibull model

0
log rate

-2

-4

-6

11
13
15
17
19
1
3
5
7
9

duration
rate: Weibull model

0.06

0.04
rate

0.02

0

11
13
15
17
19
1
3
5
7
9
duration
Splined piecewise Gompertz model

• In a splined piecewise gompertz model, you attach several
gompertz models across durations.

• (equation suppressed)
• Df = df for comparable piecewise +1, but you get an
unbroken function with a much better fit than a comparable
piecewise.
• Good for fitting models to complex rates.
• Calculating survivor probabilities is difficult.
• You interpret the coefficients as usual, but it is also possible
to specify complex nonproportional coefficient effects.
log rate: splined piecewise
model

0
log rate

-2
-4
-6
10

20
0

duration
rate: splined piecewise
gompertz

0.06
0.04
rate

0.02
0
0

15
duration
Practice with rates and survivor proportions in a
piecewise model
• In event history models, you should help the reader
distinguish whether and when the event occurs by turning
model coefficients into meaningful predictions.

Some options:
1.) read coefficients to calculate the log rate
2.) calculate the survivor function from the rate
3.) calculate the proportion experiencing the event from a
survivor function.
4.) calculate the median duration to the event, using the
survivor function.
Next Topic: Problems that arise in event history
research
Censoring: can have various effects on your model
interpretations – some benign, some severe

Right censoring

Left Censoring

Left Truncation
Right censoring

• Causes of right censoring:
– #1: experiencing the event, which makes the respondent no longer
at risk of the event.
– #2: experiencing some competing event which makes the
respondent no longer at risk of the event.
– #3: interview occurs, and no further information is available for
that respondent
• Effects of right censoring:
– increases standard errors of estimates, particularly survivor
estimates.
– makes models subject to bias if the baseline is improperly
specified.
Losing data at the left side of the duration
function
• Left-censoring: we do not have information dating
back to the starting duration for a person, but we
know the duration at which we begin to have valid
observations.

Example: a health study which examines death rates due to
tuberculosis. The key explanatory variable is a genetic
marker. The death rate due to tuberculosis is duration-
dependent, and the survey asks the respondents to recall the
duration since first symptoms.
A more serious problem: left-truncation

• Left-truncation: we do not have information dating
back to the starting duration for a person, and we do
not know the duration at which we begin to have
valid observations.

Example: a health study which examines death rates due to
tuberculosis. The key explanatory variable is a genetic
marker. The death rate due to tuberculosis is duration-
dependent, and there is no reliable measure of the duration
since first symptoms.

• The problem: If we don’t know the starting duration, we can
never know the baseline hazard function. Furthermore, if
genetic markers make a difference in survival probabilities,
then respondents with different genetic markers will tend to
have different durations at first observation because of
unequal death probabilities before first observation.

• Solution: There is no solution. You must drop all left-
truncated cases.
Sample data set with left- and right-censoring
and left-truncation

start of obs               end of obs
0    0   0   1
0     0    0   1
0   0  0     0    0   0   0   0   0    …?
0     0    0   0   0   0
?… 0     0    0   0   1
0  0     0    0   1
?… 0     0    0   0   0   0   0    …?
0   1
0   0   0   1
Next possible problem: nonproportional hazards

• The problem: What happens if a covariate has a certain
effect at some durations, but a different or no effect at other
durations?

• If this happens, estimates for the coefficient depend on the
duration of observation. This leads to confusion and the
possibility of manipulation of the results.

• Example: 2nd birth rates for teen mothers may indeed be
higher at certain durations than at others.
Graphic representation of nonproportional
hazards
Nonproportional association
between age at 1st birth and 2nd
birth rate (fictitious)
2nd birth rate

0.06
0.04
nonteen
0.02                                      mothers
0                                       teen
108
18
36
54
72
90
0

mothers

duration in months
A solution to the problem of nonproportional
hazards
interact the covariate of interest with each duration
interval.

In this case, you can estimate an “overall” coefficient
for a teen first birth (in the omitted interval 33 to 68
months) plus interaction terms for teen * shorter and
longer intervals.
Nonproportional piecewise hazard models:
First example: no interactions: single coefficient for teen first birth

. streg teen1st hispanic nhblack nhother dur0008 dur0920 dur2132 dur69p, dist(e
> xp) nohr

Exponential regression -- log relative-hazard form
No. of subjects =         2918                      Number of obs   =      9708
No. of failures =         1741
Time at risk    =       124953
LR chi2(8)       =    878.35
Log likelihood =    -2787.6193                     Prob > chi2      =    0.0000
------------------------------------------------------------------------------
_t |      Coef.   Std. Err.       z     P>|z|        [95% Conf. Interval]
---------+--------------------------------------------------------------------
teen1st |   .0569005   .0497184      1.144   0.252       -.0405458    .1543468
hispanic |    .138043   .0704231      1.960   0.050        .0000162    .2760697
nhblack |   .0376364   .0673827      0.559   0.576       -.0944313    .1697041
nhother |   .3049904   .1096829      2.781   0.005        .0900158    .5199651
dur0008 | -3.446043    .2615858    -13.174   0.000       -3.958742   -2.933344
dur0920 | -.2052949    .0629295     -3.262   0.001       -.3286344   -.0819555
dur2132 |   .3674214   .0594054      6.185   0.000         .250989    .4838539
dur69p | -.8321701    .0970686     -8.573   0.000       -1.022421   -.6419191
_cons | -4.059757    .0506448    -80.161   0.000       -4.159019   -3.960495
------------------------------------------------------------------------------
Nonproportional piecewise hazard models:
Second example: interactions of teen first birth with duration parameters

. streg teen1st teen0008 teen0920 teen2132 teen69p hispanic nhblack nhother dur
> 0008 dur0920 dur2132 dur69p, dist(exp) nohr

Exponential regression -- log relative-hazard form
No. of subjects =         2918                      Number of obs   =      9708
No. of failures =         1741
Time at risk    =       124953
LR chi2(12)      =    897.71
Log likelihood =    -2777.9392                     Prob > chi2      =    0.0000
------------------------------------------------------------------------------
_t |      Coef.   Std. Err.       z     P>|z|        [95% Conf. Interval]
---------+--------------------------------------------------------------------
teen1st | -.0748238    .0834548     -0.897   0.370       -.2383922    .0887447
teen0008 | -.3460412    .5897907     -0.587   0.557        -1.50201    .8099273
teen0920 |    .273927   .1264363      2.167   0.030        .0261165    .5217375
teen2132 |   .2873234   .1190987      2.412   0.016        .0538943    .5207525
teen69p | -.5030413     .204028     -2.466   0.014       -.9029288   -.1031537
hispanic |   .1338842     .07044      1.901   0.057       -.0041757     .271944
nhblack |   .0293305   .0674051      0.435   0.663       -.1027811    .1614422
nhother |   .2961435   .1096925      2.700   0.007        .0811501    .5111368
dur0008 | -3.357439    .3067545    -10.945   0.000       -3.958667   -2.756211
dur0920 | -.3262495    .0839718     -3.885   0.000       -.4908312   -.1616678
dur2132 |   .2385884   .0800161      2.982   0.003        .0817596    .3954171
dur69p | -.4192567    .1643321     -2.551   0.011       -.7413418   -.0971716
_cons | -3.994679    .0588621    -67.865   0.000       -4.110047   -3.879311
------------------------------------------------------------------------------
Conclusions from nonproportional models:

It appears that while teen mothers have second births
at the same overall rates as women with a first birth
at age 20 or older, there are some difference in the
spacing of second births.

Teen mothers may have higher rates of closely spaced
second births than other women.
Practical difficulties with piecewise
nonproportional models:
Piecewise models are a rather clumsy and inefficient
way to model smoothly varying baseline hazard
rates.
The results can often depend on the choice of number
and duration for the intervals.
Some intervals may have few or no events in them.
Standard errors for such intervals will be hugely
inflated.
You must explain the results very carefully to the
A final problem with rates: unmeasured
heterogeneity.
• What happens if there is an important explanatory variable
that we do not use in the model?
• We will have groups in the data set with different rates.

• OLS models frequently are missing variables that explain
some variance but are irrelevant to the questions we are
asking. Why is this a special problem for rate models?
• In the presence of unmeasured heterogeneity, rate models
produce #1: distorted baseline hazards and #2 biased
covariate estimates.
Examples of unmeasured heterogeneity:

• Observed average hazard   Underlying true hazards

duration                  duration
The other problem with unmeasured
heterogeneity: biased coefficients
• In some cases, we may not be interested at all in the
duration dependencies of rates, or even in whether covariate
effects are nonproportional over time. Even if we are only
interested in the average covariate effect across generations,
unmeasured heterogeneity can trip us up.

• We will look at an example using a mover-stayer
specification, the simplest form of heterogeneity. The rate
is 0 for some unmeasured part of the population.
Mover-stayer example:

• In a fictitious study, we want to find out whether vitamin C
protects children from a strain of chicken pox.
chicken pox and are immune.

• N = 4000:          1000 Vit C = 1, immune = 1
1000 Vit C = 1, immune = 0
1000 Vit C = 0, immune = 1
1000 Vit C = 0, immune = 0
0 = -2.3 (rate for no vitamin C group = .10/month)
1 = -.7 (rate for vitamin C group = .05/month)
Mover-stayer example:
 Vit C     Immune t = 0   t=1    t=2     t=3     t=4

0         0       1000   900    810     729     656

0         1       1000   1000   1000    1000    1000

average rate             .05    .0474   .0448   .0422

1         0       1000   950    903     858     815

1         1       1000   1000   1000    1000    1000

average rate             .025   .0241   .0236   .0231

relative rate            50%    51%     53%     55%
Solutions for unmeasured heterogeneity:

• #1: Find the unmeasured variables and measure them!

• #2: Use hifalutin’ statistics that estimate effects for
unmeasured heterogeneity.
• There is always an identification problem when you do this.

• #3: Think about the possible effects of unmeasured
heterogeneity and discuss them in your findings.
• You always want to do this anyway
Practice with hazard models: How women’s
suffrage movements succeeded

• “An event history analysis provides evidence that
gendered opportunity structures helped bring about
the political successes of the suffragists. Results
suggest the need for a broader understanding of
opportunity structure than one rooted simply in
formal political opportunities.”
– McCammon et al., 2001
How women’s suffrage movements succeeded

• Outcome of interest:
– State or territory passes major suffrage legislation

• Gendered opportunity structures include…
– new-woman index (college students, doctors, lawyers,
and women’s movements)
– proportion of neighboring states with women’s suffrage
– World War I years (lagged)
Interpreting the coefficients
from the Women’s Suffrage Paper.
• By how much does the rate of passage of suffrage laws
change in a state if (all else equal):
– The new-woman index increases by a point?
• What is the new-woman index?
–   2 of 5 neighboring states have granted suffrage?
–   The state is a Western state?
–   There is a state prohibition law?
–   Suffragists are using “separate spheres” arguments?
–   The year is 1905 instead of 1915?
Possible problems:

• Problems of causal inference: How do we know that
x caused y?
–   new woman index
–   proportion of neighboring states
–   World War I years
–   State prohibition laws and other time-forward variables

• If we accept a causal inference, do we accept the
substantive intepretation?
– Is proportion of neighboring states a “gendered
opportunity structure”?
Possible problems:

• Whether versus when
– “Whether” is a problem, because key explanatory
variables are strongly correlated with time.
• How much does the “new woman index” vary across states in a
single year?
– “When” is a problem, because of a lack of variation
•   4 states enacted suffrage from 1869 to 1909,
•   8 from 1910 to 1914
•   17 from 1917 to 1919
•   Do controls for decade help?
Possible problems:

• Sample size and degrees of freedom
– Table 2 lists 25 covariates, and the text mentions more that were
– Number of cases varies from 1,161 to 2,358. (in State*months)
– However, Table 1 lists only 31 “events”, with 2 outside the time
frame of the study, so this model is severely overidentified.
– (What p-values did the authors use?)

• Multiple events from the same case.
– Should we drop duplicate events?
– If so, should we drop the first or the second event?

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 0 posted: 3/28/2013 language: English pages: 46
How are you planning on using Docstoc?