# The Improved Iterative Scaling Algorithm A Gentle Introduction Adam

Document Sample

The Improved Iterative Scaling Algorithm:
A Gentle Introduction

School of Computer Science
Carnegie Mellon University
December, 1997

This note concerns the improved iterative scaling algorithm for computing maximum-likelihood
estimates of the parameters of exponential models. The algorithm was invented by members of
the machine translation group at IBM's T.J. Watson Research Center in the early 1990s. The goal
here is to motivate the improved iterative scaling algorithm for conditional models in a way that is
as complete and self-contained as possible, yet minimizes the mathematical burden on the reader1 .
Parametric form
The task is to come up with an accurate encapsulation of a random process. This random process
produces, at each time step, some output value y , a member of a (necessarily nite) set of possible
output values. The value of the random variable y is in uenced by some conditioning information
(or \context") x. The language modelling problem, for example, is to assign a probability p(y j x)
to the event that the next word in a sequence of text will be y , given x, the values of the previous
words.
We adopt a conditional exponential model
!
p (y j x)
1 exp X f (x y )     n

(1)
Z (x)        =1    i
i   i

where
f (x y ) is a binary-valued function|called a \feature"|of (x y ). Associated with the model
i

p is some nite collection of n such functions. How one might select these features is not the
topic of this note. There exist methods for automatically discovering \good" features from
within a large collection of candidates see De97] or Be96] for details.
i is a real-valued weight associated with f . Technically, is the Lagrange multiplier
i                  i

corresponding to the function f in a certain constrained optimization problem. In this sense,
i

the absolute value of is a measure of the \importance" of the feature f . We denote by
f 1 2 : : : g.
i                                                       i

the vector of weights:                     n

1
This and other material related to exponential models|mostly of a survey nature|are available online at
http://www.cs.cmu.edu/~aberger/maxent.html.       Comments on and suggestions for this document should be sent
to aberger@cs.cmu.edu.

1
Z (x) is a normalizing factor, required to make p a probability distribution:
X                     X   n
!
Z (x) =                         exp                    i   f (x y )
i                                   (2)
y                    i=1

Maximum likelihood
~                              ~
Given a joint empirical distribution p(x y ), the log-likelihood of p(x y ) according to a conditional
model p (y j x), is de ned as
X
L ~( )       p(x y )log p (y j x)
~ p                                                   (3)
x y

We employ the log-likelihood as a measure of the quality of the model p . From (3) we can
immediately see that
L ~( ) 0 always
p

~
L ~( ) = 0 is optimal, attained by a model p which is \perfect" with respect to p that is,
p (y j x) = 1 if and only if p(x y ) > 0.
p

~
Given the set of features ff1 f2 : : :f g, the exponential form (1), and an empirical distribution
n

~
p(x y ), the maximum likelihood problem is to discover          argmax L ~( ), which is a search in       ?
p

R . In the following section we will describe how to perform this search e ciently.
n

The log likelihood of the exponential model (1) is
X          X               X           X X
L ~( ) = p(x y )
p~            f (x y ) ; p(x) log exp
~  i   i             f (x y )                                           i       i
x y                 i                                    x                         y       i

Di erentiating with respect to an individual parameter , we get                                           i

@L ~( )       X                    X
=    p
p(x y )f (x y ) ; p(x)p (y j x)f (x y )
~                  ~                                                                                        (4)
@       i           x y
i

x y
i

e
= <f >;<f >              i                      i

e                                                                                   ~
Here < f > denotes the expectation of f (x y ) with respect to the empirical distribution p, and
i                                                  i

< f > the expectation of f (x y ) with respect to the distribution p(x)p (y j x).
i                                    i                        ~
Setting (4) to zero yields the condition for an extremum of the log-likelihood with respect to
the single parameter . And the resulting condition|that the empirical expected probability of
i

the feature f be equal to the model expected probability|is a very natural condition.
i

When f (x y ) is binary-valued, as is the case here, this condition has an especially intuitive
i

interpretation: the expected fraction of events (x y ) for which f is \on" (non-zero) should be the
same according to the empirical distribution p(x y ) and the model distribution p(x)p (y j x).
i

~                                  ~
Finding            ?

Say we have a model of the form (1) with some arbitrary set of parameters        f 1 2 : : : g.
f 1 + 1 2 + 2 : : : + g which yield a
n

We'd like to nd a new set of parameters +                                                                                                 n   n

model of higher log-likelihood. If we can nd a procedure (a growth transformation) : ! +
which takes one set of parameters as input and produces a new set as output which is not inferior,
we can apply the transformation until we reach its xed point2 a stationary point for .
2
For those familiar with the EM algorithm, this is reminiscient of how one iterates until reaching a stationary point
of the auxiliary function Q( j ). Unlike in the EM algorithm, which guarantees only a locally optimal solution,
0

the IIS algorithm converges to the unique maximum.

2
~
With respect to a given empirical distribution p(x y ), the change in log-likelihood from the
model to the model + is
X                          X
L ~( + ) ; L ~( ) =
p                     p(x y )log p (y j x) ; p(x y )log p (y j x)
~p                        ~                               0

x y                                             x y

X               X                               X                           Z (x)
=                  p(x y )
~                     f (x y ) ;                  ~
p(x) log
0

x y                 i
i       i

x
Z (x)
We now make use of the inequality ; log                                         1 ; (true for all > 0), to establish a lower bound
on the change in likelihood:
L ~( + ) ; L ~( )
p                      p

X                   X                                    X                   Z (x)
p(x y )
~                         f (x y ) + 1 ;                   ~
p(x)
0

i    i
Z (x)
x y

X
i

X          P exp P ( + )f (x y )     x

X
=       ~
p(x y )     f (x y ) + 1 ; p(x) P
~            P f (x y )                               y           i       i           i    i
i    i
exp
X           X                  X       X              X
x y                                                      x                           y               i   i       i
i

=       p(x y )
~           f (x y ) + 1 ; p(x) p (y j x) exp
~i    i                  f (x y )         (5)                                                  i   i

|       x y                 i
{z                             x
}                 y                                   i

Call this A( j )
Since L ~( + ) ; L ~( ) A( j ), we know that if we can nd a for which A( j ) > 0,
p                     p

then the model + is an improvement (in terms of log-likelihood) over the model . A greedy
strategy for optimizing the parameters of a log-linear model of the form (1), then, is to nd the
which maximizes A( j ), set             + , and repeat. So long as A( j ) > 0, we're guaranteed
an improvement in likelihood by this technique.
The straightforward approach, then, would be to maximize A( j ) with respect to each
i. Unfortunately, this doesn't quite work: di erentiating A( j ) with respect to yields an                                                                              i

equation containing f 1 2 : : : g in other words, the contraint equations for the will be
n                                                                                                                     i

coupled. To get around this, we'll need the quantity
X
f # (x y )      f (x y )                               i

i

If the f are binary-valued, f # (x y ) has the simple interpretation of the number of features which
i

\apply" (are non-zero) at x y . We can rewrite A( j ) as
X         X                    X       X                         X f (x y ) !
A( j ) = p(x y)~            f (x y ) + 1 ; p(x) p (y j x) exp f (x y )
i   i ~                       #
f # (x y )
(6)                                                         i    i

x y                i                                        x               y                                                            i

Notice that #(( )) is a p.d.f. over i, since it's always non-negative and sums to one over the natural
f
fi x y

x y

numbers. This means we can apply Jensen's inequality|namely, for a p.d.f. p(x),
X                X
exp p(x)q (x)            p(x) exp q (x)
x                           x

to rewrite (6) as
X         X                X        X      X f (x y )
A( j )         p(x y )
~         f (x y ) + 1 ; p(x) p (y j x)
i~  i
f # (x y )
e                                                              i                        if
# (x   y   )
(7)
|   x y               i
{z                          x                   y                       i
}
Call this B( j )
3
B( j ) is a new, not as tight, lower bound on the change in log-likelihood. That is,
L ~( + ) ; L ~( ) B( j )
p           p

Di erentiating B( j ) with respect to gives             i

@ B( )     X                    X     X
=       p(x y )f (x y ) ; p(x) p (y j x)f (x y )e ( )
~                  ~                                          (8)
#
if   x y

@       i             x y
i

x           y
i

What's nice about (8) is that appears alone, without any other free parameters. Thus we can
solve for each of the n free parameters f 1 2 : : : g individually by di erentiating B( j )
i

n

with respect to each in turn. This suggests an iterative algorithm for nding the optimal values
i

of 1 2 : : : :
n

IIS Algorithm
Start with some (arbitrary) value for each                 i

Repeat until convergence:
{ Solve B( ) = 0 in (8) for
@
@ i                         i

{ Set    i    +        i   i

References
Be96] A. Berger, S. Della Pietra, V. Della Pietra (1996). A maximum entropy approach to natural
language processing. Computational Linguistics, 22(1), 39|71.
De97] S. Della Pietra, V. Della Pietra and J. La erty (1997). Inducing features of random elds.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(4), 380|393.

4

DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 56 posted: 4/28/2009 language: English pages: 4
How are you planning on using Docstoc?