Local Parametric Modeling via U-Divergence by dse10841


									Local Parametric Modeling via U-Divergence

Shinto Eguchi
Institute of Statistical Mathematics, Japan and
Department of Statistical Science, Graduate University of Advanced Studies
Minami-Azabu 4-6-1, Minato,
Tokyo 106-8569, Japan
Email: eguchi@ism.ac.jp


This paper discusses a local parametric modeling by the use of U-divergence in a statistical pattern
recognition. The class of U-divergence measures commonly has an empirical loss function in a simple
form including Kullback-Leibler divergence, the power divergence and mean squared error. We propose
a minimization algorithm for parametric models of sequentially increasing dimension by incorporating
kernel localization into. This is a boosting algorithm with spatial information. The objective of this
paper is to accommodate simultaneously local and global fitting for the statistical pattern recognition,
which is extended to non-parametric estimation of density function and regression function.

Key words: Statistical pattern recognition, kernel function, local likelihood

   Let p(x, θ) be a statistical model of probability functions with a parameter θ of the space
Θ. In statistical inference the idea of likelihood function is fundamental commonly even in
the different paradigms. The maximum likelihood method is theoretically supported in the
framework of exponential family through the notion of minimal sufficiency. The Kullback-
Leibler divergence is closely related with the maximum likelihood estimation. The extended
Kullback-Leibler divergence between density functions p(x) with q(x) is defined by
(1)         DKL (p, q) =             p(x) log        + q(x) − p(x) Λ(dx),
where Λ is a sigma-finite measure on the data space. Note that if p and q are probability
density functions, then DKL reduces to the usual one. The maximum log-likelihood function
approximates minus Kullback-Leibler divergence up to the constant, that is,
            1                    ˆ                  ˆ
                      log p(xi , θ) = −DKL (p, p(·, θ)) + const.
            n   i=1

where θ is the maximum likelihood estimator and {xi , · · · , xn } is a set of observations gener-
ated from the true density function p(x). This observation offers the important property of
consistency for the maximum likelihood estimator because Kullback-Leibler divergence has a
discriminative property, that is, DKL (p, q) ≥ 0 with equality if and only if p = q a.e. Λ. This
property has been extended to local likelihood estimation by presentation of the local version
of the extended Kullback-Leibler divergence, cf. [6]. The local likelihood function is
            L(θ, x, h) =               Kh (xi , x) log p(xi , θ) −   Kh (x , x)p(x , θ)Λ(dx ),
                             n   i=1

where Kh (·, x) is a kernel function with the center x and bandwidth h. The derivation is
based on weighting the integrand in the right-side of (1) by the kernel function Kh (·, x). The
maximizer of L(θ, x, h) with respect with θ depends on x and h such that only the information
with data around the target point x is highly utilized with a control for the locality by h. The
several proposals of local likelihood have been researched in [3], [9], see the related references
    In statistical pattern recognition AdaBoost algorithm has been recently proposed and shown
to have the efficient performance, see [5] The method is apparently different from the maxi-
mum likelihood in a logistic regression model. However the exponential loss function deriving
AdaBoost is closely connected with the log-likelihood function. The exponential loss function
is viewed as the approximate for the extended Kullback-Leibler divergence from an exponential
family, cf. [7]. This view is further extended to U-divergence in which includes Kullback-Leibler
divergence as a special example with a choice of U = exp. See [8] for U-Boost algorithm in-
cluding the most robust method for outlying in the feature space. Also see [10] for Eta-Boost
with robustness against misclassified observations.
    In this talk I would like to consider a localized version of U-divergence and to propose the
local U-Boost algorithm. This algorithm aims at local learning in the feature space using the
kernel function. U-Boost algorithm needs a sub-step of ‘stagewise’ for functional optimization,
in which all the classification machines should be tested by the training data set and the best
performed machine is selected. Alternatively the kernel weighting needs to fix arbitrarily a
target point x that should hold up to applying the final classification machine to the prediction
of class label in the feature space. To reconcile these different situations we make use of
resampling such that the target point x is sequentially generated uniformly from the training
data set. We selectively combine all the machines in the algorithm with the kernel weight
from the target point x. The performance over U-Boost algorithm will be elucidated from the
information-geometric point of view.

Minimum U-divergence method
    Let M be a positive cone of density functions on a data space in Rp and let F a linear space
of square-integrable functions. We choose a convex function U over a real axis and denote the
drivative function u and the inverse function ξ of u, assuming that u is a positive function. As
for typical examples, (U(t), u(t), ξ(t)) = (exp t, exp t, log t) or
                                (1 + βt)(β+1)/β                 tβ − 1
(2)        (U(t), u(t), ξ(t)) =                 , (1 + βt)1/β ,        .
                                     1+β                           β
A canonical functional is considered by ϕU (f ) = D U(f (z))Λ(dz), which is also convex because
of the convexity assumption on U. Thus the convexity leads to the conjugate convex functional
as follows:
(3)        ϕ∗ (p) = sup{ f, p − ϕU (f )},

where f, p = f (z)p(z)Λ(dz). By variatinal argument the supremum in (3) attains at f ∗ =
ξ(p) or equivalently p = u(f ∗ ), so that ϕ∗ (p) = ξ(p), p −ϕU (ξ(p)). Hence the positive function
u connects F with M. The U-divergence is defined by
           DU (p, f ) = ϕU (f ) + ϕ∗ (p) − f, p .

By definition, DU (p, f ) ≥ 0 with equality if and only if f ∗ = ξ(p). The U-function given by (2)
associates with the power divergence, cf. [1]. For a given data set {z1 , · · · , zn } the minimum U-
divergence method is formally given as follows. Let F0 = {f (z, θ) : θ ∈ Θ}, in which we usually
model f (z, θ) such that the function p(z, θ) = u(f (z, θ)) is a probability density function. Then
the empirical loss function is
             (U )           1
(4)        L        (θ) = −           f (zi , θ)) −       U(f (z, θ))Λ(dz)
                            n   i=1                   D
and so that the estimator of minimum U-divergence is θ(U ) = arg minθ∈Θ L(U ) (θ). For example
if U = exp and f (z, θ) is linear in θ, then the estimator reduces to the maximum likelihood
estimator under the exponential family p(z, θ) = exp{θT f (z) − ψ(θ)} with a canonical statistic
f (z) and the cumulant function ψ(θ). See [4] for the several applications by other U function
than exp.

Local learning algorithm
    We propose a localization of the empirical U-loss function (4) for statistical pattern recog-
nition. For this we make a simple framework as follow. Let X and Y be an input vector and a
binary label with values ±1 in which the joint distribution is decomposed into

            r(x, y) = p(y|x)q(x),

with the conditional distribution p(y|x) of Y given X = x and the marginal distribution with
density q(x) of X. Thus we confine our discussion to the binary classification, but our arguments
can almost extend to any multiple classification.
    Let F0 be a linear space by F0 = {f (x, θ) = J θj fj (x)|θ = (θ1 , · · · , θJ )} which is hulled
by leaning machines fj (x) such that each fj themselves equip with the decision machine x → y
by y = sgn(fj (x)) for the prediction of the class label. We propose a local version of the U-loss
function by
            LU (θ, x, h) =             U(−yi Kh (xi , x)f (xi , θ))
                           n     i=1

for a given data set {(x1 , y1 ), · · · , (xn , yn )}. Note that the contribution of (xi , yi) to the loss
function is controled by the weight function Kh (xi , x), which assigns less weight for the i-th
observation (xi , yi) if the distance between xi and x is larger.
    The sequential optimization is defined by
            ˆ                          1
(5)         θj (f, x(j) , h) = arg min             U − yi {Fj−1(xi ) + θKh (xi , x(j) )f (xi )}
                                   θ∈Ê n

generating x(j) uniformly in {x1 , · · · , xn }, where Fj−1 (x) = j−1 θk Kh (xi , x(k) )fk (x). The key
of local U-Boost algorithm is kernel localization in the weighted error rate of j-th step
                               i=1   I(yi = f (xi ))wj (i)
(6)         errj (f, h) =               n                  ,
                                        i=1 wj (i)

where the definition function I(·) and wj (i) = u(−yi Fj−1 (xi )) with the derivative u of U. If
U = exp, we call the algorithm local AdaBoost, in which (5) specifically has a closed form
θj (f, x(j) , h) = 1 log{(1 − errj (f, h))/errj (f, h)}.
     Let us begin with setting the initial weight over the data set by w1 (i) = n . Using the
weighted error rate errj (f, h) defined by (6) the following three sub-steps for iteration numbers
j = 1, · · · , J are given as follows:
1. Select f(j) = argminf ∈F0 errj (f, h) with a preliminarily fixed class F0 of classifiers.
2. Update from Fj−1 (x) to Fj (x) = Fj−1 (x) + θ∗ f ∗ (x), where θ∗ = θ(f ∗ , x(j) , h) as in (5)
                                                            j (j)
                                                                   ˆ     ˆ
                                                                               j       (j)
3. Update errj+1(f, h) by wj+1 (i) = u(−yi Fj (xi )) and iterate until j < J.
Finally, hJ (x) = sign(FJ (x)) is defined by
            FJ (x) =                      ˆ∗ ∗
                             Kh (x, x(j) )θj f(j) (x).
Remark 1
In the local likelihood method the parameter θ of our interests is dependent on the target or
kernel center x of which the density or the conditional expectation given is to be predicted.
However it is necessary for the boosting algorithm to prepare the functional minimization in the
sub-step 1 as above, which is said to be ‘stagewise’, see [3]. For this we generate xj uniformly
from the set {x1 , · · · , xn } The local loss highlighting a neighborhood of xj leads to the optimal
                      ∗                                    ˆ∗                               ∗
decision machine f(j) with the combining coefficient θ(j) . Thus the selected machine f(j) can be
expected good performance for the prediction only over the neighborhood of xj .
Remark 2
Let us pursue the meaning of random generation of the target point x in the process on local
U-Boost algorithm. If we got the infinite number of iterations, then local version of the U-loss
function would become
            1               ¯
(7)                   U(−yi Kh (xi )f (xi , θ)),
            n   i=1

where Kh (xi ) = (1/n) n Kh (xi , x(j) ). However, the behaviors of local U-Boost algorithm and
the sequential minimization algorithm for the averaged U-loss function (7) are clearly different,
which is similar to a situation when we compare the boosting with bagging algorithms. In this
sense our proposing algorithm is a fusion of boosting and bagging with biased resampling based
on kernel weighting.
Remark 3
The selection of the bandwidth h is an important issue for this proposal. We observe that the
failure of the choice for appropriate h badly reflects on poor performance in the classification
tasks in several empirical simulation studies. We impliment the selection of h by K-fold cross
validation method in which the data set to be trained is partitioned into K sub-samples and
use our learning algorithm for K − 1 joined sub-samples and use evaluate the U-loss function
for the rest of one sub-sample in the all possible partition. The method gave a sensible choice
of h within our simulation study.
[1] Basu, A., Harris, I.R., Hort, N.L. and Jones, M.C. Robust and efficient estimation by minimising
a density power divergence Biometrika 85 (1998), 549-559.
[2] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. Least angle regression Ann. Statist. 32
    (2004), 407-499.
[3] Eguchi, S., and Copas, J.B. A class of local likelihood methods and near-parametric asymptotics.
    J. Royal Statist. Soc. B, 60 (1998), 709-724.
[4] Eguchi, S., and Copas, J. B. A class of logistic type discriminant functions.
     Biometrika 89 (2002), 1–22.
[5] Freund, Y., and Schapire, R. E. A decision-theoretic generalization of on-line learning and an
    application to boosting. J. Computer and System Sciences 55 (1997), 119–139.
[6] Hjort, N. L. and Jones, M. C. Locally parametric nonparametric density estimation. Ann. Statist.
    24 (1996), 1619-1647.
[7] Lebanon, G. and Lafferty, J. Boosting and maximum likelihood for exponential models. Advances
    in Neural Information Processing Systems 14 (2002).
[8] Murata, N., Takenouchi, T., Kanamori, T. and Eguchi. S. Information geometry of U-Boost and
    Bregman divergence. Neural Computation 16 (2004), 1437-1481.
[9] Park, B. U. Kim, W. C. and Jones, M. C., On local likelihood of density estimation. Ann. Statist.
    30 (2002), 1480-1495.
[10] Takenouchi, T. and Eguchi, S. Robustifying AdaBoost by adding the naive error rate. Neural
    Computation 16 (2004), 767-787.
[11] Kawakita, M. and Eguchi, S. Boosting method for local learning in satistical pattern recognition.
    In preparation.

To top