VIEWS: 18 PAGES: 4 CATEGORY: Technology POSTED ON: 1/14/2010 Public Domain
Local Parametric Modeling via U-Divergence Shinto Eguchi Institute of Statistical Mathematics, Japan and Department of Statistical Science, Graduate University of Advanced Studies Minami-Azabu 4-6-1, Minato, Tokyo 106-8569, Japan Email: eguchi@ism.ac.jp Abstract This paper discusses a local parametric modeling by the use of U-divergence in a statistical pattern recognition. The class of U-divergence measures commonly has an empirical loss function in a simple form including Kullback-Leibler divergence, the power divergence and mean squared error. We propose a minimization algorithm for parametric models of sequentially increasing dimension by incorporating kernel localization into. This is a boosting algorithm with spatial information. The objective of this paper is to accommodate simultaneously local and global ﬁtting for the statistical pattern recognition, which is extended to non-parametric estimation of density function and regression function. Key words: Statistical pattern recognition, kernel function, local likelihood Introduction Let p(x, θ) be a statistical model of probability functions with a parameter θ of the space Θ. In statistical inference the idea of likelihood function is fundamental commonly even in the diﬀerent paradigms. The maximum likelihood method is theoretically supported in the framework of exponential family through the notion of minimal suﬃciency. The Kullback- Leibler divergence is closely related with the maximum likelihood estimation. The extended Kullback-Leibler divergence between density functions p(x) with q(x) is deﬁned by p(x) (1) DKL (p, q) = p(x) log + q(x) − p(x) Λ(dx), q(x) where Λ is a sigma-ﬁnite measure on the data space. Note that if p and q are probability density functions, then DKL reduces to the usual one. The maximum log-likelihood function approximates minus Kullback-Leibler divergence up to the constant, that is, n 1 ˆ ˆ log p(xi , θ) = −DKL (p, p(·, θ)) + const. n i=1 ˆ where θ is the maximum likelihood estimator and {xi , · · · , xn } is a set of observations gener- ated from the true density function p(x). This observation oﬀers the important property of consistency for the maximum likelihood estimator because Kullback-Leibler divergence has a discriminative property, that is, DKL (p, q) ≥ 0 with equality if and only if p = q a.e. Λ. This property has been extended to local likelihood estimation by presentation of the local version of the extended Kullback-Leibler divergence, cf. [6]. The local likelihood function is n 1 L(θ, x, h) = Kh (xi , x) log p(xi , θ) − Kh (x , x)p(x , θ)Λ(dx ), n i=1 where Kh (·, x) is a kernel function with the center x and bandwidth h. The derivation is based on weighting the integrand in the right-side of (1) by the kernel function Kh (·, x). The maximizer of L(θ, x, h) with respect with θ depends on x and h such that only the information with data around the target point x is highly utilized with a control for the locality by h. The several proposals of local likelihood have been researched in [3], [9], see the related references herein. In statistical pattern recognition AdaBoost algorithm has been recently proposed and shown to have the eﬃcient performance, see [5] The method is apparently diﬀerent from the maxi- mum likelihood in a logistic regression model. However the exponential loss function deriving AdaBoost is closely connected with the log-likelihood function. The exponential loss function is viewed as the approximate for the extended Kullback-Leibler divergence from an exponential family, cf. [7]. This view is further extended to U-divergence in which includes Kullback-Leibler divergence as a special example with a choice of U = exp. See [8] for U-Boost algorithm in- cluding the most robust method for outlying in the feature space. Also see [10] for Eta-Boost with robustness against misclassiﬁed observations. In this talk I would like to consider a localized version of U-divergence and to propose the local U-Boost algorithm. This algorithm aims at local learning in the feature space using the kernel function. U-Boost algorithm needs a sub-step of ‘stagewise’ for functional optimization, in which all the classiﬁcation machines should be tested by the training data set and the best performed machine is selected. Alternatively the kernel weighting needs to ﬁx arbitrarily a target point x that should hold up to applying the ﬁnal classiﬁcation machine to the prediction of class label in the feature space. To reconcile these diﬀerent situations we make use of resampling such that the target point x is sequentially generated uniformly from the training data set. We selectively combine all the machines in the algorithm with the kernel weight from the target point x. The performance over U-Boost algorithm will be elucidated from the information-geometric point of view. Minimum U-divergence method Let M be a positive cone of density functions on a data space in Rp and let F a linear space of square-integrable functions. We choose a convex function U over a real axis and denote the drivative function u and the inverse function ξ of u, assuming that u is a positive function. As for typical examples, (U(t), u(t), ξ(t)) = (exp t, exp t, log t) or (1 + βt)(β+1)/β tβ − 1 (2) (U(t), u(t), ξ(t)) = , (1 + βt)1/β , . 1+β β A canonical functional is considered by ϕU (f ) = D U(f (z))Λ(dz), which is also convex because of the convexity assumption on U. Thus the convexity leads to the conjugate convex functional as follows: (3) ϕ∗ (p) = sup{ f, p − ϕU (f )}, U f where f, p = f (z)p(z)Λ(dz). By variatinal argument the supremum in (3) attains at f ∗ = ξ(p) or equivalently p = u(f ∗ ), so that ϕ∗ (p) = ξ(p), p −ϕU (ξ(p)). Hence the positive function U u connects F with M. The U-divergence is deﬁned by DU (p, f ) = ϕU (f ) + ϕ∗ (p) − f, p . U By deﬁnition, DU (p, f ) ≥ 0 with equality if and only if f ∗ = ξ(p). The U-function given by (2) associates with the power divergence, cf. [1]. For a given data set {z1 , · · · , zn } the minimum U- divergence method is formally given as follows. Let F0 = {f (z, θ) : θ ∈ Θ}, in which we usually model f (z, θ) such that the function p(z, θ) = u(f (z, θ)) is a probability density function. Then the empirical loss function is n (U ) 1 (4) L (θ) = − f (zi , θ)) − U(f (z, θ))Λ(dz) n i=1 D and so that the estimator of minimum U-divergence is θ(U ) = arg minθ∈Θ L(U ) (θ). For example ˆ if U = exp and f (z, θ) is linear in θ, then the estimator reduces to the maximum likelihood estimator under the exponential family p(z, θ) = exp{θT f (z) − ψ(θ)} with a canonical statistic f (z) and the cumulant function ψ(θ). See [4] for the several applications by other U function than exp. Local learning algorithm We propose a localization of the empirical U-loss function (4) for statistical pattern recog- nition. For this we make a simple framework as follow. Let X and Y be an input vector and a binary label with values ±1 in which the joint distribution is decomposed into r(x, y) = p(y|x)q(x), with the conditional distribution p(y|x) of Y given X = x and the marginal distribution with density q(x) of X. Thus we conﬁne our discussion to the binary classiﬁcation, but our arguments can almost extend to any multiple classiﬁcation. Let F0 be a linear space by F0 = {f (x, θ) = J θj fj (x)|θ = (θ1 , · · · , θJ )} which is hulled j=1 by leaning machines fj (x) such that each fj themselves equip with the decision machine x → y by y = sgn(fj (x)) for the prediction of the class label. We propose a local version of the U-loss function by n 1 LU (θ, x, h) = U(−yi Kh (xi , x)f (xi , θ)) n i=1 for a given data set {(x1 , y1 ), · · · , (xn , yn )}. Note that the contribution of (xi , yi) to the loss function is controled by the weight function Kh (xi , x), which assigns less weight for the i-th observation (xi , yi) if the distance between xi and x is larger. The sequential optimization is deﬁned by n ˆ 1 (5) θj (f, x(j) , h) = arg min U − yi {Fj−1(xi ) + θKh (xi , x(j) )f (xi )} θ∈Ê n i=1 generating x(j) uniformly in {x1 , · · · , xn }, where Fj−1 (x) = j−1 θk Kh (xi , x(k) )fk (x). The key k=1 of local U-Boost algorithm is kernel localization in the weighted error rate of j-th step n i=1 I(yi = f (xi ))wj (i) (6) errj (f, h) = n , i=1 wj (i) where the deﬁnition function I(·) and wj (i) = u(−yi Fj−1 (xi )) with the derivative u of U. If U = exp, we call the algorithm local AdaBoost, in which (5) speciﬁcally has a closed form θj (f, x(j) , h) = 1 log{(1 − errj (f, h))/errj (f, h)}. ˆ 2 1 Let us begin with setting the initial weight over the data set by w1 (i) = n . Using the weighted error rate errj (f, h) deﬁned by (6) the following three sub-steps for iteration numbers j = 1, · · · , J are given as follows: ∗ 1. Select f(j) = argminf ∈F0 errj (f, h) with a preliminarily ﬁxed class F0 of classiﬁers. 2. Update from Fj−1 (x) to Fj (x) = Fj−1 (x) + θ∗ f ∗ (x), where θ∗ = θ(f ∗ , x(j) , h) as in (5) ˆ j (j) ˆ ˆ j (j) 3. Update errj+1(f, h) by wj+1 (i) = u(−yi Fj (xi )) and iterate until j < J. Finally, hJ (x) = sign(FJ (x)) is deﬁned by J FJ (x) = ˆ∗ ∗ Kh (x, x(j) )θj f(j) (x). j=1 Remark 1 In the local likelihood method the parameter θ of our interests is dependent on the target or kernel center x of which the density or the conditional expectation given is to be predicted. However it is necessary for the boosting algorithm to prepare the functional minimization in the sub-step 1 as above, which is said to be ‘stagewise’, see [3]. For this we generate xj uniformly from the set {x1 , · · · , xn } The local loss highlighting a neighborhood of xj leads to the optimal ∗ ˆ∗ ∗ decision machine f(j) with the combining coeﬃcient θ(j) . Thus the selected machine f(j) can be expected good performance for the prediction only over the neighborhood of xj . Remark 2 Let us pursue the meaning of random generation of the target point x in the process on local U-Boost algorithm. If we got the inﬁnite number of iterations, then local version of the U-loss function would become n 1 ¯ (7) U(−yi Kh (xi )f (xi , θ)), n i=1 where Kh (xi ) = (1/n) n Kh (xi , x(j) ). However, the behaviors of local U-Boost algorithm and ¯ j=1 the sequential minimization algorithm for the averaged U-loss function (7) are clearly diﬀerent, which is similar to a situation when we compare the boosting with bagging algorithms. In this sense our proposing algorithm is a fusion of boosting and bagging with biased resampling based on kernel weighting. Remark 3 The selection of the bandwidth h is an important issue for this proposal. We observe that the failure of the choice for appropriate h badly reﬂects on poor performance in the classiﬁcation tasks in several empirical simulation studies. We impliment the selection of h by K-fold cross validation method in which the data set to be trained is partitioned into K sub-samples and use our learning algorithm for K − 1 joined sub-samples and use evaluate the U-loss function for the rest of one sub-sample in the all possible partition. The method gave a sensible choice of h within our simulation study. References [1] Basu, A., Harris, I.R., Hort, N.L. and Jones, M.C. Robust and eﬃcient estimation by minimising a density power divergence Biometrika 85 (1998), 549-559. [2] Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. Least angle regression Ann. Statist. 32 (2004), 407-499. [3] Eguchi, S., and Copas, J.B. A class of local likelihood methods and near-parametric asymptotics. J. Royal Statist. Soc. B, 60 (1998), 709-724. [4] Eguchi, S., and Copas, J. B. A class of logistic type discriminant functions. Biometrika 89 (2002), 1–22. [5] Freund, Y., and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Computer and System Sciences 55 (1997), 119–139. [6] Hjort, N. L. and Jones, M. C. Locally parametric nonparametric density estimation. Ann. Statist. 24 (1996), 1619-1647. [7] Lebanon, G. and Laﬀerty, J. Boosting and maximum likelihood for exponential models. Advances in Neural Information Processing Systems 14 (2002). [8] Murata, N., Takenouchi, T., Kanamori, T. and Eguchi. S. Information geometry of U-Boost and Bregman divergence. Neural Computation 16 (2004), 1437-1481. [9] Park, B. U. Kim, W. C. and Jones, M. C., On local likelihood of density estimation. Ann. Statist. 30 (2002), 1480-1495. [10] Takenouchi, T. and Eguchi, S. Robustifying AdaBoost by adding the naive error rate. Neural Computation 16 (2004), 767-787. [11] Kawakita, M. and Eguchi, S. Boosting method for local learning in satistical pattern recognition. In preparation.