VIEWS: 7 PAGES: 6 CATEGORY: Real Estate POSTED ON: 1/19/2010 Public Domain
UMR CNRS 6168 Laboratoire LSIS Approximation of Linear Discriminant Analysis for Word Dependent Visual Features Selection e Herv´ Glotin Sabrina Tollari Pascale Giraudet e Universit´ du Sud Toulon-Var UMR CNRS 6168 LSIS Research report LSIS.RR.2005.002 2005 Contents Abstract 1 Introduction 2 To automatically determine a set of keywords that de- scribes the content of a given image is a diﬃcult prob- 2 LDA approximation and adaptive visual fea- lem, because of (i) the huge dimension number of the vi- tures 2 sual space and (ii) the unsolved object segmentation prob- lem. Therefore, in order to solve matter (i), we present 3 Experimentations on COREL image database 5 a novel method based on an Approximation of Linear 3.1 F estimation for COLOR, TEXTURE, SHAPE Discriminant Analysis (ALDA) from the theoretical and and POSITION . . . . . . . . . . . . . . . . 5 practical point of view. Application of ALDA is more 3.2 Hierarchical Ascendant Classiﬁcations im- generic than usual LDA because it doesn’t require explicit proved by ALDA . . . . . . . . . . . . . . . 5 class labelling of each training sample, and however al- lows eﬃcient estimation of the visual features discrimi- 4 Conclusion 5 nation power. This is particularly interesting because of 5 Acknowledgments 6 (ii) and the expensive manually object segmentation and labelling tasks on large visual database. In ﬁrst step of ALDA, for each word wk , the train set is split in two, according if images are labelled or not by wk . Then, un- der weak assumptions, we show theoretically that Between and Within variances of these two sets are giving good es- timates of the best discriminative features for wk . Exper- imentations are conducted on COREL database, showing an eﬃcient word adaptive feature selection, and a great enhancement (+37%) of an image Hierarchical Ascendant Classiﬁcation (HAC) for which ALDA saves also compu- tational cost reducing by 90% the visual features space. Keywords Feature selection, Fisher LDA, visual segmentation, im- age auto-annotation, high dimension problem, word pre- diction, CBIR, HAC, COREL database, PCA. e Universit´ du Sud Toulon-Var 1 UMR CNRS 6168 Laboratoire LSIS 1 Introduction ALDA is simpler than usual LDA, because it doesn’t need explicit labelling of the training samples for generating a The need for eﬃcient content-based image retrieval has good estimation of the most discriminant features. ALDA increased in many application areas such as biomedicine, ﬁrst stage consists, for each word wk , to split train set in military, and Web image classiﬁcation and searching. two, according if images are labelled by wk or not. Then, Many approaches have been devised and discussed over under weak assumption, we show that for a given wk , Be- more than a decade. While the technology to search text tween and Within variances, between these two sets, are has been available for some time, the one to search im- giving good estimates of the best discriminative features. ages (or videos) is much more challenging. Most of image Experimentations are illustrating features dependency to content based retrieval systems require the user to give a each word, and signiﬁcant classiﬁcation enhancements. query based on image concepts, but in general he asks se- mantic queries using textual descriptions. Some systems aim to enhance image word research using visual informa- 2 LDA approximation and adap- tion [13]. Anyway, one needs a fast system that robustly auto-annotates large un-annotated image databases. The tive visual features general idea of image auto-annotation systems is to asso- Major databases are not manually segmented and segment- ciate a class of ‘similar’ images with semantic keywords, labelled, thus given a set of training images Φ = {φj }j∈{1,..,J} e.g. to index by few keywords a new image according to and a lexicon λ = {wk }k∈{1,..,K}, each image φj is labelled a reference train set. This problem has been pursued in with some words of λ (e.g. φj has a global legend con- various approaches, such as neural networks, statistical structed with λ as shown in Fig. 1). In order to extract classiﬁcation, etc. One major issue in these models is the visual features of each object included in each φj , one can huge dimension number of visual space, and “it remains automatically segment each image in many areas called an interesting open question to construct feature sets that blobs. Unfortunately, blobs generally do not match with (...) oﬀer very good performance for a particular vision the shape of each object. task” [1]. Even if they do, there is no way to relate each blob to Some recent works consider user feedback to estimate the corresponding word. the most discriminant features. This exploration process Nevertheless, we show below that despite the fact that before or during classiﬁcation, like in Active Learning, re- each word class wk is not associated to a unique blob, and quiers a lot of manual interactions, many hundreds for only vice-versa, one can estimate for each wk which are the 10 words [6]. Therefore these methods can’t be applied to most discriminant visual features. large image databases or large lexicons. In this paper we To this purpose we need to deﬁne four sets: S, T , TG propose to answer to the previous question by automat- and G. Let be S the theoretical set of values of one feature ically reducing the high dimensional visual space to the x, calculated on all the blobs that are exactly representing most eﬃcient usual features for a considered word. The the word wk . We note for any feature set E, cE its cardinal, most famous method of dimensionality reduction is Prin- µE the average of all xi values of x ∈ E, vE their variance. cipal Components Analysis (PCA). But PCA does not in- Let be T the set of x values of all blobs included in all clude label information of the data. Although PCA ﬁnds images labelled by wk (of course T includes S). Let be TG components that are useful for representing data, there is such that T = TG U S, with empty intersection between TG no reason to assume that these components must be useful and S. We assume cTG = 0 (otherwise each image labeled for discriminating between data in diﬀerent classes. But by wk contains only the corresponding blobs). where PCA seeks direction that are eﬃcient for represen- Let be G the set containing all values of x from all tation, Fisher Linear Discriminant Analysis (LDA) seeks blobs contained in images that are not labelled by wk . In ones that are eﬃcient for discrimination ([3] pp 117). the following, we only assume the weak assumption (hyp. Indeed recent works in audio-visual classiﬁcation show 1) µTG = µG and vTG = vG , which is related to the simple that LDA is eﬃcient under well labelled databases to de- assumption of context independency provided by any large termine the most discriminant features, reducing the vi- enougth image database. We note BDE (resp. WDE ) the sual space [4, 10, 7]. Unfortunately, most of the large im- Between variance (resp. the Within variance) between any age databases are not correctly labelled, and do not pro- sets D and E. The usual LDA is based on the calculation, vide a one-to-one relation between keywords and image for each feature x of the theoretical discrimination power segments (see COREL image sample with their caption in 1 F (x; wk ) = 1+V (x;wk ) where V (x; wk ) = WSG . We show SG B Fig. 1). Consequently usual LDA can’t be applied on real ˆ below that V (x; wk ) = WT G is a good approximation of TG image databases. B V (x; wk ), and that if one apply V to ordinate all x for a Moreover because of the unsolved visual scene segmen- ˆ given word wk , then this order is the same by applying V , tation problem (see Fig. 1), real applications or training at least for the most discriminant features x. Therefore of image auto-annotation systems from web pages, would the selection of features whith higher theoretical discrim- require a robust visual features selection method from un- inative powers F can be carried out from the calculation certain data. Therefore, we present a novel Approximation ˆ 1 of practical F (x; wk ) = 1+V (x;w ) values. ˆ of LDA (ALDA), in a theoretical and practical analysis. k e Universit´ du Sud Toulon-Var 2 UMR CNRS 6168 Laboratoire LSIS Figure 1: Examples of an automatic segmentation (Normalized Cuts algorithm [11]) of two COREL images [1]. Image caption are (left image) “Windmill Shore Water Harbor” and (right) “Dolphin Bottlenosed Closeup Water”. Each blob of each image is labelled by all words of its image caption. Notice also that dolphin is split in two parts as many as other objects after the Normalized Cuts algorithm. cS cT −cS cT G Let pS = cT and qS = 1 − pS = cT = cT . We have µT = qS .µTG + pS .µS . Therefore: µT = qS .µG + pS .µS . (1) Let derive vT with vS , vG , and for any x ∈ T , the probability pi of event ‘x = xi ’: 2 2 vT = xi − µT pi = xi − qS .µG − pS .µS pi xi ∈T xi ∈T 2 2 = (xi − µG ) + pS (µG − µS ) pi + (xi − µS ) + qS (µS − µG ) pi xi ∈TG xi ∈S = (xi − µTG )2 pi + 2pS (µG − µS ) (xi − µG )pi + p2 (µG − µS )2 S pi xi ∈TG xi ∈TG xi ∈TG + (xi − µS )2 pi + 2qS (µS − µG ) 2 (xi − µS )pi + qS (µS − µG )2 pi xi ∈S xi ∈S xi ∈S = qS .vTG + 2pS (µG − µS ) xi .pi − µG pi + p2 (µG − µS )2 qS S xi ∈TG xi ∈TG +pS .vS + 2qS (µS − µG ) xi .pi − µS 2 pi + qS (µS − µG )2 pS xi ∈S xi ∈S = qS .vG + 2pS (µG − µS )(qS .µTG − µG .qS ) + p2 .qS (µG − µS )2 S 2 +pS .vS + 2.qS .(µS − µG ).(pS .µS − µS .pS ) + qS .pS (µS − µG )2 then vT = qS .vG + pS .vS + pS .qS .(µG − µS )2 (2) We are now able to derive and link BT G and BSG : cT cT .µT + cG .µG 2 cG cT .µT + cG .µG 2 BT G = µT − + µG − cT + cG cT + cG cT + cG cT + cG cT cG .µT − cG .µG 2 cG cT .µG − cT .µT 2 = + cT + cG cT + cG cT + cG cT + cG cT .cG (µT − µG )2 BT G = (3) (cT + cG )2 cT .cG .(qS .µG + pS .µS − µG )2 cT .cG .p2 (µS − µG )2 S cG .c2 .(µS − µG )2 S = 2 = 2 = . (cT + cG ) (cT + cG ) cT .(cT + cG )2 cS .cG .(µS − µG )2 Similary to Eq. (3) we have: BSG = . (4) (cS + cG )2 cS .(cS + cG )2 Thus from Eq. (4) and (5): BT G = .BSG . (5) cT .(cT + cG )2 e Universit´ du Sud Toulon-Var 3 UMR CNRS 6168 Laboratoire LSIS TREE TREE TREE 0.3 0.3 0.3 ROCK ROCK ROCK 0.25 FLOWER 0.25 FLOWER 0.25 FLOWER PLANTS PLANTS PLANTS max Hn(COLOR ; Wk) LEAF LEAF LEAF 0.2 0.2 0.2 BIRD BIRD BIRD WATER WATER WATER SNOW SKY SNOW SKY SNOW SKY 0.15 0.15 0.15 GRASS MOUNTAIN GRASS MOUNTAIN GRASS MOUNTAIN 0.1 0.1 0.1 STREET STREET STREET PEOPLE BUILDING PEOPLE BUILDING PEOPLE BUILDING 0.05 0.05 0.05 0 0.1 0.2 0 0.01 0.02 0 0.02 0.04 max Hn(TEXTURE ; W ) max Hn(SHAPE ; W ) max Hn(POSITION ; W ) k k k ˆ ˆ Figure 2: Maximum values of normalised estimated discrimination power Hn(x; wk ) = F (x; wk )/ x F (x, wk ) for COLOR, TEXTURE, SHAPE, and POSITION features sets for the 14 most frequent words of the database (other words are represented by a simple dot). Results are intuitively correct: TREE, ROCK, FLOWER, PLANTS are mostly discriminated by color; while BUILDING and STREET are more discriminated by texture. SHAPE is in average not very competitive in comparison to COLOR, neither POSITION. BIRD is the word the most discriminated by POSITION, indeed most of COREL images with a bird represent a bird in the image center. We also derive the Within variances WT G and WSG : cT .vT + cG .vG cT .(qS .vG + pS .vS + pS .qS .(µG − µS )2 ) + cG .vG WT G = = cT + cG cT + cG (qS .cT + cG ).vG + pS .cT .vS + pS .qS .cT .(µG − µS )2 = cT + cG (cT − cS + cG ).vG + cS .vS + pS .qS .cT .(µG − µS )2 then WT G = . (6) cT + cG cS .vS + cG .vG cS + cG cS .vS By deﬁnition WSG = , so vG = .WSG − . cS + cG cG cG cS +cG cS .vS (cT − cS + cG ). cG .WSG − cG + cS .vS + pS .qS .cT .(µG − µS )2 WT G = cT + cG (cT − cS + cG ).(cS + cG ) cS .(cT − cS ) cS .(cT − cS ) = .WSG − .vS + .(µG − µS )2 . (7) cG .(cT + cG ) cG .(cT + cG ) cT .(cT + cG ) (cT −cS +cG ).(cS +cG ) cS .(cT −cS ) cS .(cT −cS ) cG .(cT +cG ) .WSG − cG .(cT +cG ) .vS + cT .(cT +cG ) .(µG − µS )2 ˆ V (x; wk ) = cS .(cS +cG )2 cT .(cT +cG )2 .BSG cT (cT − cS + cG )(cT + cG ) WSG (cT − cS )(cT + cG ) cT vS = + 1− cG .cS (cS + cG ) BSG cS .cG cG (µG − µS )2 ˆ thus V (x; wk ) = A(wk ).V (x; wk ) + B(wk ). 1 − C(x; wk ) (8) e Universit´ du Sud Toulon-Var 4 UMR CNRS 6168 Laboratoire LSIS where A and B are positive constants independent of x, only depending on number of blobs in sets T , S, G (exper- 0.5 TEXTURE SNOW FIELD PATTERN imentations on COREL database show that for all words, 0.45 GROUND A and B are close to 10). Therefore, for any given word CAT CLOUD WOMAN GRASS ˆ wk , V (x; wk ) is a linear function of V (x; wk ) if C(x; wk ) is 0.4 cT negligible in front of 1. This is the case if (hyp. 2) cG is 0.35 MOUNTAIN small, which is true in COREL database since it is close 0.3 LEAF WATER BOAT STONE FISH NS NADAPT0.5 to 0.01 for most words, and never exceeds 0.2 (actually RUINS ROCK 0.25 BIRD one can build any database such that CT << CG ) and SAND SKY HILLS FOREST HORSE (hyp. 3) vS is tiny in front of (µG − µS )2 which is the 0.2 HOUSE WALL FLOWER case when x is a reasonably good feature to discriminate 0.15 PLANTS G and S (e.g. wk is represented by a rather stationnary TREE feature value diﬀerent from the mean contextual value). 0.1 BUILDING STREET ˆ Then order of V and V values are the same. Finally, for 0.05 PEOPLE CLOSEUP GARDEN each word wk , even without knowing which blob of the 0 image it labels, one can estimate the most discriminant 0 0.05 0.1 0.15 0.2 NS 40DIM 0.25 0.3 0.35 0.4 ˆ features by simply ranking F values. Thereby, in order to estimate how many and which Figure 3: Word visual consistency representation for of the Xn , n ∈ {1, .., δ} features are really discriminant 40DIM method (in X-coordinate) and for NADAPT0.5 for each word wk , we simply sort by decreasing order method (in Y-coordinate). NADAPT0.5 method gives ˆ all the F (Xn ; wk ), and calculate N < δ where δ is the better results than 40DIM except for closeup, garden, dimension number ofP visual space and N is deﬁned by: street, forest, horse. δ ˆ N ˆ (Xn ; wk ) = n=1 F (Xn ;wk ) n=1 F 2 . Thus X1 , .., XN are considered as the N best discriminative features for wk . features, 24% for TEXTURE features. COLOR features are conﬁrmed to be the most discriminant ones (see also 3 Experimentations on COREL im- Fig. 2). The simple TEXTURE features (16 gaussian ﬁl- ters) are better than the SHAPE ones, certainly because age database blobs’ segmentation are imprecise (see Fig. 1). To test the eﬃciency of ALDA, extensive experiments are done on the COREL images database [9] made of 10 000 3.2 Hierarchical Ascendant Classiﬁcations images with approximately 100 000 segments preprocessed improved by ALDA by K. Barnard and al. [1]. Each image is labelled by an average 3.6 words from a lexicon of 267 diﬀerent words, To demonstrate ALDA eﬃciency on a classiﬁcation task, and has an average of 10 visual segments (‘blobs’) from we now run on COREL a Hierarchical Ascendant Classiﬁ- the Normalized Cuts algorithm [11], which somehow pro- cations (HAC) of visual features into word categories [12]. duces small ones. Each blob is described by a set of δ = 40 As in [2], we measure the system performance using the features listed below by their dimension index. Firstly PO- Normalised Score N S = sensi. + specif − 1 [1, 8]. Com- SITION and SHAPE: (1,2) horizontal and vertical blob’s pared to the raw visual input space, good results have been position; (3) the proportion of the blob in its image; (4) obtained reducing HAC visual features inputs to ALDA N ratio of bold’s area to the perimeter squared; (5) moment best discriminant features as previously deﬁned end of sec- of inertia; (6) ratio of the blob’s area by its convex hull. tion 2 (method called NADAPT0.5). NS values for HAC COLOURS (7,..,24) are represented by the average and on the 40 usual visual dimensions or word adaptive fea- standard deviation of (R,G,B), (r,g,S) and (L,a,b). TEX- tures are shown in Fig. 3. Classiﬁcation of the 3 000 TURES (25,..,40) are extracted by gaussian ﬁlters [1]. images of the test set shows a gain of +37% of NS, and simultaneously an average over all words of a dimension reduction from δ = 40 to 4 best features (see [12] for more 3.1 F estimation for COLOR, TEXTURE, details on the HAC experiments). SHAPE and POSITION We run ALDA on 6 000 COREL images, and measure for 4 Conclusion ˆ each word the maximum value of F for SHAPE, COLOR or TEXTURE features sets. These values represented in In this paper we present ALDA based on an approxima- Fig. 2 for the 14 most frequent words are intuitively cor- tion of the Fisher LDA. We shown that, under weak as- rect and show the word dependence of ALDA. The repar- sumptions (hyp. 1 to 3), ALDA estimates N best fea- tition analysis, over words of all the 6 000 images of the tures which enhance HAC task, while reducing by 10 the train set, of selected N best features are respectively 3% visual space dimension. The main contributions on this for POSITION, 8% for SHAPE features, 65% for COLOR paper are summarized as follows: (a) For the ﬁrst time e Universit´ du Sud Toulon-Var 5 UMR CNRS 6168 Laboratoire LSIS a theoretical demonstration of ALDA is given in the ﬁrst In The Challenge of Image and Video Retrieval section. (b) We implement ALDA on a reference image (CIVR02), 2002. database and we analyse word dependant features sets constructed using ALDA. (c) We integrate ALDA in a sim- [10] C. Neti, G. Potamianos, J. Luettin, I. Matthews, ple HAC model, leading to signiﬁcant improvements. Fur- H. Glotin, and D. Vergyri. Large-vocabulary audio- ther auto-annotation experiments are currently being done visual speech recognition: A summary of the Johns on COREL with a bayesian system (DIMATEX model Hopkins Summer 2000 Workshop. In Proc. IEEE [5]), yielding to promising ﬁrst results. Work. Multimedia Signal Process., 2001. [11] J. Shi and J. Malik. Normalized cuts and image seg- mentation. IEEE Transactions on Pattern Analysis 5 Acknowledgments and Machine Intelligence, 22(8):888–905, 2000. We thank K. Barnard and J. Wang [14] for providing e [12] Sabrina Tollari and Herv´ Glotin. Keyword depen- COREL image database. dant selection of visual features and their heterogene- ity for image content-based interpretation. Technical References Report LSIS.RR.2005.003, LSIS, 2005. e [13] Sabrina Tollari, Herv´ Glotin, and Jacques Le Maitre. [1] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, Enhancement of textual images classiﬁcation using D. Blei, and M. I. Jordan. Matching words and pic- segmented visual contents for image search engine. tures. In Journal of Machine Learning Research, vol- Multimedia Tools and Applications, 25(3):405–417, ume 3, pages 1107–1135, 2003. march 2005. [2] K. Barnard, P. Duygulu, R. Guru, P. Gabbur, and [14] J. Z. Wang, J. Li, and G. Wiederhold. Simplicity: D. Forsyth. The eﬀects of segmentation and fea- Semantics-sensitive integrated matching for picture ture choice in a translation model of object recog- libraries. IEEE Transactions on Pattern Analysis and nition. Computer Vision and Pattern Recognition, Machine Intelligence, 23(9):947–963, 2001. pages 675–682, 2003. [3] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classiﬁcation. John Wiley and Sons, Inc., 2000. [4] J. Luettin G. Potamianos and C. Neti. Hierarchi- cal discriminant features for audio-visual LVCSR. In Proc. of IEEE Int. Conf. ASSP, 2001. e [5] Herv´ Glotin and Sabrina Tollari. Image auto- annotation method using dichotomic visual clustering for CBIR. In Proc. of IEEE EURASIP Fourth In- ternational Workshop on Content-Based Multimedia Indexing (CBMI2005), june 2005. [6] Philippe H. Gosselin and Matthieu Cord. A com- parison of active classiﬁcation methods for content- based image retrieval. In Proc. of the 1st Internation- nal Workshop on Computer Vision Meets Databases (CVDB2004) in conjonction with ACM SIGMOD 2004, pages 51–58, Paris, France, 2004. [7] Q.S. Liu, R. Huang, H.Q. Lu, and S.D. Ma. Face recognition using kernel based Fisher discriminant analysis. In Proc. of Int. Conf. Automatic Face and Gesture Recognition, pages 197–201, may 2002. [8] F. Monay and D. Gatica-Perez. On image auto- annotation with latent space models. In Proc. ACM Int. Conf. on Multimedia (ACM MM), pages 275– 278, 2003. [9] H. Muller, S. Marchand-Maillet, and T. Pun. The truth about corel - evaluation in image retrieval. e Universit´ du Sud Toulon-Var 6