Approximation of Linear Discriminant Analysis for Word Dependent by guf14004


									                                       UMR CNRS 6168 Laboratoire LSIS

                       Approximation of Linear Discriminant
                       Analysis for Word Dependent Visual
                       Features Selection

                      Herv´ Glotin            Sabrina Tollari            Pascale Giraudet
                                       Universit´ du Sud Toulon-Var
                                          UMR CNRS 6168 LSIS
                                     Research report LSIS.RR.2005.002


Contents                                                  Abstract
1 Introduction                                       2 To automatically determine a set of keywords that de-
                                                       scribes the content of a given image is a difficult prob-
2   LDA approximation and adaptive visual fea-         lem, because of (i) the huge dimension number of the vi-
    tures                                            2 sual space and (ii) the unsolved object segmentation prob-
                                                       lem. Therefore, in order to solve matter (i), we present
3   Experimentations on COREL image database 5 a novel method based on an Approximation of Linear
    3.1 F estimation for COLOR, TEXTURE, SHAPE         Discriminant Analysis (ALDA) from the theoretical and
        and POSITION . . . . . . . . . . . . . . . . 5 practical point of view. Application of ALDA is more
    3.2 Hierarchical Ascendant Classifications im-      generic than usual LDA because it doesn’t require explicit
        proved by ALDA . . . . . . . . . . . . . . . 5 class labelling of each training sample, and however al-
                                                       lows efficient estimation of the visual features discrimi-
4   Conclusion                                       5
                                                       nation power. This is particularly interesting because of
5   Acknowledgments                                  6 (ii) and the expensive manually object segmentation and
                                                       labelling tasks on large visual database. In first step of
                                                       ALDA, for each word wk , the train set is split in two,
                                                       according if images are labelled or not by wk . Then, un-
                                                       der weak assumptions, we show theoretically that Between
                                                       and Within variances of these two sets are giving good es-
                                                       timates of the best discriminative features for wk . Exper-
                                                       imentations are conducted on COREL database, showing
                                                       an efficient word adaptive feature selection, and a great
                                                       enhancement (+37%) of an image Hierarchical Ascendant
                                                       Classification (HAC) for which ALDA saves also compu-
                                                       tational cost reducing by 90% the visual features space.

                                                          Feature selection, Fisher LDA, visual segmentation, im-
                                                          age auto-annotation, high dimension problem, word pre-
                                                          diction, CBIR, HAC, COREL database, PCA.

                                          Universit´ du Sud Toulon-Var                                          1
                                          UMR CNRS 6168 Laboratoire LSIS

1    Introduction                                              ALDA is simpler than usual LDA, because it doesn’t need
                                                               explicit labelling of the training samples for generating a
The need for efficient content-based image retrieval has         good estimation of the most discriminant features. ALDA
increased in many application areas such as biomedicine,       first stage consists, for each word wk , to split train set in
military, and Web image classification and searching.           two, according if images are labelled by wk or not. Then,
     Many approaches have been devised and discussed over      under weak assumption, we show that for a given wk , Be-
more than a decade. While the technology to search text        tween and Within variances, between these two sets, are
has been available for some time, the one to search im-        giving good estimates of the best discriminative features.
ages (or videos) is much more challenging. Most of image       Experimentations are illustrating features dependency to
content based retrieval systems require the user to give a     each word, and significant classification enhancements.
query based on image concepts, but in general he asks se-
mantic queries using textual descriptions. Some systems
aim to enhance image word research using visual informa-       2     LDA approximation and adap-
tion [13]. Anyway, one needs a fast system that robustly
auto-annotates large un-annotated image databases. The
                                                                     tive visual features
general idea of image auto-annotation systems is to asso-      Major databases are not manually segmented and segment-
ciate a class of ‘similar’ images with semantic keywords,      labelled, thus given a set of training images Φ = {φj }j∈{1,..,J}
e.g. to index by few keywords a new image according to         and a lexicon λ = {wk }k∈{1,..,K}, each image φj is labelled
a reference train set. This problem has been pursued in        with some words of λ (e.g. φj has a global legend con-
various approaches, such as neural networks, statistical       structed with λ as shown in Fig. 1). In order to extract
classification, etc. One major issue in these models is the     visual features of each object included in each φj , one can
huge dimension number of visual space, and “it remains         automatically segment each image in many areas called
an interesting open question to construct feature sets that    blobs. Unfortunately, blobs generally do not match with
(...) offer very good performance for a particular vision       the shape of each object.
task” [1].                                                         Even if they do, there is no way to relate each blob to
     Some recent works consider user feedback to estimate      the corresponding word.
the most discriminant features. This exploration process           Nevertheless, we show below that despite the fact that
before or during classification, like in Active Learning, re-   each word class wk is not associated to a unique blob, and
quiers a lot of manual interactions, many hundreds for only    vice-versa, one can estimate for each wk which are the
10 words [6]. Therefore these methods can’t be applied to      most discriminant visual features.
large image databases or large lexicons. In this paper we          To this purpose we need to define four sets: S, T , TG
propose to answer to the previous question by automat-         and G. Let be S the theoretical set of values of one feature
ically reducing the high dimensional visual space to the       x, calculated on all the blobs that are exactly representing
most efficient usual features for a considered word. The         the word wk . We note for any feature set E, cE its cardinal,
most famous method of dimensionality reduction is Prin-        µE the average of all xi values of x ∈ E, vE their variance.
cipal Components Analysis (PCA). But PCA does not in-          Let be T the set of x values of all blobs included in all
clude label information of the data. Although PCA finds         images labelled by wk (of course T includes S). Let be TG
components that are useful for representing data, there is     such that T = TG U S, with empty intersection between TG
no reason to assume that these components must be useful       and S. We assume cTG = 0 (otherwise each image labeled
for discriminating between data in different classes. But       by wk contains only the corresponding blobs).
where PCA seeks direction that are efficient for represen-           Let be G the set containing all values of x from all
tation, Fisher Linear Discriminant Analysis (LDA) seeks        blobs contained in images that are not labelled by wk . In
ones that are efficient for discrimination ([3] pp 117).         the following, we only assume the weak assumption (hyp.
     Indeed recent works in audio-visual classification show    1) µTG = µG and vTG = vG , which is related to the simple
that LDA is efficient under well labelled databases to de-       assumption of context independency provided by any large
termine the most discriminant features, reducing the vi-       enougth image database. We note BDE (resp. WDE ) the
sual space [4, 10, 7]. Unfortunately, most of the large im-    Between variance (resp. the Within variance) between any
age databases are not correctly labelled, and do not pro-      sets D and E. The usual LDA is based on the calculation,
vide a one-to-one relation between keywords and image          for each feature x of the theoretical discrimination power
segments (see COREL image sample with their caption in                           1
                                                               F (x; wk ) = 1+V (x;wk ) where V (x; wk ) = WSG . We show
Fig. 1). Consequently usual LDA can’t be applied on real                     ˆ
                                                               below that V (x; wk ) = WT G is a good approximation of
image databases.                                                                          B
                                                               V (x; wk ), and that if one apply V to ordinate all x for a
     Moreover because of the unsolved visual scene segmen-                                                                ˆ
                                                               given word wk , then this order is the same by applying V ,
tation problem (see Fig. 1), real applications or training
                                                               at least for the most discriminant features x. Therefore
of image auto-annotation systems from web pages, would
                                                               the selection of features whith higher theoretical discrim-
require a robust visual features selection method from un-
                                                               inative powers F can be carried out from the calculation
certain data. Therefore, we present a novel Approximation                    ˆ                 1
                                                               of practical F (x; wk ) = 1+V (x;w ) values.
of LDA (ALDA), in a theoretical and practical analysis.                                          k

                                             Universit´ du Sud Toulon-Var                                                   2
                                                    UMR CNRS 6168 Laboratoire LSIS

Figure 1: Examples of an automatic segmentation (Normalized Cuts algorithm [11]) of two COREL images [1]. Image
caption are (left image) “Windmill Shore Water Harbor” and (right) “Dolphin Bottlenosed Closeup Water”. Each
blob of each image is labelled by all words of its image caption. Notice also that dolphin is split in two parts as many
as other objects after the Normalized Cuts algorithm.

              cS                              cT −cS       cT G
   Let pS =   cT   and qS = 1 − pS =            cT     =    cT    . We have µT = qS .µTG + pS .µS . Therefore:

                                                              µT = qS .µG + pS .µS .                                                                  (1)

Let derive vT with vS , vG , and for any x ∈ T , the probability pi of event ‘x = xi ’:
                                                                   2                                                   2
                                     vT =           xi − µT            pi    =                xi − qS .µG − pS .µS         pi
                                            xi ∈T                                 xi ∈T

                                                                              2                                                         2
                        =             (xi − µG ) + pS (µG − µS )                  pi +             (xi − µS ) + qS (µS − µG )               pi
                            xi ∈TG                                                        xi ∈S

                    =            (xi − µTG )2 pi + 2pS (µG − µS )                          (xi − µG )pi + p2 (µG − µS )2
                                                                                                           S                                     pi
                        xi ∈TG                                                 xi ∈TG                                            xi ∈TG

                        +           (xi − µS )2 pi + 2qS (µS − µG )                                       2
                                                                                          (xi − µS )pi + qS (µS − µG )2                 pi
                            xi ∈S                                              xi ∈S                                            xi ∈S

                            = qS .vTG + 2pS (µG − µS )                        xi .pi − µG                  pi + p2 (µG − µS )2 qS
                                                                   xi ∈TG                         xi ∈TG

                              +pS .vS + 2qS (µS − µG )                       xi .pi − µS                     2
                                                                                                       pi + qS (µS − µG )2 pS
                                                                   xi ∈S                       xi ∈S

                                    = qS .vG + 2pS (µG − µS )(qS .µTG − µG .qS ) + p2 .qS (µG − µS )2
                                    +pS .vS + 2.qS .(µS − µG ).(pS .µS − µS .pS ) + qS .pS (µS − µG )2

                                              then vT = qS .vG + pS .vS + pS .qS .(µG − µS )2                                                         (2)
We are now able to derive and link BT G and BSG :
                                        cT        cT .µT + cG .µG                     2          cG        cT .µT + cG .µG                  2
                        BT G =               µT −                                         +           µG −
                                     cT + cG          cT + cG                                 cT + cG          cT + cG
                                            cT   cG .µT − cG .µG                  2          cG   cT .µG − cT .µT           2
                                     =                                                +
                                         cT + cG     cT + cG                              cT + cG     cT + cG
                                                                            cT .cG (µT − µG )2
                                                            BT G =                                                                                    (3)
                                                                                (cT + cG )2
                            cT .cG .(qS .µG + pS .µS − µG )2   cT .cG .p2 (µS − µG )2
                                                                        S               cG .c2 .(µS − µG )2
                      =                           2
                                                             =                 2
                                                                                      =                     .
                                        (cT + cG )                  (cT + cG )            cT .(cT + cG )2
                                                                                                   cS .cG .(µS − µG )2
                                         Similary to Eq. (3) we have: BSG =                                            .                              (4)
                                                                                                       (cS + cG )2
                                                                                                  cS .(cS + cG )2
                                         Thus from Eq. (4) and (5): BT G =                                        .BSG .                              (5)
                                                                                                  cT .(cT + cG )2

                                                       Universit´ du Sud Toulon-Var                                                                    3
                                                                       UMR CNRS 6168 Laboratoire LSIS

                                                     TREE                                      TREE                                TREE
                                           0.3                                     0.3                                       0.3

                                                          ROCK                                          ROCK                       ROCK
                                          0.25        FLOWER                    0.25                FLOWER                  0.25 FLOWER

                                                      PLANTS                                        PLANTS                             PLANTS
                     max Hn(COLOR ; Wk)

                                                      LEAF                                            LEAF                             LEAF
                                           0.2                                     0.2                                       0.2
                                                        BIRD                                 BIRD                                                              BIRD
                                                          WATER                                         WATER                      WATER

                                                        SKY                                  SNOW                 SKY              SNOW
                                          0.15                                  0.15                                        0.15

                                                          MOUNTAIN                                    GRASS
                                                                                                        MOUNTAIN                       GRASS

                                           0.1                                     0.1                                       0.1

                                                                          STREET           STREET                                       STREET
                                                                     BUILDING            PEOPLE                  BUILDING              PEOPLE

                                          0.05                                  0.05                                        0.05
                                                 0           0.1       0.2               0           0.01        0.02              0          0.02       0.04
                                                     max Hn(TEXTURE ; W )                     max Hn(SHAPE ; W )                       max Hn(POSITION ; W )
                                                                         k                                         k                                       k

                                                                                     ˆ             ˆ
Figure 2: Maximum values of normalised estimated discrimination power Hn(x; wk ) = F (x; wk )/ x F (x, wk ) for
COLOR, TEXTURE, SHAPE, and POSITION features sets for the 14 most frequent words of the database (other
words are represented by a simple dot). Results are intuitively correct: TREE, ROCK, FLOWER, PLANTS are
mostly discriminated by color; while BUILDING and STREET are more discriminated by texture. SHAPE is in
average not very competitive in comparison to COLOR, neither POSITION. BIRD is the word the most discriminated
by POSITION, indeed most of COREL images with a bird represent a bird in the image center.

We also derive the Within variances WT G and WSG :

                                                      cT .vT + cG .vG              cT .(qS .vG + pS .vS + pS .qS .(µG − µS )2 ) + cG .vG
                      WT G =                                                   =
                                                          cT + cG                                        cT + cG

                                                                (qS .cT + cG ).vG + pS .cT .vS + pS .qS .cT .(µG − µS )2
                                                                                       cT + cG
                                                         (cT − cS + cG ).vG + cS .vS + pS .qS .cT .(µG − µS )2
                                                 then WT G =                                                   .                                                      (6)
                                                                              cT + cG
                                                             cS .vS + cG .vG             cS + cG            cS .vS
                                          By definition WSG =                 , so vG =              .WSG −         .
                                                                 cS + cG                     cG               cG
                                                                              cS +cG                    cS .vS
                                                      (cT − cS + cG ).          cG .WSG             −    cG        + cS .vS + pS .qS .cT .(µG − µS )2
                     WT G =
                                                                                                        cT + cG
                          (cT − cS + cG ).(cS + cG )        cS .(cT − cS )       cS .(cT − cS )
                 =                                   .WSG −                .vS +                .(µG − µS )2 .                                                        (7)
                                cG .(cT + cG )              cG .(cT + cG )       cT .(cT + cG )
                                                             (cT −cS +cG ).(cS +cG )                    cS .(cT −cS )       cS .(cT −cS )
                                                                  cG .(cT +cG )      .WSG           −   cG .(cT +cG ) .vS + cT .(cT +cG ) .(µG          − µS )2
                  V (x; wk ) =                                                                      cS .(cS +cG )2
                                                                                                    cT .(cT +cG )2 .BSG

                                  cT (cT − cS + cG )(cT + cG ) WSG   (cT − cS )(cT + cG )    cT     vS
                  =                                                +                      1−
                                         cG .cS (cS + cG )     BSG          cS .cG           cG (µG − µS )2
                                                       thus V (x; wk ) = A(wk ).V (x; wk ) + B(wk ). 1 − C(x; wk )                                                    (8)

                                                                             Universit´ du Sud Toulon-Var                                                              4
                                              UMR CNRS 6168 Laboratoire LSIS

where A and B are positive constants independent of x,
only depending on number of blobs in sets T , S, G (exper-                           0.5
                                                                                                                                                                  SNOW FIELD
imentations on COREL database show that for all words,                              0.45                                           GROUND

A and B are close to 10). Therefore, for any given word                                                                                                        CAT
                                                                                                                                        WOMAN                           GRASS
wk , V (x; wk ) is a linear function of V (x; wk ) if C(x; wk ) is

negligible in front of 1. This is the case if (hyp. 2) cG is                        0.35
small, which is true in COREL database since it is close                             0.3                               LEAF     WATER

                                                                     NS NADAPT0.5
to 0.01 for most words, and never exceeds 0.2 (actually                                                                       RUINS
                                                                                    0.25                               BIRD
one can build any database such that CT << CG ) and                                                             SAND
                                                                                                                          SKY           HILLS     FOREST
(hyp. 3) vS is tiny in front of (µG − µS )2 which is the                             0.2                     HOUSE
case when x is a reasonably good feature to discriminate                            0.15                    PLANTS

G and S (e.g. wk is represented by a rather stationnary
feature value different from the mean contextual value).                              0.1                         BUILDING

Then order of V and V values are the same. Finally, for                             0.05        PEOPLE   CLOSEUP

each word wk , even without knowing which blob of the
image it labels, one can estimate the most discriminant                                    0    0.05      0.1      0.15           0.2
                                                                                                                                NS 40DIM
                                                                                                                                           0.25       0.3      0.35       0.4

features by simply ranking F values.
    Thereby, in order to estimate how many and which                 Figure 3: Word visual consistency representation for
of the Xn , n ∈ {1, .., δ} features are really discriminant          40DIM method (in X-coordinate) and for NADAPT0.5
for each word wk , we simply sort by decreasing order                method (in Y-coordinate). NADAPT0.5 method gives
all the F (Xn ; wk ), and calculate N < δ where δ is the             better results than 40DIM except for closeup, garden,
dimension number ofP      visual space and N is defined by:           street, forest, horse.
                           δ   ˆ
   N    ˆ (Xn ; wk ) =     n=1 F (Xn ;wk )
   n=1 F                        2          . Thus X1 , .., XN are
considered as the N best discriminative features for wk .
                                                                     features, 24% for TEXTURE features. COLOR features
                                                                     are confirmed to be the most discriminant ones (see also
3     Experimentations on COREL im-                                  Fig. 2). The simple TEXTURE features (16 gaussian fil-
                                                                     ters) are better than the SHAPE ones, certainly because
      age database                                                   blobs’ segmentation are imprecise (see Fig. 1).
To test the efficiency of ALDA, extensive experiments are
done on the COREL images database [9] made of 10 000                 3.2                       Hierarchical Ascendant Classifications
images with approximately 100 000 segments preprocessed                                        improved by ALDA
by K. Barnard and al. [1]. Each image is labelled by an
average 3.6 words from a lexicon of 267 different words,              To demonstrate ALDA efficiency on a classification task,
and has an average of 10 visual segments (‘blobs’) from              we now run on COREL a Hierarchical Ascendant Classifi-
the Normalized Cuts algorithm [11], which somehow pro-               cations (HAC) of visual features into word categories [12].
duces small ones. Each blob is described by a set of δ = 40          As in [2], we measure the system performance using the
features listed below by their dimension index. Firstly PO-          Normalised Score N S = sensi. + specif − 1 [1, 8]. Com-
SITION and SHAPE: (1,2) horizontal and vertical blob’s               pared to the raw visual input space, good results have been
position; (3) the proportion of the blob in its image; (4)           obtained reducing HAC visual features inputs to ALDA N
ratio of bold’s area to the perimeter squared; (5) moment            best discriminant features as previously defined end of sec-
of inertia; (6) ratio of the blob’s area by its convex hull.         tion 2 (method called NADAPT0.5). NS values for HAC
COLOURS (7,..,24) are represented by the average and                 on the 40 usual visual dimensions or word adaptive fea-
standard deviation of (R,G,B), (r,g,S) and (L,a,b). TEX-             tures are shown in Fig. 3. Classification of the 3 000
TURES (25,..,40) are extracted by gaussian filters [1].               images of the test set shows a gain of +37% of NS, and
                                                                     simultaneously an average over all words of a dimension
                                                                     reduction from δ = 40 to 4 best features (see [12] for more
3.1     F estimation for COLOR, TEXTURE,                             details on the HAC experiments).
        SHAPE and POSITION
We run ALDA on 6 000 COREL images, and measure for                   4                         Conclusion
each word the maximum value of F for SHAPE, COLOR
or TEXTURE features sets. These values represented in                In this paper we present ALDA based on an approxima-
Fig. 2 for the 14 most frequent words are intuitively cor-           tion of the Fisher LDA. We shown that, under weak as-
rect and show the word dependence of ALDA. The repar-                sumptions (hyp. 1 to 3), ALDA estimates N best fea-
tition analysis, over words of all the 6 000 images of the           tures which enhance HAC task, while reducing by 10 the
train set, of selected N best features are respectively 3%           visual space dimension. The main contributions on this
for POSITION, 8% for SHAPE features, 65% for COLOR                   paper are summarized as follows: (a) For the first time

                                                  Universit´ du Sud Toulon-Var                                                                                                  5
                                          UMR CNRS 6168 Laboratoire LSIS

a theoretical demonstration of ALDA is given in the first     In The Challenge of Image and Video Retrieval
section. (b) We implement ALDA on a reference image          (CIVR02), 2002.
database and we analyse word dependant features sets
constructed using ALDA. (c) We integrate ALDA in a sim- [10] C. Neti, G. Potamianos, J. Luettin, I. Matthews,
ple HAC model, leading to significant improvements. Fur-      H. Glotin, and D. Vergyri. Large-vocabulary audio-
ther auto-annotation experiments are currently being done    visual speech recognition: A summary of the Johns
on COREL with a bayesian system (DIMATEX model               Hopkins Summer 2000 Workshop. In Proc. IEEE
[5]), yielding to promising first results.                    Work. Multimedia Signal Process., 2001.
                                                               [11] J. Shi and J. Malik. Normalized cuts and image seg-
                                                                    mentation. IEEE Transactions on Pattern Analysis
5    Acknowledgments                                                and Machine Intelligence, 22(8):888–905, 2000.
We thank K. Barnard and J. Wang [14] for providing
                                                   [12] Sabrina Tollari and Herv´ Glotin. Keyword depen-
COREL image database.
                                                        dant selection of visual features and their heterogene-
                                                        ity for image content-based interpretation. Technical
References                                              Report LSIS.RR.2005.003, LSIS, 2005.
                                                               [13] Sabrina Tollari, Herv´ Glotin, and Jacques Le Maitre.
 [1] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth,
                                                                    Enhancement of textual images classification using
     D. Blei, and M. I. Jordan. Matching words and pic-
                                                                    segmented visual contents for image search engine.
     tures. In Journal of Machine Learning Research, vol-
                                                                    Multimedia Tools and Applications, 25(3):405–417,
     ume 3, pages 1107–1135, 2003.
                                                                    march 2005.
 [2] K. Barnard, P. Duygulu, R. Guru, P. Gabbur, and
                                                         [14] J. Z. Wang, J. Li, and G. Wiederhold. Simplicity:
     D. Forsyth. The effects of segmentation and fea-
                                                              Semantics-sensitive integrated matching for picture
     ture choice in a translation model of object recog-
                                                              libraries. IEEE Transactions on Pattern Analysis and
     nition. Computer Vision and Pattern Recognition,
                                                              Machine Intelligence, 23(9):947–963, 2001.
     pages 675–682, 2003.
 [3] R. O. Duda, P. E. Hart, and D. G. Stork. Pattern
     Classification. John Wiley and Sons, Inc., 2000.
 [4] J. Luettin G. Potamianos and C. Neti. Hierarchi-
     cal discriminant features for audio-visual LVCSR. In
     Proc. of IEEE Int. Conf. ASSP, 2001.
 [5] Herv´ Glotin and Sabrina Tollari. Image auto-
     annotation method using dichotomic visual clustering
     for CBIR. In Proc. of IEEE EURASIP Fourth In-
     ternational Workshop on Content-Based Multimedia
     Indexing (CBMI2005), june 2005.
 [6] Philippe H. Gosselin and Matthieu Cord. A com-
     parison of active classification methods for content-
     based image retrieval. In Proc. of the 1st Internation-
     nal Workshop on Computer Vision Meets Databases
     (CVDB2004) in conjonction with ACM SIGMOD
     2004, pages 51–58, Paris, France, 2004.
 [7] Q.S. Liu, R. Huang, H.Q. Lu, and S.D. Ma. Face
     recognition using kernel based Fisher discriminant
     analysis. In Proc. of Int. Conf. Automatic Face and
     Gesture Recognition, pages 197–201, may 2002.
 [8] F. Monay and D. Gatica-Perez. On image auto-
     annotation with latent space models. In Proc. ACM
     Int. Conf. on Multimedia (ACM MM), pages 275–
     278, 2003.
 [9] H. Muller, S. Marchand-Maillet, and T. Pun. The
     truth about corel - evaluation in image retrieval.

                                             Universit´ du Sud Toulon-Var                                              6

To top