GHUNT a semantic indexing system based on concept space

Shared by: zyc19183
-
Stats
views:
7
posted:
6/1/2010
language:
English
pages:
6
Document Sample
scope of work template
							GHUNT A SEMANTIC INDEXING SYSTEM BASED ON CONCEPT SPACE
                 Qing He, Ziyan Jia, Jiayou Li,Haijun Zhang,Qingyong Li, Zhongzhi Shi

       Key Laboratory of Intelligent Information Processing, Institute of Computing Technology,
                        Chinese Academy of Sciences, Beijing 100080, China
                                                lico 'ti.ics.iCt.:s.cn



ABSTRACT                                                      We assume there is some underlying latent semantic
                                                              stmcture in the data that is partially obscured by the
Hov to get text information from this huge
                                                              randomness of word choice with respect to retrieval.
information space becomes an even important
                                                              The rest of this paper is organized as follows: Section
problem with rapid growth of'the Inteniet. In this
                                                              2-.reviews related works. Section 3 intrOduces the
paper. a semantic indexing system GHUNT based on
                                                              functions of GHUNT. Section 4 introduces the
concept space is proposed to solve the problem.
                                                              document classifier and document clustering method
Some new technologies are integrated in G H P T to
                                                              used in construct concept space. N e q Section 5
obtain good perfomiance. GHUNT is an all-sided
                                                              presents retrieval with semantic association. Finally,
solution for information retrieval on Internet:
                                                              conclusions are given in Section 6 .
Keywords:     Semantic indexing, Concept space
                                                                           2. RELATED WORK
             1. INTRODUCTION
                                                              For constructing concept space, many works have
                                                              been done around the world since ninths in the last
Large quantities of textual data available for example
                                                              century. In the Neural Network Research Centre.
on the lntemet pose a. continuing challenge to
                                                              Helsinki University of Technology during 1995-2000
applications that help users in making sense of the
                                                              as part of a prqject called WEBSOM has becn carried
data.
                                                              out, led by Academician Teuvo Kohonen. In the
   The problem is that users want to retrieve on the
                                                              WEBSOM method the self organizing map (SOM)
basis of conceptual conteiit; and individual words
                                                              algorithm is used to automatically organize v e q large
provide unreliable evidence about the conceptual
                                                              and high dimensional collections of text documents
topic or meaning of a document. There are usually
                                                              onto two-dimensional map displays i.e. concept
many ways to express a given concept, so the literal
                                                              space
temis in a user's que? may not match those of a
                                                                 An automatic indexing and concept space
relevant document. In addition, most words have
                                                              approach to a multilingual (Chinese and English)
multiple meanings, so temis in a user's query will
                                                              bibliographic database is presented by H. Chen. He
literally match terms in documents that are not of
                                                              introduced a multi-linear term-phrasing technique to
interest to the user. The proposed approach tries to
                                                              extract concept descriptors (terms or keywords) from
overcome the deficiencies of term matching retrieval
                                              _I
                                                              a Chinese-English bibliographic database.
by treating the uiireliability of obserGd tenii
document association data as a statistical problem.




0-7803-7902-0/03/%I7.00@ 2003 IEEE.
                                                    716
  3. THE ARCHITECTURE OF GHUNT

GHUNT gets documents from Internet and organizes
thcse documents by combining directov stnicture              where gk is the frequency of the occurrence of
with semantic indes automatically for improving the
                                                           the term r,            in the document 0, ,                      yk   is the
accuracy and recall of retrieval and characterizing the
semantic of retrieval results. The documents collected     corresponding term weight, k =1,2, . . .: 171 ( i n is the
by spider are processed by test aialyzer including         number of the terms).
parsing and extracting the informative words as              The amount of information of term T, can be
concept for classification, concept clustering,            calculated         using                 the       following
semantic index generating and so on.                       formula: ICk = H ( D ) - H ( D / T , ) , Where H ( D ) is the
                                                           entropy ofthe document collection
   4. CONSTRUCTINGG                  CONCEPT
              SPACE
                                                              H(DIT,) is the conditional entropy ofterm T,:
After the web pages are crawled by Spiders, they will
                                                              H ( D ITi)=   -1      ti,<”
                                                                                            P ( d , I T r ) x log: P ( d , ITk
be classified automatically to construct the first level
of concept space. It is based on Chinese word                P(d?,) is the conditional probability of the
segmentation. Most of the approaches on, tex?              document d,:
classification adopt the classical Vector Space Model
(VSM). In this model, the content of a document is
fonnalized as a dot of the multi-dimension space and          where wordfeq(d,) is the sum of all the temi
represented by a vector. Then we can decide the            frequencies in d, . Then formula (1) can be revised as
corresponding classes of the given vector by               follows:
calculating and comparing the distances among the                           ffn    xlag( N l n , + O . O 1 ) r / G k
                                                              w,*=                                                                (2)
vcctors. The frequently used document representation                  -f(lf,,      ) > x [log(   N I n , + 0.01)    x   IG,]’
in VSM is the so-called TF.IDF-vector representation                  ,=,
(31. i.e., the calculation oftem weight is mainly based    4.2. Multi-hierarchy text classification
on temi frequency and inverse document frequency.          The frequently used approaches in,test classification
I n order to dcal with the limitation of TF.IDF. we        let all classes share a single classifier or assign each
adopt ai improved approach named TF.1DF.E by               class a classifier. And all classes are in the same
combining the information gain from Infomiation            hierarchy, i.e., in the same “flattened” class space.
Theory The approach is validated to be feasible and        When the set of classes is large, not only it will cost
effective by esperinients. Furthermore, an approach        much time to construct class models, hut also it will
of multi-hierarchy text classification based on VSM        have to match among all the class models to assign
is applied.                                                the proper classes to new documents.
4.1. The improvement on the formula of                     To overcome this problem, we propose an approach
calculating term weight                                    of multi-hierarcby test classification based on VSM:
To calculate term weight: the TF.lDF approach              i.e., all classes are organized as a tree according to
considers hvo factors: TF (tenii frequency) and IDF        some given hierarchical relations. The basic insight
(invcrse document frequency). And the formula is:          supporting our approach is that classes that are
                                                           attachcd to thc sanic nodc bavc a lot morc in




                                                     717
         Class         Precision             Cl;l*l                Precision                In the test phase, we collect 21,430 documents
        Intrmri         69.23%           Winter sports             60.8955               downloaded from some famous web sits in China
       Software         81.66%            Basketball               98.83%                such as SINA, FM365 as training set. The
       Hardwax          78.57'6           Vollcybdl                81.66%                corporation collects 11062 documents as testing set.
        Gamer          Y5.2356           Table tennis              69.76%                There are 34 classes. And the experiment results are
                                                                                         shown in table 1. We will further research on the



   i 1 1 1 1
       Education       83.76%            Else ofsports             54.789h

       Economv         76.0850          Chess and card              80.0%                classification Precision if the web page is assigned
   1    scicncc        83.56%               Boxing                 98.223                the best N classes. Most details can be seen in paper
                       Y3.-1j?,           kicing car               %.a796                 [71.
   I    Socirt\         81.90,               Gym                    93.9%
                                                                                          4.3. Deeper level of concept space
                       91.7606          Trxkmdfield                87.143
                                                                                          Deeper level of Concept Space can be done by
        Clssr          Precision            Tennis                 81.81%
                                                                                         clustering. We propose a document Clustering
                   t               I                                           I
                                                                                         algorithm based on Swarm Intelligence and k-Means:
   I    Tennis         81.81%             Swimming             '    95.4%
                                                                                          CSIM in this paper, Swarm Intelligence is defined as
   !   swimmlne    I
                   I
                        95.l%      I      Bsdminton                85.34%
                                                                               I
                                                                                         any attempt to design algorithms or distributed


                                                           I                   I
                                                                                         problem-solving devices inspired by the collective

                   I               I
   !                                                                           I


   j   Chinac

       Footbvll          101
                        X.9
                                         Intcmational

                                           Football
                                                                   83.67%
                                                                                         behavior of the social insect colonies and other
                                                                                         animal societies. CSIM combines swarm intelligence
                                                                                         with k-means clustering technique. It is a two-phase
                                                                                         process. Firstly, an initial set of clusters is fomied by
                                                                                         swami intelligence based clustering method which is
                       81 969"                                     90.6946
                                        Entcrtrinmcnt
                                                                                         derived from a basic model interpreting ant colon!:
                                       Poliricr. Law and
                                                                                         organization of cemeteries. Secondly, an iterative
                       79.6740                                     84.7596               partitioning phase is employed to further optimize the
                                                                                         results.
          Table I . The classification Precision
                                                                                            The main idea of swarm intelligence based
common with each other than else classes, so the
                                                                                         clustering method is that data objects are initially
models of these classes will be based on a small set
                                                                                         projected onto a plane at random; the artificial ants
of features.
                                                                                         then perform random walks on the plane and pick up
   Our lab has developed a system about text
                                                                                         or drop projected data items with the probability
classificatioli for a Corporation. The main idea is
                                                                                         which is converted from swarm similarity within a
described as follows. Firstly, the class 'architecture is
                                                                                         local region by probability conversion function,
designed for the need of the corporation. Then the
                                                                                         clusters are visually formed on the plane by ant
class models are constructed by training the
                                                                                         colony collective actions in the absence of central
documents classified by hand corresponding to the
                                                                                         controls. It is also applied in document clustering by
classification hierarchy. Nest, the text contents are
                                                                                         vector space model. Self-organizing clusters are
estracted from the web pages crawled by Spiders.
                                                                                         fonned by this method. The number of clusters is
Furthermore. the text contents are, analyzed after
word scgnientation. Finally, the web pages are
                                                                                         also adaptively acquired. Moreover it is insensitive to
                                                                                         the outliers and the order of input. It obviously
assigned the proper class using the automatic
                                                                                         offsets the weakness of partitioning method and
classification algorithni. The ideal performance of
                                                                                         shortens the iterative times in the second phase.
classification has been achieved after several times of
                                                                                         Actually, thc swarm intclligcncc bascd clustcring
revision.




                                                                                   718
                                                            collect correctly on the plane by chance are also split.
                                                            K-means clustering phase softens the chanciness of
                                                            the swarm intelligence based method which is
                                                            originated from a probabilistic model. A description
                                                            of the clustering algorithm and more details can be
                                                            seen in [9].Table 2 show that the results of CSIM. By
                                                            using CSIM we can obtain deeper level of concept
                                                            space.


                                                               5 . RETRIEVAL WITH SEMANTIC
                                                                                      INDEXING

                                                            In the following sections, we will describe how to use
                                                            the concept space to facilitate queving and
                                                            information retrieving.
                                                            5.1     Concept space generation                using
                                                            co-occ urrence Analysis
                                                            Now we will introduce how to generate the link
                                                            weights in concept space of specific domain
                                                            automatically. Before generating the concept space;
                                                            we must identify the concepts of that domain. By
                                                            using the following formulae we could compute the
                                                            information gain of each term for classification.
                                                                                                                      P(wiIF)+
                                                                      InqGoin ( F ) = P ( F ! C P ( y i I F!los   ~




                                                                                                                       P(V.1
                                                                                               -        P(,Y   I TI
                                                                                    x P ( w i IFjlogL
                                                                                                         P(yl,!


                                                            where F is a term, ' P(F) is the probability of that term

                                                            F occurred, ?means that term F doesn't occur,p(yg)is
                                                            the probability of the i-th class value, P ( ~I , ) is the
                                                                                                            F
                                                            conditional probability of the i-th class value given
                                                            that word F occurred. Ifl,,,a,n(F),o, we choose term
                                                            F as the concept. Although the thesaunis generated by
                                                            this method is not as thorough and precise as the
                                                            thesaurus generated manually in scientific literature,
                                                            it is an acceptable thesaurus. This is not an ideal
                                                            method but feasible method.
method can be applied independently. But on second             After we have recognized the concept of a class,
phase, the outliers which are single points on the          we could generate the concept space of that class
ant-work plane are converged on the nearest neighbor        automatically. Hcrc, wc adopt Chcn's mcthod which
clusters aid the clusters \vliich are piled too close to




                                                      719
uses co-occurrence analysis and Hopfield net [SI.                                 the following iteration until convergence.
Noa~:'\vewill introduce this method briefly.
In this method. using .co-occurrence analysis
technolog\;_ we coniputc the tenii association weight                                                                          I
                                                                                                   .f,W,) =                - ( n e t , - 0,)
betlveeii t x o tcniis T, and           c.:
                                                                                  where fs is the SIGMOID function.
Cb,s,e,.li.i.;~/7/(1;k )=
                   T                    x   l~e;plrl;~7~~uc/or.(T~)
                            I:=,
                              dg                                                                                    i;/,


Notice that this is an asymmetric function, i.e.                                  By using above method many important semantic
                                                                                  relation between concepts are discovered.
 C h r e r W e i g/7l ( T i ,TA.)           is           different          to

Clusler Wei ght ( Ti. 7.j) . Each term of this function                                                                                        .-_i..*.i




are computed as follo\i-.
                                     N
                     d , = lh x log(-   x w,)
                                    dfi
where        {r,is the number of occurrences of term                    j   in

document i ;             is the number of documents in a
collection of        \
                    !:   documents;              it'j   is the length of

tenii   .j


                                                                                             Figl. The Retrieval Result of Italy
nhere gj is the smaller number of occurrences of                                     For example, Figl. shows the concept of Italy have
                                                                                  many important concept as commonsense. According
term j        and term k in document                     I   ,   d4,   is the     the definitions of concepts and the relation between
                                                                                  sememes, we could extract the relations between
number of documents in which term j and k occurs.                                 concepts. These relations include hypemym,
                                                                                  hyponym,           synonym,               antonym,       part-whole,
                                                                                  attribute-host etc. We only use some of these
After we have computed the asymmetric association                                 relations. We give different weight to different
between terms. we could activate related temis in                                 relation.
response to user's input. This process was                                                                ,
                                                                                                        ,~ , =O.~.W,,,,~,,, =0.7.w,
                                                                                              w ' , ~ , ~,, ~                        =0.4.

accomplished by a single-layered Hopfield network.                                              %jw",>m8 0.2. "'",,,/b,,,< =
                                                                                                       =                       0.2
Each temi was' treated as a neuron, and the                                       If user has selected a tenii, its activation sprcads
association weight was assigned to the network as the                             following this equation.
s!maptic weight betmeen nodes. At time 0; the
outputs of nodes corresponding users input terms
)ij(o) were assigned to I , output of other nodes were                            If 4. is greater than a threshold value; this node is
assigned to 0. After the initialization phase: we repeat                          activated and the activation spreads. Othenvise, it




                                                                            720
isn’t activated. If the iiuiiiber of activated nodes is     [3] G Salton, B. Buckley. Tenn-weighting Approaches
greater than an cspected value, activation stops                in Automatic Test Retrieval. Iiifoniiatioii Processing
spreading.                                                      and Management, 1998,24(5) ,pages 513-523
5.2. Text query                                             [4] H. Chen and D. T. Ng., An algoritliiiiic approach to
First. we ask the user to select specific domain. The           concept exploration iii a large knowledge nehvork
que? was restricted i n specific domain. Then. we ask           (automatic thesaurus consultation): syiiibolic
the iser to iuput a keyword. System retums all the              branch-and-bound vs. coiuiectionist Hopfield net
documents idiich contain the keyword. At the mean               activation. , Jounial of the Aniencati Society for
time, using tlie concept space generated automatically          Iiifoniiation Science, 46(5),pages 348-369; June
and using Hon.Net. system could prompt the user                  199.5.
with related words as well. And by matching tlie            [5] H. Chen, J. Martinez, T. D.Ng, and B. R. Schatz, A
keyword with the classes; me could retum the related            Concept Space Approach to Addressing the
classes. too.                                                   Vocabulary Problem in Scientific Infomiation
                                                                Retrieval: An Experiment on tlie Worm Commuuit!.
                6. CONCLUSION                                   System, Jounial of the American Sock@ for
                                                                Iisomiation Science, Volume 48, Number I pages ~




In this paper, a solution for infoniiatioii retrieval on         17-31, Januar): 1997.
Inteniet: GHUNT is proposed. Concept search is a            [6] Samuel Kashi, Tim0 Hotikela, Krista Lagus, Teuvo
trend of infomiation retrieval. GHUNT partially                 Kohoneu, WEBSOM-Self-Org~iiziiig maps of
realized a concept-associated search by efficient               docuineut         collections,        Neurocotiiputiiig
information organization. Moreover, GHUNT can                   21(1998),pages 101-117.
work more perfect if improvements are made on the           [7] Shaoliui Liu, Miiigkai Dong, Haijuii Zhang, Roiig
aspect of system integration. For example, although             Li, Zhoiigzhi Shi, An Approach of Multi-hierarchy
there are differences between feature selection for             Text Classification, 200 1 Iiiteniational CoilFerences
document classificatioii and clustering, some                   on Info-tech and Info-net PPOCEEDINGS.
procedures are still similar. They can be processed             Conferences C, pages 95-100
onl! once.                                                  [SI Teuvo Kolioiieiiz Samuel Kashi, Self-Organization
                                                                of a Massive Documelit Collection. IEEE
        7. ACKNOWLEDGEMENTS                                     Transactions On Neural Nehvorks. Vol.1 I. No.3.
                                                                May22000
The research work in this paper is supported by tlie        [9] Wu Bin         Zheng Yi         Liu Sliaoliui      Shi
National Science Foundation of China (No. 60173017,             Zhoiigzhi,CSIM: A Documelit Clustering Algorithm
9010402 I) aiid tlie Nature Scieiice Foundation of              Based On Swarm Intelligence, In Proceedings of
BeiJiiig (No. 4011003).                                         Congress on Evolutionaly Computatibn:2002
                                                            [IO] Xiaoli Li, Data Miniug Research iu Web
References                                                      Itlfoniiatioti Retrieval aiid Classification, P1i.D.
[ 11 Dong Mingkai, Tian Qi-iia, Shi Zhougzhi, Wcb               thesis, May, 2001. R. Basili and A. Moschitti aiid
     Spider Based on Iiitelligent Agent. SCI2001,               M. Pazienza. 1999. A test classifier based 011
     Orlando, pages 292-296,2001,                               linguistic processing. In Proceedings of IJCAI-99,
121 E.Bonabeau. M.Dorigo. & G.Theraulaz, Saanii
                  ~                                             Machine Leaniiiig for Infoniiation Filtering.
     Intelligence: From Nahiral to Artificial Systems,
     Oxford Univ. Prcss; Ncw Yorkl 1999




                                                     72 1

						
Related docs