GHUNT a semantic indexing system based on concept space
Document Sample


GHUNT A SEMANTIC INDEXING SYSTEM BASED ON CONCEPT SPACE
Qing He, Ziyan Jia, Jiayou Li,Haijun Zhang,Qingyong Li, Zhongzhi Shi
Key Laboratory of Intelligent Information Processing, Institute of Computing Technology,
Chinese Academy of Sciences, Beijing 100080, China
lico 'ti.ics.iCt.:s.cn
ABSTRACT We assume there is some underlying latent semantic
stmcture in the data that is partially obscured by the
Hov to get text information from this huge
randomness of word choice with respect to retrieval.
information space becomes an even important
The rest of this paper is organized as follows: Section
problem with rapid growth of'the Inteniet. In this
2-.reviews related works. Section 3 intrOduces the
paper. a semantic indexing system GHUNT based on
functions of GHUNT. Section 4 introduces the
concept space is proposed to solve the problem.
document classifier and document clustering method
Some new technologies are integrated in G H P T to
used in construct concept space. N e q Section 5
obtain good perfomiance. GHUNT is an all-sided
presents retrieval with semantic association. Finally,
solution for information retrieval on Internet:
conclusions are given in Section 6 .
Keywords: Semantic indexing, Concept space
2. RELATED WORK
1. INTRODUCTION
For constructing concept space, many works have
been done around the world since ninths in the last
Large quantities of textual data available for example
century. In the Neural Network Research Centre.
on the lntemet pose a. continuing challenge to
Helsinki University of Technology during 1995-2000
applications that help users in making sense of the
as part of a prqject called WEBSOM has becn carried
data.
out, led by Academician Teuvo Kohonen. In the
The problem is that users want to retrieve on the
WEBSOM method the self organizing map (SOM)
basis of conceptual conteiit; and individual words
algorithm is used to automatically organize v e q large
provide unreliable evidence about the conceptual
and high dimensional collections of text documents
topic or meaning of a document. There are usually
onto two-dimensional map displays i.e. concept
many ways to express a given concept, so the literal
space
temis in a user's que? may not match those of a
An automatic indexing and concept space
relevant document. In addition, most words have
approach to a multilingual (Chinese and English)
multiple meanings, so temis in a user's query will
bibliographic database is presented by H. Chen. He
literally match terms in documents that are not of
introduced a multi-linear term-phrasing technique to
interest to the user. The proposed approach tries to
extract concept descriptors (terms or keywords) from
overcome the deficiencies of term matching retrieval
_I
a Chinese-English bibliographic database.
by treating the uiireliability of obserGd tenii
document association data as a statistical problem.
0-7803-7902-0/03/%I7.00@ 2003 IEEE.
716
3. THE ARCHITECTURE OF GHUNT
GHUNT gets documents from Internet and organizes
thcse documents by combining directov stnicture where gk is the frequency of the occurrence of
with semantic indes automatically for improving the
the term r, in the document 0, , yk is the
accuracy and recall of retrieval and characterizing the
semantic of retrieval results. The documents collected corresponding term weight, k =1,2, . . .: 171 ( i n is the
by spider are processed by test aialyzer including number of the terms).
parsing and extracting the informative words as The amount of information of term T, can be
concept for classification, concept clustering, calculated using the following
semantic index generating and so on. formula: ICk = H ( D ) - H ( D / T , ) , Where H ( D ) is the
entropy ofthe document collection
4. CONSTRUCTINGG CONCEPT
SPACE
H(DIT,) is the conditional entropy ofterm T,:
After the web pages are crawled by Spiders, they will
H ( D ITi)= -1 ti,<”
P ( d , I T r ) x log: P ( d , ITk
be classified automatically to construct the first level
of concept space. It is based on Chinese word P(d?,) is the conditional probability of the
segmentation. Most of the approaches on, tex? document d,:
classification adopt the classical Vector Space Model
(VSM). In this model, the content of a document is
fonnalized as a dot of the multi-dimension space and where wordfeq(d,) is the sum of all the temi
represented by a vector. Then we can decide the frequencies in d, . Then formula (1) can be revised as
corresponding classes of the given vector by follows:
calculating and comparing the distances among the ffn xlag( N l n , + O . O 1 ) r / G k
w,*= (2)
vcctors. The frequently used document representation -f(lf,, ) > x [log( N I n , + 0.01) x IG,]’
in VSM is the so-called TF.IDF-vector representation ,=,
(31. i.e., the calculation oftem weight is mainly based 4.2. Multi-hierarchy text classification
on temi frequency and inverse document frequency. The frequently used approaches in,test classification
I n order to dcal with the limitation of TF.IDF. we let all classes share a single classifier or assign each
adopt ai improved approach named TF.1DF.E by class a classifier. And all classes are in the same
combining the information gain from Infomiation hierarchy, i.e., in the same “flattened” class space.
Theory The approach is validated to be feasible and When the set of classes is large, not only it will cost
effective by esperinients. Furthermore, an approach much time to construct class models, hut also it will
of multi-hierarchy text classification based on VSM have to match among all the class models to assign
is applied. the proper classes to new documents.
4.1. The improvement on the formula of To overcome this problem, we propose an approach
calculating term weight of multi-hierarcby test classification based on VSM:
To calculate term weight: the TF.lDF approach i.e., all classes are organized as a tree according to
considers hvo factors: TF (tenii frequency) and IDF some given hierarchical relations. The basic insight
(invcrse document frequency). And the formula is: supporting our approach is that classes that are
attachcd to thc sanic nodc bavc a lot morc in
717
Class Precision Cl;l*l Precision In the test phase, we collect 21,430 documents
Intrmri 69.23% Winter sports 60.8955 downloaded from some famous web sits in China
Software 81.66% Basketball 98.83% such as SINA, FM365 as training set. The
Hardwax 78.57'6 Vollcybdl 81.66% corporation collects 11062 documents as testing set.
Gamer Y5.2356 Table tennis 69.76% There are 34 classes. And the experiment results are
shown in table 1. We will further research on the
i 1 1 1 1
Education 83.76% Else ofsports 54.789h
Economv 76.0850 Chess and card 80.0% classification Precision if the web page is assigned
1 scicncc 83.56% Boxing 98.223 the best N classes. Most details can be seen in paper
Y3.-1j?, kicing car %.a796 [71.
I Socirt\ 81.90, Gym 93.9%
4.3. Deeper level of concept space
91.7606 Trxkmdfield 87.143
Deeper level of Concept Space can be done by
Clssr Precision Tennis 81.81%
clustering. We propose a document Clustering
t I I
algorithm based on Swarm Intelligence and k-Means:
I Tennis 81.81% Swimming ' 95.4%
CSIM in this paper, Swarm Intelligence is defined as
! swimmlne I
I
95.l% I Bsdminton 85.34%
I
any attempt to design algorithms or distributed
I I
problem-solving devices inspired by the collective
I I
! I
j Chinac
Footbvll 101
X.9
Intcmational
Football
83.67%
behavior of the social insect colonies and other
animal societies. CSIM combines swarm intelligence
with k-means clustering technique. It is a two-phase
process. Firstly, an initial set of clusters is fomied by
swami intelligence based clustering method which is
81 969" 90.6946
Entcrtrinmcnt
derived from a basic model interpreting ant colon!:
Poliricr. Law and
organization of cemeteries. Secondly, an iterative
79.6740 84.7596 partitioning phase is employed to further optimize the
results.
Table I . The classification Precision
The main idea of swarm intelligence based
common with each other than else classes, so the
clustering method is that data objects are initially
models of these classes will be based on a small set
projected onto a plane at random; the artificial ants
of features.
then perform random walks on the plane and pick up
Our lab has developed a system about text
or drop projected data items with the probability
classificatioli for a Corporation. The main idea is
which is converted from swarm similarity within a
described as follows. Firstly, the class 'architecture is
local region by probability conversion function,
designed for the need of the corporation. Then the
clusters are visually formed on the plane by ant
class models are constructed by training the
colony collective actions in the absence of central
documents classified by hand corresponding to the
controls. It is also applied in document clustering by
classification hierarchy. Nest, the text contents are
vector space model. Self-organizing clusters are
estracted from the web pages crawled by Spiders.
fonned by this method. The number of clusters is
Furthermore. the text contents are, analyzed after
word scgnientation. Finally, the web pages are
also adaptively acquired. Moreover it is insensitive to
the outliers and the order of input. It obviously
assigned the proper class using the automatic
offsets the weakness of partitioning method and
classification algorithni. The ideal performance of
shortens the iterative times in the second phase.
classification has been achieved after several times of
Actually, thc swarm intclligcncc bascd clustcring
revision.
718
collect correctly on the plane by chance are also split.
K-means clustering phase softens the chanciness of
the swarm intelligence based method which is
originated from a probabilistic model. A description
of the clustering algorithm and more details can be
seen in [9].Table 2 show that the results of CSIM. By
using CSIM we can obtain deeper level of concept
space.
5 . RETRIEVAL WITH SEMANTIC
INDEXING
In the following sections, we will describe how to use
the concept space to facilitate queving and
information retrieving.
5.1 Concept space generation using
co-occ urrence Analysis
Now we will introduce how to generate the link
weights in concept space of specific domain
automatically. Before generating the concept space;
we must identify the concepts of that domain. By
using the following formulae we could compute the
information gain of each term for classification.
P(wiIF)+
InqGoin ( F ) = P ( F ! C P ( y i I F!los ~
P(V.1
- P(,Y I TI
x P ( w i IFjlogL
P(yl,!
where F is a term, ' P(F) is the probability of that term
F occurred, ?means that term F doesn't occur,p(yg)is
the probability of the i-th class value, P ( ~I , ) is the
F
conditional probability of the i-th class value given
that word F occurred. Ifl,,,a,n(F),o, we choose term
F as the concept. Although the thesaunis generated by
this method is not as thorough and precise as the
thesaurus generated manually in scientific literature,
it is an acceptable thesaurus. This is not an ideal
method but feasible method.
method can be applied independently. But on second After we have recognized the concept of a class,
phase, the outliers which are single points on the we could generate the concept space of that class
ant-work plane are converged on the nearest neighbor automatically. Hcrc, wc adopt Chcn's mcthod which
clusters aid the clusters \vliich are piled too close to
719
uses co-occurrence analysis and Hopfield net [SI. the following iteration until convergence.
Noa~:'\vewill introduce this method briefly.
In this method. using .co-occurrence analysis
technolog\;_ we coniputc the tenii association weight I
.f,W,) = - ( n e t , - 0,)
betlveeii t x o tcniis T, and c.:
where fs is the SIGMOID function.
Cb,s,e,.li.i.;~/7/(1;k )=
T x l~e;plrl;~7~~uc/or.(T~)
I:=,
dg i;/,
Notice that this is an asymmetric function, i.e. By using above method many important semantic
relation between concepts are discovered.
C h r e r W e i g/7l ( T i ,TA.) is different to
Clusler Wei ght ( Ti. 7.j) . Each term of this function .-_i..*.i
are computed as follo\i-.
N
d , = lh x log(- x w,)
dfi
where {r,is the number of occurrences of term j in
document i ; is the number of documents in a
collection of \
!: documents; it'j is the length of
tenii .j
Figl. The Retrieval Result of Italy
nhere gj is the smaller number of occurrences of For example, Figl. shows the concept of Italy have
many important concept as commonsense. According
term j and term k in document I , d4, is the the definitions of concepts and the relation between
sememes, we could extract the relations between
number of documents in which term j and k occurs. concepts. These relations include hypemym,
hyponym, synonym, antonym, part-whole,
attribute-host etc. We only use some of these
After we have computed the asymmetric association relations. We give different weight to different
between terms. we could activate related temis in relation.
response to user's input. This process was ,
,~ , =O.~.W,,,,~,,, =0.7.w,
w ' , ~ , ~,, ~ =0.4.
accomplished by a single-layered Hopfield network. %jw",>m8 0.2. "'",,,/b,,,< =
= 0.2
Each temi was' treated as a neuron, and the If user has selected a tenii, its activation sprcads
association weight was assigned to the network as the following this equation.
s!maptic weight betmeen nodes. At time 0; the
outputs of nodes corresponding users input terms
)ij(o) were assigned to I , output of other nodes were If 4. is greater than a threshold value; this node is
assigned to 0. After the initialization phase: we repeat activated and the activation spreads. Othenvise, it
720
isn’t activated. If the iiuiiiber of activated nodes is [3] G Salton, B. Buckley. Tenn-weighting Approaches
greater than an cspected value, activation stops in Automatic Test Retrieval. Iiifoniiatioii Processing
spreading. and Management, 1998,24(5) ,pages 513-523
5.2. Text query [4] H. Chen and D. T. Ng., An algoritliiiiic approach to
First. we ask the user to select specific domain. The concept exploration iii a large knowledge nehvork
que? was restricted i n specific domain. Then. we ask (automatic thesaurus consultation): syiiibolic
the iser to iuput a keyword. System retums all the branch-and-bound vs. coiuiectionist Hopfield net
documents idiich contain the keyword. At the mean activation. , Jounial of the Aniencati Society for
time, using tlie concept space generated automatically Iiifoniiation Science, 46(5),pages 348-369; June
and using Hon.Net. system could prompt the user 199.5.
with related words as well. And by matching tlie [5] H. Chen, J. Martinez, T. D.Ng, and B. R. Schatz, A
keyword with the classes; me could retum the related Concept Space Approach to Addressing the
classes. too. Vocabulary Problem in Scientific Infomiation
Retrieval: An Experiment on tlie Worm Commuuit!.
6. CONCLUSION System, Jounial of the American Sock@ for
Iisomiation Science, Volume 48, Number I pages ~
In this paper, a solution for infoniiatioii retrieval on 17-31, Januar): 1997.
Inteniet: GHUNT is proposed. Concept search is a [6] Samuel Kashi, Tim0 Hotikela, Krista Lagus, Teuvo
trend of infomiation retrieval. GHUNT partially Kohoneu, WEBSOM-Self-Org~iiziiig maps of
realized a concept-associated search by efficient docuineut collections, Neurocotiiputiiig
information organization. Moreover, GHUNT can 21(1998),pages 101-117.
work more perfect if improvements are made on the [7] Shaoliui Liu, Miiigkai Dong, Haijuii Zhang, Roiig
aspect of system integration. For example, although Li, Zhoiigzhi Shi, An Approach of Multi-hierarchy
there are differences between feature selection for Text Classification, 200 1 Iiiteniational CoilFerences
document classificatioii and clustering, some on Info-tech and Info-net PPOCEEDINGS.
procedures are still similar. They can be processed Conferences C, pages 95-100
onl! once. [SI Teuvo Kolioiieiiz Samuel Kashi, Self-Organization
of a Massive Documelit Collection. IEEE
7. ACKNOWLEDGEMENTS Transactions On Neural Nehvorks. Vol.1 I. No.3.
May22000
The research work in this paper is supported by tlie [9] Wu Bin Zheng Yi Liu Sliaoliui Shi
National Science Foundation of China (No. 60173017, Zhoiigzhi,CSIM: A Documelit Clustering Algorithm
9010402 I) aiid tlie Nature Scieiice Foundation of Based On Swarm Intelligence, In Proceedings of
BeiJiiig (No. 4011003). Congress on Evolutionaly Computatibn:2002
[IO] Xiaoli Li, Data Miniug Research iu Web
References Itlfoniiatioti Retrieval aiid Classification, P1i.D.
[ 11 Dong Mingkai, Tian Qi-iia, Shi Zhougzhi, Wcb thesis, May, 2001. R. Basili and A. Moschitti aiid
Spider Based on Iiitelligent Agent. SCI2001, M. Pazienza. 1999. A test classifier based 011
Orlando, pages 292-296,2001, linguistic processing. In Proceedings of IJCAI-99,
121 E.Bonabeau. M.Dorigo. & G.Theraulaz, Saanii
~ Machine Leaniiiig for Infoniiation Filtering.
Intelligence: From Nahiral to Artificial Systems,
Oxford Univ. Prcss; Ncw Yorkl 1999
72 1
Related docs
Get documents about "