AUTO-CONSTRUCTION OF A LIVE THESAURUS FROM SEARCH TERM by ral42282

VIEWS: 16 PAGES: 3

									AUTO-CONSTRUCTION OF A LIVE THESAURUS FROM SEARCH TERM LOGS FOR
                   INTERACTIVE WEB SEARCH

                                             Shui-Lung Chuang1, Hsiao-Tieh Pu2,
                                              Wen-Hsiang Lu1, Lee-Feng Chien1

                           1. Institute of Information Science, Academia Sinica, Taiwan, R.O.C.
                                       2. Department of Library & Information Studies,
                                             Shih Hsin University, Taiwan, R.O.C.

                        E-mail: {slchuang, whlu, lfchien}@iis.sinica.edu.tw, htpu@cc.shu.edu.tw

                          ABSTRACT                                 log analysis, new term categorization and similar term
       The purpose of this paper is to present an on-going         clustering, for the construction of a live thesaurus. The
research that is intended to construct a live thesaurus directly   human efforts mainly involve in the analysis of search
from search term logs of real-world search engines. Such a
                                                                   term logs, such as extracting of "Core Terms",
thesaurus designed can contain representative search terms,
their frequency in use, the corresponding subject categories,      structuring subject taxonomies for these terms and
the associated and relevant terms, and the hot visiting Web        classifying each of them appropriate subject categories.
sites/pages the search terms may reach.                            The machine efforts are to make it possible to
                                                                   automatically categorize each new important search
                   1.   INTRODUCTION                               term appropriate subject categories and cluster similar
      Since Web users’ queries are often too short to              search terms for each corresponding category. The
contain sufficient keyterms to discriminate ambiguous              proposed approach is being developed and tested with
documents in the process of information retrieval [4], to          two primarily Chinese search engine logs. The obtained
alleviate such a short query problem, a                            preliminary results have showed its possibility that a
high-performance search engine needs a thesaurus to                thesaurus suited for interactive web search can be
provide efficient interactive search and/or term                   automatically constructed and updated with the change
suggestion techniques.                                             of search term logs. Besides, based on the live
                                                                   thesaurus, certain kinds of users' information behaviors
       A thesaurus is a set of items (phrases or words )           could be characterized and a more effective search
plus a set of relations between these items. Automatic             engine might be developed.
construction of a thesaurus from online text resources is
necessary but a challenging research topic [2]. Different                  2. OVERALL RESEARCH DESIGN
from previous work on automatic thesaurus                                As mentioned before, the main focus of the
construction, the ongoing research attempts to construct           ongoing research is to make a live thesaurus can be
a live thesaurus from search term logs of real-world               automatically constructed from search term logs of
search engines rather than via term extraction from                real-world search engines. On the basis of the live
online documents. In addition, the information to be               thesaurus, the research will further attempt to build a
collected in such a thesaurus is mainly the                        front-end search engine with capability of automatic
representative search terms with their associated terms.           term suggestion and meta search. Such a search engine
The proposed approach for the above purpose is a well              is designed to be properly integrated with existing
                                                                   search engines as back-end engines in our research. As
integration of human and machine’s efforts. As shown
                                                                   depicted in Fig. 1 for each input query term, the
in the live thesaurus construction block of Fig.1, the
                                                                   developing front-end engine will suggest related terms
proposed approach is a three-step process: search term             through the access of the live thesaurus and retrieve
                                                                   web results via meta search. On the other hand, for each
                                                                   unknown query term, it will categorize appropriate
                                                                   subject categories and extract similar search terms. All
                                                                   of users' query inputs and visiting web pages through
                                                                   the queries and suggested relevant terms will be
                                                                   recorded in the search term log, and update the
                                                                   corresponding information of the thesaurus. The
                                                                   thesaurus turns out will record up-to-date information
including search term frequency, and even the               category, there were some valuable findings. First, the
frequency distribution of each subject category. The        distribution of information needs of each subject
following introduction will describe more about the         category can be obtained. Second, some "core" search
construction of live thesaurus.                             terms exist without the effects of time, which also
                                                            implies the research for finding essential needs of users'
Search Term Log Analysis and Core Term                      are possible and the developing of a live thesaurus can
Extraction                                                  take them as the base. Third, over 40% searches were
       Two search engine query logs from the Dreamer        found as proper nouns, such as companies or personal
and GAIS in Taiwan were collected as the basis for          names, which provide a useful source for further
analysis.     The Dreamer's log contained 228,566           investigation of the EC activities in different subject
distinct search terms with a total frequency of             domains.
2,184,256 within a period of over 3 months in 1998,
and the GAIS's contained 114,182 distinct queries with      New Term Categorization
a total frequency of 475,564 within a period of 2 weeks            The above manual categorized core terms were
in 1999. Regarding to the components in the query           then taken as the seed vocabulary set of the live
logs, it consists of a series of requests, which each       thesaurus. To keep the thesaurus incremental and
request included the search term, and the corresponding     adaptive, we need a mechanism to automatically
timestamp (when the query was submitted) etc. The           categorize each given new term, the term doesn't appear
research consists of three major steps of analysis. The     in the thesaurus, appropriate subject categories and
first-order analysis involves structuring subject           extract possible similar terms in the same categories.
taxonomies for building a classification scheme, and        The new term categorization problem therefore needs to
categorizing search terms into certain subject categories   be dealt with. Suppose that there exists a specification
in terms of users' possible intentions from the             of category C that can cover all topics requested by
Dreamer's log. Such work was mostly done manually           Internet users' queries. Our problem of new term
by our team of five Library & Information Science           classification can be described briefly as: Given a new
students with substantial experiences of surfing the net    term t, a term vocabulary set V with corresponding
and a reference librarian for three months. They first      categories for each composed term, the goal of this
extracted nearly twenty thousands top search terms by       problem is to determine the most proper class for t in
the number of frequencies from Dreamer. Though the          terms of the intentions of users' information needs, for
top twenty thousands terms represented only 8% of the       example, to categorize search term "Microsoft" as an
distinct queries, they formed fully 81% of the total        instance of Company Sub-category in Computer
number of search terms asked. To build a proper             Category.
scheme for categorization of search terms, we used a
bottom-up methodology rather than top-down designs                 A basic idea of our solution is to take term t's
commonly found in Internet search services like Yahoo       co-occurrence search terms, the search terms that
or library communities like Dewey Decimal                   appear together with term t in some Web documents, as
Classification. Fourteen major categories with one          the feature vectors of t. The approach is very likely
hundred subcategories were developed. On the other          classifying a document based on the composed terms.
hand, for each search term, based on the careful            The first step to classify a document is to determine
judgement on the intentions of users, i.e. for what         how the terms in the vocabulary set contribute to each
purpose would be the corresponding search results used,     candidate category, which can be performed by a
it was analyzed and classified into an appropriate          previous training step from a training corpus. And then
subject category with a proper noun identification          the confidence that the given document belongs to each
according to its major intention. Since it is very likely   category is determined by all terms appearing in that
that a search term can have multiple intentions from        document.
users, each search term was assigned with one major
category and a secondary category for cross-reference if           Now let's more precisely specify the new term
needed in the future.                                       classification problem. Suppose there is a web
                                                            document collection, the developing classification
      The second-order analysis includes deriving           process is to rank all candidate categories in C to figure
various statistics to describe users' information needs     out the most probable class that t belongs to. The
and seeking patterns from both of the Dreamer and           proposed method for the estimation of the rank function
GAIS logs. The third-order analysis is to realize the       for category c is defined as follows:
limitations of the search term log analysis, and to apply                 Rc = ∑ M t , w f w [C w ≡ c]
the findings to EC applications and IR research. To                             w∈T
explore the distribution of users' needs in each subject
where t is the term to be categorized, w is a                                                      information to build related words [3]. Our research
co-occurrence term of t, T is t's co-occurrence term set,                                          towards this topic is just in the beginning. But, out of
Cw is the category of w, and fw is the frequency; besides                                          previous work is that we can extract similar terms from
the definition of Mt,w is further defined as below:                                                the search terms in the same subject domain which were
                                 | Dw ∩ Dt |                                                       extracted based on the above new term categorization
                  M t ,w =                                                                         technique and search term logs. This can reduce huge
                                    | Dt |                                                         amount of computational cost and some difficulty of
where Dt and Dw are the set of documents containing                                                word sense ambiguities.
term t and w in the web collection respectively, |Dt|
and |Dw ∩ Dt | are the number of Dt as well as the                                                        The similar terms to be clustered in our research
intersection number of Dw ∩ Dt. The addition of Mt,w is                                            can be categorized into two types: terms similar in
to avoid treating each co-occurrence term equally in                                               content. e.g., abbreviation, and Terms different in
determining the target class.                                                                      content but similar in concept. The first type of similar
                                                                                                   terms share common composed character strings has
       A preliminary experiment has been done for                                                  been found relatively easy to be clustered simply based
testing the performance of the above method. The                                                   on some heuristic rules and Mutual-Information-based
experimental data was collected from the Dreamer's log                                             co-occurrence analysis of these terms appearing in web
mentioned previously. There are total 200k distinct                                                documents. The second type of similar terms has no
query terms, and the top 20k terms, which cover about                                              obvious common sub-strings. More in-depth study is
80% query requests and are most core terms, have been                                              still necessary.
classified manually into 100 classes. We randomly
select 1,000 terms from the classified term set as the                                                                  Correct Rate of Categorization
testing set and the rest are treated as the vocabulary V.                                               Top1                        48.30%
The selected terms were then been querying to a                                                         Top2                        63.50%
well-known search engine to collect the required                                                        Top3                        68.90%
documents. The result documents are assumed highly                                                      Top4                        74.50%
correlated to t and the co-occurrence term set T can be                                                 Top5                        77.10%
therefore extracted appropriately. The obtained correct                                              Table 1. The obtained correct rates with the proposed
rates with the proposed method are shown in Table 1                                                     method for new term classification experiment.
where top n means the highly ranked n candidate
                                                                                                                       REFERENCE
categories contain the correct category. The obtained                                              [1] Dekang Lin, Automatic Retrieval and Clustering of Similar Words,
results are promising to us that certain kinds of new                                              COLING'98, 1998.
terms given by Internet users can be categorized                                                   [2] Yufeng Jing and W. Bruce Croft, An Association Thesaurus for
                                                                                                   Information Retrieval, UMass Technical Report 94-17. 1994.
                                                                                                   [3] Hinrich Schutze and Jan O. Pedersen, A Coocurrence-Based
Similar Term Clustering                                                                            Thesaurus And Two Applications To Information Retrieval,
      Some     research    has   used     head-modifier                                            Information Processing & Management, 33(3): 307-318,1997.
relationships or descriptions of entities to determine                                             [4] Silverstein, C. et al., Analysis of a Very Large AltaVista Query
similar words [1]. Others make use of lexical occurrence                                           Log, SRC Technical Note, Oct. 26, 1998.
                                                                                                    (http://www.research.digital.com/SRC/)




                 In pu t
                                      In te ra ctiv e                                                                                     B ac k-en d
                 Q u eries            S ea rcee ch                            T erm
                                          Sp h                                  T erm                      M eta S e arch                    S ea rc h
                                      In te rfac e                        S u gge st ion                    M eta S e arch
                                         In te rfac e                       S u gge st ion                                                  E ng ines
                 S u gge st ed
                 T erm s an d
                 R et rie ve d
                 R esu lts                                                                                      S ea rc h
                                                                                                                S ea rc h
                                                                                                                                       In it ial S earch
                                                                                                                                       In it ial S earch
                                 New                                            L iv e
                                                                                L iv e                          T erm
                                                                                                                 T erm                   T erm Lo g
                                                                                                                                         T erm Lo g
                                                                             T he sau ru ss
                                                                             T he sau ru                         L og
                                                                                                                  L og
                                 S ea rc h
                                 T erm s


                                                                         R el ated Te rm s
                                     N e w T erm                         S im ila r Te rm                L og A na ly sis &
                                       N e w T erm                        E xt rac tio n &              L og A na ly sis
                                   C at eg oriza tion                      C lu sterin g                    S ee d T erm
                                    C at eg oriza tion                    R ea l-N am e
                                                                                                             E xt rac tio n
                                                                          Id e nti fic atio n



                                                         L iv e T h es au ru s C o n stru cti on


                        F ig. 1 A n abstract diagram show ing the overall design of the on-going research.

								
To top