Application of a clustering algorithm to recover by zll88920


									 Application of a clustering algorithm to recover
     topic content in an unstructured text-based
   Víctor González Laria                Richard Griffiths               Graham Winstanley
                     School of Computing and Mathematical Sciences
                                     University of Brighton
                                          Brighton, UK


The majority of current text based search engines do not consider the semantic content of a
document when they perform the process of indexing. As a result of this, retrieval is based on
Boolean matches between the user's query and a set of loose terms that do not convey
semantic meaning. In this paper we have used a partitioning type clustering algorithm to study
the feasibility of recovering 'topic' information from an unstructured collection of terms.

1. Introduction

Most current Internet search engines rely on traditional methods of indexing and filtering.

During the process of indexing, the semantic content of a document is lost. Typically,

individual keywords are extracted from the body of the document and stored in an inverted file

to be used when comparison between the user's query and the index is carried out. In the

Internet environment, complex methods of filtering are difficult to implement. Internet search

engines spend most of their time indexing the vast "territory" they are dealing with and smaller

time matching user queries against the indexed information.

However, traditional Boolean searchers provide an initial procedure upon which more

sophisticated approaches can be devised. MetaCrawler, for example, is a "parasite" agent

that, built on top of several search engines, provides a unified approach to them and

enhances their filtering capabilities [EZTIONI96], [SELBERG96]. Albeit it does not introduce a

new retrieval model -it basically collates results and prevents duplicates-, it provides us with a

good example of how to improve capabilities of search engines while coexisting with them.

The number of hits returned by a search engine in response to an ordinary query is -

compared to number of documents available in the Internet- very low. It usually consists of a

few hundred references, however, they still are a high number for manual inspection. Besides,

the documents hold valuable unexploited information such as:

- Common links

A high percentage of the documents returned share common links which may give insight on

the kind of information they store [CHIA-HUI97]. For example, documents that are stored in

the same server and are returned in response to a certain query are more likely to be more

closely related to each other than those residing on different servers.

- Connected links

Documents that contain links pointing to common documents may be related.

- Cross documents similarities

Cross-document comparison may lead to the discovery of additional terms in addition to those

present in the user's query. These new terms may be helpful not only for query expansion

purposes, but also for grouping of similar documents.

We have investigated this last issue. By accessing the body of the documents instead of their

representation, we have studied how they can be grouped into clusters of similar documents

(which respond to a common topic) and how this approach can be used to assist the retrieval

process (section 2). In section 3 we will describe the information retrieval model used. The

next two sections deal with the description of the data used in our experiments and the

results, and the conclusions respectively.

2. Clustering

Document clustering has been widely investigated as a technique to improve effectiveness

and efficiency in information retrieval (see [VANRIJSBERGEN79] for an introduction to

clustering). The assumption made in document clustering is that similar documents tend to be

relevant to the same query. Therefore, if documents are clustered off-line, comparisons of

documents against the user’s query are only needed with certain clusters and not with the

whole collection of documents. Although some authors have reported higher performance of

Cluster Based Search (CBS) combined with Full Search (FS) [CAN94], clustering methods

have not become widely adopted. It seems that the major criticism is that they require

calculations of quadratic order.

To use clustering for IR purposes several issues should be tackled. Firstly, the question of
whether clustering is a useful assistant when dealing with classification in small data sets

should be answered. This situation arises −for example− when trying to improve the results

provided by a normal Web search engine. Traditionally, clustering has been applied to

massive data environments, but the issue about whether the semantics of a smaller dataset

are more closely related with the statistical occurrence of terms does not appear to have been

widely studied. Minker [MINKER73] suggests that groupings produced by clustering methods

upon small databases should be examined for semantic consistency.

Clustering should also allow a deeper understanding of the data presented to the user by a

search engine. Suggestion of keywords from each of the clusters created can be a useful

assistance in order to comprehend what kind of information each group contains.

As a further step, the question of the effectiveness of using a clustering method as the

filtering strategy could also be approached. If the user selects the cluster that he thinks suits

him, and that cluster includes all of the useful documents for a particular information need, the

clustering method can be viewed as a good assistant which works by improving precision in


3. The Information Retrieval Model

Our system does not rely on others to obtain the indexes of the documents, it performs its

own indexing. This decision is due first to the fact that we need to explore the effects that the

indexing parameters have on the performance of the system, and second to the need to have

more extensive document representations than those provided by the search engines. The

conceptual model of the system can be represented as in figure 1.

    We are talking here of sets of data of several hundred documents.

     S e a r c h E n g in e                                D o c u m e n ts

                                                                     I n d e x in g

                              U s e r ’s S e a r c h e r

                                                                                        C lu s t e r s
             G1                  G2                               Gn

Figure 1. Conceptual IR model

The user’s searcher accesses, indexes and applies its filtering scheme to the data whose

location has been provided by the search engine. The results are a set of clusters each of

them grouping documents that should respond to the same topic.

3.1. The Indexing Vocabulary

In our model we have avoided relying directly on the query approach due the problems it

causes [BORGMAN89], [PINKERTON94]. The unavailability of a pattern against which to

compare all the documents creates the need for the development of an indexing vocabulary.

This will be made up of terms which exhibit high frequency of repetition in the set of document

representatives under consideration. The process of its development is depicted in figure 2.

                                      new common terms
common terms
                                                                                      low frequent terms


Figure 2. Development of the indexing vocabulary

The indexing vocabulary is stored in a vector of common terms whose size will normally be
smaller than the sum of sizes of all the document representatives . Therefore, to avoid losing

relevant terms, those whose frequency of repetition is greater than one, promote to a higher

position in the indexing vocabulary vector representative. However, it is still likely that terms

with low frequency of repetition are lost during the process (figure 2). The effect of this will be

commented on when we present our results later in this paper (section 4.2).

    The size of the individual document representatives is not known beforehand.

The indexing vocabulary within each cluster (centroid update) is finally created by choosing

those terms which appear at least k/2 times in the cluster (k = number of elements in cluster)


3.2. The Cover-Coefficient Clustering Algorithm

The cover-coefficient (CC) clustering algorithm is a seed based algorithm, that is, it selects

some initial seeds as the initiators of the algorithm [CAN90]. Therefore, the seeds become the

initial representatives of the classifications. The rest of the documents in the collection are

compared against each seed and assigned to one of the classifications according to the value

of the comparison result.

A common criticism of seed-based clustering algorithms is that they create clusters in which
structure varies depending on which seeds are initially chosen . The CC clustering algorithm

does not exhibit this limitation since the seeds are chosen by using a deterministic algorithm

and therefore they are always the same under the following assumptions:

         1. The indexing vocabulary is constant.

         2. All the document representations are available.

Another advantage of the CC clustering algorithm is that it produces stable clusters. An

application of the CC algorithm has been proposed by Can [CAN89], [CAN93] as a promising

method for incremental index updating (without reindexing the whole database) under

conditions of high novelty of documents . The existent database may more that double in size

without heavy deterioration. Stability of this algorithm makes it a good candidate for distributed

or constrained implementation.

3.3. Term Weighting

Term weighting is a way to point out the importance of individual terms according to the

frequency with which they appear in the corpus. We use the scheme proposed by Salton


w    =   f × lnç     ÷ × normalizing factor

The weight of a term is directly proportional to the number of times it appears on a document

(Salton proposes raw frequency when the vocabulary is varied). To decrease the importance

of frequent terms in the corpus a second factor is added. N is usually the number of

documents in the collection and n the number of documents in which the term whose

weighting is considered appears.

In our case:

- N is the number of clusters created,

- n is the number of clusters in which the term under consideration appears

At the end of the indexing process the document representations are normalised.

3.4. Similarity measure

Calculation of similarity between a document and a seed is performed by using the traditional

cosine measure in the vector space model:

doc • seed = cos(doc , seed ) × doc × seed

where • is the dot product, doc is the vector representation of the document and seed is the

vector representation of the seed.

When similarity values are under a certain threshold they are moved onto a “rag-bag” cluster.

4. Experimentation

4.1. Description of the data set

The data has been extracted from the manufacturing domain. It consists of Web abstracts of
documents drawn from the Network for Excellence in Manufacturing (NEM on-line) [NEM].
They are documents that have been classified by human experts and may contain concepts
that overlap. For example documents about labour legislation can be assigned either to the
Labour or to the Legal categories depending on the person doing the classification.
We have chosen 4 domains and have downloaded 15 document abstracts for each category
(table 1).

    Normally they are chosen randomly

    Category           Number of Documents
    Legal                            15
    Design                           15
    Government                       15
    Labour                           15
    Total                            60

Table 1. Documents organised by category
Obviously the data will exhibit cluster tendency: we know in advance the ideal organisation of
the data which is 4 clusters with 15 documents each. Cluster tendency of a collection is
essential to apply a clustering method [DUBES79].
The data set is made up of 60 document abstracts. They share 337 terms and there are
around 2000 distinct terms.

4. 2. Parameters

We use the following:

a). Depth of indexing (thtw).

It controls the exhaustivity of the indexing. Small values of it indicate that more terms are

added to the document representative and therefore the depth of indexing is higher.

b). Minimum Similarity Value (thts).

It is the minimum value of similarity between a document and a cluster representative under

which the document is not considered relevant.

c). Size of the Indexing Vocabulary (SIV)

Number of terms that make up the indexing vocabulary

d) Data Slot Size (DSS)

Size of the vector used to perform cross-document comparisons.

e) Rand’s value of Similarity (R)

Rand’s Value of Similarity between two classifications. All the values of similarity shown lie

outside the band E(R)±Var(R), where E(R) is the mean of R and Var(R) its variance. This is a
necessary condition for ensuring non-randomness of the classification .

  We include here the errors of the stemmer.
  We use Rand’s measure of similarity [FOWLKES83]. This value indicates the probability that two documents are
treated alike in the clusters under comparison.

4.3. Results

The first results that show when running the algorithm is that the measure of similarity

between the pursued classification (see 4.1) and that obtained does not depend on the size of

the indexing vocabulary (figure 2). We get comparable values of similarity with SIV=86 and

with SIV=337. This means both that, we do not need to know in advance the exact size of the

document representatives, and we can keep the DSS value as a fraction of the maximum

value (the sum of the individual DSSs). In our case it means that with values of DSS in the

range of 200 we get comparable values of similarity to values of DSS in the range of 2000.


                               0         100           200           300      400

                                       Size of Indexing Vocabulary

Figure 3. Dependance of the algorithm with the size of the indexing vocabulary (SIV)

Another important parameter to consider is thtw or depth of indexing. It controls the

exhaustivity of the document representations. For us, it is interesting to know how the depth of

indexing influences similarity values and, therefore, retrieval performance. In figure 4 we show

averaged values of similarity (several realisations of the experiment) versus several values of

thtw and thts.


                           thts=0.01     thts=0.1    thts=0.01    thts=0.1
                           thtw =0.1    thtw =0.1   thtw =0.01   thtw =0.01

                                            Param eters

Figure 4. Effect of thtw and thts

In figure 4 we can see that the best results are obtained when the depth of indexing is high.

This result seems to contradict the previous one: there is not an improvement with larger

indexing vocabularies. However, the explanation of this is concerned with the way the

indexing vocabulary is created. If we recall the way it works (section 3.1) we see that the most

frequent terms promote to higher positions and the least frequent are possibly lost after the

updating of the centroid. Therefore, the important thing is to find the most frequent terms and

this will not be achieved if the depth of indexing is too low at the beginning of the process.

Once the most frequent terms are detected, the loss of the least significant ones does not

affect performance.

Finally we present some terms that the system uses to cluster arranged by category (table 2).

 Design           design, engine

 Labour           federation, regulation, labor, policy, deals, courses, employee, job, health,
                  government, medic, notice, family, act, social, security, disabilities.

 Legal            code, patent, office, protect, sector, nation, unite, states, law, civil, code,
                  trade, inform, property.

 Government        government, inform, policy, export, bank, patent, DOE (Department of
                  Transportation), federation, grants, contract, funds, whitehouse, iso.

Table 2. Terms suggested ordered by category

5. Conclusions

In this paper we have presented a method for detecting topic information from an
unstructured set of terms which have been obtained after indexing a collection of text
documents using a frequency-based method. We have shown that the proposed method
produces meaningful groups of document with high values of similarities compared to the
ideal classification. The values are in the range of 80-90% which indicates the probability that
two documents cluster together in both groups. Compared to Can’s values we find a
deterioration within the range of 10%. This is due to the fact that we do not use a static
indexing vocabulary but a dynamically developed one.

The clusters created by the system are not random and the terms in each group suggest that
they incorporate information about topic content. From the inspection of each group in table 2
we can anticipate the information embedded within. For example, it shows that in the Legal
group there is information about “health”, “law” and “social security”. These are terms that
indicate how to continue and steer the search.

We have also shown how to dimension the algorithm to produce meaningful results under
restrictions on the size of the indexing vocabulary and depth of indexing.

The results provided here cannot be extended to other collection of documents that do not
comply with the data description of section 4.1. I conjecture that data sets based on full text
documents instead of abstracts should use other methods of term extraction. This is because
as a document's size grows, its vector representation flattens and weight terms tend to
become similar to each other. Under these circumstances it becomes difficult to distinguish
between useful and non-useful documents. Restrictions of the algorithm in this situation
should be explored as future work.

6. References

[BORGMAN89] Borgman C. L. All users of Information Retrieval Systems are not created

equal: An exploration into individual preferences. Information Processing and Management

Vol. 25, No. 3, pp. 237-251, 1989.

[DUBES79]. Dubes R., Jain A. K. Validity studies in Clustering Methodologies. Pattern

Recognition. Vol. 11, pp. 235-254, 1979.

[CAN89]. Dynamic Cluster Maintenance. Can F. Information Processing and Management.

Vol. 25, no. 3, pp. 275-291, 1989.

[CAN90]. Fazli C., Ozkaraham E. A. Concepts and Effectiveness of the Cover-Coefficient-

Based Clustering Methodology for Text Databases. ACM Transactions on Database Systems.

Vol 15., no. 4, pp. 483-517. December 1990.

[CAN93]. Fazli C. Incremental Clustering for Dynamic Information Processing. ACM

Transactions on Information Systems. Vol. 11, no. 2, pp. 143-164, April 1993.

[CAN94]. Fazli C. On the Efficiency of Best-Match Cluster Searches. Information Processing

and Management. Vol. 30, no. 3, pp. 343-361, 1994.

[CHIA-HUI97]. Chia-Hui C., Ching-Chi H.        Customisable Multi-Engine Search Tool with

Clustering. Proceedings of the Sixth International World Wide Web Conference, 1997 (20 November 1997)

[ETZIONI96]. Etzioni O. Moving Up the Information Food Chain: Deploying Softbots on the

World Wide Web. AAAI-96, pp. 1322-1326.

[FOWLKES83]. Folkes E. B., Mallows C. L. A Method for Comparing two Hierarchical
Clusterings. Journal of the American Statistical Association. Vol. 78, no. 383, pp. 553-568,
September 1983.

[MINKER73]. Minker J. Document Retrieval Experiments Using Cluster Analysis. Journal of

the American Society for Information Science. July-August, 1973.

[NEM]. NEM On-line

(3 February 1998)

[PINKERTON94]. Pinkerton B. Finding What People Know: Experiences with the

WebCrawler. Second International WWW Conference. July 1994.

[SALTON88]. Salton G., Buckley C. Term-Weighting Approaches In Automatic Text Retrieval.

Information Processing and Management. Vol. 24, no. 5, pp. 513-523, 1988.

[SELBERG95]. Selberg E., Eztioni O. Multi-Service Search and Comparison Using the

MetaCrawler. Department of Computer Science and Engineering. University of Washington.

October 1995.

[VANRIJSBERGEN79]. Van Rijsbergen C.J. Information Retrieval. Butterworth 1979.


To top