Text Clustering Based on Frequent Items Using Zoning and Ranking by ijcsiseditor


									                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                        Vol. 9, No. 6, June 2011

       Text Clustering Based on Frequent Items Using
                    Zoning and Ranking
                  S. Suneetha1, Dr. M. Usha Rani2, Yaswanth Kumar.Avulapati3
                                            Research Scholar 1, Associate Professor2,
                                     Department of Computer Science, SPMVV, Tirupati.
                             Research Scholar3, Dept of Computer Science, S.V.University, Tirupati
         suneethanaresh@yahoo.com1, musha_rohan@yahoo.com2, Yaswanthkumar_1817@yahoo.co.in3

Abstract— In today’s information age, there is an incredible    applicable to any kind of information repository.
nonstop growth in the textual information available in electronic Generally, data mining is performed on data represented in
form. This increasing textual data has led to the task of miningquantitative, textual, or multimedia forms. In recent times,
useful or interesting frequent itemsets (words/terms) from very there is an increasing flood of unstructured textual information.
large unstructured text databases and this task still seems to be
                                                                The area of text mining is growing rapidly mainly because
quite challenging. The use of such frequent association for textof the strong need for analyzing this vast amount of textual
clustering has received a great deal of attention in research   data. As the most natural form of storing and exchanging
communities since the mined frequent itemsets reduces the
                                                                information is written words, text mining has a very high
dimensionality of the documents drastically. In this work, an
                                                                commercial potential [9], [11]. So, it is regarded as the next
effective approach for text clustering is developed in accordance
with the frequent itemsets that provides significant            wave of knowledge discovery. Traditional document and text
dimensionality reduction. Here, Apriori algorithm, a well-known management tools are inadequate to meet these utilities.
method for mining the frequent itemsets is used. Then, a set of Document management systems work well with homogeneous
non-overlapping partitions are obtained using these frequent    documents but not with the heterogeneous mix. Even the best
itemsets and the resultant clusters are generated within the    internet search tools suffer from poor precision and recall. The
partition for the document collections. An extensive analysis ofability to distil this untapped source of information, free text
frequent item-based text clustering approach is conducted with adocument, provides substantial competitive advantages to
real life text dataset, Reuters-21578. The experimental results of
                                                                succeed in the era of a knowledge-based economy. Thus, Text
the frequent item-based text clustering approach for 100
documents of Reuters-21578 dataset are given, and the           Mining provides a competitive edge for a company to process
performance of the same has been evaluated with Precision,      and take advantage of massive textual information.
Recall and F-measure. The results ensured that the performance    Text Mining, also known as Text Data Mining or
of the proposed approach improved effectively. Thus, this       Knowledge Discovery from Textual Databases, is defined as,
approach effectively groups the documents into clusters and     “the nontrivial extraction of implicit, previously unknown,
mostly, it provides better precision for dataset taken for      and potentially useful information from textual data” [3] Or
experimentation.                                                “the process of extracting interesting and non-trivial patterns
                                                                or knowledge from unstructured text documents”. 'High
Keywords— Text Mining, Text Clustering, Text Documents, Quality' in text mining refers to some combination of
Frequent Itemsets, Apriori Algorithm, Reuters-21578.            relevance, novelty, and interestingness [6].
                                                                   ‘Text Clustering’ or ‘Document Clustering’ is ‘the
                        I. INTRODUCTION
                                                                organization of a collection of text documents into clusters
   The current age is referred to as the “Information Age”. In based on similarity. Intuitively, documents within a valid
this information age, information leads to power and success, cluster are more similar to each other than those belonging to
only if one can “Get the Right Information, To the Right a different cluster’. In other words, documents in one cluster
People, At the Right Time, On the Right Medium, In the Right share similar topics. Thus, the goal of text clustering scheme
Language, With the Right Level of Detail”. The abundance of is to minimize intra-cluster distances between documents,
data, coupled with the need for powerful data analysis tools is while maximizing inter-cluster distances [12]. It is the most
described as “Data Rich but Information Poor” situation. In common form of unsupervised learning and it is an efficient
order to relieve such a data rich but information poor dilemma, way for sorting several documents to assist users to shift,
a new discipline named data mining emerged, which devotes summarize, and arrange text documents [4], [24], [14].
itself to extracting knowledge from huge volumes of data,         In this paper, an effective approach for frequent itemset-
with the help of the ubiquitous modern computing devices. based text clustering using zoning and ranking is proposed.
The term “Data Mining” also known as Knowledge Discovery First, the text documents in the document set are preprocessed.
in Databases (KDD) is formally defined as: “the non-trivial Then, top-p frequent words are extracted from each document
extraction of implicit, previously unknown, and potentially and hence, the binary mapped database is formed through the
useful information from large amounts of data” [13]. Data use of these extracted words. Then, the Apriori algorithm is
mining is not specific to one type of media or data. It is applied to discover the frequent itemsets having different

                                                                 208                               http://sites.google.com/site/ijcsis/
                                                                                                   ISSN 1947-5500
                                                        (IJCSIS) International Journal of Computer Science and Information Security,
                                                        Vol. 9, No. 6, June 2011

length. For every length, the mined frequent itemsets are            make the later processing more effective and efficient. Stop-
sorted in descending order based on their support level.             words are dependent on natural language [20].
Subsequently, the documents are split into partition using              Stop Words for Reuters-21578: a, b, c, d, e, f, g, h, i, j, k, l,
sorted frequent itemsets. Furthermore, the resultant clusters        m, n, o, p, q, r, s, t, u, v, w, x, y, z, that, the, these, this, those,
are formed within the partition using the derived keywords.          who, whom, what, where, which, why, of, is, are, when, will,
                                                                     was, were, be, as, their, been, have, has, had, from, may,
                  II. PROPOSED APPROACH                              might, there, should, their, it, its, it's, find, out, with, the,
   Text mining is an increasingly important research field           native, status, all, live, in, who, me, get, who, who’s, whom,
because of the necessity of obtaining knowledge from                 the, this, there, is, at, was, or, are, then, that, when, why, what,
enormous number of unstructured text documents [23]. Text            want, have, had, has, and, an, you, our, on, of, with, for, can,
clustering is one of the fundamental functions in text mining.       to, be, used, all, they, from, so, as, in, if, where, into, by, were,
It is to group a collection of documents into different category     more, about, said, talk, my, mine, me, you, your, yours, we, us,
groups so that documents in the same category group describe         our, ours, he, she, it, her, him, his, they, them, their, there.
the same subject. Many researchers [6], [7], [16], [17], [19],          2) Stemming Algorithm: A stemming algorithm is a
[24], [25] investigated possible ways to improve the                 computational procedure that reduces all words with the same
performance of text or document clustering based on the              root to a common form, by stripping each word of its
popular clustering algorithms (partitional and hierarchical          derivational and inflectional suffixes.
clustering) and frequent term based clustering. In the current          The approach to stemming employed here involves a two
work, an effective approach for clustering a text corpus with        phase stemming system. The first phase of the stemming
the help of frequent itemsets is proposed.                           algorithm ‘proper’ retrieves the stem of a word by removing
                                                                     its longest possible ending which matches one on a list stored
A. Algorithm: Text Clustering Process.                               in the computer. The second phase handles “spelling
  The effective approach for clustering a text corpus with the       exceptions [18].
help of frequent itemsets is furnished below:
1. Collect the set of documents i.e. D = {d1, d2, d3, . ., dn}       C. Mining of Frequent ItemSets.
    to make clusters.                                                  This section describes the mining of frequent itemsets from
2. Apply the Text preprocessing method on D.
3. Create the Binary database B.                                     the preprocessed text document set D . For every document d i ,
4. Mine the Frequent Itemsets using Apriori algorithm on B.          the frequency of the extracted words/terms from the
5. Organize the output of first stage of Apriori in sets of          preprocessing step is computed and top- p frequent words
    frequent Itemsets of different length.                           from each document d i are taken.
6. Partition the text documents based on Frequent Itemsets.
7. Cluster text documents within the zone based on their rank.                      K w  { di | p ( d i )   ;  di  D }
8. Output the resultant clusters.
                                                                                   p ( d i )  Tw j ; 1  j  p
  The devised approach consists of the following major steps:          From the set of top- p frequent words, the binary
(1) Text PreProcessing
                                                                     database B is formed by obtaining the unique words. Let BT
(2) Mining of Frequent Itemsets
(3) Partitioning the text documents based on frequent                be a binary database consisting of n number of transactions
     itemsets                                                        T (number of documents) and q number of attributes (unique
(4) Clustering of text documents within the partition                words) U  [u1 , u 2 ,....., u q ] . Binary database BT consists of
  The steps of the algorithm are explained in detail below:
                                                                     binary data that represents whether the unique words are
B. Text PreProcessing.                                               present in the documents d i or not.
Let D be the text documents representing a                                   0 if u j  d i
                                                                     BT                              ; 1 j  q , 1  i  n
set D  {d1 d 2 d 3 .....d n }; 1  i  n , where, n is the number          1 if      u j  di
documents in the text dataset D . The text document set D is
converted from unstructured format into some common                  Then, the binary database BT is fed to the Apriori algorithm
representation using the text preprocessing techniques in            for mining the frequent itemsets (words/terms) Fs .
which, words/terms are extracted (tokenization) and the input           1) Apriori Algorithm: Apriori is a traditional algorithm for
data set D (text documents) are preprocessed using the               mining association rules that was first introduced in [2]. There
techniques namely, removing stop words and stemming
                                                                     are two steps used for mining association rules: (1) Finding
algorithm.                                                           frequent or large itemsets (2) Generating association rules
  1) Stop Word Removal: It is the process of removing non-           from the frequent itemsets. Frequent itemsets can be generated
information bearing words from the documents to reduce               in two steps. Firstly, candidate itemsets are generated and
noise as well as to save huge amount of space and thus to            secondly frequent itemsets are mined using these candidate

                                                                 209                                  http://sites.google.com/site/ijcsis/
                                                                                                      ISSN 1947-5500
                                                                      (IJCSIS) International Journal of Computer Science and Information Security,
                                                                      Vol. 9, No. 6, June 2011

itemsets. The itemsets whose support is greater than the                            sorted lists ( f (( k / 2 ) 1) , f (( k / 2 )  2 ) etc.. ) are taken for
minimum support given by the user are referred as, ‘frequent
itemsets’. In the proposed approach, only the frequent itemsets                     performing the above discussed step. This results into a set of
are used for further processing so, only the first step                             partition c and each partition C(i ) contains a set
(generation of frequent itemsets) of the Apriori algorithm is
performed. The pseudo code for the Apriori algorithm [1] is,
                                                                                    documents Dc(xi)) .
 I1  {l arg e 1  itemsets};                                                         c  { c(i ) | c(i )  f l ( i ) } ; 1  i  m , 1  l  k
  for (k  2; I k -1  0; k   ) do begin                                            C(i )  Doc[ f l ( i ) ] ;
      C k  apriori  gen( I k 1 ); // New candidates                                           (x        (x
                                                                                      C(i )  { Dc (i)) ; Dc (i))  D , 1  x  r }
       forall transactions T  D do begin
                                                                                    where, m denotes the number of partition and r denotes the
            CT  subset (C k , T ); // Candidates contained in T
                                                                                    number of documents in each partition.
            forall candidates c  C T do                                               For constructing initial partition (or cluster), the mined
                c.count   ;                                                       frequent itemset that significantly reduces the dimensionality
            end                                                                     of the text document set is used and so the clustering with
      end                                                                           reduced dimensionality is considerably more efficient and
                                                                                    scalable. Some of the researchers [15], [22] generated the
      I k  {c  Ck | c.count  min sup}
                                                                                    overlapped of clusters in accordance with the frequent
end                                                                                 itemsets and then removed the overlapping documents. In the
Answer           I ;
                  k k
                                                                                    proposed research, the non-overlapping partitions are
                                                                                    generated directly from the frequent itemsets. This makes the
D. Partitioning the Text Documents based on Frequent                                initial partitions disjoint because the proposed approach keeps
ItemSets.                                                                           the document only within the best initial partition.
                                                                                    E. Clustering the Text Documents within the Partition.
  This section describes the partitioning of text documents D                          In this section, the process of clustering the set of partitions
based on the mined frequent itemsets F . ‘Frequent Itemset’ is                      obtained from the previous step is described. This step is
a set of words that occur together in some minimum fraction                         necessary to form a sub cluster (describing sub-topic) of the
of documents in a cluster. The Apriori algorithm generates a                        partition (describing same topic) and the resulting cluster can
set of frequent itemsets with varying length ( l ) from 1 to k .                    detect the outlier documents significantly. ‘Outlier document’
                                                                                    in a partition is defined as a document that is different from
First, the set of frequent itemsets of each length ( l ) are sorted
                                                                                    the remaining documents in the partition. Furthermore, the
in descending order according to their support level.
                                                                                    proposed approach does not require a pre-specified number of
            Fs  { f1 f 2 f 3 .... f k } ; 1  l  k                                clusters. The devised procedure for clustering the text
            f l  { f l ( i ) ; 1  i  t}                                          documents available in the set of partition c is discussed
where, sup( f l ( 1 ) )  sup( f l ( 2 ) )  ...  sup( f l ( t ) ) and t denotes                                                (
                                                                                       In this phase, first the documents Dc(xi)) and the familiar
the number of frequent itemsets in the set f l .
                                                                                    words        f c(i) (frequent itemset used for constructing the
From the sorted list f (k / 2 ) , the first element of frequent
                                                                                    partition) of each partition C(i ) are identified. Then, the
itemsets ( f (k / 2 ) (1) ) is selected and thereby, an initial                                             (                     (x
                                                                                    derived keywords K d [ Dc(xi)) ] of document Dc (i)) are obtained
partition c1 containing all the documents having this
                                                                                    by taking the absolute complement of familiar words f c(i)
itemset      f (k / 2) (1) is constructed.           Then,      the     second
                                                                                    with respect to the top- p frequent words of the
element f (k / 2 ) ( 2) , whose support less than f (k / 2 ) (1) is                                    ( x)
                                                                                    document Dc ( i ) .
taken to form a new partition c 2 . This new partition c 2 is
formed by identifying all the documents having large itemset                                           (                                   (x
                                                                                                  Kd [Dc(xi)) ]  {Twj \ f c(i) } ; Twj  Dc(i)) ,
 f (k / 2) ( 2) and considering all the documents that are in the
                                                                                                   1 i  m , 1 j  p , 1 x  r
initial partition c1 . This procedure is repeated until every text
documents in the input dataset D are moved into a
                                                                                                  Tw j \ f c(i)  {x  Tw j | x  f c(i) }
partition C(i ) . Furthermore, if the above procedure is not                            The set of unique derived keywords of each partition C(i )
terminated with the sorted list f (k / 2 ) , then the subsequent                    are obtained and the support of each unique derived keyword
                                                                                    is computed within the partition. The set of keywords

                                                                                210                                          http://sites.google.com/site/ijcsis/
                                                                                                                             ISSN 1947-5500
                                                                              (IJCSIS) International Journal of Computer Science and Information Security,
                                                                              Vol. 9, No. 6, June 2011

satisfying the cluster support ( cl _ sup ) are formed as                                 obtained. Then, the initial partition is constructed using these
                                                                                          frequent itemsets, as shown in Table I. After that, a
representative words of the partition C(i ) . ‘The cluster                                representative of each partition is computed based on both the
support’ of a keyword in C(i ) is the percentage of the                                   top 10 and familiar words of the partition. The similarity
                                                                                          measure is calculated for each document in the partition, as
documents in C(i ) that contains the keyword.                                             shown in Table II. The resultant cluster is formed, only if the
                                                                                          similarity value of the documents within the partition is below
                               Rw [ c (i) ]  { x : p ( x) }
                                                                                          0.4. So, finally 19 clusters are obtained from 14 partitions, as
                                                     ( x)
                where, p ( x)  [ K d [ D            c(i ) ] ]    cl _ sup               shown in Table III.
                                                                              ( x)                                 TABLE I
Subsequently, the similarity of the documents Dc ( i ) with
                                                                                                   GENERATED PARTITIONS OF TEXT DOCUMENTS
respect to the representative words Rw [ c(i ) ] is found. The
                                                                                                Partition                Text Documents
definition of the similarity measure of text documents is                                                    d2, d3, d4, d5, d6, d7, d8, d9,    d10, d11,
strictly important for obtaining effective and meaningful                                          P1        d12, d13, d58, d64, d65, d66,      d67, d68,
clusters. The similarity measure of each document S m is                                                     d69, d72, d76, d77
                                                                                                             d14, d16, d36, d42, d43, d44,      d45, d46,
computed                      as                       follows:                                    P2        d49, d50 , d60, d73, d75, d85,     d90, d93,
        (
                                      (
S Kd [ D cxi)) ] , Rw[ c(i) ]  Kd [ D cxi))]  Rw[ c (i)]
           (                             (                                                                   d98, d100
                                                                                                             d39, d40, d41, d47, d48, d51,      d78, d82,
                                                              
                                   S Kd[ D (xi)) ], Rw[ c(i) ]                                     P3
                                                                                                             d83, d88
Sm Kd[ D (xi)) ], Rw[ c(i) ]
                                           Rw[ c(i) ]                                              P4
                                                                                                             d26, d27, d28, d29, d31, d33,      d35, d37,
                                                                                                             d57, d95, d96
                                                                                                   P5        d19, d55, d62
The documents within the partition are sorted according to
                                                                                                             d56, d63, d70, d71, d74, d80, d81, d87,
their similarity measure and the documents form a new cluster                                      P6
separately, when the similarity measure exceeds the minimum                                                  d17, d18, d20, d22, d23, d24, d30, d32
                                                                                                   P8        d25, d34, d38
  III. EXPERIMENTAL RESULTS AND PERFORMANCE                                                        P9        d79
                                                                                                   P10       d15, d91, d92, d94, d97
   The proposed approach is implemented using Java (JDK
                                                                                                   P11       d21
1.6). This implementation of the proposed algorithm is applied
on the text dataset that is collected from Reuters-21578 text                                      P12       d1
database. The dataset consists of 21,578 different text                                            P13       d52, d53, d54, d61
documents, which is used mainly by researchers. Out of these                                                 d59, d84, d86
many documents, 100 sample documents are taken to evaluate
the developed algorithm. The performance of the proposed                                                           TABLE III
approach is evaluated on these 100 text documents of Reuters-                                       SIMILARITY MEASURE OF TEXT DOCUMENTS
21578 [21] using Precision, Recall & F-Measure.                                                                             Text Document
   Reuters-21578 Text Database: The documents in the                                            Partition
                                                                                                                         (Similarity Measure)
Reuters-21578 collection appeared on the Reuters newswire in                                                 d2(0.125), d3(0.25), d4(0.125), d5(0.125),
1987. The documents were assembled and indexed with                                                          d6(0.125), d7(0.125), d8(0.125), d9 (0.25),
categories, by personnel from Reuters Ltd. and Carnegie                                             P1
                                                                                                             d10(0.125), d11(0.25), d12(0.125),
Group, Inc. in 1987. In 1990, the documents were made                                                        d13(0.125), d58(0.0), d64(0.5), d65(0.5),
available by Reuters and CGI for research purposes. Further                                                  d66(0.625), d67(0.375), d68(0.5), d69(0.625),
formatting and data file production was done in 1991 and                                                     d72(0.0), d76(0.375), d77(0.25)
1992 by David D. Lewis and Peter Shoemaker. Steve Finch                                                      d14(0.3333), d16(0.0), d36(0.1666),
                                                                                                             d42(0.3333), d43(0.5), d44(0.1666),
and David D. Lewis did cleanup of the collection in1996. The                                                 d45(0.3333), d46(0.5), d49(0.5), d50(0.6666),
new collection has only 21,578 documents, and thus the name                                         P2
                                                                                                             d60(0.0), d73(0.0), d75(0.0), d85(0.0),
Reuters-21578 collection.                                                                                    d90(0.3333), d93(0.3333), d98(0.1666),
A. Experimental Results.                                                                                     d100(0.1666)
   For experimental results, 100 documents are taken from                                                    d39(0.3846), d40(0.5385), d41(0.5385),
various topics and the top 10 words are extracted from each                                         P3
                                                                                                             d47(0.5385), d48(0.3846), d51(0.3077),
document. Then, a binary database is constructed with 452                                                    d78(0.2308), d82(0.1538), d83(0.0),
attributes. The frequent itemsets are mined from the binary                                                  d88(0.2308)
database and these itemsets are sorted based on their support                                                d26(0.3333), d27(0.6666), d28(0.4444),
level. Thus, 31 frequent itemsets of varying length are                                             P4       d29(0.4444), d31(0.5555), d33(0.6666),
                                                                                                             d35(0.5555), d37(0.7777), d57(0.1111),

                                                                                       211                                http://sites.google.com/site/ijcsis/
                                                                                                                          ISSN 1947-5500
                                                                   (IJCSIS) International Journal of Computer Science and Information Security,
                                                                   Vol. 9, No. 6, June 2011

                 d95(0.2222), d96(0.0)                                                      Precision (i, j)  C ij /C j
         P5      d19(0.36), d55(0.36), d62(0.36)
                 d56(0.2272), d63(0.1818), d70(0.2727),
         P6      d71(0.0909), d74(0.3182), d80(0.3636),                                       Recall (i, j )  Cij / Ci
                 d81(0.3182), d87(0.3182), d89(0.2727)
                 d17(0.4737), d18(0.4737), d20(0.4211),                                                         2 * Recall (i, j ) * Precision (i, j )
         P7      d22(0.3684), d23(0.3684), d24(0.4211),                              F  MeasureF (i, j ) 
                 d30(0.1579), d32(0.1579)                                                                        Precision (i, j )  Recall (i, j )
         P8      d25(0.375), d34(0.375), d38 (0.375)
                                                                                    where ,
         P9      d79(1.0)
                                                                                     Cij is the number of members of topic i in cluster j ,
                 d15(0.2093), d91(0.2093), d92(0.2093),
                 d94(0.2093), d97(0.2093)
                                                                                     C j is the number of members of cluster j , and
        P11      d21(1.0)
        P12      d1(1.0)                                                             Ci is the number of members of topic i .
        P13      d52(0.25), d53(0.25), d54(0.25), d61(0.25)
                                                                                   In order to evaluate the proposed approach on Reuters-
        P14      d59(0.4211), d84(0.5385), d86(0.5385)
                                                                               21578 database, 100 documents are taken from 8 different
                        TABLE IIIII                                            topics (acq, cocoa, coffee, cpi, crude, earn, money-fx, trade).
                   RESULTANT CLUSTERS                                          The proposed approach uses these documents as input text and
    Partition   Clusters              Text Documents                           finally resulted in 19 clusters. For each cluster, the precision,
                                                                               Recall and F-Measure are computed with the help of the
                   C1        d64, d65 , d66, d68, d69                          above mentioned definitions. The obtained results are shown
        P1                                                                     in Table IV.
                   C2        d2, d3, d4 , d5, d6, d7, d8, d9, d10,
                             d11, d12 , d13, d58, d67, d72 , d76, d77                                      TABLE IVV
                                                                                          CLUSTERING PERFORMANCES OF TEXT DOCUMENTS
                   C3        d14, d16, d36, d42, d44, d45 , d60, d73,
        P2                   d75, d85 , d90, d93, d98, d100                       Partition       Cluster      Precision        Recall       F-measure
                   C4        d43, d46 , d49, d50                                                     C1           1.0             0.36           0.53
                   C5        d39, d48 , d51, d78, d82, d83 , d88                                     C2           0.71            0.92            0.8
                   C6        d40, d41 , d47                                                          C3           1.0             0.31           0.47
                   C7        d27, d28 , d29, d31, d33, d35 , d37                                     C4           0.33            0.45           0.38
                   C8        d26, d57 , d95, d96                                                     C5           1.0             0.23           0.37
        P5         C9        d19, d55 , d62                                                          C6           0.57            0.33           0.42
        P6         C10       d56, d63, d70, d71, d74, d80 , d81, d87,                                C7           1.0             0.5            0.67
                             d89                                                     P4
                   C11                                                                               C8           0.5             0.18           0.26
                             d17, d18 , d20, d24
                   C12                                                               P5              C9           1.0             0.25            0.4
                             d22, d23 , d30, d32
        P8         C13                                                               P6             C10           0.44            0.25           0.32
                             d25, d34 , d38
        P9         C14                                                                              C11           1.0             0.36           0.53
                             d79                                                     P7
       P10         C15                                                                              C12           0.5             0.18           0.26
                             d15, d91 , d92, d94, d97
       P11         C16                                                               P8             C13           1.0             0.21           0.35
       P12         C17                                                               P9             C14           1.0             0.08           0.15
       P13         C18                                                               P10            C15           0.8             0.36            0.5
                             d52, d53 , d54, d61
       P14         C19       d59, d84 , d86                                         P11             C16           1.0            0.09            0.17
B. Performance Evaluation.                                                          P12             C17           1.0            0.08            0.15
  Evaluation metrics namely, Precision, Recall and F-measure                        P13             C18           1.0            0.33            0.5
are used for evaluating the performance of the proposed                             P14             C19           0.66           0.17            0.27
approach. The definitions of the evaluation metrics are given
                                                                                                            IV. CONCLUSION

                                                                            212                                    http://sites.google.com/site/ijcsis/
                                                                                                                   ISSN 1947-5500
                                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                                 Vol. 9, No. 6, June 2011

   Text Clustering is one of the important techniques of text                  [4]    Congnan Luo, Yanjun Li and Soon M. Chung, "Text document
mining. The use of frequent association for text clustering has                       clustering based on neighbors", Data & Knowledge Engineering, Vol:
                                                                                      68, No: 11, pp: 1271-1288, November 2009.
received a great deal of attention in research communities                     [5]    Domingos P and Hulten G. Mining High-Speed Data Streams. In
since the mined frequent itemsets reduces the dimensionality                          Knowledge Discovery and Data Mining, pages 71–80, 2000.
of the documents drastically.                                                  [6]    Feldman R., Sanger J., “The Text Mining Handbook”, Cambridge
   In this paper, an effective approach for text clustering is                        University Press, 2007.
                                                                               [7]    Florian Beil, Martin Ester, Xiaowei Xu “Frequent Term-based Text
developed in accordance with the frequent itemsets. In the                            Clustering”, In KDD '02: Proceedings of the eighth ACM SIGKDD
proposed work, initially, the text documents are preprocessed                         International conference on Knowledge discovery and data mining
and subsequently, the Apriori algorithm is applied to discover                        (2002), pp. 436-442. doi:10.1145/775047.775110.
the frequent itemsets having different length. Consequently,                   [8]    Guha S, Mishra N, Motwani R, and Callaghan L. O, Clustering Data
                                                                                      Streams. In IEEE Symposium on Foundations of Computer Science,
the documents are split into partitions using the sorted                              pages 359–366, 2000.
frequent itemsets. Furthermore, the resultant clusters are                     [9]    Haralampos Karanikas, Christos Tjortjis, Babis Theodoulidis, “An
formed within the partition using the derived keywords. Real                          Approach to Text Mining using Information Extraction”, Proc.
life dataset Reuters-21578 is used for analysing the frequent                         Knowledge Management Theory Applications Workshop, (KMTA
                                                                                      2000), Lyon, France, pp. 165-178, Sep 2000.
itemset based text clustering approach. In addition to this,                   [10]   Hulten G, Spencer L, and Domingos P, Mining time-changing data
evaluation metrics: Precision, Recall and F-Measure are used                          streams. In Proceedings of the Seventh ACM SIGKDD International
for evaluating the performance. High Precision indicates the                          Conference on Knowledge Discovery and Data Mining, pages 97–106,
effectiveness of the proposed frequent-item based text                                San Francisco, CA, 2001. ACM Press.
                                                                               [11]   Ah-Hwee Tan, (1999), Text Mining: The state of art and the challenges,
clustering approach. Furthermore, the proposed approach does                          In proceedings, PAKDD'99 Workshop on Knowledge discovery from
not require a pre-specified number of clusters.                                       Advanced Databases (KDAD'99), Beijing, pp. 71-76, April 1999.
   In conclusion, the importance of document clustering will                   [12]   Jain A K and Dubes R C, “Algorithms for Clustering Data”, Prentice
continue to grow along with the massive volumes of                                    Hall, Englewood Cliffs, 1988.
                                                                               [13]   Jiawei Han, Micheline Kamber, “Data Mining: Concepts and
unstructured data generated. Exploiting an effective and                              Techniques”, 2006 (c) Morgan Kaufmann Publishers.
efficient method in document clustering would be an essential                  [14]   Jain, A.K., Murty, M.N., Flynn, P.J., “Data Clustering: A Review”,
direction for research in text mining, especially text clustering.                    ACM Computing Surveys, Vol: 31, No: 3, pp: 264-323. 1999.
                                                                               [15]   Law M.H.C., Figueiredo M.A.T., Jain A.K., “Simultaneous feature
                      V. FUTURE WORK                                                  selection and clustering using mixture models”, IEEE Transaction on
                                                                                      Pattern Analysis and Machine Intelligence, 26(9), pp. 1154-1166, 2004.
  Future study on text clustering using frequent items has the                 [16] W.-L. Liu and X.-S. Zheng, "Documents Clustering based on Frequent
following possible avenues:                                                         Term Sets", Intelligent Systems and Control, 2005.
 In the proposed approach, Apriori algorithm is used to find                  [17] Le Wang, Li Tian, Yan Jia, Weihong Han, “A Hybrid Algorithm For
                                                                                    Web Document Clustering Based On Frequent Term Sets And K-
  out the frequent item sets in the data set, in which the                          Means”, Advances in Web and Network Technologies and Information
  database is to be scanned repeatedly if the number of                             Management, Lecture Notes in Computer Science, 2007, Vol.
  frequent 1-itemsets is high or if the size of the frequent                        4537/2007, pp. 198-203, DOI: 10.1007/978-3-540-72909-9_20.
  pattern is big. A possible direction for future research is to               [18] Lovins, J.B. 1968: "Development of a stemming algorithm", Mechanical
                                                                                    Translation and Computational Linguistics, vol. 11, pp. 22-31, 1968.
  make use of FP-Growth algorithm in the place of Apriori
                                                                               [19] Murali Krishna S., Durga Bhavani S., “An Efficient Approach For Text
  algorithm so that best frequent itemsets can be identified                        Clustering Based On Frequent Itemsets”, ©Euro Journals Publishing,
  and FP-Growth is faster.                                                          Inc. 2010, European Journal of Scientific Research ISSN 1450-216X,
 The proposed approach does not take Outliers into                                 Vol.42, n 3, pp. 399-410, 2010.
                                                                               [20] Pant. G., Srinivasan. P and Menczer, F., "Crawling the Web". Web
  consideration. As a part of future work; Outliers can also be                     Dynamics: Adapting to Change in Content, Size, Topology and Use,
  handled.                                                                          edited by M. Levene and A. Poulovassilis, Springer- verilog, pp: 153-
 The current implementation is inappropriate for preserving                        178, November 2004.
  the clusters in dynamic environment. So, another possible                    [21] Reuters-21578, Text Categorization Collection, UCI KDD Archive.
  research direction is to develop an incremental clustering                   [22] Shenzhi Li, Tianhao Wu, William M. Pottenger, “Distributed Higher
  approach, which makes use of frequent item sets, in order to                      Order Association Rule Mining Using Information Extracted from
                                                                                    Textual Data”, ACM SIGKDD Explorations Newsletter, Natural
  avoid the complete re-clustering of entire database each                          language processing and text mining, Vol. 7, n 1, pp. 26-35, 2005.
  time when a change is made in the database [5], [8], [10].
                                                                               [23] Un Yong Nahm, Raymond J Mooney, “Text mining with information
                                                                                    extraction”,CM, pp. 218, 2004.
                                                                               [24] Xiangwei Liu, Pilian, “A Study On Text Clustering Algorithms Based
[1] Agrawal R and Srikant R, “Fast algorithms for mining association rules”,        On Frequent Term Sets”, Advanced Data Mining and Applications,
    In Proceedings of 20th International Conference on Very Large Data              Lecture Notes in Computer Science, 2005, Vol. 3584/2005, pp. 347-354,
    Bases, Santiago, Chile, pp. 487–499, September 1994.                            DOI: 10.1007/11527503_42.
[2] Agrawal R, Imielinski T and Swami A, “Mining association rules             [25] Zhou Chong, Lu Yansheng, Zou Lei, Hu Rong, “FICW: Frequent
    between sets of items in large databases”, In proceedings of the                Itemset Based Text Clustering with Window Constraint”, Vol. 11, n 5,
    international Conference on Management of Data, ACM SIGMOD, pp.                 pp. 1345-1351, 2006, DOI: 10, 1007/BF02829264.
    207–216, Washington, DC, May 1993.
[3] Bjornar Larsen and Chinatsu Aone, “Fast and Effective Text Mining
    Using Linear-time Document Clustering”, in Proceedings of the fifth
    ACM SIGKDD international conference on Knowledge discovery and
    data mining, San Diego, California, United States , pp. 16 – 22, 1999.

                                                                           213                                      http://sites.google.com/site/ijcsis/
                                                                                                                    ISSN 1947-5500
                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                 Vol. 9, No. 6, June 2011

About the Authors
                             S.SUNEETHA pursued her                                    Mr.YaswanthKumar .Avu
                             Bachelor’s     Degree      in                             lapati received his MCA
                             Science and in Education,                                 degree with First class
                             Master’s      Degree       in                             from Sri Venkateswara
                             Computer        Applications                              University, Tirupati. He
                             (MCA)         from        Sri                             received    his   M.Tech
                             Venkateswara University,                                  Computer Science and
                             Tirupati, Andhra Pradesh,                                 Engineering degree with
                             India. She completed her                                  Distinction from Acharya
                             M.Phil.     in    Computer                                Nagarjuna       University,
                             Science from Sri Padmavati      Guntur.He is a research scholar in S.V.University
                             Mahila     Visvavidyalayam,     Tirupati, Andhra Pradesh.He has presented number of
Tirupati. She presented and published papers in              papers in national and international conferences,
International and National Conferences. Her main             seminars.He attend Number of workshops in different
research interests include, Data Mining and Software         fields.
Engineering. She is a life member of ISTE. She served
Narayana Engineering College, Nellore, Andhra Pradesh
as Sr. Asst. Professor, heading the departments of IT and

                          Dr. M. Usha Rani is an
                         Associate Professor in the
                         Department      of Computer
                         Science and HOD for CSE&IT,
                         Sri      Padmavathi     Mahila
                         Womens’ University), Tirupati.
                         She did her Ph.D. in Computer
                         Science in the area of Artificial
                         Intelligence     and     Expert
                         Systems. She is in teaching
                         since 1992. She presented
more than 34 papers at National and Internal
Conferences and published 19 articles in national &
international journals. She also written 4 books like
Data Mining - Applications: Opportunities and
Challenges, Superficial Overview of Data Mining Tools,
Data Warehousing & Data Mining and Intelligent
Systems & Communications. She is guiding M.Phil and
Ph.D. in the areas like Artificial Intelligence,
DataWarehousing and Data Mining, Computer
Networks and Network Security etc.

                                                          214                               http://sites.google.com/site/ijcsis/
                                                                                            ISSN 1947-5500

To top