Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering

Document Sample
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering Powered By Docstoc
					International Journal of Research in Computer Science
eISSN 2249-8265 Volume 2 Issue 4 (2012) pp. 7-12
© White Globe Publications
www.ijorcs.org


        PRIVACY PRESERVING MFI BASED SIMILARITY
         MEASURE FOR HIERARCHICAL DOCUMENT
                     CLUSTERING
                                      P. Rajesh1, G. Narasimha2, N.Saisumanth3
                             1,3
                                 Department of CSE, VVIT, Nambur, Andhra Pradesh, India
                                             Email: rajesh.pleti@gmail.com
                                        Email: saisumanth.nanduri@gmail.com
                           2
                             Department of CSE, JNTUH, Hyderabad, Andhra Pradesh, India
                                            Email: narasimha06@gmail.com

Abstract: The increasing nature of World Wide Web           navigation steps to find relevant documents. So we
has imposed great challenges for researchers in             need a hierarchical clustering that is relatively flat that
improving the search efficiency over the internet. Now      reduces the number of navigation steps. Therefore
days web document clustering has become an                  there is a great need for new document clustering
important research topic to provide most relevant           algorithms, which are more efficient than conventional
documents in huge volumes of results returned in            clustering algorithms [1, 2].
response to a simple query. In this paper, first we
                                                               The increasing nature of World Wide Web has
proposed a novel approach, to precisely define
                                                            imposed great challenges for researchers to cluster the
clusters based on maximal frequent item set (MFI) by
                                                            similar documents over the internet and their by
Apriori algorithm. Afterwards utilizing the same
                                                            improving the efficiency of search. Search engine uses
maximal frequent item set (MFI) based similarity
                                                            the getting more confused in selecting the relevant
measure for Hierarchical document clustering. By
                                                            documents among huge volumes of search results
considering maximal frequent item sets, the
                                                            returned to a simple query. A potential solution to this
dimensionality of document set is decreased. Secondly,
                                                            problem is to cluster the similar web documents, which
providing privacy preserving of open web documents
                                                            helps the user in identifying the relevant data easily
is to avoiding duplicate documents. There by we can
                                                            and effectively [3].
protect the privacy of individual copy rights of
documents. This can be achieved using equivalence              The outline of this paper is divided into six
relation.                                                   sections. section II, briefly discusses related work. We
                                                            explained our proposed algorithm description
Keywords: Maximal Frequent Item set, Apriori
                                                            including common preprocessing steps and pseudo
algorithm,    Hierarchical document clustering,
                                                            code of algorithm in section III. It also includes to
equivalence relation.
                                                            precisely defining clusters based on maximal frequent
                                                            item set (MFI) by Apriori algorithm. Section IV,
                 I. INTRODUCTION
                                                            describes exploiting the same maximal frequent item
    Document clustering has been studied intensively        set (MFI) based similarity measure for Hierarchical
because of its wide applicability in areas such as web      document clustering with running example. In section
mining, search engines, text mining and information         V, provides privacy preserving of open web
retrieval. The rapid progress of databases in every         documents by using equivalence relation to protect the
aspect of human actions has resulted in enormous            individual copy rights of a document.. Section VI,
demand for efficient algorithms for spinning data into      consists of conclusion and future scope.
valuable knowledge.
                                                                            II. RELATED WORK
   Document clustering has undergone through
various methods, still document clustering is in its            The related work of using maximal frequent item
inefficiency state for providing the required               set in web document clustering is explained as follows.
information needed by the user exactly and                  Ling Zhuang Honghua Dai [4] introduced a new
approximately. Suppose the user makes an incorrect          criterion to specifically locate the initial points using
selection while browsing the documents in hierarchy.        maximal frequent item set. These initial points are then
If user may not notice his mistakes until he browses        used as centers for k-means algorithm. However k-
into the deep portion of the hierarchy, then it decreases   means clustering is completely unstructured approach,
the efficiency of search and increases the number of        sensitive to noise and produces an unorganized


                                                                             www.ijorcs.org
8                                                                                      P. Rajesh, G. Narasimha, N.Saisumanth
collection of clusters that is not favorable to              based similarity measure . The clusters in the resulting
interpretation [5, 6]. To minimize the overlapping of        hierarchy are non-overlapping. The parent cluster
documents, Beil, Ester [7] were proposed a method            contains only the general documents.
HFTC (Hierarchical Frequent Text Clustering) is
another frequent item set based approach to choose the                   III. ALGORITHM DESCRIPTION
next frequent item sets. But the clustering result               In this section, we explained our proposed
depends on the order of choosing next frequent item          algorithm       description      including    common
sets. The resulting hierarchy in HFTC usually contains       preprocessing steps and pseudo code of algorithm. It
many clusters at first level. As a result the documents      also includes to precisely defining clusters based on
in the same class are to be distributed into different       maximal frequent item set (MFI) by Apriori algorithm.
branches of hierarchy, which decreases the overall           First, we will speak about some common
clustering accuracy.                                         preprocessing steps for representing each document by
   C.M.Fung [8] has introduced FIHC (Frequent Item           item sets (terms). Second we will bring in vector space
set based Hierarchical Clustering) method for                model by assigning weights to terms in all document
document clustering. Which employed, a cluster topic         sets. Finally, we will explain the process of
tree is constructed based on the similarity among            initialization of clusters seeds using MFI to perform
clusters. FIHC used the efficient child pruning when         hierarchical clustering. Let Ds represents set of all
number of clusters is large and to apply the elaborated      documents in collection of database.
sibling merging only when number of clusters is small.                  Ds= {d1, d2, d3………dM}: 1 ≤ i ≤ M
The experiment results FIHC actually outperforms all
other algorithms (bisecting-k means, UPGMA) in               A. Pre-Processing
accuracy for most number of clusters.                           The document set Ds is converted from
   The Apriori algorithm [9] is a well-known method          unstructured format into some common representation
for computing frequent item sets in a transaction            using the text preprocessing techniques, in which
database. The document under the same topic, shares          words or terms are extracted (tokenization). The input
more common frequent item sets (terms) than the              data set of documents in Ds are preprocessed using the
documents of different topics. The main advantage of         techniques namely, removing HTML tags first, after
using frequent item sets is that it can identify the         that apply stop words list and stemming algorithm.
relation among the more than two documents at a time             a) HTML Tags: parsing of HTML Tag
in a document collection unlike similarity measure               b) Stop words: Remove the stop words list like
between two documents [10, 11].By the means of                      “conjunctions, connectives, prepositions etc”
maximal frequent item sets, the dimensionality of the            c) Stemming algorithm: We utilize porter 2
document set is reduced. More over maximal frequent                 stemmer algorithm in our approach.
item sets captures most related document sets. On the
other hand, hierarchical clustering most relevant for        B. Vector representation of document:
browsing and maps most specific documents to
generalized documents in the whole collection.                    Vector space model is the most commonly used
                                                             document representation model in text mining, web
   A conventional hierarchical clustering method             mining and information retrieval areas. In this model
constructs the hierarchy by subdividing parent cluster       each document is represented as n-dimensional term
or merging similar children clusters. It usually suffers     vector. The value of each term in the n-dimensional
from its inability to perform tuning once a merge or         vector reflects the importance of corresponding
split decision has been performed. This rigidity may         document. Let N be the total number of terms and M
lower the clustering accuracy. Furthermore, due to the       be the number of documents and each the document

                                                             �������� = (��������������������1 , ��������������������2 , … … … … . . ������������������������ ) 1≤ i≤ M. Where
fact that a parent cluster in the hierarchy always           can                         be                     denoted                 as

                                                             ��������(������������������������ ) < ����ℎ������������ℎ������������
contains all objects of its Childs, this kind of hierarchy

                                                             frequency ������������������������ is less than the threshold value is
is not suitable for browsing. The user may have                                                        value.             The     document
difficulty to locate his intention object in such a large
cluster.                                                     considered to avoid the problem of more times a term
   Our hierarchical clustering method is completely          appears throughout all documents in the whole
different. The aim of this paper is, first we form all       collection, the more poorly it discriminates between
the clusters by assigning documents to the most similar      documents [12].Calculate term frequency tf is number
cluster using maximal frequent item sets by Apriori          of times a term appears in a document. Document
                                                             frequency of a term df as no of documents that

                                                             documents vectors. �������� = (��������1 , ����12 , ����13 , … … . . , ����1�������� )
algorithm       and then construct the hierarchical
document clustering based on their inter-cluster             contains term. Also construct the weights for

                                                                               Where ������������ = ���������������� ∗ ������������(����) and
similarities via same maximal frequent item set (MFI)



                                                                                    www.ijorcs.org
IDf (j) =������������ �            �1≤j≤n.where IDf is the inverse
                     ����
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering                                                              9

                    ������������
                                                                   A frequent item set is a set of words which occurs
                                                               frequently together and are good candidates for


                                                               such that X ⊂ X1 and t(X) = t(X1), where t(X) defined
 document frequency.                                           clusters and are denoted by FI. An item set X is closed
Table 1: Table Representation of Transactional Database of     if there does not exist an item set X1 such that X1,
                       Documents
                                                               as the set of transactions that contain item set X and it
Terms      Doc 1             Doc 2   Doc 3   .....   Doc 4
                                                               is denoted by FCI(frequently closed items).If X is
Java       1                 1       0       .....   1
                                                               frequent and no superset of X is frequent among the
Beans      0                 1       0       .....   0


                                                               MFI. Then MFI⊂ FCI ⊂ FI Whenever there are very
                                                               set of items I in transactional databases. Then we say
.....      .....             …..     …..     .....   …..       that X is maximal frequent item set and denoted by
Servlets   1                 0       1       .....   1

   By the representation of document as vector form,           long patterns are present in the data it is often
we can easily identify which documents Contains the            impractical to generate the entire set if frequent item
same features .The more features documents have in             sets or closed item sets [16]. In that case, maximal
common, the more related they are. Thus, it is realistic       frequent item sets are adequate for such applications.
to find well related documents. Assume that each               We employed maximal frequent item set algorithm
document is an item in the transactional database; each        from [17] using apriori. These maximal frequent item
term corresponds to a transaction. Our aim is to search        sets are initial seeds for hierarchical document
for highly related documents “appearing” together              clustering.
with same features (the documents whose MFI features           D. Pseudo code Algorithm
are closed). Similarly, the maximal frequent item set
discovery in the transaction database serves the                 For MFI Based Similarity Measure for Hierarchical
purpose of finding items of documents appearing                Document Clustering
together in many transactions. i.e., document sets             Input: Document set Ds.
which have large amount of feature in common.
                                                               Definition: MFI: Maximal Frequent Item set.
C. Apriori for maximal frequent item sets
                                                               (tf) Term frequency and (df) document frequency
    Mining frequent item sets is a primary content of
                                                               Step 1. For    each document in Ds, Remove the HTML
data mining that emphasizes particularly in finding the
relation of different items in the large database. Mining              tags and perform stop word list and stemming.
                                                               Step 2. Calculate the term frequency (tf) and document


                                                               �������� = (��������������������1 , ��������������������2 , … … … … . . ������������������������ ) 1≤i≤M
frequent patterns is crucial problem in many data
mining applications such as the discovery of                           frequency (df).


                                                               Where df������������������������� � < Threshold value
association rules, correlations, multidimensional
patterns, and other numerous important inferring
patterns from consumer market basket analysis and
web access etc. The association mining problem is              Step 3. Also    construct the weighted document vectors

                                                               �������� = (��������1 , ����12 , ����13 , … … . . , ����1�������� )                 ������������ = ���������������� ∗
formulated as follows: Given a large data base of set of                for all the documents


                                                               ������������(����).Idf (j) =������������ �             � 1≤j≤n.
items transactions, find all frequent item sets, where a
                                                                                                ����
                                                                                                                   Where
frequent item set is one that occurs in at least a user-
                                                                                               ������������
specified threshold value of the data base. Many of the
proposed item set mining algorithms are a variant of
                                                               Step 4. Now     represent each documents by keywords
Apriori, which employs a bottom-up, breadth first
                                                                      whose tf>support
search that enumerates every single frequent item set.


                                                               ������������ = {����1 , ����2 , ����3 , … … … … . . �������� }
Apriori is a conventional algorithm that was first             Calculate the Maximal Frequent Item set(MFI) of
introduced] for mining association rules. Association          terms              using             Apriori   algorithm

                                                                Where each �������� = {����1 , ����2 , ����3 , … … … �������� }
can be viewed as two-step process as


                                                                              a document �������� is in more than one maximal
                                                                         frequent item set then choose �������� as a set
   (1) Identifying all frequent item sets
                                                               Step 5. If
   (2) Generating strong association rules from the


                                                                         containing document �������� . Then Assign�������� =��������0 .For
      frequent item sets
                                                                         consisting of such maximal frequent item sets
   At first, candidate item sets are generated and

                                                                         the document ��������
afterwards frequent item sets are mined with the help                    each the maximal frequent item sets containing

                                                                    �������� [��������������������������������(������������������������ ( �������� , �������� ))
of these candidate item sets. In the proposed approach,

                                                                                               > ��������������������������������(������������������������ ( ������������ , �������� ))]
we have used only the frequent item sets for further
processing so that, we undergone only the first step
(generation of maximal frequent item sets) of the
Apriori algorithm.


                                                                                         www.ijorcs.org
   Then assign �������� = ������������ .Assign the document �������� to ��������               �������� ���������������� ����3 = {����1 , ����5 , ����7 } as one cluster in hierarchy
10                                                                                                        P. Rajesh, G. Narasimha, N.Saisumanth


and discard �������� for other maximal frequent item sets.
                                                                                   Case 3: If �������� , �������� contains some same documents
                                                                             and represent it by center (as in step6).
Repeat this process for all documents that occurs in


                                                                             consider the case of document ����2 is repeatedin more
more than one maximal frequent item set


         these maximal frequent item sets �������� as clusters                   than one maximal frequent item sets{����1 ����4 }.Similarly
                                                                             among the documents list obtained from MFI. Let us


         and combine the documents in �������� into a single
Step 6. Apply   hierarchical document clustering to make

                                                                             ����4 is repeated in{����1 , ����2 , ����4 }. Then choose�������� =
                                                                             {����1 , ����2 , ����4 } = {��������0 , ��������1 , ��������2 }for    document����4 .Assign
                                                                              �������� =��������0 = ����1 . For each the maximal frequent item sets
         new document and represent it by centers of the

                                                                                        �������� containing                                ����4
         maximal frequent item sets. These are obtained

                                                                             ��������0 �������� ��������2 calculate the measure
         by combining the features of maximal frequent                       in                                   the        document          from


                                                                                 �������� [��������������������������������(������������������������ ( �������� , ����4 ))
         item set of terms that grouping the documents


                                                                                                            > ��������������������������������(������������������������ ( ������������ , ����4 ))]
Step 7. Repeat      the same process of hierarchical
         document clustering based on maximal frequent


                                                                             document ����4 closest to which maximal frequent item
         item sets for all levels in hierarchy and stop if
         total number of documents equals to one else go                        By using this jaccards measure, we can identify the


                                                                             document ����4 .Then assign �������� = ������������ .
         to step 4.
                                                                             set among maximal frequent item sets containing the


                                                                                    Let’s suppose that ����4 is closed to the maximal
     IV. HIERARCHICAL CLUSTERS BASED ON


                                                                             frequent item set ����4 . Assign the document����4 to�������� =
           MAXIMAL FREQUENT ITEM SETS


                                                                             ������������ = ����4 and discard ����4 for other maximal frequent
   After finding maximal frequent item sets (MFI) by
using Apriori algorithm. We turn to describing the

                                                                             exactly one cluster. Similarly ����2 belongs to����1 .Repeat
creation of hierarchical document clustering using                           item sets. After this step, each document belongs to
same similarity measure by MFI. A simple instance


among the whole collection of documents �������� by
case of example is also provided to demonstrate the

                                                                             ����2 , ����4 are repeated in����1 , ����4 . The clusters that will form
                                                                             this process for all documents that occurs in more than


apriorialgorithm are ������������ = {����1 , ����2 , ����3 … . . �������� }.Where
entire process. The set of maximal frequent item sets                        one maximal frequent item set. Since the documents



by�������� = {����1 , ����2 , ����3 … . . �������� }.Then consider total number
                                                                             at the first level of hierarchy by applying step5 and


                                                                                  ����1 = {����2 , ����6 }
each MFI consist of set of documents represented                             step 6 are as follows.


                                                                                  ����2 = {����3 , , ����8 }
of documents which occurs in maximal frequent item

                          ����1 , ����2 , ����3, ����4 , ����5 , ����6 , ����7 , ����8 ,          ����3 = {����1 , ����5 , ����7 }
sets in MFI as follows.

           ������������ = �                                                    �
                        ����9 , ����10 , ����11 , ����12 , ����13 , ����14 , ����15
                                                                                  ����4 = {����4 , , ����14 }
     ����1 = {����2 , ����4 , ����6 }
                                                                                  ����5 = {����10 , ����12 , ����15 }
     ����2 = {����3 , ����4 , ����8 }
                                                                                  ����6 = {����9 , ����11 , ����13 }
     ����3 = {����1 , ����5 , ����7 }
     ����4 = {����4 , ����2 , ����14 }
                                                                                The hierarchical diagram for the above form of

     ����5 = {����10 , ����12 , ����15 }
                                                                             maximal frequent item set clusters can be representing


     ����6 = {����9 , ����11 , ����13 }
                                                                             as follows. Repeat the same process of hierarchical
                                                                             document clustering based on maximal frequent item
                                                                             sets for all levels in hierarchy and stop if total number
   The clusters in the resulting hierarchy are non-                          of documents equals to one else go to step 4.
overlapping. This can be achieved through the


    Case1: If �������� , �������� are same then choose one in random
following cases.




   Case2: If �������� , �������� are different then form clusters of
to form cluster.


documents contained in�������� , �������� independently. In our

in ����3 , ����5 and ����6 ������������ different. So we form a clusters
example, the maximal frequent item set of documents

according to the documents contained in
                                                                                Figure 1: Hierarchical document clustering using MFI



                                                                                                     www.ijorcs.org
     Represent each new document ������������� � in hierarchy by
Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering                                                              11
                                                                 itself. When we are classifying the documents into
maximal frequent item set of terms as centers (as in             equivalence classes, we are not considering these ones
step 6).These maximal frequent item sets are obtained            and put zeros. Jaccard similarity coefficient matrix for
by combining the features of maximal frequent item               four documents can be represented as follows.
set of terms that grouping the documents. Each new
                                                                                           d1       d2       d3      d4

������������� � represents that jth document in the level of
document also consisting of corresponding updated
weights of maximal frequent item set of terms. Where                            d 1  1 0.4 0.8 0.5

hierarchy�������� . In the figure {����12 = ����21 }means that the
                                                                                d 2 0.4 1 0.8 0.4
                                                                            Rα =                  

level ����1 are not matched with other documents MFI set
                                                                                d 3 0.8 0.8 1 0.9
maximal frequent item set of terms in 2nd document of                                             
                                                                                d 4 0.5 0.4 0.9 1 
in same level����1 .So it is repeated same for the next
level and it is also same for the document {����13 =               Ds = {d1 , d2 , d3 , d4 }as the collectionof document pairs
����22 }. The documents{����11 , ����15 } and{����14 , ����16 } in first
                                                                    Where alpha is threshold. Let define a relation R on


                                                                 value. i.e ���� = {(�������� , �������� )/ ���� (�������� , �������� ) ≥ ����ℎ������������ℎ������������ }
                                                                 whose similarity measure is above some threshold


level as ����23 , ����24 .
level are combined using MFI based hierarchical

                                                                  1. R is reflexive on Ds iff ���� (�������� , �������� ) = 1. i.e Every
clustering and represent these documents in the second



                                                                  2. R is symmetric on Ds iff���� ��������� , �������� � = ���� ��������� , �������� �i.e
                                                                     document is mostly related to itself.

                                                                     if the document �������� is similar to �������� then the
       V. PRIVACY PRESERVING OF WEB


                                                                     document �������� is also similar to�������� .
        DOCUMENTS USING EQUIVALENCE
                  RELATION
   Most internet web documents are publicly available

                                                                     ���� (�������� , �������� ) ≥ ���������������� { min{���� ��������� , �������� �, ���� ��������� , �������� �}}.
for providing services required by the user. In such              3. R is transitive on Ds iff
documents there is no confidential or sensitive data
(open to all). Then how can we provide privacy of
such documents. Now a days, same information will                Then R is transitive by the definition.
be exists in more than one document in duplicate
                                                                     Then R is an equivalence relation on Ds, which
forms. The way of providing privacy preserving of
                                                                 partitions the input document set Ds into set of
documents is by avoiding duplicate documents. There
                                                                 equivalence classes. Equivalence relation seems a
by we can protect the privacy of individual copy rights
                                                                 natural             technique    for    duplicate     document
of documents. Many duplicate document detection
                                                                 categorization. Any two documents in same
techniques are available such as syntactic, URL based,
                                                                 equivalence class are related and are different if they
semantic approaches. In each technique, a processing
                                                                 are coming from two equivalence classes. The set of
overhead of maintaining shingling’s, signatures,
                                                                 all equivalence classes induces the document set Ds.
fingerprints [13, 14, 15, 18]. In this paper, we
                                                                 High syntactic similarity pairs of documents typically
proposed a new technique for avoiding duplicate
                                                                 referred to as duplicates or near duplicates except
documents using equivalence relation. Let Ds be the
                                                                 diagonal elements. By using equivalence relation,
input duplicate document set is subset to web
                                                                 easily we can identify the duplicate documents or we
document collection. First find the jaccard similarity
                                                                 can perform the clustering on duplicate documents.
measure for every pair of documents in Ds using
                                                                 Apart from the representation of feature document
weighted feature representation of maximal frequent
                                                                 vector by MFI, we also need to consider that who is
item sets discussed in step 2 and step 3 in algorithm. If
                                                                 the author of document, when the document was
the similarity measure of two documents is equal to 1,
                                                                 created, where it is available, helps in effectively
then the two documents are most similar. If the
                                                                 finding the duplicate documents. Each document in
measure is 0, then they are not duplicates. The Jaccard
                                                                 input Ds must belong to unique equivalence class. If R
index or the Jaccard similarity coefficient is a
                                                                 is equivalence relation on Ds = {d1, d2, d3, d4 …..dn}.
statistical measure of similarity between sample sets.
                                                                 Then number of equivalence relations on Ds is always
For two sets, it is denoted as the cardinality of their
                                                                 lies between n ≤ | R|≤ n2. i.e the time complexity of
intersection divided by the cardinality of their union.


                                    |����1 ∩ ����2 |
                                                                 calculating equivalence relation on Ds is O(n2).

                                                                 .i.e���� ��������� , �������� � ≥ 0.8. Since the matrix is symmetric, the
Mathematically

                 ����(����1 , ����2 ) =
                                                                 Choose the threshold α in equivalence relation as 0.8

                                    |����1 ∩ ����2 |                 documents sets {(����3 , ����1 ), (����3 , ����2 ), (����4 , ����3 )}   are
                                                                 mostly related. Hence the documents are near
   For every pair of two documents calculate jaccard             duplicates and grouping the documents into clusters
measure of d1, d2.All the diagonal elements in matrix            thereby providing privacy of individual copy rights of
are ones, because every document mostly related to               documents.


                                                                                        www.ijorcs.org
12                                                                                   P. Rajesh, G. Narasimha, N.Saisumanth

                          0    0    1   0                          Data mining 2002 (KDD-2002), Edmonton, Alberta,
                          0    0    1   0
                                                                     Canada.
                  R 0.8 =                                     [8] BenjaminFung, C.M., Wang, Ke., Ester, Martin. (2003).
                          1    1    0   1                          “Hierarchical Document Clustering using Frequent Item
                                                                   Sets”. In Proceedings SIAM International Conference
                          0    0    1   0                          on Data Mining 2003 (SIAM DM-2003), pp:59-70.
                                                                [9] Agrawal, R., Srikant, R. (1994). “Fast Algorithms for
      VI. CONCLUSION AND FUTURE SCOPE                                Mining Association Rules”. In the Proceedings of 20th
                                                                     International Conference on Very Large Data Bases,
   Cluster analysis can be used as powerful ,stranded
                                                                     1994, Santiago, Chile, PP: 487-499.
alone data mining concept that gains insight
                                                                [10] Liu, W.L., and Zeng, X.S. (2005). “Document
information of knowledge from huge unstructured
                                                                     Clustering Based on Frequent Term Sets”. Proceedings
databases. Most conventional clustering methods do                   of Intelligent Systems and Control, 2005.
not satisfy the document clustering requirements such
                                                                [11] Zamir, O., Etzioni, O. (1998). “Web Document
as high dimensionality, huge volumes and easy of
                                                                     Clustering: A Feasibility Demonstration”. In the
accessing meaningful clusters labels. In this paper, we              Proceedings of ACM,1998 (SIGIR-98), PP: 46-54.
presented novel approach; Maximal frequent item set
                                                                [12] Kjersti, (1997). “A Survey on Personalized Information
(MFI) Based Similarity Measure for Hierarchical
                                                                     Filtering Systems for the World Wide Web”. Technical
Document Clustering to address these issues.                         Report 922, Norwegian Computing Center, 1997.
Dimensionality reduction can be achieved through
                                                                [13] Prasannakumar, J., Govindarajulu, P., “Duplicate and
MFI. By using the same MFI similarity measure in                     Near Duplicate Documents Detection: A Review”.
hierarchal document clustering, the number of levels                 European Journal of Scientific Research ISSN 1450-
will be decreased. It is easy for browsing. Clustering               216X Vol.32 No.4 ,2009, pp:514-527
has its paths in many areas, by applying MFI based              [14] Syed Mudhasir,Y., Deepika,J., “Near Duplicate
techniques to clusters, including data mining, statistics,           Detection and Elimination Based on Web Provenance
biology, and machine learning we can get the high                    for Efficient Web Search”. In the Proceedings of
quality of clusters. Moreover, by means of maximal                   International Journal on Internet and Distributed
frequent item sets, we can predict the most influenced               Computing Systems, Vol.1, No.1, 2011.
objects of clusters in the entire dataset of applications       [15] Alsulami, B.S., Abulkhair, F., Essa, E., “Near Duplicate
like business, marketing, world wide web, social                     Document Detection Survey”. In the Proceedings of
networking analysis.                                                 International Journal of Computer Science and
                                                                     Communications Networks, Vol.2, N0.2, pp:147-151.
                 VII. REFEERENCES                               [16] Doug Burdick, Manuel Calimlim, Johannes Gehrke.
                                                                     (2001). “A Maximal Frequent Itemset Algorithm for
[1] Ruxixu, Donald Wunsch., “A Survey of Clustering
                                                                     Transactional Databases”. In the Proceedings of ICDE,
      Algorithms”. In the Proceedings of IEEE Transactions
                                                                     17th International Conference on Data Engineering
      on Neural Networks, Vol. 16, No. 3, May 2005.
                                                                     2001 (ICDE-2001).
[2]   Jain, A.K., Murty, M.N., Flynn, P.J., “Data Clustering:
                                                                [17] Murali Krishna, S., Durga Bhavani, S., “An Efficient
      A Review”. In the Proceedings of ACM Computing
                                                                     Approach for Text Clustering Based On Frequent Item
      Surveys, Vol.31, No.3, 1999, pp: 264-323.
                                                                     Sets”. European Journal of Scientific Research ISSN
[3]   Kleinberg, J.M., “Authoritative Sources in a                   1450-216X, Vol.42, No.3, 2010, pp:399-410.
      Hyperlinked Environment”. In the Journal of the ACM,
                                                                [18] Lopresti, D.P. (1999). "Models and Algorithms for
      Vol. 46, No.5, 1999, pp: 604-632.
                                                                     Duplicate Document Detection". In the Proceedings of
[4]   Ling Zhuang, Honghua Dai. (2004). “A Maximal                   Fifth International Conference on Document Analysis
      Frequent Item Set Approach for Web Document                    and Recognition 1999 (ICDAR-1999), 20th-22th Sep,
      Clustering”. In Proceedings of the IEEE Fourth                 pp:297-300.
      International Conference on Computer and Information
      Technology 2004 (CIT-2004).
[5]   Michael, W., Trosset. (2008). “Representing Clusters:
      k-Means Clustering, Self-Organizing Maps and
      Multidimensional      Scaling”.   Technical    Report,
      Department      of   Statistics,  Indian    University,
      Bloomington, 2008.
[6]   Michael Steinbach, George karypis, and Vipinkumar.
      (2000). “A Comparison of Document Clustering
      Techniques”. In Proceedings of the Workshop on Text
      Mining, 2000 (KDD-2000), Boston, pp: 109-111.
[7]   Beil, F., Ester, M., Xu, X. (2002). “Frequent Term-
      Based Text Clustering”. In Proceedings of 8th
      International Conference on Knowledge Discovery and



                                                                                  www.ijorcs.org

				
DOCUMENT INFO
Description: The increasing nature of World Wide Web has imposed great challenges for researchers in improving the search efficiency over the internet. Now days web document clustering has become an important research topic to provide most relevant documents in huge volumes of results returned in response to a simple query. In this paper, first we proposed a novel approach, to precisely define clusters based on maximal frequent item set (MFI) by Apriori algorithm. Afterwards utilizing the same maximal frequent item set (MFI) based similarity measure for Hierarchical document clustering. By considering maximal frequent item sets, the dimensionality of document set is decreased. Secondly, providing privacy preserving of open web documents is to avoiding duplicate documents. There by we can protect the privacy of individual copy rights of documents. This can be achieved using equivalence relation.