VIEWS: 23 PAGES: 6 CATEGORY: Emerging Technologies POSTED ON: 7/14/2012
The increasing nature of World Wide Web has imposed great challenges for researchers in improving the search efficiency over the internet. Now days web document clustering has become an important research topic to provide most relevant documents in huge volumes of results returned in response to a simple query. In this paper, first we proposed a novel approach, to precisely define clusters based on maximal frequent item set (MFI) by Apriori algorithm. Afterwards utilizing the same maximal frequent item set (MFI) based similarity measure for Hierarchical document clustering. By considering maximal frequent item sets, the dimensionality of document set is decreased. Secondly, providing privacy preserving of open web documents is to avoiding duplicate documents. There by we can protect the privacy of individual copy rights of documents. This can be achieved using equivalence relation.
International Journal of Research in Computer Science eISSN 2249-8265 Volume 2 Issue 4 (2012) pp. 7-12 © White Globe Publications www.ijorcs.org PRIVACY PRESERVING MFI BASED SIMILARITY MEASURE FOR HIERARCHICAL DOCUMENT CLUSTERING P. Rajesh1, G. Narasimha2, N.Saisumanth3 1,3 Department of CSE, VVIT, Nambur, Andhra Pradesh, India Email: rajesh.pleti@gmail.com Email: saisumanth.nanduri@gmail.com 2 Department of CSE, JNTUH, Hyderabad, Andhra Pradesh, India Email: narasimha06@gmail.com Abstract: The increasing nature of World Wide Web navigation steps to find relevant documents. So we has imposed great challenges for researchers in need a hierarchical clustering that is relatively flat that improving the search efficiency over the internet. Now reduces the number of navigation steps. Therefore days web document clustering has become an there is a great need for new document clustering important research topic to provide most relevant algorithms, which are more efficient than conventional documents in huge volumes of results returned in clustering algorithms [1, 2]. response to a simple query. In this paper, first we The increasing nature of World Wide Web has proposed a novel approach, to precisely define imposed great challenges for researchers to cluster the clusters based on maximal frequent item set (MFI) by similar documents over the internet and their by Apriori algorithm. Afterwards utilizing the same improving the efficiency of search. Search engine uses maximal frequent item set (MFI) based similarity the getting more confused in selecting the relevant measure for Hierarchical document clustering. By documents among huge volumes of search results considering maximal frequent item sets, the returned to a simple query. A potential solution to this dimensionality of document set is decreased. Secondly, problem is to cluster the similar web documents, which providing privacy preserving of open web documents helps the user in identifying the relevant data easily is to avoiding duplicate documents. There by we can and effectively [3]. protect the privacy of individual copy rights of documents. This can be achieved using equivalence The outline of this paper is divided into six relation. sections. section II, briefly discusses related work. We explained our proposed algorithm description Keywords: Maximal Frequent Item set, Apriori including common preprocessing steps and pseudo algorithm, Hierarchical document clustering, code of algorithm in section III. It also includes to equivalence relation. precisely defining clusters based on maximal frequent item set (MFI) by Apriori algorithm. Section IV, I. INTRODUCTION describes exploiting the same maximal frequent item Document clustering has been studied intensively set (MFI) based similarity measure for Hierarchical because of its wide applicability in areas such as web document clustering with running example. In section mining, search engines, text mining and information V, provides privacy preserving of open web retrieval. The rapid progress of databases in every documents by using equivalence relation to protect the aspect of human actions has resulted in enormous individual copy rights of a document.. Section VI, demand for efficient algorithms for spinning data into consists of conclusion and future scope. valuable knowledge. II. RELATED WORK Document clustering has undergone through various methods, still document clustering is in its The related work of using maximal frequent item inefficiency state for providing the required set in web document clustering is explained as follows. information needed by the user exactly and Ling Zhuang Honghua Dai [4] introduced a new approximately. Suppose the user makes an incorrect criterion to specifically locate the initial points using selection while browsing the documents in hierarchy. maximal frequent item set. These initial points are then If user may not notice his mistakes until he browses used as centers for k-means algorithm. However k- into the deep portion of the hierarchy, then it decreases means clustering is completely unstructured approach, the efficiency of search and increases the number of sensitive to noise and produces an unorganized www.ijorcs.org 8 P. Rajesh, G. Narasimha, N.Saisumanth collection of clusters that is not favorable to based similarity measure . The clusters in the resulting interpretation [5, 6]. To minimize the overlapping of hierarchy are non-overlapping. The parent cluster documents, Beil, Ester [7] were proposed a method contains only the general documents. HFTC (Hierarchical Frequent Text Clustering) is another frequent item set based approach to choose the III. ALGORITHM DESCRIPTION next frequent item sets. But the clustering result In this section, we explained our proposed depends on the order of choosing next frequent item algorithm description including common sets. The resulting hierarchy in HFTC usually contains preprocessing steps and pseudo code of algorithm. It many clusters at first level. As a result the documents also includes to precisely defining clusters based on in the same class are to be distributed into different maximal frequent item set (MFI) by Apriori algorithm. branches of hierarchy, which decreases the overall First, we will speak about some common clustering accuracy. preprocessing steps for representing each document by C.M.Fung [8] has introduced FIHC (Frequent Item item sets (terms). Second we will bring in vector space set based Hierarchical Clustering) method for model by assigning weights to terms in all document document clustering. Which employed, a cluster topic sets. Finally, we will explain the process of tree is constructed based on the similarity among initialization of clusters seeds using MFI to perform clusters. FIHC used the efficient child pruning when hierarchical clustering. Let Ds represents set of all number of clusters is large and to apply the elaborated documents in collection of database. sibling merging only when number of clusters is small. Ds= {d1, d2, d3………dM}: 1 ≤ i ≤ M The experiment results FIHC actually outperforms all other algorithms (bisecting-k means, UPGMA) in A. Pre-Processing accuracy for most number of clusters. The document set Ds is converted from The Apriori algorithm [9] is a well-known method unstructured format into some common representation for computing frequent item sets in a transaction using the text preprocessing techniques, in which database. The document under the same topic, shares words or terms are extracted (tokenization). The input more common frequent item sets (terms) than the data set of documents in Ds are preprocessed using the documents of different topics. The main advantage of techniques namely, removing HTML tags first, after using frequent item sets is that it can identify the that apply stop words list and stemming algorithm. relation among the more than two documents at a time a) HTML Tags: parsing of HTML Tag in a document collection unlike similarity measure b) Stop words: Remove the stop words list like between two documents [10, 11].By the means of “conjunctions, connectives, prepositions etc” maximal frequent item sets, the dimensionality of the c) Stemming algorithm: We utilize porter 2 document set is reduced. More over maximal frequent stemmer algorithm in our approach. item sets captures most related document sets. On the other hand, hierarchical clustering most relevant for B. Vector representation of document: browsing and maps most specific documents to generalized documents in the whole collection. Vector space model is the most commonly used document representation model in text mining, web A conventional hierarchical clustering method mining and information retrieval areas. In this model constructs the hierarchy by subdividing parent cluster each document is represented as n-dimensional term or merging similar children clusters. It usually suffers vector. The value of each term in the n-dimensional from its inability to perform tuning once a merge or vector reflects the importance of corresponding split decision has been performed. This rigidity may document. Let N be the total number of terms and M lower the clustering accuracy. Furthermore, due to the be the number of documents and each the document �������� = (��������������������1 , ��������������������2 , … … … … . . ������������������������ ) 1≤ i≤ M. Where fact that a parent cluster in the hierarchy always can be denoted as ��������(������������������������ ) < ����ℎ������������ℎ������������ contains all objects of its Childs, this kind of hierarchy frequency ������������������������ is less than the threshold value is is not suitable for browsing. The user may have value. The document difficulty to locate his intention object in such a large cluster. considered to avoid the problem of more times a term Our hierarchical clustering method is completely appears throughout all documents in the whole different. The aim of this paper is, first we form all collection, the more poorly it discriminates between the clusters by assigning documents to the most similar documents [12].Calculate term frequency tf is number cluster using maximal frequent item sets by Apriori of times a term appears in a document. Document frequency of a term df as no of documents that documents vectors. �������� = (��������1 , ����12 , ����13 , … … . . , ����1�������� ) algorithm and then construct the hierarchical document clustering based on their inter-cluster contains term. Also construct the weights for Where ������������ = ���������������� ∗ ������������(����) and similarities via same maximal frequent item set (MFI) www.ijorcs.org IDf (j) =������������ � �1≤j≤n.where IDf is the inverse ���� Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering 9 ������������ A frequent item set is a set of words which occurs frequently together and are good candidates for such that X ⊂ X1 and t(X) = t(X1), where t(X) defined document frequency. clusters and are denoted by FI. An item set X is closed Table 1: Table Representation of Transactional Database of if there does not exist an item set X1 such that X1, Documents as the set of transactions that contain item set X and it Terms Doc 1 Doc 2 Doc 3 ..... Doc 4 is denoted by FCI(frequently closed items).If X is Java 1 1 0 ..... 1 frequent and no superset of X is frequent among the Beans 0 1 0 ..... 0 MFI. Then MFI⊂ FCI ⊂ FI Whenever there are very set of items I in transactional databases. Then we say ..... ..... ….. ….. ..... ….. that X is maximal frequent item set and denoted by Servlets 1 0 1 ..... 1 By the representation of document as vector form, long patterns are present in the data it is often we can easily identify which documents Contains the impractical to generate the entire set if frequent item same features .The more features documents have in sets or closed item sets [16]. In that case, maximal common, the more related they are. Thus, it is realistic frequent item sets are adequate for such applications. to find well related documents. Assume that each We employed maximal frequent item set algorithm document is an item in the transactional database; each from [17] using apriori. These maximal frequent item term corresponds to a transaction. Our aim is to search sets are initial seeds for hierarchical document for highly related documents “appearing” together clustering. with same features (the documents whose MFI features D. Pseudo code Algorithm are closed). Similarly, the maximal frequent item set discovery in the transaction database serves the For MFI Based Similarity Measure for Hierarchical purpose of finding items of documents appearing Document Clustering together in many transactions. i.e., document sets Input: Document set Ds. which have large amount of feature in common. Definition: MFI: Maximal Frequent Item set. C. Apriori for maximal frequent item sets (tf) Term frequency and (df) document frequency Mining frequent item sets is a primary content of Step 1. For each document in Ds, Remove the HTML data mining that emphasizes particularly in finding the relation of different items in the large database. Mining tags and perform stop word list and stemming. Step 2. Calculate the term frequency (tf) and document �������� = (��������������������1 , ��������������������2 , … … … … . . ������������������������ ) 1≤i≤M frequent patterns is crucial problem in many data mining applications such as the discovery of frequency (df). Where df������������������������� � < Threshold value association rules, correlations, multidimensional patterns, and other numerous important inferring patterns from consumer market basket analysis and web access etc. The association mining problem is Step 3. Also construct the weighted document vectors �������� = (��������1 , ����12 , ����13 , … … . . , ����1�������� ) ������������ = ���������������� ∗ formulated as follows: Given a large data base of set of for all the documents ������������(����).Idf (j) =������������ � � 1≤j≤n. items transactions, find all frequent item sets, where a ���� Where frequent item set is one that occurs in at least a user- ������������ specified threshold value of the data base. Many of the proposed item set mining algorithms are a variant of Step 4. Now represent each documents by keywords Apriori, which employs a bottom-up, breadth first whose tf>support search that enumerates every single frequent item set. ������������ = {����1 , ����2 , ����3 , … … … … . . �������� } Apriori is a conventional algorithm that was first Calculate the Maximal Frequent Item set(MFI) of introduced] for mining association rules. Association terms using Apriori algorithm Where each �������� = {����1 , ����2 , ����3 , … … … �������� } can be viewed as two-step process as a document �������� is in more than one maximal frequent item set then choose �������� as a set (1) Identifying all frequent item sets Step 5. If (2) Generating strong association rules from the containing document �������� . Then Assign�������� =��������0 .For frequent item sets consisting of such maximal frequent item sets At first, candidate item sets are generated and the document �������� afterwards frequent item sets are mined with the help each the maximal frequent item sets containing �������� [��������������������������������(������������������������ ( �������� , �������� )) of these candidate item sets. In the proposed approach, > ��������������������������������(������������������������ ( ������������ , �������� ))] we have used only the frequent item sets for further processing so that, we undergone only the first step (generation of maximal frequent item sets) of the Apriori algorithm. www.ijorcs.org Then assign �������� = ������������ .Assign the document �������� to �������� �������� ���������������� ����3 = {����1 , ����5 , ����7 } as one cluster in hierarchy 10 P. Rajesh, G. Narasimha, N.Saisumanth and discard �������� for other maximal frequent item sets. Case 3: If �������� , �������� contains some same documents and represent it by center (as in step6). Repeat this process for all documents that occurs in consider the case of document ����2 is repeatedin more more than one maximal frequent item set these maximal frequent item sets �������� as clusters than one maximal frequent item sets{����1 ����4 }.Similarly among the documents list obtained from MFI. Let us and combine the documents in �������� into a single Step 6. Apply hierarchical document clustering to make ����4 is repeated in{����1 , ����2 , ����4 }. Then choose�������� = {����1 , ����2 , ����4 } = {��������0 , ��������1 , ��������2 }for document����4 .Assign �������� =��������0 = ����1 . For each the maximal frequent item sets new document and represent it by centers of the �������� containing ����4 maximal frequent item sets. These are obtained ��������0 �������� ��������2 calculate the measure by combining the features of maximal frequent in the document from �������� [��������������������������������(������������������������ ( �������� , ����4 )) item set of terms that grouping the documents > ��������������������������������(������������������������ ( ������������ , ����4 ))] Step 7. Repeat the same process of hierarchical document clustering based on maximal frequent document ����4 closest to which maximal frequent item item sets for all levels in hierarchy and stop if total number of documents equals to one else go By using this jaccards measure, we can identify the document ����4 .Then assign �������� = ������������ . to step 4. set among maximal frequent item sets containing the Let’s suppose that ����4 is closed to the maximal IV. HIERARCHICAL CLUSTERS BASED ON frequent item set ����4 . Assign the document����4 to�������� = MAXIMAL FREQUENT ITEM SETS ������������ = ����4 and discard ����4 for other maximal frequent After finding maximal frequent item sets (MFI) by using Apriori algorithm. We turn to describing the exactly one cluster. Similarly ����2 belongs to����1 .Repeat creation of hierarchical document clustering using item sets. After this step, each document belongs to same similarity measure by MFI. A simple instance among the whole collection of documents �������� by case of example is also provided to demonstrate the ����2 , ����4 are repeated in����1 , ����4 . The clusters that will form this process for all documents that occurs in more than apriorialgorithm are ������������ = {����1 , ����2 , ����3 … . . �������� }.Where entire process. The set of maximal frequent item sets one maximal frequent item set. Since the documents by�������� = {����1 , ����2 , ����3 … . . �������� }.Then consider total number at the first level of hierarchy by applying step5 and ����1 = {����2 , ����6 } each MFI consist of set of documents represented step 6 are as follows. ����2 = {����3 , , ����8 } of documents which occurs in maximal frequent item ����1 , ����2 , ����3, ����4 , ����5 , ����6 , ����7 , ����8 , ����3 = {����1 , ����5 , ����7 } sets in MFI as follows. ������������ = � � ����9 , ����10 , ����11 , ����12 , ����13 , ����14 , ����15 ����4 = {����4 , , ����14 } ����1 = {����2 , ����4 , ����6 } ����5 = {����10 , ����12 , ����15 } ����2 = {����3 , ����4 , ����8 } ����6 = {����9 , ����11 , ����13 } ����3 = {����1 , ����5 , ����7 } ����4 = {����4 , ����2 , ����14 } The hierarchical diagram for the above form of ����5 = {����10 , ����12 , ����15 } maximal frequent item set clusters can be representing ����6 = {����9 , ����11 , ����13 } as follows. Repeat the same process of hierarchical document clustering based on maximal frequent item sets for all levels in hierarchy and stop if total number The clusters in the resulting hierarchy are non- of documents equals to one else go to step 4. overlapping. This can be achieved through the Case1: If �������� , �������� are same then choose one in random following cases. Case2: If �������� , �������� are different then form clusters of to form cluster. documents contained in�������� , �������� independently. In our in ����3 , ����5 and ����6 ������������ different. So we form a clusters example, the maximal frequent item set of documents according to the documents contained in Figure 1: Hierarchical document clustering using MFI www.ijorcs.org Represent each new document ������������� � in hierarchy by Privacy Preserving MFI Based Similarity Measure For Hierarchical Document Clustering 11 itself. When we are classifying the documents into maximal frequent item set of terms as centers (as in equivalence classes, we are not considering these ones step 6).These maximal frequent item sets are obtained and put zeros. Jaccard similarity coefficient matrix for by combining the features of maximal frequent item four documents can be represented as follows. set of terms that grouping the documents. Each new d1 d2 d3 d4 ������������� � represents that jth document in the level of document also consisting of corresponding updated weights of maximal frequent item set of terms. Where d 1 1 0.4 0.8 0.5 hierarchy�������� . In the figure {����12 = ����21 }means that the d 2 0.4 1 0.8 0.4 Rα = level ����1 are not matched with other documents MFI set d 3 0.8 0.8 1 0.9 maximal frequent item set of terms in 2nd document of d 4 0.5 0.4 0.9 1 in same level����1 .So it is repeated same for the next level and it is also same for the document {����13 = Ds = {d1 , d2 , d3 , d4 }as the collectionof document pairs ����22 }. The documents{����11 , ����15 } and{����14 , ����16 } in first Where alpha is threshold. Let define a relation R on value. i.e ���� = {(�������� , �������� )/ ���� (�������� , �������� ) ≥ ����ℎ������������ℎ������������ } whose similarity measure is above some threshold level as ����23 , ����24 . level are combined using MFI based hierarchical 1. R is reflexive on Ds iff ���� (�������� , �������� ) = 1. i.e Every clustering and represent these documents in the second 2. R is symmetric on Ds iff���� ��������� , �������� � = ���� ��������� , �������� �i.e document is mostly related to itself. if the document �������� is similar to �������� then the V. PRIVACY PRESERVING OF WEB document �������� is also similar to�������� . DOCUMENTS USING EQUIVALENCE RELATION Most internet web documents are publicly available ���� (�������� , �������� ) ≥ ���������������� { min{���� ��������� , �������� �, ���� ��������� , �������� �}}. for providing services required by the user. In such 3. R is transitive on Ds iff documents there is no confidential or sensitive data (open to all). Then how can we provide privacy of such documents. Now a days, same information will Then R is transitive by the definition. be exists in more than one document in duplicate Then R is an equivalence relation on Ds, which forms. The way of providing privacy preserving of partitions the input document set Ds into set of documents is by avoiding duplicate documents. There equivalence classes. Equivalence relation seems a by we can protect the privacy of individual copy rights natural technique for duplicate document of documents. Many duplicate document detection categorization. Any two documents in same techniques are available such as syntactic, URL based, equivalence class are related and are different if they semantic approaches. In each technique, a processing are coming from two equivalence classes. The set of overhead of maintaining shingling’s, signatures, all equivalence classes induces the document set Ds. fingerprints [13, 14, 15, 18]. In this paper, we High syntactic similarity pairs of documents typically proposed a new technique for avoiding duplicate referred to as duplicates or near duplicates except documents using equivalence relation. Let Ds be the diagonal elements. By using equivalence relation, input duplicate document set is subset to web easily we can identify the duplicate documents or we document collection. First find the jaccard similarity can perform the clustering on duplicate documents. measure for every pair of documents in Ds using Apart from the representation of feature document weighted feature representation of maximal frequent vector by MFI, we also need to consider that who is item sets discussed in step 2 and step 3 in algorithm. If the author of document, when the document was the similarity measure of two documents is equal to 1, created, where it is available, helps in effectively then the two documents are most similar. If the finding the duplicate documents. Each document in measure is 0, then they are not duplicates. The Jaccard input Ds must belong to unique equivalence class. If R index or the Jaccard similarity coefficient is a is equivalence relation on Ds = {d1, d2, d3, d4 …..dn}. statistical measure of similarity between sample sets. Then number of equivalence relations on Ds is always For two sets, it is denoted as the cardinality of their lies between n ≤ | R|≤ n2. i.e the time complexity of intersection divided by the cardinality of their union. |����1 ∩ ����2 | calculating equivalence relation on Ds is O(n2). .i.e���� ��������� , �������� � ≥ 0.8. Since the matrix is symmetric, the Mathematically ����(����1 , ����2 ) = Choose the threshold α in equivalence relation as 0.8 |����1 ∩ ����2 | documents sets {(����3 , ����1 ), (����3 , ����2 ), (����4 , ����3 )} are mostly related. Hence the documents are near For every pair of two documents calculate jaccard duplicates and grouping the documents into clusters measure of d1, d2.All the diagonal elements in matrix thereby providing privacy of individual copy rights of are ones, because every document mostly related to documents. www.ijorcs.org 12 P. Rajesh, G. Narasimha, N.Saisumanth 0 0 1 0 Data mining 2002 (KDD-2002), Edmonton, Alberta, 0 0 1 0 Canada. R 0.8 = [8] BenjaminFung, C.M., Wang, Ke., Ester, Martin. (2003). 1 1 0 1 “Hierarchical Document Clustering using Frequent Item Sets”. In Proceedings SIAM International Conference 0 0 1 0 on Data Mining 2003 (SIAM DM-2003), pp:59-70. [9] Agrawal, R., Srikant, R. (1994). “Fast Algorithms for VI. CONCLUSION AND FUTURE SCOPE Mining Association Rules”. In the Proceedings of 20th International Conference on Very Large Data Bases, Cluster analysis can be used as powerful ,stranded 1994, Santiago, Chile, PP: 487-499. alone data mining concept that gains insight [10] Liu, W.L., and Zeng, X.S. (2005). “Document information of knowledge from huge unstructured Clustering Based on Frequent Term Sets”. Proceedings databases. Most conventional clustering methods do of Intelligent Systems and Control, 2005. not satisfy the document clustering requirements such [11] Zamir, O., Etzioni, O. (1998). “Web Document as high dimensionality, huge volumes and easy of Clustering: A Feasibility Demonstration”. In the accessing meaningful clusters labels. In this paper, we Proceedings of ACM,1998 (SIGIR-98), PP: 46-54. presented novel approach; Maximal frequent item set [12] Kjersti, (1997). “A Survey on Personalized Information (MFI) Based Similarity Measure for Hierarchical Filtering Systems for the World Wide Web”. Technical Document Clustering to address these issues. Report 922, Norwegian Computing Center, 1997. Dimensionality reduction can be achieved through [13] Prasannakumar, J., Govindarajulu, P., “Duplicate and MFI. By using the same MFI similarity measure in Near Duplicate Documents Detection: A Review”. hierarchal document clustering, the number of levels European Journal of Scientific Research ISSN 1450- will be decreased. It is easy for browsing. Clustering 216X Vol.32 No.4 ,2009, pp:514-527 has its paths in many areas, by applying MFI based [14] Syed Mudhasir,Y., Deepika,J., “Near Duplicate techniques to clusters, including data mining, statistics, Detection and Elimination Based on Web Provenance biology, and machine learning we can get the high for Efficient Web Search”. In the Proceedings of quality of clusters. Moreover, by means of maximal International Journal on Internet and Distributed frequent item sets, we can predict the most influenced Computing Systems, Vol.1, No.1, 2011. objects of clusters in the entire dataset of applications [15] Alsulami, B.S., Abulkhair, F., Essa, E., “Near Duplicate like business, marketing, world wide web, social Document Detection Survey”. In the Proceedings of networking analysis. International Journal of Computer Science and Communications Networks, Vol.2, N0.2, pp:147-151. VII. REFEERENCES [16] Doug Burdick, Manuel Calimlim, Johannes Gehrke. (2001). “A Maximal Frequent Itemset Algorithm for [1] Ruxixu, Donald Wunsch., “A Survey of Clustering Transactional Databases”. In the Proceedings of ICDE, Algorithms”. In the Proceedings of IEEE Transactions 17th International Conference on Data Engineering on Neural Networks, Vol. 16, No. 3, May 2005. 2001 (ICDE-2001). [2] Jain, A.K., Murty, M.N., Flynn, P.J., “Data Clustering: [17] Murali Krishna, S., Durga Bhavani, S., “An Efficient A Review”. In the Proceedings of ACM Computing Approach for Text Clustering Based On Frequent Item Surveys, Vol.31, No.3, 1999, pp: 264-323. Sets”. European Journal of Scientific Research ISSN [3] Kleinberg, J.M., “Authoritative Sources in a 1450-216X, Vol.42, No.3, 2010, pp:399-410. Hyperlinked Environment”. In the Journal of the ACM, [18] Lopresti, D.P. (1999). "Models and Algorithms for Vol. 46, No.5, 1999, pp: 604-632. Duplicate Document Detection". In the Proceedings of [4] Ling Zhuang, Honghua Dai. (2004). “A Maximal Fifth International Conference on Document Analysis Frequent Item Set Approach for Web Document and Recognition 1999 (ICDAR-1999), 20th-22th Sep, Clustering”. In Proceedings of the IEEE Fourth pp:297-300. International Conference on Computer and Information Technology 2004 (CIT-2004). [5] Michael, W., Trosset. (2008). “Representing Clusters: k-Means Clustering, Self-Organizing Maps and Multidimensional Scaling”. Technical Report, Department of Statistics, Indian University, Bloomington, 2008. [6] Michael Steinbach, George karypis, and Vipinkumar. (2000). “A Comparison of Document Clustering Techniques”. In Proceedings of the Workshop on Text Mining, 2000 (KDD-2000), Boston, pp: 109-111. [7] Beil, F., Ester, M., Xu, X. (2002). “Frequent Term- Based Text Clustering”. In Proceedings of 8th International Conference on Knowledge Discovery and www.ijorcs.org