VIEWS: 0 PAGES: 13 CATEGORY: Lifestyle POSTED ON: 2/8/2010 Public Domain
Document Categorization Using Table of Contents Fuzzy Clustering Document Categorization Pattern Recognition Systems Masoud Makrehchi Data Clustering Maryam Shokri Fuzzy Clustering Proposed Method Submitted to: Prof. M. Kamel Term paper of SD-622, Machine Intelligence Course References University of Waterloo SYDE 5/2/2003 M. Makrehchi, M. Shokri 2 Document Categorization Introduction Introduction • This paper presents the results of an experimental study of document clustering techniques. Architecture of a Document Classification • In this report we give a survey of the state-of-the-art in text System categorization. • Since there is a large amount of information represented to the people by Internet, the problem of finding an appropriate method to organize the vast variety of information must be solved. • Our goal is organizing the large set of documents using clustering method. 5/2/2003 M. Makrehchi, M. Shokri 3 5/2/2003 M. Makrehchi, M. Shokri 4 Architecture of a Document Classification System Pattern Recognition Systems Classified/ Labeled Thesaurus Data Unlabeled Documents Labeling Process Pattern Recognition Systems Life Cycle Clustering Preprocessing Classification or Clustering? Training Classification Methods (Classification) Knowledge Base Preprocessing Decision Class (Recall) 5/2/2003 M. Makrehchi, M. Shokri 5 5/2/2003 M. Makrehchi, M. Shokri 6 Pattern Recognition Systems Life Cycle Classification or Clustering? Data Data Feature Feature Knowledge Engineer/ No Yes Field Engineer Layer labeled Sampling Conditioning Extraction Selection data? (Real World) No Yes Number of Training Groups? (Learning Process) Classification Classification/Clustering AI Engineer Layer Unsupervised/ (Computer World) ____________ Supervised Clustering Clustering Supervised Including Internal ____________ ____________ Learning Feedback Unsupervised Semi-Supervised (Learning) Recall Learning Learning True (Pattern Recognition Customer Layer Data System In Use) (Real World) 5/2/2003 M. Makrehchi, M. Shokri 7 5/2/2003 M. Makrehchi, M. Shokri 8 Classification Methods Data Clustering • Rocchio’s algorithm Introduction • Naive Bayes • K-nearest neighbor A Clustering System • Decision Trees • Support Vector Machines A Taxonomy of Clustering Techniques • Voted Classification Hard C-means Clustering • Neural Networks • Fuzzy Logic Based Learning • Aggregation of multiple classifiers 5/2/2003 M. Makrehchi, M. Shokri 9 5/2/2003 M. Makrehchi, M. Shokri 10 Introduction A Clustering System • Cluster analysis is based on partitioning a collection of data points into a number of subgroups, where the objects inside a cluster (a subgroup) show a certain degree of closeness or similarity. • Hard clustering assigns each data point (or feature vector) to one and Pattern only one of the clusters. Clusters • In hard clustering, the degree of membership for each data point is equal either one or zero, it means we assume a well defined boundaries Feature Extraction/ Interpattern between the clusters. (No overlapping) Feature Selection Similarity Grouping • Clustering is an unsupervised learning approach. • Similarity measure is required – Generally taken as the Euclidean distance in feature space. Learning Feedback Design Feedback 5/2/2003 M. Makrehchi, M. Shokri 11 5/2/2003 M. Makrehchi, M. Shokri 12 A Taxonomy of Clustering Techniques Hard C-means Clustering Clustering • Initialize Cluster Centers: Choose a value of c representing the number of desired clusters- Centroid Hierarchical Partitional – Each cluster is represented by a centroid (mean of all cluster members). – The number of Centroids is equal to number of final clusters. • Assign Data to the nearest cluster center • Evaluate the clusters (the average distance from centers) if it is within a given Single Link Complete Link limit then stop Square Graph Mixture Mode Error Theoretic Resolving Seeking – We need a similarity measure or a distance measure, for example Euclidean distance • Otherwise update the cluster centers and reassign From: Expectation A.K. JAIN, M.N. MURTY, P.J. FLYNN C-Means the data points Maximizing 5/2/2003 M. Makrehchi, M. Shokri 13 5/2/2003 M. Makrehchi, M. Shokri 14 Hard C-means Algorithm Hard C-means Algorithm X4 X4 X1 X3 X1 X3 X21 X21 X16 X16 k X7 X17 X5 X5 1 X2 X2 k2 X8 X12 X8 X12 X17 X17 X6 X11 X14 X6 X11 X14 X9 X9 k4 X15 X15 X13 X13 From: X10 X19 From: X10 k3 X19 X20 X20 Ekkasit Tiamkaew, X18 Ekkasit Tiamkaew, X18 Jirakhom Ruttanavakul Jirakhom Ruttanavakul 5/2/2003 M. Makrehchi, M. Shokri 15 5/2/2003 M. Makrehchi, M. Shokri 16 Hard C-means Algorithm Fuzzy Clustering ∑k =1uik ∗ xk n Centroids: Vi = Why Fuzzy Clustering? ∑k =1uik n What Is Fuzzy Clustering? Types of Fuzzy Clustering Membership: Objective function-based fuzzy clustering 1 X k − Vi ≤ X k − V j algorithms uik = Bottlenecks in Fuzzy C-Means 0 otherwise Fuzzy Clustering in Document Categorization Fuzzy c-means 5/2/2003 M. Makrehchi, M. Shokri 17 5/2/2003 M. Makrehchi, M. Shokri 18 Why Fuzzy Clustering? What Is Fuzzy Clustering? • There are several applications in which the clusters have no clear and well defined boundaries, for example in Document Categorization. • Fuzzification of Hard C-Means • Each example belongs to multiple clusters, to different degrees. • The U membership matrix takes on real values between 0 and 1. • U sums to 1 across the rows. • If an example does not clearly fit into either of two clusters, this knowledge can be captured Example C1 C2 Example C1 C2 1 0 1 1 .2 .8 2 0 1 2 .01 .99 3 0 1 3 .45 .55 4 1 0 4 .9 .1 Hard C-Means Fuzzy C-Means 5/2/2003 M. Makrehchi, M. Shokri 19 5/2/2003 M. Makrehchi, M. Shokri 20 What Is Fuzzy Clustering? Types of Fuzzy Clustering • In Fuzzy clustering result is represented by grades of 1. Fuzzy clustering based on fuzzy relation. membership of every pattern to the classes 2. Fuzzy clustering based on objective function and fuzzy coovariance matrix established. 3. Nonparametric classifier, that is the fuzzy generalized k-nearest • Unlike binary evaluation of crispy clustering, the neighbor rule membership grades in fuzzy clustering are evaluated 4. Neuro-Fuzzy Clustering within the [0, 1] interval. – Self Organizing Maps • The necessity of fuzzy clustering lies in the reality – Fuzzy Learning Vector Quantization that a pattern could be assigned to different classes – Fuzzy Adaptive Resonance Theory (categories). – Growing Neural Gas – Fully Self-Organizing Simplified Adaptive Resonance theory • The objective function method is one of the major – Fuzzy Competitive Learning techniques in fuzzy clustering. 5/2/2003 M. Makrehchi, M. Shokri 21 5/2/2003 M. Makrehchi, M. Shokri 22 Objective function-based fuzzy clustering algorithms Bottlenecks in Fuzzy C-Means • Fuzzy c-means algorithm: spherical clusters of approx. the same size • Gustafson-Kessel algorithm: ellipsoidal clusters with approx. the same size; there are also axis-parallel variants of this algorithm; can also be used to detect • There are three major bottlenecks in fuzzy clustering of real data: lines (to some extent) the number of clusters: in most cases it is not defined a priori- we • Gath-Geva algorithm / Gaussian mixture decomposition: ellipsoidal clusters have to have a criteria to stop algorithm. With knowing the target with varying size; there are also axis-parallel variants of this algorithm; can number of clusters, we have a semi-supervised learning. also be used to detect lines (to some extent) • Fuzzy c-varieties algorithm: detection of linear manifolds (infinite lines in 2D) • Centroid: The location and character of centroid, that is the • Adaptive fuzzy c-varieties algorithm: detection of line segments in 2D data representative of its cluster, is not necessarily predefined. We have • Fuzzy c-shells algorithm: detection of circles (no closed form solution for to do an initial estimation from the centroid location. prototypes) • Fuzzy c-spherical shells algorithm: detection of circles • There are a large variability in cluster shapes, cluster • Fuzzy c-rings algorithm: detection of circles densities, and the maximum number of data point in each • Fuzzy c-quadric shells algorithm: detection of ellipsoids cluster. • Fuzzy c-rectangular shells algorithm: detection of rectangles 5/2/2003 M. Makrehchi, M. Shokri 23 5/2/2003 M. Makrehchi, M. Shokri 24 Fuzzy Clustering in Document Categorization Fuzzy Clustering in Document Categorization G. Keswani, L.O. Hall O. Nasrouni, et al • Application: Text Categoriation • Application: Mining web access logs • Method: Semi-Supervised Fuzzy C-means algorithm • Method: Relational competitive fuzzy clustering – Combination of ssFCM and Naïve – Fuzzy version of agglomeration clustering, Bayse methods (feeding Naïve Bayse – Using a new dissimilarity function (Non-Euclidean) classifier with labeled data using ssFCM clustering), – ssFCM is used for estimating the class- R. Kreishnapuram, et al labels of the unlabeled data using the labeled data. The results of this • Application: Web document clustering clustering are used with NBC to classify • Method: K-medoids fuzzy clustering unseen documents. – A relational clustering method same as SAHN, CLARA, PAM and • The result has been compared with CLARANS. combination of NBC and Expectation- • Results: 83.05% recognition rate reported. Maximization Clustering 5/2/2003 M. Makrehchi, M. Shokri 25 5/2/2003 M. Makrehchi, M. Shokri 26 Fuzzy Clustering in Document Categorization Fuzzy c-means M.E.S. Mendes, L. Sacks • The fuzzy c-means algorithm is based on minimization of the following • Application: RFC documents objective function: Any inner product metric • Method: Fuzzy c-means, with proposed similarity function instead of Set of K prototypes or centroids (distance between Xj and Euclidean distance Any real number>1 Vi) N K J q (U , V ) = ∑∑ (uij ) q d 2 ( X j , Vi ); • Results: 85% recognition rate reported K.S. Leung, et al j =1 i =1 • Application: Content-based indexing. Fuzzy K-partitioned data set The centroid of ith cluster • Method: Fuzzy competitive clustering. K: Number of Clusters The degree of • Results: 79% recognition rate reported. (in average efficiency). membership of Xj in N: Number of Data points the ith cluster The jth m-dimensional feature vector 5/2/2003 M. Makrehchi, M. Shokri 27 5/2/2003 M. Makrehchi, M. Shokri 28 Step1: Choose primary random centroid (vi)- Prototypes Eliminate all centroid vectors that are too close X1 X1 X1 X1 Z1 X1 X1 Z15 Z1 X1 X1 Z8 X1 X1 X1 X1 X1 X1 X1 X1 X1 X1 Z21 Z13 X1 X1 X1 X1 X1 X1 X1 X1 Z22 Z14 X1 X1 X1 X1 7 X1 X1 Z Z6 X1 X1 Z23 Z16 X1 X1 X1 X1 X1 X1 X1 X1 4 18 Z X1 Z Z4 X1 X1 X1 X1 Z20 X1 X1 X1 Z5 X1 X1 Z5 X1 X1 X1 X1 X1 X1 X1 X1 Z12 X1 X1 X1 Z2 Z11 X1 Z2 Z6 X1 Z25 X1 Z19 X1 X1 X1 X1 Z9 X1 X1 X1 X1 Z8 Z3 Z3 X1 X1 X1 X1 X1 X1 X1 X1 X1 X1 X1 X1 Z17 Z24 X1 X1 X1 X1 X1 X1 X1 X1 X1 X1 From: X1 X1 X1 X1 Z9 Z10 X1 From: X1 X1 X1 X1 X1 Z7 E. Tiamkaew, E. Tiamkaew, J. Ruttanavakul J. Ruttanavakul 5/2/2003 M. Makrehchi, M. Shokri 29 5/2/2003 M. Makrehchi, M. Shokri 30 Eliminate all clusters that have fewer than p vectors Assign feature vectors (X) to the nearest centroid vector From: From: E. Tiamkaew, E. Tiamkaew, J. Ruttanavakul J. Ruttanavakul 5/2/2003 M. Makrehchi, M. Shokri 31 5/2/2003 M. Makrehchi, M. Shokri 32 • Step 3: Compute the degree of membership of all feature vectors in the all clusters: ˆ • Step 5: Recalculate the degree of memberships; uij 1 /( q −1) 1 • Step 6: Do termination test; 2 d ( x j , vi ) uij = K 1 ∑ d 2 (x , v ) 1 /( q −1) ij [ ˆ ] if Max uij −uij < ε , then Stop k =1 j k else goto Step 4 and compute new centroid vecors, • Step 4: Compute new centroids: N ∑ (u ij )q X j ˆ V i= j =1 N ∑ (u j =1 ij )q 5/2/2003 M. Makrehchi, M. Shokri 33 5/2/2003 M. Makrehchi, M. Shokri 34 Proposed Method Problem Statement • Our data set is set of documents from the Reuters corpus. • All the documents have been parsed and processed. Problem Statement • Each term has been mapped to an integer word ID. • Each category tag has been mapped to an integer category ID. Modeling and Data Representation • Preprocessing – Remove HTML (or other) tags Feature Selection Strategy – Remove stop words Clustering Approach – Perform word stemming • Dimensions – 445 primary categories – 29,108 primary terms (feature space) – 18,551 items in data collection 5/2/2003 M. Makrehchi, M. Shokri 35 5/2/2003 M. Makrehchi, M. Shokri 36 Modeling and Data Representation Data Collection • Documents are represented by vectors of words. • There is a collection of documents represented in a word-by-document matrix. • doc_w_db.dat : Database of documents that holds all the documents and the • This matrix is usually sparse categories for each document. • category.dic : Dictionary of categories this maps between the category IDs and • The number of rows of the matrix corresponds to the number of words the category name. in the dictionary • word.dic : Dictionary of words This maps between the word IDs and the • Each component of the matrix is the weight of word i in document k words. • The more times a word occurs in a document, the more relevant it is to the topic of the document. • The more times the word occurs throughout all documents in the collection, the more poorly it discriminates between documents. • A major characteristic of text categorization problems is the high dimensionality of the feature space. 5/2/2003 M. Makrehchi, M. Shokri 37 5/2/2003 M. Makrehchi, M. Shokri 38 Document Vector Weighting • Boolean Weighting • Word Frequency Weighting • tf x idf-weighting • tfc-Weighting Each document is represented by a vector • ltc-Weighting Each dimension of the vector is associated with a word/term • Entropy Weighting For each document, the value of each dimension is the frequency of that word that exists in the vector. 5/2/2003 M. Makrehchi, M. Shokri 39 5/2/2003 M. Makrehchi, M. Shokri 40 tf x idf-Weighting Feature Selection Strategy It is a well-known approach for computing word weights. This method assigns the weight to each word in document in proportion to the number of occurrences of the word in the document, and in inverse Goals: proportion to the number of documents in the collection for which the word occurs at least once. • Removes non-informative words N from documents. aik = f ik × log( ) ni • Improve categorization effectiveness. f ik = The frequency of word i in document k • Reduce computational complexity N = The number of documents in the collection. Result: Dimensionality Reduction A simple 2D feature space ni = The total number of times word i occurs in the whole collection. 5/2/2003 M. Makrehchi, M. Shokri 41 5/2/2003 M. Makrehchi, M. Shokri 42 Why dimension reduction? Dimensionality Reduction 1. Reducing number of target categories (Classes) • Pattern Recognition=Dimension Reduction with a threshold of at least 50 Documents belong to the • Why? category. (<<0.1 of total documents) Distribution of documents in categories Feature Space Sample Space (order of m) Class Space (order of n) (order of k) n>>m>>k Before Thresholding After Thresholding 5/2/2003 M. Makrehchi, M. Shokri 43 5/2/2003 M. Makrehchi, M. Shokri 44 Dimensionality Reduction Feature Selection Methods 2. Term selection (Feature space reduction) • Document Frequency Thresholding In the first step, we remove those terms with no information content. • Information Gain Terms seen in all categories, same as Reuter; 234 terms. • χ 2 - Statistic Term seen in none of categories; 669 terms. • Mutual Information We still have 28,205 terms! • Term Strength Need more reduction. 5/2/2003 M. Makrehchi, M. Shokri 45 5/2/2003 M. Makrehchi, M. Shokri 46 Document Frequency Thresholding Information Gain Information Gain measures the number of bits of information obtained for category pre-diction by knowing the presence or absence of a word • The document frequency for a word is the number of documents in at document. The information gain of a word w is: in which the word occurs. IG ( w) = −∑ j =1 P (c j ) log P (c j ) + P ( w)∑ j =1 P (c j w) log P (c j w) + k k • In document frequency thresholding the document frequency for each word in the training corpus is computed. P ( w)∑ j =1 P (c j w) log P (c j w) k • Those words whose document frequency is less than predetermined threshold are removed in document frequency thresholding. c1 ,..., c k Denote the the set of possible categories. Information gain is computed for each word of training set. • Assumption is that rare words are either non-informative for Words whose information gain is less than some predetermined category prediction, or not influential in global performance. threshold are removed. 5/2/2003 M. Makrehchi, M. Shokri 47 5/2/2003 M. Makrehchi, M. Shokri 48 Clustering Approach References • Fuzzy Clustering Method for Content-based Indexing; K.S. Leung, I. King and • Fuzzy C-Means (FCM) clustering technique and H.Y. Yue. its variants, and examine different similarity • Assessment of the Performance of Fuzzy Cluster Analysis in the Classification function to find more efficient strategy of RFC Documents; M. E. S. Mendes, L. Sacks. • Text Classification with Enhanced Semi-supervised Fuzzy Clustering; Girish • With knowing the number of final cluster, the Keswani, Lawrance O. Hall. algorithm will be semi-supervised. • A survey of Fuzzy Clustering for Pattern Recognition; A. Barakli, P. Blonda. • Unsupervised Optimal Fuzzy Clustering; I. Gath, A.B. Geva. • With labeled data, we can evaluate the clustering • A Fuzzy Relative of the K-Medoids Algorithm with Application to Web accuracy. Document and Snippet Clustring; R. Krishnapuram, A. Joshi, L. Yi • Text Categorisation : A Survey; Kjersti Aas, Line Eikvil, June 1999 5/2/2003 M. Makrehchi, M. Shokri 49 5/2/2003 M. Makrehchi, M. Shokri 50 5/2/2003 M. Makrehchi, M. Shokri 51