VIEWS: 27 PAGES: 15 POSTED ON: 11/24/2011 Public Domain
Solving the Set Covering Problem based on a new Clustering Heuristic Nikolaos Mastrogiannis Ioannis Giannikos Basilis Boutsinas Department of Business Administration, University of Patras, Greece The Set Covering Problem The Set Covering Problem (SCP) is the problem of covering the rows of an m-row, n-column, zero-one matrix ( aij ) by a subset of the columns at minimum cost. Defining: 1, if column j is in the solution (with cost c j 0 ) xj 0, otherwise n the SCP is: Minimize c j 1 j x j subject to n a j 1 ij x j 1, i 1, m x j {0,1}, j 1, n Clustering Heuristics: General Description Given the set covering problem as stated above, a clustering heuristic must specify: 1. A method for partitioning the column-set into “homogeneous” clusters formed by similar columns. 2. A rule for selecting a best column for each cluster. If the set J of all the selected columns forms a cover, then a prime cover PJ is extracted from J. Else, the current partition is modified and the process is repeated. The proposed Clustering Heuristic, is based on the general principles of : k-means clustering algorithm (MacQueen 1967) k-modes clustering algorithm (Huang 1998) ELECTRE methods and especially ELECTRE I multicriteria method (Roy, 1968) The Clustering Heuristic: Introduction (1) The Clustering Heuristic consists of the following steps: 1. Select k initial centroids, one for each cluster. 2. Assign each column to the proper cluster according to: - the distance of the column to be clustered from the best of the centroids, for each of the rows (attributes) that describe both the column and the centroid - the importance of the best of the centroids in terms of its rows’ (attributes’) weights - the possible veto that some of the rows (attributes) might oppose in the result of the assignment process Update the centroid of each cluster after each assignment The Clustering Heuristic: Introduction (2) 3. After all columns have been assigned to clusters, retest the dissimilarity of columns against the current centroids according to Step 2. If a column is found “nearest” to another cluster rather than its current one, re-assign the column to that cluster and update the centroids of both clusters. 4. Repeat 3 until no column has changed clusters after a full cycle test of the whole dataset. The Clustering Heuristic: Introduction (3) After partitioning the set of columns into “homogeneous” clusters, the best column for each cluster is chosen according to the Chvatal’s selector: cj f s min , where N j {i : aij 1} and S N {1,..., n} jS Nj When the best column for each cluster is identified, we solve the SCP as an integer programming problem in MS Excel Solver in order to extract a prime cover from the above partition. Else we modify the partition (by changing the number of clusters) and we repeat the process. Description of the Clustering Heuristic (1) Step 1: The initial centroids Kt, t = 1, k are selected among the columns to be clustered. Step 2.1: For every row (attribute) l=1,m common between column yj , j=1, n and each of the centroids Kt, t = 1, k we calculate the distance, 1, if K tl y jl (my jl mKtl ) D( K tl , y jl ) ( K tl , y jl ) , where ( K tl , y jl ) (my jl mKtl ) 0, if K y tl jl and m K , my jl are the relative frequencies of values Ktl and yjl tl If D ( K tl , x jl ) ql , where ql is an indifference threshold, then row (attribute) l belong to the concordance coalition, that is, it supports the proposition “column yj is similar to centroids Kt on row (attribute) l”. Otherwise, row (attribute) l, belongs to the discordance coalition. Observation: The indifference threshold ql is a variable automatically valued according to the number of the discrete non-zero distances D ( K tl , x jl ) . Description of the Clustering Heuristic (2) Step 2.2: In order to choose the best of the centroids Kt, t = 1, k and then assign the column to be clustered to its cluster, we calculate the concordance index (CIt, i=1, k) and threshold (CTt, i=1, k), as follows: wc p w p w p CI t [( c bonus * ( )] CTt [(m1 bonus * ( p )] f wf f w f , wf f where wc, wf, wp are weights of rows (attributes) and bonus is a parameter valued automatically, set to enforce those clusters (1, k) that contain as much zero differences as possible in every row. We sort in an descending order, the concordance index (CIt, i=1, k) and its corresponding threshold (CTt, i=1, k) for every cluster. Description of the Clustering Heuristic (3) If CI b ( K i , y j ) CTb ( K i , y j ), where b =1,…, k denotes the k positions of the k centroids and the corresponding indices and thresholds in the descending order stated above, the column j is clustered to cluster b. Otherwise, the process is repeated until the column is clustered. Observation1: Parameter m1[0.6,1] is the only parameter defined by the user. If m1 is 0.7, this means that for each column, the best of the centroids incorporates 70% of the strength of the weights of the attributes that belong to both the concordance and the discordance coalition. Observation2: The weighting of the rows l=1, m in the proposed algorithm is based on the density of 1’s in matrix A. Thus: m Zl wl , where Zl { j : alj 1} and Ztotal Zl Ztotal l 1 Description of the Clustering Heuristic (4) Step 2.3: This step confirms or rejects the allocation of a column to a cluster: If j, is the column to be clustered, t, is the cluster assigned to the column according to step 2.2 and l, are all the rows that belong the discordance coalition, then: (mx jl mKtl ) If D( K tl , x jl ) ( K tl , x jl ) U l , for every row l that belongs to (mx jl mKtl ) the discordance coalition, that is D( Ktl , x jl ) ql , then the clustering is confirmed, the centroid of the cluster is updated and we proceed to the next column. Otherwise, we return to step 2.2. Observation: Parameter Ul is called veto threshold, it is automatically valued, and for every row l, Veto threshold (Ul) > Indifference threshold (ql). Description of the Clustering Heuristic (5) Step 3: We retest the dissimilarity of columns against the current centroids according to Step 2, we re-assign every column needed to the proper cluster and update the centroids of both clusters. Step 4: We repeat Step 3 until no column has changed clusters after a full cycle test. Finding a new centroid for cluster k: Column xj is a centroid for the zero-one matrix ( aij ) if it minimizes m DIS ( x j , aij ) D (aij , x j ) i 1 If nc is the number of columns with a discrete value ck ,l on row l, and k ,l nck ,l fr ( Al ck ,l / aij ) n is the relative frequency of appearance of ck ,l on the set of columns, then DIS is minimized iff fr ( Al g l / aij ) fr ( Al ck ,l / aij ) for g l ck ,l for every l=1,…,m Computational Experimentation The algorithm was tested using datasets from the OR Library of J.E.Beasley. The datasets include problems of size 50 x 500 and 200 x 1000. In 80% of the tested datasets, the optimal solution was found using k [40, 45] in the 50 x 500 problems and k [55,65] in the 200 x 1000 problems. Final SCP was solved using Premium Solver V.8 and in particular the XPRESS-MP Solver Engine Conclusions The new clustering heuristic: 1. combines three different scientific fields (set covering, data mining, multicriteria analysis). 2. takes into consideration the weight of each row (attribute) in the clustering process rather than considering each row equivalently weighted. 3. calculates these weights according to the density of 1’s to the set. 4. analyzes the dataset in details through the pairwise comparisons of each column with each centroid for each row. 5. takes into consideration the possible objections of the minority of the rows (attributes) for the clustering process. 6. presents covering results as well as processing time which are very promising. Bibliography Beasley, J.E (1987), An algorithm for set covering problem, European Journal of Operational Research, 31, pp.85-93 Huang, Z. (1998), Extensions to the k-means Algorithm for Clustering Large Data Sets with categorical Values, Data Mining and Knowledge Discovery, vol.2, no.3, pp.283-304. Huang, Z., Ng, M.K., Rong, H. & Li, Z. (2005), Automated Variable Weighting in k-means type clustering, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.27, no.5, pp.657-668. MacQueen, J.B. (1967), Some methods for classification and analysis of multivariate observations, In proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp.281-297 Roy, B. (1968) Classement et choix en présence de points de vue muptiples : La méthode ELECTRE. R.I.R.O., 8, 57-75. Roy, B. (1991) The outranking approach and the foundations of ELECTRE methods. Theory and Decision, 31, 49-73.