Solving the Set Covering Problem based on a new Clustering Heuristic by cJaxKih

VIEWS: 27 PAGES: 15

									Solving the Set Covering
Problem based on a new
Clustering Heuristic
Nikolaos Mastrogiannis
Ioannis Giannikos
Basilis Boutsinas
Department of Business Administration,
     University of Patras, Greece
                  The Set Covering Problem
   The Set Covering Problem (SCP) is the problem of covering the
    rows of an m-row, n-column, zero-one matrix ( aij ) by a subset of
    the columns at minimum cost. Defining:
          1, if column j is in the solution (with cost c j  0 )
       
  xj  
       0,
         otherwise
                                   n

   the SCP is: Minimize         c
                                  j 1
                                          j   x j subject to

                   n

                  a
                   j 1
                          ij   x j  1,       i  1, m

                       x j  {0,1},       j  1, n
         Clustering Heuristics: General Description
    Given the set covering problem as stated above, a clustering heuristic
     must specify:
1.   A method for partitioning the column-set into “homogeneous” clusters
     formed by similar columns.
2.   A rule for selecting a best column for each cluster.

    If the set J of all the selected columns forms a cover, then a prime cover
     PJ is extracted from J. Else, the current partition is modified and the
     process is repeated.

    The proposed Clustering Heuristic, is based on the general principles of :
    k-means clustering algorithm (MacQueen 1967)
    k-modes clustering algorithm (Huang 1998)
    ELECTRE methods and especially ELECTRE I multicriteria method (Roy,
     1968)
          The Clustering Heuristic: Introduction (1)

    The Clustering Heuristic consists of the following steps:
1.   Select k initial centroids, one for each cluster.
2.   Assign each column to the proper cluster according to:
 - the distance of the column to be clustered from the best of the centroids,
     for each of the rows (attributes) that describe both the column and the
     centroid
 - the importance of the best of the centroids in terms of its rows’
     (attributes’) weights
 - the possible veto that some of the rows (attributes) might oppose in the
     result of the assignment process
     Update the centroid of each cluster after each assignment
        The Clustering Heuristic: Introduction (2)

3.   After all columns have been assigned to clusters, retest the
     dissimilarity of columns against the current centroids according to
     Step 2. If a column is found “nearest” to another cluster rather
     than its current one, re-assign the column to that cluster and
     update the centroids of both clusters.
4.   Repeat 3 until no column has changed clusters after a full cycle
     test of the whole dataset.
      The Clustering Heuristic: Introduction (3)
   After partitioning the set of columns into “homogeneous”
    clusters, the best column for each cluster is chosen according
    to the Chvatal’s selector:

                cj
    f s  min        , where N j  {i : aij  1} and S  N  {1,..., n}
          jS   Nj

   When the best column for each cluster is identified, we solve
    the SCP as an integer programming problem in MS Excel
    Solver in order to extract a prime cover from the above
    partition. Else we modify the partition (by changing the
    number of clusters) and we repeat the process.
      Description of the Clustering Heuristic (1)
   Step 1: The initial centroids Kt, t = 1, k are selected among the columns to be
    clustered.
   Step 2.1: For every row (attribute) l=1,m common between column yj , j=1, n
    and each of the centroids Kt, t = 1, k we calculate the distance,
                                                                                           1, if K tl  y jl
                         (my jl  mKtl )                                                   
    D( K tl , y jl )                        ( K tl , y jl ) , where  ( K tl , y jl )  
                         (my jl    mKtl )                                                 0, if K  y
                                                                                                    tl     jl




    and m K , my jl are the relative frequencies of values Ktl and yjl
             tl


   If D ( K tl , x jl )  ql , where ql is an indifference threshold, then row
    (attribute) l belong to the concordance coalition, that is, it supports the
    proposition “column yj is similar to centroids Kt on row (attribute) l”.
    Otherwise, row (attribute) l, belongs to the discordance coalition.
   Observation: The indifference threshold ql is a variable automatically valued
    according to the number of the discrete non-zero distances D ( K tl , x jl ) .
            Description of the Clustering Heuristic (2)

   Step 2.2: In order to choose the best of the centroids Kt, t = 1, k and then
    assign the column to be clustered to its cluster, we calculate the
    concordance index (CIt, i=1, k) and threshold (CTt, i=1, k), as follows:

                 wc
                                 p
                                   w   p
                                                                              w   p
     CI t    [(
               c
                        bonus * (          )]       CTt  [(m1  bonus * (
                                                                              p
                                                                                       )]

               f
                 wf
                                 f
                                   w   f        ,                             wf
                                                                              f


   where wc, wf, wp are weights of rows (attributes) and bonus is a parameter
    valued automatically, set to enforce those clusters (1, k) that contain as
    much zero differences as possible in every row.
   We sort in an descending order, the concordance index (CIt, i=1, k) and
    its corresponding threshold (CTt, i=1, k) for every cluster.
         Description of the Clustering Heuristic (3)
   If CI b ( K i , y j )  CTb ( K i , y j ), where b =1,…, k denotes the k positions of
    the k centroids and the corresponding indices and thresholds in the
    descending order stated above, the column j is clustered to cluster b.
    Otherwise, the process is repeated until the column is clustered.
   Observation1: Parameter m1[0.6,1] is the only parameter defined by
    the user. If m1 is 0.7, this means that for each column, the best of the
    centroids incorporates 70% of the strength of the weights of the attributes
    that belong to both the concordance and the discordance coalition.
   Observation2: The weighting of the rows l=1, m in the proposed
    algorithm is based on the density of 1’s in matrix A. Thus:

                                                          m
          Zl
   wl         , where Zl  { j : alj  1} and Ztotal   Zl
         Ztotal                                          l 1
           Description of the Clustering Heuristic (4)
   Step 2.3: This step confirms or rejects the allocation of a column to a
    cluster:
   If j, is the column to be clustered, t, is the cluster assigned to the column
    according to step 2.2 and l, are all the rows that belong the discordance
    coalition, then:
                              (mx jl  mKtl )
   If   D( K tl , x jl )                        ( K tl , x jl )  U l   , for every row l that belongs to
                              (mx jl  mKtl )

   the discordance coalition, that is D( Ktl , x jl )  ql , then the clustering is
    confirmed, the centroid of the cluster is updated and we proceed to the
    next column. Otherwise, we return to step 2.2.

   Observation: Parameter Ul is called veto threshold, it is automatically
    valued, and for every row l, Veto threshold (Ul) > Indifference threshold
    (ql).
            Description of the Clustering Heuristic (5)
   Step 3: We retest the dissimilarity of columns against the current centroids
    according to Step 2, we re-assign every column needed to the proper cluster and
    update the centroids of both clusters.
   Step 4: We repeat Step 3 until no column has changed clusters after a full cycle
    test.
   Finding a new centroid for cluster k:
    Column xj is a centroid for the zero-one matrix ( aij ) if it minimizes
                                            m
                        DIS ( x j , aij )   D (aij , x j )
                                           i 1

   If nc is the number of columns with a discrete value ck ,l on row l, and
            k ,l


                                                 nck ,l
                       fr ( Al  ck ,l / aij ) 
                                                  n
   is the relative frequency of appearance of ck ,l on the set of columns, then DIS is
    minimized iff fr ( Al  g l / aij )  fr ( Al  ck ,l / aij ) for g l  ck ,l for every
    l=1,…,m
             Computational Experimentation

   The algorithm was tested using datasets from the OR Library
    of J.E.Beasley.
   The datasets include problems of size 50 x 500 and 200 x
    1000.
   In 80% of the tested datasets, the optimal solution was found
    using k [40, 45] in the 50 x 500 problems and k [55,65] in the
    200 x 1000 problems.
   Final SCP was solved using Premium Solver V.8 and in
    particular the XPRESS-MP Solver Engine
                              Conclusions
    The new clustering heuristic:
1.   combines three different scientific fields (set covering, data mining,
     multicriteria analysis).
2.   takes into consideration the weight of each row (attribute) in the
     clustering process rather than considering each row equivalently
     weighted.
3.   calculates these weights according to the density of 1’s to the set.
4.   analyzes the dataset in details through the pairwise comparisons of
     each column with each centroid for each row.
5.   takes into consideration the possible objections of the minority of
     the rows (attributes) for the clustering process.
6.   presents covering results as well as processing time which are very
     promising.
                           Bibliography
   Beasley, J.E (1987), An algorithm for set covering problem, European
    Journal of Operational Research, 31, pp.85-93
   Huang, Z. (1998), Extensions to the k-means Algorithm for Clustering
    Large Data Sets with categorical Values, Data Mining and Knowledge
    Discovery, vol.2, no.3, pp.283-304.
   Huang, Z., Ng, M.K., Rong, H. & Li, Z. (2005), Automated Variable
    Weighting in k-means type clustering, IEEE Transactions on Pattern
    Analysis and Machine Intelligence, vol.27, no.5, pp.657-668.
   MacQueen, J.B. (1967), Some methods for classification and analysis
    of multivariate observations, In proceedings of the 5th Berkeley
    Symposium on Mathematical Statistics and Probability, pp.281-297
   Roy, B. (1968) Classement et choix en présence de points de vue
    muptiples : La méthode ELECTRE. R.I.R.O., 8, 57-75.
   Roy, B. (1991) The outranking approach and the foundations of
    ELECTRE methods. Theory and Decision, 31, 49-73.

								
To top