# Solving the Set Covering Problem based on a new Clustering Heuristic by cJaxKih

VIEWS: 27 PAGES: 15

• pg 1
```									Solving the Set Covering
Problem based on a new
Clustering Heuristic
Nikolaos Mastrogiannis
Ioannis Giannikos
Basilis Boutsinas
University of Patras, Greece
The Set Covering Problem
   The Set Covering Problem (SCP) is the problem of covering the
rows of an m-row, n-column, zero-one matrix ( aij ) by a subset of
the columns at minimum cost. Defining:
          1, if column j is in the solution (with cost c j  0 )

xj  
0,
         otherwise
n

   the SCP is: Minimize         c
j 1
j   x j subject to

n

a
j 1
ij   x j  1,       i  1, m

x j  {0,1},       j  1, n
Clustering Heuristics: General Description
    Given the set covering problem as stated above, a clustering heuristic
must specify:
1.   A method for partitioning the column-set into “homogeneous” clusters
formed by similar columns.
2.   A rule for selecting a best column for each cluster.

    If the set J of all the selected columns forms a cover, then a prime cover
PJ is extracted from J. Else, the current partition is modified and the
process is repeated.

    The proposed Clustering Heuristic, is based on the general principles of :
    k-means clustering algorithm (MacQueen 1967)
    k-modes clustering algorithm (Huang 1998)
    ELECTRE methods and especially ELECTRE I multicriteria method (Roy,
1968)
The Clustering Heuristic: Introduction (1)

    The Clustering Heuristic consists of the following steps:
1.   Select k initial centroids, one for each cluster.
2.   Assign each column to the proper cluster according to:
- the distance of the column to be clustered from the best of the centroids,
for each of the rows (attributes) that describe both the column and the
centroid
- the importance of the best of the centroids in terms of its rows’
(attributes’) weights
- the possible veto that some of the rows (attributes) might oppose in the
result of the assignment process
Update the centroid of each cluster after each assignment
The Clustering Heuristic: Introduction (2)

3.   After all columns have been assigned to clusters, retest the
dissimilarity of columns against the current centroids according to
Step 2. If a column is found “nearest” to another cluster rather
than its current one, re-assign the column to that cluster and
update the centroids of both clusters.
4.   Repeat 3 until no column has changed clusters after a full cycle
test of the whole dataset.
The Clustering Heuristic: Introduction (3)
   After partitioning the set of columns into “homogeneous”
clusters, the best column for each cluster is chosen according
to the Chvatal’s selector:

cj
f s  min        , where N j  {i : aij  1} and S  N  {1,..., n}
jS   Nj

   When the best column for each cluster is identified, we solve
the SCP as an integer programming problem in MS Excel
Solver in order to extract a prime cover from the above
partition. Else we modify the partition (by changing the
number of clusters) and we repeat the process.
Description of the Clustering Heuristic (1)
   Step 1: The initial centroids Kt, t = 1, k are selected among the columns to be
clustered.
   Step 2.1: For every row (attribute) l=1,m common between column yj , j=1, n
and each of the centroids Kt, t = 1, k we calculate the distance,
1, if K tl  y jl
(my jl  mKtl )                                                   
D( K tl , y jl )                        ( K tl , y jl ) , where  ( K tl , y jl )  
(my jl    mKtl )                                                 0, if K  y
         tl     jl

and m K , my jl are the relative frequencies of values Ktl and yjl
tl

   If D ( K tl , x jl )  ql , where ql is an indifference threshold, then row
(attribute) l belong to the concordance coalition, that is, it supports the
proposition “column yj is similar to centroids Kt on row (attribute) l”.
Otherwise, row (attribute) l, belongs to the discordance coalition.
   Observation: The indifference threshold ql is a variable automatically valued
according to the number of the discrete non-zero distances D ( K tl , x jl ) .
Description of the Clustering Heuristic (2)

   Step 2.2: In order to choose the best of the centroids Kt, t = 1, k and then
assign the column to be clustered to its cluster, we calculate the
concordance index (CIt, i=1, k) and threshold (CTt, i=1, k), as follows:

wc
p
w   p
w   p
CI t    [(
c
 bonus * (          )]       CTt  [(m1  bonus * (
p
)]

f
wf
f
w   f        ,                             wf
f

   where wc, wf, wp are weights of rows (attributes) and bonus is a parameter
valued automatically, set to enforce those clusters (1, k) that contain as
much zero differences as possible in every row.
   We sort in an descending order, the concordance index (CIt, i=1, k) and
its corresponding threshold (CTt, i=1, k) for every cluster.
Description of the Clustering Heuristic (3)
   If CI b ( K i , y j )  CTb ( K i , y j ), where b =1,…, k denotes the k positions of
the k centroids and the corresponding indices and thresholds in the
descending order stated above, the column j is clustered to cluster b.
Otherwise, the process is repeated until the column is clustered.
   Observation1: Parameter m1[0.6,1] is the only parameter defined by
the user. If m1 is 0.7, this means that for each column, the best of the
centroids incorporates 70% of the strength of the weights of the attributes
that belong to both the concordance and the discordance coalition.
   Observation2: The weighting of the rows l=1, m in the proposed
algorithm is based on the density of 1’s in matrix A. Thus:

m
Zl
   wl         , where Zl  { j : alj  1} and Ztotal   Zl
Ztotal                                          l 1
Description of the Clustering Heuristic (4)
   Step 2.3: This step confirms or rejects the allocation of a column to a
cluster:
   If j, is the column to be clustered, t, is the cluster assigned to the column
according to step 2.2 and l, are all the rows that belong the discordance
coalition, then:
(mx jl  mKtl )
   If   D( K tl , x jl )                        ( K tl , x jl )  U l   , for every row l that belongs to
(mx jl  mKtl )

   the discordance coalition, that is D( Ktl , x jl )  ql , then the clustering is
confirmed, the centroid of the cluster is updated and we proceed to the

   Observation: Parameter Ul is called veto threshold, it is automatically
valued, and for every row l, Veto threshold (Ul) > Indifference threshold
(ql).
Description of the Clustering Heuristic (5)
   Step 3: We retest the dissimilarity of columns against the current centroids
according to Step 2, we re-assign every column needed to the proper cluster and
update the centroids of both clusters.
   Step 4: We repeat Step 3 until no column has changed clusters after a full cycle
test.
   Finding a new centroid for cluster k:
Column xj is a centroid for the zero-one matrix ( aij ) if it minimizes
m
DIS ( x j , aij )   D (aij , x j )
i 1

   If nc is the number of columns with a discrete value ck ,l on row l, and
k ,l

nck ,l
fr ( Al  ck ,l / aij ) 
n
   is the relative frequency of appearance of ck ,l on the set of columns, then DIS is
minimized iff fr ( Al  g l / aij )  fr ( Al  ck ,l / aij ) for g l  ck ,l for every
l=1,…,m
Computational Experimentation

   The algorithm was tested using datasets from the OR Library
of J.E.Beasley.
   The datasets include problems of size 50 x 500 and 200 x
1000.
   In 80% of the tested datasets, the optimal solution was found
using k [40, 45] in the 50 x 500 problems and k [55,65] in the
200 x 1000 problems.
   Final SCP was solved using Premium Solver V.8 and in
particular the XPRESS-MP Solver Engine
Conclusions
    The new clustering heuristic:
1.   combines three different scientific fields (set covering, data mining,
multicriteria analysis).
2.   takes into consideration the weight of each row (attribute) in the
clustering process rather than considering each row equivalently
weighted.
3.   calculates these weights according to the density of 1’s to the set.
4.   analyzes the dataset in details through the pairwise comparisons of
each column with each centroid for each row.
5.   takes into consideration the possible objections of the minority of
the rows (attributes) for the clustering process.
6.   presents covering results as well as processing time which are very
promising.
Bibliography
   Beasley, J.E (1987), An algorithm for set covering problem, European
Journal of Operational Research, 31, pp.85-93
   Huang, Z. (1998), Extensions to the k-means Algorithm for Clustering
Large Data Sets with categorical Values, Data Mining and Knowledge
Discovery, vol.2, no.3, pp.283-304.
   Huang, Z., Ng, M.K., Rong, H. & Li, Z. (2005), Automated Variable
Weighting in k-means type clustering, IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol.27, no.5, pp.657-668.
   MacQueen, J.B. (1967), Some methods for classification and analysis
of multivariate observations, In proceedings of the 5th Berkeley
Symposium on Mathematical Statistics and Probability, pp.281-297
   Roy, B. (1968) Classement et choix en présence de points de vue
muptiples : La méthode ELECTRE. R.I.R.O., 8, 57-75.
   Roy, B. (1991) The outranking approach and the foundations of
ELECTRE methods. Theory and Decision, 31, 49-73.

```
To top