Supervised Clustering for Spam Detection in Data Streams by ert634


									    Supervised Clustering for Spam Detection in Data Streams

                     Ulf Brefeld                                       Peter Haider and Tobias Scheffer
         Technical University Berlin, Germany                                                   u
                                                                MPI for Computer Science, Saarbr¨ cken, Germany

Introduction. We address the problem of detecting batches of emails that have been created according to the same
template. This problem is motivated by the desire to filter spam more effectively by exploiting collective information
about entire batches of jointly generated messages. The application matches the problem setting of supervised cluster-
ing, because examples of correct clusterings can be collected. Known decoding procedures for supervised clustering
are cubic in the number of instances. When decisions cannot be reconsidered once they have been made – owing to
the streaming nature of the data – then the decoding problem can be solved in linear time. We devise a sequential
decoding procedure and derive the corresponding optimization problem of supervised clustering. We study the impact
of collective attributes of email batches on the effectiveness of recognizing spam emails.
A natural approach to identifying batches in incoming messages is to cluster groups of similar instances. Finley and
Joachims (2005) make explicit use of the ground-truth of correct clusterings and adapt the similarity measure of
correlation clustering by structural support vector machines. However, this solution ignores the temporal structure of
the data. And although training can be performed offline, the correlation clustering procedure has to make a decision
for each incoming message in real time as to whether it is part of a batch. Being cubic in the number of instances, this
solution leads to intractable problems in practice. We devise a sequential clustering technique that overcomes these
drawbacks. Exploiting the temporal nature of the data, it is linear in the number of instances.
Preliminaries. Training data consists of n sets of training emails x(1) , . . . , x(n) with T (1) , . . . T (n) elements. Each
set x(i) represents a snapshot of a window of observable messages. For each training set we are given the correct
partitioning into batches and singleton emails by means of adjacency matrices y(1) , . . . , y(n) , where yjk = 1 if xj
and xk are elements of the same batch, and 0 otherwise.
A set of pairwise feature functions φd : (xj , xk ) → r ∈ R with d = 1, . . . , D is available. The feature functions
implement aspects of the correspondence between xj and xk , such as the TFIDF similarity of email bodies or the
edit distance of subject lines. It is natural to find a similarity value simw (xj , xk ) by linearly combining the pairwise
feature functions with a weight vector w, forging the parameterized similarity measure of Equation 1.
                                simw (xj , xk ) =         wd φd (xj , xk ) = w⊤ Φ(xj , xk )                               (1)

Applying the similarity function to all pairs of emails in a set yields a similarity matrix. The problem of creating a
consistent clustering of instances from a similarity matrix is equivalent to the problem of correlation clustering (Bansal
et al., 2002). Given the parameters w, the objective of correlation clustering is to maximize the intra-cluster similarity.
This can be rewritten in terms of a joint model in input output space:
                           T   t−1                                  T   t−1
            f (x, y) =               ytk simw (xt , xk ) = w⊤                 ytk Φ(xt , xk )   = w⊤ Ψ(x, y),             (2)
                          t=1 k=1                                  t=1 k=1

where Ψ(x, y) jointly represents input x and output y. Thus, the underlying learning task in supervised clustering is
twofold. Given parameters w and a set of instances x, the decoding problem is to find the highest-scoring clustering
                         y = argmaxy f (x, y)       s.t. ∀jkl : (1 − yjk ) + (1 − ykl ) ≥ (1 − yjl )
                                                            ∀jk : yjk ∈ {0, 1},
and the learning problem can be stated as finding a function f (respectively a vector w) that minimizes the expected
risk Ep(x,y) [∆(y, argmaxy f (x, y))], where ∆ : (y, y) → r ∈ R+ measures the number of incorrect assignments
                           ¯     ¯                   ˆ             0
between the true clustering y and the prediction y.
                             0.04                                                                                                                                                                  1
                                                         LP decoding                                          9                              LP decoding                                                     True batch information with noise                                1000

                                                                                                                                                                                                                                                            Runtime in seconds
                            0.035              Agglomerative decoding                                                              Agglomerative decoding                                      0.999                             LP decoding

                                                                                  Number of erroneous edges
                                                  Sequential decoding                                         8                       Sequential decoding                                                                 Sequential decoding                                  100
 Normalized loss per mail

                                                                                                                                          Similarity Matrix                                                         Without batch information

                                                                                                                                                                        Area under ROC-curve
                             0.03                                                                                                                                                              0.998
                                                                                                              7                                                                                                                                                                 10
                            0.025                                                                             6                                                                                0.997                                                                                            LP decoding
                                                                                                                                                                                                                                                                                  1   Agglomerative decoding
                             0.02                                                                             5
                                                                                                                                                                                               0.996                                                                            0.1      Sequential decoding
                                                                                                                                                                                               0.995                                                                          0.01
                                                                                                              2                                                                                0.994                                                                         0.001
                            0.005                                                                             1                                                                                                                                                             0.0001
                               0                                                                              0
                                    Compact   Iterative   Sequential   Pairwise                                    Compact    Iterative    Sequential     Pairwise                             0.992
                                                                                                                                                                                                    0.07   0.06   0.05   0.04   0.03   0.02      0.01   0
                                                                                                                                                                                                                                                                                      50     100    150    200      250
                                                                                                                                                                                                           Percentage of wronlgy assigned emails                                       Number of emails in window

                    Figure 1: From left to right: a) average loss, b) feazing the loss, c) classification accuracy, and d) execution time.

Clustering of Streaming Data. In our batch detection application, incoming emails are processed sequentially. The
decision on the cluster assignment has to be made immediately, within an SMTP session, and cannot be altered
thereafter. We therefore impose the constraint that cluster membership cannot be reconsidered once a decision has
been made in the decoding procedure. When the partitioning of all previous emails in the window is fixed, a new mail
is processed by either assigning it to one of the existing clusters, or creating a new singleton batch. Finding the new
optimal clustering can be expressed as the following maximization problem.

Decoding Strategy 1 Given T (i) instances x1 , . . . , xT (i) , similarity measure simw : (xj , xk ) → r ∈ R, and a
clustering of instances x1 , . . . , xT (i) −1 ; the sequential decoding problem is defined as
                                                                                                                                                   T (i) −1
                                                                                                                  y = argmax                                         ¯
                                                                                                                                                                     yT (i) k simw (xT (i) , xk ).
                                                                                                                             y∈YT                       k=1

Now, we derive an optimization problem that requires the sequential clustering to produce the correct output for all
training data. Optimization Problem 1 constitutes a compact formulation for finding the desired optimal weight vector
by treating every message as the most recent message once, in order to exploit the available training data as effectively
as possible. Note that Optimization Problem 1 has at most i=1 (T (i) )2 constraints and can efficiently be solved with
standard techniques.

Optimization Problem 1 Given n labeled clusterings, C > 0, the sequential learning problem is defined as
                                                  1             2                                                  (i)                                                                                     (i)
                                        min         w                  + C                                        ξj         s.t. w⊤ Ψ(x(i) , y(i) ) + ξt ≥ w⊤ Ψ(x(i) , y) + ∆N (y(i) , y)
                                                                                                                                                                        ¯               ¯
                                                  2                               i,j

for all 1 ≤ i ≤ n, 1 ≤ t ≤ T (i) , and y ∈ Yt .

Conclusion. We devised a sequential clustering algorithm for learning a similarity measure to be used with correlation
clustering. Starting from the assumption that decisions for already processed emails cannot be reconsidered, we
devised an efficient clustering algorithm with computational complexity linear in the number of emails in the window.
From this algorithm we derived a second integrated method for learning the weight vector.
Our empirical results indicate that there are no significant differences between the learning or decoding methods in
terms of accuracy (Figure 1a). Yet the integrated learning formulations optimize the weight vector more directly
to yield consistent partitionings (Figure 1b). Using the batch information obtained from decoding with the learned
models, email spam classification performance increases substantially over the baseline with no batch information
(Figure 1c). The efficiency of the sequential clustering algorithm makes supervised batch detection in enterprise-level
scales, with millions of emails per hour and thousands of recent emails as reference, feasible.

Acknowledgments. We gratefully acknowledge support from STRATO AG and from the German Science Foundation DFG.

Bansal, N., Blum, A., & Chawla, S. (2002). Correlation clustering. Proceedings of the Symposium on Foundations of Computer
Finley, T., & Joachims, T. (2005). Supervised clustering with support vector machines. Proc. of the International Conference on
   Machine Learning.

To top