Document Sample

Supervised Clustering for Spam Detection in Data Streams Ulf Brefeld Peter Haider and Tobias Scheffer Technical University Berlin, Germany u MPI for Computer Science, Saarbr¨ cken, Germany brefeld@cs.tu-berlin.de {haider,scheffer}@mpi-inf.mpg.de Introduction. We address the problem of detecting batches of emails that have been created according to the same template. This problem is motivated by the desire to ﬁlter spam more effectively by exploiting collective information about entire batches of jointly generated messages. The application matches the problem setting of supervised cluster- ing, because examples of correct clusterings can be collected. Known decoding procedures for supervised clustering are cubic in the number of instances. When decisions cannot be reconsidered once they have been made – owing to the streaming nature of the data – then the decoding problem can be solved in linear time. We devise a sequential decoding procedure and derive the corresponding optimization problem of supervised clustering. We study the impact of collective attributes of email batches on the effectiveness of recognizing spam emails. A natural approach to identifying batches in incoming messages is to cluster groups of similar instances. Finley and Joachims (2005) make explicit use of the ground-truth of correct clusterings and adapt the similarity measure of correlation clustering by structural support vector machines. However, this solution ignores the temporal structure of the data. And although training can be performed offline, the correlation clustering procedure has to make a decision for each incoming message in real time as to whether it is part of a batch. Being cubic in the number of instances, this solution leads to intractable problems in practice. We devise a sequential clustering technique that overcomes these drawbacks. Exploiting the temporal nature of the data, it is linear in the number of instances. Preliminaries. Training data consists of n sets of training emails x(1) , . . . , x(n) with T (1) , . . . T (n) elements. Each set x(i) represents a snapshot of a window of observable messages. For each training set we are given the correct (i) partitioning into batches and singleton emails by means of adjacency matrices y(1) , . . . , y(n) , where yjk = 1 if xj and xk are elements of the same batch, and 0 otherwise. A set of pairwise feature functions φd : (xj , xk ) → r ∈ R with d = 1, . . . , D is available. The feature functions implement aspects of the correspondence between xj and xk , such as the TFIDF similarity of email bodies or the edit distance of subject lines. It is natural to ﬁnd a similarity value simw (xj , xk ) by linearly combining the pairwise feature functions with a weight vector w, forging the parameterized similarity measure of Equation 1. D simw (xj , xk ) = wd φd (xj , xk ) = w⊤ Φ(xj , xk ) (1) d=1 Applying the similarity function to all pairs of emails in a set yields a similarity matrix. The problem of creating a consistent clustering of instances from a similarity matrix is equivalent to the problem of correlation clustering (Bansal et al., 2002). Given the parameters w, the objective of correlation clustering is to maximize the intra-cluster similarity. This can be rewritten in terms of a joint model in input output space: T t−1 T t−1 f (x, y) = ytk simw (xt , xk ) = w⊤ ytk Φ(xt , xk ) = w⊤ Ψ(x, y), (2) t=1 k=1 t=1 k=1 where Ψ(x, y) jointly represents input x and output y. Thus, the underlying learning task in supervised clustering is twofold. Given parameters w and a set of instances x, the decoding problem is to ﬁnd the highest-scoring clustering ˆ y = argmaxy f (x, y) s.t. ∀jkl : (1 − yjk ) + (1 − ykl ) ≥ (1 − yjl ) ∀jk : yjk ∈ {0, 1}, and the learning problem can be stated as ﬁnding a function f (respectively a vector w) that minimizes the expected risk Ep(x,y) [∆(y, argmaxy f (x, y))], where ∆ : (y, y) → r ∈ R+ measures the number of incorrect assignments ¯ ¯ ˆ 0 ˆ between the true clustering y and the prediction y. 0.04 1 LP decoding 9 LP decoding True batch information with noise 1000 Runtime in seconds 0.035 Agglomerative decoding Agglomerative decoding 0.999 LP decoding Number of erroneous edges Sequential decoding 8 Sequential decoding Sequential decoding 100 Normalized loss per mail Similarity Matrix Without batch information Area under ROC-curve 0.03 0.998 7 10 0.025 6 0.997 LP decoding 1 Agglomerative decoding 0.02 5 0.996 0.1 Sequential decoding 4 0.015 3 0.995 0.01 0.01 2 0.994 0.001 0.005 1 0.0001 0.993 0 0 Compact Iterative Sequential Pairwise Compact Iterative Sequential Pairwise 0.992 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 50 100 150 200 250 Percentage of wronlgy assigned emails Number of emails in window Figure 1: From left to right: a) average loss, b) feazing the loss, c) classiﬁcation accuracy, and d) execution time. Clustering of Streaming Data. In our batch detection application, incoming emails are processed sequentially. The decision on the cluster assignment has to be made immediately, within an SMTP session, and cannot be altered thereafter. We therefore impose the constraint that cluster membership cannot be reconsidered once a decision has been made in the decoding procedure. When the partitioning of all previous emails in the window is ﬁxed, a new mail is processed by either assigning it to one of the existing clusters, or creating a new singleton batch. Finding the new optimal clustering can be expressed as the following maximization problem. Decoding Strategy 1 Given T (i) instances x1 , . . . , xT (i) , similarity measure simw : (xj , xk ) → r ∈ R, and a clustering of instances x1 , . . . , xT (i) −1 ; the sequential decoding problem is deﬁned as T (i) −1 ˆ y = argmax ¯ yT (i) k simw (xT (i) , xk ). (i) ¯ y∈YT k=1 Now, we derive an optimization problem that requires the sequential clustering to produce the correct output for all training data. Optimization Problem 1 constitutes a compact formulation for ﬁnding the desired optimal weight vector by treating every message as the most recent message once, in order to exploit the available training data as effectively n as possible. Note that Optimization Problem 1 has at most i=1 (T (i) )2 constraints and can efﬁciently be solved with standard techniques. Optimization Problem 1 Given n labeled clusterings, C > 0, the sequential learning problem is deﬁned as 1 2 (i) (i) min w + C ξj s.t. w⊤ Ψ(x(i) , y(i) ) + ξt ≥ w⊤ Ψ(x(i) , y) + ∆N (y(i) , y) ¯ ¯ 2 i,j (i) for all 1 ≤ i ≤ n, 1 ≤ t ≤ T (i) , and y ∈ Yt . ¯ Conclusion. We devised a sequential clustering algorithm for learning a similarity measure to be used with correlation clustering. Starting from the assumption that decisions for already processed emails cannot be reconsidered, we devised an efﬁcient clustering algorithm with computational complexity linear in the number of emails in the window. From this algorithm we derived a second integrated method for learning the weight vector. Our empirical results indicate that there are no signiﬁcant differences between the learning or decoding methods in terms of accuracy (Figure 1a). Yet the integrated learning formulations optimize the weight vector more directly to yield consistent partitionings (Figure 1b). Using the batch information obtained from decoding with the learned models, email spam classiﬁcation performance increases substantially over the baseline with no batch information (Figure 1c). The efﬁciency of the sequential clustering algorithm makes supervised batch detection in enterprise-level scales, with millions of emails per hour and thousands of recent emails as reference, feasible. Acknowledgments. We gratefully acknowledge support from STRATO AG and from the German Science Foundation DFG. References Bansal, N., Blum, A., & Chawla, S. (2002). Correlation clustering. Proceedings of the Symposium on Foundations of Computer Science. Finley, T., & Joachims, T. (2005). Supervised clustering with support vector machines. Proc. of the International Conference on Machine Learning.

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 4 |

posted: | 7/13/2011 |

language: | English |

pages: | 2 |

OTHER DOCS BY ert634

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.