Docstoc

THE ALPACAS ANTI-SPAM FRAMEWORK

Document Sample
THE ALPACAS ANTI-SPAM FRAMEWORK Powered By Docstoc
					                                           http://eslab.inha.ac.kr/




    ALPACAS: A Large-scale
Privacy-aware Collaborative
         Anti-spam System
  Z. Zhong, L. Ramaswamy and K. Li, IEEE, INFOCOM 2008




      Intelligent E-Commerce System Lab.
                                Aettie, Ji


                                                  INHA UNIVERSITY
                                                  INCHEON, KOREA
                                                http://eslab.inha.ac.kr/




    OUTLINE
INTORDUCTION
PRIOR WORK
THE ALPACAS ANTI-SPAM FRAMEWORK
   Feature-Preserving Fingerprint
   Privacy-Preserving Collaboration Protocol
   System Structure

EXPERIMENTS & RESULTS
DISSCUSION
CONCLUSION

                                                       INHA UNIVERSITY
                             -2-                       INCHEON, KOREA
                                                 http://eslab.inha.ac.kr/




    INRTODUCTION
Motivations
   Recent   spam attack expose strong challenges to
    statistical filters, which have been popular.
   Collaborative spam filtering has a natural defense
    paradigm, wherein information of spam is shared, since
    the spammers sends similar emails to several target
    receivers.
   However, privacy of participating collaboration is an
    important challenge.
  For protecting privacy, digest approaches have been
    proposed but they are not sufficient.


                                                        INHA UNIVERSITY
                            -3-                         INCHEON, KOREA
                                                        http://eslab.inha.ac.kr/




    INRTODUCTION
Contributions
   ALPACAS:  Large-scale Privacy-Aware Collaborative Anti-
    spam System.
     •   A resilient fingerprint generation technique, “feature-
         preserving transformation”, is proposed.
     •   A privacy-preserving protocol is designed to control the
         amount of information to be shared.
   The experimental results demonstrate that the ALPACAS
    outperforms traditional stand-alone statistical filters.




                                                               INHA UNIVERSITY
                                -4-                            INCHEON, KOREA
                                                         http://eslab.inha.ac.kr/




    PRIOR WORK
Drawbacks of the existing collaborative anti-
 spam schemes (using DCC).
   How    it works?
     •   Participating servers in DCC share the email’s digests
         computed through hash functions such as MD5.
     •   DCC system replies back with the recent statistics about
         the digests.
   Drawbacks
     •   Hashing schemes like MD5 generate complete different
         hash value even if a single byte is altered.
     •   The DCC scheme does not completely address the privacy
         issue.  inference-based privacy breaches.

                                                                INHA UNIVERSITY
                                 -5-                            INCHEON, KOREA
                                                         http://eslab.inha.ac.kr/


    THE ALPACAS ANTI-SPAM
    FRAMEWORK(1/2)
Challenges
   To   protect email privacy,
     •   The messages have to be encrypted.
     •   It should retain important feature of the messages.
   To   avoid inference-based privacy beaches,
     •   It is necessary to minimize the information revealed during
         the collaboration.
ALPACAS framework components
   Feature-preserving fingerprint
   Privacy-preserving protocol
   DHT-based architecture

                                                                INHA UNIVERSITY
                                  -6-                           INCHEON, KOREA
                                                                http://eslab.inha.ac.kr/


THE ALPACAS ANTI-SPAM
FRAMEWORK(2/2)




 (a) ALPACAS Network                       (b) Internal mechanism of EA4


                 Fig. 1: ALPACAS System Overview

                                                                       INHA UNIVERSITY
                               -7-                                     INCHEON, KOREA
    THE ALPACAS ANTI-SPAM FRAMEWORK                                      http://eslab.inha.ac.kr/




    Feature-Preserving Fingerprint(1/4)
Shingle-based Message Transformation
   Shingle: If two documents vary by a small amount their
    shingle sets also differ by a small amount.




        Fig. 2: ALPACAS Feature Sets, DCC and Razor Digests for 2 spam emails
        (Texts in bold font indicate differences)                               INHA UNIVERSITY
                                       -8-                                      INCHEON, KOREA
    THE ALPACAS ANTI-SPAM FRAMEWORK                               http://eslab.inha.ac.kr/




    Feature-Preserving Fingerprint(2/4)
Shingle-based Message Transformation
   Generation   of transformed feature set of message
    Ma(TFSet(Ma))
     • Computing Rabin fingerprint[11] of consecutive tokens in
       sliding window of length W
     •   Each fingerprint is in the range of (0, 2K – 1)
     •   For a message with X tokens, X – W + 1 fingerprints are
         obtained.
     •   The smallest Y are retained.
     •   The similarity between Ma and Mb can be calculated as

                  TFSet (W ,Y ) ( M a )  TFSet (W ,Y ) ( M b )
                  TFSet (W ,Y ) ( M a )  TFSet (W ,Y ) ( M b )
                                                                         INHA UNIVERSITY
                                       -9-                               INCHEON, KOREA
    THE ALPACAS ANTI-SPAM FRAMEWORK                      http://eslab.inha.ac.kr/




    Feature-Preserving Fingerprint(3/4)
Shingle-based Message Transformation
   In   consideration of the privacy preservation,
     •   Rabin fingerprint algorithm is one-way hash function such
         that it is infeasible to reverse.
     •   However, it is possible to infer a word or a group of words
         from an individual feature value.




                                                                INHA UNIVERSITY
                                - 10 -                          INCHEON, KOREA
    THE ALPACAS ANTI-SPAM FRAMEWORK                        http://eslab.inha.ac.kr/




    Feature-Preserving Fingerprint(4/4)
Term-level Privacy Preservation
   Controlled    shuffling
     •   The email text is divided into consecutive h chucks of z
         consecutive token.
     •   The tokens in each chuck are shuffled in a pre-defined
         manner, remaining the ordering of chucks.
     •   Each chuck is divided into y sub-chuck. (y is a factor of z.)
     •   The tokens in chuck CKh are shuffled such that the token
         at rth position in the sth sub-chuck is moved to (r ⅹ y +
         s)th position in CKh.
     •   If two messages contain an identical term, by shuffling the
         term, the feature set could be different.


                                                                  INHA UNIVERSITY
                                 - 11 -                           INCHEON, KOREA
      THE ALPACAS ANTI-SPAM FRAMEWORK                    http://eslab.inha.ac.kr/


      Privacy-Preserving
      Collaboration Protocol (1/3)
Spam/ham dichotomy
      Revealing the contents of a spam email does not affect the
      privacy, whereas revealing information about a ham email
      constitutes a privacy breach.


Protocol
     EAj receives Ma, then computes TFSet(Ma).
     EAj sends query to other agent with subset of TFSet(Ma).
     EAk receives the query, then check its spam/ham KB.
     For each matching entry in spam KB, EAk sends back the
      complete transformed feature set.
     For each matching entry in ham KB, EAk sends back a small,
      randomly selected part of the transformed feature set.
                                                                INHA UNIVERSITY
                                - 12 -                          INCHEON, KOREA
THE ALPACAS ANTI-SPAM FRAMEWORK                        http://eslab.inha.ac.kr/


Privacy-Preserving
Collaboration Protocol (2/3)




        Fig. 3: ALPACAS Protocol: Query and Response
                                                              INHA UNIVERSITY
                            - 13 -                            INCHEON, KOREA
      THE ALPACAS ANTI-SPAM FRAMEWORK                            http://eslab.inha.ac.kr/


      Privacy-Preserving
      Collaboration Protocol (3/3)
Protocol(cont’)
     EAj now computes the ratio of MaxSpamOvlp(Ma) to
      MaxHamOvlp(Ma) and decides whether the Ma is spam or ham.

                     1  MaxSpamOvl ( M a )  MaxHamOvlp M a )
                                  p                     (
           Score 
                                         2

     If the score is greater than a threshold λ, Ma is classified spam,
      otherwise ham.




                                                                        INHA UNIVERSITY
                                    - 14 -                              INCHEON, KOREA
    THE ALPACAS ANTI-SPAM FRAMEWORK                      http://eslab.inha.ac.kr/




    System Structure (1/2)
Design principle
    A query should be sent to an email agent only if it has a
    reasonable chance of containing information about the
    email that is being verified. Contacting any other email
    agent not only introduces inefficiencies but also leads to
    unnecessary exposure of data.


DHT-based Architecture
   EAj is responsible for maintaining information about all
    the emails whose TFSet as one feature element in the
    range of allocated to it.



                                                                INHA UNIVERSITY
                               - 15 -                           INCHEON, KOREA
    THE ALPACAS ANTI-SPAM FRAMEWORK                     http://eslab.inha.ac.kr/




    System Structure (2/2)
DHT-based Architecture (cont’)
  N   email agent.
   All feature elements lie within (0, 2K-1).
   The range (0, 2K-1) is divided into N overlapping region
    as {(MinF0,MaxF0), (MinF1,MaxF1), . . . , (MinFN-1, 2K−1)}.
   (MinFj, MaxFj) denotes the sub-range allocated to EAj.
         •   For spam, EAj stores the entire TFSet.
         •   For ham, EAj stores the subset of TFSet.
   IfMinFj ≤ Ft ≤ MaxFj, then EAj is called rendezvous
    agent of feature element Ft.



                                                               INHA UNIVERSITY
                                     - 16 -                    INCHEON, KOREA
                                                        http://eslab.inha.ac.kr/




    EXPERIMENTS & RESULTS
Benchmarked algorithm
   Bogofilter   based on Bayesian filtering
     •   Calculating a spamminess score of the email.
   DCC    based on simple hash-based collaborative filtering
     •   Counting the number of times the hash value of the email
         has been reported as a spam.




                                                               INHA UNIVERSITY
                                - 17 -                         INCHEON, KOREA
    EXPERIMENTS & RESULTS                           http://eslab.inha.ac.kr/




    Experimental Setup
Dataset
   TREC   email corpus & SpamAssassin email corpus
   TREC corpus is classified into 67 email sets according to
    their target address (67 agents).
   Half of each email set including ham and spam is used
    for training and the remainder for testing.
   Each individual has a pre-classified email
    corpus(SpamAssassin) a the initial knowledgebase.




                                                           INHA UNIVERSITY
                             - 18 -                        INCHEON, KOREA
    EXPERIMENTS & RESULTS                           http://eslab.inha.ac.kr/




    Performance Metrics
Spam filtering accuracy
  A  ham email that is classified a spam by the filtering
    scheme is termed as false positive.
Privacy of collaborative anti-spam system
   Message-level privacy breach percentage is defined as
    the ratio number of test ham messages suffering privacy
    compromises to the total number of test ham messages.
Communication overhead of the system
   Per-test   communication cost metric is defined as the
    total number of messages circulated in the system
    during the entire experiment.

                                                           INHA UNIVERSITY
                              - 19 -                       INCHEON, KOREA
     EXPERIMENTS & RESULTS                                 http://eslab.inha.ac.kr/




     SPAM Filtering Effectiveness




Fig. 4: False Positive                              Fig. 6: System Overall
                          Fig. 5: False Negative
Percentages of ALPACAS,                             Accuracy (DCC is not
                          Percentages of ALPACAS,
BogoFilter and DCC                                  displayed because its FP
                          BogoFilter and DCC
                                                    is 0)




                                                                  INHA UNIVERSITY
                                 - 20 -                           INCHEON, KOREA
EXPERIMENTS & RESULTS                                               http://eslab.inha.ac.kr/




Robustness Against Attacks




Fig. 7: System Robustness Against            Fig. 8: System Robustness against
Good-Word Attacks                            Character Replacement Attacks




                                                                           INHA UNIVERSITY
                                    - 21 -                                 INCHEON, KOREA
EXPERIMENTS & RESULTS                                               http://eslab.inha.ac.kr/




Privacy Awareness




     Fig. 9: Privacy Breach in ALPACAS (Varying Number of Agents)




                                                                           INHA UNIVERSITY
                                - 22 -                                     INCHEON, KOREA
EXPERIMENTS & RESULTS                                            http://eslab.inha.ac.kr/




Communication Oveheads




   Fig. 10: Communication Overheads of the ALPACAS and the DCC systems




                                                                         INHA UNIVERSITY
                                - 23 -                                   INCHEON, KOREA
           EXPERIMENTS & RESULTS                                       http://eslab.inha.ac.kr/


           Massage Transformation
           Algorithm Analysis




Fig. 11: False Positive of ALPACAS   Fig. 12: False Negative   Fig. 13: Effectiveness of
for Various Parameter Setup          of ALPACAS for Various    Controlled Shuffling
                                     Parameter Setup           Strategy




                                                                              INHA UNIVERSITY
                                            - 24 -                            INCHEON, KOREA
                                             http://eslab.inha.ac.kr/




    DISCUSSION
 Approaches like statistical filtering combined the
  feature preservation transformation scheme.
 Applying dynamic nature of email agent to the system
  using replication and finger-table based routing.
 Approaches for preventing malicious email agents.




                                                    INHA UNIVERSITY
                          - 25 -                    INCHEON, KOREA
                                                    http://eslab.inha.ac.kr/




    CONCLUSION
In this paper, the design and evaluation of
 ALPACAS is presented.
The two novel features:
  A  feature preserving transformation technique
   A privacy-preserving protocol

Our initial experiments show that ALPACAS
   Isvery effective in filtering spam.
   Has high resilience towards various attacks.
   Has strong privacy protection to the participating
    entities.

                                                           INHA UNIVERSITY
                             - 26 -                        INCHEON, KOREA

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:22
posted:11/26/2012
language:Unknown
pages:26