chris spam

Shared by: HC120727131149
Categories
Tags
-
Stats
views:
2
posted:
7/27/2012
language:
pages:
81
Document Sample
scope of work template
							SPAM
Christian Loza
Srikanth Palla
 Liqin Zhang




     presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                                 Overview

 Introduction
 Background
 Measurement
 Methods
 Compare  different methods
 Conclusions




                     presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                                   Introduction


If you use email, it's likely that you've recently been visited by a piece
of Spam- an unsolicited, unwanted messag, sent to you with out your
permission.Sending spam violates the Acceptable Use Policy (AUP)of
almost all ISP's and can lead to the termination of the sender's
account.As the recipient directly bears the cost of delivery, storage,
and processing, one could regard spam as the electronic equivalent of
"postage-due" junk mail.




                                 presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                                   Introduction


Spammers frequently engage in deliberate fraud to send out their
messages. Spammers often use false names, addresses, phone numbers,
and other contact information to set up "disposable" accounts at various
Internet service providers. They also often use falsified or stolen credit
card numbers to pay for these accounts. This allows them to move
quickly from one account to the next as the host ISPs discover and shut
down each one.




                                 presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                            Introduction

 In recent years, the
  spam has show no
  signals of stopping
  growth
 This is mainly
  because it does work
 The advantage is that
  is a cheap way to
  increase the customer
  base.

                          presented by Christian Loza, Srikanth Palla and Liqin Zhang
Spammers frequently engage in deliberate fraud to send out their
messages. Spammers often use false names, addresses, phone numbers,
and other contact information to set up "disposable" accounts at various
Internet service providers. They also often use falsified or stolen credit
card numbers to pay for these accounts. This allows them to move quickly
from one account to the next as the host ISPs discover and shut down
each one.




                                   presented by Christian Loza, Srikanth Palla and Liqin Zhang
Spammers frequently go to great lengths to conceal the origin of their
messages. They do this by spoofing e-mail addresses . The spammer
hacks the email protocol SMTP so that a message appears to originate
from another email address. Some ISPs and domains require the use of
SMTP AUTHallowing positive identification of the specific account from
which an e-mail originates.




                                  presented by Christian Loza, Srikanth Palla and Liqin Zhang
One cannot completely spoof an e-mail address chain, since the receiving
mailserver records the actual connection from the last mailserver's IP
address; however, spammers can forge the rest of the ostensible history of
the mailservers the e-mail has ostensibly traversed. Spammers frequently
seek out and make use of vulnerable third-party systems such as open
mail relays and open proxy servers.




                                   presented by Christian Loza, Srikanth Palla and Liqin Zhang
                              Address Collection



Spammers may harvest e-mail addresses from a number of sources. A
popular method uses e-mail addresses which their owners have
published for other purposes. Usenet posts, especially those in archives
such as Google groups, frequently yield addresses. Simply searching the
Web for pages with addresses ― such as corporate staff directories ―
can yield thousands of addresses, most of them deliverable.




                                presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                Address Collection


Spammers have also subscribed to discussion mailing lists for the purpose
of gathering the addresses of posters. The DNS and WHOIS systems
require the publication of technical contact information for all Internet
domains spammers have illegally crawled these resources for email
addresses. Many spammers utilize programs called Web Spiders to find
email addresses on web pages.Because spammers offload the bulk of their
costs onto others, however, they can use even more computationally
expensive means to generate addresses.



                                   presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                 Address Collection


A dictionary attack consists of an exhaustive attempt to gain access to a
resource by trying all possible credentials ― usually, usernames and
passwords. Spammers have applied this principle to guessing email
addresses ― as by taking common names and generating likely email
addresses for them at each of thousands of domain names.Spammers
sometimes use various means to confirm addresses as deliverable. For
instance, including a Web bug in a spam message written in HTML may
cause the recipient's mail client to transmit the recipient's address, or any
other unique key, to the spammer's Web site.



                                     presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                                     Terminology
To better understand the concepts in this presentation let us consider the
following terminology.

Mail User Agent (MUA). This refers to the program used by the client to
send and receive e-mail from. It is usually referred to as the "mail client."
An example of this is Pine or Eudora.

Mail Transfer Agent (MTA). This refers to the program used running on
the
server to store and forward e-mail messages. It is usually referred to as the
"mail server program." An example of this is sendmail or the Microsoft
Exchange server.



                                     presented by Christian Loza, Srikanth Palla and Liqin Zhang
      The Mail Queue




presented by Christian Loza, Srikanth Palla and Liqin Zhang
In a normal configuration, sendmail sits in the background waiting for
new messages. When a new connection arrives, a child process is invoked
to handle the connection, while the parent process goes back to listening
for new connections.

When a message is received, the sendmail child process puts it into the
mail queue (usually stored in /var/spool/mqueue). If it is immediately
deliverable, it is delivered and removed from the queue. If it is not
immediately deliverable, it will be left in the queue and the process will
terminate.

Messages left in the queue will stay there until the next time the queue is
processed. The parent sendmail will usually fork a child process to
attempt to deliver anything left in the queue at regular intervals.
                                    presented by Christian Loza, Srikanth Palla and Liqin Zhang
            Structure of E-mail Message


Email messages are compose of two parts:

  1. Headers (lines of the form "field: value" which contain information
about the message, such as "To:", "From:", "Date:", and "Message-
ID:")

 2. Body (the text of the message)




                                     presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                                                         Example
From johndoe@students.uiuc.edu Mon Jul 5 23:46:19 1999
Received: (from johndoe@localhost)
     by students.uiuc.edu (8.9.3/8.9.3) id LAA05394;
     Mon, 5 Jul 1999 23:46:18 -0500
Received: from staff.uiuc.edu (staff.uiuc.edu [128.174.5.59])
     by students.uiuc.edu (8.9.3/8.9.3) id XAA24214;
     Mon, 5 Jul 1999 23:46:25 -0500
Date: Mon, 5 Jul 1999 23:46:18 -0500
From: John Doe <johndoe@students.uiuc.edu>
To: John Smith <jsmith@staff.uiuc.edu>
Message-Id: <199907052346.LAA05394@students.uiuc.edu>
Subject: This is a subject header.

This is the message body. It is seperated from the headers by a blank
line.

The message body can span multiple lines.
                                            presented by Christian Loza, Srikanth Palla and Liqin Zhang
Here is an example SMTP transaction:

  1. Client connects to server's SMTP port (25).
  2. Server: 220 staff.uiuc.edu ESMTP Sendmail 8.10.0/8.10.0 ready; Mon, 13 Mar 2000 14:54:08 -0600
  3. Client: helo students.uiuc.edu
  4. Server: 250 staff.uiuc.edu Hello root@students.uiuc.edu [128.174.5.62], pleased to meet you
  5. Client: mail from: johndoe@students.uiuc.edu
  6. Server: 250 2.1.0 johndoe@students.uiuc.edu... Sender ok
  7. Client: rcpt to: jsmith@staff.uiuc.edu
  8. Server: 250 2.1.5 jsmith@staff.uiuc.edu... Recipient ok
  9. Client: data
 10. Server: 354 Enter mail, end with "." on a line by itself
 11. Client:

Received: (from johndoe@localhost)
     by students.uiuc.edu (8.9.3/8.9.3) id LAA05394;
     Mon, 5 Jul 1999 23:46:18 -0500
Date: Mon, 5 Jul 1999 23:46:18 -0500
From: John Doe <johndoe@students.uiuc.edu>
To: John Smith <jsmith@staff.uiuc.edu>
Message-Id: <199907052346.LAA05394@students.uiuc.edu>
Subject: This is a subject header.

This is the message body. It is seperated from the headers by a blank
line.The message body can span multiple lines.
  12. Server: 250 2.0.0 e2DKuDw34528 Message accepted for delivery
  13. Client: quit
  14. Server: 221 2.0.0 staff.uiuc.edu closing connection

The sender and recipient addresses used in the SMTP transaction are called the Message Envelope. Note that these
addresses do not need to have any similarity to the addresses in the message headers!
                                                           presented by Christian Loza, Srikanth Palla and Liqin Zhang
               Delivering Spam messages


Early on, spammers discovered that if they sent large quantities of spam
directly from their ISP accounts, recipients would complain and ISPs
would shut their accounts down. Thus, one of the basic techniques of
sending spam has become to send it from someone else's computer and
network connection. By doing this, spammers protect themselves in
several ways: they hide their tracks, get others' systems to do most of the
work of delivering messages, and direct the efforts of investigators
towards the other systems rather than the spammers themselves.



                                    presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                                          Mail filters



A mail filter is a piece of software which takes an input of an email
message. For its output, it might pass the message through unchanged for
delivery to the user's mailbox, it might redirect the message for delivery
elsewhere, or it might even throw the message away. Some mail filters are
able to edit messages during processing.




                                   presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                                        Introduction

   Application of Text Categorization
   The Spam classification is defined as a binary
    problem: Email is Spam OR is not Spam.
   Automatic text categorization assigns emails to one
    of the above categories, using different methods
   One of this methods is the Centroid-based
    classification
             Hello,                             SPAM
             Hi, this is your
             opportunity to buy a
             house with new
             mortage rates.
             To find more about
             this, just click here.




                                             NOT SPAM
                                      presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                                 Background

   Text Classification: classify documents into
    categories
     Spam
     un-spam
   Classification process
     preprocess message
         Remove tag
         Stop-word removal
         Word stemming
     Training --- build the classification model
     Testing --- evaluate the model


                               presented by Christian Loza, Srikanth Palla and Liqin Zhang
                             Methodologies

 Bayes-Naives
 Centroid-Based
 Content-based




                   presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                                  Bayesianism



Is the philosophical tenet that the mathematical theory of probability
applies to the degree of plausibility of a statement. This also applies to
the degree of believability contained within the rational agents of a
truth statement. Additionally, when a statement is used with Bayes'
theorem, it then becomes a Bayesian inference.




                                   presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                                     Baye's Rule


If A and B are two separate but possibly dependent random events, then:

Probability of A and B occurring together       = Pr[(A,B)]

The conditional probability of A, given that B occurs = Pr[(A|B)]

The conditional probability of BB, given that AA occurs = Pr[(B|A)]




                                   presented by Christian Loza, Srikanth Palla and Liqin Zhang
From elementary rules of probability :

                    Pr[(A,B)] = Pr[(A|B)]Pr[(B)] = Pr[(B|A)]Pr[(A)]


Dividing the right-hand pair of expressions by Pr[(B)] gives Bayes' rule:

                      Pr[A|B] = Pr[B|A]Pr[A]
                               -----------------
                                    Pr[B]



                                    presented by Christian Loza, Srikanth Palla and Liqin Zhang
In problems of probabilistic inference, we are often trying to estimate the
most probable underlying model for a random process, based on some
observed data or evidence. If AA represents a given set of model
parameters, and BB represents the set of observed data values, then the
terms in equation are given the following terminology:

➢Pr[A] is the prior probability of the model A (in the absence of any
evidence)
➢Pr[B] is the probability of the evidence B

➢Pr[B|A] is the likelihood that the evidence B was produced, given that

the model was A
➢Pr[A|B] is the posterior probability of the model being A, given that the

evidence is B.
                                    presented by Christian Loza, Srikanth Palla and Liqin Zhang
 Mathematically, Bayes' rule states

                likelihood * prior
posterior = ------------------------------
               marginal likelihood




                   presented by Christian Loza, Srikanth Palla and Liqin Zhang
Representing E-mail for statistical
                      Algorithms


All statistical algorithms for spam filtering begin with a vector
representation of individual e-mail messages.
The length of the term vector is the number of distinct words in all the e-
mail messages in the training data. The entry for a particular word in the
term vector for a particular e-mail message is usually he number of
occurences of the word in the e-mail message.




                                    presented by Christian Loza, Srikanth Palla and Liqin Zhang
         Training data comprising four
              labeled e-mail messages
Table below presents toy training data comprising four e-mail messages.
These data contain ten distinct words: the, quick, brown, fox, rabbit, ran,
and, run, at, and rest.

#                             Message                                              Spam

1                             The quick brown fox                                      no
2                             The quick rabbit ran and ran                             yes
3                             rabbit run run run                                       no
4                             rabbit at rest
yes



                                    presented by Christian Loza, Srikanth Palla and Liqin Zhang
      Term Vectors corresponding to
                        training data

# and at brown fox quick rabbit ran rest run the

1100011100001

2120020010101

3000031110010

4203200001011




                                    presented by Christian Loza, Srikanth Palla and Liqin Zhang
If the training data comprise thousands of e-mail messages, the number of
distinct words often exceeds 10,000. Two simple strategies to reduce the
size of the term vector somewhat are to remove “stop words” (words like
and, of, the, etc.) and to reduce words to their root form, a process known
as stemming (so, for example, “ran” and “run” reduce to “run”). Table 3
shows the reduced term vectors along with the spam label.




                                    presented by Christian Loza, Srikanth Palla and Liqin Zhang
 Term vectors after stemming and
   stop word removal, spam label
            coded as 0=no,1=yes

X1        X2         X3         X4             X5               X6               Y

# brown fox quick rabbit rest run Spam
1 1           1           1          0               2                1                0
2 0           1           1          0               3                0                1
3 0           0           1          0               0                1                0
4 0           0           0          1               1                2                1




                                  presented by Christian Loza, Srikanth Palla and Liqin Zhang
                            Navie Bayes for Spam
Let X = (X1,. .., Xd) denote the term vector for a random e-mail message,
where d is the number of distinct words in the training data, after
stemming and stopword removal. Let Y denote the corresponding spam
label. The Naive Bayes model seeks to build a model for:
Pr(Y = 1|X1= x1,. .., Xd= xd).

From Bayes theorem, we have:

Pr(Y = 1|X1= x1,. .., Xd= xd) = Pr(Y = 1) * Pr(X1=x1,. .., Xd= xd|Y = 1)
                               ------------------------------------------------
                                         Pr(X1= x1,. .., Xd= xd)



                                      presented by Christian Loza, Srikanth Palla and Liqin Zhang
            Centroid-based method

 The documents are represented using a
  Vector-space model.
 Each document is represented as a Term
  Frequency vector (TF)
                            t2
                                          d1


                                                            d4
                                                                    d3

                                                              d2

                                                                    t1
                    presented by Christian Loza, Srikanth Palla and Liqin Zhang
                        Centroid-based method

   A refinement of this model is the inverse document
    frequency (IDF)
   This is to limit the discrimination power of frequent
    terms and stop words, and to emphasize words that
    appear in specific documents.
   IDF is log(N/dfi)



   The size of the document is normalized


                             presented by Christian Loza, Srikanth Palla and Liqin Zhang
                    Centroid-based method

   The distance between two vectors is defined using the
    cosine function




   Finallly, one Centroid Vector C is defined for each
    category (spam/not spam) as




                             presented by Christian Loza, Srikanth Palla and Liqin Zhang
            Centroid-based method

 We can measure the similarity between one
 document and the Centroid of the category
 with the following function




                    presented by Christian Loza, Srikanth Palla and Liqin Zhang
   Steps: Centroid-based Method

1. TRAINNING
  Determine the document vectors using TD/IDF.


             t2         d7
                   d8
                              d5
                               d6

                                               d3
                                        d2
                                        d1           d4
                                                t1

                             presented by Christian Loza, Srikanth Palla and Liqin Zhang
   Steps: Centroid-based Method

1. TRAINNING
  Calculate the centroid for the categories SPAM and NOT
    SPAM

             t2         d7    CSPAM
                   d8
                              d5
                               d6

                                               d3
                                        d2       CNOT SPAM
                                        d1           d4
                                                t1

                             presented by Christian Loza, Srikanth Palla and Liqin Zhang
   Steps: Centroid-based Method

1. CLASSIFICATION
   Given a new document dn, calculate the document vector
     representation (like in the training stage)

              t2

                           dn




                                             t1

                          presented by Christian Loza, Srikanth Palla and Liqin Zhang
   Steps: Centroid-based Method

1. CLASSIFICATION
  Measure the distance between the vector dn and the
    Centroids of the Categories SPAM / NOT SPAM

               t2           CSPAM
                            dn




                                               CNOT SPAM


                                              t1

                           presented by Christian Loza, Srikanth Palla and Liqin Zhang
   Steps: Centroid-based Method

1. CLASSIFICATION (cont.)
  Measure the distance between the vector dn and the
    Centroids of the Categories SPAM / NOT SPAM

               t2           CSPAM
                            dn




                                               CNOT SPAM


                                              t1

                           presented by Christian Loza, Srikanth Palla and Liqin Zhang
   Steps: Centroid-based Method

1. FINAL RESULT
   Obtain the maximum similarity between the document and
     the Centroids of SPAM and NOT SPAM




   for i=1,2 where 1=SPAM and 2=NOT SPAM

                               presented by Christian Loza, Srikanth Palla and Liqin Zhang
                        Analysis of Results

 Thestandard methodology for measuring
 performance of text classification methods are
 the Precision and Recall

             n. of correctly predicted positives
        P=
                       N of predicted positive
                       examples


             n. of correctly predicted positives
        R=
                      N of all positive examples


                            presented by Christian Loza, Srikanth Palla and Liqin Zhang
                 Analysis of Results

 None Precision or Recall can give a good
 measure by themselves. To have an idea of
 the performance, we have to combine them.
                           R
                                                     better
           2PR
    F=
           P+R


                                                                 P

                   presented by Christian Loza, Srikanth Palla and Liqin Zhang
                    Some results
 Compared  agains kNN
  and Naïve Based, the
  Centroid method
  performs better




 presented by Christian Loza, Srikanth Palla and Liqin Zhang
             Content Based Approach

 Spam   can be detected
  before reading the message --- non-
  content based:
   Based on special protocol [3] – voip protocol
   Based on address book[1] – build an email
    network
   Based on IP address [4]
   …..
 After process the content of the email ---
  content based
                     presented by Christian Loza, Srikanth Palla and Liqin Zhang
                       Content-based Approach

   Non-content based approach
     remove spam message if contain virus, worms before read.
     leaves some messages un-labeled

   Content based method:
     widely used method
     may need lots pre-labeled message
     label message based its content
     Zdziarski[5] said that it's possible to stop spam, and that content-
      based filters are the way to do it
   Focus on content based method


                                  presented by Christian Loza, Srikanth Palla and Liqin Zhang
                Method of content-based

 Bayesian based method [6]
 Centroid-based method[7]
 Machine learning method [8]
     Latent Semantic indexing LSI
     Contextual Network Graphs (CNG)
   Rule based method[9]
     ripper rule: a list of predefined rules that can be changed
      by hand
   Memory based method[10]
     saving cost


                              presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                             Measurement

 Accuracy:  the percentage of correct classified
  correct/(correct + un-correct)
 False positive: if a message is a spam, but
  misclassify to un-spam.
  Goal:
      Improve accuracy
      Prevent false positive                                             Correct




                                              No spam Spam

                               presented                                     Un-correct
                       False positive by Christian Loza, Srikanth Palla and Liqin Zhang
                                                                       measurement




                                                 relevant irrelevant
Entire document
collection      Relevant   Retrieved                                   retrieved &   Not retrieved &
              documents    documents                                    irrelevant     irrelevant


                                                                       retrieved &   not retrieved but
                                                                         relevant         relevant

                                                                       retrieved      not retrieved
                 Number of relevant documents retrieved
        recall 
                  Total number of relevant documents
                    Number of relevant documents retrieved
        precision 
                     Total number of documents retrieved



                                       presented by Christian Loza, Srikanth Palla and Liqin Zhang
                     Rule-based method

A list of predefined rules that can be changed
 by hand
   ripper rule
 Each   rule/test associated with a score
 If an email fails a rule, its score increased
 After apply all rules, if the score is above a
  certain threshold, it is classified as spam



                       presented by Christian Loza, Srikanth Palla and Liqin Zhang
                       Rule-based method

 Advantage:
   able to employ diverse and specific rule to check
   spam
     Check size of the email
     Number of pictures it contains
   no training messages are needed
 Disadvantage:
   rules have to be entered and maintained by hand
   --- can’t be automatically


                          presented by Christian Loza, Srikanth Palla and Liqin Zhang
                    Latent Semantic indexing

   Keyword
     important word for text classification
     High frequent word in a message
     Can used as an indicator for the message
   Why LSI?
     Polysemy: word can be used in more than one category
        ex: Play
     Synonymous: if two words have identical meaning
   Based on nearest neighbors based algorithm

                                   presented by Christian Loza, Srikanth Palla and Liqin Zhang
              Latent semantic indexing

   consider semantic links between words
     Search keyword over the semantic space
     Two words have the same meaning are treated as one
     word
     eliminate synonymous
   Consider the overlap between different message, this
    overlap may indicate:
     polysemy or stop-word
     two messages in same category




                              presented by Christian Loza, Srikanth Palla and Liqin Zhang
              Latent semantic indexing

 Step1: build a term-document matrix X from
  the input documents
        Doc1: computer science department
        Doc2: computer science and engineering science
        Doc3: engineering school


   Computer science department and engineering school
Doc1     1         1       1       0          0     0
Doc2     1         2       0       1          1       0
Doc3      0        0       0       0          1       1




                              presented by Christian Loza, Srikanth Palla and Liqin Zhang
           Latent semantic indexing

 Step2:Singular value Decomposition (SVD) is
 performed on matrix X
  To extract a set of linearly independent FACTORS
   that describe the matrix
  Generalize the terms have the same meaning
  Three new matrices TSD are produced to reduce
   the vocabulary’s size




                      presented by Christian Loza, Srikanth Palla and Liqin Zhang
          Latent Semantic indexing

 Two  document can be compared by finding
  the distance between two document vector,
  stored in matrix X1
 Text classification is done by finding the
  nearest neighbors – assign to category with
  max document




                     presented by Christian Loza, Srikanth Palla and Liqin Zhang
Nearest neighbors method classify the test message to be UN-SPAM



                                                         Spam:

                                                         Un-spam:

                                                         Test:




                              presented by Christian Loza, Srikanth Palla and Liqin Zhang
               Latent Semantic Indexing

   Advantage:
     Entire training set can be learned at same time
     No intermediate model need to be build
     Good for the training set is predefined
   Disadvantage:
     When new document added, matrix X changed, and TSD
      need to be re-calculated
     Time consuming
     Real classifier need the ability to change training set




                              presented by Christian Loza, Srikanth Palla and Liqin Zhang
          Contextual network Graphs

A weighted, bipartite, undirected graph of term
 and document nodes

                              t1                     t1: W11+w21 = 1
                                        w21          d1: w11+w12+w13 = 1
                   w11
                              t2       w22
                     w12
            d1                                            d2

                   w1                      w23
                   3
                               t3
 At any time, for each node, the sum of the weigh is 1
                                   presented by Christian Loza, Srikanth Palla and Liqin Zhang
         Contextual network graphs

 When  new document d is added, energizing
  the weights at node d, and may need re-
  balance the weights at the connected node
 The document is classified to the one with
  maximum of energy (weight) average for each
  class




                    presented by Christian Loza, Srikanth Palla and Liqin Zhang
                Comparison Bayesian, LSI,CNG,
                          centroid, rule-based

                Bayesian         LSI              CNG                 Centroid            Rule


Classificatio   Statistical/pr   Generalizatio Energy re-             Cosine              Predefined
n               obabilities      n/contextual balancing               similarity          rules
                                 data                                 TF-IDF
Learning        Update           Recalculatio     Addition            Recalculatio        Test against
                statistics       n matrices       nodes to            n of centroid       rule
                                                  graph
Semantic        No               Yes              Yes                 No                  no


Automatic       Yes              Yes              Yes                 Yes                 no
update model
                                                presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                      Result




presented by Christian Loza, Srikanth Palla and Liqin Zhang
            Result and conclusion:

 LSI & CNG super Bayesian approach 5%
  accuracy, and reduce false positive and
  negatives up to 71%
 LSI & CNG shows better performance even
  with small document set




                   presented by Christian Loza, Srikanth Palla and Liqin Zhang
       Comparison content based and non-content based



 Non-content    based:
   dis-adv:
     depends on special factor like email address, IP
      address, special protocol,
      leaves some un-classified
   Adv: detect spam before reading message with
   high accuracy




                          presented by Christian Loza, Srikanth Palla and Liqin Zhang
 Content   based:
  Disadvantage:
      need some training message
      not 100% correct classified due to the spammer also
      know the anti-spam tech.
  Advantage:
      leaves no message unclassified




                          presented by Christian Loza, Srikanth Palla and Liqin Zhang
                   Improvement for spam

   Combine both method
     [1] proposes an email network based algorithm, which with
      100% accuracy, but leaves 47% unclassified, if combine
      with content based method, can improve the performance.
   Build up multi-layers[11]


[11] Chris Miller, A layered Approach to enterprise
  antispam




                            presented by Christian Loza, Srikanth Palla and Liqin Zhang
                        Data set for spam:

 Non-content   based:
  Email network:
     One author’s email corpus, formed by 5,486 messages
   IP address: -- none
 Content   based:




                          presented by Christian Loza, Srikanth Palla and Liqin Zhang
                       Data set for spam

 LSI   & CNG:
  Corpus of varying size (250 ~ 4000)
  Spam and un-spam emails in equal amount
 Bayesian   based:
  Corpus of 1789 email
  211 spam, 1578 non-spam
 Cetroid   based:
  Totally 200 email message
  90 spam, 110 non-spam


                      presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                         Most recently used
                                              Benchmarks:
   Reuters:
      About 7700 training and 3000 test documents, 30000 terms,135 categories,
        21MB.
      each category has about 57 instances
      collection of newswire stories
   20NG:
      About 18800 total documents, 94000 terms, 20 topics, 25MB.
      Each category has about 1000 instances
   WebKB:
      About 8300 documents, 7 categories, 26 MB.
      Each category has about 1200 instances
      4 university website data



                                 recently IR with small Palla and and
    Above three are well-known in presented by Christian Loza, Srikanth in sizeLiqin Zhang
    used to test the performance and CPU scalability
                                                  Benchmarks

   OHSUMED:
     348566 document, 230000 terms and 308511 topics, 400 MB.
     Each category has about 1 instance
     Abstract from medical journals
   Dmoz:
     482 topics, 300 training document for each topic, 271MB
     Each category has less than 1 instance
     taken from Dmoz(http://dmoz.org/) topic tree


   Large dataset, used to test the memory scalability of a
    model
                                 presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                              Some facts

 Spam  is a growing problem, and the research
 on this topic has become more relevant the
 last years
 Spam   grows because it works.
 Many commercial products try to fight spam.
 Most of them rely on the exposed techniques,
 or combination of them
 Spam  damages economy, more than hackers
 or viruses
                     presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                               Some facts

 Damages attributed to Spam are calculated
 around 10.4 billion in 2003, 58 – 112 billion in
 2004, and is projected to cross 200 billion
 worldwide in 2005.
    trillion unsolicited messages were sent in
 1.6
 2004.




                      presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                          Conclusions

 Spam  is a problem that causes a great impact
 of global business
 We presented three methods for Spam
 classification.
 Thebenchmarks on this three methods
 suggest that combination of the methods
 perform better than the methods alone


                     presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                           Conclusions

 Spamclassifiers can be Content Based, and
 Non Content Based
 Content   Based: Rules, Naïve Bayes, Centroid
 Non content work without reading the content
 of the mail




                      presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                             Conclusions

 Researchers have found ways to increase the
 accuracy of all the methods, using heuristics
 and combining them
 Spammers     also learn how to avoid spam filters
 No   single method is perfect in all situations




                        presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                                                    Sources
   Slide 1, image: ttp://www.ecommerce-guide.com
   Slide 1, image: ttp://www.email-firewall.jp/products/das.html




                                    presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                                       References
   Anti-spam Filtering: A centroid-based Classification Approach,
    Nuanwan Soonthornphisaj, Kanokwan Chaikulseriwat, Piyan Tang-
    On, 2002
   Centroid-Based Document Classification: Analysis & Experimental
    Results, Eui-Hong (Sam) and George Karypis, 2000
   Multi-dimensional Text classification, Thanaruk Theeramunkog, 2002
   Improving centroid-based text classification using term-distribution-
    based weighting system and clustering, Thanaruk Theeramunkog and
    Verayuth Lertnattee
   Combining Homogeneous Classifiers for Centroid-based text
    classifications, Verayuth Lertnattee and Thanaruk Theeramunkog




                                 presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                                            References
[1] P Oscar Boykin and Vwani Roychowdhury, Personal Email Networks: An Effective
    Anti-Spam Tool, IEEE COMPUTER, volume 38, 2004
[2] Andras A. Benczur and Karoly Csalogany and Tamas Sarlos and Mate Uher,
    SpamRank - Fully Automatic Link Spam Detection,
    citeseer.ist.psu.edu/benczur05spamrank.html
[3]. R. Dantu, P. Kolan, “Detecting Spam in VoIP Networks”, Proceedings of USENIX,
    SRUTI (Steps for Reducing Unwanted Traffic on the Internet) workshop, July
    05(accepted)
[4]. IP addresses in email clients ttp://www.ceas.cc/papers-2004/162.pdf
[5] Plan for Spam ttp://ww.paulgraham.com/spam.html




                                      presented by Christian Loza, Srikanth Palla and Liqin Zhang
                                                             References

[6] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz. 1998, “A Bayesian
    Approach to Filtering Junk E-Mail”, Learning for Text Categorization – Papers from
    the AAAI Workshop, pages 55–62, Madison Wisconsin. AAAI Technical Report
    WS-98-05
[7] N. Soonthornphisaj, K. Chaikulseriwat, P Tang-On, “Anti-Spam Filtering: A
    Centroid Based Classification Approach”, IEEE proceedings ICSP 02
[8] Spam Filtering Using Contextual Networking Graphs
    www.cs.tcd.ie/courses/csll/dkellehe0304.pdf
[9] W.W. Cohen, “Learning Rules that Classify e-mail”, In Proceedings of the AAAI
    Spring Symposium on Machine Learning in Information Access, 1996
[10] G. Sakkis, I. Androutsopoulos, G. Paliouras, V. Karkaletsis, C.D. Spyropoulos, P.
   Stamatopoulos, “A memory based approach to anti-spam filtering for mailing lists”,
   Information Retrieval 2003




                                       presented by Christian Loza, Srikanth Palla and Liqin Zhang

						
Related docs
Other docs by HC120727131149
PowerPoint Presentation
Views: 1  |  Downloads: 0
DNA and Gene Expression
Views: 3  |  Downloads: 0
blank storymap narrative persuasive
Views: 17  |  Downloads: 0
AF SBIR/STTR
Views: 3  |  Downloads: 0
multinational corporations20111
Views: 4  |  Downloads: 0
DISTRICT IN NEED OF IMPROVEMENT
Views: 2  |  Downloads: 0
Story Telling Quiz
Views: 5  |  Downloads: 0