Docstoc

Text _spam_ Classification

Document Sample
Text _spam_ Classification Powered By Docstoc
					    A Suffix Tree Approach to Text
Classification Applied to Email Filtering
         Rajesh Pampapathi, Boris Mirkin, Mark Levene

     School of Computer Science and Information Systems
            Birkbeck College, University of London
        Introduction – Outline

   Motivation: Examples of Spam
   Suffix Tree construction
   Document scoring and classification
   Experiments and results
   Conclusion
1. Standard spam mail
Buy cheap medications online, no prescription needed.
We have Viagra, Pherentermine, Levitra, Soma, Ambien, Tramadol and many more
products.
No embarrasing trips to the doctor, get it delivered directly to your door.

Experienced reliable service.
Most trusted name brands.


For your solution click here: http://www.webrx-doctor.com/?rid=1000
5. Embedded message (plus word salad)
 zygotes zoogenous zoometric zygosphene zygotactic zygoid zucchettos zymolysis zoopathy
 zygophyllaceous zoophytologist zygomaticoauricular zoogeologist zymoid zoophytish
 zoospores zygomaticotemporal zoogonous zygotenes zoogony zymosis zuza zoomorphs
 zythum zoonitic zyzzyva zoophobes zygotactic zoogenous zombies zoogrpahy zoneless
 zoonic zoom zoosporic zoolatrous zoophilous zymotically zymosterol

 FreeHYSHKRODMonthQGYIHOCSupply.IHJBUMDSTIPLIBJTJUBIYYXFN

 * GetJIIXOLDViagraPWXJXFDUUTabletsNXZXVRCBX
 <http://healthygrow.biz/index.php?id=2>

 zonally zooidal zoospermia zoning zoonosology zooplankton zoochemical zoogloeal
 zoological zoologist zooid zoosphere zoochemical

 & Safezoonal andNGASXHBPnatural
 & TestedQLOLNYQandEAVMGFCapproved

 zonelike zoophytes zoroastrians zonular zoogloeic zoris zygophore zoograft zoophiles
 zonulas zygotic zymograms zygotene zootomical zymes zoodendrium zygomata zoometries
 zoographist zygophoric zoosporangium zygotes zumatic zygomaticus zorillas zoocurrent
 zooxanthella zyzzyvas zoophobia zygodactylism zygotenes zoopathological
 noZFYFEPBmas <http://healthygrow.biz/remove.php>
4. Word salads
Buy meds online and get it shipped to your door Find out more here
<http://www.gowebrx.com/?rid=1001>

a publications website accepted definition. known are can Commons the be
definition. Commons UK great public principal work Pre-Budget but an can
Majesty's many contains statements statements titles (eg includes have website.
health, these Committee Select undertaken described may publications
                Creating a Suffix Tree
MEET     FEET              ROOT




       F (1)       M (1)    E (2)
                              (4)             (2)
                                            T (1)



       E (1)       E (1)    E (1)
                              (2)     (2)
                                    T (1)



       E (1)       E (1)    T (1)
                              (2)



       T (1)       T (1)
            Levels of Information
   Characters: the alphabet (and their frequencies) of a class.
   Matches: between query strings and a class.
       s =nviaXgraU>Tabl$$$ets
        t =xv^ia$graTab£££lets
        Matches(s, t) = {v, ia, gra, Tab, l, ets, $}
        - But what about overlapping matches?
   Trees: properties of the class as a whole.
        ~size
        ~density (complexity)
  Document Similarity Measure
The score for a document, d, is the sum of the scores
for each suffix:

                                       n
              1
 SCORE(d, T)   score (d(i), T)
               i 0

  d(i) is the suffix of d beginning at the ith letter
  tau is a tree normalisation coefficient
     Substring Similarity Measure
Score for match, m = m0m1m2…mn, is score(m):

                                         n
      score (m)  v(m | T)  [p(m t )]
                                        t 0

 T is the tree profile of the class.
 v(m|T) is a normalisation coefficient based on the properties of T.
 p(mt) is the probability of the character, mt, of the match m.
 Φ[p] is a significance function.
    Decision Mechanism

SCORE H (d, TH )
                  threshold  HAM
SCORES (d, TS )

SCORE H (d, TH )
                  threshold  SPAM
SCORES (d, TS )
               Specifications of Φ[p]
                  (character level)
Constant: 1
  Linear: p
  Square: p2

   Root: p0.5
   Logit: ln(p) – ln(1-p)
Sigmoid: (1 + exp(-p))-1


  Note: Logit and Sigmoid need to be adjusted to fit in the range [0,1]
Significance function
   Threshold Variation
~ Significance functions ~
   Threshold Variation
~ Significance functions ~
                   Match normalisation
Match unnormalised                                            1

Match permutation normalised                             f (m | T)
                                                       i(m*|T) f (i)
Match length normalised                                  f (m | T)
                                                        i(m'|T) f (i)
m* is the set of all strings formed by permutations of m
m’ is the set of all strings of length equal to length of m
               Match normalisation




MUN: match unnormalised; MPN: permutation normalised; MLN: length normalised
      Threshold Variation
    ~ match normalisation ~
Constant significance function   Constant significance function
unnormalised                     match normalised
           Specifications of tau
Unnormalised: 1
     Size(T): The total number of nodes
  Density(T): The average number of children of
              internal nodes
  AvFreq(T): Average frequency of nodes
Tree normalisation
        Androutsopoulos et al. (2000)
          ~ Ling-Spam Corpus ~
                     Pre-processing           Number of   Spam Recall        Spam
                                               Features      Error      Precision Error

Naïve Bayes (NB)     Lemmatizer + Stop-List      100        17.22%          0.51%

Suffix Tree (ST)     None                       N/A          2.50%          0.21%

Naïve Bayes* (NB*)   Lemmatizer + Stop-List   Unlimited      0.84%          2.86%



                     Pre-processing           Number      Spam Recall   Spam Precision
                                                 of          Error          Error
                                              Features
Naïve Bayes (NB)     Lemmatizer + Stop-List      300        36.95%           0%

Suffix Tree (ST)     None                       N/A         3.96%            0%

Naïve Bayes* (NB*)   Lemmatizer + Stop-List   Unlimited     10.42%           0%
                     ~ SpamAssassin Corpus ~
                           Pre-processing           False Positive   False Negative
                                                        Rate              Rate
Suffix Tree (ST)           None                         3.50%            3.25%
Naïve Bayes* (NB*)         Lemmatizer + Stop-List      10.50%            1.50%




                       ~ Ling-BKS Corpus ~
                           Pre-processing           False Positive   False Negative
                                                        Rate              Rate

Suffix Tree (ST)           None                          0%               0%

Naïve Bayes* (NB*)         Lemmatizer + Stop-List        0%             12.25%
                Conclusions
   Good overall classifier
    - improvement on naïve Bayes
    - but there‟s still room for improvement
   Can one method ever maintain 100% accuracy?
   Extending the classifier
   Applications to other domains
    - web page classification
Future Work - ODP
           Computational Performance
Data Set             Training (s)   Av. Spam (ms)   Av. Ham (ms)   Av. Peak Mem.


LS-FULL (7.40MB)           63            843             659          765MB

LS-11 (1.48MB)             36            221             206          259MB

SAeh-11 (5.16MB)          155            504            2528          544MB

BKS-LS-11 (1.12MB)         41            161             222          345MB
           Experimental Data Sets
   Ling-Spam (LS)
    Spam (481) collected by Androutsopoulos et al.
    Ham (2412) from online linguists‟ bulletin board
   Spam Assassin
    - Easy (SAe)
    - Hard (SAh)
    Spam (1876) and ham (4176) examples donated
   BBK
    Spam (652) collected by Birkbeck
       Androutsopoulos et al. (2000)
         ~ Ling-Spam Corpus ~
Classifier Configuration   Threshold   No. of    Spam       Spam
                                       Attrib.   Recall     Precision
Bare                          0.5         50      81.10\%   96.85\%
Stop-List                     0.5         50      82.35%     97.13%
Lemmatizer                    0.5        100      82.35%     99.02%
Lemmatizer + Stop-List        0.5        100      82.78%     99.49%
Bare                          0.9        200      76.94\%   99.46\%
Stop-List                     0.9        200      76.11\%   99.47\%
Lemmatizer                    0.9        100      77.57\%   99.45\%
Lemmatizer + Stop-list        0.9        100      78.41\%   99.47\%
Bare                         0.999       200      73.82\%   99.43\%
Stop-List                    0.999       200      73.40\%   99.43\%
Lemmatizer                   0.999       300      63.67\%   100.00\%
Lemmatizer + Stop-List       0.999       300      63.05\%   100.00\%
       Androutsopoulos et al. (2000)
         ~ Ling-Spam Corpus ~
                     Classifier Configuration   Spam Recall   Spam Precision
                                                   Error          Error

Naïve Bayes (NB)     Lemmatizer + Stop-List       17.22%          0.51%

Suffix Tree (ST)     N/A                           2.5%           0.21%

Naïve Bayes* (NB*)   Lemmatizer + Stop-List       0.84%           2.86%



                     Classifier Configuration   Spam Recall   Spam Precision
                                                   Error          Error

Naïve Bayes (NB)     Lemmatizer + Stop-List       36.95%           0%

Suffix Tree (ST)     N/A                          3.96%            0%

Naïve Bayes* (NB*)   Lemmatizer + Stop-List       10.42%           0%
           ~ SpamAssassin Corpus ~
                     Classifier Configuration   Spam Recall   Spam
                                                              Precision

Naïve Bayes (NB)     Lemmatizer + Stop-List       82.78%        99.49%

Suffix Tree (ST)     N/A                          97.50%        99.79%

Naïve Bayes* (NB*)   Lemmatizer + Stop-List       99.16%        97.14%




                     Classifier Configuration   Spam Recall   Spam
                                                              Precision
Naïve Bayes (NB)     Lemmatizer + Stop-List       82.78%        99.49%

Suffix Tree (ST)     N/A                          97.50%        99.79%

Naïve Bayes* (NB*)   Lemmatizer + Stop-List       99.16%        97.14%
        Vector Space Model

“What then?” sang Plato‟s ghost, “What then?”
                                                   W. B. Yeats



book   ghost    host   plate   Plato   sang   then     what
  0      1       0       0       1      1      2         2




               Word Probability

       P(w = ‘what’) = 50/1000 = 0.05
       Creating Profiles

Mark
                                  Profiles

Mark Levene
      engines      databases       information     search        data




Mike Hu
          police   intelligence      criminal    computational   data
               Classification
Boris Mirkin     Mark Levene         Mike Hu




SBM               SML          SMH
                    Naïve Bayes
                (similarity measure)
For a document d = {d1d2d3 … dm }and set of classes c = {c1, c2 ... cJ}:



                              m
                  P cj d  Pcj  Pdi cj                      (1)

                                     i 1
  Where:


                               Pcj  
                               ~        Nj                        (2)
                                        N
                                       1  nij
                     Pdi cj  
                     ~
                                                                  (3)
                                   M
                                            M
                                            k 1
                                                 nkj
                   Criticisms
   Pre-processing:
    - Stop-word removal
    - Word stemming/lemmatisation
    - Punctuation and formatting
   Smallest unit of consideration is a word.
   Classes (and documents) are bags of words, i.e.
    each word is independent of all others.
                Word Dependencies

Boris Mirkin
       means    intelligence   clustering   computational   data




Mike Hu
        means   intelligence    criminal    computational   data
                 Word Inflections

Intelligent

Intelligence
                    Intellig-   OR   intelligent
Intelligentsia

Intelligible
                      Success measures
    Recall is the proportion of

    correctly classified examples of a               # (S  S)
    class.                               SR 
                                              # (S  S)  # (S  H)
    If SR is spam recall, then (1-
    SR) gives the proportion of false
    negatives.


   Precision is the proportion
    assigned to a class which are true
    members of that class. It is a                   # (S  S)
    measure of the number of true        SP 
    positives.                                # (S  S)  # (H  S)
    If SP is spam precision, then (1
    – SP) would give the proportion
    of false positives.
                      Success measures
   True Positive Rate (TPR) is the
    proportion of correctly classified
    examples of the „positive‟ class.            # (Spam Spam )
    Spam is typically taken as the
                                         TPR 
    positive class, so TPR is then the             TotalSpam
    number of spam classified as
    spam over the total number of
    spam.

   False Positive Rate (FPR) is
    the proportion of the „negatve‟
    class erroneously assigned to the
                                                 #( Ham Spam)

    „positive‟ class.
                                         FPR 
    Ham is typically taken as the                 TotalHam
    negative class, so FPR is then the
    number of ham classified as spam
    over the total number of ham.
              Classifier Structure
       Spam       Ham            Training Data


                                 Profiling Method

                                 Profile Representation

                                 Similarity/Comparison
                                  Measure

              ?                  Decision Mechanism or
                                  Classification Criterion
Spam                    Ham

                                 Decision
             Classification using a
                   suffix tree

   Method of profiling is construction of the tree
    (no pre-processing, no post-processing)
   The tree is a profile of the class.
   Similarity measure?
   Decision mechanism?
         Threshold Variation
       ~ match normalisation ~
  Constant significance function       Constant significance function
  unnormalised                         match normalised




SPE = spam precision error; HPE = ham precision error
        Threshold Variation
     ~ Significance functions ~
   Root function, no normalisation      Logit function, no normalisation




SPE = spam precision error; HPE = ham precision error
            Threshold Variation


  Constant significance function
        (unnormalised)




SPE = spam precision error; HPE = ham precision error

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:8/14/2011
language:English
pages:46