Using Text Mining for Spam Filtering by juanagui


									Using Text Mining for Spam Filtering

Supercomputing 2003

Loretta Auvil
Duane Searsmith
Automated Learning Group
National Center for Supercomputing Applications
University of Illinois

• Introduction to Text Mining
      •   Text Mining Process
      •   Text Characteristics

• Hand-On Exercise
      •   Spam Filtering

alg | Automated Learning Group
Text Mining Definition

Many definitions in the literature
•   The non trivial extraction of implicit, previously unknown, and
    potentially useful information from (large amount of) textual data”
•   An exploration and analysis of textual (natural-language) data by
    automatic and semi automatic means to discover new knowledge

•   What is “previously unknown” information?
      •   Strict definition
            – Information that not even the writer knows
      •   Lenient definition
            – Rediscover the information that the author encoded in the text

alg | Automated Learning Group
Text Mining Process

•   Text Preprocessing
     •   Syntactic/Semantic
         Text Analysis

•   Features Generation
     •   Bag of Words

•   Feature Selection
     •   Simple Counting
     •   Statistics

•   Text/Data Mining
     •   Classification-
         Supervised Learning
     •   Clustering-
         Unsupervised Learning

•   Analyzing Results

alg | Automated Learning Group
Text Characteristics (1)

•   Large textual database
      •   Web is growing
      •   Publications are electronic

•   High dimensionality
      •   Consider each word/phrase as a dimension

•   Dependency
      •   Relevant information is a complex conjunction of words/phrases
            – e.g., Document categorization and Pronoun disambiguation

•   Ambiguity
      •   Word ambiguity
            – Pronouns (he, she …)
            – Synonyms (buy, purchase)
            – Words with multiple meanings (bat – it is related to baseball or mammal)
      •   Semantic ambiguity
            – The king saw the rabbit with his glasses. (multiple meanings)

alg | Automated Learning Group
Text Characteristics (2)

• Noisy data
      •   Spelling mistakes
      •   Abbreviations
      •   Acronyms

• Not well structured text
      •   Email/Chat rooms
            – “r u available ?”
            – “Hey whazzzzzz up”
      •   Speech

alg | Automated Learning Group
Text Characteristics (3)

• Order of words in the query
      •   hot dog stand in the amusement park
      •   hot amusement stand in the dog park

• User dependency for the data
      •   direct feedback
      •   indirect feedback

• Authority of the source
      •   IBM is more likely to be an authorized source then my second far

alg | Automated Learning Group
Text Mining: Views from T2K and ThemeWeaver

alg | Automated Learning Group
Text Mining: Themescape and ThemeRiver

•   Visualizing Relationships Between Documents

alg | Automated Learning Group   Images from Pacific Northwest Laboratory
Text Mining General Application Areas

•   Information Retrieval
      •   Indexing and retrieval of textual documents
      •   Finding a set of (ranked) documents that are relevant to the query

•   Information Extraction
      •   Extraction of partial knowledge in the text

•   Web Mining
      •   Indexing and retrieval of textual documents and extraction of partial knowledge
          using the web

•   Classification
      •   Predict a class for each text document

•   Clustering
      •   Generating collections of similar text documents

alg | Automated Learning Group
Text Mining Applications

•   Email: Spam filtering
•   News Feeds: Discover what is
•   Medical: Identify relationships and link
    information from different medical
•   Homeland Security
•   Marketing: Discover distinct groups of
    potential buyers and make suggestions
    for other products
•   Industry: Identifying groups of
    competitors web pages
•   Job Seeking: Identify parameters in
    searching for jobs
alg | Automated Learning Group
What is Information Extraction?
                                        Advisory Programmer - Oracle (Austin, TX) Response Code: 1008-0074-97-iexc-jcn

• Given:                                Responsibilities: This is an exciting opportunity with Siemens Wireless Terminals; a
                                        start-up venture fully capitalized by a Global Leader in Advanced Technologies.
                                        Qualified candidates will: Responsible for assisting with requirements definition,
      •   Source of textual             analysis, design and implementation that meet objectives, codes difficult and
                                        sophisticated routines . Develops project plans, schedules and cost data. Develop
          documents                     test plans and implement physical design of databases. Develop shell scripts for
                                        administrative and background tasks, stored procedures and triggers. Using Oracles
      •   Well defined limited query    Designer 2000, assist with Data Model maintenance and assist with applications
          (text based)                  development using Oracle Forms. Qualifications: BSCS, BSMIS or closely related
                                        field or related equivalent knowledge normally obtained through technical
                                        education programs. 5-8 years of professional experience in development, system

• Find:                                 design analysis, programming, installation using Oracle development…

      •   Sentences with relevant
      •   Extract the relevant
          information and ignore non-
          relevant information
      •   Link related information
          and output in a
          predetermined format

alg | Automated Learning Group   Example from Dan Roth Web Page
Web Mining

• Enormous wealth of textual information on the Web
      •   Book/CD/Video stores (e.g., Amazon)
      •   Restaurant information (e.g., Zagats)
      •   Car prices (e.g., Carpoint)
      •   Hyper-link information

• Web is very dynamic
      •   Web pages are constantly being generated (removed)
      •   Web pages are generated from database queries

• Lots of data on user access patterns
      •   Web logs contain sequence of URLs accessed by users
      •   Access and usage information

alg | Automated Learning Group
Text PreProcessing: Syntactic / Semantic Text Analysis

• Part Of Speech (PoS) Tagging
      •   Find the corresponding PoS for each word
      •   e.g., John (noun) gave (verb) the (det) ball (noun)
      •   ~98% accurate

• Word Sense Disambiguation
      •   Context based or proximity based
      •   Very accurate

• Parsing
      •   Generates a parse tree (graph) for each sentence
      •   Each sentence is a stand alone graph

alg | Automated Learning Group
Feature Generation: Bag of Words

• Text document is represented by the words it contains (and
    their occurrences)
      •   e.g., “Lord of the rings”  {“the”, “Lord”, “rings”, “of”}
      •   Highly efficient
      •   Makes learning far simpler and easier
      •   Order of words is not that important for certain applications

• Stemming
      •   Reduce dimensionality
      •   Identifies a word by its root
      •   e.g., flying, flew  fly

• Stop words
      •   Identifies the most common words that are unlikely to help with text
      •   e.g., “the”, “a”, “an”, “you”

alg | Automated Learning Group
Feature Selection

• Reduce Dimensionality
      •   Learners have difficulty addressing tasks with high dimensionality

• Irrelevant Features
      •   Not all features help!
      •   Remove features that occur in only a few documents
      •   Reduce features that occur in too many documents

alg | Automated Learning Group
Text Mining: Supervised vs. Unsupervised Learning

• Supervised learning (Classification)
      •   Data (observations, measurements, etc.) are accompanied by labels
          indicating the class of the observations
      •   Split into training data and test data for model building process
      •   New data is classified based on the model built with the training data

• Unsupervised learning (Clustering)
      •   Class labels of training data is unknown
      •   Given a set of measurements, observations, etc. with the aim of
          establishing the existence of classes or clusters in the data

alg | Automated Learning Group
Text Mining: Classification Definition

•   Given: Collection of labeled records
      •   Each record contains a set of features (attributes), and the true class (label)
      •   Create a training set to build the model
      •   Create a testing set to test the model

•   Find: Model for the class as a function of the values of the features
•   Goal: Assign a class (as accurately as possible) to previously unseen records
•   Evaluation: What Is Good Classification?
      •   Correct classification
            – Known label of test example is identical to the predicted class from the model
      •   Accuracy ratio
            – Percent of test set examples that are correctly classified by the model
      •   Distance measure between classes can be used
            – e.g., classifying “football” document as a “basketball” document is not as bad as
              classifying it as “crime”

alg | Automated Learning Group
Text Mining: Clustering Definition
•   Given: Set of documents and a similarity measure among
•   Find: Clusters such that
     •   Documents in one cluster are more similar to one another
     •   Documents in separate clusters are less similar to one another

•   Goal:
     •   Finding a correct set of documents

•   Similarity Measures:
     •   Euclidean distance if attributes are continuous
     •   Other problem-specific measures
            – e.g., how many words are common in these documents

•   Evaluation: What Is Good Clustering?
     •   Produce high quality clusters with
            – high intra-class similarity
            – low inter-class similarity
     •   Quality of a clustering method is also measured by its ability to
         discover some or all of the hidden patterns

alg | Automated Learning Group
Classification Techniques

•   Bayesian classification
•   Decision trees
•   Neural networks
•   Instance-Based Methods

alg | Automated Learning Group
Bayesian Classification

• Idea: assign to example X
    the class label C such that
    P(C|X) is maximal
• Computes the distribution
    of an input associated with
    each class, for example,
    given the variable X with a
    value at xi the probability
    of it being in Class A is
    greater than it being in      Mathematically speaking — If one knows
    Class B                       how P(X | C), and the densities P(xi) and
                                  P(cj) (prior probabilities) are known
                                  then the classifier is one which assigns
                                  class cj to datum xi if cj has the highest
                                  posterior probability given the data.

alg | Automated Learning Group
Bayesian Classification: Why?

• Probabilistic learning: Calculate explicit probabilities for
    hypothesis, is among the most practical approaches to certain
    types of learning problems
• Incremental: Each training example can incrementally
    increase/decrease the probability that a hypothesis is correct
• Prior knowledge: Can be combined with observed data
• Standard:
      •   Provide a standard of optimal decision making against which other
          methods can be measured
      •   In a simpler form, provide a baseline against which other methods can
          be measured

alg | Automated Learning Group
Naïve Bayesian Classification

• Naïve assumption
      •   Feature independence

• P(xi |C) is estimated as the relative frequency of examples
    having value xi as feature in class C
• Computationally easy!!!

alg | Automated Learning Group
Classification by Decision Tree

• Decision tree
      •   Flow-chart-like tree structure
      •   Internal node denotes a test on an attribute
      •   Branch represents an outcome of the test
      •   Leaf nodes represent class labels or class distribution

• Decision tree generation consists of two phases:
      •   Tree construction
      •   Tree pruning
            – Identify and remove branches that reflect noise or outliers

• Use of decision tree
      •   Classifying an unknown example
      •   Test the attribute of the example against the decision tree

alg | Automated Learning Group
                 Text Mining in D2K

           Email CLASSIFICATION
                        Naïve Bayesian

alg | Automated Learning Group
Email Classification

• Input:
      •   Multiple mailboxes where each mailbox represents a class

• Output:
      •   Results of the model on the testing set
      •   Model that classifies future email

alg | Automated Learning Group
Mailbox Files

• MONO.mbx
      •   Mono Developer Discussion List
      •   216 messages

• SPAM.mbx
      •   Spam Mailbox
      •   100 messages

• JINI.mbx
      •   JINI-Users mail list
      •   104 messages

alg | Automated Learning Group
Opening the Itinerary

• Click on the
    “Itinerary” Pane in
    the Resource Panel
• Expand the “T2K”
    directory with a single
• Double click on

alg | Automated Learning Group
D2K – A Few Features

• Properties indicate that a module has settings that can be
    changed before execution
      •   Indicated by a “P” in the lower left corner of a module
      •   e.g., filename, maximum iterations, etc.

• Resource Manager
      •   Load data that is accessible by all modules

alg | Automated Learning Group
NaiveBayesEmail Itinerary

• Use of D2K’s Resource
   Manager to store data that
   will serve as a dictionary
     •   Contextual Rule File
     •   Lexical Rule File
     •   Stop words
     •   Lexicon

alg | Automated Learning Group
 NaiveBayesEmail Itinerary (2)

Load the mailbox data
• Input File Name
     •   Specify a directory by
         changing properties of

• ReadFileNames
     •   Sends each filename in this
         directory as output

• MBX Email Parser
     •   Parses the mailbox files

• Email -> Document
     •   Converts the email document
         to the standard document
     •   Flags on whether to include
         sender/receiver info

 alg | Automated Learning Group
    NaiveBayesEmail Itinerary (3)
Pre-Process text data
•     Tokenizer
        •   Forms word tokens for each
            word or symbol

•     Brill Pre-Tagger
        •   Assigns part of speech tag to
            each token
        •   Can be used without following 2

•     Brill Lexical Tagger
        •   Adjusts tag based on lexical
        •   Lexical must precede Contextual

•     Brill Contextual Tagger
        •   Adjusts tag based on contextual

    alg | Automated Learning Group
    NaiveBayesEmail Itinerary (4)

More Text Pre-Processing
•     Filler Stop Words
        •   Removes Stop Words

•     Stemmer
        •   Transforms words into their word
        •   Removes plurals, etc.

•     Select Tokens By Speech Tag
        •   Removes tokens that do not
            match speech tag of interest

•     Document->TermList
        •   Counts the frequency of terms in
            the document
        •   Adjusts counts for title weighting

    alg | Automated Learning Group
 NaiveBayesEmail Itinerary (5)

Creation of Sparse Table for
• Add Series of Ints
     •   Counts all values it receives
     •   Outputs sum

• TermLists -> SparseTable
     •   Creates Sparse Table to be used
         for mining
     •   Term counts across documents
         is sparse
     •   Conserve on memory and usage

• Feature Filter
     •   Eliminates terms that occur in
         only one document

 alg | Automated Learning Group
NaiveBayesEmail Itinerary (6)

Selecting input and output
   attributes for classification
• Choose Attributes
      •   Select input and attributes
      •   Select output attribute –

alg | Automated Learning Group
NaiveBayesEmail Itinerary (7)

Setting testing and training sets
• Simple Train Test
      •   Property window to set train and
          test percentages

alg | Automated Learning Group
NaiveBayesEmail Itinerary (8)

Model building and testing
• Naïve Bayes Text Model
     •   Builds a Naïve Bayesian
         classification model

• Model Predict
     •   Applies the model to the
         testing data

alg | Automated Learning Group
NaiveBayesEmail Itinerary (9)

• Results of model on testing
• Prediction Table Report
      •   Shows classification error
      •   Shows confusion matrix

alg | Automated Learning Group
NaiveBayesEmail Itinerary (10)

• Table Viewer
      •   Shows original data
      •   Shows the predicted column

alg | Automated Learning Group
Execute Itinerary

• Check Properties
    for Input FileName
• Click the Run

alg | Automated Learning Group

• Results of model on testing
• Prediction Table Report
      •   Shows classification error
      •   Shows confusion matrix

• Original table had 6718
• After filtering the table had
    2830 attributes

alg | Automated Learning Group
Other Scenarios

• Change the weight for words in titles
• Change whether or not sender/receiver info is included
• Take the filter module out

• What happens to accuracy of the model??
• What happens to performance??

alg | Automated Learning Group
Scenario – Verbs only

• Original table had 1020
• After filtering the table had
    636 attributes

alg | Automated Learning Group
T2K Components

• Many of the components used in this example could be used for
    text mining applications
• We have used these components in a variety of different
    applications of text clustering

alg | Automated Learning Group
Future Work

• Information Extraction approaches built into D2K
• Question and Answer Applications

alg | Automated Learning Group
The ALG Team
Staff                            Students
      Loretta Auvil                 Tyler Alumbaugh
      Ruth Aydt                     Peter Groves
      Peter Bajcsy
                                    Olubanji Iyun
      Colleen Bushell
                                    Sang-Chul Lee
      Dora Cai
                                    Xiaolei Li
      David Clutter
      Lisa Gatzke                   Brian Navarro
      Vered Goren                   Jeff Ng
      Chris Navarro                 Scott Ramon
      Greg Pape                     Sunayana Saha
      Tom Redman                    Martin Urban
      Duane Searsmith               Bei Yu
      Andrew Shirk                  Hwanjo Yu
      Anca Suvaiala
      David Tcheng
      Michael Welge

alg | Automated Learning Group
Licensing D2K
•   Faculty, staff and students at US academic institutions will be able to
    license and use D2K for free by downloading from
•   Private Sector Partners who have provided funding for projects related
    to D2K will be able to license and use D2K for free
•   Private Sector Partners who have not provided funding will be able to
    license and use D2K for a discounted fee

Contact John McEntire
Office of Technology Management
308 Ceramics Building, MC-243
105 South Goodwin Avenue
Urbana, Illinois 61801-2901
(217) 333-3715

alg | Automated Learning Group

To top