SIMS 290-2 Applied Natural Langu by liwenting


									SIMS 290-2:
Applied Natural Language Processing

Marti Hearst
Oct 23, 2006

(Slides developed by Preslav Nakov)


 Feature selection
 TF.IDF Term Weighting
 Weka Input File Format

Features for Text Categorization
Linguistic features
     – lowercase? (should we convert to?)
     – normalized? (e.g. “texts”  “text”)
      Word-level n-grams
      Character-level n-grams
      Part of Speech

Non-linguistic features
      document formatting
      informative character sequences (e.g. &lt)

When Do We Need
Feature Selection?

If the algorithm cannot handle all possible features
     – e.g. language identification for 100 languages using all words
     – text classification using n-grams

Good features can result in higher accuracy
     What if we just keep all features?
     – Even the unreliable features can be helpful.
     – But we need to weight them:
          In the extreme case, the bad features can have a weight
           of 0 (or very close), which is… a form of feature selection!

Why Feature Selection?
Not all features are equally good!
      Bad features: best to remove
     – Infrequent
          unlikely to be seen again
          co-occurrence with a class can be due to chance
     – Too frequent
          mostly function words
     – Uniform across all categories
      Good features: should be kept
     – Co-occur with a particular category
     – Do not co-occur with other categories
      The rest: good to keep

 Types Of Feature Selection?

 Feature selection reduces the number of features
     Usually:
       Eliminating features
       Weighting features
       Normalizing features
     Sometimes by transforming parameters
       e.g. Latent Semantic Indexing using Singular Value
 Method may depend on problem type
     For classification and filtering, may want to use
    information from example documents to guide selection.

Feature Selection

 Task independent methods
   Document Frequency (DF)
   Term Strength (TS)

 Task-dependent methods
   Information Gain (IG)
   Mutual Information (MI)
   2 statistic (CHI)

      Empirically compared by Yang & Pedersen (1997)

Pedersen & Yang Experiments

 Compared feature selection methods for text
   5 feature selection methods:
     – DF, MI, CHI, (IG, TS)
     – Features were just words, not phrases
   2 classifiers:
     – kNN: k-Nearest Neighbor
     – LLSF: Linear Least Squares Fit
   2 data collections:
     – Reuters-22173
     – OHSUMED: subset of MEDLINE (1990&1991 used)

 Document Frequency (DF)
DF: number of documents a term appears in
 Based on Zipf’s Law
 Remove the rare terms: (seen 1-2 times)
         Unreliable – can be just noise
         Unlikely to appear in new documents
         Easy to compute
         Task independent: do not need to know the classes
          Ad hoc criterion
          For some applications, rare terms can be good
        discriminators (e.g., in IR)

Stop Word Removal
Common words from a predefined list
     Mostly from closed-class categories:
     – unlikely to have a new word added
     – include: auxiliaries, conjunctions, determiners, prepositions,
       pronouns, articles
     But also some open-class words like numerals

Bad discriminators
     uniformly spread across all classes
     can be safely removed from the vocabulary
     – Is this always a good idea? (e.g. author identification)

2 statistic (CHI)
2 statistic (pronounced “kai square”)
    A commonly used method of comparing proportions.
    Measures the lack of independence between a term and
  a category (Yang & Pedersen)

 2 statistic (CHI)
Is “jaguar” a good predictor for the “auto” class?

                        Term = jaguar   Term  jaguar

         Class = auto        2              500

         Class  auto        3              9500

We want to compare:
 the observed distribution above; and
 null hypothesis: that jaguar and auto are independent

 2 statistic (CHI)
Under the null hypothesis: (jaguar and auto independent):
How many co-occurrences of jaguar and auto do we expect?
       If independent: Pr(j,a) = Pr(j)  Pr(a)
       So, there would be: N  Pr(j,a), i.e. N  Pr(j)  Pr(a)
        Pr(j) = (2+3)/N;
        Pr(a) = (2+500)/N;
       Which = N(5/N)(502/N)=2510/N=2510/10005  0.25

                       Term = jaguar   Term  jaguar

        Class = auto         2            500

        Class  auto         3           9500

 2 statistic (CHI)
Under the null hypothesis: (jaguar and auto independent):
How many co-occurrences of jaguar and auto do we expect?

                       Term = jaguar   Term  jaguar
                                                       expected: fe

        Class = auto       2 (0.25)      500

                                                       observed: fo
        Class  auto       3            9500

 2 statistic (CHI)
Under the null hypothesis: (jaguar and auto – independent):
How many co-occurrences of jaguar and auto do we expect?

                       Term = jaguar   Term  jaguar
                                                       expected: fe

        Class = auto       2 (0.25)      500   (502)

                                                       observed: fo
        Class  auto       3 (4.75)     9500 (9498)

 2 statistic (CHI)
2 is interested in (fo – fe)2/fe summed over all table entries:
   2 ( j, a)   (O  E ) 2 / E  (2  .25) 2 / .25  (3  4.75) 2 / 4.75
                   (500  502) 2 / 502  (9500  9498) 2 / 9498  12.9 ( p  .001)

The null hypothesis is rejected with confidence .999,
since 12.9 > 10.83 (the value for .999 confidence).

                              Term = jaguar         Term  jaguar
                                                                             expected: fe

           Class = auto              2 (0.25)          500     (502)

                                                                             observed: fo
           Class  auto              3 (4.75)         9500 (9498)

 2 statistic (CHI)
There is a simpler formula for 2:

                 A = #(t,c)    C = #(¬t,c)

                B = #(t,¬c)   D = #(¬t, ¬c)


 2 statistic (CHI)
How to use 2 for multiple categories?
Compute 2 for each category and then combine:
       To require a feature to discriminate well across all
     categories, then we need to take the expected value of 2:

       Or to weight for a single category, take the maximum:

2 statistic (CHI)

     normalized and thus comparable across terms
     2(t,c) is 0, when t and c are independent
     can be compared to 2 distribution, 1 degree of freedom

     unreliable for low frequency terms

Information Gain

 A measure of importance of the feature for predicting
 the presence of the class.
   Has an information theoretic justification
 Defined as:
   The number of “bits of information” gained by
   knowing the term is present or absent
   Based on Information Theory
     – We won’t go into this in detail here.

 Information Gain (IG)
IG: number of bits of information gained by knowing the
term is present or absent
                                               entropy: H(c)

                                                 entropy H(c|t)

     t is the term being scored,               entropy H(c|¬t)
     ci is a class variable

 Mutual Information (MI)
 The probability of seeing x and y together
 The probably of seeing x anywhere times the probability of
seeing y anywhere (independently).

     MI = log ( P(x,y) / P(x)P(y) )
       = log(P(x,y)) – log(P(x)P(y))

          From Bayes law: P(x,y) = P(x|y)P(y)

          = log(P(x|y)P(y)) – log(P(x)P(y))

     MI = log(P(x|y) – log(P(x))

 Mutual Information (MI)                   rare terms
                                           get higher scores


                              A = #(t,c)    C = #(¬t,c)

                              B = #(t,¬c) D = #(¬t, ¬c)

                 N=A+B +C+D
                                            does not use
                                            term absence

Using Mutual Information
Compute MI for each category and then combine
    If we want to discriminate well across all categories, then
   we need to take the expected value of MI:

     To discriminate well for a single category, then we take
   the maximum:

Mutual Information

   I(t,c) is 0, when t and c are independent
   Has a sound information-theoretic interpretation

   Small numbers produce unreliable results
   Does not use term absence

                                                      CHI max, IG, DF

                                                Term strength

                           Mutual information

From Yang & Pedersen ‘97                                                26
Feature Comparison

DF, IG and CHI are good and strongly correlated
       thus using DF is good, cheap and task independent
       can be used when IG and CHI are too expensive
 MI is bad
       favors rare terms (which are typically bad)

Term Weighting

 In the study just shown, terms were (mainly)
 treated as binary features
    If a term occurred in a document, it was
    assigned 1
    Else 0
 Often it us useful to weight the selected
 Standard technique: tf.idf

TF.IDF Term Weighting
TF: term frequency
     definition: TF = tij
     – frequency of term i in document j
    purpose: makes the frequent words for the document
   more important
IDF: inverted document frequency
     definition: IDF = log(N/ni)
     – ni : number of documents containing term i
     – N : total number of documents
     purpose: makes rare words across documents more
TF.IDF (for term i in document j)
     definition: tij  log(N/ni)

Term Normalization

 Combine different words into a single representation
   Stemming/morphological analysis
     – bought, buy, buys -> buy
   General word categories
     –   $23.45, 5.30 Yen -> MONEY
     –   1984, 10,000      -> DATE, NUM
     –   PERSON
            (Covered in Information Extraction segment)
   Generalize with lexical hierarchies
     – WordNet, MeSH
         (Covered later in the semester)

What Do People Do In Practice?

1. Feature selection
   infrequent term removal
       infrequent across the whole collection (i.e. DF)
       seen in a single document
   most frequent term removal (i.e. stop words)

2. Normalization:
     1. Stemming. (often)
     2. Word classes (sometimes)
3. Feature weighting: TF.IDF or IDF
4. Dimensionality reduction (sometimes)


Java-based tool for large-scale machine-learning
Tailored towards text analysis

Weka Input Format
 Expects a particular input file format
    Called ARFF: Attribute-Relation File Format
    Consists of a Header and a Data section


 WEKA File Format: ARFF

@relation heart-disease-simplified
                                          Numerical attribute
@attribute    age numeric
@attribute    sex { female, male}
                                            Nominal attribute
@attribute    chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute    cholesterol numeric
@attribute    exercise_induced_angina { no, yes}
                                                          Other attribute types:
@attribute    class { present, not_present}
                                                          • String
@data                                                     • Date
67,male,asympt,229,yes,present                 Missing value

Slide adapted from Eibe Frank's                                                    34
WEKA Sparse File Format
Value 0 is not represented explicitly
Same header (i.e @relation and @attribute tags)
the @data section is different
     Instead of
                   0, X, 0, Y, "class A"
                   0, 0, W, 0, "class B"
     We have
                    {1 X, 3 Y, 4 "class A"}
                    {2 W, 4 "class B"}
This saves LOTS of space for text applications.

Next Time

 Wed: Guest lecture by Peter Jackson:
   Pure and Applied Research in NLP: The Good, the
   Bad, and the Lucky.
 Following week:
   Text Categorization Algorithms
   How to use Weka


To top