Docstoc
EXCLUSIVE OFFER FOR DOCSTOC USERS
Try the all-new QuickBooks Online for FREE.  No credit card required.

SIMS 290-2 Applied Natural Langu

Document Sample
SIMS 290-2 Applied Natural Langu Powered By Docstoc
					SIMS 290-2:
Applied Natural Language Processing



Marti Hearst
Oct 23, 2006

(Slides developed by Preslav Nakov)




                                      1
Today

 Feature selection
 TF.IDF Term Weighting
 Weka Input File Format




                          2
Features for Text Categorization
Linguistic features
      Words
     – lowercase? (should we convert to?)
     – normalized? (e.g. “texts”  “text”)
      Phrases
      Word-level n-grams
      Character-level n-grams
      Punctuation
      Part of Speech


Non-linguistic features
      document formatting
      informative character sequences (e.g. &lt)

                                                   3
When Do We Need
Feature Selection?

If the algorithm cannot handle all possible features
     – e.g. language identification for 100 languages using all words
     – text classification using n-grams


Good features can result in higher accuracy
     What if we just keep all features?
     – Even the unreliable features can be helpful.
     – But we need to weight them:
          In the extreme case, the bad features can have a weight
           of 0 (or very close), which is… a form of feature selection!




                                                                      4
Why Feature Selection?
Not all features are equally good!
      Bad features: best to remove
     – Infrequent
          unlikely to be seen again
          co-occurrence with a class can be due to chance
     – Too frequent
          mostly function words
     – Uniform across all categories
      Good features: should be kept
     – Co-occur with a particular category
     – Do not co-occur with other categories
      The rest: good to keep




                                                             5
 Types Of Feature Selection?

 Feature selection reduces the number of features
     Usually:
       Eliminating features
       Weighting features
       Normalizing features
     Sometimes by transforming parameters
       e.g. Latent Semantic Indexing using Singular Value
        Decomposition
 Method may depend on problem type
     For classification and filtering, may want to use
    information from example documents to guide selection.



                                                             6
Feature Selection

 Task independent methods
   Document Frequency (DF)
   Term Strength (TS)


 Task-dependent methods
   Information Gain (IG)
   Mutual Information (MI)
   2 statistic (CHI)



      Empirically compared by Yang & Pedersen (1997)



                                                       7
Pedersen & Yang Experiments

 Compared feature selection methods for text
 categorization
   5 feature selection methods:
     – DF, MI, CHI, (IG, TS)
     – Features were just words, not phrases
   2 classifiers:
     – kNN: k-Nearest Neighbor
     – LLSF: Linear Least Squares Fit
   2 data collections:
     – Reuters-22173
     – OHSUMED: subset of MEDLINE (1990&1991 used)




                                                     8
 Document Frequency (DF)
DF: number of documents a term appears in
 Based on Zipf’s Law
 Remove the rare terms: (seen 1-2 times)
         Spurious
         Unreliable – can be just noise
         Unlikely to appear in new documents
 Plus
         Easy to compute
         Task independent: do not need to know the classes
 Minus
          Ad hoc criterion
          For some applications, rare terms can be good
        discriminators (e.g., in IR)

                                                             9
Stop Word Removal
Common words from a predefined list
     Mostly from closed-class categories:
     – unlikely to have a new word added
     – include: auxiliaries, conjunctions, determiners, prepositions,
       pronouns, articles
     But also some open-class words like numerals


Bad discriminators
     uniformly spread across all classes
     can be safely removed from the vocabulary
     – Is this always a good idea? (e.g. author identification)




                                                                        10
2 statistic (CHI)
2 statistic (pronounced “kai square”)
    A commonly used method of comparing proportions.
    Measures the lack of independence between a term and
  a category (Yang & Pedersen)




                                                           11
 2 statistic (CHI)
Is “jaguar” a good predictor for the “auto” class?

                        Term = jaguar   Term  jaguar



         Class = auto        2              500



         Class  auto        3              9500


We want to compare:
 the observed distribution above; and
 null hypothesis: that jaguar and auto are independent

                                                         12
 2 statistic (CHI)
Under the null hypothesis: (jaguar and auto independent):
How many co-occurrences of jaguar and auto do we expect?
       If independent: Pr(j,a) = Pr(j)  Pr(a)
       So, there would be: N  Pr(j,a), i.e. N  Pr(j)  Pr(a)
        Pr(j) = (2+3)/N;
        Pr(a) = (2+500)/N;
        N=2+3+500+9500
       Which = N(5/N)(502/N)=2510/N=2510/10005  0.25

                       Term = jaguar   Term  jaguar


        Class = auto         2            500


        Class  auto         3           9500


                                                                 13
 2 statistic (CHI)
Under the null hypothesis: (jaguar and auto independent):
How many co-occurrences of jaguar and auto do we expect?




                       Term = jaguar   Term  jaguar
                                                       expected: fe

        Class = auto       2 (0.25)      500

                                                       observed: fo
        Class  auto       3            9500



                                                                14
 2 statistic (CHI)
Under the null hypothesis: (jaguar and auto – independent):
How many co-occurrences of jaguar and auto do we expect?




                       Term = jaguar   Term  jaguar
                                                       expected: fe

        Class = auto       2 (0.25)      500   (502)

                                                       observed: fo
        Class  auto       3 (4.75)     9500 (9498)



                                                                15
 2 statistic (CHI)
2 is interested in (fo – fe)2/fe summed over all table entries:
   2 ( j, a)   (O  E ) 2 / E  (2  .25) 2 / .25  (3  4.75) 2 / 4.75
                   (500  502) 2 / 502  (9500  9498) 2 / 9498  12.9 ( p  .001)

The null hypothesis is rejected with confidence .999,
since 12.9 > 10.83 (the value for .999 confidence).

                              Term = jaguar         Term  jaguar
                                                                             expected: fe

           Class = auto              2 (0.25)          500     (502)

                                                                             observed: fo
           Class  auto              3 (4.75)         9500 (9498)



                                                                                      16
 2 statistic (CHI)
There is a simpler formula for 2:




                 A = #(t,c)    C = #(¬t,c)

                B = #(t,¬c)   D = #(¬t, ¬c)


                  N=A+B+C+D


                                              17
 2 statistic (CHI)
How to use 2 for multiple categories?
Compute 2 for each category and then combine:
       To require a feature to discriminate well across all
     categories, then we need to take the expected value of 2:




       Or to weight for a single category, take the maximum:




                                                               18
2 statistic (CHI)


Pluses
     normalized and thus comparable across terms
     2(t,c) is 0, when t and c are independent
     can be compared to 2 distribution, 1 degree of freedom


Minuses
     unreliable for low frequency terms




                                                           19
Information Gain

 A measure of importance of the feature for predicting
 the presence of the class.
   Has an information theoretic justification
 Defined as:
   The number of “bits of information” gained by
   knowing the term is present or absent
   Based on Information Theory
     – We won’t go into this in detail here.




                                                         20
 Information Gain (IG)
IG: number of bits of information gained by knowing the
term is present or absent
                                               entropy: H(c)

                                                 specific
                                                 conditional
                                                 entropy H(c|t)



                                               specific
                                               conditional
     t is the term being scored,               entropy H(c|¬t)
     ci is a class variable




                                                                 21
 Mutual Information (MI)
 The probability of seeing x and y together
     vs
 The probably of seeing x anywhere times the probability of
seeing y anywhere (independently).

     MI = log ( P(x,y) / P(x)P(y) )
       = log(P(x,y)) – log(P(x)P(y))

          From Bayes law: P(x,y) = P(x|y)P(y)

          = log(P(x|y)P(y)) – log(P(x)P(y))

     MI = log(P(x|y) – log(P(x))

                                                              22
 Mutual Information (MI)                   rare terms
                                           get higher scores




Approximation:

                              A = #(t,c)    C = #(¬t,c)

                              B = #(t,¬c) D = #(¬t, ¬c)



                 N=A+B +C+D
                                            does not use
                                            term absence

                                                          23
Using Mutual Information
Compute MI for each category and then combine
    If we want to discriminate well across all categories, then
   we need to take the expected value of MI:




     To discriminate well for a single category, then we take
   the maximum:




                                                                24
Mutual Information


 Pluses
   I(t,c) is 0, when t and c are independent
   Has a sound information-theoretic interpretation


 Minuses
   Small numbers produce unreliable results
   Does not use term absence




                                                      25
                                                      CHI max, IG, DF




                                                Term strength


                           Mutual information




From Yang & Pedersen ‘97                                                26
Feature Comparison




DF, IG and CHI are good and strongly correlated
       thus using DF is good, cheap and task independent
       can be used when IG and CHI are too expensive
 MI is bad
       favors rare terms (which are typically bad)




                                                           27
Term Weighting

 In the study just shown, terms were (mainly)
 treated as binary features
    If a term occurred in a document, it was
    assigned 1
    Else 0
 Often it us useful to weight the selected
 features
 Standard technique: tf.idf




                                                28
TF.IDF Term Weighting
TF: term frequency
     definition: TF = tij
     – frequency of term i in document j
    purpose: makes the frequent words for the document
   more important
IDF: inverted document frequency
     definition: IDF = log(N/ni)
     – ni : number of documents containing term i
     – N : total number of documents
     purpose: makes rare words across documents more
   important
TF.IDF (for term i in document j)
     definition: tij  log(N/ni)


                                                         29
Term Normalization

 Combine different words into a single representation
   Stemming/morphological analysis
     – bought, buy, buys -> buy
   General word categories
     –   $23.45, 5.30 Yen -> MONEY
     –   1984, 10,000      -> DATE, NUM
     –   PERSON
     –   ORGANIZATION
            (Covered in Information Extraction segment)
   Generalize with lexical hierarchies
     – WordNet, MeSH
         (Covered later in the semester)




                                                           30
What Do People Do In Practice?

1. Feature selection
   infrequent term removal
       infrequent across the whole collection (i.e. DF)
       seen in a single document
   most frequent term removal (i.e. stop words)

2. Normalization:
     1. Stemming. (often)
     2. Word classes (sometimes)
3. Feature weighting: TF.IDF or IDF
4. Dimensionality reduction (sometimes)


                                                          31
Weka

Java-based tool for large-scale machine-learning
problems
Tailored towards text analysis
http://weka.sourceforge.net/wekadoc/




                                                   32
Weka Input Format
 Expects a particular input file format
    Called ARFF: Attribute-Relation File Format
    Consists of a Header and a Data section




                               http://weka.sourceforge.net/wekadoc/index.php/en:ARFF_(3.4.6)



                                                                                      33
 WEKA File Format: ARFF

@relation heart-disease-simplified
                                          Numerical attribute
@attribute    age numeric
@attribute    sex { female, male}
                                            Nominal attribute
@attribute    chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute    cholesterol numeric
@attribute    exercise_induced_angina { no, yes}
                                                          Other attribute types:
@attribute    class { present, not_present}
                                                          • String
@data                                                     • Date
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present                 Missing value
38,female,non_anginal,?,no,not_present
...                     http://weka.sourceforge.net/wekadoc/index.php/en:ARFF_(3.4.6)

Slide adapted from Eibe Frank's                                                    34
WEKA Sparse File Format
Value 0 is not represented explicitly
Same header (i.e @relation and @attribute tags)
the @data section is different
     Instead of
               @data
                   0, X, 0, Y, "class A"
                   0, 0, W, 0, "class B"
     We have
               @data
                    {1 X, 3 Y, 4 "class A"}
                    {2 W, 4 "class B"}
This saves LOTS of space for text applications.
     Why?

                                                  35
Next Time

 Wed: Guest lecture by Peter Jackson:
   Pure and Applied Research in NLP: The Good, the
   Bad, and the Lucky.
 Following week:
   Text Categorization Algorithms
   How to use Weka




                                                     36

				
DOCUMENT INFO