VIEWS: 3 PAGES: 34 POSTED ON: 12/8/2011
I256: Applied Natural Language Processing Marti Hearst Oct 9, 2006 1 Today Finish Conditional Probabilities and Bayesian Learning Intro to Classification; Identification of Language Author 2 Conditional Probability A way to reason about the outcome of an experiment based on partial information In a word guessing game the first letter for the word is a “t”. What is the likelihood that the second letter is an “h”? How likely is it that a person has a disease given that a medical test was negative? A spot shows up on a radar screen. How likely is it that it corresponds to an aircraft? Slide adapted from Dan Jurafsky's 3 Conditional Probability Conditional probability specifies the probability given that the values of some other random variables are known. P(Sneeze | Cold) = 0.8 P(Cold | Sneeze) = 0.6 The probability of a sneeze given a cold is 80%. The probability of a cold given a sneeze is 60%. Slides adapted from Mary Ellen Califf 4 More precisely Given an experiment, a corresponding sample space S, and the probability law Suppose we know that the outcome is within some given event B The first letter was „t‟ We want to quantify the likelihood that the outcome also belongs to some other given event A. The second letter will be „h‟ We need a new probability law that gives us the conditional probability of A given B P(A|B) “the probability of A given B” Slide adapted from Dan Jurafsky's 5 Joint Probability Distribution The joint probability distribution for a set of random variables X1…Xn gives the probability of every combination of values P(X1,...,Xn) Sneeze ¬Sneeze Cold 0.08 0.01 ¬Cold 0.01 0.9 The probability of all possible cases can be calculated by summing the appropriate subset of values from the joint distribution. All conditional probabilities can therefore also be calculated P(Cold | ¬Sneeze) Slides adapted from Mary Ellen Califf 6 An intuition • Let’s say A is “it’s raining”. • Let’s say P(A) in dry California is .01 • Let’s say B is “it was sunny ten minutes ago” • P(A|B) means • “what is the probability of it raining now if it was sunny 10 minutes ago” • P(A|B) is probably way less than P(A) • Perhaps P(A|B) is .0001 • Intuition: The knowledge about B should change our estimate of the probability of A. Slide adapted from Dan Jurafsky's 7 Conditional Probability Let A and B be events P(A,B) and P(A B) both means “the probability that BOTH A and B occur” p(B|A) = the probability of event B occurring given event A occurs definition: p(A|B) = p(A B) / p(B) P( A, B) P( A | B) P( B) P(A, B) = P(A|B) * P(B) (simple arithmetic) P(A, B) = P(B, A) Slide adapted from Dan Jurafsky's 8 Bayes Theorem We start with conditional probability definition: P( A, B) P( A | B) P( B) So say we know how to compute P(A|B). What if we want to figure out P(B|A)? We can re-arrange the formula using Bayes Theorem: P( A | B) P( B) P ( B | A) P ( A) 9 Deriving Bayes Rule P(A B) P(A | B) P(A B) P(B) P(B | A) P(A) P(A | B)P(B) P(A B) P(B | A)P(A) P(A B) P(A | B)P(B) P(B | A)P(A) P(B | A)P(A) P(A | B) P(B) Slide adapted from Dan Jurafsky's 10 How to compute probilities? We don’t have the probabilities for most NLP problems We can try to estimate them from data (that‟s the learning part) Usually we can’t actually estimate the probability that something belongs to a given class given the information about it BUT we can estimate the probability that something in a given class has particular values. Slides adapted from Mary Ellen Califf 11 Simple Bayesian Reasoning If we assume there are n possible disjoint tags, t1 … tn P(ti | w) = P(w | ti) P(ti) P(w) Want to know the probability of the tag given the word. P(w| ti ) = number of times we see this tag with this word divided by how often we see the tag P(w| ti ) = Sum(word with tag i) / (count of tag i in corpus) P(ti ) = Sum(count of tag i in corpus) / (count of all tags) P(w) = Sum(count of word w in corpus) / (count of all words) Slides adapted from Mary Ellen Califf 12 Some notation P(fi| Sentence) This means that you multiple all the features together P(f1| S) * P(f2 | S) * … * P(fn | S) There is a similar one for summation. 13 Naïve Bayes Classifier The simpler version of Bayes was: P(B|A) = P(A|B)P(B) P(Sentence | feature) = P(feature | S) P(S) Using Naïve Bayes, we expand the number of feaures by defining a joint probability distribution: P(Sentence, f1, f2, … fn) = P(Sentence) P(fi| Sentence) We learn P(Sentence) and P(fi| Sentence) in training Test: we need to state P(Sentence | f1, f2, … fn) P(Sentence| f1, f2, … fn) = P(Sentence, f1, f2, … fn) / P(f1, f2, … fn) 14 Bayes Independence Example If there are many kinds of evidence, we need to combine them By assuming independence, we ignore the possible interactions: Imagine there are diagnoses ALLERGY, COLD, and WELL Symptoms SNEEZE, COUGH, and FEVER Prob Well Cold Allergy P(d) 0.9 0.05 0.05 P(sneeze|d) 0.1 0.9 0.9 P(cough | d) 0.1 0.8 0.7 P(fever | d) 0.01 0.7 0.4 Slides adapted from Mary Ellen Califf 15 Bayes Independence Example If symptoms are: sneeze & cough & no fever: P(well | s, c, not(f)) = P(e | well) P(well) / P (e) = (P(s | well) * P (c | well) * 1 - P(f|well)) * P(well) / P(e) = (0.1)(0.1)(0.99)(0.9)/P(e) = 0.0089/P(e) P(cold | e) = (.05)(0.9)(0.8)(0.3)/P(e) = 0.01/P(e) P(allergy | e) = (.05)(0.9)(0.7)(0.6)/P(e) = 0.019/P(e) P(e) = .0089 + .01 + .019 = .0379 P(well | e) = .23 P(cold | e) = .26 P(allergy | e) = .50 Diagnosis: allergy Slides adapted from Mary Ellen Califf 16 Kupiec et al. Feature Representation Fixed-phrase feature Certain phrases indicate summary, e.g. “in summary” Paragraph feature Paragraph initial/final more likely to be important. Thematic word feature Repetition is an indicator of importance Uppercase word feature Uppercase often indicates named entities. (Taylor) Sentence length cut-off Summary sentence should be > 5 words. 17 Details: Bayesian Classifier P( F1 , F2 ,... Fk | s S ) P( s S ) P( s S | F1 , F2 ,... Fk ) P( F1 , F2 ,... Fk ) Probability of feature-value pair Assuming statistical independence: occurring in a source sentence which is also in the summary k j 1 P( F j | s S ) P( s S ) P( s S | F , F ,...F ) 1 2 k k j 1 P( F j ) compression rate Probability that sentence s is included in summary S, given that sentence’s feature value pairs Probability of feature-value pair occurring in a source sentence 18 Language Identification 19 Language identification Tutti gli esseri umani nascono liberi ed eguali in dignità e diritti. Essi sono dotati di ragione e di coscienza e devono agire gli uni verso gli altri in spirito di fratellanza. Alle Menschen sind frei und gleich an Würde und Rechten geboren. Sie sind mit Vernunft und Gewissen begabt und sollen einander im Geist der Brüderlichkeit begegnen. Universal Declaration of Human Rights, UN, in 363 languages http://www.unhchr.ch/udhr/navigate/alpha.htm 20 Language identification égaux eguali iguales edistämään Ü ¿ How to do determine, for a stretch of text, which language it is from? 21 Language Identification Turns out to be really simple Just a few character bigrams can do it (Sibun & Reynar 96) Used Kullback Leibler distance (relative entropy) Compare probability distribution of the test set to those for the languages trained on Smallest distance determines the language Using special character sets helps a bit, but barely 22 Language Identification (Sibun & Reynar 96) 23 Confusion Matrix A table that shows, for each class, which ones your algorithm got right and which wrong Gold standard Algorithm’s guess 24 25 Author Identification (Stylometry) 26 Author Identification Also called Stylometry in the humanities An example of a Classification Problem Classifiers: Decide which of N buckets to put an item in (Some classifiers allow for multiple buckets) 27 The Disputed Federalist Papers In 1787-1788, Jay, Madison, and Hamilton wrote a series of anonymous essays to convince the voters of New York to ratify the new U. S. Constitution. Scholars have consensus that: 5 authored by Jay 51 authored by Hamilton 14 authored by Madison 3 jointly by Hamilton and Madison 12 remain in dispute … Hamilton or Madison? 28 Author identification Federalist papers In 1963 Mosteller and Wallace solved the problem They identified function words as good candidates for authorships analysis Using statistical inference they concluded the author was Madison Since then, other statistical techniques have supported this conclusion. 29 Function vs. Content Words High rates for “by” favor M, low favor H High rates for “from” favor M, low says little High rats for “to” favor H, low favor M 30 Function vs. Content Words No consistent pattern for “war” 31 Federalist Papers Problem Fung, The Disputed Federalist Papers: SVM Feature Selection Via Concave Minimization, ACM TAPIA’03 32 Discussion Can Pseudonymity Really Guarantee Privacy? Rao and Rohatgi, 2000 33 Next Time Guest lecture by Elizabeth Charnock and Steve Roberts of Cataphora 34