VIEWS: 3 PAGES: 71 POSTED ON: 8/11/2012
COMP791A: Statistical Language Processing Collocations Chap. 5 1 A collocation… is an expression of 2 or more words that correspond to a conventional way of saying things. broad daylight Why not? ?bright daylight or ?narrow darkness Big mistake but not ?large mistake overlap with the concepts of: terms, technical terms & terminological phrases Collocations extracted form technical domains Ex: hydraulic oil filter, file transfer protocol 2 Examples of Collocations strong tea weapons of mass destruction to make up to check in heard it through the grapevine he knocked at the door I made it all up 3 Definition of a collocation (Choueka, 1988) [A collocation is defined as] “a sequence of two or more consecutive words, that has characteristics of a syntactic and semantic unit, and whose exact and unambiguous meaning or connotation cannot be derived directly from the meaning or connotation of its components." Criteria: non-compositionality non-substitutability non-modifiability non-translatable word for word 4 Non-Compositionality A phrase is compositional if its meaning can be predicted from the meaning of its parts Collocations have limited compositionality there is usually an element of meaning added to the combination Ex: strong tea Idioms are the most extreme examples of non-compositionality Ex: to hear it through the grapevine 5 Non-Substitutability We cannot substitute near-synonyms for the components of a collocation. Strong is a near-synonym of powerful strong tea ?powerful tea yellow is as good a description of the color of white wines white wine ?yellow wine 6 Non-modifiability Many collocations cannot be freely modified with additional lexical material or through grammatical transformations weapons of mass destruction --> ?weapons of massive destruction to be fed up to the back teeth --> ?to be fed up to the teeth in the back 7 Non-translatable (word for word) English: make a decision ?take a decision French: ?faire une décision prendre une décision to test whether a group of words is a collocation: translate it into another language if we cannot translate it word by word then it probably is a collocation 8 Linguistic Subclasses of Collocations Phrases with light verbs: Verbs with little semantic content in the collocation make, take, do… Verb particle/phrasal verb constructions to go down, to check out,… Proper nouns John Smith Terminological expressions concepts and objects in technical domains hydraulic oil filter 9 Why study collocations? In NLG or MT The output should be natural make a decision ?take a decision In lexicography Identify collocations to list them in a dictionary To distinguish the usage of synonyms or near-synonyms In parsing To give preference to most natural attachments plastic (can opener) ? (plastic can) opener In corpus linguistics and psycholinguists Ex: To study social attitudes towards different types of substances strong cigarettes/tea/coffee powerful drug 10 A note on (near-)synonymy To determine if 2 words are synonyms-- Principle of substitutability: 2 words are synonym if they can be substituted for one another in some?/any? sentence without changing the meaning or acceptability of the sentence How big/large is this plane? Would I be flying on a big/large or small plane? Miss Nelson became a kind of big / ?? large sister to Tom. I think I made a big / ?? large mistake. 11 A note on (near-)synonymy (con’t) True synonyms are rare... Depend on: shades of meaning: words may share central core meaning but have different sense accents register/social factors speaking to a 4-yr old VS to graduate students! collocations: conventional way of saying something / fixed expression 12 Approaches to finding collocations Frequency Mean and Variance Hypothesis Testing t-test 2-test Mutual Information 13 Approaches to finding collocations --> Frequency Mean and Variance Hypothesis Testing t-test 2-test Mutual Information 14 Frequency (Justeson & Katz, 1995) Hypothesis: if 2 words occur together very often, they must be interesting candidates for a collocation Method: Select the most frequently occurring bigrams (sequence of 2 adjacent words) 15 Results Not very interesting… Except for “New York”, all bigrams are pairs of function words So, let’s pass the results through a part- of- speech filter Tag Pattern Example AN linear function NN regression coefficient AAN Gaussian random variable ANN cumulative distribution function NAN mean squared error NNN class probability function NPN degrees of freedom 16 Frequency + POS filter Simple method that works very well 17 “Strong” versus “powerful” On a 14 million word corpus from the New-York Times (Aug.- Nov. 1990) 18 Frequency: Conclusion Advantages: works well for fixed phrases Simple method & accurate result Requires small linguistic knowledge But: many collocations consist of two words in more flexible relationships she knocked on his door they knocked at the door 100 women knocked on Donaldson’s door a man knocked on the metal front door 19 Approaches to finding collocations Frequency --> Mean and Variance Hypothesis Testing t-test 2-test Mutual Information 20 Mean and Variance (Smadja et al., 1993) Looks at the distribution of distances between two words in a corpus looking for pairs of words with low variance A low variance means that the two words usually occur at about the same distance A low variance --> good candidate for collocation Need a Collocational Window to capture collocations of variable distances knock door knock door 21 Collocational Window This is an example of a three word window. To capture 2-word collocations this is this an is an is example an example an if example if example a of a of three a three a word three word three window word window 22 Mean and Variance (con’t) The mean is the average offset (signed distance) between two words in a corpus The variance measures how much the individual offsets deviate from the mean i1 n (di d )2 var n 1 n is the number of times the two words (two candidates) co-occur di is the offset of the ith pair of candidates d is the mean offset of all pairs of candidates If offsets (di) are the same in all co-occurrences --> variance is zero --> definitely a collocation If offsets (di) are randomly distributed --> variance is high --> not a collocation 23 An Example window size = 11 around knock (5 left, 5 right) she knocked on his door they knocked at the door 100 women knocked on Donaldson’s door a man knocked on the metal front door Mean d = (3 3 5 5) 4.0 4 Std. deviation s = (3 4.0)2 (3 4.0)2 (5 4.0)2 (5 4.0)2 1.15 3 24 Position histograms “strong…opposition” variance is low --> interesting collocation “strong…support” “strong…for” variance is high --> not interesting collocation 25 Mean and variance versus Frequency std. dev. ~0 & mean offset ~1 --> would be found by frequency method std. dev. ~0 & high mean offset --> very interesting, but would not be found by frequency method high deviation --> not interesting 26 Mean & Variance: Conclusion good for finding collocations that have: looser relationship between words intervening material and relative position 27 Approaches to finding collocations Frequency Mean and Variance --> Hypothesis Testing t-test 2-test Mutual Information 28 Hypothesis Testing If 2 words are frequent… they will frequently occur together… Frequent bigrams and low variance can be accidental (two words can co-occur by chance) We want to determine whether the co-occurrence is random or whether it occurs more often than chance This is a classical problem in statistics called Hypothesis Testing When two words co-occur, Hypothesis Testing measures how confident we have that this was due to chance or not 29 Hypothesis Testing (con’t) We formulate a null hypothesis H0 H0 : no real association (just chance…) H0 states what should be true if two words do not form a collocation if 2 words w1 and w2 do not form a collocation, then w1 and w2 are independently of each other: P(w1 , w2 ) P(w1 )P(w2 ) We need a statistical test that tells us how probable or improbable it is that a certain combination occurs Statistical tests: t test 2 test 30 Approaches to finding collocations Frequency Mean and Variance Hypothesis Testing --> t-test 2-test Mutual Information 31 Hypothesis Testing: the t-test (or Student's t-test) H0 states that: P(w1 , w2 ) P(w1 )P(w2 ) We calculate the probability p-value that a collocation would occur if H0 was true If p-value is too low, we reject H0 Typically if under a significant level of p < 0.05, 0.01, or 0.001 Otherwise, retain H0 as possible 32 Some intuition Assume we want to compare the heights of men and women we cannot measure the height of every adult… so we take a sample of the population and make inferences about the whole population by comparing the sample means and the variation of each mean Ho: women and men are equally tall, on average We gather data from 10 men and 10 women 33 Some intuition (con't) t-test compares: the sample mean (computed from observed values) to a expected mean determines the likelihood (p-value) that the difference between the 2 means occurs by chance. a p-value close to 1 --> it is very likely that the expected and sample means are the same a small p-value (ex: 0.01) --> it is unlikely (only a 1 in 100 chance) that such a difference would occur by chance so the lower the p-value --> the more certain we are that there is a significant difference between the observed and expected mean, so we reject H0 34 Some intuition (con’t) t-test assigns a probability to describe the likelihood that the null hypothesis is true high p-value --> Accept Ho Accept Ho low p-value --> Reject Ho Reject Ho Reject Ho frequency frequency 0 value of t 0 value of t Critical value c (value of t where we decide to Confidence level a = probability reject Ho) that t-score > critical value c t distribution (1-tailed) t distribution (2-tailed) 35 Some intuition (con’t) 1. Compute t score 2. Consult the table of critical values with df = 18 (10+10-2) 3. If t > critical value (value in table), then the 2 samples are significantly different at the probability level that is listed Assume t=2.7 if there is no difference in height between women and men (H0 is true) then the probability of finding t=2.7 is between 0.025 & 0.01 … that’s not much… so we reject the null hypothesis H0 and conclude that there is a difference Probability table based on the t distribution (2-tailed test) in height between men and woman 36 The t-Test looks at the mean and variance of a sample of measurements the null hypothesis is that the sample is drawn from a distribution with mean The test : looks at the difference between the observed and expected means, scaled by the variance of the data tells us how likely one is to get a sample of that mean and variance assuming that the sample is drawn from a normal distribution with mean . 37 The t-Statistic Difference between the observed mean x μ and the expected mean t s 2 x is the sample mean is the expected mean of the distribution N s2 is the sample variance N is the sample size the higher the value of t, the greater the confidence that: •there is a significant difference •it’s not due to chance •the 2 words are not independent 38 t-Test for finding Collocations is w1 w2 a collocation? Think of a corpus of N words as a long sequence of N bigrams let's randomly generate one bigram: if the bigram is w1 w2 ==> success if the bigram is not w1 w2 ==> failure …in effect a Bernoulli trial 39 Bernoulli Distribution Also known as Binomial distribution Each trial has only two outcomes (success or failure) The trials are independent There are a fixed number of trials Distribution has 2 parameters: nb of trials n probability of success p in 1 trial Example: Flipping a coin 10 times and counting the number of heads that occur Can only get a head or a tail (2 outcomes) The probability of success for each trial is p= ½ The coin flips do not effect each other (independence) There are 10 coin flips (n = 10) 40 Properties of binomial distribution Mean (or expectation) E(X) = μ = np Ex: Flipping a coin 10 times ==> E(head) = μ = 10 x ½ = 5 Variance σ2= np(1-p) Ex: Flipping a coin 10 times ==> σ2 = 10 x ½ ( ½ ) = 2.5 A binomial distribution is made of a sequence of independent trials (n > 1) If we only have 1 trial (n=1), we have Bernoulli trial, and: s2 = p x (1-p) where p is the probability of success (1-p) is the probability of failure 41 t-Test: Example with collocations In a corpus: new occurs 15,828 times companies occurs 4,675 times new companies occurs 8 times there are 14,307,668 tokens overall Is new companies a collocation? x μ t s 2 N 42 Example (Cont.) 8 x 5.591 10 7 x : the observed mean is 14307668 μ : If the null hypothesis is true, then: Independence assumption -- P(new companies) = P(new) P(companies) the probability of having new companies is expected to be 15 828 4 675 3.615 10 7 14 307 668 14 307 668 s2 = sample variance = p x (1-p) where p is the probability of success according to the observations (i.e. getting the bigram new companies) p is small for most bigrams so s2 ≈ p 8 s2 p 5.591 10 7 14 307 668 N : total number of bigrams = 14,307,668 43 Example (Cont.) By applying the t-test, we have: x -μ 5.591 10 7 3.615 10 7 t 1 s2 5.591 10 7 N 14307668 With a confidence level a=0.005, critical value is 2.576 (t should be at least 2.576) Since t=1 < 2.576 we cannot reject the Ho so we cannot claim that new and companies form a collocation 44 t test: Some results t test applied to 10 bigrams that occur with frequency = 20 pass the t-test fail the t-test t C(w1) C(w2) C(w1 w2) w1 w2 (t < 2.756) so: (t > 2.756) so: we can reject 4.4721 42 20 20 Ayatollah Ruhollah we cannot the null 4.4721 41 27 20 Bette Midler reject the null hypothesis 1.2176 14093 14776 20 like people hypothesis so they form 0.8036 15019 15629 20 time last so they do not collocation form a collocation Notes: Frequency-based method could not have seen the difference in these bigrams, because they all have the same frequency the t test takes into account the frequency of a bigram relative to the frequencies of its component words If a high proportion of the occurrences of both words occurs in the bigram, then its t is high. The t test is mostly used to rank collocations 45 Hypothesis testing of differences Used to see if 2 words (near-synonyms) are used in the same context or not “strong” vs “powerful” can be useful in lexicography we want to test: if there is a difference in 2 populations Ex: height of woman / height of man the null hypothesis is that there is no difference i.e. the average difference is 0 ( =0) x1 is the sample mean of population 1 x1 x2 t x2 is the sample mean of population 2 s12 is the sample variance of population 1 s 2 1 s 2 2 s22 is the sample variance of population 2 n1 is the sample size of population 1 n1 n2 n2 is the sample size of population 2 46 Difference test example Is there a difference in how we use “powerful” and how we use “strong”? t C(w) C(strong w) C(powerful w) Word 3.1622 933 0 10 computers 2.8284 2377 0 8 computer 2.4494 289 0 6 symbol 2.2360 2266 0 5 Germany 7.0710 3685 50 0 support 6.3257 3616 58 7 enough 4.6904 986 22 0 safety 4.5825 3741 21 0 sales 47 Approaches to finding collocations Frequency Mean and Variance Hypothesis Testing t-test --> 2-test Mutual Information 48 Hypothesis testing: the 2-test problem with the t test is that it assumes that probabilities are approximately normally distributed… the 2-test does not make this assumption The essence of the 2-test is the same as the t-test Compare observed frequencies and expected frequencies for independence if the difference is large then we can reject the null hypothesis of independence 49 2-test In its simplest form, it is applied to a 2x2 table of observed frequencies The 2 statistic: sums the differences between observed frequencies (in the table) and expected values for independence scaled by the magnitude of the expected values: i - ranges over rows (Obsij Expij ) 2 j - ranges over columns X 2 Oij - the observed value for cell (i, j) i, j Expij Eij - the expected value 50 2-test- Example Observed frequencies Obsij Observed w1 = new w1 ≠ new TOTAL w2 = companies 8 4 667 4 675 (new companies) (ex: old companies) c(companies) w2 ≠ companies 15 820 14 287 181 14 303 001 (ex: new machines) (ex: old machines) c(~companies) TOTAL 15 828 14 291 848 14 307 676 c(new) c(~new) N = 4 675 + 14 303 001 = 15 828 +14 291 848 51 2-test- Example (con’t) Expected frequencies Expij If independence Computed from the marginal probabilities (the totals of the rows and columns converted into proportions) Expected w1 = new w1 ≠ new w2 = companies 5.17 4669.83 c(new) x c(companies) / N c(companies) x c(˜new) / N 15828 x 4675 / 14307676 4675 x 14291848 / 14307676 w2 ≠ companies 15 822.83 14 287 178.17 c(new) x c(˜companies) / N c(˜new) x c(˜companies) / N 15828 x 14303001 /14307676 14291848 x 14303001 / 14307676 Ex: expected frequency for cell (1,1) (new companies) marginal probability of new occurring as the first part of a bigram times marginal probability of companies occurring as the second part of bigram: 8 4667 8 15820 x x N 5.17 N N If “new” and “companies” occurred completely independent of each other we would expect 5.17 occurrences of “new companies” on average 52 2-test- Example (con’t) But is the difference significant? (8 5.17) 2 (46 667 46 669.83) 2 (15 820 15 822.83) 2 (14 287 181 14 287 178.17) 2 χ 2 1.55 5.17 46 669 15 823 14 287 186 df in an nxc table = (n-1)(c-1) = (2-1)(2-1) =1 (degrees of freedom) The probability level of a=0.05 the critical value is 3.84 Since 1.55 < 3.84: So we cannot reject H0 (that new and companies occur independently of each other) So new companies is not a good candidate for a collocation 53 2-test: Conclusion Differences between the t statistic and 2 statistic do not seem to be large But: the 2 test is appropriate for large probabilities where t test fails because of the normality assumption the 2 is not appropriate with sparse data (if numbers in the 2 by 2 tables are small) 2 test has been applied to a wider range of problems Machine translation Corpus similarity 54 2-test for machine translation (Church & Gale, 1991) To identify translation word pairs in aligned corpora Ex: Nb of aligned sentence pairs containing “cow” in English and “vache” in French Observed “cow” ~”cow” TOTAL frequency “vache” 59 6 65 ~”vache” 8 570 934 570 942 TOTAL 67 570 940 571 007 2 = 456 400 >> 3.84 (with a= 0.05) So “vache” and “cow” are not independent… and so are translations of each other 55 2-test for corpus similarity (Kilgarriff & Rose, 1998) Ex: Observed Corpus 1 Corpus 2 Ratio frequency Word1 60 9 60/9 =6.7 Word2 500 76 6.6 Word3 124 20 6.2 … … … … Word500 … … … Compute 2 for the 2 populations (corpus1 and corpus2) Ho: the 2 corpora have the same word distribution 56 Collocations across corpora Ratios of relative frequencies between two or more different corpora can be used to discover collocations that are characteristic of a corpus when compared to other corpus Likelihood NY Times NY Times w1 w 2 ratio (1990) (1989) 0.0241 2 68 Karim Obeid (2/14 307 668) / (68/11 731 564) 0.0372 2 44 East Berliners 0.0372 2 44 Miss Manners 0.0399 2 41 17 earthquake … … … …… TOTAL 14 307 668 11 731 564 57 Collocations across corpora (con’t) most useful for the discovery of subject- specific collocations Compare a general text with a subject-specific text words and phrases that (on a relative basis) occur most often in the subject-specific text are likely to be part of the vocabulary that is specific to the domain 58 Approaches to finding collocations Frequency Mean and Variance Hypothesis Testing t-test 2-test --> Mutual Information 59 Pointwise Mutual Information Uses a measure from information-theory Pointwise mutual information between 2 events x and y (in our case the occurrence of 2 words) is roughly: a measure of how much one event (e.g. a word) tells us about the other or a measure of the independence of 2 events (or 2 words) If 2 events x and y are independent, then I(x,y) = 0 60 Essential Information Theory (back to section 2.2 of book) Developed by Shannon in the 40s To maximize the amount of information that can be transmitted over an imperfect communication channel (the noisy channel) Notion of entropy (informational content): How informative is a piece of information? ex. How informative is the answer to a question If you already have a good guess about the answer, the actual answer is less informative… low entropy 61 Entropy - intuition Ex: Betting 1$ to the flip of a coin If the coin is fair: Expected gain is ½ (+1) + ½ (-1) = 0$ So you’d be willing to pay up to 1$ for advanced information (1$ - 0$ average win) If the coin is rigged P(head) = 0.99 P(tail) = 0.01 assuming you bet on head (!) Expected gain is 0.99(+1) + 0.01(-1) = 0.98$ So you’d be willing to pay up to 2¢ for advanced information (1$ - 0.98$ average win) Entropy of fair coin is 1$ > entropy of rigged coin 0.02$ 62 Entropy Let X be a discrete Random Variable (e.g. the function of tossing a coin with outputs xi) Entropy (or self-information) n H(X) p(xi )log2p(xi ) i1 measures the amount of information in a RV average uncertainty of a RV the average length of the message needed to transmit an outcome xi of that variable the size of the search space consisting of the possible values of a RV and its associated probabilities measured in bits Properties: H(X) ≥ 0 If H(X) = 0 then it provides no new information 63 Example: The coin flip Fair coin: H(X) p(xi )log2p(xi ) - 1 log2 1 1 log2 1 1 bit n i1 2 2 2 2 Rigged coin: H(X) p(xi )log2p(xi ) - 99 log2 99 1 log2 1 0.08 bits n i1 100 100 100 100 Entropy P(head) 64 Example: Simplified Polynesian In simplified Polynesian, we have 6 letters with frequencies: p t k a i u 1/8 1/4 1/8 1/4 1/8 1/8 The per-letter entropy is 1 1 1 1 1 1 1 1 1 1 1 1 H(p) p(i)log p(i) ( 8 log i{p, t,k,a,i,u} 2 2 log2 log2 log2 log2 log2 ) 2.5 bits 8 4 4 8 8 4 4 8 8 8 8 We can design a code that on average takes 2.5bits to transmit a letter p t k a i u 100 00 101 01 110 111 Can be viewed as the average nb of yes/no questions you need to ask to identify the outcome (ex: is it a ‘t’? Is it a ‘p’?) 65 Entropy in NLP Entropy is a measure of uncertainty The more we know about something the lower its entropy So if a language model captures more of the structure of the language, then its entropy should be lower in NLP, language models are compared by using their entropy. ex: given 2 grammars and a corpus, we use entropy to determine which grammar better matches the corpus. 66 Mutual Information H(X) - H(X|Y) = H(Y) - H(Y|X) = I(X;Y) The reduction in uncertainty of a RV by knowing about another RV e.g. if you see "Merry"… how surprised are you if it is followed by: "hippopotamus" --> very surprised so I(Merry; hippo) ≈ 0 "Christmas" --> not surprised so I(Merry; Christmas) is very high p(x, y) also known as pointwise I(x; y) log2 p(x)p(y) mutual information 67 Example: Finding Collocations Assume: c(Ayatollah) = 42 c(Ruhollah) = 20 c(Ayatollah, Ruhollah) = 20 N = 14 307 668 Then: p(x, y) I(x; y) log2 p(x)p(y) 20 I(Ayatollah; Ruhollah) log2 14 307 668 18.38 42 20 14 307 668 14 307 668 So? The occurrence of “Ayatollah” at position i increases by 18.38bits if “Ruhollah” occurs at position i+1 works particularly badly with sparse data 68 Pointwise Mutual Information (con’t) With pointwise mutual information: I(w1;w2) C(w1) C(w2) C(w1 w2) w1 w2 18.38 42 20 20 Ayatollah Ruhollah 17.98 41 27 20 Bette Midler 0.46 14093 14776 20 like people 0.29 15019 15629 20 time last With t-test (see p.43 of slides) t C(w1) C(w2) C(w1 w2) w1 w2 4.4721 42 20 20 Ayatollah Ruhollah 4.4721 41 27 20 Bette Midler 1.2176 14093 14776 20 like people 0.8036 15019 15629 20 time last Same ranking as t-test 69 Pointwise Mutual Information (con’t) good measure of independence values close to 0 --> independence bad measure of dependence because PMI does not depend on frequency all things being equal, bigrams of low frequency words will receive a higher score than bigrams of high frequency words so sometimes we take C(w1 w2) I(w1 ; w2) 70 Automatic vs manual detection of collocations Manual detection finds a wider variety of grammatical patterns Ex: in the BBI combinatory dictionary of English strength power to build up ~ to assume ~ to find ~ emergency ~ to save ~ discretionary ~ to sap somebody's ~ fire ~ brute ~ supernatural ~ tensile ~ to turn off the ~ the ~ to [do X] the ~ to [do X] Quality of collocations is better that computer-generated ones But… long and requires expertise 71