Docstoc

Acquisition of Collocations

Document Sample
Acquisition of Collocations Powered By Docstoc
					                Acquisition of Collocations



Document Number          Working Paper 5.3
Project ref.             IST-2001-34460
Project Acronym          MEANING
Project full title       Developing Multilingual Web-scale Language Technologies
Project URL              http://www.lsi.upc.es/˜nlp/meaning/meaning.html
Availability             Public
Authors: Xinglong Wang (Sussex), John Carroll (Sussex)




                 INFORMATION SOCIETY TECHNOLOGIES
WP 5-Working Paper 5.3                                              Version: FINAL
Acquisition of Collocations                                         Page : 1


    Project ref.                               IST-2001-34460
    Project Acronym                            MEANING
    Project full title                         Developing Multilingual   Web-scale
                                               Language Technologies
    Security (Distribution level)              Public
    Contractual date of delivery               March 2002
    Actual date of delivery                    March 20, 2003
    Document Number                            Working Paper 5.3
    Type                                       Report
    Status & version                           v FINAL
    Number of pages                            6
    WP contributing to the deliberable         WP 5
    WPTask responsible                         John Carroll
    Authors
                                                    Xinglong Wang (Sussex), John
                                                    Carroll (Sussex)

    Other contributors
    Reviewer
    EC Project Officer                            Evangelia Markidou
    Authors: Xinglong Wang (Sussex), John Carroll (Sussex)
    Keywords: collocations, lexical acquisition
    Abstract: This working paper describes experiments in extracting collocations
    from large amounts of English text.




IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
WP 5-Working Paper 5.3                                             Version: FINAL
Acquisition of Collocations                                        Page : 2


Contents
1 Introduction and Background                                                   3

2 Source Data                                                                   3

3 Extracting Collocations                                                       3

4 Evaluation                                                                    4

5 Conclusions and Future Work                                                   6




IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
WP 5-Working Paper 5.3                                                                    Version: FINAL
Acquisition of Collocations                                                               Page : 3


1       Introduction and Background
The goal of the collocation extraction task in MEANING is to detect noun-noun (N-N) and
adjective-noun (A-N) collocations in a large amount of text, focusing on British English.
This is a sub-task of WP5 (Acquisition), and collocations detected will be used in WP6
(Word Sense Disambiguation) as potential seeds for bootstrapping algorithms [Yarowsky,
1993].
    Researchers have proposed several techniques for the task of collocation extraction.
Church and Hanks [Church and Hanks, March, 1990] and Kita et al. [Kita and Ogata,
1997] used techniques based on N-gram statistics derived from large corpora. Lin [Lin,
August, 1998] used an information-theoretic approach over parse dependency triples. Bla-
heta and Johnson’s [Blaheta and Johnson, July, 2001] method used log-linear models to
measure association between verb-particle pairs. These techniques are oriented to vari-
ous applications and they extract different types of collocations. For example, Blaheta
and Johnson’s [Blaheta and Johnson, July, 2001] method is to extract multi-word verbs,
whereas Lin’s [Lin, August, 1998] is to retrieve subject-verb and verb-object combinations.
    In our experiments, we adopted a new technique developed by Pearce [Pearce, 2001]. It
is based on analysing the possible substitutions for synonyms within candidate phrases. For
example, emotional baggage (a collocation) occurs more frequently than the phrase emo-
tional luggage formed when baggage is substituted for its synonym luggage. This technique
uses WordNet as a source of synonymic information.


2       Source Data
The source data of the experiments is the part-of-speech tagged1 version of the British
National Corpus (BNC).
     To reduce the amount of noise in the text, the BNC data was filtered. In this case, filter-
ing introduces a risk of eliminating collocational information (reducing “recall”). However,
it leads to the positive possibility of making collocational information more prominent (in-
creasing “precision”). Details of the filtering process and its side-effects are discussed later
in this section.


3       Extracting Collocations
The process of collocation extraction includes eliminating noise from the source data and
applying Pearce’s technique on it. Here is a brief description of steps:

    1
    In these experiments, we used the BNC tagged with CLAWS-5 tags. We can also run our system on
text tagged by Carroll et al.’s [Carroll et al., 1998] robust statistical parser. Carroll et al.’s parser uses
CLAWS-2 tagset and also provides head-dependency information, which may help improve performance
of our software.


IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
WP 5-Working Paper 5.3                                                        Version: FINAL
Acquisition of Collocations                                                   Page : 4


Step 1: Noun-group chunking

Input: The BNC corpus tagged using CLAWS-5 tagset.
Output: Multi-word noun groups in the BNC.
Description: This step filtered the BNC with Carroll’s finite-state noun goup chunker
and only noun groups that contain more than one word were retained.

Step 2: Extraction of Bigrams

Input: Multi-word noun groups in the BNC.
Output: A-N and N-N bigrams in the BNC.
Description: This step filtered the multi-word noun groups using text2wngram, an N-
gram extractor in the CMU-Cambridge toolkit. Then we eliminated bigrams that don’t
match A-N or N-N patterns. Digits and words with upper case letters were also removed.
Finally, inflected names were reduced to their base forms by a morphological processing
tool [Minnen et al., 2000].

Step 3: Applying Pearce’s Collocation Extraction Algorithm

Input: A-N and N-N bigrams in BNC and WordNet source files.
Output: A list of candidate collocations.
Description: Pearce’s algorithm is based on an assumption that a pair of words is consid-
ered a collocation if one of the words significantly prefers a particular lexical realisation of
the concept the other represents. Using WordNet as a resource of synonymic information,
the technique tries to compare between the number of occurrences of a bigram, say, α,
and those of the bigrams that contain synonyms of words composing α. The bigram that
occurs more frequently is selected as a candidate collocation. For the details of Pearce’s al-
gorithm, please see [Pearce, 2001]. In this step, we ran an implementation of the algorithm
on A-N, N-N bigrams extracted in the previous steps and a list of candidate collocations
was produced.


4     Evaluation
Evaluation of collocations is difficult as there is no accepted gold standard. In our experi-
ments, we used the BBI Dictionary of English Word Combinations, recognised as the best
collocation dictionary so far, as a gold standard. After the extraction phase, our system
generated 103, 980 candidate collocations. Since manually evaluating such a large amount
of collocations is impossible, we took three sets as samples and ran three rounds of evalu-
ation on them, hoping to estimate the performance of our system. we also compared the
samples with multi-word entries in WordNet.
    In the first round of evaluation, we selected 216 candidate collocations which occur most


IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
WP 5-Working Paper 5.3                                                                 Version: FINAL
Acquisition of Collocations                                                            Page : 5


          Samples of Evaluation                  Precision (BBI)       Precision (WordNet)
          Most frequent collocations                  52.3%                     3.2%
          Randomly selected collocations               4.5%                      0%
          Collocations containing voice               11.5%                    16.7%

                   Table 1: Results of Collocation Extraction Experiments.

frequently on the output list as a sample. The precision of these candidates was expected
to be much higher than that of the average. After manually looking them up in the BBI
Dictionary, 113 out of 216 collocations are correct2 , achieving a precision of 52.3%. At the
same time, 7 out of 216 are recognised as multi-word entries on the WordNet, achieving a
precision of 3.2%.
    In the second round, we randomly selected 200 candidate collocations as a sample. The
precision of these candidates was expected to reflect that of the average. Unfortunately,
the result is rather disappointing: out of the 200 items, only 9 are correct and none of the
them are WordNet entries.
    In the final round, we randomly picked a noun, voice, and intended to measure the
performance of the system when extracting collocations concerning a given word. Then we
selected the 52 N-N or A-N combinations under the entry of voice in the BBI Dictionary
(set A), the 6 multi-word entries of voice in the WordNet (set B), and the 174 of candi-
date collocations containing the selected word (set C). There are 20 items found in the
intersection of sets A and C, which means when extracting collocations regarding the word
voice and using the BBI Dictionary as a gold standard, our system achieves a precision of
11.4% and a recall3 of 38.4%. We also compared between sets B and C and found only
one overlap, which means our system achievs a precision of 16.7% when using WordNet as
a gold standard. These results are summarised in table 1.
    During the evaluation process, we found that the BBI Dictionary is probably not a
suitable gold standard for our task. Firstly, it includes not only N-N and A-N combinations
but also other types such as verb-particle or prepositional phrases. Secondly, it includes
too many free combinations like low voice or pleasant voice, which we are not interested in.
Finally, the dictionary is a bit out-of-date and doesn’t include collocations that recently
emerged such as network administrator, etc.
    At the same time, we discovered several drawbacks of our system. First of all, the algo-
rithm lacks the ability to distinguish collocations from free combinations. This drawback
led to the majority of wrong outputs. For example, new company is just a free combina-
tion but it was extracted as a collocation by the algorithm. In fact, bigrams containing
an adjective that occurs very frequently like new or good, are almost always picked as
collocations. That is because the algorithm is not able to tell collocations from two words
   2
     A candidate collocation is considered correct if it is found as lexical collocations or idioms in the
dictionary.
   3
     We assume the BBI Dictionary covers all two-word N-N or A-N collocations containing the word voice.



IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
WP 5-Working Paper 5.3                                                        Version: FINAL
Acquisition of Collocations                                                   Page : 6


that co-occur a lot just by chance. A solution would be to create a ranking module that
performs a statistical test such as a t-test or a χ2 -test and to eliminate those items obtain-
ing low scores. We can also rank them by calculating their mutual information. Further
investigation needs to be done to decide which method to use.
    Secondly, the algorithm would not recognise a bigram as a collocation if its components
don’t have synonyms according to the WordNet 1.6 or if bigrams containing its compo-
nents’ synonyms do not occur in the BNC. That explains why the bigram, prime minister,
occurring 8, 805 times in the BNC was not picked as a collocation. In addition, if there
are two collocations whose components are synonyms, such as shaking voice and trembling
voice, the algorithm would always select the one with higher frequency and reject the other.
Applying the algorithm to a larger corpus is a possible solution to this problem.
    Thirdly, morphological processing wrongly eliminates a few collocations. For example,
it changes foreign affairs, a good collocation, into foreign affair, a bad one. A possible
solution is to selectively run the morphological analysis. Taking the above example, when
we apply the proposed solution, affairs in foreign affairs will not be lemmatised if foreign
affairs occurs more frequently than foreign affair does.


5     Conclusions and Future Work
Overall, and with respect to the controversial gold standard of the BBI Dictionary, the
performance of our collocation extraction software is not very satisfactory. However, we
still believe the algorithm has its potential. In the future, we will improve the system as
described above and use the revised software on the the finance and sport corpora extracted
from Reuters News Corpus.
     Our future work will also include developing evaluation techniques. A possible trend
is to use task-based evaluation methodology. An interesting task is to apply collocation
candidates to Word Sense Disambiguation following Yarowsky’s idea of one sense per col-
location [Yarowsky, 1993].
     From these experiments, we also see that WordNet seriously lacks multi-word entries.
Thus, in the first round of evaluation, out of 216 candidate collocations extracted, 113 are
correct according to the BBI Dictionary, but only 7 are correct according to WordNet 1.6.
It appears that coverage of the WordNet is much narrower than that of the BBI Dictionary
and work needs to be done to enrich multi-word entries in WordNet.


References
[Blaheta and Johnson, July, 2001] Don Blaheta and Mark Johnson. Unsupervised learning
  of multi-word verbs. In 39th Annual Meeting and 10th Conference of the European
  Chapter of the Association for Computational Linguistics (ACL’01), pages 54–60, July,
  2001. Toulouse, France.


IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies
WP 5-Working Paper 5.3                                                   Version: FINAL
Acquisition of Collocations                                              Page : 7


[Carroll et al., 1998] John Carroll, Guido Minnen, and Ted Briscoe. Can subcategorisation
  probabilities help a statistical parser? In 6th ACL/SIGDAT Workshop on Very Large
  Corpora, pages 118–126, 1998. Montreal, Canada.
[Church and Hanks, March, 1990] Kenneth Ward Church and Patrick Hanks. Word associ-
  ation norms, nutual information, and lexicography. Computational Linguistics, 16(1):22–
  29, March, 1990.
[Kita and Ogata, 1997] Kenji Kita and Hiroaki Ogata. Collocations in language learning:
  Corpus-based automatic compilation of collocations and bilingual collocation concor-
  dancer. Computer Assisted Language Learning: An International Journal, 10(3):229–
  238, 1997.
[Lin, August, 1998] Dekang Lin. Extracting collocations from text corpora. In First Work-
   shop on Computational Terminology, August, 1998. Montreal, Canada.
[Minnen et al., 2000] G. Minnen, J. Carroll, and D. Pearce. Robust, applied morpholog-
  ical generation. In Proceedings of the 1st International Natural Language Generation
  Conference, pages 201–208, 2000. Mitzpe Ramon, Israel.
[Pearce, 2001] Darren Pearce. Synonymy in collocation extraction. In WordNet and Other
  Lexical Resources: Applications, Extensions and Customizations (NAACL 2001 Work-
  shop), pages 41–46, 2001. Carnegie Mellon University, Pittsburgh.
[Yarowsky, 1993] David Yarowsky. One sense per collocation. In Proceeding of ARPA
  Human Language Technology Workshop, pages 266–271, 1993. Princeton, New Jersey.




IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies