The University of Texas at Dallas
CS 6320
Natural Language Processing
Fall 2006
Class Project
A Maximum Likelihood Approach to Phrase
Chunking
Large Text
Chunked Phrases
Corpus
By:
Mahbubur Rahman Haque
Yusuf Bhagat
Course Instructor: Dr. Sanda Harabagiu
1
Table of Contents
1. The Problem .............................................................................................................. 3
1.1 Background Information ................................................................................. 3
1.2 Train and Test Data .......................................................................................... 4
2. Methodology: ............................................................................................................. 5
3. Implementation: ........................................................................................................ 5
3.1 Training ............................................................................................................. 5
3.2 Testing ................................................................................................................ 6
4. Experimental Results ................................................................................................ 7
5. Discussion................................................................................................................... 8
6. Conclusion ................................................................................................................. 9
7. References .................................................................................................................. 9
2
1. The Problem
Phrase chunking consists of dividing a text into the phrases. For example, the sentence
“He reckons the current account deficit will narrow to only # 1.8 billion in September .”
can be divided as follows:
[NP He] [VP reckons] [NP the current account deficit] [VP will narrow] [PP to] [NP only
# 1.8 billion] [PP in] [NP September].
Phrase chunking is an intermediate step towards full parsing. So, there has been a lot of
research work done on chunking. The main task in phrase chunking is to train computer
with a learning algorithm with a training corpus and then using the information leaned
from training, to chunk test data and measure the accuracy. There are several machine
learning approaches that have been employed in phrase chunking including SVM
(Support Vector Machine), Maximum Entropy, MLE (Maximum Likelihood Estimation)
etc. For our project, we selected the MLE approach.
1.1 Background Information
In 1991, Steven Abney proposed to approach parsing by starting with finding correlated
chunks of words [Abn91]. Lance Ramshaw and Mitch Marcus have approached chunking
by using a machine learning method [RM95]. Their work has inspired many others to
study the application of learning methods to noun phrase chunking. Other chunk types
have not received the same attention as NP chunks. The most complete work is [BVD99]
which presents results for NP, VP, PP, ADJP and ADVP chunks. [Vee99] works with
NP, VP and PP chunks. [RM95] have recognized arbitrary chunks but classified every
non-NP chunk as VP chunk. [Rat98] has recognized arbitrary chunks as part of a parsing
task but did not report on the chunking performance.
3
1.2 Train and Test Data
For our project, we used the test and train data that had been given for “Conference on
Computational Natural Language Learning” (CoNNL-2000). The train and test data
consist of three columns separated by spaces. Each word has been put on a separate line
and there is an empty line after each sentence. The first column contains the current
word, the second its part-of-speech tag as derived by the Brill tagger and the third its
chunk tag as derived from the WSJ corpus. The chunk tags contain the name of the chunk
type, for example I-NP for noun phrase words and I-VP for verb phrase words. Most
chunk types have two types of chunk tags, B-CHUNK for the first word of the chunk and
I-CHUNK for each other word in the chunk.
Here is an example of the file format:
He PRP B-NP
reckons VBZ B-VP
the DT B-NP
current JJ I-NP
account NN I-NP
deficit NN I-NP
will MD B-VP
narrow VB I-VP
to TO B-PP
only RB B-NP
# # I-NP
1.8 CD I-NP
billion CD I-NP
in IN B-PP
September NNP B-NP
. . O
The O chunk tag is used for tokens which are not part of any chunk. Instead of using the
part-of-speech tags of the WSJ corpus, the data set used tags generated by the Brill
tagger. The performance with the corpus tags will be better but it will be unrealistic since
for novel text no perfect part-of-speech tags will be available.[2]
4
2. Methodology:
For our project we followed the Maximum Likelihood Approach to predict the phrase
chunks for the sentences given in the test corpus. Maximum Likelihood Approach is just
to find out the chunk label that is most likely in some given context. For each of the
words of the test corpus some label is assigned and finally, using a script the final
accuracy, precision, recall and F values were computed to find out how good the
implementation was. More about the methodology we used for our chunking can be
found in the paper named: “A Context Sensitive Maximum Likelihood Approach to
Chunking” by Christer Johansson, published in proceedings of CoNLL-2000.
3. Implementation:
The Maximum Likelihood Estimation method allows us to predict the maximum likely
chunk labels for each of the phrases of a sentence. However, this prediction is not always
correct and hence the accuracy needs to be calculated. We can divide the chunking task
into two steps: (a) Training and (b) Testing.
3.1 Training
The main task of the training process is to allow the learning algorithm find out the most
frequent label for a phrase in a particular context. Only the POS (parts of speech) tag
information had been used for this. During training phase, we created a list of maximum
likely chunk-labels in any context using the POS tag information of the training file. This
list was created as follows:
We constructed a symmetric n-context for each of the words in the training corpus. A 1-
context is simple the most frequent chunk label for each tag. A 3-context is the tag of the
word under consideration, the tag of the preceding word and the tag of the word after
and kept as “[t-1 t0 t+1] label” in the list. Similarly, in 5-context we keep something like:
5
“[t-2 t-1 t0 t+1 t+2] label” in the list. Now, when we assigned some label to a word with
some context, we used the most frequent label.
For our project, we used only 3 different context information for determining chunk-
labels: 1-context, 3-context and 5-context. Also, we added “” tag at the beginning and
ending of a sentence to be considered in the context. Therefore, in brief, the steps below
were followed in the training phase:
1. Obtain 5-context, 3-context and 1-context information from the training file.
2. For each set of the n-contexts, associate the most frequent label with the context
information as the predicted label for any word occurring at the same context.
3.2 Testing
Once training is completed, the information learnt from the training phase is applied on
the test file. For each word, 5-context was considered first. If the same 5-context was
found in the list of 5-contexts (created during training phase), the label given by the 5-
context list is assigned to that word. If 5-context is not found for a word, we looked if 3-
context information is available for the given word. If 3-context is available, the 3-
context label is assigned to that particular word. If 3-context is unavailable, 1-context was
used. Therefore, in the testing phase, we needed to follow the steps below:
1. Compute the longest context used in training phase (5-context in our case) for
each word in the test corpus.
2. For each of the contexts computed at step-1, look at the n-context lists created at
the training phase and return the label that corresponds to the longest surviving
n-contexts. Simply look up for [t-2 t-1 t0 t+1 t+2] …… [t0] in the list obtained at the
training phase and use the longest context available in the list to predict the
label.
6
4. Experimental Results
We used the training and test files provided at CoNLL-2000. And after training our
program with the portion of the training file (since it takes a lot of time to run on the
entire test file, we took portions of that training file for training our program) of CoNLL-
200) („train.txt‟) we used the test file („test.txt‟) to find measure the accuracy of our
program. We trained our program on increasing sizes of the training corpus and the
following table compiles the accuracy, precision, recall and F values that we achieved
for two different training files. Please note, in both cases, the test file was the same
(“test.txt” from CoNLL-2000).
Table 4-1: Test Results-1
Accuracy on Test File: 86.66%; processed 47377 tokens
with 23852 phrases; found: 25149 phrases; correct:
20207
Training File Size: 160 KB
Test Data Precision Recall F
ADJP 34.35% 25.80% 29.47
ADVP 56.33% 62.70% 59.34
CONJP 0.00% 0.00% 0.00
INTJ 0.00% 0.00% 0.00
LST 0.00% 0.00% 0.00
NP 82.85% 87.55% 85.14
PP 85.52% 91.33% 88.33
PRT 28.87% 26.42% 27.59
SBAR 40.26% 23.18% 29.42
VP 80.75 % 88.64 % 84.52
All 80.35% 84.72% 82.48
7
Table 4-2: Test Result-2
Accuracy on Test File: 90.63%; processed 47377 tokens
with 23852 phrases; found: 24871 phrases; correct:
20642
Training File Size: 442 KB
Test Data Precision Recall F
ADJP 48.99% 44.29% 46.52
ADVP 65.31% 74.36% 69.55
CONJP 0.00% 0.00% 0.00
INTJ 0.00% 0.00% 0.00
LST 0.00% 0.00% 0.00
NP 84.58% 88.67% 86.58
PP 86.51% 92.54% 89.42
PRT 39.13% 33.96% 36.36
SBAR 55.56% 21.50% 31.00
VP 83.49 % 89.87 % 86.56
All 83.00% 86.54% 84.73
5. Discussion
The resulting accuracy shows that Maximum Likelihood Estimation works well for
chunking. From our test results, increase in accuracy, precision, recall and Fscore with
the increase of training file‟s size allows us to claim that we would have achieved a
higher accuracy if we could train our program with the whole training corpus provided
for CoNLL-2000. But due to time constraints, we could not train our algorithm using the
whole of that training file. Therefore, we would like to claim our implementation would
achieve a higher accuracy if a bigger training corpus is given to it. Also, accuracy can be
further increased by human intervention. It can be easily checked where the “inside
phrase” chunk labels are mismatching with the “beginning phrase” chunk labels. On
correcting these labels, the accuracy can climb higher. The mismatches will not be many
and will require very less work on human side. Checking by a human can also help in
8
discovering new rules to be added to the chunking algorithm which may improve
performance.
Generally sentence chunks contain up to five words. Hence 5-context chunking was
enough to give a sufficiently high accuracy. Depending on the data, adding 7-context
might improve the accuracy. However, it does not seem to add on much to the present
accuracy.
6. Conclusion
Maximum Likelihood Estimation is a simple but effective technique to perform text
chunking. It creates straight forward mappings from word tags to chunk labels. Parsing
through these mappings and chunking new text requires very less computation and gives
a good accuracy. The accuracy has a scope of increasing further if all the groups of labels
between “beginning phrases” are recorded and the label that occurring the highest
number of times is taken to identify the phrase.
7. References
1. Christer Johansson, A Context Sensitive Maximum Likelihood Approach to
Chunking. In: Proceedings of CoNLL-2000 and LLL-2000, Lisbon, Portugal,
2000.
2. Papers, train and test data from “Conference on Computational Natural Language
Learning”[CoNLL-2000 web address: http://www.cnts.ua.ac.be/conll2000/
chunking/]
9