Semantic Role Labeling with Boosting, SVMs, Maximum Entropy, SNOW, by nns95765


									                          SENSEVAL-3: Third International Workshop on the Evaluation of Systems
                                 for the Semantic Analysis of Text, Barcelona, Spain, July 2004
                                                      Association for Computational Linguistics

                           Semantic Role Labeling with
           Boosting, SVMs, Maximum Entropy, SNOW, and Decision Lists
                                       Grace N GAI†1 , Dekai W U‡2
                          Marine C ARPUAT‡ , Chi-Shing WANG† , Chi-Yung WANG†
                      †                                       ‡
                      Dept. of Computing                       HKUST, Dept of Computer Science
                   HK Polytechnic University                  Human Language Technology Center
                         Hong Kong                                      Hong Kong

                          Abstract                                 2 Experimental Features
This paper describes the HKPolyU-HKUST sys-                        This section describes the features that were used
tems which were entered into the Semantic Role La-                 for the SRL task. Since the non-restricted SRL task
beling task in Senseval-3. Results show that these                 is essentially a classification task, each parse con-
systems, which are based upon common machine                       stituent that was known to correspond to a frame
learning algorithms, all manage to achieve good                    element was considered to be a sample.
performances on the non-restricted Semantic Role                      The features that we used for each sample have
Labeling task.                                                     been previously shown to be helpful for the SRL
                                                                   task (Gildea and Jurafsky, 2002). Some of these
1 Introduction                                                     features can be obtained directly from the Framenet
This paper describes the HKPolyU-HKUST sys-                        annotations:
tems which participated in the Senseval-3 Semantic
                                                                     • The name of the frame.
Role Labeling task. The systems represent a diverse
array of machine learning algorithms, from decision                  • The lexical unit of the sentence — i.e. the lex-
lists to SVMs to Winnow-type networks.                                 ical identity of the target word in the sentence.
   Semantic Role Labeling (SRL) is a task that                       • The general part-of-speech tag of the target
has recently received a lot of attention in the NLP                    word.
community. The SRL task in Senseval-3 used
the Framenet (Baker et al., 1998) corpus: given a                    • The “phrase type” of the constituent — i.e. the
sentence instance from the corpus, a system’s job                      syntactic category (e.g. NP, VP) that the con-
would be to identify the phrase constituents and                       stituent falls into.
their corresponding role.                                            • The “grammatical function” (e.g. subject, ob-
   The Senseval-3 task was divided into restricted                     ject, modifier, etc) of the constituent, with re-
and non-restricted subtasks. In the non-restricted                     spect to the target word.
subtask, any and all of the gold standard annotations
contained in the FrameNet corpus could be used.                      • The position (e.g. before, after) of the con-
Since this includes information on the boundaries                      stituent, with respect to the target word.
of the parse constituents which correspond to some
                                                                      In addition to the above features, we also ex-
frame element, this effectively maps the SRL task
                                                                   tracted a set of features which required the use of
to that of a role-labeling classification task: given a
                                                                   some statistical NLP tools:
constituent parse, identify the frame element that it
belongs to.                                                          • Transitivity and voice of the target word —
   Due to the lack of time and resources, we chose to                  The sentence was first part-of-speech tagged
participate only in the non-restricted subtask. This                   and chunked with the fnTBL transformation-
enabled our systems to take the classification ap-                      based learning tools (Ngai and Florian, 2001).
proach mentioned in the previous paragraph.                            Simple heuristics were then used to deduce the
   1                                                                   transitivity voice of the target word.
      The author would like to thank the Hong Kong Polytechnic
University for supporting this research in part through research     • Head word (and its part-of-speech tag) of the
grants A-PE37 and 4-Z03S.                                              constituent — After POS tagging, a syntactic
      The author would like to thank the Hong Kong Research
Grants Council (RGC) for supporting this research in part
                                                                       parser (Collins, 1997) was then used to ob-
through research grants RGC6083/99E, RGC6256/00E, and                  tain the parse tree for the sentence. The head
DAG03/04.EG09.                                                         word (and the POS tag of the head word) of
     the syntactic parse constituent whose span cor-        Model               Prec.   Recall    Attempted
     responded most closely to the candidate con-           Single Model        0.891   0.795      89.2%
     stituent was then assumed to be the head word          Frame Separated     0.894   0.798      89.2%
     of the candidate constituent.                          Baseline            0.444   0.396      89.2%

    The resulting training data set consisted of 51,366    Table 1: Boosting Models: Validation Set Results
constituent samples with a total of 151 frame ele-
ment types. These ranged from “Descriptor” (3520
constituents) to “Baggage” and “Carrier” (1 con-          ments boosting on top of decision stumps (decision
stituent each). This training data was randomly par-      trees of one level), and was originally designed for
titioned into a 80/20 “development training” and          text classification. The same system also partici-
“validation” set.                                         pated in the Senseval-3 lexical sample tasks for Chi-
                                                          nese and English, as well as the Multilingual lexical
3 Methodology                                             sample task (Carpuat et al., 2004).
The previous section described the features that             Table 1 compares the results of training one sin-
were extracted for each constituent. This section         gle overall boosting model (Single) versus training
will describe the experiment methodology as well          separate models for each frame (Frame). It can be
as the learning systems used to construct the mod-        seen that training frame-specific models produces
els.                                                      a small improvement over the single model. The
   Our systems had originally been trained on the         frame-specific model was used in all of the ensem-
entire development training (devtrain) set, gener-        ble systems, and was also entered into the competi-
ating one global model per system. However, on            tion as an individual system (hkpust-boost).
closer examination of the task, it quickly became
evident that distinguishing between 151 possible          3.2   Support Vector Machines
outcomes was a difficult task for any system. It
                                                          The second of our individual systems was based
was also not clear that there was going to be a
                                                          on support vector machines, and implemented using
lot of information that could be generalized across
                                                          the TinySVM software package (Boser et al., 1992).
frame types. We therefore partitioned the data by
frame, so that one model would be trained for each            Since SVMs are binary classifiers, we used a
frame. (This was also the approach taken by (Gildea       one-against-all method to reduce the SRL task to
and Jurafsky, 2002).) Some of our individual sys-         a binary classification problem. One model is con-
tems tried both approaches; the results are com-          structed for each possible frame element and the
pared in the following subsections. For compar-           task of the model is to decide, for a given con-
ison purposes, a baseline model was constructed           stituent, whether it should be classified with that
by simply classifying all constituents with the most      frame element. Since it is possible for all the bi-
frequently-seen (in the training set) frame element       nary classifiers to decide on “NOT-<element>”, the
for the frame.                                            model is effectively allowed to pass on samples that
   In total, five individual systems were trained for      it is not confident about. This results in a very pre-
the SRL task, and four ensemble models were gen-          cise model, but unfortunately at a significant hit to
erated by using various combinations of the indi-         recall.
vidual systems. With one exception, all of the indi-          A number of kernel parameter settings were in-
vidual systems were constructed using off-the-shelf       vestigated, and the best performance was achieved
machine learning software. The following subsec-          with a polynomial kernel of degree 4. The rest of
tions describe each system; however, it should be         the parameters were left at the default values. Table
noted that some of the individual systems were not        2 shows the results of the best SVM model on the
officially entered as competing systems; therefore,        validation set. This model participated in the all of
their scores are not listed in the final rankings.         the ensemble systems, and was also entered into the
                                                          competition as an individual system.
3.1 Boosting
The most successful of our individual systems is                System     Prec.    Recall   Attempted
based on boosting, a powerful machine learning                  SVM        0.945    0.669     70.8%
algorithm which has been shown to achieve good                  Baseline   0.444    0.396     89.2%
results on NLP problems in the past. Our sys-
tem was constructed around the Boostexter soft-             Table 2: SVM Models: Validation Set Results
ware (Schapire and Singer, 2000), which imple-
3.3   Maximum Entropy                                       System           Prec.     Recall     Attempted
                                                            Single Model     0.764     0.682       89.2%
The third of our individual systems was based on            Baseline         0.444     0.396       89.2%
the maximum entropy model, and implemented on
top of the YASMET package (Och, 2002). Like the           Table 4: SNOW Models: Validation Set Results
boosting model, the maximum entropy system also
participated in the Senseval-3 lexical sample tasks      3.5 Decision Lists
for Chinese and English, as well as the Multilingual
lexical sample task (Carpuat et al., 2004).              The final individual system was a decision list im-
   Our maximum entropy models can be classi-             plementation contributed from the Swarthmore Col-
fied into two main approaches. Both approaches            lege team (Wicentowski et al., 2004), which partic-
used the frame-partitioned data. The more conven-        ipated in some of the lexical sample tasks.
tional approach (“multi”) then trained one model            The Swarthmore team followed the frame-
per frame; that model would be responsible for clas-     separated approach in building the decision list
sifying a constituent belonging to that frame with       models. Table 5 shows the result on the validation
one of several possible frame elements. The second       set. This system participated in some of the final
approach (binary) used the same approach as the          ensemble systems as well as being an official par-
SVM models, and trained one binary one-against-          ticipant (hkpust-swat-dl).
all classifier for each frame type-frame element                System      Prec.     Recall     Attempted
combination. (Unlike the boosting models, a single             DL          0.837     0.747       89.2%
maximum entropy model could not be trained for all             Baseline    0.444     0.396       89.2%
possible frame types and elements, since YASMET
crashed on the sheer size of the feature space.)         Table 5: Decision List Models: Validation Set Re-
      System      Prec.    Recall   Attempted
      multi       0.856    0.764     89.2%
                                                         3.6 Ensemble Systems
      binary      0.956    0.539     56.4%
      Baseline    0.444    0.396     89.2%               Classifier combination, where the results of differ-
                                                         ent models are combined in some way to make a
Table 3: Maximum Entropy Models: Validation Set          new model, has been well studied in the literature.
Results                                                  A successful combined classifier can result in the
                                                         combined model outperforming the best base mod-
   Table 3 shows the results for the maximum en-         els, as the advantages of one model make up for the
tropy models. As would have been expected, the           shortcomings of another.
binary model achieves very high levels of precision,        Classifier combination is most successful when
but at considerable expense of recall. Both systems      the base models are biased differently. That condi-
were eventually used in the some of the ensemble         tion applies to our set of base models, and it was
models but were not submitted as individual contes-      reasonable to make an attempt at combining them.
tants.                                                      Since the performances of our systems spanned
                                                         a large range, we did not want to use a simple ma-
3.4   SNOW                                               jority vote in creating the combined system. Rather,
The fourth of our individual systems is based on         we used a set of heuristics which trusted the most
SNOW — Sparse Network Of Winnows (Mu´ oz et      n       precise systems (the SVM and the binary maximum
al., 1999).                                              entropy) when they made a prediction, or a combi-
   The development approach for the SNOW mod-            nation of the others when they did not.
els was similar to that of the boosting models. Two         Table 6 shows the results of the top-scoring com-
main model types were generated: one which gener-        bined systems which were entered as official con-
ated a single overall model for all the possible frame   testants. As expected, the best of our combined sys-
elements, and one which generated one model per          tems outperformed the best base model.
frame type. Due to a bug in the coding which was
not discovered until the last minute, however, the       4 Test Set Results
results for the frame-separated model were invali-       Table 7 shows the test set results for all systems
dated. The single model system was eventually used       which participated in some way in the official com-
in some of the ensemble systems, but not entered as      petition, either as part of a combined system or as
an official contestant. Table 4 shows the results.        an individual contestant.
     Model                                                                  Prec.    Recall   Attempted
     svm, boosting, maxent (binary) (hkpolyust-all(a))                      0.874    0.867     99.2%
     boosting (hkpolyust-boost)                                             0.859    0.852     0.846%
     svm, boosting, maxent (binary), DL (hkpolyust-swat(a))                 0.902    0.849     94.1%
     svm, boosting, maxent (binary), DL, snow (hkpolyust-swat(b))           0.908    0.846     93.2%
     svm, boosting, maxent (multi), DL, snow (hkpolyust-all(b))             0.905    0.846     93.5%
     decision list (hkpolyust-swat-dl)                                      0.819    0.812      99.2%
     maxent (multi)                                                         0.827    0.735      88.8%
     svm (hkpolyust-svm)                                                    0.926    0.725      76.1%
     snow                                                                   0.713    0.499      70.0%
     maxent (binary)                                                        0.935    0.454      48.6%
     Baseline                                                               0.438    0.388      88.6%

Table 7: Test set results for all our official systems, as well as the base models used in the ensemble system.

Base Models           Prec.    Recall   Attempted        References
svm,     boosting,    0.901    0.803     89.2%           Collin F. Baker, Charles J. Fillmore, and John B. Lowe.
maxent (bin)                                               1998. The Berkeley FrameNet project. In Chris-
svm,     boosting,    0.938      0.8      85.2%            tian Boitet and Pete Whitelock, editors, Proceedings
maxent (bin), snow                                         of the Thirty-Sixth Annual Meeting of the Associa-
svm,     boosting,    0.926    0.783      84.6%            tion for Computational Linguistics and Seventeenth
maxent (bin), DL                                           International Conference on Computational Linguis-
                                                           tics, pages 86–90, San Francisco, California. Morgan
svm,     boosting,    0.935    0.797      85.2%
                                                           Kaufmann Publishers.
maxent     (multi),                                      Bernhard E. Boser, Isabelle Guyon, and Vladimir Vap-
DL, snow                                                   nik. 1992. A training algorithm for optimal margin
Baseline              0.444    0.396      89.2%            classifiers. In Computational Learing Theory, pages
Table 6: Combined Models: Validation Set Results         Marine Carpuat, Weifeng Su, and Dekai Wu. 2004.
                                                           Augmenting Ensemble Classification for Word Sense
   The top-performing system is the combined sys-          Disambiguation with a Kernel PCA Model. In Pro-
tem that uses the SVM, boosting and the binary im-         ceedings of Senseval-3, Barcelona.
plementation of maximum entropy. Of the individ-         Michael Collins. 1997. Three generative, lexicalised
                                                           models for statistical parsing. In Proceedings of the
ual systems, boosting performs the best, even out-         35th Annual Meeting of the ACL (jointly with the 8th
performing 3 of the combined systems. The SVM              Conference of the EACL), Madrid.
suffers from its high-precision approach, as does the    Daniel Gildea and Dan Jurafsky. 2002. Automatic la-
binary implementation of maximum entropy. The              beling of semantic roles. Computational Linguistics,
rest of the systems fall somewhere in between.             28(3):256–288.
                                                         Marcia Mu´ oz, Vasin Punyakanok, Dan Roth, and Dav
5 Conclusion                                               Zimak. 1999. A learning approach to shallow pars-
                                                           ing. In Proceedings of EMNLP-WVLC’99, pages
This paper presented the HKPolyU-HKUST sys-                168–178, College Park. Association for Computa-
tems for the non-restricted Semantic Role Labeling         tional Linguistics.
task for Senseval-3. We mapped the task to that          G. Ngai and R. Florian. 2001. Transformation-based
of a simple classification task, and used features          learning in the fast lane. In Proceedings of the 39th
and systems which were easily extracted and con-           Conference of the Association for Comp utational Lin-
structed. Our systems achieved good performance            guistics, Pittsburgh, PA.
on the SRL task, easily beating the baseline.            Franz Josef Och. 2002. Yet Another Small Max-
                                                           ent Toolkit: Yasmet. http://www-i6.informatik.rwth-
6 Acknowledgments                                
                                                         Robert E. Schapire and Yoram Singer. 2000. Boostex-
The “hkpolyust-swat-*” systems are the result of           ter: A boosting-based system for text categorization.
joint work between our team and Richard Wicen-             Machine Learning, 39(2/3):135–168.
towski’s team at Swarthmore College. The authors         Richard Wicentowski, Emily Thomforde, and Adrian
would like to express their immense gratitude to the       Packel. 2004. The Swarthmore College Senseval-3
Swarthmore team for providing their decision list          system. In Proceedings of Senseval-3, Barcelona.
system as one of our models.

To top