SENSEVAL-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text, Barcelona, Spain, July 2004 Association for Computational Linguistics Semantic Role Labeling with Boosting, SVMs, Maximum Entropy, SNOW, and Decision Lists Grace N GAI†1 , Dekai W U‡2 Marine C ARPUAT‡ , Chi-Shing WANG† , Chi-Yung WANG† † ‡ Dept. of Computing HKUST, Dept of Computer Science HK Polytechnic University Human Language Technology Center Hong Kong Hong Kong firstname.lastname@example.org, email@example.com firstname.lastname@example.org, email@example.com, firstname.lastname@example.org Abstract 2 Experimental Features This paper describes the HKPolyU-HKUST sys- This section describes the features that were used tems which were entered into the Semantic Role La- for the SRL task. Since the non-restricted SRL task beling task in Senseval-3. Results show that these is essentially a classiﬁcation task, each parse con- systems, which are based upon common machine stituent that was known to correspond to a frame learning algorithms, all manage to achieve good element was considered to be a sample. performances on the non-restricted Semantic Role The features that we used for each sample have Labeling task. been previously shown to be helpful for the SRL task (Gildea and Jurafsky, 2002). Some of these 1 Introduction features can be obtained directly from the Framenet This paper describes the HKPolyU-HKUST sys- annotations: tems which participated in the Senseval-3 Semantic • The name of the frame. Role Labeling task. The systems represent a diverse array of machine learning algorithms, from decision • The lexical unit of the sentence — i.e. the lex- lists to SVMs to Winnow-type networks. ical identity of the target word in the sentence. Semantic Role Labeling (SRL) is a task that • The general part-of-speech tag of the target has recently received a lot of attention in the NLP word. community. The SRL task in Senseval-3 used the Framenet (Baker et al., 1998) corpus: given a • The “phrase type” of the constituent — i.e. the sentence instance from the corpus, a system’s job syntactic category (e.g. NP, VP) that the con- would be to identify the phrase constituents and stituent falls into. their corresponding role. • The “grammatical function” (e.g. subject, ob- The Senseval-3 task was divided into restricted ject, modiﬁer, etc) of the constituent, with re- and non-restricted subtasks. In the non-restricted spect to the target word. subtask, any and all of the gold standard annotations contained in the FrameNet corpus could be used. • The position (e.g. before, after) of the con- Since this includes information on the boundaries stituent, with respect to the target word. of the parse constituents which correspond to some In addition to the above features, we also ex- frame element, this effectively maps the SRL task tracted a set of features which required the use of to that of a role-labeling classiﬁcation task: given a some statistical NLP tools: constituent parse, identify the frame element that it belongs to. • Transitivity and voice of the target word — Due to the lack of time and resources, we chose to The sentence was ﬁrst part-of-speech tagged participate only in the non-restricted subtask. This and chunked with the fnTBL transformation- enabled our systems to take the classiﬁcation ap- based learning tools (Ngai and Florian, 2001). proach mentioned in the previous paragraph. Simple heuristics were then used to deduce the 1 transitivity voice of the target word. The author would like to thank the Hong Kong Polytechnic University for supporting this research in part through research • Head word (and its part-of-speech tag) of the grants A-PE37 and 4-Z03S. constituent — After POS tagging, a syntactic 2 The author would like to thank the Hong Kong Research Grants Council (RGC) for supporting this research in part parser (Collins, 1997) was then used to ob- through research grants RGC6083/99E, RGC6256/00E, and tain the parse tree for the sentence. The head DAG03/04.EG09. word (and the POS tag of the head word) of the syntactic parse constituent whose span cor- Model Prec. Recall Attempted responded most closely to the candidate con- Single Model 0.891 0.795 89.2% stituent was then assumed to be the head word Frame Separated 0.894 0.798 89.2% of the candidate constituent. Baseline 0.444 0.396 89.2% The resulting training data set consisted of 51,366 Table 1: Boosting Models: Validation Set Results constituent samples with a total of 151 frame ele- ment types. These ranged from “Descriptor” (3520 constituents) to “Baggage” and “Carrier” (1 con- ments boosting on top of decision stumps (decision stituent each). This training data was randomly par- trees of one level), and was originally designed for titioned into a 80/20 “development training” and text classiﬁcation. The same system also partici- “validation” set. pated in the Senseval-3 lexical sample tasks for Chi- nese and English, as well as the Multilingual lexical 3 Methodology sample task (Carpuat et al., 2004). The previous section described the features that Table 1 compares the results of training one sin- were extracted for each constituent. This section gle overall boosting model (Single) versus training will describe the experiment methodology as well separate models for each frame (Frame). It can be as the learning systems used to construct the mod- seen that training frame-speciﬁc models produces els. a small improvement over the single model. The Our systems had originally been trained on the frame-speciﬁc model was used in all of the ensem- entire development training (devtrain) set, gener- ble systems, and was also entered into the competi- ating one global model per system. However, on tion as an individual system (hkpust-boost). closer examination of the task, it quickly became evident that distinguishing between 151 possible 3.2 Support Vector Machines outcomes was a difﬁcult task for any system. It The second of our individual systems was based was also not clear that there was going to be a on support vector machines, and implemented using lot of information that could be generalized across the TinySVM software package (Boser et al., 1992). frame types. We therefore partitioned the data by frame, so that one model would be trained for each Since SVMs are binary classiﬁers, we used a frame. (This was also the approach taken by (Gildea one-against-all method to reduce the SRL task to and Jurafsky, 2002).) Some of our individual sys- a binary classiﬁcation problem. One model is con- tems tried both approaches; the results are com- structed for each possible frame element and the pared in the following subsections. For compar- task of the model is to decide, for a given con- ison purposes, a baseline model was constructed stituent, whether it should be classiﬁed with that by simply classifying all constituents with the most frame element. Since it is possible for all the bi- frequently-seen (in the training set) frame element nary classiﬁers to decide on “NOT-<element>”, the for the frame. model is effectively allowed to pass on samples that In total, ﬁve individual systems were trained for it is not conﬁdent about. This results in a very pre- the SRL task, and four ensemble models were gen- cise model, but unfortunately at a signiﬁcant hit to erated by using various combinations of the indi- recall. vidual systems. With one exception, all of the indi- A number of kernel parameter settings were in- vidual systems were constructed using off-the-shelf vestigated, and the best performance was achieved machine learning software. The following subsec- with a polynomial kernel of degree 4. The rest of tions describe each system; however, it should be the parameters were left at the default values. Table noted that some of the individual systems were not 2 shows the results of the best SVM model on the ofﬁcially entered as competing systems; therefore, validation set. This model participated in the all of their scores are not listed in the ﬁnal rankings. the ensemble systems, and was also entered into the competition as an individual system. 3.1 Boosting The most successful of our individual systems is System Prec. Recall Attempted based on boosting, a powerful machine learning SVM 0.945 0.669 70.8% algorithm which has been shown to achieve good Baseline 0.444 0.396 89.2% results on NLP problems in the past. Our sys- tem was constructed around the Boostexter soft- Table 2: SVM Models: Validation Set Results ware (Schapire and Singer, 2000), which imple- 3.3 Maximum Entropy System Prec. Recall Attempted Single Model 0.764 0.682 89.2% The third of our individual systems was based on Baseline 0.444 0.396 89.2% the maximum entropy model, and implemented on top of the YASMET package (Och, 2002). Like the Table 4: SNOW Models: Validation Set Results boosting model, the maximum entropy system also participated in the Senseval-3 lexical sample tasks 3.5 Decision Lists for Chinese and English, as well as the Multilingual lexical sample task (Carpuat et al., 2004). The ﬁnal individual system was a decision list im- Our maximum entropy models can be classi- plementation contributed from the Swarthmore Col- ﬁed into two main approaches. Both approaches lege team (Wicentowski et al., 2004), which partic- used the frame-partitioned data. The more conven- ipated in some of the lexical sample tasks. tional approach (“multi”) then trained one model The Swarthmore team followed the frame- per frame; that model would be responsible for clas- separated approach in building the decision list sifying a constituent belonging to that frame with models. Table 5 shows the result on the validation one of several possible frame elements. The second set. This system participated in some of the ﬁnal approach (binary) used the same approach as the ensemble systems as well as being an ofﬁcial par- SVM models, and trained one binary one-against- ticipant (hkpust-swat-dl). all classiﬁer for each frame type-frame element System Prec. Recall Attempted combination. (Unlike the boosting models, a single DL 0.837 0.747 89.2% maximum entropy model could not be trained for all Baseline 0.444 0.396 89.2% possible frame types and elements, since YASMET crashed on the sheer size of the feature space.) Table 5: Decision List Models: Validation Set Re- sults System Prec. Recall Attempted multi 0.856 0.764 89.2% 3.6 Ensemble Systems binary 0.956 0.539 56.4% Baseline 0.444 0.396 89.2% Classiﬁer combination, where the results of differ- ent models are combined in some way to make a Table 3: Maximum Entropy Models: Validation Set new model, has been well studied in the literature. Results A successful combined classiﬁer can result in the combined model outperforming the best base mod- Table 3 shows the results for the maximum en- els, as the advantages of one model make up for the tropy models. As would have been expected, the shortcomings of another. binary model achieves very high levels of precision, Classiﬁer combination is most successful when but at considerable expense of recall. Both systems the base models are biased differently. That condi- were eventually used in the some of the ensemble tion applies to our set of base models, and it was models but were not submitted as individual contes- reasonable to make an attempt at combining them. tants. Since the performances of our systems spanned a large range, we did not want to use a simple ma- 3.4 SNOW jority vote in creating the combined system. Rather, The fourth of our individual systems is based on we used a set of heuristics which trusted the most SNOW — Sparse Network Of Winnows (Mu´ oz et n precise systems (the SVM and the binary maximum al., 1999). entropy) when they made a prediction, or a combi- The development approach for the SNOW mod- nation of the others when they did not. els was similar to that of the boosting models. Two Table 6 shows the results of the top-scoring com- main model types were generated: one which gener- bined systems which were entered as ofﬁcial con- ated a single overall model for all the possible frame testants. As expected, the best of our combined sys- elements, and one which generated one model per tems outperformed the best base model. frame type. Due to a bug in the coding which was not discovered until the last minute, however, the 4 Test Set Results results for the frame-separated model were invali- Table 7 shows the test set results for all systems dated. The single model system was eventually used which participated in some way in the ofﬁcial com- in some of the ensemble systems, but not entered as petition, either as part of a combined system or as an ofﬁcial contestant. Table 4 shows the results. an individual contestant. Model Prec. Recall Attempted svm, boosting, maxent (binary) (hkpolyust-all(a)) 0.874 0.867 99.2% boosting (hkpolyust-boost) 0.859 0.852 0.846% svm, boosting, maxent (binary), DL (hkpolyust-swat(a)) 0.902 0.849 94.1% svm, boosting, maxent (binary), DL, snow (hkpolyust-swat(b)) 0.908 0.846 93.2% svm, boosting, maxent (multi), DL, snow (hkpolyust-all(b)) 0.905 0.846 93.5% decision list (hkpolyust-swat-dl) 0.819 0.812 99.2% maxent (multi) 0.827 0.735 88.8% svm (hkpolyust-svm) 0.926 0.725 76.1% snow 0.713 0.499 70.0% maxent (binary) 0.935 0.454 48.6% Baseline 0.438 0.388 88.6% Table 7: Test set results for all our ofﬁcial systems, as well as the base models used in the ensemble system. Base Models Prec. Recall Attempted References svm, boosting, 0.901 0.803 89.2% Collin F. Baker, Charles J. Fillmore, and John B. Lowe. maxent (bin) 1998. The Berkeley FrameNet project. In Chris- svm, boosting, 0.938 0.8 85.2% tian Boitet and Pete Whitelock, editors, Proceedings maxent (bin), snow of the Thirty-Sixth Annual Meeting of the Associa- svm, boosting, 0.926 0.783 84.6% tion for Computational Linguistics and Seventeenth maxent (bin), DL International Conference on Computational Linguis- tics, pages 86–90, San Francisco, California. Morgan svm, boosting, 0.935 0.797 85.2% Kaufmann Publishers. maxent (multi), Bernhard E. Boser, Isabelle Guyon, and Vladimir Vap- DL, snow nik. 1992. A training algorithm for optimal margin Baseline 0.444 0.396 89.2% classiﬁers. In Computational Learing Theory, pages 144–152. Table 6: Combined Models: Validation Set Results Marine Carpuat, Weifeng Su, and Dekai Wu. 2004. Augmenting Ensemble Classiﬁcation for Word Sense The top-performing system is the combined sys- Disambiguation with a Kernel PCA Model. In Pro- tem that uses the SVM, boosting and the binary im- ceedings of Senseval-3, Barcelona. plementation of maximum entropy. Of the individ- Michael Collins. 1997. Three generative, lexicalised models for statistical parsing. In Proceedings of the ual systems, boosting performs the best, even out- 35th Annual Meeting of the ACL (jointly with the 8th performing 3 of the combined systems. The SVM Conference of the EACL), Madrid. suffers from its high-precision approach, as does the Daniel Gildea and Dan Jurafsky. 2002. Automatic la- binary implementation of maximum entropy. The beling of semantic roles. Computational Linguistics, rest of the systems fall somewhere in between. 28(3):256–288. n Marcia Mu´ oz, Vasin Punyakanok, Dan Roth, and Dav 5 Conclusion Zimak. 1999. A learning approach to shallow pars- ing. In Proceedings of EMNLP-WVLC’99, pages This paper presented the HKPolyU-HKUST sys- 168–178, College Park. Association for Computa- tems for the non-restricted Semantic Role Labeling tional Linguistics. task for Senseval-3. We mapped the task to that G. Ngai and R. Florian. 2001. Transformation-based of a simple classiﬁcation task, and used features learning in the fast lane. In Proceedings of the 39th and systems which were easily extracted and con- Conference of the Association for Comp utational Lin- structed. Our systems achieved good performance guistics, Pittsburgh, PA. on the SRL task, easily beating the baseline. Franz Josef Och. 2002. Yet Another Small Max- ent Toolkit: Yasmet. http://www-i6.informatik.rwth- 6 Acknowledgments aachen.de/Colleagues/och. Robert E. Schapire and Yoram Singer. 2000. Boostex- The “hkpolyust-swat-*” systems are the result of ter: A boosting-based system for text categorization. joint work between our team and Richard Wicen- Machine Learning, 39(2/3):135–168. towski’s team at Swarthmore College. The authors Richard Wicentowski, Emily Thomforde, and Adrian would like to express their immense gratitude to the Packel. 2004. The Swarthmore College Senseval-3 Swarthmore team for providing their decision list system. In Proceedings of Senseval-3, Barcelona. system as one of our models.
Pages to are hidden for
"Semantic Role Labeling with Boosting, SVMs, Maximum Entropy, SNOW,"Please download to view full document