IWLeL 2004: An Interactive Workshop on Language e-Learning 15 - 23 15 Computer Assisted Language Learning Based on Corpora and Natural Language Processing: The Experience of Project CANDLE Jason S. CHANG1 and Yu-Chia CHANG 2 1 Department of Computer Science, National Tsing Hua University, Taiwan email@example.com 2 Department of Computer Science, National Tsing Hua University, Taiwan firstname.lastname@example.org Abstract This paper describes Project CANDLE, an ongoing 3-year project which uses various corpora and NLP technologies to construct an online English learning environment for learners in Taiwan. This report focuses on the interim results obtained in the first eighteen months. First, an English-Chinese parallel corpus, Sinorama, was used as the main course material for reading, writing, and culture-based learning courses. Second, an online bilingual concordancer, TotalRecall, and a collocation reference tool, TANGO, were developed based on Sinorama and other corpora. Third, many online lessons, including extensive reading, verb-noun collocations, and vocabulary, were designed to be used alone or together with TotalRecall and TANGO. Fourth, an online collocation check program, MUST, was developed for detecting V-N miscollocation and suggesting adequate collocates in student’s writings based on the hypothesis of L1 interference and the database of BNC and the bilingual Sinorama Corpus. Other computational scaffoldings are under development. It is hoped that this project will help intermediate learners in Taiwan enhance their English proficiency with effective pedagogical approaches and versatile language reference tools. Keywords: computer assisted language learning, corpus, concordance, natural language processing, collocation 1 Introduction Researchers and teachers in the field of Computer Assisted Language Learning (CALL) have been working on harnessing speech and natural language processing technology and Internet resources to revitalize traditional language learning. They have also explored new pedagogy made possible by computers and the Internet. The first goal can be met by an adaptive CALL system which provides a learning environment that makes systematic and ongoing adjustment based on individual differences of learners. Drill sessions for different language proficiency levels with feedback and guidance are accessible online to facilitate learning of structural knowledge. As a new pedagogy, digitalized corpora are used to facilitate inductive data-driven language learning in ways that have not been possible in the past. Language data is important for learning because it activates learners’ mental mechanisms and becomes essential input for second or foreign language acquisition. Various language learning activities or tasks that include listening, speaking, reading, writing, translation or a combination of two or more skills can be constructed based on various corpora and adaptive/automatic tools to achieve the goal of computational scaffolding. Computer, Corpora, and Computational Linguistics have an increasing role to play in ELT and e-Learning. This new combination of texts and technology has forever changed the way we study, teach, and learn a language. We can take full advantage of the “New Way,” only if we embrace new pedagogy and Natural Language Processing technology. NLP technology enables us to compile better dictionaries, 16 IWLeL 2004: An Interactive Workshop on Language e-Learning 15 - 23 concordancers, and collocation aids from corpora of authentic text. NLP technology serves as a microscopic look at trouble spots of L2 learning in a learner corpus and offers remedies. NLP technology even helps teachers with preparation of better reading comprehension tests in shorter time. And most importantly, NLP technology makes possible a new pedagogy that encourages student-centered, inductive, and culture-based learning. This paper will describe the first-year experience we have with Project CANDLE under the National Science and Technology Program for Digital Learning. This three-year project for advancing Computer Assisted Language Learning (CALL) is based upon corpora and NLP tools, to provide a digital English learning environment for intermediate learners in Taiwan. It is thus named, Corpora And NLP for Digital Learning of English (CANDLE). The project integrates expertise in four research areas: (1) NLP technologies and applications, (2) An intelligent self-access English reading environment, (3) English learning through writing and translation, and (4) Bilingual corpus and culture-based English learning. Visit http://candle.cs.nthu.edu.tw/candle/ to try out an array of language resources, computational tools, and web-based learning units. The project is unique, in Taiwan and internationally, drawing on language resources mainly from a bilingual parallel corpus, the Sinorama corpus, and builds on learners’ first language background knowledge to empower learners with culture-based materials. Both first language and its culture provide a scaffold for learners to use while learning a new language. The project is aimed at providing various types of English learning activities which emphasize structural knowledge, and complex problem solving learning (Chan, et al., 2001): reading, writing, listening, speaking, and translation. Its major features include online practice that adapts to learners’ levels, automatically assessing/monitoring learner’s progress, and profiling of learners’ preferences and level of proficiency. The paper is organized with a detailed description of the CANDLE project with its four sub-projects, unique features of corpora processing and development of NLP tools, and a conclusion. Successful research effort on digital language learning requires close collaboration between computer engineers, teachers, and content experts. In this three-year CANDLE project, ten researchers from the research areas of computer science (specifically NLP and speech recognition) and English teaching (CALL), have been working together with the aim of leveraging cutting edge NLP tools and corpora processing tools to advance English learning for students in Taiwan. The CANDLE website (http://candle.cs.nthu.edu.tw/) will be used by students from six participating institutions associated with the research team. Attempts are being made to build and assess an English learning environment meeting the needs of local students. 2 Description of the CANDLE project and its initial achievements The main goal of the CANDLE project is achieved through collaboration of four subprojects: (1) Natural Language Processing and Assessment Tools, (2) An Intelligent Self-Access Reading Environment, (3) Learning a Foreign Language Through Written Exercises and Translation, (4) Bilingual Corpus and Culture-based Language Learning. The first subproject provides essential and advanced NLP tools and activity tracking mechanisms for the other three sub-projects to facilitate and monitor online learners. The second subproject focuses on construction and assessment of an intelligent self-access reading environment that adapts to learners’ English levels. The third subproject works on exploring the potential of learning English based on writing and translation exercises. The fourth sub-project uses the bilingual corpus to enhance culture-based English learning, an area that has not been fully explored yet. All four subprojects are innovative for natural language processing and English learning. The first subproject also produces the much-needed digital and content-related advanced technology for the other three subprojects which have to do with fundamental research on digital learning strategy and behavior and assessment of usefulness of the proposed approach to advanced digital learning of English language (see Figure 1 for illustration of the role of the first subproject). IWLeL 2004: An Interactive Workshop on Language e-Learning 15 - 23 17 ur objectives and anticipated progress in the three-year project are planned in three stages: in the first year, we will work on web-based learning material development; in the second year, formative evaluation; in the third year, summative evaluation. NLP Reading TotalRecall, Tango, Self-access module, Collocation checker Text grader, Speedy reading, Speech recognizer Strategy trainer Supported by TotalRecall CANDLE Interface Computer assisted management system [1st year] Writing Culture Writing with TotalRecall, Culture courses Collocation practice Candle talk Supported by TotalRecall, Supported by TotalRecall Tango, & Collocation checker Figure 1 Integration of the Four Sub-projects with its Respective Modules 3 Corpora Processing and NLP Tools Corpus linguists study real texts, using explicit algorithms to extract linguistic knowledge from corpora. An important function of corpora in the language classroom is to provide the learners with concentrated exposure to particular patterns of repetition. With the use of corpus tools, language learners can avoid unhelpful reliance on oversimplified 'rules' prepackaged by the teacher; instead they develop proficiency through focused, purposeful exposure to, and use of, language in specific contexts (Teubert, 1996). 3.1 Sinorama, Main Material for Reading, Writing, and Culture-based Learning Courses A Parallel Corpus is a collection of "parallel" texts in different languages or in different varieties of a language. There is a bilingual Chinese-English electronic corpus in Taiwan: Sinorama, which is a monthly published over three decades by the Government Information Office (GIO) in Taiwan. Among several of the magazines published by the GIO, Sinorama remains the most popular because it includes insightful reports on the life styles, society, economy and cultures related to the people in Taiwan. The topics of the articles in Sinorama include the following: art achievement, literature, painting and calligraphy, film, dance, music, architecture, museums, traditional opera, handicrafts, clothing and accessories, stories told in Chinese paper cuts, drama, Taiwanese culture, and influx of Western culture. The reasons for reading articles in Sinorama are many; the most important is that students are allowed to interact with the authentic texts that they are reading. Using Sinorama also gives the students opportunities to read articles about their own culture through the texts that are specific to the Taiwanese context. In addition, this can familiarize the students with the writing style of Sinorama and prepare them to utilize a concordancer that is based on this corpus. 18 IWLeL 2004: An Interactive Workshop on Language e-Learning 15 - 23 A: Database selection B: English query C: Mandarin query D: Number of entries per page E: Normal F: Clustered summary according to translation H: Order by I: Submit bottom J: Page index K: English citation L: Mandarin citation N: All citations in the cluster O: Full text context P: Paragraph context Figure 2 The results of searching for “example” The current project, Corpora and Natural language processing for Digital Learning of English (CANDLE) aims to use a range of corpora and advanced natural language processing tools to revitalize traditional English learning for intermediate learners in Taiwan via activities of reading, translation and culture. 3.2 Online Bilingual Concordancer and Collocation Reference Tool 3.2.1 TotalRecall TotalRecall is a Chinese-English bilingual concordancer using Sinoroma parallel corpus (Wu, et al., 2003). This project involves bilingual sentence, word and phrase alignment, sophisticated queries that meet learners’ various needs of searching, and output display after ranking. It can display (see Figures 2 and 3): a) Collocation information in a concordance b) Four levels of context for citation: sub-sentential, sentential, paragraph, and text c) Highlighting of accurate word and phrase level alignment of translation equivalent. 3.2.2 TANGO With the collocation types and instances extracted from the corpus, we built an online collocational concordancer called TANGO for looking up collocation instances and translations (Jian, 2004). A user can type in any English words as query and select the expected part of speech of the accompanying words. For example in Figure 4, after the query “influence” is submitted, the result of possible collocates will be displayed on the return page. The user can even select different adjacent collocates for further IWLeL 2004: An Interactive Workshop on Language e-Learning 15 - 23 19 investigation. Moreover, using the technique of bilingual collocation alignment and sentence alignment, the system will display the target collocation and its translation equivalents highlighted in different sentential contexts. Translators or learners, through this browser-based interface, can easily gain access to the usage of each collocation with relevant instances. This may help learners speed-up their internalization process. This bilingual collocational concordancer could be a very useful tool for self-inductive learning tailored to intermediate or advanced English learners. Figure 3. Chinese-English Bilingual Concordance, TotalRecall 3.3 Online Lessons that work with TotalRecall and TANGO In addition to the development of NLP tools and planning of a learner corpus compilation, four English teaching master’s theses have been completed and more are under way that shed light on development of NLP tools or online instructional units and provide a curriculum model that can be used by other English teachers to infuse CANDLE into their own instructional contexts. The topics of those thesis researches include: (1) Subsentential alignment of bilingual corpus by interleaving text and punctuation matches. (2) Automatic acquisition of VN and other types of collocation in free texts. (3) Effects of automatic essay grading and bilingual concordancing on college students’ EFL writing: This thesis investigates how a commercial online essay grader, My Access, and TotalRecall can help college English students’ writing, revision, and error-correction. It will provide a curriculum model. Post-tests have shown that the students improved their score after using these tools. (4) Effects of CALL approaches on learning of college students’ English verb-noun collocation: This thesis develops 8 online instructional units of teaching English verb-noun collocation and uses TotalRecall to investigate whether the two CALL approaches can help college students’ learning of collocation. It has both development and pedagogical implications. (5) Effects of online extensive reading of English texts with controlled vocabulary on college students’ incidental learning: This thesis uses some text processing programs to control 20 IWLeL 2004: An Interactive Workshop on Language e-Learning 15 - 23 vocabulary difficulty level and number of new words appearing in a group of texts and investigates whether such arrangement of text selection can help college English student readers to acquire more new words. It has both development and pedagogical implications. (6) The feasibility of using the Sinorama bilingual concordance in a culture awareness language course for non-English-major students: This thesis integrates the use of TotalRecall into a college English course and explores the learning process and product. It will provide a curriculum model. Figure 4 Web-based Collocational Concordance, TANGO Preparation of developing online cultural materials for English learning and pilot testing of use of Sinorama and TotalRecall in the English teaching contexts in participating colleges are under way. 3.4 Online Collocation Check Program We have also developed a web-based automatic collocation-detection system as an online aid for EFL writers and especially tackle learners’ miscollocations attributable to L1 translation interference on the verb collocate (Chang, 2004). The system provides relevant adequate collocation as feedback messages according to the mutual translations between learner’s L1 and L2. When user inputs a V-N collocation, system will check and derive a list of candidate English verbs that share the same Chinese translations via processing of bilingual corpora. After combing nouns with those candidate verbs as V-N pairs, the system makes use of a reference English corpus to exclude the inappropriate V-N pairs so as to single out the proper collocations. An example of correcting a misused collocation “publish album” is shown in Figure 5. The system can promptly provide the exact suggestive collocation which the learner intends to write but misuses. It is hoped that this online assistant can facilitate EFL learner-writers’ collocations use and transfer this knowledge to their future writing. IWLeL 2004: An Interactive Workshop on Language e-Learning 15 - 23 21 Figure 5 The Interface of Online Collocation Check Program, MUST 4 Conclusion In the first year of project CANDLE, we have built several NLP tools for CALL. These tools have been used in various ELT research and teaching activities with promising results. We plan to develop more tools based on NLP technologies to explore the area such as semi-automatic test generation and grading. By emphasizing advanced NLP technologies, sound English pedagogical theories and empirical assessment of usability based on real learners, we have confidence that we will reach our optimum goal of creating a digital learning environment that meets real needs of English learners in Taiwan. In the coming years, we will achieve the following goals via the CANDLE website: (1) Providing access to the CANDLE learning center to as many students as we can reach. (2) Providing empirical evidence or usability testing data to prove CANDLE usefulness or effectiveness. (3) Exploring the possibilities of curriculum infusion in various universities or colleges for different learners. By putting natural language processing technologies to work with sound English pedagogical theories and usability study on real learners, we hope to advance the state of the art of English Language Teaching. Bilingual corpus, bilingual concordancers, browser-based interactive tests or exercises are among the advanced technologies we have developed. Additionally, we will make the CANDLE environment meet the English learning needs of local learners by attending to their specific difficulties. Evaluation methods such as psychometric means in a comparison design, discourse analysis, or portfolio will be conducted in the third year to advance the understanding of learners’ behavior when they work online. We envision that learners will be capable of the complex problem solving needed to network with foreign language users in other countries. We hope to achieve the goal of promoting learner autonomy and life-long learning, so as to enable learners to fully participate in the English speaking discourse community. 22 IWLeL 2004: An Interactive Workshop on Language e-Learning 15 - 23 Acknowledgements The paper is supported by research grant from National Science Council under the projects NSC92-2524-S007-002 and NSC 93-2524-S-007-002. References Chan, T. W., Hue, C. W., Chou, C. Y., & Tzeng, O. J. L. 2001. Four spaces of network learning models. Computers & Education, 37, 141-161. Chang, J.S., David Yu, Chun-Jun Lee. Statistical Translation Model for Phrases, Vol. 6, No. 2, pp. 43-64, 2001. Chang, Richard, T-P Chen, Jason S. Chang. 2004. An Automatic Collocation Writing Assistant for Taiwanese EFL Learners: Using Corpora for language teaching and learning based on NLP Technology, EUROCALL. Chuang, Thomas C., and Jason S Chang, 2002. Adaptive Sentence Alignment based on Length and Lexical Information, In Proceedings of the 40th Annual Meeting of Association for Computational Linguistics, Comp. Volume, 91-92. Chuang, Thomas C., Jian-Cheng Wu, Tracy Lin, Web-Chie Shei and Jason S. Chang, “Bilingual Sentence Alignment Based on Punctuation Statistics and Lexicon, “ Proceedings of the first International Joint Conference on Natural Language, IJCNLP-04, PP. 644-651, Hainan Island, China, Jan 2004. Chuang, Thomas C., NG You, and Jason S Chang, 2002. Adaptive Sentence Alignment, Proceedings of the Fifth Conference of the Association for Machine Translation in the Americas, AMTA'2002, Tiburon, California. Conzett, J. (2000). Integrating collocation into a reading and writing course. In Lewis, M. (Ed.), Teaching collocation: Further developments in the lexical approach (pp. 70-86). London: Language Teaching Publications. Farghal, M. & Obiedat, H. (1995). Collocations: A neglected variables in EFL. International Review of Applied Linguistics, 33, 313-331. Gale, W. & K. W. Church, "A Program for Aligning Sentences in Bilingual Corpora" Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, CA, 1991. Granger, S. 2003. Error-tagged learner corpora and CALL: A promising synergy. CALICO Journal, 20(3), 465-480. Jian, J. Y., Chang, Y. C., & Chang, J. S. 2004. Collocational Translation Memory Extraction Based on Statistical Linguistic Information, Paper presented in ROCLING 2004, Conference on Computational Linguistics and Speech Processing, Taipei. Jian, Jia-Yan, Yu-Chia Chang, Jason S. Chang. “TANGO: Bilingual Collocational Concordancer, " Proceedings of the 42th Annual Meeting of Association for Computational Linguistics,” Comp. Vol., 2004. Lee, S. H. (2003). ESL learners’ vocabulary use in writing and the effects of explicit vocabulary instruction, System, 31, 537-561. Lewis, M. (2000). Teaching collocation: Further development in lexical approach. London: Language Teaching Publications. Lin, T., C.J. Wu, and J.S. Chang. 2003 Word Transliteration Alignment, Proceedings of the fifteenth Research on Computational Linguistics Conference, ROCLING XV, Hsinchu. Liou, H. C., et al. 2003. Using corpora and computational scaffolding to construct an advanced digital English learning environment: The CANDLE project. The 7th Int’l Conference on Multimedia Language Education, Chia-Yi, December 19-21. Liu, C. P. (1999). An analysis of collocation errors in EFL writings. The proceedings of the Eighth International Symposium on English Teaching (pp. 483-494). Taipei: Crane. Liu, C. P. (2000). A study of strategy use in producing lexical collocations. Selected Papers from the Ninth International Symposium on English Teaching (pp. 481-492). Taipei: Crane. Liu, L. E. (2002). A corpus-based lexical semantic investigation of verb-noun miscollocations in Taiwan learners’ English. Unpublished master’s thesis, Tamkang University, Taipei, January. Macklovitch, E., Simard, M., Langlais, P.: TransSearch: A Free Translation Memory on the World Wide Web. Proc. LREC 2000 III, 1201--1208 (2000). Melamed, I. D. 1997. A Word-to-Word Model of Translational Equivalence. Proc. of the ACL97. pp 490-497. Madrid Spain, 1997. Mitkov, Ruslan, and Le An Ha 2003, Computer-Aided Generation of Multiple-Choice Tests, In Proceedings of HLT-NAACL 2003. Nation, I. S. P. (2001). Learning vocabulary in another language. Cambridge: Cambridge Press. Nattinger, J. R., & DeCarrico, J. D. (1992). Lexical phrase and language teaching. Oxford: Oxford University Press. IWLeL 2004: An Interactive Workshop on Language e-Learning 15 - 23 23 Nesselhauf, N (2003). The use of collocations by advanced learners of English and some implications for teaching. Applied Linguistics, 24, 223-242. Shei, C. C., & Pain, H. (2000). An ESL writer’s collocation aid. Computer Assisted Language Learning, 13, 167-182. Simard, M., G. Foster & P. Isabelle (1992), Using cognates to align sentences in bilingual corpora. In Proceedings of TMI92, Montreal, Canada, pp. 67-81. Teubert, W. 1996. Comparable or Parallel Corpora? International Journal of Lexicography Vol. 9, Number 3, pp. 238-265. Teubert, W. 1996. Why corpus linguistics? International Journal of Corpus Linguistics, 1(1). Teubert, W. 2003. Parallel Corpora and Language Learning, 12th International Symposium on English Teaching, Taipei. Wang, Yi-Chia, Jian-Cheng Wu, Tyne Liang, Jason S. Chang. 2004. Using the Web as Corpus for Un-supervised Learning in Question Answering, to appear in ROCLING XVI: Conference on Computational Linguistics and Speech Processing, September 2-3, 2004, Howard Pacific Green Bay, Taipei, Taiwan, ROC. Whitelock, P., & Edmonds, P. 2000. The Sharp intelligent dictionary. Proceedings of the 9th EURALEX, pp. 871-876. Wu, Chien-Cheng, and Jason S. Chang. Bilingual Collocation Extraction Based on Syntactic and Statistical Analyses, Computational Linguistics and Chinese Language Processing, Vol. 9, No. 1, 2004, pp. 1-20. Wu, CJ and J.S. Chang. Alignment of Collocation via Syntactic and Statstical Analyses. Proceedings of the fifteenth Research on Computational Linguistics Conference, ROCLING XV, Hsinchu. Wu, CJ, K. Yeh, T.C. Chuang, W.C. Shei and Jason S. Chang. 2003. ‘TotalRecall: A Bilingual Concordance for Computer Assisted Translation and Language Learning,’ In Proceedings of the 41st Annual Meeting of Association for Computational Linguistics, Comp. Volume, 201-204. Wu, Dekai (1994), Aligning a parallel English-Chinese corpus statistically with lexical criteria. In The Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, New Mexico, USA, pp. 80-87. Wu, J.C., Thomas C. Chuang, Wen-Chi Shei and Jason S. Chang. “Subsentential Translation Memory for Computer Assisted Writing and Translation, " Proceedings of the 42nd Annual Meeting of Association for Computational Linguistics, Comp. Vol. 2004. Zhang, X. (1993). English collocations and their effect on the writing of native and non-native college freshmen. Ph.D. thesis, Indiana University of Pennsylvania.
Pages to are hidden for
"Computer Assisted Language Learning Based on Corpora and Natural"Please download to view full document