Computer Assisted Language Learning Based on Corpora and Natural by xtq29964


									              IWLeL 2004: An Interactive Workshop on Language e-Learning            15 - 23                 15

    Computer Assisted Language Learning Based on Corpora and
      Natural Language Processing: The Experience of Project
                              Jason S. CHANG1 and Yu-Chia CHANG 2
                Department of Computer Science, National Tsing Hua University, Taiwan
                Department of Computer Science, National Tsing Hua University, Taiwan

    This paper describes Project CANDLE, an ongoing 3-year project which uses various corpora
    and NLP technologies to construct an online English learning environment for learners in
    Taiwan. This report focuses on the interim results obtained in the first eighteen months. First, an
    English-Chinese parallel corpus, Sinorama, was used as the main course material for reading,
    writing, and culture-based learning courses. Second, an online bilingual concordancer,
    TotalRecall, and a collocation reference tool, TANGO, were developed based on Sinorama and
    other corpora. Third, many online lessons, including extensive reading, verb-noun collocations,
    and vocabulary, were designed to be used alone or together with TotalRecall and TANGO.
    Fourth, an online collocation check program, MUST, was developed for detecting V-N
    miscollocation and suggesting adequate collocates in student’s writings based on the hypothesis
    of L1 interference and the database of BNC and the bilingual Sinorama Corpus. Other
    computational scaffoldings are under development. It is hoped that this project will help
    intermediate learners in Taiwan enhance their English proficiency with effective pedagogical
    approaches and versatile language reference tools.

    Keywords: computer assisted language learning, corpus, concordance, natural language
              processing, collocation

1     Introduction
Researchers and teachers in the field of Computer Assisted Language Learning (CALL) have been
working on harnessing speech and natural language processing technology and Internet resources to
revitalize traditional language learning. They have also explored new pedagogy made possible by
computers and the Internet. The first goal can be met by an adaptive CALL system which provides a
learning environment that makes systematic and ongoing adjustment based on individual differences of
learners. Drill sessions for different language proficiency levels with feedback and guidance are
accessible online to facilitate learning of structural knowledge. As a new pedagogy, digitalized corpora
are used to facilitate inductive data-driven language learning in ways that have not been possible in the
past. Language data is important for learning because it activates learners’ mental mechanisms and
becomes essential input for second or foreign language acquisition. Various language learning activities
or tasks that include listening, speaking, reading, writing, translation or a combination of two or more
skills can be constructed based on various corpora and adaptive/automatic tools to achieve the goal of
computational scaffolding.
    Computer, Corpora, and Computational Linguistics have an increasing role to play in ELT and
e-Learning. This new combination of texts and technology has forever changed the way we study, teach,
and learn a language. We can take full advantage of the “New Way,” only if we embrace new pedagogy
and Natural Language Processing technology. NLP technology enables us to compile better dictionaries,
16                  IWLeL 2004: An Interactive Workshop on Language e-Learning           15 - 23

     concordancers, and collocation aids from corpora of authentic text. NLP technology serves as a
     microscopic look at trouble spots of L2 learning in a learner corpus and offers remedies. NLP
     technology even helps teachers with preparation of better reading comprehension tests in shorter time.
     And most importantly, NLP technology makes possible a new pedagogy that encourages
     student-centered, inductive, and culture-based learning. This paper will describe the first-year
     experience we have with Project CANDLE under the National Science and Technology Program for
     Digital Learning. This three-year project for advancing Computer Assisted Language Learning (CALL)
     is based upon corpora and NLP tools, to provide a digital English learning environment for intermediate
     learners in Taiwan. It is thus named, Corpora And NLP for Digital Learning of English (CANDLE). The
     project integrates expertise in four research areas: (1) NLP technologies and applications, (2) An
     intelligent self-access English reading environment, (3) English learning through writing and translation,
     and (4) Bilingual corpus and culture-based English learning. Visit
     to try out an array of language resources, computational tools, and web-based learning units.
          The project is unique, in Taiwan and internationally, drawing on language resources mainly from a
     bilingual parallel corpus, the Sinorama corpus, and builds on learners’ first language background
     knowledge to empower learners with culture-based materials. Both first language and its culture provide
     a scaffold for learners to use while learning a new language. The project is aimed at providing various
     types of English learning activities which emphasize structural knowledge, and complex problem
     solving learning (Chan, et al., 2001): reading, writing, listening, speaking, and translation. Its major
     features include online practice that adapts to learners’ levels, automatically assessing/monitoring
     learner’s progress, and profiling of learners’ preferences and level of proficiency. The paper is organized
     with a detailed description of the CANDLE project with its four sub-projects, unique features of corpora
     processing and development of NLP tools, and a conclusion.
          Successful research effort on digital language learning requires close collaboration between
     computer engineers, teachers, and content experts. In this three-year CANDLE project, ten researchers
     from the research areas of computer science (specifically NLP and speech recognition) and English
     teaching (CALL), have been working together with the aim of leveraging cutting edge NLP tools and
     corpora processing tools to advance English learning for students in Taiwan. The CANDLE website
     ( will be used by students from six participating institutions associated
     with the research team. Attempts are being made to build and assess an English learning environment
     meeting the needs of local students.

     2     Description of the CANDLE project and its initial achievements
     The main goal of the CANDLE project is achieved through collaboration of four subprojects:

         (1)   Natural Language Processing and Assessment Tools,
         (2)   An Intelligent Self-Access Reading Environment,
         (3)   Learning a Foreign Language Through Written Exercises and Translation,
         (4)   Bilingual Corpus and Culture-based Language Learning.

         The first subproject provides essential and advanced NLP tools and activity tracking mechanisms for
     the other three sub-projects to facilitate and monitor online learners. The second subproject focuses on
     construction and assessment of an intelligent self-access reading environment that adapts to learners’
     English levels. The third subproject works on exploring the potential of learning English based on
     writing and translation exercises. The fourth sub-project uses the bilingual corpus to enhance
     culture-based English learning, an area that has not been fully explored yet. All four subprojects are
     innovative for natural language processing and English learning. The first subproject also produces the
     much-needed digital and content-related advanced technology for the other three subprojects which
     have to do with fundamental research on digital learning strategy and behavior and assessment of
     usefulness of the proposed approach to advanced digital learning of English language (see Figure 1 for
     illustration of the role of the first subproject).
              IWLeL 2004: An Interactive Workshop on Language e-Learning            15 - 23                   17

     ur objectives and anticipated progress in the three-year project are planned in three stages: in the
first year, we will work on web-based learning material development; in the second year, formative
evaluation; in the third year, summative evaluation.

           NLP                                                      Reading
           TotalRecall, Tango,                                      Self-access module,
           Collocation checker                                      Text grader, Speedy reading,
           Speech recognizer                                        Strategy trainer
                                                                    Supported by TotalRecall

                                            Computer assisted
                                            management system
                                                  [1st year]

           Writing                                                  Culture
           Writing with TotalRecall,                                Culture courses
           Collocation practice                                     Candle talk
           Supported by TotalRecall,                                Supported by TotalRecall
           Tango, & Collocation checker

               Figure 1 Integration of the Four Sub-projects with its Respective Modules

3     Corpora Processing and NLP Tools
Corpus linguists study real texts, using explicit algorithms to extract linguistic knowledge from corpora.
An important function of corpora in the language classroom is to provide the learners with concentrated
exposure to particular patterns of repetition. With the use of corpus tools, language learners can avoid
unhelpful reliance on oversimplified 'rules' prepackaged by the teacher; instead they develop
proficiency through focused, purposeful exposure to, and use of, language in specific contexts (Teubert,

3.1   Sinorama, Main Material for Reading, Writing, and Culture-based Learning Courses
A Parallel Corpus is a collection of "parallel" texts in different languages or in different varieties of a
language. There is a bilingual Chinese-English electronic corpus in Taiwan: Sinorama, which is a
monthly published over three decades by the Government Information Office (GIO) in Taiwan. Among
several of the magazines published by the GIO, Sinorama remains the most popular because it includes
insightful reports on the life styles, society, economy and cultures related to the people in Taiwan. The
topics of the articles in Sinorama include the following: art achievement, literature, painting and
calligraphy, film, dance, music, architecture, museums, traditional opera, handicrafts, clothing and
accessories, stories told in Chinese paper cuts, drama, Taiwanese culture, and influx of Western culture.
The reasons for reading articles in Sinorama are many; the most important is that students are allowed to
interact with the authentic texts that they are reading. Using Sinorama also gives the students
opportunities to read articles about their own culture through the texts that are specific to the Taiwanese
context. In addition, this can familiarize the students with the writing style of Sinorama and prepare
them to utilize a concordancer that is based on this corpus.
18                 IWLeL 2004: An Interactive Workshop on Language e-Learning            15 - 23

           A: Database selection B: English query C: Mandarin query D: Number of entries per page
           E: Normal F: Clustered summary according to translation
           H: Order by I: Submit bottom J: Page index K: English citation L: Mandarin citation
           N: All citations in the cluster O: Full text context P: Paragraph context

                                 Figure 2 The results of searching for “example”

         The current project, Corpora and Natural language processing for Digital Learning of English
     (CANDLE) aims to use a range of corpora and advanced natural language processing tools to revitalize
     traditional English learning for intermediate learners in Taiwan via activities of reading, translation and

     3.2    Online Bilingual Concordancer and Collocation Reference Tool

     3.2.1 TotalRecall
     TotalRecall is a Chinese-English bilingual concordancer using Sinoroma parallel corpus (Wu, et al.,
     2003). This project involves bilingual sentence, word and phrase alignment, sophisticated queries that
     meet learners’ various needs of searching, and output display after ranking. It can display (see Figures 2
     and 3):

           a) Collocation information in a concordance
           b) Four levels of context for citation: sub-sentential, sentential, paragraph, and text
           c) Highlighting of accurate word and phrase level alignment of translation equivalent.

     3.2.2 TANGO
     With the collocation types and instances extracted from the corpus, we built an online collocational
     concordancer called TANGO for looking up collocation instances and translations (Jian, 2004). A user
     can type in any English words as query and select the expected part of speech of the accompanying
     words. For example in Figure 4, after the query “influence” is submitted, the result of possible collocates
     will be displayed on the return page. The user can even select different adjacent collocates for further
               IWLeL 2004: An Interactive Workshop on Language e-Learning          15 - 23                   19

investigation. Moreover, using the technique of bilingual collocation alignment and sentence alignment,
the system will display the target collocation and its translation equivalents highlighted in different
sentential contexts. Translators or learners, through this browser-based interface, can easily gain access
to the usage of each collocation with relevant instances. This may help learners speed-up their
internalization process. This bilingual collocational concordancer could be a very useful tool for
self-inductive learning tailored to intermediate or advanced English learners.

                     Figure 3. Chinese-English Bilingual Concordance, TotalRecall

3.3    Online Lessons that work with TotalRecall and TANGO
In addition to the development of NLP tools and planning of a learner corpus compilation, four English
teaching master’s theses have been completed and more are under way that shed light on development
of NLP tools or online instructional units and provide a curriculum model that can be used by other
English teachers to infuse CANDLE into their own instructional contexts. The topics of those thesis
researches include:

      (1) Subsentential alignment of bilingual corpus by interleaving text and punctuation matches.
      (2) Automatic acquisition of VN and other types of collocation in free texts.
      (3) Effects of automatic essay grading and bilingual concordancing on college students’ EFL
          writing: This thesis investigates how a commercial online essay grader, My Access, and
          TotalRecall can help college English students’ writing, revision, and error-correction. It will
          provide a curriculum model. Post-tests have shown that the students improved their score after
          using these tools.
      (4) Effects of CALL approaches on learning of college students’ English verb-noun collocation:
          This thesis develops 8 online instructional units of teaching English verb-noun collocation and
          uses TotalRecall to investigate whether the two CALL approaches can help college students’
          learning of collocation. It has both development and pedagogical implications.
      (5) Effects of online extensive reading of English texts with controlled vocabulary on college
          students’ incidental learning: This thesis uses some text processing programs to control
20                  IWLeL 2004: An Interactive Workshop on Language e-Learning             15 - 23

               vocabulary difficulty level and number of new words appearing in a group of texts and
               investigates whether such arrangement of text selection can help college English student
               readers to acquire more new words. It has both development and pedagogical implications.
           (6) The feasibility of using the Sinorama bilingual concordance in a culture awareness language
               course for non-English-major students: This thesis integrates the use of TotalRecall into a
               college English course and explores the learning process and product. It will provide a
               curriculum model.

                             Figure 4 Web-based Collocational Concordance, TANGO

     Preparation of developing online cultural materials for English learning and pilot testing of use of
     Sinorama and TotalRecall in the English teaching contexts in participating colleges are under way.

     3.4    Online Collocation Check Program
     We have also developed a web-based automatic collocation-detection system as an online aid for EFL
     writers and especially tackle learners’ miscollocations attributable to L1 translation interference on the
     verb collocate (Chang, 2004). The system provides relevant adequate collocation as feedback messages
     according to the mutual translations between learner’s L1 and L2. When user inputs a V-N collocation,
     system will check and derive a list of candidate English verbs that share the same Chinese translations
     via processing of bilingual corpora. After combing nouns with those candidate verbs as V-N pairs, the
     system makes use of a reference English corpus to exclude the inappropriate V-N pairs so as to single
     out the proper collocations. An example of correcting a misused collocation “publish album” is shown
     in Figure 5. The system can promptly provide the exact suggestive collocation which the learner intends
     to write but misuses. It is hoped that this online assistant can facilitate EFL learner-writers’ collocations
     use and transfer this knowledge to their future writing.
              IWLeL 2004: An Interactive Workshop on Language e-Learning             15 - 23                   21

                 Figure 5 The Interface of Online Collocation Check Program, MUST

4     Conclusion
In the first year of project CANDLE, we have built several NLP tools for CALL. These tools have been
used in various ELT research and teaching activities with promising results. We plan to develop more
tools based on NLP technologies to explore the area such as semi-automatic test generation and grading.
By emphasizing advanced NLP technologies, sound English pedagogical theories and empirical
assessment of usability based on real learners, we have confidence that we will reach our optimum goal
of creating a digital learning environment that meets real needs of English learners in Taiwan.

    In the coming years, we will achieve the following goals via the CANDLE website:

    (1) Providing access to the CANDLE learning center to as many students as we can reach.
    (2) Providing empirical evidence or usability testing data to prove CANDLE usefulness or
    (3) Exploring the possibilities of curriculum infusion in various universities or colleges for different

    By putting natural language processing technologies to work with sound English pedagogical
theories and usability study on real learners, we hope to advance the state of the art of English Language
Teaching. Bilingual corpus, bilingual concordancers, browser-based interactive tests or exercises are
among the advanced technologies we have developed. Additionally, we will make the CANDLE
environment meet the English learning needs of local learners by attending to their specific difficulties.
    Evaluation methods such as psychometric means in a comparison design, discourse analysis, or
portfolio will be conducted in the third year to advance the understanding of learners’ behavior when
they work online. We envision that learners will be capable of the complex problem solving needed to
network with foreign language users in other countries. We hope to achieve the goal of promoting
learner autonomy and life-long learning, so as to enable learners to fully participate in the English
speaking discourse community.
22                  IWLeL 2004: An Interactive Workshop on Language e-Learning                15 - 23

     The paper is supported by research grant from National Science Council under the projects
     NSC92-2524-S007-002 and NSC 93-2524-S-007-002.

     Chan, T. W., Hue, C. W., Chou, C. Y., & Tzeng, O. J. L. 2001. Four spaces of network learning models.
         Computers & Education, 37, 141-161.
     Chang, J.S., David Yu, Chun-Jun Lee. Statistical Translation Model for Phrases, Vol. 6, No. 2, pp. 43-64, 2001.
     Chang, Richard, T-P Chen, Jason S. Chang. 2004. An Automatic Collocation Writing Assistant for Taiwanese
         EFL Learners: Using Corpora for language teaching and learning based on NLP Technology, EUROCALL.
     Chuang, Thomas C., and Jason S Chang, 2002. Adaptive Sentence Alignment based on Length and Lexical
         Information, In Proceedings of the 40th Annual Meeting of Association for Computational Linguistics, Comp.
         Volume, 91-92.
     Chuang, Thomas C., Jian-Cheng Wu, Tracy Lin, Web-Chie Shei and Jason S. Chang, “Bilingual Sentence
         Alignment Based on Punctuation Statistics and Lexicon, “ Proceedings of the first International Joint
         Conference on Natural Language, IJCNLP-04, PP. 644-651, Hainan Island, China, Jan 2004.
     Chuang, Thomas C., NG You, and Jason S Chang, 2002. Adaptive Sentence Alignment, Proceedings of the Fifth
         Conference of the Association for Machine Translation in the Americas, AMTA'2002, Tiburon, California.
     Conzett, J. (2000). Integrating collocation into a reading and writing course. In Lewis, M. (Ed.), Teaching
         collocation: Further developments in the lexical approach (pp. 70-86). London: Language Teaching
     Farghal, M. & Obiedat, H. (1995). Collocations: A neglected variables in EFL. International Review of Applied
         Linguistics, 33, 313-331.
     Gale, W. & K. W. Church, "A Program for Aligning Sentences in Bilingual Corpora" Proceedings of the 29th
        Annual Meeting of the Association for Computational Linguistics, Berkeley, CA, 1991.
     Granger, S. 2003. Error-tagged learner corpora and CALL: A promising synergy. CALICO Journal, 20(3),
     Jian, J. Y., Chang, Y. C., & Chang, J. S. 2004. Collocational Translation Memory Extraction Based on Statistical
          Linguistic Information, Paper presented in ROCLING 2004, Conference on Computational Linguistics and
          Speech Processing, Taipei.
     Jian, Jia-Yan, Yu-Chia Chang, Jason S. Chang. “TANGO: Bilingual Collocational Concordancer, " Proceedings of
         the 42th Annual Meeting of Association for Computational Linguistics,” Comp. Vol., 2004.
     Lee, S. H. (2003). ESL learners’ vocabulary use in writing and the effects of explicit vocabulary instruction,
        System, 31, 537-561.
     Lewis, M. (2000). Teaching collocation: Further development in lexical approach. London: Language Teaching
     Lin, T., C.J. Wu, and J.S. Chang. 2003 Word Transliteration Alignment, Proceedings of the fifteenth Research on
         Computational Linguistics Conference, ROCLING XV, Hsinchu.
     Liou, H. C., et al. 2003. Using corpora and computational scaffolding to construct an advanced digital English
         learning environment: The CANDLE project. The 7th Int’l Conference on Multimedia Language Education,
         Chia-Yi, December 19-21.
     Liu, C. P. (1999). An analysis of collocation errors in EFL writings. The proceedings of the Eighth International
         Symposium on English Teaching (pp. 483-494). Taipei: Crane.
     Liu, C. P. (2000). A study of strategy use in producing lexical collocations. Selected Papers from the Ninth
         International Symposium on English Teaching (pp. 481-492). Taipei: Crane.
     Liu, L. E. (2002). A corpus-based lexical semantic investigation of verb-noun miscollocations in Taiwan learners’
         English. Unpublished master’s thesis, Tamkang University, Taipei, January.
     Macklovitch, E., Simard, M., Langlais, P.: TransSearch: A Free Translation Memory on the World Wide Web.
        Proc. LREC 2000 III, 1201--1208 (2000).
     Melamed, I. D. 1997. A Word-to-Word Model of Translational Equivalence. Proc. of the ACL97. pp 490-497.
            Madrid Spain, 1997.
     Mitkov, Ruslan, and Le An Ha 2003, Computer-Aided Generation of Multiple-Choice Tests, In Proceedings of
         HLT-NAACL 2003.
     Nation, I. S. P. (2001). Learning vocabulary in another language. Cambridge: Cambridge Press.
     Nattinger, J. R., & DeCarrico, J. D. (1992). Lexical phrase and language teaching. Oxford: Oxford University
               IWLeL 2004: An Interactive Workshop on Language e-Learning                 15 - 23                    23

Nesselhauf, N (2003). The use of collocations by advanced learners of English and some implications for teaching.
   Applied Linguistics, 24, 223-242.
Shei, C. C., & Pain, H. (2000). An ESL writer’s collocation aid. Computer Assisted Language Learning, 13,
Simard, M., G. Foster & P. Isabelle (1992), Using cognates to align sentences in bilingual corpora. In Proceedings
  of TMI92, Montreal, Canada, pp. 67-81.
Teubert, W. 1996. Comparable or Parallel Corpora? International Journal of Lexicography Vol. 9, Number 3, pp.
Teubert, W. 1996. Why corpus linguistics? International Journal of Corpus Linguistics, 1(1).
Teubert, W. 2003. Parallel Corpora and Language Learning, 12th International Symposium on English Teaching,
Wang, Yi-Chia, Jian-Cheng Wu, Tyne Liang, Jason S. Chang. 2004. Using the Web as Corpus for Un-supervised
   Learning in Question Answering, to appear in ROCLING XVI: Conference on Computational Linguistics and
   Speech Processing, September 2-3, 2004, Howard Pacific Green Bay, Taipei, Taiwan, ROC.
Whitelock, P., & Edmonds, P. 2000. The Sharp intelligent dictionary. Proceedings of the 9th EURALEX, pp.
Wu, Chien-Cheng, and Jason S. Chang. Bilingual Collocation Extraction Based on Syntactic and Statistical
   Analyses, Computational Linguistics and Chinese Language Processing, Vol. 9, No. 1, 2004, pp. 1-20.
Wu, CJ and J.S. Chang. Alignment of Collocation via Syntactic and Statstical Analyses. Proceedings of the
   fifteenth Research on Computational Linguistics Conference, ROCLING XV, Hsinchu.
Wu, CJ, K. Yeh, T.C. Chuang, W.C. Shei and Jason S. Chang. 2003. ‘TotalRecall: A Bilingual Concordance for
   Computer Assisted Translation and Language Learning,’ In Proceedings of the 41st Annual Meeting of
   Association for Computational Linguistics, Comp. Volume, 201-204.
Wu, Dekai (1994), Aligning a parallel English-Chinese corpus statistically with lexical criteria. In The
  Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, New Mexico, USA,
  pp. 80-87.
Wu, J.C., Thomas C. Chuang, Wen-Chi Shei and Jason S. Chang. “Subsentential Translation Memory for
   Computer Assisted Writing and Translation, " Proceedings of the 42nd Annual Meeting of Association for
   Computational Linguistics, Comp. Vol. 2004.
Zhang, X. (1993). English collocations and their effect on the writing of native and non-native college freshmen.
   Ph.D. thesis, Indiana University of Pennsylvania.

To top