Sentence Alignment as the Basis For Translation Memory Database by shameona

VIEWS: 11 PAGES: 20

									                                              INFuture 2007
                                              The Future of Information Sciences
                                              7-9 November 2007
                                                Zagreb, Croatia



                                           Digital Information and Heritage



Sentence Alignment as the Basis
For Translation Memory Database
                        Sanja Seljan
Faculty of Humanities and Social Sciences – University of Zagreb
              Department of Information Sciences
                      sanja.seljan@ffzg.hr

                      Angelina Gašpar
                       SOA Centre Split
                     ginasplit@yahoo.com

                       Damir Pavuna
                         Integra d.o.o.
                   damir.pavuna@integra.hr                          INFuture
                                                         Zagreb, 7-9.11.2007.
                                    INFuture 2007
Overview                            The Future of Information Sciences
                                    7-9 November 2007
                                      Zagreb, Croatia



                                 Digital Information and Heritage

I     Introduction
II    When to use TMs?
          Text preparation
III   Corpus used
          Text characteristics
IV    Research
          Tools used
          Automatic and manual alignement
          Comparison of TMs
          Results
V     Conclusion
                                                          INFuture
                                               Zagreb, 7-9.11.2007.
                                     INFuture 2007
                                     The Future of Information Sciences

Sentence alignment (SA)              7-9 November 2007
                                       Zagreb, Croatia



                                  Digital Information and Heritage



•   basis for computer-assisted translation (CAT)
•   terminology management
•   term extraction
•   word alignment
•   cross-linguistic information retrieval

Sentence alignment (SA) -> translation memory (TM)
              basis for further research
             in translation equivalencies
                                                           INFuture
                                                Zagreb, 7-9.11.2007.
                               INFuture 2007
                               The Future of Information Sciences
                               7-9 November 2007
                                 Zagreb, Croatia



                            Digital Information and Heritage


Problems in automatic SA:
• robustness
• discrepancies in layout and omissions
• -> influence on accuracy and TM




                                                     INFuture
                                          Zagreb, 7-9.11.2007.
                              INFuture 2007
                              The Future of Information Sciences
                              7-9 November 2007
                                Zagreb, Croatia



                           Digital Information and Heritage

Research:
• SA on Cro-Eng parallel texts (laws,
  regulations, acts, decisions)
• alignment tool WinAlign 7.5.0 by
  SDL Trados 2006 Professional




                                                    INFuture
                                         Zagreb, 7-9.11.2007.
                                    INFuture 2007
                                    The Future of Information Sciences
                                    7-9 November 2007
                                      Zagreb, Croatia



                                 Digital Information and Heritage
Aim:
• impact of SA process on the creation of TM
• comparison of 3 types of TMs
• Differences:
  – in levels of expert intervention in set up of
    the alignment program
  – in preparation of the source text for the
    segmentation

                                                          INFuture
                                               Zagreb, 7-9.11.2007.
                                          INFuture 2007
    II When to use TMs?                   The Future of Information Sciences
                                          7-9 November 2007
                                            Zagreb, Croatia



                                       Digital Information and Heritage

• Fast and consistent translation (e.g. EU,
  multinational agencies)
• Voluminous texts
• Highly repetitive types of texts
• Use of specialized and consistent terminology
• Several languages

•   Sharing of common resources (cooperation)
•   Time-saving (Speed up the translation process)
•   Cost-saving
•   Consistent translation                         INFuture
                                        Zagreb, 7-9.11.2007.
                              INFuture 2007
                              The Future of Information Sciences
                              7-9 November 2007
                                Zagreb, Croatia



                          Digital Information and Heritage




             Creation of TM




Directly through      Use of already
   translation       translated material
                     (alignment process)

                                                    INFuture
                                         Zagreb, 7-9.11.2007.
                                     INFuture 2007
III Corpus used                      The Future of Information Sciences
                                     7-9 November 2007
                                       Zagreb, Croatia



                                  Digital Information and Heritage

• 9 parallel legislative Croatian-English texts or
  bitexts related to: acts, laws, regulations, decisions
  and ordinances;

• The sake of uniformity: standard presentation and
  standard formulas;

• 33.15% - percentage ratio for word count in English
  translations;
                                                           INFuture
                                                Zagreb, 7-9.11.2007.
                                    INFuture 2007
                                    The Future of Information Sciences
                                    7-9 November 2007
                                      Zagreb, Croatia



                                 Digital Information and Heritage

•   Reasons:
    – English-an analytic type of language, use of
      passive voice,
    – Croatian - a highly flective system, use of
      active voice,

•   Repetitive legal terms, phrases, sentences

•   A regulation main components: the title,
    preamble, enacting terms, addresee, place,
    date and signature.
                                                          INFuture
                                               Zagreb, 7-9.11.2007.
                                    INFuture 2007
                                    The Future of Information Sciences
                                    7-9 November 2007
                                      Zagreb, Croatia



                                 Digital Information and Heritage

• Enacting terms - strict rules of presentation:
  - subject matter and scope,
  - definitions,
  - provisions conferring implementing power,
  - penalties or legal remedies,
  - transitional and final provisions.

• Standard form prescribes the layout on the
  page: spacing, paragraphing, punctuation and
  even typographic characteristics (capitalisation,
  typeface, boldface and italics)
                                                          INFuture
                                               Zagreb, 7-9.11.2007.
                                  INFuture 2007
                                  The Future of Information Sciences
                                  7-9 November 2007
                                    Zagreb, Croatia



                               Digital Information and Heritage


• Use of verbs in enacting terms
- Binding Croatian legislation:
   - declarative terms (definitions, amendments)
   - and imperative terms (commands, prohibitions)
- English “shall”= Croatian present tense, modals
  (morati , trebati)
- English “may” for prohibition, permission and
  authorisation = Croatian present tense (“ne može
  se”, “može se”).
                                                        INFuture
                                             Zagreb, 7-9.11.2007.
                                  INFuture 2007
                                  The Future of Information Sciences
                                  7-9 November 2007
                                    Zagreb, Croatia



                               Digital Information and Heritage

• Bitexts similarities :
  – punctuation, numbers, dates, foreign words;

• Differences:
  – capital letters, hyphens, compound words,
    synonyms (avoided in target language);

• Common points:
  – consistent terminology, a uniform manner,
    gender-neutral language;
                                                        INFuture
                                             Zagreb, 7-9.11.2007.
                                    INFuture 2007
IV Alignment research               The Future of Information Sciences
                                    7-9 November 2007
                                      Zagreb, Croatia



                                 Digital Information and Heritage


• Texts:
  – Croatian legislative acts translations Cr->En;


• Tools:
  – AnyCount 4.0 (version 405) – for document
    structure analysis
  – SDL Trados 2006 Professional (WinAlign
    7.5.0.) – for alignment process;

                                                          INFuture
                                               Zagreb, 7-9.11.2007.
                                         INFuture 2007
Alignment research                       The Future of Information Sciences
                                         7-9 November 2007
                                           Zagreb, Croatia



                                      Digital Information and Heritage

• PREPARATORY ACTIVITIES:
  – comparison of the source and target texts (whether all
    text is translated)
  – defining set up of end and skip rules (delimiters,
    creating abbreviation user list)
  – preparation of the source text for better segmentation
    (spelling, automatic bullets and numbering, deleting of
    soft returns, hyphens, certain punctuation, tables
    created with tabs and revision marks)
  – modification of set up rules
  – verification of the alignment (especially 1:2 and 2:1
    pairs and commitment of pairs)
  – creation of translation memory and verification INFuture
                                                    Zagreb, 7-9.11.2007.
                                   INFuture 2007
Alignment research                 The Future of Information Sciences
                                   7-9 November 2007
                                     Zagreb, Croatia



                                Digital Information and Heritage


• Automatic alignment

WinAlign has language independent
 algorithms that count:
  – the quality of translation units which can
    have tree levels (low, medium, high)
  – translation units aligning 1:2 or 2:1 pairs
  – unconnected target segments
                                                         INFuture
                                              Zagreb, 7-9.11.2007.
                                  INFuture 2007
Alignment research                The Future of Information Sciences
                                  7-9 November 2007
                                    Zagreb, Croatia



                               Digital Information and Heritage


• Manual alignment
  – source text corresponds to translated target
    segment (Aligned TM)
  – set up of the alignment program (Aligned TM
    + set up rules, e.g. segment and skip rules,
    abbreviation user list)
  – segmentation of the source text (e.g.
    changes of soft returns, check of colon
    segmentation)
                                                        INFuture
                                             Zagreb, 7-9.11.2007.
                        INFuture 2007
Alignment research      The Future of Information Sciences
                        7-9 November 2007
                          Zagreb, Croatia



                     Digital Information and Heritage




                                              INFuture
                                   Zagreb, 7-9.11.2007.
                                         INFuture 2007
Alignment research                       The Future of Information Sciences
                                         7-9 November 2007
                                           Zagreb, Croatia



                                      Digital Information and Heritage

           Raw TM       Aligned TM + Setup           ++ Segmented
                                   rules             source
100%            121            106        112                                 120
95%-99%             0            0             0                                0
85%-94%             2            5             0                                0
75%-84%             2            2             1                                0
50%-74%             1            1             2                                0
No match            6           18           11                                 0
Total           132            132        126                                 120
Percent      91.67%       80.30%     88.89%                           100%
                                                               INFuture
                                                    Zagreb, 7-9.11.2007.
                                  INFuture 2007
Alignment research                The Future of Information Sciences
                                  7-9 November 2007
                                    Zagreb, Croatia



                               Digital Information and Heritage


• Conclusion
  – The translation memories created in this study
    out of different types of the alignment
    processes give different results regarding
    the quality of the translated material.

  – The results show necessary interventions
    of an expert when defining the set up
    rules, in preparation activities for the
    source text segmentation and in the
    verification of suggested translation units.
                                                        INFuture
                                             Zagreb, 7-9.11.2007.

								
To top