LC-STAR_D3.0_v1.6

Shared by: xiangpeng
-
Stats
views:
4
posted:
5/6/2011
language:
pages:
17
Document Sample
scope of work template
							LC-STAR Deliverable D3.0.


Project ref. no.                IST-2001-32216
Project acronym                 LC-STAR
                                Lexica and Corpora for Speech-to-Speech Translation
Project full title
                                Technologies

Security (distribution level)   Public
Contractual date of delivery    New Deliverable
Actual date of delivery
Deliverable number              D3.0
Deliverable name                Specifications of lexicon interchange format
Type                            Report
Status & version                Final +2 Version 1.6
Number of pages                 16
WP contributing to the
                                WP3
deliverable
WP / Task / Deliverable
                                WP3 – Task 3.1 – D3.0 UPC
responsible
Other contributors
Author(s)                       Asuncion Moreno (UPC)
EC Project Officer              Domenico Perrotta
Project Coordinator             Name: Harald Höge
                                Company:        Siemens AG, CT IC 5
                                Address:        Otto-Hahn-Ring 6, 81739 München, Germany
                                Phone: +49-89-636-53374
                                Fax: +49-89-636-49802
                                E-mail: harald.h.hoege@mchp.siemens.de
                                Project web site:       http://www.lc-star.com
Keywords                        Specifications, Formats
Abstract                        This document provides specifications for the formats of the
(for dissemination)             language resources generated in the LC-STAR project
                                concerning LR for Recognition and Synthesis.
IST-2001-32216 D3.0                                                                           1



Document evolution:

 Version      Date             Security   Notes
                               Project    First draft version – work document, to be
 V1.0         Aug 28, 2003
                               internal   discussed by all partners
 v1.1         Sep, 19, 2003    Internal   Pre-fnal.
                                          Pre-final. Includes decissions from Helsinky
                                          meeting: Language and country codes,
 v1.2         Dec, 11, 2003    Internal   README.TXT description, Language specific
                                          guidelines, Lexicon split in several files,
                                          Foreing words: phonetization, tagging…
              March, 12,
 v1.3                          Internal   Pre-final. After prevalidation A and B
              2004
                                          few details in Table 1, tokenization, and foreing
 v1.4         April, 1, 2004   Final
                                          phonemes. Language specific DTD is included
 v1.5         May 10, 2004     Final+1    CC for Arabic, Abreviations
                                          Files for Exchange. After Sant Petersburgh
 v1.6         Aug, 31, 2004    Final +2
                                          meeting
IST-2001-32216 D3.0                                                                                                                          2



1  Introduction ........................................................................................................................... 3
2  Storage media ........................................................................................................................ 3
3  Character Coding................................................................................................................... 3
4  Directory structure ................................................................................................................. 3
5  Data files ............................................................................................................................... 5
 5.1     <database>\LIST directory .......................................................................................... 5
   5.1.1 Common Words List WORDLIST.TXT ................................................................. 6
   5.1.2 Special Application Word List: SAP.TXT .............................................................. 6
   5.1.3 NGram list: NGRAM.TXT...................................................................................... 7
 5.2     <database>\NAMES directory ..................................................................................... 8
 5.3     <database>\CLOSED directory ................................................................................... 8
 5.4     <database>\LEXICON directory ................................................................................. 9
   5.4.1 Lexicon Files LEXIC<nn>.XML ............................................................................ 9
   5.4.2 DTD File: LEXICON.DTD ................................................................................... 12
6 Documentation files ............................................................................................................ 12
 6.1     Root Directory ........................................................................................................... 13
   6.1.1 README.TXT...................................................................................................... 13
   6.1.2 COPYRIGH.TXT .................................................................................................. 13
 6.2     <database>\DOC directory ........................................................................................ 13
   6.2.1 DESIGN.DOC and DESIGN.PDF ........................................................................ 13
   6.2.2 SAMPALEX.PS .................................................................................................... 16
   6.2.3 VALREP.DOC ...................................................................................................... 16
7 References ........................................................................................................................... 16
IST-2001-32216 D3.0                                                                                   3




1    Introduction
The purpose of this document is to provide specifications for the formats of the language
resources generated in the LC-STAR project concerning LR for Recognition and Synthesis.
That includes:

1. Format specifications for word lists for prevalidation A purposes
2. Format specifications for mini lexicon for prevalidation B purposes
3. Format specifications for Lexicon for Speech recognition and Synthesis

The document takes into particular account the following aspects: media, database format and
documentation. The document is close related with LC-STAR project deliverables D1.1, D2.1,
D2.2 and D2.3

This document is structured as follows: first, a general description of storage media, data files
and documentation files is done. Then, for each specific purpose (word list, mini lexicon and
full lexicon) a description of directory structure and a list of mandatory and optional files to be
included in the database is done.


2    Storage media
LC-STAR Language Resources will be stored on CD-ROMs. Disks will be printed according to
the ISO 9660 Interchange level 1 specifications.



3    Character Coding
For the wordlists ISO 8859-X [http://czyborra.com/charsets/iso8859.html] character sets for
European languages, Arabic and Hebrew will be used. For the Mandarin wordlists, we will use
GB2312 for the characters, ISO-8859-1 for Pinyin and non-native words. The final lexica will
be delivered in UTF16 character encoding [http://www.unicode.org/]. For each language the
character set used should be documented.


4    Directory structure
The following directory structure and nomenclature has been chosen:

\COPYRIGH.TXT                             Copyright notice in the root directory
\README.TXT                               Readme file in the root directory
\<database>\DOC\
   DESIGN.{PDF|DOC}                       Documentation file
   SAMPALEX.PDF                           List of SAMPA symbols
   VALREP.{DOC|PDF}                       Validation report by SPEX
\<database>\LISTS\
   WORDLIST.TXT                           Wordlist file
   SAP.TXT                                Special Application Words list
   NGRAM.TXT                              NGram file
\<database>\NAMES                         Names directory
     PERSONS.TXT                          Includes first and last person names
     PLACES.TXT                           Includes place names
IST-2001-32216 D3.0                                                                              4



     ORGNAMES.TXT                         Includes organizations, companies and brand names
     NAMES.TXT                            Include a merge of the above three files
\<database>\CLOSED                        Closed sets directory
     <FILENAME1>.TXT                      Closed set 1 (t.b.s.)
     <FILENAME2>.TXT                      Closed set 2 (t.b.s.)
         ……                                 ………..
\<database>\LEXICON
  LEXIC<nn>.XML                           Lexicon files nn
  LEXICON.DTD                             DTD File

Table 1. Directory structure

Where:

<database>        Defined as: <name><-><LL><CC>.
                  Where:
                  <name> is LC for LC-STAR databases
                  <LL> 2-letter for language code ISO 639
                  <CC> 2-letter for country code ISO 3166 where language is spoken


Language code is ISO 639 available at http://ftp.ics.uci.edu/pub/ietf/http/related/iso639.txt
Country code is ISO 3166 available at http://object-net.com/object-net/countrycodes.html

The following list shows some examples of database names

                              Lexicon            LL   CC     Database
                              Catalan            ca   es     LC-caes
                              Finnish            fi   fi     LC-fifi
                              German             de   de     LC-dede
                              Greek              el   gr     LC-elgr
                              Hebrew             he   il     LC-heil
                              Italian            it   it     LC-itit
                              Mandarin Chinese   zh   cn     LC-zhcn
                              Spanish            es   es     LC-eses
                              Standard Arabic    ar   --     LC---
                              Russian            ru   ru     LC-ruru
                              Turkish            tr   tr     LC-trtr
                              US-English         en   us     LC-enus
                              Slovenian          sl   si     LC-slsi

                              Table 2. List of language/country codes


The directory structure shown in Table 1 is common for the final lexicon and the prevalidation
steps. The mandatory files to be included in each step are specified in Table 3


Directory      Filename                Word list        Minilexicon       Lexicon      Exchange
                                     Prevalidation     Prevalidation       final
                                          A                  B
(root)         COPYRIGH.TXT           Mandatory         Mandatory        Mandatory     Mandatory
IST-2001-32216 D3.0                                                                          5



               README.TXT              Mandatory         Mandatory       Mandatory   Mandatory
\<database     DESIGN.{PDF|DO          Mandatory         Mandatory       Mandatory   Mandatory
>\DOC          C}                                                                    (Lexicon)
               SAMPALEX.PDF                 -             Optional        Optional    Optional
               VALREP.DOC                   -                -            Optional   Mandatory
                                                                             final
                                                                           release
\<database     WORDLIST.TXT            Mandatory                                     Mandatory
>\LISTS        SAP.TXT                 Mandatory                -               -    Mandatory
               NGRAM.TXT                                                             Mandatory
\<database     PERSON.TXT              Mandatory                -               -
>\NAMES        PLACES.TXT              Mandatory                -               -
               ORGNAMES.TXT            Mandatory                -               -
               NAMES.TXT               Mandatory                -               -
\<database     <FILENAME1>.T           Mandatory                -               -
>\CLOSE        XT
D              <FILENAME2>.T           Mandatory                -               -
               XT
\<database     LEXIC<nn>.XML                -            Mandatory       Mandatory   Mandatory
>\LEXIC        LEXICON.DTD                  -            Mandatory       Mandatory   Mandatory
ON

                      Table 3. Summary of files to be included in each step
Specific details of content and formatting for each file are described below.



5     Data files
This section describes the format of those files containning data.

\<database>\LISTS
   WORDLIST.TXT                          Wordlist file
   SAP.TXT                               Special Application Words list
   NGRAM.TXT                             NGram file
 \<database>\NAMES                       Names directory
      PERSONS.TXT                        Includes first and last person names
      PLACES.TXT                         Includes place names
      ORGNAMES.TXT                       Includes organizations, companies and brand names
      NAMES.TXT                          Include a merge of the above three files
 \<database>\CLOSED                      Closed sets directory
     <FILENAME1>.TXT                     Closed set 1 (to.be.specified.)
     <FILENAME2>.TXT                     Closed set 2 (t.b.s.)
          ……                               ………..
\<database>\LEXICON
  LEXIC<nn>.XML                          Set of lexicon files
         ……
  LEXICON.DTD                            DTD File



5.1    <database>\LIST directory
IST-2001-32216 D3.0                                                                                 6




5.1.1    Common Words List WORDLIST.TXT

The common word list provides the following information: the entry word; counts of the entry
word computed over all domains and counts in each of the six individual domains (C1 - C6) as
defined in D1.1 [1]. The count field should get the value 0 if an entry does not occur in a given
domain.

For a given language let‟s define

nCj(wi) counts (or number of occurrences) of a given word wi in the domain Cj, j=1,…6
n(wi) counts of a given word wi in the overall corpora

The i-entry of the common word list provides the following information:

wi <tab> n(wi) <tab>nC1(wi) <tab> nC2(wi) <tab> … nC6(wi)

 there is no specific heading row
 each row is delimited by the sequence <CR><LF> (ASCII 13 10);
 each field is delimited by “Hard Tab”, briefly <HT>, in C-language '\t', (ASCII 9);
 the list is sorted according to descending values of the second column n(wi)
Example:

                      Sports                                Culture      Consumer    Personal
Word Overall                        News       Finance
                      Games                                 Entert.      Information Comm.

de       2340295      226553        1272472     283436       277035        109855       170944
la       1647387      179589        909858      186869       186926        71754        112391
el       1451500      224297        773405      159046       152306        52379        90067
que      1294574      136388        671902      124241       132074        42641        187328
en       1137760      153250        592280      128426       139175        49514        75115
y        1015769      130420        493995      89707        149854        37894        113899
a        803137       101300        421936      76146        81813         26867        95075
los      741347       74841         397583      89048        70058         32575        77242
del      511193       65805         282252      63595        54823         18488        26230
se       496751       62434         252140      51808        53488         24975        51906

                               Figure 1. Example of WORDLIST.TXT

5.1.2    Special Application Word List: SAP.TXT

The special application word list contains two parts: numbers, letters and abbreviations
extracted from the corpora and a specific vocabulary. For the specific vocabulary, a reference
list of US-English terms was defined and further subcategorized into semantic domains [1]. The
vocabulary represents lexical entries specific for the listed applications and was translated into
the target languages. For translation purposes, the basic word list was provided in US- English
with a short description of the scenarios and one or two typical examples of usage for the verbs.
For verbs the infinitive and the inflectional forms from the examples are provided. For all other
categories the nominative, singular, masculine forms or equivalent dictionary forms are
provided. Synonyms should be provided wherever appropriate. If a specific word does not exist
in a language a synonymous term or phrase may be used.
IST-2001-32216 D3.0                                                                               7



Complex common word entries will be broken up in the final lexicon if reasonable (e.g.
change_password). However those which change their meaning when splitted should be kept
together (e.g. web_spider).
SAP.TXT contains the information in a table format. Each row of the table is associated to an
English term. Synonyms and inflectional forms are added in new lines and coded as explained
below. The table has the following fields:

ID:             Identifier of the semantic domain.

Nr.             Is a consecutive number that labels each English term in each semantic domain.
                Synonyms and inflectional verb forms are labelled by adding an additional
                alphabetical numbering to the Nr.

English term: Is the basis of the translation.

POS:            Part of speech of the English term. Used only to help the translation

Translation: Is the result of the translation in the target language. Synonyms and inflectional
              forms are added in new lines.

Examples:       For each verb two examples of typical usage are given in the example column.

Comments:       Comment field. If a specific word does not exist at all in a language, the value is
                “NE”. For synonyms the value is “Synonym”


     the heading row contains the names of the labels
     each row is delimited by the sequence <CR><LF> (ASCII 13 10);
     each field is delimited by “Hard Tab”, briefly <HT>, in C-language '\t', (ASCII 9);
     the list is sorted according to increasing values of ID and Nr.
     examples are written in a single column separated by spaces.

ID                Nr.   English term   POS       Translation        Examples            Comments
1.1.1.            1     meter          NOU       Meter
1.1.1.            2     mile           NOU       Meile
1.1.2.            1     pound          NOU       Pfund
1.4.              60    to_stop        VER       beenden            Stop the program.
                                                                    Program stopped.
1.4.              60a   stop           VER       beende
1.4.              60b   stopped        VER       beendet
1.4.              60c                  VER       stoppen
1.4.              60d                  VER       stoppe
1.4.              60e                  VER       gestoppt
4.1.4.            14    boss           NOU       Chef
4.1.4.            14a                  NOU       Vorgesetzter                           Synonym
6.2.1.            1     text_only      NP        nur_Text

Figure 2. Example of SAP.TXT

5.1.3     NGram list: NGRAM.TXT

To be discussed, finished and approved
IST-2001-32216 D3.0                                                                               8



Unigram, bigram and trigram information for each word in the wordlist.txt file is provided
The format of Ngram file will follow specifications from
http://www.w3.org/TR/ngram-spec/



5.2     <database>\NAMES directory

The proper names task [1] is divided into three main domains: First and Last Names, Place
Names and Organizations. The proper names are included in the following files:

       PERSONS.TXT                -Includes first and last person names
       PLACES.TXT                 -Includes place names
       ORGNAMES.TXT               -Includes organizations, companies and brand names
       NAMES.TXT                  -Include a merge of the above files

Format:

     List of lexical entries. Each entry in a different row
     There is no specific heading row
     Each row is delimited by the sequence <CR><LF> (ASCII 13 10);
     The list is alphabetically sorted.

EXAMPLE

Antonio
Asunción
Nuria_Castell
Santiago_Segura

Figure 3. PERSONS.TXT file


5.3     <database>\CLOSED directory

The number and names of closed sets is language dependent. Each set is included in a separate
file. Names have up to eight characters and the extension is .txt. The following list is a non-
exhaustive set of recommended names
:
        ADPOSITI.TXT              - Adpositions
        ARTICLE.TXT               - Definite and Indefinite articles
        VER_AUXI.TXT              - Modal Verbs
        CON_COOR.TXT              - Coordinative Conjunctions
        CON_SUBO.TXT              - Subordinate Conjunctions
        DET_DEMO.TXT              - Demonstrative determine
        DET_POSS.TXT              - Possessive determine
        PRO_DEMO.TXT              - Demonstrative pronouns
        PRO_EXCL.TXT              - Exclamative pronouns
        PRO_INTE.TXT              - Interrogative pronouns
        PRO_PERS.TXT              - Personal pronouns
        PRO_POSS.TXT              - Possessive pronouns
        PRO_RELA.TXT              - Relative pronouns
IST-2001-32216 D3.0                                                                              9




The filenames have to be documented.

Format:

 List of lexical entries. Each entry in a different row
 There is no specific heading row
 Each row is delimited by the sequence <CR><LF> (ASCII 13 10);
 The list is alphabetically sorted.
EXAMPLE

a
ante
bajo
cabe
con

Figure 4. <CLOSED>.TXT file



5.4     <database>\LEXICON directory

5.4.1     Lexicon Files LEXIC<nn>.XML

An XML-based mark-up language was chosen to represent the linguistic information in a
formal, unambiguous manner and easy to read. Moreover, the information can be processed by
as many parties as possible. The XML parser that will be used to parse the Lexica can be any
XML version 1.0 compliant parser. Parser should be able to deal with UTF-16.

As is defined in [2], lexica consist of a set of entry group elements.
     An entry group refers to a generic entry in a vocabulary. For each entry group, it is
        mandatory to specify:
             orthography;
             zero or more alternative spelling elements;
             one or more entry or compound entry or abbreviation elements.
     An entry refers to one specific grammatical/morphological meaning of a vocabulary
        entry. For each entry, it is mandatory to specify:
             One POS, together with its attributes. In case of multiple POS, or in case of
                 multiple attributes of the same POS, multiple entries have to be specified within
                 the same entry group.A description of all mandatory features and values for a
                 given language will be provided in Design.doc
             One lemma. It contains string data representing the entry lemma. In case of
                 multiple lemmas, multiple entries have to be specified within the same entry
                 group.
             One phonetic transcription. It contains string data representing the entry
                 phonetic transcription and syllabification. In case of multiple phonetic
                 transcriptions, multiple entries have to be specified within the same entry
                 group.
             For application words, one APP tag has to be specified. The structure of APP
                 tag is as follows:
                          Subdomain_type1 No_of_entry 1
IST-2001-32216 D3.0                                                                              10



                          …
                          Subdomain_typeN No_of_entryN
        Compound entries will have the following structure:
              phonetic transcription;
              two or more entry elements (a subset of an entry), which are links to other
                 entries. Each entry element must be characterized by an orthography and must
                 contain one POS TAG, together with all of its attributes.
        Abbreviations from application wordlists will be tagged using the ABB tag. Each
         abbreviation must contain one or more EXP TAG and each EXP TAG contains
              a string data representing the actual expansion (optional);
              one entry or compound entry element (mandatory).
        Each attribute has the default value NS (=Not Specified), which is always optional in
         the DTD ( IMPLIED)..NS should be used only if the attribute is not mandatory in a
         given language.
IST-2001-32216 D3.0                                                                                      11



In each entry the possibility of inserting a comment is also provided by the XML formalism
<!—insert here your comment --> that can be used in any part of the Lexica.
No.                                                                                                    Phonetisation +
         Spelling                                   POS                                    Lemma
                                                                                                        syllabification
1      capitano       NOM. Class: common. Number: singular. Gender: masculine.            capitano   k a – p i - “t a – n o
       (It)
                      VER. Number: plural. Person: 3. Tense: present. Mood: indicative.   capitare   “k a – p i – t a – n o
                      Voice: active.
2      dai (It)       VER. Number: singular. Person: 2. Tense: present. Mood:             dare       “d a – r e
                      imperative. Voice: active.
3      ‫( الحمراء‬Ar)   NOM. Class: common. Number: singular. Gender: Feminine.             ‫حمراء‬      not available
                      ADJ. Number: singular. Gender: feminine. Case: Genitive. Degree:    ‫حمراء‬      not available
                      positive.

Table 4. Logical structure for some entries in a Lexicon.

<?xml version="1.0" encoding="UTF-16"?>
<!DOCTYPE LEXICA SYSTEM "NewLexica7.dtd" >
<LEXICA xml:lang="IT">
      <ENTRYGROUP orthography="capitano">
            <ENTRY>
                  <NOM class="common" gender="masculine"
                       number="singular" />
                  <LEMMA>capitano</LEMMA>
                  <PHONETIC>k a - p i - " t a - n o</PHONETIC>
            </ENTRY>
            <ENTRY>
                  <VER tense="present" number="plural" person="3"
                       mood="indicative" voice="active" />
                  <LEMMA>capitare</LEMMA>
                  <PHONETIC>" k a - p i - t a - n o</PHONETIC>
            </ENTRY>
      </ENTRYGROUP>
      <ENTRYGROUP orthography="‫ "الحمراء‬xml:lang="AR">
            <ENTRY>
                  <NOM class="common" gender="feminine"
                       number="singular" />
                  <LEMMA> ‫/<حمراء‬LEMMA>
                  <PHONETIC>not available</PHONETIC>
            </ENTRY>
            <ENTRY>
                  <ADJ case="genitive" degree="positive"
                       gender="feminine" number="singular" />
                  <LEMMA>‫/<حمراء‬LEMMA>
                  <PHONETIC>not available</PHONETIC>
            </ENTRY>
      </ENTRYGROUP>
</LEXICA>
Table 5 XML-based coding of examples listed in Table 4

Packaging: Lexicon has to be split into two parts: proper names and common nouns. These
parts should be split further into a set of smaller and more manageable files. Splitting is
language dependent and must be done on an alphabetic base.

Filenames are
LEXIC <nn>.XML
where <nn> are two digit from 00 to 99.
IST-2001-32216 D3.0                                                                                     12



Spliting criteria, filenames and mapping between file names and content should be
documented.

5.4.2     DTD File: LEXICON.DTD

A formally specified grammar (Document Type Definition or DTD), containing all the
linguistic information described so far allows to validate automatically the XML-based lexica.
The LEXICON.DTD file contains the DTD implementing the linguistic information described
in [2]. All lexicons generated in LC-STAR uses a common DTD and is available at the web
page of the project. For validation purposes, according to Language Specific guidelines, each
partner should provide in the documentation (DESIGN.DOC) a Language Specific DTD to
allow validation of mandatory attributes.

It should be noted [2] that:
 Each lexicon and entry group has an optional attribute specifying the language and the
     attribute values of their sub-elements: we chose the standard XML attribute xml:lang whose
     possible values are defined by [IETF RFC 1766], Tags for the Identification of Languages,
     or its successor on the IETF Standards Track.

   For sake of simplicity, we associate each entry to as many triples (POS, Lemma, Phonetic
    Transcription) it can belong to, thereby allowing for repetitions in a triple.

   The characters range supported by the XML Standard [http://www.w3.org/TR/2000/REC-
    xml-20001006] is any UNICODE character, excluding the surrogate blocks, FFFE, and
    FFFF. Moreover, all the XML processors must accept UTF-16 encoding.

   However, special characters must be escaped by pre-defined strings when they are
    contained either in elements of type PCDATA (i.e. Parsed Character Data) or in attribute
    values of type CDATA (i.e. Character Data). The following Table illustrates the situation:


    Special Character         Escaping String           Must be escaped in        Must be escaped in
                                                        PCDATA Elements           CDATA Attributes
    &                         &amp;                     Yes                       Yes
    <                         &lt;                      Yes                       Yes
    >                         &gt;                      Yes (1)                   Yes (1)
    ]]>                       NOT AVAILABLE             No                        Yes
    „                         &apos;                    No                        No
    “                         &quot;                    No                        Yes

Table 6. Special characters that have to be escaped in PCDATA elements content and/or
in CDATA attribute values.
Notes:
(1) According to the XML Standard “the right angle bracket (>) may be represented using the string
"&gt;", and must, for compatibility, be escaped using " &gt;" or a character reference when it appears in
the string "]]>" in content, when that string is not marking the end of a CDATA section.”


6    Documentation files
A number of documentation files will be provided on the CD-ROM containing: the overall
database description. They can be in ASCII (.TXT), MS Word (.DOC), in Postscript (.PS), in
Portable Document Format (.PDF) or Hypertext Markup Language (.HTM) and stored in the
root. All files are mandatory except when they are explicitly marked optional.
IST-2001-32216 D3.0                                                                                  13




     README.TXT
     COPYRIGH.TXT
     DESIGN.{DOC|PDF}
     SAMPALEX.PS
     VALREP.DOC (supplied by the Validation Centre – only in the final release of a database)

6.1     Root Directory

6.1.1     README.TXT

Is an ASCII text file and has to list and describe all the files present in the database. The file
includes most of the introduction of the DESIGN.DOC file i.e.
        Mention wordlist/lexicon collection within the framework of LC-STAR
        Collectors and owners of wordlist/lexicon
        Short description: Types of entries, Number of words, number of entries /information
        contained in the lexicon
        Brief summary of contents of disk
        Layout of the media file system
        File nomenclature


6.1.2     COPYRIGH.TXT

Is an ASCII text file with some short sentences i.e. “This material is copyright. The copyright
belongs to …”. Date….”


6.2     <database>\DOC directory

6.2.1     DESIGN.DOC and DESIGN.PDF

DESIGN.doc or DESIGN.pdf, is the documentation file. It is written in English and should
contain the relevant part of the specifications expanded to include some more detail.

For the Prevalidation part A should include:

     Introduction
           Mention wordlist/lexicon collection within the framework of LC-STAR
           Collectors and owners of wordlist/lexicon
           Short description: Number of words, number of entries
           Brief summary of contents of disk
           Layout of the media file system; reference to README.TXT file
     Common word entries
           Describe the corpus domains and subdomains.
           Overview of corpora used per domain, including year of publication
           Description of procedure used to extract words from corpora
           Description of tokenization procedure.
           Provide corpus size (indicate in which step of the tokenization-cleaning procedure)
           Provide the corpus size after cleaning. (without numbers, names, abbreviations,
      singletons…)
           Provide corpus size after cleaning for each domain.
IST-2001-32216 D3.0                                                                                 14



        Provide word list size and coverage.
        Describe media used for the corpus collection
   Closed sets
        Define closed sets included.
        Criteria for –language specific- closed sets definition and contents
        Filename of each set
   Proper name task
        Describe the corpus size
        Describe the corpus domains and subdomains
       Describe name tags and principles used for name tagging (what is considered for TOU,
       GEO,…)
        Define Name i.e. New_York = one entry; rose Rose…
        Provide corpus size for each domain
        Describe media used for the corpus collection
        Describe collection strategies
   Special application domain
        Purpose
        Describe the corpus domains and subdomains
        Provide corpus size for each domain in the table
        Describe media used for the corpus collection
   Formats
         The character coding format
        Character set
       Common Word List
       Special Application Word List

A template named LC_wordlist_template.doc can be found in the LC_STAR web page.

For Prevalidation Part B and the Final lexicon, DESIGN.DOC or DESIGN.PDF should
include:

   Introduction
        Mention wordlist/lexicon collection within the framework of LC-STAR
        Collectors and owners of wordlist/lexicon
        Contact person: name, address, affiliation;
        Short description;
        the contents of the CD;
        the layout of the CD-ROM;
        File Nomenclature and formats
        Contents of the lexicon
         Types of entries: # of common Words, # names, # special application words
         Information contained in the lexicon (fields)
   Common word entries
         Describe the corpus domains and subdomains
         Provide corpus size for each domain and subdomain (words and lemmas)
         Overview of corpora used per domain, including year of publication
         Description of procedure used to extract words from corpora
         Description of Tokenization procedure.
        Number of entries before cleaning overall and per domain
        Number of entries after cleaning, overall and per domain.
        Mention that closed sets are included
        Define closed sets.and language specific criteria for closed sets definition and contents
IST-2001-32216 D3.0                                                                          15



   Proper names
        Describe the corpus domains and subdomains
        Provide name categories size for each domain and subdomain
       Describe name tags and principles used for name tagging (what is considered for TOU,
       GEO,…)
        Describe collection strategies
        Define Name i.e. New_York = one entry; rose Rose…
        Indicate if capitalization is used or not to differenciate proper names from common
        words
   Special application words
         Purpose
         Describe application domains and subdomains
         Describe strategy employed to obtain the words
         Description of synonym selection
   Format of the lexicon
         Description of XML format used: (Entry_group, Entry, Entry_el…)
        Description of fields
        Splitting criteria
         Description of DTD file. Mention LEXICON.DTD file
       Examples
     Morphological and syntactic information
         POS set used, plus attributes and values
         Description of multiple tagging possibilities
       Language specific guidelines
                 Principles of POS tagging and their attributes
                 Any sistematic underspecification of POS attributes
                 Morphological boundaries
                 Principles of multiple tags
                 Choice of lemma
                 Instructions to the coder
                 Language Specific DTD
   Orthographical information
         Character set used
         Orthographical conventions used
        Language specific ortography guidelines
                 Treatment of multi-token entries
                  Treatment of numeric entries
                  Treatment of acronym spellings
                 Treatement of abbreviations
                  Treatment of multiple spellings
   Phonemic transcriptions
         Procedures used to obtain phonemic forms from orthographic input
         SAMPA set used and that blanks separate individual phoneme symbols in the
        transcriptions
         Use of syllable markers, stress markers (and tone markers or morphological markers, if
        provided)
         Assimilation processes accounted for in the transcriptions
         Treatment of multiple pronunciations
         Treatment of foreign words: phonetization, tagging…
       Language specific phonetic transcription guidelines
                Stress markers and syllable boundary markers
                Principles of assigning stress
                Treatment of stress in multi-word entries
IST-2001-32216 D3.0                                                                             16



                 List of usually unstressed words (e.g. Turkish)
                 Tone markers
                 Morphological markers
                 Extra SAMPA symbols to cope with foreing languages
                 Mapping foreign phonemes to language specific phonemes
                 Phonetic transcription in names (nativization of foreign names, etc)
                 Dialect considered as „canonic‟ (e.g. Arabic)
                 Syllabisation conventions (i.e. geminates, etc)
        Quality control
         Procedures for uniformity in spelling, transcriptions and tags
         Double check procedures
        UTF 16> character set
        Tokenisation procedures

A template named LC_lexicon_template.doc can be found in the LC_STAR project web page to
produce the document for prevalidation partB and Final Lexicon


6.2.2    SAMPALEX.PS

The SAMPA table used to transcribe the lexicon. The table should include the SAMPA
phonemes for the given language, SAMPA phonemes from other languages (if used) and a table
for mapping foreign phonemes) to the language specific phonemes.


6.2.3    VALREP.DOC

Is the validation report documentation file. It must be included in the final release of each
database, before to distribute the database to the project‟s partners. The document will be
produced by the project‟s Validation Centre during the validation phase, according to the
specifications.



7    References
[1] Ziegenhain, U. et al. Specification of corpora and word lists in 12 languages. LC-STAR
    Project Deliverable D1.1.

[2] Maltese, G. Montecchio, C. et al. General and language specific specification of contents of
    lexica. LC-STAR Project Deliverables D2. 2003.

						
Related docs
Other docs by xiangpeng
鞀澕鞚措摐 1 - KELP
Views: 0  |  Downloads: 0
Views: 0  |  Downloads: 0
[Pangea] StatusMeeting081006.ppt - MSE Studio
Views: 0  |  Downloads: 0
[Pangea] StatusMeeting080915
Views: 0  |  Downloads: 0
[Pangea] StatusMeeting080908
Views: 1  |  Downloads: 0
[Hm] Home Loan _ Mortgage News
Views: 0  |  Downloads: 0