LC-STAR_D3.0_v1.6
Document Sample


LC-STAR Deliverable D3.0.
Project ref. no. IST-2001-32216
Project acronym LC-STAR
Lexica and Corpora for Speech-to-Speech Translation
Project full title
Technologies
Security (distribution level) Public
Contractual date of delivery New Deliverable
Actual date of delivery
Deliverable number D3.0
Deliverable name Specifications of lexicon interchange format
Type Report
Status & version Final +2 Version 1.6
Number of pages 16
WP contributing to the
WP3
deliverable
WP / Task / Deliverable
WP3 – Task 3.1 – D3.0 UPC
responsible
Other contributors
Author(s) Asuncion Moreno (UPC)
EC Project Officer Domenico Perrotta
Project Coordinator Name: Harald Höge
Company: Siemens AG, CT IC 5
Address: Otto-Hahn-Ring 6, 81739 München, Germany
Phone: +49-89-636-53374
Fax: +49-89-636-49802
E-mail: harald.h.hoege@mchp.siemens.de
Project web site: http://www.lc-star.com
Keywords Specifications, Formats
Abstract This document provides specifications for the formats of the
(for dissemination) language resources generated in the LC-STAR project
concerning LR for Recognition and Synthesis.
IST-2001-32216 D3.0 1
Document evolution:
Version Date Security Notes
Project First draft version – work document, to be
V1.0 Aug 28, 2003
internal discussed by all partners
v1.1 Sep, 19, 2003 Internal Pre-fnal.
Pre-final. Includes decissions from Helsinky
meeting: Language and country codes,
v1.2 Dec, 11, 2003 Internal README.TXT description, Language specific
guidelines, Lexicon split in several files,
Foreing words: phonetization, tagging…
March, 12,
v1.3 Internal Pre-final. After prevalidation A and B
2004
few details in Table 1, tokenization, and foreing
v1.4 April, 1, 2004 Final
phonemes. Language specific DTD is included
v1.5 May 10, 2004 Final+1 CC for Arabic, Abreviations
Files for Exchange. After Sant Petersburgh
v1.6 Aug, 31, 2004 Final +2
meeting
IST-2001-32216 D3.0 2
1 Introduction ........................................................................................................................... 3
2 Storage media ........................................................................................................................ 3
3 Character Coding................................................................................................................... 3
4 Directory structure ................................................................................................................. 3
5 Data files ............................................................................................................................... 5
5.1 <database>\LIST directory .......................................................................................... 5
5.1.1 Common Words List WORDLIST.TXT ................................................................. 6
5.1.2 Special Application Word List: SAP.TXT .............................................................. 6
5.1.3 NGram list: NGRAM.TXT...................................................................................... 7
5.2 <database>\NAMES directory ..................................................................................... 8
5.3 <database>\CLOSED directory ................................................................................... 8
5.4 <database>\LEXICON directory ................................................................................. 9
5.4.1 Lexicon Files LEXIC<nn>.XML ............................................................................ 9
5.4.2 DTD File: LEXICON.DTD ................................................................................... 12
6 Documentation files ............................................................................................................ 12
6.1 Root Directory ........................................................................................................... 13
6.1.1 README.TXT...................................................................................................... 13
6.1.2 COPYRIGH.TXT .................................................................................................. 13
6.2 <database>\DOC directory ........................................................................................ 13
6.2.1 DESIGN.DOC and DESIGN.PDF ........................................................................ 13
6.2.2 SAMPALEX.PS .................................................................................................... 16
6.2.3 VALREP.DOC ...................................................................................................... 16
7 References ........................................................................................................................... 16
IST-2001-32216 D3.0 3
1 Introduction
The purpose of this document is to provide specifications for the formats of the language
resources generated in the LC-STAR project concerning LR for Recognition and Synthesis.
That includes:
1. Format specifications for word lists for prevalidation A purposes
2. Format specifications for mini lexicon for prevalidation B purposes
3. Format specifications for Lexicon for Speech recognition and Synthesis
The document takes into particular account the following aspects: media, database format and
documentation. The document is close related with LC-STAR project deliverables D1.1, D2.1,
D2.2 and D2.3
This document is structured as follows: first, a general description of storage media, data files
and documentation files is done. Then, for each specific purpose (word list, mini lexicon and
full lexicon) a description of directory structure and a list of mandatory and optional files to be
included in the database is done.
2 Storage media
LC-STAR Language Resources will be stored on CD-ROMs. Disks will be printed according to
the ISO 9660 Interchange level 1 specifications.
3 Character Coding
For the wordlists ISO 8859-X [http://czyborra.com/charsets/iso8859.html] character sets for
European languages, Arabic and Hebrew will be used. For the Mandarin wordlists, we will use
GB2312 for the characters, ISO-8859-1 for Pinyin and non-native words. The final lexica will
be delivered in UTF16 character encoding [http://www.unicode.org/]. For each language the
character set used should be documented.
4 Directory structure
The following directory structure and nomenclature has been chosen:
\COPYRIGH.TXT Copyright notice in the root directory
\README.TXT Readme file in the root directory
\<database>\DOC\
DESIGN.{PDF|DOC} Documentation file
SAMPALEX.PDF List of SAMPA symbols
VALREP.{DOC|PDF} Validation report by SPEX
\<database>\LISTS\
WORDLIST.TXT Wordlist file
SAP.TXT Special Application Words list
NGRAM.TXT NGram file
\<database>\NAMES Names directory
PERSONS.TXT Includes first and last person names
PLACES.TXT Includes place names
IST-2001-32216 D3.0 4
ORGNAMES.TXT Includes organizations, companies and brand names
NAMES.TXT Include a merge of the above three files
\<database>\CLOSED Closed sets directory
<FILENAME1>.TXT Closed set 1 (t.b.s.)
<FILENAME2>.TXT Closed set 2 (t.b.s.)
…… ………..
\<database>\LEXICON
LEXIC<nn>.XML Lexicon files nn
LEXICON.DTD DTD File
Table 1. Directory structure
Where:
<database> Defined as: <name><-><LL><CC>.
Where:
<name> is LC for LC-STAR databases
<LL> 2-letter for language code ISO 639
<CC> 2-letter for country code ISO 3166 where language is spoken
Language code is ISO 639 available at http://ftp.ics.uci.edu/pub/ietf/http/related/iso639.txt
Country code is ISO 3166 available at http://object-net.com/object-net/countrycodes.html
The following list shows some examples of database names
Lexicon LL CC Database
Catalan ca es LC-caes
Finnish fi fi LC-fifi
German de de LC-dede
Greek el gr LC-elgr
Hebrew he il LC-heil
Italian it it LC-itit
Mandarin Chinese zh cn LC-zhcn
Spanish es es LC-eses
Standard Arabic ar -- LC---
Russian ru ru LC-ruru
Turkish tr tr LC-trtr
US-English en us LC-enus
Slovenian sl si LC-slsi
Table 2. List of language/country codes
The directory structure shown in Table 1 is common for the final lexicon and the prevalidation
steps. The mandatory files to be included in each step are specified in Table 3
Directory Filename Word list Minilexicon Lexicon Exchange
Prevalidation Prevalidation final
A B
(root) COPYRIGH.TXT Mandatory Mandatory Mandatory Mandatory
IST-2001-32216 D3.0 5
README.TXT Mandatory Mandatory Mandatory Mandatory
\<database DESIGN.{PDF|DO Mandatory Mandatory Mandatory Mandatory
>\DOC C} (Lexicon)
SAMPALEX.PDF - Optional Optional Optional
VALREP.DOC - - Optional Mandatory
final
release
\<database WORDLIST.TXT Mandatory Mandatory
>\LISTS SAP.TXT Mandatory - - Mandatory
NGRAM.TXT Mandatory
\<database PERSON.TXT Mandatory - -
>\NAMES PLACES.TXT Mandatory - -
ORGNAMES.TXT Mandatory - -
NAMES.TXT Mandatory - -
\<database <FILENAME1>.T Mandatory - -
>\CLOSE XT
D <FILENAME2>.T Mandatory - -
XT
\<database LEXIC<nn>.XML - Mandatory Mandatory Mandatory
>\LEXIC LEXICON.DTD - Mandatory Mandatory Mandatory
ON
Table 3. Summary of files to be included in each step
Specific details of content and formatting for each file are described below.
5 Data files
This section describes the format of those files containning data.
\<database>\LISTS
WORDLIST.TXT Wordlist file
SAP.TXT Special Application Words list
NGRAM.TXT NGram file
\<database>\NAMES Names directory
PERSONS.TXT Includes first and last person names
PLACES.TXT Includes place names
ORGNAMES.TXT Includes organizations, companies and brand names
NAMES.TXT Include a merge of the above three files
\<database>\CLOSED Closed sets directory
<FILENAME1>.TXT Closed set 1 (to.be.specified.)
<FILENAME2>.TXT Closed set 2 (t.b.s.)
…… ………..
\<database>\LEXICON
LEXIC<nn>.XML Set of lexicon files
……
LEXICON.DTD DTD File
5.1 <database>\LIST directory
IST-2001-32216 D3.0 6
5.1.1 Common Words List WORDLIST.TXT
The common word list provides the following information: the entry word; counts of the entry
word computed over all domains and counts in each of the six individual domains (C1 - C6) as
defined in D1.1 [1]. The count field should get the value 0 if an entry does not occur in a given
domain.
For a given language let‟s define
nCj(wi) counts (or number of occurrences) of a given word wi in the domain Cj, j=1,…6
n(wi) counts of a given word wi in the overall corpora
The i-entry of the common word list provides the following information:
wi <tab> n(wi) <tab>nC1(wi) <tab> nC2(wi) <tab> … nC6(wi)
there is no specific heading row
each row is delimited by the sequence <CR><LF> (ASCII 13 10);
each field is delimited by “Hard Tab”, briefly <HT>, in C-language '\t', (ASCII 9);
the list is sorted according to descending values of the second column n(wi)
Example:
Sports Culture Consumer Personal
Word Overall News Finance
Games Entert. Information Comm.
de 2340295 226553 1272472 283436 277035 109855 170944
la 1647387 179589 909858 186869 186926 71754 112391
el 1451500 224297 773405 159046 152306 52379 90067
que 1294574 136388 671902 124241 132074 42641 187328
en 1137760 153250 592280 128426 139175 49514 75115
y 1015769 130420 493995 89707 149854 37894 113899
a 803137 101300 421936 76146 81813 26867 95075
los 741347 74841 397583 89048 70058 32575 77242
del 511193 65805 282252 63595 54823 18488 26230
se 496751 62434 252140 51808 53488 24975 51906
Figure 1. Example of WORDLIST.TXT
5.1.2 Special Application Word List: SAP.TXT
The special application word list contains two parts: numbers, letters and abbreviations
extracted from the corpora and a specific vocabulary. For the specific vocabulary, a reference
list of US-English terms was defined and further subcategorized into semantic domains [1]. The
vocabulary represents lexical entries specific for the listed applications and was translated into
the target languages. For translation purposes, the basic word list was provided in US- English
with a short description of the scenarios and one or two typical examples of usage for the verbs.
For verbs the infinitive and the inflectional forms from the examples are provided. For all other
categories the nominative, singular, masculine forms or equivalent dictionary forms are
provided. Synonyms should be provided wherever appropriate. If a specific word does not exist
in a language a synonymous term or phrase may be used.
IST-2001-32216 D3.0 7
Complex common word entries will be broken up in the final lexicon if reasonable (e.g.
change_password). However those which change their meaning when splitted should be kept
together (e.g. web_spider).
SAP.TXT contains the information in a table format. Each row of the table is associated to an
English term. Synonyms and inflectional forms are added in new lines and coded as explained
below. The table has the following fields:
ID: Identifier of the semantic domain.
Nr. Is a consecutive number that labels each English term in each semantic domain.
Synonyms and inflectional verb forms are labelled by adding an additional
alphabetical numbering to the Nr.
English term: Is the basis of the translation.
POS: Part of speech of the English term. Used only to help the translation
Translation: Is the result of the translation in the target language. Synonyms and inflectional
forms are added in new lines.
Examples: For each verb two examples of typical usage are given in the example column.
Comments: Comment field. If a specific word does not exist at all in a language, the value is
“NE”. For synonyms the value is “Synonym”
the heading row contains the names of the labels
each row is delimited by the sequence <CR><LF> (ASCII 13 10);
each field is delimited by “Hard Tab”, briefly <HT>, in C-language '\t', (ASCII 9);
the list is sorted according to increasing values of ID and Nr.
examples are written in a single column separated by spaces.
ID Nr. English term POS Translation Examples Comments
1.1.1. 1 meter NOU Meter
1.1.1. 2 mile NOU Meile
1.1.2. 1 pound NOU Pfund
1.4. 60 to_stop VER beenden Stop the program.
Program stopped.
1.4. 60a stop VER beende
1.4. 60b stopped VER beendet
1.4. 60c VER stoppen
1.4. 60d VER stoppe
1.4. 60e VER gestoppt
4.1.4. 14 boss NOU Chef
4.1.4. 14a NOU Vorgesetzter Synonym
6.2.1. 1 text_only NP nur_Text
Figure 2. Example of SAP.TXT
5.1.3 NGram list: NGRAM.TXT
To be discussed, finished and approved
IST-2001-32216 D3.0 8
Unigram, bigram and trigram information for each word in the wordlist.txt file is provided
The format of Ngram file will follow specifications from
http://www.w3.org/TR/ngram-spec/
5.2 <database>\NAMES directory
The proper names task [1] is divided into three main domains: First and Last Names, Place
Names and Organizations. The proper names are included in the following files:
PERSONS.TXT -Includes first and last person names
PLACES.TXT -Includes place names
ORGNAMES.TXT -Includes organizations, companies and brand names
NAMES.TXT -Include a merge of the above files
Format:
List of lexical entries. Each entry in a different row
There is no specific heading row
Each row is delimited by the sequence <CR><LF> (ASCII 13 10);
The list is alphabetically sorted.
EXAMPLE
Antonio
Asunción
Nuria_Castell
Santiago_Segura
Figure 3. PERSONS.TXT file
5.3 <database>\CLOSED directory
The number and names of closed sets is language dependent. Each set is included in a separate
file. Names have up to eight characters and the extension is .txt. The following list is a non-
exhaustive set of recommended names
:
ADPOSITI.TXT - Adpositions
ARTICLE.TXT - Definite and Indefinite articles
VER_AUXI.TXT - Modal Verbs
CON_COOR.TXT - Coordinative Conjunctions
CON_SUBO.TXT - Subordinate Conjunctions
DET_DEMO.TXT - Demonstrative determine
DET_POSS.TXT - Possessive determine
PRO_DEMO.TXT - Demonstrative pronouns
PRO_EXCL.TXT - Exclamative pronouns
PRO_INTE.TXT - Interrogative pronouns
PRO_PERS.TXT - Personal pronouns
PRO_POSS.TXT - Possessive pronouns
PRO_RELA.TXT - Relative pronouns
IST-2001-32216 D3.0 9
The filenames have to be documented.
Format:
List of lexical entries. Each entry in a different row
There is no specific heading row
Each row is delimited by the sequence <CR><LF> (ASCII 13 10);
The list is alphabetically sorted.
EXAMPLE
a
ante
bajo
cabe
con
Figure 4. <CLOSED>.TXT file
5.4 <database>\LEXICON directory
5.4.1 Lexicon Files LEXIC<nn>.XML
An XML-based mark-up language was chosen to represent the linguistic information in a
formal, unambiguous manner and easy to read. Moreover, the information can be processed by
as many parties as possible. The XML parser that will be used to parse the Lexica can be any
XML version 1.0 compliant parser. Parser should be able to deal with UTF-16.
As is defined in [2], lexica consist of a set of entry group elements.
An entry group refers to a generic entry in a vocabulary. For each entry group, it is
mandatory to specify:
orthography;
zero or more alternative spelling elements;
one or more entry or compound entry or abbreviation elements.
An entry refers to one specific grammatical/morphological meaning of a vocabulary
entry. For each entry, it is mandatory to specify:
One POS, together with its attributes. In case of multiple POS, or in case of
multiple attributes of the same POS, multiple entries have to be specified within
the same entry group.A description of all mandatory features and values for a
given language will be provided in Design.doc
One lemma. It contains string data representing the entry lemma. In case of
multiple lemmas, multiple entries have to be specified within the same entry
group.
One phonetic transcription. It contains string data representing the entry
phonetic transcription and syllabification. In case of multiple phonetic
transcriptions, multiple entries have to be specified within the same entry
group.
For application words, one APP tag has to be specified. The structure of APP
tag is as follows:
Subdomain_type1 No_of_entry 1
IST-2001-32216 D3.0 10
…
Subdomain_typeN No_of_entryN
Compound entries will have the following structure:
phonetic transcription;
two or more entry elements (a subset of an entry), which are links to other
entries. Each entry element must be characterized by an orthography and must
contain one POS TAG, together with all of its attributes.
Abbreviations from application wordlists will be tagged using the ABB tag. Each
abbreviation must contain one or more EXP TAG and each EXP TAG contains
a string data representing the actual expansion (optional);
one entry or compound entry element (mandatory).
Each attribute has the default value NS (=Not Specified), which is always optional in
the DTD ( IMPLIED)..NS should be used only if the attribute is not mandatory in a
given language.
IST-2001-32216 D3.0 11
In each entry the possibility of inserting a comment is also provided by the XML formalism
<!—insert here your comment --> that can be used in any part of the Lexica.
No. Phonetisation +
Spelling POS Lemma
syllabification
1 capitano NOM. Class: common. Number: singular. Gender: masculine. capitano k a – p i - “t a – n o
(It)
VER. Number: plural. Person: 3. Tense: present. Mood: indicative. capitare “k a – p i – t a – n o
Voice: active.
2 dai (It) VER. Number: singular. Person: 2. Tense: present. Mood: dare “d a – r e
imperative. Voice: active.
3 ( الحمراءAr) NOM. Class: common. Number: singular. Gender: Feminine. حمراء not available
ADJ. Number: singular. Gender: feminine. Case: Genitive. Degree: حمراء not available
positive.
Table 4. Logical structure for some entries in a Lexicon.
<?xml version="1.0" encoding="UTF-16"?>
<!DOCTYPE LEXICA SYSTEM "NewLexica7.dtd" >
<LEXICA xml:lang="IT">
<ENTRYGROUP orthography="capitano">
<ENTRY>
<NOM class="common" gender="masculine"
number="singular" />
<LEMMA>capitano</LEMMA>
<PHONETIC>k a - p i - " t a - n o</PHONETIC>
</ENTRY>
<ENTRY>
<VER tense="present" number="plural" person="3"
mood="indicative" voice="active" />
<LEMMA>capitare</LEMMA>
<PHONETIC>" k a - p i - t a - n o</PHONETIC>
</ENTRY>
</ENTRYGROUP>
<ENTRYGROUP orthography=" "الحمراءxml:lang="AR">
<ENTRY>
<NOM class="common" gender="feminine"
number="singular" />
<LEMMA> /<حمراءLEMMA>
<PHONETIC>not available</PHONETIC>
</ENTRY>
<ENTRY>
<ADJ case="genitive" degree="positive"
gender="feminine" number="singular" />
<LEMMA>/<حمراءLEMMA>
<PHONETIC>not available</PHONETIC>
</ENTRY>
</ENTRYGROUP>
</LEXICA>
Table 5 XML-based coding of examples listed in Table 4
Packaging: Lexicon has to be split into two parts: proper names and common nouns. These
parts should be split further into a set of smaller and more manageable files. Splitting is
language dependent and must be done on an alphabetic base.
Filenames are
LEXIC <nn>.XML
where <nn> are two digit from 00 to 99.
IST-2001-32216 D3.0 12
Spliting criteria, filenames and mapping between file names and content should be
documented.
5.4.2 DTD File: LEXICON.DTD
A formally specified grammar (Document Type Definition or DTD), containing all the
linguistic information described so far allows to validate automatically the XML-based lexica.
The LEXICON.DTD file contains the DTD implementing the linguistic information described
in [2]. All lexicons generated in LC-STAR uses a common DTD and is available at the web
page of the project. For validation purposes, according to Language Specific guidelines, each
partner should provide in the documentation (DESIGN.DOC) a Language Specific DTD to
allow validation of mandatory attributes.
It should be noted [2] that:
Each lexicon and entry group has an optional attribute specifying the language and the
attribute values of their sub-elements: we chose the standard XML attribute xml:lang whose
possible values are defined by [IETF RFC 1766], Tags for the Identification of Languages,
or its successor on the IETF Standards Track.
For sake of simplicity, we associate each entry to as many triples (POS, Lemma, Phonetic
Transcription) it can belong to, thereby allowing for repetitions in a triple.
The characters range supported by the XML Standard [http://www.w3.org/TR/2000/REC-
xml-20001006] is any UNICODE character, excluding the surrogate blocks, FFFE, and
FFFF. Moreover, all the XML processors must accept UTF-16 encoding.
However, special characters must be escaped by pre-defined strings when they are
contained either in elements of type PCDATA (i.e. Parsed Character Data) or in attribute
values of type CDATA (i.e. Character Data). The following Table illustrates the situation:
Special Character Escaping String Must be escaped in Must be escaped in
PCDATA Elements CDATA Attributes
& & Yes Yes
< < Yes Yes
> > Yes (1) Yes (1)
]]> NOT AVAILABLE No Yes
„ ' No No
“ " No Yes
Table 6. Special characters that have to be escaped in PCDATA elements content and/or
in CDATA attribute values.
Notes:
(1) According to the XML Standard “the right angle bracket (>) may be represented using the string
">", and must, for compatibility, be escaped using " >" or a character reference when it appears in
the string "]]>" in content, when that string is not marking the end of a CDATA section.”
6 Documentation files
A number of documentation files will be provided on the CD-ROM containing: the overall
database description. They can be in ASCII (.TXT), MS Word (.DOC), in Postscript (.PS), in
Portable Document Format (.PDF) or Hypertext Markup Language (.HTM) and stored in the
root. All files are mandatory except when they are explicitly marked optional.
IST-2001-32216 D3.0 13
README.TXT
COPYRIGH.TXT
DESIGN.{DOC|PDF}
SAMPALEX.PS
VALREP.DOC (supplied by the Validation Centre – only in the final release of a database)
6.1 Root Directory
6.1.1 README.TXT
Is an ASCII text file and has to list and describe all the files present in the database. The file
includes most of the introduction of the DESIGN.DOC file i.e.
Mention wordlist/lexicon collection within the framework of LC-STAR
Collectors and owners of wordlist/lexicon
Short description: Types of entries, Number of words, number of entries /information
contained in the lexicon
Brief summary of contents of disk
Layout of the media file system
File nomenclature
6.1.2 COPYRIGH.TXT
Is an ASCII text file with some short sentences i.e. “This material is copyright. The copyright
belongs to …”. Date….”
6.2 <database>\DOC directory
6.2.1 DESIGN.DOC and DESIGN.PDF
DESIGN.doc or DESIGN.pdf, is the documentation file. It is written in English and should
contain the relevant part of the specifications expanded to include some more detail.
For the Prevalidation part A should include:
Introduction
Mention wordlist/lexicon collection within the framework of LC-STAR
Collectors and owners of wordlist/lexicon
Short description: Number of words, number of entries
Brief summary of contents of disk
Layout of the media file system; reference to README.TXT file
Common word entries
Describe the corpus domains and subdomains.
Overview of corpora used per domain, including year of publication
Description of procedure used to extract words from corpora
Description of tokenization procedure.
Provide corpus size (indicate in which step of the tokenization-cleaning procedure)
Provide the corpus size after cleaning. (without numbers, names, abbreviations,
singletons…)
Provide corpus size after cleaning for each domain.
IST-2001-32216 D3.0 14
Provide word list size and coverage.
Describe media used for the corpus collection
Closed sets
Define closed sets included.
Criteria for –language specific- closed sets definition and contents
Filename of each set
Proper name task
Describe the corpus size
Describe the corpus domains and subdomains
Describe name tags and principles used for name tagging (what is considered for TOU,
GEO,…)
Define Name i.e. New_York = one entry; rose Rose…
Provide corpus size for each domain
Describe media used for the corpus collection
Describe collection strategies
Special application domain
Purpose
Describe the corpus domains and subdomains
Provide corpus size for each domain in the table
Describe media used for the corpus collection
Formats
The character coding format
Character set
Common Word List
Special Application Word List
A template named LC_wordlist_template.doc can be found in the LC_STAR web page.
For Prevalidation Part B and the Final lexicon, DESIGN.DOC or DESIGN.PDF should
include:
Introduction
Mention wordlist/lexicon collection within the framework of LC-STAR
Collectors and owners of wordlist/lexicon
Contact person: name, address, affiliation;
Short description;
the contents of the CD;
the layout of the CD-ROM;
File Nomenclature and formats
Contents of the lexicon
Types of entries: # of common Words, # names, # special application words
Information contained in the lexicon (fields)
Common word entries
Describe the corpus domains and subdomains
Provide corpus size for each domain and subdomain (words and lemmas)
Overview of corpora used per domain, including year of publication
Description of procedure used to extract words from corpora
Description of Tokenization procedure.
Number of entries before cleaning overall and per domain
Number of entries after cleaning, overall and per domain.
Mention that closed sets are included
Define closed sets.and language specific criteria for closed sets definition and contents
IST-2001-32216 D3.0 15
Proper names
Describe the corpus domains and subdomains
Provide name categories size for each domain and subdomain
Describe name tags and principles used for name tagging (what is considered for TOU,
GEO,…)
Describe collection strategies
Define Name i.e. New_York = one entry; rose Rose…
Indicate if capitalization is used or not to differenciate proper names from common
words
Special application words
Purpose
Describe application domains and subdomains
Describe strategy employed to obtain the words
Description of synonym selection
Format of the lexicon
Description of XML format used: (Entry_group, Entry, Entry_el…)
Description of fields
Splitting criteria
Description of DTD file. Mention LEXICON.DTD file
Examples
Morphological and syntactic information
POS set used, plus attributes and values
Description of multiple tagging possibilities
Language specific guidelines
Principles of POS tagging and their attributes
Any sistematic underspecification of POS attributes
Morphological boundaries
Principles of multiple tags
Choice of lemma
Instructions to the coder
Language Specific DTD
Orthographical information
Character set used
Orthographical conventions used
Language specific ortography guidelines
Treatment of multi-token entries
Treatment of numeric entries
Treatment of acronym spellings
Treatement of abbreviations
Treatment of multiple spellings
Phonemic transcriptions
Procedures used to obtain phonemic forms from orthographic input
SAMPA set used and that blanks separate individual phoneme symbols in the
transcriptions
Use of syllable markers, stress markers (and tone markers or morphological markers, if
provided)
Assimilation processes accounted for in the transcriptions
Treatment of multiple pronunciations
Treatment of foreign words: phonetization, tagging…
Language specific phonetic transcription guidelines
Stress markers and syllable boundary markers
Principles of assigning stress
Treatment of stress in multi-word entries
IST-2001-32216 D3.0 16
List of usually unstressed words (e.g. Turkish)
Tone markers
Morphological markers
Extra SAMPA symbols to cope with foreing languages
Mapping foreign phonemes to language specific phonemes
Phonetic transcription in names (nativization of foreign names, etc)
Dialect considered as „canonic‟ (e.g. Arabic)
Syllabisation conventions (i.e. geminates, etc)
Quality control
Procedures for uniformity in spelling, transcriptions and tags
Double check procedures
UTF 16> character set
Tokenisation procedures
A template named LC_lexicon_template.doc can be found in the LC_STAR project web page to
produce the document for prevalidation partB and Final Lexicon
6.2.2 SAMPALEX.PS
The SAMPA table used to transcribe the lexicon. The table should include the SAMPA
phonemes for the given language, SAMPA phonemes from other languages (if used) and a table
for mapping foreign phonemes) to the language specific phonemes.
6.2.3 VALREP.DOC
Is the validation report documentation file. It must be included in the final release of each
database, before to distribute the database to the project‟s partners. The document will be
produced by the project‟s Validation Centre during the validation phase, according to the
specifications.
7 References
[1] Ziegenhain, U. et al. Specification of corpora and word lists in 12 languages. LC-STAR
Project Deliverable D1.1.
[2] Maltese, G. Montecchio, C. et al. General and language specific specification of contents of
lexica. LC-STAR Project Deliverables D2. 2003.
Get documents about "