A. Handleiding

Document Sample
A. Handleiding Powered By Docstoc
					                                FONILEX Manual
                                                  version 1.0b, march 1998

                           by Piet MERTENS, Filip VERCAMMEN

              a pronunciation database of Dutch in Flanders∗


0 Introduction .................................................................................................................2
1 Coverage......................................................................................................................5
    1.1 Criteria for the selection of word forms ..........................................................................................5
    1.2 Criteria for the definition of the pronunciation ...............................................................................5
2 The treatment of pronunciation variants .....................................................................7
    2.1 Two classes of variants ...................................................................................................................7
    2.2 The pronunciation of loan words ....................................................................................................9
3 Technical description of the FONILEX database.........................................................10
    3.1 Description of fields......................................................................................................................10
    3.2 Phonetic symbol encoding ............................................................................................................11
    3.3 Phonological form encoding. ........................................................................................................15
    3.4 Stress.............................................................................................................................................17
4 Rule system ...............................................................................................................18
    4.1 Phonological rules in FONILEX ......................................................................................................18
    4.2 The implementation of rewrite rules .............................................................................................20
    4.3 Limitations ....................................................................................................................................23
5 Copyright notice ........................................................................................................23
Appendices ...................................................................................................................24
    A1: The Perl program “rewrite”..........................................................................................................24
    A2: The Perl program “yapa2sampa” .................................................................................................29

 This research project was funded by the Flemish ministry of Science (“Vlaamse minister voor
                                                                           FONILEX Manual, page 2

0       Introduction
    This document describes the FONILEX pronunciation database, which contains the
phonetic transcription of the most frequent word forms of Dutch as spoken in
Flanders. The database was compiled within the FONILEX research project, which was
funded by the Flemish government through IWT1, between january 1995 and july
1997. Three academic research groups participated in the project: the “Centre for
Computational Linguistics” (K.U.Leuven), the “Centrum voor Nederlandse Taal en
Spraak” (U.I.A., Antwerp), and “Vakgroep voor Electronica en Informatiesystemen
(ELIS)” (U.Gent). In this introduction we first describe the content, purpose and
applications of the database, and then indicate the contribution of each partner.
    Content. The FONILEX database is a list of over 200.000 Dutch word forms
together with information about the way in which they are pronounced in the Flemish
speaking part of Belgium. As such it is the first database of the pronunciation of
standard Flemish. The pronunciation information consists of two parts: first, an
abstract representation of the pronunciation of a given word form, and second, the
concrete pronunciation of this word form in three speaking styles. The database is
accompanied by a set of phonological rewriting rules; these were used to derive the
phonetic (concrete) transcriptions from the abstract form. As such the rule system
accounts for most pronunciation variants. This rule-based approach also enables the
user to derive a particular pronunciation style (or a particular representation)
according to his own requirements, by simply adapting the rule system.
     Specificity. In order to grasp the specific nature of the database, it may be useful
to consider its properties in comparison with a classical pronunciation dictionary. The
latter is a book in which a human user can look up a particular word in order to obtain
its pronunciation; the former is a computer file, intended to be used by computer
programs. Most pronunciation dictionaries only contain base forms (called lemmata
in lexicography), as do regular dictionaries: for instance, for nouns only the singular
form is included. In principle, the FONILEX database lists all word forms of a given
lemma separately; it is a “full-form” lexicon. Typically, a dictionary only mentions
what is considered to be the “correct” pronunciation of a word; most often there is
just one, although occasionally there may be more. The database, on the other hand,
tries to capture all acceptable pronunciations (be it in an economical way). In the
database, the basic criterion for including a word form is its frequency of use, whereas
in a dictionary words may be selected (or eliminated) on the basis of various criteria
(such as origin (loan words) or mere space limitations). Finally, there are size
differences: the size of a dictionary is determined by practical considerations (such as
physical size, cost, marketing aspects), whereas the size of the database is mainly
determined by the project duration.
    An additional characteristic of FONILEX, which distinguishes it from most other
pronunciation databases, is that it contains most pronunciation variants, and that is has
the capacity to generate additional variants. This is because, in FONILEX, pronun-
 The research took place in the context of the research programme “Korte termijnprogramma Taal- en
                                                                  FONILEX Manual, page 3

ciation variants are not considered to be (unwelcome) exceptions, but on the contrary
they result from the application of phonological rules, which describe general
phonological processes.
    FONILEX was compiled semi-automatically, and verified manually. The initial
form of the entries was obtained using grapheme-to-phoneme conversion, i.e. using a
computer program that derives a representation of the pronounciation of a word or a
sentence on the basis of its orthographic form. However, each indiviual phonological
transcription was checked manually, or hand-corrected, if one prefers. Due to the
detail of the transcription (needed to account for pronunciation variants), the
corrections were numerous. As the abstract representation evolved during the project,
and the notation changed accordingly, many of the entries had to be verified several
times. This enormous task was performed mainly by one and the same person, which
ensures consistency of transcription. We strived for a reliable database, rather than a
large one.
     Applications. The major applications of phonetic databases such as FONILEX are
in language and speech technology. Speech technology involves either speech
synthesis or speech recognition, or both. For instance, in “text-to-speech” systems,
text in the memory of the computer is converted to a speech signal, so the user can
listen to the computer instead of looking at the computer screen. In speech
recognition, an audio signal, such as the user’s voice, is identified as a sequence of
words, and this is used, for instance, in dictation systems, to type out sentences the
user reads to his computer. All sophisticated speech applications require a
pronunciation database. In speech recognition, the database is needed to map a
sequence of sounds onto a particular word form. In speech synthesis, it may be used
to obtain the position of word stress, or to obtain the sequence of sound symbols. The
latter task, known as “grapheme-to-phoneme conversion”, is generally handled
without a complete pronunciation database, but the use of a phonetic database will
significantly reduce the error rate.
    Speech technology already proved its usefulness in reading and speaking aids for
the visually handicapped, as well as in tools for hands-free computer interaction.
Nowadays it has become clear that its use in general purpose applications, such as
dictation systems and messaging systems, will soon be generalized and that speech
will become a primary means of computer interaction. Other applications will then
emerge, such as use of speech synthesis in language teaching, and in education in
general. In all these applications, it is crucial to have an phonetic database for those
programs that require it.
    Because FONILEX records also contain a cross-reference to the corresponding
CELEX database records (cf. infra, section 1.1), which among other things include
morphological information, the phonetic database can be integrated in natural
language processing systems. (As a matter of fact, it can be integrated in any language
processing system for Dutch, provided a morphological analyser be used on the
spelled word form.)
   Finally, the FONILEX database will be useful for basic research in the area of the
phonology (and morpho-phonology) of Dutch, where it provides a systematic list of
                                                                  FONILEX Manual, page 4

language facts phonological theories have to account for. The digital form of the
database enables the linguist to verify hypotheses very quickly and in a systematic
way, rather than manually, as used to be the case. As a matter of fact, this approach
was already applied in the construction of FONILEX itself, and to some extent
determined the shape of the abstract representation.
    Partners. FONILEX resulted from the collaboration of three academic partners.
    The phonetics research unit of the “Centre for Computational Linguistics” of
Leuven coordinated the project. It carried out a phonological analysis of Dutch which
led to the specification of the abstract representation, the treatment of pronunciation
variants, and the phonological rule system. The team of Leuven also performed the
laborious task of validating the phonological representation of each entry. It also
implemented the rewriting rule system used to generate the phonetic variants. Finally,
it was in charge with the management of the database, and its distribution.
    The research unit “Electronica en informatiesystemen” of Ghent designed and
implemented a validation environment and a database management system used to
store the intermediate results from the different partners. The validation system
includes a speech synthesis system to synthesize the stored transcriptions of
pronunciation. The team from Ghent also provided the first grapheme-to-phoneme
conversions which were used to bootstrap the automatic learning system implemented
in Antwerp.
    The contribution of the “Centrum voor Nederlandse Taal en Spraak”      of Antwerp
consisted of two parts. First it provided theoretical and descriptive      support, in
particular concerning the treatment of syllable division and stress.        Second it
implemented an automatic learning system for grapheme-to-phoneme           conversion,
which provided the tentative representation of pronunciation which         was hand-
corrected in Leuven.
    Further details on the contribution of the three partners can be found in the
scientific report of 1995.
   The next sections provide information about the contents and coverage of the
FONILEX database, about its format, about the treatment of pronunciation variants, and
about the phonological rules and rewriting system.
                                                                  FONILEX Manual, page 5

1      Coverage
    This section describes the criteria used to select the word forms to be included in
the database and to select the type of pronunciation to be represented.

1.1    Criteria for the selection of word forms
    In order to create a pronunciation database, one needs a list of word forms the
pronunciation of which will be included in the database. In the case of Dutch, lists of
word forms have been compiled at various occasions, such as for the CELEX lexical
database (developed at Max-Planck-Institut für Psycholinguistik, in Nijmegen) and for
the INL database (Instituut voor Nederlandse Lexicologie, Leiden). The CELEX
lexical database, which is available on CD-ROM, contains over 384.000 word forms. It
contains word frequency information, based on a 40 million word corpus by INL.
Following an agreement with CELEX, the CELEX list was used as the basis for the list
of word forms in FONILEX.
     Since the number of word forms in a language is very high, and also it is not a
closed set (due to neologisms and compounding), and given the limited resources in
our project, additional criteria had to be found for the selection of word forms. The
first selection criterion is word frequency. The obvious advantage of frequency is that
frequent words are likely to be encountered in applications; therefore they should be
present in our database. All word forms in CELEX which have a frequency of 1 or
more, are included in FONILEX. Some word forms in CELEX have a frequency count of
0, indicating that they were not encountered in the corpus used for the frequency
measurements, but were included from other sources (such as the Van Dale
dictionary). From the latter set all monomorphematic word forms (i.e. containing one
stem morpheme only) were also selected for inclusion in our database, because their
presence will enable us to deal with the compound words in which they appear. The
FONILEX database contains 218.113 entries, covering 205.216 different word forms.
    Still, a list of word forms encountered in recent texts but not appearing in CELEX
has been compiled by Jacques Duchateau and Kris Demuynck from the ESAT speech
research unit (Departement Elektrotechniek, K.U.Leuven) and could be used for future
extensions of the FONILEX database.

1.2    Criteria for the definition of the pronunciation
    Ideally a pronunciation database should contain all correct or acceptable
pronunciations of the included word forms, preferably indicating their stylistic
properties. Such an approach is of course not plausible, because of the mere size of
such an inventory. More importantly, and particularly in the case of Dutch as spoken
in Flanders, it is not immediately clear which pronunciations should be considered
acceptable and which are not.
    If one observes the ways in which Flemish is pronounced in Flanders, one is
confronted with a set of dialects of Dutch, showing phonetic characteristics which
make them quite different from one another. Moreover, there is a variant of Flemish,
                                                                    FONILEX Manual, page 6

which is usually referred to as “Algemeen Nederlands” (General Dutch) and could be
called “standard pronunciation”. Its sociolinguistic status is analoguous to that of RP
in Britain. It is used mainly in public broadcasting and in education (especially higher
education), as well as on formal occasions (politics, religion, jurisdiction). Quite
often, in actual speech, one can observe a mixture of the formal and informal
pronunciations, in particular when a speaker tries to use the standard pronunciation
but inadvertently introduces elements of his dialect.
    As far as the dialect pronunciations are concerned, they should be described in
dialect dictionaries, and we will not be dealing with them here. But this leaves us
with the problem of defining the standard pronunciation.
     From a phonetic point of view, there are clear, important differences between the
standard pronunciations of Dutch in the Netherlands and in Flanders. One has to look
for criteria that enable us to define this standard pronunciation. Here we face the
following problems.
    1. At the start of the project, descriptive research on the standard pronunciation
appeared to be lacking. There were no systematic studies on the phonetic or socio-
linguistic properties of standard Flemish. Such basic research would have been of
great help to this project, but it would have been too vast to undertake within the
project itself. Recently the universities of Nijmegen (Netherlands) and Antwerp
(U.I.A., Belgium) started a joint research project investigating the phonetic
characteristics of standard Dutch in the Netherlands and in Flanders, as well as the
judgements of native speakers of Dutch and Flemish about pronunciation variants.
(“De uitspraak van het Standaard Nederlands: variatie en varianten in Vlaanderen en
Nederland”. The project is sponsored by FWO and NWO.)
    2. When language users, and even phoneticians, judge the acceptability of
pronunciation variants, they arrive at different conclusions: what is acceptable for
some, may be inacceptable to others. Quite often, users are implicitely referring to the
spelling of the words, when they make such judgements.
     The “standard” pronunciation of Flemisch, which FONILEX tries to encode, will be
defined here as the pronunciation used by newsreaders on major Flemish broadcasting
channels. This choice may be motivated by the following considerations. The
pronunciation used by the official broadcasting stations explicitly strives to be neutral,
i.e. free from interferences from dialects. This kind of pronunciation is familiar to
almost anyone, since it can be heard on a daily basis by the whole population.
Moreover, the standard pronunciation is perfectly intelligible for speakers of Dutch in
the Netherlands and is well accepted by them.
    It should be pointed out that this pragmatic definition is not fully satisfactory. The
newsreaders at the public broadcasting station (VRT, formerly BRTN) do refer to
printed guidelines to choose particular pronunciations in case of doubt. For instance,
they are advised to pronounce “plenning” [éäbåfk] and “dosjee” [Ç´ëàÉ ], rather
than “planning” [éä^åák] and “dossier” [Ç´ëá ê] (which are obvious examples of
an attempt to maintain the original pronunciation of loan words, rather than to adapt
their pronunciation to the phonological regularities of Dutch or to use the
pronunciation suggested by their spelling). These instructions are quite specific and
                                                                  FONILEX Manual, page 7

mainly aim at the pronunciation of sentence intonation and proper names. At the
commercial station (VTM), instructions are less specific, although there is also an
effort to use a uniform pronunciation. In either case, there is no full phonetic account
of the pronunciation broadcasters should be using or effectively use.
     The guidelines have a normative status. It is unsure whether they reflect the
actual use of broadcasters. As a matter of fact, the mere existence of such guidelines
suggests they do not. It is even more unlikely that they would reflect the actual use of
a large number of speakers of Flemish.
    Also one should keep in mind that the speech of newsreaders very often is “read
speech”, i.e. it is text that is read from a display (an “autocue”), rather than
spontaneous speech. It is well know that read speech is articulated more carefully
than spontaneous speech. News messages are prepared in advance and annotated with
marks to indicate stresses or particular intonation patterns.
    Given these observations, it seems useful to include pronunciations suggested by
the guidelines, but more importantly, the pronunciations in FONILEX will be based on
observations and on introspection of the possible pronunciation variants.

2      The treatment of pronunciation variants

2.1    Two classes of variants
   An important issue concerns the treatment of pronunciation variants. For many
words one can observe pronunciations variants. For instance,
      “creatief”      [âêÉ^íÈáÑ]
      “negatief”      [åÉÂ^íÈáÑ]
      “tenten”        [íÈbåíÃå]
      “ceramiek”      [âÉê^ãÈáâ]
      “flat”          [ÑäÈ^í]
      “canvas”        [âÈ^åî^ë]
    Of course, one could do away with variants and mention one form only. This
approach wouldn’t be descriptively adequate: the standard pronunciation does allow
for variants. Moreover, it is not obvious which form should be chosen as the correct
one. Alternatively, one could mention only a selection of variants. But, if a selection
needs to be made, a linguistically motivated criterion is required for doing so.
    Obviously, applications in speech recognition need to deal with pronunciation
variants. But the enumeration of all pronunciation variants would increase the size of
the database to a level which makes it difficult to maintain and use. Such
enumerations will often decrease the general usefulness of the database, because the
                                                                                 FONILEX Manual, page 8

large number of variants aren’t always needed by a particular application, and because
of the negative effect of size on database access time.
    To avoid these problems, a particular representation of pronunciation was devised
which predicts several variants from one and the same basic notation, without listing
them all. This basic notation is not actually a phonetic transcription of one particular
pronunciation variant; rather it is a more abstract notation which covers several
possible variants. A rule system is used to obtain the variants from the basic notation.
    The approach just mentioned is valid for all variants which result from “regular”
phonological or phonetic processes, and therefore apply to large sets of word forms.
Examples of such general processes are glide insertion between consecutive vowels
(as in “creatief”), alternations between tense and lax vowels (as in “negatief”), or
deletion of [å] (as in “tenten”).
    Other types of variants are less frequent, or even idiosyncratic. One particular
case is that of loan words (such as “flat”) for which two pronunciations coexist: one
form which is close to the original pronunciation of English, and another form in
which the pronunciation has been adapted to the phonological system of Dutch
(possibly with interferences of spelling). Another case is alternative stress positions
(such as in “canvas”).
     Because of their different nature and frequency, the FONILEX database in practice2
distinguishes two classes of variants: regular and irregular ones. Regular variants are
indicated by diacritic symbols in the transcription. A word form with regular variants
has only one entry in the database, and hence only one (abstract) transcription, from
which variants are derived by applying a set of rules. Irregular variants, which cannot
be predicted by phonological processes, are added explicitly to the database. For
instance, there will be multiple entries for “flat” and “dossier”:
       “flat”              /ÑäÈ^í/
       “dossier”           /Ç´ëÈáê/
     Since almost all variants are regular (i.e. predictable from the basic notation), the
large majority of word forms in the database contain only one representation of
pronunciation. As a result, the FONILEX database is very compact. As a matter of fact,
it is almost as compact as would be the case if variants would not have mentioned
                                               word forms      records
                                   regular        194.037      194.037
                                   irregular       11.179        24.076
                                   total          205.216      218.113

    Since it contains information on phonetic variation, the basic notation is in some
way “abstract”, and therefore should not be called “phonetic”. It could be called
“phonological”, but then again phonological representations usually do not include
information about phonetic variants explicitly. Rather they implicitly assume variants

 This doesn’t imply any assumptions on the theoretical status of this distinction. It merely is a practical
                                                                     FONILEX Manual, page 9

can be derived correctly from the properties of the form itself (such as stress, syllable
structure, or morphological structure). In the process of creating the FONILEX entries,
it soon became clear that the phonetic facts are much more complex than might be
expected from studies on Dutch phonology. Phonological studies only provide partial
descriptions of these facts, and the rules they put forward often have overgenerating
effects. As a result, it was decided to indicate phonetic variation explicitly.
       A list of (mostly) regular processes encoded in FONILEX is given is section 3.3.

2.2        The pronunciation of loan words
    In principle loan words receive two entries: one for the Dutch pronunciation and
one for the foreign pronunciation. However, the boundary between both types is not
always clear; a pronunciation may be adapted partially to the Dutch phonetic
characteristics. Therefore, some “foreign-like” pronunciation variants will not be
mentioned in FONILEX. The criteria for the selection are listed below.
• In FONILEX word forms never end in voiced obstruents.
         “love”               only /äÈÃÑ/, not /äÈÃî/
• Loan words ending in “-eur” can be pronounced with /lê/ or /Ã ê/. The latter
  form will not be mentioned unless there are other reasons for doing so.
         “grandeur”           /cê^åÇÈlê/ and /cê^FÇà ê/
         “mineur”             /ãá2åÈlê/ only3
• The possibility of pronouncing a glottal stop in words like “hausse” will not be
  mentioned as a separate variant, unless there are other reasons for doing so.
• The diacritics for lax vowel variants or schwa variants (which are described in
  section 3.3) are maintained, even in “truly foreign” pronunciations.
         “détente”            only /ÇÉ3íÈ^FíÃ/
• Loan words in which the final “-e” may be deleted, do not receive a variant without
  this /Ã/, unless there are other reasons for doing so.
         “détente”            only /ÇÉ3íÈ^FíÃ/

    Diacritics such as /i2/ are described in section 3.3
                                                                             FONILEX Manual, page 10

3        Technical description of the FONILEX database

3.1      Description of fields
     FONILEX is distributed as a text file, using ISO 8859-1 character encoding. Each
line contains one entry of the database, and in principle corresponds to one word form.
There is one exception: word forms with multiple non reduceable pronunciation
variants (see section 2) receive multiple entries in the database, one for each variant.
An entry consists of six fields, separated by a ‘|’ character. The six fields are as
         1.   identification number
         2.   orthography
         3.   phonological form
         4.   phonetic form of normal pronunciation
         5.   phonetic form of spelling pronunciation
         6.   phonetic form of sloppy pronunciation
      All fields, except the second one, use the ASCII subset for character encoding.
    The first field contains the unique identification number of the word form. It is
taken over from the CELEX database (Release 2, Dutch version 3.1).
    The second field contains the spelling of the word form, according to the latest
spelling reform of 19964. This was achieved using a list of modified word forms,
which was kindly made available to us by the CELEX team. Before the spelling reform
two spelling conventions were used for Dutch: the first, the “preferred” spelling
(“voorkeurspelling”), had an etymological flavour, while the second, the “modern”
spelling (“progressieve spelling”), was phonetic in some respects. If, for some reason,
the older spellings are needed by the application, they can be recovered from CELEX
(Release 2) using the identification number in field 1. The CELEX database gives
either the modern spelling, the preferred spelling, or both.
    The third field contains the phonological or abstract representation of
pronunciation. The conventions used to encode phonetic symbols and pronunciation
variants are described in sections 3.2 and 3.3. The phonological notation of field 3 is
used to derive the phonetic representations of fields 4 to 6.
     These last three fields were generated with a particular setting of the phonological
rules, in order to obtain pronunciations which are likely for three different speaking
styles: the first is normal (unmarked) pronunciation, the second one is a spelling
pronunciation, and the third is a somewhat uncareful or even sloppy pronunciation.
These three phonetic fields are intended as examples of the output of rule application,
in order to give an idea of the possibilities of this method and to encourage the user to

 The official spelling reference is the “Woordenlijst Nederlandse taal” also known as the “green book”,
“het Groene boekje”. However, some words (e.g. “analfabeet”) are missing from this source. In such
cases, the “Van Dale Groot Woordenboek van de Nederlandse Taal” can be used instead, but there are
two problems: 1. both sources sometimes disagree on spelling, and 2. some word forms encountered in
Celex are missing from either two dictionaries.
                                                                    FONILEX Manual, page 11

experiment with it. As a result, they do not represent the full range of phonetic forms.
The user might select another combination of phonological rules, according to the
(phonostylistic) requirements of his application, and derive the phonetic forms
accordingly. The program for doing this is described in section 4.2 and is part of the
FONILEX distribution. The phonetic symbol set which is used, is given in section 3.2.
The normal pronunciation (field 4) will always be present. The two other forms
(fields 5 and 6) are only supplied when they differ from the normal form; otherwhise
either one of these fields, or both, will be empty (although the field separators will still
be present). This is done in order to reduce the size of the database.
     We now return to the content of field 1. Since the identification number is
identical to the one in the CELEX database, the treatment of homonyms (words with
similar form, but difference in meaning) is also similar to that of CELEX. Such words
will receive separate entries if they differ with respect to at least one of the following
• Difference in pronunciation
  (e.g. “zoom” [òÈçã] and “zoom” [òÈìã]).
• Difference in part of speech category
  (e.g. “werken” as a plural noun, and “werken” as the infinitive form of the verb).
• Difference in gender
  (e.g. “de idee” and “het idee”).
• Difference in inflectional morphology
  (e.g. “scheppen, geschept” and “scheppen, geschapen”).
• Difference in derivational morphology
  (e.g. “valkuil” = val+kuil, and “valkuil” = valk+uil).
    The identification number allows for a link to the corresponding entry in the
CELEX database and to the information (on morphological properties, frequency, etc.)
given there. It also enables the grouping of multiple entries for the same word form
(non reduceable pronunciation variants), which can be identified on the basis of the
identity of their identification numbers. Here are some examples for the first fields
(the encoding of phonetic symbols will be discussed in the next sections).
      57212 chronologisch              xro2no2l’oGi2s
      57212 chronologisch              kro2no2l’oGi2s
      305327 tank                      t’ANk
      305327 tank                      t’ENk

3.2    Phonetic symbol encoding
    The phonological and phonetic forms are represented using a particular encoding
of the International Phonetic Alphabet. But given the different status of both
representations, their actual symbols sets differ slightly. In this section the encoding
of the phonetic forms is described, while the conventions for the phonological form
are given in the next section.
     The phonetic symbols for representing Dutch pronunciation are encoded in ASCII,
for the sake of portability and to facilitate their manipulation by computer programs.
                                                                       FONILEX Manual, page 12

An encoding scheme, called YAPA, was used to this effect5. Recently, the SAMPA6
encoding scheme has emerged as the de facto standard for phonetic symbol encoding
in ASCII. At the time of the start of the project, this was not yet the case. The
differences between YAPA and SAMPA are listed in appendix A2.
    The list of symbols used in FONILEX database is given in the following table.
Each phonetic symbol is represented by one character; diacritic signs are also
represented by one character, and follow the symbol they modify. Whitespace has no
particular meaning and is used only to improve readability. The check marks in the
columns labeled “phonetics” and “phonology” indicate that the symbol appears in the
phonetic and phonological transcriptions, respectively.

  The name YAPA stands for “Yet Another Phonetic Alphabet”, referring to the large number of
analoguous encoding conventions. The YAPA scheme was adopted by all projects in the research
program “Korte termijnprogramma Spraak- en Taaltechnologie”.
  SAMPA was developped within the SAM project (“Speech Assessment Methods”, ESPRIT), by prof.
Wells from University College, London. It has been adopted by the EAGLES (“Expert Advisory Group
on Language Engineering Standards”) standardisation project.
                                           FONILEX Manual, page 13

IPA   YAPA   example         Phonetics   Phonology
á      i     dier               √           √
f      I     pit                √           √
ó      y     muur               √           √
v      Y     put                √           √
ì      u     doen               √           √
É      e     veel               √           √
b      E     vel                √           √
l      &     deur               √           √
‡      @     oeuvre ([@:])      √           √
ç      o     voor               √           √
´      O     log                √           √
~      a     aan                √           √
^      A     lat                √           √
à      @     de                 √           √
ã      m     mes                √           √
j      M     kamfer             √
å      n     nacht              √           √
       J     champagne          √
k      N     teerling           √           √
é      p     pet                √           √
Ä      b     bal                √           √
í      t     taak               √           √
Ç      d     dan                √           √
â      k     kegel              √           √
Ö      g     zakdoek            √           √
Ñ      f     flink              √           √
î      v     voelen             √           √
ë      s     sinds              √           √
ò      z     zeel               √           √
p      S     sjaal              √           √
w      Z     garage             √           √
ñ      x     achter             √           √
      G     zegen              √           √
ê      r     rel                √           √
ä      l     leeft              √           √
ï      w     wagen              √           √
à      j     jaagt              √           √
ç      9     intuïtie           √           √
Ü      h     huren              √           √
L      ?                        √
                                                                               FONILEX Manual, page 14

                  Diacritics & suprasegmentals
                 IPA      YAPA                                   Phonetics   Phonology
                  È         ‘          main stress                  √           √
                  Ë         “          secondary stress             √           √
                            :          long                         √           √
                  =F        ~          nasal(ised)                  √           √
                            ^          non syllabic,                            √
                  =U        0          voiceless                    √           √

     Secondary stress is indicated only in compound words, for the main stressed
syllable of the weaker part of the compound (see section 3.4).
     The diacritic for voiceless is used only in /àU/ in the diminutive suffix. This suffix
undergoes phonetic changes which differ from those observed in other contexts for
/àÃ/. Therefore /àU/ could be considered to be a morphophoneme. For further details,
the reader is referred to the FONILEX scientific report of 1995.
    In the phonetic symbol set, loan sounds are respresented as follows.

                                     Loan sounds
                            IPA          YAPA        example
                            ‡              @:        oeuvre
                            ´                 O:     rose
                            b                 E:     militair
                                ´â         O~        congé
                                bF         E~        vaccin
                                ^F         A~        croissant

     The YAPA encoding scheme was also adopted in order to deal with the ambiguity
in the representation of diphthongs in International Phonetic Alphabet . In the IPA a
diphthong is represented as a sequence of vowels. Optionally the non syllabic part of
the diphthong is indicated by a diacritic for non-syllabicity. In the absence of this
diacritic it is not clear whether a sequence such as /bf/ should be interpreted as a
sequence of two vowels (and hence as two syllables) or as a diphthong (and hence as
one syllable). Two avoid this ambiguity, the Dutch diphthongs are represented in the
phonetic transcriptions of FONILEX by a vowel followed by a glide. In the
phonological transcription, however, the non-syllabic part of the so-called “true”
diphthongs (i.e. [bfE], [´rE], and [‡vE]) is indicated by a special symbol (^), such that it
can be distinguished from the syllabic vowels in rule application.
                                                                   FONILEX Manual, page 15

                        Diphthongs (true)
                       IPA     YAPA      YAPA      example
                               Phonet.   Phonol.
                       bfE       E:j         E^    ijs
                       ´rE      O:w          O^    kous
                       ‡vE      @:9          @^    huis
                        Diphthongs (false)
                       ìà        uj          uj    roeit
                       çà        oj          oj    nooit
                       óï        yw          yw    duw
                       Éï        ew          ew    eeuw
                       ~à        aj          aj    draai
                       áï        iw          iw    nieuw

3.3    Phonological form encoding.
    The symbol set used for the phonological form consists of two parts. The first
part is a subset of the phonetic symbol set, described in the previous section. Indeed,
context-dependent allophones need not be distinguished explicitly in the phonemic
representation; instead the phoneme can be used. For instance, the sound [j] is absent
from the phonological set because it can be predicted from context; instead the
phoneme /m/ appears.
     The second part consists of a set of special diacritics which are required to predict
pronunciation variants. As for other diacritics, they are placed immediately after the
element they modify. These special diacritics do not have a counterpart within the
phonetic alphabet. For instance, the combination /i2/ indicates a unit which can be
realised phonetically as either [á] or [f]. Such units would be called archiphonemes
in the terminology of André Martinet. Many phonological theories would assume
these elements aren’t actually needed, but can be predicted from other phonological
properties such as stress, syllable structure, or morphological structure. We do not
want to adopt or refute this claim; we merely observe that a complete rule system
which correctly predicts archiphonemes (and their corresponding phonetic
alternations) for all word forms of Dutch seems to be lacking. Even if such rules
existed, it would be impossible to implement them for FONILEX, since the
phonological form of FONILEX indicates neither syllable structure, nor morpheme
    The special diacritics used to encode pronunciation variants are ‘1’, ‘2’, ‘3’, ‘_’.
However, their exact interpretation depends upon the symbol they combine with. The
mnemonic interpretation of ‘1’, ‘2’, and ‘3’ is that the unit allows for one, two or three
variants, respectively. In the case of ‘1’ this implies that the sound cannot be deleted.
     The following list enumerates (mostly) regular processes encoded in FONILEX.
The first column shows the symbol used in the basic representation; the second
column lists its possible realisations at the phonetic level (where Ø indicates a zero
realisation, i.e. the segment is absent from the phonetic representation); and the last
columns gives some examples of word forms in which the unit appears.
                                                                               FONILEX Manual, page 16

• Alternation between tense and lax vowel.

 Symbol      Variants   Examples
   i2         á=f       abrikoos, academisch, activist, alchimie, ambigu, artistiek, ...
    y2         ó=v      formulier, puree, aluminium, bureau, communicatie, corpulent,
                        absolutisme, accuraat, abiturient, ...
    e2         É=b      mechanisme, sekretariaat, ethiek, econoom, benedictijn, theorie,
                        celibaat, cerebraal, preparaat, ...
    o2         ç=´      proces, probleem, produkt, abdomen, absoluut, acrobaat, adolescent,
                        alcohol, anoniem, apotheker, ...
    a2         ~=^      aalmoezenier, academisch, acrobaat, analyse, akoestiek, atelier,
                        avontuur, camera, casuist, satelliet, ...

In some contexts, alternations identical to those listed above can be observed, but are
less frequent or probable. Such cases have been encoded as /i1/, /y1/, /e1/, /o1/, and

 Symbol      Variants   Examples
   i1         á=f       fiets, dient, biecht, adrenaline, aspirine, benzine,
    y1         ó=v      kostuum, volume, formule, bruto, bruusk, clausule, communie,
                        discussie, lucifer, module, ...
    e1         É=b      lyceum, ecosysteem, museum, sneeën, stereo, veearts, ...
    o1         ç=´      bolero, co-auteur, cowboy, cholera, deposito, fotometer, kokos,
                        heroïsch, monitor, pony, prototype, zoals, ...
    a1         ~=^      Afrika, ananas, archaïsch, cacao, galabal, maïs, parameter,
                        prozaisch, ...

• Alternation between tense vowel, lax vowel, and schwa.

 Symbol      Variants   Examples
   e3        É=b=Ã      referentie, periode, aceton, adequaat, Amerikaan, antecedent,
                        kerikaal, atheneum, cement, cinema, competitie, ...
    o3        ç=´=Ã     peloton (N=1)

• Alternation between lax vowel, and schwa.

 Symbol      Variants   Examples
   I2         f=Ã       basis, stoornis, salaris, kennis, begrafenis, botterik, lottis, service,
                        tennis, vuilnisbelt, ...
   Y2          v=Ã      Bakkum, dattum (N=2)
    E2         b=Ã      concentratie, prestatie, accepteren, terrein, termijn, advertentie,
                        agressief, centrale, anekdote, benzine, deskundig, ...
   O2          ´=Ã      (-or) motor, collector, factor, senator,... ; avond, avontuur
   A2          ^=Ã      almanak (N=1)

• Alternation between tense vowel, lax vowel, and (homorganic) glide.

 Symbol      Variants   Examples
   i3        á=f=à      ambiance, stadium, criterium, audientie, axioma, kampioen, dubieus,
                        geniaal, grandioos, idioot, Indiaan, ...
    y3        ó=v=ç     actueel, situatie, evaluatie, gradueel, manueel, prostituée, intuïtief,
                        seksualiteit, ...
    u2         ì=ï      bedoeïnen, desavoueren, pirouette, silhouet, tatoeëren, ...
                                                                                FONILEX Manual, page 17

Note that /u2/ appears instead of /u3/ because there is no phoneme /r/ according to the
FONILEX phonology of Dutch.
• Alternation between schwa and zero.

  Symbol        Variants   Examples
    @2           Ã Ø       ambtelijk, aanschouwelijk, dodelijk, beitelen, aftakelen, zuiveren,
                           bladeren, benaderen, beverig, meerdere, mindere, ...

• Alternation between diphthong and schwa.

  Symbol        Variants   Examples
   E^2          b à= Ã     belangrijk, bijzonder, mijnheer, seizoen, talrijk, surveillant, ...

• Deletion of /t/ and /d/.

  Symbol        Variants   Examples
    t2           í=Ø       zwartst, absurdst, aartsvader, hartstocht, schertst, kortste, schildpad,
      d2            Ç=Ø    aartsbisdom, geweldsdelicten, koortsdelirium, ...

• Deletion of /n/.

  Symbol        Variants   Examples
    n1           å=Ø       lopend /lop@n1t/, dringend, rekent, agens, arend, accountant,
                           daarenboven, eens, nomen, tenzij, zijn, z’n, ...

• Insertion of schwa in consonant clusters (epenthesis).

  Symbol        Variants   Examples
    _            Ã=Ø       melk /mEl_k/, durfde, sterft, alarm, markt, arm, balk, Belg, berg,
                           scherm, vorm, perk, borg, darm, achtervolg, ...

3.4        Stress
     1. Both in phonetic and phonological transcriptions, primary word stress is
marked by a single left quote preceeding the vowel of the stressed syllable. This is
somewhat unconventional, since the stress mark is usually put before the left syllable
boundary of the stressed syllable, rather than before the stressed vowel. The main
advantage of putting the stress mark before the vowel is that it is always clear where a
syllable’s vowel is; it is less obvious where the syllable starts (the idea of a discrete
boundary might even be perceptually implausible). Moreover the conventional
notation would have required syllable divisions, which aren’t included in FONILEX
(precisely because they aren’t obvious).
       “bioscoop”          /bi2o2sk’op/
       “eettafel”          /’ettaf@l/
      2. Secundary stress (as for the first vowel in “bioskoop”) is not indicated.
    3. Some words can be pronounced using different stress positions. In such cases,
the phonological transcription will contain two stress marks. This approach was
preferred over the alternative solution of having multiple entries for the same word
form, one for each stress pattern. In the phonetic transcription only one stress mark
                                                                    FONILEX Manual, page 18

should be kept (the rule system currently doesn’t remove the extra stress mark,
      “daarvoor”       /d’arv’or/             [d’arv’or]
For instance, the word form “daarvoor” is given as /d’arv’or/ because the stress
can be either on the syllables [dar] or [vor]. The phonetic transcription gives
[d’arv’or], although there are two pronunciation variants actually: [d’arvor] and
    For some words the van Dale dictionary (“Groot Woordenboek van de Neder-
landse Taal”) mentions only one stress pattern, but an alternative stress position seems
acceptable. In these cases, two entries with an identical identification number are
included in the lexicon.
      “luitenant”      /l’@^t@nAnt/           (stress position as in the dictionary)
      “luitenant”      /l@^t@n’Ant/           (pronunciation variant)
   4. A double quote indicates the second stress of compound words. Both stress
marks will be present in the phonetic forms.
      “luitenant-generaal”     /l’@^t@nAntxe2ne3r”al/
                               [l’@:9t@nAntxener”al], etc.

4      Rule system

4.1    Phonological rules in FONILEX
    In the framework of generative phonology, phonological rules are formalised as
rewrite rules. This means that if a rule applies to an input string (i.e. a sequence of
phoneme symbols), some symbols in that string will be replaced (“rewritten”) by
others, resulting in an output string which differs from the input. This output in turn
becomes the input to the next rule, and so on. This general approach has important
implications for rule ordering. When a rule applies to a given string, this will affect
the behavior of all subsequent rules the structural description of which matches the
output of the former rule. When two rules A and B interact in this way, then applying
B after A doesn’t produce the same result as applying A after B. Therefore rule
ordering is essential. For a grammar containing n rules, there are n! possible rule
    The phonology (including morphophonology) of Dutch (and Flemish) is fairly
complex, since it requires a large number of rules. The number of possible rule
orderings is accordingly high. As a matter of fact, it becomes hard to oversee all
possible interactions between rules.
     To tackle this problem, the phonological rules of Dutch were classified according
to the properties appearing in their context description. This results in three classes of
                                                                   FONILEX Manual, page 19

(a) Rules with a context description containing both word and syllable boundaries (in
    addition to the phonemes, obviously).
(b) Rules with a context description containing word boundaries (but no syllable
(c) Rules with a context description containing neither word nor syllable boundaries
    (but only phonemes).
     In a complete phonology of Dutch, these three sets of rules would be applied to a
morphological form (containing word and syllable boundaries), in the order given
above. After the application of the first set of rules, syllable boundaries can be deleted
from the representation, since they aren’t used anymore by subsequent rules.
Similarly, syllable boundaries can be dropped once the second set of rules has been
applied. This approach has the advantage that, at a given level of abstraction (or
derivation), the representation is as simple as possible, since it only contains the
information that is needed for the application of the remaining rules. The formulation
of these rules will also be simplified thanks to the economical representation (e.g. the
rules in last set can simply ignore syllable boundaries).
     The phonological transcription which is included in FONILEX corresponds to the
representation before the last set of rules. It contains no syllable or word boundaries.
This choice is motivated by two considerations. The first is the primary purpose of
the pronunciation database, which is its use in language and speech technology. In
this context, a concrete, phonetic representation of speech is needed. The use of a
more abstract, phonological representation is justified by its inherent economical
encoding of pronunciation variants. However, it shouldn’t be too far away from the
phonetic level, because in that case the mapping between both levels (via a procedure
or a transducer) becomes too expensive. To summarise, the choice of the
phonological respresentation is a compromise between readability, compactness and
the size of the rule set. The second consideration is that syllable boundaries aren’t
always easy to establish: there may be alternative syllable divisions for the same
sequence of phonemes (in particular for consonant clusters).
     In this perspective, pronunciation variants resulting from rules of the first two sets
should be included in the database as separate entries. For instance, the word
“glorieus” has several pronunciation variants among which [ÂäçêáÈlë],
[ÂäçêfÈlë], [ÂäçêàÈlë]. These variants can be related to the presence of a
syllable boundary before the /l/ vowel. A strict application of the principle of
separation between types of rules would result in multiple entries for many word
forms (there would be multiple entries for “glorieus”). The existence of optional rules
in rule sets (a) and (b), and the possible combinations of alternations within the same
word form, further complicate matters.
     In order to avoid multiple entries for word forms where possible, some rules from
sets (a) and (b) were “postponed” until after the phonological representation (i.e. after
(b)). This was achieved by replacing certain syllable or morpheme boundaries by
special diacritics. The original approach (using boundaries) obviously is to be
preferred from a theoretical point of view, but it requires a full fledged account of
                                                                     FONILEX Manual, page 20

Dutch morphophonology, something which is not plausible within the context of this
research project.

4.2      The implementation of rewrite rules
      In generative phonology, rewrite rules take the general form
       name_of_rule: A → B / L _ R
This reads as follows: A rewrites as B in the context L _ R, i.e. preceeded by L and
followed by R. In this general form, A, B, L, and R each refer to a symbol or a
sequence of symbols. A symbol corresponds to a phoneme or a feature specification
denoting a phoneme or a set of phonemes. L and R may be empty; in that case they do
not appear in the rule specification. For instance, a rule for place assimilation of nasal
consonant /n/ could look like this:
       place assimilation: å → k / _ â
The assimilation is triggered by the right context; so the left context is ignored. The
use of feature specifications results in concise rules which apply to classes of
phonemes. For instance, the previous rule can be modified to take into account all
contexts containing velar obstruents (plosives and fricatives):
                                         + cons
       place assimilation: å → k / _ ê− son ú
                                      ê + back ú

When B is an empty string, it is noted as Ø, and the rule corresponds to a deletion
rule. When A is empty, it is also noted as Ø, and we obtain an insertion rule. In the
following example, a glide is inserted between an /i/ and a vowel (indicated by the
feature [+syll]).
       glide insertion: Ø → à / á _ [+syll]
     There were two options for the implementation of the FONILEX rule system. A
first approach is to write a dedicated computer program which reads a grammar
formulated in the generative rule format, and applies it to an input string. The
alternative is to use a general purpose tool instead. In this case, the format of the
rewrite rules has to be adapted to the syntax of the tool, and then the rules can be
applied to an input string.
    While the first approach offers the advantage of familiar style, readability and
conciseness, it requires programming and creates a problem of availability: in order to
use or modify the rule system, one would need a compiler of the programming
language that was used to build the grammar compiler and rule application program.
This problems does not arise if one uses common tools, which are generally available
(say, on Unix systems).
    Rewrite rules can be translated into string substitutions using regular expressions,
so tools such as sed, awk or perl can be used for the application of the rules.
Perl was selected because of its versatility and its rich syntax for regular
                                                                           FONILEX Manual, page 21

    In perl, a substitution consists of four parts: the character s (indicating the
substitution command), the input pattern, the replacement pattern, and an optional
suffix; these parts are separated by slashes.
     Whereas in a rewrite rule the context appears once in a separate part of the rule
specification (after the slash), in a regular expression the context is part of the pattern
itself, and is to be repeated in the replacement part. The first rule (on place
assimilation) can be expressed as follows:
    This reads as follows: the regular expression nk (i.e. n followed by k) is
substituted by the string Nk. The symbols in the input are in YAPA notation, so N
stands for k. Suffix “g” indicates global substitution, i.e. substitution occurs each
time the pattern occurs in the input string (rather than for the first occurrence only).
    A major restriction of regular expressions is that they do not allow for the use of
feature matrices. As a result one has to enumerate all phonemes matching a feature
specification. The second rule would look as follows:
    The square brackets indicate a set of symbols. Therefore, the symbol following n
can be one of those mentioned between the brackets. For instance, the regular
expression /n[kgxG]/ matches “Ank@r” or “anGIft@”, and so on. By enclosing
some part of an expression by parentheses, one can indicate a substring. The substring
will be stored in a (predefined) variable, which can later be evaluated in the
replacement part using \1. So, if the input string is “Ank@r”, then the part between
parentheses matches “k” and the substitution pattern will be “Nk”. There can be
multiple stored substrings: \1, \2, \3, and so on. Similarly, the glide insertion rule
might look as follows (with all vowels7 listed between brackets):
    As mentioned earlier, phonological rules are used in FONILEX to encode
pronunciation variants. Both the activation of a rule and its actual form may depend
on speaking style. The latter aspect appears in the word “salaris” /sAl’arI2s/,
which may be pronounced [ë^äÈ~êfë] or [ë^äÈ~êÃë]; the latter form will be
considered less formal. The first aspect (rule activation) appears in “hoofdstuk”
/h’oftstYk/, which may be pronounced either [ÜÈçÑíëívâ] or [ÜÈçÑëíìâ],
with analoguous phonostylistic connotations. Here the /t/ deletion rule is switched on
or off as a function of speaking style. Three speaking styles were defined (see section
    Each particular speaking style thus requires its appropriate rule set. The following
table lists phonological processes (either archiphonemes or rules), and gives the
phonetic realisations for each speaking style, as well as the number of records in
which the phenomenon appears. These statistics show that /Y2/, /o3/, /A2/ are so

 The vowel [Ã], noted @ in YAPA, has to be preceeded by a backslash in the expression, because @ is a
metacharacter in perl.
                                                                    FONILEX Manual, page 22

idiosyncratic (they appear in one or two morphemes only) that they may be considered
non regular variants; it would be preferable to remove these archiphonemes and store
the corresponding forms in duplicate records.
                        unit / rule      high   normal   low   occurrences
                           i2             i       I       I        21694
                           y2             y       Y       Y          5524
                           e2             e       E       E          5210
                           o2             o       O       O        15824
                           a2             a       A       A          8521
                           i1             i       i       I          4317
                           y1             y       y       Y           527
                           e1             e       e       E           552
                           o1             o       o       O           402
                           a1             a       a       A           567
                           e3             e       E       @        10606
                           o3             o       O       @              9
                           I2             I       I       @          1131
                           Y2             Y       Y       @              2
                           E2             E       E       @          6973
                           O2             O       O       @           575
                           A2             A       A       @              2
                           i3             i       I       j          2799
                           y3             y       Y       9           809
                           u2             u       u       w             72
                           E^            E:j     E:j     E:        18378
                           @^            @:9     @:9     @:        10007
                           O^            O:w     O:w     O:w         5068
                          E^2            E:j     E:j     @            178
                           @2             @       @       Ø          8946
                            _             Ø       Ø      @           5965
                           t2             t       t       Ø           429
                           d2             d       d       Ø              8
                           n1             n       n       n        15033
                 final /n/ deletion:      n       Ø      Ø         48593

    In order to deal with multiple rule sets within the one and the same program, three
boolean variables were defined, one for each speaking style (normal, low, high), and
control flow statements using these variables were added to the program:
     if ($low) {                          # for sloppy pronunciation...
         s/I2/\@/g;                       # ... replace ‘I2’ by ‘@’
     if ($high || $normal) {              # for normal or spelling pronunciation
         s/2//g;                          # ... remove ‘2’, so we keep ‘I’
                                                                 FONILEX Manual, page 23

      The complete listing of the program is given in appendix A1.

4.3      Limitations
    In continuous speech, assimilation processes occur at word boundaries. These are
not modelled by FONILEX, because FONILEX is a database of isolated word forms, and
the rule system does not have access to the context in which a word form appears.
Word final devoicing has been applied to the words in the database. Continuous
speech is not a concatenation of isolated words...

5        Copyright notice
    All files in the FONILEX distribution (the database itself, the scripts) can be used
freely for academic and commercial purposes alike, provided that the origin of the
data is mentioned in all publications, both scientific and commercial.
    An exception to this general principle concerns the field in the FONILEX database
containing the spelled word form itself (i.e. field 2). This information is property of
Max-Planck institute, Nijmegen, and cannot be used in commercial applications
without their prior agreement (Contact: Richard Piepenbrock, Max-Planck institute.
Wundtlaan 1, 6525 XD Nijmegen, The Netherlands).
                                                           FONILEX Manual, page 24


A1: The Perl program “rewrite”

# Converts FONILEX transcription into pronunciation variants.
# This is for Perl 5.003. For earlier versions of Perl, remove
# the backslash preceeding the symbol @ in regular expressions.

sub usage { print
"Rewrite FONILEX form into various pronounciation variants.\
Piet Mertens 1996-1997 (Last updated: 12/09/97) \n\
usage 1: rewrite [-d] -M <input_string> \
    This rewrites one (double quoted) string in mode M,\
    where M is one of the following: \
       h      high   spelling pronunciation \
       n      normal normal pronunciation \
       l      low    sloppy pronunciation \
       a      all 3 modes above \
    Option -d display derivation (for debugging) \n\
usage 2: rewrite -f <filename> \
    This rewrites all words in the file in all 3 modes.\
    Each line of input file should contain 3 fields (tab separated):\
       1. identification number, 2. word spelling, 3. Fonilex form \
       e.g. 12345    uitspraak    '\@^tsprak \n";

$arg = shift(@ARGV);
if ($arg eq "-d") {                # -d : enable debugging (verbose)
    $debug = 1;
    $arg = shift(@ARGV);
} else { $debug = 0; }             # disable debugging

if ($arg eq "-f") {               # -f: read input from named file
    $fname = shift(@ARGV); # get filename from program args
    open(FILE, "$fname") || die "cannot open $fname\n";
    $linenr = 0;           # line number in file
    while (<FILE>) {
       chop;               # remove trailing newline
       @fields = split(/\t/);     # split line into fields
       if ($#fields +1 != 3) { # check number of fields
           warn "need 3 fields in line $linenr of file $fname\n";
       } else {
           &derive_all($_);       # rewrite in all 3 modes
                           # put result in $out{}
           print "$idnum\t$word\t$form\t";
           print "$out{normal}\t$out{high}\t$out{low}\n";
elsif ($arg eq "-a") {            # -a : rewrite 1 form
       $form = shift(@ARGV);      # get form from program args
       foreach $mode (high,normal,low) {
           $out{$mode} = &rewrite($form, $mode);
       print "$form NORMAL $out{normal} HIGH $out{high} LOW $out{low}\n\n";
else {
       $form = shift(@ARGV);      # get next argument

       if    ($arg eq   "-n") { $mode = normal; }
       elsif ($arg eq   "-h") { $mode = high; }
       elsif ($arg eq   "-l") { $mode = low; }
       else { &usage;   }
                                                            FONILEX Manual, page 25

      print "$form ", &rewrite($form, $mode), "\n";

sub derive_all {
  local($line) = @_ ;      # function parameters

    # get fields from line
    ($idnum, $word, $form) = split(/\t/, $line);
    # Derive normal pronunciation first
    $out{normal} = &rewrite($form, normal);
    # Derive high and low pronunciations
    foreach $mode (high,low) {
       $out{$mode} = &rewrite($form, $mode);
       # Replace by "" if identical to normal mode
       if ($out{$mode} eq $out{normal}) {
           $out{$mode} = "";

sub rewrite {
  local($string, $mode) = @_ ;   # function parameters
  $_ = $string;      # use default variable $_ for substitutions

    if ($mode eq normal)   { $normal=1; $high=0; $low=0; }
    if ($mode eq high)     { $normal=0; $high=1; $low=0; }
    if ($mode eq low)      { $normal=0; $high=0; $low=1; }

if ($debug) { print "\n\nInput:$string Mode:$mode \n"; }

$Rule = "glide insertion";
# Order: before vowel reduction
# (e.g. "reactie" > [r@jAksi] vs. "geaardheid" > [G@arthE:jt] )

    if ($normal || $low) {
       s/([ieIE])(['iyue&oaIYE\@OA])/\1j\2/g ;
       s/([y&Y])(['iyue&oaIYE\@OA])/\1=9\2/g ; s/=//g;
       s/([uoO])(['iyue&oaIYE\@OA])/\1w\2/g ;
       s/([eIE])([123])(['iyue&oaIYE\@OA])/\1\2j\3/g ;
       s/Y([12])(['iyue&oaIYE\@OA])/Y\1=9\2/g ; s/=//g;
       s/([oO])([12])(['iyue&oaIYE\@OA])/\1\2w\3/g ;
       s/(i)([12])(['iyue&oaIYE\@OA])/\1\2j\3/g ;
       s/(y)([12])(['iyue&oaIYE\@OA])/\1\2=9\3/g; s/=//g;

# Closed vowels which can be reduced to glides
    if ($normal) {

if ($debug) { print "$Rule:\t$_\n"; }

$Rule = "n-deletion";
# Order: before nasal assimilation ("eigenbelang")
#        before degemination, h-deletion

# 1. /n/ deletion in diminutive
    if ($low) {
       s/\@ntj0/\@tj0/g ;

# 2. /n/ deletion before consonant
    if ($normal || $low) {
       s/\@n([mnNptkbdgfsSxvzZGrljwh])/\@\1/g ;
                                                            FONILEX Manual, page 26

# 3. final /n/ deletion
    if ($normal || $low) {

# /n1/ can not be deleted. Later rules shouldn't touch /n/.
       s/n1/n/g ;

if ($debug) { print "$Rule:\t$_\n"; }

$Rule = "nasal assimilation";
# Place assimilation of nasal consonants
# Order: before degemination and after n-deletion

    if ($normal || $low) {
       s/n([mpb])/m\1/g ; # not [w]: e.g. "onweer"
       s/n([kgxG])/N\1/g ;
       s/([mn])([fv])/M\2/g ;
       s/nj/Jj/g ;

if ($debug) { print "$Rule:\t$_\n"; }

$Rule = "glottal stop insertion"; # e.g. "beademing"
# Order: before vowel reduction

    if ($high || $normal) {

if ($debug) { print "$Rule:\t$_\n"; }

$Rule = "vowel reduction";
# Order: after glide-insertion, after glottal stop insertion

    if ($low) {
       s/i1/I/g ;
       s/y1/Y/g ;
       s/e1/E/g ;
       s/o1/O/g ;
       s/a1/A/g ;
    if ($normal || $low) {
       s/i2/I/g ;
       s/y2/Y/g ;
       s/e2/E/g ;
       s/o2/O/g ;
       s/a2/A/g ;
    if ($normal) {
       s/e3/E/g ;
       s/o3/O/g ;
       s/i3/I/g ;
       s/y3/Y/g ;
    if ($low) {
       s/[IYEOA]2/\@/g;# kennis,notaris,motor,almanak,bakkum
       s/\@2//g;    # schwa deletion

# Remove all diacrits !
    if ($high || $normal) {
       s/[123]//g ; # remove diacr.=> use tense allophone

if ($debug) { print "$Rule:\t$_\n"; }
                                                         FONILEX Manual, page 27

$Rule = "schwa insertion";
# Order: before degemination

    if ($low) {

if ($debug) { print "$Rule:\t$_\n"; }

$Rule = "degemination"; # i.e. identical consonant reduction
# Order: before diphthong realisation

    if ($low) {
    if ($normal || $low) {
       s/mm/m/g; s/nn/n/g; s/pp/p/g; s/bb/b/g;
       s/tt/t/g; s/dd/d/g; s/kk/k/g; s/gg/g/g;
       s/ff/f/g; s/ss/s/g; s/SS/S/g; s/xx/x/g;
       s/ww/w/g; s/jj/j/g; s/ll/l/g; s/rr/r/g;
    if ($high) {
       s/mm/m:/g; s/nn/n:/g; s/pp/p:/g; s/bb/b:/g;
       s/tt/t:/g; s/dd/d:/g; s/kk/k:/g; s/gg/g:/g;
       s/ff/f:/g; s/ss/s:/g; s/SS/S:/g; s/xx/x:/g;
       s/ww/w:/g; s/jj/j:/g; s/ll/l:/g; s/rr/r:/g;

if ($debug) { print "$Rule:\t$_\n"; }

$Rule = "diphthongs";
# Note: no reduction for /O^/: e.g. mevrouw, houd, blauw, paus,..

    if ($high || $normal) {
       s/E\^/E:j/g ;
       s/\@\^/\@:9/g ;
       s/O\^/O:w/g ;
    if ($low) {
       s/E\^/E:/g ;
       s/\@\^/\@:/g ;
       s/O\^/O:w/g ;

if ($debug) { print "$Rule:\t$_\n"; }

$Rule = "t-deletion before j,j0";
# Order: before palatalisation of j0, and before vowel-to-glide
# e.g. kAstj0@ -> kAS@, tOxtj0@ -> tOxj0@, lIftj0@ -> lIfj0@

    if ($low) {

if ($debug) { print "$Rule:\t$_\n"; }

$Rule = "Plosive cluster reduction";
# e.g. pr'Ompt -> pr'Omt

    if ($low) {
       s/([mnNptkbdg])([ptkbdg])([ptkbdg])([ptkbdgfsSxvzZG])/\1\3\4/g ;
       s/([mnNptkbdg])([ptkbdg])([ptkbdgfsSxvzZG])/\1\3/g ;

if ($debug) { print "$Rule:\t$_\n"; }
                                                            FONILEX Manual, page 28

$Rule = "t,d-deletion";
# e.g. h'oftstYk Ontst'Ont

     if ($low) {
        s/([fsSxvzZG])([td])([ptkbdgfsSxvzZG])/\1\3/g ;

if ($debug) { print "$Rule:\t$_\n"; }

$Rule = "vowel-to-glide";
# vowels realized as semi-consonants, e.g. b'Asti5an -> b'Astjan
# Order: after t-deletion with j,j0

     if ($low) {
        s/i3/j/g;    s/jj/j/g;

if ($debug) { print "$Rule:\t$_\n"; }

$Rule = "palatalisation";
# palatalisation of diminutive j0, e.g. G'Astj0\@

     if ($normal || $low) {
        s/tj0/tS/g ;
        s/sj0/S/g ;

if ($debug) { print "$Rule:\t$_\n"; }

$Rule = "h-deletion";
# also: /h/->g e.g. veiligheid -> vEjl\@GEjt

     if ($low) {

if ($debug) { print "$Rule:\t$_\n"; }

    return $_ ;      # final form in $_
                                                               FONILEX Manual, page 29

A2: The Perl program “yapa2sampa”
The following Perl script converts the FONILEX database to the     SAMPA    encoding
convention. It is used as follows:
     $   ./yapa2sampa       fonilex.iso       > database.sampa
where fonilex.iso is the Fonilex database with all 6 fields.

# Convert YAPA to SAMPA notation
# This should be applied to PHONETIC fields only.

while (<>) {
    ($IdNum,$Ortho,$Phono,$V1,$V2,$V3) = split(/\t/, $_, 6);
    $V1 = &yapa2sampa($V1);
    $V2 = &yapa2sampa($V2);
    $V3 = &yapa2sampa($V3);
    $_ = join("\t", $IdNum,$Ortho,$Phono,$V1,$V2,$V3);
    print "$_\n";

# Convert YAPA to SAMPA
sub yapa2sampa {
    local($in) = @_ ;        # function parameters
    $_ = $in;

    # order of substitutions is relevant

    return $_;

Shared By: