An Indonesian Phonetically Balanced Sentence Set for Collecting Speech by ftz16498

VIEWS: 233 PAGES: 10

									An Indonesian Phonetically Balanced Sentence Set for Collecting Speech Database
(Suyanto)




An Indonesian Phonetically Balanced Sentence Set for Collecting
Speech Database

Suyanto
Jurusan Teknik Informatika, Sekolah Tinggi Teknologi Telkom
Jl. Telekomunikasi, Dayeuh Kolot, Bandung 40257, Telp/Fax: +62 22 756 5931
E-mail: suy@stttelkom.ac.id


Abstract
      This paper describes research on developing a phonetically balanced sentence set for
      collecting Indonesian speech database in developing Automatic Speech Recognition (ASR)-
      based systems. Two methods, i.e. Least-to-Most Greedy using standard scoring and Least-
      to-Most Greedy using modified scoring, were applied to a mother sentence set of 500,353
      sentences to find the minimum phonetically balanced sentence set. The second method
      produced sentence set 2 containing fewer sentences (2691) than sentence set 1 (2969)
      which is generated by the first method. Unfortunately, set 2 contained much more words
      (16382) than set 1 (14338). Therefore, set 1 is selected as the minimum phonetically
      balanced sentence set.

      Keywords: Indonesian phonetically balanced sentence set, speech database, automatic
                  speech recognition, least-to-most greedy algorithm


1. Introduction
       In the future telecommunication industry, Automatic Speech Recognition (ASR)
technology will play an important role. Many nations have spent much effort to collect speech
databases and develop large vocabulary (more than 10,000 words) ASR for their languages.
       In Indonesia, research on ASR started in 2003. Until now, the Indonesian speech
databases are available only in three types: isolated digits, connected digits, and very simple
dialog words. A rapid development of the Indonesian Large Vocabulary Continuous Speech
Recognition (LVCSR) using the cross-language approach provided low accuracy. In (Sakriani
Sakti, Konstantin Markov, Satoshi Nakamura, 2005), three researchers developed an Indonesian
LVCSR using English speech database. Accuracy of the resulting Indonesian LVCSR was only
88.97% for 70 words from very simple dialogs. Therefore, it is better to develop the Indonesian
speech database for the Indonesian LVCSR.
       This research focused on developing an Indonesian phonetically balanced sentence set
from a mother sentence set of 500,353 sentences which were collected from two Indonesian
widely read daily newspapers: Kompas and Tempo. The objective sentence set is expected to be
minimum size but cover as many as possible triphone types (three sequences of phonemes). The
sentence set will be used in collecting Indonesian speech database.

2. Algorithms to search the minimum sentence set
      In (Zhang and Nakamura, 2003), two researcher studied four algorithms to search the
minimum phonetically balanced sentence set. The Least-to-Most (LTM) Greedy using Standard
Greedy produced most efficient sentence set. In this research, the algorithm will be
implemented and studied by modifying its scoring formula.




                                                                                                59
Jurnal Teknologi Industri Vol. XI No. 1 Januari 2007: 59-68


a. The Standard Greedy algorithm
       This algorithm selects a sentence from a mother sentence set according to a unit covering
score. The units can be monophones, biphones, or triphones. In the early iterations, it found
more units since there are many uncovered units in the unit list to be covered. But, in the later
iterations, it decreased the covering rate as the uncovered unit reduced. This method is hard to
find infrequent unit since it has no information about unit frequencies.

The Standard Greedy algorithm
   1. Step 1: A = {mother sentence set}, B = {null}, U = {unit list to be covered}.
   2. Step 2: Compute covering score si for each sentence i according to the following
       formula:
                   types of uncovered units in i
            Si                                                                    (1)
                     total tokens of units in i
     3. Step 3: Select the sentence sh with the highest score and insert it into B, then delete all
        newly covered units in sh from U.
     4. Step 4: do step 1, 2, 3 iteratively until U becomes null or all si equal zero.
     5. Step 5: B is the objective minimum sentence set.

b. The Least-to-Most (LTM) Greedy algorithm
       Firstly, the uncovered units are sorted by their frequency of appearance in ascending
pattern. Then, a text subset with each sentence containing at least one token of the least
frequent uncovered unit is developed. Next, the best sentence is selected from the subset. This
algorithm can keep covering rate until it finds all units. This method will output fewer sentences
than the standard greedy algorithm.

Method 1: LTM Greedy using standard greedy
   1. Step 1: For any unit uk in U, Auk = {all sentences containing at least one token of uk}.
   2. Step 2: Put all the to-be-covered units in U to a queue in ascending order, Q = {u1, u2,
      …, uw}, where u1 is the least frequent unit and u w the most frequent one in A.
   3. Step 3: From the sentence subset Au1, use the Standard Greedy algorithm to find a best
      sentence sh and insert it into B.
   4. Step 4: Delete all the newly covered units in sh from the queue Q.
   5. Step 5: Do step 3-4 iteratively until Q is empty.

c. Modified scoring
        Using standard scoring in equation (1), the long sentences tend to have low score in the
later iterations. Thus, the LTM Greedy using standard greedy (method 1) tends to select short
sentences in the later iterations so that it found many sentences as the minimum phonetically
balanced sentence set. To address this issue, the scoring formula was modified to be:
            Si = types of uncovered units in i.                                    (2)
      Using this modified scoring, the LTM Greedy tends to find maximum covering regardless
unit redundancy. Thus, the LTM Greedy using this modified scoring may output fewer
sentences than LTM greedy using the standard scoring. Furthermore, the LTM Greedy using
this modified scoring is called as method 2.

3. Experimental setup
      Indonesia has become the fourth most populous nation in the world with around 240
million people living in more than 13 thousand islands (InfoPlease, 2004 ). In Indonesia, there



60
An Indonesian Phonetically Balanced Sentence Set for Collecting Speech Database
(Suyanto)

are approximately 300 ethnic groups speaking 669 languages and dialects (Tan, 2005).
However, the Indonesian is a unity language formed from the hundreds of languages. Most
Indonesian people speak the Indonesian as a second language, and only 7% of population
speaks the Indonesian as a mother tongue. Today, the Indonesian ranks around sixth or seventh
in size among the world's languages (Quinn, 2006).

a. Indonesian Writing and spelling
       In today writing, the Indonesian language uses Roman or Latin script (26 letters as in the
English). But, there is a symbol “-” used in repetition or plural words, such as “bersama-sama”
(together), “orang-orang” (people) coming from “orang” (person). At the beginning, the spelling
of Malay was chaotic but eventually it stabilized, essentially following the conventions of Dutch
spelling. Small adjustments were made to this spelling in 1947 (the so-called Soewandi
spelling). The “Ejaan Yang Disempurnakan (EYD)” (Updated and Improved Spelling) was
implemented in 1972. The EYD united the spelling of the Indonesian and Malaysian variants of
the language (Quinn, 2006). Therefore, the EYD is used until now.
       The complete Indonesian phoneme set is listed in Table 1. The definition of related
English phonemes is based on similarity or closest match of their International Phonetic
Alphabet (IPA).

                Table 1. The Indonesian phonemes and related English phonemes using
                          International Phonetic Alphabet (IPA).
                 Index Indonesian English               Example
                     1.          a             aa       Father
                     2.          e          ah, ae      Ten
                     3.          ê          ah, ax      Learn
                     4.          i         ih, iy, ix   see, happy, sit
                     5.          o          ow, ao      got, saw
                     6.          u          uh, uw      put, too
                     7.         ay             ay       five
                     8.        aw              aw       now
                     9.         ey             ey       say
                     10.        oy             oy       boy
                     11.        b               b       bad
                     12.         c             ch       chain
                     13.         d         d, dx, dh    did
                     14.         f            f, v      fall, van
                     15.         g              g       got
                     16.         h             hh       hat
                     17.         j             jh       jam
                     18.         k              k       keep
                     19.         l              l       leg
                     20.        m              m        man
                     21.         n              n       no
                     22.        p               p       pen
                     23.        R               r       red
                     24.        S               s       so
                     25.        T             t, th     tea
                     26.        W               w       wet
                     27.        Y               y       yes




                                                                                              61
Jurnal Teknologi Industri Vol. XI No. 1 Januari 2007: 59-68


                 Table 1. Continued
                  Index Indonesian           English          Example
                      28.       Z               z, zh         zoo
                      29.      Kh                  -          -
                      30.      Ng                 ng          sing
                      31.      Ny                  -          -
                      32.      Sy                 sh          share

       The phoneme set used in this research contains the 32 phonemes in Table 1 plus a silence
denoted by /sil/. Thus, the phoneme set consists of 33 phonemes. The phonetically balancing of
a sentence set can be measured by monophone, biphone (left and right), or triphone. But, the
fact in speech signal showed that the wave form of a phoneme was depended on its previous and
next phonemes. Thus, the triphone unit is used as the phonetically balancing measurement. The
triphone units are developed in cross-word scheme since the desired sentence set will be used to
develop an ASR system for continuous speech. Examples of the phonetically balancing
measurement units are shown by Table 2. An Indonesian pronunciation lexicon of 37,500 words
is used to convert a sentence into its phonetically balancing measurement units.

Table 2. Examples of phonetically balancing measurement units.
 Sentence           Selamat pagi. (Good morning)
 Monophone          sil s ê l a m a t p a g i sil
 Left biphone       sil sil-s s-ê ê-l l-a a-m m-a a-t t-p p-a a-g g-i sil
 Right biphone      sil s+ê ê+l l+a a+m m+a a+t t+p p+a a+g g+i i+sil sil
 Triphone           sil sil-s+ê s-ê+l ê-l+a l-a+m a-m+a m-a+t a-t+p t-p+a p-a+g a-g+i g-i+sil sil

b. The Indonesian mother sentence set
        The raw text sources used in this research are one year (in 2001) headline news from two
national daily newspapers: Kompas and Tempo. These newspapers represent commonly used
Indonesian language in most regions in Indonesia since they are the most widely read by the
Indonesian people. Removing wild symbols (which are not in 26 letters plus dash ‘-‘) in the raw
text produced 1,2 million sentences. Filtering the 1,2 million sentences using Indonesian word
list of 37,500 words produced 500,353 clean sentences as a mother sentence set. In the mother
sentence set, there are 2,966,352 words appear, but the number of distinct word is 35,951. The
statistic of the mother sentence set is shown by Table 3.

                Table 3. Statistic of the mother sentence set.
                  Number of sentences                                          500,353
                  Number of words appear                                     2,966,352
                  Number of distinct words                                      35,951
                  Average number of phoneme per sentence                         35.79
                  Maximum number of phoneme in a sentence                           98
                  Minimum number of phoneme in a sentence                            6
                  Number of triphone                                        18,970,019
                  Number of triphone type (distinct triphone)                    9,668

      The ambiguous words in mother sentence set should be replaced by the appropriate terms
according to the context of sentence in order to get the correct triphone units. This step was



62
An Indonesian Phonetically Balanced Sentence Set for Collecting Speech Database
(Suyanto)

carried out manually since it is too hard to check the terms contextually by a computer program.
For example:
1. ”Apel berlangsung dari pagi hingga malam.”
    (The call for readiness was held on morning until evening.)
2. ”Apel dan anggur.” (Apple and grape.)

        There are 33 x 32 x 33 + 1 = 34,849 possible triphone types produced from the 32
phonemes plus one /sil/. Triphone a-sil+a is the same as triphone a-sil+b since the wave form of
/sil/ is independent to the previous and next phonemes. Thus, triphone sil represents 33 x 1 x 33
= 1089 triphone types containing /sil/ in the middle position. However, not all possible triphone
types are permitted in the Indonesian language. Some triphones, such as a-a+a, b-b+b, z-z+b,
z-z+z, etc., never appear in any Indonesian sentence. But, it is difficult to calculate the exact
number of permissible triphone types. Hence, the number of 9,668 triphone types in the mother
sentence set is obviously not the maximum number of the permissible triphone types.
        The distribution of triphones is illustrated by Figure 1. The x coordinate represents indices
of triphone type as illustrated by Table 4. The arrangements of indices are based on the indices
of 32 phonemes in table 1 plus /sil/ with index of 33. A computer simulation showed that the
most frequent triphone was k-a+n which appeared 224,962 times. There were 681 triphone
types with frequency of 1, for instance: a-e+o, a-e+y, a-o+c, e-e+d, e-e+l, etc.

                              Table 4. Indices for all possible triphone types.

                                   Index            Triphone

                                      1                a-a+a

                                      2                a-a+e

                                      3                a-a+ê

                                     …                  …

                                     33               a-a+sil

                                     34                a-e+a

                                     …                  …

                                   34848             sil-sy+sil

                                   34849                sil




                                                                                                  63
Jurnal Teknologi Industri Vol. XI No. 1 Januari 2007: 59-68


                                                    5
                                               x 10                              Triphone distribution of the mother sentence set
                                       2.5




                                           2




                                       1.5
                           Frequency




                                           1




                                       0.5




                                           0
                                               0                  0.5             1               1.5              2                2.5                3            3.5
                                                                                                Index of triphone type                                              4
                                                                                                                                                                 x 10

                    Figure 1. Triphone distribution of the mother sentence set.
      In the phoneme distribution (Figure 2 below), the most frequent phoneme is /a/ that
presents up to 3,543,399 times. Thus, its average frequency of appearance is 7.08 in a sentence.
This fact affects the minimum sentence set will be generated. Of course, there will be redundant
triphone containing /a/.

                                                                              Phoneme Distribution of Mother Sentence Set

                     4000000
                            3543399
                     3500000
                     3000000
  F re q u e n c y




                     2500000
                     2000000
                                                  1421402
                     1500000                   1236397                                                              1062359      1189032
                                                                  928309                                                                 950429 995109                      1058562
                                                                                              716595                         832357         856141
                     1000000                                                                                              582767     604089
                                           318070           330538                  457434        376926                                                           446370
                      500000                                                                   224257 195956                                           86501 85761011 201670
                                                                                                                                                          132261 1
                                                                        45508 655 234 70327 56125
                                                                           19385                                                                                          11268
                           0
                                       a       e   ê    i     o     u ay aw ey oy b       c     d   f   g   h   j     k     l   m n    p   r   s   t       w y   z kh ng ny sy sil
                                                                                                            Phoneme

                                                        Figure 2. Phoneme distribution of the mother sentence set.

4. Experimental results
      The two methods described in section 2 were applied to the mother sentence set
containing 500,353 sentences. The experimental results are described in two points of view, the
search pattern and the statistic of the resulting sentence sets.

a. The search pattern
      In the first 600 iterations, both methods found the same sentences since there are 681
triphone types with frequency of 1 (see sub-section 3.b). In the next iterations, the method 1
found short sentences because of the standard scoring formula. On the other hand, the method 2,




64
An Indonesian Phonetically Balanced Sentence Set for Collecting Speech Database
(Suyanto)

which used modified scoring, kept covering rate by finding maximum covering sentences.
Finally, the method 2 found all 9,668 triphone types earlier (on iteration 2,691) than the method
1 (on iteration 2,969).
                                    10000


                                    9000


                                    8000


                                    7000
        Total nu b of fo d units




                                    6000
                        un




                                    5000
                m er




                                    4000


                                    3000


                                    2000


                                    1000


                                       0
                                            0        500        1000            1500             2000         2500          3000
                                                                   The sequence of found sentences


                                   Figure 3. Search patterns of method 1 (solid line) and method 2 (dashed line).

b. Statistic of the sentence sets
      Method 1 generated sentence set 1 containing 2,969 sentences. But, method 2 produced
sentence set 2 containing 2,691 sentences (see Table 5). Both sets covered the same number of
triphone types, i.e. 9,668. Though set 2 contains fewer sentences (10.33%) than set 1, but,
unfortunately, it contains more words (14.25%) than set 1. In set 2, the average number of
phoneme per sentence is much greater (28.89%) than that in set 1. The number of triphone in set
2 is greater (15.16%) than that in set 1. These redundancies are noticeably significant.
Therefore, set 1 is selected as the minimum phonetically balanced sentence set.
      Trial recording to ten speakers showed that the average time to read 110 sentences, which
are randomly selected from that set, was 15.47 minutes (8.44 seconds per sentence). The
sentences are quite easy to read, even the longest sentence containing 83 phonemes.

                                   Table 5. Statistic of sentence set 1 and sentence set 2.
                                                           Criterion                                    Set 1         Set 2

                                                     Number of sentences                                2,969         2,691
                                                   Number of words appear                               14,338       16,382
                                                   Number of distinct words                             5,352         5,762
                                            Average number of phoneme per sentence                      27.34         35.24
                                       Maximum number of phoneme in a sentence                           83            81
                                       Minimum number of phoneme in a sentence                            6            8
                                                      Number of triphone                                87,322       100,564
                                        Number of triphone type (distinct triphone)                     9,668         9,668




                                                                                                                                   65
Jurnal Teknologi Industri Vol. XI No. 1 Januari 2007: 59-68


      Both sets show the same patterns in triphone distribution as shown by Figure 4. The most
frequent triphone in both sets was the same, k-a+n. However, this triphone in set 2 had
frequency of around 800, higher than that in set 1 (around 600).


                                          Triphone distribution of the sentence set 1
                      700



                      600



                      500



                      400
          Frequency




                      300



                      200



                      100



                       0
                            0   0.5   1               1.5              2                2.5   3      3.5
                                                    Index of triphone type                               4
                                                                                                  x 10


                                          Triphone distribution of the sentence set 2
                      900


                      800


                      700


                      600
          Frequency




                      500


                      400


                      300


                      200


                      100


                       0
                            0   0.5   1               1.5              2                2.5   3      3.5
                                                    Index of triphone type                               4
                                                                                                  x 10


       Figure 4. Triphone distribution of sentence set 1 (up) and sentence set 2 (bottom).


       As illustrated by Figure 5 below, both sets show the same patterns in phoneme
distribution, but most phonemes in set 2 had higher frequency than those in set 1. In both sets,
/a/ was the most frequent phoneme (more than 14,000). The phonemes /aw/, /ey/, /oy/, /z/, /kh/,
and /sy/ had low frequency (less than 500).




66
An Indonesian Phonetically Balanced Sentence Set for Collecting Speech Database
(Suyanto)


                                                                Phoneme distribution of Sentence Set 1


                   16000 14313
                   14000
                   12000
     F re q u e n c y




                   10000
                    8000             6312                                                                                                      6160
                    6000         4469        4482                                        4420        4599 424839564237
                    4000     2296        2427                           3080                 29143608 2465
                                                                2121                                                                  2004
                                                                             129816651003                              688 744 233 199 869 183
                    2000                          533 387 96 59      531 723
                       0
                          a e ê i o u ay aw ey oy b c d f g h j k l m n p r s t w y z kh ng ny sy sil
                                                                                     Phoneme

                                                                               (a)


                                                               Phoneme Distribution of Sentence Set 2

                   20000
                               17029
                   15000
     F re q u e n c y




                   10000                7310
                                     5557     5131                                          5525 5017 4958
                                                                                    5198 4274                                    5721
                                                                        3624                        4572
                        5000      2500     2624                  2517                  3335    2994                      2402
                                                                               1928
                                                  580 389 100 58     591 7381464 1143                     747 814 233 206 1098187
                           0
                                 a e ê i o u ay aw ey oy b c d f g h j k l m n p r s t w y z kh ng ny sy sil
                                                                                     Phoneme

                                                                               (b)
                                 Figure 5. Phoneme distribution of sentence set 1 (a) and sentence set 2 (b).

5.   Conclusions
       Sentence set 1 generated by the Least-to-Most (LTM) greedy using standard scoring is
more efficient than sentence set 2 generated by the LTM greedy using modified scoring. Thus,
sentence set 1 is selected as the minimum phonetically balanced sentence set containing 2,969
sentences covering 9,668 triphone types. The shortest sentence consists of 6 phonemes and the
longest one contains 83 phonemes. The sentences in this set are easy to read, even the longest
one. In this set, phoneme /a/ is the most frequent (more than 14,000). This fact shows that most
Indonesian native words contain phoneme /a/ since many Indonesian native words are built
using both suffixes “-an” and “-kan”. The six phonemes /aw/, /ey/, /oy/, /z/, /kh/, and /sy/ have
low frequency (less than 500). The six phonemes appear more frequently in borrowed words or
terms of person/place names than those in the Indonesian native words.




                                                                                                                                               67
Jurnal Teknologi Industri Vol. XI No. 1 Januari 2007: 59-68


Acknowledgement

The Indonesian mother sentence set as well as the Indonesian pronunciation lexicons are the
properties of PT Telekomunikasi Indonesia’s Research & Development Centre (TELKOMRisTI)
and used in this research with permission.

References

InfoPlease, All the knowledge you need, World's 50 Most Populous Countries: 2004.
       http://www.infoplease.com/ipa/A0004391.html
Quinn, G., 2006, The Indonesian Language, http://www.hawaii.edu/sealit/Downloads/.
Sakti, S., Markov, K., Nakamura, S., 2005, Rapid Development of Initial Indonesian Phoneme-
       based Speech Recognition Using The Cross-Language Approach, Proceeding of Oriental-
       COSCODA 2005, pp. 38-43.
Tan,       J.,   2005,      Bahasa      Indonesia:     Between       FAQs     and     Facts,
       http://www.indotransnet.com/article1.html
Zhang, J.S. and Nakamura, S., 2003, An Efficient Algorithm to Search For A Minimum Sentence
       Set For Collecting Speech Database. 15th ICPhS Barcelona, pp. 3145-3148.




68

								
To top