Speech Segmentation Algorithm Based On Fuzzy Memberships by ijcsis


More Info
									                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                         Vol. 8, No. 1, April 2010

     Speech Segmentation Algorithm Based On Fuzzy
                                    Luis D. Huerta, Jose A. Huesca and Julio C. Contreras
                                                   Departamento de Informática
                                               Universidad del Istmo Campus Ixtepec
                                                      Ixtepec Oaxaca, México
                                           {luisdh2, achavez, jcontreras}@bianni.edu.mx

Abstract— In this work, an automatic speech segmentation                  and without any type of additional information to the content of
algorithm with text independency was implemented. In the                  the speech wave which could provide some help to the
algorithm, the use of fuzzy memberships on each characteristic in         segmentation. Promising results in the segmentation based on
different speech sub-bands is proposed. Thus, the segmentation is         phonemes were reported.
performed a greater detail. Additionally, we tested with various
speech signal frequencies and labeling, and we could observe how              However, there has been little effort to study factors that
they affect the performance of the segmentation process in                affect the performance of the segmentation beyond the
phonemes. The speech segmentation algorithm used is described.            implementation of the algorithm.
During the segmentation process it is not supported by any
additional information on the speech signal, as the text is. A                The present work shows results of the performance of the
correct segmentation of the 80,51% is reported on a data base in          segmentation process with variants on the speech signal
Spanish, with a rate of on-segmentation near 0%.                          frequency and labeling. In the experiments, the DIMEx100
                                                                          data base of the Spanish spoken in Mexico was used.
  Keywords-component:     Speech     Segmentation;         Fuzzy
Memberships; Phonemes; Sub-bands Features                                   FACTORS WHICH TAKE PART IN THE PERFORMANCE OF THE
                                                                                              SPEECH SEGMENTATION
                        INTRODUCTION                                          Recently, works related to automatic speech segmentation
   Speech recognition systems provide a natural                           in phonemes, tested under conditions that include different
communication environment between people and computers.                   speakers, utterances expressed in natural conditions without
Basically these systems require two processes to carry out the            any vocabulary restrictions or any additional information on
understanding of the speech signal: the segmentation process              the content of the speech wave, known as text independence.
and the segment recognition process.                                      These testing conditions affect the performance of the
   The speech recognition systems are based on units such as              algorithm. However when considering them in the
words, syllables, diphonemes and phonemes, the phonemes                   experimental phase they allow more realistic results and of
being the smallest set. A speech recognition system based on              better quality to be obtained.
phonemes reduces the number of units needed for recognition.              Therefore, it is crucially important to know which factors take
Therefore, it reduces the confusion during the recognition                part in the final result. Some factors that affect the
process.                                                                  performance of the segmentation are as follows:
    The segmentation process will determine the existing limits           Aspects of the speakers
between the speech units considered within the signal. The                    To include different speakers to the tests assesses how
quality of the segmentation process directly affects the quality          robust the algorithm is when dealing with diverse natural styles
of the recognition process since vague segments will perform              of speech, where there are some features such as the speaker’s
poorly during the recognition process and therefore perform               diction, rate and intensity. Diction is related to the right
poorly in the whole system.                                               pronunciation and articulation of the words. An appropriate
    Works of segmentation based on sub-words using syllables              diction results in utterances that can be heard clearly and
[1,2] and phonemes [4,5], to mention some, have been                      intelligibly by the receiver. On the other hand, the speaker’s
reported. Many have been tested under a series of restrictions            rate refers to the amount of words or sub-words spoken by time
such as the use of limited vocabulary [1], a small number of              unit, or more exactly, the speed at which a word or an utterance
speakers [4], and the use of additional information. These are            is expressed. At high rates of speech the clarity of the
known as text dependent, as the ones reported in [1,3].                   utterances is reduced. The intensity of the speech is related to
                                                                          the amount of energy involved in the emitted wave. When
                                                                          there is a greater intensity there is a major emphasis between
   A series of segmentation algorithms [5, 6, 7] has been                 the phonetic transitions. The factors of the speaker mentioned
proposed. These have been tested with various speakers,                   above influence in the articulation of the words, and due to
naturally spoken utterances with a wide range of vocabulary,              their effects, the transition between some phonetic limits is not

                                                                    229                              http://sites.google.com/site/ijcsis/
                                                                                                     ISSN 1947-5500
                                                         (IJCSIS) International Journal of Computer Science and Information Security,
                                                         Vol. 8, No. 1, April 2010

clearly defined in the signal [9]. These aspects affect the                  On the other hand, some encoding schemes are extracted
performance of speech segmentation. A better performance is               from the frequency domain such as MFCC, PCBF, Bark
reached when the speaker has good diction, low speech rate                spectrum and Mel spectrum. The best results for the
and high intensity.                                                       segmentation process [5] were obtained in the later spectrum.
Aspects of the Signal                                                     Mel Spectrum
    The most important aspect within the signal is the sampling               Stevens and Volkman in [12] proposed the Mel scale. It
frequency. With a greater number of samples obtained from the             was obtained from experiments on human hearing perception.
original signal during the digitizing better detail is revealed.          They proposed that the perception level with respect to the
However, more noise or unnecessary signals of frequency                   frequency heard follows a logarithmic scale expressed by the
might be included. In accordance with the Nyquist-Shannon                 equation:

                                                                                                     = 2595
Theorem it is sufficient to have the quantity of samples which
come out with a frequency of at least twice the base frequency.
Types of Labeling
    There are levels of labeling that define the limits of                      In order to obtain the speech codified in Mel spectra, a
segments contained in a speech signal. The levels of labeling               bank of filters emulates the critical perception bands, where
depend on the number of allophones, closings in stop and
                                                                            the boundaries of the filters coincide with the center of the
affricate consonants, glides and, sounds of accentuated vowels
                                                                            adjacent filters; their own axes follow the Mel scale. The
to mention a few. In this work utterances from the DIMEx100
corpus of the Spanish spoken in Mexico are used. The corpus is              filters obtain the average of the concentrations of energy of
described below. The utterances used in the tests include the               each central frequency corresponding to each frame of the
following levels of labeling as described in [10].                          speech signal, where each frame is a segment of the speech
                                                                            (usually of 10 ms).
  •    Level T54
     At this level the 37 most frequent allophones of Mexican                                             frame
  Spanish are represented, as well as the 8 closings in stop and
  affricate consonants, ([p_c, t_c, k_c, b_c, d_c, g_c, tS_c,
  dZ_c]) and the 9 vowels that allow an accent ([i_7, e_7,
  E_7, a_j_7, a_7, a_2_7, O_7, o_7 and u_7]); the complete
  inventory of allophone units is also represented at this level.

  •    Level T44
     This level considers some basic acoustic aspects, and
  some syllabic features; the level includes, besides the 22
                                                                                   m1 m2   m3   m4   m5      m6     m7        m8
  prototypical allophones of Mexican Spanish, the closings in
  stop consonants and the voiceless affricate consonant ([p_c,                        Figura 1. Obtaining the Mel spectrum in vectors
  t_c, k_c, b_c, d_c, g_c, tS_c]), the allophones near voiced
  stops([V,D,D]), the 9 vowels which allow accent ([i_7, e_7,
  E_7, a_j_7, a_7, a_2_7, O_7, o_7, u_7])and the glides ([j,                  In the present paper, Mel spectra found in feature vectors,
  w]). Also, a single symbol is allocated to consonant couples            where the size of such vectors is equal to the number of filters
  ([p/b, t/d, k/g, n/m, r(/r]) at the end of a syllable or a              applied to each frame of the signal, are used. Each filter is
  syllable coda ([-P, -T, -K, -N, -R]).                                   applied to a frequency sub-band to obtain a quantification of
                                                                          the energy in it. On the other hand, to carry out the
  •    Level T22                                                          segmentation, the approach of comparing distances between
     At this level solely the 22 allophonic forms (inventory)             objects represented by feature vectors is applied to determine
  which are related with the phonemes of the Mexican                      the phonetic limits.
  Spanish are represented. This is one of the aspects that must
  be considered, since the type of labeling might affect the
  segmentation performance, as it is shown in the experiments                           SEGMENTATION ALGORITHM
  section.                                                                    Some segmentation algorithms are based on features of the
Extracted information of the speech signal                                time domain such as the ones reported in [6,8], as well as
                                                                          features of the frequencies domain, such as the ones reported
    Some features can be extracted from the time domain as                in [4, 5,7] which perform the phoneme segmentation with text
well as from the frequency domain. Segmentation algorithms                independency. The proposed segmentation algorithm uses
that use features of time domain such as intensity [6, 8],                speech feature vectors in particular the codification schemes
energy and zero crossing rates, to mention a few, have been               based on Mel spectrum. Each vector represents the features of
reported.                                                                 the wave of speech in diverse intervals of frequency at a
                                                                          moment of time t. For each frequency interval, a fuzzy space is

      Identify applicable sponsor/s here. (sponsors)

                                                                    230                                   http://sites.google.com/site/ijcsis/
                                                                                                          ISSN 1947-5500
                                                                                          (IJCSIS) International Journal of Computer Science and Information Security,
                                                                                          Vol. 8, No. 1, April 2010

defined in order to obtain a better detailed spectral                                                              The first condition is oriented to obtain the local maxima
quantification in each case. This fuzzy space is defined as                                                    based in this simple condition, while the second allows
obtaining the minimum and maximum spectrum of each sub-                                                        selecting the significant local maxima.
band, with overlapping of 50%. For each spectrum the High,
Mid and Low memberships regarding the frequency interval in                                                                IMPLEMENTATION AND EXPERIMENTS
which they reside are obtained.                                                                                    In the experiments, tests with frequency variations,
    In order to obtain a quantitative representation of the                                                    labeling, encoding schemes, and the use of fuzzy memberships
existence or non existence of a spectral change between frames,                                                on the Mel spectra were carried out. The features extraction
a summation of the corresponding distances in each sub-band is                                                 and the segmentation process were implemented using the
computed for an instant of time t and, in this way, the distance                                               freeware PRAAT v.4.16.3 [11].
between compared frames is established. In order to determine                                                  Data Description
the distance between the feature vectors of each frame, the
                                                                                                                  Tests were performed using Spanish utterances obtained
following formula is used:
                                                                                                               from the DIMEX100 corpus. The corpus was recorded in a
                                                                                                               sound study in CCADET, Universidad Nacional Autónoma de

                                          µ        − µ                         +
                                                                                                               México (UNAM), with a mono format of sampling at 16 bits,

                       =              µ            − µ                              +
                                                                                                               and a sampling rate of 44.1 KHz.
                                          µ          − µ
                                                                                              (2)                  Speakers age ranges from 16 to 36. They have more than 9
                                                                                                               years of formal education in Mexico City. A random group of
                                                                                                               speakers in the UNAM (researchers, students, teachers and
                                                                                                               workers) was selected, with an average age of 23.82. 87% of
    Where is the number of given features sub-bands in the                                                     them do not hold a degree. 49% of utterances of the corpus are
number of filters used in the extraction of Mel spectra. The                                                   expressed by females      and 51% by males. Speakers from
distance of a frame is obtained from the features of its adjacent                                              Mexico City were chosen for this corpus, since this variety
frames. The previous equation gives the distances that exist in                                                represents the variety spoken by the majority of the population
each sub-band with respect to each membership of each                                                          in the country.
adjacent frame, applying summation to the distances of each                                                    Experimental Data
sub-band. A sole distance with respect to the frame in process
is obtained.                                                                                                       In the test phase 240 speech signals were used. There were
                                                                                                               a total of 12655, 12551 and 11192 phonetic boundaries using
                                                                                                               labeling of 54, 44 y 22 phonemes respectively and
                           Features                                                                            corresponding to 30 speakers (15 males and 15 females). The
                                              High Mid Low
                                                                                                               signals were extracted from DIMEx100 corpus with Spanish
             filter                                          Sub-band1        D1,1            Dt,1             sentences.

                                                                              D1,2            Dt,2
                                                                                                               Measurement of Performance
                                                                                                               The algorithm performance was evaluated with commonly
                                                                                                               used means such as in [5, 6, 7, 8].

         Filter Bank
                                                             Sub-band         D1,    -1       Dt,   -1                         S     
                                                                                                                     D = 100 ⋅  d − 1

                                                                                                                               S                                               (3)
                                                             Suband           D1,             Dt,                               t    
                                                                                                               Where D is the measurement of over-segmentation, Sd is the
                                                                                                               number of points of segmentation detected by the algorithm,
                                                                                                               and St is the number of real points of segmentation.

   Figura 2. Algorithm based on fuzzy memberships of the Mel spectrum.                                                          S 
                                                                                                                     Pc = 100 ⋅  c 
                                                                                                                                S                                              (4)
                                                                                                                                 t
    The representative distance of each frame is analyzed to
establish if it is a candidate to be a phonetic limit. The                                                     Where Pc is the percentage of the correct detections, and Sc is
conditions used for the selection of these limits are:                                                         the number of points of correct segmentation. The
                                                                                                               segmentation points are considered as correct if the distance to
   1. Dt>Dt-1 y Dt>Dt+1                                                                                        the true point of segmentation is in the range ± 20 ms.
   2. Dt>

                                                                                                         231                              http://sites.google.com/site/ijcsis/
                                                                                                                                          ISSN 1947-5500
                                                           (IJCSIS) International Journal of Computer Science and Information Security,
                                                           Vol. 8, No. 1, April 2010

Using Fuzzy Memberships                                                    A difference of 2.25% was obtained on the segmentation on
   Comparative results for the use of fuzzy memberships are                sample signals at 44 Khz and 16 Khz, using a labeling of 54
shown in the table 1. In this test, 44.1 KHz speaking signals              phonemes.
and labeling based on 54 phonemes were used to test the                    TABLE IV.       RESULTS OF THE SEGMENTATION USING DIFFERENT LABELING
phonetic boundaries.                                                                               WITH SIGNALS OF 8 KHZ.

 TABLE I.      SEGMENTATION RESULTS WITHOUT FUZZY MEMBERSHIPS.                                Labeling             Pc            D
                                Sd        Sc        Pc         D                            54 phonemes           78.26        -0.10
                                                                                            44 phonemes           78.12        -0.30
 Without fuzzy memberships     12663     9851      77.84      0.06                          22 phonemes           77.14         0.60
  With fuzzy memberships       12570    10189      80.51     -0.67
                                                                               Furthermore, there is no significant difference in the
    There is a considerable contribution when fuzzy                        segmentation performance when labeling of 54 phonemes and
memberships on each sub-band per frame are used. The                       44 phonemes are used. Indeed, there is a slight difference due
improvement is approximately 2.67%. This approach was also                 to the fact that labeling is assigned to the segments and not to
                                                                           the number of boundaries in the sentence. However, for the
used in the segmentation algorithm with text independence
                                                                           labeling of 22 phonemes there is a considerable difference with
based on the signal intensity presented in [6]. Subsequent tests
                                                                           regard to the other levels related to the number of existing
include fuzzy memberships.                                                 segments in the segmented utterance.
Using different sampling frequency and labeling
   Despite the fact that frequency affects the quantity of                     In this work, factors such as frequency, labeling and fuzzy
                                                                           memberships were tested. Fuzzy memberships contribute
samples contained in a speech signal, in this work we tested
                                                                           additional information to each sub-band enhancing the
with a speech signal with the original sampling of 44 KHz.
                                                                           segmentation performance. In these particular experiments an
Additionally the same signals, re-sampling to 16 and 8 Khz,                improvement of 2.67% was obtained without increasing the
were tested.                                                               insertion rate. On the other hand, with the use of higher
                                                                           frequency better results in segmentation were obtained. The
                     WITH SIGNALS OF 44 KHZ.                               labeling is an important feature to keep in mind during the
                                                                           evaluation of the algorithm segmentation. If a more detailed
               Labeling         Pc           D                             labeling is used, the algorithm will cause errors such as
             54 phonemes       80.51       -0.67                           insertions, where many of these could be due to the real
             44 phonemes       80.72        0.08                           existence of a phoneme transition, based on a higher labeling
             22 phonemes       79.66        0.16                           level.

   On the other hand, it is important to emphasize that the                                               REFERENCES
segmentation process is evaluated on the basis of the labeling
done by experts in phonetics. There are different levels of                [1]   Mayora O. Segmentazione automática di fonemi per aplicazioni di
                                                                                 riconoscimento vocale. Technical Report, Università di Genova, 2000.
labeling that differentially affect the final result of the
evaluation of the segmentation done by the algorithm.                      [2]   Hu Z., Schalwyk J., Brenard E., Cole R. “Speech recognition using
                                                                                 syllable-like units”, ICSLP ’96, 2:117-1120, 1996.
                                                                           [3]   Pellom B. Hansen J., “Automatic segmentation of speech recorded in
                                                                                 unknow noisy channel characteristics”, Speech communication, 1998,
                      WITH SIGNALS OF 16 KHZ.
                                                                           [4]   Suh Y. and Lee Y., “Phoneme Segmentation of Continuous Speech
                                                                                 using multi layer perceptron”, IEE Trans. Speech and Audio Proc.,
               Labeling          Pc          D                                   7(6):697-708,1999.
             54 phonemes        79.34      -0.39                           [5]   Aversano G. and Esposito A.,”Automatic Parameter Estimation for a
             44 phonemes        79.42      -0.12                                 Context-Independent Speech Segmentation Algorithm,”, TSD 2002,
             22 phonemes        78.31      0.57                                  LNAI 2448, pp. 293-300, 2002 Springer Verlag Berlin Heidelberg 2002
                                                                           [6]   Huerta L.D and Reyes C.A, “On the processing of Fuzzy Patterns for
    Table II, shows that for different labeling, different results               Text Independent Phonetic Speech Segmentation”, CIARP 2006, LNCS
are obtained in the algorithm performance. It also shows that                    4225, pp. 437-445, 2006 Springer Verlag Berlin Heidelberg 2006
by comparing the segmentation points obtained by the                       [7]   Saraswhati S., Geetha T., and Saravanan K., “Integrating Language
                                                                                 Independent Segmentation and Language Dependent Phoneme Based
algorithm with a lower level labeling, the algorithm                             Modeling for Tamil Speech Recognition System”, Asian Journal of
performance is lower.                                                            Information Technology 5 (1), pp. 38-43, 2006.
   In tables III and IV, the same tendency related to the results          [8]   Huerta L.D, “Segmentación del Habla con Independencia de Texto,
obtained in the segmentation as regards the labeling is                          usando características en el dominio del tiempo”, CITII 2008, Vol. 38,
observed. Additionally, it shows that a decrease in frequency                    pp. 129-139, 2008.
corresponds to a decline in the quality of the segmentation.               [9]   Sarkar A. and Sreenivas T.,” Automatic Speech Segmentation Using
                                                                                 Average Level Crossing Rate Information”, ICASSP 2005, IEE, Vol. 1,

                                                                     232                                    http://sites.google.com/site/ijcsis/
                                                                                                            ISSN 1947-5500
                                                                 (IJCSIS) International Journal of Computer Science and Information Security,
                                                                 Vol. 8, No. 1, April 2010

[10] Pineda, L.A., Villaseñor-Pineda, L., Cuétara, J., Castellanos, H. &
     López, I. “DIMEx100: A New Phonetic and Speech corpus for Mexican
     Spanish”. Proceedings of the 9th Ibero-American Conference on AI,
     (IBERAMIA), Puebla, Mexico, November 22-25, 2004. Lecture Notes
     in Artificial Intelligence, Vol. 3315, pp. 974-983, Springer 2004.
[11] Boersman P. “Praat, a system for doing prhonetics by computers”. Glot
     International 5:9/10, 341-345.
[12] Volkman J. Stevens S. The relation of pitch to frequency. American
     Journal of Psychology, 1940.

                                                                             233                            http://sites.google.com/site/ijcsis/
                                                                                                            ISSN 1947-5500

To top