Speech Segmentation Algorithm Based On Fuzzy Memberships
W
Description
In this work, an automatic speech segmentation algorithm with text independency was implemented. In the algorithm, the use of fuzzy memberships on each characteristic in different speech sub-bands is proposed. Thus, the segmentation is performed a greater detail. Additionally, we tested with various speech signal frequencies and labeling, and we could observe how they affect the performance of the segmentation process in phonemes. The speech segmentation algorithm used is described. During the segmentation process it is not supported by any additional information on the speech signal, as the text is. A correct segmentation of the 80,51% is reported on a data base in Spanish, with a rate of on-segmentation near 0%.
Document Sample


(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 1, April 2010
Speech Segmentation Algorithm Based On Fuzzy
Memberships
Luis D. Huerta, Jose A. Huesca and Julio C. Contreras
Departamento de Informática
Universidad del Istmo Campus Ixtepec
Ixtepec Oaxaca, México
{luisdh2, achavez, jcontreras}@bianni.edu.mx
Abstract— In this work, an automatic speech segmentation and without any type of additional information to the content of
algorithm with text independency was implemented. In the the speech wave which could provide some help to the
algorithm, the use of fuzzy memberships on each characteristic in segmentation. Promising results in the segmentation based on
different speech sub-bands is proposed. Thus, the segmentation is phonemes were reported.
performed a greater detail. Additionally, we tested with various
speech signal frequencies and labeling, and we could observe how However, there has been little effort to study factors that
they affect the performance of the segmentation process in affect the performance of the segmentation beyond the
phonemes. The speech segmentation algorithm used is described. implementation of the algorithm.
During the segmentation process it is not supported by any
additional information on the speech signal, as the text is. A The present work shows results of the performance of the
correct segmentation of the 80,51% is reported on a data base in segmentation process with variants on the speech signal
Spanish, with a rate of on-segmentation near 0%. frequency and labeling. In the experiments, the DIMEx100
data base of the Spanish spoken in Mexico was used.
Keywords-component: Speech Segmentation; Fuzzy
Memberships; Phonemes; Sub-bands Features FACTORS WHICH TAKE PART IN THE PERFORMANCE OF THE
SPEECH SEGMENTATION
INTRODUCTION Recently, works related to automatic speech segmentation
Speech recognition systems provide a natural in phonemes, tested under conditions that include different
communication environment between people and computers. speakers, utterances expressed in natural conditions without
Basically these systems require two processes to carry out the any vocabulary restrictions or any additional information on
understanding of the speech signal: the segmentation process the content of the speech wave, known as text independence.
and the segment recognition process. These testing conditions affect the performance of the
The speech recognition systems are based on units such as algorithm. However when considering them in the
words, syllables, diphonemes and phonemes, the phonemes experimental phase they allow more realistic results and of
being the smallest set. A speech recognition system based on better quality to be obtained.
phonemes reduces the number of units needed for recognition. Therefore, it is crucially important to know which factors take
Therefore, it reduces the confusion during the recognition part in the final result. Some factors that affect the
process. performance of the segmentation are as follows:
The segmentation process will determine the existing limits Aspects of the speakers
between the speech units considered within the signal. The To include different speakers to the tests assesses how
quality of the segmentation process directly affects the quality robust the algorithm is when dealing with diverse natural styles
of the recognition process since vague segments will perform of speech, where there are some features such as the speaker’s
poorly during the recognition process and therefore perform diction, rate and intensity. Diction is related to the right
poorly in the whole system. pronunciation and articulation of the words. An appropriate
Works of segmentation based on sub-words using syllables diction results in utterances that can be heard clearly and
[1,2] and phonemes [4,5], to mention some, have been intelligibly by the receiver. On the other hand, the speaker’s
reported. Many have been tested under a series of restrictions rate refers to the amount of words or sub-words spoken by time
such as the use of limited vocabulary [1], a small number of unit, or more exactly, the speed at which a word or an utterance
speakers [4], and the use of additional information. These are is expressed. At high rates of speech the clarity of the
known as text dependent, as the ones reported in [1,3]. utterances is reduced. The intensity of the speech is related to
the amount of energy involved in the emitted wave. When
there is a greater intensity there is a major emphasis between
A series of segmentation algorithms [5, 6, 7] has been the phonetic transitions. The factors of the speaker mentioned
proposed. These have been tested with various speakers, above influence in the articulation of the words, and due to
naturally spoken utterances with a wide range of vocabulary, their effects, the transition between some phonetic limits is not
229 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 1, April 2010
clearly defined in the signal [9]. These aspects affect the On the other hand, some encoding schemes are extracted
performance of speech segmentation. A better performance is from the frequency domain such as MFCC, PCBF, Bark
reached when the speaker has good diction, low speech rate spectrum and Mel spectrum. The best results for the
and high intensity. segmentation process [5] were obtained in the later spectrum.
Aspects of the Signal Mel Spectrum
The most important aspect within the signal is the sampling Stevens and Volkman in [12] proposed the Mel scale. It
frequency. With a greater number of samples obtained from the was obtained from experiments on human hearing perception.
original signal during the digitizing better detail is revealed. They proposed that the perception level with respect to the
However, more noise or unnecessary signals of frequency frequency heard follows a logarithmic scale expressed by the
might be included. In accordance with the Nyquist-Shannon equation:
= 2595
Theorem it is sufficient to have the quantity of samples which
come out with a frequency of at least twice the base frequency.
(1)
Types of Labeling
There are levels of labeling that define the limits of In order to obtain the speech codified in Mel spectra, a
segments contained in a speech signal. The levels of labeling bank of filters emulates the critical perception bands, where
depend on the number of allophones, closings in stop and
the boundaries of the filters coincide with the center of the
affricate consonants, glides and, sounds of accentuated vowels
adjacent filters; their own axes follow the Mel scale. The
to mention a few. In this work utterances from the DIMEx100
corpus of the Spanish spoken in Mexico are used. The corpus is filters obtain the average of the concentrations of energy of
described below. The utterances used in the tests include the each central frequency corresponding to each frame of the
following levels of labeling as described in [10]. speech signal, where each frame is a segment of the speech
(usually of 10 ms).
• Level T54
At this level the 37 most frequent allophones of Mexican frame
Spanish are represented, as well as the 8 closings in stop and
affricate consonants, ([p_c, t_c, k_c, b_c, d_c, g_c, tS_c,
dZ_c]) and the 9 vowels that allow an accent ([i_7, e_7,
E_7, a_j_7, a_7, a_2_7, O_7, o_7 and u_7]); the complete
inventory of allophone units is also represented at this level.
• Level T44
This level considers some basic acoustic aspects, and
some syllabic features; the level includes, besides the 22
m1 m2 m3 m4 m5 m6 m7 m8
prototypical allophones of Mexican Spanish, the closings in
stop consonants and the voiceless affricate consonant ([p_c, Figura 1. Obtaining the Mel spectrum in vectors
t_c, k_c, b_c, d_c, g_c, tS_c]), the allophones near voiced
stops([V,D,D]), the 9 vowels which allow accent ([i_7, e_7,
E_7, a_j_7, a_7, a_2_7, O_7, o_7, u_7])and the glides ([j, In the present paper, Mel spectra found in feature vectors,
w]). Also, a single symbol is allocated to consonant couples where the size of such vectors is equal to the number of filters
([p/b, t/d, k/g, n/m, r(/r]) at the end of a syllable or a applied to each frame of the signal, are used. Each filter is
syllable coda ([-P, -T, -K, -N, -R]). applied to a frequency sub-band to obtain a quantification of
the energy in it. On the other hand, to carry out the
• Level T22 segmentation, the approach of comparing distances between
At this level solely the 22 allophonic forms (inventory) objects represented by feature vectors is applied to determine
which are related with the phonemes of the Mexican the phonetic limits.
Spanish are represented. This is one of the aspects that must
be considered, since the type of labeling might affect the
segmentation performance, as it is shown in the experiments SEGMENTATION ALGORITHM
section. Some segmentation algorithms are based on features of the
Extracted information of the speech signal time domain such as the ones reported in [6,8], as well as
features of the frequencies domain, such as the ones reported
Some features can be extracted from the time domain as in [4, 5,7] which perform the phoneme segmentation with text
well as from the frequency domain. Segmentation algorithms independency. The proposed segmentation algorithm uses
that use features of time domain such as intensity [6, 8], speech feature vectors in particular the codification schemes
energy and zero crossing rates, to mention a few, have been based on Mel spectrum. Each vector represents the features of
reported. the wave of speech in diverse intervals of frequency at a
moment of time t. For each frequency interval, a fuzzy space is
Identify applicable sponsor/s here. (sponsors)
230 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 1, April 2010
defined in order to obtain a better detailed spectral The first condition is oriented to obtain the local maxima
quantification in each case. This fuzzy space is defined as based in this simple condition, while the second allows
obtaining the minimum and maximum spectrum of each sub- selecting the significant local maxima.
band, with overlapping of 50%. For each spectrum the High,
Mid and Low memberships regarding the frequency interval in IMPLEMENTATION AND EXPERIMENTS
which they reside are obtained. In the experiments, tests with frequency variations,
In order to obtain a quantitative representation of the labeling, encoding schemes, and the use of fuzzy memberships
existence or non existence of a spectral change between frames, on the Mel spectra were carried out. The features extraction
a summation of the corresponding distances in each sub-band is and the segmentation process were implemented using the
computed for an instant of time t and, in this way, the distance freeware PRAAT v.4.16.3 [11].
between compared frames is established. In order to determine Data Description
the distance between the feature vectors of each frame, the
Tests were performed using Spanish utterances obtained
following formula is used:
from the DIMEX100 corpus. The corpus was recorded in a
sound study in CCADET, Universidad Nacional Autónoma de
µ − µ +
México (UNAM), with a mono format of sampling at 16 bits,
= µ − µ +
and a sampling rate of 44.1 KHz.
,
µ − µ
(2) Speakers age ranges from 16 to 36. They have more than 9
years of formal education in Mexico City. A random group of
speakers in the UNAM (researchers, students, teachers and
workers) was selected, with an average age of 23.82. 87% of
Where is the number of given features sub-bands in the them do not hold a degree. 49% of utterances of the corpus are
number of filters used in the extraction of Mel spectra. The expressed by females and 51% by males. Speakers from
distance of a frame is obtained from the features of its adjacent Mexico City were chosen for this corpus, since this variety
frames. The previous equation gives the distances that exist in represents the variety spoken by the majority of the population
each sub-band with respect to each membership of each in the country.
adjacent frame, applying summation to the distances of each Experimental Data
sub-band. A sole distance with respect to the frame in process
is obtained. In the test phase 240 speech signals were used. There were
a total of 12655, 12551 and 11192 phonetic boundaries using
labeling of 54, 44 y 22 phonemes respectively and
Features corresponding to 30 speakers (15 males and 15 females). The
High Mid Low
signals were extracted from DIMEx100 corpus with Spanish
Pre-emphasis
filter Sub-band1 D1,1 Dt,1 sentences.
D1,2 Dt,2
Measurement of Performance
Sub-band2
The algorithm performance was evaluated with commonly
used means such as in [5, 6, 7, 8].
Filter Bank
Sub-band D1, -1 Dt, -1 S
D = 100 ⋅ d − 1
-1
S (3)
Suband D1, Dt, t
Where D is the measurement of over-segmentation, Sd is the
Phoneme
Rules
number of points of segmentation detected by the algorithm,
boundaries
and St is the number of real points of segmentation.
Figura 2. Algorithm based on fuzzy memberships of the Mel spectrum. S
Pc = 100 ⋅ c
S (4)
t
The representative distance of each frame is analyzed to
establish if it is a candidate to be a phonetic limit. The Where Pc is the percentage of the correct detections, and Sc is
conditions used for the selection of these limits are: the number of points of correct segmentation. The
segmentation points are considered as correct if the distance to
1. Dt>Dt-1 y Dt>Dt+1 the true point of segmentation is in the range ± 20 ms.
2. Dt>
231 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 1, April 2010
Using Fuzzy Memberships A difference of 2.25% was obtained on the segmentation on
Comparative results for the use of fuzzy memberships are sample signals at 44 Khz and 16 Khz, using a labeling of 54
shown in the table 1. In this test, 44.1 KHz speaking signals phonemes.
and labeling based on 54 phonemes were used to test the TABLE IV. RESULTS OF THE SEGMENTATION USING DIFFERENT LABELING
phonetic boundaries. WITH SIGNALS OF 8 KHZ.
TABLE I. SEGMENTATION RESULTS WITHOUT FUZZY MEMBERSHIPS. Labeling Pc D
Sd Sc Pc D 54 phonemes 78.26 -0.10
44 phonemes 78.12 -0.30
Without fuzzy memberships 12663 9851 77.84 0.06 22 phonemes 77.14 0.60
With fuzzy memberships 12570 10189 80.51 -0.67
Furthermore, there is no significant difference in the
There is a considerable contribution when fuzzy segmentation performance when labeling of 54 phonemes and
memberships on each sub-band per frame are used. The 44 phonemes are used. Indeed, there is a slight difference due
improvement is approximately 2.67%. This approach was also to the fact that labeling is assigned to the segments and not to
the number of boundaries in the sentence. However, for the
used in the segmentation algorithm with text independence
labeling of 22 phonemes there is a considerable difference with
based on the signal intensity presented in [6]. Subsequent tests
regard to the other levels related to the number of existing
include fuzzy memberships. segments in the segmented utterance.
CONCLUSIONS
Using different sampling frequency and labeling
Despite the fact that frequency affects the quantity of In this work, factors such as frequency, labeling and fuzzy
memberships were tested. Fuzzy memberships contribute
samples contained in a speech signal, in this work we tested
additional information to each sub-band enhancing the
with a speech signal with the original sampling of 44 KHz.
segmentation performance. In these particular experiments an
Additionally the same signals, re-sampling to 16 and 8 Khz, improvement of 2.67% was obtained without increasing the
were tested. insertion rate. On the other hand, with the use of higher
frequency better results in segmentation were obtained. The
TABLE II. RESULTS OF THE SEGMENTATION USING DIFFERENT LABELING
WITH SIGNALS OF 44 KHZ. labeling is an important feature to keep in mind during the
evaluation of the algorithm segmentation. If a more detailed
Labeling Pc D labeling is used, the algorithm will cause errors such as
54 phonemes 80.51 -0.67 insertions, where many of these could be due to the real
44 phonemes 80.72 0.08 existence of a phoneme transition, based on a higher labeling
22 phonemes 79.66 0.16 level.
On the other hand, it is important to emphasize that the REFERENCES
segmentation process is evaluated on the basis of the labeling
done by experts in phonetics. There are different levels of [1] Mayora O. Segmentazione automática di fonemi per aplicazioni di
riconoscimento vocale. Technical Report, Università di Genova, 2000.
labeling that differentially affect the final result of the
evaluation of the segmentation done by the algorithm. [2] Hu Z., Schalwyk J., Brenard E., Cole R. “Speech recognition using
syllable-like units”, ICSLP ’96, 2:117-1120, 1996.
[3] Pellom B. Hansen J., “Automatic segmentation of speech recorded in
unknow noisy channel characteristics”, Speech communication, 1998,
TABLE III. RESULTS OF THE SEGMENTATION USING DIFFERENT LABELING 15, 97-116.
WITH SIGNALS OF 16 KHZ.
[4] Suh Y. and Lee Y., “Phoneme Segmentation of Continuous Speech
using multi layer perceptron”, IEE Trans. Speech and Audio Proc.,
Labeling Pc D 7(6):697-708,1999.
54 phonemes 79.34 -0.39 [5] Aversano G. and Esposito A.,”Automatic Parameter Estimation for a
44 phonemes 79.42 -0.12 Context-Independent Speech Segmentation Algorithm,”, TSD 2002,
22 phonemes 78.31 0.57 LNAI 2448, pp. 293-300, 2002 Springer Verlag Berlin Heidelberg 2002
[6] Huerta L.D and Reyes C.A, “On the processing of Fuzzy Patterns for
Table II, shows that for different labeling, different results Text Independent Phonetic Speech Segmentation”, CIARP 2006, LNCS
are obtained in the algorithm performance. It also shows that 4225, pp. 437-445, 2006 Springer Verlag Berlin Heidelberg 2006
by comparing the segmentation points obtained by the [7] Saraswhati S., Geetha T., and Saravanan K., “Integrating Language
Independent Segmentation and Language Dependent Phoneme Based
algorithm with a lower level labeling, the algorithm Modeling for Tamil Speech Recognition System”, Asian Journal of
performance is lower. Information Technology 5 (1), pp. 38-43, 2006.
In tables III and IV, the same tendency related to the results [8] Huerta L.D, “Segmentación del Habla con Independencia de Texto,
obtained in the segmentation as regards the labeling is usando características en el dominio del tiempo”, CITII 2008, Vol. 38,
observed. Additionally, it shows that a decrease in frequency pp. 129-139, 2008.
corresponds to a decline in the quality of the segmentation. [9] Sarkar A. and Sreenivas T.,” Automatic Speech Segmentation Using
Average Level Crossing Rate Information”, ICASSP 2005, IEE, Vol. 1,
pp.397-400.
232 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
(IJCSIS) International Journal of Computer Science and Information Security,
Vol. 8, No. 1, April 2010
[10] Pineda, L.A., Villaseñor-Pineda, L., Cuétara, J., Castellanos, H. &
López, I. “DIMEx100: A New Phonetic and Speech corpus for Mexican
Spanish”. Proceedings of the 9th Ibero-American Conference on AI,
(IBERAMIA), Puebla, Mexico, November 22-25, 2004. Lecture Notes
in Artificial Intelligence, Vol. 3315, pp. 974-983, Springer 2004.
[11] Boersman P. “Praat, a system for doing prhonetics by computers”. Glot
International 5:9/10, 341-345.
[12] Volkman J. Stevens S. The relation of pitch to frequency. American
Journal of Psychology, 1940.
233 http://sites.google.com/site/ijcsis/
ISSN 1947-5500
Related docs
Other docs by ijcsis
Comparative Analysis between Split and HierarchyMap Treemap Algorithms for Visualizing Hierarchical Data
Views: 15 | Downloads: 0
Non-Preemptive Multi-Constrain Scheduling for Multiprocessor with Hopfield Neural Network
Views: 5 | Downloads: 0
Reliable Multipath Routing Protocol (RMRP) For Mobile Ad Hoc Networks Using Adaptive Video Compression
Views: 10 | Downloads: 1
Single CCTA-Based Four Input Single Output Voltage-Mode Universal Biquad Filter
Views: 36 | Downloads: 0
A Cloud Computing Architecture for E-Learning Platform, Supporting Multimedia Content
Views: 42 | Downloads: 0
Get documents about "