Ecient Client-Server based Implementations

Document Sample
Ecient Client-Server based Implementations Powered By Docstoc
					Efficient Client-Server based Implementations
  of Mobile Speech Recognition Services


                Richard C. Rose a and Iker Arizmendi b
                           aCorresponding Author
                                McGill University
               Department of Electrical and Computer Engineering
                  McConnell Engineering Building, Room 755
          3480 University Street, Montreal, Quebec H3A 2A7 Canada
       Email: rose@ece.mcgill.ca, Phone: 514-398-1749, Fax: 514-398-4470

                                   b
                                  Coauthor
                          AT&T Labs – Research
       Room D129, 180 Park Ave., Florham Park, NJ 07932-0971 U.S.A.
             Email: iker@research.att.com, Phone: 973-360-8516




Abstract

The purpose of this paper is to demonstrate the efficiencies that can be achieved
when automatic speech recognition (ASR) applications are provided to large user
populations using client-server implementations of interactive voice services. It is
shown that, through proper design of a client-server framework, excellent overall
system performance can be obtained with minimal demands on the computing re-
sources that are allocated to ASR. System performance is considered in the paper in
terms of both ASR speed and accuracy in multi-user scenarios. An ASR resource al-
location strategy is presented that maintains sub-second average speech recognition
response latencies observed by users even as the number of concurrent users exceeds
the available number of ASR servers by more than an order of magnitude. An ar-
chitecture for unsupervised estimation of user-specific feature space adaptation and
normalization algorithms is also described and evaluated. Significant reductions in
ASR word error rate were obtained by applying these techniques to utterances col-
lected from users of hand-held mobile devices. These results are important because,
while there is a large body of work addressing the speed and accuracy of individual
ASR decoders, there has been very little effort applied to dealing with the same
issues when a large number of ASR decoders are used in multi-user scenarios.



Preprint submitted to Elsevier Science                                  5 May 2006
Key words: Automatic Speech Recognition, Distributed Speech Recognition,
Robustness, Client-Server Implementations, Adaptation




1   Introduction


There are a large number of voice enabled services that are currently being
provided to telecommunications customers using client-server implementations
in multi-user scenarios. The interest in this work is in those implementations
where the functionality of the interactive system may be distributed between a
client and server which can be interconnected over any of a variety of communi-
cations networks. The client in these applications may be a cellular telephone,
personal digital assistant, portable tablet computer, or any other device that
supports speech input along with additional input and output modalities that
may be appropriate for a given application. The server deployment, on the
other hand, often consists of many low cost commodity computers located in
a centralized location. For these implementations to be practical, it is neces-
sary for the server deployment to support large numbers of users concurrently
interacting with voice services under highly variable conditions. This requires
that the server deployments be able to scale to large user populations while
simultaneously minimizing degradations in performance under peak load con-
ditions. Methods for maintaining efficient and robust operation under these
conditions will be presented.

This paper presents a client-server framework that efficiently implements multi-
modal applications on general purpose computers. This framework will serve
as the context for addressing two important practical problems that have re-
ceived relatively little attention in the ASR literature. The first problem is
the efficient assignment of ASR decoders to computing resources in network
based server deployments. There has been a great deal of work applied to-
wards increasing the efficiency of individual ASR decoders using a number of
strategies including efficient pruning [26], efficient acoustic likelihood compu-
tation during decoding [16,3], and network optimization [15]. However, there
has been little effort applied to increasing the overall efficiency at peak loads
when a large number of ASR decoders are used in multi-user scenarios. The
second problem is the implementation of acoustic adaptation and normaliza-
tion algorithms within a client-server framework. Over the last decade, a large
number of techniques have been proposed for adapting hidden Markov mod-
els (HMMs) or normalizing observation vectors based on a set of adaptation
utterances. The overall goal in this work is to apply these techniques to min-
imizing the impact of speaker, channel, or environment variability relative to
purely task independent ASR systems. Methods will be proposed and evalu-
ated for applying these algorithms under the constraints that are posed when


                                      2
implemented within client-server scenarios.

One can make many qualitative arguments for when either fully embedded
ASR implementations or network based client-server implementations are ap-
propriate. It is generally thought that fully embedded implementations are
most appropriate for value added applications like name dialing or digit di-
aling, largely because no network connectivity is necessary when ASR is im-
plemented locally on the device [24]. Distributed or network based ASR im-
plementations are considered appropriate for ASR based services that require
access to large application specific databases. In these cases, issues of database
security and integrity make it impractical to distribute representations of the
database to all devices [21]. Network based implementations also facilitate
porting the application to multiple languages and multiple applications with-
out having to affect changes to the individual devices in the network. However,
implementing ASR in a network based server deployment can also lead to po-
tential degradations in ASR word accuracy (WAC) resulting from transmitting
speech over the communications channel between client and server. There has
been a very large body of research devoted to this issue. Some of the relevant
work in this area will be summarized in Section 2.

The client-server framework presented here, referred to as the distributed
speech enabled middleware (DSEM), performs several functions. First, it im-
plements the communications channels that allow data and messages to be
passed between the components that make up the interactive dialog system.
Second, it manages a set of resources that include ASR decoders, database
servers, and reconfiguration modules that are responsible for adapting the sys-
tem to particular users. The framework was designed to minimize the degra-
dation in performance that occurs as the number of clients begins to exceed
a server’s peak capacity [25]. This degradation could be the result of context
switching and synchronization overhead as can occur for any non-ASR server
implementation, but can also be a result of the high input-output activity
necessary to support ASR services. Algorithms presented in this paper for ef-
ficient allocation of ASR resources and for efficient user configuration of ASR
acoustic modeling are implemented in the context of the DSEM framework.
The framework will be described in Section 3 and its performance will be eval-
uated in terms of its ability to minimize response latencies observed by the
user under peak load conditions.

Strategies are presented for efficient allocation of ASR resources in server
deployments that utilize many low cost, commodity computational servers.
By dynamically assigning ASR decoders to individual utterances within a
dialog, these strategies are meant to compensate for the high variability in
processing effort that exists in human-machine dialog scenarios. The sources
of this variability are discussed in Section 4. A model for these ASR allocation
strategies is also presented in Section 4. The strategies are evaluated both in


                                       3
terms of their simulated and actual performance for a large vocabulary dialog
task running on a deployment with ten ASR servers.

Finally, an efficient architecture is presented for implementing algorithms for
fast acoustic reconfiguration of individual ASR decoders to a particular mobile
client. This is motivated primarily by the need to deal with the environmen-
tal, channel, and speaker variability that might be associated with a typical
mobile domain. It is also motivated by the opportunity for acquiring repre-
sentations of speaker, environment, and transducer variability that is afforded
in the case where the client is dedicated to a particular user account. Since
the ASR allocation strategies discussed above can dynamically assign ASR de-
coders to individual utterances, it is difficult in practice to adapt the acoustic
hidden Markov model (HMM) parameters to a particular client. Hence, it is
more practical to modify the feature space parameters for a particular client
rather than attempt to adapt the acoustic model parameters. An architecture
is presented in Section 5 for unsupervised adaptation and normalization of
feature space parameters in the context of the DSEM framework.



2     Robustness Issues for Mobile Domains


It is well known that the performance of client-server based ASR implementa-
tions suffer from the distortions associated with transmission of speech, or the
ASR features derived from speech, over a communications channel. There has
been a great deal of research addressing issues of acoustic feature extraction
and channel robustness for ASR under these conditions [2,4,7,8,11,12,19,23].
This section provides a brief summary of some of this work and the impact
of these degradations especially in wireless mobile applications. This serves as
motivation for the discussion in Section 5 on the implementation of acoustic
feature space normalization and adaptation algorithms.


2.1    Feature Extraction Scenarios


There have been several investigations comparing the ability of different fea-
ture analysis scenarios to obtain high performance network-based ASR over
wireless telephone networks [7,8,11,12]. The ETSI distributed speech recog-
nition (DSR) effort has standardized feature analysis and compression algo-
rithms that run on the client handset [7]. In this scenario, the coded fea-
tures are transmitted over a protected data channel to mitigate the effects of
degradation in voice quality when channel carrier-to-interference ratio is low.
Another scenario involves performing feature analysis in the network by ex-
tracting ASR features directly from the received voice channel bit stream [11].


                                       4
A last scenario has been evaluated for ASR which does not involve additional
client based or network based processing [8,12]. Instead, it involves the use
of the adaptive multi-rate (AMR) speech codec that has been selected as
the default speech codec for use in Wideband Code Division Multiple Access
(WCDMA) networks. Studies have shown that the ability of the AMR codec
to trade-off source coding bit-rate over a range from 4.75 to 12.2 kbit/s with
channel coding bit allocation results in negligible change to ASR accuracy for
carrier to interferer ratios as low as 4 dB [8].


2.2   Robustness with Respect to Channel Distortions


There have also been a variety of approaches that have been investigated for
making ASR more robust with respect to the distortions induced by Gaus-
sian and Rayleigh fading channels associated wireless communications net-
works [2,19,23]. One approach is to apply transmission error protection and
concealment techniques to the coded feature parameters as they are shipped
over wireless channels [23]. Another approach involves combining confidence
measures derived from the channel decoder with the likelihood computation
performed in the Viterbi search algorithm used in the ASR decoder. In this
approach, confidence measures are computed from the a posteriori proba-
bility which provides an indication of whether received feature vectors have
been correctly decoded. These confidence measures are then used to weight
or censor the local Gaussian likelihood computations used in the Viterbi al-
gorithm [2,19].

This last approach is similar in some ways to the missing feature theory ap-
proach to robust ASR where noise corrupted feature vector components are
labeled and removed from the likelihood computation [19]. However, the abil-
ity of the channel decoder to identify missing features in this application is
far more effective than the existing techniques for labeling feature vectors cor-
rupted by noisy acoustic environments. Similar techniques have been investi-
gated for the packet loss scenarios associated with packet-based transmission
over VoIP networks [4].


2.3   Importance of Acoustic Environment


It is well known that distortions introduced by both the acoustic environment
and the communications channel can impact ASR performance. Studies based
on empirical data collected in multiple cellular telephone domains have demon-
strated that the effects of environmental noise can often dominate the observed
increases in ASR word error rate (WER) [22]. Increases in WER of 50% have


                                       5
been measured over wireless communications channels in noisy automobile en-
vironments compared to quiet office environments. On the other hand, a WER
decrease of only 10% was observed in wireless channels compared to wire line
channels when speech was collected in a quiet office environment. This agrees
with similar findings suggesting that, except for extremely degraded commu-
nications channels, the impact of channel specific variability has been found
in some cases to be secondary with respect to environmental variability in
mobile ASR applications.



3     Mobile ASR Framework


Modern multi-user applications are often challenged by the need to scale to a
potentially large number of users while minimizing the degradation in service
response even under peak load conditions. Scaling multi–modal applications
that include ASR as an input modality presents an additional hurdle as there is
typically a great disparity between the number of potentially active users and
a system’s limited ability to provide computationally intensive ASR services.
This section provides an overview of a proposed distributed speech enabling
middleware (DSEM) framework that is used to efficiently implement multi–
modal applications that maximize performance under normal loads and are
well conditioned under peak loads. The section is comprised of two parts.
First, the framework rationale and design are briefly described. The second
part of the section presents an experimental study demonstrating the through-
put of the framework in the context of hundreds of simulated mobile clients
simultaneously accessing a system equipped with a limited number of ASR
decoders.


3.1     Description


3.1.1    Models for Efficient Client-Server Communication

Traditional non-ASR server implementations that assign a thread or process
per client suffer from degraded time and space performance as the number
of clients approaches and exceeds a server’s peak capacity. The practical
and theoretical issues that are behind these observed performance degrada-
tions have received a great deal of attention in the computer science com-
munity [1,5,17,25,14]. There is general agreement, however, that while this
degradation stems from many factors, there are three principal issues that
limit the performance of the thread per client model.

A first issue is the overhead incurred by the operating system in perform-


                                      6
ing context switching. Context switching involves saving the current thread’s
context and loading the context of the next runnable thread. It can intro-
duce a number of artifacts. For example, cache performance can be reduced as
the memory required for storing the stacks associated with individual threads
compete for processor cache space [14]. This switching overhead can be exac-
erbated in the presence of threads that handle streaming media which must be
frequently invoked to service buffers that are used for transmission of digital
audio and video. It is important to note that the degradations associated with
this particular scenario, which is critical to supporting the interaction between
clients and ASR servers, has received relatively little attention when evaluat-
ing existing client-server communication models. The degradations stemming
from this type of overhead are the principal focus of the evaluation described
in Section 3.2.

A second issue is associated with the synchronization of state information that
must often be shared between clients. Resources like shared pools of feature
frames and pools of ASR decoder proxies are examples of shared state informa-
tion which are heavily contended between clients. In the presence of multiple
threads, access to this shared state information must be synchronized using
locking primitives. The overhead of such synchronization can be significant,
especially for the above examples.

Finally, a third issue limiting the performance of thread per client models is the
virtual memory requirements associated with each individual thread. Every
thread allocated by the operating system requires a small data structure to
track it and, more importantly, an allocation of virtual memory for the thread’s
stack. One reason why this issue is problematic is because of limitations that
are specific to a given operating system. For example, it is common on Linux
operating systems to define a thread’s stack size to be 2 Mb. It is also common
in Linux to limit the user address space to 2 Gb, which limits the maximum
number of threads that can be supported in a thread per client model to 1000.

In an effort to address the above issues, the proposed DSEM framework uses
an event-driven, non-blocking IO model. This requires only a single thread to
manage a large number of concurrently connected clients [25]. In addition, an
ASR decoder cache is employed to effectively share limited decoder resources
among active clients.


3.1.2   An Example Interaction

The basic functional components of the framework can be introduced by way
of the example illustrated in the block diagram of Figure 1. Figure 1 shows a
typical interaction that allows a client to issue a voice query that retrieves the
contents of a URL on a remote web server. In this example, the recognition


                                       7
Fig. 1. A block diagram of the distributed speech enabled middleware (DSEM)
framework. Strategies for allocation of ASR resources and for the server-based
implementation of feature space adaptation techniques were evaluated within this
framework.


result is not returned to the client directly, but is instead acted upon by the
DSEM server to produce the final result. The sequence of steps in such an
interaction can be summarized as follows:

Establishing DSEM Connection The interaction begins with the client estab-
lishing a connection with the DSEM server. Upon accepting the connection,
the DSEM server creates a special session handler, labeled “SES” in Figure 1,
for that connection and adds the client’s socket to its internal dispatch ta-
ble. The session handler performs two functions. First, it houses application
specific processing such as determining which other handlers to invoke. Other
handlers in this example may include ASR handlers and HTTP handlers. Sec-
ond, it provides a place to place session state that spans across more than one
request.

Generating a Voice Query The user then issues a voice query which is streamed
to its session handler on the DSEM server. The query is streamed using a
custom protocol which typically includes the type of query (eg, the URL to
fetch, what database query to perform, etc) and can also include ASR related
parameters such as audio coding and language model.

Creating an ASR Handler The DSEM dispatcher, which is responsible for
detecting and routing all of the system’s IO events, detects activity on the
client’s socket and notifies its session handler to process any incoming data.
In this example, the session handler creates an ASR handler to process the
audio stream and registers interest in the ASR handler’s output. Among other
things, this may include a recognition string or word lattice produced by the
ASR decoder.


                                       8
Initializing the ASR Handler Upon activation, the ASR handler fetches client
specific parameters from the user database. These may include, for example,
the user specific acoustic feature space adaptation and normalization parame-
ters that are discussed in Section 5. It also acquires a decoder proxy, which is
a local representation of a decoder process potentially residing on another ma-
chine, from the decoder proxy cache. If there are no free decoder processes, the
proxy enters “buffering mode”. The ASR handler registers its decoder proxy’s
socket with the DSEM dispatcher to receive notification when decoder IO is
detected.

Generating and Buffering Acoustic Features Each portion of the audio stream
received by the client session handler is forwarded to the ASR handler which
may perform feature analysis and implement user-specific acoustic feature
space normalizations and transformations. If the decoder proxy created in the
previous step acquired an actual decoding process then the computed cepstrum
vectors are streamed directly to that process. If no decoding processes were
available, the vectors are buffered and transmitted when a decoder is freed.
The proxy cache provides a signal scheme that alerts its proxies when this
occurs.

Obtaining ASR Results and Releasing the ASR Decoder An ASR decoder pro-
cess produces a result and transmits it to the DSEM server. The DSEM dis-
patcher detects this event and notifies the associated ASR handler which ex-
tracts the recognition string from the decoder proxy and reports it to the
session handler. Once the session handler receives the recognition string, the
ASR handler unregisters itself from the DSEM dispatcher and releases its
decoder back to the cache.

Issuing Query to Web Server With the recognition string in hand, the session
handler creates an HTTP handler, registers interest in the HTTP handler’s
output and issues a query to a remote web server. The prototype application
implemented in this work uses this technique to retrieve employee information
from AT&T’s intranet site.

Sending Result to Client When the web server responds to the HTTP request
the HTTP handler processes the reply and notifies the session handler which
in turn sends the result to the waiting mobile client.

One of the key assumptions of the above framework is that it is impractical to
permanently assign and adapt an ASR decoder and model to a specific client.
Typical ASR implementations require the use of large acoustic and language
models which, if speaker and environment independent, can be preloaded and
efficiently shared across multiple instances of a decoder drastically reducing
the cost of an ASR deployment. This is typically done by memory mapping all
needed acoustic and language models on all ASR servers, subject to available


                                       9
Fig. 2. DSEM server performance for an eight server installation plotted with re-
spect to the number of concurrent clients. a) Average response latency measured in
seconds between the time a client submits an ASR request and a result is returned
by the DSEM. b) Average server throughput computed as the number of completed
recognition transactions per second.

physical memory, and selecting between them at runtime (which involves little
overhead). Speaker adapted models, on the other hand, cannot be shared and
thus result in a substantial increase in the amount of memory required on a
decoding server which is servicing several clients. In order to enjoy the ben-
efits of shared models and client specific adaptation the proposed framework
implements all acoustic modeling techniques as feature space normalizations
and transformations in the DSEM server. This issue is addressed further in
Section 5.


3.2   Performance Evaluation


An experimental study was performed to demonstrate the throughput of the
framework described in Section 3.1. The goal of the study was to measure both
the throughput maintained by the DSEM server and the latencies that would
be observed by users of the associated mobile ASR services as the number of
users making simultaneous requests increased into the hundreds of users. The
study was performed by simulating many clients interacting with the DSEM
and performing the following interaction:

• Each client streamed an 8-bit, 8-kHz speech request to the DSEM server.
  Each request consisted of a 1.5 second utterance corresponding to a query
  to an AT&T interactive dialog application.
• The DSEM server performed acoustic feature analysis and streamed features


                                       10
  to an available ASR decoder. When a decoder was not available, the DSEM
  server buffered features.
• When a decoder was released and made available to the decoder proxy cache,
  the DSEM streamed buffered features in a burst and streamed subsequent
  features (if any) as they arrived. The decoder returned a decoded result to
  the DSEM server which in turn forwarded the result to the waiting client.

The infrastructure used for the study included eight 1GHz Linux ASR servers
with each server running four instances of the AT&T Watson ASR decoder and
a single 1GHz Linux DSEM server with 256Mb of RAM. Figure 2a illustrates
the effect on response latency as the number of concurrent clients increases.
Response latency was calculated as the interval in seconds between the time
that the last sample of a speech request was sent by the client and the time
that the recognition result was returned to the client by the DSEM server.
The plot shows a relatively constant latency when the number of clients is less
than 128 and a gracefully degrading response latency as the number of clients
is increased. In addition, note the slight increase between 32 and 128 clients:
as the number of clients exceeds the number of available decoders the DSEM
buffers features and transmits them in bursts to an available decoder. The
fact that the audio streams are transmitted all at once and that the decoding
task typically ran at better than real time (at most 1/4 real time) helped to
minimize latency in that range. After the number of clients exceeds 128, the
delay imposed on clients by the DSEM decoder wait queue and the overhead
of the DSEM server itself begins to dominate. A more thorough investigation
of this effect could shed some light on the relative importance of audio arrival
rate to decoder performance.

Figure 2b illustrates the effect on server throughput as the number of concur-
rent clients increases. Throughput was calculated as the number of completed
recognition transactions per second. The plot in Figure 2b demonstrates that
throughput gradually increases until the server’s peak capacity is reached at a
point corresponding to 128 clients and remains relatively constant even as the
number of clients far exceeds this peak capacity. Again, the buffering of fea-
tures in the DSEM server provides a throughput benefit beyond the expected
32 recognitions/second.



4   Allocation Strategies for ASR Resources


The problem of efficient assignment of ASR decoders to computing resources
in client-server frameworks like the DSEM is addressed in this section. The
section begins by providing basic definitions of a call in the context of human-
machine dialogs and quality of service for ASR servers. Next, a simple theoreti-
cal model for efficient ASR resource allocation is presented. This model is used


                                      11
to predict the total number of users that can be supported by the proposed
framework under different assumptions while maintaining a given quality of
service. Finally, the theoretical performance and actual performance of the
model evaluated on a large vocabulary dialog task running on a deployment
of with ten ASR servers is presented.




4.1   Multi-User ASR Scenario


There are several assumptions that are made in this work concerning the means
by which a user interacts with a speech dialog system and how both ASR
quality of service (QoS) and system overload are defined. The most general
assumption about the overall implementation is that calls are accepted from
multiple users and are serviced by pools of ASR servers each of which can
return a recognition string for any given utterance with some latency. The
manner in which these ASR servers are allocated is described in Section 4.2.

A typical interaction, or call, in human-machine dialog applications consists of
several steps. The user first establishes a channel with the dialog system over
a public switched telephone network (PSTN) or VoIP connection. Once the
channel is established, the user engages in a dialog that consists of one or more
turns during which the user speaks to the system and the system responds
with information, requests for disambiguation, confirmations, etc. During the
periods in which the system issues prompts to the user, the user will generally
remain silent and the system will be mostly idle with respect to that channel.
Finally, when the user is done, the channel is closed and the call is complete.

The quality of service (QoS) of an overall implementation is defined here in
terms of the latency a system exhibits in generating a recognition result for an
utterance. For utterances processed on a server, there are a number of factors
that contribute to this latency. When the multi-server system is operating at
near peak capacity, the number of concurrent utterances, or utterance load,
the server is handling can be the dominant factor. The focus of this paper rests
on the observation that, irrespective of all other factors, implementing simple
strategies for reducing the instantaneous load on ASR servers will result in a
significant decrease in the average response latency observed by the user.

A server’s maximum utterance load is defined here as the maximum number
of concurrent utterances which can be processed with an acceptable average
response latency. A server that handles more than its maximum utterance
load is said to be overloaded.


                                      12
4.2     ASR Resource Allocation Strategies


Two strategies are presented for allocating ASR servers to incoming calls. It
will be shown that an intelligent approach for allocating utterances to servers
in a typical multi-server deployment can dramatically reduce the incidence of
overload with respect to more commonly used allocation strategies.


4.2.1    Call-Level Allocation

A common approach for indirectly balancing the utterance load across the
hardware resources an allocator has at its disposal is call-level allocation. Us-
ing this approach, an allocator assigns a call to an ASR process running on
a decoding server for the duration of the call. This process is responsible for
all feature extraction, voice activity detection, and decoding. For example,
consider the hardware configuration shown in Figure 3 that illustrates a typi-
cal setup where a source of call traffic (a PBX, or VoIP gateway) routes user
request streams to ASR processes residing on two servers. The figure depicts
six calls where each call consists of intervals of speech or silence denoted by
colored and uncolored blocks, respectively. As calls arrive, a simple allocator
tracks the number of calls on each server and ensures that they all handle an
equal number of calls.

However, as the number of calls handled by an ASR deployment increases,
use of such a simple allocator can lead to an unacceptably high utterance load
on some servers even when other servers are underutilized. In Figure 3 we see
that during the first and second intervals the first server will need to handle
an utterance load of 3 even though the second server is only handling a load
of 1. Assuming that the maximum utterance load for each server is 2 and as-
suming that the processing of each utterance requires identical computational
complexity, the first server will be overloaded. If the computational complex-
ity of the ASR task is sufficiently high, this may result in unacceptably high
latencies for users assigned to the overloaded first server.
A simple probabilistic argument can be made that generalizes the example
to an arbitrary deployment and makes this deficiency explicit. Assume, for
simplicity, that each utterance is of some fixed duration, d, and each call is
of some fixed duration, D. A call is then assumed to consist of L randomly
occurring utterances so that at any time t, the probability that an utterance
is active is given by

               d
      pt = L     .                                                           (1)
               D

If we assume that a server that handles an utterance load of more than Q is


                                      13
Fig. 3. Example of call-level allocation showing six calls being routed directly to two
ASR servers. Individual utterances are shown as colored blocks within each call.
overloaded, then the probability of overload if it services M calls, with M > Q,
is given by
             M
                    M k
     Pq =             p (1 − pt )M −k .                                            (2)
            k=Q+1   k t


This is simply the probability that more than Q users out of M calls on
a server will speak at any given moment. This probability is obviously zero
when the server is handling Q calls or less. The probability Pq can then be
used to calculate the probability, PC , that one or more servers in a deployment
of S servers (with S > 1), each handling M calls, will be overloaded.

     PC = 1 − (1 − Pq )S                                                           (3)


In Section 4.3, Equations 2 and 3 will be used to determine the number of
calls, M , that can be supported by the call-level allocation strategy when the
probability of overload, PC , is fixed atan acceptable value. It will be shown
that the fundamental difficulty with this approach arises from the fact that
the call-level allocator knows nothing of what transpires within a call.


4.2.2   Utterance-Level Allocation

One way to reduce the probability of overload is to let the allocator look within
calls to determine when utterances begin and end. This additional information
can be used to implement an allocator that assigns computational resources
to utterances instead of entire calls. This will be referred to as utterance-


                                          14
level allocation. Figure 4 illustrates this approach. In order to inspect the
audio stream of incoming calls the allocator is placed between the source of
call traffic and the ASR decoding servers. In addition, feature extraction and
voice activity detection are moved to the allocator so that it may determine
when utterances begin and end. Of course, it is possible to perform feature
extraction in several locations including the client, the allocator as shown here,
or in the ASR server. From this vantage point the allocator can keep track
of activity across the deployment and intelligently dispatch utterances and
balance the incoming utterance load. This allows that same deployment of S
servers to be viewed as a single virtual server that can handle an aggregate
utterance load of SQ concurrent utterances. Under this model, an overload




Fig. 4. An utterance level allocator looks within dialogs to determine when utter-
ances begin and end. This information is used to balance the load on decoding
servers.
on any server can only occur if more than SQ utterances are active, an event
that is considerably less likely than any individual server being overloaded.
More specifically, for a deployment handling SM calls, with SM > SQ, the
probability, PU , that an overload will occur is given by
             SM
                    SM k
    PU =               pt (1 − pt )SM −k                                      (4)
           k=SQ+1    k


Equation 4 will be used in Section 4.3 to determine the number of calls that can
be supported by the utterance-level allocation strategy when the probability
of overload, PU , is fixed at an acceptable value.

Note that although the allocator in this scenario acts as a gateway to the
decoding servers it generally is not a bottleneck as the processing required to
detect utterances is very small [20]. However, we must introduce an allocator
that can monitor all traffic, which may be a potential bottleneck. We look at
the effects of such an allocator in Section 4.3.


                                       15
4.2.3   Refining Utterance-Level Allocation

Incorporating knowledge of additional sources of variability in ASR computing
effort can further improve the efficiency of multi-user ASR deployments. Two
examples of these sources of variability are illustrated by the plots displayed
in Figure 5. The first is the high variance in computational load exhibited by
a decoder over the length of an utterance. It is well known that the instan-
taneous branching factor associated with a given speech recognition network
can vary considerably. This fact, coupled with the pruning strategies used in
decoders, results in a large variation in the number of network arcs that are
active and must be processed at any given instant. This is illustrated by the
plot in Figure 5a which displays the number of active network arcs in the
decoder plotted versus time for an example utterance in a 4000 word proper
name recognition task. The plot demonstrates the fact that the majority of
the computing effort in such tasks occurs over a fairly small portion of the
utterance. Knowledge of this time dependent variability in the form of sample
distributions could potentially be used to allocate utterances such that peak
processing demands do not overlap.




Fig. 5. a) Computational effort measured as the number of active arcs versus time for
an example utterance from the proper names recognition task. b) The distribution
of the ratio of decoding time to audio duration (CPU vs. audio) for test utterances
taken from a digit recognition task and from c) an LVCSR task.

The second source of variability comes from the variation in computational
complexity that exists between different ASR tasks. This is illustrated by the
histograms displayed in Figures 5b and 5c. The plots display the distribution of
average computational effort measured as the ratio of the decoding time to the
utterance duration. The distributions correspond to continuous digit and large
vocabulary continuous speech recognition (LVCSR) tasks with means of 0.022


                                        16
and 0.44 respectively on a 2.6 GHz server. As would be expected, the high
perplexity stochastic speech recognition network associated with the LVCSR
task demands a higher and more variable level of computational resources than
the small vocabulary deterministic network. Distributions characterizing this
inter-task variability could be incorporated into server allocation strategies.
In addition to the obvious efficiency improvements beyond those discussed in
Section 4.2.2, servers with large CPU caches can be dedicated to a single ASR
task to achieve improved cache utilization.



4.3   Experimental Results

This section presents the results of two comparisons of the call-level allocation
(CLA) and utterance-level allocation (ULA) strategies. The first compares the
efficiencies of the CLA and ULA strategies that are predicted by the model
presented in Sections 4.2.1 and 4.2.2. The second compares the two strategies
using an actual deployment where ASR decoders are run on multiple servers
processing utterances from an LVCSR task.

A comparison of the efficiencies as predicted by the model can be made by
plotting the number of incoming calls with respect to the probabilities of
overload, PC in Equation 3 for the CLA strategy and PU in Equation 4 for the
ULA strategy. The difference in overall efficiency for the two strategies can be
measured as the difference between the number of calls that are supported at
a given probability of overload.

Figure 6 shows a plot of this comparison for an example where the multiuser
configurations illustrated in Figures 3 and 4 are configured with ten ASR
servers each of which can service a maximum of four simultaneous utterances
without overload. It is also assumed that, on the average, there are active
utterances to be processed by an ASR server for only one third, pt = 1/3, of
the total duration of a call. It is clear from Figure 6 that, at a probability
of overload equal to 0.1, the utterance-level allocation strategy can support
approximately two times the number of calls that can be supported by the
call-level allocation strategy.
A comparison of the efficiencies that are obtainable in an actual deployment
was made using the DSEM framework that was evaluated in Section 3. The
framework was configured with ten 1 GHz Linux based servers running in-
stances of the AT&T Watson ASR decoder. Calls were formed from utterances
that were natural language queries to a spoken dialog system with speech ac-
tive for an average of 35 percent of the total call duration and each server
able to service approximately two simultaneous utterances without overload.
A rather aggressive load of four hundred of these calls were presented si-
multaneously to the multi-user system. An overall performance measure was


                                      17
Fig. 6. Number of calls supported by CLA and ULA strategies using ten simulated
ASR servers. The curves are plotted versus probability of overload predicted by PC
in Equation 3 for CLA and PU in Equation 4 for ULA.

used that is derived from the latency based QoS defined in Section 4.1. For
a given number of incoming calls, a count is obtained for the percentage of
utterances where the latency in generating a recognition result falls below a
specified threshold. Figure 7 shows a plot of these percentages plotted versus
the threshold that is placed on the maximum response latency. The maximum
response latency ranges from 0.5 to 3.0 sec. Curves are shown for both the
CLA and ULA strategies.

The system implemented with the ULA strategy is shown in Figure 7 to sup-
port a significantly larger call load than the CLA system. It can be seen that
the ULA strategy is able to service approximately twice as many requests with
a one second maximum latency.



5   Robust Modeling Techniques in Client-Server Scenarios


This section describes the application of acoustic adaptation and feature nor-
malization procedures in the context of the DSEM client-server based ASR
framework described in Section 3. This class of procedures are in general car-
ried out in two steps where parameters are first estimated from adaptation
data and these parameters are then applied as transformations in the acoustic
feature space or the HMM model space. A discussion of how the client-server
framework impacts the implementation of this class of procedures will be fol-
lowed by a description of the implementation of three well-known approaches
to feature space adaptation / normalization. The implementation of these ap-
proaches was evaluated on a task where users fill in “voice fields” that appear


                                       18
Fig. 7. Percentage of actual calls serviced within specified latencies for CLA and
ULA strategies. Measurements were made on an actual server deployment consisting
of ten servers with calls formed from natural language queries to a spoken dialog
system.

on the display of a mobile hand-held device. The evaluation was performed
under a scenario where unsupervised estimation of adaptation parameters was
performed from user utterances collected during the normal use of the hand-
held device.


5.1   Adaptation within the DSEM Framework


Client-server communications frameworks like the DSEM impact the imple-
mentation of these algorithms in several ways. First, the dynamic assignment
of ASR decoders to individual utterances makes it very difficult in practice to
configure the acoustic HMMs associated with these decoders to a particular
user. As each individual utterance is shipped to one of multiple servers as part
of the utterance level ASR server allocation strategy, the server will suffer the
overhead of loading a user specific HMM model or adapting the parameters of
a task independent model for that user. Since it is not unusual for the HMM
for a given server installation to be composed of tens of thousands of states
and hundreds of thousands of Gaussian densities, this overhead can be sub-
stantial. This process would have to be repeated for each utterance that is
routed to that server.

The second impact of the DSEM arises from the fact that, as argued in Sec-
tion 3, it is ideally suited for operating on multiple channels of input speech
data. Communications frameworks like the DSEM facilitate the implementa-
tion of low complexity feature space transformations and normalization pro-


                                      19
cedures for a large number of concurrent clients. The plot in Figure 2 demon-
strates that it is possible to route a large number of client utterances to ASR
servers while still maintaining acceptable user response latencies. As result,
relatively low complexity feature space transformation and normalization pro-
cedures can easily be applied within the DSEM framework with little impact
on overall system performance.

A third impact of the DSEM arises in the implementation of “personalized
services” where state information relating to individual users can be stored
within the server installation. Acoustic compensation parameters can be es-
timated off–line from adaptation utterances, or statistics derived from those
utterances, that have been collected from users’ previous interactions with
voice enabled services that are supported by the installation. The advantages
of this paradigm for adaptation from the standpoint of providing sufficient
adaptation data and minimizing computation complexity during recognition
are well known. First, input utterances can be very short, sometimes single
word, utterances. These short utterances can be insufficient for robust pa-
rameter estimation. Second, the computational complexity associated with
the estimation of parameters for many adaptation / normalization techniques
could overwhelm the DSEM if performed at recognition time.




Fig. 8. The role of the DSEM framework in ASR feature adaptation and normal-
ization.




                                     20
5.2   Algorithms


This section describes the robust acoustic compensation algorithms that are
implemented within the DSEM framework. The algorithms that are applied
here include frequency warping based speaker normalization [13], constrained
model adaptation (CMA) and speaker adaptive training (SAT) [9], and cep-
strum and variance normalization. They were applied to compensating ut-
terances spoken into a far-field device mounted microphone with respect to
acoustic HMM models that were trained in a mis-matched acoustic environ-
ment. Normalization/transformation parameters were estimated using any-
where from approximately one second to one minute of speech obtained from
previous utterances spoken by the user of the device. All of these techniques
were applied in the context of mel frequency cepstrum coefficient (MFCC)
feature analysis. The Davis and Mermelstein triangular weighting functions
with center frequencies spaced on a mel frequency scale were applied as a
filter-bank to the 8 KHz bandwidth magnitude spectrum [6].

The first technique is frequency warping based speaker normalization [13].
Several definitions have been proposed for warping functions that can be ap-
plied to warping the frequency axis in ASR feature analysis and there have
been several techniques proposed for estimating an optimum warping function
from adaptation data [18,13]. In this work, warping is performed by selecting
a single linear warping function, α, from a W length ensemble of candidate
warping functions using the adaptation utterances for a given speaker to max-
imize the likelihood of the adaptation speech with respect to the HMM. This
ensemble of warping functions typically consists of approximately W = 20
linearly spaced values and corresponds to a compression or expansion of the
frequency axis of from ten to twenty percent. Then, during speech recognition
for that speaker, the warping factor is retrieved and applied to scaling the
frequency axis in mel-frequency cepstrum coefficient (MFCC) based feature
analysis [13]. During acoustic model training, a “warped HMM” is trained by
estimating optimum warping factors for all speakers in the training set and
retraining the HMM model using the warped utterances.

There are several regression based adaptation algorithms that obtain maxi-
mum likelihood estimates of model transformation parameters. The techniques
differ primarily in the form of the transformations. Constrained model space
adaptation (CMA) is investigated here [9]. CMA estimates a model transfor-
mation {A, b} to an HMM, λ, with means and variances µ and Σ, to create
updated mean and variance,

      µ = Aµ − b
      ˆ             ˆ
                    Σ = AΣAT .                                            (5)


These parameters are estimated to maximize the likelihood of the adaptation


                                    21
data, X, P (X|λ, A, b) with respect to the model, λ. The term “constrained”
refers to the fact that the same transformation is applied to both the model
means and covariances. Since the variances are transformed under CMA, it
is generally considered to have some effect in compensating with respect to
environmental variability, which is generally characterized by additive noise,
as well as speaker and channel variability.

An important implementation aspect of CMA is that this model transforma-
                                                      ˆ
tion is equivalent to transforming the feature space, xt = Axt + b. It is applied
during recognition to the d = 39 component feature vectors, xt , t = 1, . . . , T ,
composed of cepstrum observations and the appended first and second order
difference cepstrum. Speaker adaptive training was also used for training the
original acoustic model. In one implementation of SAT, an HMM is trained
by estimating an optimum CMA transform for each speaker in the training
set and retraining the HMM model using the transformed utterances [9]. This
provides a more “compact’ HMM model and results in improved performance
when CMA is applied during recognition.

Cepstrum mean normalization (CMN) and cepstrum variance normalization
(CVN) were also applied under a similar scenario as the algorithms described
                               ˜     ˜
above. Normalization vectors, µ and σ respectively, were computed from adap-
tation utterances for each speaker and then used to initialize estimates of nor-
malization vectors for each input utterance. The incorporation of additional
speech data provided by this simple modification to standard cepstrum nor-
malization procedures had a significant impact on ASR performance.


5.3   The Acoustic Reconfiguration Server


A block diagram of the architecture that has been realized for acoustic fea-
ture space adaptation/normalization within the DSEM framework is shown
in Figure 8. It relies on acoustic transformations being applied to recognition
utterances as they are routed by the DSEM from clients to ASR decoders.
Figure 8 depicts acoustic feature analysis and feature space transformations
being performed within the DSEM. It also depicts the storage of the user
specific acoustic parameters needed to implement these transformations. This
includes speech data taken from previous utterances from a given user, tran-
scriptions or word lattices produced by an ASR decoder from these utterances,
and partial statistics that have been accumulated by parameter estimation al-
gorithms. Finally, a “reconfiguration server” that is invoked by the DSEM for
off-line estimation of adaptation parameters is also depicted in the figure. Of
course, as discussed in Section 2, the architecture shown in Figure 8 represents
one of many possible ways for distributing functionality between client and
server. Many of the following arguments still apply if, for example, the feature


                                       22
analysis is performed on the client instead of within the DSEM.

One motivation for the architecture in Figure 8 is the need to minimize com-
putational complexity during recognition. Applying frequency warping based
speaker normalization during recognition requires zero operations. It can be
implemented in this scenario simply by swapping in a filterbank in Mel-
frequency cepstrum coefficient (MFCC) analysis corresponding to the given
warping function α. CMA requires d2 operations per frame corresponding to
multiplying observation vectors by the regression matrix, A. CMN and CVN
require only d operations per frame associated with subtracting the mean
from feature vectors and scaling by the inverse variance. It has been found in
practice that the additional computation associated with applying these trans-
formations during recognition has minimal impact on the overall throughput
as characterized by the plots in Figure 2.

Parameter estimation for all of the above procedures can be performed in an
“incremental mode” where partial statistics are accumulated across multiple
utterances. These partial statistics can be used as the next incremental update
of the parameters as additional data becomes available. The per-user storage
and computational requirements for feature space adaptation techniques like
CMA can be fairly heavy. The computational load is dominated by an itera-
tive matrix inversion that requires d4 operations per iteration and the partial
statistics require on the order of d3 floating point locations for storage. Ob-
taining a maximum likelihood estimate of the frequency warping function, α,
can also be computationally intensive, requiring on the order of T ∗ W ∗ d op-
erations where T is the total number of adaptation frames and W is the size
of the warping ensemble. To deal with this additional computational complex-
ity, the DSEM invokes the “reconfiguration server” in Figure 8 at infrequent
intervals to estimate the adaptation and normalization parameters for speaker
Si : {αi , µi , σi , Ai , bi }. It is assumed that the DSEM has continually augmented
           ˜ ˜
the data storage for speaker Si with speech and ASR transcriptions collected
from previous utterances. The reconfiguration server produces the updated
partial statistics and the updated parameters. The following section addresses
the potential gains in WAC that are obtainable from the scenario given in
Figure 8 for a typical application and addresses the frequency with which the
off-line parameter estimation procedures should be invoked.


5.4   Experimental Study


The feature normalization/adaptation algorithms described in Section 5.2
were used to reduce acoustic mismatch between task independent HMM mod-
els and utterances spoken through a Compaq iPAQ hand-held device over the
distributed framework described in Section 3. This section describes the sce-


                                        23
nario under which the algorithms were evaluated, the speech database, and
the experimental study.

The dataset for the study included a maximum of 400 utterances of proper
names per speaker from a population of six speakers. The utterances were spo-
ken through the device mounted microphone on the hand-held device in an
office environment. Since the data collection scenario also involved interacting
with the display on the hand-held device, a distance of from approximately
0.5 to 1.0 meters was maintained between the speaker and the microphone.
The first 200 utterances for each speaker were used for estimating the parame-
ters of the normalizations and transformations described in Section 5.2. After
automatic end-pointing, this corresponded to an average of 3.5 minutes of
speech per speaker. The remaining 1200 utterances, corresponding to isolated
utterances of last-names (family names) from the six speakers, were used as a
test set for the experimental study described below.

A baseline acoustic hidden Markov model (HMM) was trained from 18.4 hours
of speech which corresponds to 35,900 utterances of proper names and general
phrases spoken over wire-line and cellular telephone channels. After decision
tree based state clustering, the models consisted of approximately 3450 states
and 23,500 Gaussian densities.

In order to evaluate the effect of acoustic level and task level mismatch on
this baseline model, ASR word error rates (WER) were evaluated on several
speech corpora. The first corpus included 1000 utterances of proper names spo-
ken as first-name last-name pairs that were collected over wire-line telephone
channels. A WER of 4.8 percent was obtained for this corpus. The second cor-
pus included isolated telephone bandwidth utterances of last-names that were
collected from a different population of speakers over a close-talking noise-
canceling microphone [21]. A significant reduction in WER to 26.1 percent
was obtained for this corpus which was largely due to the more difficult task
of recognizing isolated last-name utterances rather than first-name, last-name
pairs. The third corpus, and the corpus that was used for the experiments
reported in Table 1, consisted of the isolated last-name utterances that were
spoken through a far-field device mounted microphone under the conditions
described above. A baseline WER of 41.5 percent was obtained for this corpus.

One can infer from these comparisons that both acoustic mismatch due to
the far-field microphone and lexical ambiguity due to the less constrained
recognition grammar combine to significantly degrade the baseline ASR per-
formance for this task. In any case, 41.5 percent WER is clearly a very high
baseline WER and not acceptable for realistic applications. One must be care-
ful about making general interpretations of performance gains achieved when
the baseline WER is high. However, it is not uncommon to find these levels
of performance degradation when applications developers attempt to incor-


                                     24
porate generic acoustic models provided by speech technology vendors into a
new domain. The goal of the robust compensation algorithms applied here is
to close the performance gap between these scenarios.

It is important to note that this experimental study is by no means an exhaus-
tive evaluation of robust ASR techniques. Model based adaptation techniques
were not evaluated because, as mentioned in Section 5.1, it is not practical
for the ASR servers to dynamically load user-specific acoustic models for each
utterance in our multi-user framework. Furthermore, the channel robustness
techniques discussed in Section 2 and the large class of algorithms developed
specifically for dealing with environmental distortions have the potential to
improve ASR robustness in mobile domains. These techniques were not im-
plemented as part of this study mainly because it was felt that the speech
utterances used in this study were collected from a domain where these classes
of distortions had only marginal impact.

Table 1 displays the results for the experimental study as the word error
rate (WER) resulting from the use of each of the individual algorithms where
parameters are estimated using adaptation data of varying length. Columns
2 through 5 of Table 1 correspond to the WER obtained when 1.3, 6.8, 13.4,
and 58.2 seconds of speech data are used for speaker dependent parameter
estimation.

               Compensation    Ave. Adaptation Data Dur. (sec)
                  Algorithm     1.3   6.8    13.4     58.2

              Baseline         41.5   41.5   41.5     41.5
              N                40.2   37.2   36.8     36.8
              N+W              36.7   33.8   33.6     33.3
              N+W+C              –    35.0   32.3     29.8
              N+W+C+SAT          –    34.4   31.5     28.9
Table 1
 WER obtained using unsupervised estimation of mean and variance normalization
(N), frequency warping (W), and constrained model adaptation (C) parameters
from varying amounts adaptation data.


There are several observations that can be made from Table 1. First, by com-
paring rows 1 and 2, it is clear that simply initializing mean and variance
normalization estimates using the adaptation data (N) results in a significant
decrease in WER across all adaptation data sets. Second, frequency warping
(W) is also shown to provide significant reduction in WER with the most dra-
matic reduction occurring for the case where an average of only 1.3 seconds
of adaptation data per speaker is used to estimate warping factors. Third, by


                                      25
observing rows 4 and 5 of Table 1, it is clear that constrained model adapta-
tion (C) actually increases WER when the transformation matrix is estimated
from less than 13.4 seconds of adaptation data. However, significant WER rate
reductions were obtained as the adaptation data length was increased. It is
important to note that the over-training problem observed here for adaptation
algorithms resulting from insufficient adaptation data is well known. Future
work will investigate the use of procedures that prevent over-training by in-
terpolating counts estimated on a small adaptation set with those obtained
from other sources of data [10].



6   Conclusions


This paper has addressed several important issues that are specific to the
implementation of ASR applications and services in client-server scenarios. It
was noted that there has been a great deal of research addressing issues relating
to the communications channels associated with distributed speech recognition
scenarios and addressing methods for making individual ASR channels more
efficient. The techniques presented here, however, have addressed robustness
and efficiency issues strictly in the context of multi-user scenarios.

All of these techniques relied on the existence of an efficient framework for
client-server communications and for managing the resources associated with
human-machine dialog systems. It was shown in Section 3 that the DSEM
framework with an event-driven, non-blocking IO model with a single thread
for managing concurrently connected clients was well-behaved even when sup-
porting many hundreds of clients. An architecture for implementing unsuper-
vised acoustic feature space adaptation and normalization in the context of
this framework was introduced. When approximately one minute of adaptation
utterances were used to estimate parameters for a combination of algorithms
in a large vocabulary name recognition task under this scenario, a 31% re-
duction in word error rate was obtained. Again using the DSEM framework,
the effect of using an intelligent scheme for allocating ASR decoders to ap-
plication servers in multi-user client-server deployments was demonstrated.
It was shown to decrease average response latencies by over a factor of two
when compared to an alternative commonly used approach for ASR resource
allocation.

With an expanding infrastructure of personal devices, communications net-
works, and server configurations, it is hoped that there will be increased inter-
est in addressing the problems of robust and efficient ASR that are relevant to
this infrastructure. It is often the case that very efficient single channel ASR
systems are applied in relatively inefficient client-server installations which do
not exploit the power of the underlying ASR technology. It is also the case


                                      26
that many robust modeling techniques are not realizable in a given client-
server framework or are simplified to the point where they are less effective.
Hence, it may be found that addressing these problems from the standpoint of
multi-user distributed scenarios may have a greater impact than incremental
improvements in the underlying single channel systems.



References


[1] G. Banga, J.C. Mogul, and P. Druschel. A scalable and explicit event delivery
    mechanism for UNIX. In Proc. USENIX 1999 Annual Technical Conference,
    June 1999.

[2] A. Bernard and A. Alwan. Joint channel decoding - Viterbi recognition
    for wireless applications. Proc. European Conf. on Speech Communications,
    September 2001.

[3] E. Bocchieri and B. Mak. Subspace distribution clustering hidden Markov
    model. IEEE Transactions on Speech and Audio Processing, 9(3):264–275,
    March 2001.

[4] Antonio Cardenal-Lopez, Laura Docio-Fernandez, and Carmen Garcia-Mateo.
    Soft decoding strategies for distributed speech recognition over IP networks.
    Proceedings of the International Conference on Acoustics, Speech, and Signal
    Processing, pages 49–52, May 2004.

[5] A. Chandra and D. Mosberger. Scalability of Linux event-dispatch mechanisms.
    Technical Report HPL-2000-174, Hewlett Packard Laboratory, 2000.

[6] S. B. Davis and P. Mermelstein. Comparison of parametric representations for
    monosyllabic word recognition in continuously spoken sentences. IEEE Trans
    on Acous. Speech and Sig. Proc., ASSP-28(4):357–366, 1980.

[7] ETSI TS 126 094 (2001-03). Universal Mobile Telecommunications System
    (UMTS); Mandatory speech codec speech processing functions AMR speech
    codec; Voice Activity Detector (VAD) (3FPP TS 26.094 version 4.00 Release
    4).

[8] T. Fingscheidt, S. Aalburg, S. Stan, and C. Beaugeant. Network-based versus
    distributed speech recognition in adaptive multi-rate wireless systems. Proc.
    Int. Conf. on Spoken Lang. Processing, pages 2209–2212, September 2002.

[9] M. J. F. Gales. Maximum likelihood linear transformations for HMM-based
    speech recognition. Computer Speech and Language, 12:75–98, 1998.

[10] A. Gunawardana and W. Byrne. Robust estimation for rapid speaker adaptation
     using discounted likelihood techniques. Proc. Int. Conf. on Acoust., Speech, and
     Sig. Processing, May 2000.


                                        27
[11] H. K. Kim, R. V. Cox, and R. C. Rose. Bitstream-based front-end for wireless
     speech recognition in adverse environments. IEEE Trans. on Speech and Audio
     Processing - Special Issue on Speech Technologies for Mobile and Portable
     Devices, nov 2002. to be published.

[12] Imre Kiss, Ari Lakaniemi, Cao Yang, and Olli Viikki. Review of AMR speech
     codec and distributed speech recognition-based speech-enabled services. Proc.
     IEEE ASRU Workshop, pages 613–618, December 2003.

[13] L. Lee and R. C. Rose. A frequency warping approach to speaker normalization.
     IEEE Trans on Speech and Audio Processing, 6, January 1998.

[14] Luke K. McDowell, Susan J. Eggers, and Steven D. Gribble. Improving server
     software support for simultaneous multithreaded processors. In Proc. Ninth
     ACM SIGPLAN symposium on Principles and practice of parallel programming,
     June 2003.

[15] M Mohri and M. Riley. Network optimization for large vocabulary speech
     recognition. Speech Communication, 25(3), 1998.

[16] S. Ortmanns, H. Ney, and T Firzlaff. Fast likelihood computation methods
     for continuous mixture densities in large vocabulary speech recognition. Proc.
     European Conf. Speech Communication and Technology, September 1997.

[17] Vivek S. Pai, Peter Druschel, and Willy Zwaenepoel. Flash: An efficient and
     portable web server. In Proc. USENIX 1999 Annual Technical Conference, June
     1999.

[18] M. Pitz, S. Molau, R. Schluter, and H. Ney. Vocal tract normalization equals
     linear transformation in cepstral space. Proc. European Conf. on Speech
     Communications, September 2001.

[19] A. Potamianos and V. Weerackody. Soft-feature decoding for speech recognition
     over wireless channels.    Proceedings of the International Conference on
     Acoustics, Speech, and Signal Processing, pages 269–272, May 2001.

[20] R. C. Rose, I. Arizmendi, and S. Parthasarathy. An efficient framework for
     robust mobile speech recognition services. Proceedings of the International
     Conference on Acoustics, Speech, and Signal Processing, April 2003.

[21] R. C. Rose, S. Parthasarathy, B. Gajic, A. E. Rosenberg, and S. Narayanan.
     On the implementation of asr algorithms for hand–held wireless mobile devices.
     Proc. Int. Conf. on Acoust., Speech, and Sig. Processing, May 2001.

[22] R. A. Sukkar, R. Chengalvarayan, and J. J. Jacob. Unified speech recognition
     for landline and wireless environments. Proc. Int. Conf. on Acoust., Speech,
     and Sig. Processing, pages 293–296, May 2002.

[23] Zheng-Hua Tan, Paul Dalsgaard, and Borge Lindberg. On the integration of
     speech recognition into personal networks. Proc. Int. Conf. on Spoken Lang.
     Processing, October 2004.


                                       28
[24] O. Viikki. ASR in portable wireless devices. Proc. IEEE ASRU Workshop,
     December 2001.

[25] Matt Welsh, David E. Culler, and Eric A. Brewer. SEDA: An architecture
     for well-conditioned, scalable internet services. In Symposium on Operating
     Systems Principles, pages 230–243, 2001.

[26] S. Wendt, G. A. Fink, and F. Kummert. Dynamic search-space pruning for
     time-constrained speech recognition. Proc. Int. Conf. on Spoken Language
     Processing, September 2002.




                                      29