Speech Application Language Tags SALT Technical White+Paper

Document Sample
Speech Application Language Tags SALT Technical White+Paper Powered By Docstoc
					Speech Application Language Tags (SALT) Technical White Paper
SALT is a speech interface markup language. It consists of a small set of XML elements, with associated attributes and DOM object properties, events and methods, which apply a speech interface to web pages. SALT can be used with HTML, XHTML and other standards to write speech interfaces for both voice-only (e.g. telephony) and multimodal applications.

Design Principles
SALT is designed according to the following fundamental principles. 1. Clean integration of speech with web pages SALT execution is designed to leverage the event-based DOM execution model of web pages. For multimodal applications it integrates cleanly into visual markup pages. More generally, for all speech applications it reuses the knowledge and skills of web developers. This allows for simplicity of design in SALT, since it does not need to reinvent page execution or programming models. 2. Separation of the speech interface from business logic and data Linked to the clean integration principle is the notion that the speech interface should be separate from the business logic which drives the application and the data which it maintains. SALT does not therefore extend any individual markup language directly, rather it applies the speech interface as a separate layer which is extensible across different markup languages. The dialog framework which drives the SALT speech interface can be as loosely or as tightly coupled as necessary to the underlying data structure (e.g. an HTML form), so that speech and dialog components can be reused across pages and across applications. 3. Power and flexibility of programming model Flexibility in programming the speech interface is crucially important for top quality speech applications. SALT offers finelevel control of dialog execution through the powerful DOM event and scripting model familiar to millions of web developers. This permits the SALT elements to remain simple and intuitive, while leveraging the benefits of a rich and well-understood execution environment. 4. Reuse existing standards for grammar, speech output and semantic results SALT reuses existing standards for speech output, grammar formats and semantic results, so it remains a lightweight application-level markup that builds on industry standards. 5. Range of devices SALT is not designed for any particular device type, but rather for a range of architectural scenarios. Its way of adding speech to web pages is generic, so the whole continuum of devices from PCs to mobile devices to the telephone can be speech-enabled. For example, personal computers might run all speech recognition and speech output processes locally, whereas smaller devices with limited processing capabilities will have a SALT-capable browser but use remote servers for speech recognition or synthesis services. Traditional telephones without processors will make telephone calls to a server running a voice-only SALT browser. SALT will be modularized and page profiles will be defined according to the modal and environmental capabilities of clients. A high-level SALT reference architecture is illustrated at the end of this paper. 6. Minimal cost of authoring across modes and devices Deriving from all the above principles is a notion which is becoming increasingly important for web developers as the types and numbers of web client devices proliferate: minimizing the authoring overhead for different modes and different clients. This enables two important classes of application scenario: (i) multimodal, where a visual page can be seamlessly enhanced with a speech interface on the same device; and (ii) cross-modal, where a single application page can be reused for different modes on different devices, e.g. visual-only and voice-only. In this way, delivering the main principles of clean integration and speech interface separation will allow maximum reuse of developers' work. SALT's method of adding the speech mode to web pages reuses to the greatest extent possible the relevant pages, forms and fields, scripts and back-end logic of existing web applications.

The elements of SALT
There are three main top-level elements in SALT: <listen …> <prompt …> <dtmf …> configures the speech recognizer, executes recognitions and handles speech input events configures the speech synthesizer and plays out prompts configures and controls DTMF in telephony applications

The listen and the dtmf element may contain <grammar> and <bind> elements, and the listen element can also hold <record>. SALT also features ways to configure and manipulate telephony call control through both script and markup. The following sections outline each element individually with examples to illustrate how SALT realizes the design principles alone. The examples assume an XHTML profile, i.e. that an XHTML page is the basis for the SALT speech interface.

The listen element is used for speech input: to specify grammars and a means of dealing with speech recognition results. It is also used for recording spoken input (e.g. voice messages). As such, it contains the <grammar>, <record> and <bind> elements, and further resources for handling speech events and configuring recognizer properties. It also contains methods to activate and deactivate grammars, and to start and stop recognition. The following is a simple listen example which holds a remote grammar containing city names, and a bind statement to process the recognition result: <salt:listen id="travel"> <salt:grammar src="./city.xml" /> <salt:bind targetElement="txtBoxOriginCity" value="/result/origin_city" /> </salt:listen> listen elements can be executed either programmatically with the Start() method in script (as in the 'Event Wiring' example below), or declaratively in scriptless environments or SMIL-enabled browsers. The handlers on <listen> include events for successful recognitions, misrecognitions, timeouts and other speech events, and each recognition turn can be configured using attributes to specify timeout periods, confidence thresholds and other parameters. When <listen> is used in tandem with <dtmf> in telephony profiles, the behavior of the elements is associated by default in order to simplify authoring. <grammar> The grammar element is used to specify grammars, either inline or referenced. Multiple grammar elements may be used in a single <listen>. Individual grammars may be activated or deactivated before recognition begins, using methods on the <listen> element. SALT itself is independent of the grammar formats that may be referenced by the grammar element. However, in order to enable interoperability of SALT applications, it is intended that SALT browsers will support at minimum the XML form of the W3C Speech Recognition Grammar Specification (pending evaluation of the functional completeness of this specification). <bind> The bind element can be used to inspect the results of recognition and conditionally copy the relevant portions to values in the containing page. It is optional, and multiple binds may be contained in a <listen>. The recognition result is returned in the form of an XML document, so <bind> uses XPath syntax in its value attribute to point to a particular node of the result, and an XML pattern query in its test attribute to specify binding conditions. If the condition evaluates to true, the content of the node is bound into the page element specified by the targetElement attribute. (Again, SALT itself is independent of the XML format used for the recognition return, but SALT browsers will be expected to support a single standard result format.) So given a recognition return of the following simple document:

<result text="I'd like to go to London, please" confidence="0.45"> <dest_city text="to London" confidence="0.55"> London </dest_city> </result> The following bind statement will copy the value of the dest_city node into the XHTML element txtBoxDest, since the precondition is met that the 'confidence' attribute of the dest_city node holds a value greater than 0.4: <input name="txtBoxDestCity" type="text" /> <salt:listen ...> <salt:bind targetElement="txtBoxDestCity" value="/result/dest_city" test="/result/dest_city[@confidence &gt; 0.4]" /> </salt:listen> So the bind element is a simple declarative way to process recognition results. For more complex processing, the onReco event handler can be used for finer level programmatic script analysis and post-processing of the recognition return. Like <bind>, this is triggered on the return of a result, and is illustrated in the example in the Dialog Flow section. <record> The record element is used to specify audio recording parameters and other information related to recording speech input. The results of recording may also be processed with <bind>, or with scripted code.

The prompt element is used to specify system output. Its content may be simple text, speech output markup, variable values, links to audio files, or any mix of these. Prompt elements are executed declaratively on scriptless or SMIL browsers, or by object methods in script. In order to enable interoperability of SALT applications, it is intended that SALT browsers will support at minimum the W3C Speech Synthesis Markup Specification (pending suitable evaluation of the functional completeness of this specification). The following example shows the values of XHTML elements on the page used inside a textual prompt. The content of the value attributes of the txtBoxOrigin and txtBoxDest elements are referenced at the relevant points inside the question: <salt:prompt id="ConfirmTravel"> So you want to travel from <salt:value targetElement="txtBoxOriginCity" targetAttribute="value" /> to <salt:value targetElement="txtBoxDestCity" targetAttribute="value" /> ? </salt:prompt> The prompt element also contains methods to start, stop, pause and resume prompt playback, and to alter speed and volume. Its handlers include the events of user barge-in, internal 'bookmarks' and prompt-completion.

The <dtmf> element is used in telephony applications to specify DTMF grammars and to deal with keypress input and other events. Like <listen>, its main elements are <grammar> and <bind>, and it holds resources for configuring the DTMF collection process and handling DMTF keypresses and timeouts. Like <listen> it may be executed declaratively or programmatically with start and stop commands. Mirroring the listen element, <dtmf> holds as content the <grammar> and <bind> elements. The following example shows grammar and bind used just as they are in listen, this time to obtain a telephone number: <salt:dtmf id="dmtfPhoneNumber"> <salt:grammar src="7digits.gram" />

<salt:bind value="/result/phoneNumber" targetElement="iptPhoneNumber" /> </salt:dtmf> The dtmf element is configured with the use of attributes for configuring timeouts and other properties, and its handlers include keypress events, valid dtmf sequences, and out-of-grammar input.

Event wiring
As mentioned above, SALT elements are XML objects in the Document Object Model (DOM) of the page. As such, each SALT element contains methods, properties and event handlers which are accessible to script and can therefore interact with other events and processes in the execution of the web page. This allows SALT's speech interface to be seamlessly integrated into web applications. For example, the listen, prompt and dtmf elements all contain asynchronous methods to begin and end their execution. Similarly they contain properties for configuration and result storing, and event handlers for events associated with speech. So an onReco event is fired on a <listen> when recognition results are successfully returned, and an onBargein event is fired on a <prompt> if user input is received while the prompt is in playback. Other useful methods, properties and event handlers are exposed on each object for finer level application control. Here is an example of execution in a multimodal web page. In the markup below, the wiring of the listen's start method to the GUI event of the click on an input field would implement a 'click-to-talk' interaction model, whereby an appropriate grammar is activated according to the field clicked: <input name="txtBoxDestCity" type="text" onclick="recoDestCity.Start()" /> <salt:listen id="recoDestCity"> <salt:grammar src="city.xml" /> <salt:bind targetElement="txtBoxDestCity" value="/result/city" /> </salt:listen> Since individual recos can be associated in this way with individual GUI elements, grammar selection and result processing can be localized into individual recos and then wired to events associated with the relevant fields of a form. More complex mixed initiative possibilities are enabled by the use of broader grammars and multiple <bind> statements: <input type="button" onclick="recoFromTo.Start()" value="Say From and To Cities" /> <input name="txtBoxOriginCity" type="text" /> <input name="txtBoxDestCity" type="text" /> <salt:listen id="recoFromTo"> <salt:grammar src="FromToCity.xml" /> <salt:bind targetElement="txtBoxOriginCity" value="/result/originCity" /> <salt:bind targetElement="txtBoxDestCity" value="/result/destCity" /> </salt:listen> In this example, any recognition return which contains an origin city, a destination city or both cities will populate the input fields specified in the bind statements accordingly.

Telephony capabilities
For telephony dialogs, SALT supports call control mechanisms for managing telephony interfaces, including: • Listening for, accepting, and rejecting incoming calls. • Placing an outgoing call. • Disconnecting and transferring calls. • Grouping calls together into conferences.

The mechanisms available to the SALT programmer are grouped into: • An extensible telephony abstraction derived from the open Java Call Processing (JCP) standard, consisting of an object model with properties, methods, and events controllable through scripting. This programming abstraction is independent of the underlying telephony subsystem that may rely on, for example: SS7, ISDN, SIP, H.323, or even vendor-proprietary interfaces and signaling protocols. • A general-purpose message-passing primitive allowing arbitrary interaction with a call control module independent from SALT. This primitive is controllable through both markup and scripting.

Dialog flow
Voice-only scenarios such as telephony applications do not have a visual display for users to drive interactions via GUI events. For these scenarios, SALT applications continue to leverage the web model of events and scripts to manage the interactional flow of dialog. Scripts can also control the extent of user initiative, the dialog repair strategy, and so on. In this way, a complete and familiar programming model is available to application authors. It is also expected that many implementations of SALT will provide tools with pre-built scriptlets and subdialog libraries which will make easier many common dialog processing tasks. The following example shows a simple form of dialog flow management using client-side script. The SALT primitives <listen> and <prompt> are activated accordingly to the RunAsk() script which examines the values inside the input fields and executes the relevant prompts and recognitions until the values of the both fields are obtained. In order to illustrate programmatic result processing, the binding of the recognition results into the relevant input fields is accomplished below by the script functions procOriginCity() and procDestCity(), which are triggered by onReco events of the relevant <listen> elements. The handler for an unrecognized speech event onNoReco is used to play an appropriate message - the SayDidntUnderstand prompt, which restarts the cycle on its completion. <!—- HTML --> <html xmlns:salt="urn:saltforum.org/schemas/020124"> <body onload="RunAsk()"> <form id="travelForm"> <input name="txtBoxOriginCity" type="text" /> <input name="txtBoxDestCity" type="text" /> </form> <!—- Speech Application Language Tags --> <salt:prompt id="askOriginCity"> Where would you like to leave from? </salt:prompt> <salt:prompt id="askDestCity"> Where would you like to go to? </salt:prompt> <salt:prompt id="sayDidntUnderstand" onComplete="runAsk()"> Sorry, I didn't understand. </salt:prompt> <salt:listen id="recoOriginCity" onReco="procOriginCity()" onNoReco="sayDidntUnderstand.Start()"> <salt:grammar src="city.xml" /> </salt:listen> <salt:listen id="recoDestCity" onReco="procDestCity()" onNoReco="sayDidntUnderstand.Start()"> <salt:grammar src="city.xml" /> </salt:listen> <!—- script --> <script> function RunAsk() { if (travelForm.txtBoxOriginCity.value=="") { askOriginCity.Start(); recoOriginCity.Start(); } else if (travelForm.txtBoxDestCity.value=="") { askDestCity.Start();

recoDestCity.Start(); } } function procOriginCity() { travelForm.txtBoxOriginCity.value = recoOriginCity.text; RunAsk(); } function procDestCity() { travelForm.txtBoxDestCity.value = recoDestCity.text; travelForm.submit(); } </script> </body> </html> As noted above, further event handlers are available in the <listen> and <prompt> elements to manage user silences, errors and other situations requiring some form of dialog recovery.

SALT Reference Architecture

Shared By: