Project Proposal by Flavio58

VIEWS: 112 PAGES: 14

									HCI 575X Spring 2009: Computational Perception

Instructor: Alex Stoychev

Semester Project Proposal

March 10th 2009

A Multi-media Desktop System Simulation

Integration with Multi-touch, Voice Recognition and Tutoring System

Wei, Yi I. Introduction

For the past few years, interaction with computer, especially PC has usually taken place with one kind of interaction mode. This mode takes the monitor or screen as the output device to show people what happening, and use keyboard or mouse (laptop usually use a small touch table but not very conscience) as input device. This is a quiet and silent mode which make the interaction just occur between operator’s hand and vision. The disadvantage of that is people need to take a longer time to get used to the devices (such as keyboard, mouse. Actually, not everyone can be easily proficient in them). Furthermore, people need to be familiar with the interaction mode based on the proficiency on these devices.

However, due to much research in the area of different interaction technologies, we have more and more choice to accomplish the mission of people’s motivation. For example, the

touch interfaces, the new generation of multi-touch interaction is upon us. Multi-touch is a hardware interaction technique between user and machine which consists of a touch screen to recognize user’s touch points along with their positions and pressures, even gestures. Some multi-touch devices can support multiple users at the same time. Since there are no other tools easier to learn and use than people their own hands. Thus multi-touch table will be a very good choice for the new desktop system model. It will provide the user with more direct mapping interaction with the application rather than with a mouse or laptop touch table etc. Simultaneously, this interaction was recently growing amazingly fast in the past few years due to the serial release of multi-touch products such as the iPhone, new blackberry, etc. All this new products and users’ experience make this technology mature. In a word, multi-touch technique will reduce the period of desktop operation learning and make user feel more convenient.

But just like the touch-table of laptop especially macbook (which is the best product proved in real world), people could learn to control the desktop by learning how to use this touch-table, it still need a longer period to accomplish this process. And when user are already familiar with the tools, they also meet several situations consisting of complicated operations to the desktop (for example, if there are several tasks to do and the items, either icons or windows are disorder, people will feel annoyed if they can not do their task directly). The strategy to solve this problem is to import voice recognition features to this desktop system, which means when user use their hands finish the movement, they can use the voice command concurrently. Actually, this interaction mode is much closer to the real human behavior, since people usually use gestures, hands movement and audio to interact or communicate with the outside world. The tool will be used to support this idea is Voce. Voce is a speech synthesis and recognition library that is cross-platform, accessible from Java and C++, and has a very small API. It uses CMU Sphinx4 and FreeTTS internally. “Voce” is Italian for “voice”. Furthermore, based on the FreeTTS features, the system could support users not only vision but also audio output, which makes the whole interaction process more efficient.

Figure 1: The flowchart of the proposed desktop system

Considering the limitation of voice recognition and the error of the multi-touch table, and in order to control the workflow well, we propose include the xPST: Extensible Problem-Specific Tutor engine. The xPST Engine observes the interaction in the interface.

Using the xPST File, the xPST Engine knows the correct tutoring response when the user does either the right or wrong thing in the interface, and knows which hint or just-in-time message to display at the appropriate time. It will be combined with the data collected from Sparsh UI and Voce. In addition, the output hint data will be transformed to either display on screen or the voice by FreeTTS which imported by Voce. Figure 1 shows the workflow of the whole system.

II. Background

A. Traditional Desktop System In Figure 2, it shows a standard and traditional desktop on Windows XP system. It uses different icons different kinds of files to separate them. The icons are deployed on the rectangle graphical desktop and usually in the transparent grid. User could select each file to open or operate by pitching on it, and the selected item will be in different foreground color. Moreover, user could do more if they want by using an extendable rectangle, which could contain more items in one rectangle area on the desktop as a group, then do the same operation for its each member.

Figure 2: Traditional Desktop interface on Windows XP

Figure 3: Traditional Workflow Control–Right Click by Mouse and Menu Hint

In Figure 3 and Figure 4, it shows how it control the operation workflow in traditional desktop system, if user want to do some more complicated operation, which may need more than 1 steps rather than just movement. User need to use right click of mouse or Touch-table

Figure 4: mouse and touch-table on different laptops

for laptop Figure 4 (if you are a macbook user, you can use gesture as 2 fingers and one click), this click will activate the menu, which will show user more operation choices. The menu may contain some direct operation; it also could have some folder which leads to the next level submenu to contain more operations. But this kind of workflow control may result in a big mess if there are too many menus and windows on the desktop. Thus, if user have serial operations, the activation for several menu will be a difficult a heavy job.

B. Sparsh UI

Figure 5: Google code project Sparsh UI logo [1]

Input device of the multi-touch device which is being used to obtain the input from the user. Sparsh UI is hardware is compatible to any multi-touch hardware as long it follows the protocol specified by Sparsh UI.

Figure 6: Sparsh UI [1]

The Sparsh-UI Gesture Server is the main piece of the application. It handles gesture processing and passes touch points and gesture information to the client application. The Gesture Server supports basic gestures such as drag, scale, and rotates, and is extensible to support user-written gestures.

Figure 7: Sparsh UI Architecture [1]

A device driver is needed for a device to communicate with the Gesture Server. The driver should be capable of passing touch point information to the Gesture Server. It is the software driver that interacts with the multi-touch hardware and provides the multi-touch events. Since this is dependent on the hardware that is being used it needs to be written for the specific multi-touch hardware one is using. (Currently supported hardware devices FTIR based multi-touch table, Infra-Red bezel, Stantum multi-touch screen).

Figure 8: Sparsh UI Solution [1]

A client adapter is needed for each GUI framework the Gesture Server intends to communicate with, which currently is working on a Java Swing adapter. Gesture Recognition Framework is the core engine of the Sparsh UI that takes in the input device data and analyses the gestures from the touch point co-ordinates. Multiple clients can register to the gesture recognition framework and obtain the gesture events for which the client registers. Client Adapter is the adapter that the client needs to implement to receive Sparsh UI events. The Client to server protocol adapter is provided as a part of Sparsh UI. The client application needs to implement an interface (abstract class) to receive the events. The handling of various events is done in this Client adapter. Client is the Multi-touch application itself. The client listens on to the various multi-touch events (gesture) and acts accordingly.

C. Voce

Figure 9: logo of open source project Voce [2]

Voce is a speech synthesis and recognition library that is cross-platform, accessible from Java and C++, and has a very small API. It uses CMU Sphinx4 and FreeTTS internally. “Voce” is Italian for “voice” (pronunciation). Here is some sample code demonstrating the simplicity of the API [2]: // Speech synthesis in Java voce.SpeechInterface.synthesize(“hello world”); // Speech synthesis in C++ voce::synthesize(“hello world”); // Speech recognition in Java while (voce.SpeechInterface.getRecognizerQueueSize() > 0) { String s = voce.SpeechInterface.popRecognizedString(); System.out.println("You said: " + s); } // Speech recognition in C++ while (voce::getRecognizerQueueSize() > 0) { std::string s = voce::popRecognizedString(); std::cout << "You said: " << s << std::endl; }

D. xPST

Figure 10: logo of Google code project xPST [3]

Figure 10 shows the relationship between these components and the to-be-tutored interface. The xPST File contains the tutoring information data—what to do when as the user interacts with the interface. This information includes: how to map interface widgets from the tutored software onto more meaningful constructs for the various learning goals, the sequence in which users are supposed to interact and complete these learning goals, the correct answer for each learning goal, and hints and just-in-time error messages for each learning goal. For the desktop system, it would be the workflow control. Since there might be fussy voice recognition or some fatal operation (such remove some items) need to be determined. The xPST Engine observes the user’s interaction in the interface. Using the xPST File, the xPST Engine knows the correct tutoring response which is the workflow for the desktop system when the user does either the right or wrong thing in the interface, and knows which hint or just-in-time message to display at the appropriate time.

When the xPST Engine decides that information needs to be conveyed back to the user (e.g., a wrong answer needs to be flagged or a hint message needs to be displayed), that information is relayed to the Presentation Manager. The Presentation Manager knows how to display that information to the user given the current interface. For example, a wrong answer may need to be shown in red text, a radio button highlighted, or a tooltip with appropriate text may need to appear (this might be voice hint, question and the display on the screen for the desktop interaction). The Presentation Manager handles the communication between the xPST Engine and the interface.

Figure 11: the architecture of xPST. The xPST Engine "eavesdrops" on the software interface that needs tutoring. The Presentation Manager gives visual feedback on the software interface. The xPST File provides the feedback (need to be programmed) and goal structure needed for each task within the tutor. [3]

III. Methodology
A. Architecture

Figure 12: Architecture of Desktop System Model

Figure 12 shows the architecture of the desktop system. This whole system based on 3 open source projects, which are Sparsh UI for multi-touch, Voce for voice command recognition, and xPST for the workflow process and examination. These 3 projects are all open source projects. And it is very easy to combine them together since there are java version for each of them. So the implementation does not need to be considered of the platform (except the multi-touch table need the appropriate device drivers).

B. Equipment Multi Touch Simulator for Windows XP and Vista Featured

Figure 13: Multi-Touch Simulator on Windows XP

Sparsh Sever catch the mouse clicking records.

Figure 14: The Sparsh Server

Logitech earphone and microphone

Figure 15: Logitech QuickCam Chat bundled head set

Multi-touch table/screen

Figure 16: Multi-touch table/screen in VRAC lab

Since the project will first work on the simulator, performance of which especially based on the PC. There is no more information about the multi-touch table. And the project will also focus on the API of multi-touch. The physics features of different multi-touch devices may concern more on the testing period.

C. Implementation Process

Multi-touch desktop model by simulator on PC

Voice refining commands sets design and the grammar for voice recognition in Voce

Integration of Multi-touch desktop model & Voice commands recognition model

Workflow design and menus elimination, integrated to of xPST grammar file

Combination of three part: multi-touch, voice commands and tutoring for workflow, testing

E. Project Timeline Tasks Desktop items movement model Voice command recognition Voice &Multi-touch Combined Workflow analysis Tutoring & desktop integration Testing Time 1 week 1 week 0.5 week 0.5 week 1 week 1 week

IV. Expectations and Evaluation of the Proposed System
A. Testing Examples

Remove:

Use hands touching movement to select an item or select several items

Input the remove command through user’s voice. User just need say a word such as delete, remove.

If the system recognizes the command, it will output audio determinism question “Do you really want to remove this item?”

If the system gets a problem to recognize your voice command, it may give you some hints such as “Do you want to remove or open?” The hint will be based on your operation of workflow before.

Yes. Then remove the item to finish the workflow.

No. Then cancel the process, stop the workflow

Arrange:

Use hands touching movement to select several items either choosing them separately or draw a circle

Input the remove command through user’s voice. User just need say a word such as arrange.

If the system recognizes the command, it will output audio determinism question “How do you want to arrange these items?”

If the system gets a problem to recognize your voice command, it may give you some hints such as “Do you want to arrange or remove?” The hint will be based on your operation of workflow before.

By name, date, catalog…

B. Limitations of the proposed system The desktop system will be just a model for experiencing the feeling and advantage of the interaction which integrated by audio and touch, thus it may not be a perfect system, which means may not accomplish all the features of traditional desktop system. The challenge of this system is that there may exist a bottleneck of the voice recognition without a learning procedure. And the voice recognition part will have a relatively high memory requirement. How much the xPST tutoring system will help the system to enhance the fault-tolerant will be an unknown question.

V. Reference
[1] http://code.google.com/p/sparsh-ui/ [2] http://voce.sourceforge.net/ Tyler Streeter [3] http://code.google.com/p/xpst/


								
To top