AN XML-BASED LANGUAGE FOR FACE MODELING AND ANIMATION

Document Sample
AN XML-BASED LANGUAGE FOR FACE MODELING AND ANIMATION Powered By Docstoc
					                                AN XML-BASED LANGUAGE FOR
                               FACE MODELING AND ANIMATION
                                              Ali Arya and Babak hamidzadeh

                       Dept. of Electrical & Computer Engineering, University of British Columbia,
                2356 Main Mall, Vancouver, BC, Canada V6T 1Z4, Phone: (604) 822-9181, Fax: (604) 822-5949
                                              Email: {alia,babak}@ece.ubc.ca

ABSTRACT                                                          characters are good examples of the applications using
                                                                  face animation.
Content Description is a major issue in multimedia and
animation systems and is necessary for content creation,          Although new streaming technologies allow real-time
delivery, and even retrieval. This paper proposes a content       download/playback of audio/video data, but bandwidth
description language, specifically designed for face              limitation and its efficient usage still are, and probably
animation systems. Face Modeling Language (FML) is an             will be, major issues. This makes a textual description
XML-based Structured Content Description. It allows               of facial actions a very effective coding/compression
definition of face actions from complicated stories to            mechanism, provided the visual effects of these actions
simple moves, and also face models and behaviours. FML            can be recreated with a minimum acceptable quality.
is used as input script in ShowFace animation system. The         Based on this idea, some researches have been done to
paper describes FML hierarchical structure, dynamic               translate certain facial actions into a predefined set of
behaviour and event handling, relation to face animation          “codes”. Facial Action Coding System [6] is probably
components, and also other multimedia languages and               the first successful attempt in this regard. More
standards like SMIL and MPEG-4.                                   recently, MPEG-4 standard [3] has defined Face
                                                                  Animation Parameters to encode low level facial
KEY WORDS                                                         actions like jaw-down, and higher level, more
Face Animation, Multimedia, XML, Language, Content                complicated ones like smile.

                                                                  Efficient use of bandwidth is not the only advantage of
                                                                  facial action coding. In many cases, the “real”
1. INTRODUCTION                                                   multimedia data does not exist at all, and has to be
                                                                  created based on a description of desired actions. This
Describing the contents of a multimedia presentation is a         leads to the whole new idea of representing the spatial
basic task in multimedia systems. It is necessary when a          and temporal relation of the facial actions. In a
client asks for a certain presentation to be designed, when       generalized view, such a description of facial
a media player receives input to play, and even when a            presentation should provide a hierarchical structure
search is done to retrieve an existing multimedia file. In        with elements ranging from low level “images”, to
all these cases, the description can include raw                  simple “moves”, more complicated “actions”, to
multimedia data (video, audio, etc) and textual commands          complete “stories”. We call this a Structured Content
and information. Such a description works as a                    Description, which also requires means of defining
Generalized Encoding, since it represents the multimedia          capabilities, behavioural templates, dynamic contents,
content in a form not necessarily the same as the playback        and event/user interaction. Needless to say,
format, and is usually more efficient and compact. For            compatibility with existing multimedia and web
instance a textual description of a scene can be a very           technologies is another fundamental requirement, in
effective “encoded” version of a multimedia presentation          this regard.
that will be “decoded” by the media player when it
recreates the scene.                                              In this paper, we propose Face Modeling Language
                                                                  (FML) as a Structured Content Description mechanism
Face Animation, as a special type of multimedia                   based on eXtensible Markup Language. The main ideas
presentation, has been a challenging subject for many             behind FML are:
researchers. Advances in computer hardware and                    1. Hierarchical representation of face animation
software, and also new web-based applications, have               2. Timeline definition of the relation between facial
helped intensify these research activities, recently. Video           actions and external events
conferencing and online services provided by human                3. Defining capabilities and behaviour templates
                                                                  4. Compatibility with MPEG-4 FAPs
5.   Compatibility   with   XML      and    related   web    3. STRUCTURED CONTENT
     technologies                                               DESCRIPTION
In section 2, some related works are reviewed. Section 3     Design Ideas
is dedicated to basic concepts and structure of FML. The     FACS and MPEG-4 FAPs provide the means of
integration of FML in ShowFace animation system [2] is
                                                             describing low-level face actions but they do not cover
the topic of Section 4. Some case studies and concluding     temporal relations and higher-level structures.
remarks are presented in Sections 5 and 6, respectively.     Languages like SMIL do this in a general purpose form
                                                             for any multimedia presentation and are not customized
                                                             for specific applications like face animation. A
2. RELATED WORKS                                             language bringing the best of these two together,
                                                             customized for face animation, seems to be an
Multimedia Content description has been the subject of       important requirement. Face Modeling Language
some research projects and industry standards. In case of    (FML) is designed to do so.
face animation, Facial Action Coding System (FACS) [6]
was one the first attempts to model the low level            Fundamental to FML is the idea of Structured Content
movements, which can happen on a face. MPEG-4                Description. It means a hierarchical view of face
standard [3] ext ends this by Face Definition/Animation      animation capable of representing simple individually-
Parameters for facial features and their movements. These    meaningless moves to complicated high level stories.
parameters can define low-level moves of features (e.g.      This hierarchy can be thought of as consisting of the
jaw-down) and also higher level set of moves to form a       following levels (bottom-up):
complete facial situation (e.g. visemes or expressions).
The standard does not go any further to define a dynamic      Story           Action          Moves
description of the scene (facial activities and their
temporal relation) and is limited to static snapshots.
Decoding these parameters and the underlying face model
are out of scope of MPEG-4 standard.

Indirectly related is Synchronized Multimedia Integration
Language, SMIL [4], an XML-based language for
dynamic (temporal) description of the events in a general
multimedia presentation. It defines time containers for
sequential, parallel, and exclusive actions related to
contained objects, in order to synchronize the events.                                               Time
SMIL does not act as dynamic content description for
facial animation or any other specific application.

BEAT [5] is another XML-based system, specifically                Figure 1. FML Timeline and Temporal Relation of
                                                                  Face Activities
designed for human animation purposes. It is a toolkit for
automatically suggesting expressions and gestures, based
on a given text to be spoken. BEAT uses a knowledge          1-     Frame, a single image showing a snapshot of the
base and rule set, and provides synchronization data for            face (Naturally, may not be accompanied by
facial activities, all in XML format. This enables the              speech)
system to use standard XML parsing and scripting             2-     Move, a set of frames representing linear transition
capabilities. Although BEAT is not a general content                between two frames (e.g. making a smile)
description tool, but it demonstrates some of the            3-     Action, a “meaningful” combination of moves
advantages of XML-based approaches.                          4-     Story, a stand-alone piece of face animation

Scripting and behavioural modeling languages for virtual     The boundaries between these levels are not rigid and
humans are considered by other researchers as well [7,8].    well defined. Due to complicated and highly expressive
These languages are usually simple macros for                nature of facial activities, a single move can make a
simplifying the animation, or new languages which are        simple yet meaningful story (e.g. an expression). The
not using existing multimedia technologies. Most of the      levels are basically required by content designer in
times, they are not specifically designed for face           order to:
animation.                                                   1- Organize the content
                                                             2- Define temporal relation between activities
3- Develop behavioural templates,                             An FML document consists, at higher level, of two
based on his/her presentation purposes and structure.         types of elements: model and story. A model
                                                              element is used for defining face capabilities,
FML defines a timeline of events including head               parameters, and initial configuration. A story
movements, speech, and facial expressions, and their          element, on the other hand, describes the timeline of
combinations. Since a face animation might be used in an      events in face animation. It is possible to have more
interactive environment, such a timeline may be               than one of each element but due to possible sequential
altered/determined by a user. So another functionality of     execution of animation in streaming applications, a
FML is to allow user interaction and in general event         model element affect only those parts of document
handling (Notice that user input can be considered a          coming after it.
special case of external event.). This event handling may
be in form of:                                                Face animation timeline consists of facial activities and
                                                              their temporal relations. These activities are themselves
1-   Decision Making; choosing to go through one of           sets of simple Moves. The timeline is primarily created
     possible paths in the story                              using two time container elements, seq and par
2-   Dynamic Generation; creating a new set of actions to     representing sequential and parallel move-sets. A
     follow                                                   story itself is a special case of sequential time
                                                              container. The begin times of activities inside a seq
A major concern in designing FML is compatibility with        and par are relative to previous activity and container
existing standards and languages. Growing acceptance of
                                                              begin time, respectively.
MPEG-4 standard makes it necessary to design FML in a
way it can be translated to/from a set of FAPs. Also due to
similarity of concepts, it is desirable to use SMIL syntax
                                                                <seq begin=”0”>
and constructs, as much as possible.
                                                                     <talk begin=”0”>
                                                                       Hello World</talk>
Primary Language Constructs                                          <hdmv begin=”0” end=”5”
FML is an XML-based language. The choice of XML as                     dir=”0” val=”30” />
the base for FML is based on its capabilities as a markup       </seq>
language, growing acceptance, and available system              <par begin=”0”>
support in different platforms. Figure 2 shows typical               <talk begin=”1”>
structure of an FML.                                                   Hello World</talk>
                                                                     <exp begin=”0” end=”3”
 <fml>                                                                 type=”3” val=”50” />
   <model>     <!-- Model Info -->                              </par>
       <model-info />
   </model>                                                     Figure 3. FML Primary Time Container
   <story>     <!-- Story Timeline -->
       <action>
          <time-container>                                    FML supports three basic face activities: talking,
             <move-set>                                       expressions, and 3D head movements. They can be a
                 < . . . >
                                                              simple Move (like an expression) or more complicated
             <move-set>
                                                              (like a piece of speech). Combined in time containers,
             < . . . >
                                                              they create FML Actions. This combination can be
          </time-container>
                                                              done using nested containers, as shown in Figure 4.
          < . . . >
       </action>
                                                              FML also provides the means of creating a behavioural
       < . . . >
                                                              model for the face animation. At this time, it is limited
   </story>
                                                              to initialization data such as range of possible
 </fml>
                                                              movements and image/sound database, and simple
 Figure 2. FML Document Map; Time-container and               behavioural templates (subroutines). But it can be
 move-set will be replaced by FML time container              extended to include behavioural rules and knowledge
 elements and sets of possible activities,                    bases, specially for interactive applications. A typical
 respectively.                                                model element is illustrated in Figure 5, defining a
                                                              behavioural template used later in story.
                                                             enabled player, through simple scripting. Such a script
  <action>                                                   can also use XML Document Object Model (DOM) to
        <par begin=”0”>                                      modify the FML document, e.g. adding certain
        <seq begin=”0”>                                      activities based on user input. This compatibility with
             <talk begin=”0”>                                web browsing environments, gives another level of
                Hello World</talk>                           interactivity and dynamic operation to FML-based
             <hdmv begin=”0”                                 system, as illustrated in Section 5.
                end=”5” dir=”0”
                val=”30” />
        </seq>                                                 <event id=”user” val=”-1” />
             <exp begin=”0” end=”3”                            <excl ev_id=”user”>
                type=”3” val=”50” />                                <talk ev_val=”0”>Hello</talk>
        </par>                                                      <talk ev_val=”1”>Bye</talk>
  </action>                                                    </excl>

  Figure 4. Nested Time Container                              Figure 6.    FML    Decision    Making    and   Event
                                                               Handling


  <model>                                                    Another aspect of FML is its compatibility with
       <img src=”me.jpg” />                                  MPEG-4 face definition/animation parameters. This
       <range dir=”0” val=”60” />                            has been achieved by:
       <template name=”hello” >                              1- Translation of FML documents to MPEG-4 codes
         <seq begin=”0”>                                         by the media player.
               <talk begin=”0”>                              2- Embedded MPEG-4 elements in FML (fap
             Hello</talk>                                        element is considered to allow direct embedding of
                       <hdmv                                     FAPs in FML document)
  begin=”0”
  end=”5”
  dir=”0”
  val=”30” />
                                                             4. SHOWFACE SYSTEM
         </seq>
       </template>
                                                             FML can be used as input in any system capable of
  </model>
  <story>                                                    parsing and interpreting it. We have used FML as
                                                             native input mechanism in ShowFace system [2].
       <behaviour template=”hello” />
                                                             ShowFace is a modular streaming system for face
  </story>
                                                             animation by a personalized agent. It is the logical
  Figure 5. FML Model and Templates                          extension to our previous work, TalkingFace [1], by
                                                             applying its image transformations to a variety of facial
                                                             actions, including speech, expressions, and head
Event Handling and Decision Making                           movements, rather than limited visual speech.
Dynamic interactive applications require the FML             ShowFace uses a component-based architecture which
document to be able to make decisions, i.e. to follow        allows interfacing to standard objects like decoders and
different paths based on certain events. To accomplish       output renderers, and also independent upgrade for
this excl time container and event element are added.        each part of the system, e.g. input parser, facial image
An event represents any external data, e.g. the value of a   generator, and speech synthesizer. This means that new
user selection. The new time container associates with an    animation techniques, for instance 3D geometric
                                                             models, can be used as long as the component
event and allows waiting until the event has one of the
given values, then it continues with action corresponding    interfacing schemes are preserved. ShowFace also
to that value.                                               allows the input to be in the form of scenarios written
                                                             in FML. The basic structure of ShowFace system is
Compatibility                                                illustrated in Figure 7.
The XML-based nature of this language allows the FML
documents to be embedded in web pages. Normal XML
parsers can extract data and use them as input to an FML-
                                                             of setting external events values. ShowFacePlayer has
           Applications (GUI, Web Page, …)
                                                             a method called SetFaceEvent, which can be called by
                                                             the owner of player object to simulate external events.
          SF-API
                                   Script
                                   Reader                      <event id=”select” val=”2” />
                                                               < . . . >
      ShowFace                                                 <seq repeat=”select”>
        Player                     Splitter                         <talk begin=”0”>
                                                                      Hello World</talk>
       ActiveX                                                      <exp begin=”0” end=”3”
                                                                      type=”3” val=”50” />
       Control                                                 </seq>
                          Audio               Video
                          Kernel              Kernel           Figure 8. Repeated Activity. Using event is not
                                                               necessary

                              Multimedia
                              Mixer                          Event Handling
                                                             The second example shows how to define an external
                                                             event, wait for a change in its value, and perform
          Underlying Multimedia Framework                    certain activities based on the value. An external event
                                                             corresponding to an interactive user selection is
  Figure 7 Component-based ShowFace System                   defined, first. It is initialized to –1 that specifies an
  Structure                                                  invalid value. Then, an excl time container, including
                                                             required activities for possible user selections, is
ShowFace relies on the underlying multimedia technology      associated with the event. The excl element will wait
for audio and video display. The system components           for a valid value of the event. This is equivalent to a
interact with each other using ShowFace Application          pause in face animation until a user selection is done.
Programming Interface, SF-API, a set of interfaces
exposed by the components and utility functions provided       function onLoad()
by ShowFace run-time environment. User applications            {
can access system components through SF-API, or use a             facePlayer.ReadFML(“test.fml”);
wrapper object called ShowFacePlayer, which exposes the           facePlayer.Run();
main functionality of the system, and hides programming        }
details. With the player object, web-based applications
can simply create and control content using any scripting      function onHelloButton()
language.                                                      {
                                                                  facePlayer.SetFaceEvent(
                                                                    “user”, 0);
                                                               }
5. CASE STUDIES
                                                               function onByeButton()
Static Document                                                {
The first case is a simple FML document without any               facePlayer.SetFaceEvent(
need for user interaction. There is one unique path the             “user”, 1);
animation follows. The interesting point in this basic         }
example is the use of loop structures, using repeat
attribute included in any activity.                            Figure 9. JavaScript Code for FML Event shown
                                                               in Figure 6
The event element specifies any external entity whose
value can change. The default value for repeat is 1. If      It should be noted that an FML-based system usually
there is a numerical value, it will be used. Otherwise, it   consists of three parts:
must be an event id, in which case the value of that         1- FML Document
event at the time of execution of related activity will be   2- FML-compatible Player
used. An FML-compatible player should provide means          3- Owner Application
In a simple example like this, it could be easier to simply    FML is designed based on the same concepts, and even
implement the “story” in the owner application and send        syntax, of SMIL as the most important multimedia
simpler commands to a player just to create the specified      synchronization standard. It is also formed with
content (e.g. face saying Hello). But in more complicated      MPEG-4 compatibility in mind, in order to cover all the
cases, the owner application may be unaware of desired         capabilities of FAPs.
stories, or unable to implement them. In those cases, e.g.
interactive environments, the owner only simulates the         Although some behavioural modeling ideas are
external parameters.                                           included in FML, it still needs more work in modeling
                                                               parts. Addition of a knowledge base and sets of
Dynamic Content Generation                                     behavioural rules seem to be a promising feature
The last case to be presented illustrates the use of XML       specially for interactive applications.
DOM to dynamically modify the FML document and
generate new animation activities.                             Interfacing FML-based players to the rest of a
                                                               multimedia application also requires further research.
                                                               Embedding FML documents in a web page or MPEG-4
                                                               stream is not yet a very straightforward task.


                                                               REFERENCES

                                                               [1] A. Arya and B. Hamidzadeh, “TalkingFace: Using
                                                               Facial Feature Detection and Image Transformations
                                                               for Visual Speech,” Proc Int Conf Image Processing
                                                               (ICIP)¸ 2001.
  Figure 10. Dynamic FML Generation
                                                               [2] A. Arya and B. Hamidzadeh, “ShowFace MPEG-4
                                                               Compatible Face Animation Framework,” to be
The simplified and partial JavaScript code for the web         published in Proc. Int. Conf Computer Graphics and
page shown in Figure 10 looks like this:                       Image Processing (CGIP)¸ 2002.
function onRight()
{                                                              [3] S. Battista, et al, “MPEG-4: A Multimedia
   var fml = fmldoc.documentElement;                           Standard for the Third Millennium,” IEEE Multimedia,
   var new = fmldoc.createElement(“hdmv”);                     October 1999.
   new.setAttribute(“dir”,”0”);
   new.setAttribute(“val”,”30”);                               [4] D. Bulterman,      “SMIL-2,”   IEEE    Multimedia,
   fml.appendChild(new);                                       October 2001.
}
                                                               [5] J. Cassell, et al, “BEAT: the Behavior Expression
More complicated scenarios can be considered, using this       Animation Toolkit,” Proc ACM SIGGRAPH, 2001.
dynamic FML generation, for instance, having a form-
based web page and asking for user input on desired            [6] P. Ekman and W. V. Friesen, Facial Action Coding
behaviour, and using templates in model section of             System, Consulting Psychologists Press Inc., 1978.
FML.
                                                               [7] J. Funge, et al, “Cognitive Modeling: Knowledge,
                                                               Reasoning, and Planning for Intelligent Characters,"
6. CONCLUSION                                                  Proc ACM SIGGRAPH, 1999.

FML is an XML-based language designed specifically for         [8] W. S. Lee, et al, "MPEG-4 Compatible Faces from
face animation. The main concepts in FML are a                 Orthogonal Photos," Proc IEEE Conf Computer
hierarchical view of animation, and a timeline of activities   Animation, 1999.
with temporal relations. The major advantages of FML
are:
1-     Structured representation of animation events
2-     Use of existing XML services
3-     Ability to define behavioural templates