Experience Report for RealActor
Institution: Faculty of electrical engineering and computing, University of Zagreb, Croatia
URL: not available
Main Authors: Aleksandra Čereković
RealActor is a multimodal behavior realization system for Embodied Conversational Agents. It provides specification mechanisms
and implements interruption of ongoing behavior.
Core BML behavior compliance
BML behavior Implementation Intermediate representation(s) Surface representation(s)
<locomotion> Only supports simple walking Procedural animation units Joint rotations and root
manner to the specific point in translation
3D space. The point is
<posture> Not implemented - -
<gesture> Implemented, but no support for Motion clips or Joint rotations
stroke repetition and Procedural animation units
<speech> Fully supported. Morph face units, audio Morph targets, MPEG-4 FAPs
Allows speech through TTS (MS
or any other).
<head> Implemented. Procedural animation units Joint rotations
<face> Supported: eyebrow, mouth, (Ekman) AU unit or MPEG-4 FAPs
blink. From FACS the following Morph units
AU are supported: AU1,
AU2, AU4-AU32, AU38, AU39,
AU43, AU45, AU46
<gaze> Only “at” gaze is supported, Procedural animation units Joint rotations and MPEG-4
using only eyes or also head FAPs
<lips> Not implemented, we suggest to - -
remove this behavior (it‟s been
removed in BML 1.0 draft,
<emit> Fully implemented. Emit unit ---
<wait> Not implemented
Behavior Description Intermediate representation(s) Surface representation(s)
<ra:body> Complex behaviors performed Motion clips Joint rotations
by several body modalities
<ra:face> Complex facial expressions Morph units MPEG-4 FAPs
(MPEG-4 facial expressions)
<speech><file ref=””/><speech> Plays an audio file as character‟s Morph units, audio Morph targets, MPEG-4 FAPs
speech - this behavior is
embedded inside core BML‟s
Description type Description extension start tag Extends core behavior Description
application/msapi+xml sapi speech Microsoft speech api XML
audio audiofile speech BMLT Audio file playback
RealActor supports synchrony between BML elements using one synchronization point of BML element as reference. This means that
one BML element can be synchronized with another using either start, or ready, or stroke-start, or stroke, or end point. If more than
one point is specified for the synchrony, then the former point will be used for synchrony. Therefore, in further work with RealActor,
the synchrony algorithm needs to be improved.
RelaActor supports synchrony with TTS synthesized speech or audio lipsync speech. In the speech specification, practically
“unlimited” number of sync points can be defined.
RealActor does currently not support:
- required blocks
- synchrony element
- the use of „self‟
- before and after constraints
- stroke repetition
- assigning start to internal sync points (as described in WG3: gesture)
- bml:start, bml:end (do we really need those?? – I agree with Twente group :))
How are constraints interpreted?
We interpret all time constraints specified within behavior as bidirectional. That is:
<head id="head1" stroke="gesture1:stroke" ... />
means that the t strokes of head1 and gesture1 should be aligned. This synchronization constraint is NOT interpreted as setting the
stroke time of head1 to the stroke time of gesture1. The exact same time constraint can be expressed by:
<gesture id="gesture1" stroke="head1:stroke" ... />
How does the scheduling work?
The system is based on animation planning and execution done through applying BML‟s primitive behaviors to distinct body
controllers. The overall realization process is done by three components responsible for: parsing of BML scripts, preparing elements
for execution, and realizing planned behaviors.
The first component is responsible for reading BML scripts and creating objects - BML blocks – containing BML elements with
element attributes. BML blocks are processed sequentially.
After parsing, the system does animation planning. It reads the list of BML blocks and prepares each block for execution by adding
timing information needed for behavior synchronization. The planning is done using a recursive algorithm in which BML elements are
processed using relative references to other BML elements and absolute timing information of BML elements which have been
prepared for execution. This timing information is retrieved from the Animation Database, and animation lexicon which contains a list
of all existing animations. In the Database each primitive animation is annotated with time constraints, type information and
information about the motion synthesis process (example/procedural). Alternatively, with BML script, a developer can provide BMLT
script which contains timing of animation phases for each BML element. The character‟s speech can be handled in two ways: using
lip-sync or text-to-speech synthesis. If lip-sync is used, speech must be recorded in a pre-processing step and manually annotated with
necessary timing information, which is then used for animation planning. Text-to-speech synthesis is more practical, because it can be
performed concurrently with behavior realization. However, most TTS systems (including Microsoft SAPI used in RealActor) do not
provide a priori phoneme and word timing information necessary for synchronization with nonverbal behavior. To address this issue
we employ machine learning and neural networks.
Finally, prepared BML elements are executed with TTS or lip-sync based speech and related animations (which are nonverbal
behaviors) using a hierarchical architecture of animation controllers. The main controller uses timing information and time control to
decide which behaviors will execute and when. By continually updating the execution phase of each behavior it is able to synchronize
BML elements and provide feedback about progress of behavior execution. During realization, each BML block is realized
sequentially. When one block is finished, the main controller notifies the behavior planner so a new BML block can be passed for
The principle of realizing multimodal behaviors is based on applying BML‟s primitive units to distinct body controllers, such as head,
gesture, face... Synchronization is handled by the main controller, which transmits motions to its low-level
subcomponents responsible for realization of primitive behaviors
Issues: The behaviors are processed in BML order. If no start constraint is imposed on a behavior, or if the start constraint is imposed
by a subsequent (and thus yet unplanned) behavior, we assume that the behavior starts at time 0. This assumption may result in
subsequent behaviors being planned in the past (e.g. with time constraints <0). If this happens, no shifting is done – negative values
are set to 0.
The behaviors are aligned using just one synchrony point. This means that one BML element can be synchronized with another
element using either start, or ready, or stroke-start, or stroke, or end point. If more than one is used, then the former point will be used
What are the available surface realizations and how are they stored and encoded?
Our animation system synthesizes character motion using MPEG-4 Face and Body animation (FBA) standard. All animations are
coded as stream of FBA values for each animation frame.
Procedural animations are hard-coded functions which receive BML element physical attributes as input value. Gazing and head
movements are generated by two controllers, Gaze Controller and Head Controller, which run in parallel with the central controller.
When new gaze direction or head movement occurs, the central controller sends to those two controllers parameter values which
correspond to the new movement. Animation is executed depending on the current position of the character‟s head/eyes, target
position. Head Controller uses spline function to find head movements in different directions. Gaze Controller generates eye
movements and head rotations (unfortunately, neck and body rotations are not taken into account). To generate facial expressions we
use two distinct controllers. The first controller, Face Controller, controls mouth, eyes and eyebrows separately. Alternatively, we
design a face animation model based on Ekman‟s Facial Action Coding System. The model is controlled by AU Manager, which runs
in parallel with the central controller, accepting and running Action Units (AUs). RealActor uses defined time constrains from the
animation library. Alternatively, a developer can develop BMLT script which contains timings of BML elements.
Most gestures are generated using motion clips from the animation database containing 52 types of gestures. The clips affect not only
hand and arm movements, but also head movements which are blended with the result coming from the Head Controller. The clips are
saved as FBA files and are loaded when RealActor initiates. Depending of the CPU speech, initialization of RA may take up to 2
minutes. At the moment, motion clips are not editable.
How do you move from BML to surface realization?
One BML behavior is passed to one motion controller responsible to its realization. The controller then uses appropriate animations,
which is procedural hard-coded action or prerecorded animation - a motion example modeled in 3ds max.
Animations are applied to the character as streams of face and body animation parameter (MPEG- 4 FBAP) values. When multiple
motion controllers (or prerecorded animations or motion clips) affect the same character, they are blended together using linear
What interesting additional issues did you run into?
During realization, scheduling and planning typically takes a non-zero amount of time. We have used RealActor to re-create
behavior of a real human who is speaking with angry voice. The created utterance (BML block) contains 369 BML elements
(face and head movements) and lasts for 1 minute and 15 seconds. Scheduling and planning of this BML block takes about 3
Due to presence of many controllers and scheduling mechanisms, RA takes a lot of processing memory
BML script authoring is a tough work. Even if we have an idea what the character shall do, it is hard to get realistically looking
behaviors using just several BML elements, especially for face and head movements. Therefore we started to develop BML
authoring tool, a graphical tool with drag and drop mechanisms which visualize content of BML script and relationship
between elements in on BML block. We have a first test version (with to temporal, just tree-based visualization), but now we
ran out of humans who can do that job.
In case of an angry speaker (see point one of this question) we needed to add additional description of facial movements -
duration of execution phases for FACS action units. The realization result will not be equal if we control duration of BML
element phases. For example, if we control phase duration of f1, f2, f3 we control dynamics of brows raising and falling:
<speech id=”speech1”><text>Hello<sync id=”tm1”>world</text></speech>
<face id="f1" start="s1:tm1" type="facs" au="1" amount="0.60" />
<face id="f2" stroke-start="f1:stroke-end" type="facs" au="1" amount="1" />
<face id="f3" stroke-start="f2:stroke-end" type="facs" au="1" amount="0.20" />
Therefore, we provide an internal layer which controls this issue – a BMLT script which contains timings of BML elements.
Using this script we could correctly recreate movements of a real human who is speaking with an angry voice. We claim that
Realizer has to have time control of executed BML elements, especially for face, because timing is important. However, this
rises up an issue of “double-timing”, relationship-time control inside BML (and BML elements may have complex relations),
and time control inside Realizer.
When developing RealActor it happened sometimes that we created “incorrect” BML script for testing. By “incorrect” we
mean that it had incorrect (temporal) relationship between BML elements, e.g.
<speech id=”speech1”><text>Hello<sync id=”hello”>world</text></speech>
<head id=”head1” start=”head2:end” action=”ROTATION” rotation=”NOD”/>
<head id=”head2” start=”head1:stroke” action=”ORIENTATION” orientation=”LEFT”/>
<speech id=”speech1”><text>Hello<sync id=”hello”>world</text></speech>
<head id=”head1” start=”speech1:hello” action=”ROTATION” rotation=”NOD”/>
<head id=”head2” start=” speech1:world” action=”ORIENTATION” orientation=”LEFT”/>
What should be result of these scripts?
BML Realizers need to have mechanisms that will check validity of a given script. Or is it a task of the script author?