Docstoc

OPEN MOBILE VIDEO COMMUNICATION FOR THE DEAF

Document Sample
OPEN MOBILE VIDEO COMMUNICATION FOR THE DEAF Powered By Docstoc
					                         OPEN MOBILE VIDEO
         COMMUNICATION FOR THE DEAF.




                         DRAFT




     By: Chris Laidler




V 0.10
A.Abstract




i
B.Acknowledgements




ii
C. Contents
A. Abstract................................................................................................................................ i
B.      Acknowledgements ............................................................................................................ ii
C.      Contents............................................................................................................................. iii
D. Glossary ............................................................................................................................. vi
E.      List of Tables ..................................................................................................................... vii
F.      List of Figures ................................................................................................................... viii
1.      Introduction ........................................................................................................................ 1
      1.1.     Problem statement ..................................................................................................... 2
      1.2.     Research question ....................................................................................................... 2
      1.3.     Thesis Organisation ..................................................................................................... 2
2.      Background and related work ............................................................................................ 3
      2.1.     The Local Deaf community .......................................................................................... 3
      2.2.     Deaf communication ................................................................................................... 3
      2.3.     Needs and characteristics of SL video ......................................................................... 4
      2.4.     Video encoding............................................................................................................ 5
        2.4.1.        The H.264 standard.............................................................................................. 6
        2.4.2.        The x264 codec .................................................................................................... 7
      2.5.     Mobile infrastructure .................................................................................................. 7
        2.5.1.        Android ................................................................................................................ 7
      2.6.     Related Work............................................................................................................... 8
        2.6.1.        Eye tracking .......................................................................................................... 8
        2.6.2.        Foveated and Region of interest coding .............................................................. 8
        2.6.3.        MobileASL ............................................................................................................ 9
        2.6.4.        Local work .......................................................................................................... 10
      2.7.     Summary ................................................................................................................... 10
3.      Design ............................................................................................................................... 11
      3.1.     Research question ..................................................................................................... 11
      3.2.     Introduction............................................................................................................... 11
      3.3.     Encoding system........................................................................................................ 11
        3.3.1.        Platform ............................................................................................................. 11

iii
       3.3.2.        Base Codec ......................................................................................................... 11
       3.3.3.        ROI ...................................................................................................................... 11
       3.3.4.        Face detection.................................................................................................... 11
       3.3.5.        Rate control........................................................................................................ 11
     3.4.     Analysis ...................................................................................................................... 12
       3.4.1.        Measures and metrics........................................................................................ 12
       3.4.2.        Variables being measured ................................................................................. 13
     3.5.     Objective measurements .......................................................................................... 13
     3.6.     Subjective measurements ......................................................................................... 13
       3.6.1.        User experiments ............................................................................................... 13
     3.7.     Summery ................................................................................................................... 14
4.     Implementation ................................................................................................................ 14
     4.1.     Introduction............................................................................................................... 14
     4.2.     Encoding system........................................................................................................ 14
       4.2.1.        Platform ............................................................................................................. 14
       4.2.2.        Base Codec ......................................................................................................... 14
       4.2.3.        ROI ...................................................................................................................... 15
       4.2.4.        Weighted Mask .................................................................................................. 15
       4.2.5.        Face detection.................................................................................................... 15
       4.2.6.        Statistic............................................................................................................... 15
       4.2.7.        Frame saving ...................................................................................................... 16
       4.2.8.        Rate control........................................................................................................ 16
     4.3.     Analysis ...................................................................................................................... 16
     4.4.     User experiments ...................................................................................................... 16
     4.5.     Conclusion ................................................................................................................. 17
5.     Results............................................................................................................................... 18
     5.1.     Encoder ..................................................................................................................... 18
       5.1.1.        Final rate ............................................................................................................ 18
       5.1.2.        Skip ..................................................................................................................... 20
     5.2.     User experimentation ............................................................................................... 20
       5.2.1.        User vids ............................................................................................................. 22
6.     Discussion ......................................................................................................................... 23

iv
     6.1.      Encoder ..................................................................................................................... 23
       6.1.1.         Segmentation ..................................................................................................... 23
       6.1.2.         ROI encoding method ........................................................................................ 24
       6.1.3.         ROI Spread ......................................................................................................... 28
       6.1.4.         Rate control........................................................................................................ 29
       6.1.5.         Rate .................................................................................................................... 33
       6.1.6.         Encoding Fps ...................................................................................................... 35
       6.1.7.         Summery ............................................................................................................ 35
     6.2.      User Study ................................................................................................................. 36
       6.2.1.         Introduction ....................................................................................................... 36
       6.2.2.         MOS data ........................................................................................................... 37
       6.2.3.         Quality metrics ................................................................................................... 38
       6.2.4.         Encoder variables and parameters .................................................................... 38
       6.2.5.         Summery ............................................................................................................ 38
     6.3.      Summery ................................................................................................................... 39
7.     Conclusion ........................................................................................................................ 40
     7.1.      Future work ............................................................................................................... 40
     7.2.      We got good results .................................................................................................. 40
8.     Appendix A: build x264 ..................................................................................................... 41
9.     Appendix B: Experimental video questions ...................................................................... 42
10.         Appendix A: User questionnaires ................................................................................. 44
11.         Appendix A: User questionnaires ................................................................................... 1
12.         DATA ............................................................................................................................... 1
13.         Appendix XX Results........................................................................................................ 2
14.         Reference ........................................................................................................................ 1
15.         Crap ................................................................................................................................. 3




v
D. Glossary
ICT    Information and Communications Technology
MB     Macroblock
DCT    Descrete Cosine Transform
SASL   South African Sign Language
SL     Sign Language
DCCT   Deaf Community of Cape Town




vi
E. List of Tables




vii
F. List of Figures




viii
1.   Introduction
Many members of the South Africa the Deaf community do not only suffer directly from
their physical disability, but suffer from a number of related socio-economical
complications. The biggest of these being: illiteracy and poverty, these stem from a legacy
of discrimination and poor educational structures tailored for the Deaf in South Africa [10].

What may people do not realize, is that sign language (SL) is not a spatial form of another
language such as English. It is a distinct language of its own, with its own structure and
grammar. Not only is it distinct but there are many forms of sign language, South African
Sign Language (SASL) being the primary language used by the Deaf in South Africa. SASL is
the first language of most of the South African Deaf community. There have been a number
of attempts to create a written form of sign language; these include HamNoSys, a phonetic
transcription tool developed in 1984. The second is SignWriting, developed by Valerie
Sutton in 1974, which is a graphical bi-dimensional representation using symbols [6]. The
major disadvantage of theses is the many varying forms of SL, thus they have not been
widely adopted. Thus any written communication between the Deaf is often not in their
first language.

Information and Communications Technology (ICT) is a field that has revolutionized the way
most of us communicate, with a large drive towards mobile hand held devices. Much of the
ICT development is based on and was developed for audio communication. This naturally
precludes the Deaf. The other alternative is text communication using ICT but as the
previous section highlighted this is not suitable for the Deaf either as it forces them to
communicate in a second or third language, not only that but it is often considered
impersonal and slow [3]. This means many of the advances in ICT have not been available to
the Deaf.

With the development of camera enabled smart phones, video communication is now a real
alternative. This opens up a new communication channel for the Deaf, allowing them to
benefit from new ICT developments. This highlights the need for a mobile visual form of
communication tailored for the Deaf.

This need has led to the formation of a project, funded by a Dutch funding agency
(SANPAD). The project aims to provide Deaf users with a practical way of communicating in
their own language, SASL, and at the same time highlight policy impediments to the
widespread adoption of such a solution.

The ultimate goal is produce real time video communication between Deaf users. Our
project aims to lay down some of the ground work required for the realization of the goals
of the greater project.


1
1.1.   Problem statement
There has been a substantial body of work, devoted to the needs of sign language video.
Recently much of this has been focused on the requirements and associated difficulties of
using mobile devices to facility sign language video.

Firstly mobile devices are a naturally restrictive environment. The major factors that
influence this form of communication are; the limited computational power, limited
bandwidth available for data transfer and various hardware limitations such as size of the
display, the location and quality of the video camera, the power requirements and limited        Comment [D1]: Word?

battery life of mobile devises.

Sign language video has a number of unique requirements, the major factors are that the
video has to be of high enough quality that the sign language is intelligible, and that the
video stream is continuous and uninterrupted. Video jitter has a substantial negative affect
on the intelligibility and usefulness of sign language video.                                    Comment [D2]: Citation?


These two aspects; the available technology and the communication requirements are often
in opposition. Communication requires high quality video and the technology limits
achievable quality, this leads to the need for our work.

There has been much work done in the field to determine optimizations that can be made
to video and the way it is encoded to maximize intelligibility. Limited computational power
means that the encoding of the video needs to be efficient, with minimal computation. The
limited available bandwidth means that video will have to be rate limited. Much of the
resulting work has focused on region of interest (ROI) encoding. Most of the segmentation
for these ROI encoders is based on skin detection. Unfortunately may methods of skin
detection have poor results when applied to dark skin colours. Meaning these ROI encoders
will have poor results in our South African context.

1.2.   Research question
The objective of this project is to; implement a region of interest (ROI) encoder, utilising
face detection to aid segmentation. This encoder will be tested experimentally, to
investigate differing levels of ROI encoding and their effects on intelligibility and bit rate
for this encoder.

One of the goals of the greater project is that this system utilizes open software and system
to achieve its goals.

1.3.   Thesis Organisation
Will do later….




2
2.     Background and related work
2.1.   The Local Deaf community
This project is run as part of a greater project; this project has been working in conjunction
with Deaf Community of Cape Town (DCCT) a nongovernmental organization (NGO) based at
the Bastion of the Deaf in Newlands Cape Town {{17 Tucker, W.D. 2009}}. DCCT serves a
disadvantaged Deaf community of approximately a thousand people {{17 Tucker, W.D.
2009}}. The community members are united by their disability, there are a number of socio-
economic factors that disadvantage this community. Their disability often leads to a lack
education and unemployment and many of them come from previously disadvantaged race
groups, which all lead to cycle of poverty. One of the unifying features the community is
that they communicate in SASL and that this is their first language {{17 Tucker, W.D. 2009}}.
Many members of the community have poor levels of spoken, written and reading literacy
in any other language {{17 Tucker, W.D. 2009}}. This is a common theme in most of the
literature on the Deaf, their first language is a form of SL and understandably this is their
preferably form of communication {{17 Tucker, W.D. 2009}} {{7 Cavender, A. 2006}}

2.2.   Deaf communication
There have been a number technological developments to facilitate Deaf communication.
Many of these are based on the concept of a relay, this is essentially an intermediary that
translates a message to another form and then passes it along. This can be a computer or as
is often the case with a deaf communication, a person. Most of these relays are centered
on communication between the hearing and the Deaf. The thesis of W. Tucker deals with
the process of creating a digital relay in the local Deaf community {{17 Tucker, W.D. 2009}}.

There have been numerous forms text telephony (TTY) around the world; these include the
Teldem in South Africa {{17 Tucker, W.D. 2009}} and the more widely used short message
service (SMS) for mobile devices. There have been a number of text to audio relays,
allowing the Deaf to communicate to the hearing. Text to audio relays are ether automatic
or manual, but they have mostly been limited to the developed world and even here they
are not widely available. In recent times there have been a number of these implemented
for mobile devices, these include Vodaphone’s service for the Nocia 9210i and Mobile
TextPhone {{17 Tucker, W.D. 2009}}. Due to the difficulty in interpreting spoken language
most of these are still manual, which makes them slow and expensive. These relays do not
allow the Deaf users to communicate with other Deaf users.

There are a number of international video relay services (VRS) as well, these are currently
exclusively manual, but there are a number of projects that are attempting to automate this
system, but again there is still no effective way to automatically interpreting SL, thus these
relays are relatively expensive and thus exclusive.


3
There are a number of systems based on avatars, these allow users to enter to text that is
converted into a structure that is displayed as a character signing, these include, Signeuse
Virtuelle, eSighn VSigns and WebSign {{27 Jemni, M. 2007}}, but again these require the
sender to enter their message as text.

The aim of this project is to allow the Deaf to communicate in SL. Recently there has been a
big move towards using internet based video conferencing. Here many of the standard
video conferencing tools are not suitable for SL communication, the reasons for this will be
outlined in the next section. There are a number of SL orientated videoconferencing tools
as well. Most of these rely on various forms of video or web cameras, a computer and a            Comment [D3]: Ref: names….

connection to the internet, all of these rely on set infrastructure {{7 Cavender, A. 2006}}.
There is currently no mobile option is SA! Our aim is to investigate the feasibility of using
hand held mobile devises to facilitate this type video communication in South Africa.

2.3.    Needs and characteristics of SL video
The natural question that is asked is: why don’t the Deaf just use existing video conferencing
tools? The short answer is that the video quality just simply isn’t high enough. Existing tools   Comment [D4]: Is this to colloquial?

may have video but they are still predominantly focused on audio communication, thus the
video quality simply isn’t high enough to make sign language intelligible.

There has been much research in the field of video communication for the Deaf, and much
of the current focus is on mobile video communication. Here specifically the constraints are
low bitrates and the lack of computational power and the small nature of handled devices.
Cavender et al {{7 Cavender, A. 2006}} conducted a series of focus groups regarding the
requirements of mobile video SL communication. Here is a brief summary of their findings:

       The camera and screen should face the same direction to facilitate two way
        communications.
       The mobile device should have a stand.
       There is scope for holding the device in one hand for short conversations.
       There is the possibility of using a separate external camera.
       The users want the ability to supplement the video message with text input.               Comment [D5]: Word?

       The device should have existing text features such as: e-mail, Instant Messaging and
        SMS.
       The device should have a “ring” functionality to indicate incoming video call.
       The user should have the ability to accept or reject a call.
       The system should have the ability to leave a video message, they dubbed this
        SignMail.
       Video messaging must be accusable from other technology such as standard
        computers.
       Privacy is not as big a concern as was expected this was attributed to the inherent
        public nature of a SL conversation.

4
Most of the work done in the field of mobile video communication has been research, with
very little work being done on actual mobile devices. Most of the research has been on
intelligibility of SL and on optimizing video encoding with respect to SL intelligibility.

There are two subjective metrics that have been adopted to evaluate intelligibility of SL.
The first is a direct intelligibility test in which comprehension is directly tested by a written
questionnaire. The second is an opinion test in which users are asked to evaluate their
perception of the intelligibility on a seven point linear scale. The mean result is taken to be
a measure of intelligibility. This is a form of Mean Opinion Score (MOS). These two have
been found to be highly correlated {{22 Nakazono, K. 2006}}{{22 Nakazono, K. 2006}}, and
the second is prefer as it is less labour-intensive, and does not rely on the writing capability
of the participants.

The traditional objective metric of Peak Signal-to-Noise Ratio (PSNR) did not exhibit a high
correlation to intelligibility. Francis et al {{23 Ciaramello, F. 2007}} developed another
objective metric to measure SL intelligibility; the measure is based on the weighted Mean
Square Error (MSE) of the hands and face. This has been found to betters predicts                   Comment [D6]: What is this explain
                                                                                                    more detail…..
intelligibility {{23 Ciaramello, F. 2007}}.

The major constraints to consider when investigating mobile video communication are the
bandwidth available, the processing power required to encode and decode video, the
battery power required to encode and transmit data.

2.4.   Video encoding
One of the major elements of this project is video communication. The core problem in
video communication is transferring of video data at the lowest bit rate possible while
maintain the video quality. This results in the fundamental trade-off between bit rate and
quality. In this section I will give a brief outline of some of the major topics in video
encoding.

The primary work flow of digital video communication is: the capture of video information,
converting and compressing this data into a standard from, transferring this data, reverting
the data to a video format and displaying the resultant video. Digital video consist of
periodical sequence of images called frames. Each frame consists of a two dimensional
array of pixels.

Video formats are the standards that define the digital data structures into which video is
converted and stored. Codes are the programmatic elements or tools that perform the
actual encoding and decoding into these formats {{57 Sullivan, G.J. 2005}}. Both the format
and the encoder play a major interconnected role in the interplay between quality and
bitrate. A highly efficient codec may be able to produce video that is highly compressed,
but this process will almost certainly be computationally expensive.


5
2.4.1. The H.264 standard
H.264/AVC standard, also called MPEG-4 Part 10 is one of the latest standards to emerge
from a long line of video standards {{55 Chen, J.W. 2006}}. The H264 standard is very
efficient having up to a 50% lower bitrate that its predecessors {{55 Chen, J.W. 2006}} {{24
Vanam, R. 2009}}. Its high compression efficiency makes it effective in the low bandwidth
mobile environment in which we are working, and most of the work done in the field utilises
the H.264/AVC standard. In this section I will give a brief description of the H.264/AVC
standard focusing on the elements that are applicable to this project.                          Comment [D7]: Is this necessary?


The H.264 standard is a block based coding system, in this format each frame is subdivided
into Groups of Blocks (GOB). Each GOB is further divided into macroblocks (MB) each
consisting of 16x16 pixels. A MB can be subdivided into quarters of 8x8 pixels each. MB’s
form the basic unit of data that is dealt with during encoding and decoding {{26 Wiegand, T.
2003}} {{22 Nakazono, K. 2006}} although each MB is further subdivided into three
subcomponents each of which is processed separately {{55 Chen, J.W. 2006}}. These three
components correspond to the elements of the tristimulus colour representation, which are
Y, U and V. Where Y is luma, which represents brightness and U and V are two chroma
values which represent the extent to which the colour deviates from grey towards blue and
red {{57 Sullivan, G.J. 2005}} {{55 Chen, J.W. 2006}}. In Each MB these three elements are
sampled at differing rates, a usual 4:2:0 sampling will have 16x16 Y vales and only 8x8 U and
V values per MB, with 8 bits of precision per value {{57 Sullivan, G.J. 2005}}.

A combination of techniques are used to compress MB’s the dominate ones being:
prediction, transformation, quantization and entropy coding.

Prediction use the spatial and temporal nature of video. Exploiting the fact that an
individual frame may have larger areas that are very similar in appearance and sequential
frames will be very similar, especially if the frame rate is high and there is not too much
movement. Prediction attempts to find a reference MB that is similar to the MB being
processed, this means that not all the data need to be transfer. There are two forms of
prediction: inter-frame (P or B), where the reference MB is in another frame and intra-frame
(I) where the reference frame is found in the frame being processed. The prediction need
not necessarily be restricted to one MB it can be a weighted function of a number of MB’s
which is usually the case with a I MB {{55 Chen, J.W. 2006}}. These predictions are usually
coded in a form that is referred to as a movement vector.

The difference between the actual MB and its prediction is called the residual, and this
represents the data that is needed to recreate the MB. This residual is transformed from
the spatial domain to the frequency domain, by means of a Discrete Cosine Transform (DCT)
{{55 Chen, J.W. 2006}}.

This frequency data is then quantized, the process of quantisation is a lossy form of
compression as the process is not fully reversible. This process retains low frequency data

6
and discards some of the high frequency data, thus losing some of the fine detail {{55 Chen,
J.W. 2006}}. The Quantization Step Size (QP) is a per MB variable that determines how
much data will be lost per MB, and thus the quality of the MB. QP is expressed as an integer
from 0 to 31, where 31 corresponds to a large number of bits being allocated to that MB,
and a low number corresponds to few bits being allocated to the MB {{26 Wiegand, T.
2003}} {{22 Nakazono, K. 2006}}.

Entropy coding is usually the last step, it is a means of encoding the resultant data. It gives
shorter codes to more frequently used symbols and longer ones to less frequently used
symbols {{55 Chen, J.W. 2006}}. This reduces the total number of bits that have to be
transferred, and thus the bitrate.

2.4.2. The x264 codec
x264 is an open source codec written in c and developed by Laurent Aimar, Loren Merritt,
Eric Petit (OS X), Min Chen (vfw/asm), Justin Clay (vfw), Måns Rullgård, Radek Czyz, Christian
Heine (asm), Alex Izvorski, Alex Wright and Jason Garrett-Glaser{{32 Anonymous 2010}}.
X264 has won many awards and is a feature rich high performance codec. Because it is
open source and easily configured and customised most of the custom SL video encoding
has been performed using plain x264 or a custom version of x264 codec.

2.5.   Mobile infrastructure
2.5.1. Android
This project was designed to run on the Android operating system running on HTC.                  Comment [D8]: Fix later


The android architecture is divided into three layer, operating system, middleware and
application. The operating system runs a Linux kernel. The operating system has a number
of precompiled C/C++ libraries, the developer has access to these through the Android
application framework. Developers have full access to the same framework APIs used by
the core applications. Applications run in the application layer are developed using the Java
programming language.{{36 Anonymous 2010}}

PacketVideo {{33 Ciaramello, F.M. 2007}} was chosen to be the multimedia subsystem
provider for Android. Thus PacketVideo provides the multimedia framework and associated
components (e.g audio and video codecs, file format parsing and authoring, streaming
components, etc.) which power Android’s multimedia experience and which will be freely
available to the Android developer community. These components are collectively called
OpenCORE {{37 KOSMACH, J.}}.

OpenCORE’s video codecs are highly optimized for speed, efficient use of memory,
robustness and portability. The codecs are based on PcketVideo’s commercial video codecs.
The video codec block includes MPEG4, H.263, and H.264 video codecs. Similarly the video
codec block includes interfaces to integrate other formats and hardware accelerated codecs
{{37 KOSMACH, J.}}.

7
2.6.   Related Work
2.6.1. Eye tracking
One of the major themes that much of the sign language specific research is based on is that
much of the content and meaning of SL is conveyed by the face and fine facial gestures.
Facial expressions have a big impact on the context of what is being said. This was
popularized by the work of Muir et al and Agrafiotis et al {{9 Muir,L J. 2005}} {{6 Agrafiotis,D.
2003}} {{21 Agrafiotis, D. 2006}}. They performed an eye tracking studies in which subjects
took part in a number of experiments. In these experiments subject watched video
narratives in BSL. They used the Eyelink eye tracking system to record the gaze points of the
participants. Results of their experiments show that experienced signers focused on the
face of the signer, especially the mouth, whereas inexperienced signers and no signers often
looked at the hands {{21 Agrafiotis, D. 2006}}{{6 Agrafiotis,D. 2003}}. The reason for this is
that much of the contextual information obtained from the hands and arms is relatively
easily picked up in peripheral vision. While much of the fine contextual information is             Comment [D9]: Better word?

conveyed through subtle facial expressions {{7 Cavender, A. 2006}}. The fact that for sign
language most if not all the attention of the viewer is focused on the face forms the bases
for most of the work on codec optimization that will be discussed.

2.6.2. Foveated and Region of interest coding
With their eye tracking results Agrafiotis et al then went on to propose the use of foveated
video coding to reduce overall bitrates {{21 Agrafiotis, D. 2006}}. They had established that
the viewer focused on the face and more specifically the mouth of the signer. Foveated
compression aims to exploit the falloff is spatial resolution of the human visual system away
from the point of fixation. They developed an algorithm to separate the image into 8 areas
around the fixation point of the viewer. Each area is defined at the macroblock level and
each area has a similar maximum visually detectable spatial frequency. These are roughly
concentric rings around focal point. They proposed implementing a variable quality coding
process by specifying the QP of each MB depending on the foveaion area in which it fell
within. Thus manually forcing an affective high bitrates around the face, and thus improving
the video quality around the face. The idea of adding weight to areas of the image later
became known as Region of interest (ROI) encoding, we use this term to describe it through
the rest of this document. They proposed two methods for locating the Fixation point,
firstly using the assumption that the signer would be roughly in the centre of the frame they
proposed simply fixing the focal point in the centre. The second method they proposed was
to use face tracking to find the face and thus the focal point. They performed a test in
which they did face tracing and ROI encoding. They reported decreased a bitrate without
substantial loss of intelligibility {{21 Agrafiotis, D. 2006}} although they gave no substantial
details of their experimental method or results.

The next step in the world of encoding for SL was work done by Nakazono et al {{22
Nakazono, K. 2006}}. Their paper gives a good outline of the work up to that point. They

8
noted a problem with simply changing the QP of a MB while encoding. If a target bit rate is
set, and QP’s are changed on the fly the remaining MB will simply consume the freed bits,
thus these bits will not be available for later MB’s with high QP’s, the QPs need to be
evaluated for the whole frame and then weighted accordingly, this ensures that the target
bitrate will be maintained. They proposed an ordering in which the MBs be encoded to
solve this problem. They specified that first an order be set for GOB’s and then an ordering
for MB in those GOB’s. The ordering of the GOBs was centred in the centre of the ROI, thus
if the allocated bits were exhausted the areas of highest significance would have been
encoded. They also maid and important distinction that the background in signed
environment may be busy, this would consume a large number of bits in the traditional
video encoding system. They proposed defining a RIO around the signer and then marking
blocks not in this region not to be ‘not coded’. Essentially dropping the bits allocated to
these regions to almost zero thus drastically reducing the required bits, even in a location
with a busy background. They did note that setting MB close to the point of focus to be ‘not
coded’, “gave a strange feeling” to the viewer so the suggested only using it on the
extremities. They implemented all their optimizations in the H.263 encoding and performed
fairly rigors experimentation, using subjective MOS to evaluate their results. They showed
that at set low bitrates their optimised encoding far outperformed the base encoding.

2.6.3. MobileASL
From this point most of the dominant work in the field of encoding SL video was done by
the MobileASL group, headed by Prof. R. Ladner of the University of Washington.

Cavender et al {{7 Cavender, A. 2006}} did some investigations into the interplay between
different ROI encodings FPS and bitrates. Some of their work and results have already been
covered in section 2.2. Through preliminary studies they found that at framer rate of 5fps
finger spelling was difficult to interpret, while the difference between 15 and 30 fps was
negligible . Thus they decided to use 10 and 15 as their experimental value, evaluating
whether fewer better quality frames are better than more, worse quality frames. They
decide to test bitrates of 15, 20 and 25kbps. They used a ROI encoding simply using a
square around the face of the signer, and varying the QP in this area between three values
which we will refer to here as high medium and low, where high refers to a higher quality
difference between the ROI and the rest of the frame. They customized the x264 codec to
encode their video and tested its intelligibility on a smart phone. They evaluated their
results using a subjective MOS questionnaire. They had some interesting results, showing
that unsurprisingly the users prefer higher bitrates across the board, but that they prefer
lower frame rates preferring 10fps to 15fps, which allows fewer high detail frames, and a
medium ROI, which gives sufficient detail to the face while at the same time not obscuring
the hands {{7 Cavender, A. 2006}}.

The work of Ciaramello et al {{23 Ciaramello, F. 2007}} aimed to developing an objective
metric for measuring the intelligibility of encoded SL video. They developed a metric to

9
assess the intelligibility of SL video, their metric is based on the Mean square error (MSE) in
the hands and face. They found their results were highly correlated with the results
obtained in the work of Cavender et al {{7 Cavender, A. 2006}}. They concluded that their
objective measure for the intelligibility was an effective one.                                   Comment [D10]: remove


The work presented in the three papers {{33 Ciaramello, F.M. 2007}} {{35 Vanam, R. 2009}}
{{24 Vanam, R. 2009}} all deal with varying the encoding parameters of and adapted x264
codec to optimize it for encoding SL Video in the most intelligible way. They predominately
gained results by optimizing the values of QF and other parameter for given MB’s. The
problem was that the process was prepossessed intensive and they wish to run it in real
time on a mobile device. They solved this problem by realizing that the results they were
getting were falling in a predictable manner. They thus prospered a solution by generating
the results on a powerful online server and then storing them as a lookup table on the local
device, thus cutting down on computation time and still gaining coding optimization.

The MobileASL project has made much some significant findings and developments in the
field. They are a big well-funded long term project, with the goal of real time video
communication at very low bitrates. They may achieve their goals but I believe that by the
time they do, the constraints of the bitrates will be irrelevant as the mobile infrastructure
will have developed well beyond what it was when the project was started, although this
douse not decrease the value of their findings.

Skin detection?                                                                                   Comment [D11]: who is using ROI
                                                                                                  based on skin detection?

2.6.4. Local work                                                                                 Comment [D12]: Change heading

There was a paper produced by Zhenya et al {{15 Ma, ZY. 2008}} which delta with the
optimization of the x264 codec for asynchronous communication for the deaf. This paper
was tested in the community for which this project is intended. The research question was
a valid one that is very applicable to our project. I am not discrediting the results but I do
have issue with how they were presented. A number of metric were used, these were a
number of industry standards such as the MSE of PSNR, the SSIM index and the VQM, and
although by the work outlined in the previous there are better metric for specifically
measuring intelligibility of SL video. A number of custom metrics were used as well, these
include compression rate (CR), Compression time (CT), transmission time (TT) and delay
time (DT). My biggest gripe is that these clearly discreet vales were displayed as a line grape
and the in the analysis of the results the slope if the line was refer to. Thus some value may
be able to be gained from the results I will still question there validity.                       Comment [D13]: Rewrite


2.7.   Summary




10
3.     Design
Introduce current bitrate

3.1.   Research question
Our research question as stated earlier is:

The objective of this project is to; implement a region of interest (ROI) encoder, utilising
face detection to aid segmentation. This encoder will be tested experimentally, to
investigate differing levels of ROI encoding and their effects on intelligibility and bit rate
for this encoder.

3.2.   Introduction
The main objective is to maximize intelligibility of sign language video in a bitrate
constrained environ.

To investigate ROI encoding, we developed an encoding system…

3.3.   Encoding system
3.3.1. Platform
Windows, and Linux.

Future work port to Android

3.3.2. Base Codec
x264

Open source …

3.3.3. ROI
Segment into grid (based on MB)

Weight this grid

Differ rate assigned to this grid, depending on weight.

3.3.4. Face detection
Detect face location

Centre ROI grid, using face location

3.3.5. Rate control
If the QP spread is below this point excess rate will be given to the non ROI are, this will
decrease the base QP, which will in turn drop both the ROI and non ROI QP. If however the


11
QP spread is above this point the ROI rate control will decrease the QP spread by multiplying
it buy a scaling factor. This scaling factor is adjusted on a frame to frame basis. The scale
factor is adjusted in turn by scaling it, the amount by which it is scaled is based on the
current mean rate and the target rate. It takes a number of frames for the result of the
change in the scale factor to have a significant effect on the mean rate. Thus the feedback
loop on which controls the scale factor takes a number of frames to have a significant effect.
The result of this is that if the QP spread is to larger it takes a number of frames for the scale
factor to be lowered significantly.                                                                  Comment [D14]: move to
                                                                                                     implementation




Future work: customise and tune FD

3.4.   Analysis
3.4.1. Measures and metrics

3.4.1.1.   Video
Fps

Rate

Intelegability

MOS

I

3.4.1.2.   Video quality
SSIM

PSNR

       Encoder
3.4.1.3.
Current rate

Total rate



The desired global maximum QP difference between ROI and non-ROI MBs is specified by
the roispread parameter, and is in QP. ROI_SPREAD is a variable calculated per-frame, it is
the product of roispread and the ROI scaling factor, it expresses the maximum possible QP
difference between ROI and non-ROI MBs for that frame. MEAN_ROI_SPREAD is the mean
of each frame’s ROI_SPREAD, and is measured in QP. QP_DIFF is the difference between the
mean QP of all non ROI MBs and the mean QP of all ROI MBs.


12
3.4.2. Variables being measured
           Intelligibility

           ROI

           Face detection

           Rate



3.5.   Objective measurements
3.6.   Subjective measurements
3.6.1. User experiments
Rate vieos using MOS

      Test video segments
3.6.1.1.
For the intelligibility experiment test video segments will be required. These video
segments will be encoded with various settings, to create the experimental videos.

To reduce experimental error these test video segments will need to be of similar length
and quality. Thus they will need to contain similar content.

For these videos, a two way conversation will be posed, with one person asking
demographic questions and prompting the other person to answer. The answers will be in
the form of a short story or description. These responses should make sense in a
standalone context. The questions will be chosen to encourage answers that will
incorporate the key features of SL; board signs and finger spelling where applicable. The
“conversational” structure is aimed at regulating signing speed. The ‘answering’ member of
the conversation will be filmed. A number of short extracts will be taken from their replies
or stories. These will then be rated by fluent signers and an appropriate selection taken;
these will be the base test videos.

The participants used to produce these test videos will be DCCT staff, or paid members of
the DDC community, sourced with the help DCCT staff. A sign language interpreter will be
needed when the video is recorded and segments rated for similar intelligibility. The
participant shown in the video will be asked to sign a waiver.

3.6.1.2. Intelligibility experiments
For this project the results of the encoding system will be tested subjectively with a user
intelligibility experiment. This intelligibility test will be performed by approximately six


13
selected members of the community severed by the DCCT and with the help of the DCCT
staff. The experiments will be performed at the Bastion of the.

Participants will be seated in a room with a sign language interpreter and an experiment
facilitator who will assist them and answer and questions they may have. They will be given
a brief description of the experiment and what is required of them, this will be as both a
written document and an interpreted version read by the facilitator from a script. It will be
emphases that the experiment is voluntary and that participants may stop at any point.
They will then be asked to fill out a consent form.

They will then be asked to fill out, with the assistance of the interpreter if required, a brief
demographic questionnaire. Answering this demographic questionnaire will be voluntary, if
the participant ids uncomfortable answering any question they will be permitted to omit it.

They will then be shown two demonstration videos. These two test videos will serve to
familiarize the participants with the format of the experiment, and will at the same time
give them reference points. As the two test videos will be examples of the best and worst
quality videos that will be shown to them during the rest of the experiment.

The participant will then be asked to view each test video, after each video the participant
will be asked to rate the intelligibility of that video using a 5poinbt MOS questionnaire.

The videos will be sown on a HTC legend mobile phone. Participants will be asked to ether
hold the phone or place it on a stand on a table, depending on their preference. It will be
recorded on the questionnaire by the facilitator whether they held the phone or placed it on
the table.

3.7.   Summery


4.     Implementation

4.1.   Introduction
4.2.   Encoding system
4.2.1. Platform
How to on windows

How to on Linux

4.2.2. Base Codec
Latest version from git repository,



14
Compiled, using settings: xxxx

4.2.3. ROI
Grid: based on MB (16 x16 ) grid

Read weightings from input file

(parameter specification and use)

Weight using QP (0 – 51)

         Types of ROI 1 and 2

         Foce skip Skip MB’s

4.2.4. Weighted Mask
For the user study I encoded the test videos using the mask, user01 described earlier and
shown in Appendix XXX. It has a total of                    mask elements, 108 of these are
set to -1 and 67 are set to 0. Leaving 114 elements as greater than zero, these elements will
be considered as ROI MB’s for the encoding methods that

4.2.5. Face detection
Use external faced detection

Fdlib (bad)

Attempts to use OpenCV

4.2.6. Statistic




30kb/s

2 second lead in




15
4.2.7. Frame saving




                                                 +




4.2.8. Rate control

4.3.   Analysis
Encoding numbers and prams used, and groupings

Vids numbers, namesm answersn sets

4.4.   User experiments
How what when where why

16
What encodings

4.5.   Conclusion




17
5.     Results
5.1.      Encoder
5.1.1. Final rate
A two factor ANOVA analysis of the final bit rate, as affected by the encoding combination
and video gives the following results:
Table 1

 SUMMARY      Count    Sum     Average    Variance
 E000            21    556.4   26.49524   2.591506
 E001            21    556.4   26.49524   2.591506
 E002            21   599.46   28.54571   2.010246
 E003            21   590.23   28.10619   2.179755
 E004            21    601.6   28.64762   1.474209
 E005            21   595.06   28.33619   2.186475
 E006            21   610.74   29.08286   2.723361
 E007            21   600.65   28.60238   2.059649
 E008            21    611.8   29.13333   2.128893
 E009            21   603.79    28.7519   2.121486
 E010            21   559.65      26.65    2.83125
 E011            21   569.87   27.13667   2.806703
 E012            21   583.65   27.79286   2.978791

 A[01]          13    381.17   29.32077   0.835258
 A[02]          13    380.29   29.25308   0.621406
 A[04]          13    346.75   26.67308    1.52474
 A[05]          13    359.84      27.68   0.622733
 A[06]          13    319.51   24.57769   1.529753
 A[07]          13    369.35   28.41154   0.262481
 A[08]          13    387.55   29.81154   0.367247
 A[09]          13    339.92   26.14769   1.184419
 A[10]          13    366.15   28.16538   1.027044
 A[11]          13    383.31   29.48538    0.36891
 A[12]          13    356.44   27.41846   1.452531
 A[13]          13    363.45   27.95769   2.089753
 A[14]          13    375.63   28.89462   5.392527
 A[15]          13    376.59   28.96846   0.577881
 A[18]          13    353.88   27.22154   2.134764
 A[19]          13    345.43   26.57154   0.471297
 A[20]          13    388.96      29.92   1.971317
 A[21]          13    359.73   27.67154   0.947097
 A[22]          13    359.37   27.64385   1.015692
 A[23]          13    339.66   26.12769   1.635453

18
A[24]   13   386.32   29.71692   0.467756




19
Table 2



 Source of Variation          SS                df               MS          F           P-value       F crit
 Encoding                  238.6079                   12       19.88399   60.10815        2.72E-65   1.792674
 Video                     534.2838                   20       26.71419   80.75545        1.57E-94   1.614488
 Error                     79.39285                  240       0.330804

 Total                     852.2845                  272




Rate

5.1.2. Skip

Table 3 Percentage of P and B frames skipped.

                            P_SKIP                                   B_SKIP
            MEAN         STDEV    MIN           MAX        MEAN STDEV MIN       MAX
  E000       84.40        1.064 82.50           86.40       98.73 0.375 97.90 99.30
  E001       84.40        1.064 82.50           86.40       98.73 0.375 97.90 99.30
  E002       89.52        1.022 88.40           92.80       97.52 0.404 96.50 98.20
  E003       90.87        0.521 89.90           91.80      100.00 0.000 100.00 100.00
  E004       89.48        1.134 88.20           93.10       97.36 0.463 96.20 98.10
  E005       89.56        0.708 88.10           91.20      100.00 0.000 100.00 100.00
  E006       90.29        0.961 88.60           93.00       97.37 0.368 96.60 98.10
  E007       91.10        0.474 90.40           92.10      100.00 0.000 100.00 100.00
  E008       89.94        1.095 88.50           93.30       97.23 0.409 96.20 97.90
  E009       89.73        0.696 88.10           91.20      100.00 0.000 100.00 100.00
  E010       84.93        1.141 83.30           87.70       98.61 0.389 97.80 99.30
  E011       85.62        1.017 84.20           88.40       98.43 0.351 97.70 99.00
  E012       87.11        0.953 85.70           90.30       98.11 0.305 97.40 98.70

5.2.      User experimentation
Table 4 Raw OS of participants for the videos in UserVids

 A[01]         5     5      3      1     5       1         5    4    3    3   4      5     4
 A[02]         2     4      5      4     5       2         5    2    3    3   3      2     1     5
 A[08]         2     3      5      2     4       5         4    2    3    3
 A[10]         1     4      5      5     3       3         2    2    2    5   2      4
 A[11]         3     3      3      3     1       4         2    5    2    3   4      5
 A[15]         4     3      4      5     2       2         3    4    2    4
 A[20]         1     5      4      5     2       1         5    4    4    5   5      3     5
 A[24]         1     1      3      4     2       5         4    3    3    4   4      4




20
 Table 5 Descriptive statistics of the OS of the videos UserVids


          Mean          Var       Stdev        Median        Mode        Max       Min       Count   MOS

A[01]      3.692       1.905       1.437               4             5         5         1      13   3.692
A[02]      3.286       1.776       1.383               3             2         5         1      14   3.286
A[08]      3.300       1.210       1.160               3             2         5         2      10   3.300
A[10]      3.167       1.806       1.403               3             2         5         1      12   3.167
A[11]      3.167       1.306       1.193               3             3         5         1      12   3.167
A[15]      3.300       1.010       1.059             3.5             4         5         2      10   3.300
A[20]      3.769       2.178       1.536               4             5         5         1      13   3.769
A[24]      3.167       1.472       1.267             3.5             4         5         1      12   3.167




 Table 6 One factor ANOVA of the participant OS over the videos in UserVids

  Anova: Single Factor

  SUMMARY
      Groups                Count  Sum Average Variance
  A[01]                         13   48 3.692308 2.064103
  A[02]                         14   46 3.285714 1.912088
  A[08]                         10   33      3.3 1.344444
  A[10]                         12   38 3.166667 1.969697
  A[11]                         12   38 3.166667 1.424242
  A[15]                         10   33      3.3 1.122222
  A[20]                         13   49 3.769231 2.358974
  A[24]                         12   38 3.166667 1.606061



  ANOVA
     Source of
     Variation               SS           df     MS                   F     P-value   F crit
  Between Groups          5.105517          7 0.72936              0.41373 0.891443 2.115472
  Within Groups           155.1341         88 1.762887

  Total                   160.2396        95




 21
Table 7 OS of participants over the encodings E000 to E0012

                U1      U2      U3       U4        U5          U6       U7    U8
 E000            1       3       5        1         3           1        1     2
 E001            1       3       5        1         3           1        1     2
 E002            5       4       5        4         3           3        4     5
 E003            3       3       5        4         2           1        4     4
 E004            3       4       5        5         2           5        2     5
 E005            5       3       5        5         1           2        2     2
 E006            5       4       5        5         1           3        4     4
 E007            5       4       5        4         1           2        2     2
 E008            3       3       5        4         3           2        3     3
 E009            4       3       5        4         3           2        2     3
 E010            3       3       4        1         2           2        2     4
 E011            4       4       5        5         3           5        3     2
 E012            5       4       4        4         2           5        4     3



Table 8 Summary of OS of participants for the encodings E000 to E0012
                                          Median




                                                                                             Count
                                                        Mode
              Mean




                                 Stdev




                                                                                                         MOS
                                                                        Max



                                                                                   Min
                        Var




 E000        2.286      1.918     1.496             2               1         5          1           7   2.286
 E001        2.286      1.918     1.496             2               1         5          1           7   2.286
 E002        4.000      0.571     0.816             4               4         5          3           7   4.000
 E003        3.286      1.633     1.380             4               4         5          1           7   3.286
 E004        4.000      1.714     1.414             5               5         5          2           7   4.000
 E005        2.857      2.122     1.574             2               2         5          1           7   2.857
 E006        3.714      1.633     1.380             4               4         5          1           7   3.714
 E007        2.857      1.837     1.464             2               2         5          1           7   2.857
 E008        3.286      0.776     0.951             3               3         5          2           7   3.286
 E009        3.143      0.980     1.069             3               3         5          2           7   3.143
 E010        2.571      1.102     1.134             2               2         4          1           7   2.571
 E011        3.857      1.265     1.215             4               5         5          2           7   3.857
 E012        3.714      0.776     0.951             4               4         5          2           7   3.714


5.2.1. User vids




22
6.             Discussion
How this section will be layer out

Two sections, empirical user

6.1.           Encoder
It is important to remember that at the core of the ROI encoder we manipulate per-MB QP
as a means to control rate and hence quality and finally intelligibility.

This section will first discuss the encoding parameters used and the results observed in the
test encoded videos. There after we will discuss other results applicable to the ROI encoder.                                      Comment [D15]: What videos?

We will show that the modified ROI encoder successfully segments a ROI and encodes a
video, allocating more rate to this ROI.

                                                                ROI
               100                                                                                                  60

                90
                                                                                                                    50
                80

                70
                                                                                                                    40
                60


                                                                                                                         Mean QP
     Percent




                50                                                                                                  30

                40
                                                                                                                    20
                30

                20
                                                                                                                    10
                10

                0                                                                                                   0
                     E000   E001   E010   E011   E012   E004    E002      E005   E003   E008   E006   E009   E007
                                                               Encoding

                     ROI_Rate_%           ROI_Size_%           Mean_ROI_QP              Mean_ROD_QP          QP_DIFF


Figure 1 Mean results of encoding all videos in the set CLIPPED, using E000 to E0012. The results are order by increasing
QP_DIFF

6.1.1. Segmentation
The mean percentage of each frame classified as ROI is expressed as the variable ROI_Size_%,                                       Comment [D16]: Name?

and is shown as the, maroon columns in Figure 1.




23
For segmentation that uses a ROI mask, the applicable parameter is roimask, all test videos
were encoded with the mask file user01 as described in APENDIX XXX

The encoding combinations using static ROI and the mask user01 have mean ROI_Size_% of
(19%)                    . As would be expected this is a direct reflection of the 114 positive
entries in the mask user01, taken over the 600 MB’s                of each frame. This shows
the masks files are being read and interpreted an MB are being correctly segmented by the
encoder.                                                                                          Comment [D17]: Also visable in fig…


Face detection implements shifting and scaling of the mask. The effect of this scaling can be
seen by the increase of ROI_Size_% from (19%) to (22%)                                  when
face detection is used in combination with mask user01.

Skin segmentations

thikening

6.1.2. ROI encoding method
The method by which the ROI is encoded is controlled by the two parameters roimethoud             Comment [D18]: Applied ?

and roiskip.

QP_Diff represents the difference between the mean QP of ROI MBs and mean QP of non-
ROI MBs ( Mean_ROI_QP - Mean_ROD_QP ). We use it as a measure how different the ROI
is to the rest of the frame.

Not that all the encodings in this section were reformed all encoded with a roispread of
20QP, see the next section for a discussion on roispread.




24
                                     QP across videos
      55
      50
      45
      40
      35
      30
 QP




      25
      20
      15
      10
      5
      0
      -5



                                               Video
                  Mean_ROI_QP E000       Mean_ROI_QP E002            Mean_ROI_QP E006
                  Mean_ROD_QP E000       Mean_ROD_QP E002            Mean_ROD_QP E006
                  QP_Diff E000           QP_Diff E002                QP_Diff E006


Figure 2 sdfdsf

First I will focus on E000, E002 and E006, these I will take as representatives of the base
encoding with no ROI, roimethoud 1 and roimethoud 2, respectively.

As can be seen be in Figure 2 the QP_DIFF is for E000 rages between -0.03QP and 0.86QP,
                          showing that when no ROI encoding is applied there is a slightly
higher mean QP in the static ROI defined by the mask user01. This is most lightly due to the
fact that the complexity is slightly higher in the ROI as it covers an area of high movement,
namely the signer. Thus through adaptive quantisation these MB will be assigned a slightly
higher QP                            . This increase in QP is miner though.                     Comment [D19]: Duplicate necessar


We would expect QP_Diff to be lower for roimethoud 1 than it is for roimethoud 2, this can
clearly be seen in Figure 2. This difference in QP_Dif is significant for E002
                    and E006 M                                                      . Bothe
E002 and E006 have significantly higher QP_Diff than the base E000;
      and                             respectively. This shows that the QP in the ROI is
effectively changed by the ROI encoder, and that the difference is greater for roimethoud 2.
Thus we can conclude that the ROI encoder is effectively separating the ROI from the rest of
the frame, by changing per-MB QP.

25
The other encoding parameter is roiskip, to investigate this we examine the difference
between results for E002 and E003 as well as E006 and E007. We note a significant increase
in QP_Diff both between E002                                 and E003
                             and E006                            and E007
                                   . This result although encouraging may be misleading; if
the video is encoded with roiskip all the MB’s in non-ROI areas that are marked as skip MB’s.
Their QP will still count towards the mean QP, but they will be skipped, irrespective of their
QP. Thus their QP does not directly influence their contribution to the final rate. This
partially applies to the non-roiskip encodings as well. At low target bitrates such as our
target, most of the MB’s in P and B frames are skipped, for E000 (94%)
        of MB’s are skipped. However this does not mean the QP results are without value.
Firstly it shows that the MBs are being successfully separated from the rest of the frame.
Secondly The QP of a MB is used at many points during the encoding process, especially in
analysis and rate control thus changing the QP will influence other decision in the encoding
process. When considered across a whole frame, such is the case with Mean_ROI_QP, and
Mean_ROD_QP the value of the mean QP may be analysed, all be it that on an individual MB
level it may not be an effective predictor. This is intact demonstrated by the fact that there
is a difference in QP values. This shows MB skipping is having an effect and that this is in
turn is effecting rate control which is increasing the QP spread of between the ROI and non
ROI areas.

It is of interest to note that the expected Mean_ROD_QP for roimethoud 2 would be exactly
51, or whatever the maximum encoding QP was set to. We observe that this is not the case,        Comment [D20]: Explain eirlier

for example Mean_ROD_QP for E006 is 50.9 M=50.92 SD= 0.043 not 51. We invested the
cause of this and found that on occasion sections of the non ROI were given a QP of 50, as
they followed a ROI MB with a QP of 50. The encoder has an optimisation that under
certain conditions will set a MB’s QP to the previous MB’s QP if the original MB’s QP is
within one QP of that value. This is in an attempt to save on having to explicitly set the QP
delta for that MB. We believe that this has only a miner effect, dropping the
Mean_ROD_QP an insignificantly small amount                               and thus left this
optimisation as it was.

The QP results are clearly visible in Figure 2, the blue data is the non-ROI encoded results,
roimethoud 1 is shown in orange and roimethoud 2 in green. Notice that the Mean_ROI_QP
for E002 and E006 is well below E000, and that Mean_ROD_QP well is above. The distance
between these two is greater for roimethoud 2 as is shown by the spread between
Mean_ROI_QP and Mean_ROD_QP, as is plotted s QP_Dif. It is also clear that the
Mean_ROD_QP for roimethoud 2 is very close to the 51QP maximum.

As we use MB skipping, we will investigate the effects of the roiskip parameter on the ROI
encoder results; P_SKIP and B_SKIP which are the percent of skipped MBs in the P and B           Comment [D21]: word

frames respectively. A summary of these values is shown in Table 3. As mentioned

26
previously at low bitrates may MB’s will be skipped, this can be seen in the base encoding
E000 which has a very high MB skip rate; (84%)                            for P frames and
(99%)                          for B frames. The encoding E003 has a significantly higher
P_SKIP                              and B_SKIP                                 to E002 its
corresponding non roiskip encoding.             The same can be said for E007
                                                               and E006. Thus we can
conclude that as would be expected, MB skipping is significantly increased when using the
roiskip parameter. We did not investigate the difference between MB skipping within and
without the ROI area, as we did not record these values.                                             Comment [D22]: With ought?


One might expect roiskip to force all the rate to be used by the ROI this would be the case,
but the mask user01 has a buffer of 0 weighted MB round the positive values, this forces
these MB not to be skipped even though they are not considered part of the ROI.


                                         QP effects on Rate
                  100
                   90
                   80
                   70
     ROI_Rate_%




                   60
                   50
                   40
                   30
                   20
                   10
                    0
                        0   2      4          6   8         10     12     14       16      18   20
                                                         QP_DIFF

                                Non MB Skip           MB Skip       Linear (Non MB Skip)


Figure 3

As was discussed earlier the direct interpretation of mean QP may be questioned in a high
MB skip environment. The initial reasoning for changing QP values was to change the
allocation of rate, thus we will briefly investigate the effect varing QP has on rate. We note
the very high correlation                         between QP_DIFF and ROI_Rate_%, which is
clearly visible in Figure 3. We also note a significant increase in ROI_Rate_% between E000
and E002                                   and E000 and E006                               , as
well as between the two roimethoud E002 and E006                                    . Changing
the roiskip parameter also has a significant effect on ROI_Rate_% as can be seen by
examining results for E002 and E003                                       and E006 and E007
                              .



27
From these decisive results we can conclude that the ROI encoder, by changing QP,
successfully separates the ROI from the rest of the frame. This change in QP has a
significant effect on the way the rate is divided between ROI and non-ROI MBs. Forcing non
ROI MBs to be skipped has a significantly increases the rate partitioned to the ROI. These
results are well demonstrated in Figure 1, as the QP spread increases, the proportion of rate
going to the ROI increases, the slight dips at E002 and E008 and E006 are due to differing
roiskip values, which are not fully captured by QP values.

6.1.3. ROI Spread

                                                          roispread
                  100                                                                    20
                   90                                                                    18
                   80                                                                    16




                                                                                              MEAN_ROI_SPREAD
                   70                                                                    14
     ROI_Rate_%




                   60                                                                    12
                   50                                                                    10
                            y = 2.9795x + 22.448
                   40                                                                    8
                   30                                                                    6
                   20                                                                    4
                                       y = 0.8899x + 0.1134
                   10                                                                    2
                    0                                                                    0
                        0                5               10               15   20   25
                                                              roispread


Figure 4 ROI_Rate_% Shown in orange and MEAN_ROI_SPREAD in grey.

The previous section highlighted how changes in QP effectively alter how the rate is
distributed. The extent to which the ROI is separated from the rest of the frame is
controlled by the roispread parameter, which specifies the maximum QP difference
between ROI and non-ROI MBs. The actual per-frame maximum is alter through the rate
control process. MEAN_ROI_SPREAD is the mean of these rate controlled Maximus. Notice
that the maximum value of MEAN_ROI_SPREAD will always be the roispread for that
encoding and that it will often be less than roispread as it is scaled down through the rate
control process. This can be seen in Figure 4, the MEAN_ROI_SPREAD is always slightly
lower than the roispread.

As would be hoped we observe a strong correlation, between roispread and ROI_Sread,
                         . Thus as would be expected, from the previous sections results,
we observe a high correlation between roispread and ROI_Rate_%                          .
This shows that ROI scaling works effectively in the encoder, as a means to indicate how
different the ROI should be from the rest of the frame.

28
6.1.4. Rate control
As was mentioned earlier the ROI rate control is a basic implementation, and was not
designed to be robust. We observed a failure in the rate control for high values of
roispread, but in general rate control gives good results when appropriate values of
roispread are chosen.


                                        Effects roispread for A01
          60                                                                           80

                                                                                       70
          50
                                                                                       60
          40
                                                                                       50




                                                                                            rate (kbps)
     QP




          30                                                                           40

                                                                                       30
          20
                                                                                       20
          10
                                                                                       10

          0                                                                            0
               0       5          10       15    20       25   30     35     40   45
                                                 roispresad

                           MEAN_ROI_SPREAD      Mean_ROI_QP         Mean_ROD_QP
                           Final_Rate           Target Rate


Figure 5 sedrfsdfdsf

To investigate ROI rate control, the observed failure and appropriate choice of roispread we
encoded the video [A01] with a number of values for roispread. A01 was chosen as it is the
longest of the test videos, allow us to gain the most insight into the rate control process.

As can be seen in Figure 5 initially increasing the roispread parameter has the expected
result of increasing MEAN_ROI_SPREAD, this holds up to a roispread value of approximately
21QP there after the ROI_Spread drops, and then begins to increase again at value of
approximately 34QP. As would be expected with functioning rate control, the final rate
tends towards and remain close to the target rate for roispread values of up to
approximately 24QP. After this point it drops to a minimum at a roispread of 34 and the
increases rapidly. This is the observed failure of the ROI rate control, for large values of
roispread.



29
This suggests there is an ideal value for roispread, above which the rate control will be
negatively affected, and then break down totally For these specific tests with [A01], it
appears there is a maximum achievable MEAN_ROI_SPREAD this appears to be
approximately 18.7QP for this video and is obtained when roispread is approximately 21QP.

The apparent peak in MEAN_ROI_SPREAD is logically consistent with the design of the ROI
rate control. Consider that for a large ROI spread the non ROI QP will be set to its
maximum, thus decreasing the rate in the non ROI area. There must be some minimum QP
that if used as the ROI_SPREAD would result in all the reaming rate being used by the ROI. If
the ROI_SPREAD were increased above this point, this would result in excess rate being used
in the ROI, this in turn would result in the mean rate exceeding the target rate. This logical
ROI_SPREAD would be the instantaneous maximum attainable ROI_SPREAD. This value may
fluctuate as more or less rate is required to encode both the ROI and non-ROI areas for any
given frame.

If the roispread is exceeds this maximum attainable ROI_SPREAD, initially a number of
frames will have an excessively large QP spread, and thus the ROI will be encoded with and
unreasonably low QP resulting in a high early bitrate. The rate control feedback system
described earlier will then adjust the scale factor down, thus dropping the instantaneous
rate. The apparent problem with this system is that if the ROI_SPREAD is very large, and
well above the maximum attainable ROI_SPREAD, the initial rate will be very high, in the
tests we performed on answer 01, with a qpspread of 40 the initial instantaneous rate was
in excess of 200kbps as can be seen in Figure 6. These excessively high rates persist for
approximately the first 80 frames, this is an unexpectedly long time. By this time the high
current rate has drastically increased the total rate, as can clearly be seen in Figure 7. This
increase in the total rate leads to a reaction by which the scale factor and thus the
ROI_SPREAD. Assumable the ROI_SPREAD will drop to a value close to zero, at this point the
current rate is then dropped to between 15kbps and 5kbps, for this video. Thus the current
rate will be below the total rate and thus drop the total rate. The total rate will slow
decrease, until the total rate is below the target rate, at which point the scale factor will
start to increase again, pulling up the current rate and thus the allowing the total rate to
flatten off close to the target rate. Again it appears this takes some time. As can be seen in
Figure 7 roispread 32 has a total rate below the target rate, and in Figure 6 it can be seen
that the current rate is increasing. Given enough time the total rate would approached the
target rate, and the rate control would stabilise.

As can be seen for this test video, at the last frame, frame 361, the encodings with roispread
above 28 are still recovering from their excessively high initial rate. This explains the
unpredictable final rate at the end of the video sequence. Whereas encodings with
roispread below 28, appear to rapidly achieve a stable rate factor, and thus effective rate
control.



30
                                               Current bitrate
                  250
                                                                                                              0

                                                                                                              4

                                                                                                              8
                  200
                                                                                                              10

                                                                                                              12

                                                                                                              14
                  150
bit rate (kbps)




                                                                                                              16

                                                                                                              18

                                                                                                              20
                  100
                                                                                                              22

                                                                                                              24

                                                                                                              28
                   50
                                                                                                              32

                                                                                                              36

                                                                                                              40
                    0
                         14
                         27
                         40
                         53
                         66
                         79
                         92
                          1




                        105
                        118
                        131
                        144
                        157
                        170
                        183
                        196
                        209
                        222
                        235
                        248
                        261
                        274
                        287
                        300
                        313
                        326
                        339
                        352
                                                                                                              Target

                                                    frame

Figure 6 Shows current bitrate, averaged over 15 frames centred on the frame, for the video Answer01 encoded with a
static ROI, roimethoud 01, a target bitrate of 30kbps and at various levels of roispread. Each series represents the
relevant roispread. The base encoding with no ROI is shown as the thick red line.

The observed decrease of MEAN_ROI_SPREAD in Figure 5 is the result of the early lowering
of the scale factor in this case at approximately frame 80. There after the scale factor and
thus ROI_SPREAD will remain low, probably around zero, until the total rate drops below the
target rate, at which the sale factored will begin to increase again. This prolonged period of
remaining low in an attempt to drop total rate dropped the MEAN_ROI_SPREAD, and
explains the drop in MEAN_ROI_SPREAD.

As can be seen in Figure 7 the encodings with roispread 32 have dropped their total rate
below the target rate and are waiting for the increase in current rate to pull it up. This
explains the apparently strange dip in Final rate rebound a roispread 32.

Note that when considering the base encoding, with no ROI, shown as the thick red line in
Figure 6 and Figure 7 the initial high rate for the encodings is due to the size of the first I
frame, which is approximately 10kb. The dip in rate from frame 1 to 27 is to compensate for
this initial I frame. There after the current rate fluctuates around the target rate, as the rate
control adjusts and compensates as the complexity of the video changes over time.



31
Currently in the x264 codec one pass rate control is optimised to bring the total bitrate as
close to the target bitrate, this is desirable when simply encoding a video, as the final bitrate
is the important factor. For real time communication however the current rate is the
important factor. In a rate limited environment the target rate may well be close to the
maximum achievable rate. Thus if the current rate is above this limit, sections of the video
stream may be lost, due to the inability to transmit it. This will lead to jitter, which as we
disused earlier is undesirable and will diminishes intelligibility. Thus if this encoder is to be
used for real time encoding, the rate control will have to be optimised with the above
factors in mind.

Thus the failure in the current rate control is due to a reaction delay, in which time the total
rate changes drastically, which takes time to rectify. One solution would be to start with an
initially low scale factor and the increase it to the desired level, this will reduce the chance
of the drastic increase in total rate in the early stages. Another solution would be to
explicitly set the scale factor, instead of scaling it as is the case at the moment.

                                                    Total bitrate                                                0
                   250
                                                                                                                 4
                                                                                                                 8
                                                                                                                 10
                                                                                                                 12
                   200
                                                                                                                 14
                                                                                                                 16
                                                                                                                 18
                   150                                                                                           20
 bit rate (kbps)




                                                                                                                 22
                                                                                                                 24
                                                                                                                 28
                   100                                                                                           32
                                                                                                                 36
                                                                                                                 40
                                                                                                                 Target
                    50




                     0
                          40
                          14
                          27

                          53
                          66
                          79
                          92
                           1




                         105
                         118
                         131
                         144
                         157
                         170
                         183
                         196
                         209
                         222
                         235
                         248
                         261
                         274
                         287
                         300
                         313
                         326
                         339
                         352




                                                      Frame


Figure 7 This shows mean bitrate, for video Answer01 encoded with a static ROI, roimethoud 01, a target bitrate of
30kbps and various levels of roispread. Each series represents the relevant roispread. The base encoding with no ROI is
shown as the thick red line.



32
6.1.5. Rate
Many of the ROI optimisations implemented are predominantly focused on increasing the
proportion of rate that is allocated to the ROI.

The proportion of total bits, excluding header and packing data that is devoted to ROI areas,
is measured as the metric ROI_Rate_% and expressed as a percentage.

As mentioned previously all encodings tests were performed with a target bitrate of 30kbps,
the actual bitrates achieved, varied between 23kbps for A[06] with encoding method 12 and
32.6kbps for A[14] encoding combination 06.

As discussed earlier rate control is used to in an attempt to achieve the target bitrate, the
actual bitrate achieved for each encoding will differ slightly from the target bitrate. Final
rates are reported in Table 9, and an ANOV analysis is shown in Table 2. In an ideal scenario
all encoded video would have exactly the target bitrate, or at least be similar, we find that
there is a significant difference between the encodings                           and videos
                             . Thus we conclude that the encoding parameters used will affect
the final rate, which is not a desirable result. As would be expected the choice of video has
a strong influence of the final rate as well. Thus the choice of vide and encoding will affect
the final rate, even though they all have the same target rate.


                                          Final_Rate
                   35.00

                   30.00

                   25.00
  bitrate (kbps)




                   20.00

                   15.00

                   10.00

                    5.00

                    0.00
                           0   50   100   150    200      250       300      350       400
                                                Frames

Figure 8

As can be seen in Figure 9 the non ROI encodings E000 and E001 have the lowest final rate
M=26.50 SD=1.61, and all the ROI encodings have a mean final rate above this, but still
below the target bitrate. Due to the nature of the ROI rate control we believe this will hold

33
for reasonable initial values of roispread as discussed previously, and will be improved with
longer video sequences. The improvement with longer video sequences is supported by a
moderate correlation                             between FRAMES and the mean FINAL_RATE
across E000 to E012 for the videos in CLIPPED. This can also be seen in Figure 8 and in
Figure 10, as the mean rates approach the target rate as the frame count increases.


                               Final rate over Encodings
        35

        30

        25

                                                                                        Q3
        20
 kb/s




                                                                                        Q2
        15
                                                                                        Target

        10                                                                              Mean


        5

        0
             E000 E001 E002 E003 E004 E005 E006 E007 E008 E009 E010 E011 E012
                                         Encoding


Figure 9 sdfdsf

As would be expected, for E000 and E001 which do not implement any ROI encoding the
proportion of rate given to the ROI is very similar to the relative size of the ROI, as is shown
by the blue columns in Figure 1. With a mean 23%                               of the rate going
to an area covering 19% of each frame in E000, and mean (25%)                                  of
the rate going to a ROI covering a mean 22% of each frame in E001. The slightly higher rate
in the ROI is most lightly due to a higher complexity in the ROI, as it is meant to cover an
area of high activity, namely the signer. As can be seen the mean percentage of rate used
by the ROI is much higher for all other encodings that implement ROI encoding.

As would be expected MB skipping has a significant influence on ROI_Rate_% as it explicitly
forces almost all the rate to the ROI. Its dominance is illustrated by the fact that fore
encodings with the highest ROI_Rate_% are E005, E003 E009 and E007, which are the fore
encodings using roiskip. MB skipping forces non ROI areas to use minimal rate, which allows
it to be used in the ROI. We have shown that in general as would be expected roimethoud 2
will have a higher ROI_Rate_% than roimethoud 1.




34
                            Final rate over Test videos
               35.00                                                             400

               30.00                                                             350

                                                                                 300
               25.00
                                                                                 250
 rare (kbps)




               20.00




                                                                                      Frames
                                                                                 200
               15.00
                                                                                 150
               10.00
                                                                                 100
                5.00                                                             50

                0.00                                                             0



                       Q2      Q3     Target      Final_Rate      Frames


Figure 10 fghfh

Most notably E009 which is: ROI method 2, no face detection and forced MB skipping, has
almost all (94%) the rate being dedicated to the ROI.

Why is it ok that they have a higher rate?

6.1.6. Encoding Fps
It should be noted that no significant optimisations have been made to the encoder, and
optimisations was not the focus of this project, these results are reported for interest sake
only. The only metric I will examine here is the ENC_FPS, which is the final rate at which
frames were encoded. We found that the factor that has the most marked effect on
ENC_FPS was face detection. With no face detection we were encoding relatively fast
M=200.98 SD=0.9, but when using face detection this dropped significantly M=47.44
SD=0.1. This shows that face detection has a marked performance cost. This difficult to
investigate further as we used a closed source face detection library, this was meant as a
feasibility study. Thus I if this project is to be developed father I would advise development
of an optimised face detection system. These results were obtained on a powerful desktop
PC if the final aim is for real time encoding on mobile devises the encoder will have to be
optimised for the environment.                                                                   Comment [D23]: More blerb on face
                                                                                                 detection

6.1.7. Summery




35
6.2.   User Study
6.2.1. Introduction
Objectives

In this section we will first examine the results of the user intelligibility study. We will
compare our results with existing quality and intelligibility metrics. Finally we will examine
the results with regards the encoder.

We will hope to show that…….

Videos used: For the user study only 8 of the test videos were used. This was not a desired
result, but was dure to problems. This may lead to a learning biase. It was observed by a
number of participants that they had seen a video sequence before, with comments like;
“That one was the same as the other one” and “I have seen this before” and “”. When this
was note by a participant, the invigilator told the participant that they may see the same
video clip more than once, and that they should try and rate the videos on their
intelligibility. We do believe this will have a significant effect on the results, firstly as the
participants were not being asked questions on content, but rather perceived intelligibility.
Thus if the content for a specific video had been viewed before, the knowledge of the
content would not be directly beneficial as at no point would explicit understanding of the
content be nesecarry. Undoubtedly there would be some form of learning bias, most lightly
giving a higher intelligibility score the second time an answers was viewed with a different
encoding, especially if the participant was undecided between two possible scores. We
believe this effect to be minimal and not to negative influence the overall results.

T-tes has a p value of p=0.11 thas not very helpful =/




36
6.2.2. MOS data
     5

 4.5

     4

 3.5                                                                                  Q2

     3                                                                                Q2
                                                                                      Mean
 2.5
                                                                                      Global Mean
     2

 1.5

     1
         A[01]      A[02]      A[08]     A[10]        A[11]   A[15]   A[20]   A[24]


Figure 11 MOS of the videos used in the user study.

We will start with a direct analysis of the MOS of the user study. Before we look at the            Comment [D24]: name

various encodings we will analyse the test videos video.

We make the assumption that the test videos are all of same basic intelligibility, and that
there is no significant difference in intelligibility between the videos used in the experiment.
This implies that the variation in opinion scores is due to the ROI encoding. We examine the
OS across the videos, the results can be found in Table 4 and Table 5, and are shown in
Figure 11. Note that the videos were selected randomly, thus they were not all viewed an
equal number of times, the per-video observation counts ranging between 10 and 14. From
Figure 11it can be seen MOS have a wide spread, and that there does not appear to be any
significant difference in OS between the videos. This is confirmed by the data, we examined
the MOS for the videos; the mean of these is noticeably above the midpoint the MOS scale
                       , showing that on average we have intelligibility ratings above 3. More
importantly the deviation in MOS is low,                     , and significantly lower than the
average deviation                          . Showing the three is very little difference between
the OS for the videos. Finally this subjective analysus is strongly agreed with by the results
of a one factor ANOV analyses, which shows no significant difference between the videos
                           . With an observed p-value as high as 0.95 we have little or no
grounds to refute or disbelieve our claim that there is no significant difference in
intelligibility between the videos. If there were a difference, which undoably there must be
a slight one, randomly selecting the videos, over a large enough population will negate
these results when the MOS are studied.

This implies that the variation in opinion scores is due to other factors, these include the ROI
encoding of the video, the individual user and other environmental factors such as lighting
and viewing conditions.

37
We thus examine the individual OS across the users and the encoding used to on the video.
As the experiment was designed each user view 14 videos, two test videos, and 12 vidoes
each encoded with one of the encodings from E001 to E012. Thus each user viewed each
encoding one and only once, the videos were chosen randomly from A[01], A[02] A[08]
A[10], A[11],A[15],A[20] and A[24] and the order in which the videos was shown was
randomised. The resultant OS are shown in Table 7 and Table 8, and are lotted in

the for the users

Means

Var

Anova

t-tests pairs for diif

Ftest for pool var

6.2.3. Quality metrics
SSIM

PSNR

I 

6.2.4. Encoder variables and parameters
roimethoud

roiskip

Face detection

Roispread (Compare with Mobile ASL (we differ) ) hands a face, trailing roi

6.2.5. Summery
User prefer ROI encoder

Face bad ?

Skip bad! ( buffer no artefacts )

Roi spread (good)

I is a good mesure (test with skin and other segmentation)




38
6.3.   Summery
A workable system that user like, that’s great!




39
7.     Conclusion
7.1.   Future work
Rate control

        bottom up!

        Constant rate (capped)

Face detection

Optimisation, real time

Skip and artefacting?

Devleop skin detection, or motion tracking, and test.

Perceptual optimisations

7.2.   We got good results
Effective roi implimentaion

User like

Sucsess!




40
8.   Appendix A: build x264




41
9.    Appendix B: Experimental video questions
     1. Do you have any children? What are their names?

I have two children and their names are xxx and xxx.

     2. Are you married, what is you’re spouses name?

I am married and my husband/wife’s name is xxx.

     3. How long have you been married?

I have been married for xxx years.

     4. What are your parents’ names?

My father’s name is xxx my mother’s name is xxx.

     5. Where do you live?

I live in Newlands

     6. Where were you born?

I was born in Stellenbosch.

     7. Where do you work?

I work at xxx in Wynberg

     8. How do you get to work?

I catch the train to work in the morning.

     9. What time do you finish work?

I finish work at 17:30

     10. Which team did you support during the world cup?

I supported Spain during the soccer world cup.

     11. Did you play any sport at school?

I played hockey at school.

     12. Do you have any pets if so what are their names?

I have a pet dog, his name is xxx



42
     13. Which school did you go to?

I went to xxx school.

     14. What was your favourite subject at school?

Geography was my favourite subject at school.

     15. What did you have for dinner yesterday?

I had xxx for dinner yesterday.

     16. What is your favourite fast food?

My favourite fast food is Nando’s

     17. What is your favourite fruit?

Apples are my favourite fruit.

     18. What is your favourite soft drink

My favourite soft drink is Coca-Cola

     19. What is your favourite colour?

Green is my favourite colour.

     20. What is your favourite season and why?

Winter is my favourite season, the rain reminds me of my childhood.

     21. What is your favourite time of day and why?

I love sunsets, because they are so beautiful.

     22. Which famous person do you admire most?

My most admired person is Nelson Mandela for what he has done for the county.

     23. Which animal do you dislike the most?

I am very afraid of spiders.

     24. What is the furthest you have travel and how did you get there?

I once took the train to Durban.




43
10. Appendix A: User questionnaires




44
                                   Information Sheet
Who are we?
We are Computer Science researchers from the University of Cape Town (UCT). The team members are Chris Laidler
and Hung Tran, under the supervision of Edwin Blake.

What do we want to do?

We want to improve communications between the Deaf, specifically utilising cell phones for video communication.
There are a number of limiting factors that affect the transmission of video using cell phones and mobile networks.
Our projects aims to investigate a number of optimisations that can be made to the encoding of the video to optimise
for intelligibility at the cost of overall quality.

Why are we doing this experiment?

We have developed some software in the laboratory, and used it to encode some video samples. We want to assess
how the optimisations we have made will affect the intelligibility of sign language when judged by competent sign
language speakers.
If you agree to help in my experiment today, I will ask you to sign a consent form, but you can
leave the project at any time without any penalty to you at all. Participation is your free choice.
You will be asked to view a number of test videos, and rate their intelligibility.




1
                                     Consent Form
I, __________________________, fully understand the Mobile Deaf Telephony user experiment and
agree to participate. I understand that all information that I provide, will be kept confidential, and that
my identity will not be revealed in any publication resulting from the research unless I choose to give
permission. Furthermore, all recorded interview media and transcripts will be deleted after the data
results have been analysed. I am also free to withdraw from the project at any time.

I understand that the South African Sign Language interpreter who will provide the voice-sign-voice
translation is bound by a code of ethics, which does not allow him/her to repeat any information that is
given during the discussions. This means that my identity will remain confidential.

For further information, please do not hesitate to contact:

Chris Laidler
Email: chris.laidler@gmail.com
Cell: 076 687 3178




Name:         _______________________________________________________________


Signature:    _______________________________________________________________


Date:         _______________________________________________________________




2
                                          Questionnaire
1. Are you Male or Female?
      Male      Female

2. What is your age?
      18 – 21        22 – 25    26 – 30      31 – 40        41 – 50          51 – 60   61+

3. Do you consider yourself fluent in South Africa Sign language
      Yes       No

For each video segment you are about to watch pleas rate it in terms of its intelligibility, pleas
notice that this is how well you can understand what is being said, not the overall quality of the
video.

Example


      You will be shown two example videos first?


     How would you rate the intelligibility of the video you were just shown?

1.    Video:      A[      ] ROI [   ]   F[   ]   S[    ]   B[   ]     SP [   ]

     1. Bad                2 Poor            3 Fair                 4 Good             5 Excellent

     Comments:______________________________________________________________________
     ________________________________________________________________________________
     ________________________________________________________________________________
     _____________________________________________________________________________


2.    Video:      A[      ] ROI [   ]   F[   ]   S[    ]   B[   ]     SP [   ]

     1. Bad                2 Poor            3 Fair                 4 Good             5 Excellent

     Comments:______________________________________________________________________
     ________________________________________________________________________________
     ________________________________________________________________________________
     _____________________________________________________________________________




3
Test Videos

1.    Video:   A[   ] ROI [   ]   F[   ]   S[   ]   B[   ]   SP [   ]

     1. Bad         2 Poor             3 Fair                4 Good     5 Excellent

     Comments:______________________________________________________________________
     ________________________________________________________________________________
     ________________________________________________________________________________
     _____________________________________________________________________________


2.    Video:   A[   ] ROI [   ]   F[   ]   S[   ]   B[   ]   SP [   ]

     1. Bad         2 Poor             3 Fair                4 Good     5 Excellent

     Comments:______________________________________________________________________
     ________________________________________________________________________________
     ________________________________________________________________________________
     _____________________________________________________________________________


3.    Video:   A[   ] ROI [   ]   F[   ]   S[   ]   B[   ]   SP [   ]

     1. Bad         2 Poor             3 Fair                4 Good     5 Excellent

     Comments:______________________________________________________________________
     ________________________________________________________________________________
     ________________________________________________________________________________
     _____________________________________________________________________________


4.    Video:   A[   ] ROI [   ]   F[   ]   S[   ]   B[   ]   SP [   ]

     1. Bad         2 Poor             3 Fair                4 Good     5 Excellent

     Comments:______________________________________________________________________
     ________________________________________________________________________________
     ________________________________________________________________________________
     _____________________________________________________________________________




4
5.    Video:   A[   ] ROI [   ]   F[   ]   S[   ]   B[   ]   SP [   ]

     1. Bad         2 Poor             3 Fair                4 Good     5 Excellent

     Comments:______________________________________________________________________
     ________________________________________________________________________________
     ________________________________________________________________________________
     _____________________________________________________________________________


6.    Video:   A[   ] ROI [   ]   F[   ]   S[   ]   B[   ]   SP [   ]

     1. Bad         2 Poor             3 Fair                4 Good     5 Excellent

     Comments:______________________________________________________________________
     ________________________________________________________________________________
     ________________________________________________________________________________
     _____________________________________________________________________________


7.    Video:   A[   ] ROI [   ]   F[   ]   S[   ]   B[   ]   SP [   ]

     1. Bad         2 Poor             3 Fair                4 Good     5 Excellent

     Comments:______________________________________________________________________
     ________________________________________________________________________________
     ________________________________________________________________________________
     _____________________________________________________________________________


8.    Video:   A[   ] ROI [   ]   F[   ]   S[   ]   B[   ]   SP [   ]

     1. Bad         2 Poor             3 Fair                4 Good     5 Excellent

     Comments:______________________________________________________________________
     ________________________________________________________________________________
     ________________________________________________________________________________
     _____________________________________________________________________________




5
9.     Video:   A[   ] ROI [   ]   F[   ]   S[   ]   B[   ]   SP [   ]

      1. Bad         2 Poor             3 Fair                4 Good     5 Excellent

      Comments:______________________________________________________________________
      ________________________________________________________________________________
      ________________________________________________________________________________
      _____________________________________________________________________________


10.    Video:   A[   ] ROI [   ]   F[   ]   S[   ]   B[   ]   SP [   ]

      1. Bad         2 Poor             3 Fair                4 Good     5 Excellent

      Comments:______________________________________________________________________
      ________________________________________________________________________________
      ________________________________________________________________________________
      _____________________________________________________________________________


11.    Video:   A[   ] ROI [   ]   F[   ]   S[   ]   B[   ]   SP [   ]

      1. Bad         2 Poor             3 Fair                4 Good     5 Excellent

      Comments:______________________________________________________________________
      ________________________________________________________________________________
      ________________________________________________________________________________
      _____________________________________________________________________________


12.    Video:   A[   ] ROI [   ]   F[   ]   S[   ]   B[   ]   SP [   ]

      1. Bad         2 Poor             3 Fair                4 Good     5 Excellent

      Comments:______________________________________________________________________
      ________________________________________________________________________________
      ________________________________________________________________________________
      _____________________________________________________________________________




6
Do you have any general comments or observations?

__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________

7
11. Appendix A: User questionnaires




1
       12. DATA


       Table 9

           A[01]   A[02]   A[04]   A[05]   A[06]   A[07]   A[08]   A[09]   A[10]   A[11]   A[12]   A[13]   A[14]   A[15]   A[18]   A[19]   A[20]   A[21]   A[22]   A[23]   A[24]
E000       27.93   27.97   24.81   26.53   23.42   28.41   28.72   24.98   26.48   28.82   25.56   25.50   24.99   27.80   25.28   26.25   27.43   26.39   26.06   24.24   28.83
E001       27.93   27.97   24.81   26.53   23.42   28.41   28.72   24.98   26.48   28.82   25.56   25.50   24.99   27.80   25.28   26.25   27.43   26.39   26.06   24.24   28.83
E002       29.92   29.58   27.20   28.04   25.71   28.83   29.91   27.58   28.41   30.03   27.78   28.39   30.67   29.39   27.15   27.19   30.66   28.30   28.41   26.21   30.10
E003       29.36   29.46   27.18   27.45   25.36   27.93   29.93   26.03   28.92   29.25   27.73   28.37   29.71   29.26   26.63   25.70   30.07   27.94   28.19   26.20   29.56
E004       30.06   29.77   27.49   28.19   26.62   28.63   29.89   27.54   28.36   29.67   27.89   28.49   30.12   29.01   28.72   27.10   30.86   28.28   28.04   26.72   30.15
E005       29.67   29.75   27.37   28.08   25.00   27.53   30.22   25.43   29.21   29.15   28.46   28.94   29.46   29.24   28.04   26.56   30.53   27.48   28.05   27.35   29.54
E006       30.39   30.06   27.77   28.53   24.96   29.38   30.25   27.40   28.57   30.75   28.25   29.11   32.06   29.84   28.42   27.57   31.38   29.16   29.04   27.02   30.83
E007       29.60   29.72   27.83   28.19   25.25   28.58   30.27   26.75   29.19   29.76   28.23   29.07   30.21   29.56   27.36   26.46   30.60   28.44   28.33   27.05   30.20
E008       30.57   30.17   27.70   28.57   25.74   29.16   30.27   27.29   28.70   30.29   28.71   29.49   31.34   29.72   29.46   27.61   31.51   28.95   28.66   27.30   30.59
E009       29.87   29.87   27.82   28.48   24.98   27.92   30.56   26.47   29.35   29.41   28.93   29.48   29.82   29.78   29.03   26.97   30.96   27.81   28.03   28.07   30.18
E010       28.13   28.28   24.83   26.52   22.95   28.16   29.12   24.80   26.94   28.79   25.83   26.19   26.03   27.86   25.37   25.41   28.31   26.32   26.42   24.52   28.87
E011       28.54   28.59   25.54   27.00   23.10   28.23   29.55   24.84   27.48   29.01   26.38   26.94   27.58   28.42   25.94   25.93   29.14   26.72   26.78   25.13   29.03
E012       29.20   29.10   26.40   27.73   23.00   28.18   30.14   25.83   28.06   29.56   27.13   27.98   28.65   28.91   27.20   26.43   30.08   27.55   27.30   25.61   29.61




       1
13. Appendix XX Results


                                      STD                 ROI    Final  ROI    ROI
E   MOS   ROI F   S   SP STDDEV       Error VAR           Rate % Rate   Rate   Size % ROISP     FPS     PSNR   SSIM
-2 2.125   -1   0   0 20   1.859      0.482 1.859         23.846 27.877  6.648 19.000 20.000    230.608 32.722  0.887
-1 2.125   -1   0   1 20   1.859      0.482 1.859         23.846 27.877  6.648 19.000 20.000    440.971 32.722  0.887
  0 2.125  -1   1   0 20   1.364      0.482 1.859         29.647 27.877  8.265 26.231 20.000     38.408 32.722  0.887
  1 2.125  -1   1   1 20   1.859      0.482 1.859         29.647 27.877  8.265 26.231 20.000     38.477 32.722  0.887
  2 4.125   1   0   0 20   0.609      0.276 0.609         73.572 29.612 21.786 19.000 17.277    416.978 29.958  0.845
  4 3.875   1   1   0 20   1.269      0.449 1.609         75.961 29.312 22.266 26.231 15.137     38.396 29.984  0.601
  6 3.875   2   0   0 20   1.609      0.449 1.609         79.737 30.127 24.022 19.000 18.428    429.788 27.866  0.847
  8  3.25   2   1   0 20   0.829      0.293 0.688         80.756 29.735 24.013 26.231 16.301     38.393 28.238  0.626
  5 3.125   1   1   1 20   2.359      0.543 2.359         86.823 29.565 25.669 26.231 16.561     38.472 14.577  0.789
10 2.625    1   1   0   5  0.992      0.351 0.984         42.587 28.251 12.031 26.231  4.796     38.305 32.616  0.594
  3  3.25   1   0   1 20   1.438      0.424 1.438         91.776 29.657 27.218 19.000 18.723    449.625 13.941  0.804
11 3.875    1   1   0 10   1.053      0.372 1.109         56.922 28.802 16.395 26.231  9.565     38.398 32.046  0.617
  9  3.25   2   1   1 20   0.938      0.342 0.938         90.613 29.860 27.057 26.231 17.282     38.360 14.569  0.888
  7 3.125   2   0   1 20   2.109      0.513 2.109         94.995 30.063 28.558 19.000 19.142    462.847 14.089  0.882
12 3.875    1   1   0 15   0.927      0.328 0.859         69.680 29.467 20.533 26.231 14.038     38.315 31.031  0.867

Correlation ### ### ### ###   -0.49   -0.51   -0.51 ###    0.73    0.78   0.72   0.01   -0.31     0.07   -0.20   -0.34




2
14. Reference

References

[1] Anonymous 2010. Android Dev Guide! Available from:
http://developer.android.com/guide/index.html. [ Accessed on ].

[2] Anonymous 2010. x264 - a free h264/avc encoder Available from:
http://www.videolan.org/developers/x264.html. [ Accessed on 2009/12/03 ].

[3] Agrafiotis, D., Canagarajah, C. N., Bull, D. R., Kyle, J., Seersb, H. and Dyec, M. 2006. A
perceptually optimised video coding system for sign language communication at low bit
rates. Signal Process Image Commun. 21, ( 2006), 531-549.

[4] Agrafiotis, D. 2003. Perceptually optimised sign language video coding based on eye
tracking analysis. Electron. Lett. 39, 24 ( 2003), 1703.

[5] Cavender, A., Ladner, R. E. and Riskin, E. A. 2006. MobileASL:: intelligibility of sign
language video as constrained by mobile phone technology. In Proceedings of the 8th
international ACM SIGACCESS conference on Computers and accessibility (Portland,
Oregon, USA, October 22–25). ACM, New York, NY, USA, 71-78.
10.1145/1168987.1169001.

[6] Chen, J. W., Kao, C. Y. and Lin, Y. L. 2006. Introduction to H. 264 advanced video
coding. In Proceedings of the 2006 Asia and South Pacific Design Automation Conference
(Yokohama, Japan, January 24 - 27). IEEE Press, Piscataway, NJ, USA, 736-741.

[7] Ciaramello, F. and Hemami, S. 2007. ‘Can you see me now?’ An Objective Metric for
Predicting Intelligibility of Compressed American Sign Language Video. In (28 January).
San Jose, 6492.

[8] Ciaramello, F. M. and Hemami, S. S. 2007. Complexity constrained rate-distortion
optimization of sign language video using an objective intelligibility metric. In Presented at
IEEE Western New York Image Processing Workshop (WNYIP) (). Citeseer, .

[9] Jemni, M., Elghoul, O. and Makhlouf, S. 2007. A Web-Based Tool to Create Online
Courses for Deaf Pupils. In IMCL Conference, Amman ().

[10] KOSMACH, J., LENGWEHASATIT, K., VESELINOVIC, D., SHERWOOD, G. and
NEFF, R. INTRODUCTION TO THE OPENCORE VIDEO COMPONENTS USED IN
THE ANDROID PLATFORM. ().

[11] Ma, Z. and Tucker, W. 2008. Adapting x264 to asynchronous video telephony for the
Deaf. In Proc. South African Telecommunications Networks and Applications
Conference,(SATNAC 2008) ().




1
[12] Muir, L. J. 2005. Perception of sign language and its application to visual
communications for deaf people. Journal of Deaf Studies and Deaf Education. 10, 4 ( 2005),
390.

[13] Nakazono, K., Nagashima, Y. and Ichikawa, A. 2006. Digital encoding applied to sign
language video. IEICE Trans. Inf. Syst. 89, 6 ( 2006), 1893-1900.

[14] Sullivan, G. J. and Wiegnad, T. 2005. Video compression—From concepts to the H.
264/AVC standard. Proc IEEE. 93, 1 (Jan 2005), 18-31. DOI=10.1109/JPROC.2004.839617.

[15] Tucker, W. D. Softbridge: a socially aware framework for communication bridges over
digital divides. PhD Thesis, Department of Computer Science, University of Cape Town, ,
2009.

[16] Vanam, R., Riskin, E. A. and Ladner, R. E. 2009. H. 264/MPEG-4 AVC encoder
parameter selection algorithms for complexity distortion tradeoff. In Proceedings of the 2009
Data Compression Conference (Snowbird, UT, March 16 - 18). DCC. IEEE Computer
Society, Washington, Seattle, WA, 372-381. 10.1109/DCC.2009.53.

[17] Vanam, R., Riskin, E. A., Ladner, R. E., Ciaramello, F. M. and Hemami, S. S. JOINT
RATE-INTELLIGIBILITY-COMPLEXITY OPTIMIZATION OF AN H. 264 VIDEO
ENCODER FOR AMERICAN SIGN LANGUAGE. ().

[18] Wiegand, T., Sullivan, G. J., Bjontegaard, G. and Luthra, A. 2003. Overview of the H.
264/AVC video coding standard. IEEE Transactions on circuits and systems for video
technology. 13, 7 (July 2003), 560-576. DOI=10.1109/TCSVT.2003.815165.




2
15. Crap
S can be seen the encodings with a roispread greater than 24, all have an excessively high
rate in the first 90 frames, to compensate for this, their current rate drops to between
15kbps and 5kbps, where as the others fluctuate around the 30kbps target bitrate, as is
desired. The rate controled is controlled by the toyal bitrate, up until that frame.

If we examine the total accumulated bitrate over a spread of roispread values, as is shown
in Figure 7 we can see that as time progresses the mean bitrate approaches the target
bitrate. As the one pass rate control is based on the total rate up to that point, this comes as
no surprise

Thus with the current basic ROI rate control, if an encoding is started with a roispread that is
unattainable it will ajust its scale factor to a point where it has scaled the roisprtead to an
attainable value. Up until that time the scale factor will be to large meaning that an
excessive amount of rate is used in this initial time. This initial excess of data results in the
lowering of the current rate to compensate. Thus with the current simple rate contrl, if a
reasonable initial roispread is used the time take to adjust the scale factor will be minimal
and the current rate will be maintained close to the target bitrate.

and this will have to be compensated for by lowering the rate in subsequent frames.

The p This can be seen in XXX

The

this will ho

Thus on average they achieve bit rates closer to the target bitrate. We believe this property,
all be it desirable is coincidental. Due to the nature of the rate control implemented a ROI
encoded video will most lightly have a final rate higher that is higher that the base



We teste the hypothesys that the different encoding methouds have the




Given a rate constrained environment, such as a mobile device, the objective is to maximise
the proportion of rate that is given to areas of importance and interest, while not negatively
impacting on overall inelegance. The ROI is designated as the area in which we are trying to
increase rate, the encoding system implemented various methods to achieve these means.
This section will discuss these, and the relevant effects on rate in the ROI.


3
The percentage of the overall rate, that is used to transfer data that is contained in the ROI
is shown as the blue columns is Figure 1.




4

				
DOCUMENT INFO