MPEG 4 by gabyion


• MPEG : Moving Picture Experts Group: a
  working group of ISO/TEC
     “Compactly representing digital video and audio signal
     for consumer distribution”

• MPEG-1: Standard for storage and retrieval of audio
  and video on storage media

• MPEG-2: Standard for digital TV
       Scope of MPEG-4 Standard

• Author: greater production flexibility and
• Network Service Provider: Offering
  transport information which can be
  interpreted on various network platforms
• End user: Higher levels of interaction with
  content within the limits set by the author.

• Interactivity : Interacting with the different
  audio-visual objects

• Scalability : Adopting contents to match

• Reusability : For both tools and data
               Objectives - Interactivity

• Client Side Interaction
  – Manipulating scene description and properties of audio-
    visual objects
• Audio-Visual Objects Behavior
  – Triggered by user actions and other events
• Client Server Interaction
  – In case a return channel is available
                 Objectives - Scalability
• Scalability refers to the ability to only decode a
  part of a bitstream and reconstruct images or
  image sequences with:
   – Reduced decoder complexity (reduced quality)
   – Reduced spatial resolution
   – Reduced temporal resolution
• A scalable object is the one that has basic-quality
  information for presentation. When enough bitrate
  or resources can be assigned, enhancement layers
  can be added for improving quality.
   Objectives – Scalability (Cont.)
• Scalability is a key factor in many applications:
  making moving video possible at very low bitrates
  notably for mobile devices
• MPEG-4 has been found usable for streaming
  wireless video transmission at 10Kbps in GSM.
• Low bitrates are accommodated by the use of
  scalable objects.
             Objectives - Reusability

• Authors can easily organize and manipulate
  individual components and reuse existing
  decoded objects.
• Each type of content can be coded using the
  most effective algorithms.
• Traditional Requirements (MPEG-1 & 2)
  – Streaming : for live broadcast
  – Synchronization : to process data received at the right
    instants of time
  – Stream Management : to allow the application to
    consume the content (content type, dependencies…etc.)
  – Intellectual Property Management
• Specific MPEG-4 Requirements
  – Audio-Visual objects
  – Scene description
                      Audio-Visual Objects
• The representation of a natural or synthetic
  object that has an audio and/or visual
• Examples:
   –   Video Sequence (with Shape information).
   –   Audio Track
   –   Animated 3D face
   –   Speech synthesized from text.
• Advantages: Interaction – Scalability – Reusability
                   Scene Description

The coding of information that describes the
spatio-temporal relationships between the
various audio-visual objects.
           Scene Description (Cont.)
• Place media objects anywhere in a given
  coordinate system.
• Apply transforms to change the geometrical or
  acoustical appearance of a media object.
• Group primitive media objects to form compound
  media objects.
• Apply streamed data to media objects to modify
  their attributes (sound, moving texture…)
• Change, interactively, the user’s viewing and
  listening point anywhere in the scene.
          Logical Structure of a Scene


 Person      Background      Video          Synthetic Objects

                                     Ball            Table
Shape        Voice
            Scene Description (Cont.)
• Starting from VRML, MPEG has developed a
  binary language called BInary Format for Scenes
• The standard differentiates parameters used to
  improve the coding efficiency of an object
  (motion vectors in video coding), and the ones
  used as modifiers of an object (its position in the
• Modification in the latter set does not imply re-
  decoding the primitive media objects.
                   MPEG-4 Mission

Develop a coded, streamable representation
for audio-visual objects and their
associated time-variant data along with a
description of how they are combined.
           MPEG-4 Mission (Cont.)

• Coded Vs. Textual

• Streamable Vs. Downloaded

• Audio-Visual objects Vs. Individual Audio
  or Visual Streams
                           Object Model

• Visual objects in the scene are described
  mathematically and given a position in two
  or three dimensional space. Similarly, audio
  objects are placed in sound space.
• “Create once, access everywhere” ..objects
  are defined once and the calculations to
  update the screen and sound are done
               Objectifying the Visual

• Classical video (from the camera) is one of
  the visual objects defined in the standard.
• Objects with arbitrary shapes can be
  encoded apart from their background and
  can be described in two ways.
  – Binary Shape: for low bitrate environments
  – Gray Scale (Alpha Shape): for higher quality
    Objectifying the Visual (Cont.)

• MPEG does not specify how shapes are to
  be extracted. Current methods still have
  limitations (e.g. Weatherman).
• MPEG-4 specifies only the decoding
  process. Encoding is left to the market
                   2D Animated Meshes
• A 2D mesh is a partition of a 2D planar region into
  polygonal patches.

• A 2D dynamic mesh refers to a 2D mesh geometry
  and motion information.
       2D Animated Meshes (Cont.)
• The most entertaining feature in MPEG-4 is the
  ability to map images onto computer generated
  shapes (meshes currently 2D and 3D in the next
• A few parameters to deform the mesh can create
  the impression of moving video from a still video
  (e.g. a waving flag).
• Predefined faces are particularly interesting
  meshes. Any feature (lips or eyes) may be
  animated by special commands that make them
  move in synchronization with speech.
                    System Architecture

• Streaming data for media objects.
• Different architecture layers
  –   Delivery layer
  –   Sync layer
  –   Compression layer
  –   Composition layer
• Syntax Description
Streaming data for media objects
• Needed data for media objects can be
  conveyed in one or more Elementary
  Streams (ESs).
• An Object Descriptor (OD) identifies all
  streams associated with one media object.
• OD contains a set of descriptors that
  characterized the ESs (required decoder
  resources, encoder timing,..)
                             Composition Layer

            ODs                 Scene

        ES         ES                   ES            ES
                                                            Sync Layer
SL     SL         SL               SL                SL

            FlexMux            FlexMux          FlexMux        Layer

                       Various Transport Protocols
                          Delivery Layer
• Contains two-layer multiplexer
  – FlexMux: a tool defined according to the DMIF
    (Delivery Multimedia Integration Framework). It
    allows grouping of ESs with a low overhead.
  – TranMux: the second layer that offers transport
    service interfaces with different transport
    protocols (UDP/IP- MPEG-2,….)
• A session protocol for the management of
  multimedia streaming over generic delivery
  technologies.It is similar to FTP.
• Actions:
   –   Setup a session with the remote side
   –   Select streams and request streaming them
   –   The peer will return pointers to the streams connections
   –   Establish the connection themselves.
• User can specify QoS the DAI. It is up to the
  DMIF to ensure the satisfaction of these
                  Delivery layer (Cont.)
• The functionality of the DMIF is expressed by an
  interface called DMIF Application Interface (DAI)
• DAI defines a single, uniform interface to access
  multimedia contents on a multitude of delivery
• DAI is the reference point at which the elementary
  streams can be accessed as Sync layer –
  packetized streams.
• Sync layer talks to the delivery layer through DAI.
                               Sync Layer

• SL A flexible and configurable packetization
  facility that allows: Timing, Fragmentation,
  and continuity information on associated
  data packets. (Packetized Elementary Streams)
• It does not provide frame information (no
  packet length in header). Delivery layer will
  do it.
             Sync Layer Functionality
• Identifying time stamped Access Units (data units
  that comprise complete representation unit).
• Each packet is an access unit or a fragment of an
  access unit.
• These access units forms the only semantic
  structure of ESs in this layer.
• Stamping access units includes timing information
  for decoding and composition.
• SL retrieves ESs from packetized ESs.
                   Compression Layer

• The streams are sent to their respective
  decoders that process the data and produce
  composition units.
• In order to relate ESs to media objects
  Object Descriptors (OD) are used to convey
  information about the number and
  properties of a set of ESs that belongs to a
  media object.
          Compression Layer (Cont.)
• Scene Description: Defines
   – The spatial and temporal position of the various
   – The objects dynamic behavior
   – Interactivity features
• The scene description contains unique
  identifiers that point to object descriptors.
• Tree structured and based on VRML
                    Composition Layer

• Using scene description and decoded audio-
  visual object data to render the final scene
  presented to user.
• MPEG-4 does not specify how information
  is rendered
• Composition is performed at the receiver
                                 System Principles
     Scene Descriptor ES

     Object Descriptor ES   OD           OD

                             ESD   ESD           ESD

          Decoding Buffer Architecture

              Decoder                      C
               Buffer                      O
                        Decoder   Memory
              Decoder                      M
  DMIF         Buffer                      P
Application                                O
              Decoder             Memory   S
 Interface              Decoder
               Buffer                      I
              Decoder                      O
                        Decoder   Memory
               Buffer                      R
                    Syntax Description

• MPEG-4 defines a syntactic description
  language (MSDL) to describe the exact
  binary syntax for bitstreams carrying media
  objects and for bitstreams with scene
  description information
• This language is an extension of C++, and is
  used to describe the syntactic representation
  of objects

• Stream Management: The Object
  Description Framework (ODF)

• Timing and synchronization: The System
  Decoder Model (SDM)

• Presentation Engine: (BIFS)
                                       Tools - ODF
• Provides the glue between the scene description and the
  elementary streams.
• Unique identifiers are used in the scene description to point
  to the OD.
• The OD is a structure that encapsulates the setup and
  association information for a set of ES’s.
• OD’s are transported in dedicated ES’s called Object
  Descriptor Streams (ODS).
• This makes it possible to associate timing information to a
  set of OD’s.
• Provides mechanisms to describe a hierarchical relations
  between streams reflecting scalable encoding of the
                   Tools - ODF (Cont.)

• The initial OD, a derivative of the object
  descriptor is a key element necessary for
  accessing MPEG-4 content.
• Contains at least two elementary stream
  – One point to the scene description stream.
  – Others may point to object descriptor stream.
                            Initial Object Descriptor
  Scene Descriptor Stream

  Object Descriptor Stream
 Initial OD
Scene Descriptor ES

Object Descriptor ES            OD           OD

                                 ESD   ESD           ESD
                            Tools - SDM
• An adaptation for the MPEG-2 System
  Target Decoder (that describes temporal and
  buffer constraints for packetizing ES’s).
• MPEG-4 chose not to define multiplexing
  constraints in the SDM.
• SDM assumes the concurrent delivery of
  already demultiplexed ES’s to the decoder
                             Tools - BIFS

• A set of nodes to represent the primitive
  scene objects, the scene graph constructs,
  the behavior and activity.

• BIFS scene tells where and when to render
  the media
                   Tools - BIFS (Cont.)

• Used to describe scene decomposition
  – Spatial and Temporal locations of objects.
  – Object attributes and behavior.
  – Relationships between elements in the scene
• Relies heavily on VRML.

  A file format for describing 3D interactive
worlds(scenes) and objects. It may be used
in conjunction with the WWW. It may be
used to create 3D representation of
complex scenes as in virtual reality
       VRML Example – Shape node
     geometry IndexedFaceSet{
       coordindex [0, 1, 3, -1, 0, 2, 5, -1, …]
       coord Coordinate {point[0.0 5.0 ..]}
       color   Color         {rgb [0.2 0.7…]}
       normal Normal     {vector[0.0 1.0 0.0 ..]}
       textCoord Texture Coordinate {point [0 1.0 ,..]}
appearance Appearance {material Material {transperancy 0.5}}
                                 BIFS vs. VRML
• VRML lacks important features:
  – The support of natural audio and video.
  – Timing model is loosely specified.
  – VRML worlds (scenes) are often very large.
• BIFS is a superset of VRML.
  –   A binary format not a textual format (shorter)
  –   Real-time streaming
  –   Definition of 2D objects
  –   Facial Animation
  –   Enhanced Audio
                                 BIFS Protocols
• BIFS Scene Compression (text vs. binary)
• BIFS Command (produce unique-time events for
  the scene.)
   –   Replace the whole scene with the new scene.
   –   Insert node in a grouping node.
   –   Delete a node.
   –   Change a field value.
• BIFS Anim (used for continuous animation of the
  scene.) allows modification of any value in the
  scene : viewpoints - transforms - colors - lights
                              Version 2

• Intellectual Property Management &
  Protection (IPMP)
• Advanced BIFS
• MPEG-4 File Format
• Coding of 3D Meshes
• Body Animation
                            Advanced BIFS
• Multi-user functionality to access the same scene.
• Advanced audio BIFS for more natural sounds,
  and sound environment modeling (air absorption,
  natural distance attenuation).
• Face and body animation.
• Proto and Externproto and Script VRML
• Other VRML nodes not included in version 1.
      MPEG-4 File Format (MP4)
• Designed to contain the media information
  of an MPEG-4 presentation in a flexible
  extensible format that facilitates
  interchange, management, editing and
• The design is based on QuickTime® format.
• Composed of object-oriented structures
  called atoms.
• Specification of Java API’s in MPEG-4 System
  (Scene Graph, Resource Manager, …etc)
• Contents creator may embed complex control and
  data processing mechanisms to intelligently
  manage the operation of the audio-visual session.
• Java application is delivered as a separate ES to
  the terminal then directed to the MPEG-J run time
                Coding of 3D Meshes

• Coding of generic 3D meshes to efficiently
  code synthetic 3D objects.
• LOD (Level of Detail) scalability to reduce
  rendering time for objects that are distant
  from the viewer.
• 3D progressive geometric meshes (temporal
  enhancement of 3D mesh).
                       Body Animation

• A body is an object capable of producing
  virtual body models and animations in form
  of a set of 3D polygon meshes ready for

To top