Introduction to Video Search Engines
David C. Gibbon
Abstract: The emergence of video search engines on major web search portals
and market forces such as IPTV and mobile video service deployments and the
growing acceptance of digital rights management technologies are enabling new
application domains for multimedia search. This tutorial will give participants a
more complete understanding of the development, current state of the art and
future trends of multimedia search technologies in general, and video search
engines in particular. Participants will learn the relationships between multimedia
search and conventional web search and the capabilities and limitations of
current multimedia retrieval systems.
It is difficult to imagine the Web without search engines. The Web has become
so vast that without automated tools to locate items of value, much of the content
on the Web would be inaccessible and therefore of little use. In spite of the
successes of search engines, there is clearly room for improvement in accuracy
of the search results and this presents an opportunity for technology innovation.
The ability to search media content is an aspect of search engines that has
historically received less attention, but rapid development is underway in this
area. The top internet sites: Google, Yahoo, AOL, and MSN are all offering video
Technology evolution has set the stage for rapid growth of video search engines:
research and prototyping has been underway for several years, broadband
access is ubiquitous, streaming media protocols and encoding standards are
mature. Disk and processor cost reductions are making it possible to store and
index large volumes of digital media and create indexed on-line archives. Market
forces such as the emergence of IPTV, mobile video services, and download
services such as iTunes and the growing acceptance of digital rights
management technologies are fueling these trends. Further, major broadcasters
are beginning to publish primetime content on the internet.
Video comes from a wide variety of sources with a wide range of quality levels for
different applications. The production costs for the evening national news can be
several thousand dollars per minute while a video blog (vlog) costs almost
nothing to produce. Video search is valuable for both applications, but the former
may warrant manual indexing while the latter requires automatic methods. Video
may be classified by its origin: consumer, enterprise (semi-pro), and professional.
Extremes include web-cam (no author) on the low end, and HD-DVDs or iMax
video at the high value end. Latency is another dimension: on-demand, live,
interactive, duplex (telephony). Archival value varies: video mail, 24 hour news
channel, C-SPAN, home video, movies, forensics, etc. Applications include
entertainment, communications, advertising, education, medical, industrial,
scientific, law enforcement, legal, surveillance.
Challenges of video search: Searching requires browsing sets of candidate
results. Video is a serial (or linear) medium: if paused, only a single frame
remains, audio is lost. Text is parallel and can therefore be browsed easily. Video
storage and transmission requirements are several orders of magnitude greater
than those for text. Textual features (characters, words) are well defined, can be
efficiently encoded, and are limited in number. Video features (edges, colors,
motion) and acoustic features (pitch, energy) are ill-defined, computationally
expensive to extract, and bulky to represent. In fact, there is no consensus on
which features are best for a given application. Furthermore, users can
formulate textual queries easily using a keyboard so that, to a first approximation,
the IR problem reduces to a symbol look-up (i.e. find me the documents
containing this word). For video databases, the query-response cycle is cross-
modal (enter text, retrieve video). Query by image content involves specifying
image or video attributes perhaps with a graphical tool which is tedious at best.
Query by example requires a seed search to bootstrap the process. The
challenges of video search as compared to text search are summarized as
1. Crawling: video is more likely to be accessed via a web application and in
accessible to simple spiders. This is a problem for text search (invisible
web) but to a lesser extent. Access to video is more often restricted to
resisted users than text files and maybe protected by DRM.
2. File formats: Text search engines handle HTML format and perhaps PDF,
while video media comes in a wide variety of formats.
3. Link Rot: video files are often too large to be maintained indefinitely by on
web servers. News content in particular may be published only for a
4. Duplication: compared to text, media files are much more likely to appear
on multiple sites, and duplication is much more difficult to detect
5. Caching: the size of video media makes caching costly, and caching may
6. Ranking: referral based ranking may be used, but it is more difficult to
implement due to the duplication and web application issues (above).
Term frequency ranking methods are typically inapplicable.
7. Browsing: summary generation or hit context extraction is more involved
with video, and may introduce copyright issues. The time cost of viewing a
document is higher due to buffering delays and the nature of linear media
so users will tolerate fewer false positives (smaller top “n”).
Metadata vs. content: metadata is ‘data about data’, or here ‘data about media’.
Global metadata refers to entire media asset (file) and typically includes a title,
author, copyrights, etc. Example: MLB.com contains detailed metadata about
each video clip. Metadata may also be time-varying, applying only to a segment
of the asset, however this is less common. Content-based retrieval involves the
use of an index that is derived (typically automatically) from the media streams
and almost always includes a temporal attribute. A transcript of the dialog is
considered “content” and not metadata.
2. Video Data Sources and Applications:
Video search engine technology can be applied to a wide range of applications,
each with its own challenges and opportunities. Video data sources have varying
properties, availability, attributes and volumes. Available metadata for different
types of video includes: electronic program guide, transcripts, RSS feeds, Closed
Captions, etc. Examples video sources and applications include: movies / DVDs,
broadcast television, digital video recorders (DVR), consumer video, and internet
sources such as Podcasts and video blogs. Alternative sources for metadata are
web mining (for related media e.g. news video / web new reeds) and social
An electronic program guide encodes metadata for broadcast television
(terrestrial, satellite, and cable) content and is used to build web-based guides
and for consumer DVRs. XML Sources include Tribune in the US: zap2it labs
(labs.zap2it.com) provides free SOAP services for developers. XML-TV provides
a unified format for guide data from international sources. Typically 2 weeks of
coverage – may be bulky. Deviations from the schedule (e.g., live sports events
overtime) not handled. In addition to the schedule information, EPGs may include
brief descriptions for serial content, but typically not news story titles (e.g.
“roundup of today’s news”.) Available DVRs include Tivo, Windows Media Center
Edition, MythTV, Freevo.
For movies, IMDb maintains a detailed collection of metadata for over 500,000
items (movies, TV episodes) and includes plot summaries, actors, directors, in
addition to typical metadata such as genre, rating, title. Box art is also available.
The database is available for download, and some tools are provided. AMG
(allmediaguide.com) provides a similar database for a fee. These databases are
used by DVR systems and VoD systems to augment EPG data. Also, they can
be used for personal media jukebox systems in the same way as CDDb, freedb,
AMG are used for music.
RSS (rich site summary or really simple syndication) descriptors were developed
for HTML news feeds but have evolved into the de facto metadata encoding
standard for web media. Simple, yet some complications due to multiple versions
and alternatives such as Atom. Podcasts are RSS 2.0 feeds that include
enclosure tags to point to media such as audio (MP3, AAC) or video (MP4) files.
Yahoo’s Media RSS includes extensions to the enclosure tag and is gaining
popularity. There are many tools and packages for parsing RSS feeds and since
they are XML format, it is easy to develop custom tools. Available metadata
typically includes title, author, date, language, and category (genre). Used for
amateur (blogs) and professional (e.g. TV news clips, radio programs).
! " # $ "" "
! % & !
' ! $ "" " !
% & ! ( & ) *+ ! ! $ ( )+ !
! ,) # ) ! ! ! % -! ! !
+ ! & + (
' ! $ "" " !
% & ! ( & ) *+ ! ! $ ( )+ !
! ,) # ) ! ! ! % -! ! !
+ ! & + (
+ & +. & / / & +.
+ & +. & / / & +.
012 ) + -
) #+ 0 3
Figure 1: A segment showing metadata from an RSS 2.0 Podcast
In addition to the metadata sources described above, media content in the form
of text or text streams is available from the web and other sources. Transcripts of
TV program dialogs are available on the web for some content, (e.g.
cnn.com/transcripts.) Others are fee-based services such as Burrelles.
Transcripts include some level of speaker ID and may include descriptions of
visuals or audio not in the dialog. For NTSC systems (US) closed captioning is
used to make TV accessible to hearing impaired viewers. A paraphrased
transcription (often real-time transcribed) is encoded in ASCII and transported in
line 21 of the video. Video capture hardware decodes this and software APIs
expose this to applications (e.g. MS direct show filters, ccdecoder.) EIA-608 is
the engineering specification for CC and EIA-708 for DTV adds many additional
capabilities including scalability, more languages. FCC mandates dictate which
programs must be captioned.
Media files contain embedded global metadata: frame size, may contain title,
author fields (MP3 tags, WMV), and may include DVR-MS, MPEG-2 CC, etc.
DVS/DVI: Descriptive video services or information are sometimes available via
the secondary audio program (SAP) for visually impaired.
DVD subtitles are bitmaps, unlike captions which are nearly ASCII. This provides
more flexibility, but requires optical character recognition for extraction. Some
OCR tools require supervision. Many tools are available for this (subtitle rippers).
Most DVDs include closed captions as well as subtitles.
Social sources: Web users extract movie subtitles, translate, and post them the
web (fan translation, fansubs.) Extended character sets and encodings are
needed to represent multilingual texts. Social tagging (a.k.a. folksonomy) may be
applied to video wherein any user may enter metadata which is later indexed for
Contextual data: Metadata can be inferred from a URI anchor pointing to a media
file, or from web page context. Some systems make use of proximate text, not
only in the HTML file, but based on the spatial layout of elements after parsing
and rendering the page. Video blogs may contain textual descriptions related to
videos. Contextual sources tend to be noisy.
3. Internet Video:
This section provides background on internet video systems, including the
handling and transport of video which is an integral component of video search
engines. Topics include bandwidth, compression, seeking, streaming, standards,
digital rights management, transcoding, multi-rate streams, redirector files.
Multimedia content adaptation for mobile devices is related to summary
extraction. Over time internet video standards gain and lose popularity. Broader
industry standards include MPEG, IPTV and mobile digital video standards.
Architectures for developing Web video applications include browser plug-ins and
external media players.
Spatial resolution: number of pixels in a frame, temporal resolution: frames per
second, flicker fusion frequency: human perceives smooth motion (24, 25, 29.92,
30 are common.) Extreme example are high-speed photography and time lapse
(web cams and security cameras may operate at 1 frame per second.) Aspect
ratio: frame: width / height, pixel may not be square. Compression: DCT removes
spatial redundancy, motion compensated frame differencing removes temporal
redundancy. Intra-coded, I-frame, reference frame inserted at scene changes or
periodically for error resilience. Seeking limited to I frames. Two pass VBR
coding much higher quality than realtime. Video conferencing H.263 using
conditional replenishment may have no I frames. H.264/AVC = MPEG-4 Part 10
improves compression, slices permit seeking.
IPTV uses MPEG-4 or Microsoft VC-1, mobile uses MP4 or H.263 with a 3GP
Transcoding typically involves decoding a video file and re-compressing in a
different format. Transrating refers to recoding at a lower bit rate in the same
format. Multirate streaming packs two or more alternative representations of
media content encoded at different bandwidths into a single file. Systems can
switch dynamically depending on available bandwidth. Related to “scalable
Codecs vs. Transport or streaming systems. Windows, Real, Quicktime, Flash –
download player requirement / cross platform / tie-in to personal media library.
Number of smaller players, Java Media Framework. WM services can stream
Digital rights management systems: prevent unlawful copy of media files.
Support policies such as hold for up to 30 days, play for 24 hours. For IPTV
systems, related to “conditional access”.
For high-quality applications, where bandwidth precludes real-time, local cache
management systems, Kontiki AOL, MSN. Content distribution networks store
popular content closer to end-users to reduce network load.
Integration into web applications (search engine clients):
1) link to file, use server mime type to launch player
2) link to a metafile – supports failover, offsets
3) embedded player – client side scripting.
4. Video Search Engine Systems Architectures:
Most systems share a common high-level architecture including:
1) content acquisition
2) media processing and representation for indexing (e.g. XML, MPEG-7)
3) transcoding for devices
4) media and metadata storage
5) query engine
6) user interface rendering
7) media navigation
Design tradeoffs include index size verses retrieval speed, client device support,
scalability, media browsing capabilities offered. Design considerations:
1) Broadcast focused – supporting temporal queries on a specified
2) Title focused – movies, TV series
3) Author focused – blogs
4) Security – open, PPV, membership, groups
5) Event based: webcasts, virtual space exists before, during, after
Crawler or Indexing and
RSS Handler Media Processing
Figure 2: Typical video search engine architecture; dashed lines indicate optional
Acquisition (ingest) is the process of inserting a media into a database and is
often accompanied by media processing to generate an index for that media
item. Acquisition may involve analog to digital conversion and compression of
video material, but most often the source content is in digital form. Web crawling
is useful for random media files linked to web pages, but most video–centric sites
cannot be crawled easily and may employ DRM. RSS feeds and Podcast feeds
in particular are good sources for acquisition since the metadata is well defined,
and the media is downloadable via HTTP. In addition to the pull model, a push
mechanism which requires users to provide metadata is suitable for applications
such as consumer video hosting (Google video, YouTube). Engines must guard
against copyright infringement where posters rip content and submit it. Typically
removing the content after the rights owner gives notice of infringement is
sufficient. Content often copied stored and replayed from the search engine, not
from origin. Redundant content is a problem for crawlers and EPG based
systems and is best handled at acquisition time to conserve system resources.
An acquisition or entity may consist of more than one stream or file. DVDs / TV
broadcasts often have multiple audio tracks, and multiple transcriptions
(multilingual subtitles, CC, transcripts). Presentations may include a speaker
video and a stream of slides, and perhaps an audience stream. Advanced
systems, and systems for video conferencing may include several cameras at
each location, resulting in multiple audio/video streams. For a live sports event,
producers dynamically chose among several cameras to produce the broadcast.
One might be interested in searching the composite stream, but viewing all of the
constituent streams at a point in time of interest. At a lower level, media may be
partitioned into several files (e.g, 1GB .vob) Recordings may include gaps (e.g.
pause/resume recording of a video teleconference.) A universal video search
engine should handle these cases, but most greatly simplify the problem by using
a file-centric architecture (one media file = one document).
Automated methods of processing the acquired media to extract index data will
be covered in the next section. XML is the logical choice for exchanging the
index data, and MPEG-7 defines standardized schemas for this purpose
including media information, production information and usage information.
However, many systems don’t use XML for efficiency reasons. Low-level features
are cumbersome to represent in XML. Some metadata datatypes lend
themselves well to traditional relational database designs (attributes such as
date, genre, actors, etc.), but for many applications the relations are trivial and
most of the value comes from free-text queries, which suggests that text IR
systems architectures are more appropriate. In any case, pointers into the media
must be maintained, and media file formats must be observed. As a result, many
systems resort to standard filesystem storage. This can make consistency
management difficult, but the transaction load is low which ameliorates the
situation (inserts are infrequent, and deletions are quite rare.) Lucene: good open
source text IR system supporting multiple fields, MS Indexing services is an
alternative. If media offsets are not needed, these systems can be used out-of-
the-box. XML IR engines are maturing.
Another major component is the query processing engine which is closely related
to results rendering capabilities to form the user search interface. Real-time
performance is required, thin clients are desirable. Standard HTML preferred, but
may limit ability to render highly interactive media browsing UIs. At a minimum, a
media player control of some kind is required. Challenge: rich (large files) linear
media vs. rapid browsing. Pre-computed media summaries may be utilized, or
summaries and extracts may be generated dynamically in response to user
queries. UI functions include rendering sets of query results and navigation with
links to individual media assets. Within each asset, navigation and browsing
features are required for long-form content, and are desirable for all content.
To handle different device capabilities (processor, display, input), multiple Uis
may be employed, using template technologies such as XSL. Content adaptation
methods permit content to reach a wide audience without re-authoring. Concise
representations and trans-modal summaries serve a dual role of fast browsing on
highly capable workstations, while serve as surrogates for low-power mobile
Transcoding is often employed to “normalize” video compressed in a wide range
of formats to a single format so the users don’t need to install multiple media
players and so that a consistent user interface can be ensured. Transcoding can
also be used to manage video search engine bandwidth (e.g. convert high
resolution source material to a limited maximum bandwidth,) or to create
alternative representations for alternative devices such as mobile devices. Client
decoding capabilities and connection bandwidth should also be taken into
In addition to video search engines, related architectures are digital asset
management systems (DAM) or media asset management (MAM) and
multimedia database approaches used by the digital library and data
warehousing communities. Content distribution networks (CDN) and Video on
Demand (VoD) systems must also manage metadata, access control, and may
include search functions.
5. Media processing
Automated media processing and indexing techniques supplement available
authored metadata and transcriptions. In many cases, traditional metadata is not
available or is not detailed enough for search applications. Media processing is
also used to create browsing interfaces by extracting thumbnails, video
summaries, etc. We will focus on video, audio, and textual media individually,
and then describe multimodal processing. For each medium, three processing
areas are significant:
1) segmentation or dividing the content into relatively homogenous units:
critical for improving precision/recall for IR systems
2) feature extraction for indexing applications: provides low-level semantics
valuable for retrieval, but often “noisy”. VSE application is noise-tolerant.
3) summarization for browsing
6. Video Processing
Major topics include shot boundary detection (scene change detection, content-
based sampling) scene classification, face detection, face recognition, optical
character recognition, representative image selection, browsing via mosaicing
and skimming. In limited domains, object detection / recognition are successful,
but not in the general case. As a practical matter, video search engines must
detect black or constant color images and not used these as thumbnails.
Example: content based sampling using order statistic filtering of motion
compensated block matching results. Extracted reframes per second varies with
content: C-SPAN, commercials, video conferences.
Example: face detection using supervised learning with boosting, committees of
weak learners, Harr image feature detectors.
Shot boundary detection segments produced video in to its constituent shots,
while content-based sampling goes beyond this to sample frames within shots as
well in cases where the visual contents have significantly changed (e.g. long
camera pan or zoom.) Video motion analysis algorithms (e.g., gait detection) can
be used flag scenes containing specific actions (people walking.)
Video analysis can be used in combination with image processing methods to
detect images containing faces (face detection). Hybrid methods based on
motion, edges and color features improve robustness. After detection, face
recognition can be use to help determine the number of actors and which actors
are more significant (major cast detection, anchor person detection.)
Performance is good for a single video program but degrades if applied across a
large number of programs or when used with a large set of candidates.
Video optical character recognition is the process of detecting, segmenting
(foreground background segmentation, not temporal segmentation) and
translating text in video into a standard format (e.g., ASCII) and can be used to
generate a text stream which can be used for IR. Segmentation is a challenge,
and filtering must be performed to remove redundant detections in consecutive
frames. Also, the result may be irrelevant to the IR task (e.g. news ticker crawl is
often unrelated to video.)
Scene classification attempts to assign semantic labels to segmented video (e.g.
outdoors vs. indoors) based on low-level features such as color or motion.
For browsing purposes, video segments are represented typically by a thumbnail
(reduced spatial resolution) image corresponding to the first frame of the
segment. However, other methods are used such as longest dwell time,
minimum difference, and minimum motion. Mosaics combine multiple lower
resolution frames into a higher resolution frame and can be used to create a
montage that represents a pan sequence with a single image. Skimming
methods transform longer video sequences to shorter ones while attempting to
preserve the content as much as possible (e.g., scenes with little or no motion
7. Audio Processing
Audio can be segmented based on energy or other acoustic features. Voice
activity detectors can detect pauses between spoken words. Classifiers can
detect music or speech. Speaker segmentation attempts to divide an audio
stream into units containing only a single speaker, but is hampered by multiple
simultaneous speakers, or short segments. At a systems level, cross correlation
can be used to synchronize two or more audio streams (e.g. slides video).
Speaker identification can be employed to assign an attribute to segments to
indicate which speaker is talking, and may possibly be used for retrieval from
larger collections. Methods involve training from labeled datasets to build models
(e.g., GMM) based on extracted acoustic features.
Many speech and audio processing tasks use similar features based on audio
frame such as energy, Mel Frequency Cepstral Coefficients (MFCCs), and may
include longer range features such as pitch period, etc.
Automatic speech recognition (speech-to-text conversion) is very successful for
applications with high quality recordings and in which the speech engine has
been trained on a large corpus of labeled data. 1-best transcriptions can be fed
directly to traditional text IR systems, however sentence boundaries and
punctuation are not usually available. Also, the text is “normalized” which may
pose a problem for numeric queries. Systems have very large vocabularies
(200K, or more) but will fail for “out-of-vocabulary” queries. Often these are
content words or new terms. More advanced systems, or systems designed for
noisy data (e.g. telephony) go beyond the 1-best hypothesis and retain n-best
probability lattices at the word or phoneme level. For retrieval, a threshold is
applied to control precision vs. recall.
For audio browsing processing to increase play back speed while preserving the
pitch are employed. For speech systems, gap (silence) removal can also be
8. Text Processing
Text is the most accessible medium and text based information retrieval (IR)
methods are mature. The goal of ASR, OCR, CC extraction for indexing is to
leverage these methods.
Topic (or story) segmentation divides documents such as newscasts into units
bases on statistical natural language processing methods such as co-
occurrences of content words. For IR applications, topic segmentation improves
document ranking since common metrics normalize based on document length
(so video programs should be divided into multiple “documents” or story-units).
Synonymy and polysemy complicate text IR systems. Text features are typically
words or stemmed forms. Bag-of-words methods ignore word order, vector space
models, treat documents as sparse high-dimensional vectors while latent
semantic indexing (LSI) attempts to reduce the dimensionality. More advanced
features include term proximity (and with text streams temporal proximity)
Relevance feedback circumvents polysemy by dynamically building and testing
classifiers and involves a document distance metric. IR performance metrics
include precision, recall, F-measure. Additional natural language processing
methods are related to IR such as entity extraction, parsing, part of speech
1. Close caption case restoration (improves browsing) using statistical
language models from web sources.
2. Text alignment using minimal edit-distance can be used to replace real-
time transcripts with offline transcripts of greater accuracy.
3. Relevance feedback
Browsing: text summarization or extracting most relevant segments and
highlighting keywords. Keyword extraction: term frequency based, using PoS
tagging. Entity extraction can be used to create links from media to other
resources on the web (as is done with text documents.) Relevance feedback
and links to “similar” documents may be thought of as browsing tools.
9. Multimodal Processing:
Greater accuracy can be realized by combining results from processing individual
media streams in isolation. Tasks that strive to extract higher level semantics
require collaborative processing of media components. Multimodal story
segmentation combines video processing (e.g. identify fade to black), audio
queues (e.g. music detection) in addition to traditional text-domain methods to
improve accuracy and boundary localization. Anchorperson detection may
involve face and speaker identification. Media (CC)synchronization is a simple
form of multimodal processing where ASR methods are used.
1. Story segmentation: anchorperson detection, speaker segmentation
2. Named faces: named entity extraction, face recognition, video OCR
10. Available Search engines
Google normalizes to Flash format. Early versions included closed caption
search and links to schedule (EPG) information to allow users to watch upcoming
versions. Singingfish was an early multimedia search engine and is still in use
today. Virage had a lot of visibility in 2000 and Blinkx currently licenses their
technology. The table below lists major search sites, as well as some of the
larger media search engines. Most sites support AVI, MPEG, Quicktime, Real
and Flash ingest formats, although some of the Podcast sites may be limited to
MP3, AAC and MP4 formats.
The National Institute of Standards (NIST) in the US sponsors TREC-VID which
has been conducting evaluations since 2001 on video search topics including
shot boundary detection, scene classification, and search. There is broad
participation by research groups from academia and industry (e.g., CMU,
Columbia, IBM, Singapore, Dublin, and many others.) The Linguistic Data
Consortium at the University of Pennsylvania provides data for benchmarking
media processing algorithms. Spoken document retrieval is another TREC track
which is closely related to video search.
Site Acquisition Search Features
Google Broadcasters by special Metadata, Upload, Download,
arrangement, posted manual purchase, video
via upload tools transcription, details (fixed interval
closed sampling) Exclusively
captions hosted flash content.
Yahoo MediaRSS, Metadata, Incorporates
Web, content links, Altavista.
acquisition program, context
sitematch, (ABC, AOL,
MSN Broadcasters by special Metadata Mainly content from
arrangement, major broadcasters.
(MSNBC, NBC, Fox) Hosted WMV content
AOL Broadcasters by special Metadata Singing Fish, Truveo
via upload tools
Blinkx RSS Metadata, Virage, autonomy,
ASR desktop search
1M hours, flash
TVEyes Off-air Broadcasts, Metadata, Podscope
PodZinger Podcasts (RSS) Metadata, BBN
Singing Fish Crawl, RSS, YouTube Metadata 14M files
Trueveo “visual crawling”, Media Context,
YouTube uploads user Transcoding to
supplied hosted flash
Veoh uploads user P2P distribution, local
supplied cache mgt, 20K
Table 1: Properties of selected video search engines