Embed
Email

image-search

Document Sample

Shared by: huanghengdong
Categories
Tags
Stats
views:
0
posted:
1/21/2012
language:
pages:
62
Using HTML Metadata to Retrieve

Relevant Images from the World Wide

Web



Ethan V. Munson

University of Wisconsin-Milwaukee

Why is image search important?



• The Web is becoming the world’s primary

information source

• Images are one of the Web’s key features

• Few WWW image search engines exist currently

• Using textual search engines to find images

manually is laborious

A Requirement for Web Image Search



• We need an efficient method of discovering and

indexing image content.

• Two main sources of information about image

content:

– image processing

– associated text

• text content

• markup

Related work

• QBIC (the IBM Almaden Research Center)

– indexes and retrieves images according to:

– shape

– color

– texture

– object layout

– queries are formulated through visual examples

– a sample image

– user provided sketches

Related work

QBIC system

Related work

QBIC system

Related work

QBIC system

QBIC: Advantages and Disadvantages

• Advantages

– well-developed visual query language

– interesting GUI

– queries are based on image appearance

• Disadvantages

– works only at the primitive feature level (color, texture,

shape)

– doesn’t recognize semantics of image

• very sensitive to camera viewpoint

– doesn’t scale up to the Web

Related work

• WebSeek (J. Smith & S. Chang, Columbia University)

– performs a semi-automated classification of the images

• automatically extracts keywords from image file names

• computes the keyword histogram

• manually creates a subject hierarchy

• manually maps the images into the subject hierarchy

– User can

• browse the categories

• search the categories by keyword

• search the database using image features

– color content

Webseek: Advantages/Disadvantages

• Advantages

– Large index of Web images

– Supports both text and image search

• Disadvantages

– Not clear that database can scale up

• Manual categorization is very expensive

– Relevance feedback mechanism is computationally

expensive

Related work

• WebSeer (M. Swain et al., The University of Chicago)

– uses associated text and markup to supplement

information derived from analyzing image content

– uses multiple kinds of metadata

• image file names

• alternate text

• text of a hyperlink

– decides which images are photographs, portraits, or

computer generated drawing

– research emphasized categorization, not metadata-based

search

Why seek new image retrieval methods?



• The number of WWW documents is growing

rapidly and constantly changing

• We need fast and efficient methods for finding

images

• Image processing is

– complex

– computationally expensive

– limited (misses true image semantics)

– unnecessary

Research Goals

• Show that images can be found using HTML

“metadata”

– textual content

– HTML tag structure

– attribute values

• Determine which metadata features are the best

clues to image content

The URL Filter

• assembles a list of URLs from the results returned by Alta

Vista

– parses the first page returned by Alta Vista

– follows the URLs of results pages, retrieves these pages, and

parses them

– extracts list of URLs from the results pages

The Crawler

• retrieves the pages

• saves each page’s HTML source code in a separate file

“Tidy”

• converts arbitrary and probably ill-formed HTML into

XHTML

XHTML Parser

• parses an XHTML document

• builds an XHTML parse tree

The Document Analyzer

• scans the parse tree for image URLs

– an image URL appears in either an image or anchor

element

• converts relative URLs into absolute URLs

• uses various heuristics to determine which URLs

point to relevant images

Search Strategies

• Image’s file name

• Textual content of the TITLE element

• Value of the ALT attribute of IMG elements

• Textual content of anchor elements

• Value of the title attribute of anchor elements

• Textual content of the paragraph surrounding an image

• Textual content of any paragraph located within the same

center element as the image

• Textual content of heading elements

Image Retrieval Experiment

Experimental Questions

• Which HTML features reveal the most

information about image?

– Do particular patterns of HTML structure carry useful

information?

• Do image search results depend on the type of

query?

Informal Experiments

• Conducted extensive informal testing

– to check software correctness

– to investigate possible metadata clues

– to determine rules for filtering out images based on size

• images smaller than 65 pixels in either dimension almost never

contained useful content

• reduced the number of images we had to classify

Metadata Clues

1 Image’s file name

2 Textual content of the TITLE element

3 Value of the ALT attribute of IMG elements

4 Textual content of anchor elements

5 Value of the title attribute of anchor elements

6 Textual content of the paragraph surrounding an image

7 Textual content of any paragraph located within the same

center element as the image

8 Textual content of heading elements

Query Categories

• Famous people

“Gorbachev”, “Yeltsin”, and “Streisand”

• Non-famous people

“Yelena” and “Ekaterina”



• Famous places

“Paris” and “London”



• Less-famous places

“Bremen” and “Spokane”



• Phenomena

“Explosion”, “Sunset”, and “Hurricane”

Experimental Procedure

• For each of the 12 queries

– Alta Vista returned 200 URLs (20 groups of 10)

– We used first, middle, and last groups (30 URLs)

– Downloaded pages and all images on pages

• excluding small images ( Text ”?

– Analysis of “header” clue is questionable

Body Body









P IMG P









IMG

Conclusion



• Existing content-based image retrieval systems are

not good models for Web image search

• HTML metadata is useful for Web image search

– Image file name and document title are most useful

– Alternate text is extremely precise, when present

• HTML metadata should provide faster image

search than image processing approaches

– no need to download and analyze images

– can take advantage of existing search engines

Using HTML Metadata to Retrieve

Relevant Images from the Web



Ethan V. Munson

Dept. of Electrical Engineering & Computer Science

University of Wisconsin - Milwaukee

munson@cs.uwm.edu

http://www.cs.uwm.edu/~multimedia



Related docs
Other docs by huanghengdong
Which Stage of Public school development
Views: 0  |  Downloads: 0
ArchitectureandReuse
Views: 0  |  Downloads: 0
measureSize
Views: 0  |  Downloads: 0
exam2
Views: 0  |  Downloads: 0
Newsletter_12.11.09
Views: 0  |  Downloads: 0
luke_Images
Views: 0  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!