Using HTML Metadata to Retrieve
Relevant Images from the World Wide
Web
Ethan V. Munson
University of Wisconsin-Milwaukee
Why is image search important?
• The Web is becoming the world’s primary
information source
• Images are one of the Web’s key features
• Few WWW image search engines exist currently
• Using textual search engines to find images
manually is laborious
A Requirement for Web Image Search
• We need an efficient method of discovering and
indexing image content.
• Two main sources of information about image
content:
– image processing
– associated text
• text content
• markup
Related work
• QBIC (the IBM Almaden Research Center)
– indexes and retrieves images according to:
– shape
– color
– texture
– object layout
– queries are formulated through visual examples
– a sample image
– user provided sketches
Related work
QBIC system
Related work
QBIC system
Related work
QBIC system
QBIC: Advantages and Disadvantages
• Advantages
– well-developed visual query language
– interesting GUI
– queries are based on image appearance
• Disadvantages
– works only at the primitive feature level (color, texture,
shape)
– doesn’t recognize semantics of image
• very sensitive to camera viewpoint
– doesn’t scale up to the Web
Related work
• WebSeek (J. Smith & S. Chang, Columbia University)
– performs a semi-automated classification of the images
• automatically extracts keywords from image file names
• computes the keyword histogram
• manually creates a subject hierarchy
• manually maps the images into the subject hierarchy
– User can
• browse the categories
• search the categories by keyword
• search the database using image features
– color content
Webseek: Advantages/Disadvantages
• Advantages
– Large index of Web images
– Supports both text and image search
• Disadvantages
– Not clear that database can scale up
• Manual categorization is very expensive
– Relevance feedback mechanism is computationally
expensive
Related work
• WebSeer (M. Swain et al., The University of Chicago)
– uses associated text and markup to supplement
information derived from analyzing image content
– uses multiple kinds of metadata
• image file names
• alternate text
• text of a hyperlink
– decides which images are photographs, portraits, or
computer generated drawing
– research emphasized categorization, not metadata-based
search
Why seek new image retrieval methods?
• The number of WWW documents is growing
rapidly and constantly changing
• We need fast and efficient methods for finding
images
• Image processing is
– complex
– computationally expensive
– limited (misses true image semantics)
– unnecessary
Research Goals
• Show that images can be found using HTML
“metadata”
– textual content
– HTML tag structure
– attribute values
• Determine which metadata features are the best
clues to image content
The URL Filter
• assembles a list of URLs from the results returned by Alta
Vista
– parses the first page returned by Alta Vista
– follows the URLs of results pages, retrieves these pages, and
parses them
– extracts list of URLs from the results pages
The Crawler
• retrieves the pages
• saves each page’s HTML source code in a separate file
“Tidy”
• converts arbitrary and probably ill-formed HTML into
XHTML
XHTML Parser
• parses an XHTML document
• builds an XHTML parse tree
The Document Analyzer
• scans the parse tree for image URLs
– an image URL appears in either an image or anchor
element
• converts relative URLs into absolute URLs
• uses various heuristics to determine which URLs
point to relevant images
Search Strategies
• Image’s file name
• Textual content of the TITLE element
• Value of the ALT attribute of IMG elements
• Textual content of anchor elements
• Value of the title attribute of anchor elements
• Textual content of the paragraph surrounding an image
• Textual content of any paragraph located within the same
center element as the image
• Textual content of heading elements
Image Retrieval Experiment
Experimental Questions
• Which HTML features reveal the most
information about image?
– Do particular patterns of HTML structure carry useful
information?
• Do image search results depend on the type of
query?
Informal Experiments
• Conducted extensive informal testing
– to check software correctness
– to investigate possible metadata clues
– to determine rules for filtering out images based on size
• images smaller than 65 pixels in either dimension almost never
contained useful content
• reduced the number of images we had to classify
Metadata Clues
1 Image’s file name
2 Textual content of the TITLE element
3 Value of the ALT attribute of IMG elements
4 Textual content of anchor elements
5 Value of the title attribute of anchor elements
6 Textual content of the paragraph surrounding an image
7 Textual content of any paragraph located within the same
center element as the image
8 Textual content of heading elements
Query Categories
• Famous people
“Gorbachev”, “Yeltsin”, and “Streisand”
• Non-famous people
“Yelena” and “Ekaterina”
• Famous places
“Paris” and “London”
• Less-famous places
“Bremen” and “Spokane”
• Phenomena
“Explosion”, “Sunset”, and “Hurricane”
Experimental Procedure
• For each of the 12 queries
– Alta Vista returned 200 URLs (20 groups of 10)
– We used first, middle, and last groups (30 URLs)
– Downloaded pages and all images on pages
• excluding small images ( Text ”?
– Analysis of “header” clue is questionable
Body Body
P IMG P
IMG
Conclusion
• Existing content-based image retrieval systems are
not good models for Web image search
• HTML metadata is useful for Web image search
– Image file name and document title are most useful
– Alternate text is extremely precise, when present
• HTML metadata should provide faster image
search than image processing approaches
– no need to download and analyze images
– can take advantage of existing search engines
Using HTML Metadata to Retrieve
Relevant Images from the Web
Ethan V. Munson
Dept. of Electrical Engineering & Computer Science
University of Wisconsin - Milwaukee
munson@cs.uwm.edu
http://www.cs.uwm.edu/~multimedia