Portals
Document Sample


Optimizing your site’s
“searchability”
Content, metadata, and interface considerations
for web search engines
James Powell
Web Application Research & Development
April 2, 2001
First… a few definitions
An index is a file that allows a search engine to relate
words to documents.
A spider is a piece of software that automatically visits
(crawls) and retrieves (harvests) pages from a web
site.
A query is one or more terms (words or phrases) a
user enters at a search page.
A result set is a list of documents that a search engine
determined were related to a query by consulting its
index.
Part 1 – Writing for search engines
Treat search engines as a category
of browser
Like browsers, search engines are unique:
– Text-oriented focus
– Support metatags and metadata
– Depend on automatic harvesting through spider
component
Web spiders are simplistic…
Search engine spiders are not terribly smart:
– They start with a given URL and grab that page.
– They find all the links in the page.
– They follow each of those links, retrieving a
document.
– They repeat this process until something tells them
to stop.
…And responsible for some search
engine limitations
Spiders stumble on complex content:
– Frames often confuse search engine spiders.
– Spiders tend to ignore Javascript-based navigation.
– Spiders can get stuck inside dynamic (e.g.
database-driven) websites.
Big and Dumb
As Danny Sullivan of searchenginewatch.com
is fond of saying, search engines like big,
dumb web pages. Translation:
– Lots of text and metadata
– Valid and standard HTML markup
– Without obtrusive multi-media and dynamic content.
Ranked term list
Make a list of the top 10 or 20 terms you
expect users to use to find your site.
Rank them in order of importance.
Use the list when writing content and
metadata.
Craft a description
Use your ranked terms to develop a 15-25
word description.
Phrases are better
It is often difficult to figure prominently in a
results set in response to a single word query.
Emphasize specific, well-known phrases that
relate to or describe your work.
Unusual, or field-specific terms yield the best
results!
Alt attributes, text equivalents
Always include an alt attribute with every
image.
Alt attributes improve site accessibility and
provide another opportunity to embed ranked
terms.
Provide links to textual equivalent for audio and
video files (for the same reasons).
Best content first
Include your best, most descriptive content at
the top of your web page.
Use your ranked terms often, but in a natural
way.
Create pages that provide a different
perspective on the same topic and cross-link
(sometimes called doorway pages).
Using images
Don’t use pictures of text unless the text is also
included in the page.
View the source of your page – that’s what a
spider sees.
Take users directly to the content
Splash pages using technologies such as
Flash are empty as far as search engines are
concerned.
If you have to redirect, create a web page that
redirects users to the new pages.
Devote time to your description
Prepare more descriptions: short, medium, and
long (5-10, 15-25, and 20-50 words).
Use your ranked terms.
Expand acronyms where appropriate.
Use both singular and plural forms of words.
Use synonyms.
Tempt with a good Title
Always include a title in EVERY document.
Don’t use generic titles.
Create titles that
– entice users
– include your ranked terms
Webreference.com recommends: “Use titles
early in the alphabet”
Part 2 - Metadata
What is Metadata?
Metadata is data about data – for example:
– An abstract is metadata about a paper.
– Herman Melville and “Moby Dick” are metadata
about a book.
What are some examples?
Common metadata elements include:
– Title
– Author
– Subject
– Keywords
– Description
– Abstract
Metadata is for software, not people
Metadata is for software (like search engines),
not people.
But, search engines may display some
metadata in result sets.
Metatags
Metatags are HTML tags.
They are data containers with their own unique
format:
<META name=“title” content=“Moby Dick”>
<META name=“date” content=“May 2, 2003”>
<META name=“genus” content=“ginkgo”>
Where do they go?
HTML Metatags go in the header portion of an
HTML document.
– The header of an HTML document is the content
between the <HEAD> start and </HEAD> end tag.
You can include as many <META> tags within
the header as you need.
Metatags can contain metadata
Metatags can contain metadata about a web
page.
Most search engines look for:
<META name=“keywords” content=“some keywords”>
and
<META name=“description” content=“a description of the
page’s contents”>
Dublin Core Metadata
The Dublin Core is a set of metadata elements
designed to describe web content.
Dublin Core Facts
Originally developed in 1996.
There are 15 Dublin Core metadata elements.
Each element is optional, and can occur as many times
as needed.
Many elements may be associated with fixed set of
values.
Current version is 1.1 (issued July 1999).
Dublin Core documentation is at
http://www.purl.org/metadata/dublin_core_elements
Example: DC.TITLE
The title element indicates the name by
which a web page is formally known:
<META name=“DC.TITLE” content=“The Lost World”>
Note: When using Dublin Core metadata in HTML metatags, common
practice is to use the DC prefix to indicate the type of metadata.
The five elements you should
always include
Title
Creator (author)
Subject (keywords)
Description
Publisher
Other elements you can include
Contributor Source
Date Language
Type Relation
Format Coverage
Identifier Rights
Part 3 - Getting Found
Searching is popular
Jakob Nielsen found that more than 50% of all
web users preferred searching to browsing
so…
– You need to be concerned about where users are
searching.
– You need to develop a strategy for registering your
site with search engines.
Where users search
Media Metrix
– http://us.mediametrix.com/data/thetop.jsp
Nielsen//NetRatings
– http://www.nielsen-netratings.com/
Organizational, subject-oriented, and site
specific searches
Establish a search strategy
A search strategy outlines how you will support
and facilitate end user discovery of your site
through search tools.
Before you begin
Before you develop a search strategy, review
your web server logs.
This is one way to measure the effectiveness
of your strategy.
Keep a record.
Link exchange
Some search engines, such as Google, use
the number of sites that point to your site to
build an automatic citation index.
Establish reciprocal links, using most
appropriate title or phrase to target your site.
Leverage directories
With a directory, users can search or browse.
Users expect a directory to point them to
higher quality content.
Category information helps users select
content and refine queries.
Register with directory services
Use your ranked terms to perform queries
against the directory’s search engine.
Submit your site using the appropriate directory
category.
Registering with search engines
Most search engines let you submit a URL to
be indexed.
Some sites provide free, or fee-based search
engine registration services.
If lots of other sites point at yours, many search
engines will eventually find it.
$$$
You can pay for placement within a results set.
Not a bad way to get your corporate name
ahead of your competitors, but probably won’t
win you any converts with people looking for
information.
Don’t spam
Don’t spam search engines.
Spam is unrelated metadata (e.g. keyword
“sex” for a Java programming site).
A few search engines actually penalize
spammers.
Consider maintaining a local
search
Many sites include a local search.
Good for sites that host lots of documents.
Link to your search page often (Nielsen
recommends pointing to your search from all of
your pages).
Follow-up
Revisit the directories you registered with and
verify that you were added.
Do the same with search engines. Search for
your ranked terms and see if you find your site.
Part 4 – Designing Search
Interfaces
Four phase framework
Ben Schneiderman invented the four phase
framework for search:
– Formulating a query,
– Triggering a search,
– Reviewing the results,
– Refining the query.
Formulating a query
What should the search page look like?
– Provide a simple page titled “Search this site”.
– Make sure the input field is large, to encourage
multi-word queries.
– Label the submission button with a word that
describes the action being performed (e.g. Search).
– Relegate the advanced query interface to a
separate page.
Triggering a search
How do users initiate a search action?
– They click the search button!
– Or, they might click a link on the results page that
says something like “find more pages like this one.”
– Don’t make them guess, don’t make them think.
Reviewing results
How should results be organized?
– Users scan, so be brief and use formatting cues, like bold
titles.
– Clearly indicate the ranking in obvious ways and reinforce it
(obvious: present in ranked order, reinforcement: share details
about ranking).
– Provide URL, date, size, and format information if space
permits.
– Eliminate duplicates!
– Provide clear navigation if results are presented on multiple
pages.
Query refinement
How to support subsequent queries –
– Make sure the search form is part of the results set.
– Embed the last query in the form.
– For relevance or related document searches,
provide a hypertext link.
Test and refine
The framework focuses on user tasks.
Test your implementation with real users.
Fix problems, and don’t be afraid to violate the
general guidelines if user testing reveals
problems.
More about the framework…
Visit the website for the
at http://ijhcs.open.ac.uk/
Part 5 – Local Search
Planning your search option
Scoped search:
– Branded version of a “commercial” search
– Spidering/Indexing by parent organization
Local search:
– Database-driven metadata search
– Search is on the document server
Multi-site search:
– Separate, dedicated search service
Branded version of an existing
search
If your site is indexed by commercial sites,
consider using their site-specific search.
Drawbacks: you have no control over search
interface, results set, presence or content of
ads, frequency of indexing, depth of indexing.
Site searches
Google supports a site search:
– site:ward.vt.edu searches WARD documents
indexed by Google
AltaVista’s domain and host keywords:
– domain:edu or host:www.vt.edu
Spidering/Indexing by parent
organization
If your parent organization provides a search
engine, use it.
Drawbacks: probably limited control over
search interface and/or results set
presentation, probably little control over
frequency or depth of indexing, maybe limited
to a fixed number of documents.
Leveraging an existing local
service
Search the site and check your log files to see
if some of your content is indexed.
Contact the search admin to review the
configuration options that might affect your site.
Database-driven metadata search
Set up a simple web-accessible metadata
database.
Drawbacks: difficulty of managing some
relational database systems, difficulty getting
metadata into database and keeping it current,
less comprehensive than full text searching.
Database of metadata
Select metadata standard.
Setup a database table to contain it.
Create input form for submitting metadata.
Use the four phase framework to design your
query form and results set.
Search on the document server
There are many ways to implement local
search options on your web server.
Drawbacks: Only local content can be
searched.
Local searching
Determine scope of problem – number of
documents to be searched, full-text or
metadata only, types of searches to be
supported.
Find out what solutions are available for your
platform.
Separate, dedicated search service
Managing a dedicated search service provides
the most features and the most scalable
solution.
Drawbacks: Expensive, resource intensive,
time consuming, bandwidth hogging, tricky to
tune and maintain.
Features of search engines
Gather from multiple sites
Thesaurus, stop-word lists
Custom interfaces
Site-specific and arbitrary string and regular
expression filters
Scheduled gathering
Searchable while indexing
Planning to implement search
How many sites will you be crawling?
How many documents will be gathered and
indexed?
Are there distinct collections of documents?
Do you need custom query interfaces?
What metadata will be supported?
Controlling spiders
The robots exclusion protocol allows you to
control spider applications that visit your site.
You need to know the name of the spider (the
“user-agent”).
For each agent, you can control what
directories are crawled.
The robots.txt file should be stored at the root
of your web document tree.
Example robots.txt entries
Prevent all spiders from crawling a cgi
directory:
User-agent: *
Disallow: /cgi-bin/
Prevent one spider called “charlotte” from
gathering documentation:
User-agent: charlotte
Disallow: /docs/
Optimizing Searchability:
Conclusion
Develop content with search engines in mind.
Include metadata.
Develop a “search strategy.”
Design and test your search interfaces.
Select the search option that works for you.
For more Info
My upcoming FDI class on Web Search Engines
These sites:
– http://www.rankwrite.com/default.htm
– http://websearch.about.com/internet/websearch/mbody.htm
– http://www.searchengineguide.com/
– http://searchenginewatch.com/
– http://www.submitexpress.com/
– http://webreference.com/content/search/how.html
– http://www.searchtools.com/
Other useful info
VT’s search agent name is:
– CookieMonster
Contact the Search Administrator (Vijaya
Mallikarjunan vijaya@vt.edu) or the VT
webmaster to get registered.
Get documents about "