Portals

W
Shared by: pengxiuhui
Categories
Tags
-
Stats
views:
5
posted:
10/4/2011
language:
English
pages:
67
Document Sample
scope of work template
							Optimizing your site’s
“searchability”

Content, metadata, and interface considerations
 for web search engines




                James Powell
   Web Application Research & Development
                April 2, 2001
First… a few definitions
   An index is a file that allows a search engine to relate
    words to documents.
   A spider is a piece of software that automatically visits
    (crawls) and retrieves (harvests) pages from a web
    site.
   A query is one or more terms (words or phrases) a
    user enters at a search page.
   A result set is a list of documents that a search engine
    determined were related to a query by consulting its
    index.
Part 1 – Writing for search engines
Treat search engines as a category
of browser

   Like browsers, search engines are unique:
    –   Text-oriented focus
    –   Support metatags and metadata
    –   Depend on automatic harvesting through spider
        component
Web spiders are simplistic…

   Search engine spiders are not terribly smart:
    –   They start with a given URL and grab that page.
    –   They find all the links in the page.
    –   They follow each of those links, retrieving a
        document.
    –   They repeat this process until something tells them
        to stop.
…And responsible for some search
engine limitations

   Spiders stumble on complex content:
    –   Frames often confuse search engine spiders.
    –   Spiders tend to ignore Javascript-based navigation.
    –   Spiders can get stuck inside dynamic (e.g.
        database-driven) websites.
Big and Dumb

   As Danny Sullivan of searchenginewatch.com
    is fond of saying, search engines like big,
    dumb web pages. Translation:
    –   Lots of text and metadata
    –   Valid and standard HTML markup
    –   Without obtrusive multi-media and dynamic content.
Ranked term list

   Make a list of the top 10 or 20 terms you
    expect users to use to find your site.
   Rank them in order of importance.
   Use the list when writing content and
    metadata.
Craft a description

   Use your ranked terms to develop a 15-25
    word description.
Phrases are better

   It is often difficult to figure prominently in a
    results set in response to a single word query.
   Emphasize specific, well-known phrases that
    relate to or describe your work.
   Unusual, or field-specific terms yield the best
    results!
Alt attributes, text equivalents

   Always include an alt attribute with every
    image.
   Alt attributes improve site accessibility and
    provide another opportunity to embed ranked
    terms.
   Provide links to textual equivalent for audio and
    video files (for the same reasons).
Best content first

   Include your best, most descriptive content at
    the top of your web page.
   Use your ranked terms often, but in a natural
    way.
   Create pages that provide a different
    perspective on the same topic and cross-link
    (sometimes called doorway pages).
Using images

   Don’t use pictures of text unless the text is also
    included in the page.
   View the source of your page – that’s what a
    spider sees.
Take users directly to the content

   Splash pages using technologies such as
    Flash are empty as far as search engines are
    concerned.
   If you have to redirect, create a web page that
    redirects users to the new pages.
Devote time to your description

   Prepare more descriptions: short, medium, and
    long (5-10, 15-25, and 20-50 words).
   Use your ranked terms.
   Expand acronyms where appropriate.
   Use both singular and plural forms of words.
   Use synonyms.
Tempt with a good Title

   Always include a title in EVERY document.
   Don’t use generic titles.
   Create titles that
    –   entice users
    –   include your ranked terms

   Webreference.com recommends: “Use titles
    early in the alphabet”
Part 2 - Metadata
What is Metadata?

   Metadata is data about data – for example:
    –   An abstract is metadata about a paper.
    –   Herman Melville and “Moby Dick” are metadata
        about a book.
What are some examples?

   Common metadata elements include:
    –   Title
    –   Author
    –   Subject
    –   Keywords
    –   Description
    –   Abstract
Metadata is for software, not people

   Metadata is for software (like search engines),
    not people.
   But, search engines may display some
    metadata in result sets.
Metatags

   Metatags are HTML tags.
   They are data containers with their own unique
    format:
       <META name=“title” content=“Moby Dick”>
       <META name=“date” content=“May 2, 2003”>
       <META name=“genus” content=“ginkgo”>
Where do they go?

   HTML Metatags go in the header portion of an
    HTML document.
    –   The header of an HTML document is the content
        between the <HEAD> start and </HEAD> end tag.
   You can include as many <META> tags within
    the header as you need.
Metatags can contain metadata

   Metatags can contain metadata about a web
    page.
   Most search engines look for:
       <META name=“keywords” content=“some keywords”>

    and
       <META name=“description” content=“a description of the
         page’s contents”>
Dublin Core Metadata

   The Dublin Core is a set of metadata elements
    designed to describe web content.
Dublin Core Facts
   Originally developed in 1996.
   There are 15 Dublin Core metadata elements.
   Each element is optional, and can occur as many times
    as needed.
   Many elements may be associated with fixed set of
    values.
   Current version is 1.1 (issued July 1999).
   Dublin Core documentation is at
    http://www.purl.org/metadata/dublin_core_elements
Example: DC.TITLE

   The title element indicates the name by
    which a web page is formally known:

<META name=“DC.TITLE” content=“The Lost World”>

Note: When using Dublin Core metadata in HTML metatags, common
  practice is to use the DC prefix to indicate the type of metadata.
The five elements you should
always include

   Title
   Creator (author)
   Subject (keywords)
   Description
   Publisher
Other elements you can include

   Contributor      Source
   Date             Language
   Type             Relation
   Format           Coverage
   Identifier       Rights
Part 3 - Getting Found
Searching is popular

   Jakob Nielsen found that more than 50% of all
    web users preferred searching to browsing
    so…
    –   You need to be concerned about where users are
        searching.
    –   You need to develop a strategy for registering your
        site with search engines.
Where users search

   Media Metrix
    –   http://us.mediametrix.com/data/thetop.jsp
   Nielsen//NetRatings
    –   http://www.nielsen-netratings.com/


   Organizational, subject-oriented, and site
    specific searches
Establish a search strategy

   A search strategy outlines how you will support
    and facilitate end user discovery of your site
    through search tools.
Before you begin

   Before you develop a search strategy, review
    your web server logs.
   This is one way to measure the effectiveness
    of your strategy.
   Keep a record.
Link exchange

   Some search engines, such as Google, use
    the number of sites that point to your site to
    build an automatic citation index.
   Establish reciprocal links, using most
    appropriate title or phrase to target your site.
Leverage directories

   With a directory, users can search or browse.
   Users expect a directory to point them to
    higher quality content.
   Category information helps users select
    content and refine queries.
Register with directory services

   Use your ranked terms to perform queries
    against the directory’s search engine.
   Submit your site using the appropriate directory
    category.
Registering with search engines

   Most search engines let you submit a URL to
    be indexed.
   Some sites provide free, or fee-based search
    engine registration services.
   If lots of other sites point at yours, many search
    engines will eventually find it.
$$$

   You can pay for placement within a results set.
   Not a bad way to get your corporate name
    ahead of your competitors, but probably won’t
    win you any converts with people looking for
    information.
Don’t spam

   Don’t spam search engines.
   Spam is unrelated metadata (e.g. keyword
    “sex” for a Java programming site).
   A few search engines actually penalize
    spammers.
Consider maintaining a local
search

   Many sites include a local search.
   Good for sites that host lots of documents.
   Link to your search page often (Nielsen
    recommends pointing to your search from all of
    your pages).
Follow-up

   Revisit the directories you registered with and
    verify that you were added.
   Do the same with search engines. Search for
    your ranked terms and see if you find your site.
Part 4 – Designing Search
Interfaces
Four phase framework

   Ben Schneiderman invented the four phase
    framework for search:
    –   Formulating a query,
    –   Triggering a search,
    –   Reviewing the results,
    –   Refining the query.
Formulating a query

   What should the search page look like?
    –   Provide a simple page titled “Search this site”.
    –   Make sure the input field is large, to encourage
        multi-word queries.
    –   Label the submission button with a word that
        describes the action being performed (e.g. Search).
    –   Relegate the advanced query interface to a
        separate page.
Triggering a search

   How do users initiate a search action?
    –   They click the search button!
    –   Or, they might click a link on the results page that
        says something like “find more pages like this one.”
    –   Don’t make them guess, don’t make them think.
Reviewing results
   How should results be organized?
    –   Users scan, so be brief and use formatting cues, like bold
        titles.
    –   Clearly indicate the ranking in obvious ways and reinforce it
        (obvious: present in ranked order, reinforcement: share details
        about ranking).
    –   Provide URL, date, size, and format information if space
        permits.
    –   Eliminate duplicates!
    –   Provide clear navigation if results are presented on multiple
        pages.
Query refinement

   How to support subsequent queries –
    –   Make sure the search form is part of the results set.
    –   Embed the last query in the form.
    –   For relevance or related document searches,
        provide a hypertext link.
Test and refine

   The framework focuses on user tasks.
   Test your implementation with real users.
   Fix problems, and don’t be afraid to violate the
    general guidelines if user testing reveals
    problems.
More about the framework…

Visit the website for the




at http://ijhcs.open.ac.uk/
Part 5 – Local Search
Planning your search option

   Scoped search:
    –   Branded version of a “commercial” search
    –   Spidering/Indexing by parent organization
   Local search:
    –   Database-driven metadata search
    –   Search is on the document server
   Multi-site search:
    –   Separate, dedicated search service
Branded version of an existing
search

   If your site is indexed by commercial sites,
    consider using their site-specific search.
   Drawbacks: you have no control over search
    interface, results set, presence or content of
    ads, frequency of indexing, depth of indexing.
Site searches

   Google supports a site search:
    –   site:ward.vt.edu searches WARD documents
        indexed by Google
   AltaVista’s domain and host keywords:
    –   domain:edu or host:www.vt.edu
Spidering/Indexing by parent
organization

   If your parent organization provides a search
    engine, use it.
   Drawbacks: probably limited control over
    search interface and/or results set
    presentation, probably little control over
    frequency or depth of indexing, maybe limited
    to a fixed number of documents.
Leveraging an existing local
service

   Search the site and check your log files to see
    if some of your content is indexed.
   Contact the search admin to review the
    configuration options that might affect your site.
Database-driven metadata search

   Set up a simple web-accessible metadata
    database.
   Drawbacks: difficulty of managing some
    relational database systems, difficulty getting
    metadata into database and keeping it current,
    less comprehensive than full text searching.
Database of metadata

   Select metadata standard.
   Setup a database table to contain it.
   Create input form for submitting metadata.
   Use the four phase framework to design your
    query form and results set.
Search on the document server

   There are many ways to implement local
    search options on your web server.
   Drawbacks: Only local content can be
    searched.
Local searching

   Determine scope of problem – number of
    documents to be searched, full-text or
    metadata only, types of searches to be
    supported.
   Find out what solutions are available for your
    platform.
Separate, dedicated search service

   Managing a dedicated search service provides
    the most features and the most scalable
    solution.
   Drawbacks: Expensive, resource intensive,
    time consuming, bandwidth hogging, tricky to
    tune and maintain.
Features of search engines

   Gather from multiple sites
   Thesaurus, stop-word lists
   Custom interfaces
   Site-specific and arbitrary string and regular
    expression filters
   Scheduled gathering
   Searchable while indexing
Planning to implement search

   How many sites will you be crawling?
   How many documents will be gathered and
    indexed?
   Are there distinct collections of documents?
   Do you need custom query interfaces?
   What metadata will be supported?
Controlling spiders

   The robots exclusion protocol allows you to
    control spider applications that visit your site.
   You need to know the name of the spider (the
    “user-agent”).
   For each agent, you can control what
    directories are crawled.
   The robots.txt file should be stored at the root
    of your web document tree.
Example robots.txt entries

   Prevent all spiders from crawling a cgi
    directory:
    User-agent: *
    Disallow: /cgi-bin/
   Prevent one spider called “charlotte” from
    gathering documentation:
    User-agent: charlotte
    Disallow: /docs/
         Optimizing Searchability:
               Conclusion

   Develop content with search engines in mind.
   Include metadata.
   Develop a “search strategy.”
   Design and test your search interfaces.
   Select the search option that works for you.
For more Info

   My upcoming FDI class on Web Search Engines
   These sites:
    –   http://www.rankwrite.com/default.htm
    –   http://websearch.about.com/internet/websearch/mbody.htm
    –   http://www.searchengineguide.com/
    –   http://searchenginewatch.com/
    –   http://www.submitexpress.com/
    –   http://webreference.com/content/search/how.html
    –   http://www.searchtools.com/
Other useful info

   VT’s search agent name is:
    –   CookieMonster


   Contact the Search Administrator (Vijaya
    Mallikarjunan vijaya@vt.edu) or the VT
    webmaster to get registered.

						
Related docs
Other docs by pengxiuhui
84th USARRTC Leadership Developm
Views: 2  |  Downloads: 0
Interest Rates
Views: 116  |  Downloads: 0
CALIFORNIA STATE UNIVERSITY_ EAST BAY FACULTY
Views: 104  |  Downloads: 0
presentation - 﨧 icrosoft P owe
Views: 100  |  Downloads: 0
Vendor Information
Views: 74  |  Downloads: 0
M
Views: 8  |  Downloads: 0
The UK and the €uro Background and Prospects
Views: 86  |  Downloads: 0