Semantic web by yaoyufang

VIEWS: 34 PAGES: 34

									Semantic web
Alesso
Introduction
This is the first of a series of three articles in which I'll review the current state of search
engine technology and the progress being made towards the next generation of search
engines: Semantic search engines . The term 'semantic search' essentially means that the
scope of the search is defined by meaning and context rather than by merely looking at
keywords. The problem of course with searching based on keywords is that any given word
can be used in many different contexts, which means that choosing keywords likely to yield
web pages that give exactly the type of information you need is a very subtle and complex
problem, often with no easily identifiable solution. A full solution ultimately would probably
require the search engine to have a fairly sophisticated ability to process natural English
language queries, but that unfortunately is well beyond what can currently be achieved. What
is more realistic in the near future is the ability to deduce likely matches by measuring the
degrees of closeness of documents using the extent to which those documents contain words
of similar meanings. Over the course of these three articles, I'll review the techniques being
developed on the Web to do roughly this. I'll explore semantic search engines and semantic
search agents, including their current development and progress. I'll present efforts being
made to implement semantic search by Google, MSN and other innovators. The structure of
the three articles is as follows:

       This article will focus on the current state of technology, though with a brief mention
        of the work being done on semantic search, to set the context. The article will also
        discuss one particular technique, known as stemming , which resolves similar words
        that are likely to have the same meaning. Stemming is a very important precursor to
        semantic search algorithms. The article concludes with a sample that stems words in a
        document.
       The 2nd article will focus more on current progress in semantic search algorithms,
        describing the techniques being developed.
       The final article will be much more applied. Based on the concepts presented in these
        two articles, and also presuming some understanding of the Microsoft Speech SDK, it
        will develop an application that leverages google web services to provide a semantic
        search page. Not only that but it uses the Speech SDK to have the page speech-
        enabled.

OK, let's get started. I'll kick off by reviewing the history of search engines to date, then I'll
examine what qualities are required of a good search engine, before looking more deeply at
the techniques involved and finally presenting the sample code.


System Requirements
To run the sample code for this article you simply need a computer running VS .NET and IIS.


Installing and Compiling the Sample Code
The sample code for this article is a plain ASP.NET application. There are no special
installation instructions, other than that you unzip the files into a virtual director called
LSIDemo.


A Brief History of Search Engines
As the use of the World Wide Web has become increasingly widespread, the business of
commercial search engines has become a vital and lucrative part of the Web. Search engines
have become commonplace tools for virtually every user of the Internet; and companies, such
as Google and Yahoo!, have become household names.

In early 1994, Jerry Yang and David Filo of Stanford University started the hierarchical
search engine, Yahoo!, in order to bring some order to the otherwise chaotic collection of
documents on the Web. Some months later, Brian Pinkerton of the University of Washington,
developed the search crawler WebCrawler. Also in 1994, Michael Maldin of Carnegie Melon
University created Lycos.

In late 1995, Metacrawler, Excite, AltaVista, and later Inktomi/HotBot (mid-1996),
AskJeeves and GoTo appeared. Yahoo!, though utilizing a directory, was the leading search
engine at that time, but AltaVista was soon launched and began to gain popularity.

By late 1998, Stanford's Larry Page and Sergey Brin reinvented search ranking technology
with their paper "The Anatomy of a Large-Scale Hypertextual Web Search Engine" and
started what became the most successful search engine in the world, Google. The uncluttered
interface, speed and relevancy of the search results, were cornerstones in winning the tech-
literate public.

Search engine optimization became more important as experts tried to boost the rankings of
their commercial websites in order to attract more customers. In 2000, Yahoo! and Google
become partners, with Google handling over 100 million daily search requests. In 2001,
AskJeeves acquired Teoma, and GoTo was renamed Overture.

All of these commercial search engines are based upon one of two forms of Web search
technologies: human directed search or automated search.

The human directed search engine technology utilizes a database of keyword, concepts, and
references. The keyword searches are used to rank pages, but this simplistic method often
leads to voluminous irrelevant and spurious results. In its simplest form, a content-based
search engine will count the number of the query words (keywords) that occur in each of the
pages that are contained in its index. The search engine will then rank the pages. More
sophisticated approaches take into account the location of the keywords. For example,
keywords occurring in the title tags of the Web page are more important than those in the
body. Other types of human-directed search engines, like Yahoo!, use topic hierarchies to
help to narrow the search and make search results more relevant. These topic hierarchies are
human created. Because of this, they are costly to produce and maintain in terms of time, and
are subsequently not updated as often as the fully automated systems.

The fully automated form of Web search technology is based upon the Web crawler, spider,
robot (bot), or agent, which follows HTTP links from site to site and accumulates information
about Web pages. This agent-based search technology accumulates data automatically and is
continuously updating information.


What Makes a Good Search Result
There are several characteristics required to improve search engines. It is important to
consider useful searches as distinct from fruitless ones. To be useful, there are three necessary
criteria:

       Maximum relevant information.
       Minimum irrelevant information
       Meaningful ranking, with the most relevant results first.

The first of these criteria - getting all of the relevant information available - is called recall .
Without good recall, you have no guarantee that valid, interesting results won't be left out of
our result set. You want the rate of false negatives - relevant results that you never see - to be
as low as possible.

The second criterion - minimizing irrelevant information so that the proportion of relevant
documents in our result set is very high - is called precision . With too little precision, our
useful results get diluted by irrelevancies, and you are left with the task of sifting through a
large set of documents to find what you want. High precision means the lowest possible rate
of false positives.

There is an inevitable tradeoff between precision and recall. Search results generally lie on a
continuum of relevancy, so there is no distinct place where relevant results stop and
extraneous ones begin.

This is why the third criterion, ranking , is so important. Ranking has to do with whether the
result set is ordered in a way that matches our intuitive understanding of what is more and
what is less relevant. Of course the concept of 'relevance' depends heavily on our own
immediate needs, our interests, and the context of our search. In an ideal world, search
engines would learn our individual preferences, so that they could fine-tune any search based
on our past interests.

Traditional search engines are based almost purely on the occurrence of words in documents.
Search engines like Google however, augment this with information about the hyperlink
structure of the Web. Nevertheless their shortcomings are still significant, including:

       There is a semantic gap between what a user wants and what he gets.
       Searching with small hand-held devices is handicapped through inadequate IO
        devices. .
       Users cannot provide feedback regarding the relevance of returned pages.
       Users cannot personalize the ranking mechanism that the search engine uses.
       The search engine cannot learn from past user preferences.


Current Search Engine Technology
In this section I'll briefly review the current state of search engine technology, taking Google
as a particular example.

Types of Search Engines
Current search engines are based upon huge databases of Web page references. There are two
implementations of search engines:

      Individual - Individual search engines compile their own searchable databases on the
       Web (e.g., Google)
      Meta - Metasearchers do not compile databases. Instead, they search the databases of
       multiple sets of individual engines simultaneously.

Agent-based search engines compile these searchable databases by employing spiders or
robots ( bots ) to crawl through Web space from link to link, identifying and perusing pages.
Sites with no links to other pages may be missed by spiders altogether. Once the spiders get to
a Web site, they typically index most of the words on the publicly available pages. Web page
owners submit their URLs to search engines for "crawling" and eventual inclusion in their
databases.

In ranking Web pages, search engines follow a certain set of rules. Their goal, of course, is to
return the most relevant pages at the top of their lists. To do this, they look for the location
and frequency of keywords and phrases in the Web page document and, sometimes, in the
HTML META tags. They check out the title field and scan the headers and text near the top of
the document. Some of them assess popularity by the number of links that are pointing to
sites; the more links, the greater the popularity of the page.

Search can be categorized by several fundamental types including lexical, linguistic,
semantic, meta, mathematical, SQL structured query, and XML query, as follows:

      Lexical : searches for a word or a set of words, with Boolean operators (AND, OR,
       EXCEPT).
      Linguistic : analysis allows words to be found in whatever form they take, and
       enables the search to be extended to synonyms.
      Semantic : the search can be carried out on the basis of the meaning of the query.
      Mathematical : semantic search operates in parallel with a statistical model adapted
       to it.
      Metasearch engines do not crawl the Web compiling their own searchable databases.
       Instead, they search the databases of multiple sets of individual search engines
       simultaneously, from a single site and using the same interface. Metasearchers provide
       a quick way of finding out which engines are retrieving the best results for you in your
       search. http://www.ixquick.com/ - http://www.profusion.com/ - http://vivisimo.com/
      SQL structure query : a search through a sub-set of the documents of the database
       defined by SQL.
      XML structured query : the initial structuring of a document is preserved and the
       request is formulated in XPath.

Because Web search engines use keywords they are subject to the two well-known linguistic
phenomena that strongly degrade a query's precision and recall:
        Polysemy (one word might have several meanings) and
        Synonymy (several words or phrases, might designate the same concept).

To illustrate the problem, Google, with its 400 million hits per day, and over 4 billion indexed
Web pages, is undeniably the most popular commercial search engine used today, but even
with Google, you'll be aware that results returned often include an ocean of irrelevant results,
which have been returned essentially because the google algorithm is unable to put words into
context or extract substantial meaning from your query.

Several systems have been built to overcome these problems based on the idea of annotating
Web pages with of Resource Description Framework ( RDF - http://www.w3.org/RDF/)
and Web Ontology Language ( OWL - http://www.w3.org/2004/OWL/ ) tags. I won't
discuss the details of the tags in this article - you can look up the links if you're interested.
You should bear in mind, however, that the limitation of these systems is that they can only
process Web pages that are already annotated with semantic tags.

An Example: The Google Search Algorithm
The heart of Google Search software is PageRank , a system for ranking Web pages,
developed by the founders Larry Page and Sergey Brin at Stanford University

PageRank relies on the vast link structure as an indicator of an individual page's value.
Essentially, Google interprets a link from page A to page B as a vote, by page A, for page B.
Important sites receive a higher PageRank. Votes cast by pages that are themselves
"important," weigh more heavily and help to make other pages "important."

Google combines PageRank with sophisticated text-matching techniques to find pages that are
both important and relevant to the search. Google goes far beyond the number of times a term
appears on a page and examines all aspects of the page's content (and the content of the pages
linking to it) to determine if it's a good match for the query.

The PageRank is calculated roughly as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

where:

        PR(A) is the PageRank of a page A
        PR(T1) is the PageRank of a page T1
        C(T1) is the number of outgoing links from the page T1
        d is a damping factor in the range 0 < d < 1, usually set to 0.85

The PageRank of a Web page is therefore calculated as a sum of the PageRanks of all pages
linking to it (its incoming links), divided by the number of links on each of those pages (its
outgoing links).

A Stanford University start-up, Kaltix, was recently purchased by Google after it had taken
Google's model one step further, so that different search results are produced for every user
based on their preferences and history. Kaltix's has published research that offers a way to
compute search results nearly 1,000 times faster than what's currently possible. Kaltix's
method is similar to looking for a tree in a forest by examining only a clump of trees, rather
than the whole forest.


Semantic Search Concepts
As Artificial Intelligence (AI) technologies become more powerful, it is reasonable to ask for
better search capabilities which can truly respond to detailed requests. This is the intent of
semantic-based search engines and semantic-based search agents. A semantic search engine
seeks to find documents that have similar 'concepts' not just similar 'words'. In order for the
Web to become a semantic network, it must provide more meaningful meta-data about its
content. This can be achieved through the use of Resource Description Framework and Web
Ontology Language tags to represent semantics, as previously mentioned; these tags can in
principle help to form the Web into a semantic network . In a semantic network, the meaning
of content is better represented and logical connections are formed between related
information.

There are two approaches to improving search results through semantic methods:

   1. Semantic Web Architecture -http://www.w3.org/2001/sw/ and
   2. Latent Semantic Indexing (LSI -
       http://javelina.cet.middlebury.edu/lsa/out/lsa_definition.htm

However, most semantic-based search engines suffer performance problems because of the
scale of the very large semantic network. In order for the semantic search to be effective in
finding responsive results, the network must contain a great deal of relevant information. At
the same time, a large network creates difficulties in processing the many possible paths to a
relevant solution.

Semantic search methods augment and improve traditional search results by using not just
words, but concepts. Several major companies are seriously addressing the issue of semantic
search, including Microsoft and Google. One approach is to garner semantic information from
existing Web pages using LSI.

Existing Semantic Search Examples
Most of the early efforts on semantic based search engines are highly dependent on natural
language processing techniques to parse and understand the query sentence. One of the first
and the most popular of these search engines is Ask Jeeves (http://www.askjeeves.com/). It
combines the strengths of natural-language parsing software, data mining process, and
knowledge-base creation with cognitive analysis. Users can type queries in natural language
and get satisfactory answers.

Another semantic-based example is Albert (http://www.albert.com). Its greatest advantage is
that it supports many languages in addition to English, such as French, Spanish, and German.
This kind of search engine needs a lot of people to build up a very large semantic network in
order to reach reasonable performance.

Another advanced type of Internet search engine is Cycorp (http://www.cyc.com). Cyc
combines the world's largest knowledge base with the Internet. Cyc (en-cyc-lopedia) is an
immense, multi-contextual knowledge based. With Cyc Knowledge Server it allows Internet
sites to add common-sense intelligence and distinguish different meanings of ambiguous
concepts.

Current Semantic Search Efforts
As AI Web technologies become more advanced, using RDF and OWL tags will offer
semantic opportunities for search. However, the size of the network being searched will
establish the complexity of solution space and therefore drastically affect the likelihood of
success results.

Several major companies are seriously addressing the issue of semantic search. Microsoft's
growth on the Web may depend on its ability to compete with search leader Google. As a
result, Microsoft has launched a new search program called MSNBot, which scours the Web
to build an index of HTML links and documents. MSNBot is planned as a technology that
binds applications to the Windows operating system. Microsoft could then connect the search
engine of its MSN portal in its next version of Windows (code-named Longhorn) which will
make it easier to search e-mail, spreadsheets and documents on PCs, corporate networks, as
well as the Web.

Google has increased its commitment to content-targeted advertising with products that are
based on semantic technology, which understands, organizes, and extracts knowledge from
websites and information repositories in a way that mimics human thought and enables more
effective information retrieval. A key application of the technology is the representation of the
key themes on Web pages to deliver highly relevant and targeted advertisements.

The business of commercial search has become very profitable. With an estimated 500
million online searches taking place daily, the targeted ad business is predicted to generate
more than $7 billion annually within four years, according to analysts.


Stemming
A semantic search engine is a remarkably useful solution. It can discover if two documents
are similar even if they do not have any specific words in common and it can reject
documents that share only uninteresting words in common. An important piece of preparatory
work to do in this context before comparing documents is stemming , which is the process of
converting all the different grammatical forms of a word to one single base form. The term
derives from the fact that in English, most grammatical forms are composed by adding a
suffix onto the end of the basic word (the stem ), although it's worth noting that this isn't
necessarily the case in all languages. In English, it's also common to form words by adding
prefixes to other words, however, almost invariably the prefix completely changes the
meaning (eg. from variably to invariably), so that as far as semantic search engines are
concerned, there is no interest in prefix removal, at least in English.

Some of the preparatory work needed to get documents ready for indexing is very language-
specific, such as, stemming. For English documents, you can use an algorithm called the
Porter stemmer to remove common endings from words, leaving behind an invariant root
form. The Porter stemmer is important because it can hugely improve the recall of a search.
For example, suppose you are looking for articles about harvesting (the annual gathering in of
agricultural crops). If you are using a search engine that matches exact words, then what
should you type in: Web pages that about this topic might use the noun in either singular or
plural form - harvest or harvests , or they might use the verb, which, depending on its
grammatical position can appear as harvest , harvests , harvested or harvesting . By using
stemming, the common endings can be removed, so that if both the search term and the
documents being searched are stemmed, a match will be made independently of the
grammatical form of the word. However, if you think that stemming is a simple matter of
removing a couple of common endings, you'd be mistaken. Consider for example hydrology ,
the science dealing with the circulation and flow of water - hydrology is used extensively, for
example, for computing flood risks. A person who practices hydrology might be termed a
hydrologist (note the removal of the final y ) while the word turns up in the adjective form
hydrological or the adverb hydrologically . Other common phenomena in English are doubling
of final consonents or removal of a final e when adding endings to words. And of course if
you are to stem words, the rules will be totally different in each language.


The Stemming Sample
I'm going to finish this article by presenting a sample which illustrates stemming based
roughly on the porter algorithm. As you work through the code you'll see just how complex
the rules for stemming actually are. Although in the discussion I've focused on stemming, the
sample also filters the document by removing common words, such as an , that , or , etc. - the
kinds of words that occur everywhere and which it doesn't really make sense to search for.
When working through the code, you should bear in mind that stemming treats individual
words. Hence, although it's very good at mapping together different words that have the same
root and are therefore likely to have similar meanings, it makes no attempt to analyze the
grammar or context.

You might expect a sample of this nature which focuses heavily on text processing to use
regular expressions. In fact the sample presented here uses string manipulation. The reason for
that is twofold: Firstly, using regular expressions would make the sample far less clear: By
manipulating the System.String class directly, the sample becomes virtually self-documenting
and far easier to debug. And secondly, the complexity of the sample arises largely from the
number of different special cases that must be treated, not from any complexity in the
underlying algorithm, so regular expressions are arguably not as useful here as they would be
for algorithmically more complex work.

The Sample in Action
When the sample starts up, it simply displays a text box inviting you to type (or paste) in
some text to be stemmed. Figure 1 shows the situation where I've just pasted in a paragraph of
an early draft of this article.
Figure 1. The sample application.

Figure 2 shows what the application does with this text, when I click the ProcessText button.
Figure 2. The application, after stemming.

Figure 2 shows that quite a few of the words have been reduced to more basic forms. For
simplicity, punctuation has also been removed - that doesn't really matter here. For example,
the first person singular form of the verb, to require, requires, has been reduced to require.
Similarly handling has become handle while based has become base. On the other hand, the
screenshot shows some problems. On the one hand, words have been missed ( given has
remained given , ideally it would be changed to give ), and on the other hand some words have
ben incorrectly shortened. For example, seed has become see (the simple algorithm employed
assumes that an eed ending implied the past tense of a verb, and so knocks off the final d ).
The effect of such errors on a search would be to reduce the precision (ie. increase the rate of
false results) since unlike words would be treated as the same word. However this must be
balanced by the vast improvement in recall that is made possible. On balance, applying the
algorithm as it stands is likely to improve searches, and obviously that improvement could be
made better by making the algorithm more complex.

The Code
So that's what the application looks like, let's look at the algorithm used. I won't bother
presenting the code for the user interface since that's fairly trivial - basically an event handler
for the Process Text button. Instead I'll focus on the code that actually does the stemming.

The algorithm is based on two classes, Document and Stemmer .

The Document Class
Document holds the complete text to be processed, in a member field called text:

  class Document
  {
  string text;
  // other methods etc.
  }


Stemmer   on the other hand is responsible for the stemming of individual words.

Let's examine the Document class first. And I'll start off by showing you a couple of useful
arrays. These respectively store a list of the common words that can be removed, and a list of
all the characters that might be used to separate words. These characters can also be removed
and replaced by spaces.

  public static string[] commonWords =
  {
     "the", "at", "of", "and", "a", "in", "to", "it", "is", "was", "i", "for",
     "you", "he", "be", "with", "on", "that", "by", "are", "not", "this",
  "but",
     "they", "his", "from", "had", "she", "which", "or", "you", "an", "were",
     "been", "have", "their", "has", "would", "what", "will", "there",
     "if", "can", "all", "as", "who", "have", "do"
  };

  private static char[] separators =
  {
     ' ', '!', '@', '\t', '$', '%', '^', '&', '*', '(', ')', '\n', '\r',
     '-', '_', '+', '=', '{', '}', '[', ']', ':', ';', '\"',
     '\'', '<', '>', ',', '.', '?', '/', '~', '`'
  };


Hence you see that there are two stages to the process: Firstly all words and characters in
these lists can be removed, and secondly, the words that remain must be stemmed.

The following code snippet shows this process. The method, GetProcessedText() , returns a
string that contains the new text to be displayed.

  public string GetProcessedText()
  {
     Stemmer stemmer = new Stemmer();
      string [] words = text.Split(separators);
      StringBuilder result = new StringBuilder();
      int nWords = 0;
      for (int i=0 ; i<words.Length ; i++)
      {
         if (words[i].Length <= 1)
            continue;
         bool isCommonWord = false;
         string wordLower = words[i].ToLower();
         for (int j=0 ; j<commonWords.Length ; j++)
         {
            if (wordLower == commonWords[j])
            {
               isCommonWord = true;
               break;
            }
         }
         if (isCommonWord)
            continue;
         if(nWords > 0)
            result.Append(" ");
         string stem = stemmer.Stem(words[i]);
         result.Append(stem);
         ++nWords;
      }
      return result.ToString();
  }


As you can see from this code, it essentially works by forming a StringBuilder called result
, and working through each word in the initial text in turn, appending it to result if it is not a
common word. Before appending each word, the word is stemmed using a call to the method,
Stemmer.Stem(). The Stemmer class is what you'll examine next.


The Stemmer Class
The crucial method is Stem() , which stems a word. A quick look at this method shows that
the process can be broken down into a number of steps, each one controlled by a method that
removes certain types of ending.

  public string Stem(string word)
  {
     this.word = word;
     ReplaceCommonEndings();
     RemoveTerminalY();
     RemoveDoubleSuffixes();
     RemoveCommonSuffixes();
     return this.word;
  }


Notice also from this method that the word being processed is stored in a member field, called
word .

  public class Stemmer
  {
     string word;
     // other methods etc
  }


Some Utilities
Before you examine the methods to perform the various steps, I'll quickly present some helper
utility methods and properties. These should be self-explanatory.

First up, a couple of properties that respectively retrieve the last and the second-last character
of the word.

  char LastChar { get { return word[word.Length-1];}}
  char LastButOneChar { get { return word[word.Length-2];}}


There's also a helper method that checks whether a given character is a vowel:

  bool IsVowel(char ch)
  {
     char chLower = char.ToLower(ch);
     return chLower == 'a' || chLower == 'a' ||chLower == 'e' ||
        chLower == 'i' ||chLower == 'o' || chLower == 'u';
  }


Next, a method that checks whether the method ends with a certain string, supplied as a
parameter. Note that the comparison, as is the case with all comparisons in this class, is case-
insensitive.

  bool EndsWith(string ending)
  {
     if (word == null || ending == null ||
        ending.Length > word.Length)
        return false;
     string actualEnding = word.Substring(word.Length-ending.Length);
     return string.Compare(ending, actualEnding, true) == 0;
  }


Now we come to the helper methods that perform the real work. ReplaceEnding() replaces a
specified old ending with a new ending. For example if the word is mating , and you call this
method passing the parameters ing and e , the word will be converted to mate .

  bool ReplaceEnding(string oldEnding, string newEnding)
  {
     if (word.EndsWith(oldEnding))
     {
        word = word.Substring(0, word.Length-oldEnding.Length) + newEnding;
        return true;
     }
     return false;
  }


As you can see from the code, this method returns true if the replacement was made, and false
if the replacement was not possible (because the word didn't actually end in the specified old
ending in the first place). As you'll see later, this return value is important because in many
cases it lets the computer know that the ending has now been processed, so there is no need to
look for certain other endings (for example if you've removed a plural by changing ches to ch ,
then you certainly don't want to carry on looking for plural forms).

The RemoveEnding() method is very similar to ReplaceEnding() , but it simply removes a
specified ending, not replacing it with anything.

  bool RemoveEnding(string oldEnding)
  {
      if (word.EndsWith(oldEnding))
      {
         word = word.Substring(0, word.Length-oldEnding.Length);
         return true;
      }
      return false;
  }


Finally, consider the situation of the suffix ing , as in for example carrying . This suffix forms
the present participle of verbs and normally needs to be removed to form the basic form of the
verb ( carry ). However, this suffix also occurs in words like sing and thing - so there are cases
where it shouldn't be removed. Because of this possiblility, in some cases as a first
approximation the program uses the test of whether or not the the rest of the word (the stem)
contains a vowel - this test determines whether or not an ending will be removed. Hence you
need a method to test for this possibility:

  bool EndsWithAndVowelInStem(string ending)
  {
     bool result = EndsWith(ending);
     if (result == true)
     {
        string stem = word.Substring(word.Length-ending.Length);
        result = stem.IndexOfAny(
           new char[]{'a','e','i','o','u','A','E','I','O','U'}) >= 0;
     }
     return result;
  }


Removing Plurals and Participles.

Now you've seen the utilities to assist in stemming, I'll work through the code that controls the
logic of what combinations of letters are actually removed. These are simply the four methods
that you saw earlier were called from the Stem() method. To some extent the division of
which endings are covered by which method has arbitrary, and has been made partly for
algorithmic convenience and in order to avoid one exceptionally long method. For these
methods I'll simply present the code without further discussion, since to a large extent you are
dealing with large switch statements, so the logic should be quite easy to follow from the
code.

The first method is called ReplaceCommonEndings() , and it is responsible for getting rid of
plurals and word endings such as "ing."

  /* ReplaceCommonEndings() gets rid of plurals and -ed or -ing. e.g.
           caresses -> caress
           ponies    -> poni
           ties      -> ti
           caress    -> caress
           cats      -> cat

            feed        ->   feed
            agreed      ->   agree
            disabled    ->   disable

            matting     ->   mat
            mating      ->   mate
            meeting     ->   meet
            milling     ->   mill
            messing     ->   mess
            meetings   ->   meet

   */
private void ReplaceCommonEndings()
{
   bool done = false;
   if (word[word.Length-1] == 's')
   {
      done = ReplaceEnding("sses", "ss");
      if (!done)
         done = ReplaceEnding("ches","ch");
      if (!done)
         done = ReplaceEnding("shes","sh");
      if (!done)
         done = ReplaceEnding("ies","i");
      if (!done)
         done = RemoveEnding("'s");
      if (!done)
         done = RemoveEnding("s");
   }
   if (!done)
      done = ReplaceEnding("eed", "ee");
   if (done)
      return;

    bool ingOrEdRemoved = false;
    if (EndsWithAndVowelInStem("ed"))
    {
       RemoveEnding("ed");
       ingOrEdRemoved = true;
    }
    if (EndsWithAndVowelInStem("ing"))
    {
       RemoveEnding("ing");
       ingOrEdRemoved = true;
    }

    //   now compensate for removed 'ed' or 'ing'
    //   this means restoring final 'e' for combinations vowel-consonent (ex.y)
    //   and consonent(ex.l)-l
    if   (ingOrEdRemoved)
    {
         char lastChar = char.ToLower(LastChar);
         char lastButOneChar = char.ToLower(LastButOneChar);

         if (lastChar == 'l' && lastButOneChar != 'l' &&
            !IsVowel(lastButOneChar))
            word += "e";

         else if (!IsVowel(lastChar) && IsVowel(lastButOneChar))
            word += "e";

         else
         {
            string removeDoubles="bcdfgjkmnpqrtvwx";
            for (int i=0; i<removeDoubles.Length; i++)
            {
               string strSingle = removeDoubles[i].ToString();
               string strDouble = new string(removeDoubles[i],2);
               done = ReplaceEnding(strDouble, strSingle);
               if (EndsWith(strSingle) && IsVowel(word[word.Length-2]))
                  ReplaceEnding(strSingle, strSingle + "e");
               if (done)
                  break;
            }
         }
    }
}
Replacing Terminal Y

Quite often in English, a terminal y is converted to an i or vice versa between different forms
of a word. This method changes terminal y 's to i 's, which then makes it easier for following
methods to work - since that means the remaining parts of the algorithm can work with just
one possibility, that the word ends in i .

  // RemoveTerminalY() turns terminal y to i
  // when there is another vowel in the stem.

  private void RemoveTerminalY()
  {
     if (EndsWithAndVowelInStem("y"))
        ReplaceEnding("y", "i");
  }


Removing Double Suffixes

This method removes certain double suffixes.

  // RemoveDoubleSuffices() maps double suffixes to single ones.
  // so eg. -ization ( = -ize plus -ation)
  // maps to -ize etc. note that the string before the suffix must give m() > 0.

  private void RemoveDoubleSuffixes()
  {
     bool done;
     /* For Bug 1 */
     switch (word[word.Length - 2])
     {
        case 'a':
           done = ReplaceEnding("ational", "ate");
           if (!done)
              done = ReplaceEnding("tional", "tion");
           break;
        case 'c':
           done = ReplaceEnding("enci", "ence");
           if (!done)
              done = ReplaceEnding("anci", "ance");
           break;
        case 'e':
           done = ReplaceEnding("izer", "ize");
           break;
        case 'l':
           done = ReplaceEnding("bli", "ble");
           if (!done)
              done = ReplaceEnding("alli", "al");
           if (!done)
              done = ReplaceEnding("entli", "ent");
           if (!done)
              done = ReplaceEnding("eli", "e");
           if (!done)
              done = ReplaceEnding("ousli", "ous");
           break;
        case 'o':
           done = ReplaceEnding("ization", "ize");
           if (!done)
              done = ReplaceEnding("ator", "ate");
           if (!done)
              done = ReplaceEnding("ation", "ate");
           break;
        case 's':
           done = ReplaceEnding("alism", "al");
           if (!done)
                done = ReplaceEnding("iveness", "ive");
             if (!done)
                done = ReplaceEnding("fulness", "ful");
             if (!done)
                done = ReplaceEnding("ousness", "ous");
             break;
          case 't':
             done = ReplaceEnding("aliti", "al");
             if (!done)
                done = ReplaceEnding("iviti", "ive");
             if (!done)
                done = ReplaceEnding("iviti", "ble");
             break;
          case 'g':
             done = ReplaceEnding("logi", "log");
             break;
          default:
             break;
      }
  }


Removing Common Single Suffixes

And finally, the following method removes certain other suffixes that haven't yet been dealt
with.

  // step4() deals with -ic-, -full, -ness etc. similar strategy to step3.

  private void RemoveCommonSuffixes()
  {
     bool done;
     switch (LastChar)
     {
        case 'e':
           done = ReplaceEnding("icate", "al");
           if (!done)
              done = RemoveEnding("ative");
           if (!done)
              done = ReplaceEnding("alize", "al");
           break;
        case 'i':
           done = ReplaceEnding("iciti", "ic");
           break;
        case 'l':
           done = ReplaceEnding("ical", "ic");
           if (!done)
              done = RemoveEnding("ful");
           break;
        case 's':
           done = RemoveEnding("ness");
           break;
     }
  }


Conclusion
Today, searching the Web is an essential capability whether you are sitting at your desktop
PC or wondering the corporate halls with your wireless PDA. However, even with Google, it
is difficult to find the right bit of data that you need and interface with the search results
efficiently.
In this article, I reviewed the current state of search engine technology and mentioned some of
the work that is being done to move towards semantic search engines and semantic search
agents, concluding with a sample that illustrated how to perform stemming of words.

In the next article of the series, you'll learn about the research efforts towards semantic search
engines in a bit more detail and also examine the concept of the semantic web.




Introduction
This is the second of a series of articles that explores the development of semantic search
engines. Most of todays current search engines are based on matching keywords, which is
simple to implement, but doesn't on the whole do a good job of identifying the context and the
meaning behind your intentions when you do a search with Google or other search engines.
Semantic searching can be considered as the next generation of search algorithms. A semantic
search uses one of a couple of algorithms that - very roughly speaking - attempt to match
documents likely to be using the word in the required context, and offers the hope of
considerable improvements in search result accuracy.

In the previous article, I reviewed the state of current technology. In this article, I'll explore
semantic search engines and semantic search agents , including their current development
and progress. I'll present efforts being made to implement semantic search by Google, MSN
and other innovators. The structure of the article is that I'll start by explaining what a web
search agent is and how it differs from a web search engine. I'll move from there onto a
conceptual discussion of some of the theoretical/mathematical problems that constrain the
development of web search agents. I'll then describe the two main techniques currently under
development for developing semantic search engines: semantic web architecture and latent
semantic indexing , before shifting the topic somewhat and presenting the sample code.

The sample code for this article doesn't itself illustrate web search agents (a sample that
adequately illustrated the algorithms I'll be talking about in this article would fill a book, not
an ASP Today article), but it does illustrate an important preliminary step in semantic
searches: Stemming words to identify all similar words. The sample is a development of the
sample presented in the part 1 article, http://www.asptoday.com/Content.aspx?id=2326; it
performs a search against a text 'document' (actually a bunch of text pasted into a textbox) and
highlights all occurrances of words that have the same root as the search term. Figure 1 shows
the sample after performing a search.
Figure 1. The sample code after performing a search.

I'll discuss the sample later in the article.

In a follow-up article, Building an ASP.NET Speech-Enabled Google-Powered Semantic-
Based Search Engine, to be published soon, I'll present a detailed example of a semantic-
based ASP.NET search engine along with its C# source code.
System Requirements
To run the sample code for this article you simply need a computer running VS .NET and IIS.


Installing and Compiling the Sample Code
The sample code for this article is a plain ASP.NET application. There are no special
installation instructions, other than that you unzip the files into a virtual director called
StemAndSearch .



Introducing Web Search Agents
While Web search engines are powerful and important to the future of the Web, there is
another form of search that is also critical: Web search agents . A Web search agent is not
the same as a commercial search engine. Search engines use database lookups from a
Knowledge Base, whereas in the case of the Web search agent, the Web itself is searched and
the computer provides the interface with the user. The agent's percepts are documents
connected through the Internet utilizing HTTP. The agent's actions are to determine if its goal
of seeking a Website containing a specified target (e.g., key word or phrase), has been met
and if not, find other locations to visit. It acts on the environment using output methods to
update the user on the status of the search or the end results. The real value of this
methodology is achieved by caching results and associating results with particular users so
that the agent will produce better, more relevent, searches more quickly over time.

What makes the agent intelligent is its ability to make a rational decision when given a choice.
In other words, given a goal, it will make decisions to follow the course of actions that would
lead it to that goal in a timely manner.

An agent can usually generate all of the possible outcomes of an event, but then it will need to
search through those outcomes to find the desired goal and execute the path (sequence of
steps) starting at the initial or current state, to get to the desired goal state. In the case of the
intelligent Web search agent, it will need to utilize a search to navigate through the Web to
reach its goal.

Building an intelligent Web search agent requires mechanisms for multiple and combinational
keyword searches, exclusion handling, and the ability to self-seed when it exhausts a search
space. Given a target, the Web search agent should proceed to look for it through as many
paths as are necessary. This agent will be keyword based. The method advocated is to start
from a "seed" location (user provided) and find all other locations linked in a tree fashion to
the root (seed location) that contain the target.

The search agent needs to know the target (i.e., key word or phrase), where to start, how many
iterations of the target to find how long to look (time constraint), and what methods should
determine criteria for choosing paths (search methods). These issues are addressed in the
software.
Implementation requires some knowledge of general programming, working with sockets, the
Hypertext Transfer Protocol (HTTP), Hypertext Markup Language (HTML), sorting, and
searches.

There are many languages with Web based utilities, advanced application programming
interfaces (APIs), and superior text parsing capabilities that can be used to write a Web search
agent.

Using a more advanced, efficient sorting algorithm will help improve the performance of the
Web search agent.

The Web search agent design consists of four main phases: initialization, perception, action,
and effect. In the initialization phase the Web search agent should set up all variables,
structures, and arrays. It should also get the base information it will need to conduct the
"hunt" - the target, the goal, a place to start and the method of searching. The perception
phase is centered on using the knowledge provided to contact a site and retrieve the
information from that location. It should identify if the target is present and should identify
paths to other URL locations. The action phase takes all of the information that the system
knows and determines if the goal has been met (the target has been found and the hunt is
over).

If the hunt is still active it must make the decision on where to go next. This is the intelligence
of the agent, and the method of search dictates how "smart'' the Web agent will be. If a match
is found, the hunt is complete, and it provides output to the user. Figure 2 shows the algorithm
involved for the hunt.

The Web search agent moves from the initialize phase to a loop consisting of the perception,
action, and effect phases until the goal is achieved or cannot be achieved.
Figure 2. Web Search Basic Flow


Searching Techniques
Semantic search deals with concepts and logical relationships. If you examine the practical
problems of semantic search, you will find that the search tree faces an incompleteness of
logic resulting in the "Incompleteness Problem," or the "Halting Problem."

Introducing the Incompleteness Problem
First, let's consider the "Incompleteness Problem." Inference can be viewed as a sequence of
logical deductions chained together. At each point along the way, there might be different
ways to reach a new deduction. So, in effect, there is a branching set of possibilities for how
to reach a correct solution. And that branching set can spread out in novel ways.

For example, you might want to try to determine "Who does Kevin Bacon know?" based on
information about his family relationships, his movies, or his business contacts. So, there's
more than one path to some conclusions. This results in a branching set of possibilities.
Therefore, the inference in our system is a kind of search problem, displayed as a search tree.

It is possible to start at the top of the tree, the root, or with the branches. The top of the tree
can be the query asked. Each step down to child nodes in this tree can be viewed as one
potential logical deduction that moves toward trying to prove the original query using this
logical deductive step. The fan out of possibilities can be viewed as this branching tree,
getting bushier and deeper. Each of the approaches ends up being one of the child steps, to a
child node.

Imagine that each node in this tree represents something to prove. Each link from a parent
node higher to a child node represents one logical statement. Now the problem is that you
have a big tree of possibilities.

In a complex logical system, there are an arbitrarily large number of potential proofs. Some of
them are arbitrarily long and it is uncertain if there is a proof. Gdel proved in the 1930's, that
any sufficiently complicated logical system is inherently incomplete (Undecideable). In other
words, there are statements that cannot be logically proven. His argument for that is related to
the other problem, the halting problem.

The halting problem infers that certain algorithms will never end in an answer. When you talk
about the Web, you're talking about millions of facts and tens of thousands of rules that can
chain together in arbitrarily complicated and interesting ways; so the space of potential proofs
is infinite and the tree becomes logically infinite. Due to this, you will run into some inherent
incompleteness issues; for example, you cannot look at every possible proof and collect all the
answers."

You run into incompleteness because the search tree is too large. So our approach must be to
only search portions of the tree. There are well-known strategies for how one addresses search
problems like this. Two such strategies are to search the tree in either a depth-first or a
breadth-first fashion. I'll discuss these next.

The Depth-First Strategy
A depth-first search would start at the top of the tree and go as deeply as possible down some
path, expanding nodes as you go, until you find a dead end. A dead end is either a goal
(success) or a node where you are unable to produce new children. So the system can't prove
anything beyond that point.

Let's walk through a depth-first search and traverse the tree. You start at the top node and go
as deeply as possible:

    1. Start at the highest node.
    2. Go as deeply as possible down one path.
   3. When you run into a dead-end, back-up to the last node that you turned away from. If
      there is a path there that you haven't tried, go down it. Follow this option until you
      reach a dead-end or a goal.
   4. This path leads to another dead-end, so go back up a node and try the other branch.
   5. This path leads to a goal. In other words, this final node is a positive result to the
      query. So you have one answer. Keep searching for other answers by going up a
      couple more nodes and then down a path you haven't tried.
   6. Continue until you reach more dead-ends and have exhausted search possibilities.

The advantage of depth-first search is that it is a very algorithmically efficient way to search
trees in one format. It limits the amount of space that you have to keep for remembering the
things you haven't looked at yet. All you have to remember is the path back up.

The disadvantage with depth-first search is that once you get started down some path, you
will trace it all the way to the end.

The Breadth-First Strategy
Another strategy for searching is a breadth-first search. Here you search layer by layer. First
you try to do all of the zero-step proofs then you try to do all of the one-step proofs, etc. The
advantage of breadth-first search is that you're guaranteed to get the simplest proofs before
you get anything that's strictly more complicated. This is referred to as the Ockham's Razor
benefit. If there is an n-step proof, you'll find it before you look at any n+1 -step proofs. The
disadvantage of breadth-first search is that you've got huge deep trees you also have huge
bushy trees where you could have thousands, or tens of thousands, of child nodes. Another
disadvantage of breadth-first searching is the amount of space you have to use to store what
you haven't examined as yet. So, if the third layer is explosively large, you would have to
store all of the third level results before you could even look at them. With a breadth-first
search, the deeper you go into the tree, the more space is required.

A further disadvantage of breadth-first searching is the amount of space you have to use to
store what you haven't examined as yet. So, if the third layer is explosively large, you would
have to store all of them before you could even look at them. With a breadth-first search, the
deeper you go into the tree, the more space is required.

So you find that that two of the traditional algorithms for search, depth-first and breadth-first,
are going to run into problems with large systems.

Informed vs Uninformed Searching
There are two basic classes of search algorithms used to attempt to overcome the
incompleteness and halting limitations: uninformed and informed. Uninformed, or blind,
searches are those that have no information about the number of steps or the path cost from
the current state to the goal. These searches include: depth-first, breadth-first, uniform-cost,
depth-limiting and iterative deepening search. Informed, or heuristic, searches are those that
have information about the goal; this information is usually either an estimated path cost to it
or estimated number of steps away from it. This information is known as the search agent
heuristic. It allows informed searches to perform better than the blind searches and makes
them behave in an almost "rational'' manner. These searches include: best-first, hill-climbing,
beam, A*, and IDA* (iterative deepening A*) searches.
Improved Semantic Methods
Semantic search methods augment and improve traditional search results by using not just
words, but concepts. Several major companies are seriously addressing the issue of semantic
search. There are two approaches to improving search results through semantic methods: (1)
Semantic Web Architecture and (2) Latent Semantic Indexing (LSI).

Understanding Semantic Web Architecture
Semantic Web Architecture has been developed based on the idea of annotating Web pages
with Resource Description Framework (RDF) and Web Ontology Language (called OWL)
tags to represent detailed semantic ontologies . However, the limitation of these systems is
that they can only process Web pages that are already annotated with appropriate semantic
tags.

Ontology describes concepts and relationships with a set of representational vocabulary. The
aim of building ontologies is to share and reuse knowledge. Since the Semantic Web is a
distributed network, there are different ontologies that describe semantically equivalent
things. As a result, it is necessary to map elements of these ontologies if you want to process
information on the scale of the Web. An approach for semantic search can be based on text
categorization for ontology mapping compares each element of an ontology with each
element of the other ontology, and then determine a similarity metric on a per pair basis.
Matched items are those whose similarity values are greater than a certain threshold.

An example of semantic search technology is TAP. TAP (http://tap.stanford.edu/ ) is a
distributed project involving researchers from the Stanford, IBM and W3C. TAP leverages
automated and semi-automated techniques to extract knowledge bases from unstructured and
semi-structured bodies of text. The system is able to use previously learned information to
learn new information, and can be used for information retrieval.

In TAP, existing documents are analyzed using semantic techniques and converted into
Semantic Web documents using automated techniques or manually by the document author
using standard word processing packages. Traditional information retrieval techniques are
enhanced with more deeply structured knowledge to provide more accurate results. Both
automated and guided analysis uses intelligent reasoning systems and agents.

The solutions are built on a core technology called Semantic Web Templates. Utilizing
Knowledge Representation, the creation, consumption, and maintenance of knowledge
becomes transparent to the user. Resource Description Framework (RDF) data model is the
foundation of Semantic Web knowledge representation technology and TAP uses RDF
Schema and OWL.

The difficulty of creating the knowledge itself requires a "knowledge engineer" who translates
documents into the symbolic and logical languages required. Ontologies forming the core
vocabulary of the knowledge are required in order to define concepts and relations that hold
instances of the concepts.

Understanding Latent Semantic Indexing
Latent Semantic Indexing (LSI) is an information retrieval method that organizes existing
information into a semantic structure that takes advantage of some of the implicit higher-order
associations of words with text objects. The resulting structure reflects the major associative
patterns in the data. This permits retrieval based on the "latent" semantic content of the
existing Web documents, rather than just on keyword matches. LSI offers an application
method that can be implemented immediately with existing Web documentation.


Implementing an LSI-based Search
So far I have reviewed search technology in general, and identified today's limitations and
problems with potential future technologies based upon the Semantic Web. Now I will discuss
implementing Latent Semantic Indexing which may improve today's search capabilities
without the extreme limitations of search large Semantic Web networks.

Building on the criteria of precision, ranking and recall requires more than brute force.
Assigning descriptors and classifiers to a text provides an important advantage, by returning
relevant documents that don't necessarily contain a verbatim match to our search query. Fully
described data sets can also provide an image of the scope and distribution of the document
collection as a whole. This can be accomplished by examining the structure of categories and
sub-categories (called taxonomy).

A serious drawback to this approach to categorizing data is the problem inherent in any kind
of taxonomy - the world sometimes resists categorization. For example, is a tomato a fruit or a
vegetable?

And what happens when you combine two document collections indexed in different ways?
Solutions are called ontology taxonomies.

Regular keyword searches approach a document collection where either a document contains
a given word or it doesn't.

Latent semantic indexing (LSI) adds an important step to the document indexing process. In
addition to recording which keywords a document contains, the method examines the
document collection as a whole, to see which other documents contain some of those same
words. LSI was first developed at Bellcore in the late 1980's. LSI considers documents that
have many words in common to be semantically close, and ones with few words in common
to be semantically distant. Although the LSI algorithm doesn't understand anything about
what the words mean, it notices the patterns.

Searching for Similarity with LSI
When you search an LSI-indexed database, the search engine looks at similarity values it has
calculated for every content word, and returns the documents that it thinks best fit the query.
Because two documents may be semantically very close even if they do not share a particular
keyword, LSI does not require an exact match to return useful results. Where a plain keyword
search will fail if there is no exact match, LSI will often return relevant documents that don't
contain the keyword at all.
Latent semantic indexing looks at patterns of word within a set of documents. Natural
language is full of redundancies, and not every word that appears in a document carries
semantic meaning. Frequently used words in English often don't carry content, such as,
functional words, conjunctions, prepositions, and auxiliary verbs. The first step in doing LSI
is culling these extraneous words from a document. To obtain semantic content from a
document:

   1.   Make a complete list of all the words that appear in the collection
   2.   Discard articles, prepositions, and conjunctions
   3.   Discard common verbs (know, see, do, be)
   4.   Discard pronouns
   5.   Discard common adjectives (big, late, high)
   6.   Discard frilly words (therefore, thus, however, albeit, etc.)
   7.   Discard any words that appear in every document
   8.   Discard any words that appear in only one document

Weighting
An important aspect of LSI is deciding how much weight should be ascribed to each word.
There are two forms of weighting to consider:

Term weighting is a formalization of two common-sense insights:

       Content words that appear several times in a document are probably more meaningful
        than content words that appear just once.
       Infrequently used words are likely to be more interesting than common words.

The first of these insights applies to individual documents, and is referred to it as local
weighting . Words that appear multiple times in a document are given a greater local weight
than words that appear once.

In broad terms what is involved is an algorithm that forms a web of documents and words -
connecting all documents to all words. Given such a model of words and documents one can
then establish values based on the distance of documents from each other. The 'value' of any
document to any other document might be designated as a function of the number of
connections that must be traversed to establish a connection between documents. If two
documents are connected by multiple routes then those documents might have a high degree
of correlation.

The implementation algorithm for weighting looks roughly like this:

For each document:

       Stem all of the words and throw away any common 'noise' words.
       For each of the words

   1. Visit and remember each document that has a direct relationship to this word.
   2. Score each document based on a distance function from the original document and the
      relative scarcity of the word in common.
      For each of the as-of-yet-unvisited new related documents now being tracked perform
       the same operation recursively.

One possible weighting algorithm could work like this: For each increase in distance, divide a
baseline score by two. Then the score of each document is equal to the baseline divided by the
square root of the popularity of the word.

Overall this algorithm delivers a cheap semantic lookup based on walking a document and
word graph.

The specification shown here is the simplest case and it could be improved in a variety of
ways. There are many other scoring algorithms that could be used. Additionally a thesaurus
could be applied to help bridge semantic issues.

One interesting challenge would be to make the algorithm work 'on the fly' so that as new
documents are added they would self-score. Another challenge would be to find a way to
distribute the algorithm over multiple machines for scalability.

The idea is then that the weighting algorithm feeds input into the semantic algorithm which
first stems the words appropriately, scores them according to the semantic algorithm and sorts
the results into the new rank order reflecting the semantic analysis.


The Sample Code: Searching a Document
It's now time to move on to the sample code for this article. The sample code is going to take
a slightly different direction from the main text, while staying on the same theme of semantic
searching. The reason is that the algorithms I've been describing are quite complex - too
complex for a short ASP Today sample. In the final article of the series you'll get the chance
to try out a semantic search by leveraging google. For now however, I'm going to focus on
further developing the sample from part 1 of this series, which illustrates some of the
prerequisites for a semantic search. Recall that in part 1, you developed a web page that could
take a piece of text and perform stemming on the text to convert words that appeared from the
spelling to have the same root meaning/form into the same word. Well now you're going to
develop that and actually perform a search on a single document that uses stemming to
identify all such words. Thus, what the sample demonstrates is the first steps towards
searching based on meaning rather than on precise word.

Seeing the Sample in Action
Figure 3 shows the sample when it is launched.
Figure 3. The sample application

Figure 3 shows that the idea of the sample is that the user pastes a large block of text (the
document to be searched) into the text area, and types a keyword to search for in the small
text box. You saw the results of doing this in figure 1 earlier, but to save you scrolling back,
I've reproduced the screenshot in figure 4 here.
Figure 4. Performing a search with the sample application

As you can see the sample has displayed the document, and highlighted all locations of the
search term. However, notice that, as expected, hasn't merely highlighted the search term
itself: It's highlighted related words too. For this test, I (somewhat ironically) asked the
sample to locate instances of the word search (Yeah I know. The text of an early draft of this
article was the nearest text to hand, and that text, for some reason, seems to contain a lot of
instances of variants of the word, search , which made it rather a suitable word to search for).
The sample hasn't only picked up instances of search , but has also identified different forms
of the word: searching , searched and searches . Although the sample searches only a single
'document' instead of looking for similar documents on the web and therefore can't really be
considered a search engine or agent, you can hopefully see that it is potentially useful in its
own right as a way of searching for highlighting instances of words in a document.

Working through The Code
The technology behind identifying multiple forms of a word is stemming, and the previous
article presented a sample which would stem all the words in a document and present the
results. Because of this, I won't discuss the code to stem the words in this article, but will
focus on how the text is displayed with the terms highlighted.

Displaying Highlighted Results

To highlight the results, you need to break up the text, which is originally stored as a string.
The way I'll work it hear is to define a class, which I've called DocItem , and which is designed
to store one item of text. For our purposes, an item of text is either a single word, or the space
between words. The idea is that the entire document will be broken into. Here is the class and
an associated enum.

  public enum DocElementType {Word, Space}
  public class DocElement
  {
     public string Text;
     public string StemmedText;
     public DocElementType ElementType;
     public bool IsMatch;
  }


The member fields should be self-explanatory. The idea is that the items will be placed
sequentially in an ArrayList from which the original text can be reconstructed. Notice that
DocItem contains fields to indicate not just the original text but also the resultant text after
stemming, plus a boolean that indicates if this item matches the search term, and therefore
needs to be highlighted. I've followed nonstandard programming practice by having public
member fields - this is because in spirit what you have here is a plain old C-style struct. The
only reason I've declared it as a class rather than a struct is to save boxing and therefore
improve performance when putting it in the ArrayList.

The reason for storing the spaces from the original text as well as the words is that the spaces
might actually contain punctuation or other non-word characters, and you want to be able to
reproduce those exactly when the text is displayed with the search terms highlighted.

Armed with that information, let's see how the results are highlighted. That's done in the event
handler for the button (actually named btnSubmit ):

  private void btnSubmit_Click(object sender, System.EventArgs e)
  {
     this.lblResult.Visible = true;
     string origText = this.tbOriginal.Text;

     Document document = new Document(origText);
     DocElement [] elems =
  document.GetSearchResults(this.tbSearchText.Text.Trim());
      StringBuilder sb = new StringBuilder();
      foreach (DocElement elem in elems)
      {
         string text = elem.Text;
         if (elem.ElementType == DocElementType.Space)
            text = HttpUtility.HtmlEncode(text);
         if (elem.IsMatch == true)
            text = "<span class=\"highlight\">" + text + "</span>";
         sb.Append(text);
      }
      this.lblResult.Text = sb.ToString();
  }


As you can see this event handler retrieves the original text and the search text, and
instantiates a Document instance. Document is the class that holds the text and is responsible for
analyzing it. The key method is Document.GetSearchResults() , which returns an array of
DocElements that contains all the information needed to display the highlighted text. Let's see
how that method works.

Breaking up the Text
First we need a member field that stores the results:

  class Document
  {
     DocElement [] docElements;
     // other methods
  }


Next let's have a look at the method that actually breaks up the text.

  public DocElement [] GetDocElements()
  {
     // don't repeat the analysis if we've already done it once
     if (docElements != null)
        return docElements;
     ArrayList elems = new ArrayList();
     Stemmer stemmer = new Stemmer();

      DocElement nextElement;

      // ensure that first DocElement is a space, to simplify coding
      string text = " " + origText;

      string pattern = @"([a-z]+)[^a-z]";
      string spacePatten = @"([^a-z]+)[a-z]";

      MatchCollection wordMatches = Regex.Matches(
         text, pattern, RegexOptions.IgnoreCase);
      MatchCollection spaceMatches = Regex.Matches(
         text, spacePatten, RegexOptions.IgnoreCase);
      nextElement = new DocElement();
      nextElement.ElementType = DocElementType.Space;
      nextElement.Text = spaceMatches[0].Groups[1].Value;
      elems.Add(nextElement);

      for(int i=0 ; i<wordMatches.Count ; i++)
      {
         Match wordMatch = wordMatches[i];
         Match spaceMatch = spaceMatches.Count>i+1 ? spaceMatches[i+1] : null;

         nextElement = new DocElement();
          nextElement.ElementType = DocElementType.Word;
          nextElement.Text =
             wordMatch.Groups.Count >1 ? wordMatch.Groups[1].Value : null;
          nextElement.StemmedText = stemmer.Stem(nextElement.Text);
          elems.Add(nextElement);
          if (spaceMatch != null)
          {
             nextElement = new DocElement();
             nextElement.ElementType = DocElementType.Space;
             nextElement.Text = spaceMatch.Groups.Count >1
                ? spaceMatch.Groups[1].Value : null;
             elems.Add(nextElement);
          }
      }

      docElements = (DocElement[])elems.ToArray(typeof(DocElement));
      return docElements;
  }


There's a lot going on here, so let's break it up a bit. One of the first things done is to
instantiate an instance of a class called Stemmer. Stemmer is responsible for stemming each
word, via a method, Stemmer.Stem() . This method was presented in the previous article and is
unchanged here, so I won't discuss it further.

The actual work in breaking up the text is done using regular expressions; using the following
patterns:

      string pattern = @"([a-z]+)[^a-z]";
      string spacePatten = @"([^a-z]+)[a-z]";


The first pattern will return a sequence of one or more alphabetic characters, followed by a
non-alphabetic character (ie. a character that separates words). The brackets around the [a-z]+
ensure that the subsequent non-alphabetic character is excluded from the capture, so we only
actually return each complete word. The second pattern does the exact reverse, capturing all
the character sequences that separate words in the text.The code then enters a for loop which
iterates through the results returned from the pattern matches, constructing the DocElement
instances and placing them in sequence in an array, via an intermediate ArrayList. The
algorithm relies on the fact that the code stuck an extra space at the beginning of the text, thus
forcing the first document item to be a space. Hence we know that the sequence is going to
run space-word-space-word etc., which makes it very simple to look through the two pattern
match collections appending them to the array of DocElements in sequence.That completes the
relevent code.


Conclusion
Semantic search methods augment and improve traditional search results by using not just
words, but concepts and logical relationships. In this article you've seen that there are two
approaches to improving search results through semantic methods: (1) the Semantic Web and
(2) Latent Semantic Indexing (LSI). The article reviewed semantic search engines and
semantic search agents, including their current development and progress. The article also
presented a code sample that identifies and highlights words in a document that are similar to
a specified search term, using stemming to match the words.

In the final article of the series, you'll see a detailed example of a semantic-based ASP.NET
search engine along with its C# source code

								
To top