The Development of a search engine _ Comparison according to

Document Sample
The Development of a search engine _ Comparison according to Powered By Docstoc
					The Development of a search
   engine & Comparison
  according to algorithms
       The mid-term progress report

               20032017 Sungsoo Kim
               20032066 Haebeom Lee
    Topic of our term project
n Compare the performance of the
  algorithms used in information
n On the basis of that comparison,
  make efficient search engine and
  demonstrate it.
I.     Extracting the text-information’s
       position from raw files.
II.    Extracting the keyword or index from the
III.   Making the index file.
IV.    Gathering and sorting those index file
V.     Getting information of index.
VI.    Boolean retrieval
VII.   Natural language retrieval using Vector
       and Probability model.
          Procedure (I)-1
n Raw document: putting together into a file
  from HTML files.
<HTML> …document …</HTML>
<HTML> …document….</HTML>
q Get the text information by string match
            Procedure (I)-2
n   Tuned Boyer-Moore Algorithm

     n BalkParcMoraPark
       Park Park Park
n   Modified from Boyer-Moore Algorithm
n   Using the bad-character shift function
n   Easy to applying
n   Can search in a 1/3 times to the general
    search algorithm
          Procedure (II)
n Statistical information from the
  extracted text
Ø The result contain
  - average text length
  - total the number of the text
  - average text file from a document
Ø This information do not be used in
  analyzing the search engine directly
         Procedure (III)
n Making temporary index
n There are a number of making index
n Exclude stopword from index word
Ex) Stopword : “the”, “of” , “and”,
n “to”
  Stored in AVL tree
n AVL tree enables the machine to
  insert or delete nodes and help to
  search efficiently.
        Procedure (IV)-1
n Gathering and getting information of
  index terms.
n Document index consists of a pair of
  index from document and location
  which that index word appeared.
n That location information is pointed
  to lexicon and posting.
                Procedure (IV)-2
                  Sample document
       Document   Contents
       No.                                                 cold
       1          Peace porridge hot, peace porridge
       2          Peace porridge in the hot
       3          Nine days old
       4          Some like it hot, some like it
       5          Some like it in the pot
       6          Nine days old
 Lexicon file                               Posting file
Cold 2                               1,4
Days 2                               3,6
Hot   2                              1,4
in    2                              2,5
Typical information retrieval
n   Boolean model
- set model, express query and express as a set
- “not”, “or”, “and”
- easy to understand but difficult for user to use
n   Vector model
- assign weighted value to index
-   calculate the similarity and rank the result
-   Most popular model
n   Probability model
-   Robertson & Sparck Jones suggest in 1976
-   Based on probability and Bayes’ theorem
          Until now….& next
n   Extract information from raw-files.
n   Extract the keyword and index word.
n   Be making index file and lexicon/posting
n   Will survey model (boolean, vector,
n   Will make engine consists of three part
    (according to 3 model)
n   Compare their performance and suggest
    simple engine.
     Development system
n System:
   Pentium 4 (1.6G) , XP window
n OS:
   Red hat-linux on VM ware
n Interface:
   Execute on console line
   Text-based result

Shared By: