Docstoc

The Development of a search engine _ Comparison according to

Document Sample
The Development of a search engine _ Comparison according to Powered By Docstoc
					The Development of a search
   engine & Comparison
  according to algorithms
       The mid-term progress report



               20032017 Sungsoo Kim
               20032066 Haebeom Lee
    Topic of our term project
n Compare the performance of the
  algorithms used in information
  retrieval.
n On the basis of that comparison,
  make efficient search engine and
  demonstrate it.
               Procedures
I.     Extracting the text-information’s
       position from raw files.
II.    Extracting the keyword or index from the
       text.
III.   Making the index file.
IV.    Gathering and sorting those index file
V.     Getting information of index.
VI.    Boolean retrieval
VII.   Natural language retrieval using Vector
       and Probability model.
          Procedure (I)-1
n Raw document: putting together into a file
  from HTML files.
 ex)
<HTML> …document …</HTML>
<HTML> …document….</HTML>
q Get the text information by string match
  algorithm.
            Procedure (I)-2
n   Tuned Boyer-Moore Algorithm

     n BalkParcMoraPark
       Park Park Park
                   Park
                      Park
n   Modified from Boyer-Moore Algorithm
n   Using the bad-character shift function
n   Easy to applying
n   Can search in a 1/3 times to the general
    search algorithm
          Procedure (II)
n Statistical information from the
  extracted text
Ø The result contain
  - average text length
  - total the number of the text
  - average text file from a document
Ø This information do not be used in
  analyzing the search engine directly
         Procedure (III)
n Making temporary index
n There are a number of making index
  word.
n Exclude stopword from index word
Ex) Stopword : “the”, “of” , “and”,
n “to”
  Stored in AVL tree
n AVL tree enables the machine to
  insert or delete nodes and help to
  search efficiently.
        Procedure (IV)-1
n Gathering and getting information of
  index terms.
n Document index consists of a pair of
  index from document and location
  which that index word appeared.
n That location information is pointed
  to lexicon and posting.
                Procedure (IV)-2
                  Sample document
       Document   Contents
       No.                                                 cold
       1          Peace porridge hot, peace porridge
       2          Peace porridge in the hot
       3          Nine days old
                                                    cold
       4          Some like it hot, some like it
       5          Some like it in the pot
       6          Nine days old
 Lexicon file                               Posting file
Cold 2                               1,4
Days 2                               3,6
Hot   2                              1,4
in    2                              2,5
Typical information retrieval
n   Boolean model
- set model, express query and express as a set
- “not”, “or”, “and”
- easy to understand but difficult for user to use
n   Vector model
- assign weighted value to index
-   calculate the similarity and rank the result
-   Most popular model
n   Probability model
-   Robertson & Sparck Jones suggest in 1976
-   Based on probability and Bayes’ theorem
          Until now….& next
n   Extract information from raw-files.
n   Extract the keyword and index word.
n   Be making index file and lexicon/posting
n   Will survey model (boolean, vector,
    probability)
n   Will make engine consists of three part
    (according to 3 model)
n   Compare their performance and suggest
    simple engine.
     Development system
n System:
   Pentium 4 (1.6G) , XP window
n OS:
   Red hat-linux on VM ware
n Interface:
   Execute on console line
   Text-based result

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:4
posted:4/24/2014
language:English
pages:12