Information Retrieval _data structures and algorithms_

Document Sample
Information Retrieval _data structures and algorithms_ Powered By Docstoc
					             Inverted file


2005-03-16    Information retrieval
         3.1 Instruction
 Three of the most commonly used file structures of IR
        Lexicographical indices (sorted)
            Inverted file, Patricia (PAT) tree
        Clustered file structures
        Indices based on hashing

 inverted file
        Sorted list (index) of keywords having links to the documents
         containing that keyword
 advantage
        Improve search efficiency
 disadvantage
        Need to store a data structure
         that ranges 10 percent to 100 percent
         or more of the size of the text itself
        Need to update index
         as the data set changes

2005-03-16                    Information Retrieval                      1
        3.1 Instruction(cont.)
 Some restrictions
       A controlled vocabulary
       Words (not in the vocabulary) in the text will not be indexed and
        not searchable
       A list of stopwords will not be included in the index and not
       A set of rules (decide the beginning of a word or a piece f text) is
         Ex) Treatment of spaces, punctuation marks, standard prefixes,
       A list of character sequences to be indexed
        (or not indexed)
         Character sequences consisting of all numerics are
             often not indexed
 Restrictions (determine what is to be indexed) are critical to
  search effectiveness therefore these rules should be
  carefully constructed and evaluated

2005-03-16                   Information Retrieval                        2
        3.1 Instruction(cont.)
 A search in an inverted file is the composition
  of two searching algorithms
       A search for a keyword, which returns an index
       A possible search on that index for a particular
        attribute value

 The result is a set of records
  (or pointers to records)

2005-03-16              Information Retrieval              3
        3.2 structures used in inverted files
 Sorted array, B-trees, tries, and various hashing
  structures, or combinations of these structures

 3.2.1 sorted array
       Store the list of keywords in a sorted array, including the
        number of documents associated with each keyword and a
        link to the documents containing that keyword
       Disadvantage
         Updating the index (ex. Appending a new keyword) is
       Advantage
         Easy to implement and reasonably fast

2005-03-16                 Information Retrieval                      4
 3.2.2 B-trees
       disadvantage
         Compared with sorted array, use more space
         More complex than using sorted arrays
       advantage
            Updates are much easier and search time is
             generally faster, especially if secondary storage is
             used for the inverted file (instead of memory)
 3.2.3 Tries
       Use the digital decomposition of the set of
        keywords to represent those keywords
       Especially, Patricia (PAT) tree is useful in IR

2005-03-16                 Information Retrieval                    5
         3.3 building an inverted file using a sorted array

 First, Input text must be parsed into a list of words along with
  their location in the text
  (most time consuming and storage consuming operation in
 Second, this list must be inverted, from a list of terms in
  location order to a list of terms ordered for use in searching
  (alphabetical order)
 Postprocessing of inverted files (optional) – adding term
  weights or reorganizing or compressing the files

 text           word location            word location            word location   weight
        phase                    sort                    weight

                Location order          Alphabetic order

2005-03-16                       Information Retrieval                              6
 Typically inverted files store field location and possibly
  even word location
       Additional location are needed for field and proximity
        searching in Boolean operations
       Cause higher inverted file storage overhead than if only
        record location was need
 Inverted files for ranking retrieval systems usually store
  only record locations and term weights or frequencies
 Although an inverted file could be used directly by the
  search routine
 It is usually processed into an improved final format
       This format is based on the search methods and the
        (optional) weighting methods used
       A common search technique is to use a binary search
       File to be searched should be as short as possible

2005-03-16                 Information Retrieval                   7
 Single file shown containing the
  terms, locations, and (possibly)
  frequencies is usually split into
  two peaces

       First piece is the dictionary
         containing the term, statistics
            about term such as number of
            postings, and a pointer to the
            location of the postings file for
            that term
       Second piece is the posting file
         containing record numbers
            (plus other necessary location
            information) and the (optional)
            weights for all occurrences of
            the term

2005-03-16                    Information Retrieval   8
        3.4 modification to the basic technique

 3.4.1 producing an inverted
  file for large data sets
  without sorting
       The new indexing method is
        a two-step process that
        does not need the middle
        sorting step
         First step produces the
            initial inverted file
         Second step adds the
            term weights to that file
            and reorganizes the file
            for maximum efficiency

2005-03-16                 Information Retrieval   9
 Binary tree and linked postings list are capable further
 This is important in indexing large data set
       is processed from multiple separate files over a short period
        of time
 Step two
       Each term is consecutively read form the binary tree (this
        automatically puts the list in alphabetical order), along with
        its related data
       A new sequentially stored postings file is allocated with two
        elements per posting
       The linked list is traversed with the frequencies
       Writes the record numbers and corresponding term weights
        to the newly created sequential postings file

2005-03-16                 Information Retrieval                    10
Text Size(MB)   Indexing Time(hours)      Working Strorage(MB)   Index Storage(MB)

                old       new             old           new      old         new
1.6             0.25      0.50            4.0           0.7      0.4         0.4
50              8         10.5            132           6        4           4
359             -         137             933           70       52          52
806             -         313             Over 2GB      163      112         112

 Old indexing scheme                                                      sorting
         Records -> a list of words within record location ---->
          Inverted list -> add the term weights
         Posting contains a record id number and term’s weight in
          that record                   2

         Sorting : nlogn(best case) , n (worst case)
 Making processing of the very large databases likely to
  have taken longer using this method
2005-03-16                             Information Retrieval                         11
        3.4.2 a fast inversion lgorithm(FAST-INV)

 This technique takes advantage of two principles
       Large primary memories available on today’s computers
         Even if databases are on the order of 1 GB, If they can
           be split into memory loads that can be rapidly processed
           and then combined, the overall cost will be minimized
       the inherent order of the input data
         This is crucial since with large files it is very expensive to
           use polynomial or sorting algorithm

 FAST_INV algorithm follows two principles
       Using the primary memory in a close to optional fashion
       Processing the data in three passes

2005-03-16                  Information Retrieval                    12
 Input : document vector file containing the concept vectors
  for each documents in collection to be indexed
 Document vector file is in sorted order
 Document numbers are sorted within collection
 Concept numbers are sorted within document numbers

                      DOC : document number
                      CON : concept number of the
                            Concept number : each unique word
                             in the collection implies unique
                             concept number
                             (unique words : 250000, concept
                             numbers : 250000)
2005-03-16              Information Retrieval               13
 Preparation
       HCN =highest concept number in dictionary
       L = number of document/concept pairs in the collection
       M = available primary memory size, in bytes
 First pass
 Entire document vector file can be read and two new files
       A file of concept postings/pointers(CONPTR)
       A load table
 assumption
       M>>HCN so that two file can be built in primary memory
       M<L so several primary memory loads will be needed to process
        the document data
         Because entries in the document vector file will already be
            grouped by concept number, witch those concepts in
            ascending order

2005-03-16                  Information Retrieval                       14
   Data can be transformed beforehand into j parts
       L/j<M, so that each part will fit into primary memory
       HCN/j concepts are associated with each part
   Each of the j parts to be read into primary memory, inverted there
   Splitting document vector file
       Load table indicate the range of concepts that should be processed for
        each primary memory load
       There are two approaches to handling the multiplicity of loads
            First approach, currently used, make a pass through the document vector
             file to obtain the input for each load
                Not requiring additional storage space
                Requiring expensive disk I/O
            Second approach, to build a new copy of the document vector collection
             with the desired separation into loads
                This can easily be done using the load table since size of each load are
                As each document vector is load, it is separated into parts for each range of
                 concepts in the load table
                Those parts are appended to the end of the corresponding section of the
                 output document collection file
                With I/O buffering, the expense of this operation is proportional to the size of
                 the files, essentially costs the same as copying the file

2005-03-16                         Information Retrieval                                      15
 Inverting each load
 A load is to be processed, the appropriate section of
  the CONPTR file is needed
 An output array of size equal to the input document
  vector file subset is needed
 As each document vector is processed, the offset
  (previously recorded in CONPTR) for a given concept is
  used to place the corresponding document/weight
  entry, then that offset is incremented
 CONPTR data allows the input to be directly mapped to
  the output, without any sorting
 At the end of the input load the newly constructed
  output is appended to the inverted file

2005-03-16           Information Retrieval            16
        FAST-INV example

2005-03-16       Information Retrieval   17
        8.2 Inverted file (Modern information retrieval)

 탐색 작업 향상 위해 텍스트에 대한 색인 만들기 위한                   단어기반
 구조
     어휘(vocabulary)
       텍스트에 나타나는 모든 단어의 집합
       필요한 공간은 상대적으로 작다
     출현빈도(occurrence)
       각 단어가 나타난 문서의 위치를 저장한 리스트의 집합
       어휘보다 많은 공간 필요(텍스트 크기의 30%~40%), 텍
        스트에 해당 단어 나타날 때마다 출현빈도에 포함
       필요공간 감소 방법 -> 블록 주소법 이용(block
        addressing; 텍스트 블록단위로 나누고 출현빈도는 단어
        나타난 블록 가리킴) , 텍스트 크기의 5%

2005-03-16              Information Retrieval              18
          Inverted file 구성 예
  1   6   9 11    17 19   24   28      33        40         46   50   55   60
 This is a text. A text has many words. Words are made from letters.

             어휘                          출현빈도
             Letters                     60…
             Made                        50…
             Many                        28…
             Text                        11, 19…
             words                       33, 40..

 단어들은 소문자로 변환되고 어떤 것은 색인되지
 출현빈도는 텍스트에서 문자의 위치 가리킨다
2005-03-16                          Information Retrieval                       19
         block addressing
 블록1           블록2               블록3                   블록4
 This is a text. A text has many words. Words are made from letters.

          어휘                       출현빈도
          Letters                  4…
          Made                     4…
          Many                     2…
          Text                     1,2…
          words                    3...

   검색 : 색인 검색(블록 외) + 온라인 검색(블록 내)
   고정된 크기 블록 사용 : 탐색시간 효율성 향상
   블록 크기 변화 심할 경우 평균 순차 탐색 양 증가
   자연적 구분 사용한 분할(파일, 문헌, 웹 페이지)
        온라인 탐색(텍스트에서 직접 패턴 검색, 텍스트가 작을 때)의 필요성 제거
   단점
        근접 질의(proximity query)시 선택된 블록 내 온라인 검색 필요
        CD-ROM이나 웹엔진처럼 원거리 텍스트에 대해 불가능
        검색할 때 텍스트 필요

2005-03-16                     Information Retrieval                   20
        Inverted file 탐색 알고리즘
 어휘(vocabulary) 탐색
       질의에 나타난 단어와 패턴은 분리되어 탐색
       구와 근사 질의는 단일 단어로 분할
 출현빈도(occurrence) 탐색
       발견된 모든 단어의 출현빈도 탐색
 출현빈도 처리
       구와 근접 또는 불리안 연산 해결
       블록 주소법 사용되었다면 출현빈도로부터 생략된
        정보(정확한 단어 위치) 찾기 위해 직접탐색 필요
 탐색은 항상 어휘로부터 시작
       어휘를 분리된 파일로 관리, 주기억장치에 저장

2005-03-16       Information Retrieval   21
 단일어 질의(single-word query)
       탐색 속도 증진 위해, 해싱, 트라이, B트리 사용
       해싱과 트라이 : 탐색비용은 텍스트 크기에 관계없이 O(m)
       단순 사전 편찬 순 단어 저장하는 것이 공간 면에서 더 작은 공
        간 차지, 이진 탐색을 이용(O(log n))
 접두사(prefix), 범위 질의(range query)
       해싱 제외한 이진 탐색, 트라이, B트리로 해결
       질의가 단일 단어들로 구성되면 출현빈도 리스트의 제공으로 끝,
        패턴이 많은 단어에 정합되는 경우 여러 리스트 합치는 작업 필
        요할 수도 있다
 문맥질의(context query)
       각 요소 독립적으로 탐색, 각 요소에 대해 하나의 목록(위치의
        증가순) 생성되므로 Inverted file로 해결 힘들다

2005-03-16         Information Retrieval   22
         building an inverted file using tries
 상대적으로 적은 비용으로 구축, 관리
 트라이에 어휘와 출현빈도 리스트 저장
        텍스트의 단어 읽어 트라이에서 검색
        단어 트라이에 없으면 공집합 출현빈도 목록과 해당단어 트라
         이에 추가
        트라이에 있으면 새로운 위치 출현빈도 목록의 끝에 추가
         1    6 9 11 17 19 24 28        33      40    46 50    55 60
         This is a text. A text has many words. Words are made from letters

                                    letters: 60
                         ‘i’                       ‘d’   made: 50
                           ‘m’            ‘a’
                              ‘t’                  ‘n’   many: 28
                                     text: 11,19
                                    words: 33,40
2005-03-16                            Information Retrieval                   23
 텍스트 처리되면 트라이 디스크에 기록
 색인 두 파일로 분리
       검색 시간에 어휘가 메모리에 있도록 하며
       단어 출현빈도 수는 공간 오버헤드가 없거나 혹은 아주 작은 공
        간 오버헤드로 즉시 어휘 알 수 있다
         Posting file : 출현빈도 목록 연속적으로 저장
         어휘 파일
                사전 편집 순으로 저장
                Posting file의 각 단어에 대한 목록 포인터 포함

 색인이 모두 메모리에 적재되지 않는 경우 비실용적
       부분 색인 이용
         메모리 용량이 허락하는 범위까지 부분 색인 구축
         디스크에 저장, 과정 반복
         저장된 부분 색인 계층적으로 병합

2005-03-16                Information Retrieval     24

Shared By: