Docstoc

Architeture for XML IR in P2P

Document Sample
Architeture for XML IR in P2P Powered By Docstoc
					             Analyzing Document Retrievability in
                  Patent Retrieval Settings

                                  Shariq Bashir, and Andreas Rauber
                     DEXA 2009, Linz, Austria, 31 August – 4 September



                Department of Software Technology and Interactive Systems
                                Vienna University of Technology, Austria
                                      {bashir, rauber}@ifs.tuwien.ac.at
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
                    Motivation
   Patent retrieval is a emerging & challenging area.
   Patents fall into legal category, use to protect inventions.
    Patents are Complex
         – Patents have large document length.
         – Contain complex vocabulary.
         – Contain complex structure and technical contents.
         – Patent writers often intentionally use vague words and expressions, in order
           to pass their patents from examination test.
         – This creates serious word mismatch problems.
         – Relevant patents could not be findable from their relevant queries.
         – Users (Attorneys, Patent examiners) mostly use hundreds of queries for
    Patent Retrieval is different to Web Retrieval
         – Patent retrieval is recall oriented domain.
         – Finding all relevant patents is considered more important than finding only small
           set of top relevant patents.
               • Exp: A single prior-art patent can invalidate the application of new patent,
               • but can we find such patent in given retrieval model?

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
                    Motivation

    Role of Retrieval System in Accessing
     Information
         – Generally, there is always argue on the quality of user queries.
         – Therefore, rather than arguing on the quality of user queries.
         – In this paper, we check the role of retrieval systems in accessing
           information.



         –   Can we access all information using given Retrieval Model?
         –   How much retrieval system’s bias restrict our access to information?
         –   Are there some subsets in given collection, which could not be find?
         –   How easily we can find information in given retrieval system?



. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
                    Document Retrievability (aka Findability)

   We measure retrieval systems effectiveness using findability
    measure.

   Findability Measure
        –   Measures how easily a retrieval model can find all documents.
        –   Findability is measured with top c results. (e.g. c = 35, c = 80 etc).
        –   Can figure out which retrieval systems is better for finding patents.
        –   Can figure out high/low findable subsets in the collection.
        –   Can figure out non-findable subsets in the collection.




. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
                      Computing Findability Measure

     Given a collection of documents D with large set of Queries Q.
     The findability of document d1 is, how many times we can
      access d in top-c results, with all queries in Q.
     Exp: If a document d1 in findable in top-c of query q1,
      findability score r(d1) = 1.




     kdq is the rank of dD in query qQ.
     f(kdq,c) returns a value of 1 if kdq<= c, and 0 otherwise.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
                     Our Contribution

   Findability is measured with
    single score across all queries.


   We consider relevance of queries,
    analyzing
        – Findability across all queries
        – Findability considering only queries
          that the document is relevant for
        – Findability for queries that a
          document is NOT relevant for
        – Characteristics of high/low findable
          documents
        – To what extend we can increase the
          findability of documents
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
                         Experiment Setup

     Retrieval models used
          – TFIDF, BM25, BM25F, Exact Match
     Patents from US Patent and Trademark website
      http://www.uspto.gov
     USPC class 433 - Dentistry Domain
     For query generation, we used only Claim section
     For indexing and searching we used all sections
          – Title, Abstract, Claim, Background Summary, Description, Captions
     We used cut-off rank factor c = 35.

    Total           Unique          Average Patent Length                           Average Claim Section
    Patents         Terms           (words)                                         Length (words)
    7,213           53,456          2,888                                           878.5

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
                         Query Generation

     Queries based on patent invalidity search scenario
     Extract all single terms from individual patents
      term frequency > 2 in claim section
     Single terms expanded into two & three term combinations
     A query is considered relevant for patent, if all its terms
      appear at least 3 times in a document


     Approach                                Total Queries               Average Retrievability Score
     Single Term Queries                     9,751                       345.3
     2-Terms Queries                         67,735                      317.6
     3-Terms Queries                         337,200                     248
     All Queries                             414,686
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Patent Number: 5,348,473
Patent Title: Medical Tool
Queries:
"GEAR", "CYLINDR", "PORTION", "BODI", "DRIVEN", "PROTRUS", "TOOL", "MEDIC",
"RESPECT"                Term Frequency > 2

Claims:-
What is claimed is:
1. A medical tool, comprising: a housing including an elongated cylindrical body and a
substantially cylindrical head portion transverse to an axis of said cylindrical body;
a drive gear; and a driven gear; wherein at least one of said drive gear and said driven gear
have a diameter greater than an inside diameter of one of said cylindrical body and said head
portion, respectively, said drive gear being disposed in said cylindrical body and said driven
gear being disposed in said substantially cylindrical head portion, and further comprising
protrusions extending from said elongated cylindrical body and said cylindrical head portion,
said protrusions accommodating said drive gear and said driven gear, respectively.
2. A medical tool as recited in claim 1, wherein said drive gear and said driven gear have
beveled faces, and said protrusions conform to the shape of these beveled faces.
3. A medical tool as recited in claim 1, wherein at least one of said cylindrical body and said
head portion have a protrusion surrounding portions of at least one of said drive
 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
                    Results with Single-Term Queries rank-cut off = 35


                  BM25




. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
                   Results with Two-Terms Queries rank-cut off = 35


                       BM25




. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
                    Results with Three-Terms Queries rank-cut off = 35


                        BM25




. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
                     Some Low Findable Patents with BM25
Patent Title                          Retrievability            Retrievabilit Total    Retrievability
                                      in all Queries            y in Relevant Relevant in Irrelevant
                                                                Queries       Queries Queries

Dental implant for the                2                         0                   37   2
securement of fixed
dental prostheses

Electric toothbrush with              6                         0                   37   6
vibration
Dental lining composition             6                         0                   27   6

Dental implant member                 7                         0                   81   7

Optionally cross linkable             7                         0                   81   7
coatings for orthodontic
devices

Dental floss                          8                         0                   81   8

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
                     Some High Findable Patents with BM25
Patent Title                          Retrievability            Retrievabilit Total    Retrievability
                                      in all Queries            y in Relevant Relevant in Irrelevant
                                                                Queries       Queries Queries
Implants, device and                  2,146                     12                  81   2,134
method for joining tissue
parts
Method and system for     2,028                                 12                  81   2,016
comprehensive evaluation
of orthodontic treatment
using unified workstation
Teeth cleaning implement              1,905                     16                  81   1,889
with integrated fluid
dispenser
Method and apparatus for 1,838                                  11                  81   1,837
the three-dimensional
registration and display
of prepared teeth
Dental filling material               1,805                     18                  81   1,797
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
                    Conclusion
   We analyze patents retrieval with findability measure.
   We differentiate findability using relevant & irrelevant queries.
   Our results indicate that
        – With well-known retrieval models, we could not able to find some patents
          in top-c results.
        – Large retrieval patents are more findable from irrelevant queries than
          relevant queries.
        – There is lot of noise on Top-c results of queries.
   Future Work
        – For handling word mismatch, we need efficient Query Expansion technique.
        – Individual patents have different findability scores in different retrieval
          models.
        – Exp: Patents which are low findable in Model A, are high findable in Model
          B.
        – We need efficient Fusion technique.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
                                          Thank You




. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:0
posted:5/15/2012
language:
pages:16
fanzhongqing fanzhongqing http://
About