Docstoc

Google Desktop _Search_ _ Regain

Document Sample
Google Desktop _Search_ _ Regain Powered By Docstoc
					            Google Desktop (Search) & Regain

               Seminar talk: Information Retrieval
              Winter term 2010/2011 (11/29/2010)
                          Eike Kleiner




1/12/2011             Eike Kleiner | University of Konstanz   1
  Agenda

1)    Google Desktop
     •        General
     •        Query syntax
     •        Features & missing features
2)    Regain
     •        General
     •        Query syntax
     •        Features & missing features
3)    File formats:
      •      Difficulties with comparison
      •      Generic approaches via plug-ins / preparators
4)    Fundamental differences
5)    What to do next? / Outlook

12.01.2011                       Eike Kleiner | University of Konstanz   2
    1 Google Desktop: general

    Application for Desktop Search & Gadgets

    First release in 2004 (search only)

    Since 2005 a additional gadget functionality

    Current version (on Windows): 5.9

    Mac and Linux versions are available

    Private and enterprise edition (both free available)

12.01.2011                Eike Kleiner | University of Konstanz   3
    1 Google Desktop: search syntax

Similar to Google’s web search syntax but weaker:
 Operators: +, -
      •      “info +retrieval” (include retrieval)
      •      “info -retrieval” (exclude retrieval)
      •      No Boolean operators AND & OR!
    Term- and phrase search:
      •      information retrieval (terms may be somewhere in the document)
      •      “information retrieval” (exact this terms must appear in exact this order)
    Wildcards:
      •      … in terms: not offered, but it uses ‘Stemming-Technology’  find
             variations of each word (diet, diets, dietary …)
      •      … in phrases: not implemented
 File types: info retrieval AND (filetype:word OR filetype:doc)
 Field search: “subject:information” just e-mail fields (from, to, cc
  etc.)


12.01.2011                           Eike Kleiner | University of Konstanz                4
    1 Google Desktop: features

    Acts as proxy for web search results (seamless integration)
    Acts as web server for local results
    Integrated interface
    Index can be used by multiple devices (but privacy ?)
    Definition of the search space (what sections to index)
    One can define which machine to search (machine:ek-pc)
    Search just within a directory and its subdirectories is possible:
     (under:d:\this_is_an_example_directory\)
    Filter searches by groups (all, e-mails, web protocols, files, chats
     and other)
    Order by date and relevance
    Well documented plug-in concept with powerful API
    Accessible to other applications via the plug-in API
    Portal for sharing plug-ins


1/12/2011                    Eike Kleiner | University of Konstanz          5
    1 Google Desktop: missing features

    Index documentation

    Relevance criteria are not documented and modifiable

    Possibilities for qualified searches with fields and operators

    Influence the index structure (except via plug-ins)

    Only the own machines could be indexed (no crawler functionality)

    Interface which supports the user to formulate complex queries
     (advanced search)

1/12/2011                   Eike Kleiner | University of Konstanz        6
    2 Regain: general

    It’s a open source search application…

… and a web crawler

    Written in pure Java

    Based on Lucene Search

    Offers a desktop and a server version

    Current version: 1.7.0

12.01.2011                  Eike Kleiner | University of Konstanz   7
    2 Regain: query syntax

Lucene search syntax:
 Operators and brackets: AND, OR, +, -, ()
 Term- and phrase search
 Wildcards: for terms and phrases (? and *)
      •      “in?ormation retrie*”    (didn’t worked for me)
 File types
 Field searches: “title:information” (title, summary, headlines, last modified etc.)
 Fuzzy searches: “information retrieXXal~” (the edit distance algorithm)
 Proximity searches: “information retrieval”~10 (terms are within a distance of 10
     words)
    Range searches: “title:{Arbeit                TO Bewerbung}” (works for different data types),
     inclusive[] or exclusive{}
    Term boosting / relevance manipulation: “information                         retrieval^10”
     (works with terms and phrases)




12.01.2011                            Eike Kleiner | University of Konstanz                           8
    2 Regain: features

    Acts as an web server for local results

    Index is shareable by different machines

    Well documented plug-in concept (preparators)

    Source code is full accessible

    Relevance criteria are well explained and can be overwritten (boost)

    Definition of the search space (what sections to index)

    Index is well structured in fields and its documented

    Many possibilities to influence and understand the index via configuration:
      •      Content extraction
      •      Stop word lists
      •      Logs of different creation states
      •      Analyze dead links
      •      Create own fields out of URL’s or pathes


12.01.2011                                 Eike Kleiner | University of Konstanz   9
    2 Regain: missing features

    Easy definition which machine to search (at least not found yet)

    Easy definition which directory to search (at least not found yet)

    Out of the box is no ordering by date possible

    No filtering possible (except via the file type search and OR operators)

    Has no own interface  Web browser is always needed  System
     integration?

    Interface which supports the user to formulate complex queries (advanced
     search)

    Centralized resource for sharing preparators and other stuff (just a quite
     messy forum is online)


12.01.2011                    Eike Kleiner | University of Konstanz               10
    3 File formats: comparison (1)


Google Desktop                                Regain
    HTML / XML                               
    Plain text                               
    PDF                                      
    MS Office                                
      •     PowerPoint                               
      •     Word                                     
      •     Excel                                    
      •     Outlook                                      (Ifilter plug-in: Windows only)
 OpenOffice (only via plug-in)               
 Chat protocols (MSN, Google                 x
  Talk, AOL, Skype)


1/12/2011                Eike Kleiner | University of Konstanz                         11
  3 File formats: comparison (2)


Google Desktop                                 Regain
 Netscape- / Thunderbird                      x
  Mail
 Browser history                              x
 Metadata only:
      •     Audio files                        x
      •     Image files                        x
      •     Video files                        x




1/12/2011                 Eike Kleiner | University of Konstanz   12
    3 File formats: comparison is obsolete? Difficulties…

    Google Desktop: plug-in‘s
      •     It offers an full SDK with an API to write own gadgets and search plug-
            ins
      •     A lot of additional file formats and gadgets are already available
      •     Plug-ins can use “Ifilter” on Windows systems


    Regain: Generic Ifilter preparator & Preparator concept
      • A generic preparator for the Ifilter interface  Regain is able to index
        all data which MS Windows Search can index
      • Preparators are fully independent plug-ins for the crawling process
       Written in Java but could contain bindings to other languages




1/12/2011                        Eike Kleiner | University of Konstanz            13
  4 Fundamental differences


Google Desktop                            Regain
 Closed Source                            Open Source (LGPL)
 Platform dependent (but…)                Platform independent
 Data privacy (index on                   Data privacy
  Google servers, unique ID)
 Gadgets / Sidebar                        Nope!
 Vendor: Google Inc. and a                Vendor: some German guys
                                            and a open community
  vital plug-in community
                                           Multiple indices
 One index
                                           Sharable index (between
 Where’s my index?                         machines & applications)
 Index is a black box                     Index is open and documented
 Lame search syntax                       Mighty search syntax

1/12/2011            Eike Kleiner | University of Kon’stanz           14
    5 What to do next? Outlook to upcoming talk

    Getting insight in Google Desktop‘s plug-in concept
      •     Test and analyze the OpenOffice plug-in
    Getting insight in Regain‘s preparator concept
      •     Try to write a own preparator for ID3 tags of mp3 files
    Getting an idea why not all Lucene syntax worked with Regain
    Understand the structure of the Regain index with the “Luke - Lucene
     Index Toolbox”
    Try to get information about the structure of the Google Desktop index
     and some answers why the query syntax is so weak
    Try out the plug-in “Google Desktop Extreme” which offers a more
     powerful search interface and some other useful additions and compare
     this results to Regains search syntax and capabilities
    Maybe: Build up test data and evaluate:
      •     Index speed (hard to measure)
      •     Search speed (hard to measure)
      •     Index quality (?)  Depends on ‘quality’ parameters
             o   Which metadata is indexed  Comparison of selected formats (e.g. audio and image files)
             o   Is full text indexed and will it lead to the same results for both application (for PDF files)


1/12/2011                                 Eike Kleiner | University of Konstanz                                   15
                     Thanks for your attention

            Questions: now or via Eike.Kleiner@uni-konstanz.de




1/12/2011                  Eike Kleiner | University of Konstanz   16

				
DOCUMENT INFO
Shared By:
Tags: Google, Desktop
Stats:
views:13
posted:12/30/2011
language:English
pages:16
Description: Google Desktop is a desktop search Google's software, Windows, Mac, Linux running on the local. The desktop search program can be a person's e-mail, electronic documents, music, photos, chats and web pages for users to browse through full-text search. "Google Desktop" is not open source, free software, but in the end user to comply with the Terms of Use (EULA) under the premise that users can download free of charge. After installation is complete, "Google Desktop" will spend a few hundred megabytes of space and some time to build the index, and automatically at each boot, when activated to search for local resources to achieve the function. Users can also freely choose to turn off, remove the software.