Distributed Search over the Hidden Web Hierarchical Database

Document Sample
Distributed Search over the Hidden Web Hierarchical Database Powered By Docstoc
					Distributed Search over the Hidden Web
Hierarchical Database Sampling and Selection



        Panagiotis G. Ipeirotis
           Luis Gravano
     Computer Science Department
         Columbia University
                    Distributed Search? Why?
                 “Surface” Web vs. “Hidden” Web

                                                       Keywords
                                                           SUBMIT   CLEAR




           “Surface” Web              “Hidden” Web
            –   Link structure          –   No link structure
            –   Crawlable               –   Documents “hidden” in databases
            –   Documents indexed       –   Documents not indexed by search engines
                by search engines       –   Need to query each collection individually

2/11/2009                                Columbia University                        2
                           Hidden Web: Examples

            PubMed search: [diabetes]
               178,975 matches
                PubMed is at http://www.ncbi.nlm.nih.gov/PubMed

            Google search: [diabetes site:www.ncbi.nlm.nih.gov]
               119 matches

            Database              Query                         Matches   Google
            PubMed                diabetes                      178,975   119
            U.S. Patents          wireless network              16,741    0
            Library of Congress   visa regulations              >10,000   0
            …                     …                             …         …

2/11/2009                                 Columbia University                      3
            Distributed Search: Challenges
             Select good databases for query                     Content summaries of databases
             Evaluate query at these databases                    (vocabulary, word frequencies)
             Merge results from databases



                                                                              Hidden Web
                                 Metasearcher



                        PubMed             Library of
                                                                  ESPN
                                            Congress
        kidneys   220,000
        stones    40,000         kidneys    20 stones               kidneys    5
        ...
                                            950                     stones     40
                                 ...                                ...

2/11/2009                                   Columbia University                                     4
            Database Selection Problems
                                                                          basketball     4
                                                                          cancer         4,532
        1. How to extract content summaries?                              cpu            23


                                                                            Web Database




        2. How to use the extracted
            content summaries?
                                                     basketball   4
                                                     cancer       4,532                Web Database 1
                                                     cpu          23

                                                     basketball   4
                  cancer   Metasearcher              cancer       60,298               Web Database 2
                                                     cpu          0

                                                     basketball   6,340
                                                     cancer       2                    Web Database 3
                                                     cpu          0


2/11/2009                      Columbia University                                                5
            Extracting Content Summaries
            from Web Databases

             No direct access to remote documents other
              than by querying

             Resort to query-based document sampling:
                  Send queries to database
                  Retrieve document sample
                  Use sample to create approximate content summary




2/11/2009                             Columbia University             6
             “Random” Query-Based Sampling
             Pick a word and send it as a
                                                                                           Sample
              query to database
             Retrieve top-k documents                                         aids                              metallurgy
                                                                                               football polo
              returned (e.g., k=4)                           ram
                                                                    keyboard          cancer                   dna

             Repeat until “enough” (e.g., 300)
              documents are retrieved



              Callan et al., SIGMOD’99, TOIS 2001

                                                                           Word            Frequency in Sample
                                                                           cancer          150 (out of 300)
            Use word frequencies in sample to
                                                                           aids            114 (out of 300)
                create content summary
                                                                           heart           98 (out of 300)
                                                                           …
                                                                           basketball 2 (out of 300)
2/11/2009                                     Columbia University                                                             7
             Random Sampling: Problems

             No actual word frequencies
              computed for content                          # documents
              summaries, only a “ranking” of
              words
                                                                                  Many words
             Many words missing from                                            appear in only
              content summaries (many rare                                        one or two
                                                                                  documents
              words)

             Many queries return very few or                                           word rank
              no matches                                                  Zipf’s law




2/11/2009                             Columbia University                                           8
            Our Technique: Focused Probing

        1. Train document classifiers
               Find representative words for each category

        2. Use classifier rules to derive a topically-focused sample
            from database

        3. Estimate actual document frequencies for all discovered
            words




2/11/2009                          Columbia University                 9
              Focused Probing: Training
             Start with a predefined topic                                              Root
               hierarchy and preclassified
               documents

             Train document classifiers for each                   ...     Computers     ...   Health    ...
               node

             Extract rules from classifiers:                                ...        Heart    ...     Hepatitis   ...
                ibm AND computers → Computers
                lung AND cancer → Health
                   …
                                                             } Root                             SIGMOD 2001

                   angina → Heart
                

                
                    hepatitis AND liver → Hepatitis
                    …
                                                       } Health
2/11/2009                                             Columbia University                                                  10
            Focused Probing: Sampling
                                        HealthRoot                                Transform each rule into a query
                                                                                  For each query:
                                 aids                             metallurgy
                         oncology
                               (7,530)         football   polo       (0)               Send to database
                                                                                           
                 liver
                    keyboard                angina
               (4,345)    (1,230)      cancer(150)
                                                (780)chf  (80)
                                                               dna
             ram (32)               psa
                                      (24,520)                                       Record number of matches
                                  (7,700)           (2,340)    (30)
            (140)
                                                                                     Retrieve top-k matching
                                              Sports                                   documents
                                           Heart
                             Health
                         Cancer
                                                                                  At the end of round:
                                                                                     Analyze matches for each
                           Computers         Science safe AND sex
                        Hepatitis
                                                           (245)                       category
                                                                     hiv
                                           AIDS                    (5,334)
                                                                                     Choose category to focus on


                     Sampling proceeds in rounds:
                                    Representative       document sample
              In each round, the rules associated with each
                    Output:
              node are turned into queries for the database
                                          Actual frequencies for some “important” words


2/11/2009                                                            Columbia University                              11
        Sample Frequencies and Actual Frequencies

         “liver” appears in 200 out of 300 documents in sample
         “kidney” appears in 100 out of 300 documents in sample
         “hepatitis” appears in 30 out of 300 documents in sample

                   Document frequencies in actual database?

              Query “liver” returned 140,000 matches
              Query “hepatitis” returned 20,000 matches
              “kidney” was not a query probe…

            Can exploit number of matches from one-word queries

2/11/2009                           Columbia University              12
            Adjusting Document Frequencies
         We know ranking r of
            words according to          f
            document frequency in                                    f = P (r+p) -B
            sample
                                                                                          Known Frequency

         We know absolute                             140,000 matches
                                                                                      ?
                                                                                          Unknown Frequency
            document frequency f of
            some words from one-            ?
                                                                                          Frequency in Sample (always known)
            word queries
                                                                       60,000 matches
         Mandelbrot’s formula
            connects empirically word
            frequency f and ranking r                                            ?
                                                                                                                     20,000 matches

                                                                                                     ?
         We use curve-fitting to                ...           ...         ...             ...              ...
            estimate the absolute       cancer         liver         ...    kidneys        ...    stomach         hepatitis
            frequency of all words in                                                                                                 r
            sample


2/11/2009                                   Columbia University                                                                       13
            Actual PubMed Content Summary
            PubMed content summary
            Number of Documents: 3,868,552    Extracted automatically
            category: Health, Diseases
            …                                 ~ 27,500 words in extracted
            cancer            1,398,178         content summary
            aids              106,512
                                              Fewer than 200 queries
            heart             281,506
                                                sent
            hepatitis         23,481
            …
                                              At most 4 documents
            basketball        907               retrieved per query
            cpu               487

                    The extracted content summary accurately represents
2/11/2009
                     size, contents, and classification of the database
                                            Columbia University              14
              Focused Probing: Contributions

             Focuses database sampling on dense topic areas


             Estimates absolute document frequencies of words


             Classifies databases along the way
                Classification useful for database selection




2/11/2009                              Columbia University       15
            Database Selection Problems
                                                                            basketball     4
                                                                            cancer         4,532
       1.   How to extract content summaries?                               cpu            23


                                                                              Web Database




       2. How to use the extracted
            content summaries?
                                                       basketball   4
                                                       cancer       4,532                Web Database 1
                                                       cpu          23

                                                       basketball   4
                   cancer    Metasearcher              cancer       60,298               Web Database 2
                                                       cpu          0

                                                       basketball   6,340
                                                       cancer       2                    Web Database 3
                                                       cpu          0


2/11/2009                        Columbia University                                               16
            Database Selection and Extracted Content Summaries

             Database selection algorithms assume complete
               content summaries

             Content summaries extracted by (small-scale)
               sampling are inherently incomplete (Zipf's law)

             Queries with undiscovered words are problematic

                        Database Classification Helps:
                   Similar topics ↔ Similar content summaries
              Extracted content summaries complement each other

2/11/2009                            Columbia University          17
            Content Summaries for Categories: Example

                                                                    Category: Cancer
                                                                      NumDBs: 2
        Cancerlit contains
            “metastasis”, not found                            Number of Documents: 166,272
                                                            …                   ...
            during sampling                                 breast              133,680
                                                            …                   ...
                                                            cancer              101,423
                                                            …                   ...

        CancerBacup contains                               diabetes
                                                            …
                                                                                11,344
                                                                                …

            “diabetes”, not found                           metastasis          3,569


            during sampling
                                                   CANCERLIT                              CancerBACUP
                                        Number of Documents: 148,944               Number of Documents: 17,328

        Cancer category              …
                                      breast
                                                            ...
                                                            121,134
                                                                                 …
                                                                                 breast
                                                                                                    ...
                                                                                                    12,546
                                      …                                          …
            content summary           cancer
                                                            ...
                                                            91,688               cancer
                                                                                                    ...
                                                                                                    9,735
                                      …                                          …
            contains both             diabetes
                                                            ...
                                                            11,344               diabetes
                                                                                                    ...
                                                                                                    <not found>
                                      …                     …                    …                  …
                                      metastasis            <not found>          metastasis         3,569




2/11/2009                             Columbia University                                                         18
            Hierarchical DB Selection: Outline

             Create aggregated content summaries for categories


             Hierarchically direct queries using categories
                 Category content summaries are more complete
                 than database content summaries



                    Various traversal techniques possible


2/11/2009                            Columbia University           19
            Hierarchical DB Selection: Example
        To select D databases:
                                                   Query: [brazil AND world AND cup]

         Use a “flat” DB selection                                   Root
                                                                   NumDBs: 136
            algorithm to score
            categories
                                            Arts               Computers           Health            Sports
         Proceed to category with       NumDBs:35
                                         (score: 0.0)
                                                              NumDBs:55
                                                              (score: 0.15)
                                                                                NumDBs:25
                                                                                (score: 0.10)
                                                                                                  NumDBs: 21
                                                                                                  (score: 0.93)
            highest score

         Repeat until category is a      Baseball             Hockey
                                                                                 ESPN
                                                                                                  Soccer
                                        NumDBs:7             NumDBs:8                           NumDBs:5
                                                                              (score:0.68)
            leaf, or category has       (score:0.18)         (score:0.08)                       (score:0.92)
            fewer than D databases



2/11/2009                              Columbia University                                                     20
            Experiments: Content Summary Extraction

            Focused Probing compared to Random Sampling:
                                                                        Actual             Sample
             Better vocabulary coverage
                                                                         aids
                                                                        cancer              aids
                                                                      basketball
                                                                      pneumonia           basketball
             Better word ranking                                    Ignores “off-topic” documents
                                                                         aids
                                                                        cancer             cancer

             More efficient for same sample size                       Retrieves same number of
                                                                         heart Better sample: heart
                                                                     documents using fewer queries
                                                                         Each retrieved document
                                                                           …                   …

             More effective for same sample size                    “represents” many unretrieved,
                                                                       basketball
                                                                       pneumonia           pneumonia
                                                                       so “on-topic” sampling helps
                                                                           Topic detection helps



                                 More results in the paper!
      4 types of classifiers (SVM, Ripper, C4.5, Bayes), frequency estimation, different data sets…


2/11/2009                                      Columbia University                                     21
            Experiments: Database Selection

                                                                 Query   LoC
                                                                           LoC
            Data set and workload:                                           LoCc
                                                                               LoC
             50 real Web databases                                               LoC
                                                             Database
             50 TREC Web Track queries                                             LoC
                                                                                      LoC
                                                             Selection                  LoC

            Metric: Precision @ 15                                           LoC
                                                                               LoC
             For each query pick 3 databases                                    LoC
                                                                                   LoC
             Retrieve 5 documents from each database                                LoC
             Return 15 documents to user                                  LoC
                                                                                       LoC

             Mark “relevant” and “irrelevant” documents                     LoC
                                                                               LoC
                                                                                 LoC
                                                                                   LoC
                                                                                     LoC
                                                                                       LoC


                    Good database selection algorithms choose
                       databases with relevant documents

2/11/2009                                  Columbia University                                22
            Experiments:
            Precision of Database Selection Algorithms
                                            Hierarchical               Flat
                  Focused Probing                 0.27                0.17

                  Random Sampling                     -               0.18

             Hierarchical database selection improves precision drastically
              Category content summaries more complete
              Topic-based database clustering helps

                                 More results in the paper!
            (different flat selection algorithms, more content summary extraction algorithms…)

                           Best result for centralized search ~ 0.35
                                Not an option for Hidden Web!

2/11/2009                                     Columbia University                                23
             Contributions
             Technique for extracting content summaries from
              completely autonomous Hidden-Web databases

             Technique for estimating frequencies: Possible to
              distinguish large from small databases

             Hierarchical database selection exploits classification
              improving drastically precision of distributed search

            Content summary extraction implemented and available for download at:
                            http://sdarts.cs.columbia.edu

2/11/2009                                Columbia University                   24
            Future Work


             Different techniques for merging content summaries
              for category content summary creation

             Effect of frequency estimation on database selection


             Different hierarchy “traversing” algorithms for
              hierarchical database selection




2/11/2009                            Columbia University             25