Yahoo Music Videos

Document Sample
Yahoo Music Videos Powered By Docstoc
					Building Knowledge Bases
from the Web




         Rajeev Rastogi
      Yahoo! Labs Bangalore
         Basic premise



The Web is a vast repository of
     human knowledge
         Diverse information spanning
         multiple verticals
• Wikipedia, Product, Business, People, …
            Grand challenge
Mine the Web to build knowledge bases (KBs)
of people, places, things, events,…

            Name           Address                        Phone
                        Name            Affiliation      # connections
            Chinese Mirch 120 Lexington Ave               (212) 532-3663
                        Rajeev Rastogi Yahoo! Labs
                           (between 28th St & 29th St)   142
                                        Bangalore
                           New York, NY 10016
                               Product Name            List       Sale
                                   Camera              Price        Mega-
                                                            AspectPrice
                                                            Ratio pixels
                               Apple iPod nano 8 GB    $145.00 $139.99
                               BlackCanon Powershot 600 4:3
                                     (5th Generation)               0.5
                                    Olympus D-300L           4:3     0.8
What did search look like in the
past?
        Search results of the future:
        Structured abstracts


 yelp.com                   Gawker




babycenter               New York Times




 epicurious                 LinkedIn




answers.com                 webmd
Comparison shopping




        Rank by price
Product near me
                    Topic entity pages
A topic based
page                  Celebrity   Music   Videos

automatically                                       Related Topics
generated in real
time




                                                   Relevant Multi-media
                                                   content including
                                                   music, videos,
Up to the minute:                                  information from Wiki
Latest info using                                  pedia etc.
News feeds,
blogs, Twitter,
Flickr to stay up
to date on
Madonna
              Building KBs from the Web is a
              hard problem
• Billions of pages with diverse structure, conflicting
  information, noise
       yelp.com                         superpages.com




      Noise
           Page content/structure changes
           constantly




          Old




                                 New
• ~2% of sites change each day
                        KB creation pipeline

Content acquisition     Information extraction       Disambiguation &
                                                        Integration
                            Roma Bistro Paris



                                                       Roma Bistro Paris




 Acquire content      Extract structured data for   Identify and integrate
  from the Web         entities from Web pages       data for each entity
                        IE example


 Name

                  Cuisine
Address                        Name      Address             Phone
          Phone
                      Price    Chinese   120 Lexington Ave   (212) 532-3663
                               Mirch     New York, NY
                                         10016




Reviews
           Template-based Web pages

• From head/torso
  sites
• Pages have
  similar
  structure
• ~30% of crawled
  Web pages
• Information rich:
  31% of search
  results
                Hand-crafted pages

• Mainly from
  tail sites
• Pages have
  diverse
  structures
                     Browse pages



Similar-structured
records
Unstructured text
                      Web extraction landscape
 Content
              Template-             Hand-crafted,        Unstructured
             based pages            browse pages             text
                                                            Pattern      Snowball [AG 00]
 Context
                                                            -based


 Content
 Features                     Machine Learning Models
                                                                HCRF [ZNWZM 06]
                                                                MLN [YCWZZM 09]
  Content       Content
                                [GRST 10]
Redundancy      Matching

                                        Record          RoadRunner [CMM 01]
  [KWD 97]       Wrapper
  [MMK 99]                           Identification     DEPTA [ZL 05]

             Site structure         Page structure        Unstructured       Structure
                         Web extraction landscape
 Content
              Template-       Hand-crafted,     Unstructured
             based pages      browse pages          text
                                                  Pattern
 Context
                                                  -based


 Content
 Features                     ML Models


  Content        Content
Redundancy       Matching

                  Wrapper         Record
                               Identification

             Site structure   Page structure    Unstructured   Structure
                          Wrapper induction
 • Technique for extraction from template-based pages


                               Sample pages       Annotations

Learn     Website                        Annotate               Learn
          pages      Cluster              Pages                 Rules




                       Site change                               XPath
                                          Monitor
                                          Rules                  Rules




Extract    Website                        Apply
           pages                          Rules                 Records
                Clustering pages
• Group structurally similar pages using shingle
  signatures
                 Page shingle signature

Tags html body @id textarea @id div   /div     /textarea   …   br/ /body /html


Windows


Hash


 Min       55         5          20                                   30


                                  Shingle: 5


                 Page signature: Vector of shingles
                                 Rule learning
                                              XPath Generalization
/html/body/div/div/div/div/div/div/span[@class=“tel”]       //span[@class=“tel”]
                       Learning robust XPaths
                                                          Most general XPath
                                    //*
SPECIALIZE




                   //span         //*[@class=tel]              //h1


                                          Use Apriori to generate candidate XPaths
             //span[@class=tel]




                                    Most general XPath that matches all the
                                    annotated values and none of the un-
                                    annotated values
                   Detecting site changes

    During Learn                                Monitoring

 For each cluster, store   Crawl the pages daily and compare page signatures
the page signature and                   and extracted records
 extracted record for a
small number of pages
                                      Day n                     Day m
        Day 0




                                                               Signature/
                                   Signature
                                                             Record Mismatch
                                   & Record
                                    Match
             Wrapper system deployed in Yahoo!

• 250M extractions from 200 sites (product, business)
• Avg num of clusters per site: 24
• Avg num of pages annotated per cluster: 1.6

                                Average Precision / Recall (%)
       100
        90

        80

        70
        60
                                                                                       Precision
        50                                                                             Recall
        40
        30

        20

        10

         0
             review_co   business_   product_ti   image   phone_nu   address   price
             unt         name        tle                  mber
             Limitations of wrappers
• Won’t work across Web sites due to different
  page layouts
• Scaling to thousands of sites can be a challenge
  – Need to learn a separate wrapper for each site
  – Annotating example pages from thousands of sites
    can be time-consuming & expensive
           Holy grail of IE research

• Unsupervised IE: Extract attribute values from
  pages of a new Web site without annotating a
  single page from the site
• OK to annotate pages from a few sites initially to
  create training data
                         Web extraction landscape
 Content
              Template-       Hand-crafted,      Unstructured
             based pages      browse pages           text
                                                  Pattern
 Context
                                                  -based


 Content
 Features                      ML Models


  Content       Content
Redundancy      Matching

                                 Record
                 Wrapper
                              Identification

             Site structure    Page structure   Unstructured    Structure
                   Key observation
• Web sites contain redundant content (that is,
  pages for same entity)
      yelp.com                       superpages.com
          Content matching approach
• Step 1: Populate seed database from few initial sites


                                    Seed DB
                                    Name             Address
                                    Chinese Mirrch   120 Lexington
                         Wrappers                    Ave, New York,
                                                     NY 10016
                                    Tiffin Wallah    127 E 28th St
                                                     New York, NY
                                                     10079
                 Content matching approach
 • Step 2: Match values in page with seed record values
                                  New site Web page



Seed DB
Name             Address
Chinese Mirrch   120 Lexington
                 Ave, New York,
                 NY 10016
Tiffin Wallah    127 E 28th St
                 New York, NY
                 10079
                Content matching approach
  • Step 3: Use matched values to extract records, expand
    seed database
New site Web pages
                                     Seed DB
                                     Name             Address
                                     Chinese Mirrch   120 Lexington
                          Wrappers                    Ave, New York,
                                                      NY 10016
                                     Tiffin Wallah    127 E 28th St
                                                      New York, NY
                                                      10079
                         New record 21 Club           21 W 52nd St
                                                      New York, NY
                                                      10019
                           Key challenge 1
 • Diverse attribute value representations (impacts recall)

                       Spelling
                       error

Name             Address
Chinese Mirrch   120 Lexington
                 Ave, New York,   Variant
                 NY 10016
Tiffin Wallah    127 E 28th St
                 New York, NY
                 10079
                           Key challenge 2
 • Noisy attribute value matches (impacts precision)



Name             Address
Chinese Mirrch   120 Lexington
                 Ave, New York,
                 NY 10016
Tiffin Wallah    127 E 28th St    Noisy
                 New York, NY     match
                 10079
               Baseline similarity measure
 • Use q-grams to handle spelling errors
      String    3-grams
      chinese   { chi, hin, ine, nes, ese, se# , e#m, #mi, mir, irc, rch}
      mirch
      chinese   { chi, hin, ine, nes, ese, se#, e#m, #mi, mir, irr, rrc, rch}
      mirrch

 • Weight of a q-gram (attribute-specific)
            = Sum of the IDFs of the words it appears in

Weak Similarity
        = Cosine-similarity between IDF-weighted q-grams
                      Strong similarity

  Address (Seed DB)      Address (Web site)           WS
  120 Lexington Avenue   120 Lexington Ave            0.53    Templatized
  New York, NY 10016     (between 28th and 29th St)
                                                              content
                         New York, NY 10016
  312 W 34th Street      312 W 34th St                0.49
  New York, NY 10001     (between 8th and 9th Ave)
                         New York, NY 10001


Strong similarity is defined between two sets of strings
1. Calculate the matching pattern between weakly similar pairs in the
   two sets
2. Pick matching patterns with sufficient “support”
3. Use only portions selected by the matching pattern in the final
   similarity calculation
                Computing matching pattern
                     s1 1          s2 0            s3 3
                                                                  Matching pattern:
           120 Lexington Avenue New York NY 10016                     103 103

    1            1                             1     1    1   1



120 Lexington Ave (Between 28th And 29th St) New York NY 10016

        s’1 1                          s’2 0                  s’3 3


1. Perform max-weight bipartite matching to find matching words
    •    Edge weight = Jaccard similarity over 3-grams
2. Form segments by grouping contiguous matching words
3. Assign each segment si a label
    •      0 if non-matching
    •      j if matching segment s’j
                 Strong similarity score
                 computation
Address (Seed)    Address (Web site)          Matching   SS     Matching
                                              pattern           segments
120 Lexington    120 Lexington Ave          103 103      1      120 Lexington
Avenue New York, (between 28th and 29th St)                     New York, NY
NY 10016         New York, NY 10016                             10016
312 W 34th Street 312 W 34th St               103 103    1      312 W 34th
New York, NY      (between 8th and 9th Ave)                     New York, NY
10001             New York, NY 10001                            10001


    Strong similarity: similarity between matching segments of values

    Support of matching pattern: # distinct matching segments
    Support(103 103) = 2

    Strong similarity only computed for patterns with support
      Need for support of a matching
      pattern
Address (Seed)      Address (Web site)   Matching    SS    Matching
                                         pattern           segments
120 Lexington       1075 Fifth Ave       010 010    0.35   New York,
Avenue New York,    New York, NY 10128                     NY
NY 10016
312 W 34th Street   1167 Madison Ave     010 010    0.32   New York,
New York, NY        New York, NY 10128                     NY
10001



     Support(010 010): = 1
     Hence Strong Similarity = Weak Similarity
                   Pruning noisy matches
                                                          ✓
Name             Address
Chinese Mirrch   120 Lexington
                 Ave, New York,
                 NY 10016                                  ✓
Tiffin Wallah    127 E 28th St
                 New York, NY
                 10079

                                                                 ✗
 • Match combinations of values in page
 • Prune combinations that don’t match attribute values in any
  seed record
                  Apriori-style enumeration
                 X1                          X1
                                                          Round 1:
                                                          <Name, X1>    (sup=2)
                                                          <Addr, X2>    (sup=2)
                  X2                         X2           <Name, X3>    (sup=2)

                                                          Round 2:
                                                          <Name, X1> <Addr, X2>
                 X3                                                    (sup=2)
                                                          <Name, X3> <Addr, X2>
                                                                       (sup=0)

                                                  X3


• Prune attribute position combinations with low support
     – support = # pages in which values at positions match
       attribute values in a seed record
             Experimental results

Datasets




Attributes    Restaurant       Bibliography
              Name (core)      Title (core)
              Address (core)   Author (core)
              Phone            Source
              Payment
              Cuisine
                    Strong vs Weak similarity




• Extraction precision of WS and SS are comparable, precision increases with threshold
• Coverage of SS is steady wrt threshold, coverage of WS drops at high thresholds
           Strong similarity scores

                                                       SS boosts the similarity
                                                       scores of TPs over a range of
                                                       WS scores without boosting
                                                       that of FPs




String 1                        String 2                    WS       SS
980 n michigan ave 14th floor   980 n michigan ave          0.57     1
chicago il                      chicago il 60611
1100 e north ave west           300 w north ave west        0.74     0.74
chicago il 60185                chicago il 60185
Extraction Precision
              Coverage

                        Coverage (%)
100

90

80

70

60

50

40

30

20

10

  0
      2000       5000          10000   20000   40000



             Seed data size (Restaurant)
                        Summary

• Web is a vast repository of human knowledge
• Building (structured) knowledge base can improve
  search, help users find relevant information
• Key challenge: Unsupervised information extraction
  from Web pages
• Content redundancy on Web can be used for
  unsupervised extraction with high precision
• Future work
   – Handling numeric attributes, browse pages
   – Detecting and integrating records for the same entity

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:6
posted:7/28/2012
language:
pages:48