Compiler by wuyunyi


									   Extracting Key Terms From
Noisy and Multi-theme Documents

Maria Grineva, Maxim Grinev and Dmitry Lizorkin
      Institute for System Programming of RAS
1. Key terms extraction: traditional approaches and
2. Using Wikipedia as a knowledge base for Natural
   Language Processing
3. Main techniques of our approach:
   • Wikipedia-based semantic relatedness
   • Network analysis algorithm to detect community
     structure in networks
4. Our method
5. Experimental evaluation
                   Key Terms Extraction
• Basic step for various NLP tasks:
   –   document classification
   –   document clustering
   –   text summarization
   –   inferring a more general topic of a text document

• Core task of Internet content-based advertising
  systems, such as Google AdSense and Yahoo! Contextual
   – Web pages are typically noisy (side bars/menus, comments,
     future announces, etc.)
   – Dealing with multi-theme Web pages (portal home pages,
      Approaches to Key Terms Extraction
• Based on statistical learning:
   – use for example: frequency criterion (TFxIDF model),
     keyphrase-frequency, distance between terms normalized by
     the number of words in the document (KEA)
   – compute statistical features over Wikipedia corpus (Wikify! )
   – require training set

• Based on analyzing syntactic or semantic term
  relatedness within a document
   – compute semantic relatedness between terms (using, for
     example, Wikipedia)
   – modeling document as a semantic graph of terms and
     applying graph analysis techniques to it (TextRank)
   – no training set required
  Using Wikipedia as a Knowledge Base for
       Natural Language Processing
• Wikipedia ( – free open
  – Today Wikipedia is the biggest encyclopedia (more
    than 2.7 million articles in English Wikipedia)
  – It is always up-to-date thanks to millions of editors
    over the world
  – Has huge network of cross-references between
    articles, large number of categories, redirect pages,
    disambiguation pages => rich resource for
    bootstrapping NLP and IR tasks
       Basic Techniques of Our Method:
        Semantic Relatedness of Terms

• Semantic relatedness assigns a score for a pair of
  terms that represents the strength of relatedness
  between the terms
• We use Wikipedia compute terms semantic
• We use semantic relatedness to model document
  as a graph of terms
            Basic Techniques of Our Method:
             Semantic Relatedness of Terms
• Wikipedia-based semantic relatedness for the two terms can
  be computed using:
   – the links found within their corresponding Wikipedia articles
   – Wikipedia categories structure
   – the article’s textual content
• Using Dice-measure for Wikipedia-based semantic relatedness
        Basic Techniques of Our Method:
 Detecting Community Structure in Networks
• We discover terms communities in a document graph
• Community – densely interconnected group of nodes in a
• Girvan-Newman algorithm for detection community
  structure in networks:
 • betweenness – how much is edge
   “in between” different communities
 • modularity - partition is a good one,
   if there are many edges within
   communities and only a few
   between them
                   Our Method

1. Candidate terms extraction
2. Word sense disambiguation
3. Building semantic graph
4. Discovering community structure of the semantic
5. Selecting valuable communities
                        Our Method:
           Candidate Terms Extraction
• Goal: extract all terms from the document and for
  each term prepare a set of Wikipedia articles that can
  describe its meaning
• Parse the input document and extract all possible n-
• For each n-gram (+ its morphological variations)
  provide a set of Wikipedia article titles
   – “drinks”, “drinking”, “drink” => [Wikipedia:] Drink; Drinking
                            Our Method:
               Word Sense Disambiguation
  • Goal: choose the most appropriate Wikipedia article from the set of
    candidate articles for each ambiguous term extracted on the previous
  • Use of Wikipedia disambiguation and redirect pages to obtain
    candidate meanings of ambiguous terms

Denis Turdakov, Pavel Velikhov
   “Semantic Relatedness Metric for Wikipedia Concepts Based on
   Link Analysis and its Application to Word Sense Disambiguation”
SYRCoDIS, 2008
                       Our Method:
             Building Semantic Graph
• Goal: building document semantic graph using semantic
  relatedness between terms

             Semantic graph built from a news article
       "Apple to Make ITunes More Accessible For the Blind"
                   Our Method:
Detecting Community Structure of the Semantic Graph
                      Our Method:
         Selecting Valuable Communities
• Goal: rank term communities in a way that:
   – the highest ranked communities contain key terms
   – the lowest ranked communities contain not important terms,
     and possible disambiguation mistakes
• Use:
   – density of community – sum of inner edges of community
     divided by the number of vertices in this community
   – informativeness – sum of keyphraseness measure
     (Wikipedia-based TFxIDF analogue) of community terms
• Community rank: density*informativeness
                      Our Method:
           Selecting Valuable Communities
• In 73% of web pages decline in communities scores
  separates key-terms communities from non-important ones
             Advantages of the Method
• No training. Instead of training the system with hand-
  created examples, we use semantic information derived
  from Wikipedia
• Noise and multi-theme stability. Good at filtering out
  noise and discover topics in Web pages
• Thematically grouped key terms. Significantly improve
  further inferring of document topics using, for example,
  spreading activation over Wikipedia categories graph
• High accuracy. Evaluated using human judgments
  (further in this presentation)
Experimental Evaluation on Noise-free dataset

• Classical – TFxIDF, Yahoo! Terms Extractor
• Wikipedia-based – Wikify!, TextRank
• Evaluation on noise-free dataset (blog posts) using human
     Experimental Evaluation on Web Pages
• Performance of our method on different kinds of Web pages

• Comparison to other methods
    Experimental Evaluation on Web Pages
• Multi-theme stability evaluated on compound Web
  pages (popular news site, portal homepages, etc.)
                      Thank You!
                     Any Questions?


To top