Web Page Classification by pptfiles


                         Features and Algorithms

                  Presentation by: Jason Bender
 Introduction to Classification
 Background
   Classification Types
   Classification Methods
 Applications
 Features
 Algorithms
 Evolution of Websites
What is web page classification?
 The process of assigning a web page to
  one or more predefined category labels
  (ex: news, sports, business…)

 Classification is generally posed as a
  supervised learning problem
   Set of labeled data is used to train a
    classifier which is applied to label future
Background - Classification Types
 Supervised learning problem broken into
  sub problems:
   Subject Classification
   Functional Classification
   Sentiment Classification
   Other types of Classification
Subject Classification
 Concerned with subject or topic of the
  web page
   Judging whether a page is about arts,
    business, sports, etc…

Functional Classification
 Role that the page is playing
   Deciding a page to be a personal homepage,
    course page, admissions page, etc…
Sentiment Classification
 Focuses on the opinion that is presented
  in a web page

Other types of Classification
 Such as genre classification and search
  engine spam classification
Background - Classification
 Binary vs. Multiclass
 Single Label vs. Multi Label
 Soft vs. Hard
 Flat vs. Hierarchical
Binary vs. Multiclass Classification
Single-Label vs. Multi-Label
Soft vs. Hard Classification
Flat vs. Hierarchical Classification
 Why is classification important and how
  can we use it efficiently?
Constructing, maintaining, or
expanding web directories
 Web directories provide an efficient way to
  browse for information within a predefined
  set of categories

 Example:
   Open Directory Project

 Currently constructed by human effort
   78,940 editors of ODP
Improving the quality of search
 Big problem with search results is
  search ambiguity
Helping question and answering
 Can use classification systems to help
  improve the quality of answers
 Example: Wolfram alpha

Other applications
 Contextual advertising
 What features can we extract from a
  web page to use to help classify it?
Features - Introduction
 Because of features such as the hyperlink
  <a> … </a>, webpage classification is vastly
  different from other forms of classification
  such as plaintext classification.

 Features organized into two groups:
    ○ On-page features – directly located on page
    ○ Neighbor features – found on related pages
On Page Features
 Textual Contents & Tags
   Bag-of-words
    ○ N-gram feature
       Rather than analyzing individual words, group them into
        clusters of n-words.
        - Ex: New York vs. new ….. ….. York
       Yahoo! Has used a 5-gram feature

   HTML tags – title, heading, metadata, main text

   URL
On Page Features
 Visual Analysis
   Each page has two representations
    ○ Text via HTML
    ○ Visual via the browser
   Each page can be represented as a visual
    adjacency multigraph
Features of Neighbors
 What happens when a page’s features
  are missing or are unrecognizable?
Features of Neighbors
 Assumptions
   If page1 is in the neighborhood of many
    “sports” pages then there is an increasing
    probability that page1 is also a “sports” page.
   Linked pages are more likely to have terms
    in common
Features of Neighbors
 Neighbor Selection
   Focus on pages within 2 steps of target
   6 types: parent, child, sibling, spouse,
    grandparent, and grandchild
Features of Neighbors
 Labels
 Anchor Text
 Surrounding Anchor Text

 By using the anchor text, surrounding
  text, and page title of a parent page in
  combination with text from target page,
  classification can be improved.
Features of Neighbors
 Implicit Links
   Connections between pages that appear in
    the results of the same query and are both
    clicked by users
 What are the algorithmic approaches to
  webpage classification?
   Dimension reduction
   Relational learning
   Hierarchal classification
   Information combination
Dimension Reduction
 Boost classification by emphasizing
  certain features that are more useful in
   Feature Weighting
    ○ Reduces the dimensions of feature space
    ○ Reduces computational complexity
    ○ Classification more accurate as a result of
      reduced space
Dimension Reduction
 Methods
   Use first fragment
   K-nearest neighbor algorithm
   ○ Weighted features
   ○ Weighted HTML Tags
   ○ Metrics
      Expected mutual information
      Mutual information
Relational Learning
 Relaxation Labeling
Hierarchical Classification
 Based on “divide and conquer”
   Classification problems split into hierarchical
    set of sub problems.
 Error Minimization
   When a lower level category is uncertain of
    whether page belongs or not, shift
    assignment one level up.
Information Combination
 Combine several methods into one
   Information from different sources are used
    to train multiple classifiers and the collective
    work of those classifiers make a final
 Webpage classification is a type of
  supervised learning problem aiming to
  categorize a webpage into a predefined
  set of categories.

 In the future, efforts will most likely be
  focused on effectively combining content
  and link information to build a more
  accurate classifier
Evolution of Websites
 Apple in 1998
Evolution of Websites
 Apple 2008
Evolution of Websites
 Nike in 2000
Evolution of Websites
 Nike in 2008
Evolution of Websites
 Yahoo in 1996
Evolution of Websites
 Yahoo in 2008
Evolution of Websites
 Microsoft in 1998
Evolution of Websites
 Microsoft in 2008
Evolution of Websites
 MTV in 1998
Evolution of Websites
 MTV in 2008
 Web Page Classification: Features and Algorithms
   by Xiaoguang Qi & Brian D. Davison

 Visual Adjacency Multigraphs – A Novel Approach for a
  Web Page Classification
  by Milos Kovacevic, Michelangelo Diligenti, Marco Gori, and Veljko

 The Evolution of Websites

To top