Web Mining

Document Sample
Web Mining Powered By Docstoc
Prof. Navneet Goyal
   BITS, Pilani
      Web Mining
§ Web Mining is the use of the data mining
    techniques to automatically discover and extract
    information from web documents/services
§   Discovering useful information from the World-
    Wide Web and its usage patterns
§   My Definition: Using data mining techniques to
    make the web more useful and more profitable
    (for some) and to increase the efficiency of our
    interaction with the web
    Web Mining
t   Data Mining Techniques
    t   Association rules
    t   Sequential patterns
    t   Classification
    t   Clustering
    t   Outlier discovery

t   Applications to the Web
    t   E-commerce
    t   Information retrieval (search)
    t   Network management
     Examples of Discovered
t   Association rules
    t   98% of AOL users also have E-trade accounts

t   Classification
    t   People with age less than 40 and salary > 40k trade on-line

t   Clustering
    t   Users A and B access similar URLs

t   Outlier Detection
    t   User A spends more than twice the average amount of time
        surfing on the Web
      Web Mining

§ The WWW is huge, widely distributed, global
  information service centre for
   § Information services: news, advertisements,
    consumer information, financial management,
    education, government, e-commerce, etc.
  § Hyper-link information
  § Access and usage information
§ WWW provides rich sources of data for data mining
         Why Mine the Web?
t   Enormous wealth of information on Web
    t   Financial information (e.g. stock quotes)
    t   Book/CD/Video stores (e.g. Amazon)
    t   Restaurant information (e.g. Zagats)
    t   Car prices (e.g. Carpoint)

t   Lots of data on user access patterns
    t   Web logs contain sequence of URLs accessed by users

t   Possible to mine interesting nuggets of information
    t   People who ski also travel frequently to Europe
    t   Tech stocks have corrections in the summer and rally from
        November until February
        Why is Web Mining
t   The Web is a huge collection of documents except
    t   Hyper-link information
    t   Access and usage information

t   The Web is very dynamic
    t   New pages are constantly being generated

t   Challenge: Develop new Web mining algorithms and
    adapt traditional data mining algorithms to
    t   Exploit hyper-links and access patterns
    t   Be incremental
        Web Mining Applications
t   E-commerce (Infrastructure)
    t   Generate user profiles
    t   Targetted advertizing
    t   Fraud
    t   Similar image retrieval

t   Information retrieval (Search) on the Web
    t   Automated generation of topic hierarchies
    t   Web knowledge bases
    t   Extraction of schema for XML documents

t   Network Management
    t   Performance management
    t   Fault management
        User Profiling
t   Important for improving customization
    t   Provide users with pages, advertisements of interest
    t   Example profiles: on-line trader, on-line shopper

t   Generate user profiles based on their access patterns
    t   Cluster users based on frequently accessed URLs
    t   Use classifier to generate a profile for each cluster

t   Engage technologies
    t   Tracks web traffic to create anonymous user profiles of Web
    t   Has profiles for more than 35 million anonymous users
    Internet Advertizing
t   Ads are a major source of revenue for Web
    portals (e.g., Yahoo, Lycos) and E-commerce

t   Plenty of startups doing internet advertizing
    t   Doubleclick, AdForce, Flycast, AdKnowledge

t   Internet advertizing is probably the “hottest”
    web mining application today
    Internet Advertizing
t   Scheme 1:
    t   Manually associate a set of ads with each user profile
    t   For each user, display an ad from the set based on profile

t   Scheme 2:
    t   Automate association between ads and users
    t   Use ad click information to cluster users (each user is
        associated with a set of ads that he/she clicked on)
    t   For each cluster, find ads that occur most frequently in the
        cluster and these become the ads for the set of users in the
    Internet Advertizing
t   Use collaborative filtering (e.g. Likeminds, Firefly)
t   Each user Ui has a rating for a subset of ads (based
    on click information, time spent, items bought etc.)
t   Rij - rating of user Ui for ad Aj
t   Problem: Compute user Ui’s rating for an unrated ad


                  A1        A2         A3
    Internet Advertizing
t   Key Idea: User Ui’s rating for ad Aj is set to Rkj,
    where Uk is the user whose rating of ads is most
    similar to Ui’s

t   User Ui’s rating for an ad Aj that has not been
    previously displayed to Ui is computed as follows:
    t   Consider a user Uk who has rated ad Aj
    t   Compute Dik, the distance between Ui and Uk’s ratings on
        common ads
    t   Ui’s rating for ad Aj = Rkj (Uk is user with smallest Dik)
    t   Display to Ui ad Aj with highest computed rating
t   With the growing popularity of E-commerce, systems
    to detect and prevent fraud on the Web become

t   Maintain a signature for each user based on buying
    patterns on the Web (e.g., amount spent, categories
    of items bought)

t   If buying pattern changes significantly, then signal

t   HNC software uses domain knowledge and neural
    networks for credit card fraud detection
Retrieval of Similar Images
t   Given:

     t   A set of images
t   Find:
     t All images similar to a given image
     t All pairs of similar images
t   Sample applications:

     t Medical diagnosis
     t Weather predication
     t Web search engine for images
     t E-commerce
Retrieval of Similar Images
t   QBIC, Virage, Photobook
t   Compute feature signature for each image
    t   QBIC uses color histograms
    t   WBIIS, WALRUS use wavelets
t   Use spatial index to retrieve database image whose
    signature is closest to the query’s signature

t   WALRUS decomposes an image into regions
t   A single signature is stored for each region
t   Two images are considered to be similar if they have
    enough similar region pairs
    Images retrieved by

Query image
    Problems with Web Search Today
t   Today’s search engines are plagued by
    t the abundance problem (99% of info of no
      interest to 99% of people)
    t limited coverage of the Web (internet
      sources hidden behind search interfaces)
      Largest crawlers cover < 18% of all web
    t limited query interface based on keyword-
      oriented search
    t limited customization to individual users
    Problems with Web Search Today
t   Today’s search engines are plagued by
    t   Web is highly dynamic
             of pages added, removed, and updated
         t Lot
          every day
    t   Very high dimensionality
           Improve Search By Adding
              Structure to the Web
 t   Use Web directories (or topic hierarchies)
      t   Provide a hierarchical classification of documents (e.g., Yahoo!)
                              Yahoo home page

     Recreation              Business               Science            News

Travel       Sports         Companies            Finance       Jobs

 t   Searches performed in the context of a topic restricts the search to only
     a subset of web pages related to the topic
        Automatic Creation
        of Web Directories
t   In the Clever project, hyper-links between Web pages
    are taken into account when categorizing them
    t   Use a bayesian classifier
    t   Exploit knowledge of the classes of immediate neighbors of
        document to be classified
    t   Show that simply taking text from neighbors and using
        standard document classifiers to classify page does not work

t   Inktomi’s Directory Engine uses “Concept Induction”
    to automatically categorize millions of documents
     Network Management
 t   Objective: To deliver content to users quickly and
     t   Traffic management
     t   Fault management

Router             Service Provider Network
    Why is Traffic Management
t   While annual bandwidth demand is increasing ten-fold
    on average, annual bandwidth supply is rising only by
    a factor of three

t   Result is frequent congestion at servers and on
    network links
    t   during a major event (e.g., princess diana’s death), an
        overwhelming number of user requests can result in millions
        of redundant copies of data flowing back and forth across the
    t   Olympic sites during the games
    t   NASA sites close to launch and landing of shuttles
t   Key Ideas

    t   Dynamically replicate/cache content at multiple sites within
        the network and closer to the user

    t   Multiple paths between any pair of sites

    t   Route user requests to server closest to the user or least
        loaded server

    t   Use path with least congested network links

t   Akamai, Inktomi
Traffic Management



         Service Provider Network
    Traffic Management
t   Need to mine network and Web traffic to determine
     t   What content to replicate?
     t   Which servers should store replicas?
     t   Which server to route a user request?

     t   What path to use to route packets?

t   Network Design issues
     t   Where to place servers?
     t   Where to place routers?
     t   Which routers should be connected by links?

t   One can use association rules, sequential pattern mining
    algorithms to cache/prefetch replicas at server
    Fault Management

t   Fault management involves

     t   Quickly identifying failed/congested servers and links in network
     t   Re-routing user requests and packets to avoid congested/down servers and

t   Need to analyze alarm and traffic data to carry out root cause analysis of

t   Bayesian classifiers can be used to predict the root cause given a set of
Web Mining Issues
t   Size
    t   Grows at about 1 million pages a day
    t   Google indexes 9 billion documents
    t   Number of web sites
         t Netcraft survey says 72 million sites

t   Diverse types of data
    t   Images
    t   Text
    t   Audio/video
    t   XML
    t   HTML
Number of Active Sites

  Total Sites Across All Domains August 1995 - October 2007
Systems Issues
t   Web data sets can be very large
     t   Tens to hundreds of terabytes
t   Cannot mine on a single server!
     t   Need large farms of servers
t   How to organize hardware/software to
    mine multi-terabye data sets
     t Without   breaking the bank!
Different Data Formats
t Structured Data
t Unstructured Data
t OLE DB offers some solutions!
    Web Data
t Web pages
t Intra-page structures
t Inter-page structures
t Usage data
t Supplemental data
    t Profiles
    t Registration information
    t Cookies
    Web Usage Mining
t Pages contain information
t Links are ‘roads’
t How do people navigate the Internet
    t   è Web Usage Mining (clickstream
t Information on navigation paths
  available in log files
t Logs can be mined from a client or a
  server perspective
    Website Usage Analysis
t   Why analyze Website usage?
t   Knowledge about how visitors use Website could
    t   Provide guidelines to web site reorganization; Help prevent
    t   Help designers place important information where the visitors look
        for it
    t   Pre-fetching and caching web pages
    t   Provide adaptive Website (Personalization)
    t    Questions which could be answered
         t   What are the differences in usage and access patterns among users?
         t   What user behaviors change over time?
         t   How usage patterns change with quality of service (slow/fast)?
         t   What is the distribution of network traffic over time?
Website Usage Analysis
Website Usage Analysis
    Website Usage Analysis

Analog – Web Log File Analyser
Gives basic statistics such as
    • number of hits
    • average hits per time period
    • what are the popular pages in your site
    • who is visiting your site
    • what keywords are users searching for to get to you
    • what is being downloaded
Web Usage Mining Process
Web Usage Mining Process
Web Usage Mining Process
  Web Mining Outline
Goal: Examine the use of data mining on
  the World Wide Web
t Web Content Mining
t Web Structure Mining
t Web Usage Mining
      Web Mining Taxonomy

Modified from [zai01]
    Web Content Mining
t   Examine the contents of web pages as well as result
    of web searching
t   Can be thought of as extending the work performed
    by basic search engines
t   Search engines have crawlers to search the web and
    gather information, indexing techniques to store the
    information, and query processing support to provide
    information to the users
t   Web Content Mining is: the process of extracting
    knowledge from web contents
  Semi-structured Data

t Content is, in general, semi-
  t Example:
    t Title
    t Author
    t Publication_Date
    t Length
    t Category
    t Abstract
    t Content
    Structuring Textual Data
t   Many methods designed to analyze structured data
t   If we can represent documents by a set of attributes
    we will be able to use existing data mining methods
t   How to represent a document?
    t   Vector based representation
        (referred to as “bag of words” as it is invariant to
t   Use statistics to add a numerical dimension to
    unstructured text
    Document Representation
t   A document representation aims to capture what the
    document is about
t   One possible approach:
    t   Each entry describes a document
    t   Attribute describe whether or not a term appears in the
 Document Representation
Another approach:
• Each entry describes a document
• Attributes represent the frequency in which a
term appears in the document
 Document Representation
• Stop Word removal: Many words are not informative and thus
irrelevant for document representation
          the, and, a, an, is, of, that, …
• Stemming: reducing words to their root form (Reduce
          A document may contain several occurrences of words
like fish, fishes, fisher, and fishers. But would not be retrieved by
a query with the keyword “fishing”
Different words share the same word stem and should be
represented with its stem, instead of the actual word “Fish”

Shared By: