bi by wulinqing


									Discovering Business Intelligence
    Information by Comparing
       Company Web Sites
        Unit 6 of Web Intelligence –
         Web Mining and Farming


• More and more companies, government organizations, and
  individuals are publishing their information on the Web
• How to find the useful/interesting information from Web
   –   Keyword-based search
   –   Manual browsing
   –   Wrapper-based approaches
   –   Web query languages
   –   User-preference approaches

   – They only find the information that matches the user’s

          Introduction (Cont.)

• Finding unexpected information can be very important
   – Need human analysts browse Web to identify these piece of interest
     (including unexpected) information
   – Automated assistance is urgently needed
   – Whether a piece of information is interesting or not is subjective
   – Similar to the interestingness problem in data mining

          Interestingness Measures

• Interestingness measures
   – Unexpectedness: a piece of information is interesting if it is relevant
     but unknown to the user, or it contradicts the user’s expectation
   – Actionability: a piece of information is actionable if the user can do
     something with it to his/her advantages
      • Key concept but elusive (so, decided by the user)
• Information categorization
   – Information that is both unexpected and actionable
   – Information that is unexpected but not actionable
   – Information that is actionable but expected

          Summary of the Proposed
• Aim to find interesting information from a competitor Web
• Input
   – A user site U (expectation of the user)
   – Some additional knowledge E that the user has about its competitor
     (expectation of the user)
   – A competitor site C
• Actions of WebCompare
   – Analyze U to extract all the information that represent the user’s
   – Analyze C and compare the information contained in C, and U and E
     to find various types of expected and unexpected information from C

          Summary of the Proposed
          Approach (Cont.)
• The information in a Web page is represented using two
   – Vector space representation – similarities, differences, and the main
     concepts of text documents can be represented by keywords that
     appear in the documents
   – Concepts
      • Combination of keywords that occur frequently in the sentence of
        a Web page
      • Often represent significant information that the owner wants to

Vector Space Representation of
       Text Documents

         Vector Space Representation
         of Text Documents
• Each document is described by a set of keywords called
  index terms (or simply terms)
• An index term is simply a word whose semantics helps to
  remember the document’s main themes
• Index terms are used to index and to summarize the
  document content
• An index term is associated with a weight

          Term Weight

• Two approaches to associate a weight with an index term
   – Binary:
      • the domain contains the the value of one or zero.
   – Weighted:
      • the domain is the set of all real positive numbers.
   – Ex: discuss petroleum refineries in Mexico

           Petroleum Mexico              Oil   Taxes Refineries
    Binary     1       1                  1      0       1
  Weighted       2.8          1.6        3.5     .3           3.1

          Term Weight (Cont.)

• Simple term frequency algorithm
   – The weight is equal to the term frequency (TF)
   – Emphasize the use of particular processing token within an item
      • if the word “computer” occurs 15 times within an item it has a
        weight of 15
   – problems: Normalization!!
      • The longer an item is, the more often a processing token may
        occur within the item.

          Term Weight (Cont.)

• Inverse document frequency
   – the weight equal to the frequency of occurrence of the index terms in
     all the documents
   – WEIGHTij=Tfij*[Log2(n)-Log2(IFj)+1]
       • WEIGHTij : assigned to term “j”in item “i”
       • TFij : frequency of term “j” in item “i”
       • IFij : number of items in the database that have term “j” in them
       • n : number of documents in the databases

   Term Weight (Cont.)
• Ex:
                      n         TF       IF
        Oil          2048        4      128
        Mexico       2048        8      16
        Refinery     2048       10     1024

   – Weightoil=4*(Log2(2048)-Log2(128)+1)=20
   – WeightMexico=8*(Log2(2048)-Log2(16)+1)=64
   – WeightRefinery=10*(Log2(2048)-Log2(1024)+1)=20

          Term Weight (Cont.)

• Signal weighting
   – IDF does not account the term frequency distribution of the
     processing token in the items that contain the term.
   – The distribution of the frequency of processing tokens within an item
     can affect the ability to rank items.
                                                    An instance of an
                                                     event that occurs all
   Item Distribution          SAW DRILL              the time has less
                                                     information value
           A                   10   2
                                                     than an instance of
           B                   10   2
                                                     a seldom occurring
           C                   10  18
           D                   10  10
           E                   10  18
         Similarity Measure

• Measure the similarity between a query and a document
• Similarity measure examples

        SIM(DOC i ,QUERY j )   (DTerm i,k )(QTerm j,k )

                                    (DTerm   i, k   )(QTerm j ,k )
     SIM(DOC i , QUERY j )        k

                                (DTerm i,k ) *  (QTerm j ,k )
                                              2                       2

                               k                       k

          Finding Concepts Using
          Association Rule Mining
• 關聯規則探勘
• Cheese  beer [support = 10%, confidence=80%]
• 關聯規則探勘必須依使用者需求,設定支持度(Support)門檻
• An association mining algorithm works in two steps (Aprori)
   – Generate all large (frequent) itemsets that satisfy minsup
       • An itemset is simply a set of items
       • A large itemset is an itemset that has transaction support above
   – Generate all association rules that satisfy minconf using the large



• 信度c (confidence):當XY時,在D中X發生且Y也同時發生
                  P( X  Y )
              c             0  c 1
                    P( X )
• 支持度s (support):當XY時,D中包含有XY機率

          s  P X  Y     0  s 1

TID             年齡                性別         已購車
100             27                F          Y
200             29                F          Y
300             32                M          N
400             35                M          Y
500             26                F          N

規則                          支持度(support)   信度(confidence)
(年齡:25-30) and (性別:F) (已   40%            66.7%

(已購車:Y) (性別:M)             20%            33.3%

           Finding Concepts Using
           Association Rule Mining (Cont.)
• Association rule mining in WebCompare
   –   The set of items I is the set of keywords in a page
   –   The keywords in each sentence of the page form a transaction t
   –   The set of all sentences in the page gives the transaction set T
   –   If a particular keyword occurs more than once in a sentence,
       consider it only once

          Finding Concepts Using
          Association Rule Mining (Cont.)
• WebCompare mines all large itemsets from every page in C
  and every page in U separately
   – Each page of a Web site typically focuses on a specific topic
   – If we mix it with other page, we may not be able to find interesting
     concepts that exist in the page, due to the minimum support
       • A concept may be large in one page, but may not be large when
         it is combined with another page, as the minimum support is
         normally specified in percentage

 Proposed Techniques –
Comparing Two Web Sites


• Five methods to compare the user site U and the competitor
  site C to help the user find various types of interesting
  and/or unexpected information
   – User site U = u1, u2,…, uw
   – Competitor site C=c1, c2, …, cv

          Finding the Corresponding C
          Page(s) of a U Page
• The user is interested in finding some pages in C that are
  similar to a page in U
   – Useful when the user wants to perform detailed analysis on a
     specific topic  see if C has published the same topic
• Given a U page uj, use the cosine measure to compute the
  similarity between uj and each page in C
• After the comparison, the pages in C are ranked according
  to their similarity values in descending order


          Finding Unexpected Terms in
          a C Page w.r.t. a U Page
• Given two similar pages, find unexpected terms
   – Allow the users to obtain the key differences of two pages
   – Help the user decide whether to browse the C page to find further
• Given a U page uj and a C page ci, compare the term
  weights in both documents to obtain those unexpected
  terms in ci w.r.t the terms in uj
• Unexpectedness value of each term kr in ci w.r.t uj
                                       tf r , j  tf r , j
                                  1          ,     1
               unexpTr ,i , j        tf r ,i    tf r ,i
                                   0,   otherwise

         Finding Unexpected Terms in a
         C Page w.r.t. a U Page (Cont.)
• After the unexpectedness value for each term kr is
  computed, all the terms in ci are ranked according to their
  unexpTr,i,j values in descending order
• Example: we are interested in unexpected terms in Cpage 1
  w.r.t. Upage 1  Rank 1: classify

         Finding Unexpected Pages in
         C w.r.t. U
• These finding pages are often very interesting, as they tell
  the user that the competitor site may have some useful
  pages that the user site does not have
• Combine all the pages in U to form a single document Du,
  and all the pages in C to form another single document Dc
• Compute the unexpectedness value of each term kl in Dc
  w.r.t Du (unexpTl,c,u)
• The unexpectedness of a page ci w.r.t U (unexpPi): the
  amount of term unexpectedness contained in ci

                                  unexpT        r , c ,u
                      unexpP 
                                 r 1
         Finding Unexpected Pages in
         C w.r.t. U (Cont.)
• After all unexpPi values are computed, we rank the C pages
  according to their unexpPi values in descending order
• Example:
   – Rank 1: Cpage 2
   – Rank 2: Cpage 3
   – Rank 3: Cpage 1

          Finding Unexpected Concepts in
          a C Page w.r.t a U Page
• A concept is a set of keywords that occur together in the
  sentences of a page above a certain user-specified
  minimum support (or frequency)
   – "information extraction", "extraction of information", "information is
• Use association rule mining to discover all concepts
• Treat each concept as a term or keyword, and apply method
  2 and/or method 3

          Finding Unexpected Outgoing
          Links from C
• May indicate some useful resource that are of additional
  help to the customer of the competitor
• Let the set of outgoing links from U be Lu, and let the set of
  outgoing links from C be Lc.
• The set of unexpected outgoing links in C w.r.t U is Lc-Lu

          Proposed Techniques – Incorporating
          the User's Existing Knowledge

• Users may have some existing knowledge about the
  application domain and its competitor
   – It enable the system to discover truly unexpected information
   – It allows the user to check if his/her expectations are correct
• Express the user's knowledge as keywords, concepts, and
  hypertext links. E  Eg and Es
   – Eg = all the general items of the domain that the user knows about
     and does not want them ranked high
   – Es = specific items of the site that the user knows about and does not
     want them ranked high

         Proposed Techniques – Incorporating
         the User's Existing Knowledge (Cont.)

• In computation, item in E are added to the set of items in U
   – Keywords in E are used in methods 2 and 3
   – Concepts in method 4
   – Outgoing links in method 5
• When a weight is needed for an item, it takes the maximum

System Architecture

A Running Example

The Crawler Interface

          Evaluation – Application
• It allow the user to quickly focus on those potentially
  interesting pages, terms, and concepts
• Due to the difficulties of manual analysis, before using
  WebCompare, the users may gave up after browsing some
  top-level pages
• If a page is long, the users often do not read it carefully, and
  thus may miss some useful information. WebCompare can
  summarize each page with keywords and concepts

Evaluation -- Efficiency

          Future Works

• Study the use of metadata and ontology to provide more
  information related to keywords to create a more intelligent
• Study how the links of a Web site may be used to infer more
  unexpected information
• May be extended as a methodology for monitoring a
  competitor's Web site
   – Treat the old web pages of C as the existing knowledge or the U site
   – Report any unexpected changes to the old pages by the competitor


To top