; bi
Documents
Resources
Learning Center
Upload
Plans & pricing Sign in
Sign Out
Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

bi

VIEWS: 8 PAGES: 41

  • pg 1
									Discovering Business Intelligence
    Information by Comparing
       Company Web Sites
        Unit 6 of Web Intelligence –
         Web Mining and Farming




                                       1
           Introduction

• More and more companies, government organizations, and
  individuals are publishing their information on the Web
• How to find the useful/interesting information from Web
   –   Keyword-based search
   –   Manual browsing
   –   Wrapper-based approaches
   –   Web query languages
   –   User-preference approaches

   – They only find the information that matches the user’s
     specifications


                                    2
          Introduction (Cont.)

• Finding unexpected information can be very important
   – Need human analysts browse Web to identify these piece of interest
     (including unexpected) information
   – Automated assistance is urgently needed
   – Whether a piece of information is interesting or not is subjective
   – Similar to the interestingness problem in data mining




                                   3
          Interestingness Measures

• Interestingness measures
   – Unexpectedness: a piece of information is interesting if it is relevant
     but unknown to the user, or it contradicts the user’s expectation
   – Actionability: a piece of information is actionable if the user can do
     something with it to his/her advantages
      • Key concept but elusive (so, decided by the user)
• Information categorization
   – Information that is both unexpected and actionable
   – Information that is unexpected but not actionable
   – Information that is actionable but expected



                                      4
          Summary of the Proposed
          Approach
• Aim to find interesting information from a competitor Web
  site
• Input
   – A user site U (expectation of the user)
   – Some additional knowledge E that the user has about its competitor
     (expectation of the user)
   – A competitor site C
• Actions of WebCompare
   – Analyze U to extract all the information that represent the user’s
     expectation
   – Analyze C and compare the information contained in C, and U and E
     to find various types of expected and unexpected information from C


                                    5
          Summary of the Proposed
          Approach (Cont.)
• The information in a Web page is represented using two
  schemes
   – Vector space representation – similarities, differences, and the main
     concepts of text documents can be represented by keywords that
     appear in the documents
   – Concepts
      • Combination of keywords that occur frequently in the sentence of
        a Web page
      • Often represent significant information that the owner wants to
        emphasize




                                     6
Vector Space Representation of
       Text Documents




                                 7
         Vector Space Representation
         of Text Documents
• Each document is described by a set of keywords called
  index terms (or simply terms)
• An index term is simply a word whose semantics helps to
  remember the document’s main themes
• Index terms are used to index and to summarize the
  document content
• An index term is associated with a weight




                              8
          Term Weight

• Two approaches to associate a weight with an index term
   – Binary:
      • the domain contains the the value of one or zero.
   – Weighted:
      • the domain is the set of all real positive numbers.
   – Ex: discuss petroleum refineries in Mexico


           Petroleum Mexico              Oil   Taxes Refineries
    Binary     1       1                  1      0       1
  Weighted       2.8          1.6        3.5     .3           3.1

                                     9
          Term Weight (Cont.)

• Simple term frequency algorithm
   – The weight is equal to the term frequency (TF)
   – Emphasize the use of particular processing token within an item
      • if the word “computer” occurs 15 times within an item it has a
        weight of 15
   – problems: Normalization!!
      • The longer an item is, the more often a processing token may
        occur within the item.




                                    10
          Term Weight (Cont.)

• Inverse document frequency
   – the weight equal to the frequency of occurrence of the index terms in
     all the documents
   – WEIGHTij=Tfij*[Log2(n)-Log2(IFj)+1]
       • WEIGHTij : assigned to term “j”in item “i”
       • TFij : frequency of term “j” in item “i”
       • IFij : number of items in the database that have term “j” in them
       • n : number of documents in the databases




                                        11
   Term Weight (Cont.)
• Ex:
                      n         TF       IF
        Oil          2048        4      128
        Mexico       2048        8      16
        Refinery     2048       10     1024

   – Weightoil=4*(Log2(2048)-Log2(128)+1)=20
   – WeightMexico=8*(Log2(2048)-Log2(16)+1)=64
   – WeightRefinery=10*(Log2(2048)-Log2(1024)+1)=20




                           12
          Term Weight (Cont.)

• Signal weighting
   – IDF does not account the term frequency distribution of the
     processing token in the items that contain the term.
   – The distribution of the frequency of processing tokens within an item
     can affect the ability to rank items.
                                                    An instance of an
                                                     event that occurs all
   Item Distribution          SAW DRILL              the time has less
                                                     information value
           A                   10   2
                                                     than an instance of
           B                   10   2
                                                     a seldom occurring
           C                   10  18
                                                     event.
           D                   10  10
           E                   10  18
                                    13
         Similarity Measure

• Measure the similarity between a query and a document
• Similarity measure examples



        SIM(DOC i ,QUERY j )   (DTerm i,k )(QTerm j,k )
                                   k


                                    (DTerm   i, k   )(QTerm j ,k )
     SIM(DOC i , QUERY j )        k


                                (DTerm i,k ) *  (QTerm j ,k )
                                              2                       2

                               k                       k




                                   14
          Finding Concepts Using
          Association Rule Mining
• 關聯規則探勘
• Cheese  beer [support = 10%, confidence=80%]
• 關聯規則探勘必須依使用者需求,設定支持度(Support)門檻
  值和信度(Confidence)門檻值
• An association mining algorithm works in two steps (Aprori)
   – Generate all large (frequent) itemsets that satisfy minsup
       • An itemset is simply a set of items
       • A large itemset is an itemset that has transaction support above
         minsup
   – Generate all association rules that satisfy minconf using the large
     itemsets

                                    15
找出頻繁項目集




     16
     信度及支持度

• 信度c (confidence):當XY時,在D中X發生且Y也同時發生
  的機率
                  P( X  Y )
              c             0  c 1
                    P( X )
• 支持度s (support):當XY時,D中包含有XY機率

          s  P X  Y     0  s 1




                       17
      歸納出關聯規則
TID             年齡                性別         已購車
100             27                F          Y
200             29                F          Y
300             32                M          N
400             35                M          Y
500             26                F          N

規則                          支持度(support)   信度(confidence)
(年齡:25-30) and (性別:F) (已   40%            66.7%
  購車:Y)

(已購車:Y) (性別:M)             20%            33.3%

                             18
           Finding Concepts Using
           Association Rule Mining (Cont.)
• Association rule mining in WebCompare
   –   The set of items I is the set of keywords in a page
   –   The keywords in each sentence of the page form a transaction t
   –   The set of all sentences in the page gives the transaction set T
   –   If a particular keyword occurs more than once in a sentence,
       consider it only once




                                      19
          Finding Concepts Using
          Association Rule Mining (Cont.)
• WebCompare mines all large itemsets from every page in C
  and every page in U separately
   – Each page of a Web site typically focuses on a specific topic
   – If we mix it with other page, we may not be able to find interesting
     concepts that exist in the page, due to the minimum support
     constraint
       • A concept may be large in one page, but may not be large when
         it is combined with another page, as the minimum support is
         normally specified in percentage




                                    20
 Proposed Techniques –
Comparing Two Web Sites




                          21
          Overview

• Five methods to compare the user site U and the competitor
  site C to help the user find various types of interesting
  and/or unexpected information
   – User site U = u1, u2,…, uw
   – Competitor site C=c1, c2, …, cv




                                       22
          Finding the Corresponding C
          Page(s) of a U Page
• The user is interested in finding some pages in C that are
  similar to a page in U
   – Useful when the user wants to perform detailed analysis on a
     specific topic  see if C has published the same topic
• Given a U page uj, use the cosine measure to compute the
  similarity between uj and each page in C
• After the comparison, the pages in C are ranked according
  to their similarity values in descending order




                                   23
Example




          24
          Finding Unexpected Terms in
          a C Page w.r.t. a U Page
• Given two similar pages, find unexpected terms
   – Allow the users to obtain the key differences of two pages
   – Help the user decide whether to browse the C page to find further
     details
• Given a U page uj and a C page ci, compare the term
  weights in both documents to obtain those unexpected
  terms in ci w.r.t the terms in uj
• Unexpectedness value of each term kr in ci w.r.t uj
                                       tf r , j  tf r , j
                                  1          ,     1
               unexpTr ,i , j        tf r ,i    tf r ,i
                                   0,   otherwise

                                          25
         Finding Unexpected Terms in a
         C Page w.r.t. a U Page (Cont.)
• After the unexpectedness value for each term kr is
  computed, all the terms in ci are ranked according to their
  unexpTr,i,j values in descending order
• Example: we are interested in unexpected terms in Cpage 1
  w.r.t. Upage 1  Rank 1: classify




                              26
         Finding Unexpected Pages in
         C w.r.t. U
• These finding pages are often very interesting, as they tell
  the user that the competitor site may have some useful
  pages that the user site does not have
• Combine all the pages in U to form a single document Du,
  and all the pages in C to form another single document Dc
• Compute the unexpectedness value of each term kl in Dc
  w.r.t Du (unexpTl,c,u)
• The unexpectedness of a page ci w.r.t U (unexpPi): the
  amount of term unexpectedness contained in ci
                                 m

                                  unexpT        r , c ,u
                      unexpP 
                            i
                                 r 1
                                             m
                                        27
         Finding Unexpected Pages in
         C w.r.t. U (Cont.)
• After all unexpPi values are computed, we rank the C pages
  according to their unexpPi values in descending order
• Example:
   – Rank 1: Cpage 2
   – Rank 2: Cpage 3
   – Rank 3: Cpage 1




                             28
          Finding Unexpected Concepts in
          a C Page w.r.t a U Page
• A concept is a set of keywords that occur together in the
  sentences of a page above a certain user-specified
  minimum support (or frequency)
   – "information extraction", "extraction of information", "information is
     extracted"
• Use association rule mining to discover all concepts
• Treat each concept as a term or keyword, and apply method
  2 and/or method 3




                                      29
          Finding Unexpected Outgoing
          Links from C
• May indicate some useful resource that are of additional
  help to the customer of the competitor
• Let the set of outgoing links from U be Lu, and let the set of
  outgoing links from C be Lc.
• The set of unexpected outgoing links in C w.r.t U is Lc-Lu




                                30
          Proposed Techniques – Incorporating
          the User's Existing Knowledge

• Users may have some existing knowledge about the
  application domain and its competitor
   – It enable the system to discover truly unexpected information
   – It allows the user to check if his/her expectations are correct
• Express the user's knowledge as keywords, concepts, and
  hypertext links. E  Eg and Es
   – Eg = all the general items of the domain that the user knows about
     and does not want them ranked high
   – Es = specific items of the site that the user knows about and does not
     want them ranked high



                                     31
         Proposed Techniques – Incorporating
         the User's Existing Knowledge (Cont.)

• In computation, item in E are added to the set of items in U
   – Keywords in E are used in methods 2 and 3
   – Concepts in method 4
   – Outgoing links in method 5
• When a weight is needed for an item, it takes the maximum
  weight




                                  32
System Architecture




            33
A Running Example




                    34
The Crawler Interface




            35
36
37
38
          Evaluation – Application
          Experience
• It allow the user to quickly focus on those potentially
  interesting pages, terms, and concepts
• Due to the difficulties of manual analysis, before using
  WebCompare, the users may gave up after browsing some
  top-level pages
• If a page is long, the users often do not read it carefully, and
  thus may miss some useful information. WebCompare can
  summarize each page with keywords and concepts




                                39
Evaluation -- Efficiency




             40
          Future Works

• Study the use of metadata and ontology to provide more
  information related to keywords to create a more intelligent
  system
• Study how the links of a Web site may be used to infer more
  unexpected information
• May be extended as a methodology for monitoring a
  competitor's Web site
   – Treat the old web pages of C as the existing knowledge or the U site
   – Report any unexpected changes to the old pages by the competitor




                                    41

								
To top