CSE 634 Data Mining Concepts and Techniques Association Rule

Document Sample
CSE 634 Data Mining Concepts and Techniques Association Rule Powered By Docstoc
					             CSE 634
Data Mining Concepts and Techniques
      Association Rule Mining

             Barbara Mucha
                Tania Irani
              Irem Incekoy
             Mikhail Bautin
   Course Instructor: Prof. Anita Wasilewska
   State University of New York, Stony Brook
                   Group 6
   Data Mining: Concepts & Techniques by Jiawei
    Han and Micheline Kamber
   Presentation Slides of Prateek Duble
   Presentation Slides of the Course Book.
   Mining Topic-Specific Concepts and Definitions
    on the Web
   Effective Personalization Based on Association
    Rule Discovery from Web Usage Data
      Basic Concepts of Association Rule
    Association   & Apriori Algorithm
    Paper: Mining Topic-Specific Concepts
     and Definitions on the Web
    Paper: Effective Personalization Based on
     Association Rule Discovery from Web
     Usage Data
Barbara Mucha
      What is association rule mining?

      Methods for association rule mining

      Examples

      Extensions of association rule

Barbara Mucha
       What Is Association Rule Mining?
      Frequent patterns: patterns (set of items,
       sequence, etc.) that occur frequently in a

      Frequent pattern mining: finding regularities in
        What products were      often purchased together?
           Beer and diapers?!

        What   are the subsequent purchases after buying a
        Can we automatically profile customers?
Barbara Mucha
       Basic Concepts of Association Rule
      Given: (1) database of transactions, (2) each transaction is
       a list of items (purchased by a customer in a visit)
      Find: all rules that correlate the presence of one set of
       items with that of another set of items
           E.g., 98% of people who purchase tires and auto accessories also
            get automotive services done
      Applications
           *  Maintenance Agreement (What the store should do to boost
            Maintenance Agreement sales)
           Home Electronics  * (What other products should the store
            stocks up?)
           Attached mailing in direct marketing

Barbara Mucha
        Association Rule Definitions
      Set of items: I={I1,I2,…,Im}
      Transactions: D = {t1, t2,.., tn} be a set of
       transactions, where a transaction,t, is a set of
      Itemset: {Ii1,Ii2, …, Iik}  I
      Support of an itemset: Percentage of transactions
       which contain that itemset.
      Large (Frequent) itemset: Itemset whose number
       of occurrences is above a threshold.
Barbara Mucha
              Rule Measures: Support &
   An association rule is of the form : X  Y where X, Y are
    subsets of I, and X INTERSECT Y = EMPTY

   Each rule has two measures of value, support, and confidence.

   Support indicates the frequencies of the occurring patterns, and
    confidence denotes the strength of implication in the rule.

   The support of the rule X  Y is support (X UNION Y) c is the
    CONFIDENCE of rule X  Y if c% of transactions that
    contain X also contain Y, which can be written as the radio:
           support(X UNION Y)/support(X)

Barbara Mucha
           Support & Confidence : An
   Let minimum support 50%, and minimum
     confidence 50%, then we have,
       A  C (50%, 66.6%)
       C  A (50%, 100%)

                TransactionID ItemsBought
                    2000      A,B,C
                    1000      A,C
                    4000      A,D
                    5000      B,E,F
Barbara Mucha
       Types of Association Rule Mining
      Boolean vs. quantitative associations
       (Based on the types of values handled)
         buys(x, “computer”)  buys(x, “financial software”)
          [.2%, 60%]
         age(x, “30..39”) ^ income(x, “42..48K”) buys(x,
          “PC”) [1%, 75%]

      Single dimension vs. multiple dimensional associations
         buys(x, “computer”)  buys(x, “financial software”)
          [.2%, 60%]
         age(x, “30..39”) ^ income(x, “42..48K”) buys(x,
          “PC”) [1%, 75%]
Barbara Mucha
       Types of Association Rule Mining
      Single level vs. multiple-level analysis
         What brands of beers are associated with
          what brands of diapers?

      Various extensions
        Correlation, causality analysis
            Association     does not necessarily imply
                correlation or causality
         Constraints     enforced
            E.g.,  small sales (sum < 100) trigger big buys
                (sum > 1,000)?
Barbara Mucha
                 Association Discovery
      Given a user specified minimum support (called MINSUP)
       and minimum confidence (called MINCONF), an important

      PROBLEM is to find all high confidence, large itemsets
       (frequent sets, sets with high support). (where support and
       confidence are larger than minsup and minconf).

      This problem can be decomposed into two subproblems:

      1. Find all large itemsets: with support > minsup (frequent

      2. For a large itemset, X and B X (or Y  X) , find those rules,
       X\{B} => B ( X-Y  Y) for which confidence > minconf.

Barbara Mucha
      Itemset: a set of items
        E.g.,   acm={a, c, m}
                                 Transaction database TDB
      Support of itemsets
        Sup(acm)=3
                                      TID     Items bought
                                  100       f, a, c, d, g, I, m, p
      Given min_sup=3, acm
       is a frequent pattern      200       a, b, c, f, l,m, o
                                  300       b, f, h, j, o
      Frequent pattern
                                  400       b, c, k, s, p
       mining: find all
       frequent patterns in a     500       a, f, c, e, l, p, m, n
Barbara Mucha
      Mining Association Rules—An
Transaction ID   Items Bought     Min. support 50%
    2000         A,B,C            Min. confidence 50%
    1000         A,C
    4000         A,D               Frequent Itemset Support
                                   {A}                 75%
    5000         B,E,F
                                   {B}                 50%
                                   {C}                 50%
 For rule A  C:                   {A,C}               50%
    support = support({A &C}) = 50%
    confidence = support({A &C})/support({A}) = 66.6%
 The Apriori principle:
    Any subset of a frequent itemset must be frequent
                Rules from frequent sets

    X = {mustard, sausage, beer}; frequency =
    Y = {mustard, sausage, beer, chips};
     frequency = 0.2
    If the customer buys mustard, sausage,
     and beer, then the probability that he/she
     buys chips is 0.5

Barbara Mucha
      Mine:
        Sequential patterns
           find inter-transaction patterns such that the presence of a set
            of items is followed by another item in the time-stamp
            ordered transaction set.
        Periodic patterns
           It can be envisioned as a tool for forecasting and prediction
            of the future behavior of time-series data.
        Structural Patterns
           Structural patterns describe how classes and objects can be
            combined to form larger structures.

Barbara Mucha
                Application Difficulties
      Wal-Mart knows that customers who buy Barbie
       dolls have a 60% likelihood of buying one of three
       types of candy bars.
      What does Wal-Mart do with information like
       that? 'I don't have a clue,' says Wal-Mart's chief of
       merchandising, Lee Scott
      Diapers and beer urban legend

Barbara Mucha
                Thank You!

Barbara Mucha
             CSE 634
Data Mining Concepts and Techniques

   Association & Apriori Algorithm
             Tania Irani

   Course Instructor: Prof. Anita Wasilewska
   State University of New York, Stony Brook

   Data Mining: Concepts & Techniques by Jiawei
    Han and Micheline Kamber

   Presentation Slides of Prof. Anita Wasilewska

   The Apriori Algorithm (Mining single-dimensional
    boolean association rules)

   Frequent-Pattern Growth (FP-Growth) Method

   Summary
         The Apriori Algorithm: Key Concepts
   K-itemsets: An itemset having k items in it.

   Support or Frequency: Number of transactions that contain a
    particular itemset.

   Frequent Itemsets: An itemset that satisfies minimum support.
    (denoted by Lk for frequent k-itemset).

   Apriori Property: All non-empty subsets of a frequent itemset must
    be frequent.

   Join Operation: Ck, the set of candidate k-itemsets is generated by
    joining Lk-1 with itself. (L1: frequent 1-itemset, Lk: frequent k-itemset)

   Prune Operation: Lk, the set of frequent k-itemsets is extracted from
    Ck by pruning it – getting rid of all the non-frequent k-itemsets in Ck

    Iterative level-wise approach: k-itemsets used to explore (k+1)-
                The Apriori Algorithm finds frequent k-itemsets.
      How is the Apriori Property used in the
   Mining single-dimensional Boolean association
    rules is a 2 step process:

     Using    the Apriori Property find the frequent itemsets:
          Each iteration will generate Ck (candidate k-itemsets from
           Ck-1) and Lk (frequent k-itemsets)
     Use  the frequent k-itemsets to generate association
Finding frequent itemsets using the Apriori
Algorithm: Example

   TID    List of Items       Consider a database D, consisting
   T100   I1, I2, I5
                               of 9 transactions.
                              Each transaction is represented
   T100   I2, I4               by an itemset.
   T100   I2, I3              Suppose min. support required is
                               2 (2 out of 9 = 2/9 =22 % )
   T100   I1, I2, I4          Say min. confidence required is
   T100   I1, I3
                              We have to first find out the
   T100   I2, I3               frequent itemset using Apriori
   T100   I1, I3
                              Then, Association rules will be
   T100   I1, I2 ,I3, I5       generated using min. support &
   T100   I1, I2, I3
                               min. confidence.
 Step 1: Generating candidate and frequent 1-
         itemsets with min. support = 2
                                       Compare candidate
Scan D for                             support count with
count of each   Itemset    Sup.Count                        Itemset   Sup.Count
                                       minimum support
candidate                              count
                 {I1}          6                             {I1}     6
                 {I2}          7                             {I2}     7
                 {I3}          6                             {I3}     6
                 {I4}          2                             {I4}     2
                 {I5}          2                             {I5}     2
                          C1                                        L1

 In the first iteration of the algorithm, each item is a member of the set
of candidates Ck along with its support count.
 The set of frequent 1-itemsets L1, consists of the candidate 1-
itemsets satisfying minimum support.
       Step 2: Generating candidate and frequent 2-
               itemsets with min. support = 2
Generate C2               Scan D for                           Compare
               Itemset                 Itemset         Sup.                 Itemset          Sup
candidates                count of                             candidate
from L1 x L1   {I1, I2}                                Count   support                      Count
                          candidate    {I1, I2}         4      count with   {I1, I2}         4
               {I1, I3}                                        minimum
               {I1, I4}                {I1, I3}         4      support      {I1, I3}         4
               {I1, I5}                {I1, I4}         1                   {I1, I5}         2

               {I2, I3}                {I1, I5}         2                   {I2, I3}         4

               {I2, I4}                                                     {I2, I4}         2
                                       {I2, I3}         4
               {I2, I5}                                                     {I2, I5}         2
                                       {I2, I4}         2
               {I3, I4}                {I2, I5}         2                              L2
               {I3, I5}
                                       {I3, I4}         0
               {I4, I5}                                          Note: We haven’t used
                                       {I3, I5}         1
                                                                 Apriori Property yet!
                C2                     {I4, I5}         0

     Step 3: Generating candidate and frequent 3-
             itemsets with min. support = 2
Generate                    Scan D for                           candidate
C3                          count of                             support
candidates    Itemset       each          Itemset        Sup.                   Itemset        Sup
                                                                 count with
from L2      {I1, I2, I3}   candidate                    Count   min support                  Count
                                                                 count         {I1, I2, I3}    2
             {I1, I2, I5}                {I1, I2, I3}     2
             {I1, I3, I5}                {I1, I2, I5}     2                    {I1, I2, I5}    2
             {I2, I3, I4}
                                                    C3                                    L3
             {I2, I3, I5}
             {I2, I4, I5}
                              Contains non-frequent
                C3            (2-itemset) subsets

  The generation of the set of candidate 3-itemsets C3, involves use of
 the Apriori Property.
  When Join step is complete, the Prune step will be used to reduce the
 size of C3. Prune step helps to avoid heavy computation due to large Ck.
       Step 4: Generating frequent 4-itemset
   L3 Join L3        C4 = {{I1, I2, I3, I5}}

   This itemset is pruned since its subset {{I2, I3, I5}} is not

   Thus, C4 = φ, and the algorithm terminates, having found
    all of the frequent items.

   This completes our Apriori Algorithm. What’s Next ?

   These frequent itemsets will be used to generate strong
    association rules (where strong association rules satisfy
    both minimum support & minimum confidence).
     Step 5: Generating Association Rules from
                frequent k-itemsets
   Procedure:
       For each frequent itemset l, generate all nonempty subsets of l

       For every nonempty subset s of l, output the rule “s  (l - s)” if
        support_count(l) / support_count(s) ≥ min_conf where min_conf is
        minimum confidence threshold. 70% in our case.

   Back To Example:
       Lets take l = {I1,I2,I5}

       The nonempty subsets of Lets take l are {I1,I2}, {I1,I5}, {I2,I5},
        {I1}, {I2}, {I5}
    Step 5: Generating Association Rules from
            frequent k-itemsets [Cont.]

   The resulting association rules are:
     R1:   I1 ^ I2  I5
          Confidence = sc{I1,I2,I5} / sc{I1,I2} = 2/4 = 50%
          R1 is Rejected.
     R2:   I1 ^ I5  I2
          Confidence = sc{I1,I2,I5} / sc{I1,I5} = 2/2 = 100%
          R2 is Selected.
     R3:   I2 ^ I5  I1
          Confidence = sc{I1,I2,I5} / sc{I2,I5} = 2/2 = 100%
          R3 is Selected.
Step 5: Generating Association Rules from
        Frequent Itemsets [Cont.]
   R4: I1  I2 ^ I5
     Confidence = sc{I1,I2,I5} / sc{I1} = 2/6 = 33%

     R4 is Rejected.

   R5: I2  I1 ^ I5
     Confidence = sc{I1,I2,I5} / {I2} = 2/7 = 29%

     R5 is Rejected.

   R6: I5  I1 ^ I2
     Confidence = sc{I1,I2,I5} / {I5} = 2/2 = 100%

     R6 is Selected.

We have found three strong association rules.

   The Apriori Algorithm (Mining single dimensional
    boolean association rules)

   Frequent-Pattern Growth (FP-Growth) Method

   Summary
    Mining Frequent Patterns Without Candidate
   Compress a large database into a compact, Frequent-
    Pattern tree (FP-tree) structure
        Highly condensed, but complete for frequent pattern mining
        Avoid costly database scans
   Develop an efficient, FP-tree-based frequent pattern mining method
        A divide-and-conquer methodology:
             Compress DB into FP-tree, retain itemset associations
             Divide the new DB into a set of conditional DBs – each
              associated with one frequent item
             Mine each such database seperately
        Avoid candidate generation
       FP-Growth Method : An Example

TID        List of Items       Consider the previous example
T100       I1, I2, I5           of a database D, consisting of
                                9 transactions.
T100       I2, I4              Suppose min. support count
T100       I2, I3
                                required is 2 (i.e. min_sup =
                                2/9 = 22 % )
T100       I1, I2, I4          The first scan of the database
                                is same as Apriori, which
T100       I1, I3
                                derives the set of 1-itemsets &
T100       I2, I3               their support counts.
                               The set of frequent items is
T100       I1, I3               sorted in the order of
T100       I1, I2 ,I3, I5
                                descending support count.
                               The resulting set is denoted as
T100       I1, I2, I3           L = {I2:7, I1:6, I3:6, I4:2, I5:2}
    FP-Growth Method: Construction of FP-Tree
   First, create the root of the tree, labeled with ―null‖.
   Scan the database D a second time (First time we scanned it to
    create 1-itemset and then L), this will generate the complete tree.
   The items in each transaction are processed in L order (i.e. sorted
   A branch is created for each transaction with items having their
    support count separated by colon.
   Whenever the same node is encountered in another transaction, we
    just increment the support count of the common node or Prefix.
   To facilitate tree traversal, an item header table is built so that each
    item points to its occurrences in the tree via a chain of node-links.
   Now, The problem of mining frequent patterns in database is
    transformed to that of mining the FP-Tree.
  FP-Growth Method: Construction of FP-Tree
   Item    Sup Node-                    I2:7
     Id   Count link                                              I1:2
    I2     7
    I1     6                               I3:2   I4:1
    I3     6
    I4     2                                               I3:2

    I5     2
                                 I3:2     I4:1

An FP-Tree that registers compressed, frequent pattern
     Mining the FP-Tree by Creating Conditional
                 (sub) pattern bases
1.    Start from each frequent length-1 pattern (as an initial
      suffix pattern).
2.    Construct its conditional pattern base which consists of
      the set of prefix paths in the FP-Tree co-occurring with
      suffix pattern.
3.    Then, construct its conditional FP-Tree & perform
      mining on this tree.
4.    The pattern growth is achieved by concatenation of the
      suffix pattern with the frequent patterns generated from
      a conditional FP-Tree.
5.    The union of all frequent patterns (generated by step
      4) gives the required frequent itemset.
                    FP-Tree Example Continued
Item    Conditional pattern base        Conditional             Frequent pattern
                                        FP-Tree                 generated
I5      {(I2 I1: 1),(I2 I1 I3: 1)}      <I2:2 , I1:2>           I2 I5:2, I1 I5:2, I2 I1 I5: 2

I4      {(I2 I1: 1),(I2: 1)}            <I2: 2>                 I2 I4: 2

I3      {(I2 I1: 2),(I2: 2), (I1: 2)}   <I2: 4, I1: 2>,<I1:2>   I2 I3:4, I1 I3: 2 , I2 I1 I3: 2

I1      {(I2: 4)}                       <I2: 4>                 I2 I1: 4

       Mining the FP-Tree by creating conditional (sub) pattern bases

Now, following the above mentioned steps:
 Lets start from I5. I5 is involved in 2 branches namely {I2 I1 I5: 1} and {I2 I1 I3
  I5: 1}.
 Therefore considering I5 as suffix, its 2 corresponding prefix paths would be
  {I2 I1: 1} and {I2 I1 I3: 1}, which forms its conditional pattern base.
               FP-Tree Example Continued
   Out of these, only I1 & I2 is selected in the conditional FP-Tree
    because I3 does not satisfy the minimum support count.
    For I1, support count in conditional pattern base = 1 + 1 = 2
    For I2, support count in conditional pattern base = 1 + 1 = 2
    For I3, support count in conditional pattern base = 1
    Thus support count for I3 is less than required min_sup which is 2
   Now, we have a conditional FP-Tree with us.
   All frequent pattern corresponding to suffix I5 are generated by
    considering all possible combinations of I5 and conditional FP-Tree.
   The same procedure is applied to suffixes I4, I3 and I1.
   Note: I2 is not taken into consideration for suffix because it doesn’t
    have any prefix at all.
       Why Frequent Pattern Growth Fast ?

   Performance study shows
     FP-growth    is an order of magnitude faster than
   Reasoning
     No   candidate generation, no candidate test
     Use   compact data structure
     Eliminate   repeated database scans
     Basic   operation is counting and FP-tree building

   The Apriori Algorithm (Mining single
    dimensional boolean association rules)

   Frequent-Pattern Growth (FP-Growth)

   Summary
   Association rules are generated from frequent itemsets.

   Frequent itemsets are mined using Apriori algorithm or Frequent-
    Pattern Growth method.

   Apriori property states that all the subsets of frequent itemsets must
    also be frequent.

   Apriori algorithm uses frequent itemsets, join & prune methods and
    Apriori property to derive strong association rules.

   Frequent-Pattern Growth method avoids repeated database
    scanning of Apriori algorithm.

   FP-Growth method is faster than Apriori algorithm.
Thank You!
Mining Topic-Specific Concepts and
      Definitions on the Web

                          Irem Incekoy

    May 2003, Proceedings of the 12th International
    conference on World Wide Web, ACM Press

    Bing Liu, University of Illinois at Chicago, 851 S. Morgan
    Street Chicago IL 60607-7053
    Chee Wee Chin,
    Hwee Tou Ng, National University of Singapore
                   3 Science Drive 2 Singapore
   Agrawal, R. and Srikant, R. ―Fast Algorithm for
    Mining Association Rules‖, VLDB-94, 1994.

   Anderson, C. and Horvitz, E. ―Web Montage: A
    Dynamic Personalized Start Page‖, WWW-02,

   Brin, S. and Page, L. ―The Anatomy of a Large-
    Scale Hypertextual Web Search Engine‖,
    WWW7, 1998.
 When one wants to learn about a topic,
  one reads a book or a survey paper.
 One can read the research papers about
  the topic.
 None of these is very practical.
 Learning from web is convenient, intuitive,
  and diverse.
Purpose of the Paper
   This paper’s task is ―mining topic-specific
    knowledge on the Web‖.

   The goal is to help people learn in-depth
    knowledge of a topic systematically on the
Learning about a New Topic
 One needs to find definitions and
  descriptions of the topic.
 One also needs to know the sub-topics
  and salient concepts of the topic.
 Thus, one wants the knowledge as
  presented in a traditional book.
 The task of this paper can be summarized
  as ―compiling a book on the Web‖.
Proposed Technique
   First, identify sub-topics or salient
    concepts of that specific topic.

   Then, find and organize the informative
    pages containing definitions and
    descriptions of the topic and sub-topics.
Why are the current search
tecnhiques not sufficient?
   For definitions and descriptions of the topic:
    Existing search engines rank web pages based on
    keyword matching and hyperlink structures. NOT very
    useful for measuring the informative value of the page.

   For sub-topics and salient concepts of the topic:
    A single web page is unlikely to contain information
    about all the key concepts or sub-topics of the topic.
    Thus, sub-topics need to be discovered from multiple
    web pages. Current search engine systems do not
    perform this task.
Related Work
   Web information extraction wrappers
   Web query languages
   User preference approach
   Question answering in information retrieval

•   Question answering is a closely-related work to this
    paper. The objective of a question-answering system is
    to provide direct answers to questions submitted by the
    user. In this paper’s task, many of the questions are
    about definitions of terms.
The Algorithm
WebLearn (T)

1) Submit T to a search engine, which returns a set of relevant pages
2) The system mines the sub-topics or salient concepts of T using a set
    S of top ranking pages from the search engine
3) The system then discovers the informative pages containing
    definitions of the topic and sub-topics (salient concepts) from S
4) The user views the concepts and informative pages.
   If s/he still wants to know more about sub-topics then
      for each user-interested sub-topic Ti of T do
          WebLearn (Ti);
Sub-Topic or Salient Concept
   Observation:
    Sub-topics or salient concepts of a topic are
    important word phrases, usually emphasized
    using some HTML tags (e.g.,

   However, this is not sufficient. Data mining
    techniques are able to help to find the frequent
    occurring word phrases.
Sub-Topic Discovery
   After obtaining a set of relevant top-
    ranking pages (using Google), sub-topic
    discovery consists of the following 5 steps.

1) Filter out the ―noisy‖ documents that
  rarely contain sub-topics or salient-
  concepts. The resulting set of documents
  is the source for sub-topic discovery.
Sub-Topic Discovery
2) Identify important phrases in each page (discover
    phrases emphasized by HTML markup tags).

    Rules to determine if a markup tag can safely be ignored
   Contains a salutation title (Mr, Dr, Professor).
   Contains an URL or an email address.
   Contains terms related to a publication (conference,
    proceedings, journal).
   Contains an image between the markup tags.
   Too lengthy (the paper uses 15 words as the upper limit)
Sub-Topic Discovery
   Also, in this step, some preprocessing
    techniques such as stopwords removal
    and word stemming are applied in order to
    extract quality text segments.

   Stopwords removal: Eliminating the words that occur
    too frequently and have little informational meaning.
   Word stemming: Finding the root form of a word by
    removing its suffix.
Sub-Topic Discovery
   3) Mine frequent occurring phrases:
    - Each piece of text extracted in step 2 is stored in a
     dataset called a transaction set.
    - Then, an association rule miner based on Apriori
     algorithm is executed to find those frequent itemsets. In
     this context, an itemset is a set of words that occur
     together, and an itemset is frequent if it appears in more
     than two documents.
    - We only need the first step of the Apriori algorithm and
     we only need to find frequent itemsets with three words
     or fewer (this restriction can be relaxed).
Sub-Topic Discovery
   4) Eliminate itemsets that are unlikely to
    be sub-topics, and determine the
    sequence of words in a sub-topic.

   Heuristic: If an itemset does not appear alone
    as an important phrase in any page, it is unlikely
    to be a main sub-topic and it is removed.
Sub-Topic Discovery

   5) Rank the remaining itemsets. The
    remaining itemsets are regarded as the
    sub-topics or salient concepts of the
    search topic and are ranked based on the
    number of pages that they occur.
Definition Finding
   This step tries to identify those pages that
    include definitions of the search topic and its
    sub-topics discovered in the previous step.
   Preprocessing steps:
   Texts that will not be displayed by browsers (e.g.,
    <script>...</ script >,<!—comments-->) are ignored.
   Word stemming is applied.
   Stopwords and punctuation are kept as they serve as
    clues to identify definitions.
   HTML tags within a paragraph are removed.
Definition Finding
   After that, following patterns are applied to
    identify definitions:

    [1] Bing Liu, Chee Wee Chin, Hwee Tou Ng. Mining Topic-Specific Concepts and Definitions on the Web
Definition Finding
   Besides using the above patterns, the paper
    also relies on HTML structuring and hyperlink
   1) If a page contains only one header or one big
    emphasized text segment at the beginning in the entire
    document, then the document contains a definition of the
    concept in the header.
   2) Definitions at the second level of the hyperlink
    structure are also discovered. All the patterns and
    methods described above are applied to these second
    level documents.
Definition Finding
   Observation: Sometimes no informative page is
    found for a particular sub-topic when the pages
    for the main topic are very general and do not
    contain detailed information for sub-topics.

   In such cases, the sub-topic can be submitted to
    the search engine and sub-subtopics may be
    found recursively.
Dealing with Ambiguity
   One of the difficult problems in concept mining is
    the ambiguity of the search terms (e.g.,
   A search engine may not return any page in the
    right context in its top ranking pages.
   Partial solution: adding terms that can represent
    the context (e.g., classification data mining).
   Disadvantage: returned web pages focus more
    on the context words since they represent a
    larger concept.
Dealing with Ambiguity
   To handle this problem: First reduce the
    ambiguity of a search topic by using context
    words. Then,
   1) Finding salient concepts only in the segment
    describing the topic or sub-topic. (using HTML
    structuring tags as cues).
   2) Identifying those pages that hierarchically organize
    knowledge of the parent topic. To identify such pages,
    we can parse the HTML nested list items (e.g., <li>)
    structure by building a tree.
Dealing with Ambiguity
  • We confirm whether it is a correct page by finding if
  the hierarchy contains at least another sub-topic of
  the parent topic.
           An example of a well-organized topic hierarchy

 [1] Bing Liu, Chee Wee Chin, Hwee Tou Ng. Mining Topic-Specific Concepts and Definitions on the Web
Dealing with Ambiguity
   Finding salient concepts enclosed within braces
    illustrating examples.
      There are many clustering approaches (e.g., hierarchical,
    partitioning, k-means, k-medoids), and we add that efficiency is
    important if the clusters contain many points.
   The execution of the algorithm can stop when
    most of the salient concepts found are parallel
    concepts of the search topic.
Mutual Reinforcement
 This method applies to situations where we have already
  found the sub-topics of a topic, and we want to find the
  salient concepts of the sub-topics of the topic, to go
  down further.
 Often, when one searches for a sub-topic S1, one also
  finds important information about another sub-topic S2
  due to the ranking algorithm used by the search engine.
 This method works in two steps:
 1) submit each sub-topic individually to the search engine.
 2) combine the top-ranking pages from each search into
  one set, and apply the proposed techniques to the whole
  set to look for all sub-topics.
System Architecture
    The overall system is composed of five main
1)   A search engine: This is a standard web search
     engine (Google is used in this system).
2)   A crawler: It crawls the World Wide Web to download
     those top-ranking pages returned by the search
     engine. It stores the pages in ―Web Page Depository‖.
3)   A salient concept miner: It uses the sub-topic
     discovery techniques explained before to search the
     pages stored in ―Web Page Depository‖, in order to
     identify and extract those sub-topics and salient
System Architecture

4) A definition finder: It uses the technique presented in
   definition finding section to search through the pages
   stored in ―Web Page Depository‖ to find those
   informative pages containing definitions of the topics and
   the sub-topics.

5) A user interface: It enables the user to interact with the
System Architecture

  [1] Bing Liu, Chee Wee Chin, Hwee Tou Ng. Mining Topic-Specific Concepts and Definitions on the Web
Experimental Study
   The size of the set of documents is limited to the
    first hundred results returned by Google.
   Table 1 shows the sub-topics and salient
    concepts discovered for 28 search topics
   In each box, the first line gives the search topic.
    For each topic, only ten top-ranking concepts
    are listed.
   For too specific topics, only definition finding is
[1] Bing Liu, Chee Wee Chin, Hwee Tou Ng. Mining Topic-Specific Concepts and Definitions on
    the Web
Experimental Study
   In Table 2, the precision of the definition-finding task is
    compared with the Google search engine and
    AskJeeves, the web’s premier question-answering

   The first 10 pages of results are compared with the first
    10 pages returned by Google and AskJeeves. To do a
    fair comparison, they also look for definitions in the
    second level of the search results returned by Google
    and AskJeeves.

[1] Bing Liu, Chee Wee Chin, Hwee Tou Ng. Mining Topic-Specific Concepts and Definitions on
     the Web
Table 2
Experimental Study
   Table 3 presents the results for ambiguity handling by
    applying the respective methods explained before.
   Column 1 lists two ambiguous topics of ―data mining‖
    and ―time series‖. Column 2 lists the sub-topics identified
    using the original technique.
   Column 3 lists gives the sub-topics discovered using the
    respective parent-topics as context terms.
   Column 4 uses ambiguity handling techniques. Column 5
    applies mutual reinforcement in addition to others.

[1] Bing Liu, Chee Wee Chin, Hwee Tou Ng. Mining Topic-Specific Concepts and Definitions on
     the Web
   The proposed techniques aim at helping
    Web users to learn an unfamiliar topic in-
    depth and systematically.

   This is an efficient system to discover and
    organize knowledge on the web, in a way
    similar to a traditional book, to assist
Effective Personalization Based on
          Association Rule
 Discovery from Web Usage Data

            Mikhail Bautin

Bamshad Mobasher, Honghua Dai, Tao
       Luo, Miki Nakagawa
  DePaul University 243 S. Wabash Ave.
   Chicago, Illinois 60604, USA (2001)
   B. Mobasher, H. Dai, T. Luo and M. Nakagawa:
    "Effective Personalization Based on Association Rule
    Discovery from Web Usage Data", in Proc. the 3rd ACM
    Workshop on Web Information and Data Management
    (WIDM01) (2001).
   R. Agarwal, C. Aggarwal, and V. Prasad. A tree
    projection algorithm for generation of frequent itemsets.
    In Proceedings of the High Performance Data Mining
    Workshop, Puerto Rico, 1999.
   R. Agrawal and Ramakrishnan Srikant. Fast algorithms
    for mining association rules. In Proc. 20th Int.
    Conference on Very Large Data Bases, VLDB94, 1994.
   Personalize a web site:
     Predictactions of the user (pre-fetching etc.)
     Recommend new items to a customer based
      on viewed items and knowledge of what other
      customers are interested in:
      ―Customers who buy this also buy that...‖
   Collaborative filtering
     Find top k users who have similar tastes or
      interests (k-nearest-neighbor)
     Predict actions based on what those users did
     Too much online computation needed

   Association rules
     Scalable: constant time query processing
     Better precision and coverage than CF
Data Preparation
 Input: web server logs
 Steps:
     User identification (trivial if using cookies)
     Session and transaction identification
     Page view identification (for multi-frame sites)
   As a result of preparation:
     Records  correspond to transactions
     Items correspond to page views
     Order of page views does not matter
Pattern Discovery
   Running Apriori algorithm
     Records  = transactions, items = page views
     Minimum support and confidence restriction
     Problem with global minimum support value:
      important but rare items can be discarded
     Solution: multiple minimum support values.
      For itemset {p1, ..., pn} require
Recommendation Engine
 Fixed-size sliding window w:
  subset of |w| most recent page views
 Need to find rules with w on the left
 This is done with depth-first search
 Sort elements of w lexicographically
 Only need O(|w|) to find the itemset and
  O(# of page views) to produce
Frequent Itemset Graph

Figure 1 from the paper (Mobasher et al.)
 Active session window w = {B, E}
 Solid lines – ―lexicographic‖ extension
 Stippled lines – any extension
 The search leads to node BE (5) at level 3
 Possible extensions: A and C
 Confidence calculated as
 For A it is 5/5 = 1, for C it is 4/5
Window size vs minsup
 For large window size it might be difficult
  to find frequent enough itemsets
 But larger window gives better accuracy
 Solution: the ―all-kth-order‖ method
     Startwith the largest possible window size
     Reduce window size until able to generate a
     No additional computation incurred
Evaluation Methodology
 For each transaction t first n page views
  are used for generating recommendation
  and last |t| - n are used for testing
 ast – subset of first n elements of t
  – minimum required confidence
 R(ast, ) – set of recommendations
 evalt – the last |t| - n pageviews of t
Measures of Evaluation

   The threshold  is ranging from 0.1 to 1
Impact of Window Size

Figure 2 from the paper (Mobasher et al.)
Single vs Multiple Min. Support

Figure 3 from the paper (Mobasher et al.)
The all-kth-order Model

Figure 4 from the paper (Mobasher et al.)
Association Rules vs kNN

Figure 5 from the paper (Mobasher et al.)
   Personalization based on association rules
    is better than k-nearest-neighbor approach:
     Faster – very little online computation
     Therefore, better scalability
     Better precision
     Better coverage

   Effective alternative to standard
    collaborative filtering mechanisms for
Thank you!