Learning To Predict rare events in event sequences by lonyoo


									Learning To Predict rare events
            in event sequences

    Gary M Weiss and Haym Hirsh
          Presented by
        Veena Raja Rathna
• Aim & Introduction to the problem
• Basic problem Formulation
• Definitions & Evaluation metrics
• Learning methods
• Example & Results
• Reference
         Introduction to the problem
• To predict rare events from sequences of events
  which contain non numerical features
• An event sequence is a time stamped observations
  described by a fixed set of features
• Why
• Predicting events with categorical data is an
  important real world problem
• Not Suited to be solved by existing statistical
  and machine learning methods
• Classical time series - Predict next „n‟ successive
  observations from a history of past observations
• Statistical techniques not applicable to event
  prediction because
      they require numerical features
     do not support predicting a specific „event‟
  within window of time.

• Telecommunication equipment failure
• Predicting Fraudulent credit card
• Start of Transcription in DNA sequences
        Basic Problem Formulation
• Definitions
• Event- Et: is a time stamped observation which
  occurs at time „t‟ and is described by a set of
  feature value pairs.
• Event Sequence- S: is a time ordered sequence of
• Domain Object- D: events are associated with D
  which is a generator of events
• Target Event Xt :is the event to be predicted and is
  specified by a set of feature value pairs.
• Warning Time W:is the lead time necessary for a
  prediction to be useful
• Monitoring Time M: determines the maximum
  amount of time prior to target event for which a
  prediction is considered correct.
              Problem now reduces to
• Learning a prediction procedure P that correctly
  predicts the target events.
• P is a function that maps an event sequence to a
  boolean prediction value.
• P:Et1,Et2,Et3,…Etn-->{+,-}
• A target event is predicted if at least one
  prediction is made within its prediction period
• A prediction is correct if it falls within the
  prediction period of some target event
    Evaluation Measures for Event
• Recall = #Target Events Predicted /Total target
    It is the percentage of target events correctly
• Simple Precision = TP/TP+FP
  TP - true predictions       FP - false Predictions
     Simple precision is the percentage of predictions
  that are correct
                     Evaluation Metrics
• Normalized precision = #Target Events
  Predicted/#Target Events Predicted +FP
    Replaces # of correct predictions with target
    events correctly predicted
• Reduced Predictions = #Target Events predicted/#
  Target Events Predicted+Discounted FP
• A prediction is active for a period equal to its
  monitoring time
                  Basic Learning Method
• Identify prediction patterns : The space of PP is searched
  to identify a set „C‟ of candidate PP.Each pattern c  C
  should do well at predicting a subset of target events.
• Generate prediction rules : An ordered list of PP is
  generated from C. Prediction rules are then formed by
  creating a disjunction of the top n PP,thereby creating
  solutions with different precision/recall values.
• PP is a sequence of events connected by ordering
  primitives that define sequential or temporal constraints
  between consecutive events
               Ordering primitives are
• Let A B C D represent individual events
• „Wildcard‟ “*” matches any number of events so
  the PP A*D matches ABCD
• „next‟ “.” matches no events so the PP A.B.C only
  matches ABC
• „unordered‟ “|” allows events to occur in any order
  and is commutative so that the PP A|B|C will
  match ,amongst others,CBA
• “|” has the highest precedence so the pattern
  “A.B*C|D|E” matches A followed by B,followed
  sometime later by a C,D and E in any order.
• Each feature in an event is permitted to take on the
  “?” value that matches any feature value.
• PP also has an integer valued pattern duration
                              Learning Method
• A PP matches an event sequence if
•    1) events within the PP matches events within an event
•    2) ordering constraints in the PP are obeyed
•    3) events involved in the match occur within the pattern
• This language enables flexible and noise tolerant
  prediction rules to be constructed, such as
• if 3(or more) A events & 4( or more B events occur within
  an hour, then predict the target event.
                  GA for identifying PP
• Use GA to identify a diverse set of PP.
• Each individual in the GA‟s population represents
  part of a complete solution and should perform
  well at classifying a subset of the target events.
• GA used is steady state GA where only a few
  individuals are modified each „iteration‟ .
                             Basic steps in GA
•   1. Initialize population
•   2. while stopping criteria not met
•   3. select 2 individuals from the population
•   4. apply mutation operator to both individuals with Pm;
•         else apply crossover operator
•   5. evaluate the 2 newly formed individuals
•   6. Replace 2 existing individuals with the new ones
• 7.done
           Selection and Replacement
• GA‟s selection and replacement strategy must
  balance two opposing criteria
      • 1. They must focus the search in the most profitable
        areas of the search space.
      • 2. Maintain a diverse population to avoid premature
        convergence and to ensure that the individuals in the
        population collectively cover most of the target
               Selection & Replacement
• Fitness of the PP is based on both its precision and
  recall and is given by
         fitness = ((* )+1)precision recall
                (* )precision+recall
• To encourage diversity we use a strategy called
  sharing that rewards individuals based on how
  different they are from other individuals in the
             Selection & Replacement
• Individuals are selected proportional to their
  shared fitness = fitness/ count
• count measures the degree of similarity of
  individual j to p individuals comprising the
        count j=(1 -distance(j k))3
The similarity of 2 individuals is measured using a
  phenotypic distance metric that measures the
  distance based on the performance of the
• The more similar an individual to the rest of the
  individuals in the population, smaller the distances
  & greater the count.
• Replacement strategy also uses shared fitness.
• Individuals are chosen for deletion inversely
  proportional to their shared fitness
            Creating Prediction Rules
• Greedy Algorithm shown is used to form a list of
  prediction rules S from a set of candidate patterns
• Precision, recall and prediction vector information
  computed in the first step for each PP are used.
• Step 11 requires access to the training set and is
  the most time-intensive step
•   1.     C=patterns returned from the GA;S={};
•   2.      while C!=0 do
•   3.       for c  C do
•   4.        if(increase_recall(S+c,S)<=THRESHOLD)
•   5.        then C=C-c;
•   6.        Elseeval=PF*(c.precisionS.precision)+increase_recall(S+c,S);
•   7.      done
•   8.     Best={c C,x  C|c.eval>=x.eval}
•   9.     S=S||best;C=C=best;
•   10.    recompute S.precision on the training set;
•   11.   done
• Builds solution with increasing recall by
  heuristically selecting the best PP in C using eval
  function in line 6.
• PF controls the importance of precision vs
  recall.PPs that do not increase the recall by at least
  THRESHOLD are discarded.
• Both THRESHOLD AND PF affect the
  complexity of t he learned concept and can
  prevent overfitting of the data
• This paper describes timeweaver a GA based ML
  system that predicts rare events
• For AT & T ,the specific task is to predict failure
  of 4ESS h/w components from alarm messages.
• Problem Formulation
             • Each 4ESS generated alarm is an event with 3
               features--device,severity ,code
             • each 4ESS is a domain object that generates an event
               sequence and the target event is any event with code
               set to „FAILURE‟
• Training set 110,000 alarms reported from 55
  4ESS switches.
• Test set contained 40,000 alarms from 20 different
  4ESS switches.This data included 1200 alarms
  which indicated equipment failure.
• THRESHOLD was 1% and PF was 10
• Pattern 351:<TMSP,?,MJ>*<?,?,MJ>*<?,?MN>
  indicates that within 351 sec time period a major
  severity alarm on a TSMP device is followed by a
  major alarm and minor alarm

• Varying the warning time demonstrates that it is
  much easier to predict failures when short warning
  time is required.
• Increasing the monitoring time significantly
  improves TW‟s ability to predict failures.
• Larger prediction period leads TW to focus its
  attention on „spurious correlations‟ in data.
 Comparision with other methods
• Tw was compared to C4.5rules and RIPPER - 2 rule
  induction systems and FOIL a system that learns logical
  definitions from relations.
• Class distribution of the generated examples is skewed;
  prevented C4.5rules and RIPPER from predicting any
• TW yields results with precision 3-5 times higher for a
  given recall value than various thresholding strategies used
  by ANSWER system. Concept space is much more
• {100:a,b},{104:c,c},{105:d,a},{108:c,c},{110:a,d},
Format: Integer valued timestamp,colon,comma separated list
  of feature values.2 feature values per event and each event
  can take on the values a,b,c or d.The first can also take on
  the value “crash”. The target event is any event with
  „crash‟ as the first feature. Warning time is 2 secs and
  monitoring time is 8 secs.
TW-GA training on this might produce PP: 4:|c,c|*|c,c|.
• Learning to predict rare events in categorical time series
• http://paul.rutgers.edu/~gweiss/thesis/timeweaver.html

To top