sandia

Document Sample
sandia Powered By Docstoc
					    Anomaly Detection Using Data
        Mining Techniques
        Margaret H. Dunham, Yu Meng, Donya
          Quick, Jie Huang, Charlie Isaksson
                  CSE Department
           Southern Methodist University
                Dallas, Texas 75275
                mhd@engr.smu.edu

     This material is based upon work supported by the National Science Foundation under
                                      Grant No. IIS-0208741
12/9/08, Sandia                                                                            1
National Labs
                  Objectives/Outline

  Develop modeling techniques which can
    “learn/forget” past behavior of spatiotemporal
    stream events. Apply to prediction of anomalous
    events.

      Introduction
      EMM Overview
      EMM Applications to Anomaly Detection
      EMM Applications to Nuclear Testing
      Future Work




12/9/08, Sandia                                       2
National Labs
12/9/08, Sandia   3
National Labs
      Outline

 Introduction
    Motivation

    What is an anomaly?

    Spatiotemporal Data

    Modeling Spatiotemporal Data

 EMM Overview
 EMM Applications to Anomaly Detection
 Emm Applications to Nuclear Testing
 Future Work




12/9/08, Sandia                           4
National Labs
            Motivation
A growing number of applications generate streams
 of data.
   Computer network monitoring data
   Call detail records in telecommunications
   Highway transportation traffic data
   Online web purchase log records
   Sensor network data
   Stock exchange, transactions in retail chains, ATM
    operations in banks, credit card transactions.
Data mining techniques play a key role in modeling
 and analyzing this data.
    12/9/08, Sandia                                      5
    National Labs
   What is Anomaly?


       Event that is unusual
       Event that doesn’t occur frequently
       Predefined event
       What is unusual?
       What is deviation?




12/9/08, Sandia                               6
National Labs
   What is Anomaly in Stream Data?

 Rare - Anomalous – Surprising
 Out of the ordinary
 Not outlier detection
    No knowledge of data distribution

    Data is not static

    Must take temporal and spatial values into account

    May be interested in sequence of events

 Ex: Snow in upstate New York is not an anomaly
    Snow in upstate New York in June is rare

 Rare events may change over time


  12/9/08, Sandia                                         7
  National Labs
   Statistical View of Anomaly
      Outlier
      Data item that is outside the normal
       distribution of the data
      Identify by Box Plot




     Image from Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002.


12/9/08, Sandia                                                                       8
National Labs
   Statistical View of Anomaly
      Identify by looking at distribution

      THIS DOES NOT WORK with stream data




                  Image from www.wikipedia.org, Normal distribution.

12/9/08, Sandia                                                        9
National Labs
   Data Mining View of Anomaly
 Classification Problem
        Build classifier from training data
        Problem is that training data shows what is
         NOT an anomaly
        Thus an anomaly is anything that is not viewed
         as normal by the classification technique
        MUST build dynamic classifier
 Identify anomalous behavior
        Signatures of what anomalous behavior looks
         like
        Input data is identified as anomaly if it is similar
         enough to one of these signatures
 Mixed – Classification and Signature

12/9/08, Sandia                                                 10
National Labs
    Visualizing Anomalies
Temporal Heat Map (THM) is a visualization technique
  for streaming data derived from multiple sensors.
 Two dimensional structure similar to an infinite table.
 Each row of the table is associated with one sensor
  value.
 Each column of the table is associated with a point in
  time.
 Each cell within the THM is a color representation of
  the sensor value
 Colors normalized (in our examples)
    0 – While

    0.5 – Blue

    1.0 - Red
 12/9/08, Sandia                                            11
 National Labs
   THM of VoIP Data




      VoIP traffic data was provided by Cisco Systems and
      represents logged VoIP traffic in their Richardson, Texas
      facility from Mon Sep 22 12:17:32 2003 to Mon Nov 17
      11:29:11 2003.


12/9/08, Sandia                                                   12
National Labs
        Spatiotemporal Stream Data
Records may arrive at a rapid rate
High volume (possibly infinite) of continuous data
Concept drifts: Data distribution changes on the fly
Data does not necessarily fit any distribution
 pattern
Multidimensional
Temporal
Spatial
Data are collected in discrete time intervals,
Data are in structured format, <a1, a2, …>
Data hold an approximation of the Markov
 property.
  12/9/08, Sandia                                13
  National Labs
   Spatiotemporal Environment
      Events arriving in a stream
      At any time, t, we can view the state
       of the problem as represented by a
       vector of n numeric values:
               Vt = <S1t, S2t, ..., Snt>

                              V1     V2   …     Vq
                  S1     S11       S12    …   S1q
                  S2     S21       S22    …   S2q
                  …      …         …      …   …
                  Sn     Sn1       Sn2    …   Snq
                       Time
12/9/08, Sandia                                      14
National Labs
    Data Stream Modeling

   Single pass: Each record is examined at most once
   Bounded storage: Limited Memory for storing synopsis
   Real-time: Per record processing time must be low
   Summarization (Synopsis )of data
   Use data NOT SAMPLE
   Temporal and Spatial
   Dynamic
   Continuous (infinite stream)
   Learn
   Forget
   Sublinear growth rate - Clustering

12/9/08, Sandia                                       15 15
National Labs
             MM
A first order Markov Chain is a finite or countably infinite
   sequence of events {E1, E2, … } over discrete time
   points, where Pij = P(Ej | Ei), and at any time the future
   behavior of the process is based solely on the current
   state

A Markov Model (MM) is a graph with m vertices or states,
  S, and directed arcs, A, such that:
 S ={N1,N2, …, Nm}, and
 A = {Lij | i 1, 2, …, m, j 1, 2, …, m} and Each arc,
   Lij = <Ni,Nj> is labeled with a transition probability
   Pij = P(Nj | Ni).


   12/9/08, Sandia                                              16
   National Labs
       Problem with Markov Chains

 The required structure of the MC may not be certain
  at the model construction time.
 As the real world being modeled by the MC
  changes, so should the structure of the MC.
 Not scalable – grows linearly as number of events.
 Our solution:
    Extensible Markov Model (EMM)

    Cluster real world events

    Allow Markov chain to grow and shrink

      dynamically

12/9/08, Sandia                                         17
National Labs
      Outline


Introduction
EMM Overview
EMM Applications to Anomaly
 Detection
EMM Applications to Nuclear Testing
Future Work


12/9/08, Sandia                        18
National Labs
       Extensible Markov Model (EMM)

 Time Varying Discrete First Order Markov Model
 Nodes are clusters of real world states.
 Learning continues during application phase.
 Learning:
    Transition probabilities between nodes

    Node labels (centroid/medoid of cluster)

    Nodes are added and removed as data arrives




 12/9/08, Sandia                             19
 National Labs
          Related Work
 Splitting Nodes in HMMs
    Create new states by splitting an existing state

      M.J. Black and Y. Yacoob,”Recognizing facial expressions in image
       sequences using local parameterized models of image motion”, Int. Journal
       of Computer Vision, 25(1), 1997, 23-48.
 Dynamic Markov Modeling
    States and transitions are cloned

      G. V. Cormack, R. N. S. Horspool. “Data compression using dynamic
       Markov Modeling,” The Computer Journal, Vol. 30, No. 6, 1987.
 Augmented Markov Model (AMM)
    Creates new states if the input data has never been seen

     in the model, and transition probabilities are adjusted
      Dani Goldberg, Maja J Mataric. “Coordinating mobile robot group behavior
       using a model of interaction dynamics,” Proceedings, the Third
       International Conference on Autonomous Agents (agents ’99), Seattle,
       Washington
  12/9/08, Sandia                                                          20
  National Labs
        EMM vs AMM

Our proposed EMM model is similar to AMM, but is more
  flexible:
 EMM continues to learn during the application phase.
 State matching is determined using a clustering
  technique.
 EMM not only allows the creation of new nodes, but
  deletion (or merging) of existing nodes. This allows the
  EMM model to “forget” old information which may not be
  relevant in the future. It also allows the EMM to adapt to
  any main memory constraints for large scale datasets.
 EMM performs one scan of data and therefore is suitable
  for online data processing.
 12/9/08, Sandia                                        21
 National Labs
        EMM

Extensible Markov Model (EMM): at any time t,
  EMM consists of an MM and algorithms to modify
  it, where algorithms include:
 EMMSim, which defines a technique for matching
  between input data at time t + 1 and existing states
  in the MM at time t.
 EMMIncrement algorithm, which updates MM at
  time t + 1 given the MM at time t and classification
  measure result at time t + 1.

Additional algorithms may be added to modify the
  model or for applications.

12/9/08, Sandia                                          22
National Labs
     EMMSim

      Find closest node to incoming event.
      If none “close” create new node
      Labeling of cluster is centroid/medoid of
       members in cluster
      Problem
         Nearest Neighbhor O(n)

         BIRCH O(lg n)

            • Requires second phase to recluster
              initial




12/9/08, Sandia                                    23
National Labs
   EMMIncrement


             <18,10,3,3,1,0,0>
                                   2/3
                                  2/3
             <17,10,2,3,1,0,0>     2/3
                                   1/1
                                   1
                                   1/2
                                 2/2
                                            1/2

                                                     N3

             <16,9,2,3,1,0,0>    N1
                                                          1/2

             <14,8,2,3,1,0,0>         1/3          1/1
                                                          1/1
                                              N2



             <14,8,2,3,0,0,0>

            <18,10,3,3,1,1,0.>




12/9/08, Sandia                                                 24
National Labs
      EMMDecrement


     N1                  N3                       1/3
                                  1/3      N1             N3
                  1/3
                                                  1/3
   2/2                   1/3   Delete N2                 1/6
                  N2                        1/6
                        1/3
       1/2

     N5                  N6                N5             N6
                                                   1/6




12/9/08, Sandia                                                25
National Labs
       EMM Advantages

        Dynamic
        Adaptable
        Use of clustering
        Learns rare event
        Scalable:
           Growth of EMM is not linear on size of

            data.
           Hierarchical feature of EMM

        Creation/evaluation quasi-real time
        Distributed / Hierarchical extensions

12/9/08, Sandia                                      26
National Labs
                               Growth of EMM


                              800
                              700
   number of state in model




                              600                                                                                                                               threshold 0.994
                              500                                                                                                                               threshold 0.995
                              400                                                                                                                               threshold 0.996
                              300                                                                                                                               threshold 0.997
                              200                                                                                                                               threshold 0.998

                              100
                               0
                                    1
                                        80
                                             159
                                                   238
                                                         317
                                                               396
                                                                     475
                                                                           554
                                                                                 633
                                                                                       712
                                                                                             791
                                                                                                   870
                                                                                                         949
                                                                                                               1028
                                                                                                                      1107
                                                                                                                             1186
                                                                                                                                    1265
                                                                                                                                           1344
                                                                                                                                                  1423
                                                                                                                                                         1502
                                                                     number of input data (total 1574)



                                                                                 Servent Data
12/9/08, Sandia                                                                                                                                                         27
National Labs
   EMM Performance – Growth Rate


                              Threshold
 Data        Sim 0.99    0.992 0.994 0.996 0.998
            Jaccrd 156    190    268    389 667
             Dice  72      92    123    191 389
Ser         Cosine 11      14    19     31   61
went        Ovrlap 2        2     3      3    4
            Jaccrd 56      66    81     105 162
             Dice  40      43    52     66  105
            Cosine 6        8    10     13   24
Ouse        Ovrlap 1        1     1      1    1
12/9/08, Sandia                                    28
National Labs
     EMM Performance – Growth Rate




                  Minnesota Traffic Data
12/9/08, Sandia                            29
National Labs
      Outline


Introduction
EMM Overview
EMM Applications to Anomaly
 Detection
EMM Applications to Nuclear Testing
Future Work


12/9/08, Sandia                        30
National Labs
   Datasets/Anomalies
   MnDot – Minnesota Department of Transportation
     Automobile Accident



   Ouse and Serwent – River flow data from England
     Flood

     Drought



   KDD Cup’99
       http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
           Intrusion

   Cisco VoIP – VoIP traffic data obtained at Cisco
      Unusual Phone Call




12/9/08, Sandia                                                  31
National Labs
   Rare Event Detection
                                    Detected unusual
                                    weekend traffic pattern




    Weekdays Weekend

12/9/08, Sandia
                  Minnesota DOT Traffic Data                  32
National Labs
 Our Approach to Detect Anomalies

       By learning what is normal, the model can
        predict what is not
       Normal is based on likelihood of occurrence
       Use EMM to build model of behavior
       We view a rare event as:
          Unusual event

          Transition between events states which
           does not frequently occur.
       Base rare event detection on determining
        events or transitions between events that do
        not frequently occur.
       Continue learning



12/9/08, Sandia                                        33
National Labs
       EMMRare
  EMMRare algorithm indicates if the current
   input event is rare. Using a threshold
   occurrence percentage, the input event is
   determined to be rare if either of the following
   occurs:
     The frequency of the node at time t+1 is

      below this threshold
     The updated transition probability of the MC

      transition from node at time t to the node at
      t+1 is below the threshold

12/9/08, Sandia                                 34
National Labs
   Determining Rare

   Occurrence Frequency (OFc) of a node Nc :
                  OFc =   CNc     CN
                                 i
                                         i




   Normalized Transition Probability (NTPmn),
    from one state, Nm, to another, Nn :

                  NTPmn = CL
                            mn        CN
                                     i
                                             i




12/9/08, Sandia                                  36
National Labs
   EMMRare
      Given:
         • Rule#1: CNi <= thCN
         • Rule#2: CLij <= thCL
         • Rule#3: OFc <= thOF
         • Rule#4: NTPmn <= thNTP

         Input: Gt: EMM at time t
                 i: Current state at time t
                 R= {R1, R2,…,RN}: A set of rules
         Output: At: Boolean alarm at time t
      Algorithm:
                  1 Ri = True
         At =
                  0 Ri = False
12/9/08, Sandia                                     37
National Labs
         Rare Event in Cisco Data




12/9/08, Sandia                     38
National Labs
   Risk assessment
 Problem: Mitigate false alarm rate while maintaining a
  high detection rate.
 Methodology:
    Historic feedbacks can be used as a free resource to
     take out some possibly safe anomalies
    Combine anomaly detection model and user’s
     feedbacks.
    Risk level index

 Evaluation metrics: Detection rate, false alarm rate.
    Detection rate

    False alarm rate      Detection rate = TP/(TP+TN)
    Operational Curve     False alarm rate = FP/(TP+FP)




  12/9/08, Sandia                                       39
  National Labs
       Reducing False Alarms
     •Calculate Risk using historical feedback


     •Historical Feedback:




     •Count of true alarms:




12/9/08, Sandia                                  40
National Labs
   Detection Rate Experiments
       DETECTION RATE OF ANOMALY DETECTION AND RISK ASSESSMENT MODELS
         1

       0.5        ANOMALY DETECTION
                  RISK ASSESSMENT
         0
         16         18          20      22      24       26      28              30      32
                            (a) EUCLIDEAN THRESHOLD FOR CLUSTERING (th)
         1

       0.5                                                           ANOMALY DETECTION
                                                                     RISK ASSESSMENT
         0
          0       0.1      0.2         0.3    0.4   0.5    0.6   0.7       0.8     0.9    1
                                 (b) RISK ASSESSMENT WEIGHT FACTOR (alpha)
         1

       0.5                                                          ANOMALY DETECTION
                                                                    RISK ASSESSMENT
         0
          0         50         100      150     200      250       300           350     400
                            (c) EMM STATE CARDINALITY THRESHOLD (thNode)
         1

       0.5                                                          ANOMALY DETECTION
                                                                    RISK ASSESSMENT
         0
          0              50           100        150         200           250           300
                           (d) EMM TRANSITION CARDINALITY THRESHOLD (thLink)


12/9/08, Sandia                                                                                41
National Labs
   False Alarm Rate
      FALSE ALARM RATE OF ANOMALY DETECTION AND RISK ASSESSMENT MODELS
         1
                                                                     ANOMALY DETECTION
       0.5                                                           RISK ASSESSMENT

         0
         16         18          20      22       24      26      28              30      32
                             (a) EUCLIDEAN THRESHOLD FOR CUSTERING (th)
         1
                                                                     ANOMALY DETECTION
       0.5                                                           RISK ASSESSMENT

         0
          0       0.1      0.2         0.3    0.4   0.5   0.6    0.7       0.8     0.9    1
                                 (b) RISK ASSESSMENT WEIGHT FACTOR (alpha)
         1

       0.5                                                           ANOMALY DETECTION
                                                                     RISK ASSESSMENT
         0
          0         50         100      150     200      250       300           350     400
                            (c) EMM STATE CARDINALITY THRESHOLD (thNode)
         1

       0.5                                                           ANOMALY DETECTION
                                                                     RISK ASSESSMENT
         0
          0              50         100         150         200           250            300
                          (d) EMM TRANSITION CARDINALITY THRESHOLD (thLink)


12/9/08, Sandia                                                                                42
National Labs
      Outline


Introduction
EMM Overview
EMM Applications to Anomaly
 Detection
EMM Applications to Nuclear Testing
Future Work


12/9/08, Sandia                        43
National Labs
                      Research Objectives

 Apply proven spatiotemporal modeling technique to seismic
  data
 Construct EMM to model sensor data
    Local EMM at location or area

    Hierarchical EMM to summarize lower level models

    Represent all data in one vector of values

    EMM learns normal behavior

 Develop new similarity metrics to include all sensor data types
  (Fusion)
 Apply anomaly detection algorithms


    12/9/08, Sandia                                        44
    National Labs
            Input Data Representation

     •Vector of sensor values (numeric) at
     precise time points or aggregated over
     time intervals.
     •Need not come from same sensor
     types.
     •Similarity/distance between vectors
     used to determine creation of new nodes
     in EMM.



12/9/08, Sandia                                45
National Labs
                    EMM with Seismic Data

 Input – Wave arrivals (all or one per sensor)
 Identify states and changes of states in seismic
  data
 Wave form would first have to be converted into a
  series of vectors representing the activity at
  various points in time.
 Initial Testing with RDG data
 Use amplitude, period, and wave type



  12/9/08, Sandia                               46
  National Labs
                   New Distance Measure
 Data = <amplitude, period, wave type>
 Different wave type = 100% difference
 For events of same wave type:
     50% weight given to the difference in amplitude.

     50% weight given to the difference in period.

 If the distance is greater than the threshold, a state
   change is required.
 amplitude =
| amplitudenew – amplitudeaverage | / amplitudeaverage
 period =
    | periodnew – periodaverage | / periodaverage
 12/9/08, Sandia                                           47
 National Labs
                  EMM with Seismic Data




   States 1, 2, and
   3 correspond to
   Noise, Wave A,
   and Wave B
   respectively.
12/9/08, Sandia                           48
National Labs
                  Preliminary Testing


         RDG data February 1, 1981 – 6
          earthquakes
         Find transition times close to known
          earthquakes
         9 total nodes
         652 total transitions
         Found all quakes


12/9/08, Sandia                                  49
National Labs
      Outline


Introduction
EMM Overview
EMM Applications to Anomaly
 Detection
EMM Applications to Nuclear Testing
Future Work


12/9/08, Sandia                        50
National Labs
      Ongoing/Future Work

       Extend to Emerging Patterns
       Extend to Hierarchical/Distributed
          Yu Su

       Test with more data – KDD Cup
       Compare to other approaches
          Charlie Isaksson

       Apply to nuclear testing



12/9/08, Sandia                              51
National Labs
                  Hierarchical EMM




12/9/08, Sandia                      52
National Labs
12/9/08, Sandia   53
National Labs
                                    References
    Zhigang Li and Margaret H. Dunham, “ STIFF: A Forecasting Framework for Spatio-Temporal
     Data”, Proceedings of the First International Workshop on Knowledge Discovery in Multimedia
     and Complex Data, May 2002, pp 1-9.
    Zhigang Li, Liangang Liu, and Margaret H. Dunham, “ Considering Correlation Between Variables
     to Improve Spatiotemporal Forecasting,” Proceedings of the PAKDD Conference, May 2003, pp
     519-531.
    Jie Huang, Yu Meng, and Margaret H. Dunham, “Extensible Markov Model,” Proceedings IEEE
     ICDM Conference, November 2004, pp 371-374.
    Yu Meng and Margaret H. Dunham, “Efficient Mining of Emerging Events in a Dynamic
     Spatiotemporal,” Proceedings of the IEEE PAKDD Conference, April 2006, Singapore. (Also in
     Lecture Notes in Computer Science, Vol 3918, 2006, Springer Berlin/Heidelberg, pp 750-754.)
    Yu Meng and Margaret H. Dunham, “Mining Developing Trends of Dynamic Spatiotemporal Data
     Streams,” Journal of Computers, Vol 1, No 3, June 2006, pp 43-50.
    Charlie Isaksson, Yu Meng, and Margaret H. Dunham, “Risk Leveling of Network Traffic
     Anomalies,” International Journal of Computer Science and Network Security, Vol 6, No 6, June
     2006, pp 258-265.
    Margaret H. Dunham and Vijay Kumar, “Stream Hierarchy Data Mining for Sensor Data,”
     Innovations and Real-Time Applications of Distributed Sensor Networks (DSN) Symposium,
     November 26, 2007, Shreveport Louisiana.




    12/9/08, Sandia                                                                          54
    National Labs

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:1
posted:3/20/2012
language:English
pages:53
yaohongm yaohongm http://
About