Coding for Credit Card Fraud Detection Using Hmm by fkq34723

VIEWS: 214 PAGES: 88

More Info
									    An introduction to machine
     learning and probabilistic
         graphical models

                     Kevin Murphy
                      MIT AI Lab
      Presented at Intel’s workshop on “Machine learning
      for the life sciences”, Berkeley, CA, 3 November 2003


.
                         Overview



 Supervised  learning
 Unsupervised learning

 Graphical models

 Learning relational models




  Thanks to Nir Friedman, Stuart Russell, Leslie Kaelbling and
  various web sources for letting me use many of their slides
                                                                 2
                Supervised learning
                     yes                 no




Color            Shape           Size         Output
Blue             Torus           Big          Y
Blue             Square          Small        Y
Blue             Star            Small        Y
Red              Arrow           Small        N
Learn to approximate function F(x1, x2, x3) -> t
from a training set of (x,t) pairs                     3
                    Supervised learning
     Training data

X1      X2    X3        T
B       T     B         Y    Learner
B       S     S         Y
B       S     S         Y                  Prediction
R       A     S         N
                                               T
     Testing data
                                               Y
X1     X2 X3 T
B      A     S      ?          Hypothesi       N
                                  s
Y      C     S      ?

                                                        4
    Key issue: generalization
          yes                                no




                ?                     ?
Can’t just memorize the training set (overfitting)
                                                     5
             Hypothesis spaces
 Decision trees
 Neural networks

 K-nearest neighbors

 Naïve Bayes classifier

 Support vector machines (SVMs)

 Boosted decision stumps

…




                                   6
            Perceptron
(neural net with no hidden layers)




       Linearly separable data

                                     7
Which separating hyperplane?




                               8
The linear separator with the largest
   margin is the best one to pick




                  margin




                                        9
What if the data is not linearly separable?




                                              10
                   Kernel trick


                        x2 
                  x        
                     2 xy 
                  y  2 
                        y 
                                           z3

            x2        kernel
x1
                                                  z2
                                      z1


     Kernel implicitly maps from 2D to 3D,
      making problem linearly separable
                                                       11
  Support Vector Machines (SVMs)
 Two  key ideas:
     Large margins
     Kernel trick




                                   12
                       Boosting




Simple classifiers (weak learners) can have their performance
          boosted by taking weighted combinations

Boosting maximizes the margin
                                                                13
Supervised learning success stories

    Face detection
    Steering an autonomous car across the US
    Detecting credit card fraud
    Medical diagnosis
    …




                                                14
            Unsupervised learning
 What   if there are no output labels?




                                          15
                  K-means clustering
1.   Guess number of clusters, K
2.   Guess initial cluster centers, 1, 2




                                                       Reiterate
3.   Assign data points xi to nearest cluster center
4.   Re-compute cluster centers based on assignments




                                                                   16
  AutoClass (Cheeseman et al, 1986)
 EM  algorithm for mixtures of Gaussians
 “Soft” version of K-means

 Uses Bayesian criterion to select K

 Discovered new types of stars from spectral data

 Discovered new classes of proteins and introns
  from DNA/protein sequence databases




                                                     17
Hierarchical clustering




                          18
        Principal Component
           Analysis (PCA)
    PCA seeks a projection that best represents the
    data in a least-squares sense.

                                 PCA reduces the
                                 dimensionality of
                                 feature space by
                                 restricting attention to
                                 those directions along
                                 which the scatter of the
                                 cloud is greatest.


.
Discovering nonlinear manifolds




                                  20
Combining supervised and unsupervised
              learning




                                        21
         Discovering rules (data mining)
Occup.     Income     Educ.       Sex          Married      Age

Student    $10k       MA          M            S            22
Student    $20k       PhD         F            S            24
Doctor     $80k       MD          M            M            30
Retired    $30k       HS          F            M            60

    Find the most frequent patterns (association rules)

   Num in household = 1 ^ num children = 0 => language = English

   Language = English ^ Income < $40k ^ Married = false ^
   num children = 0 => education  {college, grad school}


                                                                   22
   Unsupervised learning: summary
 Clustering

 Hierarchicalclustering
 Linear dimensionality reduction (PCA)

 Non-linear dim. Reduction

 Learning rules




                                          23
    Discovering networks




              ?




From data visualization to causal discovery
                                              24
             Networks in biology
 Most  processes in the cell are controlled by
  networks of interacting molecules:
    Metabolic Network

    Signal Transduction Networks

    Regulatory Networks

 Networks can be modeled at multiple levels of
  detail/ realism
    Molecular level

    Concentration level           Decreasing detail
    Qualitative level

                                                       25
Molecular level: Lysis-Lysogeny circuit in
             Lambda phage

                                         Arkin et al. (1998),
                                         Genetics 149(4):1633-48




    5 genes, 67 parameters based on 50 years of research
        Stochastic simulation required supercomputer
                                                                   26
 Concentration level: metabolic pathways

 Usually   modeled with differential equations


                    g1 w12
              w55            g2
             g5               w23

                    g4       g3




                                                  27
Qualitative level: Boolean Networks




                                      28
     Probabilistic graphical models
 Supports   graph-based modeling at various levels of
  detail
 Models can be learned from noisy, partial data

 Can model “inherently” stochastic phenomena, e.g.,
  molecular-level fluctuations…
 But can also model deterministic, causal processes.
   "The actual science of logic is conversant at present only with
   things either certain, impossible, or entirely doubtful. Therefore
   the true logic for this world is the calculus of probabilities."
   -- James Clerk Maxwell

   "Probability theory is nothing but common sense reduced to
   calculation." -- Pierre Simon Laplace
                                                                        29
         Graphical models: outline
 What  are graphical models?
 Inference

 Structure learning




                                     30
           Simple probabilistic model:
               linear regression

    Y =  +  X + noise   Deterministic (functional) relationship

Y




                                      X


                                                                    31
           Simple probabilistic model:
               linear regression

    Y =  +  X + noise   Deterministic (functional) relationship

Y

                                         “Learning” = estimating
                                         parameters , ,  from
                                         (x,y) pairs.

                                             Is the empirical mean

                                             Can be estimate by
                                             least squares
                                     X
                                              Is the residual variance
                                                                         32
Piecewise linear regression




  Latent “switch” variable – hidden process at work
                                                      33
 Probabilistic graphical model for piecewise
               linear regression

       input

         X
                      •Hidden variable Q chooses which set of
                      parameters to use for predicting Y.

         Q             •Value of Q depends on value of
                       input X.
                      •This is an example of “mixtures of experts”
         Y
        output


Learning is harder because Q is hidden, so we don’t know which
data points to assign to each line; can be solved with EM (c.f., K-means)
                                                                        34
Classes of graphical models
                               Probabilistic models
          Graphical models



      Directed               Undirected


     Bayes nets                 MRFs

       DBNs




                                                      35
                      Bayesian Networks
Compact representation of probability
distributions via conditional independence
                                           Family of Alarm
Qualitative part:           Earthquake                  Burglary
                                                                        E   B P(A | E,B)
Directed acyclic graph (DAG)                                            e   b 0.9 0.1
 Nodes - random variables
                                                                        e   b 0.2 0.8
                            Radio                     Alarm             e   b 0.9 0.1
 Edges - direct influence
                                                                        e   b 0.01 0.99

                                                       Call
Together:
                                                      Quantitative part:
Define a unique distribution
                                                      Set of conditional
in a factored form
                                                      probability distributions
P (B , E , A, C , R )  P (B )P (E )P (A | B , E )P (R | E )P (C | A)
                                                                                           36
   Example: “ICU Alarm” network
Domain: Monitoring Intensive-Care Patients
 37 variables
                                                                                            MINVOLSET


 509 parameters
                            PULMEMBOLUS         INTUBATION               KINKEDTUBE         VENTMACH     DISCONNECT

   …instead of 254           PAP      SHUNT                        VENTLUNG                         VENITUBE

                                                                                        PRESS
                                               MINOVL      FIO2     VENTALV



                   ANAPHYLAXIS                           PVSAT       ARTCO2



                      TPR                 SAO2          INSUFFANESTH          EXPCO2



               HYPOVOLEMIA         LVFAILURE             CATECHOL



               LVEDVOLUME        STROEVOLUME       HISTORY        ERRBLOWOUTPUT        HR    ERRCAUTER



                CVP     PCWP         CO                                            HREKG        HRSAT
                                                                       HRBP

                       BP

                                                                                                                  37
Success stories for graphical models
 Multiple sequence alignment
 Forensic analysis

 Medical and fault diagnosis

 Speech recognition

 Visual tracking

 Channel coding at Shannon limit

 Genetic pedigree analysis

…




                                       38
         Graphical models: outline
 What  are graphical models? p
 Inference

 Structure learning




                                     39
           Probabilistic Inference
Posterior    probabilities
     Probability of any event given any evidence
P(X|E)



                                     Earthquake    Burglary




                                     Radio        Alarm




                                                  Call


                                                              40
                Viterbi decoding
Compute most probable explanation (MPE) of observed data

              Hidden Markov Model (HMM)


         X1             X2           X3     hidden



          Y1                         Y3    observed
                        Y2




                              “Tomato”


                                                           41
      Inference: computational issues
                 Easy                                          Hard
                                                        Dense, loopy graphs
            Chains



           Trees                      MINVOLSET


             INTUBATION KINKEDTUBE

                                                                Grids
     PULMEMBOLUS                       DISCONNECT
                                 VENTMACH

        PAP SHUNT        VENTLUNG            VENITUBE
                                     PRESS
                MINOVL    VENTALV


                    PVSAT
                        ARTCO2


     TPR     SAO2         EXPCO2
               INSUFFANESTH


        LVFAILURE CATECHOL
HYPOVOLEMIA


        STROEVOLUME ERRBLOWOUTPUT
LVEDVOLUME       HISTORY       HRERRCAUTER


  CVP PCWP CO                       HREKGHRSAT
                           HRBP
      BP




                                                                              42
      Inference: computational issues
                 Easy                                          Hard
                                                        Dense, loopy graphs
            Chains



           Trees                      MINVOLSET


             INTUBATION KINKEDTUBE

                                                                Grids
     PULMEMBOLUS                       DISCONNECT
                                 VENTMACH

        PAP SHUNT        VENTLUNG            VENITUBE
                                     PRESS
                MINOVL    VENTALV


                    PVSAT
                        ARTCO2


     TPR     SAO2         EXPCO2
               INSUFFANESTH


        LVFAILURE CATECHOL
HYPOVOLEMIA


        STROEVOLUME ERRBLOWOUTPUT
LVEDVOLUME       HISTORY       HRERRCAUTER


  CVP PCWP CO                       HREKGHRSAT
                           HRBP
      BP




Many difference inference algorithms,
both exact and approximate
                                                                              43
               Bayesian inference
 Bayesian probability treats parameters as random
  variables
 Learning/ parameter estimation is replaced by probabilistic
  inference P(|D)
 Example: Bayesian linear regression; parameters are
   = (, , )

                            Parameters are tied (shared)
                             across repetitions of the data

         X1         Xn



        Y1         Yn
                                                                44
             Bayesian inference
+  Elegant – no distinction between parameters and
  other hidden variables
 + Can use priors to learn from small data sets (c.f.,
  one-shot learning by humans)
 - Math can get hairy

 - Often computationally intractable




                                                          45
         Graphical models: outline
 What  are graphical models?   p

 Inference p

 Structure learning




                                     46
    Why Struggle for Accurate Structure?

                                Earthquake   Alarm Set   Burglary



                                              Sound



       Missing an arc                                        Adding an arc

    Earthquake   Alarm Set   Burglary                    Earthquake   Alarm Set   Burglary



                  Sound                                                Sound


 Cannot be compensated                           Increases the number of
  for by fitting parameters                        parameters to be estimated
 Wrong assumptions about                         Wrong assumptions about
  domain structure                                 domain structure
                                                                                             47
           Score-based Learning

Define scoring function that evaluates how well a
structure matches the data


 E, B, A
 <Y,N,N>
 <Y,Y,Y>
 <N,N,Y>
 <N,Y,Y>
    .
    .              E       B   E                E
 <N,Y,Y>                           A                A
                       A                    B
                                       B


 Search for a structure that maximizes the score
                                                        48
                  Learning Trees




 Can   find optimal tree structure in O(n2 log n) time: just
  find the max-weight spanning tree
 If some of the variables are hidden, problem becomes hard
  again, but can use EM to fit mixtures of trees



                                                                49
                 Heuristic Search

 Learning arbitrary graph structure is NP-hard.
  So it is common to resort to heuristic search
 Define a search space:
    search states are possible structures

    operators make small changes to structure

 Traverse space looking for high-scoring structures
 Search techniques:
    Greedy hill-climbing

    Best first search

    Simulated Annealing

    ...



                                                       50
            Local Search Operations
 Typical   operations:                 S         C

                                             E
                          S       C
                                            D
                              E
                                      score =
                                       S({C,E} D)
                              D
                                        - S({E} D)
 S          C                           S         C

     E                                       E

     D                                      D
                                                      51
         Problems with local search
            Easy to get stuck in local optima
                                                “truth”
                        you
S(G|D)




                                                          52
P(G|D)
         Problems with local search II
           Picking a single best model can be misleading




                              E       B

                          R       A

                                  C




                                                           53
P(G|D)
               Problems with local search II
                   Picking a single best model can be misleading




      E        B         E       B       E       B   E       B       E       B

  R        A         R       A       R       A       R   A       R       A

           C                 C               C           C               C




         Small sample size  many high scoring models
         Answer based on one model often useless
         Want features common to many models                                    54
Bayesian Approach to Structure Learning

 Posteriordistribution over structures
 Estimate probability of features

    Edge XY

    Path X…  Y
                                          Bayesian score
      …                                      for G



           P (f | D )  f (G )P (G | D )
                         G
 Feature of G,
                         Indicator function
  e.g., XY                 for feature f

                                                           55
Bayesian approach: computational issues

 Posterior   distribution over structures

      P (f | D )  f (G )P (G | D )
                         G


 How compute sum over super-exponential number of graphs?

 •MCMC over networks
 •MCMC over node-orderings (Rao-Blackwellisation)




                                                            56
    Structure learning: other issues
 Discovering  latent variables
 Learning causal models

 Learning from interventional data

 Active learning




                                       57
     Discovering latent variables




 a) 17 parameters               b) 59 parameters

There are some techniques for automatically detecting the
possible presence of latent variables
                                                            58
          Learning causal models
 So far, we have only assumed that X -> Y -> Z
  means that Z is independent of X given Y.
 However, we often want to interpret directed arrows
  causally.
 This is uncontroversial for the arrow of time.

 But can we infer causality from static observational
  data?




                                                         59
           Learning causal models
 We  can infer causality from static observational
  data if we have at least four measured variables
  and certain “tetrad” conditions hold.
 See books by Pearl and Spirtes et al.
 However, we can only learn up to Markov
  equivalence, not matter how much data we have.

       X    Y      Z

                                 X      Y      Z
       X    Y      Z


       X    Y      Z
                                                      60
      Learning from interventional data
   The only way to distinguish between Markov equivalent
    networks is to perform interventions, e.g., gene knockouts.
   We need to (slightly) modify our learning algorithms.




         smoking                                smoking

                                                           Cut arcs coming
                                                           into nodes which
                                                           were set by
                                                           intervention
        Yellow                                  Yellow
        fingers                                 fingers

P(smoker|observe(yellow)) >> prior   P(smoker | do(paint yellow)) = prior
                                                                            61
                Active learning
 Which   experiments (interventions) should we
  perform to learn structure as efficiently as possible?
 This problem can be modeled using decision
  theory.
 Exact solutions are wildly computationally
  intractable.
 Can we come up with good approximate decision
  making techniques?
 Can we implement hardware to automatically
  perform the experiments?
 “AB: Automated Biologist”


                                                           62
        Learning from relational data
Can we learn concepts from a set of relations between objects,
instead of/ in addition to just their attributes?




                                                                 63
Learning from relational data: approaches

 Probabilistic  relational models (PRMs)
      Reify a relationship (arcs) between nodes
       (objects) by making into a node (hypergraph)

 Inductive  Logic Programming (ILP)
      Top-down, e.g., FOIL (generalization of C4.5)
      Bottom up, e.g., PROGOL (inverse deduction)




                                                       64
ILP for learning protein folding: input
         yes                          no




TotalLength(D2mhr, 118) ^ NumberHelices(D2mhr, 6) ^ …

    100 conjuncts describing structure of each pos/neg example

                                                                 65
  ILP for learning protein folding: results

 PROGOL    learned the following rule to predict if a
 protein will form a “four-helical up-and-down
 bundle”:



   English: “The protein P folds if it contains a long
 In
 helix h1 at a secondary structure position between 1
 and 3 and h1 is next to a second helix”




                                                         66
             ILP: Pros and Cons
+   Can discover new predicates (concepts)
  automatically
 + Can learn relational models from relational (or
  flat) data
 - Computationally intractable

 - Poor handling of noise




                                                      67
The future of machine learning for
         bioinformatics?




           Oracle




                                     68
         The future of machine learning for
                   bioinformatics
  Prior knowledge




                                           Hypotheses
Replicated experiments

                             Learner

 Biological literature




                            Real world
                                               Expt.
                                              design
 •“Computer assisted pathway refinement”
                                                        69
The end




          70
  Decision trees
      blue?


yes                  oval?


                             no
              big?


      no             yes



                                  71
                        Decision trees
                              blue?


                     yes                     oval?

+ Handles mixed variables
+ Handles missing data                               no
+ Efficient for large data sets       big?
+ Handles irrelevant attributes
+ Easy to understand
- Predictive power
                              no             yes



                                                          72
  Feedforward neural network

      input           Hidden layer                        Output




Weights on each arc                 Sigmoid function at each node
              f ( J i si ),   f ( x)  1/(1  e  cx )
                  i

                                                                    73
            Feedforward neural network

                                  input   Hidden layer   Output




- Handles mixed variables
- Handles missing data
- Efficient for large data sets
- Handles irrelevant attributes
- Easy to understand
+ Predicts poorly




                                                                  74
                   Nearest Neighbor
   Remember all your data
   When someone asks a question,
        find the nearest old data point
        return the answer associated with it




                                                75
                   Nearest Neighbor



                                  ?



- Handles mixed variables
- Handles missing data
- Efficient for large data sets
- Handles irrelevant attributes
- Easy to understand
+ Predictive power
                                      76
  Support Vector Machines (SVMs)
 Two  key ideas:
     Large margins are good
     Kernel trick




                                   77
        SVM: mathematical details

 Training data : l-dimensional vector with flag of true or false
                      {xi, yi }, xi  Rl , yi {1,1}
 Separating hyperplane :     wx b  0
 Margin :    d  2/ w
 Inequalities : yi (xi  w  b)  1  0, i
 Support vector expansion:
                   w   i x i
                          i
 Support vectors :

 Decision:
                                                                    margin


                                                                             78
Replace all inner products with kernels




              Kernel function




                                          79
                     SVMs: summary

 - Handles mixed variables
 - Handles missing data
 - Efficient for large data sets
 - Handles irrelevant attributes
 - Easy to understand
 + Predictive power




General lessons from SVM success:

•Kernel trick can be used to make many linear methods non-linear e.g.,
kernel PCA, kernelized mutual information

•Large margin classifiers are good
                                                                     80
                Boosting: summary
 Can boost any weak learner
 Most commonly: boosted decision “stumps”


+ Handles mixed variables
+ Handles missing data
+ Efficient for large data sets
+ Handles irrelevant attributes
- Easy to understand
+ Predictive power




                                             81
     Supervised learning: summary
 Learn  mapping F from inputs to outputs using a
  training set of (x,t) pairs
 F can be drawn from different hypothesis spaces,
  e.g., decision trees, linear separators, linear in high
  dimensions, mixtures of linear
 Algorithms offer a variety of tradeoffs

 Many good books, e.g.,

     “The elements of statistical learning”,

      Hastie, Tibshirani, Friedman, 2001
     “Pattern classification”, Duda, Hart, Stork, 2001


                                                            82
                     Inference
Posterior    probabilities
     Probability of any event given any evidence
Most    likely explanation
     Scenario that explains evidence
Rational    decision making
                                     Earthquake    Burglary
     Maximize expected utility
     Value of Information
                                        Radio     Alarm
Effect   of intervention

                                                  Call


                                                              83
        Assumption needed to make
             learning work
 We  need to assume “Future futures will resemble
  past futures” (B. Russell)
 Unlearnable hypothesis: “All emeralds are grue”,
  where “grue” means:
  green if observed before time t, blue afterwards.




                                                      84
Structure learning success stories: gene
regulation network (Friedman et al.)




Yeast data
 [Hughes et al 2000]
 600 genes
 300 experiments

                                           85
 Structure learning success stories II: Phylogenetic Tree
             Reconstruction (Friedman et al.)
Input: Biological sequences
      Human       CGTTGC…              Uses structural EM,
                                       with max-spanning-tree
      Chimp       CCTAGG…              in the inner loop
      Orang       CGAACG…
      ….
Output: a phylogeny



                                leaf


                                                                86
         Instances of graphical models
                                                  Probabilistic models
                             Graphical models
Naïve Bayes classifier


                         Directed               Undirected


                    Bayes nets                     MRFs

Mixtures
                          DBNs
of experts

Kalman filter
model                                                 Ising model
                  Hidden Markov Model (HMM)

                                                                         87
          ML enabling technologies
 Faster computers
 More data
    The web

    Parallel corpora (machine translation)

    Multiple sequenced genomes

    Gene expression arrays

 New ideas
    Kernel trick

    Large margins

    Boosting

    Graphical models

    …


                                              88

								
To top