# Coding for Credit Card Fraud Detection Using Hmm by fkq34723

VIEWS: 214 PAGES: 88

• pg 1
```									    An introduction to machine
learning and probabilistic
graphical models

Kevin Murphy
MIT AI Lab
Presented at Intel’s workshop on “Machine learning
for the life sciences”, Berkeley, CA, 3 November 2003

.
Overview

 Supervised  learning
 Unsupervised learning

 Graphical models

 Learning relational models

Thanks to Nir Friedman, Stuart Russell, Leslie Kaelbling and
various web sources for letting me use many of their slides
2
Supervised learning
yes                 no

Color            Shape           Size         Output
Blue             Torus           Big          Y
Blue             Square          Small        Y
Blue             Star            Small        Y
Red              Arrow           Small        N
Learn to approximate function F(x1, x2, x3) -> t
from a training set of (x,t) pairs                     3
Supervised learning
Training data

X1      X2    X3        T
B       T     B         Y    Learner
B       S     S         Y
B       S     S         Y                  Prediction
R       A     S         N
T
Testing data
Y
X1     X2 X3 T
B      A     S      ?          Hypothesi       N
s
Y      C     S      ?

4
Key issue: generalization
yes                                no

?                     ?
Can’t just memorize the training set (overfitting)
5
Hypothesis spaces
 Decision trees
 Neural networks

 K-nearest neighbors

 Naïve Bayes classifier

 Support vector machines (SVMs)

 Boosted decision stumps

…

6
Perceptron
(neural net with no hidden layers)

Linearly separable data

7
Which separating hyperplane?

8
The linear separator with the largest
margin is the best one to pick

margin

9
What if the data is not linearly separable?

10
Kernel trick

 x2 
 x        
    2 xy 
 y  2 
 y 
                    z3

x2        kernel
x1
z2
z1

Kernel implicitly maps from 2D to 3D,
making problem linearly separable
11
Support Vector Machines (SVMs)
 Two  key ideas:
   Large margins
   Kernel trick

12
Boosting

Simple classifiers (weak learners) can have their performance
boosted by taking weighted combinations

Boosting maximizes the margin
13
Supervised learning success stories

   Face detection
   Steering an autonomous car across the US
   Detecting credit card fraud
   Medical diagnosis
   …

14
Unsupervised learning
 What   if there are no output labels?

15
K-means clustering
1.   Guess number of clusters, K
2.   Guess initial cluster centers, 1, 2

Reiterate
3.   Assign data points xi to nearest cluster center
4.   Re-compute cluster centers based on assignments

16
AutoClass (Cheeseman et al, 1986)
 EM  algorithm for mixtures of Gaussians
 “Soft” version of K-means

 Uses Bayesian criterion to select K

 Discovered new types of stars from spectral data

 Discovered new classes of proteins and introns
from DNA/protein sequence databases

17
Hierarchical clustering

18
Principal Component
Analysis (PCA)
PCA seeks a projection that best represents the
data in a least-squares sense.

PCA reduces the
dimensionality of
feature space by
restricting attention to
those directions along
which the scatter of the
cloud is greatest.

.
Discovering nonlinear manifolds

20
Combining supervised and unsupervised
learning

21
Discovering rules (data mining)
Occup.     Income     Educ.       Sex          Married      Age

Student    \$10k       MA          M            S            22
Student    \$20k       PhD         F            S            24
Doctor     \$80k       MD          M            M            30
Retired    \$30k       HS          F            M            60

Find the most frequent patterns (association rules)

Num in household = 1 ^ num children = 0 => language = English

Language = English ^ Income < \$40k ^ Married = false ^
num children = 0 => education  {college, grad school}

22
Unsupervised learning: summary
 Clustering

 Hierarchicalclustering
 Linear dimensionality reduction (PCA)

 Non-linear dim. Reduction

 Learning rules

23
Discovering networks

?

From data visualization to causal discovery
24
Networks in biology
 Most  processes in the cell are controlled by
networks of interacting molecules:
 Metabolic Network

 Signal Transduction Networks

 Regulatory Networks

 Networks can be modeled at multiple levels of
detail/ realism
 Molecular level

 Concentration level           Decreasing detail
 Qualitative level

25
Molecular level: Lysis-Lysogeny circuit in
Lambda phage

Arkin et al. (1998),
Genetics 149(4):1633-48

5 genes, 67 parameters based on 50 years of research
Stochastic simulation required supercomputer
26
Concentration level: metabolic pathways

 Usually   modeled with differential equations

g1 w12
w55            g2
g5               w23

g4       g3

27
Qualitative level: Boolean Networks

28
Probabilistic graphical models
 Supports   graph-based modeling at various levels of
detail
 Models can be learned from noisy, partial data

 Can model “inherently” stochastic phenomena, e.g.,
molecular-level fluctuations…
 But can also model deterministic, causal processes.
"The actual science of logic is conversant at present only with
things either certain, impossible, or entirely doubtful. Therefore
the true logic for this world is the calculus of probabilities."
-- James Clerk Maxwell

"Probability theory is nothing but common sense reduced to
calculation." -- Pierre Simon Laplace
29
Graphical models: outline
 What  are graphical models?
 Inference

 Structure learning

30
Simple probabilistic model:
linear regression

Y =  +  X + noise   Deterministic (functional) relationship

Y

X

31
Simple probabilistic model:
linear regression

Y =  +  X + noise   Deterministic (functional) relationship

Y

“Learning” = estimating
parameters , ,  from
(x,y) pairs.

Is the empirical mean

Can be estimate by
least squares
X
Is the residual variance
32
Piecewise linear regression

Latent “switch” variable – hidden process at work
33
Probabilistic graphical model for piecewise
linear regression

input

X
•Hidden variable Q chooses which set of
parameters to use for predicting Y.

Q             •Value of Q depends on value of
input X.
•This is an example of “mixtures of experts”
Y
output

Learning is harder because Q is hidden, so we don’t know which
data points to assign to each line; can be solved with EM (c.f., K-means)
34
Classes of graphical models
Probabilistic models
Graphical models

Directed               Undirected

Bayes nets                 MRFs

DBNs

35
Bayesian Networks
Compact representation of probability
distributions via conditional independence
Family of Alarm
Qualitative part:           Earthquake                  Burglary
E   B P(A | E,B)
Directed acyclic graph (DAG)                                            e   b 0.9 0.1
 Nodes - random variables
e   b 0.2 0.8
Radio                     Alarm             e   b 0.9 0.1
 Edges - direct influence
e   b 0.01 0.99

Call
Together:
Quantitative part:
Define a unique distribution
Set of conditional
in a factored form
probability distributions
P (B , E , A, C , R )  P (B )P (E )P (A | B , E )P (R | E )P (C | A)
36
Example: “ICU Alarm” network
Domain: Monitoring Intensive-Care Patients
 37 variables
MINVOLSET

 509 parameters
PULMEMBOLUS         INTUBATION               KINKEDTUBE         VENTMACH     DISCONNECT

…instead of 254           PAP      SHUNT                        VENTLUNG                         VENITUBE

PRESS
MINOVL      FIO2     VENTALV

ANAPHYLAXIS                           PVSAT       ARTCO2

TPR                 SAO2          INSUFFANESTH          EXPCO2

HYPOVOLEMIA         LVFAILURE             CATECHOL

LVEDVOLUME        STROEVOLUME       HISTORY        ERRBLOWOUTPUT        HR    ERRCAUTER

CVP     PCWP         CO                                            HREKG        HRSAT
HRBP

BP

37
Success stories for graphical models
 Multiple sequence alignment
 Forensic analysis

 Medical and fault diagnosis

 Speech recognition

 Visual tracking

 Channel coding at Shannon limit

 Genetic pedigree analysis

…

38
Graphical models: outline
 What  are graphical models? p
 Inference

 Structure learning

39
Probabilistic Inference
Posterior    probabilities
   Probability of any event given any evidence
P(X|E)

Earthquake    Burglary

Radio        Alarm

Call

40
Viterbi decoding
Compute most probable explanation (MPE) of observed data

Hidden Markov Model (HMM)

X1             X2           X3     hidden

Y1                         Y3    observed
Y2

“Tomato”

41
Inference: computational issues
Easy                                          Hard
Dense, loopy graphs
Chains

Trees                      MINVOLSET

INTUBATION KINKEDTUBE

Grids
PULMEMBOLUS                       DISCONNECT
VENTMACH

PAP SHUNT        VENTLUNG            VENITUBE
PRESS
MINOVL    VENTALV

PVSAT
ARTCO2

TPR     SAO2         EXPCO2
INSUFFANESTH

LVFAILURE CATECHOL
HYPOVOLEMIA

STROEVOLUME ERRBLOWOUTPUT
LVEDVOLUME       HISTORY       HRERRCAUTER

CVP PCWP CO                       HREKGHRSAT
HRBP
BP

42
Inference: computational issues
Easy                                          Hard
Dense, loopy graphs
Chains

Trees                      MINVOLSET

INTUBATION KINKEDTUBE

Grids
PULMEMBOLUS                       DISCONNECT
VENTMACH

PAP SHUNT        VENTLUNG            VENITUBE
PRESS
MINOVL    VENTALV

PVSAT
ARTCO2

TPR     SAO2         EXPCO2
INSUFFANESTH

LVFAILURE CATECHOL
HYPOVOLEMIA

STROEVOLUME ERRBLOWOUTPUT
LVEDVOLUME       HISTORY       HRERRCAUTER

CVP PCWP CO                       HREKGHRSAT
HRBP
BP

Many difference inference algorithms,
both exact and approximate
43
Bayesian inference
 Bayesian probability treats parameters as random
variables
 Learning/ parameter estimation is replaced by probabilistic
inference P(|D)
 Example: Bayesian linear regression; parameters are
 = (, , )

              Parameters are tied (shared)
across repetitions of the data

X1         Xn

Y1         Yn
44
Bayesian inference
+  Elegant – no distinction between parameters and
other hidden variables
 + Can use priors to learn from small data sets (c.f.,
one-shot learning by humans)
 - Math can get hairy

 - Often computationally intractable

45
Graphical models: outline
 What  are graphical models?   p

 Inference p

 Structure learning

46
Why Struggle for Accurate Structure?

Earthquake   Alarm Set   Burglary

Sound

Missing an arc                                        Adding an arc

Earthquake   Alarm Set   Burglary                    Earthquake   Alarm Set   Burglary

Sound                                                Sound

 Cannot be compensated                           Increases the number of
for by fitting parameters                        parameters to be estimated
 Wrong assumptions about                         Wrong assumptions about
domain structure                                 domain structure
47
Score-based Learning

Define scoring function that evaluates how well a
structure matches the data

E, B, A
<Y,N,N>
<Y,Y,Y>
<N,N,Y>
<N,Y,Y>
.
.              E       B   E                E
<N,Y,Y>                           A                A
A                    B
B

Search for a structure that maximizes the score
48
Learning Trees

 Can   find optimal tree structure in O(n2 log n) time: just
find the max-weight spanning tree
 If some of the variables are hidden, problem becomes hard
again, but can use EM to fit mixtures of trees

49
Heuristic Search

 Learning arbitrary graph structure is NP-hard.
So it is common to resort to heuristic search
 Define a search space:
 search states are possible structures

 operators make small changes to structure

 Traverse space looking for high-scoring structures
 Search techniques:
 Greedy hill-climbing

 Best first search

 Simulated Annealing

 ...

50
Local Search Operations
 Typical   operations:                 S         C

E
S       C
D
E
score =
S({C,E} D)
D
- S({E} D)
S          C                           S         C

E                                       E

D                                      D
51
Problems with local search
Easy to get stuck in local optima
“truth”
you
S(G|D)

52
P(G|D)
Problems with local search II
Picking a single best model can be misleading

E       B

R       A

C

53
P(G|D)
Problems with local search II
Picking a single best model can be misleading

E        B         E       B       E       B   E       B       E       B

R        A         R       A       R       A       R   A       R       A

C                 C               C           C               C

   Small sample size  many high scoring models
   Answer based on one model often useless
   Want features common to many models                                    54
Bayesian Approach to Structure Learning

 Posteriordistribution over structures
 Estimate probability of features

 Edge XY

 Path X…  Y
Bayesian score
   …                                      for G

P (f | D )  f (G )P (G | D )
G
Feature of G,
Indicator function
e.g., XY                 for feature f

55
Bayesian approach: computational issues

 Posterior   distribution over structures

P (f | D )  f (G )P (G | D )
G

How compute sum over super-exponential number of graphs?

•MCMC over networks
•MCMC over node-orderings (Rao-Blackwellisation)

56
Structure learning: other issues
 Discovering  latent variables
 Learning causal models

 Learning from interventional data

 Active learning

57
Discovering latent variables

a) 17 parameters               b) 59 parameters

There are some techniques for automatically detecting the
possible presence of latent variables
58
Learning causal models
 So far, we have only assumed that X -> Y -> Z
means that Z is independent of X given Y.
 However, we often want to interpret directed arrows
causally.
 This is uncontroversial for the arrow of time.

 But can we infer causality from static observational
data?

59
Learning causal models
 We  can infer causality from static observational
data if we have at least four measured variables
and certain “tetrad” conditions hold.
 See books by Pearl and Spirtes et al.
 However, we can only learn up to Markov
equivalence, not matter how much data we have.

X    Y      Z

X      Y      Z
X    Y      Z

X    Y      Z
60
Learning from interventional data
 The only way to distinguish between Markov equivalent
networks is to perform interventions, e.g., gene knockouts.
 We need to (slightly) modify our learning algorithms.

smoking                                smoking

Cut arcs coming
into nodes which
were set by
intervention
Yellow                                  Yellow
fingers                                 fingers

P(smoker|observe(yellow)) >> prior   P(smoker | do(paint yellow)) = prior
61
Active learning
 Which   experiments (interventions) should we
perform to learn structure as efficiently as possible?
 This problem can be modeled using decision
theory.
 Exact solutions are wildly computationally
intractable.
 Can we come up with good approximate decision
making techniques?
 Can we implement hardware to automatically
perform the experiments?
 “AB: Automated Biologist”

62
Learning from relational data
Can we learn concepts from a set of relations between objects,
instead of/ in addition to just their attributes?

63
Learning from relational data: approaches

 Probabilistic  relational models (PRMs)
   Reify a relationship (arcs) between nodes
(objects) by making into a node (hypergraph)

 Inductive  Logic Programming (ILP)
   Top-down, e.g., FOIL (generalization of C4.5)
   Bottom up, e.g., PROGOL (inverse deduction)

64
ILP for learning protein folding: input
yes                          no

TotalLength(D2mhr, 118) ^ NumberHelices(D2mhr, 6) ^ …

100 conjuncts describing structure of each pos/neg example

65
ILP for learning protein folding: results

 PROGOL    learned the following rule to predict if a
protein will form a “four-helical up-and-down
bundle”:

English: “The protein P folds if it contains a long
 In
helix h1 at a secondary structure position between 1
and 3 and h1 is next to a second helix”

66
ILP: Pros and Cons
+   Can discover new predicates (concepts)
automatically
 + Can learn relational models from relational (or
flat) data
 - Computationally intractable

 - Poor handling of noise

67
The future of machine learning for
bioinformatics?

Oracle

68
The future of machine learning for
bioinformatics
Prior knowledge

Hypotheses
Replicated experiments

Learner

Biological literature

Real world
Expt.
design
•“Computer assisted pathway refinement”
69
The end

70
Decision trees
blue?

yes                  oval?

no
big?

no             yes

71
Decision trees
blue?

yes                     oval?

+ Handles mixed variables
+ Handles missing data                               no
+ Efficient for large data sets       big?
+ Handles irrelevant attributes
+ Easy to understand
- Predictive power
no             yes

72
Feedforward neural network

input           Hidden layer                        Output

Weights on each arc                 Sigmoid function at each node
f ( J i si ),   f ( x)  1/(1  e  cx )
i

73
Feedforward neural network

input   Hidden layer   Output

- Handles mixed variables
- Handles missing data
- Efficient for large data sets
- Handles irrelevant attributes
- Easy to understand
+ Predicts poorly

74
Nearest Neighbor
   Remember all your data
   When someone asks a question,
   find the nearest old data point
   return the answer associated with it

75
Nearest Neighbor

?

- Handles mixed variables
- Handles missing data
- Efficient for large data sets
- Handles irrelevant attributes
- Easy to understand
+ Predictive power
76
Support Vector Machines (SVMs)
 Two  key ideas:
   Large margins are good
   Kernel trick

77
SVM: mathematical details

 Training data : l-dimensional vector with flag of true or false
{xi, yi }, xi  Rl , yi {1,1}
 Separating hyperplane :     wx b  0
 Margin :    d  2/ w
 Inequalities : yi (xi  w  b)  1  0, i
 Support vector expansion:
w   i x i
i
 Support vectors :

 Decision:
margin

78
Replace all inner products with kernels

Kernel function

79
SVMs: summary

- Handles mixed variables
- Handles missing data
- Efficient for large data sets
- Handles irrelevant attributes
- Easy to understand
+ Predictive power

General lessons from SVM success:

•Kernel trick can be used to make many linear methods non-linear e.g.,
kernel PCA, kernelized mutual information

•Large margin classifiers are good
80
Boosting: summary
 Can boost any weak learner
 Most commonly: boosted decision “stumps”

+ Handles mixed variables
+ Handles missing data
+ Efficient for large data sets
+ Handles irrelevant attributes
- Easy to understand
+ Predictive power

81
Supervised learning: summary
 Learn  mapping F from inputs to outputs using a
training set of (x,t) pairs
 F can be drawn from different hypothesis spaces,
e.g., decision trees, linear separators, linear in high
dimensions, mixtures of linear
 Algorithms offer a variety of tradeoffs

 Many good books, e.g.,

 “The elements of statistical learning”,

Hastie, Tibshirani, Friedman, 2001
 “Pattern classification”, Duda, Hart, Stork, 2001

82
Inference
Posterior    probabilities
   Probability of any event given any evidence
Most    likely explanation
   Scenario that explains evidence
Rational    decision making
Earthquake    Burglary
   Maximize expected utility
   Value of Information
Radio     Alarm
Effect   of intervention

Call

83
Assumption needed to make
learning work
 We  need to assume “Future futures will resemble
past futures” (B. Russell)
 Unlearnable hypothesis: “All emeralds are grue”,
where “grue” means:
green if observed before time t, blue afterwards.

84
Structure learning success stories: gene
regulation network (Friedman et al.)

Yeast data
[Hughes et al 2000]
 600 genes
 300 experiments

85
Structure learning success stories II: Phylogenetic Tree
Reconstruction (Friedman et al.)
Input: Biological sequences
Human       CGTTGC…              Uses structural EM,
with max-spanning-tree
Chimp       CCTAGG…              in the inner loop
Orang       CGAACG…
….
Output: a phylogeny

leaf

86
Instances of graphical models
Probabilistic models
Graphical models
Naïve Bayes classifier

Directed               Undirected

Bayes nets                     MRFs

Mixtures
DBNs
of experts

Kalman filter
model                                                 Ising model
Hidden Markov Model (HMM)

87
ML enabling technologies
 Faster computers
 More data
 The web

 Parallel corpora (machine translation)

 Multiple sequenced genomes

 Gene expression arrays

 New ideas
 Kernel trick

 Large margins

 Boosting

 Graphical models

 …

88

```
To top