# Slide 1 - DIMACS

Document Sample

```					Inferring Data Inter-Relationships
Via Fast Hierarchical Models

Lawrence Carin
Duke University
www.ece.duke.edu/~lcarin
Sensor Deployed Previously Across Globe

Previous deployments

New deployment

Deploy to New Location. Can Algorithm Infer Which Data from
Past is Most Relevant for New Sensing Task?
Semi-Supervised & Active Learning

• Enormous quantity of unlabeled data -> exploit context via semi-supervised learning

• Focus the analyst on most-informative data -> active learning
Technology Employed & Motivation

• Appropriately exploit related data from previous experience over sensor “lifetime”

- Transfer learning

• Place learning with labeled data in the context of unlabeled data, thereby
exploiting manifold information

- Semi-supervised learning

• Reduce load on analyst: only request labeled data on subset of data for which
label acquisition would be most informative

- Active learning
Bayesian Hierarchical Models:
Dirichlet Processes

• Principled setting for transfer learning

• Avoids problems with model selection

- Number of mixture components

- Number of HMM states

[iGMM: Rasmussan, 00], [iHMM: Teh et al., 04,06], [Escobar & West, 95]
Data Sharing: Stick-Breaking View of DP – 1/2

• The Dirichlet process (DP) is a prior on a density function, i.e., G(Θ) ~DP[α,Go(Θ)]

• One draw of G(Θ) from DP[α,Go(Θ)]:
∞                                       ∞
G (Θ ) =   ∑      π k δ(Θ - Θ * )
k                    ∑π     k   =1
k =1                                    k =1

k 1

 k ~ Beta (1, )                        k   k  (1   i )                    Θ* ~ Go
k
i 1

1

1  1                                           1

1 2                  2

[Sethuraman, 94]
Data Sharing: Stick-Breaking View of DP – 2/2
∞                                          ∞

G (Θ ) =   ∑π               *
k δ(Θ - Θ k )
∑π     k   =1
k =1
k =1

k 1

 k ~ Beta (1, )                      k   k  (1   i )                Θ* ~ Go
k
i 1

1

1  1                                         1

1 2                2

• As α → 0, the more likely that Beta(1, α) yields large νk , implying more sharing;
a few larger “sticks”, with corresponding likely parameters Θ * k

• As α → ∞, sticks very small and roughly the same size, so G(Θ) reduces to Go
Non-Parametric Mixture Models
- Data sample di drawn from a Gaussian/HMM with associated parameters Θ i

- Posterior on model parameters indicates which parameters are shared, yielding a
Gaussian/HMM mixture model; no model selection on number of mixture components

∞

d i ~ F (d Θi )         Θ i ~ G (Θ ) =   ∑ δ(Θ - Θ
π     k
*
k)   ~ DP[α, Go (Θ)]
k =1

π α ~ Beta (1, α)                                    α          π            G0

z i π ~ Mult ( π )
zi           Θ*
k
{Θ * }k =1,∞ Go ~ Go
k                                                                              

d i z i , {Θ k }k =1,∞ ~ F (Θ zi )                              di
n
Gaussian or HMM
Dirichlet Process as a Shared Prior
p( D Θ1 , Θ 2 ,..., Θ n ) p(Θ1 , Θ 2 ,..., Θ n α, Go )
p(Θ1 , Θ 2 ,..., Θ n D, α, Go ) =
∫ ∫ ...∫
dΘ dΘ
1    dΘ2         n p ( D Θ1 , Θ 2 ,..., Θ n ) p (Θ1 , Θ 2 ,..., Θ n   α, Go )

• Cumulative set of data D={d1, d2, …,dn}, with associated parameters {Θ1 , Θ2 ,...,Θn }

• When parameters are shared then the associated data are also shared; data sharing implies
learning from previous/other experiences → Life-long learning

• Posterior reflects a balance between the DP-based desire for sharing, constituted by the
prior p (Θ1 , Θ 2 ,..., Θ n α, Go ) , against the likelihood function p( D Θ1 , Θ 2 ,..., Θ n )
that rewards parameters that match the data well

DP Desire for                                                                                 Likelihood’s Desire
Sharing Parameters                                                                                  to Fit Data

Posterior Balances
these Objectives
Hierarchical Dirichlet Process – 1/2

• A DP prior on the parameters of a Gaussian model yields a GMM in which the number
of mixture components need not be set a priori (non-parametric)

• Assume we wish to build N GMMs, each designed using a DP prior

• We link the N GMMs via an overarching DP “hyper prior”
∞
G ~ DP ( γ, Go ) ⇒                      we draw            G=   ∑ δ(Θ - Θ
π     k
*
k)
k =1

π1 α ~ Beta(1, α)                        π 2 α ~ Beta(1, α)                                           π N α ~ Beta(1, α)
*
{Θ1,k }k =1,∞ G ~ G                      {Θ* ,k }k =1,∞ G ~ G
2                                                          {Θ* , k }k =1,∞ G ~ G
N

z1,i π1 ~ Mult (π1 )                     z 2,i π 2 ~ Mult (π 2 )                                      z N ,i π1 ~ Mult (π N )
d1,i z1,i ,{Θ1,k }k =1,∞ ~ F (Θ1, zi )   d 2,i z 2,i ,{Θ 2,k }k =1,∞ ~ F (Θ 2, zi )                   d N ,i z N ,i ,{Θ1, k }k =1,∞ ~ F (Θ N , zi )

[Teh et al., 06]
Hierarchical Dirichlet Process – 2/2

• HDP yields a set of GMMs, each of which shares the same parameters Θ * , corresponding
k
to Gaussian mean and covariance, with distinct probabilities of observation
∞
p (ot +1 st = S1 ) =    ∑a     1, k F (ot +1 Θ * )
k
k =1
∞
p (ot +1 st = S 2 ) =   ∑a      2, k F (ot +1   Θ* )
k
k =1

∞
p (ot +1 st = S ∞) =   ∑a       ∞, F (ot +1
k            Θ* )
k
k =1

• Coefficients an,k represent the probability of transitioning from state n to state k

• Naturally yields the structure of an HMM; number of large amplitude coefficients an,k
implicitly determines the most-probable number of states
Computational Challenges in Performing Inference

• We have the general challenge of estimating the posterior

p(D Θ,M)p(Θ M )               p(D Θ,M)p(Θ M)
p(Θ D,M) =                                   =
p(D M)
∫p(D Θ,M)p(Θ M)
dΘ

• The denominator is typically of high dimension (number of parameters in model), and
cannot be computed exactly in reasonable time

• Approximations required
MCMC
Accuracy

Variational
Bayes (VB)

Laplace

Computational Complexity
[Blei & Jordan, 05]
Graphical Model of the nDP-iHMM

[Ni, Dunson, Carin; ICML 07]
How Do You Convince Navy Data Search Works?

Validation Not as “Simple” as Text Search

Consider Special Kind of Acoustic Data: Music

• Assume we have N sequential data sets

• Wish to learn HMM for each of the data sets

• Believe that data can be shared between the learning tasks; not independent task

• All N HMMs learned jointly, with appropriate data sharing

• Use of iHMM avoids the problem of selecting number of states in HMM

• Validation on large music database; VB yields fast inference
Demonstration Music Database

525 Jazz        975 Classical    997 Rock

Jazz                         Rock
Classical
5

4.5

500                                      4

3.5

1000                                     3

2.5

1500                                     2

1.5

2000                                     1

0.5

2500                                     0
500   1000   1500   2000   2500
Typical Recommendations from Three Genres

Classical               Jazz                   Rock
Applications of Interest to Navy
• Music search provides a fairly good & objective demonstration of the technology

• Other than use of acoustic/speech features (MFCCs), nothing in previous
analysis specifically tied to music – simply data search

• Use similar technology for underwater acoustic sensing (MCM) - generative

• Use related technology for synthetic aperture radar and EO/IR detection
and classification – discriminative

• Technology delivered to NSWC Panama City, and demonstrated independently
on mission-relevant MCM data
Underwater Mine Counter Measures (MCM)
Generative Model - iHMM

[Ni & Carin, 07]
Full Posterior on Number of HMM States
Anti-Submarine Warfare (ASW)
Design HMM for all Targets of Interest
State Sharing Between ASW Targets
Semi-Supervised Discriminative

• Semi-supervised learning implemented via graphical techniques

• Multi-task learning implemented via DP

• Exploits all available data-driven context

- Data available from previous collections, labeled & unlabeled

- Labeled and unlabeled data from current data set
Graph representation of partially
labeled data manifolds (1/2)
Construct the graph G=(X,W), with the affinity matrix W, where the (i, j)-
th element of W is defined by a Gaussian kernel:
2
wij  exp(  xi  x j / 2 2 )

Define a Markov random walk on the graph by the transition matrix A,
where the (i, j)-th element:
wij
aij 

N
k 1
wik
which gives the probability of walking from xi to xj by a single step
Markov random walk.
The one-step Markov random walk provides a local similarity measure
between data points.

[Lu, Liao, Carin; 07] [Szummer & Jaakkola, 02]
Graph representation (2/2)
To account for global similarity between data points, we consider a t-step
random walk, where the transition matrix is given by A raised to the
power of t:
(t )
At  [ aij ]N N
It was demonstrated[1] that the t-step Markov random walk would result
in a volume of paths connecting the data points in stead of the shortest
path that are susceptible to noise; thus it permits us to incorporate global
manifold structure in the training data set.

The t-step neighborhood of xi is defined as the set of data points xj with
(t )
aij  0 and denoted as N t ( xi ).

[1] Tishby and Slonim, Data clustering by Markovian relaxation and the information
bottleneck Method. NIPS 13, 2000
Semi-Supervised Learning
Algorithm (1/2)
• Neighborhood-based classifier: Define the probability of label yi given the
t-step neighborhood of xi as:
N
p( yi | N t ( xi ), )   aij p( yi | x j , )
(t )

j 1

where p( yi | x j , ) is probability of labeling yi given a single data point xj
and is represented by a standard probabilistic classifier parameterized by  .

• The label yi implicitly propagates over the neighborhood. Thus it is
possible to learn a classifier with only a few labels present.
The Algorithm (2/2)
• For binary classification problems, we choose the form p( yi | x j , ) of as
logistic regression classifier:
1
p( yi | x j ) 
1  exp(  yi T x j )

• To enforce sparseness, we impose a normal prior with zero mean and
diagonal precision matrix   diag{1 ,...d } on  , and each hyper-
parameter has an independent Gamma prior.

• Important for transfer learning: The semi-supervised algorithm is inductive
and parametric

• Place a DP prior on parameters, shared among all tasks
8                                                                                            8                                                                                            6
Data for Class 1                                                                                                                              Data for Class 1                                                                       Data for Class 1
6          Data for Class 2                                                                                                                              Data for Class 2                                                                       Data for Class 2
6
4

4                                                                                            4

2
2                                                                                            2
x2

0

x2

x2
0                                                                                            0

-2                                                                                            -2
-2
-4                                                                                            -4

-4
-6                                                                                            -6

-8                                                                                            -8
-8   -6          -4          -2           0         2       4        6             8          -8      -6        -4   -2          0          2       4           6          8            -6
-6     -4        -2         0          2         4            6         8
x1                                                                                     x1                                                                                         x1

8                                                                                                  8                                                                                           8
Data for Class 1                                                                              Data for Class 1                                                                     Data for Class 1
Data for Class 2                     6                                                        Data for Class 2                                                                     Data for Class 2
6                                                                                                                                                                                              6

4
4                                                                                                                                                                                              4

2
2                                                                                                                                                                                              2
0
x2

0

x2
x2

0
-2
-2                                                                                                                                                                                         -2
-4
-4
-4
-6
-6
-8                                                                                      -6

-8
-6        -4           -2                0              2        4             6             -10                                                                                         -8
-6         -4        -2               0         2              4               6           -6         -4        -2          0          2            4             6
x1
x1                                                                                    x1
Sharing Data

8                                                                               8
Data for Class 1
Data for Class 2              6
6

4
4

2
2
0

x2
x2

0
-2
-2
-4

-4
-6

-6
-8

-8                                                                            -10
-8   -6    -4   -2          0           2   4           6          8           -8   -6    -4   -2          0           2   4   6   8
x1                                                                             x1
0.92

0.91

0.9
Supervised STL

Semi-supervised STL
0.89                                         Supervised MTL
Semi-supervised MTL
0.88

0.87

0.86

0.85

0.84
0   5    10         15          20        25      30        35
Number of labeled data from each task

1

2

3

4

5

6

1   2   3   4   5   6
Navy-Relevant Data

Synthetic Aperture Radar (SAR) Data Collected
At 19 Different Locations Across USA

• Data from 19 “tasks” or geographical regions

• 10 of these regions are relatively highly foliated

• 9 regions bare earth, or desert

weights into two basic pools, which agree with truth

• Active learning used to define labels of interest for the site under test

• Other sites used as auxiliary data, in a “life-long-learning” setting
0.78

0.76

0.74

0.72

0.7

0.68

0.66

0.64                                Supervised SMTL-2
Supervised SMTL-1
0.62                                Supervised STL
Supervised Pooling
0.6                                Semi-Supervised STL
Semi-Supervised MTL-Order 1
0.58
Semi-Supervised MTL-Order 2
40                   80                      120
Number of Labeled Data in Each Task

Supervised MTL: JMLR 07
Previous deployments

New deployment

• Classifier at new site placed appropriately within context of all available previous data

• Both labeled and unlabeled data employed

• Found that the algorithm relatively insensitive to particular labeled data selected

• Validation with relatively large music database
Reconstruction of Random-Bars with hybrid CS. Example (a) is from [3], and (b-c) are the modified images from (a) by us to represent similar
tasks for simultaneous CS inversion. The intensities of all the rectangles are randomly permuted, and the positions of all the rectangles are shifted
by distances randomly sampled from a uniform distribution of [-10,10].

```
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
 views: 1 posted: 9/17/2012 language: English pages: 47
How are you planning on using Docstoc?