Docstoc

Title

Document Sample
Title Powered By Docstoc
					Object Orie’d Data Analysis, Last Time
• Kernel Embedding
– Use linear methods in a non-linear way

• Support Vector Machines
– Completely Non-Gaussian Classification

• Distance Weighted Discrimination
– HDLSS Improvement of SVM – Used in microarray data combination

– Face Data, Male vs. Female

Support Vector Machines
Forgotten last time, Important Extension:

Multi-Class SVMs
Hsu & Lin (2002)

Lee, Lin, & Wahba (2002)
• • Defined for “implicit” version “Direction Based” variation???

2=d Visualization: n 1 min  w ,  i 1 r i Pushes Plane Away From Data All Points Have Some Influence

Distance Weighted Discrim’n

Distance Weighted Discrim’n
Maximal Data Piling

HDLSS Discrim’n Simulations
Main idea: Comparison of

•
•

SVM (Support Vector Machine)
DWD (Distance Weighted Discrimination)

•

MD (Mean Difference, a.k.a. Centroid)

Linear versions, across dimensions

HDLSS Discrim’n Simulations
Overall Approach: • Study different known phenomena
–
– –

Spherical Gaussians
Outliers Polynomial Embedding

• •

Common Sample Sizes

n  n  25
But wide range of dimensions

d  10, 40, 100, 400, 1600

HDLSS Discrim’n Simulations
Spherical Gaussians:

HDLSS Discrim’n Simulations
Spherical Gaussians: • Same setup as before • Means shifted in dim 1 only, 1  2.2 • All methods pretty good • Harder problem for higher dimension • SVM noticeably worse • MD best (Likelihood method) • DWD very close to MS • Methods converge for higher dimension??

HDLSS Discrim’n Simulations
Outlier Mixture:

HDLSS Discrim’n Simulations
Outlier Mixture: 80% dim. 1 1  2.2 other dims 0 , 20% dim. 1 ±100, dim. 2 ±500, others 0 • MD is a disaster, driven by outliers • SVM & DWD are both very robust • SVM is best • DWD very close to SVM (insig’t difference) • Methods converge for higher dimension?? Ignore RLR (a mistake)

HDLSS Discrim’n Simulations
Wobble Mixture:

HDLSS Discrim’n Simulations
Wobble Mixture: 80% dim. 1 1  2.2 other dims 0 , 20% dim. 1 ±0.1, rand dim ±100, others 0 • MD still very bad, driven by outliers • SVM & DWD are both very robust • SVM loses (affected by margin push) • DWD slightly better (by w’ted influence) • Methods converge for higher dimension?? Ignore RLR (a mistake)

HDLSS Discrim’n Simulations
Nested Spheres:

HDLSS Discrim’n Simulations
Nested Spheres: 1st d/2 dim’s, Gaussian with var 1 or C 2nd d/2 dim’s, the squares of the 1st dim’s (as for 2nd degree polynomial embedding) • Each method best somewhere • MD best in highest d (data non-Gaussian) • Methods not comparable (realistic) • Methods converge for higher dimension?? • HDLSS space is a strange place Ignore RLR (a mistake)

HDLSS Discrim’n Simulations
Conclusions: • Everything (sensible) is best sometimes • DWD often very near best • MD weak beyond Gaussian

Caution about simulations (and examples): • Very easy to cherry pick best ones • Good practice in Machine Learning
– “Ignore method proposed, but read paper for useful comparison of others”

HDLSS Discrim’n Simulations
Caution: There are additional players E.g. Regularized Logistic Regression

looks also very competitive Interesting Phenomenon:
All methods come together in very high dimensions???

HDLSS Asymptotics: Simple Paradoxes, I
UNC, Stat & OR

d

For

dim’al Standard Normal dist’n:  Z1    Z     ~ N d 0, I d  Z   d ):
Z  d  O p (1)

Euclidean Distance to Origin (as d 

- Data lie roughly on surface of sphere of radius d - Yet origin is point of highest density???

- Paradox resolved by:

density w. r. t. Lebesgue Measure
17

HDLSS Asymptotics: Simple Paradoxes, II
UNC, Stat & OR

d

For

dim’al Standard Normal dist’n: Z 2 ~ N 0 indep. dof , I d  Z1

Euclidean Dist. between Z 1 and Z 2 (asd   ): Distance tends to non-random constant:
Z 1  Z 2  2d  O p (1) Z 1Can n ,..., Z extend to

Where do they all go??? (we can only perceive 3 dim’ns)
18

HDLSS Asymptotics: Simple Paradoxes, III
UNC, Stat & OR

d For

dim’al Standard Normal dist’n: Z 1 indep. of Z 2 ~ N d 0, I d  ):

d (as High dim’al Angles  

AngleZ 1 , Z 2   90  O p (d 1/ 2 )

- Everything is orthogonal??? - Where do they all go??? (again our perceptual limitations) - Again 1st order structure is non-random
19

HDLSS Asy’s: Geometrical Representation, I
UNC, Stat & OR

Assume Z 1 ,..., Z n ~ N d 0, I d , let Study Subspace Generated by Data Hyperplane through 0, of n dimension Points are “nearly equidistant to 0”, &ddist Within plane, can “rotate towards d  Unit Simplex”

d 

All Gaussian data sets are“near
Unit Simplex Vertices”!!! “Randomness” appears only in rotation of simplex Hall, Marron & Neeman (2005)
20

HDLSS Asy’s: Geometrical Representation, II
UNC, Stat & OR

Assume Z 1 ,..., Z n ~ N d 0, I d  , let Study Hyperplane Generated by Data

d 

n  1 dimensional hyperplane
Points are pairwise equidistant, ~ d dist

Points lie at vertices of 2d  “regular  hedron” n
Again “randomness in data” is

only in rotation
Surprisingly rigid structure in data?

21

HDLSS Asy’s: Geometrical Representation, III
UNC, Stat & OR

Simulation View: shows “rigidity after rotation”

22

HDLSS Asy’s: Geometrical Representation, III
UNC, Stat & OR

Straightforward Generalizations: non-Gaussian data: non-independent: only need moments use “mixing conditions”
(with J. Ahn, K. Muller & Y. Chi)

Mild Eigenvalue condition on Theoretical Cov.



All based on simple “Laws of Large Numbers”
23

HDLSS Asy’s: Geometrical Representation, IV
UNC, Stat & OR

Explanation of Observed (Simulation) Behavior: “everything similar for very high d ” 2 popn’s are 2 simplices (i.e. regular n-hedrons)

All are same distance from the other class
i.e. everything is a support vector i.e. all sensible directions show “data piling”

so “sensible methods are all nearly the same”
Including 1 - NN
24

HDLSS Asy’s: Geometrical Representation, V
UNC, Stat & OR

Further Consequences of Geometric Representation 1. Inefficiency of DWD for uneven sample size (motivates weighted version, work in progress) 2. DWD more stable than SVM (based on deeper limiting distributions) (reflects intuitive idea feeling sampling variation) (something like mean vs. median) 3. 1-NN rule inefficiency is quantified.
25

The Future of Geometrical Representation?
UNC, Stat & OR

HDLSS version of “optimality” results? “Contiguity” approach? Params depend on d? Rates of Convergence?

Improvements of DWD?
(e.g. other functions of distance than inverse) It is still early days …
26

NCI 60 Data
UNC, Stat & OR

Recall from Sept. 6 & 8 NCI 60 Cell Lines


Interesting benchmark, since same cells Data Web available:



http://discover.nci.nih.gov/datasetsNature2000.jsp


Both cDNA and Affymetrix Platforms
27

NCI 60: Fully Adjusted Data, Melanoma Cluster
UNC, Stat & OR

BREAST.MDAMB435 BREAST.MDN MELAN.MALME3M MELAN.SKMEL2 MELAN.SKMEL5 MELAN.SKMEL28 MELAN.M14 MELAN.UACC62 MELAN.UACC257
28

NCI 60: Fully Adjusted Data, Leukemia Cluster
UNC, Stat & OR

LEUK.CCRFCEM LEUK.K562 LEUK.MOLT4 LEUK.HL60 LEUK.RPMI8266 LEUK.SR

29

NCI 60: Views using DWD Dir’ns (focus on biology)
UNC, Stat & OR

30

Real Clusters in NCI 60 Data?
UNC, Stat & OR

From Sept. 8: Simple Visual Approach:  Randomly relabel data (Cancer Types)  Recompute DWD dir’ns & visualization  Get heuristic impression from this  Some types appeared signif’ly different  Others did not Deeper Approach: Formal Hypothesis Testing
31

HDLSS Hypothesis Testing
UNC, Stat & OR

Approach:

DiProPerm Test Direction – Projection – Permutation

Ideas:  Find an appropriate Direction vector  Project data into that 1-d subspace  Construct a 1-d test statistic  Analyze significance by Permutation

32

HDLSS Hypothesis Testing – DiProPerm test
UNC, Stat & OR

DiProPerm Test
Context:


Given 2 sub-populations, Or significantly different?

X & Y




Are they from the same distribution? LX = LY vs. H1: LX ≠ LY

H0:

33

HDLSS Hypothesis Testing – DiProPerm test
UNC, Stat & OR

Reasonable Direction vectors:
    

Mean Difference SVM Maximal Data Piling DWD (used in the following) Any good discrimination direction…
34

HDLSS Hypothesis Testing – DiProPerm test
UNC, Stat & OR

Reasonable Projected 1-d statistics:
   

Two sample t-test (used here) Chi-square test for different variances Kolmogorov - Smirnov Any good distributional test…

35

HDLSS Hypothesis Testing – DiProPerm test
UNC, Stat & OR

DiProPerm Test Steps:
1.

For original data:
 

Find Direction vector Project Data, Compute True Test Statistic Find Direction vector Project Data, Compute Perm’d Test Stat True Stat among population of Perm’d Stat’s Quantile gives p-value
36

2.

For (many) random relabellings of data:
 

3.

Compare:
 

HDLSS Hypothesis Testing – DiProPerm test
UNC, Stat & OR

Remarks:  Generally can’t use standard null dist’ns…  e.g. Students t-table, for t-statistic  Because Direction and Projection give

nonstandard context

   

I.e. violate traditional assumptions E.g. DWD finds separating directions Giving completely invalid test This motivates Permutation approach
37

Improved Statistical Power - NCI 60 Melanoma
UNC, Stat & OR

38

Improved Statistical Power - NCI 60 Leukemia
UNC, Stat & OR

39

Improved Statistical Power - NCI 60 NSCLC
UNC, Stat & OR

40

Improved Statistical Power - NCI 60 Renal
UNC, Stat & OR

41

Improved Statistical Power - NCI 60 CNS
UNC, Stat & OR

42

Improved Statistical Power - NCI 60 Ovarian
UNC, Stat & OR

43

Improved Statistical Power - NCI 60 Colon
UNC, Stat & OR

44

Improved Statistical Power - NCI 60 Breast
UNC, Stat & OR

45

Improved Statistical Power - Summary
UNC, Stat & OR

Type Melanoma

cDNA -t 36.8

Affy -t Comb -t 39.9 51.8

Affy-P Comb-P e-7 0

Leukemia
NSCLC Renal CNS Ovarian

18.3
17.3 15.6 13.4 11.2

23.8
25.1 20.1 18.6 20.8

27.5
23.5 22.0 18.9 17.0

0.12
0.18 0.54 0.62 0.21

0.00001
0.02 0.04 0.21 0.27

Colon
Breast

10.3
13.8

17.4
19.6

16.3
19.3

0.74
0.51

0.58
0.16
46

HDLSS Hypothesis Testing – DiProPerm test
UNC, Stat & OR

Many Open Questions on DiProPerm Test:
    

Which Direction is “Best”? Which 1-d Projected test statistic? Permutation vs. altern’es (bootstrap?)??? How do these interact? What are asymptotic properties?
47

Independent Component Analysis
Idea: Find dir’ns that maximize indepen’ce Motivating Context: Signal Processing

Blind Source Separation
References: • Cardoso (1989) • Cardoso & Souloumiac (1993) • Lee (1998) • Hyvärinen and Oja (1999) • Hyvärinen, Karhunen and Oja (2001)

Independent Component Analysis
ICA, motivating example:

Cocktail party problem

Hear several simultaneous conversations would like to “separate them” Model for “conversations”: time series: s1 t  and s2 t 

Independent Component Analysis
Cocktail Party Problem

Independent Component Analysis
ICA, motivating example:

Cocktail party problem

What the ears hear: Ear 1: Mixed version of signals:

x1 t   a11s1 t   a12 s2 t 
Ear 2: A second mixture:

x2 t   a21s1 t   a22 s2 t 

Independent Component Analysis
What the ears hear: Mixed versions

Independent Component Analysis
Goal: Recover “signal” from “data” where
 x1 (t )  x(t )     x2 (t ) 

 s1 (t )  s (t )     s 2 (t ) 

for unknown “mixture matrix”
, x  As for all t

 a11 A  a21

a12  , a22 

Goal is to find “separating weights”,
so that s  W x for all , but

W,

t

Problem: W  A1would be fine,

A is unknown

Independent Component Analysis
Solution 1: PCA

Independent Component Analysis
Solution 2: ICA

Independent Component Analysis
“Solutions” for Cocktail Party example: Approach 1: PCA (on “population of 2-d vectors”) Directions of Greatest Variability do not solve this problem Approach 2: ICA (will describe method later) Independent Component directions do solve the problem (modulo “sign changes” and “identification”)

Independent Component Analysis
Relation to FDA: recall “data matrix”
X  X 1 X 1n   X 11    X n     X X dn   d1 

Signal Processing: focus on rows ( time series, for ) t  1,..., n d Functional Data Analysis: focus on columns ( n data vectors)

Note: same 2 different viewpoints as dual problems in PCA

Independent Component Analysis
FDA Style Scatterplot View - Signals

s1 (t ), s2 (t ) : t  1,..., n

Independent Component Analysis
FDA Style Scatterplot View - Data

 x1 (t ), x2 (t ) : t  1,..., n

Independent Component Analysis
FDA Style Scatterplot View: • Scatterplots give hint how blind recovery is possible • Affine Transformation • Inversion is key to ICA

x  As

stretches indep’t signals into dependent
(even when A is unknown)

Independent Component Analysis
Why not PCA? • Finds direction of greatest variability • Wrong direction for signal separation

Independent Component Analysis
ICA Step 1: • “sphere the data” (i.e. find linear transfo to make mean = 0 , cov = I) ˆ 1/ 2  X    • i.e. work with Z   ˆ • requires X of full rank (at least n  d, i.e. no HDLSS) • search for independence beyond linear and quadratic structure

Independent Component Analysis
ICA Step 2: • Find directions that make (sphered) data as independent as possible • Worst case: Gaussian Sphered data are independent Interesting “converse application” of C.L.T.: • For S1 and S 2 independent (& non-Gaussian) 1 • X 1  uS1  1  u Sis “more Gaussian” for u  2 2 • so maximal independence comes from

least Gaussian directions

Independent Component Analysis
ICA Step 2: • Find dir’ns that make (sphered) data as

independent as possible

Recall “independence” means: Joint distribution is product of Marginals In cocktail party example: • Happens only when rotated so

support parallel to axes

• Otherwise have blank areas, • while marginals are non-zero

Independent Component Analysis
Parallel Idea (and key to algorithm): Find directions that max non-Gaussianity

Reason:
• starting from independent coordinates

most projections are Gaussian
(since projection is “linear combo”) Mathematics behind this: Diaconis and Freedman (1984)

Independent Component Analysis
Worst case for ICA: • Gaussian marginals

• Then sphered data are independent
• So have independence in all directions

• Thus can’t find useful directions
Gaussian distribution is characterized by: Independent & spherically symmetric

Independent Component Analysis
Criteria for non-Gaussianity / independence: • kurtosis ( EX  3EX , 4th order cumulant) 
4 2 2

• negative entropy • mutual information

• nonparametric maximum likelihood
• “infomax” in neural networks

•  interesting connections between these

Independent Component Analysis
Matlab Algorithm (optimizing any of above): FastICA

• http://www.cis.hut.fi/projects/ica/fastica/
• Numerical gradient search method

• Can find directions iteratively
• Or by simultaneous optimization • Appears fast, with good defaults • Should we worry about local optima???

Independent Component Analysis
Notational summary: 1. First sphere data: 2. Apply ICA:

ˆ 1/ 2  X    Z  ˆ
WS to make

find rotation

rows of

S S independent  WS Z

3. Can transform back to original data
ˆ scale: S  1/ 2 S S

Independent Component Analysis
Identifiability problem 1: Generally can’t order rows of S S (& S) Since for a permutation matrix P (pre-multiplication by P swaps rows) (post-multiplication by P swaps columns) for each col’n, z  AS s S  AS P 1 P si.e. P s S  PWS z S

So PS S and PW Sare also solutions (i.e.

PS S  PW) Z S

Identifiability problem 1: Row Order Saw this in Cocktail Party Example

Independent Component Analysis

FastICA: orders by non-Gaussian-ness?

Independent Component Analysis
Identifiability problem 2: Can’t find scale of elements of

s

Since for a (full rank) diagonal matrix D (pre-mult’n by D is scalar mult’n of rows) (post-mult’n by Dis scalar mult’n of col’s) for each col’n, z  AS s S  AS D 1 Dsi.e. D s S  DWS z S So DS S and DWS are also solutions

Identifiability problem 2: Signal Scale Not so clear in Cocktail Party Example

Independent Component Analysis

Independent Component Analysis
Signal Processing Scale identification: (Hyvärinen and Oja) Choose scale so each signal si (t ) has unit average energy: si (t ) 2 
t

• Preserves energy along rows of data matrix • Explains same scales in Cocktail Party Example

Independent Component Analysis
Would like to do: • More toy examples • Illustrating how non-Gaussianity works Like to see some? Check out old course notes:
http://www.stat.unc.edu/postscript/papers/marron/Teaching/CornellFDA/Lecture03-11-02/FDA03-11-02.pdf

http://www.stat.unc.edu/postscript/papers/marron/Teaching/CornellFDA/Lecture03-25-02/FDA03-25-02.pdf

Independent Component Analysis
One more “Would like to do”: ICA testing of multivariate Gaussianity Usual approaches: 1-d tests on marginals

New Idea: use ICA to find “least Gaussian Directions”, and base test on those. Koch, Marron and Chen (2004)

Unfortunately Not Covered
• DWD & Micro-array Outcomes Data • Windup from FDA04-22-02.doc
– General Conclusion – Validation


				
DOCUMENT INFO