# A new approach to data clustering with application to

Document Sample

```					 MECO-27 - Middle European Cooperation in Statistical Physics - Sopron 08-03-2002

A new approach to data clustering
with application to financial time series
(and gene expression data)
Istituto Nazionale per la Fisica della Materia (INFM)
Trieste-SISSA unit

L. Giada and M. Marsili Phys. Rev. E 63, 61101 (2001).
http://xxx.lanl.gov/abs/cond-mat/0003241

Web site with algorithms: http://www.sissa.it/dataclustering/
Data clustering:
Classify N objects specified by D numbers [xi(t), i=1,…,N, t=1,…,D]
into groups/clusters of similar objects

Huge, high quality data sets available (N, D ~ 103)
data set = structure + noise
Where is the relevant information ?
Are there meaningful classifications ?

Example: Financial time series: xi(t) = return of asset i in day t
Is there a well defined classification of assets in sectors?
What are the assets whose prices fluctuations are correlated?
Are there well defined patterns of market activity (market states)?
Standard approaches 1:                                         (H. Spath 1980)
1.   Define a distance || xi - xk || (L2 , L1 …)
2.   Define a cost function
3.   Choose parameters (number of clusters/minimal distance)
4.   Define a minimization algorithm

K-means:
• Fix K=number of clusters
K
• H {C1 LC K } = ∑ ∑ xi − xC , xC =
1
∑ xi
2

j =1 i∈C j
j       j
| C j | i∈C j
• Find min H{C1…CK}

But:          What is correct K?
Why H?
Dependence on minimization algorithm?
Standard approaches 2:                                                (H. Spath 1980)
1.   Define a distance || xi - xk || between objects and clusters of objects
2.   Start from N clusters of isolated objects
3.   Pick the 2 closest clusters and merge them into single cluster
4.   Repeat until 1 cluster remains
5.   Build dendrogram
• Find (i,k) with min || xi - xk ||
ni xi + nk xk
• xi + k =                 , ni + k = ni + nk
ni + nk

But:         What is the best cluster distance?

What is the correct cluster structure?
Where to stop?
Non standard methods:

• Super-Paramagnetic Clustering (SPC) (Domany et al. 1996)

Similar       interacting
objects        particles

• model particle interaction
• temperature
• statistical mechanics

• Self-Organizing Maps (T. Kohonen 1992)
• choose K centers
• define dynamics of centroids

Xc(t+1) = f [Xc(t), data], c=1,…,K
• iterate...
Our approach:
Real world
problem

data         model                solution
yi = a xi + b        N ∑i xi yi − ∑i xi ∑i yi
a=
y                                          N ∑i xi −
2
(∑ x )
i   i
2

1            a
b=
N
∑i y i − N   ∑xi   i

x
test results
χ2
Data sets:
pi (t ) pi(t) price of asset i=1,…,N of
xi (t ) = ai + bi log                S&P500 index in day t=1,…,D
pi (t − 1)
N=443, D=1600 from ‘89 to ‘95
Mantegna EPJ (1999), Kullmann et al. PRE (2000)
or NYSE/NASDAQ
xi (t ) = ai + bi log ni (t )       N=1000, D=3100 from ‘86 to ‘99

ni(t) concentration of mRNA of
1 D
∑ xi (t ) = 0
gene i=1,…,N in experiment t
ai , bi :                             N=2467, D=18
D t =1                Yeast Saccaromices Cervisiae genome wide
expression over ~ two cell cycles [M.B. Eisen et
1 D 2
∑ xi (t ) = 1
D t =1
al. PNAS (1998), E. Domany Physica A (2001)]
The model:
g si        1
xi ( t ) =          η s i (t ) +          ε i (t )
1 + g si              1 + g si

ε i (t ),η s (t ) gaussian vectors    ε i = η s = 0,   η sηr = δ s ,r ,   ε iε k = δ i ,k ,   ε iη s = 0

 gs
         if si = sk = s
xi xk =  1 + g s
0 otherwise (si ≠ sk )


All objects in cluster s (i.e. all i such that si=s) are correlated
gs is the strength of correlations
The solution: maximum likelihood
The probability that the data come from the model with
parameters G={gs}s=1,... and structure S={si}i=1,…,N is:

Likelyhood ≡ P{G , S | ri (t )} ∝ e                          −TH {G , S }

Hence maximum likelihood structure S minimizes:

           ns2 − cs 
H c {S } = min H {G , S } = ∑ log + (ns − 1) log 2
1           cs

G              2 s:ns >0  ns        ns − ns 
N
ns = ∑ δ s , si = number of i with si = s
c s − ns
gs = 2
*                               i =1
ns − c s                      N
cs =   ∑δ
i , j =1
s , si   δ s , s Ci , j
j
Note:
• No need to define distance. Hc depends on Pearson’s coefficient:

Ci ,k =
∑ [x (t ) − x ][x (t ) − x ]
t       i            i   k               k

∑ [x (t ) − x ] ∑ [x (t ) − x ]
2                               2
t   i                i               t       k       k

• No need to define cost function. Hc arises from max likelihood

           ns2 − cs 
H c {S } = ∑ log + (ns − 1) log 2
1           cs

2 s:ns >0  ns        ns − n s 

• difference with K-means:
K
 cs 
H K − means {S } = ∑ ∑ xi − xC j                          = ∑  ns − 
2

 ns 
j =1 i∈C j                            s:ns > 0     
HK-means is always minimal when there are K=N clusters because then HK-means = 0
Clustering algorithms
! Minimize Hc by simulated annealing (SA)
perform Metropolis dynamics as T -> 0 “slowly”
! Deterministic minimization (DM)
find spin-flip move which minimize Hc and perform it until
local minimum (greedy algorithm)
! Hierarchical clustering (MR)
try all merge moves of pairs of clusters and select that with
minimal energy difference
repeat until one single cluster remains
!   Fuzzy (probabilistic) data clustering
Simulated annealing:

ns
S*=argmin H{S}
Group: size/c/g/e 18 115.202408 0.465534151 -4.64141703 Gas
Group: size/c/g/e 24 190.345795 0.431334049 -6.17717028 Oil & Computers
710 Enron Materials
AMAT 247 AppliedCorp.
ENE                                     Equipment (Semiconductor)
Natural Gas
TXN    235 Texas Instruments               Electronics (Semiconductors)
NSM1) clusters ~ economic
SLB

HAL
395 Schlumberger Ltd.
RDC 235 National Semiconductor
395 Rowan Cos.
INTC 235 Intel Corp. Co.
395 Halliburton
Oil & Gas (Drilling & Equipment)
Electronics (Semiconductors)
Oil & Gas (Drilling & Equipment)
Electronics (Semiconductors)
Oil & Gas (Drilling & Equipment)
395 Baker Hughes
BHI 235 Advanced Micro Devices              Oil & Gas (Drilling & Equipment)
AMD

IBM
TX             sectors
390 Texaco Inc.
SUNW 190 Sun Microsystems
390 Royal Dutch Petroleum
Electronics (Semiconductors)
Computers (Hardware) Integrated)
Oil (International
RD 190 International Bus. Machines Computers (Hardware) Integrated)
Oil (International
Group: size/c/g/e 8 29.0933895 0.604280651 -2.01765895
CHV   390 Chevron Corp.                     Oil (International Integrated)
HWP    190 Hewlett-Packard
SGP Phillips Petroleum
285 Schering-Plough           Health (Hardware)
Computers Care (Drugs-Major Pharmacs)
P     385                                   Oil (Domestic Integrated)
CPQOXY 190 COMPAQ Computer Inc.
PFE Occidental Petroleum
385    285 Pfizer,                   Health (Hardware)
Computers Care (Drugs-Major Pharmacs)
Oil (Domestic Integrated)

2) N(clusters>n) ~ n-τ
AAPL 190 Apple Computer& Co.
AHC
UCL
KMG
285 Hess
385
ORCL 185 Oracle Corp. (Eli) & Co.
285 Lilly
LLY Unocal Corp.
380
NOVL 185 Novell Inc.
380 Kerr-McGee
JNJ    280 Johnson & Johnson
Computers Care (Drugs-Major Pharmacs)
Health (Hardware)
Oil (Domestic Integrated)
Computers Gas (Exploration & Productn)
Oil & (Software & Services)
Health Care (Drugs-Major Pharmacs)
Oil & (Software & Services)
Computers Gas (Exploration & Productn)
Group: size/c/g/e Health Care (Diversified)
5 20.0271244 3.02181792 -4.17928696

τ ∼ 0.65
BR    380 Burlington Resources
MSFT 185 Microsoft Corp. PDG
BMY                        Squibb Computers Gas (Exploration & Productn)
Oil & (Software & Services)
Health Care (Diversified)Metals Mining
280 Bristol-Myers265 Placer Dome Inc.     Gold & Precious
CA XON 185 0 EXXON CORP
Associates Intl.
ComputerAmerican Home265 Newmont Health Care (Diversified)Metals Mining
AHP    280          NEM           Computers (Software Precious
Products Mining       Gold & & Services)
MOTSNT 180 0 SONAT INCInc. HM              Communications (Diversified)Metals Mining
MotorolaAbbott Labs 265 Homestake Mining
ABT    280                           Health Care Equipment
Gold & Precious
PZL
DIGI       DSC COMM CO
0 0 PENNZOIL CORP ABX     265 Barrick Gold Corp. Gold & Precious Metals Mining
DEC
ORX
MOB
0 ORYX ENERGY CO ECO
DIGITAL EQUIPMEN
0 0 MOBIL CORP              0 ECHO BAY MINES       cs
Gold & Precious Metals Mining

c ~ nγ
0 0 LOUISIANA LAND
3)
LLX
HP
DI
0
0
HELMERICH & PAYN
DRESSER INDUS
ARC     0
.
ATL RICHFIELD CO
AN      0
γ ∼ 1.60 − 1.65
AMOCO CORP
.

New scaling laws
ns
Hierarchical clustering (MR) algorithm
“dendrogram” graphic representation
Log-likelihood = -Hc

Log-likelihood = -Hc
X+Y
X+Y

X                                           X
Y                                          Y
0                                    0

Hc(X+Y) < Hc(X)+ Hc(Y)   Hc(X+Y) > Hc(X)+ Hc(Y)
but Hc(X+Y) < Hc(X), Hc(Y)
Hierarchical clustering of assets:

“noise level”

Statistically significant clusters
Clustering days:
N
1
use C (t , s ) =
N
∑ x (t ) x (s)
i =1
i   i         Market fluctuations follow
patterns across assets
1 D
instead of Ci , j   = ∑ xi (t ) x j (t )
D t =1
• Identify market states
• Build state process
• Compute
P{state tomorrow state today}
• Predict the state of the
market 0.57 in the future
• Connection with theoretical
market models
Two way clustering:

<r|ω> = average return in state ω
Quantifying market’s information efficiency

Hi(t|t’) = predictability of ith return in day t
given the state of the market in day t’
Comparing with other methods:

Geometric overlaps P(six=sjx | siy=sjy)

H c (S ∩ S ' )
Likelihood overlaps
H c (S )

=K       (Dataset of 1000 NYSE assets R. N. Mantegna)
This suggests that
Euclidean distance cost function                    Log-likelihood cost function

Algorithm B                                   Algorithm B

Algorithm A                                   Algorithm A
free energy

free energy

configuration                                 configuration
Results depend on algorithm                  weak dependence on algorithm
Gene expression data
! Identify (groups of) genes which are responsible for
functions or functions which are controlled by groups
of genes.
! Huge amount of data recently made available by new
techniques
! Data set from P.T. Spellman et al. (Mol. Biol.Cell.
1998), M.B. Eisen et al. (PNAS 1998), E. Domany et
al. (Physica A 2001): genome-wide measures over ~ 2
cell cycles of the yeast Saccharomyces Cervisiae.
Results:                     mRNA(t )
xi (t ) = log               t = 1,…,D D=18
i = 1,…,N N=1000
mRNA0

time
time

Very well defined dynamical patterns of activation!
One step of clustering
is not enough to describe
correlation
D=18 small

re-clustering
Conclusions:

! Human eye still plays an important role in standard data
clustering approaches

! We propose a fully unsupervised, parameter free
approach to data clustering based on maximum likelihood

! Data clustering is ill defined
Data clustering + statistical hypothesis is well defined

Web site with algorithms: http://www.sissa.it/dataclustering/
Remark: non-Gaussian data sets:
Use non-parametric correlation
Non gaussian set       Gaussian set
ξi(t)                 ri(t)
Kendall τi,k       same Kendall τi,k

Ci ,k = sin (πτ i ,k 2 )
Comparison with other methods:
Different clustering methods:
KM: K-means
compared with MR, DM
and SA algorithms on the
gene expression data set.

P(bx|b0) = probability that a link (b0)
found with ML is also found (bx)
with method x
P(b0|bx) = probability that a link (bx)
found with method x is also found
(b0) with ML
Mean field theory:

S={M blocks of N/M assets}

F=U-S/β

First order
phase transition

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 8 posted: 1/12/2010 language: English pages: 26
How are you planning on using Docstoc?