A new approach to data clustering with application to

Document Sample
A new approach to data clustering with application to Powered By Docstoc
					 MECO-27 - Middle European Cooperation in Statistical Physics - Sopron 08-03-2002



  A new approach to data clustering
with application to financial time series
      (and gene expression data)
         Lorenzo Giada and Matteo Marsili
        Istituto Nazionale per la Fisica della Materia (INFM)
                          Trieste-SISSA unit

       L. Giada and M. Marsili Phys. Rev. E 63, 61101 (2001).
                    http://xxx.lanl.gov/abs/cond-mat/0003241

    Web site with algorithms: http://www.sissa.it/dataclustering/
 Data clustering:
Classify N objects specified by D numbers [xi(t), i=1,…,N, t=1,…,D]
into groups/clusters of similar objects


     Huge, high quality data sets available (N, D ~ 103)
                data set = structure + noise
            Where is the relevant information ?
          Are there meaningful classifications ?


Example: Financial time series: xi(t) = return of asset i in day t
     Is there a well defined classification of assets in sectors?
   What are the assets whose prices fluctuations are correlated?
 Are there well defined patterns of market activity (market states)?
Standard approaches 1:                                         (H. Spath 1980)
1.   Define a distance || xi - xk || (L2 , L1 …)
2.   Define a cost function
3.   Choose parameters (number of clusters/minimal distance)
4.   Define a minimization algorithm

             K-means:
             • Fix K=number of clusters
                           K
             • H {C1 LC K } = ∑ ∑ xi − xC , xC =
                                                      1
                                                             ∑ xi
                                            2


                           j =1 i∈C j
                                        j       j
                                                    | C j | i∈C j
             • Find min H{C1…CK}


        But:          What is correct K?
                      Why H?
                      Dependence on minimization algorithm?
Standard approaches 2:                                                (H. Spath 1980)
1.   Define a distance || xi - xk || between objects and clusters of objects
2.   Start from N clusters of isolated objects
3.   Pick the 2 closest clusters and merge them into single cluster
4.   Repeat until 1 cluster remains
5.   Build dendrogram
                                    E.g. Average linkage :
                                 • Find (i,k) with min || xi - xk ||
                                              ni xi + nk xk
                                 • xi + k =                 , ni + k = ni + nk
                                                 ni + nk

 But:         What is the best cluster distance?
                                        (Single/Average/Complete/Centroid linkage)

              What is the correct cluster structure?
              Where to stop?
Non standard methods:

• Super-Paramagnetic Clustering (SPC) (Domany et al. 1996)

                       Similar       interacting
                       objects        particles

   • model particle interaction
   • temperature
   • statistical mechanics


• Self-Organizing Maps (T. Kohonen 1992)
   • choose K centers
   • define dynamics of centroids

            Xc(t+1) = f [Xc(t), data], c=1,…,K
   • iterate...
Our approach:
                    Real world
                     problem

       data         model                solution
                    yi = a xi + b        N ∑i xi yi − ∑i xi ∑i yi
                                    a=
y                                          N ∑i xi −
                                                   2
                                                        (∑ x )
                                                          i   i
                                                                  2


                                       1            a
                                    b=
                                       N
                                           ∑i y i − N   ∑xi   i




                x
                                            test results
                                                 χ2
  Data sets:
                        pi (t ) pi(t) price of asset i=1,…,N of
xi (t ) = ai + bi log                S&P500 index in day t=1,…,D
                      pi (t − 1)
                                     N=443, D=1600 from ‘89 to ‘95
                                Mantegna EPJ (1999), Kullmann et al. PRE (2000)
                                 or NYSE/NASDAQ
xi (t ) = ai + bi log ni (t )       N=1000, D=3100 from ‘86 to ‘99

                                ni(t) concentration of mRNA of
          1 D
            ∑ xi (t ) = 0
                                      gene i=1,…,N in experiment t
ai , bi :                             N=2467, D=18
          D t =1                Yeast Saccaromices Cervisiae genome wide
                                expression over ~ two cell cycles [M.B. Eisen et
           1 D 2
             ∑ xi (t ) = 1
           D t =1
                                al. PNAS (1998), E. Domany Physica A (2001)]
  The model:
                               g si        1
        xi ( t ) =          η s i (t ) +          ε i (t )
                   1 + g si              1 + g si

ε i (t ),η s (t ) gaussian vectors    ε i = η s = 0,   η sηr = δ s ,r ,   ε iε k = δ i ,k ,   ε iη s = 0

                                 gs
                                         if si = sk = s
                        xi xk =  1 + g s
                                0 otherwise (si ≠ sk )
                                

      All objects in cluster s (i.e. all i such that si=s) are correlated
                     gs is the strength of correlations
The solution: maximum likelihood
   The probability that the data come from the model with
   parameters G={gs}s=1,... and structure S={si}i=1,…,N is:

         Likelyhood ≡ P{G , S | ri (t )} ∝ e                          −TH {G , S }



     Hence maximum likelihood structure S minimizes:

                                                   ns2 − cs 
   H c {S } = min H {G , S } = ∑ log + (ns − 1) log 2
                              1           cs
                                                             
               G              2 s:ns >0  ns        ns − ns 
                                   N
                          ns = ∑ δ s , si = number of i with si = s
     c s − ns
 gs = 2
  *                               i =1
     ns − c s                      N
                          cs =   ∑δ
                                 i , j =1
                                            s , si   δ s , s Ci , j
                                                           j
Note:
• No need to define distance. Hc depends on Pearson’s coefficient:

     Ci ,k =
                 ∑ [x (t ) − x ][x (t ) − x ]
                          t       i            i   k               k


               ∑ [x (t ) − x ] ∑ [x (t ) − x ]
                                           2                               2
                  t   i                i               t       k       k

• No need to define cost function. Hc arises from max likelihood

                                    ns2 − cs 
     H c {S } = ∑ log + (ns − 1) log 2
               1           cs
                                              
               2 s:ns >0  ns        ns − n s 

• difference with K-means:
                              K
                                                                          cs 
     H K − means {S } = ∑ ∑ xi − xC j                          = ∑  ns − 
                                                           2

                                                                          ns 
                          j =1 i∈C j                            s:ns > 0     
     HK-means is always minimal when there are K=N clusters because then HK-means = 0
Clustering algorithms
! Minimize Hc by simulated annealing (SA)
    perform Metropolis dynamics as T -> 0 “slowly”
! Deterministic minimization (DM)
    find spin-flip move which minimize Hc and perform it until
    local minimum (greedy algorithm)
! Hierarchical clustering (MR)
    start with N clusters with isolated objects
    try all merge moves of pairs of clusters and select that with
    minimal energy difference
    repeat until one single cluster remains
!   Fuzzy (probabilistic) data clustering
    (see Giada+Marsili PRE 2001)
Simulated annealing:




                                                            ns
         S*=argmin H{S}
   Group: size/c/g/e 18 115.202408 0.465534151 -4.64141703 Gas
 Group: size/c/g/e 24 190.345795 0.431334049 -6.17717028 Oil & Computers
         710 Enron Materials
AMAT 247 AppliedCorp.
   ENE                                     Equipment (Semiconductor)
                                               Natural Gas
TXN    235 Texas Instruments               Electronics (Semiconductors)
NSM1) clusters ~ economic
   SLB

   HAL
         395 Schlumberger Ltd.
   RDC 235 National Semiconductor
         395 Rowan Cos.
INTC 235 Intel Corp. Co.
         395 Halliburton
                                               Oil & Gas (Drilling & Equipment)
                                           Electronics (Semiconductors)
                                               Oil & Gas (Drilling & Equipment)
                                           Electronics (Semiconductors)
                                               Oil & Gas (Drilling & Equipment)
         395 Baker Hughes
   BHI 235 Advanced Micro Devices              Oil & Gas (Drilling & Equipment)
AMD

IBM
   TX             sectors
         390 Texaco Inc.
SUNW 190 Sun Microsystems
         390 Royal Dutch Petroleum
                                           Electronics (Semiconductors)
                                           Computers (Hardware) Integrated)
                                               Oil (International
   RD 190 International Bus. Machines Computers (Hardware) Integrated)
                                               Oil (International
          Group: size/c/g/e 8 29.0933895 0.604280651 -2.01765895
   CHV   390 Chevron Corp.                     Oil (International Integrated)
HWP    190 Hewlett-Packard
         SGP Phillips Petroleum
                285 Schering-Plough           Health (Hardware)
                                           Computers Care (Drugs-Major Pharmacs)
   P     385                                   Oil (Domestic Integrated)
CPQOXY 190 COMPAQ Computer Inc.
         PFE Occidental Petroleum
         385    285 Pfizer,                   Health (Hardware)
                                           Computers Care (Drugs-Major Pharmacs)
                                               Oil (Domestic Integrated)

   2) N(clusters>n) ~ n-τ
AAPL 190 Apple Computer& Co.
   AHC
   UCL
   KMG
                285 Hess
         MRK Amerada Merck
         385
ORCL 185 Oracle Corp. (Eli) & Co.
                285 Lilly
         LLY Unocal Corp.
         380
NOVL 185 Novell Inc.
         380 Kerr-McGee
         JNJ    280 Johnson & Johnson
                                           Computers Care (Drugs-Major Pharmacs)
                                              Health (Hardware)
                                               Oil (Domestic Integrated)
                                           Computers Gas (Exploration & Productn)
                                               Oil & (Software & Services)
                                              Health Care (Drugs-Major Pharmacs)
                                               Oil & (Software & Services)
                                           Computers Gas (Exploration & Productn)
                            Group: size/c/g/e Health Care (Diversified)
                                               5 20.0271244 3.02181792 -4.17928696

                   τ ∼ 0.65
   BR    380 Burlington Resources
MSFT 185 Microsoft Corp. PDG
         BMY                        Squibb Computers Gas (Exploration & Productn)
                                               Oil & (Software & Services)
                                              Health Care (Diversified)Metals Mining
                280 Bristol-Myers265 Placer Dome Inc.     Gold & Precious
CA XON 185 0 EXXON CORP
                      Associates Intl.
           ComputerAmerican Home265 Newmont Health Care (Diversified)Metals Mining
         AHP    280          NEM           Computers (Software Precious
                                    Products Mining       Gold & & Services)
MOTSNT 180 0 SONAT INCInc. HM              Communications (Diversified)Metals Mining
           MotorolaAbbott Labs 265 Homestake Mining
         ABT    280                           Health Care Equipment
                                                          Gold & Precious
   PZL
DIGI       DSC COMM CO
         0 0 PENNZOIL CORP ABX     265 Barrick Gold Corp. Gold & Precious Metals Mining
DEC
   ORX
   MOB
           0 ORYX ENERGY CO ECO
           DIGITAL EQUIPMEN
         0 0 MOBIL CORP              0 ECHO BAY MINES       cs
                                                          Gold & Precious Metals Mining


                    c ~ nγ
ACAD       AUTODESK INC
         0 0 LOUISIANA LAND
   3)
   LLX
  HP
  DI
          0
          0
              HELMERICH & PAYN
              DRESSER INDUS
  ARC     0
                    .
              ATL RICHFIELD CO
  AN      0
           γ ∼ 1.60 − 1.65
              AMOCO CORP
                    .

   New scaling laws
                                                                                          ns
                Hierarchical clustering (MR) algorithm
                “dendrogram” graphic representation
Log-likelihood = -Hc




                                            Log-likelihood = -Hc
                           X+Y
                                                                       X+Y


                       X                                           X
                                  Y                                          Y
                 0                                    0

                   Hc(X+Y) < Hc(X)+ Hc(Y)   Hc(X+Y) > Hc(X)+ Hc(Y)
                                            but Hc(X+Y) < Hc(X), Hc(Y)
Hierarchical clustering of assets:



“noise level”




  Statistically significant clusters
 Clustering days:
                      N
                 1
use C (t , s ) =
                 N
                     ∑ x (t ) x (s)
                      i =1
                             i   i         Market fluctuations follow
                                             patterns across assets
                     1 D
instead of Ci , j   = ∑ xi (t ) x j (t )
                     D t =1
                                           • Identify market states
                                           • Build state process
                                           • Compute
                                            P{state tomorrow state today}
                                           • Predict the state of the
                                             market 0.57 in the future
                                           • Connection with theoretical
                                             market models
Two way clustering:




<r|ω> = average return in state ω
Quantifying market’s information efficiency




   Hi(t|t’) = predictability of ith return in day t
          given the state of the market in day t’
Comparing with other methods:

                         Geometric overlaps P(six=sjx | siy=sjy)




                                                   H c (S ∩ S ' )
                         Likelihood overlaps
                                                      H c (S )




                 =K       (Dataset of 1000 NYSE assets R. N. Mantegna)
              This suggests that
       Euclidean distance cost function                    Log-likelihood cost function

                               Algorithm B                                   Algorithm B

              Algorithm A                                   Algorithm A
free energy




                                             free energy

                   configuration                                 configuration
              Results depend on algorithm                  weak dependence on algorithm
           Gene expression data
! Identify (groups of) genes which are responsible for
  functions or functions which are controlled by groups
  of genes.
! Huge amount of data recently made available by new
  techniques
! Data set from P.T. Spellman et al. (Mol. Biol.Cell.
  1998), M.B. Eisen et al. (PNAS 1998), E. Domany et
  al. (Physica A 2001): genome-wide measures over ~ 2
  cell cycles of the yeast Saccharomyces Cervisiae.
Results:                     mRNA(t )
               xi (t ) = log               t = 1,…,D D=18
                                           i = 1,…,N N=1000
                             mRNA0




                                         time
        time


    Very well defined dynamical patterns of activation!
One step of clustering
is not enough to describe
correlation
D=18 small

       re-clustering
    Conclusions:

! Human eye still plays an important role in standard data
  clustering approaches

! We propose a fully unsupervised, parameter free
  approach to data clustering based on maximum likelihood

! Data clustering is ill defined
  Data clustering + statistical hypothesis is well defined

    Web site with algorithms: http://www.sissa.it/dataclustering/
Remark: non-Gaussian data sets:
   Use non-parametric correlation
    Non gaussian set       Gaussian set
         ξi(t)                 ri(t)
      Kendall τi,k       same Kendall τi,k




        Ci ,k = sin (πτ i ,k 2 )
 Comparison with other methods:
Different clustering methods:
KM: K-means
AL: average linkage
compared with MR, DM
and SA algorithms on the
gene expression data set.


P(bx|b0) = probability that a link (b0)
   found with ML is also found (bx)
   with method x
P(b0|bx) = probability that a link (bx)
   found with method x is also found
  (b0) with ML
 Mean field theory:

  S={M blocks of N/M assets}

   F=U-S/β



  First order
phase transition