Document Sample

MECO-27 - Middle European Cooperation in Statistical Physics - Sopron 08-03-2002 A new approach to data clustering with application to financial time series (and gene expression data) Lorenzo Giada and Matteo Marsili Istituto Nazionale per la Fisica della Materia (INFM) Trieste-SISSA unit L. Giada and M. Marsili Phys. Rev. E 63, 61101 (2001). http://xxx.lanl.gov/abs/cond-mat/0003241 Web site with algorithms: http://www.sissa.it/dataclustering/ Data clustering: Classify N objects specified by D numbers [xi(t), i=1,…,N, t=1,…,D] into groups/clusters of similar objects Huge, high quality data sets available (N, D ~ 103) data set = structure + noise Where is the relevant information ? Are there meaningful classifications ? Example: Financial time series: xi(t) = return of asset i in day t Is there a well defined classification of assets in sectors? What are the assets whose prices fluctuations are correlated? Are there well defined patterns of market activity (market states)? Standard approaches 1: (H. Spath 1980) 1. Define a distance || xi - xk || (L2 , L1 …) 2. Define a cost function 3. Choose parameters (number of clusters/minimal distance) 4. Define a minimization algorithm K-means: • Fix K=number of clusters K • H {C1 LC K } = ∑ ∑ xi − xC , xC = 1 ∑ xi 2 j =1 i∈C j j j | C j | i∈C j • Find min H{C1…CK} But: What is correct K? Why H? Dependence on minimization algorithm? Standard approaches 2: (H. Spath 1980) 1. Define a distance || xi - xk || between objects and clusters of objects 2. Start from N clusters of isolated objects 3. Pick the 2 closest clusters and merge them into single cluster 4. Repeat until 1 cluster remains 5. Build dendrogram E.g. Average linkage : • Find (i,k) with min || xi - xk || ni xi + nk xk • xi + k = , ni + k = ni + nk ni + nk But: What is the best cluster distance? (Single/Average/Complete/Centroid linkage) What is the correct cluster structure? Where to stop? Non standard methods: • Super-Paramagnetic Clustering (SPC) (Domany et al. 1996) Similar interacting objects particles • model particle interaction • temperature • statistical mechanics • Self-Organizing Maps (T. Kohonen 1992) • choose K centers • define dynamics of centroids Xc(t+1) = f [Xc(t), data], c=1,…,K • iterate... Our approach: Real world problem data model solution yi = a xi + b N ∑i xi yi − ∑i xi ∑i yi a= y N ∑i xi − 2 (∑ x ) i i 2 1 a b= N ∑i y i − N ∑xi i x test results χ2 Data sets: pi (t ) pi(t) price of asset i=1,…,N of xi (t ) = ai + bi log S&P500 index in day t=1,…,D pi (t − 1) N=443, D=1600 from ‘89 to ‘95 Mantegna EPJ (1999), Kullmann et al. PRE (2000) or NYSE/NASDAQ xi (t ) = ai + bi log ni (t ) N=1000, D=3100 from ‘86 to ‘99 ni(t) concentration of mRNA of 1 D ∑ xi (t ) = 0 gene i=1,…,N in experiment t ai , bi : N=2467, D=18 D t =1 Yeast Saccaromices Cervisiae genome wide expression over ~ two cell cycles [M.B. Eisen et 1 D 2 ∑ xi (t ) = 1 D t =1 al. PNAS (1998), E. Domany Physica A (2001)] The model: g si 1 xi ( t ) = η s i (t ) + ε i (t ) 1 + g si 1 + g si ε i (t ),η s (t ) gaussian vectors ε i = η s = 0, η sηr = δ s ,r , ε iε k = δ i ,k , ε iη s = 0 gs if si = sk = s xi xk = 1 + g s 0 otherwise (si ≠ sk ) All objects in cluster s (i.e. all i such that si=s) are correlated gs is the strength of correlations The solution: maximum likelihood The probability that the data come from the model with parameters G={gs}s=1,... and structure S={si}i=1,…,N is: Likelyhood ≡ P{G , S | ri (t )} ∝ e −TH {G , S } Hence maximum likelihood structure S minimizes: ns2 − cs H c {S } = min H {G , S } = ∑ log + (ns − 1) log 2 1 cs G 2 s:ns >0 ns ns − ns N ns = ∑ δ s , si = number of i with si = s c s − ns gs = 2 * i =1 ns − c s N cs = ∑δ i , j =1 s , si δ s , s Ci , j j Note: • No need to define distance. Hc depends on Pearson’s coefficient: Ci ,k = ∑ [x (t ) − x ][x (t ) − x ] t i i k k ∑ [x (t ) − x ] ∑ [x (t ) − x ] 2 2 t i i t k k • No need to define cost function. Hc arises from max likelihood ns2 − cs H c {S } = ∑ log + (ns − 1) log 2 1 cs 2 s:ns >0 ns ns − n s • difference with K-means: K cs H K − means {S } = ∑ ∑ xi − xC j = ∑ ns − 2 ns j =1 i∈C j s:ns > 0 HK-means is always minimal when there are K=N clusters because then HK-means = 0 Clustering algorithms ! Minimize Hc by simulated annealing (SA) perform Metropolis dynamics as T -> 0 “slowly” ! Deterministic minimization (DM) find spin-flip move which minimize Hc and perform it until local minimum (greedy algorithm) ! Hierarchical clustering (MR) start with N clusters with isolated objects try all merge moves of pairs of clusters and select that with minimal energy difference repeat until one single cluster remains ! Fuzzy (probabilistic) data clustering (see Giada+Marsili PRE 2001) Simulated annealing: ns S*=argmin H{S} Group: size/c/g/e 18 115.202408 0.465534151 -4.64141703 Gas Group: size/c/g/e 24 190.345795 0.431334049 -6.17717028 Oil & Computers 710 Enron Materials AMAT 247 AppliedCorp. ENE Equipment (Semiconductor) Natural Gas TXN 235 Texas Instruments Electronics (Semiconductors) NSM1) clusters ~ economic SLB HAL 395 Schlumberger Ltd. RDC 235 National Semiconductor 395 Rowan Cos. INTC 235 Intel Corp. Co. 395 Halliburton Oil & Gas (Drilling & Equipment) Electronics (Semiconductors) Oil & Gas (Drilling & Equipment) Electronics (Semiconductors) Oil & Gas (Drilling & Equipment) 395 Baker Hughes BHI 235 Advanced Micro Devices Oil & Gas (Drilling & Equipment) AMD IBM TX sectors 390 Texaco Inc. SUNW 190 Sun Microsystems 390 Royal Dutch Petroleum Electronics (Semiconductors) Computers (Hardware) Integrated) Oil (International RD 190 International Bus. Machines Computers (Hardware) Integrated) Oil (International Group: size/c/g/e 8 29.0933895 0.604280651 -2.01765895 CHV 390 Chevron Corp. Oil (International Integrated) HWP 190 Hewlett-Packard SGP Phillips Petroleum 285 Schering-Plough Health (Hardware) Computers Care (Drugs-Major Pharmacs) P 385 Oil (Domestic Integrated) CPQOXY 190 COMPAQ Computer Inc. PFE Occidental Petroleum 385 285 Pfizer, Health (Hardware) Computers Care (Drugs-Major Pharmacs) Oil (Domestic Integrated) 2) N(clusters>n) ~ n-τ AAPL 190 Apple Computer& Co. AHC UCL KMG 285 Hess MRK Amerada Merck 385 ORCL 185 Oracle Corp. (Eli) & Co. 285 Lilly LLY Unocal Corp. 380 NOVL 185 Novell Inc. 380 Kerr-McGee JNJ 280 Johnson & Johnson Computers Care (Drugs-Major Pharmacs) Health (Hardware) Oil (Domestic Integrated) Computers Gas (Exploration & Productn) Oil & (Software & Services) Health Care (Drugs-Major Pharmacs) Oil & (Software & Services) Computers Gas (Exploration & Productn) Group: size/c/g/e Health Care (Diversified) 5 20.0271244 3.02181792 -4.17928696 τ ∼ 0.65 BR 380 Burlington Resources MSFT 185 Microsoft Corp. PDG BMY Squibb Computers Gas (Exploration & Productn) Oil & (Software & Services) Health Care (Diversified)Metals Mining 280 Bristol-Myers265 Placer Dome Inc. Gold & Precious CA XON 185 0 EXXON CORP Associates Intl. ComputerAmerican Home265 Newmont Health Care (Diversified)Metals Mining AHP 280 NEM Computers (Software Precious Products Mining Gold & & Services) MOTSNT 180 0 SONAT INCInc. HM Communications (Diversified)Metals Mining MotorolaAbbott Labs 265 Homestake Mining ABT 280 Health Care Equipment Gold & Precious PZL DIGI DSC COMM CO 0 0 PENNZOIL CORP ABX 265 Barrick Gold Corp. Gold & Precious Metals Mining DEC ORX MOB 0 ORYX ENERGY CO ECO DIGITAL EQUIPMEN 0 0 MOBIL CORP 0 ECHO BAY MINES cs Gold & Precious Metals Mining c ~ nγ ACAD AUTODESK INC 0 0 LOUISIANA LAND 3) LLX HP DI 0 0 HELMERICH & PAYN DRESSER INDUS ARC 0 . ATL RICHFIELD CO AN 0 γ ∼ 1.60 − 1.65 AMOCO CORP . New scaling laws ns Hierarchical clustering (MR) algorithm “dendrogram” graphic representation Log-likelihood = -Hc Log-likelihood = -Hc X+Y X+Y X X Y Y 0 0 Hc(X+Y) < Hc(X)+ Hc(Y) Hc(X+Y) > Hc(X)+ Hc(Y) but Hc(X+Y) < Hc(X), Hc(Y) Hierarchical clustering of assets: “noise level” Statistically significant clusters Clustering days: N 1 use C (t , s ) = N ∑ x (t ) x (s) i =1 i i Market fluctuations follow patterns across assets 1 D instead of Ci , j = ∑ xi (t ) x j (t ) D t =1 • Identify market states • Build state process • Compute P{state tomorrow state today} • Predict the state of the market 0.57 in the future • Connection with theoretical market models Two way clustering: <r|ω> = average return in state ω Quantifying market’s information efficiency Hi(t|t’) = predictability of ith return in day t given the state of the market in day t’ Comparing with other methods: Geometric overlaps P(six=sjx | siy=sjy) H c (S ∩ S ' ) Likelihood overlaps H c (S ) =K (Dataset of 1000 NYSE assets R. N. Mantegna) This suggests that Euclidean distance cost function Log-likelihood cost function Algorithm B Algorithm B Algorithm A Algorithm A free energy free energy configuration configuration Results depend on algorithm weak dependence on algorithm Gene expression data ! Identify (groups of) genes which are responsible for functions or functions which are controlled by groups of genes. ! Huge amount of data recently made available by new techniques ! Data set from P.T. Spellman et al. (Mol. Biol.Cell. 1998), M.B. Eisen et al. (PNAS 1998), E. Domany et al. (Physica A 2001): genome-wide measures over ~ 2 cell cycles of the yeast Saccharomyces Cervisiae. Results: mRNA(t ) xi (t ) = log t = 1,…,D D=18 i = 1,…,N N=1000 mRNA0 time time Very well defined dynamical patterns of activation! One step of clustering is not enough to describe correlation D=18 small re-clustering Conclusions: ! Human eye still plays an important role in standard data clustering approaches ! We propose a fully unsupervised, parameter free approach to data clustering based on maximum likelihood ! Data clustering is ill defined Data clustering + statistical hypothesis is well defined Web site with algorithms: http://www.sissa.it/dataclustering/ Remark: non-Gaussian data sets: Use non-parametric correlation Non gaussian set Gaussian set ξi(t) ri(t) Kendall τi,k same Kendall τi,k Ci ,k = sin (πτ i ,k 2 ) Comparison with other methods: Different clustering methods: KM: K-means AL: average linkage compared with MR, DM and SA algorithms on the gene expression data set. P(bx|b0) = probability that a link (b0) found with ML is also found (bx) with method x P(b0|bx) = probability that a link (bx) found with method x is also found (b0) with ML Mean field theory: S={M blocks of N/M assets} F=U-S/β First order phase transition

DOCUMENT INFO

Shared By:

Categories:

Tags:
data clustering, clustering algorithms, data points, data set, clustering algorithm, k-means algorithm, data mining, hierarchical clustering, cluster analysis, cluster centers, data sets, objective function, respect to, data objects, data centers

Stats:

views: | 8 |

posted: | 1/12/2010 |

language: | English |

pages: | 26 |

OTHER DOCS BY broverya72

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.