Document Sample
suite Powered By Docstoc
					A Suite of Unsupervised Machine Learning Algorithms
                                         Roy Varshavsky1*, David Horn2 & Michal Linial3

1School   of Computer Science and Engineering, The Hebrew University of Jerusalem, 2School of Physics, Tel Aviv University

                 3Institute   of Life Science, The Hebrew University of Jerusalem, *Contact:

 We present a Software Suite that includes various novel unsupervised machine learning algorithms
 • Features Filtering and Data Compression schemes
 • A Collection of Clustering methods
     • Hierarchical
     • Nonhierarchical
 Besides its simplicity, and its ability to perform well on high-dimensional data,
 it provides Visualization and Evaluation capabilities of the results
 This open source software is configurable and expendable to newly added algorithms
 Applications of these algorithms led to significant biological inferences

1     UFF: Novel Unsupervised Feature Filtering of Biological Data
 • An unsupervised selection method based onNSVD-entropy
   (Alter et al., 2000)                1
                              E = −           V log V
                                          log ( N )
                                                      j =1
                                                             j   j

 • Vj is the normalized eigenvalues of the of the correlation
    matrix XTX
 • The Contribution of the i-th feature to the overall entropy (CE)
    is determined according to a leave-one-out measurement
 • A natural cutoff for the number of selected features
 • Feature score depends on all other features
 • Better results compared to other known unsupervised feature
    filtering methods (e.g., variance, entropy, PCA projection)
 Joint work with Assaf Gottlieb, Tel Aviv University (ISMB, 2006)

2     COMPACT: A Comparative Package for Clustering Assessment
 • There exist numerous algorithms that cluster large-scale genomic
   & proteomics (e.g., sequencing, gene-expression), test mining
   datasets etc.
 • However, different methods often lead to different results
 • COMPACT is an easy-to-use and intuitive tool that compares some
   clustering methods within the same framework
 • It may assist researchers in choosing the most appropriate method

 Availability: (ISPA, 2005)                          The graphical view on the results produced by COMPACT

3     Clustering Algorithms Optimizer: A Framework for Large Datasets
 • Clustering algorithms are routinely applied in many scientific
   applications                                                                                0.8
 • However, many of them suffer from the following limitations                                 0.7
    • Relying on predetermined parameters tuning, (e.g., a-priori                              0.6
      knowledge regarding the number of clusters)                                              0.5

    • Involving nondeterministic procedures that yield inconsistent                            0.4
 • We provide a data-driven framework that includes two interrelated
    • SVD-based dimensional reduction                                                          0.1

    • An automated tuning of the algorithm’s parameters, based on                               0
                                                                                                     KM              QC
      internal evaluation criterion, known as Bayesian Information
      Criterion (BIC)                                                        Comparison of the standard and optimized version of the K-
                                                                             Means (KM) and Quantum Clustering (QC) algorithms, Score
 Availability: (ECCB, 2006)                represents intersection over union. Dataset used is gene-
                                                                             expression colon dataset of Alon et. al. (1999)

4     ClusTree: Hierarchical Clustering Analyzer
 • A software for applying, visualizing and evaluating hierarchical
 • Algorithms included: Agglomerative (>70 configurations) and
   Top-Down (PDDP, TDQC)
 • Statistical criteria are assigned to clusters (tree-nodes of the
   hierarchy) based on expert-labeled data
 • Datasets used: Genomics (Gene expression), Proteomics
   (Functional related protein sequences) and other (stock trade
   records, movie rating)

 Availability: (ISMB, PLoS track, 2006)   The graphical view on the results produced by ClusTree (dot
                                                                           sizes indicate statistical enrichment levels)

                                            Sponsored by

Shared By: