Docstoc

Functional Annotation and Phenotype Characterization by

Document Sample
Functional Annotation and Phenotype Characterization by Powered By Docstoc
					Functions, Networks, and
Phenotypes by Integrative
   Genomics Analysis

     Xianghong Jasmine Zhou
Molecular and Computational Biology
  University of Southern California


  IMS, NUS, Singapore, July 18, 2007
Information Flow
 Cellular systems
                     1995          1997        1998      2001
                     Bacteria      Eukaryote   Animal     Human
  DNA                ~1.6K genes   ~6K genes   ~20K genes 30~100K genes


                                                                      Objective:
                                                                      $1000 human genome



                    Gene Expression Datasets: the Transcriptome
  RNA               •Oligonucleotide Array
                    •cDNA Array



                      Protein abundance measurement (Mass Spec)
                      Protein interactions (yeast 2-hybrid system, protein arrays)
Protein               Protein complexes (Mass Spec)
   Rapid accumulation of microarray data
           in public repositories

  • NCBI Gene Expression Omnibus
                              137231 experiments



  • EBI Array Express
                               55228 experiments



The public microarray data increases by 3 folds per year
Multiple Microarray Technology Platforms




                Microarray
                Platforms
 Graph-based Approach for the Integrative Microarray Analysis


       Datasets          Coexpression                                                     Recurrent                                               Annotation
                           Networks                                                        Patterns
       experiments                                                                                                                                             Transcriptional Annotation
                                          f                                                           f
                                                              j                                                           j
       ----
       ----
       ----
                     a
                          c
                                                  h
                                                                              a
                                                                                      c
                                                                                                              h                                                              TF
gene   ----
       ----                       e                                                           e

                     b                                                        b                                                   k                  h
                                                                      k
                                                                                      d                                                   e
                          d           g                                                           g               i
                                                      i
                                                                                                                                                                                  h
                                                                                                                                                                     e
                                              f                                                           f                                   g      i
                                                                  j                                                           j
                     a                                                            a
       ----
       ----
       ----                   c
                                                          h                               c
                                                                                                                      h

       ----
       ----
                                      e
                                                                              b
                                                                                                  e                                                                      g        i

                     b                                                    k                                                           k
                              d                                                           d           g               i
                                          g               i


                                                                                                          f                                                    Functional Annotation
                                              f
                                                                  j                                                           j
                     a                                                        a                                   h                                      f
                                                      h
                              c
       ----
       ----
       ----                           e
                                                                                          c
                                                                                                  e
                                                                                                                                          a

       ----
       ----          b
                                                                          k
                                                                              b
                                                                                                                                      k   b
                              d                                                           d           g               i
                                          g               i
                                                                                                                                                     d

                                              f                                                           f
                                                                  j                                                           j
                     a                                                        a                                   h
                                                      h
                              c
       ----
       ----
       ----                           e
                                                                                      c
                                                                                                  e                                                                Gene Ontology
       ----
       ----          b
                                                                          k
                                                                              b
                                                                                                                                      k
                              d                                                           d           g               i
                                          g               i
  Frequent Subgraph Mining Problem is hard!

Problem formulation: Given n graphs, identify
subgraphs which occur in at least m graphs (m  n)

Our graphs are massive! (>10,000 nodes and >1 million edges)
The traditional pattern growth approach (expand frequent
subgraph of k edges to k+1 edges) would not work, since
the time and memory requirements increase exponentially
with increasing size of patterns and increasing number of
networks.
   Novel Algorithms to identify diverse
       frequent network patterns

• CoDense                           (Hu et al. ISMB 2005)
  – identify frequent coherent dense subgraphs
    across many massive graphs
• Network Biclustering           (Huang et al, ISMB 2007)
  – identify frequent subgraphs across many massive
    graphs
• Network Modules (NeMo)          (Yan et al. ISMB 2007)
  – identify frequent dense vertex sets across many
    massive graphs
  CODENSE: identify frequent
coherent dense subgraphs across
         massive graphs



                   Hu et al, ISMB 2005
        Identify frequent co-expression clusters
          across multiple microarray data sets
                                           f                                           f
                        a          c               h j                 a       c           h j
     c1 c2… cm                         e                                           e
g1 .1 .2… .2            b                                          k   b                                   k
                               d g                     i                       d g             i
g2 .4 .3… .4
…

                                        f                                              f           j
                       a               e                   j           a
     c1 c2… cm                     c                   h                       c               h
                                                                                   e
g1 .8 .6… .2           b                                               b
                                                                   k                                       k
g2 .2 .3… .4                   d       g               i                       d g             i
…
       .
       .                                   .
                                           .                                           .
                                                                                       .
       .                                   .                                           .
                                           f                                           f           j
     c1 c2… cm             a c                     h
                                                           j               a c             h
g1 .9 .4… .1                           e                                           e
                           b                                               b                           k
g2 .7 .3… .5                                                   k
                                                                               d g
                                   d g             i                                       i
…

                                               f                                       f
                                                               j       a                               j
     c1 c2… cm          a
                                   c               h                           c           h
                                           e                                       e
g1 .2 .5… .8
                           b                                       k       b                               k
g2 .7 .1… .3                   d                                               d g
                                           g           i                                       i
…
    The common pattern growth
           approach
Find a frequent subgraph of k edges, and
expand it to k+1 edge to check occurrence
frequency
  – Koyuturk M., Grama A. & Szpankowski W. An
    efficient algorithm for detecting frequent subgraphs
    in biological networks. ISMB 2004
  – Yan, Zhou, and Han. Mining Closed Relational
    Graphs with Connectivity Constraints. ICDE 2005
Problem of the Pattern-growth approach
The time and memory requirements increase
exponentially with increasing size of patterns
and increasing number of networks. The
number of frequent dense subgraphs is
explosive when there are very large frequent
dense subgraphs, e.g., subgraphs with
hundreds of edges.
            Problem of the Pattern-growth approach

                                        Pattern Expansion                   f                                       f                                       f                                       f
                                                                                                                                                                h j                 a                   h j
                                             k  k+1
                f                                                               h j
a                   h j
                                                            a       c                               a       c           h j                 a       c                                       c
        c                                                               e                                                                               e                                       e
            e                                                                                                   e
                                                            b                                                                               b                                       b                                   k
b                                                                                               k   b                                   k                                       k
                                    k                               d g             i                                                               d g             i                       d g             i
        d g             i                                                                                   d g             i



                                                                            f                                                                               f                                       f               j
                f                                                                           j                       f               j       a                               j       a
                                j                           a       c                               a                                               c                                       c               h
a                                                                                   h                       c               h                                       h                           e
        c               h                                               e                                       e                                       e
            e                                                                                                                                                                       b
                                                            b                                   k   b                                       b                                   k                                       k
b                                   k                                                                                                   k                                                   d g
                                                                    d g             i                       d g                                     d g             i                                       i
        d g             i                                                                                                   i


                                                                            f                                                                               f           j                           f           j
                f                                               a c                     j                           f           j               a c                                     a c             h
    a c                     j                                                   h                       a c             h                                       h
                    h                                                   e                                                                               e                                       e
            e                                                                                                   e
                                                                b                                                                               b                                       b                           k
    b                                                                                       k           b                           k                                       k
                                k                                   d g         i                                                                   d g         i                           d g         i
        d g         i                                                                                       d g         i


                                                                            f                                                                               f                                       f
                f                                                                                                   f                                                       j                                       j
                                                            a                               j                                       j       a                   h                   a                   h
a                               j                                   c           h                   a                   h                           c                                       c
        c           h                                                                                       c                                                                                   e
                                                                        e                                       e                                       e
            e                                                                                                                                                                       b
                                                            b                                   k   b                                       b                                   k                                       k
b                                   k                                                                                                   k                                                   d g
                                                                    d g                                     d g                                     d g             i                                       i
        d g                                                                         i                                       i
                        i
                 Our solution
We develop a novel algorithm, called CODENSE, to mine
frequent coherent dense subgraphs. The target subgraphs
have three characteristics:
(1) All edges occur in >= k graphs (frequency)
(2) All edges should exhibit correlated occurrences in
    the given graph set. (coherency)
(3) The subgraph is dense, where density d is higher
    than a threshold  and d=2m/(n(n-1)) (density)
    m: #edges, n: #nodes
             CODENSE: Mine coherent dense
                      subgraph
(1) Builds a summary graph by eliminating infrequent edges

                 f                                                   f                               f
     a               c               h       a                                           a                            h
                                                     c                       h                   c


         b           e                       b           e                               b           e
                                                                                                                                              f
                 d           g           i           d           g           i                   d           g            i           a   c       h

                         G1                                      G2                                      G3                               e
                                                                                                                                      b
                         f                                               f                                        f
     a                                           a                                           a                                            d   g       i
             c                   h                               c               h                           c                h

                                                                                                                                      summary graph Ĝ
             e                                               e                                           e
     b                                           b                                           b
             d           g           i                       d           g           i                   d       g                i


                         G4                                          G5                                      G6
      CODENSE: Mine coherent dense
               subgraph

(2) Identify dense subgraphs of the summary graph


                         f                         f
                 a   c       h       Step 2   c        h
                     e                        e
                 b
                     d   g       i    MODES       g        i
                 summary graph Ĝ                  Sub(Ĝ)




Observation: If a frequent subgraph is dense, it must be a dense
subgraph in the summary graph. However, the reverse conclusion
is not true.
            CODENSE: Mine coherent dense
                     subgraph

(3) Construct the edge occurrence profiles for each
dense summary subgraph

                             E     G1   G2   G3   G4   G5     G6
        f
   c        h                c-e   0    0    1    1    0      1
                    Step 3
   e                         c-f   0    1    0    1    1      1
                             c-h   0    0    0    1    1      1
       g        i            c-i   0    0    1    1    1      0
       Sub(Ĝ)                e-f   0    0    0    1    1      1
                             …     …    …    …    …    …      …

                                   edge occurrence profiles
            CODENSE: Mine coherent dense
                     subgraph
(4) builds a second-order graph for each dense summary
subgraph
                                                                   g-h
                                                                                      f-i

      E      G1   G2   G3   G4   G5     G6             e-i                      h-i
      c-e    0    0    1    1    1      1
      c-f    0    1    0    1    1      1
                                             Step 4
                                                      e-g                       g-i
      c-h    0    0    0    1    1      1
                                                                   e-h
      c-i    0    0    1    1    1      0
                                                             c-h                  f-h
      e-f    0    0    0    1    1      1
      …      …    …    …    …    …      …             c-f                 e-f
                                                                                c-i
             edge occurrence profiles                               c-e

                                                      second-order graph S
          CODENSE: Mine coherent dense
                   subgraph
 (5) Identify dense subgraphs of the second-order graph

                          g-h                                            g-h
                                             f-i

              e-i                      h-i                   e-i                      h-i
                                                   Step 4
              e-g                      g-i                                            g-i
                                                            e-g
                          e-h                                            e-h
                    c-h                  f-h                       c-h                  f-h
             c-f                 e-f                        c-f                 e-f
                                       c-i
                           c-e                                            c-e
             second-order graph S                                        Sub(S)


Observation: if a subgraph is coherent (its edges show high correlation
in their occurrences across a graph set), then its 2nd-order graph must
be dense.
      CODENSE: Mine coherent dense
               subgraph
(6) Identify the coherent dense subgraphs


                   g-h
                                                                   h

       e-i                      h-i              e

                                        Step 5           g             i
      e-g                       g-i
                   e-h                                         f
                                                     c                     h
             c-h                  f-h
                                                         e
      c-f                 e-f

                    c-e
                                                             Sub(G)
                   Sub(S)
                  Our solution
The identified subgraphs by definition satisfy the three
criteria:
(1) All edges occur in >= k graphs (frequency)
(2) All edges should exhibit correlated occurrences in
     the given graph set. (coherency)
(3) The subgraph is dense, where density d is higher
     than a threshold  and d=2m/(n(n-1)) (density)
     m: #edges, n: #nodes
                                                   CODENSE: Mine coherent dense
a                       f
                                                            subgraph        f                                                       f
                c                       h           a       c                                             a                                 h
                                                                                      h                             c
    b
                e                                   b           e                                         b             e
                                                                                                                                                                                                 f                              f
            d           g                   i               d           g             i                             d           g               i                                        a
                                                                                                                                                               Step 1                        c       h         Step 2     c            h
                     G1                                                 G2                                                  G3                                                               e                            e
a                                                                                                                                                                                        b
                    f                                                           f                                                       f
        c                           h                   a               c                 h                     a               c                   h               Add/Cut                  d   g       i      MODES          g           i
b
        e                                                           e                                                       e                                                           summary graph Ĝ                        Sub(Ĝ)
                                                        b                                                       b
        d           g                   i                           d           g             i                             d           g               i

                    G4                                                      G5                                                  G6
                                                                                                                                                                                                                                           Step 3
                                                                                          g-h                                                                       g-h
                                                                                                                                                                                       f-i
                                h
                                                                        e-i                               h-i                                           e-i                      h-i                     E      G1   G2   G3    G4         G5      G6
        e
                                                                                                                                                                                                         c-e    0    0    1        1           1   1
                g                   i           Step 6                                                              Step 5                                                              Step 4
                                                                    e-g                                   g-i                                       e-g                          g-i                     c-f    0    1    0        1           1   1

                                                                                          e-h                                                                       e-h                                  c-h    0    0    0        1           1   1


            c
                            f
                                        h        Restore                        c-h                           f-h
                                                                                                                            MODES                             c-h                  f-h
                                                                                                                                                                                                         c-i    0    0    1        1           1   0
                                                                                                                                                                                                         e-f    0    0    0        1           1   1
                e                                G and              c-f                             e-f                                             c-f                    e-f                           …      …    …    …     …          …       …
                                                 MODES                                                                                                                           c-i
                                                                                              c-e                                                                    c-e                                        edge occurrence profiles
                    Sub(G)                                                            Sub(S)                                                        second-order graph S
                   CODENSE
The design of CODENSE can solve the scalability issue.
Instead of mining each biological network individually,
CODENSE compresses the networks into two meta-graphs
(the summary graph and the second-order graph) and
performs clustering in these two graphs only. Thus,
CODENSE can handle any large number of networks.
        MODES: Mine overlapped dense
                 subgraph
G       a       j                       Sub(G) a     j                          G’       j

                f                                    f                                               h
    b                           h   Step 1   b                   h    Step 2
                                                                                     V
    c                               HCS’     c                       condense                    i
                e           i                        e       i

                        g                                g                                   g
        d                                        d
                                                                                                     HCS’

                                                 a                         Sub(G’)
        f               h                            f
                                    Step 4   b                   h     Step 3                             h
                                                                                         V
            e       i               HCS’     c                        restore                         i
                                                     e       i

                                                 d


                                                                 Hu et al. ISMB 2005
Applying CoDense to 39 yeast microarray data sets
                                        f                                           f
                    a           c               h j                 a       c           h j
     c1 c2… cm                      e                                           e
g1 .1 .2… .2        b                                           k   b                                   k
                            d g                     i                       d g             i
g2 .4 .3… .4
…

                                     f                                              f           j
                    a               e                   j           a
     c1 c2… cm                  c                   h                       c               h
                                                                                e
g1 .8 .6… .2        b                                               b
                                                                k                                       k
g2 .2 .3… .4                d       g               i                       d g             i
…


                                        f                                           f           j
     c1 c2… cm          a c                     h
                                                        j               a c             h
g1 .9 .4… .1                        e                                           e
                        b                                               b                           k
g2 .7 .3… .5                                                k
                                                                            d g         i
                                d g             i
…

                                            f                                       f
                                                            j       a                               j
     c1 c2… cm      a
                                c               h                           c           h
                                        e                                       e
g1 .2 .5… .8
                        b                                       k       b                               k
g2 .7 .1… .3                d                                               d g
                                        g           i                                       i
…
Functional annotation


       Annotation
    Functional Annotation (Validation)

Method: leave-one-out approach - masking a
 known gene to be unknown, and assign its
 function based on the other genes in the
 subgraph pattern.

Functional categories: 166 functional categories
  at GO level at least 6

Results: 448 predictions with accuracy of 50%
    Functional Annotation (Prediction)

We made functional predictions for 169
genes, covering a wide range of functional
categories, e.g. amino acid biosynthesis, ATP
biosynthesis, ribosome biogenesis, vitamin
biosynthesis, etc. A significant number of our
predictions can be supported by literature.
              However…
• How about frequent non-dense graphs?
  – Many biological modules may form paths


• How about subgraphs which are coherent
  across only a subset of the graphs?
  – Not all modules are activated across all
    conditions, and genes may form modules with
    diff. other genes under diff. conditions
       Network Biclustering:
Identify frequent subgraphs across
           massive graphs




                     Huang et al, ISMB 2007
Using 65 human co-expression network
       as an illustration example
• 65 co-expression networks generated from
  65 microarray data sets

• each graph contains 8297 genes, and 1%-
  10% edges of a complete graph
   Basically, it is a biclustering problem
                                   graphs




edges
                 Network Biclustering

• Objective function                                             graphs

          c'                                 1   0   1   0   0   …        1
    f 
        mn  c                              0   1   0   0   1   …        0

                                             0   1   1   1   1   …        0

                                             0   1   0   1   1   …        1
c’: number of 1 in the bicluster
                                             0   1   1   0   1   …        0
c: number of 1 in the whole matrix
mn: size of the bicluster                    1   1   1   1   1   …        0

: regularization factor                     0   0   1   0   0   …        1

                                     edges


However, the matrix is very large with millions of edges …
We will first identify robust seed to narrow down the search space
                Identify Bicluster seed
The property of relation graphs: edge labels are unique.


Hence, each graph can be treated as a collection of items



Thus, Frequent subgraph Mining can be modeled as frequent item set mining


Problem: current frequent item set mining algorithms can only
efficiently mine across many small item sets
In our problem, we have 65 very large item set…




                            We use a trick….
                          Identify Bicluster seed
                                E    G1   G2   G3   G4    G5   G6   ...    G60   G61   G62   G63   G64   G65

                                e1   1    1    0    1      0   1    ...    0      0    1     1     1     1
                                e2   1    1    0    1      0   1    …      0      1    0     1     1     1
                                e3   1    1    1    1      1   1    …      0      0    0     1     1     1
                                e4   0    0    1    1      1   0    ...    0      0    1     1     1     0
Edge occurrence profiles:
                                e5   1    1    0    1      0   1    …      0      0    0     1     1     1
                                …    …    …    …    …     …    …    …      …     …     …     …     …     …



                                                    Graph set with more than 5 members
                 Frequent pattern tree              and with > 1000 common edges

{G1, G3, G5, G6, G7,…}               {G2, G3, G5, G7, G8,…} …. {G8, G9, G15, G26, G29}

              common edges                               common edges                               common edges


{e1, e10, e56, e100, e1000,…}        {e4, e12, e33, e56, e890,…}          ….           {e99, e220, e1545, e2629,…}


Very time consuming! It takes more than 2 weeks on 40 Pentium IV nodes
                                                                                 Huang et al. ISMB 2007
         Expanding the Biclusters
E    G1   G2   G3   G4   G5   G6   G7   …   G61   G62   G63   G64   G65

e1   1    1    0    1    1    1    0    0   0     1     1     1     1
e2   1    1    1    1    1    0    1    0   1     0     1     1     1
e3   1    1    1    1    1    1    0    0   0     0     1     1     1
e4   1    1    1    1    1    1    .0   0   0     1     1     1     0
e5   1    0    0    1    0    1    1    0   0     0     1     1     1
…    …    …    …    …    …    …    …    …   …     …     …     …     …



                                   Simulated Annealing

E    G1   G2   G3   G4   G5   G6   G7   …   G61   G62   G63   G64   G65

e1   1    1    0    1    1    1    0    0   0     1     1     1     1
e2   1    1    1    1    1    0    1    0   1     0     1     1     1
e3   1    1    1    1    1    1    0    0   0     0     1     1     1
e4   1    1    1    1    1    1    .0   0   0     1     1     1     0
e5   1    0    0    1    0    1    1    0   0     0     1     1     1
…    …    …    …    …    …    …    …    …   …     …     …     …     …



                                   Identify connected components
        Systematic identification of functional
             modules in human genome
• We identified 143,400
  network modules with
  recurrence >= 5. They
  vary in size from 4 to 180.

• 77.0% of the patterns are
  functionally homogenous
  (GO hyper-geometric P-
  value less than 0.01)

• The figure shows that the
  functional homogeneity of
  modules increase with
  their recurrences.
   Results (I): Examples of highly recurring
                   modules




            Recurrence 20                         Recurrence 18
Defense against oxygen and nitrogen species   Involved in spermatogenesis
   Loosely connected network
patterns with high recurrence can
  represent functional modules
              Functional annotation

                                Annotation




We made functional predictions for 779 known and 116 unknown genes
by random forest classification with 71% accuracy.

Variables for random forest classification:
functional enrichment P-value      network topology score
network connectivity               pattern recurrence numbers
average node degree                unknown gene ratio
Network size
           Network Modules (NeMo)
   Identify frequent dense vertex sets
      across many massive graphs
microarray       coexpression                    (neighbor association) clustering   refinement
                    graph                           summary graph




     ...              ...              ...


                            partitioning
             step 1                          step 2           step 3            step 4


                                                                 Yan et al. ISMB 2007
                           105 human microarray data sets




                                             NeMo

                   6477 recurrent coexpression clusters
                           (density > 0.7 and support > 10)



Validation based on ChIp-chip data                     Validation based on human-mouse
(9176 target genes for 20 TFs)                         Conserved Transfac prediction
                                                       (7720 target genes for 407 TFs)



          15.4% homogenous clusters              12.5% homogenous clusters
        (vs. 0.2% by randomization test)       (vs. 3.3% by randomization test)
Percentage of potential transcription modules validated by
ChIP-Chip data increase with cluster density and recurrence
                    Expression data




Microarray Data




                  Phenotype information




                  Phenotype Concepts (e.g. diseases, perturbations, tissues )
                         in Unified Medical Language System (UMLS)
Classifying microarray data based on phenotype

 Adenocarcinoma Arthritis   Asthma   …   Glaucoma   HIV




For example, the current NCBI GEO database contains >60
cancer datasets, among which 11 leukemia datasets.
    Identify phenotype-specific functional or
             transcriptional modules
   • Unsupervised approach
Cancer   Cancer   Cancer                               Cancer




                             Frequent pattern mining



          Module 1,
          Module 1 Module 2, Module 3, … Module k
                        105 microarray data sets




                                       NeMo

              6477 recurrent coexpression clusters
                     (density > 0.7 and support > 10)




 A total of 459 of the clusters are statistically significant with a
     hypergeometric P-value <0.01 in a specific phenotype:
e.g. malignant neoplasms, cardiovascular diseases or nervous
                         system disorders.
                       An example




5 out of the 9 support datasets are leukemia datasets (P-value 0.0039). It is
potentially regulated by E2F4, and majority genes are involved in cell cycle
and DNA repair.
Identify phenotype-specific functional or
         transcriptional modules
• Supervised approach
Adenocarcinoma    Arthritis            …    Glaucoma
                              AsthmaNon-Adenocarcinoma   HIV




  Functional and transcriptional modules
 which are active ONLY in Adenocarcinoma
             Related data sets
       A case study: Identify Network Modules
               Characterizing Cancer
32 Cancer Datasets

                c1 c2… cm
                  c1 2…
          g1 .9 .4…c.1 cm                                          f                                                f                                   f               j
                     c1 c2… cm                                                                                                  j
             .7 .3… .5 .1
          g2 g1 .9 .4… 1 c2… cm                a                                   j           a                   e                        a
                        c                                                  h                               c                                        c               h
               g1 .9 .4… .1                                c                                                                h
                          .5 1 2…
          …g2 .7 g.3… .4…c.1 cm
             … g2 .7 .9
                           c
                    1 .3… .5 c … c
                              c1 2
                     g1 .9 .4… .1 m
                                                               e                                                                                        e
                                                                                                                                                                                …
                               .5
               …g2 .7 g.3… .4… .1                                                                                                           b
                  …  g2 .7 .9
                         1 .3… .5
                                               b
                                                                                       k       b
                                                                                                                                    k
                                                                                                                                                                            k
                     …g2 .7 .3… .5                     d       g               i                       d
                                                                                                                   g        i
                                                                                                                                                    d   g           i
                       …




25 Non-cancer Datasets

               c1 c2… cm                           f                                               f                                            f
                  c1 2…
         g1 .2 .5…c.8 cm               a                               j           a                                j               a                           j
                     c1 c2… cm             c                   h                           c                   h                        c               h
            .7 .1… .3 .8
         g2 g1 .2 .5… 1 c2… cm
         …
                        c
               .7 .1… .3 .8
            g2 g1 .2 .5… 1 c2… cm
                          c
                  .7 .1… .3 .8
               g2 g1 .2 .5… 1 c2… cm   b
                                               e
                                                                                       b
                                                                                               e
                                                                                                                                    b
                                                                                                                                            e
                                                                                                                                                                                …
            …               c
                  g2 g1 .2 .5… .8
                                                                           k                                            k                                           k
               …     .7 .1… .3             d                                               d                                            d
                       g1 .2 .5… .8                                                                                                                         i
                  …g2 .7 .1… .3                    g           i                                   g           i                                g
                     …g2 .7 .1… .3
                       …
             Examples of identified modules




   Cell cycle Module                Cell adhesion                 PDGF-signaling
across all cancer datasets   across all solid tumor datasets   in breast cancer datasets
Reconstruct transcriptional cascades
     by second-order analysis




                   Zhou et al. Nature Biotech 2005
Frequently occurring tight clusters
   Frequently occurring tight clusters
Transcription
Factors
Co-occurrence of tight clusters




Coexpression network constructed with the dataset 1
  Co-occurrence of tight clusters




Coexpression network constructed with the dataset 2
  Co-occurrence of tight clusters




Coexpression network constructed with the dataset 3
  Co-occurrence of tight clusters




Coexpression network constructed with the dataset 4
   Co-occurrence of tight clusters




Coexpression network constructed with the dataset 5
      Transcription     Cooperativity   Transcription
      Factors Set 1                     Factor Set 2




Coexpression Networks
Relevance Networks
Three types of transcription cascades

                            gene 1

                    TF2     gene 2

Type I      TF1             gene 3

                    TF3     gene 4

                            gene 5


                            gene 1

                    TF2     gene 2

 Type II    TF1
                   gene 4
                            gene 3


                   gene 5



                   gene 1

            TF1    gene 2            Transcription Regulation
                   gene 3            Protein Interaction
 Type III
            TF2    gene 4

                   gene 5
 Applying to 39 yeast microarray data sets

• We identified 60 transcription modules. Among
  them, we found 34 pairs that showed high 2nd-
  order correlation. A significant portion (29%, p-
  value<10-5 by Monte Carlo simulation ) of those
  modules pairs are participants in transcription
  cascades: 2 pairs in Type I, 8 pairs in Type II,
  and 3 pairs in type III cascades. In fact, these
  transcription cascades inter-connect into a
  partial cellular regulatory network.
                                                                        HGH1 LOC1 NOC3                BUD19 RPL13A RPL13B RPL16A RPL24B RPL28 RPL2B
                                                                        YDR152W YHL013C               RPL33B RPL39 RPL42B RPS16B RPS18A RPS21B RPS22A
                                                                                                      RPS23B RPS4B RPS6A




                                      BAT1 ILV2 LEU9
                                                                                                                                                                 YBL113C YRF1-
                                                                                                                                                                 1 YRF1-7


                   MET10 MET14 MET28 SUL2                                            RCS1              RGM1

                                                                Leu3
                                                                                                                            GAL4



ALD5 ARG1 ARG2 ARG3 ARG4 ARG5,6                    MET4                                                                                PDR1                            YBL113C YRF1-1 YRF1-3
ARO1 ARO3 ARO4 ASN1 BNA1 CPA2                                                                                                                                          YRF1-7
DED81 ECM40 HIS1 HIS4 HOM3 LEU4
LEU9 LYS20 MET22 ORT1 PCL5 TEA1
TRP2 YBR043C YDR341C YHM1
YHR162W YJL200C YJR111C                     GCN4                                                                                              YAP5



                                                                                                                                                                        YBL113C YIL177C YJL225C
                                                                                                                                                                        YLR464W YML133C YRF1-1
                                               SWI6                                                                                                                     YRF1-3 YRF1-4 YRF1-5 YRF1-
                                                                                                                                              GAT3
                                                                                                                                                                        6 YRF1-7




                                                MBP1                                                                                 SWI5
           CDC45, RAD27, RNR1,
           SPT21, YPL267W

                                                                                                                        MSN4
                                                                        SWI4
                                                                                                   NDD1
                                                                                                                                                     YEL077C YRF1-3 YRF1-7
                      CDC21 CDC45 CLB5 CLB6 CLN1
                      GIN4 IRR1 MCD1 MSH6 PDS5
                      RAD27 RNR1 SPO16 SPT21
                      SWE1 TOF1 YBR070C
                                                                                                                  ALK1 BUD4 CDC20 CDC5
                                                               CLB6 RNR1                                          CLB2 HST3 SWI5 YIL158W
                                                               SPT21                        GAS1
                                                                                            HTA1                  YJL051W YOR315W
                                                               YPL267W
                                                                                            HTB1




             Regulation of transcription modules by transcription factors, based on ChIP-chip data and supported by recurrent expression clusters


             Regulation between transcription factors, based on ChIP-chip data and supported by 2nd-order expression correlation


             Protein interactions between two transcription factors, based on experimental data and supported by 2nd-order expression correlation



             Two transcription modules with high 2nd-order expression correlations
Integrative Array Analyzer (iArray): a software package for
    cross-platform and cross-species microarray analysis




                                         Bioinformatics. 22(13):1665-7
          Gene Aging Nexus (GAN):
An Integrated Genomics Data Mining Platform for Aging




                                     Nucleic Acids Res. 2006 Nov
                     Acknowledgement
CoDense                      Cancer Network Module
• Haiyan Hu                  • Min Xu


                             Second-order TF cascade
NetBiclustering
                             • Ming-Chih Kao (U. Michigan)
• Haifeng Li
                             • Haiyan Huang (UC Berkeley)
• Yu Huang
                             • Wing H. Wong (Stanford)

NeMo                         Consultation
• Xifeng Yan (IBM)           • Jiawei Han (UIUC)
• Mike Mehan                 • Michael Waterman

Funding agencies: NIGMS, NCI, NSF, Seaver Foundation,
Sloan Foundation, Zumberg Foundation
Thank you!

				
DOCUMENT INFO
Categories:
Tags:
Stats:
views:0
posted:5/19/2012
language:English
pages:67