Gene Expression Microarray data

					Probabilistic Sparse Matrix Factorization



      Delbert Dueck, Quaid Morris, Brendan Frey
      (Probabilistic & Statistical Inference Group)

      Tim Hughes
      (Banting and Best Department of Medical Research)
Objective

  Patterns in gene expression array data can be used
    to help understand gene regulation and predict
    the function of yet-uncharacterized genes


Objective: To develop a method of probabilistic
 sparse matrix factorization (PSMF) and apply
 it to gene expression data to learn the hidden
 structure underlying the data.
Biological Background

   Genes encode basic information about an organism
       They tend to be highly expressed in tissues related to their
        functional role
   Mouse gene expression data is from Zhang, Morris,
    et al. (2004)
   Gene expression is influenced by the presence of
    transcription factors (TFs)
       Co-expressed genes are likely activated by the same TFs
       The activity of each gene can be explained by the activities
        of a small number of transcription factors
   Gene Expression Array Dataset
                                                                          Entire data set: X
                                                                                          G×T matrix (G=22709, T=55)



                                                         100 genes 
     G=22709 genes 




                                                                                           Expression vector for gene XM_133866.1
                                                                         bladder (t=3)     xg (g=10056), a row vector of length T=55

                                                                                              colon (t=9)




                                                                                                                      hindbrain (t=22)
                                                                                                                                           large intestine (t=25)
                                                                                                                                                                      lymph node (t=28)
                                                                                                                                                                                            midbrain (t=31)
                                                                                                                                                                                                                pancreas (t=34)



                                                                                                                                                                                                                                    small intestine (t=41)
                                                                                                                                                                                                                                                               spleen (t=44)
                                                                                                                                                                                                                                                               stomach (t=45)
                                      T=55 tissues 
                        Scale:
                                 0    2   4   6   8 >10




                                                                                                             Scalar expression values (xgt )

  T=55
tissues
Sparse Matrix Factorization

   Gene expression data model:
       Each gene’s expression profile (xg) is …
        a linear combination (weighted by ygc, csg) …
        of a small number (rg<N) …
        of C possible transcription factor profiles (zc, csg)


                           å
                                rg
                  xg »          n=1
                                    y gsgn zsgn
Sparse Matrix Factorization

                         X
                   6444444447 444444448
                   éx 11 x 12 L x 13 ù
                                              644444444444444447 44444444444444448
                                              é 0    y 12   0
                                                             Y     0    y 15   0 ù
                   ê                  ú       ê                                    ú
                   êx                 ú       ê 0                             y 26 ú
                   ê 21 x 22 L x 23 ú
                   ê
                   êx 31 x 32 L x 33 ú
                   ê
                                      ú
                                      ú
                                              ê
                                              ê
                                              êy 31
                                              ê
                                                     y 22
                                                      0
                                                            0
                                                            0
                                                                   0
                                                                   0
                                                                         0
                                                                        y 35   0 ú
                                                                                   ú
                                                                                   ú
                                                                                   ú
                                                                                       64444444 44444444
                                                                                       é 11 x 12 L x 13 ù
                                                                                        x
                                                                                             Z 47       8
                   êx    x 42 L x 43 ú        ê 0                              0 ú     ê                ú
 Matrix format:    ê 41
                   ê
                                      ú
                                      ú
                                              ê
                                              ê
                                                      0     0     y 44   0         ú
                                                                                   ú
                                                                                       ê 21 x 22 L x 23 ú
                                                                                       ê
                                                                                        x
                                                                                                        ú
(entire dataset)
                   êx 51 x 52 L x 53 ú
                   ê                  ú   »
                                              êy 51
                                              ê
                                                      0    y 53    0     0     0 ú
                                                                                   ×
                                                                                   ú
                                                                                       ê 31 x 32 L x 33 ú
                                                                                       ê
                                                                                        x
                                                                                                        ú
                   ê
                   ê
                   ê
                   êx 71 x 72 L x 73 ú
                   ê
                    x 61 x 62 L x 63 ú
                                      ú
                                      ú
                                      ú
                                              ê
                                              ê
                                              ê
                                                0
                                              ê 0
                                              ê
                                                      0
                                                     y 72
                                                            0
                                                            0
                                                                   0
                                                                   0
                                                                        y 65
                                                                         0
                                                                               0 ú
                                                                               0 ú
                                                                                   ú
                                                                                   ú
                                                                                   ú
                                                                                       ê 41 x 42 L x 43 ú
                                                                                       ê
                                                                                       ê
                                                                                        x
                                                                                       ê 51 x 52 L x 53 ú
                                                                                        x
                                                                                                        ú
                                                                                                        ú
                                                                                                            S r
                                                                                                            }
                                                                                                            é 5ù
                                                                                                             1
                                                                                                                        }
                                                                                                                        éù
                                                                                                                         2
                                                                                       ê                ú   ê   ú       êú
                   ê                  ú       ê                                    ú    x
                                                                                       ê 61 x 62 L x 63 ú
                                                                                       ë                û   ê 6ú        êú
                   êx 81 x 82 L x 83 ú        êy 81   0    y 83    0     0     0 ú                          ê2 ú         2
                                                                                                                        êú
                   ê                  ú       ê                                    ú                        ê   ú       êú
                   êM M             Mú        ê M     M     M      M     M     Mú                           ê 5ú
                                                                                                             1          êú
                                                                                                                         1
                   ê                  ú       ê                                    ú                        ê   ú       êú
                   ê                  ú       ê                                    ú                        ê
                   ê G 1 xG 2 L xG 3 ú
                    x                         êyG 1   0     0     yG 4   0     0 ú                          ê4 úú
                                                                                                                        êú
                                                                                                                         1
                                                                                                                        êú
                   ë                  û       ë                                    û                        ê   ú       êú
                                                                                                            ê 3ú
                                                                                                             1          êú
                                                                                                                         2
                                                                                                            ê
                                                                                                            ê
                                                                                                            ê5 ú
                                                                                                                ú
                                                                                                                ú   ,   êú
                                                                                                                        êú
                                                                                                                        êú
                                                                                                                         1
                                                                                                            ê   ú       êú
                                                                                                            ê2 ú        êú
                                                                                                                         1
                                                                                                            ê   ú       êú
                                                                                                            ê   ú       êú
                                                                                                            ê 3ú
                                                                                                             1          êú
                                                                                                                         2
                                                                                                            ê   ú       êú
                                                                                                            êM Mú       êM ú
                                                                                                            ê   ú       êú
                                                                                                            ê   ú       êú
                                                                                                            ê 4ú
                                                                                                            ë
                                                                                                             1
                                                                                                                û       êú
                                                                                                                         2
                                                                                                                        ëû
Probabilistic Sparse Matrix Factorization

   To express as a distribution, assume …
       varying levels of Gaussian noise in the data:
             P (xg | yg , Z, sg , rg , y g ) = N xg ; å n = 1 y gsgn zsgn , y g I
                                                        rg
                                                   (                          2
                                                                                )
       nothing about transcription factor weights: P ( y g ) µ 1
       normally-distributed transcription factor profiles: P ( zc ) = N ( zc ; 0, I)
        uniformly-distributed factor assignments: P (sg | rg) = Õ ng= 1 Õ c = 1 (C ) gn
                                                                    r    C
                                                                                1 d(s - c)

       multinomially-distributed factor counts: P (rg = n ) = nn
Probabilistic Sparse Matrix Factorization

   To express as a distribution, assume …
        varying levels of Gaussian noise in the data:
              P (xg | yg , Z, sg , rg , y g ) = N xg ; å n = 1 y gsgn zsgn , y g I
                                                         rg
                                                          (                    2
                                                                                     )
        nothing about transcription factor weights: P ( y g ) µ 1
        normally-distributed transcription factor profiles: P ( zc ) = N ( zc ; 0, I)
         uniformly-distributed factor assignments: P (sg | rg) = Õ ng= 1 Õ c = 1 (C ) gn
                                                                     r    C
                                                                                 1 d(s - c)

        multinomially-distributed factor counts: P (rg = n ) = nn

   Multiply together to get joint distribution
        P ( X, Y, Z, S, r | Y) = P ( X | Y, Z, S, r, Y) ×P ( Y) ×P (Z) ×P (S | r) ×P ( r)
          éG                                    ùéC              ùé G C N 1 d(sgn - c ) ùé G N d(rg - n ) ù
        µ ê N xg ; å
          Õ                                      Õ
                                                úê N zc ; 0, I    Õ1 c = 1 n = 1
                                                                 úê Õ Õ (C )            úê Õ nn
                                                                                         Õ1 n = 1         ú
                            rg
          ê=1      (        n=1
                                  y gsgn zsgn   )
                                                úê = 1(          )
                                                                 úê =                   úê =              ú
          ëg                                    ûëc              ûëg                    ûëg               û
Factorized Variational Inference

   Exact inference is intractable with P(∙)
                           G   é               ùé             ù     C
                               ê N (x ; å
      P ( X, Y, Z, S, r | Y) µ Õ          y z ) Õ
                                               úê N (z ; 0, I)ú
                                               rg
                               ê        g      úê
                                               n=1   gs gn
                                                              ú
                                                             s gn          c
                         ëg = 1                                   ûëc = 1            û
                                   éG C N                        ùé G N d(rg - n ) ù
                                    Õ1 c = 1 n = 1
                                  ×ê Õ Õ (C )      1 d(sgn - c ) úê
                                                                   Õ1 Õ1 nn ú
                                   êg =                          úêg = n =         ú
                                   ë                             ûë                û
Factorized Variational Inference

   Exact inference is intractable with P(∙)
                               é         G     ùé             ù                         C
                               ê N (x ; å
      P ( X, Y, Z, S, r | Y) µ Õ          y z ) Õ
                                               úê N (z ; 0, I)ú  rg
                               ê               úê     g
                                                              ú  n=1   gs gn   s gn           c
                                     ëg = 1                                   ûëc = 1            û
                                               éG C N                        ùé G N d(rg - n ) ù
                                                Õ1 c = 1 n = 1
                                              ×ê Õ Õ (C )      1 d(sgn - c ) úê
                                                                               Õ1 Õ1 nn ú
                                               êg =                          úêg = n =         ú
                                               ë                             ûë                û

   Approximate it by a simpler distribution, Q(∙),
    and perform inference on that
                                G    C                     C    T                 G    N          G
    P ( Y, Z, S, r | X, Y) »   Õ Õ Q(y
                               g= 1 c = 1
                                                gc   ) ×Õ Õ Q(zc ) ×Õ Õ Q(sgn ) ×Õ Q(rg )
                                                          c= 1 t = 1             g= 1 n = 1       g= 1
Visualization                                                                                    PROBABILISTIC SPARSE
                                                                                                 MATRIX FACTORIZATION
                                                                                                 C=50 possible factors
                                                                                                 N=3 factors per gene (max)
                                                                                                  P(rg)=[.55 .27 .18]




                                                                                        C factors
 G genes




                                     G genes




                                                                   G genes
                X                                  X          =               Y                        Z
                                                    



                                                                                                      T tissues




                                                                                        *Sorted by primary
                                                                                        transcription factor (sg1)

             T tissues                         T tissues              C factors


                Scale:
                         0   2   4   6       8 >10
          Results – p-value histograms

                    Genes can be partitioned into “primary
                     categories” (i.e. same sg1 value), “secondary
                     classes”, etc.
                        Compare classes with annotated gene ontology
                         (GO-BP) categories for statistical significance
rarchical       hierarchical
                  hierarchical
                    PSMF              PSMF
                                       PSMF
                                        PSMF                                PSMF
                                                                             PSMF
                                                                              PSMF                      PSMF
                                                                                                          PSMF
                                                                                                        random
                                                                                                       hierarchical            random
                                                                                                                                 PSMF
                                                                                                                                 random
          0.4 agglomerative 0.4     (primary)                     0.4 (secondary)           0.4                         0.4
omerative  0.4 agglomerative 0.4 (secondary)
            0.4 (primary)       0.4   (primary)                     0.4 (secondary) 0.4 agglomerative 0.4 clustering
                                                                   0.4       (tertiary)       0.4 (tertiary)
                                                                                              0.4 clustering
                                                                                                         (tertiary)       0.4     (primary)
                                                                                                                                  clustering         0.4    (se
             frequency




                 clustering
           frequency




                                                                                           frequency
ustering  0.3      clustering 0.3                                 0.3                       0.3         clustering 0.3
            0.3
           0.3                  0.3
                               0.3                                  0.3
                                                                   0.3                       0.3
                                                                                              0.3
                                                                                              0.3                         0.3
                                                                                                                          0.3                        0.3
             0.2                        0.2                       0.2                       0.2                         0.2
                0.2
              0.2                         0.2
                                         0.2                        0.2
                                                                   0.2                       0.2
                                                                                              0.2
                                                                                              0.2                         0.2
                                                                                                                          0.2                        0.2
             0.1                        0.1                       0.1                       0.1                         0.1
                0.1
              0.1                         0.1
                                         0.1                        0.1
                                                                   0.1                       0.1
                                                                                              0.1
                                                                                              0.1                         0.1
                                                                                                                          0.1                        0.1
               0                          0                         0                         0                           0
                 00
               -20         -10       0 -20 00        -10       0 -20  00       -10        0 -20 0
                                                                                               00         -10        0 -20  0
                                                                                                                            0       -10        0      0
   -10      0 -20 -20        -10
                            -10        00 -20
                                            -20        -10
                                                      -10        00 -20-20        -10
                                                                                 -10       00 -20
                                                                                                -20
                                                                                                 -20        -10
                                                                                                             -10
                                                                                                           -10         0
                                                                                                                      00 -20-20        -10
                                                                                                                                      -10        0
                                                                                                                                                 0    -20
                    log10(p-value)            log10(p-value)             log10(p-value)            log10(p-value)             log10(p-value)
10
  (p-value)           log10(p-value)
                        log10(p-value)            log10(p-value)
                                                log10(p-value)              log10(p-value)
                                                                           log10(p-value)             log10(p-value)
                                                                                                       log10(p-value)
                                                                                                     log10(p-value)              log10(p-value)
                                                                                                                                log10(p-value)              log1
Results – mean log10 p-values
                                           Mean log           p-values                                                                    Fracti
                           -25                           10                                                                        100%




                                           PSMF N={1,2,3} primary
                           -20               (i.e. sg1 clustering)                                                                 80%




                                                                                           fraction of factors with significance
      mean log (p-value)




                                                                      hierarchical
                           -15                                       agglomerative                                                 60%
                                                                       clustering



                                                        
                    10




                           -10                                                                                                     40%




                                                        
                                                            PSMF N={2,3} secondary
                            -5                                                                                                     20%     PSMF
                                 PSMF N=3 tertiary
                                 random clustering
                                                                                                                                             rand
                            0                                                                                                       0%
                                 10   20     30    40    50    60    70   80   90    100                                                    10
                                               C (# clusters, factors)
       Results – count of significant p-values
                                                                 Fraction of factors with significance
                                                          100%


                                                                             PSMF N={1,2,3} primary
                                                                               (i.e. sg1 clustering)
                                                          80%



                                                                         
                  fraction of factors with significance


                                                                                              PSMF N={2,3} secondary

archical
merative                                                  60%
stering


                                                          40%



                                                                                         
                                                                                                           hierarchical
                                                                                                          agglomerative
secondary                                                                                                   clustering
                                                          20%     PSMF N=3 tertiary



                                                                   random clustering
                                                           0%
80   90     100                                                    10   20    30   40    50     60   70     80   90   100
                                                                               C (# clusters, factors)
Future Directions – different Q(·)                                                                                           C                 G

                                                                                                                             Õ Q(z ) ×Õ Q( y , r , s )
                                     4
                                 x 10
                             0
                          -0.2                                                                                                           c                   g   g   g
                          -0.4                                                                                               c= 1              g= 1
                          -0.6                                                                                 G   C                    C             G

                                                                                                           ÕÕ Q(y ) ×Õ Q(z ) ×Õ Q(r , s )
                          -0.8
                            -1
                                                                                                                                 gc              c               g   g
                                                                                                           g= 1 c = 1                   c= 1          g= 1

                           -2                                                     G     C                  C                        G    N                   G

                                                                                 Õ Õ Q(y                ) ×Õ Q ( zc ) ×Õ Õ Q (sgn ) ×Õ Q (rg )
complete log likelihood




                                                                                                   gc
                                                                                 g= 1 c = 1               c= 1                   g= 1 n = 1               g= 1
                           -3




                           -4



                                                                         iterated conditional modes                Iterated conditional modes
                           -5                                                                                           (point estimates)

                                              **NOTE: The complete log likelihoods are not necessarily
                           -6
                                              monotonically increasing due to the non-negativity constraint,
                                              implemented via a zero-thresholding heuristic.


                           -7
                                 0       5   10      15       20       25       30       35       40       45           50

                                                                    iteration
Summary

   Introduced probabilistic sparse matrix
    factorization (PSMF), each row is a linear
    combination of a “small” number of hidden
    factors selected from a larger set.
   Described a variational inference algorithm
    for fitting the PSMF model.
   Evaluated model on a gene functional
    prediction task.

				
DOCUMENT INFO