Kernels on lists as sets over relational algebra an

Document Sample
Kernels on lists as sets over relational algebra an Powered By Docstoc
					Kernels on lists as sets over relational algebra: an
application to classification of protein fingerprints




                   Artificial Intelligence Group
                 Department of Computer Science
                       University of Geneva
                           Switzerland


                                  z
                        Adam Wo´nica,
                      Alexandros Kalousis,
                         Melanie Hilario
Kernels on lists as sets: an application to classification of protein fingerprints



                                                   Motivation

  • Previous work:
       – Recently we proposed a class of general kernels over relational algebra
       – In relational algebra it is easy to model sets
       – The building blocks of these kernels are elementary kernels and kernels on sets
  • Here:
       – We extend the representation language such that we can model lists
       – We define a class of kernels over extended relational algebra
       – The building blocks are elementary kernels, kernels on sets and kernels on lists
       – We apply SVM with these kernels on protein fingerprints data
Kernels on lists as sets: an application to classification of protein fingerprints



                                                      Outline

  • Extended relational algebra representation
       – Relational instance
  • Relational kernel:
       – Kernel on lists
       – Kernel on sets
  • Experiments on protein fingerprints data
  • Conclusions and future work
Kernels on lists as sets: an application to classification of protein fingerprints



                     Extended relational algebra representation (I)

  • Relational database schema: a set of relations R = {R1, R2, . . . , Rn }
  • Relation Ri : a set of tuples
  • Tuple Rij of a relation Ri : particular row in Ri
  • Potential key: an attribute Ak of a relation Ri that assumes a unique value for each
    tuple in the relation
  • Foreign key: an attribute Al of a relation Rj that references a potential key Ak of
    relation Ri
  • One to many relations: The association between Ak and Al , i.e. one element of Ri
    can be associated with many of elements of Rj
Kernels on lists as sets: an application to classification of protein fingerprints



                    Extended relational algebra representation (II)

  • Set link: defined on the basis of a foreign key relation and is a quadruple of the form
    sl(Ri , Ak , Rj , Al ) where either Al is a foreign key of Rj referencing a potential key Ak
    of Ri or vice versa. SL(Ri ) is the set of links in which Ri participates
  • List link: is a quadruple of the form ll(Ri, Ak , Rj , Al , LIST (Al )). LL(Ri) is the set
    of links in which Ri participates
                                                                  R2

                                                 R1




                                                             R3




  • Standard attributes: Attributes of a relation Ri that are not keys (i.e. referenced keys,
    foreign keys or attributes defined as keys but not referenced)
  • Set attributes: Each link of SL(Ri) defines an attribute of type set.
    Its values are sets of tuples
  • List attributes: Each link of LL(Ri) defines an attribute of type list.
    Its values are lists of tuples
Kernels on lists as sets: an application to classification of protein fingerprints



                                            Relational instance

  • To retrieve its description we should follow all the links in which it participates
  • It exists a main relation, M , where the class of each relational instance is given
  • A relational instance could span through all the relations in a relational database
  • Relational learning algorithm: should handle attributes of type set and list
  • The resulting representation is a tree like structure
       – root of the tree: one tuple from the main relation M
       – nodes contain tuples from different relations
       – connections between nodes determined by foreign key associations
       – tree is ordered and unordered
       – tree is typed
Kernels on lists as sets: an application to classification of protein fingerprints



                                                Some notation

  • The attributes of a tuple of a relation Ri consist of three parts:
       – the standard attributes, IA,Ri
       – the attributes of type set, Isl(Ri)
       – the attributes of type list, Ill(Ri)
  • Extra notation:
       – vaIA,Ri = (va1, va2, . . . , va|IA,R |): vector of standard attributes of tuple Ria
                                                  i
       – vaIsl(R ) = (vs1, vs2, . . . , vs|Isl(R ) |): vector of attributes of type set of tuple Ria i.e.
                i                               i
         set of tuples from some other relation Rj
       – vaIll(R ) = (vl1, vl2, . . . , vl|Ill(R ) |): vector of attributes of type list of tuple Ria i.e.
                i                               i
         list of tuples from some other relation Rj
  • So, we need:
       – kernel, k(Ria , Rib ), between tuples Ria , Rib
       – kernel between sets of tuples of Ri
       – kernel between list of tuples of Ri
Kernels on lists as sets: an application to classification of protein fingerprints



                                               Relational kernel

  • The direct sum kernel:


           kΣ (Ria , Rib ) = ks(vaIA,Ri , vbIA,Ri ) +                     Kset(val , vbl ) +                 Klist (val , vbl )
                                                            l∈Isl,R                            l∈Ill,R
                                                                      i                                  i




       – ks(., .) is an elementary kernel defined on the standard attributes
       – Kset(., .) is a kernel on sets
       – Klist(., .) is a kernel on lists
                                    kΣ (Ria ,Rib )
  • We normalize it by:          1+|Isl(R ) |+|Ill(R ) |
                                         i          i
Kernels on lists as sets: an application to classification of protein fingerprints



                                                Kernels on lists

  • Some notation:
       – i: a sequence i1 ≤ i2 ≤ · · · ≤ in of indices
       – l(i): the length of a sequence i
  • Contiguous Sublist Kernel
                                                      l(i)
       – Klist(val , vbl ) =        i,j,l(i)=l(j) λ          s=1,...,l(i) KΣ (val [is ], vbl [js ])
       – subsequences i, j are contiguous
       – 0 < λ < 1 is a parameter penalizing longer subsequences
       – computable in O(l(i)l(j))
  • Longest Common Sublist Kernel
                                           m
       – Klist(val , vbl ) = m +           s=1 KΣ (val [is ], vbl [is ])    +n
       – m = min(l(val ), l(vbl ))
       – n = 1 if the lists are of the same length; 0 otherwise
  • The former takes a more global view.
Kernels on lists as sets: an application to classification of protein fingerprints



                                                Kernels on sets

  • We used the the Cross Product Kernel: (uak = {ux} ⊆ Ri, ubk = {uy } ⊆ Ri)

                                         Kset(uak , ubk ) =              KΣ(ux, uy )
                                                                ux ,uy

  • Kernel KΣ(., .) will be computed recursively by alternating computations of ks(., .),
    Kset(., .) and Klist (., .)
Kernels on lists as sets: an application to classification of protein fingerprints



                                                  Experiments (I)

                                                               fingerprint

                                          motif




                                                                                             sequence
                                                                                             alignment




                                         cydeggis      xxxxx      xxxxx      xxxxx   xxxxx
                                         cyedggis      xxxxx      xxxxx      xxxxx   xxxxx
                                         cyeeggit      xxxxx      xxxxx      xxxxx   xxxxx
                                         cyhgdggs
                                         cyrgdgnt
                                                      c−y−x2−[dg]−g−x−[st]




  • Fingerprint: group of conserved motifs drawn from multiple sequence alignment
  • Used as diagnostic signatures to characterize collections of protein sequences
  • Characterized by component motifs and protein sequences
  • Diagnostic for gene family, superfamily or domain
  • Important because proteins within a fingerprint should have similar function
Kernels on lists as sets: an application to classification of protein fingerprints



                                              Experiments (II)




  • A fingerprint is associated with
       – set of motifs
       – list of motifs
       – set and list of motifs
Kernels on lists as sets: an application to classification of protein fingerprints



                                             Experiments (III)

  • Data was divided into training (1487 instances) and blind testing (355 instances) sets
  • We applied SVM algorithm (C = 1)
  • Results (accuracy):


                        Table 1: Accuracy and rank results on the protein fingerprints dataset.
                  Elementary kernel   10-fold CV test set   10-fold CV test set 10-fold CV test set
                                              sets                  lists            sets and list
                                                                     Contiguous Sublist Kernel
                  kPp=2,a=1             85.08      84.22      85.88       86.08      87.63     87.35
                  kGγ=0.1               83.66      81.97      83.59       83.86      86.01     83.99
                                                                  Longest Common Sublist Kernel
                  kPp=2,a=1                                   85.61       86.01      86.75     86.35
                  kGγ=0.1                                     83.86       84.13      85.81      73.3
                  Def. Accuracy                                     54.4



  • Best results reported in the literature: 85.91 % and 85.92 % for 10-fold CV and inde-
    pendent test set
Kernels on lists as sets: an application to classification of protein fingerprints



                                    Conclusions and future work

  • Conclusions:
       – New representation language where instances are described using an extension of
         relational algebra
       – Easy to model sets and lists
       – New kernels on lists
       – Promising results on the protein fingerprint classification data
           ∗ Better than the state-of-the-art
  • Future work:
       – Other kernels on lists (sets)
       – Why the combination of data representation improves the results?
Kernels on lists as sets: an application to classification of protein fingerprints



                                                     Software

  • Relational extension to WEKA:

     http://cui.unige.ch/∼woznica/rel weka/
Kernels on lists as sets: an application to classification of protein fingerprints



                                                   Thank you!