# Kernels on lists as sets over relational algebra an

Document Sample

```					Kernels on lists as sets over relational algebra: an
application to classiﬁcation of protein ﬁngerprints

Artiﬁcial Intelligence Group
Department of Computer Science
University of Geneva
Switzerland

z
Alexandros Kalousis,
Melanie Hilario
Kernels on lists as sets: an application to classiﬁcation of protein ﬁngerprints

Motivation

• Previous work:
– Recently we proposed a class of general kernels over relational algebra
– In relational algebra it is easy to model sets
– The building blocks of these kernels are elementary kernels and kernels on sets
• Here:
– We extend the representation language such that we can model lists
– We deﬁne a class of kernels over extended relational algebra
– The building blocks are elementary kernels, kernels on sets and kernels on lists
– We apply SVM with these kernels on protein ﬁngerprints data
Kernels on lists as sets: an application to classiﬁcation of protein ﬁngerprints

Outline

• Extended relational algebra representation
– Relational instance
• Relational kernel:
– Kernel on lists
– Kernel on sets
• Experiments on protein ﬁngerprints data
• Conclusions and future work
Kernels on lists as sets: an application to classiﬁcation of protein ﬁngerprints

Extended relational algebra representation (I)

• Relational database schema: a set of relations R = {R1, R2, . . . , Rn }
• Relation Ri : a set of tuples
• Tuple Rij of a relation Ri : particular row in Ri
• Potential key: an attribute Ak of a relation Ri that assumes a unique value for each
tuple in the relation
• Foreign key: an attribute Al of a relation Rj that references a potential key Ak of
relation Ri
• One to many relations: The association between Ak and Al , i.e. one element of Ri
can be associated with many of elements of Rj
Kernels on lists as sets: an application to classiﬁcation of protein ﬁngerprints

Extended relational algebra representation (II)

• Set link: deﬁned on the basis of a foreign key relation and is a quadruple of the form
sl(Ri , Ak , Rj , Al ) where either Al is a foreign key of Rj referencing a potential key Ak
of Ri or vice versa. SL(Ri ) is the set of links in which Ri participates
• List link: is a quadruple of the form ll(Ri, Ak , Rj , Al , LIST (Al )). LL(Ri) is the set
of links in which Ri participates
R2

R1

R3

• Standard attributes: Attributes of a relation Ri that are not keys (i.e. referenced keys,
foreign keys or attributes deﬁned as keys but not referenced)
• Set attributes: Each link of SL(Ri) deﬁnes an attribute of type set.
Its values are sets of tuples
• List attributes: Each link of LL(Ri) deﬁnes an attribute of type list.
Its values are lists of tuples
Kernels on lists as sets: an application to classiﬁcation of protein ﬁngerprints

Relational instance

• To retrieve its description we should follow all the links in which it participates
• It exists a main relation, M , where the class of each relational instance is given
• A relational instance could span through all the relations in a relational database
• Relational learning algorithm: should handle attributes of type set and list
• The resulting representation is a tree like structure
– root of the tree: one tuple from the main relation M
– nodes contain tuples from diﬀerent relations
– connections between nodes determined by foreign key associations
– tree is ordered and unordered
– tree is typed
Kernels on lists as sets: an application to classiﬁcation of protein ﬁngerprints

Some notation

• The attributes of a tuple of a relation Ri consist of three parts:
– the standard attributes, IA,Ri
– the attributes of type set, Isl(Ri)
– the attributes of type list, Ill(Ri)
• Extra notation:
– vaIA,Ri = (va1, va2, . . . , va|IA,R |): vector of standard attributes of tuple Ria
i
– vaIsl(R ) = (vs1, vs2, . . . , vs|Isl(R ) |): vector of attributes of type set of tuple Ria i.e.
i                               i
set of tuples from some other relation Rj
– vaIll(R ) = (vl1, vl2, . . . , vl|Ill(R ) |): vector of attributes of type list of tuple Ria i.e.
i                               i
list of tuples from some other relation Rj
• So, we need:
– kernel, k(Ria , Rib ), between tuples Ria , Rib
– kernel between sets of tuples of Ri
– kernel between list of tuples of Ri
Kernels on lists as sets: an application to classiﬁcation of protein ﬁngerprints

Relational kernel

• The direct sum kernel:

kΣ (Ria , Rib ) = ks(vaIA,Ri , vbIA,Ri ) +                     Kset(val , vbl ) +                 Klist (val , vbl )
l∈Isl,R                            l∈Ill,R
i                                  i

– ks(., .) is an elementary kernel deﬁned on the standard attributes
– Kset(., .) is a kernel on sets
– Klist(., .) is a kernel on lists
kΣ (Ria ,Rib )
• We normalize it by:          1+|Isl(R ) |+|Ill(R ) |
i          i
Kernels on lists as sets: an application to classiﬁcation of protein ﬁngerprints

Kernels on lists

• Some notation:
– i: a sequence i1 ≤ i2 ≤ · · · ≤ in of indices
– l(i): the length of a sequence i
• Contiguous Sublist Kernel
l(i)
– Klist(val , vbl ) =        i,j,l(i)=l(j) λ          s=1,...,l(i) KΣ (val [is ], vbl [js ])
– subsequences i, j are contiguous
– 0 < λ < 1 is a parameter penalizing longer subsequences
– computable in O(l(i)l(j))
• Longest Common Sublist Kernel
m
– Klist(val , vbl ) = m +           s=1 KΣ (val [is ], vbl [is ])    +n
– m = min(l(val ), l(vbl ))
– n = 1 if the lists are of the same length; 0 otherwise
• The former takes a more global view.
Kernels on lists as sets: an application to classiﬁcation of protein ﬁngerprints

Kernels on sets

• We used the the Cross Product Kernel: (uak = {ux} ⊆ Ri, ubk = {uy } ⊆ Ri)

Kset(uak , ubk ) =              KΣ(ux, uy )
ux ,uy

• Kernel KΣ(., .) will be computed recursively by alternating computations of ks(., .),
Kset(., .) and Klist (., .)
Kernels on lists as sets: an application to classiﬁcation of protein ﬁngerprints

Experiments (I)

fingerprint

motif

sequence
alignment

cydeggis      xxxxx      xxxxx      xxxxx   xxxxx
cyedggis      xxxxx      xxxxx      xxxxx   xxxxx
cyeeggit      xxxxx      xxxxx      xxxxx   xxxxx
cyhgdggs
cyrgdgnt
c−y−x2−[dg]−g−x−[st]

• Fingerprint: group of conserved motifs drawn from multiple sequence alignment
• Used as diagnostic signatures to characterize collections of protein sequences
• Characterized by component motifs and protein sequences
• Diagnostic for gene family, superfamily or domain
• Important because proteins within a ﬁngerprint should have similar function
Kernels on lists as sets: an application to classiﬁcation of protein ﬁngerprints

Experiments (II)

• A ﬁngerprint is associated with
– set of motifs
– list of motifs
– set and list of motifs
Kernels on lists as sets: an application to classiﬁcation of protein ﬁngerprints

Experiments (III)

• Data was divided into training (1487 instances) and blind testing (355 instances) sets
• We applied SVM algorithm (C = 1)
• Results (accuracy):

Table 1: Accuracy and rank results on the protein ﬁngerprints dataset.
Elementary kernel   10-fold CV test set   10-fold CV test set 10-fold CV test set
sets                  lists            sets and list
Contiguous Sublist Kernel
kPp=2,a=1             85.08      84.22      85.88       86.08      87.63     87.35
kGγ=0.1               83.66      81.97      83.59       83.86      86.01     83.99
Longest Common Sublist Kernel
kPp=2,a=1                                   85.61       86.01      86.75     86.35
kGγ=0.1                                     83.86       84.13      85.81      73.3
Def. Accuracy                                     54.4

• Best results reported in the literature: 85.91 % and 85.92 % for 10-fold CV and inde-
pendent test set
Kernels on lists as sets: an application to classiﬁcation of protein ﬁngerprints

Conclusions and future work

• Conclusions:
– New representation language where instances are described using an extension of
relational algebra
– Easy to model sets and lists
– New kernels on lists
– Promising results on the protein ﬁngerprint classiﬁcation data
∗ Better than the state-of-the-art
• Future work:
– Other kernels on lists (sets)
– Why the combination of data representation improves the results?
Kernels on lists as sets: an application to classiﬁcation of protein ﬁngerprints

Software

• Relational extension to WEKA:

http://cui.unige.ch/∼woznica/rel weka/
Kernels on lists as sets: an application to classiﬁcation of protein ﬁngerprints

Thank you!

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 7 posted: 2/16/2010 language: English pages: 16