Chemical Structure Representation and Search Systems

Document Sample
Chemical Structure Representation and Search Systems Powered By Docstoc
					    Chemical Structure Representation
1
          and Search Systems

           Lecture 3. Nov 4, 2003
                     John Barnard


               Barnard Chemical Information Ltd
             Chemical Informatics Software & Consultancy Services


                         Sheffield, UK
2   Lecture 3: Topics to be Covered
       More Graph Theory
       Structure Analysis and Processing
        •   canonicalisation and symmetry perception
        •   ring perception
        •   functional group identification
        •   structure fingerprints and fragments
        •   structure depiction
        •   principles of structure searching
3   Graph Terminology
                                 1

       degree of a node
                                         3
        number of edges
          meeting at it
                             2                   2
       leaf node
        a node of degree 1   2
                                                 2

       path                             3
        connected sequence
                                         2
          of edges between                           1
          two nodes
                             1       3       3
                                                     1
4   Graph Terminology
       cycle
        path which returns
          to its starting node
       tree
        graph with no cycles
       subgraph
        graph containing a
          subset of the nodes and
          edges of another graph
5   Graph Terminology
       spanning tree
        a tree subgraph that
           contains all the nodes
           (but not necessarily
           all the edges) of a
           graph
6   Graph Terminology
       connected graph
        graph in which there
          is a path between
          every pair of nodes
       fully-connected graph
        graph in which there
          is an edge between
          every pair of nodes
          (all nodes have
          degree n-1)
7   Graph Terminology
       disconnected graph
        graph in which some
          pairs of nodes have
          no path between
          them
       component
        subgraph in which all
          pairs of nodes are
          linked by a path, but
          no node has a path
          to a node in another
          component
8   Graph Terminology

       forest
        graph containing two
          or more components
          that are trees
9   Canonicalisation
       a given chemical structure (or graph) can have
        many valid and unambiguous representations
        •   different order of rows in connection table
        •   different order of atoms in SMILES
       for comparison purposes it would be useful to
        have a single unique or “canonical” representation
       process of converting input representation to
        canonical form is called “canonicalisation” or
        “canonisation”
        •   process of applying “rules” (i.e. an algorithm)
10   Canonicalisation

        an obvious approach:
         •   generate all possible valid SMILES
         •   choose the one that comes first alphabetically
        this would be very slow, but effective, and
         there is a danger of missing one
         •   principle was used for canonicalising
             Wiswesser Line Notation
11   Canonicalisation

        most methods in use today involve
         renumbering the atoms in some unique and
         reproducible way
         •   can be used to number rows in connection table
         •   can determine order of atoms in SMILES
        normally involve a node labelling technique
         called “relaxation”
         •   example is Morgan’s algorithm (1965)
12   Morgan’s algorithm
                                 1
                                             3 different values
     1.   Label each node                    { 1, 2, 3 }
                                         3
          with its degree
     2.   Count number of    2                   2
          different values
                             2
                                                 2

                                         3

                                         2
                                                       1

                             1       3       3
                                                      1
13   Morgan’s algorithm
                                    1
                                                3 different values
     3.   Recalculate labels                    { 1, 2, 3 }
          by summing label                  3

          values at neighbour
          nodes                 2                   2

     4.   Count number of       2
                                                    2
          different values
                                            3

                                            2
                                                          1

                                1       3       3
                                                         1
14   Morgan’s algorithm
                                    3
                                                3 different values
     3.   Recalculate labels                    { 3, 5, 6 }
          by summing label                  5

          values at neighbour
          nodes                 5                   5

     4.   Count number of       5
                                                    5
          different values
     5.   Repeat from                       6

          step 3                            6
                                                          3

                                3       6       5
                                                         3
15   Morgan’s algorithm
                                     5
                                                   8 different values
     3.   Recalculate labels                       { 5, 6, 10, 11,
          by summing label                    13     12, 13, 14, 16 }
          values at neighbour
          nodes                 10                  10

     4.   Count number of       11
                                                    11
          different values
     5.   Repeat from                         16

          step 3                              12
                                                           5

                                 6       14        12
                                                          5
16   Morgan’s algorithm
                                      13

     3.   Recalculate labels                     9 different values
                                                 { 12, 13, 14, 18, 24,
          by summing label                  25     25, 26, 30, 34 }
          values at neighbour
          nodes                 24                24

     4.   Count number of
                                26
          different values                        26

     5.   Repeat from                       34

          step 3                            30
                                                         12

                                 14    18        24
                                                        12
17   Morgan’s algorithm
                                      25

     3.   Recalculate labels                     9 different values
                                                 { 18, 24, 25, 42, 48
          by summing label                  61     51, 61, 68, 82 }
          values at neighbour
          nodes                 51                51

     4.   Count number of
                                48
          different values                        48

     5.   Repeat from                       82

          step 3                            42
                                                         24

                                 18    68        42
                                                        24
18   Morgan’s algorithm
                                         61

     3.   Recalculate labels                      10 different values
                                                  { 42, 61, 68, 102,
          by summing label                    127   109, 116, 127,
                                                    133, 138, 150 }
          values at neighbour
          nodes                   109                109

     4.   Count number of
                                  133
          different values                           133

     5.   Repeat from                         138

          step 3 until there                  150
          is no increase in the                            42

          number of different       68    102       116
          values                                           42
19   Morgan’s algorithm
                                      61
                                               10 different values
        most nodes now                        { 42, 61, 68, 102,
         have different                    127   109, 116, 127,
                                                 133, 138, 150 }
         labels
                               109                109
        choose node with
         highest label as      133                133
         node 1                            138
        number its
                                           150
         neighbours in order                            42

         of label values         68    102       116
                                                        42
20   Morgan’s algorithm
                                      61
                                                10 different values
        most nodes now                         { 42, 61, 68, 102,
         have different                     127   109, 116, 127,
                                                  133, 138, 150 }
         labels
                               109                 109
        choose node with
         highest label as      133                 133
         node 1                       2     138
        number its                   1     150
         neighbours in order                3            42

         of label values         68       102     116
                                                         42
21   Morgan’s algorithm
                                                 61

            move to node 2
                                                       127
            number its remaining
             neighbours in order      109                       109
             of label values
                                             5             4
         •     because label values   133                       133
               are tied, choose one
                                                 2     138
               with higher bond
               order (green) first               1     150
                                                       3              42
            move to node 3
                                        68           102       116
                                                                      42
22   Morgan’s algorithm
                                                    61
                                                          13
            continue till all nodes
             are numbered                       12        127


            we now have a               109                       109
                                                9             8
             numbering for the rows
             of the connection table     133
                                                5             4
                                                                   133

            “breadth-first” trace              2         138
         •     nodes are dealt with                 1                    10
                                                          150
               in a “queue” (first in,    7               3 6            42
               first out)
                                           68           102       116    11
                                                                         42
23   Morgan’s algorithm
                                                     13
            continue till all nodes
             are numbered                    12

            we now have a
                                             9        8
             numbering for the rows
             of the connection table         5       4

            “breadth-first” trace               2
         •     nodes are dealt with              1            10
               in a “queue” (first in,   7                6
               first out)
                                                 3            11
24   Morgan’s algorithm
                                                           61
                                                                 11
            “depth-first” trace is
             also possible                             10        127

         •     nodes are dealt with in
                                                109              12     109
               a “stack” (last in, first out)          9

            more suitable for assigning        133
                                                       8         13
                                                                        133
             atom numbers in SMILES
                                                       7         138
             where we want consecutive
                                                           6                  3
             numbers to form a path                              150
                                                 5               4 2          42

                                                  68           102     116    1
     OC(=O)C(N)CC1C=CC(O)=CC=1                                                42
25   Symmetry perception
        if ties between label values cannot be
         resolved on basis of atom/bond types, the atoms
         are symmetrically equivalent, and
         it doesn’t matter which is chosen next
        Morgan’s algorithm is thus also useful for
         identifying symmetry in molecules
26   Morgan’s algorithm
        Provides canonical numbering for the nodes in a graph that
         doesn’t depend on any original numbering
        Works by taking more of the graph into account at each
         iteration
          •   essence of “relaxation” technique is iteratively updating a value by
              looking at its immediate neighbours
        It is not infallible
          •   some graphs are known where the algorithm cannot distinguish
              nodes that are not symmetrically equivalent
        There are many variations on it
          •   and several theoretical papers analysing it mathematically
          •   O. Ivanciuc, “Canonical numbering and constitutional symmetry”,
              in J. Gasteiger (Ed.) Handbook of Chemoinformatics, Vol 1, pp.
              139-160. Wiley, 2003
27   Canonicalisation
        Algorithms are applied to graphs not chemical
         structures
        Issues such as aromaticity, tautomerism and
         stereochemistry need to be addressed before
         canonical numbering of the graph
         •   Daylight’s canonicalisation algorithm for SMILES
             perceives aromatic rings (using its own definition of
             aromaticity) as first step
28   Ring perception
        How many rings are there in these structures and
         which ones are they?




        rings are important features of chemical structures
         •   nomenclature generation
         •   aromaticity perception
         •   synthetic significance
         •   fragment descriptor generation
29   Rings and ring systems

        A ring system is a subgraph in which every
         edge is part of a cycle
30   Ring perception
        Euler Relationship
         nodes + rings = edges + components
         where rings is the number of edges that must be removed
           from the graph to turn it into a tree
         • rings is also called the Frerejacques number or nullity




                  6+1=6+1      10 + 2 = 11 + 1   7+2=8+1

                               23 + 5 = 25 + 3
         •   this is the minimum possible number of rings;
             it may be useful to identify others
31   Which rings to perceive?
        Usually the smallest set of smallest rings
         •   two 6-membered rather than
             one 6- and one 10-membered
         •   two 5-membered rather than
             one 5- and one 6-membered


        But there may be more than one SSSR
         •   C-S-C-C-C-C         S               S
         •   C-C-C-C-O-C                             O   S
                                         O                   O
         •   C-S-C-C-O-C
              three different 6-membered rings
32   Which rings to perceive?
        Sometimes a large envelope
         ring may be aromatic, when
         smaller rings are not
        Ring perception is a complex area where there are
         no right answers
         •   there is a lot of literature on the subject
33   Ring perception by spanning tree
        start at an arbitrary node                                 13
        “grow a spanning tree”                             12
         •   add neighbours of current
             node to a queue                                9        8
              o   provided they are not already in it
                                                            5       4
         •   move to the next node in the queue
         •   repeat until queue is empty                        2
                                                                1         10
        those edges from original graph                7           3 6
         not in the spanning tree                                         11
         are ring closures
34   Substructure Fragments
        Subgraphs can be identified in a structure graph
         corresponding to functional groups, rings etc.
          • –OH                              OH
         •   –NH2
         •   –COOH
         •   phenyl
        this can be done by
         tracing appropriate                CH2
         paths in the graph                         O
                                     H2N    CH
        subgraphs may overlap
                                                    OH
35   Substructure Fragments
        More systematic subgraphs can also be identified
         (easier to do algorithmically)
          • paths of connected atoms      OH
         •   every atom and its
             immediate neighbours
         •   rings
        Subgraphs can overlap
         •   (it’s difficult to show           CH2
             pictures with atoms in                  O
             several colours at once!)   H2N   CH
                                                     OH
36   Substructure fragments
       •   fragments provide “index terms” for a chemical
           structure
            o   analogous to keywords in a text document
       •   they can be used in searching for structures
            o   retrieved structures must contain the same fragments as the
                query
       •   “ambiguous” representations
            o   many different structures can have the same fragments,
                connected together in different ways
       •   fragments to be used may be a closed list
            o   controlled “vocabulary” (dictionary) of structural features
       •   or an open-ended list (like free text searching)
            o   e.g. all unbranched paths of up to 6 atoms
37   Fragment codes
      •   many early chemical information systems were
          based on identifying fragments of this sort
           o   originally the fragments were identified manually
           o   and represented on punched cards
      •   special fragment codes (dictionaries of
          fragments) were devised for different systems
           o   some of these are still in use, though with automated
               encoding of structures
           o    particularly important are the systems for
               “Markush” structures in patents (e.g. Derwent WPI
               code)
38   Fingerprints
        the fragments present in a structure can be
         represented as a sequence of 0s and 1s
             00010100010101000101010011110100
         •   0 means fragment is not present in structure
         •   1 means fragment is present in structure (perhaps
             multiple times)
        each 0 or 1 can be represented as a single bit in the
         computer (a “bitstring”)
        for chemical structures often called structure
         “fingerprints”
39   Fingerprints
        fingerprints are typically 150-2500 bits long
        where a fixed dictionary of fragments is used there
         can be a 1:1 relationship between fragment and bit
         position in fingerprint
         •   sometimes several related fragments will “set” the same
             bit
        disadvantage is that if structure contains no
         fragments from the dictionary, no bits are set
         •   can be avoided if “generalised” fragments are used
             (involving e.g. “any atom”, “any ring bond” types)
40   Fingerprints
        if fragment set is open-ended, the fragment
         description (e.g. C-C-N-C-C-O) can be “hashed”
         to a number in fixed range (e.g. 1 to 1024) and this
         is the bit number to be set
        disadvantages:
         •   different and unrelated fragments may “collide” at the
             same bit position
         •   difficult to work back from bit position to fragment
         •   this usually causes only slight degradation in search
             performance (false hits), but can be more of a problem
             in other applications of fingerprints
41   Fingerprints

        Hashed fingerprints
         •   typically used in software from Daylight
             Chemical Information Systems Inc.
        Dictionary fingerprints
         •   Chemical Abstracts Service
         •   MDL Information Systems Inc
              o   ISIS or MACCS keys (166 and 960 bits)
         •   Barnard Chemical Information Ltd
              o   customised dictionaries
42   2D structure depiction
        if structures are stored without 2D display
         coordinates, we need to generate them
         •   SMILES
        “depiction” algorithms are used for this
        identify and lay out ring systems first
         •   complications over orientation of some systems
         •   Chemical Abstracts stores “standard depictions” of all
             ring systems it has encountered
        then add side chains, avoiding collisions
         •   many features can be added to improve appearance
43   3D structure depiction
        much more complicated than 2D
        need to store standard bond lengths and angles
        need to distinguish atoms in different hybridisation states
         (sp2 vs sp3 carbon)
        need rotate single bonds to avoid “bumps”
        sophisticated “conformation generation” programs identify
         low-energy conformers
         •  very useful for identifying molecules with the correct shape to fit
            into biological receptor sites
     J. Sadowski, “3D structure generation”, in J. Gasteiger (Ed.) Handbook of
         Chemoinformatics, Vol 1, pp. 231-261. Wiley, 2003
44   Nomenclature generation
        most systematic nomenclature is based on ring
         systems
         •   need to identify/prioritise ring systems first
         •   identify standard numbering for system
              o   frequently need to store this
         •   add side chains and substituents with appropriate
             locants
     J. L. Wisniewski, “Chemical nomenclature and structure representation:
         algorithmic generation and conversion”, in J. Gasteiger (Ed.)
         Handbook of Chemoinformatics, Vol 1, pp. 139-160. Wiley, 2003
45   Conclusions from Lecture 3
        there are several important jargon terms used in graph
         theory, which crop up in chemical informatics
        canonicalisation provides a unique numbering for the
         atoms in a molecule
         •   Morgan algorithm can be used to achieve it
        it’s not always obvious how many rings there are, or which
         ones they are
        fingerprints represent the presence or absence of
         substructure fragments in a molecule
         •   they are ambiguous representations of structure
46
     Topic for Lecture 4: Structure searching

        two main varieties of search
          • full structure search
              o   query is is complete molecule
              o   is this molecule in the database?
                   • or tautomers, stereoisomers etc. of it,
         •   substructure search
              o   query is a pattern of atoms and bonds
              o   does this pattern occur as a substructure (subgraph)
                  of any of the molecules in my database?

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:1
posted:9/14/2012
language:English
pages:46