Similarity Searching

Document Sample
Similarity Searching Powered By Docstoc
					     Module 3

Similarity Searching
    Searching Chemical Databases

   Traditional approaches
    – Structure and substructure searching
    – Initially 2D; subsequently in 3D

   More recent interest in similarity-based searching
    methods
    – Complement to existing approaches
    – Useful lead discovery technique
    – Applications in molecular diversity analysis
      Limitations of Substructure
               Searching
   A "hit" must contain the entire query substructure
    in precisely the form specified by the user

   The user must already have a well defined view of
    the types of structures of interest - this is not
    usually the case early in a project

   The database is partitioned into two sets, with no
    possibility for ranking

   No control over the size of the output
            Similarity Searching
   Given a target structure find molecules in a
    database that are most similar to it (“give me ten
    more like this”)
    – Compare the target structure (T) with each database
      structure (D) and measure its similarity
    – Sort the database in order of decreasing similarity
    – Display the top-ranked structures (“nearest
      neighbours”) to the searcher
    – Use of interesting structures (however defined) for
      further searches
        Advantages of Similarity
              Searching
   No need to define a precise substructural query,
    since a single molecule is sufficient to initiate a
    search

   Facilitates browsing-like searches, since initial hits
    can be used in subsequent searches

   Control over the volume of output that is
    produced, since a ranking is produced rather than
    a partition

   Similarity searching is widely used to complement
    substructure searching
    Rationale For Using Similarity
             Information
   The similar property principle states that
    structurally similar molecules tend to have similar
    properties
    (neighbourhood principle)
   Given an active target molecule, a similarity
    search can identify further molecules in the
    database for testing
   Similarity is inherently subjective, so need to
    provide quantitative basis for ranking structures
    – Global measures: an overall measure of the
      resemblance of two molecules
    – Local measures: additionally provide a mapping of
      features from the target structure to the database
    How to do similarity searching
   Query involves specification of an entire structure of a
    molecule
    – in the form of one or more structural descriptors
    – this is compared with the corresponding set of descriptors for each
      molecule in the database
   A measure of similarity is then calculated between the
    target structure and every database structure.
   Similarity measures quantify the relatedness of two
    molecules with
    – a large number (or one) if their molecular descriptions are closely
      related
    – with a small number (large negative or zero) when their molecular
      descriptions are unrelated
 Various similarity measures
– many measures available to quantify the degree
  of similarity between a pair of molecules.
– The computational requirements of these
  measures vary depending on the level of detail
  used to represent the molecules that are being
  compared.
– Measures designed for highly complex
  representations will require a lot of processing,
   » limiting the number of database structures that can
     be compared in a given amount of time, such as the
     use of maximal common substructure
   Requirements for Similarity
                 Searching
 Molecular descriptors
    – Numerical values assigned to structures
   Typical descriptors represent
    – 1D: Physicochemical properties, e.g., MW, logP
    – 2D properties: fragment screens, topological indices;
      maximal common substructures
    – 3D properties: molecular fields
   Similarity coefficient
    – A quantitative measure of similarity between two sets
      of molecular descriptors
   Use of a weighting function
    – Ensure equal contributions from all part of the measure
            Structure Similarity
   Representations are now text-like
   Standard similarity operations can be carried
    out
   Similarity coefficients for used for binary
    fingerprints - Tanimoto, etc.
    – number of bits in common/number of bits in either
    – AND/OR

   Most have non-binary versions
Captopril: Angiotensin Antagonist
  Captopril   88% similar   82% similar
              1D Descriptors

   Single valued integers or real numbers

   Physicochemical properties
    – molecular weight, number of rotatable bonds,
      solubility,….
           Fragment Bit-Strings
                          C         O
                        C C C     C C C
                          C




   Originally developed for substructure search
   The number of fragments common to a pair of
    molecules gives a measure of 2D similarity
    – Pfizer, Lederle, Sheffield and Upjohn
   Widely used in both in-house and commercial 2D
    chemical information systems
          Similarity Coefficients
   Tanimoto coefficient for binary bit strings

                              C
                  SIMTD 
                          T  D C

    – C bits set in common between Target and Database
      Structure
    – T bits set in Target structure
    – D bits set in Database structure


   Other similarity coefficients exist:
    – Cosine, Euclidean distance …..
            Topological Indices
   A topological index is a single-valued integer or
    real number that characterises the topology of a
    molecule
                                1
    – for example, Wiener Index:  DI , J
                                2 I ,J

    – DI,J is the distance (in bonds) between atoms I and J


   Many different types of index have been
    developed
   Used singly or in combination
    – Similarity searching, diversity analysis and as variables
      in QSAR studies
         Similarity Coefficients
   Tanimoto coefficient

                                xIK x JK
            SIM IJ 
                          2        2
                        xIK   x JK   xIK x JK



    – Molecules I, J are represented by vectors of
      length K
    – Each element of K represents one descriptor,
      e.g. topological index
Maximal Common Substructures
                 O                         O

                     O
                                O




   Largest set of atoms and/or bonds common to two
    structures - a local similarity measure
   The larger the MCS the greater the similarity
   MCS algorithms are available for 2D and 3D
    similarity searching
    – MCS detection is NP-complete and very efficient
      algorithms are still lacking
 Searching - similarity


maximal common subgraph isomorphism
      best match search
      web search using several terms



              CH3                          CH3
                                           CH3                              CH3
          N                            N
                                       N                                N




                                                            O

      O                            O
                                   O                                O
HO                       O                            H3C       O
                    OH                           OH
                                                 O                                O
                             CH3
                                                                            O
                                                                                  CH3
                  3D Descriptors

   Derived from the 3D data
   Tolerances included in
    descriptor
    – inter-atomic distances
        » N-(5.7+-0.2 Angstroms)-N
    – bond angles
    – 3/4 point pharmacophores
   Assigned to bitstring as for
    2D
         3D Similarity Searching
   Systems for 3D substructure searching are widely
    available

   Computationally more expensive than 2D methods

   Similarity measures for 3D searching can either be
    global or local in character
    – Most operational systems are global
    – Much research interest in local measures
          Inter-Atomic Distances

   Use of whole molecule or just pharmacophore
    points as in 3D substructure searching systems
   3D information can be encoded in a bit string
    – Atom pairs and their distances
    – Atom triplets (Lederle) - an integer code is calculated
      from the three distances comprising an atom triplet
    – Angular information can also be used



   Similarity is quantified using an association
    coefficient based on pairs of bit-strings
    More Sophisticated Distance-
         Based Measures
   Inter-atomic distance measures consider
    each distance on its own
    – Atom mapping (and related measures) take at
      least some account of relationships between
      pairs of distances
   3D MCS measures
    – Widely used for pharmacophore mapping and
      molecular alignments
    – Algorithmic enhancements to permit rapid
      database searching
           Field-Based Similarity
                Searching: I

   Molecules are described by
    3D fields
    – Electrostatics, steric,
      hydrophobic
   Similarity is quantified by
    – Aligning the molecules
    – Applying a similarity
      coefficient using the values of
      the fields within a 3D grid
          Field-Based Similarity
               Searching: II

   Finding the best alignment is computationally
    demanding

   The grid calculation is also computationally
    demanding but can be replaced by an extremely
    rapid Gaussian-approximation procedure

   FBSS (Sheffield) uses a genetic algorithm for the
    rapid generation of molecular alignments
            Evaluation Methods
   Use of datasets for which both structural and
    property/activity data are available
   A „leave-one-out‟ approach is applied to the
    structure-based similarities to obtain predicted
    property values for each member of the dataset
    – Compare resulting predicted property values with the
      observed values
   For a target of known activity, search against a
    database that contains other compounds having the
    same activity
    – Determine where the active compounds appear in the
      ranked list
        Comparison of Similarity
              Methods
   Evidence suggests that 2D fingerprints are most
    effective (e.g., Brown and Martin Abbott studies)
    – Good at identifying "me-too" compounds
    – Less good at identifying structurally diverse
      compounds
    – Use of 3D measures for more heterogeneous
      bioisosteres
   Different molecular descriptors can give rise to
    different relative similarities
    – Different orderings of molecules in ranked lists
    – Use of data fusion to combine different types of
      information
                    Data Fusion
   Originally developed for signal processing but an
    entirely general approach:
    – Improved performance can be obtained by combing
      evidence from several different sources
   When used for similarity searching
    – Do a similarity search for a target structure and then
      rank the database in order of decreasing similarity
    – Repeat with different representations, coefficients, etc.
    – Add the rank positions for a given structure to give an
      overall fused rank position
    – These fused ranking form the output from the search
    – Small, but consistent, improvements in performance
      over use of a single ranking
         Similarity and Diversity
   Descriptors
    – 1D: physicochemical/topological
    – 2D: structure fragments
    – 3D: distance-based features


   Similarity uses
    – Diverse subset selection
    – Near neighbour searching
       Near neighbour searching
   Similar property principle
    – structurally similar molecules are expected
      to exhibit similar properties and activities

   Methodology
    – Identify active compound in the diverse
      subset
    – Explore the nearest neighbours to maximise
      efficiency
          Diverse subset selection
   Database sizes now in millions
   Require a representative subset for
    screening/testing
   Several methods used - all based on standard
    text-based methodologies using dissimilarity
    –   Clustering
    –   Systematic selection
    –   Partition-based selection
    –   Recursive partitioning
                    Conclusions
   Similarity searching using 2D fragment bit-strings
    is widely used and is surprisingly effective
    – 2D bit-strings can also be used for cluster-based or
      dissimilarity-based compound selection
   Several methods for 3D similarity searching have
    been described and are used in-house
    – Currently little consensus as to what sorts of similarity
      measure are best and how flexibility can be handled
   Similarity searching generally provides just a
    starting point, it‟s not an end in itself

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:5
posted:9/23/2011
language:English
pages:31