Java Snippets by ashrafp


									8c3993e1-929c-4be2-835f-68253b0f03a2.doc                                                                Page 1
of 8

                                            Java Snippets
Most of the example on this page are related to the CDK package, an opensource framework for
cheminformatics. The examples are mainly for my utility (so that I can see what I did 6 months ago!) but they
do show various aspects of CDK functionality as well as Java programming and so might be useful to other
people as well. Much of the code is based of code from the JUnit tests so thanks goes to the CDK developers for
a well designed and documented framework.

NOTE: Nightly builds are now available

Building & Testing | Fragmenting a molecule | SMILES to SDF | Molecular Descriptor GUI | Surface Areas in
the CDK | Geometric transformations | Writing Molecules to Disk | Calculating molecular descriptors | CDK as
a web service | Highlighting substructure search hits | Maximum Common Substructures (MCSS) | Shortest path
length matrix | Tabular display of 2D structures | Batch generation of 2D diagrams

Building & Testing

      To recompile all the tests do
        ant test-dist-all
      To test a specific class

       ant -Dtestclass=qsar.DescriptorEngineTest junit-test

      To test a module

       ant -Dmodule=MODULENAME test-module

      To aovid running slow tests as -DrunSlowTests=false

SMILES to SDF conversion
This program will convert a SMILES string specified on the command line to an SD file
containing 2D coordinates. In the fture conversion to 3D coodinates will be implemented. The default
conversion does not add hydrogens, though addition of hydrogens can be specified on the command line.

Usage is simply

java -cp $CLASSPATH:./ smi2sdf "CCC=C=CO"
By default, the resultant SD file is named output.sdf though this can be changed on the command line. Note that
you will need the Apache Commons CLI package in the classpath. (This is included in the CDK distribution)
Surface Areas in the CDK
The CDK has a class that can numerically evaluate the surface area of a molecule. It is based on the Python
implementation of the DCLM method by Peter McCluskey. Since it is numerical it obviously does not
identically match the results obtained by an analytical method. In general numerical surface areas appear to be
larger than analytically derived surface areas (especially for larger molecules). There are two main parameters
that affect the surface area calculation:

      The probe radius
      The level of tesselation
8c3993e1-929c-4be2-835f-68253b0f03a2.doc                                                                    Page 2
of 8
Usually 4 levels of tesselation seem to give reasonable results and increasing the level of tesselation does not
appear to improve the calculated surface area, but does lead to a high memory consumption. Below are four
plots which compare the surface area of 277 molecules (J. Chem. Inf. Comput. Sci., 1999, 39, 974-983)
calculated by the CDK method and by the SAVOL routine within ADAPT (which is analytical).

(Click here for the plots on a seperate page)
Geometric transformations
It is sometimes useful to apply geometric transformations to a molecular structure such as translation and
rotation. The programs below perform specific geometric transformations to a structure resulting in a new
structure which can replace the original one or be written to a new file. The programs will require the CDK
libraries as well as the JAMA and org.apache.commons.cli packages. These come with the CDK distribution.

      rotation: rotates a structure about the X, Y or Z axes by a specified angle. The center of
       rotation is the origin. Currently, the program will only rotate a single structure.
      translate: translates a structure by specified amounts along the X, Y and Z axes. In
       addition if specified, the center of gravity of the structure can be translated to the origin.

Writing Molecules to Disk
Writing a Molecule object to disk can be done by instantiating one of the output format classes. These include
MDL MOL, Hyperchem HIN, CML, PDB and so on. To find out the available formats look under the

To write a Molecule object to a String in the MDL MOL format you can do


StringWriter w = new StringWriter();
try {
   MDLWriter mw = new MDLWriter(w);
} catch (Exception e) {
8c3993e1-929c-4be2-835f-68253b0f03a2.doc                                                                       Page 3
of 8
Alternatively to write it to a file you can do

FileWriter w = new FileWriter(new File("molecule.mol"));
try {
   MDLWriter mw = new MDLWriter(w);
} catch (Exception e) {
Calculating molecular descriptors
The CDK has been extended with the addition of the cdk.qsar module that allows for the calculation of
molecular descriptors. Currently 33 descriptors are present covering topolgical, geometric and electronic
descriptor classes. The snippet below shows how you can obtain the value of a specific descriptor for a single
IAtomContainer ac;
IDescriptor descriptor = new WienerNumbersDescriptor();
DoubleArrayResult retval = (DoubleArrayResult)descriptor.calculate(ac);
double wpath = retval.get(0); // Wiener path number
double wpol = retval.get(1); // Wiener polarity number
It should be noted that the return values for descriptors can be single integers or doubles or arrays of integers or
doubles or boolean values. The return values are objects that implement the DescriptorResult interface. A
general method of obtaining the return values (without coercion) is
DescriptorResult retval = descriptor.calculate(ac);
if (retval instanceof DoubleArrayResult) {
  // process
} else if (retval instanceof DoubleResult) {
  // process
In some cases, descriptor routines will take parameters. An example is the BCUT descriptor. In this case the
user can specify how many eigenvalues should be returned. This can be done by the following code
IAtomContainer ac;
IDescriptor descriptor = new BCUTDescriptor();
Object[] params = { new Integer(3), new Integer(3) };
DoubleArrayResult retval = (DoubleArrayResult)descriptor.calculate(ac);
The above code returns the 2 highest and 2 lowest eigenvalues of the weighted Burden matrix.

In addition to the calculation of individual descriptors it is also possible to evaluate all descriptors or subsets of
descriptors implemented in the CDK. This is performed by the DescriptorEngine. To calculate all available
descriptors we can use the code below

Molecule molecule;
// initialize the Molecule object
DescriptorEngine engine = new DescriptorEngine();
In case we want to calculate specific classes of descriptors (say topological and geometric) we can do
String[] types = {'topological','geometric'};
DescriptorEngine engine = new DescriptorEngine(types);
8c3993e1-929c-4be2-835f-68253b0f03a2.doc                                                                   Page 4
of 8
In both cases, the result of each descriptor is stored in the Molecule object as a DescriptorValue object keyed on
the DescriptorSpecification object for that descriptor (which contains, among other things, a link to a metadata
dictionary entry for that descriptor).
Shortest Path Length Matrix
The pairwise distance between atoms in a molecule can be calculated in terms of Euclidean distance or path
lengths (i.e., number of bonds between two atoms along the shortest path in the molecular graph). The latter is
useful when calculating certain topological descriptors (Weiner index etc.). The shortest path between any two
atoms in terms of number of bonds can be computed using the Floyd-Warshall algorithm. This is implemented
in org.openscience.cdk.graph.PathTools.computeFloydAPSP(). A snippet showing its use is
int[][] admat = AdjacencyMatrix.getMatrix(atomContainer);
int[][] m = PathTools.computeFloydAPSP(admat);
m is the pairwise distance matrix. A program that computes this matrix and displays it can be obtained here
Tabular display of 2D structures
This example creates a JTable and populates it with 2D structures for the specified molecules. The code
example shows how to create a table, set column types and manipulation of an AtomContainer to get rid of
hydrogens and the use of the Renderer2D and associated classes to draw 2D structures. The code also uses the
StructureDiagramGenerator class to generate 2D coordinates for the molecules (required since formats such as
HIN only specify 3D coordinates). The code can be found here. To run the example you'll need the jgrapht
library in your CLASSPATH. You can see a screenshot.

The code allows the user to specify 4 properties

         ncol - the number of columns to display
         cellx - the width of each cell in the table
         celly - the height of each cell in the table
         withH - a boolean to indicate whether hydrogens should be shown or not

Thus some examples of how to run the code are:
java -Dncol=5 st2d dan*.hin
java -Dncol=4 -Dcellx=300 -Dcelly=400 -DwithH=true s2td molecule*.cml
Right now the molecules are drawn within a fixed cell size. That is, if the table is resized, the molecules do not
get redrawn for the new cell size. Currently I'm not sure on how to handle this. Another nice thing would be to
be able to set arbitrary tooltips for each cell.
Displaying highlighted substructures
The first step in displaying substructure hits is to detect them. Functions for substructure searches can be found
in the org.openscience.cdk.isomorphism hierarchy. Specifically, the
UniversalIsomorphismTester.getSubgraphMaps() function determines all the mappings of its second argument
(query fragment) on the first (target molecule). The return value is a List of RMap objects which are mappings
that show which bonds in the query fragment correspond to which bonds in the target molecule.

To highlight a substructure in a structure using the Renderer2D class we need to create an AtomContainer that
contains the bonds of the structure that will be highlighted. In the case of substructure highlighting this
corresponds to the bonds of the target molecule that correspond to the bonds of the query fragment. Each RMap
object contains two members id1 and id2. The former is the number of the bond in the target molecule that
matches the corresponding bond in the query fragment. Thus, by making a list of these bond serial numbers
(from the target molecule) and then creating an AtomContainer that contains these bonds we can highlight the
substructure in the target molecules' structure. A code snippet that shows hwo to determine the substructure and
create an AtomContainer containing the bonds to be highlighted is given below
8c3993e1-929c-4be2-835f-68253b0f03a2.doc                                                                Page 5
of 8
              List l = UniversalIsomorphismTester.getSubgraphMaps(mol, q);
              System.out.println("Number of matched subgraphs = " + l.size());

              AtomContainer needle = new AtomContainer();
              Vector idlist = new Vector();

              // get the ID's (corresponding to the serial number of the Bond object in
              // the AtomContainer for the supplied molecule) of the matching bonds
              // (there will be repeats)
              for (Object aL : l) {
                  List maplist = (List) aL;
                  for (Object i : maplist) {
                      idlist.add(((RMap) i).getId1());

              // get a unique list of bond ID's and add them to an AtomContainer
              HashSet hs = new HashSet(idlist);
              for (Integer h : hs) needle.addBond(mol.getBond(h));
Note that the substructure search function can return multiple substructures. The above code combines all the
matching bonds (making sure that there are no repeats) and highlights all the bonds in the target molecule
matching the query fragment is a simple program that detects a substructure within a molecule and then displays
the 2D diagram of the molecule with the substructure highlighted. It uses the StructureDiagramGenerator class
to generate 2D coordinates. The code requires the jgrapht library to be in your CLASSPATH

To use it, compile and then do

java HighLightSubstructure "c1ccccc1(N(CC)C)"
The code does not do any error checking and is basically proof of concept. Furthermore the code has 4
fragments hard coded into it - a primary, secondary and tertiary amine group and a benzene ring. The tertiary
amine fragment is uncommented so supplying a SMILES string as shown above with a tertiary amine group
should display the 2D diagram with the fragment highlighted in green. You can see a screenshot.
Determining and displaying maxmimum common substructures
CDK provides a simple routine to determine the maximum common substructure of two molecules, namely
UniversalIsomorphismTester.getOverlaps(). The function returns a List of IAtomContainer's. Each element of
the list is a fragment representing a common substructure between the two supplied molecules mapped on the
first molecule. Thus the MCSS is the fragment with the largest atom count. A code snippet that shows how to
determine all common substructures and then extract the MCSS is given below
List mcsslist = UniversalIsomorphismTester.getOverlaps( atomcontainer1, atomcontainer2 );
int maxmcss = -9999999;
IAtomContainer maxac = null;
for (i = 0; i < mcsslist.size(); i++){
   IAtomContainer a = (IAtomContainer)mcsslist.get(i);
   if (a.getAtomCount() > maxmcss) {
      maxmcss = a.getAtomCount();
      maxac = a;
Once a MCSS is found, we would like to display the molecules with the MCSS portion highlighted. To map the
the MCSS to each molecule we must determine which bonds of the MCSS map to which bonds in the original
molecules. This is effectively a substructure search in which the MCSS is the query fragment .This is performed
by using the UniversalIsomorphismTester.getSubgraphMaps() function (see highlighting substructures for a
8c3993e1-929c-4be2-835f-68253b0f03a2.doc                                                                   Page 6
of 8
more detailed description of this process). This function can return multiple mappings (for instance if the MCSS
is a benzene ring then anthracene would have three mappings for this MCSS). The code below only takes the
first mapping
// a is a molecule from the set of molecules for which a MCSS was obtained
// and q is the MCSS which is obtained as shown above
public static IAtomContainer getneedle(IAtomContainer a, IAtomContainer q) {
   IAtomContainer needle = DefaultChemObjectBuilder.getInstance().newAtomContainer();
   Vector idlist = new Vector();

    List l = UniversalIsomorphismTester.getSubgraphMaps(a, q);
    List maplist = (List)l.get(0);
    for (Iterator i = maplist.iterator(); i.hasNext(); ) {
        RMap rmap = (RMap);
        idlist.add( new Integer( rmap.getId1() ) );

    HashSet hs = new HashSet(idlist);
    for (Iterator i = hs.iterator(); i.hasNext();) {
        needle.addBond( a.getBondAt( ((Integer) ) );
    return needle;
The return value of this function is an IAtomContainer containing the bonds from a that correspond to the bonds
in q (the MCSS). The return value is then used to specify which bonds in the molecule a are to be highlighted
with the Renderer2D class. A small program that combines all these features is which
determines and displays the MCSS for two molecules highlighted on each molecule. You can see screenshots
here and here. To use it compile as usual and then do
java simplemcss molecule1.hin molecule2.hin
(replace the molecule files with whatever you have).
Fragmenting a molecule
Recently a post on the cdk-user mailing list asked how a molecule could be borken into fragments. This process
is quite easy as it simply involves selecting the bonds which will serve as split points and spliting the molecule
into two parts at these bonds.

The Fragmenter class performs this operation and given a molecule it will generate an ArrayList of
IAtomContainter's, each one representing an individual fragment. Fragments are generated by splitting the
molecule at

       non-terminal bonds
       non-ring bonds

The side effect of this approach is that a molecule like 2,3-dimethyl benzene will not be split into any
fragments. On the other hand if the molecule is split on terminal bonds, we would only include the large
fragment (and not the single atom at the end of the terminal bond). A side effect of this approach is that a group
like NO2 would get split up into a N=O and O which may not be appropriate.

In addition to generating the fragments the code will bring up a tabular display of the original molecule (red
border) and images of the fragments generated (screenshot).

To run it, assuming you have the CDK jar files in your CLASSPATH

javac -cp $CLASSPATH:./
java -cp $CLASSPATH:./ Fragmenter SMILES
8c3993e1-929c-4be2-835f-68253b0f03a2.doc                                                                   Page 7
of 8
If no SMILES string is specified it will proceed to fragment c1c(CC2CC2)cc(CNCC)cc1, otherwise it will
fragment the specified molecule.

Currently the fragment list will include single bonds (i.e., terminal fragments). These are probably not very
useful and could easily be screened out.

Update: The code has been modified to recursively fragment the initial set of fragments. I think this leads to all
possible fragments. The downside is that many of the fragments are topologically identical, but since they
involve different atoms of the parent molecule, they get included in the final fragment list. To get a set of
topologically unique fragments, we'd have to perform a subgraph matching - which is left to the reader!

Batch generation of 2D diagrams
This code is a simple example of the use of the 2D structure diagram generator to generate 2D structures and
output them as PNG, JPEG or PDF images for a set of molecules. In addition it is possible to output a table of
structures (with a user-specifiable number of columns) which can contain the value of the SDF property fields if
present. Currently the program will only handle SDF files (either single or multi-molecule) and usage is simply
java -cp $CLASSPATH:./ draw2d molecules.sdf
By default it will generate a series of images named img001.png, img002.png and so on in the current directory.
The output directory can be specified. In addition the width and height of the final images can be specified. A
scale factor can also be specified that indicates how large the image of the structure will be in comparison to the
specified image area.

Requirements: The above usage implies that you have in your CLASSPATH

        iText library
        Relevant CDK jar files
        Libraries that the CDK depends upon

If you have checked out the CDK SVN repository then simply add all the jars that lie under jar/ and dist/jar/ of
the CDK repository to your CLASSPATH.

If you are not a developer you can use draw2d.jar which bundles all the required libraries (both the CDK and
the libraries it depends upon). Thus you can simply do

java -jar draw2d.jar
which will show the help message.

Example of tabular PDF's

        Single column of molecules PDF
        3 molecule columns PDF
        2 molecule columns with SDF property fields (in this case calculated descriptor values) PDF

usage: draw2d [OPTIONS] sdf_file

Draws 2D images from coordinate files. Default output is PNG and input
should be an SDF file.
 -c,--scale        Scale factor (default is 0.9)
 -f,--format       Output format (PNG, JPEG, PDF)
 -h,--help         Give this help page
8c3993e1-929c-4be2-835f-68253b0f03a2.doc                                     Page 8
of 8
 -l,--color         Color atoms
 -n,--ncol          How many molecule columns should tabular output have.
                    The number must be 1 or more
 -o,--outputDir     Where to dump images (default is current directory)
 -p,--props         If present output properties in the PDF table. Default
                    is no
 -s,--showH         Show explicit hydrogens (default is no)
 -t,--table         Output structures as a tabular PDF. This option
                    overrides the format option and the output
                    file is called output.pdf
 -v,--verbose       Verbose output
 -x,--x             Width of the images (default is 300)
 -y,--y             Height of the images (default is 300)

To top