Bioinformatics in the Pharmaceutical Industry by tyty722

VIEWS: 150 PAGES: 18


                                 NICHOLAS J. COLE

                   Celltech Therapeutics Ltd, 216 Bath Road
                          Slough, Berkshire SL1 4EN
                                  DAVID BAWDEN

              Department of Information Science, City University
                  Northampton Square, London EC1V 0HV

       A review was carried out of the 'information landscape' within the
       pharmaceuticals-based molecular biology community, which exam-
       ined the research problems requiring biological-sequence data, impor-
       tant sources of information, methods of access, information-seeking
       behaviour of end users and the role of libraries and information
       centres. This work concentrated on the practical aspects of how
       biological sequence information is managed and used in a research
       setting and was carried out as part of the MSc in Information Science
       at the City University. Fifteen questionnaires were sent to infor-
       mation scientists in the UK pharmaceutical industry and a user study
       was carried out amongst scientists at Celltech. Most of the important
       primary data are available freely or cheaply via the Internet and
       molecular biologists were found to be self-reliant in their use of these
       resources. Currency of information was found to be very important
       in the research process and the issue of Internet security was taken
       very seriously. Most questionnaire respondents saw a productive role
       in the future for information workers in thefieldof molecular biology,
       citing end-user training and data integration as possible roles,
       although the degree of involvement will depend on the particular mix
       of skills and experience that exist within an information department.


Background and aims
PHARMACEUTICAL RESEARCH is highly information-intensive and infor-
mation professionals have a long tradition of helping the R&D effort within
drug companies, where they have used their skills in information management,
database design, online searching etc. to good effect. A pharmaceutical research
information department would typically employ a number of information

Journal of Documentation, vol. 52, no. 1, March 1996, pp. 51-68
JOURNAL OF DOCUMENTATION                                           vol. 52, no. 1

scientists who are subject specialists, to search biological, biomedical and
chemical online databases on behalf of end users. These databases would
normally be classified as bibliographic, full-text, numeric, directory, chemical
structure or chemical reaction-type, although, until recently, information
workers would not have found it necessary to carry out, for example, a
homology search in a biological sequence database. This situation is now begin-
ning to change due to the increasing role of biotechnology in drug research.
   The aim of this research project was to map out the 'information landscape'
within the pharmaceuticals-based molecular biology community, using the activ-
ities at Celltech as an example and by interviewing relevant information workers
from the UK pharmaceutical industry. Quite a lot has been written surveying the
different data banks in molecular biology but the practical aspects of managing
and using this information in an organisational setting are rarely discussed. The
findings should be useful for anyone involved with the setting up of molecular
biology information systems, information management policies and in develop-
ing strategies for meeting the needs of scientists.

 What is bioinfonnatics?
The origin of the word 'bioinformatics' is hard to trace although it is normally
used to encompass the generation, handling, storage and retrieval of biological
sequence data, i.e. the sequences of nucleotides that make up the DNA of genes
and the sequences of amino acids that form the primary structure of the proteins
for which the genes code. According to Boguski [1], information science and
technology (informatics) became a serious issue for biologists in the mid 1970s
following the development of rapid DNA sequencing techniques. Since then, the
amount of sequence data (and also gene mapping and protein crystal structure
data) has grown exponentially and is now also being fuelled by data emerging
from the world-wide Human Genome Mapping Project. The growth rate of
Genbank and EMBL (European Molecular Biology Laboratory) databases has
been exponential for the last five years; the latest release of Genbank (release
80.0) contains 164 megabases of sequence and the size is currently doubling
every twenty-one months [2]. It is expected that, over the next decade,
biomolecular databanks will grow between seven and sixty-fold [3].
   Boguski considers the term bioinformatics to be wide in scope, involving
computational analysis, databases and 'everything from laboratory automation
and data acquisition to electronic publishing'. Andrew Lyall of Glaxo (personal
communication) says the Glaxo interpretation of the word is made up of three
elements as follows:
    1. Computational genetics - encompassing activities such as the Human
       Genome Mapping Project and covering physical and logical genetic
    2. Computation relating to molecular genetics, including sequence deter-
       mination. This would also cover the automatic reading of data from
       automatic gene-sequencing machines.
    3. Computation relating to three-dimensional protein structure
March 1996                                                  BIOINFORMATICS

Both of these definitions illustrate that the word bioinformatics can be inter-
preted to describe virtually any information-related activity applied to the sci-
ences of genetics or molecular biology. Whilst both are equally valid, for the
purpose of the work carried out and described in this article we shall confine
our definition to include the storage, retrieval and analysis of nucleic acid and
amino acid sequence data but will exclude three-dimensional structure compu-
tation, automated laboratory procedures or anything relating to the Human
Genome Project.

Molecular biology resources
Databases It would be inappropriate to attempt to give a comprehensive review
 of the publicly available molecular sequence and structure databanks - some
very good articles have been written on the subject [4-6]. However, some of the
key databanks should be mentioned and others will be specifically referred to
in the later sections. GenBank [7] was established in 1982 at the Los Alamos
National Laboratory and contains nucleic acid sequences derived from the
published literature. The database also contains bibliographic data, CAS
(Chemical Abstracts Service) Registry numbers and other data such as the
sequence length and source organism. Whereas GenBank has a us bias, the EMBL
Nucleotide Sequence Databank [8] provides a similar service to Europe and the
structure of the database is very similar. In fact, GenBank and EMBL share all
of their data with each other and with the DNA Databank of Japan (DDBJ) - so
in effect these three databases are one and the same (albeit with some time
differences with respect to updates) and comprise the most comprehensive
collection of nucleotide sequences. For amino acid sequences, SwissProt [9]
 (established in 1986) is a key resource. This is also a collaboration, between the
Department of Medical Biochemistry at the University of Geneva and EMBL.
 SwissProt data come from the Protein Information Resource (PIR) database [10],
 from the translation (via the genetic code, from nucleic acid sequence to protein
sequence) of entries from the EMBL database and directly from the literature.
Entries consist of the 'core' (primary) sequence, literature citations, taxonomic
data and annotation data (protein function, secondary structure information,
diseases associated with the protein, etc.). The Protein Data Bank (PDB),
maintained by Brookhaven National Laboratory (Long Island, New York, USA),
contains all publicly available solved 3-D protein structures. Data include atomic
co-ordinates and other data relating to how the structure was elucidated
(e.g. crystallographic and NMR data). CAS has registered bio-sequences from the
journal literature since 1957 although, until 1990, they were stored in electronic
'connection tables' which define a molecule in terms of the connectivity between
individual atoms and therefore could only be searched by chemical sub-structure.
The protein sequence data were enhanced in 1990 [11] with the computer
generation of amino acid sequences for all of its (approximately 150,000) protein
structures; thus proteins and peptides were additionally searchable using the
common shorthand amino acid abbreviations. An 'exact' protein/peptide search
retrieves only exact matches to the sequence query, whereas a 'sub-sequence'

JOURNAL OF DOCUMENTATION                                            vol. 52, no. 1

 search looks for a string of amino acids anywhere in a chain (analogous to sub-
 structure chemical searching). Sub-sequence or exact 'family' searching is also
 possible, whereby each amino acid residue in a query is matched to any of its
functional family members in the file structure, i.e. those that have similarities
 with respect to their acidity, hydrophobicity or aromaticity. The system also
 caters for uncommon amino acids and multi-chain systems. Since 1992, the CAS
 Registry file has also included nucleic acids from GenBank, as well as making
 the entire GenBank file available on the STN host.

Searching databases for sequence similarity ('homology') The commonest
questions about a given sequence that require recourse to molecular sequence
databases are 'has this sequence been described in the literature before?' and
'are there any other known sequences that are similar to my own sequence (and
how similar are they?)'. Homology searches involve the use of computer
programs which use algorithms to calculate a similarity score between two
different stretches of DNA, which is at least partly based on the summation of
the number of matching nucleotide pairs within a defined local region of the
complete sequence. 'Hit' sequences can therefore be ranked, exact matches
having a maximum score. Many algorithms have been designed
(e.g. BLAST - Basic Local Alignment Tool) and they all differ with respect to
their computational speed and their sensitivity. There has been a growing need
in recent years for an integrated approach towards gaining access to actual
sequence data via cross-references in the literature. Entrez, developed at the
NCBI, provides this capability and contains sequence records from a variety of
database sources, including GenBank, EMBL, DDBJ, PIR, SwissProt, and the PDB.
The sequence records are linked to the relevant literature citations from the
sequence-associated subset of Medline. The retrieval software and databases are
distributed on CD-ROM or as a free Internet service (Network Entrez).
   In addition to the 'core' databases mentioned above, there are a great number
of specialised databases, such as those dealing with a particular chromosome in
the human genome, types of cell receptor or vectors.


The aims of the external questionnaires were as follows:
     • to ascertain the types of workers (by job title) who were involved with
       bioinformatics, the degree to which they were involved and the most
       important information sources;
     • methods of current awareness used;
     • the types of problems that require the use of sequence databases and
       how they impact upon the pharmaceutical research process;
     • the requirement for specialist knowledge and the role of information
March 1996                                                 BIOINFORMATICS

   Fifteen questionnaires were sent to information workers in the pharmaceutical
field and one in-depth interview was carried out. The selection of appropriate
candidates was partly through recommendation and partly by scanning the TFPL
directory Who's who in the UK information world 1994 [12]. One specific question
was posted to an Internet Usenet news group.
   There were eleven replies to the questionnaire. All respondents were providers
of research information within UK pharmaceutical companies - ten operated
from within a library/information department and one was a bioinformatics
consultant in an IT department. All except one (information officer/assistant)
had at least a first degree in chemistry, biochemistry or pharmacology; five (two
bioinformatics specialists, one department head, the IT consultant and one
biomedical information scientist) had a life science related PhD; three (all
biomedical information scientists) had an information related MSc. The
breakdown of job titles was as follows:
    Job title                                                   Frequency
    Information scientist (biomedical)                              4
    Bioinformatics analyst                                          2
    Head of department (scientific information)                     2
    Information officer/assistant                                   2
    IT consultant (bioinformatics)                                  1
   All of the organisations were fully integrated pharmaceutical companies
except one which was a specialised bio-pharmaceuticals (biotechnology) com-
pany. On average, the information departments made up 2% of the headcount,
in a range from 1% to 5%. The size distribution of UK operations by number
of employees was as follows:
    Employees                                                   Frequency
    >3,000                                                          3
    1,000-3,000                                                     3
    2 0 0 - 1,000                                                   5

Level of involvement with bioinformatics
Six respondents were actively involved with searching molecular biology
sequence and/or structure information. Of these, three were occasional users
with a chemistry or biochemistry background and three were bioinformatics
specialists. Of the five who were not directly involved with molecular biology
information, one stated that such work was carried out by a group within
another department (and it was they who completed the rest of the question-
naire), another stated that there was a need but that this was a new field for the
company (a molecular biology department was formed last year) and work was
not done due to a lack of adequate knowledge of the information sources. The
remaining three respondents cited lack of demand as their reason for not finding
this kind of information relevant.
JOURNAL OF DOCUMENTATION                                            vol. 52, no. 1

Knowledge of sources
The bioinformatics specialists were not surprisingly satisfied with their level of
knowledge of the relevant information sources, although one mentioned that
specialist bioinformatics training would be used to help others in the organ-
isation if it were available. Of the occasional users, one was satisfied with his
knowledge but the other was not, saying that he or she would like a better
knowledge of sources and the content of databases.

Access to information
On the question of access to information, six out of the eight information inter-
mediaries were satisfied with their degree of access to the relevant information.
One stated that he was not satisfied due to very restricted access to the Internet
at their organisation and the other because molecular biology was a new field
and so the necessary knowledge was not yet available.

Databases used
Twenty-five databases/search programs were mentioned in all and Figure 1
shows the citation frequencies given by the questionnaire respondents, together
with the data formats used, i.e. whether online, CD-ROM, hard copy, etc. The
chart clearly shows the central importance of certain databases to molecular
biology, especially Entrez, EMBL/GenBank and PIR/SwissProt - these resources
were used by most of the respondents. Databases which are more specialist in
nature, such as Rebase and TFD occur lower down the order and were only used
by the bioinformatics specialists. Other points to notice are the predominance
of access via the Internet, the low usage of hard copy data, and the fact that
many databases were bought and maintained locally, with updates (presumably)
coming in on magnetic tape. Frequency of molecular biology database use was
one to five times a week for six out of the eight molecular biology information
users. One user (a 'bioinformatician') used them more than five times a day and
one (occasional user) only one to five times a month.

Use of CAS databases
Only one reply stated that CAS was a useful source of biosequence information,
mentioning that it was 'a good starting point'. To put this reply into perspective,
the respondent was an information scientist with a chemistry background who
had a minor involvement with sequence searching but who stated that 'work of
this nature is mainly carried out by end users who are subject specialists'. All
of the bioinformatics-oriented users expressed negative opinions such as:
    • 'too expensive';
    • 'I am not aware that there is any added value for primary data analysis';
    • 'I am unaware of the searching capabilities'.
  A Usenet news posting was made to try to obtain further comments and opin-
ions regarding the CAS Registry file. In addition, the question of CAS under-use
March 1996        BIOINFORMATICS

JOURNAL OF DOCUMENTATION                                            vol. 52, no. 1

was put to members of the STN International Help desk. To summarise, the
following reasons were given for the under-use of CAS:
    1.   expense;
    2.   lack of knowledge of its availability;
    3.   the same facilities are available elsewhere;
    4.   the types of searches available do not meet the typical needs of a
    5.   the database is available directly from NCBI;
    6.   NCBI has designed convenient interfaces for sequence retrieval (via
         Entrez and www) and has many tools for sequence searching (BLAST
         via email, www or network client);
    7.   it is not compatible with large-scale use, i.e. searching hundreds of
         sequences automatically;
    8.   limited annotations compared with GenBank entries;
    9.   the advent of Entrez, with its facility for linking gene sequence, protein
         sequence and Medline references.
The following positive reasons were given for searching CAS:
    1. the ability to cross over (to other files) and get references;
    2. searching patent literature.

Research problems requiring the use of databases
Typical replies to this section included:
    •    'searching for new sequences that are not in a local database';
    •    'what is this (my own) sequence? Is it known in the literature? What is
         it similar to?';
    •    'what sequences have been identified for organism X ?';
    •    'is this sequence similar or related to a human sequence?';
    •    'what genes are associated with this disease?'

Current awareness
Respondents who were information specialists but who never or infrequently
searched molecular biology information cited traditional current awareness
methods, e.g. customised 'SDIS' set up on conventional hosts (databases included
Medline, Biosis and Derwent Biotechnology Abstracts); CCOD and journal
scanning. All three bioinformatics specialists cited journals and Usenet news-
groups whereas two cited scientific conferences and only one mentioned the
traditional methods such as CCOD.
   All respondents said that currency of genetic data was important or essential,
especially for primary DNA analysis. One occasional user used it to justify full
access to the Internet and another said that their company recently discovered
a sequence on the Internet that was critical to their current work but which would
have taken another two months to reach their internal database if they had to
wait for the update via CD-ROM. One of the bioinformatics specialists stressed

March 1996                                                   BIOINFORMATICS

that whereas currency was definitely important, it was the successful integration
of genetic data with other types of data (e.g. protein structure, pharmacology
and toxicology data) which was crucial to the success of a research project.

Sequence analysis
Tasks given as requiring calculation included alignment, pattern searching and
protein homology modelling, i.e. very similar to the responses gained from the
internal interviews. Six respondents said that their organisation used automatic
sequencing machines. One predicted that data storage requirements would
increase exponentially until the year 2000 and another said that the increase
would be 'dramatic'. Five respondents used in-house databases to store
sequences of interest.

The Internet
Figure 2 summarises the amount of Internet connectivity enjoyed by the ques-
tionnaire respondents, the tools used to gain access to the information (www
and/or Gopher) and the principal activities conducted on the Net (email, data-
base searching etc.). This shows that permanent access is almost entirely corre-
lated with those companies that were active in the field of molecular biology.
   Of the four respondents without Internet access within their organisation,
three comprised the group with very little demand for molecular biology
information (one of these said that access was planned in the future). The remain-
ing one said 'security issues are to be resolved before we have full access in the
UK to the Internet'. All respondents from organisations with an interest in
molecular biology as part of their drug discovery programmes had full access
to the Internet; the vast majority (seven out of eight) having a permanent ded-
icated link with appropriate 'firewall' security features. The issue of security was
taken very seriously by all respondents who were Internet users, although one
did not comment. Another user who had personal concerns about network
security implied that their organisation was less worried than it should have
been. Searching remote databases necessarily involves the loss of control over
some data when a query is uploaded and this problem is magnified due to the
dispersed and uncontrolled nature of the Internet. Two respondents said that
only internal databases were searched for highly sensitive sequence information.
The following security risks were identified with the use of the Internet:

    •   the ease with which computer viruses can be distributed via networks;
    •   the risk of unauthorised access to one's own machines;
    •   the possibility of obtaining deliberately inaccurate or misleading

The impact of molecular biology on pharmaceutical research
One respondent simply said 'every area' of pharmaceutical research is affected,
however specific examples included:

March 1996                                                BIOINFORMATICS

    •   assay development;
    •   gene therapy;
    •   a basis for understanding disease mechanisms;
    •   high throughput receptor screening;
    •   anti-sense oligonucleotide approaches;
    •   rational drug design.

Requirement for specialist knowledge
Only three respondents (all 'occasional' searchers who were subject generalists
but not practitioners) thought specialist subject knowledge was not essential for
sequence searches, although one said that it helps to have a knowledge of naming
conventions and the basic relationships between nucleic acids and proteins.
   One who did believe in the necessity of specialist knowledge thought that the
ideal situation would be the existence of information-knowledgeable practi-
tioners of molecular biology. However it was conceded that these are exceptional
therefore realistically, the problem is best solved by good communication
between science practitioners and subject-knowledgeable information specialists.
The balance of searching responsibility between these two groups would depend
on the particular skills mix in an organisation, good communication being the
key to success (such a balance might also prevent over-reliance on one or two
'information gurus' within the scientific departments who might leave at any
time). It was considered very important that the information department should
keep their knowledge of sources up to date (communication with scientists
would also be very useful here).

The role for information scientists
Four out of the six respondents involved with biosequence searching were infor-
mation specialists working from within a library or information department,
whereas the remaining two were 'information gatekeepers', i.e. hybrid scien-
tist/information specialists working from within the laboratory (questionnaires
were passed on by the initial library-based contact). One molecular biologist
said that with the emergence of user-friendly Internet access tools, molecular
biologists who have informatics interests or skills can cover most routine needs.
An information scientist (who was not involved with sequence information)
hoped that there could be a role but was not convinced. The bioinformatics IT
consultant was not sure and said that 'they [information scientists] have been
slow to respond to the changing demand'. Another bioinformatics specialist
provided molecular biology information services to all users on behalf of the
Research Information Department. Major responsibilities included database
searching and the training of end users to carry out their own searches using
in-house or external systems. The remaining six respondents who answered the
question were positive about the role that information scientists could play and
gave the following examples:
    •   not all molecular biologists are information or IT aware - possible
        training role;
JOURNAL OF DOCUMENTATION                                           vol. 52, no. 1

    •   integration/interpretation of genetic data;
    •   general application of skills in database searching etc.;
    •   provision of a service when molecular biology information is required
        outside of the research area, e.g. clinical and patents.

Patents - searching sequence databases for novelty or infringement
The seven respondents who did sequence data patent searches mostly cited the
well known sequence databases such as GenBank, EMBL etc. (for novelty) and
Derwent's World Patents Index (WPI), CAS and Geneseq (for infringement and
novelty searching).

                             CELLTECH USER STUDY

Celltech - a brief profile
Celltech, founded in 1980, is one of the largest specialised biotechnology com-
panies in Europe and is dedicated to finding novel therapeutics for cancer and
immune disorders using its expertise in molecular biology, protein engineering
and medicinal chemistry. Celltech Group Plc consists of two independent compa-
nies. Pharmaceutical research is carried out by Celltech Therapeutics Limited.
Celltech Biologics Plc produces biopharmaceutical development products and
specialises in antibody engineering, mammalian cell line development, manu-
facturing process development and industrial scale manufacture of such products
to third parties, including Celltech Therapeutics Limited. Celltech floated on the
stock exchange in December 1993 and currently has promising compounds in
development for the treatment of cancer, rheumatoid arthritis, asthma and septic
shock. The Information and Library Service at Celltech currently subscribes to
160 journal titles, has a collection of approximately 8,000 books and has a staff
of three.
   Whilst the external study produced valuable information and demonstrated
the range of approaches towards biological sequence data handling taken by
pharmaceutical companies throughout the industry, it was of necessity carried
out at 'arm's length' via questionnaires and interviews with only one company
representative. The wish to gain a detailed insight into how such information
was obtained and used within a single site prompted the Celltech user study. It
was for illustrative purposes and was not intended to be representative of the
rest of the industry, even though some of the findings were consistent with the
other companies examined.
   The aims of the user-study were:
    •   to assess the level of user knowledge of the sources of publicly available
    •   to examine the computing infrastructure in use by scientists;
    •   to examine the information requirements in relation to specific
        areas/activities of work;
    •   to see how scientists at Celltech keep up to date with regard to sequence
March 1996                                                   BIOINFORMATICS

   Interviews were carried out with seven key Celltech scientists. All of the inter-
viewees were molecular biologists except one, who was a chemist involved with
molecular modelling. Typical research functions of the biologists included the
cloning, sequencing and expression (in mammalian systems) of genes of ther-
apeutic interest and subsequent analysis, e.g. of the proteins translated from
those genes. One biologist carried out protein engineering and the computer
modelling of macromolecules. The chemist was involved with all aspects of
molecular modelling, from small to large (bio-) molecules, but also internal
consultancy (general scientific computing) and systems administration for a
Silicon Graphics (SG) computer. At the time of writing, there is no company
(dedicated) Internet connection, although two users (both of whom were inter-
viewed) had dial-up modem access. These users acted as 'gatekeepers' of the
information held on the Net and performed searching and other activities on
behalf of the others.

Knowledge of sources
Five out of the seven users considered that they had a good and adequate knowl-
edge of molecular biology information sources. One had 'moderate' knowledge
and another wanted to know more about different sequence analysis software
packages. No users had any knowledge of the bio-sequence searching capabilities
offered by CAS-Online and thus had never used the library's existing sequence-
searching facilities (which consisted entirely of access to non-Internet 'conven-
tional' commercial databases like CAS), although one thought it potentially useful
and requested further information.

Use of computers
The two scientists with dial-up Internet access had their own desktop machines,
the others used a shared machine. All of the machines were Macintoshes, linked
via the local area network. Apart from general purpose applications such
as word processing and graph drawing, the most important software was
MacVector and the Entrez CD-ROM. MacVector was used to carry out align-
ment, hydrophobicity, accessibility (how accessible a particular protein struc-
tural feature is to ligand binding), prediction of secondary protein structure
and other calculations, and also to store sequences of interest. All of this derived
data could be used for comparison against published material in sequence
databases and would also be helpful in defining the various parameters when
uploading data to remote servers for processing, e.g. for BLAST calculations. It
was intended by one user to set up a database of unknown sequences discovered
in-house, but nothing had been implemented yet. The chemist's machine also
acted as a terminal to the SG computer for molecular modelling, energy min-
imisation and molecular dynamics calculations. All interviewees made use of
Entrez on CD-ROM although Network Entrez was sometimes used for the most
up to date information.

JOURNAL OF DOCUMENTATION                                              vol. 52, no. 1

Information requirements in relation to specific areas/activities of work
One molecular biology technique that is beginning to make an impact in drug
discovery is 'differential display' (or 'novel gene' cloning), where previously
unknown genetic sequences are produced by stressing cells, for example with
toxins. Genes expressed that are novel (i.e. only appear on the stressed or altered
cell) are potential targets for therapy. It is necessary to check these novel
sequences against the most up to date collection of known sequences to ensure
that the considerable amount of time and resources involved in assessing
therapeutic potential is not wasted by duplicating earlier work. Novel gene
cloning was considered a good example by three interviewees, where it would
be essential to have access to the most up to date genetic data. To answer the
question 'is this a known gene', BLAST similarity searches were carried out via
email (with the parameters set fairly tightly to retrieve exact or near exact
matches) in GenBank using the NCBI server. For the sake of currency, it was
always considered preferable to use the 'parent' Internet servers to access
databases, even though most of the databases can be found at a variety of other
locations on the Internet, because of the time it takes for updates to filter through
to the other locations. If the gene was known, the GenBank accession number
was used to extract the relevant bibliographic information, e.g. from Entrez. If
the gene was not known, then it was necessary to search protein sequence
databases for the derived amino acid sequence.
   For macromolecular modelling, the most important resource was the PDB.
Magnetic tapes (quarterly updates) were purchased and loaded in to Insight 2
software on a Silicon Graphics computer for general use, although very recent
PDB information was accessed via the Internet. Internet access was thought
preferable as there are between one and two new protein structures added per
day; also errors detected in older structures are continually corrected and
therefore show up later in the magnetic tape version. For the protein engineering
of antibodies, sequence databases are essential. The initial selection of an anti-
body molecule to be worked on would normally be carried out in specialised
databases such as KABAT (which contains only sequences of immunological
interest) rather than in the larger comprehensive databases such as GenBank.
Molecular modelling was then carried out on the chosen candidate. A typical
scenario would be the selection of a human antibody which is most similar in
sequence to a cloned murine antibody with a particular specificity. The human
antibody would then be used as a framework for CDR-grafting.

Current awareness
The methods used to keep up-to-date can be summarised as follows:
    Information source                                         Number of users
    Browsing of journals                                             7
    Medline                                                          3
    CCOD                                                             2
    Bionet newsgroups                                                2
    Online 'SDIS' organised by the library                           1
March 1996                                                   BIOINFORMATICS

   One scientist stated that pharmaceutical companies have become more reticent
in recent years about publishing sequences in the journal literature until patents
have been filed and subsequently published. This is because a nucleic acid
sequence itself is 'enabling', i.e. can easily be synthesised, cloned and expressed.
This runs contrary to the prevailing culture in the molecular biology community
i.e. that molecular biologists have traditionally been unique amongst scientific
workers in the degree that they share information on a 'goodwill' basis with
workers from other institutions. Such give and take is considered essential if the
information on which everyone thrives is to be maintained as a meaningful
resource. Recently, the National Institutes of Health (NIH) in the us and the
Medical Research Council (MRC) in the UK have reached an agreement that they
will not automatically file patent applications for every new DNA sequence that
is discovered as a result of the human genome project. It was hoped that
companies will follow this lead in order to free-up the flow of information to
repositories like GenBank and EMBL. In the meantime, most pharmaceutical
companies will continue to rely on secrecy and will tend not to release sequence
data into the public domain until they can be sure that it will not compromise
their intellectual property.


Although the sample size was quite small, response to the questionnaire was
very encouraging and a lot of useful data was obtained. The mix of job titles
also enabled comparisons to be made between full-time bioinformatics specialists
and the more generalist information scientists working in the field.
    All respondents (internal or external) said that the currency of genetic data
was important or essential, and several current awareness methods were iden-
tified. According to Boguski [1] 'the only way to keep aware of important new
developments is to master some of the instruments ... and to use them regularly'.
User-friendly Internet access tools have been designed in recent years which have
made searching the Net easier; however Boguski also foresees that 'intelligent
agents' (software robots) will be programmed with individual interests and will
'continually scan the information space, automatically notifying us when any
relevant data or observations become available'. With molecular biology, it
is not just the information per se that needs to be followed but the entire
information landscape.
    Much interest was focused on the place of CAS in molecular biology research,
as it was known from past experience at Celltech that this very large repository
of protein and nucleic acid sequences was hardly ever used. Questions put to
external information professionals, to CAS itself (or STN, their UK representatives)
and broadcast on the Internet confirmed that it was indeed ignored by most of
the molecular biology community, mainly for reasons of cost although there
was also a large amount of ignorance about the searching facilities amongst
practising molecular biologists. It is possible that one of the main reasons for
this is that the Chemical Abstracts database held on STN and other hosts has
traditionally been searched by information professionals acting as intermediaries

JOURNAL OF DOCUMENTATION                                              vol. 52, no. 1

and that knowledge of biosequence database enhancements did not filter through
to the scientists themselves. By contrast, one database which has become essen-
tial for all users is Entrez, which became available only two years ago and
integrates the sequence information from many databases and enables cross-
referencing to the relevant biomedical references from Medline. Its user-friendly
interface and powerful information retrieval features and low price have put
information into the hands of end users that only a few years ago would have
required the running of complex algorithms on a remote super computer.
   The Celltech user study showed in general terms how scientists are using bio-
sequence databases and computation in pharmaceutical research, for example
to recognise similarities between a totally new sequence and sequences with
known properties and function, in order to gain a 'handle' on the underlying
disease process. The majority of users were satisfied with their knowledge of the
necessary information sources although there was somewhat less satisfaction with
the access to it, mostly due to the lack of a dedicated Internet link. This situation
led to the existence of two information 'gatekeepers' who provided a service to
the rest. The information itself consisted primarily of sequence data (both nucleic
acid and peptide/protein) and protein structure information. The former can be
either 'primary' information (the sequence itself, for example as submitted to
one of the large public databanks prior to publication) or 'secondary', i.e. eval-
uated information which is found in journal articles or specialised databases; for
example the Kabat database of proteins of immunological interest.
   Nearly all of the important bioinformatics resources are available freely or
cheaply via the Internet and the scientific community have been users of this
medium for communication and other uses for many years. Thus, it is natural
that molecular biologists in universities and industry have exploited (and
contributed to) those resources as they have become available. This is especially
true as the information which makes up the resources can be both the virtual
raw material and the end product of further experiments. These circumstances
have led to a self-reliance for information among molecular biologists, although
considerable work is involved in the retrieval and analysis of these data. Both
the internal and the external surveys show that this has often led to the existence
of a small number of information gatekeepers within the laboratory or research
centre. These gatekeepers are true scientist/information scientist hybrids - proof
of a converging role of information provider and end user; however when viewed
from the perspective of the whole organisation they appear somewhat as an
'island', isolated from other functions and departments. It is my conclusion that
more generalist information specialists, e.g. those working in information centres
or libraries, can be increasingly helpful in integrating this data with other
information from various sources and bridging that gap.
   The fact that four out of the six respondents (i.e. not the two gatekeepers) who
were active in bioinformatics operated from within information/library depart-
ments showed that such departments were already playing an important role.
Indeed, most questionnaire respondents saw a productive role in the future for
information workers in the field of molecular biology, citing end-user training and
data integration as areas where they might be involved. On the basis of this

March 1996                                                     BIOINFORMATICS

research, it can be concluded that to a certain extent 'horses for courses' applies
to the appropriate level of is involvement at a particular site, i.e. it will depend
on the particular mix of skills and experience that exists within an information
department. Any biomedical information department would be well advised to
become knowledgeable about the main sources of molecular sequence data even
if current research projects do not have much of a molecular biology component
as there is a clear trend towards this type of research in the pharmaceutical industry.
   If research information professionals are to continue providing a complete
service to workers in the pharmaceutical industry then they will need to gain a
practical knowledge of bioinformatics. Major factors that have to be considered
in developing an information policy will include available infrastructure (e.g.
hardware, software, networking issues and data security), and the skills available
in-house. For the training of end users in the knowledge of sources, use of the
Internet and resources such as Entrez, most life-science degrees should provide
enough subject knowledge. More detailed analyses such as alignments or homol-
ogy searches will require at least such a first degree with a high molecular biology
content but would probably best be carried out by the practitioners themselves,
or by hybrid scientist/information workers, acting as 'local experts' and provid-
ing a service to other research scientists in the laboratory.


Many thanks to Mark Boguski for help with defining bioinformatics and for
sending the offprint. Thanks also to Tina Jones for help with the graphics and
to all those who gave time to be interviewed, replied to the questionnaires and
answered the Usenet postings.


 1.   BOGUSKI, M.S. Bioinformatics. Current Opinion in Genetics and
      Development, 4, 1994, 383-388.
 2.   ALTSCHUL, S.F., BOGUSKI, W. andWOOTON,J.C. Issues in searching molecular
      sequence databases. Nature Genetics, 6, 1994, 119-129.
 3.   SILLINCE, M. and SILLINCE, J.A.A. Sequence and structure databanks in
      molecular biology: the reasons for integration. Journal of Documentation,
      49(1), 1993, 1-28.
 4.   KEHOE, K. Specialised databases in molecular biology and genetics: the
      nucleic acid and protein sequence databases. Science and Technology
      Libraries, 11(1), 1990, 99-105.
 5.   FUCHS, R., RICE, P. and CAMERON, G.N. Molecular biological databases -
      present and future. Trends in Biotechnology, 10, 1992, 61-66.
 6.   BUNTROCK, R.E. Sequence databases: what's in it for me? Database, June
      1991, 107-109.
 7.   BENSON, D., LIPMAN, D.J. and OSTELL, J. GenBank. Nucleic Acids Research,
      2/(13), 1993, 2963-2965.
JOURNAL OF DOCUMENTATION                                        vol. 52, no. 1

 8. STOEHR, P. and CAMERON, G.N. The EMBL data library. Nucleic Acids
    Research, 19 (supplement), 1991, 2227-2230.
 9. BAIROCH, A. and BOECKMANN, B. The SwissProt protein sequence data bank.
    Nucleic Acids Research, 20 (supplement), 1992, 2019-2022.
10.   BARKER, W.C., GEORGE, D.G., MEWES, H. and TSUGITA, A. T h e PIR-
      International protein sequence database. Nucleic Acids Research, 20
      (supplement), 1992, 2023-2026.
11.   LIU-JOHNSON, H.N., HAINES, R. and HACKETT, W. Searching for protein
      sequences in CAS Online. Biotech Forum Europe, 5(4), 1991, 204-209.
12.   Who's Who in the UK Information World 1994, 4th edition. TFPL Publishing,

(Revised version received 24 October 1995)


To top