Learning Center
Plans & pricing Sign in
Sign Out



									A Literature-Based Approach to
     Scientific Discovery

   A presentation at UIC Sep 4, 2003

       Don R. Swanson c2003
       Professor, Div. Humanities
       The University of Chicago
   Literature-based discovery? ---
            the very idea.

1. It means deriving, from the public record of science
   new solutions to scientific problems.
2. The possibility arises, for example, when two articles
   considered together for the first time suggest new
   information of scientific interest not apparent from
   either article alone. This mode of discovery is the
   focus of the Arrowsmith project.

Undiscovered public knowledge

To speak of “new information of scientific interest”
is a reference to the state of the literature, not to
a state of mind. What someone may know or not
know is a quite different concept from what is in
the public record. It is public rather than private
knowledge with which we are concerned in our
quest for discovery.

   A problem-oriented approach

   Focusing on the professional and research
literature of biology and medicine, begin with
a specific problem of substantial scientific
interest -- such as finding a previously unknown
cause, cure, or treatment for a particular disease.

The growth and fragmentation of

In response to overwhelming growth, science
spontaneously, and somewhat mysteriously,
divides itself into specialties. In this way, the
labor of both producing and assimilating its
literature is divided into more or less
manageable chunks. But the inevitable
consequence of this fragmentation is the
mutual isolation of the chunks.

   The growth of science literature
    can be seen as a whole -- but

Time line
  -- is more cogently visualized as
 the growth of specialty literatures

Time line
 Citation clusters -- specialization
   and fragmentation of science

    Within a specialty, authors cite one another to a
much greater extent than they cite authors outside
of the specialty.

   When specialty literatures become too large to
assimilate, they divide into sub-specialties.

   Connections and relationships between mutually-
isolated specialties may go unnoticed.

     The Connection Explosion

The number of potential connections between units of
specialized literatures grows much faster than the number of
units themselves. The number of pairwise connections, for
example, increases as the square of the number of units that
could be connected -- more accurately, as n(n-1)/2.

                                        100           1000

1      3      6          10              4950         500,000
  The Connection Explosion
        in Medline

So, for 10,000,000 Medline records, there
would be 50,000 billion possible 2-way
connections between individual articles.
But if we think of the literature as divided
into 10,000 units or “chunks”, then the
number of possible 2-way connections
between chunks is much less -- only
50 million.

   What is meant by chunks and

I have introduced the work “chunk” to stress the
fact that “specialty” may have other meanings
not necessarily intended here. A “chunk” is
something within which there is good
communication but which tends to have poor
communication with other chunks. It is a set
of articles that cite one another appropriately,
but cite relatively few articles in other chunks.

 Visualizing literatures as sets of
 articles -- using Venn diagrams

1. A “literature” or set of articles can be visualized
as though in a “container” -- a closed figure encircling
a set of points, each point representing a scientific
article or a database record.

2. Two sets intersect if they contain some
articles in common. Articles within the
intersection are members of both sets      S1      S2
-- S1 AND S2.
    Disjoint, non-interactive sets

Disjoint means that sets have no articles in common --
that is, they do not intersect:

If there are no cross-citations -- from one set to
the other -- the two sets will be called non-interactive.

    Disjoint and non-interactive,
    but nevertheless related, sets.

Two literatures, A and C, even with no records in
common and no citations from one to the other, might
be related through shared attributes -- such as key words,
phrases, index terms, concepts, or authors.

                               A              C

If two non-interactive sets were
  related, who would know it?
Because there are no cross-citations, no one
reading one literature will be led to the other --
at least not in the usual way of following chains
of citations. Hence if two such literatures have
scientifically interesting connections, it is
possible that such connections are unintended
and unnoticed.

Science literatures -- constantly
changing worlds to be explored.

       Online searching of
     bibliographic databases
1. Database searching consists of forming and
   combining sets of bibliographic records.

2. Each search statement creates a set and
   displays how many records are in it.

    S1 [search term - A] -- 9000              A
3. Records can be displayed, thus
   providing valuable relevance feedback.
  Online searching -- continued

4. In forming search statements, the three Boolean
operators AND, OR, NOT correspond to the
intersection, union, and complement of sets.

Online searching -- continued

5. If you form two sets, you can find the
intersection, and the number of records in it.
S1 term A ---- 9000                    A 90 C
S2 term C ---- 3000                   9000 3000
S3 S1 AND S2 --- 90
    If S3 = 0, A and C are disjoint    A           C
                                      9000        3000
A    C
    3000     If S3=S2, then C is a subset of A

           In sum, tools
     for interactive searching

1. Instant display of the sets you formed,
   showing the size of each set.
2. The 3 Boolean operators, and additional
   operators such as truncation and proximity.
3. Commands that permit display of any records
  found -- either the full record or specified fields.
  (Inspection of how relevant and non-relevant
  records were indexed can provide invaluable
  clues for revising and refining a search.)
    Introduction to Medline
      Searching -- handout
Part I Algebra of Sets and Venn Diagrams
Part II Rules of the Game
         -- main types of search commands
            find - combine - display
Part III Search Strategy:
          -- online searching is the art of
              forming and combining sets
Part IV PubMed Puzzles:
          -- some surprises and some lessons

     Explaining Literature-Based
      Discovery by an example

Reprint handout: Migraine and magnesium:eleven
neglected connections. Perspect. Biol. Med. 1988
               [referenced here as MigMag88]
MigMag88 is a pre-Arrowsmith study. Arrowsmith
is not only an aid to LBD, it is a result of LBD, and
evolved from the search techniques and strategies that
MigMag88 describes. Arguably, the best way to learn
how to use Arrowsmith is to begin a process of LBD
without it, substituting for it a Medline exploration.
     LBD without Arrowsmith

In brief, the process described in MigMag88
that led from migraine to magnesium is this:
1. Search title-words in Medline for “migraine”.
2. Examine a few dozen or more records looking for
potential intermediate links in the chain of events that
might lead to migraine.
3. Then start a new title-word search for these links.
4. Examine the new titles looking for links that
might be still earlier in the same chain of events.
    -- not just an exercise

The Medline search process described
is not just an exercise for learning
about Arrowsmith; it is also a useful
preparation for any Arrowsmith search.
It can help you create a better input and
so, plausibly, a better output.

  The two stages of Literature-
       Based Discovery

Stage 1: Getting from the problem (migraine)
          to a conjectural solution (magnesium)
          [hypothesis generation]
Stage 2: Exploring in depth the connections
          between the two.
          [hypothesis generation]
In MigMag88, Stage 1 is described in the section
“A systematic trial-and-error search strategy”;
the rest of the paper is devoted to Stage 2.
Assembling other people’s ideas

 Stage 2 of MigMag88 resembles a
 literature review, but its author neither has,
 nor claims, any expertise in the two subjects
 under review. The result is virtually a cut
 and paste of what the real experts have said
 in print about migraine and magnesium

       The ABC Model of

If one area of literature shows that
A is related to B and a different area
shows that B is related to C , then
bringing together these two areas
for the first time may suggest a novel
hypothesis that connects A with C,
an implicit but not explicit
Venn Diagram -- ABC Model

               Articles about an AB relationship.

       A      AB          B           BC       C

 Articles about a BC relationship.
 AB and BC are complementary but disjoint :
 They can reveal an implicit relationship between
 A and C in the absence of any explicit relation.
   An ABC example based on title
        words in Medline

Magnesium-deficient rat            The relation of migraine
as a model of epilepsy.            and epilepsy.
Lab Animal Sci 28:680-5, 1978      Brain 92: 285-300, 1969
                     22            45

 A magnesium          B epilepsy          C migraine
    8011            An unintended link       2756

Venn diagram: sets of Medline records; A,C are disjoint.

Two sets of articles are defined as complementary
if, considered together, they suggest new
information (in this case a possible migraine-
magnesium connection) not apparent in either
set taken alone. Complementarity does not
necessarily imply logical transitivity, but rather
is used in the looser sense of suggestibility.

     Introducing Arrowsmith

Arrowsmith is software that finds words, phrases,
subject headings, authors, and other attributes
common to two downloaded sets of database
records -- the purpose of which is to help the
user see new connections within the scientific
literature that lead to novel plausible hypotheses
that are worth testing. Medline cannot do this.

    What Arrowsmith can do that
          Medline can’t

   Arrowsmith finds all “interesting” words,
    phrases, or subject headings (B-list terms)
    common to two sets of records (A, C).
                      Bi i=1,2,..
          A      platelet aggregation C
      magnesium calcium blocker        migraine
                 vascular reactivity

 Arrowsmith extends power of Boolean
“AND”, whether or not A,C are disjoint


Medline finds                                  Arrowsmith finds
A AND C                                        A AND Bi as well
but not an           A            C            as Bi AND C for
unknown Bi;                                    all Bi; i=1,2,3…

(Bi within AC intersection are presumed to be known.)
            Filtering the B-list

   A large (8000-word) pre-compiled stoplist
    (words to be excluded) is built into
    Arrowsmith and applied automatically.
   The user may delete entries from the B-list.
    Terms that remain are “interesting”.
   B-list editing by the user is optional; terms
    with rank-0 are now automatically removed.
   Ranking is based on subject headings.
                A,C input;
         B-list + titles as output

   The first output is the B-list.
   For each term on the B-list, all titles (from
    files A,C) containing that term are brought
    together and displayed.
   The title display is vital, for it provides the
    contexts in which the Bterm occurred and
    may suggest a complementary relationship.

             Using the output

   The purpose of the output is to create
    suggestive juxtapositions of titles.
   For each Bi-term, the ABi and BiC title-
    displays (+ abstracts & full text) may help
    the user construct a plausible testable
    hypothesis that connects A with C.
   If ABi and BiC are disjoint, the hypothesis
    may be novel.
LBD goal is a testable hypothesis

   Assuming that a plausible, novel, testable
   hypothesis has been developed, the next
   goal then is to stimulate a clinical or
   laboratory test of it, or simply stimulate
   more research. One can ask, did
   MigMag88 stimulate more research?

    What Arrowsmith can do that
    Medline can’t: a 2nd example

   Arrowsmith finds all “interesting” words,
    phrases, or subject headings (the B-list)
    common to two sets of records (A, C).

          A         Aphthovirus       C
      virulence        Lassa        stability
     exp viruses      Marburg       exp viruses
                   Semliki Forest

           ABC model --
        a new interpretation

The five B-viruses have each been investigated
in the context of both A and C (virulence and
stability). This fact may be of interest because
A and C together have more implications for the
threat of viruses as weapons of warfare or
terrorism than either set taken alone.
  (Ref: JASIST August 2001 p. 797-812)

Arrowsmith and Term extraction

Because Arrowsmith begins by extracting terms
from downloaded Medline records, the next 2
slides illustrate the Medline record and the
extraction process ---

     Fields and “Terms” in a Medline
UI 89317153                                   1st 2 letters at
AU LeDuc                                      left mark field.
IN U.S. Army Medical Research Institute..
TI Epidemiology of hemorrhagic fever ….   “Terms” are
SO Reviews of Infectious Diseases 1989… subject headings
MH Arenaviridae Infections/ep             (MH) or words
MH Ebola Virus                            and phrases from
MH Flavivirus                             the title (TI) or
MH Marburg Virus Disease/ep               abstract (AB)
AB Twelve distinct viruses associated     fields.
with hemorrhagic fever in humans….
       An integrated picture of a
     Medline and Arrowsmith search
                       Sets of Records
                A                             C

          First use Medline to create two disjoint sets A, C;


     Arrowsmith forms sets of terms
     extracted from Medline records.
                       Sets of Records
               A        B-list records       C
Arrowsmith         Sets of Extracted Terms

              Terms from A      Terms from C

 Terms are extracted from records to form “2nd order sets”;
 Intersection (called “B-list”) is first Arrowsmith output.
   But suppose the two sets are not
    disjoint -- i.e. they intersect?
Sets of Records
Medline                 A      C

All title words and phrases within the AC intersection
will be on the B-list because Arrowsmith in that case will
match two identical sets of titles, AB and BC. If the
intersection is large, then it may dominate the B-list. A
conventional Medline search will provide all AC titles,
so Arrowsmith is not needed for this purpose.
   Visualizing how the intersection
   of A&C may dominate the B-list
Sets of Records
Medline                   A      C

Sets of Extracted Terms
                       Direct B-list
 A “direct” (A AND C) Medline search yields a subset
of the B-list (“direct B-list”), which presumably is known.
Arrowsmith removes direct-AC

Arrowsmith is designed on the assumption
that the direct-AC articles are already well
known and separately explored by the user.
Accordingly, all direct-AC records are
removed from file A at the outset,
to prevent their unnecessary contribution
to the B-list.

Implications of A-C intersection:
first, understand what it contains!

A   C      Disjoint: Best opportunity for finding
                    previously unknown connections.

A   C     Small overlap: May be as good, or better.
        Large overlap: not promising for novelty.
        Use conventional Medline search for A.C.
        However, try to find subsets of A, C
        which are disjoint, then apply Arrowsmith.

 Preparation of A,C input files:
 Arrowsmith search strategies 1

Refer to handout on Medline Searching for discussion
of recall, precision, and search strategies. High recall
means you are trying to get everything, which
inevitably brings with it a lot of junk, whereas high
precision means you settle for less because you want
everything you do get to be very relevant.
  Arrowsmith requires a high-recall search for finding
the direct-AC literature, but requires high-precision
for the two A,C input files in order to minimize junk
in the B-list.
 Preparation of A,C input files:
 Arrowsmith search strategies 2

The previous discussion of LBD without Arrowsmith
has new cogency, for it is the same exploratory
process you should now follow in preparing the A,C
input files. The reprint handout MigMag88 provides
an example of that process.

Preparation of A,C input files:
Arrowsmith search strategies 3

A focus on searching title words and phrases
may be important, for three reasons:
 1. -- it tends to improve precision
 2. -- it improves the odds that B-terms are
meaningfully linked to their corresponding
A and C terms, because they are closer to them
in titles than in abstracts.
3. -- complementarity is easier to recognize
and the amount of text to be examined is much
less than in scanning through complete records.
Preparation of A,C input files:
Arrowsmith search strategies 4

 Precision may be further improved by searching
 subject headings in conjunction with title words
 -- that is, forming an intersection, as follows:

     migraine[TI] AND migraine[MH]

 Forming similar intersections with other subject
 headings is also an option, and may be used as
 a means of improving the rank of B-list terms.

Preparation of A,C input files:
Arrowsmith search strategies 5

 Date of Publication time limits may
 be a powerful precision tool. If you
 know that connections before a certain
 date cannot be relevant, then search
 only the later literature. Use caution
 and apply separately to A and C
 literatures because you may want new
 ideas in A to connect to old C-literature
 or vice versa.

Preparation of A,C input files:
Arrowsmith search strategies 6

 Search strategy determines, among other things,
 the size of the input files A, C. Size matters!
 A presupposition that underlies connecting
 mutually isolated or noninteractive literatures
 is that, within themselves, each of the separate
 literatures is highly interactive or highly
 inter-communicative -- that is, articles within
 File A cite each other extensively and similarly
 for File C.

Preparation of A,C input files:
Arrowsmith search strategies 7

 How large is too large?
 That is difficult to answer; we need more data
 on the typical size of citation clusters. But
 experience with excessively long Arrowsmith
 B-lists suggests that the optimal size may be
 in the range of 100 to 5000 articles.

Preparation of A,C input files:
Arrowsmith search strategies 8

  If Precision is too high, then recall may be
 too low, and good terms might be lost from
 the B-list. (A, C too small.)
    If Recall is too high, then Precision may
 be too low, and good terms will be buried.
 (A,C are too large).
    Best compromise is to start with high
 Precision and gradually increase Recall in
 repeated searches.
         UMLS vs MeSH:
         Meaning vs Use
The meaning of a word is often imprecise;
what counts is how words are used. Hence
it is always important to examine the
output of a search.
    UMLS and MeSH are addressed to
different problems. MeSH is designed to
index the medical literature-- assign terms
to articles, and so is primarily about use.
UMLS is a massive thesaurus-like
compilation, primarily about meaning.
Preparation of A,C input files:
   The sublanguage effect.
Sublanguages have been investigated by linguists
since the 1960s. They entail restricted lexicons and
grammatical operations. A set of all articles with,
say, “migraine” in the title may create a restricted
sublanguage and so a more cohesive literature --
quite possibly a literature with a strong internal
communication pattern.
  Arrowsmith probably works best when it seeks
connections across cohesive clusters. This idea
may give further support to title searching.
  Arrowsmith - U of Chicago and
Arrowsmith originated on the UC-website and was
installed at UIC in mid-2001. UIC and Marc Weeber
then developed an ingenious and more user-friendly
interface that imported and incorporated PubMed and
the Medline database, for “one-stop shopping”.
Semantic filtering using the UMLS was also installed
at UIC. Both sites continue to be available, but there
are some differences. The presentation up to this
point applies to either system. The next slide outlines
how the University of Chicago site differs from UIC.
Arrowsmith at The University of
Chicago: http//
1. Continues to be developed and maintained by Swanson.
2. Medline searching is independent; downloaded files
are transmitted as input to Arrowsmith-on-kiwi --
3. -- which can accept input files from most versions of
Medline, including particularly PubMed and Ovid.
4. It integrates hypothesis-generation (pseudo 1-node)
with hypothesis-testing (2-node).
5. For a 2-node search, the B-list is automatically ranked
using subject headings in a process to be described next.
6. For the “1-node” search, the A-list is ranked.
A method for evaluating a B-list

The work outlined briefly here is covered in some
detail in Progress Reports 2 and 3 submitted to UIC
by Swanson, and are available on request.
Central to the idea of ranking a B-list is some way
to evaluate that ranking. The MigMag88 paper is
taken as a model testbed for Arrowsmith output.
It has the principal advantage that a fairly large
number of apparently valid connections were found,
as evidenced by the arguments in MigMag88.
Using Medical Subject Headings
    (MH) to rank the B-list
The input files, in MEDLINE format (also called
the FIELDTAG format), contain subject headings
with an MH tag in the leftmost field. All MH fields
are extracted from both Files A and C, and terms
common to the two files are identified and filtered
through a MeSH stoplist. These are called MHB
terms. The original input is then converted to
abbreviated records that contain only the identifier,
title, abstract, authors, source journal, and MHB
terms, for further processing.
Automatic ranking of the B-list

After title B-list has been created, all MHB terms
common to the corresponding A and C records
for each title B-term, are highlighted in blue.
A rank number is assigned to each title B-term
based on the number of blue-highlighted MHB
terms with which it is associated. The effective-
ness of the ranking was tested on previously run
problems for which the most valuable B-links
were already known, as in MigMag88.
  Ranking results for MigMag88

Rank # Total B Target B Precis% Chi-squared test
  >=3        43        16          37          T O E
    2        36         6           13 >0 214 37 28.5
    1       135         15          11     0 131 9 17.5
    0       131          9           7 chi2=12.6 p<.005
 Precision tends to increase as the rank# increases.
 For 0 rank vs. all higher, results strongly significant.
 Two other studies showed similar trend for precision.
 Raynaud/EPA study n.s. because numbers too small.
 Arg/SmC for rank 0 vs all higher, significant p<.005
     Two Kinds of Arrowsmith

1. Hypothesis generation.[pseudo 1-node]
   Output is list of A-candidates, ranked by B-terms.
2. Hypothesis testing.[2-node]
   Out put is B-list and associated titles in AB, BC.

             A                                    C


      Thus far, the discussion has been about hypothesis testing
         Hypothesis generation

   Arrowsmith finds words and phrases
    common to titles (AA,C) -- BBblist.

                   Bi   i=1,2,..
 toxins                            C

  Explanation of AA notation
and development of ranked A-list

  AA denotes a broad category within which
  more specific terms, A, will be sought.

  Arrowsmith will decompose the titles in the
  AABB intersection into component words and
  phrases called the A-list and will then rank
  the A-list according to the number of B-terms
  with which each A-term is associated.

     Venn diagram for ranked A-list

  Ai are ranked             BB
  by no. BB
          2     A2

Choose            A1
Ai, then
re-do as 2-node        AA
         Selecting from A-list
      initiates hypothesis testing

Users select term from A-list for hypothesis-testing.
A new B-list is generated as a subset of the BB list
applicable specifically to A. AB and BC titles can
be explored online -- user clicks on B-term to see
titles in which the term is used, in both A and C.
File A is a subset of AA, and so these results are
more restrictive -- and perhaps of higher precision
-- than a new 2-node A-C search.

    Human & Machine functions in
     hypothesis generation mode
   User selects problem C and conjectural set
    AA; conducts Medline search.
   Arrowsmith produces BB-list from AA, C.
   Arrowsmith removes rank-0 from BB-list.
   Arrowsmith produces ranked and grouped
   User optionally may edit A-list and form
    groups, then let Arrowsmith try again.
           The Arrowsmith
          low-frequency list

Arrowsmith creates word-frequency lists
for both the A and C literatures.
   Low-frequency words may reveal earliest
indications of the novel relationship that is
sought. Thus Arrrowsmith can call attention to
a discovery already made, but perhaps not
widely known.

     Arrowsmith as a guide to the

   By processing downloaded database
    records, Arrowsmith can help the user
    decide what to read, and by so doing can
    stimulate new medical hypotheses --
   -- the plausibility of which can be assessed
    by reading the literature.
   Finally, the hypotheses can be tested only
    through clinical or laboratory investigation.
       Published studies of CBD
   1986 Dietary fish oil -- Raynaud’s Disease
   1988 Magnesium deficiency -- Migraine
   1990 Arginine -- Somatomedin C
   1994 Mg deficiency -- Neurologic Disease
   1996 Indomethacin --Alzheimer’s Disease
   1996 Estrogen ---- Alzheimer’s Disease
   1998 Phospholipase A2 -- Schizophrenia
   2001 Viruses as potential weapons

   2001 Genetic packaging technologies --
    potential for virus warfare. (Smalheiser)
   2001 Five potentially new therapeutic
    applications of thalidomide. (Weeber: Ch 6)

         Purpose of publishing a study
     of complementary disjoint literatures

   Place in refereed biomedical journal.
   Purpose is to present a convincing argument
    that the literature-derived hypothesis (A--C
    via B) is novel, plausible, and worth testing.
   3 measures of success: acceptance for
    publication, stimulation of a test, and
    corroboration of hypothesis.
        More details of each of the 8 studies follow:
     Fish oil, Raynaud’s Syndrome, and
      undiscovered public knowledge,

   Perspect. Biol. & Med. 30(1): 7-18, 1986
   Reference sources: 25 fish oil, 34 Raynaud
   Connections: blood viscosity, platelet function,
    vascular reactivity, red-cell deformability,
    prostaglandins, serotonin, thromboxane
   A pre-Arrowsmith study.
   Arrowsmith applied later to 353 fish oil titles and
    585 Raynaud titles yielded B-list of 31 terms that
    included all of the 7 above.
    Fish-oil / Raynaud literatures -

   The two literatures taken together, but not
    separately, suggested a novel medical
    hypothesis: -- that dietary fish-oil may be
    beneficial for (at least some) Raynaud
   Corroborated in a controlled clinical trial 2
    years later: B.B. Chang. et. al. Surgical
    Forum, 39: 324-326, 1988
Migraine and Magnesium: eleven
     neglected connections
   Perspect. Biol. & Med. 31(4): 526-557, 1988
   References sources: 63 magnesium, 65 migraine
   Connections: Type A pers., vascular reactivity,
    calcium blockers, platelet activity, spreading
    depression, epilepsy, serotonin, inflammation,
    prostaglandins, substance P, brain hypoxia
   Arrowsmith applied later to 8011 magnesium titles
    and 2756 migraine titles yielded a B-list of 103
    terms that included 9 of the above 11.
      Migraine and Mg -- cont.
    hypothesis and corroborations
   Hypothesis: Mg deficiency may be implicated in
    migraine. About 4 articles published during 20-
    year period before 1988.
   Between 1989 and 1997, more than 12 different
    groups of medical researchers reported a systemic
    or local magnesium deficiency in migraine or a
    favorable response (in 2-6 mo.) of migraine
    patients to dietary supplementation with
    magnesium. 1 report of negative results.

         Migraine AND Magnesium --
            before and after 1988
          As of mid-03, total number of articles in
          medical literature indexed with both terms
          is about 60.
11                           MigMag88, a stimulus?
     70 72 74 76 77 78 79 80 82 84 86 88 90 92 94 96 98 00 02
        The natural history of

The preceding slide suggests that a similar
time-line plot might be of value for any
relatively small intersection -- to determine
primarily if there is a key year before
which articles were few in number and
scattered and after which a substantial
increase took place. Such a pattern could
represent a new and relatively unknown
scientific discovery -- literature based or not.

    Somatomedin C and arginine: implicit
    connections between mutually isolated

   Perspect. Biol. & Med. 33(2): 157-186, 1990
   Ref sources: 85 somatomedin C, 51 arginine
   Connections: growth hormone, malnutrition,
    acromegaly, protein synthesis, lean body mass
    anabolic effects, immune function, wound healing.
   Arrowsmith applied later to 3244 arginine titles
    and 1162 SmC titles yielded B-list of 160 terms,
    that included 4 of above 7.

     Somatomedin C and arginine

   Hypothesis: Anabolic effects of arginine
    are accompanied by and possibly due to
    systemic or local release of SmC.
   Tested 3, 5, 8 years later: 1 neg., 3 pos.
    Corpas, Endocrine Revs 14: 20-39, 1993 -
    Kirk et al, Surgery 114: 155-160, 1993 +
    Hurson et al, JPEN: 227-230, 1995          +
    Chevalley et al, Bone 23(2):103-9, 1998 +
      Assessing a gap in the biomedical
    literature: magnesium deficiency and
              neurologic disease.

   Authors: Smalheiser and Swanson
    Neuroscience Res Comm. 15(1):1-8, 1994
   A: Graded dietary manipulation of Mg.
   C: Neurologic diseases.
   B: NMDA-receptor-mediated excitotoxicity

      A     Mg++          B          C

    Indomethacin and Alzheimer’s disease

   Authors: Smalheiser and Swanson
    Neurology 46: 583, 1996.
   A: 5008 titles with “indomethacin”
    C: 7002 titles with “Alzheimer”
   A,C not disjoint; A may be protective in C
   B: 103 terms in edited B-list; connections include
    fluidity, killer, muscarinic, peroxidation, TRH,
   Latter is unexpected, potentially adverse
Linking estrogen to Alzheimer’s disease:
       An informatics approach

   Authors: Smalheiser and Swanson
    Neurology 47: 809-810, 1996
   A: 16300 titles with “(o)estrogen()”
    C: 8200 titles with “Alzheimer”
   A,C not disjoint; 70 articles on both
   Edited B-list: 194 terms.
   Antioxidant activity of estrogen merits
    attention; free radicals implicated in AD.
Calcium-independent phospholipase-A2
          and schizophrenia

   Authors: Smalheiser and Swanson
    Archives of General Psychiatry 55: 752-3, 1998
   A: 54 titles with “calcium-independent
    phospholipase A2”
   C: 21,000 titles with “schizophrenia”
   A AND C: Ross reports that A elevated in C
   B-list: 38 terms, including vitamin E
   Oxidative stress from vit E/Se deficiency increases
    Ca-iPLA2 in lung, liver of rats (Kuo).
Ca-iPLA2 in schizophrenia -- continued

   Chronic oxidative stress may occur in
   Proposed hypothesis: Rats treated as in Kuo
    may have elevated Ca-iPLA2 in serum
    when assayed as in Ross.
   If confirmed, this would provide animal
    model for studying mechanisms and
    consequences of elevated Ca-iPLA2.
    Current R&D at UChicago

1. The ranked A-list with associated B-terms
    offers the possibility of revealing hidden
    relationships among A-terms based on
   B-terms in common.
2 Substitution of essentially random sets of
   articles for either File A or File C or both
   is a mode of investigation that appears to
   be fruitful. Investigation of the role of
   sublanguages is a related area.
 Logical inconsistency in (static)
     probabilistic approach
1. Lexical statistics works as follows: find words that
co-occur with “migraine” significantly more often than
one would expect by chance. These are taken as
interesting “bridge” terms.
2. Thus words that co-occur very few times are
discarded as not interesting. Clearly, “magnesium”
co-occurs rarely with “migraine”. But “magnesium” is
known to be the prime target word, hence most
interesting of all. So which are interesting --
abnormally frequent, or abnormally rare words?
Back to the future --
worlds in collision

 Sublanguages and changing

Each of the worlds to be explored is a
cluster of papers that intercommunicate
and develop their own sublanguage.
As worlds collide, sublanguages
invade each other and bring about
changing frequency distributions..

            Acknowledgment of Support
                1 R01 LM07292-01
               A collaborative grant:
     Univ of Chicago and Univ of Illinois-Chicago
Arrowsmith Data Mining Techniques in neuro-informatics.
Co-sponsored by NLM and NIMH 6/15/01 - 5/31/06

To top