Docstoc

Can Citation Indexing be Automated

Document Sample
Can Citation Indexing be Automated Powered By Docstoc
					Essasy of an Information Scientist, Vol:1, p.84-90, 1962-73. Current Contents, #9, March 4,
1970     Reprinted from:"Statistical Assoc. Methods for Mechanized
Documentation",Symp.Proc. 1964, Washington 1964 (Natl. Bureau of St


         Reprinted from: Mary Elizabeth Stevens, Vincent E. Giuliano, and Laurence              B. Heilprin,   Eds.,
         Statsitical Assocristiora Methods for     Mechamked           Documentation,   Sympom”um Proceedings,
         W’a.rhkgton 1964. (National     Bureau    of Standards         Miscellaneous Publication 269, December
         15, 1965), pp. 189-192.




                                Can Citation Indexing                   Be Automated?
                                                             by
                                                Eugene Garfield
                                     Institute for Scientific Information
                                      Philadelphia, Pennsylvania 19106




            The main characteristics     of conventional    language-oriented   indexing systems are
         itemized and compared       to the characteristics    of citation indexes.   The advantages
         and disadvantages    arc discussed in relation to the capability     of the computer   auto-
         matically to simulate human critical proccaws reflected in the act of citation.         It is
         shown that a considerable standardization of document presentations                       will be neces-
         sary and probably not achievable for many years if we are to achieve                       automatic  re-
         ferencing.   On the other hand,      many citations,           now fortuitously   or otherwise omitted,
          might   be supplied   by computer       analyses   of text.



            This paper considers whether, by man                       by use of character-recognition     devices.
         or machine, wc can simulate the process                       Programming such a device will require
         of “documenting,”       the process by                        the rcsohs tion of fantastic       syntactic
         which authors provide reference cita-                         problems even if the machine has a uni-
         tions to pertinent and usually earlier                        versal multifont reading capability. For
         documents. My paper does not concern                          example, in the citation,       ‘~. Clrem.
         the manipulative or mechanical prob-                          Sot. 1964, 1963, ” which number is the
         lems of automatically      compiling  or                      yeat and which the page number? These
         printing citation indexes. The existence                      Me not trivial problems. To handle the
         of the .fcietrce Citah”on Inde@      is ade-                  vagaries of bibliographic      syntax we
         quate testimony     to the ability of the                     ‘‘pre-edit”  all documents    before key-
         computer rapidly to sort, edit, and print                     punching the citation data needed for
         large-scale citation indexes. 1                               the Science Citatiom Index. We also
            My paper also does nor consider the                        “post-edit”   both by computer and hu-
         problcm of automatically        recognizing                   man editing procedures. Do not confuse
         (reading) and/ or extracting explicit cita-                   the “automatic”     or “routine”    nature
         tions appearing    in published    documents                  of citation indexing with a syntactically




                                                                  84
intelligent   automation.   Our citation   in-         8. Substantiating  claims
dexers do not require subject-matter                   9. Alerting to forthcoming work
competence,     but they do require con-              10. Providing leads to pcdy       dissemi-
siderable bibliographic training. The di-                 nated, poorly indexed, or uncited
verse and unstandardized        citation prac-            work
tices in the world’s literature make this             11. Authenticating    data and classes of
necessary. In addition, there are linguis-                fact—physical constants, etc.
tic variations in names and publication               12. Identifying original publications in
titles which must be handled. Our cita-                   which an idea or concept was dis-
tion indexers essentially must be trained                 cussed.
in descriptive cataloging.                            13. Identifying original publication     or
    My paper does concern the ability of                  other work describing an eponymic
an art~lcially    intelligent     machine   to            concept or term as, e.g., Hodgkin’s
deal with, among other things, the zm-                    Disease, P~eto’s Law, Friedel-Crafts
p[icit reference citation as distinguished                Reaction, etc.
from the expkit          reference citation.          14. Disclaiming work or ideas of others
Such might be the case in a paper where                   (negative claims)
the author, for one reason or another,                1~. Disputing priority claims of others
has neglected to provide a pertinent                      (negative homage)
bibliography.    The editor of a scient~lc               The problem of identifying all “per-
journal would ask such an automaton to                tinent”    references, to support implicit
supply all “pertinent”      references, if for        citations, is a special case of the general
no other reason than to make certain                  problem of automatic indexing. It has
the research was original. Citations are              previously been reported that machines
generally used to provide ‘‘documenta-                can index or abstract by use of key
tion” or support for specflc statements.              words in context taken from tit1es,2 by
However, reference citations are also                 use of statistically significant sentences, 3
provided in papers for numerous reasons               kernels,4 etc. O’C6nnor has recently re-
including, among others:                              viewed       these    methods, 5    as    has
                                                      Artandi.6     Associative   methods   have
1. Paying homage      to pioneers                     been     widely   discussed    by Stiles,7
2.   Giving credit for related work                   Maron,8 Giuliano,9 etc. All of these sys-
     (homage to peers)                                tems, however, are concerned with in-
3.   Identifying   methodology,    equip-             dexing by usc of the text only. Biblio-
     ment, etc.                                       graphic citations are regarded as meta-
4.   Providing background reading                     Iinguistic elements.
5.   Correcting one’s own work                           Recently, however, Salton 10 has dis-
6.   Correcting the work of others                    cussed the usc of bibliographic citations
7.   Criticizing previous work                        as indicators   of document    content.   Es-




                                                 85
sentially he proposes to treat citations as         ability     to describe          documents        uniquely
descriptors, which may seem strange to              and specifically.             Indeed,    those    who have
those who think in terms of conven-                 studied          citation      indexes     and    so-called
tional indexing. Indexers do not ordi-              bibliographic               coupling     are well    aware
narily think of citations (addresses of             that      only     a small       number      of   reference
cited documents) as descriptions of the             citations        are needed        to isolate     uniquely
citing document.       However, that does           a particular              in the collection
                                                                          document
not alter the fact that they are. 11                from all others. 11 That is why a search
    Citations  (document      addresses) are        of a citation index generally produces a
brief representations    of the documents           highly selective and usefid search result.
they identify. As one sacr~]ces compact-                In discussing citation indexing it is
ness, such as is found in serial numbers            frequently stated that weaknesses of the
for patents, 12 and expands to full tides           method include under-citation        (the de-
 and then to abstracts,        one sees the         liberate or unwitting failure to cite per-
 gradual enlargement      of the document           tinent literature) and over-citation (the
 description toward the complete text. In           excesive reference to presumably         non-
 this transition     from     “citation”      to    pertinent    literature). Under-citation    is
 “document, ” redundancy is introduced              illustrated by the patent literature, since
 as well as additional     information     con-      there is an economic motivation           to
 tent. Indeed, a document and a citation            cloud rather than clarify the information
 approach equality as the depth of in-               disclosed in a patent. However, the pa-
 dexing decreases (from the fill text) and          tent examiner, otherwise motivated, at-
 the length of the citation increases. This         tempts to clarify the prior art by pro-
 corresponds to my earlier d,:finition of           viding a list of ‘“references cited”. 14
  the document as the set of descriptors            Suppose, however, the patent examiner,
  which describe it. 13 In an information           or a journal editor, wishes to examine a
  retrieval system, information         content     document quite critically and asks that
 can be measured only on the basis of in-           the “machine”      provide all the perti-
  dexed information     that is supplied in         nent documentation      or prior art. This
  the indexing process. By this definition          brings me once again to the main theme
  a document is a unique combination of             of my paper.
 descriptors not assigned to any other                 To answer the question “Can citation
 document     in the collection.  In most           indexing be automated, ” as we have
 thesaurus-based   collections indexing is          seen, obviously entails a discussion of
 not sui%ciently deep to achieve such               the    entire    range     of      question-
 uniqueness, However, the combination               answering problems encountered in de-
 of conventional     subject headings   or          signing any information       retrieval sys-
 descriptors with the bibliographic cita-           tem. Consideration      of the automatic
 tions used as references increases our             procedure for supplying reference cita-




                                                   86
tions, when they are missing, merely              same actflcial intelligence would have
focuses attention on the complex index-           been available to tell him that his data
ing task performed by the author when             were wrong before he published          and
he does give pertinent     reference cita-        why! (If he persisted in publishing,      we
tions. Such considerations help us focus          probably     would    have   identified     a
attention on the significant differences          quality common to humans, but invari-
between a Ptiorr” and a postenon” index-          ably attributed to machines-stupidity.)
ing. 15 Since each person may interpret              The fmt sentence in the example il-
the meaning or significance of words              lustrates the case for an implicit citation
and documents differently, the problem            that our machine ought to be able to
we are dealing with inevitably involves           provide. What could be more simple
the human ability to create novelty, to            than the kernel sentence “Mr. X has
invent, to discover, and to be critical.          published, ” which one would hope
   Are machines, or machinelike people,           could be the result of a transformational
capable of imitating or simulating the            analysi~ when such methods are per-
human process of being critical? What             fected. Such an analysis combined with
are the peculiarly “human”     earmarks of        a complete      computer     listing of the
certain sentences containing      citations ?     papers by Mr. X is a good starting
When do such sentences contain im-                point.   Since we know that this is not
plicit citations that could be supplied by        sufficiently specific we must then expect
an intelligent machine and when would             of the linguistic analysis “Mr. X has
this appear to be difficult or impossible?        published on gobbledygook”          and then
    Consider    the   following      example:     we have reduced the computer search ro
“Mr.     X, art impossible        idiot,    has   the “‘simple” task of identifying           the
recently published a paper on gobbledy-           one paper out of the thousands by men
gook. The conclusions reported in his             named X to those which concern gob-
paper are wrong as are the data on                bledygook. Alas, this simple task alone
which the conclusions are based. The re-          requires the resolution of all the linguis-
commendations       made by Mr. X, on the         tic and semantic problems           ~sociated
basis of his conclusions, will be a ca-           with matching       the word ‘‘gobbledy-
lamity for mankind. ”                             gook” with the possibly different words
   In polite circles, this is called the cri-     in the title of the implicitly cited paper
tical review. Obviously,       “intelligent”      or book. Indeed, there is no reason at
machines are not yet ready to generate            all to assume the same word has occur-
such criticism. Or at least programmers           red either in the title or the text of the
are not yet able to prog:am machines to           “cited”    work. If these problems were
prepate such critiques.      If they were,        not sufficient, keep in mind that the
then the paper by Mr. X would prob-               word “recently”  is quite signiticam in
ably never have appeared because the              the example chosen because it stresses
the possibility   that Mr. X may have         cited work was tmdcr “critical” discus-
written   extensively   on gobblcdegook       sion because of certain syntactic or vo-
and it is only one particular, or a few       cabulary characteristics associated with
recent papers, that is the target for dis-    ‘‘critical. ” Presumably they would bc
cussion.                                      identified by transformational      or other
   Fortunately   authors usually do pro-      sophisticated analyses not yet available.
vide, explicitly, the citations needed to     This would be no mean accomplish-
support such sentences.       As a conse-     ment. Among other nontrivial problems
quence the citation index, created by         is the fact that the information     nccdcd
human indexers, does correlate the cited      to assign the marker can bc spread
work with the critical statements which       throughout,    not in a single scntencc of,
aPPe~ In the second and third senten-         the source paper.
ces of the example paragraph. This fea-           O’Connor’s     studies   on the term
ture of the citation index alone would        “toxicity”    arc quite pertinent    to this
have justified its creation. However, it is   problem because the problems have in
interesting to speculate whether trans-       common the need to discover methods
formational or any other automatic an-        for assigning dewiptions      of documents
alysis of such a paragraph could produce      which are subject to considerable varia-
a useful additional       “marker”   which    tion. 16 What is toxic to one man may
would describe briefly the kind of re-        be euphoric to another!
lationship that exists between the citing        To examine a document          from the
and cited documents.                          ‘‘citation” point of view, to determine
“ These “markers”        would appear in      what reference citations could or should
the published citation index along with       be provided which link the sentcncc,
 the usual citation data. In the case of      phrase, or word in question to man’s
 the paragraph      above,    for example,    prior recorded knowledge, is to say the
‘‘critique” or one of several other terse     Icast a formidable challenge. The task is
statements    like “Mr. X is wrong, ”         an cxccllent exercise for new journal
 8‘data spurious, ‘‘ “conclusions wrong, ”    editors,     To follow     the   “citation”
 “calamity for mankind ,” etc., might be      method of appraising a paper is in es-
appropriate. The “intelligent”    machine     scncc to challcngc rigorously each state-
would examine a new document          and     ment in that paper. If an author dots
generate a critical statement     such as     not provide documentation      for state-
“rather poor paper. ” As wc have seen         ments it does not mean that they are
above, a less intelligent machine might       false. However, they should ideally be
analyze the paragraph       and conclude      supporrcd   by a ‘‘rcferencc”   to some
that a bibliographic citation to the work     prior document, conversation, etc.
of Mr. X is missing and needed. The              It would appear that in the “ideally”
machine might also conclude that the          documented     paper almost every scn-
tence or phrase could tx interpreted to             of giving   a complete    list of papers   every
require reference to the past. While one            time   a topic   is mentioned.       in a
                                                                                      Thus,
can accept intuitively the notion that              discussion of information theory where 1
there are novel sentences that one can              felt one citation was sul%cient, someone
express in English,      novel concepts             else might have cited numerous related
appeti to be comparative y rare. Most               works.
novel combinations    of words, punctua-               The comments above are intended to
tion, etc. could be transformed         into        give you a feeling for the problem we
concepts that had appeared before. In-              face in automating citation indexing, It
deed, patent examiners like to remind               is a wide open area of research and it
inventors of this when disclosing generic           will take us into every fundamental  area
concepts,    alone or in combination,               of textual     analysis—something   com -
which anticipate specific embodiments.              parable to exegesis, 17 It is apparent that
   I recently did an experiment     with a          each author restricts his use of reference
group of my students at the University              citations according to the importance he
of Pennsylvania in which I asked them               places on the statements involved. From
to read a paper published in the journal            our knowledge of quantitative       citation
of Chemical Docrmervatronl       3 which            data, a doubling      or trebling of the
contained    no bibliographic    citations.         number of citations in the average paper
The reason this paper did not have a                would not overload the system from the
bibliography is simple. Many published              user’s viewpoint, The average paper that
papers don’t have bibliographies         for        was cited in 1961 was cited about 1.5
similar reasons. The paper was originally           times. 18 To double the amount of cita-
presented at a meeting. The editor of               tion would not even double this figure,
the journal asked for a copy, but it was            because not the exact same set of papers
published     without   the bibliography            would be cited. However, even if we did
which obviously was not needed in the               significantly increase the average num-
oral presentation.                                  ber of references to a particular work,
   Each student was asked to supply the             we would then give consideration        to a
missing bibliography     for this paper.            more specific approach to citations. This
Twelve students were involved in the                is well illustrated    in the citations to
experiment.   One student assigned 12               books where one finds the list of sources
references while another resigned 75.               subdivided by the page cited. This only
The average was about 40. This is not               adds an additional       dimension   in the
surprising, as a considerable amount of             specificity of citation indexing. There is
literature was reviewed in the paper.               no reason why this same principle can-
The bibliography could have been ex-                not be extended to the paragraph, sen-
panded to hundreds       of items if the            tence, or word. Indeed, this is exactly
common German practice were adopted                 what happens in exegesis.




                                               89
                                                           REFERENCES

 1. GaKleld       E and Sher I H. Sfi”ence citation                            Index,     2672    pp.      (Institute        for Scientfic        ln-
       formation@l,            Philadelphia,       Pa. 1963).
2. Luhn H P. Keyword-in-context index for technical Iiteratum (KWIC index), ASDD
     Rept. RC-127 (IBM, Yorktown Heights, N. Y., Aug. 91, 1959).
3. Luhn, H P. The automatic creation of literature abstracts. IBM J. Rex. and Ikvel.
       2:159-65,         1958.
4.   Harria Z S. Linguistic transformation  for information retrieval,                                        Proc. Intern.             Conf.    Sci.
       Inform, 1958, vol. 2, 937-950 (Nat], Acad, Sci., Washington,                                           D. C., 1959).
5. OCcmnor             J. Mechanical          indexing        methods          and    their    testing,      AD #409,           276,      J. Aaaoc.
       Comp. Mach.               11:497.49,      1964.
6. Artandi        J.     A selective      bibliographic               survey     of     automatic         indexing           methods,       Spexial
       L1brariea 54:630-34,              1963.
 7. Stiles H E. The association                   factor in information                  retrieval, J. Aaaoc. Ccsmp. Mach.                          8:
       271.79, 1961.
 8. Marcm M E. Automatic                      indexing:       an experimental              inquiry, J. Aaaoc. Cmnp. Mach. 8:
      404.17, 1961.
 9. Giuliano V E. Analog networks for                            word     association,         IEEE        Tram.         Mil.     Elec.     MIL-7,
       221-34,         1963,
10. Saltcm G. Associative document   retrieval techniques                                      using       bibliographic           information,
       J. Assoc. Gomp. Mach. 10:440.57, 1963.
11. Garfield    E. The    Science                  Citation           lnde.s — a         new      dimension             in      indexing.        Sci.
      144:649-54,   1964.
12, Garfield      E. Forms for literature                citations,      &i. 120:1030.40,               1954.
13. Garfield      E. Information              theory     and other         quantitative          factorv      in code         design      for docu.
       mcnt      card systems,        J. Chem. Dec. 1:70.75,                     1961,
14. Garfield       E. Brcaklng         the sub~t              index     barrier–        a citation        index    for chemical             patents,
       J. Patent         Office     SW. 39:583-95,             1957.
15. Garfield  E. Citation             indexes – ncw paths                to scientific        knowledge,          Chem        Bull.     (Chicago)
      43(4) :11. IZ, 1956.
16. OConnor            J. Mechanical          indexing        studica     of MSD,         toxicity (DDC            No.       not yet assigned.
       Contact         author     for copies     c/o     Institute      for Scientific         Information).
17. Garfield       1?. Citation          indexes         to     the     Old       Testament.            Am.       Documentation                 Inst.
       (Nov.      1955).
18. Garfield      E. Citation        indexes      in sociological          and historical            reacarch,     Am.         Documentation
       14:289-91,         1969.



                                                                      =4——+




                                                                         90

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:14
posted:11/1/2011
language:English
pages:7