Essasy of an Information Scientist, Vol:1, p.84-90, 1962-73. Current Contents, #9, March 4,
1970 Reprinted from:"Statistical Assoc. Methods for Mechanized
Documentation",Symp.Proc. 1964, Washington 1964 (Natl. Bureau of St
Reprinted from: Mary Elizabeth Stevens, Vincent E. Giuliano, and Laurence B. Heilprin, Eds.,
Statsitical Assocristiora Methods for Mechamked Documentation, Sympom”um Proceedings,
W’a.rhkgton 1964. (National Bureau of Standards Miscellaneous Publication 269, December
15, 1965), pp. 189-192.
Can Citation Indexing Be Automated?
Institute for Scientific Information
Philadelphia, Pennsylvania 19106
The main characteristics of conventional language-oriented indexing systems are
itemized and compared to the characteristics of citation indexes. The advantages
and disadvantages arc discussed in relation to the capability of the computer auto-
matically to simulate human critical proccaws reflected in the act of citation. It is
shown that a considerable standardization of document presentations will be neces-
sary and probably not achievable for many years if we are to achieve automatic re-
ferencing. On the other hand, many citations, now fortuitously or otherwise omitted,
might be supplied by computer analyses of text.
This paper considers whether, by man by use of character-recognition devices.
or machine, wc can simulate the process Programming such a device will require
of “documenting,” the process by the rcsohs tion of fantastic syntactic
which authors provide reference cita- problems even if the machine has a uni-
tions to pertinent and usually earlier versal multifont reading capability. For
documents. My paper does not concern example, in the citation, ‘~. Clrem.
the manipulative or mechanical prob- Sot. 1964, 1963, ” which number is the
lems of automatically compiling or yeat and which the page number? These
printing citation indexes. The existence Me not trivial problems. To handle the
of the .fcietrce Citah”on Inde@ is ade- vagaries of bibliographic syntax we
quate testimony to the ability of the ‘‘pre-edit” all documents before key-
computer rapidly to sort, edit, and print punching the citation data needed for
large-scale citation indexes. 1 the Science Citatiom Index. We also
My paper also does nor consider the “post-edit” both by computer and hu-
problcm of automatically recognizing man editing procedures. Do not confuse
(reading) and/ or extracting explicit cita- the “automatic” or “routine” nature
tions appearing in published documents of citation indexing with a syntactically
intelligent automation. Our citation in- 8. Substantiating claims
dexers do not require subject-matter 9. Alerting to forthcoming work
competence, but they do require con- 10. Providing leads to pcdy dissemi-
siderable bibliographic training. The di- nated, poorly indexed, or uncited
verse and unstandardized citation prac- work
tices in the world’s literature make this 11. Authenticating data and classes of
necessary. In addition, there are linguis- fact—physical constants, etc.
tic variations in names and publication 12. Identifying original publications in
titles which must be handled. Our cita- which an idea or concept was dis-
tion indexers essentially must be trained cussed.
in descriptive cataloging. 13. Identifying original publication or
My paper does concern the ability of other work describing an eponymic
an art~lcially intelligent machine to concept or term as, e.g., Hodgkin’s
deal with, among other things, the zm- Disease, P~eto’s Law, Friedel-Crafts
p[icit reference citation as distinguished Reaction, etc.
from the expkit reference citation. 14. Disclaiming work or ideas of others
Such might be the case in a paper where (negative claims)
the author, for one reason or another, 1~. Disputing priority claims of others
has neglected to provide a pertinent (negative homage)
bibliography. The editor of a scient~lc The problem of identifying all “per-
journal would ask such an automaton to tinent” references, to support implicit
supply all “pertinent” references, if for citations, is a special case of the general
no other reason than to make certain problem of automatic indexing. It has
the research was original. Citations are previously been reported that machines
generally used to provide ‘‘documenta- can index or abstract by use of key
tion” or support for specflc statements. words in context taken from tit1es,2 by
However, reference citations are also use of statistically significant sentences, 3
provided in papers for numerous reasons kernels,4 etc. O’C6nnor has recently re-
including, among others: viewed these methods, 5 as has
Artandi.6 Associative methods have
1. Paying homage to pioneers been widely discussed by Stiles,7
2. Giving credit for related work Maron,8 Giuliano,9 etc. All of these sys-
(homage to peers) tems, however, are concerned with in-
3. Identifying methodology, equip- dexing by usc of the text only. Biblio-
ment, etc. graphic citations are regarded as meta-
4. Providing background reading Iinguistic elements.
5. Correcting one’s own work Recently, however, Salton 10 has dis-
6. Correcting the work of others cussed the usc of bibliographic citations
7. Criticizing previous work as indicators of document content. Es-
sentially he proposes to treat citations as ability to describe documents uniquely
descriptors, which may seem strange to and specifically. Indeed, those who have
those who think in terms of conven- studied citation indexes and so-called
tional indexing. Indexers do not ordi- bibliographic coupling are well aware
narily think of citations (addresses of that only a small number of reference
cited documents) as descriptions of the citations are needed to isolate uniquely
citing document. However, that does a particular in the collection
not alter the fact that they are. 11 from all others. 11 That is why a search
Citations (document addresses) are of a citation index generally produces a
brief representations of the documents highly selective and usefid search result.
they identify. As one sacr~]ces compact- In discussing citation indexing it is
ness, such as is found in serial numbers frequently stated that weaknesses of the
for patents, 12 and expands to full tides method include under-citation (the de-
and then to abstracts, one sees the liberate or unwitting failure to cite per-
gradual enlargement of the document tinent literature) and over-citation (the
description toward the complete text. In excesive reference to presumably non-
this transition from “citation” to pertinent literature). Under-citation is
“document, ” redundancy is introduced illustrated by the patent literature, since
as well as additional information con- there is an economic motivation to
tent. Indeed, a document and a citation cloud rather than clarify the information
approach equality as the depth of in- disclosed in a patent. However, the pa-
dexing decreases (from the fill text) and tent examiner, otherwise motivated, at-
the length of the citation increases. This tempts to clarify the prior art by pro-
corresponds to my earlier d,:finition of viding a list of ‘“references cited”. 14
the document as the set of descriptors Suppose, however, the patent examiner,
which describe it. 13 In an information or a journal editor, wishes to examine a
retrieval system, information content document quite critically and asks that
can be measured only on the basis of in- the “machine” provide all the perti-
dexed information that is supplied in nent documentation or prior art. This
the indexing process. By this definition brings me once again to the main theme
a document is a unique combination of of my paper.
descriptors not assigned to any other To answer the question “Can citation
document in the collection. In most indexing be automated, ” as we have
thesaurus-based collections indexing is seen, obviously entails a discussion of
not sui%ciently deep to achieve such the entire range of question-
uniqueness, However, the combination answering problems encountered in de-
of conventional subject headings or signing any information retrieval sys-
descriptors with the bibliographic cita- tem. Consideration of the automatic
tions used as references increases our procedure for supplying reference cita-
tions, when they are missing, merely same actflcial intelligence would have
focuses attention on the complex index- been available to tell him that his data
ing task performed by the author when were wrong before he published and
he does give pertinent reference cita- why! (If he persisted in publishing, we
tions. Such considerations help us focus probably would have identified a
attention on the significant differences quality common to humans, but invari-
between a Ptiorr” and a postenon” index- ably attributed to machines-stupidity.)
ing. 15 Since each person may interpret The fmt sentence in the example il-
the meaning or significance of words lustrates the case for an implicit citation
and documents differently, the problem that our machine ought to be able to
we are dealing with inevitably involves provide. What could be more simple
the human ability to create novelty, to than the kernel sentence “Mr. X has
invent, to discover, and to be critical. published, ” which one would hope
Are machines, or machinelike people, could be the result of a transformational
capable of imitating or simulating the analysi~ when such methods are per-
human process of being critical? What fected. Such an analysis combined with
are the peculiarly “human” earmarks of a complete computer listing of the
certain sentences containing citations ? papers by Mr. X is a good starting
When do such sentences contain im- point. Since we know that this is not
plicit citations that could be supplied by sufficiently specific we must then expect
an intelligent machine and when would of the linguistic analysis “Mr. X has
this appear to be difficult or impossible? published on gobbledygook” and then
Consider the following example: we have reduced the computer search ro
“Mr. X, art impossible idiot, has the “‘simple” task of identifying the
recently published a paper on gobbledy- one paper out of the thousands by men
gook. The conclusions reported in his named X to those which concern gob-
paper are wrong as are the data on bledygook. Alas, this simple task alone
which the conclusions are based. The re- requires the resolution of all the linguis-
commendations made by Mr. X, on the tic and semantic problems ~sociated
basis of his conclusions, will be a ca- with matching the word ‘‘gobbledy-
lamity for mankind. ” gook” with the possibly different words
In polite circles, this is called the cri- in the title of the implicitly cited paper
tical review. Obviously, “intelligent” or book. Indeed, there is no reason at
machines are not yet ready to generate all to assume the same word has occur-
such criticism. Or at least programmers red either in the title or the text of the
are not yet able to prog:am machines to “cited” work. If these problems were
prepate such critiques. If they were, not sufficient, keep in mind that the
then the paper by Mr. X would prob- word “recently” is quite signiticam in
ably never have appeared because the the example chosen because it stresses
the possibility that Mr. X may have cited work was tmdcr “critical” discus-
written extensively on gobblcdegook sion because of certain syntactic or vo-
and it is only one particular, or a few cabulary characteristics associated with
recent papers, that is the target for dis- ‘‘critical. ” Presumably they would bc
cussion. identified by transformational or other
Fortunately authors usually do pro- sophisticated analyses not yet available.
vide, explicitly, the citations needed to This would be no mean accomplish-
support such sentences. As a conse- ment. Among other nontrivial problems
quence the citation index, created by is the fact that the information nccdcd
human indexers, does correlate the cited to assign the marker can bc spread
work with the critical statements which throughout, not in a single scntencc of,
aPPe~ In the second and third senten- the source paper.
ces of the example paragraph. This fea- O’Connor’s studies on the term
ture of the citation index alone would “toxicity” arc quite pertinent to this
have justified its creation. However, it is problem because the problems have in
interesting to speculate whether trans- common the need to discover methods
formational or any other automatic an- for assigning dewiptions of documents
alysis of such a paragraph could produce which are subject to considerable varia-
a useful additional “marker” which tion. 16 What is toxic to one man may
would describe briefly the kind of re- be euphoric to another!
lationship that exists between the citing To examine a document from the
and cited documents. ‘‘citation” point of view, to determine
“ These “markers” would appear in what reference citations could or should
the published citation index along with be provided which link the sentcncc,
the usual citation data. In the case of phrase, or word in question to man’s
the paragraph above, for example, prior recorded knowledge, is to say the
‘‘critique” or one of several other terse Icast a formidable challenge. The task is
statements like “Mr. X is wrong, ” an cxccllent exercise for new journal
8‘data spurious, ‘‘ “conclusions wrong, ” editors, To follow the “citation”
“calamity for mankind ,” etc., might be method of appraising a paper is in es-
appropriate. The “intelligent” machine scncc to challcngc rigorously each state-
would examine a new document and ment in that paper. If an author dots
generate a critical statement such as not provide documentation for state-
“rather poor paper. ” As wc have seen ments it does not mean that they are
above, a less intelligent machine might false. However, they should ideally be
analyze the paragraph and conclude supporrcd by a ‘‘rcferencc” to some
that a bibliographic citation to the work prior document, conversation, etc.
of Mr. X is missing and needed. The It would appear that in the “ideally”
machine might also conclude that the documented paper almost every scn-
tence or phrase could tx interpreted to of giving a complete list of papers every
require reference to the past. While one time a topic is mentioned. in a
can accept intuitively the notion that discussion of information theory where 1
there are novel sentences that one can felt one citation was sul%cient, someone
express in English, novel concepts else might have cited numerous related
appeti to be comparative y rare. Most works.
novel combinations of words, punctua- The comments above are intended to
tion, etc. could be transformed into give you a feeling for the problem we
concepts that had appeared before. In- face in automating citation indexing, It
deed, patent examiners like to remind is a wide open area of research and it
inventors of this when disclosing generic will take us into every fundamental area
concepts, alone or in combination, of textual analysis—something com -
which anticipate specific embodiments. parable to exegesis, 17 It is apparent that
I recently did an experiment with a each author restricts his use of reference
group of my students at the University citations according to the importance he
of Pennsylvania in which I asked them places on the statements involved. From
to read a paper published in the journal our knowledge of quantitative citation
of Chemical Docrmervatronl 3 which data, a doubling or trebling of the
contained no bibliographic citations. number of citations in the average paper
The reason this paper did not have a would not overload the system from the
bibliography is simple. Many published user’s viewpoint, The average paper that
papers don’t have bibliographies for was cited in 1961 was cited about 1.5
similar reasons. The paper was originally times. 18 To double the amount of cita-
presented at a meeting. The editor of tion would not even double this figure,
the journal asked for a copy, but it was because not the exact same set of papers
published without the bibliography would be cited. However, even if we did
which obviously was not needed in the significantly increase the average num-
oral presentation. ber of references to a particular work,
Each student was asked to supply the we would then give consideration to a
missing bibliography for this paper. more specific approach to citations. This
Twelve students were involved in the is well illustrated in the citations to
experiment. One student assigned 12 books where one finds the list of sources
references while another resigned 75. subdivided by the page cited. This only
The average was about 40. This is not adds an additional dimension in the
surprising, as a considerable amount of specificity of citation indexing. There is
literature was reviewed in the paper. no reason why this same principle can-
The bibliography could have been ex- not be extended to the paragraph, sen-
panded to hundreds of items if the tence, or word. Indeed, this is exactly
common German practice were adopted what happens in exegesis.
1. GaKleld E and Sher I H. Sfi”ence citation Index, 2672 pp. (Institute for Scientfic ln-
formation@l, Philadelphia, Pa. 1963).
2. Luhn H P. Keyword-in-context index for technical Iiteratum (KWIC index), ASDD
Rept. RC-127 (IBM, Yorktown Heights, N. Y., Aug. 91, 1959).
3. Luhn, H P. The automatic creation of literature abstracts. IBM J. Rex. and Ikvel.
4. Harria Z S. Linguistic transformation for information retrieval, Proc. Intern. Conf. Sci.
Inform, 1958, vol. 2, 937-950 (Nat], Acad, Sci., Washington, D. C., 1959).
5. OCcmnor J. Mechanical indexing methods and their testing, AD #409, 276, J. Aaaoc.
Comp. Mach. 11:497.49, 1964.
6. Artandi J. A selective bibliographic survey of automatic indexing methods, Spexial
L1brariea 54:630-34, 1963.
7. Stiles H E. The association factor in information retrieval, J. Aaaoc. Ccsmp. Mach. 8:
8. Marcm M E. Automatic indexing: an experimental inquiry, J. Aaaoc. Cmnp. Mach. 8:
9. Giuliano V E. Analog networks for word association, IEEE Tram. Mil. Elec. MIL-7,
10. Saltcm G. Associative document retrieval techniques using bibliographic information,
J. Assoc. Gomp. Mach. 10:440.57, 1963.
11. Garfield E. The Science Citation lnde.s — a new dimension in indexing. Sci.
12, Garfield E. Forms for literature citations, &i. 120:1030.40, 1954.
13. Garfield E. Information theory and other quantitative factorv in code design for docu.
mcnt card systems, J. Chem. Dec. 1:70.75, 1961,
14. Garfield E. Brcaklng the sub~t index barrier– a citation index for chemical patents,
J. Patent Office SW. 39:583-95, 1957.
15. Garfield E. Citation indexes – ncw paths to scientific knowledge, Chem Bull. (Chicago)
43(4) :11. IZ, 1956.
16. OConnor J. Mechanical indexing studica of MSD, toxicity (DDC No. not yet assigned.
Contact author for copies c/o Institute for Scientific Information).
17. Garfield 1?. Citation indexes to the Old Testament. Am. Documentation Inst.
18. Garfield E. Citation indexes in sociological and historical reacarch, Am. Documentation