The Intell-Index System:
Using NLP Techniques to organize a dynamic text browser
Nina Wacholder
Rutgers University
nina@scils.rutgers.edu
In addition to providing users with the ability to browse the
1. Introduction1 entire list of index terms for a document collection, Intell-Index
The key component of phrase browsing systems are index
allows the user to search for subsets of terms. This aspect of
terms—coherent natural language expressions that represent
Intell-Index is sometimes misunderstood, apparently because of
document content. A set of index terms retrieved for a specific
the misconception that browsing systems and searching systems
document constitutes a document surrogate. When linked to
are necessarily distinct. This functionality allows the user to
occurrences of the phrases in a text, the set of index terms for a
specify, for example, that they are only interested in occurrences
collection of documents function as flexible access points to
of the string act, where the first letter is capitalized. In the
collection content.
previously mentioned collection of documents on nuclear
In this paper, we briefly describe Intell-Index, a phrase browsing disarmament, this brings together a list of laws and international
system designed to provide an environment for development and agreements, including:
for testing the effectiveness of natural language processing FY1998 Foreign Operations Appropriation Act
techniques for automatic identification of index terms. We
describe an early version of Intell-Index which was designed to Iran Missile Proliferation Sanctions Act
establish proof-of-concept (the system is currently being re- Iran-Iraq Arms Non-Proliferation Act
implemented) and we discuss the reasons for the choice of
simplex noun phrases (which will be defined below) as the unit The specification that the A be capitalized in the last word of the
for index terms. phrase allows the user to roughly restrict the search to proper
names, thereby excluding terms in which act is a common noun,
for example in phrases such as dangerous act.
2. Intell-Index
Intell-Index (www.cs.columbia.edu/~nina/IntellIndex) is a In principle, Intell-Index is designed to incorporate as many
dynamic text browser, a user centered system designed to ways as possible to enable users to narrow down the set of
provide information seekers with unknown information needs search terms they need to browse from among all of the index
with access to the content of a large collection of text terms in the collection. But how a set of terms can be narrowed
documents. Examples of some of the index terms included for a down depends in large part on what terms the set consists of in
collection of documents on nuclear armaments are the the first place. In what follows, we focus on a very basic issue --
following: the phrasal properties of index terms-- and the implication of
decisions about what kind of phrases to include in the browsing
ability system for size and organization of the index.
political ability
accord 3. Noun phrases
Trilateral Accord A noun phrase (NP) is a coherent linguistic unit that refers to an
Background De-Nuclearization Accords idea, concept or event. The head (or most important word) in an
subsequent post-Soviet accord NP, based on syntactic and semantic considerations, is a noun.
bilateral accord Examples of noun phrases are digital library (library is the
bilateral nuclear cooperation accord head), technology of phrase browsing applications (technology
is the head) and metadata (metadata is the head of this single
These index terms were automatically identified by LinkIT, a word NP). Print indexes consist primarily of noun phrases, as a
system for identifying significant topics in full-text documents scan of almost any back-of-the-book index will show; (Anick
(Evans et al. 2001 [4]). and Vaithyanathan 1997 [1] and Wacholder 1998 [7] suggest
that phrase browsing systems should consist primarily of NPs.2
1 2
This work was conducted at the Columbia University Center We do not discuss the issue of whether to include verb phrases
for Research on Information Access. in indexes.
However, there are many definitions of NPs, and different are searching is in the head of pre-modifier of the NP, thereby
phrase browsing systems use different algorithms to identify giving the user control over this aspect of the results.
NPs. We can usefully distinguish between two types of NPs,
simplex NPs (sometimes called base NPs) and complex NPs. In Besides simplex and complex NPs, one other type of expression
what follows, we use the definition from Wacholder 1998; there is sometimes used to identify index terms in text: adjacent
are other variations of these definitions. sequences of words that are repeated in the text above some
threshold. In one variation of this approach, any adjacent
Simplex NPs are maximal noun phrases that include head sequence of words is included as a candidate term; in another,
nouns and preceding content bearing modifiers, but not only repeated sequences of words that form NPs are used. The
postmodifers. So metadata is a simplex NP, as is digital latter alternative is used by Anick and Vaithyanathan 1997 [1],
library. For proper names, a simplex NP is a term which who also note the main disadvantage of this approach:
refers to a single entity, e.g., Museum of the City of New important single word NPs such as metadata or biochemistry
York. are ruled out by the requirement that the NPs consist of more
Complex NPs are noun phrases with a postmodifer such as than one word. The former alternative, that of using adjacent
a prepositional phrase or a participial phrase after the head sequences of words, regardless of grammatical category, is used
of the NP. Digital library for the humanities is a complex by Chen et al. 19984 [2]and Nevill-Manning et al. 1999 [5]. In
NP because it consists of a noun phrase, digital library, addition to excluding single word index terms, this technique
followed by a prepositional phrase, for the humanities. The has the disadvantage of identifying sequences of words that do
complex NP (the) tasks building such a resource implied3 not form linguistic phrases e.g., definitions using or first large.
consists of an NP (the) tasks followed by a verbal phrase, These phrases do not have heads and therefore do not represent
building such a resource implied. a coherent unit. However, there is an important processing
advantage to using repeated sequences of words: they can be
Complex NPs are longer and therefore more specific than identified by simple pattern matching, without the use of any
simplex ones. If they could be reliably identified, it would syntactic information.
presumably be best to use maximal NPs (i.e., the largest NP,
regardless of how many NPs and other phrases it includes) for Decisions about which index terms to include therefore has
phrase browsing systems. Unfortunately, reliable identification important implications in phrase browsing system for the
of the boundaries of complex NPs is a hard natural language number, specificity and comprehensibility of index terms
processing problem. This is due to the ambiguity of natural identified in full-text documents. We have chosen to use
language. For example, in a sentence such as Lee saw Robin in simplex NPs in Intell-Index because 1) they can be relatively
the park with a telescope, a parser cannot tell from syntactic or reliably identified using shallow syntactic knowledge; 2) they
semantic evidence whether the Lee, Robin, or even the park has include phrases of one or more words; 3) they readily support
the telescope. head browsing; and 4) simplex NPs and appropriate
In contrast, simplex NPs can be relatively reliably identified postmodifiers can be combined by post-processing.
using regular expressions over text that has been tagged with The new version of Intell-ndex is being designed to provide an
grammatical part-of-speech. This latter property makes simple environment to test the usefulness of these different types of
the use of head sorting for organizing index terms and NPs as index terms.
presenting them to users, at least in English, where the head of a
simplex NP is reliably its last word. Head sorting simply brings 4. REFERENCES
together terms that share a common head, for example:
[1] Anick, Peter and Shivakumar Vaithyanathan (1997)
oil filter “Exploiting clustering and phrases for context-based
information retrieval”, Proceedings of the 20th SIGIR
smut filter
Conference, July 27-31, 1997, Philadelphia, PA.
chemical filter
[2] Chen, Hsinchun, Joanne Martinez, Amy Kirchhoff, Tobun
paper filter D. Ng and Bruce R. Schatz (1998) “Alleviating search
This set of phrases excludes other phrases which include the uncertainty through concept association: automatic
word agent as a premodifier but not a head, as, for example: indexing, co-occurrence analysis and parallel computing”,
JASIS 49(3):206-216.
filter paper
[3] Crane, Gregory, David A. Smith and Clifford E. Wullman
filter factory
(2001) “Building a hypertextual digital library in the
digital filter design humanities”, Proceedings of First ACM/IEEE-CS Joint
Since the heads of index terms are known, Intell-Index provides Conference on Digital Libraries, pp.426-434, July 24-28,
users with the ability to specify whether a string for which they 2001.
[4] Evans, David K., Klavans, Judith, and Wacholder, Nina
(2000) “Document processing with LinkIT”, Proc. of the
3 RIAO Conference, Paris, France.
The full clause from which this complex NP was extracted
(Crane et al. 2001, p.426 [3]) is “Although we knew a great
4
deal about the tasks building such a resource entailed and the Tolle and Chen 2000 [6] use NPs instead of repeated word
benefits such a resource could provide…” sequences..
[5] Nevill-Manning, Craig, Ian H. Witten and Gordon W. document”, Proc. of Workshop on the Computational
Paynter (1999) “Lexically-generated subject hierarchies for Treatment of Nominals, edited by Federica Busa, Inderjeet
browsing subject collections”, Int’l Journal of Digital Mani and Patrick Saint-Dizier, pp.70-79. COLING-ACL,
Libraries 2:111-123. October 16, 1998, Montreal.
[6] Tolle, Kristin M. and Hsinchun Chen (2000) “Comparing [8] Wacholder, Nina, Evans, David K. and Judith L. Klavans
noun phrasing techniques for use with medical digital (2001) “Automatic identification and organization of index
library tools”, JASIS 51(4):352-370. terms for interactive browsing”, Proceedings of First
ACM/IEEE-CS Joint Conference on Digital Libraries,
[7] Wacholder, Nina (1998) “Simplex noun phrases clustered pp.126-134, July 24-28, 2001.
by head: a method for identifying significant topics in a
Figure 1: Intell-Index opening screen
LEFTOVERS
.
One of the challenges in designing such a system is that the number of terms that can be automatically identified for a large collection is
itself quite large: for example, for a 45 MB corpus of Wall Street Journal articles, LinkIT, the system that we use to identify index terms for
IntellIndex (Wacholder et al. 2001; Evans et al. 2001) identified over 500,000 noun phrases. Each of these expressions is a candidate
index term.
Indexes typically include a diverse collection of terms selected because they represent some aspect of the content of a text and therefore are
of potential use to some unknown information seeker. Only a subset of the terms in the index are likely to be of interest to any particular
information seeker. The number of phrases can be reduced by a ranking process, in which certain NPs are excluded as not being of
sufficient potential usefulness and there are a variety of other techniques such as hierarchical organization of index terms and subject
grouping which can help with the browsing process.
Indexes typically include a diverse collection of terms selected because they represent some aspect of the content of a text and therefore are
of potential use to some unknown information seeker. Only a subset of the terms in the index are likely to be of interest to any particular
information seeker. One of the challenges in designing such a system is that the number of terms that can be automatically identified for a
large collection is itself quite large: for example, for a 45 MB corpus of Wall Street Journal articles, LinkIT, the system that we use to
identify index terms for IntellIndex (Wacholder et al. 2001; Evans et al. 2001) identified over 500,000 noun phrases. Each of these
expressions is a candidate index term.
Indexes typically include a diverse collection of terms selected because they represent some aspect of the content of a text and therefore are
of potential use to some unknown information seeker. The number of phrases is usually reduced by a ranking process, in which certain NPs
are excluded from the index as being too vague or ambiguous, and therefore not of sufficient potential usefulness.
But even after an optimal ranking process, only a subset of the terms in the index are likely to be of interest to any particular information
seeker. Techniques to help information seekers browse these terms, such as hierarchical organization of these terms and clustering them
into subject „buckets‟, are actively being explored.
>