Embed
Email

Proceedings Template - WORD

Document Sample

Shared by: niusheng11
Categories
Tags
Stats
views:
1
posted:
11/26/2011
language:
English
pages:
4
The Intell-Index System:

Using NLP Techniques to organize a dynamic text browser



Nina Wacholder

Rutgers University



nina@scils.rutgers.edu





In addition to providing users with the ability to browse the

1. Introduction1 entire list of index terms for a document collection, Intell-Index

The key component of phrase browsing systems are index

allows the user to search for subsets of terms. This aspect of

terms—coherent natural language expressions that represent

Intell-Index is sometimes misunderstood, apparently because of

document content. A set of index terms retrieved for a specific

the misconception that browsing systems and searching systems

document constitutes a document surrogate. When linked to

are necessarily distinct. This functionality allows the user to

occurrences of the phrases in a text, the set of index terms for a

specify, for example, that they are only interested in occurrences

collection of documents function as flexible access points to

of the string act, where the first letter is capitalized. In the

collection content.

previously mentioned collection of documents on nuclear

In this paper, we briefly describe Intell-Index, a phrase browsing disarmament, this brings together a list of laws and international

system designed to provide an environment for development and agreements, including:

for testing the effectiveness of natural language processing  FY1998 Foreign Operations Appropriation Act

techniques for automatic identification of index terms. We

describe an early version of Intell-Index which was designed to  Iran Missile Proliferation Sanctions Act

establish proof-of-concept (the system is currently being re-  Iran-Iraq Arms Non-Proliferation Act

implemented) and we discuss the reasons for the choice of

simplex noun phrases (which will be defined below) as the unit The specification that the A be capitalized in the last word of the

for index terms. phrase allows the user to roughly restrict the search to proper

names, thereby excluding terms in which act is a common noun,

for example in phrases such as dangerous act.

2. Intell-Index

Intell-Index (www.cs.columbia.edu/~nina/IntellIndex) is a In principle, Intell-Index is designed to incorporate as many

dynamic text browser, a user centered system designed to ways as possible to enable users to narrow down the set of

provide information seekers with unknown information needs search terms they need to browse from among all of the index

with access to the content of a large collection of text terms in the collection. But how a set of terms can be narrowed

documents. Examples of some of the index terms included for a down depends in large part on what terms the set consists of in

collection of documents on nuclear armaments are the the first place. In what follows, we focus on a very basic issue --

following: the phrasal properties of index terms-- and the implication of

decisions about what kind of phrases to include in the browsing

 ability system for size and organization of the index.

 political ability

 accord 3. Noun phrases

 Trilateral Accord A noun phrase (NP) is a coherent linguistic unit that refers to an

 Background De-Nuclearization Accords idea, concept or event. The head (or most important word) in an

 subsequent post-Soviet accord NP, based on syntactic and semantic considerations, is a noun.

 bilateral accord Examples of noun phrases are digital library (library is the

 bilateral nuclear cooperation accord head), technology of phrase browsing applications (technology

is the head) and metadata (metadata is the head of this single

These index terms were automatically identified by LinkIT, a word NP). Print indexes consist primarily of noun phrases, as a

system for identifying significant topics in full-text documents scan of almost any back-of-the-book index will show; (Anick

(Evans et al. 2001 [4]). and Vaithyanathan 1997 [1] and Wacholder 1998 [7] suggest

that phrase browsing systems should consist primarily of NPs.2



1 2

This work was conducted at the Columbia University Center We do not discuss the issue of whether to include verb phrases

for Research on Information Access. in indexes.

However, there are many definitions of NPs, and different are searching is in the head of pre-modifier of the NP, thereby

phrase browsing systems use different algorithms to identify giving the user control over this aspect of the results.

NPs. We can usefully distinguish between two types of NPs,

simplex NPs (sometimes called base NPs) and complex NPs. In Besides simplex and complex NPs, one other type of expression

what follows, we use the definition from Wacholder 1998; there is sometimes used to identify index terms in text: adjacent

are other variations of these definitions. sequences of words that are repeated in the text above some

threshold. In one variation of this approach, any adjacent

 Simplex NPs are maximal noun phrases that include head sequence of words is included as a candidate term; in another,

nouns and preceding content bearing modifiers, but not only repeated sequences of words that form NPs are used. The

postmodifers. So metadata is a simplex NP, as is digital latter alternative is used by Anick and Vaithyanathan 1997 [1],

library. For proper names, a simplex NP is a term which who also note the main disadvantage of this approach:

refers to a single entity, e.g., Museum of the City of New important single word NPs such as metadata or biochemistry

York. are ruled out by the requirement that the NPs consist of more

 Complex NPs are noun phrases with a postmodifer such as than one word. The former alternative, that of using adjacent

a prepositional phrase or a participial phrase after the head sequences of words, regardless of grammatical category, is used

of the NP. Digital library for the humanities is a complex by Chen et al. 19984 [2]and Nevill-Manning et al. 1999 [5]. In

NP because it consists of a noun phrase, digital library, addition to excluding single word index terms, this technique

followed by a prepositional phrase, for the humanities. The has the disadvantage of identifying sequences of words that do

complex NP (the) tasks building such a resource implied3 not form linguistic phrases e.g., definitions using or first large.

consists of an NP (the) tasks followed by a verbal phrase, These phrases do not have heads and therefore do not represent

building such a resource implied. a coherent unit. However, there is an important processing

advantage to using repeated sequences of words: they can be

Complex NPs are longer and therefore more specific than identified by simple pattern matching, without the use of any

simplex ones. If they could be reliably identified, it would syntactic information.

presumably be best to use maximal NPs (i.e., the largest NP,

regardless of how many NPs and other phrases it includes) for Decisions about which index terms to include therefore has

phrase browsing systems. Unfortunately, reliable identification important implications in phrase browsing system for the

of the boundaries of complex NPs is a hard natural language number, specificity and comprehensibility of index terms

processing problem. This is due to the ambiguity of natural identified in full-text documents. We have chosen to use

language. For example, in a sentence such as Lee saw Robin in simplex NPs in Intell-Index because 1) they can be relatively

the park with a telescope, a parser cannot tell from syntactic or reliably identified using shallow syntactic knowledge; 2) they

semantic evidence whether the Lee, Robin, or even the park has include phrases of one or more words; 3) they readily support

the telescope. head browsing; and 4) simplex NPs and appropriate

In contrast, simplex NPs can be relatively reliably identified postmodifiers can be combined by post-processing.

using regular expressions over text that has been tagged with The new version of Intell-ndex is being designed to provide an

grammatical part-of-speech. This latter property makes simple environment to test the usefulness of these different types of

the use of head sorting for organizing index terms and NPs as index terms.

presenting them to users, at least in English, where the head of a

simplex NP is reliably its last word. Head sorting simply brings 4. REFERENCES

together terms that share a common head, for example:

[1] Anick, Peter and Shivakumar Vaithyanathan (1997)

 oil filter “Exploiting clustering and phrases for context-based

information retrieval”, Proceedings of the 20th SIGIR

 smut filter

Conference, July 27-31, 1997, Philadelphia, PA.

 chemical filter

[2] Chen, Hsinchun, Joanne Martinez, Amy Kirchhoff, Tobun

 paper filter D. Ng and Bruce R. Schatz (1998) “Alleviating search

This set of phrases excludes other phrases which include the uncertainty through concept association: automatic

word agent as a premodifier but not a head, as, for example: indexing, co-occurrence analysis and parallel computing”,

JASIS 49(3):206-216.

 filter paper

[3] Crane, Gregory, David A. Smith and Clifford E. Wullman

 filter factory

(2001) “Building a hypertextual digital library in the

 digital filter design humanities”, Proceedings of First ACM/IEEE-CS Joint

Since the heads of index terms are known, Intell-Index provides Conference on Digital Libraries, pp.426-434, July 24-28,

users with the ability to specify whether a string for which they 2001.

[4] Evans, David K., Klavans, Judith, and Wacholder, Nina

(2000) “Document processing with LinkIT”, Proc. of the

3 RIAO Conference, Paris, France.

The full clause from which this complex NP was extracted

(Crane et al. 2001, p.426 [3]) is “Although we knew a great

4

deal about the tasks building such a resource entailed and the Tolle and Chen 2000 [6] use NPs instead of repeated word

benefits such a resource could provide…” sequences..

[5] Nevill-Manning, Craig, Ian H. Witten and Gordon W. document”, Proc. of Workshop on the Computational

Paynter (1999) “Lexically-generated subject hierarchies for Treatment of Nominals, edited by Federica Busa, Inderjeet

browsing subject collections”, Int’l Journal of Digital Mani and Patrick Saint-Dizier, pp.70-79. COLING-ACL,

Libraries 2:111-123. October 16, 1998, Montreal.

[6] Tolle, Kristin M. and Hsinchun Chen (2000) “Comparing [8] Wacholder, Nina, Evans, David K. and Judith L. Klavans

noun phrasing techniques for use with medical digital (2001) “Automatic identification and organization of index

library tools”, JASIS 51(4):352-370. terms for interactive browsing”, Proceedings of First

ACM/IEEE-CS Joint Conference on Digital Libraries,

[7] Wacholder, Nina (1998) “Simplex noun phrases clustered pp.126-134, July 24-28, 2001.

by head: a method for identifying significant topics in a









Figure 1: Intell-Index opening screen

LEFTOVERS











.

One of the challenges in designing such a system is that the number of terms that can be automatically identified for a large collection is

itself quite large: for example, for a 45 MB corpus of Wall Street Journal articles, LinkIT, the system that we use to identify index terms for

IntellIndex (Wacholder et al. 2001; Evans et al. 2001) identified over 500,000 noun phrases. Each of these expressions is a candidate

index term.

Indexes typically include a diverse collection of terms selected because they represent some aspect of the content of a text and therefore are

of potential use to some unknown information seeker. Only a subset of the terms in the index are likely to be of interest to any particular

information seeker. The number of phrases can be reduced by a ranking process, in which certain NPs are excluded as not being of

sufficient potential usefulness and there are a variety of other techniques such as hierarchical organization of index terms and subject

grouping which can help with the browsing process.







Indexes typically include a diverse collection of terms selected because they represent some aspect of the content of a text and therefore are

of potential use to some unknown information seeker. Only a subset of the terms in the index are likely to be of interest to any particular

information seeker. One of the challenges in designing such a system is that the number of terms that can be automatically identified for a

large collection is itself quite large: for example, for a 45 MB corpus of Wall Street Journal articles, LinkIT, the system that we use to

identify index terms for IntellIndex (Wacholder et al. 2001; Evans et al. 2001) identified over 500,000 noun phrases. Each of these

expressions is a candidate index term.

Indexes typically include a diverse collection of terms selected because they represent some aspect of the content of a text and therefore are

of potential use to some unknown information seeker. The number of phrases is usually reduced by a ranking process, in which certain NPs

are excluded from the index as being too vague or ambiguous, and therefore not of sufficient potential usefulness.

But even after an optimal ranking process, only a subset of the terms in the index are likely to be of interest to any particular information

seeker. Techniques to help information seekers browse these terms, such as hierarchical organization of these terms and clustering them

into subject „buckets‟, are actively being explored.

>



Related docs
Other docs by niusheng11
TEXAS ADVANCED COMPUTING CENTER Safe Travels
Views: 0  |  Downloads: 0
The Trek
Views: 3  |  Downloads: 0
article-240637
Views: 0  |  Downloads: 0
work presentation 2A
Views: 2  |  Downloads: 0
snort_configure.docx - NEOHAPSIS
Views: 1  |  Downloads: 0
Southern Maine Dressage Association
Views: 1  |  Downloads: 0
Checklists for buying a used car
Views: 17  |  Downloads: 0
mis is riin The Office of Business Services
Views: 4  |  Downloads: 0
Assisted Living_6_
Views: 2  |  Downloads: 0
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!