Project Summary

Document Sample
Project Summary Powered By Docstoc
					Project Title: Multilingual Multidocument Summarization
Organization: Columbia University

Principal Investigator: Kathleen McKeown, Vasileios Hatzivassiloglou, and Judith Klavans

Project URL:

Objective: The aim is the development of a system that automatically generates a short
English summary of a set of documents, in multiple languages, on the same event. By
providing a concise view of the consensus on the event, presented across documents,
the system dramatically reduces the amount of reading that is required. By highlighting
differences between documents, the system will point out inconsistencies in different
views of the event, whether from different sources or different countries.

Approach:        Columbia University is developing a practical, multilingual and
multidocument summarization system. The design features the integration of robust
statistical techniques, shallow linguistic approaches and machine learning to achieve
scalability within languages and portability across languages. To realize these goals, the
research is developing methods for summarization across documents using information
fusion and identification of key differences such as new developments summarization
across languages relying on externally developed machine translation as well as
identification and translation of terms. Support for Chinese and Japanese is currently
being implemented.

Recent Accomplishments:

Development of Newsblaster: Newsblaster is an online service which spiders portions of
the web starting from specified news sites in search of News articles. Background
processes run automatically every night, downloading thousands of discovered news
articles, and also images and captions contained therein.               Articles are then
automatically grouped into clusters such that each cluster represents a single news
story or event, clusters are grouped into super clusters of related events, and super
clusters are then categorized into topics typical of manually created news sites (more
information below). For each cluster, a summary is automatically generated. The web
interface to Newsblaster is than automatically updated; it can be found at In addition to showcasing many facets of NLP
research, Newsblaster provides a useful means for obtaining news. It has caught the
attention of both the public and the press (a list of articles is available at ).

Online evaluation of Newsblaster: Our extensive online evaluation addressed two
issues: the usability of Newsblaster as a tool for browsing news and the accuracy of the
individual system components (classification, clustering and summarization). To get a
comprehensive view on system performance, we employed a variety of evaluation
techniques: user preference experiment, quantitative analysis of user answers to our
questionnaire and qualitative analysis of user comments about the system. The results
of the experiment show that users strongly prefer using the advanced summarization
features incorporated in our system, and that our system helps users achieve more
efficient browsing of news.

Evaluation: We participated in the formal evaluation of multidocument summarization
organized as part of the first and second Document Understanding Conferences. We
developed a multi-strategy summarization system that invokes Multigen for single event
document sets and a new system, DEMS (Dissimilarity Engine for Multidocument
Summarization) for other sets and for the second conference, performed extensive
training on the multi-event sets from year one. Single event sets were not available for

Events: We implemented automated tools to analyze documents and extract their event
structure, identifying those sentences and phrases that describe events. We
implemented extractors for those features we identified as appropriate for events (e.g.,
proper names, locations, time phrases, lack of pronouns, cardinal numbers and certain
verb characteristics) and two machine learning models that combine the features into a
classifier for event sentences. Our classifier obtains more than 80% accuracy in
identifying event sentences within a document, and also annotates important roles in
these event sentences such as the primary participants, time, and place.

Sentence ordering: We developed and implemented in Multigen a strategy for ordering
information that combines constraints from chronological order of events and cohesion.
This strategy was derived from empirical observations based on experiments where
humans ordered information. Evaluation of our augmented algorithm shows a significant
improvement of the ordering over two naive techniques used as a baseline; these two
naïve techniques were previously used as strategies for multidocument summarization,
in other systems and our own.

New information detection: A prototype of a new information detection program was
completed and a preliminary evaluation shows promising results. The system, NIA (New
Information Agent), includes a modules for recognizing clause structures, grouping
nouns and verbs into semantic sets, Concept Sets, and measuring novelty in a new
document clause by clause.

SimFinderML: A new version of SimFinder, Columbia’s tool to identify similar sentences
across documents, has been developed to support modular interfaces for adding new
primitives and features for the text comparison process. We have re-trained this new
version of SimFinder over two data sets, and have undertaken the creation of a third
English training set using data from the NewsBlaster project.

Multilingual summarization: We have made progress on the problem of identifying
similarities across multilingual input, implementing primitives and features for both
Japanese and Chinese. An initial bilingual-lexicon based translation method has been
implemented allowing the matching of token-based primitives across languages. The
multilingual architecture has shown to be promising through exploratory analysis of
similarity between Japanese, Chinese, and English texts.

Current Plan:

Multilingual Summarization: A major focus of the third year will be on multilingual
summarization. This will involve additional development of the component for
multilingual similarity detection as well as significant work on generation of text using
fusion over similar sentences from translated pieces of documents. In the area of
multilingual similarity detection, we plan to create additional primitives and features for
use with non-English data; currently we only have implemented "token" primitives for
Chinese and Japanese. We have collected parallel and comparable corpora for use in
creating multilingual training data. Annotators will manually align clusters of sentences
across documents and languages that are similar, which will be used to train the
weights assigned to different features in the matching process. These manual
judgments will also be used in an evaluation of multilingual similarity matching
performance. We will import more sophisticated translation methods from other sites for
use in the process and explore other methods for term translation, including N-best
translation of noun phrases using statistics from large corpora as well as the
combination of multiple translation resources (induced bilingual lexicons, machine
readable dictionaries, etc.)

Summarization of Updates: We plan to extend and fully evaluate the new information
detection program and embed it in a system to generate updates. This will allow us to
track the same story across multiple days; summaries of the same story at later points
will report only the new developments. To do this, we plan to continue experimentation
with identifying features that can be used to improve our metrics for measuring novelty.
We also need to develop methods for evaluating the program accurately. We will work
with an expert in the area of cognitive evaluation who can help us develop instructions
and a framework to elicit reliable judgments. We will also be participating in the TREC
novelty track. Finally, we will work on fusing information detected by NIA (New
Information Agent), applying generation techniques, and producing a full summary. At
the same time, we will explore methods for extending new information detection to
identify other kinds of differences across documents (e.g., contradictions or differences
in perspective).

Events: We intend to refine our event model in the coming year by improving the quality
of feature extraction and including additional features. We are in the process of
collecting additional human-annotated data for events, which we will use for improving
the machine learning stage of our method. We will also look at methods for linking event
sentences that refer to the same participants or share other important characteristics
together. Ultimately, our event extraction and linking will provide an alternative way to
summarize a set of documents: rather than looking for repetition of information, we will
select those events that are most prominent in the document set because of their
multiple links to other mentioned events and the importance of their protagonists. While
this method will not work for all kinds of documents, we expect that it will provide an
alternative means for extracting the core "what happened" information out of a
document set that describes events (as many news stories do).

Technology Transition:

We integrated Newsblaster within the Mitap system developed by Mitre. We developed
an API for Newsblaster using an http protocol. One of the main technical challenges of
this integration was the adjustment of Newsblaster to operate over articles published in
the newsgroups.