A Quick Introduction to Question Answering by alendar


More Info
									A Quick Introduction to Question Answering

Author: Andrew Lampert (Andrew.Lampert@csiro.au)
Date: December 2004

Question Answering is a specialised form of information retrieval. Given a collection
of documents, a Question Answering system attempts to retrieve correct answers to
questions posed in natural language.

Open-domain question answering requires question answering systems to be able to
answer questions about any conceivable topic. Such systems cannot, therefore, rely on
hand crafted domain specific knowledge to find and extract the correct answers.

Figure 1: Architecture of an Open Domain QA System (from lecture notes for COMP248 subject
at Macquarie University)

Above is shown a possible architecture of an open-domain question answering
system. The architecture described below is based in part on one developed by
Moldovan et al. [4] for the TREC QA track, but is also inspired with additional design
ideas and concepts from a wide range of question answering systems.

Question Processing Module

Given a natural language question as input, the overall function of the question
processing module is to process and analyse the question, and to create some
representation of the information requested. Creating this representation requires the
question processing module to determine:
   •   The question type, usually based on a taxonomy of possible questions already
       coded into the system;
   •   The expected answer type, through some shallow semantic processing of the
       question; and
   •   The question focus, which represents the main information that is required to
       answer the user’s question.

These steps allow the question processing module to finally pass a set of query terms
to the Paragraph Indexing module, which uses them to perform the information

Question Type Classification

In order to find a correct answer to a user’s question, we need to first know what to
look for in our large collection of documents. The type of answer required is related to
the form of the question, so knowing the type of a question can provide constraints on
what constitutes relevant data, which helps other modules to correctly locate and
verify an answer.

The question type classification component is therefore a useful, if not essential
component in a QA system, as it provides significant guidance about the nature of the
required answer.

While the TREC Question Answering track involves questions about open domain
topics, the question types themselves are a fairly closed set. In general, questions
conform to predictable language patterns. This makes question type classification
somewhat simpler.

Many researchers have proposed various different taxonomies for question
classification. Wendy Lehnert, for example, proposed a conceptual taxonomy with 13
conceptual classes [1] back in 1986. More recently, Li and Roth propose a multi-
layered taxonomy [2], which has 6 coarse classes (ABBREVIATION, ENTITY,
classes. Moldovan et al. [4] provide another set of question classes and subclasses
along with corresponding answer types, based on the 200 question used in TREC 8.
This table is shown in Figure 2.

Ultimately, I would select some classification taxonomy, probably close to that used
by Moldovan et al. . Unfortunately, there are also issues with boundaries between
question classes, making classification an ambiguous task. In order to resolve some of
this ambiguity, I would allow for multiple class labels to be added to a single
question. To perform the actual classification, I would use Support Vector Machines
(SVM) to classify questions based on feature sets, using either surface text features
(e.g., by treating the question as a bag of words or n-grams), or more semantic
features (e.g., identifying named entities). SVMs with both surface text features and
shallow semantic features have been used quite successfully to classify questions
[Zhang and Lee – SIGIR 2003] [Suzuki et al – ACL 2003].
Figure 2: Types of questions, corresponding answer types, and statistics from the set of TREC 8
questions (From Moldovan et al. [4])

Answer Type Classification

Answer type classification could be based on a mapping of the question classification.
Once a question has been classified, a simple rule based mapping would be used to
determine the potential answer types. Again, because question classification can be
ambiguous, the system would need to allow for multiple answer types.

Question Focus

Unfortunately, knowing the question type alone is not sufficient for finding answers
[4] to all questions. In particular, what questions can be quite ambiguous in terms of
the information asked by the question. In order to address this ambiguity, an
additional component which extracts the question focus is necessary.

The focus of a question has been defined by Moldovan et al [4] to be a word or
sequence of words which indicate what information is being asked for in the question.
As an example, the question “What is the longest river in New South Wales?” has the
focus “longest river”. If both the question type (from the question classification
component) and the focus are known, the system is able to more easily determine the
type of answer required.

Identifying the focus can be done either using pattern matching rules, based on the
question type classification, or using a statistical approach. A statistical approach
might again make use of n-grams to identify likely focus words of questions. Such an
approach would require a training corpus of questions with known question foci to be
developed, which may be prohibitively expensive in terms of time and effort.

Question Keywords

Once determined, the question focus is also used to determine the list of keywords to
be passed to the information retrieval component. The process of extracting keywords
could be performed with the aid of standard techniques such as named entity
recognition, stop-word lists, and part-of-speech taggers, along with a set of ordered
heuristics, such as those described in [4]. Based on the work in [4], all words
satisfying any of the following 8 heuristics would be chosen as keywords:
    1. For each quoted expression in a question, all non-stop words in the quotation.
    2. Words recognized as proper nouns (using named-entity recognition).
    3. Complex nominals and their adjective modifiers.
    4. All other complex nominals
    5. All nouns and their adjectival modifiers
    6. All other nouns
    7. All verbs
    8. The question focus

Each heuristic would return a set of keywords hat would be added in order to the set
of question keywords. Usually, only the first 6 heuristics are used – the final two offer
further keywords if required after paragraph indexing (i.e., if there are insufficient
candidate paragraphs after extraction and filtering).

Other methods of expanding the set of question keywords could include using an
online resource such as WordNet. The synonym sets in WordNet could be used to
expand the set of question keywords with semantically related words that might also
occur in documents containing the correct question answer.

The set of question keywords is sorted by priority, so if too many keywords are
extracted from the question, only the first N words are passed onto the next module. N
would be a configurable value that could be tuned, based on an evaluation of
performance with different numbers of keywords for information retrieval. The
number of keywords passed can also be controlled by the paragraph indexing module,
if the paragraph quality component determines that a different set of question
keywords is required.

Paragraph Indexing Module

The Paragraph Indexing module is often also referred to as a Document Processing
module in many Question Answering systems. Usually, this module relies on one or
several separate information retrieval systems to gather information from a collection
of document corpora (which in the case of open domain Question Answering, almost
always involves the World Wide Web as at least one of those corpora).

Results from the information retrieval systems (search engines) are generally filtered
to remove paragraphs that do not contain all the keywords of the question being
answered. This allows for a paragraph index to be generated. After assessing the
quality of indexed paragraphs, this module then orders the extracted paragraphs,
according to how plausible the contained answer is. If there are too many or too few
paragraphs with plausible answers, then new queries can be constructed, either with
more or less question keywords, and sent again to the information retrieval system.
This ensures that a reasonable number of paragraphs (i.e., not too many, and not too
few) are passed on to the Answer Processing module.

The motivation for distilling documents down to paragraphs before processing them
in detail is to require less content to be analysed in detail, making for a faster system.
The response time of a QA system is very important due to the interactive nature of
question answering.

Information Retrieval

I would choose to have Information retrieval (IR) performed using standard IR
technologies and techniques, such as existing web search engines (Google, AltaVista
etc.). One thing to be aware of when using these systems is that the fairly standard IR
approach of using a cosine vector space model of measuring similarity between
documents and queries is not desirable for IR in question answering. This is mainly
because a QA system usually wants documents to be retrieved only when all
keywords are present in the document. This is because the keywords have been
carefully selected by the Question Processing module as being the most representative
words in the question. Cosine similarity based IR systems often return documents
even if not all keywords are present.

Information retrieval systems are usually evaluated based on two metrics – precision
and recall. Precision refers to the ratio of relevant (or correct) documents returned to
the total number of documents returned. Recall refers to the number of relevant
documents returned out of the total number of relevant documents available in the
document collection being searched. In general, the aim for information retrieval
systems is to optimise both precision and recall. For question answering, however, the
focus is subtly different.
Because a QA system performs post processing on the documents returned, the recall
of the IR system is significantly more important than its precision. A QA system
filters out irrelevant documents or sections as part of the paragraph filtering
component. In doing so, it raises the precision of information, compensating for lower
precision in the document set returned by the IR system. Lower recall in the IR
systems means the answer is less likely to be in the returned documents. Where the
answer is not present, the QA system must first recognise this (which is difficult and
error prone), and then re-select question keywords to send to the information retrieval
engine(s). If rephrasing the query still does not cause the IR system to include the
correct answer in its output, even the best QA system is unable to successfully answer
the posed question.

Rather than attempting to construct my own document collection and index, I would
use one or several web search engines, such as Google. Search engines such as
Google already employ some common techniques of increasing recall such as
stemming. And besides, the recall of a search engine that has indexed more than 8
billion documents is hard to overlook! To ensure that only documents containing all
keywords are returned, Boolean ‘AND’ operators can be placed between each
question keyword.

In addition to Google, other more specialised resources could also be used. These
resources could be linked to specific types of questions and keywords. Examples of
specialised information retrieval services include databases like Amazon.com for
books, IMDB for movies, and the CIA World Fact Book. Such specialised resources
are likely to have much higher precision and recall than open-domain web search
engines for questions in their specific domain. Of course, this higher recall and
precision is over a much smaller corpus, but it is likely that the bulk of these
specialised resources are not included in the general web search corpora. This stems
from the fact that many of these specialised databases are part of the so-called ‘dark
web’, which is yet to be indexed by current state-of-the-art web search engines.

The other advantage of making use of separate knowledge sources are that they offer
the opportunity for sanity checking candidate answers. More on answer checking will
discussed when we look in detail at the Answer Processing module.

Other approaches to improve information retrieval include predictive annotation of
documents. Predictive annotation allows a document to be indexed or marked-up with
concepts or features that are expected to be useful in answering a reasonable number
of questions. While this makes the indexing process more computationally expensive
and can make the index size an order of magnitude larger, indexing of document
collections is performed offline, so from a QA point of view, it is a worthwhile trade-

As already noted in the description of the question processing module, it is often
possible to determine the answer type required, based on analysis of the question. The
concepts that could be marked during predictive annotation could therefore use the
same taxonomy of concepts as is used for answer type classification. This would
allow the QA system to directly and easily exploit its answer type classification to
constrain the information retrieval search to content that matches not only the
keywords, but keywords of the specified type (eg. Bush as a <personName> rather
than any of its other senses).

Paragraph Filtering

As already hinted at, the number of documents returned by the information retrieval
system may be very large. Paragraph filtering can be used to reduce the number of
candidate documents, and to reduce the amount of candidate text from each

The concept of paragraph filtering is based on the principle that the most relevant
documents should contain the question keywords in a few neighbouring paragraphs,
rather than dispersed over the entire document. To exploit this idea, the location of the
set of question keywords in each document is examined. If the keywords are all found
in some set of N consecutive paragraphs, then that set of paragraphs will be returned,
otherwise, the document is discarded from further processing. N is again a
configurable number that could be tuned based on an evaluation of system
performance under different tolerances of keyword distance in documents.

Paragraph Quality

The paragraph quality component is responsible for evaluating the quality of the
selected paragraphs. If the quality of paragraphs is deemed to be insufficient, then the
system returns to the question keyword extraction module, and alters the heuristics for
extracting keywords from the question. This new set of keywords is then used to
perform the information retrieval from scratch.

Most commonly, the cause of re-determining question keywords stems from having
either too many or too few candidate paragraphs after paragraph filtering. In either
case, new queries for the information retrieval system are created by revisiting the
question keywords component, and either adding or dropping keywords. This
feedback loop provides some form of retrieval context that ensures that only a
‘reasonable’ number of paragraphs are passed onto the Answer Processing module.
Like many other parameters, exactly how many paragraphs constitute a ‘reasonable’
number should be configured, based on performance testing.

Paragraph Ordering

The aim of paragraph ordering is to rank the paragraphs according to a plausibility
degree of containing the correct answer.

Paragraph ordering is performed using a standard radix sort algorithm. The radix sort
uses three different scores to order paragraphs:
   • The number of words from the question that are recognized in the same
        sequence within the current paragraph window;
   •   The number of words that separate the most distant keywords in the current
       paragraph window; and
   •   The number of unmatched keywords in the current paragraph window.

A paragraph window is defined as the minimal span of text required to capture each
maximally inclusive set of question keywords within each paragraph. Radix sorting is
performed for each paragraph window across all paragraphs.

Other possible heuristics include weighting the confidence of each paragraph
according to its source. One might choose, for example, to give greater weight to local
or well known data sources than unknown sources. Similarly, government
organisations (identified by the namespace of the source domain name) might be
ranked with greater confidence than other sources.

Answer Processing

As the final module in the architecture pipeline, the Answer Processing module is
responsible for identifying and extracting answers from the set of ordered paragraphs
passed to it from the paragraph indexing module.

Answer Identification

The answer type (hopefully) determined during question processing, is crucial to
guiding this process.

In order to identify paragraphs which contain the required answer type, shallow
parsing techniques such as named entity recognition are commonly used. As
mentioned previously, if predictive annotation of the document corpus has been
performed, such conceptual mark-up can also help the answer identification module.

The use of a part-of-speech tagger (e.g., Brill tagger) can help to enable recognition of
answer candidates within identified paragraphs. Answer candidates can be ranked
based on measures of distance between keywords, numbers of keywords matched and
other similar heuristic metrics.

Commonly, if no match is found, QA systems would fallback to delivering the best
ranked paragraph. Unfortunately, given the tightening requirements of the TREC QA
track, such behaviour is no longer useful.

Answer Extraction

Once an answer has been identified, the shallow parsing performed is leveraged to
extract only the relevant word or phrase in answer to the question.
Answer Correctness

Confidence in the correctness of an answer can be increased in a number of ways.
One way is to use a lexical resource like WordNet to verify that a candidate response
was of the correct answer type.

As mentioned earlier, specific knowledge sources can also be used as a ‘second
opinion’ to check answers to questions within specific domains. This allows candidate
answers to be sanity checked before being presented to a user.

If a specific knowledge source has been used to actually retrieve the answer, then
general web search can also be used to sanity check answers. The principle relied on
here is that the number of documents that can be retrieved from the web in which the
question and the answer co-occur can be considered a significant clue of the validity
of the answer. Several people have investigated using the redundancy of the web to
validate answers based on frequency counts of question answer collocation, and found
it to be surprisingly effective [3]. Given its simplicity, this makes it an attractive

Answer Presentation

In the original TREC QA tracks, systems could present a list of several answers, and
were ranked based on where the correct answer appeared in the list. From 1999-2001,
the length of this list was 5. Since 2002, systems have been required to present only a
single answer. So, my system would be designed to select and present only the most
likely answer.


   1. Lehnert, W., (1986). “A conceptual theory of question answering”. In B. J.
      Grosz, K. Sparck Jones, and B. L. Webber, editors, Natural Language
      Processing, pages 651–657. Kaufmann, Los Altos, CA.
   2. Li, X. and Roth, D. (2002), “Learning Question Classifiers”, In Proceedings of
      COLING 2002.
   3. Magnini, B., Negri, M., Prevete, R. and Tanev, H., (2002) “Is It the Right
      Answer? ExploitingWeb Redundancy for Answer Validation”. In Proceedings
      of the 40th Annual Meeting of the Association for Computational Linguistics
      (ACL-2002), Philadelphia, PA
   4. Moldovan, D., Harabagiu, S., Pasca, M., Mihalcea, R., Girju, R, Goodrum, R,
      Rus, V., “The Structure and Performance of an Open-Domain Question
      Answering System”, (2000), In Proceedings of the Conference of the
      Association for Computational Linguistics (ACL-2000), p 563-570.

To top