Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

Data Catalysis Facilitating Large-Scale Natural Language Data by accinent


									                        Data Catalysis: Facilitating Large-Scale
                          Natural Language Data Processing
                                                Patrick Pantel
                     USC Information Sciences Institute, Marina del Rey, CA, 90292, USA

Large-scale data processing of the kind performed at companies like Google is within grasp of the academic
community. The potential benefits to researchers and society at large are enormous. In this article, we present
the Data Catalysis Center, whose mission is to stage and enable fast development and processing of large-scale
data processing experiments. Our prototype environment serves as a pilot demonstration towards a vision to
build the tools and processing infrastructure that can eventually provide level access to very large-scale data
processing for academic researchers around the country. Within this context, we describe a large scale
extraction task for discovering the admissible arguments of automatically generated inference rules.

1. Introduction                                          overhead required for even the simplest tasks – an
    In many sectors, it has become apparent that         example of which we describe in the next section.
experiments over large data can enable better            In response, we propose a Data Catalysis initiative
science: we are witnessing a growing number of           which is a large, cross-cutting, focused effort to
scientific disciplines that are extending their          build large data processing capabilities over this
traditional approaches of theoretical and                existing infrastructure and opening it to the entire
observational research to include large data             research community. This vision is not limited to
computer experiments. Large data experimentation         the NLP community, but is cross-cutting in that it
requires not only considerable computational             may facilitate large data experiments in many
infrastructure, but also sophisticated toolsets and      scientific disciplines. Realizing the vision requires a
large research/support teams to handle the added         consortium of projects for studying the functional
complexities of the large scale. Consequently, we        architecture, building low-level infrastructure and
are seeing an ever-widening gap in large data            middleware, engaging the research community to
experimentation capabilities between the larger and      participate in the initiative, and building up a library
the smaller academic institutions. The effects on        of open-source large data processing algorithms.
science are of important consequence:
• Slowed       progress       towards      scientific    2. The Challenge of Large Data Processing
  breakthroughs:      advances     in    developing          The natural language processing (NLP)
                                                         community is a prime example of a community in
  large-scale solutions to problems are hindered by
                                                         need of level access to large data processing
  the limited number of researchers that have
                                                         capabilities. More and more cutting edge NLP
  enabling capabilities;                                 research incorporates extensive analysis over very
• Widening competitive disadvantage for smaller          large textual resources like the Web. For example,
  institutions: lack of enabling capabilities at small   the current state of the art machine translation
  institutions makes it difficult to attract the best    system developed at Google relies on statistics over
  researchers and students wishing to work on            8 × 1012 words on the Web (Och 2006); and state of
  large-scale data problems; and                         the art information extraction efforts seek to scale to
                                                         all documents on the Web (Banko et al. 2007; Paşca
• Incomparable scientific results: researchers at        et al. 2006). This research is enabled by
   smaller institutions are unable to compare their      sophisticated custom-built tools and processing
   theories with state of the art solutions requiring    environments requiring years of development time.
   large data processing.                                Such environments are often feasible only for large
    There have been significant investments by           teams with sustained research programs. Google’s
institutional and governmental organizations to          focused investment on data processing middleware,
build high performance open computing resources          for example, has enabled their researchers to rapidly
and supporting infrastructure. Making use of this        develop data experiments at an unparalleled scale
infrastructure, however, has been difficult for many     by abstracting the underlying complexities of
computer science researchers because of the              specifying and managing distributed computation.
Lacking this middleware, even if given access to               The functions are submitted to a Large Data
computational resource allocations and low-level           Processing Abstraction Layer (LDPAL), which
tools, academic groups have reported difficulty            transparently handles the parallelization. The map
carrying out even the most conceptually simple             function is applied to each document in the textual
large data experiments, resulting in limited-scale         collection (the underlying infrastructure manages
studies and long experimental cycles.                      the parallelization) and the reduce function is used
    Consider, for example, the conceptually simple         to combine intermediate results from map to form
task of counting the frequencies of all words in a         the final output1. A distributed file system, which
very large collection of textual data. This task is        stores the large data collection, allows the LDPAL
prototypical of the many large data processing tasks       to greatly reduce network bandwidth requirements
encountered in NLP, all characterized by:                  by moving the computation of the map function to
•    Large partitionable data which can be                 the data instead of moving the data to the machines.
     distributed over a cluster of compute nodes; and      Network bandwidth is mostly only used during
                                                           aggregation when the LDPAL consolidates
•    Aggregation of intermediate results generated by      matching intermediate results from the map
     the distributed computation to form the final         invocations and moves them to compute nodes
     output.                                               ready for reduce invocations.
     An efficient solution requires: i) the distribution       We realize the great number of technologies and
of the text to several computational nodes, ii)            research efforts aimed at simplifying large-data
parallel accumulation of local frequency counts, and       processing for both grid and cluster environments.
iii) aggregation of the results over all nodes. The        The key differentiating features of the MapReduce
researcher must manage the distribution to optimize        approach are:
bandwidth utilization, monitor the parallel                •   Algorithms may be expressed in an easy to grasp
accumulation of frequencies to handle straggling               special-purpose language;
processes and memory overruns, and synchronize
the nodes and optimize bandwidth utilization for the       •   LDPAL layer which provides an API abstracting
aggregation of the results.                                    away the complexities of large-scale
     The Data Catalysis paradigm aims to facilitate            computation and providing rapid and robust
the development time and processing time of                    aggregation of results;
exactly this class of problems by enabling for the         •   Accelerated access and processing over data by
academic community the MapReduce technology                    preparing data and keeping it distributed over a
pioneered at Google (Dean and Ghemawat 2004;                   large cluster of compute nodes.
Ghemawat et al. 2003). This framework
significantly simplifies the development effort
required on the part of the researcher, allowing her       3. Data Catalysis: A Vision for the Functional
to think at a conceptual level by abstracting away         Architecture
the underlying processing complexities. For the                The data catalysis vision is not limited to the
above task of counting word frequencies, the               natural language processing community, but is
researcher needs only to define the following two          cross-cutting in that it may facilitate large data
functions:                                                 experiments in many computer science disciplines.
                                                           Realizing it is a large effort requiring a consortium
    map(String key, String value):                         of projects for studying the functional architecture,
    // key: document name                                  building low-level infrastructure and middleware,
    // value: document contents                            engaging the research community to participate in
       for each word w in value:                           the initiative, and building up a library of
               AssertIntermediate(w, "1");                 open-source large data processing algorithms.
                                                           Focusing on a vertical such as NLP may, however,
    reduce(String key, Iterator values):                   serve as a small demonstration of the potential
    // key: a word                                         impact of the data catalysis initiative, helping us
    // values: a list of counts                            gain insight into the problems that will need to be
       int result = 0;
       for each v in values:                                  The MapReduce implementation of the LDPAL is
               result += ParseInt(v);                      attributed to Google for accelerating development cycles.
       Assert(result);                                     Our goal is to bring this capability to the research
                                                           community by leveraging existing open source resources.
                               Applications and Task-focused                     algorithms have already been developed at Google
                                   Algorithm Collections
                                                                                 within this framework (Dean 2006) enabling very
                NLP Tasks
                                                                                 fast development cycles and processing speeds.
                                    Data Analysis
                                  Algorithms (DAA)             Additional
                                                                                 4. Enabling Data Catalysis for the Natural
              NLP Algorithms                                   Stovepipes
                                                              (Knowledge         Language Processing Community
                                                           Acquisition for AI,
                                                               Geospatial            Realizing the vision of an open access data
  Educate                        Large Data Processing        Information
  in Large
                MapReduce      Abstraction Layer (LDPAL)     Systems, and        catalysis environment for the research community is
               Open-Source                                 Machine Learning)
 Data NLP
 Computing       software                                                        an ambitious long-term project, requiring the
 curriculum                                                                      development of the functional architecture and its
 using Data     Existing         Data & Computation                              deployment and integration with computational
  Catalysis     Datasets
               ISI Skynet
                                       Layer                                     infrastructures such as high performance computing
                 Cluster                                                         centers and the TeraGrid, protocols and
                                                                                 infrastructure for data collection and management,
                                                                                 libraries of MapReduce algorithms, and as
      Figure 1. Data Catalysis functional architecture.                          importantly community awareness campaigns to
addressed in order to realize the overarching vision                             engage the community within this paradigm.
of large-scale data catalysis for the academic                                       Initial strides should leverage existing
community. Below is a high-level description of our                              infrastructure, open-source software, and existing
envisioned functional architecture (see Figure 1) as                             data resources to build a proof-of-concept prototype
it stands today.                                                                 of the complete functional architecture:
                                                                                 •     Data & Computation Layer – Here, we make
Data & Computation Layer, the foundational layer
                                                                                       use of existing HPCC and clustered computers
comprised of high performance computing
                                                                                       and leverage large textual datasets available to
infrastructure, potentially harnessed into large grids,
                                                                                       the community. Candidates include the Spirit
and large data storage and management services.
                                                                                       1TB web collection, Trec’s Gov2 0.5TB web
Large Data Processing Abstraction Layer (LDPAL),                                       collection, a snapshot of Wikipedia, Project
the central layer providing a programming API                                          Gutenberg collection, and offerings available
allowing simple algorithm development by                                               through LDC such as the Gigaword corpus or the
abstracting the complexities of low-level distributed                                  Google Web 1T 5-gram dataset.
computation. This layer handles the complex                                      •   LDPAL Middleware – We have deployed
management of the computational processes when                                       Hadoop 2 , the open-source software package
these algorithms are executed.                                                       implementing Google’s LDPAL MapReduce
Data Analysis Algorithms (DAA), the layer                                            and distributed file system framework. Hadoop
consisting of a variety of algorithms leveraging                                     has been shown to scale to several hundred
LDPAL. The MapReduce implementation within the                                       machines, allows users to write “map” and
LDPAL       potentially   supports     cross-cutting                                 “reduce” code, and manages the sophisticated
large-scale experiments in a variety of areas                                        parallel execution of the code.
impacting computer science, social sciences,                                         Candidate NLP tasks suitable for MapReduce
biology, and beyond. Algorithms already known to                                 include building a language model (a key
map into the framework stem from machine learning,                               component of machine translation systems, question
graph computations, information extraction, and data                             answering systems, and natural language generation
mining (Dean and Ghemawat 2004).                                                 engines), machine learning for text classification
                                                                                 (e.g., spam email classification and sentiment
Applications     and      Task-focused      Algorithm                            classification in online product reviews), and
Collections, the layer interconnecting specialized                               building a thesaurus of conceptually similar
DAA algorithms in support of applications and                                    expressions (useful in question answering, textual
research tasks, potentially managed using workflow                               entailment, and information extraction systems).
    The strengths of the proposed Data Catalysis                                 5. Inferential Selectional Preferences
approach are its demonstrated success in industry at                                 Semantic inference is a key component for
very large scale and its broad applicability (with the
full scope not yet explored). Thousands of                                       2
advanced natural language understanding. Several                     In a recently published article (Pantel et al.
important applications are already relying heavily              2007), we proposed ISP, a collection of methods for
on inference, including question answering                      learning inferential selectional preferences and
(Harabagiu and Hickl 2006), information extraction              filtering out incorrect inferences. The described
(Romano et al. 2006), and textual entailment                    algorithms apply to any collection of inference rules
(Szpektor et al. 2004).                                         between binary semantic relations, such as example
    In response, several researchers have created               (1). ISP derives inferential selectional preferences
resources for enabling semantic inference. Among                by aggregating statistics of inference rule
manual resources used for this task are WordNet                 instantiations over a large corpus of text. Within ISP,
(Fellbaum 1998) and Cyc (Lenat 1995). Although                  we explored different probabilistic models of
important and useful, these resources primarily                 selectional preference to accept or reject specific
contain prescriptive inference rules such as “X                 inferences. We showed empirical evidence that
divorces Y ⇒ X married Y”. In practical NLP                     ISP’s can be automatically learned and used for
applications, however, plausible inference rules                effectively filtering out incorrect inferences
such as “X married Y” ⇒ “X dated Y” are very                    generated using the DIRT resource (Lin and Pantel
useful. This, along with the difficulty and                     2001).
labor-intensiveness of generating exhaustive lists of                Extracting ISP’s for all 12 million DIRT
rules, has led researchers to focus on automatic                inference rules is a challenging task which fits very
methods for building inference resources such as                well the Data Catalysis paradigm. A very simple
inference rule collections (Lin and Pantel 2001;                program allowed us to extract selectional
Szpektor et al. 2004) and paraphrase collections                preferences for all DIRT inference rules in a single
(Barzilay and McKeown 2001).                                    day of effort. A demo of the resulting ISP’s can be
    Using these resources in applications has been              found at
hindered by the large amount of incorrect inferences
they generate, either because of altogether incorrect           References
                                                                Barzilay, R.; and McKeown, K.R. 2001.Extracting Paraphrases from a
rules or because of blind application of plausible                Parallel Corpus. In Proceedings of ACL 2001. pp. 50–57. Toulose,
rules without considering the context of the                      France.
relations or the senses of the words. For example,              Banko, M.; Cafarella, M.J.; Soderland, S.; Broadhead, M.; Etzioni, O.
                                                                  2007. Open Information Extraction from the Web. To appear in
consider the following sentence:                                  Proceedings in IJCAI-07. Hyderabad, India.
Terry Nichols was charged by federal prosecutors for murder     Dean, J. 2006. Experiences with MapReduce, an Abstraction for
                                                                  Large-Scale Computation. In Proceedings of Parallel Architectures
and conspiracy in the Oklahoma City bombing.                      and Compilation Techniques. Seattle, WA.
                                                                Dean, J. and Ghemawat, S. 2004. MapReduce: Simplified Data
and an inference rule such as:                                    Processing on Large Clusters. In Proceedings of OSDI'04: Sixth
      X is charged by Y ⇒ Y announced the arrest of X     (1)     Symposium on Operating System Design and Implementation. San
                                                                  Francisco, CA.
Using this rule, we can infer that “federal                     Fellbaum, C. 1998. WordNet: An Electronic Lexical Database. MIT
prosecutors announced the arrest of Terry Nichols”.             Ghemawat, S.; Gobioff, H.; and Leung, S-T. 2003. The Google File
However, given the sentence:                                      System. In Proceedings of SOSP’03. New York, NY.
                                                                Harabagiu, S.; and Hickl, A. 2006. Methods for Using Textual
Fraud was suspected when accounts were charged by CCM             Entailment in Open-Domain Question Answering. In Proceedings of
telemarketers without obtaining consumer authorization.           ACL 2006. pp. 905-912. Sydney, Australia.
                                                                Lenat, D. 1995. CYC: A large-scale investment in knowledge
the plausible inference rule (1) would incorrectly                infrastructure. Communications of the ACM, 38(11):33–38.
infer that “CCM telemarketers announced the arrest              Lin, D. and Pantel, P. 2001. Discovery of Inference Rules for Question
                                                                  Answering. Natural Language Engineering 7(4):343-360.
of accounts”.                                                   Och, F.J. 2006. Oral Presentation: The Google Machine Translation
    This example depicts a major obstacle to the                  System. NIST 2006 Machine Translation Workshop. Washington,
effective use of automatically learned inference                Pantel, P; Bhagat, R.; Coppola, B.; Chklovski, T.; and Hovy, E.H. 2007.
rules. What is missing is knowledge about the                     ISP: Learning Inferential Selectional Preferences. In Proceedings of
admissible argument values for which an inference                 NAACL HLT 07. pp. 564-571. Rochester, NY.
                                                                Paşca, M.; Lin, D.; Bigham, J.; Lifchits, A.; and Jain, A. 2006.
rule holds, which we call Inferential Selectional                 Organizing and Searching the World Wide Web of Facts - Step One:
Preferences. For example, inference rule (1) should               The One-Million Fact Extraction Challenge. In Proceedings of
only be applied if X is a Person and Y is a Law                   AAAI-06. pp. 1400–1405. Boston, MA.
                                                                Romano, L.; Kouylekov, M.; Szpektor, I.; Dagan, I.; Lavelli, A. 2006.
Enforcement Agent or a Law Enforcement Agency.                    Investigating a Generic Paraphrase-Based Approach for Relation
This knowledge does not guarantee that the                        Extraction. In EACL-2006. pp. 409-416. Trento, Italy.
                                                                Szpektor, I.; Tanev, H.; Dagan, I.; and Coppola, B. 2004. Scaling
inference rule will hold, but, as we show in this                 web-based acquisition of entailment relations. In Proceedings of
paper, goes a long way toward filtering out                       EMNLP 2004. pp. 41-48. Barcelona,Spain.
erroneous applications of rules.

To top