VIEWS: 8 PAGES: 4 POSTED ON: 4/15/2010
Data Catalysis: Facilitating Large-Scale Natural Language Data Processing Patrick Pantel USC Information Sciences Institute, Marina del Rey, CA, 90292, USA firstname.lastname@example.org Large-scale data processing of the kind performed at companies like Google is within grasp of the academic community. The potential benefits to researchers and society at large are enormous. In this article, we present the Data Catalysis Center, whose mission is to stage and enable fast development and processing of large-scale data processing experiments. Our prototype environment serves as a pilot demonstration towards a vision to build the tools and processing infrastructure that can eventually provide level access to very large-scale data processing for academic researchers around the country. Within this context, we describe a large scale extraction task for discovering the admissible arguments of automatically generated inference rules. 1. Introduction overhead required for even the simplest tasks – an In many sectors, it has become apparent that example of which we describe in the next section. experiments over large data can enable better In response, we propose a Data Catalysis initiative science: we are witnessing a growing number of which is a large, cross-cutting, focused effort to scientific disciplines that are extending their build large data processing capabilities over this traditional approaches of theoretical and existing infrastructure and opening it to the entire observational research to include large data research community. This vision is not limited to computer experiments. Large data experimentation the NLP community, but is cross-cutting in that it requires not only considerable computational may facilitate large data experiments in many infrastructure, but also sophisticated toolsets and scientific disciplines. Realizing the vision requires a large research/support teams to handle the added consortium of projects for studying the functional complexities of the large scale. Consequently, we architecture, building low-level infrastructure and are seeing an ever-widening gap in large data middleware, engaging the research community to experimentation capabilities between the larger and participate in the initiative, and building up a library the smaller academic institutions. The effects on of open-source large data processing algorithms. science are of important consequence: • Slowed progress towards scientific 2. The Challenge of Large Data Processing breakthroughs: advances in developing The natural language processing (NLP) community is a prime example of a community in large-scale solutions to problems are hindered by need of level access to large data processing the limited number of researchers that have capabilities. More and more cutting edge NLP enabling capabilities; research incorporates extensive analysis over very • Widening competitive disadvantage for smaller large textual resources like the Web. For example, institutions: lack of enabling capabilities at small the current state of the art machine translation institutions makes it difficult to attract the best system developed at Google relies on statistics over researchers and students wishing to work on 8 × 1012 words on the Web (Och 2006); and state of large-scale data problems; and the art information extraction efforts seek to scale to all documents on the Web (Banko et al. 2007; Paşca • Incomparable scientific results: researchers at et al. 2006). This research is enabled by smaller institutions are unable to compare their sophisticated custom-built tools and processing theories with state of the art solutions requiring environments requiring years of development time. large data processing. Such environments are often feasible only for large There have been significant investments by teams with sustained research programs. Google’s institutional and governmental organizations to focused investment on data processing middleware, build high performance open computing resources for example, has enabled their researchers to rapidly and supporting infrastructure. Making use of this develop data experiments at an unparalleled scale infrastructure, however, has been difficult for many by abstracting the underlying complexities of computer science researchers because of the specifying and managing distributed computation. Lacking this middleware, even if given access to The functions are submitted to a Large Data computational resource allocations and low-level Processing Abstraction Layer (LDPAL), which tools, academic groups have reported difficulty transparently handles the parallelization. The map carrying out even the most conceptually simple function is applied to each document in the textual large data experiments, resulting in limited-scale collection (the underlying infrastructure manages studies and long experimental cycles. the parallelization) and the reduce function is used Consider, for example, the conceptually simple to combine intermediate results from map to form task of counting the frequencies of all words in a the final output1. A distributed file system, which very large collection of textual data. This task is stores the large data collection, allows the LDPAL prototypical of the many large data processing tasks to greatly reduce network bandwidth requirements encountered in NLP, all characterized by: by moving the computation of the map function to • Large partitionable data which can be the data instead of moving the data to the machines. distributed over a cluster of compute nodes; and Network bandwidth is mostly only used during aggregation when the LDPAL consolidates • Aggregation of intermediate results generated by matching intermediate results from the map the distributed computation to form the final invocations and moves them to compute nodes output. ready for reduce invocations. An efficient solution requires: i) the distribution We realize the great number of technologies and of the text to several computational nodes, ii) research efforts aimed at simplifying large-data parallel accumulation of local frequency counts, and processing for both grid and cluster environments. iii) aggregation of the results over all nodes. The The key differentiating features of the MapReduce researcher must manage the distribution to optimize approach are: bandwidth utilization, monitor the parallel • Algorithms may be expressed in an easy to grasp accumulation of frequencies to handle straggling special-purpose language; processes and memory overruns, and synchronize the nodes and optimize bandwidth utilization for the • LDPAL layer which provides an API abstracting aggregation of the results. away the complexities of large-scale The Data Catalysis paradigm aims to facilitate computation and providing rapid and robust the development time and processing time of aggregation of results; exactly this class of problems by enabling for the • Accelerated access and processing over data by academic community the MapReduce technology preparing data and keeping it distributed over a pioneered at Google (Dean and Ghemawat 2004; large cluster of compute nodes. Ghemawat et al. 2003). This framework significantly simplifies the development effort required on the part of the researcher, allowing her 3. Data Catalysis: A Vision for the Functional to think at a conceptual level by abstracting away Architecture the underlying processing complexities. For the The data catalysis vision is not limited to the above task of counting word frequencies, the natural language processing community, but is researcher needs only to define the following two cross-cutting in that it may facilitate large data functions: experiments in many computer science disciplines. Realizing it is a large effort requiring a consortium map(String key, String value): of projects for studying the functional architecture, // key: document name building low-level infrastructure and middleware, // value: document contents engaging the research community to participate in for each word w in value: the initiative, and building up a library of AssertIntermediate(w, "1"); open-source large data processing algorithms. Focusing on a vertical such as NLP may, however, reduce(String key, Iterator values): serve as a small demonstration of the potential // key: a word impact of the data catalysis initiative, helping us // values: a list of counts gain insight into the problems that will need to be int result = 0; 1 for each v in values: The MapReduce implementation of the LDPAL is result += ParseInt(v); attributed to Google for accelerating development cycles. Assert(result); Our goal is to bring this capability to the research community by leveraging existing open source resources. Applications and Task-focused algorithms have already been developed at Google Algorithm Collections Supported within this framework (Dean 2006) enabling very NLP Tasks fast development cycles and processing speeds. Data Analysis Algorithms (DAA) Additional 4. Enabling Data Catalysis for the Natural NLP Algorithms Stovepipes (Knowledge Language Processing Community Acquisition for AI, Geospatial Realizing the vision of an open access data Educate Large Data Processing Information students in Large MapReduce Abstraction Layer (LDPAL) Systems, and catalysis environment for the research community is Open-Source Machine Learning) Data NLP Computing software an ambitious long-term project, requiring the (Hadoop) through curriculum development of the functional architecture and its modules using Data Existing Data & Computation deployment and integration with computational Catalysis Datasets paradigm ISI Skynet Layer infrastructures such as high performance computing Cluster centers and the TeraGrid, protocols and infrastructure for data collection and management, libraries of MapReduce algorithms, and as Figure 1. Data Catalysis functional architecture. importantly community awareness campaigns to addressed in order to realize the overarching vision engage the community within this paradigm. of large-scale data catalysis for the academic Initial strides should leverage existing community. Below is a high-level description of our infrastructure, open-source software, and existing envisioned functional architecture (see Figure 1) as data resources to build a proof-of-concept prototype it stands today. of the complete functional architecture: • Data & Computation Layer – Here, we make Data & Computation Layer, the foundational layer use of existing HPCC and clustered computers comprised of high performance computing and leverage large textual datasets available to infrastructure, potentially harnessed into large grids, the community. Candidates include the Spirit and large data storage and management services. 1TB web collection, Trec’s Gov2 0.5TB web Large Data Processing Abstraction Layer (LDPAL), collection, a snapshot of Wikipedia, Project the central layer providing a programming API Gutenberg collection, and offerings available allowing simple algorithm development by through LDC such as the Gigaword corpus or the abstracting the complexities of low-level distributed Google Web 1T 5-gram dataset. computation. This layer handles the complex • LDPAL Middleware – We have deployed management of the computational processes when Hadoop 2 , the open-source software package these algorithms are executed. implementing Google’s LDPAL MapReduce Data Analysis Algorithms (DAA), the layer and distributed file system framework. Hadoop consisting of a variety of algorithms leveraging has been shown to scale to several hundred LDPAL. The MapReduce implementation within the machines, allows users to write “map” and LDPAL potentially supports cross-cutting “reduce” code, and manages the sophisticated large-scale experiments in a variety of areas parallel execution of the code. impacting computer science, social sciences, Candidate NLP tasks suitable for MapReduce biology, and beyond. Algorithms already known to include building a language model (a key map into the framework stem from machine learning, component of machine translation systems, question graph computations, information extraction, and data answering systems, and natural language generation mining (Dean and Ghemawat 2004). engines), machine learning for text classification (e.g., spam email classification and sentiment Applications and Task-focused Algorithm classification in online product reviews), and Collections, the layer interconnecting specialized building a thesaurus of conceptually similar DAA algorithms in support of applications and expressions (useful in question answering, textual research tasks, potentially managed using workflow entailment, and information extraction systems). technologies. The strengths of the proposed Data Catalysis 5. Inferential Selectional Preferences approach are its demonstrated success in industry at Semantic inference is a key component for very large scale and its broad applicability (with the full scope not yet explored). Thousands of 2 Hadoop, http://lucene.apache.org/hadoop/ advanced natural language understanding. Several In a recently published article (Pantel et al. important applications are already relying heavily 2007), we proposed ISP, a collection of methods for on inference, including question answering learning inferential selectional preferences and (Harabagiu and Hickl 2006), information extraction filtering out incorrect inferences. The described (Romano et al. 2006), and textual entailment algorithms apply to any collection of inference rules (Szpektor et al. 2004). between binary semantic relations, such as example In response, several researchers have created (1). ISP derives inferential selectional preferences resources for enabling semantic inference. Among by aggregating statistics of inference rule manual resources used for this task are WordNet instantiations over a large corpus of text. Within ISP, (Fellbaum 1998) and Cyc (Lenat 1995). Although we explored different probabilistic models of important and useful, these resources primarily selectional preference to accept or reject specific contain prescriptive inference rules such as “X inferences. We showed empirical evidence that divorces Y ⇒ X married Y”. In practical NLP ISP’s can be automatically learned and used for applications, however, plausible inference rules effectively filtering out incorrect inferences such as “X married Y” ⇒ “X dated Y” are very generated using the DIRT resource (Lin and Pantel useful. This, along with the difficulty and 2001). labor-intensiveness of generating exhaustive lists of Extracting ISP’s for all 12 million DIRT rules, has led researchers to focus on automatic inference rules is a challenging task which fits very methods for building inference resources such as well the Data Catalysis paradigm. A very simple inference rule collections (Lin and Pantel 2001; program allowed us to extract selectional Szpektor et al. 2004) and paraphrase collections preferences for all DIRT inference rules in a single (Barzilay and McKeown 2001). day of effort. A demo of the resulting ISP’s can be Using these resources in applications has been found at http://www.patrickpantel.com/demos.htm. hindered by the large amount of incorrect inferences they generate, either because of altogether incorrect References Barzilay, R.; and McKeown, K.R. 2001.Extracting Paraphrases from a rules or because of blind application of plausible Parallel Corpus. In Proceedings of ACL 2001. pp. 50–57. Toulose, rules without considering the context of the France. relations or the senses of the words. For example, Banko, M.; Cafarella, M.J.; Soderland, S.; Broadhead, M.; Etzioni, O. 2007. Open Information Extraction from the Web. To appear in consider the following sentence: Proceedings in IJCAI-07. Hyderabad, India. Terry Nichols was charged by federal prosecutors for murder Dean, J. 2006. Experiences with MapReduce, an Abstraction for Large-Scale Computation. In Proceedings of Parallel Architectures and conspiracy in the Oklahoma City bombing. and Compilation Techniques. Seattle, WA. Dean, J. and Ghemawat, S. 2004. MapReduce: Simplified Data and an inference rule such as: Processing on Large Clusters. In Proceedings of OSDI'04: Sixth X is charged by Y ⇒ Y announced the arrest of X (1) Symposium on Operating System Design and Implementation. San Francisco, CA. Using this rule, we can infer that “federal Fellbaum, C. 1998. WordNet: An Electronic Lexical Database. MIT Press. prosecutors announced the arrest of Terry Nichols”. Ghemawat, S.; Gobioff, H.; and Leung, S-T. 2003. The Google File However, given the sentence: System. In Proceedings of SOSP’03. New York, NY. Harabagiu, S.; and Hickl, A. 2006. Methods for Using Textual Fraud was suspected when accounts were charged by CCM Entailment in Open-Domain Question Answering. In Proceedings of telemarketers without obtaining consumer authorization. ACL 2006. pp. 905-912. Sydney, Australia. Lenat, D. 1995. CYC: A large-scale investment in knowledge the plausible inference rule (1) would incorrectly infrastructure. Communications of the ACM, 38(11):33–38. infer that “CCM telemarketers announced the arrest Lin, D. and Pantel, P. 2001. Discovery of Inference Rules for Question Answering. Natural Language Engineering 7(4):343-360. of accounts”. Och, F.J. 2006. Oral Presentation: The Google Machine Translation This example depicts a major obstacle to the System. NIST 2006 Machine Translation Workshop. Washington, D.C. effective use of automatically learned inference Pantel, P; Bhagat, R.; Coppola, B.; Chklovski, T.; and Hovy, E.H. 2007. rules. What is missing is knowledge about the ISP: Learning Inferential Selectional Preferences. In Proceedings of admissible argument values for which an inference NAACL HLT 07. pp. 564-571. Rochester, NY. Paşca, M.; Lin, D.; Bigham, J.; Lifchits, A.; and Jain, A. 2006. rule holds, which we call Inferential Selectional Organizing and Searching the World Wide Web of Facts - Step One: Preferences. For example, inference rule (1) should The One-Million Fact Extraction Challenge. In Proceedings of only be applied if X is a Person and Y is a Law AAAI-06. pp. 1400–1405. Boston, MA. Romano, L.; Kouylekov, M.; Szpektor, I.; Dagan, I.; Lavelli, A. 2006. Enforcement Agent or a Law Enforcement Agency. Investigating a Generic Paraphrase-Based Approach for Relation This knowledge does not guarantee that the Extraction. In EACL-2006. pp. 409-416. Trento, Italy. Szpektor, I.; Tanev, H.; Dagan, I.; and Coppola, B. 2004. Scaling inference rule will hold, but, as we show in this web-based acquisition of entailment relations. In Proceedings of paper, goes a long way toward filtering out EMNLP 2004. pp. 41-48. Barcelona,Spain. erroneous applications of rules.
Pages to are hidden for
"Data Catalysis Facilitating Large-Scale Natural Language Data "Please download to view full document