Demonstration of Text Mining
National Centre for Text Mining
University of Manchester
Text mining tasks and resources
• Information retrieval
– Gather, select, filter, documents that may prove useful
– Find what is known
• Information extraction
– Partial, shallow language analysis
– Find relevant entities, facts about entities
– Find only what looking for
– Combine, link facts
– Discover new knowledge, find new facts
• Resources: ontologies, lexicons, terminologies,
grammars, annotated corpora (machine learning,
• Conceptual annotations for facts, events, terms,
individual entities and relations
• Integration of terminological processing and
• Rule based temporal processing
• Scalability issues
– Distributed data and processing
– Incremental processing
– On the fly rapid access to ontologies
• Annotation and rule editors
Common Annotation Scheme
• XML-based Representation
• Stores textual, linguistic and metadata
• Document Structure:
– <front>: source, title, doc. Timestamp…
– <body>: text, sections, paragraphs, sentences,
– <back>: conceptual info name instances, entities,
Lexical Annotations Person
Denmark: Danisco defends fat replacer product
after concerns in local press term
Torben Svejgaard, President of consumer
Emulsifiers at Danish ingredients giant bcycle
Danisco has responded to the recent
debate in the Danish press about the
company's fat replacer - Salatrium. 6
relating lexical annotations
AS Torben Svejgaard
Common Annotation Scheme Illustration
<title>22Nov1999 PHILIPPINES: IN BRIEF - Rebels nab son</title>
<tok id="t16" pos="NIL" lem="zamboanga" lookup="NIL" orth="uppercase" zone="" sepAfter="
<tok id="t17" pos="NIL" lem="city" lookup="NIL" orth="uppercase" zone=""
sepAfter=":">CITY</tok>. . .</s></p></sec>. . .
<PNAMEX id="pn3" type="individual" tokref="t46 t47" features ="id=Individual,path
=Thing:Individual,name = IndividualPosition=,Nation=,Importance="/>
<PEntity id="pe1" type="irish republican army" mnem="Irish Republican Army" refid="pn1"/>
<PEvent id="ev1" type="abduction" text="was abducted" refId="t26 t27" class="OCCURRENCE"
tense="PAST" polarity="POSITIVE" aspect="NONE" slot1="" slot2="" slot3="" slot4="" slot5=""
slot6="" slot7="" slot8="" />
Text Mining Architecture
Parmenides Resource Manager
Defines pipelines, queues documents between processes
Document capture and conversion
• Web and directory crawling
– Batch and interactive use
• Format conversion
– Word, HTML, PDF, etc. to Common Annotation Scheme
• Text zoning
– Separate front matter from body text
– Attempt to annotate headings etc.
• Texts are analysed at several levels leading to
a template representation of events
– Tokenization and tagging
– Sentence splitting and optional term discovery
– Ontology or gazetteer lookup
– Phrasal analysis to classify name expressions
– Phrasal analysis to fill slots of template
representations of entities and events.
to Semantic Parsing
– words, numbers, punct., tel.nos., chemical formulae, etc.
– Part of speech labelling; disambiguation in local context
• Semantic dictionary/ontology lookup
– Known names, terms, heads of terms and names
• Partial parsing
– Identify phrasal chunks - names, domain terms and other NPs;
temporal elements - tensed verbs, adverbials,…
• Semantic information extraction
– Build template or graph rep. of events/facts
The NLP components
• Part of Speech Tagger
– Based on Brill algorithm, locally trained from publicly
• Ontology Lookup
– Accesses semantic category and properties of application-
interesting words and phrases.
• Rule-based phrasal analyser
– Finds and labels phrases of application-interest, using tag,
lookup, orthography and output of other rules. Returns
feature values as well as span labels.
Tokenization and POS tagging
• Tokenizer separates words and other tokens,
analyzes each orthographically.
• Transformation-based Learning is used to train a part
of speech tagger.
• New Java implementation of Brill algorithm
• Fast in operation, circa 100K words/sec.
• Tagger available separately, distributed with
• Sentence splitter differentiates sentence
punctuation from other usages of .?!
• Next slide shows tabular view of token attributes
• IE systems typically consult lists of known names of
places, people, organizations, artifacts, etc., and
tokens that heuristically indicate class of name, e.g.
Dr. , Plc.
• Cafetiere consults a knowledge base, which associates
ontology class and/or entity identifier, as well as slot
names and type constraints.
• Previous slide showed ontology class in the ‘lookup’
column of the token attributes.
Ontology/KB can be
browsed within Cafetiere
• Cafetiere finds instances of
– Proper names of people, places, organizations, or other
application-motivated named entities, e.g. genes, proteins.
– Temporal expressions, including adverbials, dates, verb
– Descriptive phrases, e.g. NPs in apposition to names.
• and classifies them by conceptual category.
• Next slides show show these can be accessed after
analysis in a document browser.
• Typically, named entities are mentioned several times
in a text.
• Cafetiere groups together the instances of each
named entity when creating a conceptual annotation.
• Resolves some co-references, especially variant forms
of proper names.
• An initial entity is created when a name expression is
found. Later occurrences are added to form
• Next slide shows the phrasal instances of a single
entity mentioned in the text.
• At the phrasal analysis stage, verb groups and noun
phrases denoting events have timeML features
• Event extraction is either rule-based or ontology-
• Each event type has a number of conceptual slots, and
Cafetiere matches conceptually annotated text
fragments within the sentence, where they match the
• Next slide shows the event browser after analysis,
with one event’s slots in a detail view.
• The Cafetiere browser also has editing controls,
which enable a corrected annotation to be saved.
• Useful for applications where validated extracted
data are presented to the end users.
• Useful for annotating documents to create training or
evaluation gold standard data.
• For events, browser shows all compatible entities as
alternative slot fillers to the one selected by the
Viewing Analysis Results
Press Release Analyst by Biovista
Mining Scientific Literature
• In Parmenides project, a case study has been
conducted with Unilever on mining scientific
papers on weight management.
• In NacTem, work on Term Management
focuses on domain terminology for biological
sciences and medicine.
• Joint work with Lancaster University seeks to
advance conceptual summarization by
extracting causal relations expressed in text.
Unilever Case Study
• Weight management experimental papers
• Template-level representation of key features of the
– Clinical study subjects
– Study population
– Clinical study design
– Nutritional metabolic phenomenon
– Work function (effect)
– Health benefit
• Same techniques as for news NE analysis
• Targeting descriptive phrases rather than
complex proper names.
• Domain terms important
• Sentence-based fact extraction won’t work,
because information distributed throughout
Template slot filling
• Each slot has many candidate fillers
– Often benignly paraphrases of each other, but can
emphasize different attributes
– Selection is heuristic, based on proximity to
trigger word and similarity of terms.
– For Unilever’s purposes, acceptable to get filler
wrong sometimes if the candidates can be easily
substituted in the template editor.
– Not yet handled properly-multiple slot fillers
• Joint work with Chris Paice at Lancaster
• Paice and Jones (1993) pioneered template-
based abstracting of ‘crop science’ papers.
• Similar analysis to Unilever case study,
leading to a template, and then a generated
short informative abstract.
• Major drawback is the domain-specific
resource development needed.
Ameliorating the resource bottleneck
• One approach (Paice and Oakes) to develop rules by
supervised machine learning
– Transfers effort from rule-based analysis to corpus
• Current approach (Paice and Black) to develop domain-
independent extraction of causal and other key
relationships expressed in scientific papers
• Incorporates term discovery, ontology lookup,
stemming, tagging. Implemented in Cafetiere
NLP Components as Services
• Stratified processing modules well-defined
• Common Annotation Scheme facilitates
• Each module potentially a Web service
• Queuing between modules should minimize
• User defines pipeline; Processing takes place
on servers; Users share cached common
module analyses; Users access curated data.
• Cafetiere is a framework for the information
extraction phases of text mining.
• Incorporates context-sensitive partial parsing
of names, terms, chunks.
• Linkage to event ontology enables template
slot filling and hence fact/event extraction.
• Domain-specific resources (ontology, rules) can
be developed for diverse domains: business,
• Use corpus-trained components for phrasal
analysis of bio-medical literature.
• Improve rule-application engine performance.
• Cluster processing to balance IE processing
against more superficial processes.