Text mining
michel.bruley@teradata.com
December 2011
Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …
www.decideo.fr/bruley
Information context
Big amount of information is available in
textual form in databases and online sources
In this context, manual analysis and effective
extraction of useful information are not
possible
It is relevant to provide automatic tools for
analyzing large textual collections
www.decideo.fr/bruley
Text mining definition
The objective of Text Mining is to exploit
information contained in textual documents in
various ways, including … discovery of patterns
and trends in data, associations among entities,
predictive rules, etc.
The results can be important both for:
the analysis of the collection, and
providing intelligent navigation and browsing
methods
www.decideo.fr/bruley
Text mining pipeline
Unstructured Text
(implicit knowledge)
Structured content
(explicit knowledge)
www.decideo.fr/bruley
Text mining process
Text preprocessing
Syntactic/Semantic text
analysis
Features Generation
Bag of words
Features Selection
Simple counting
Statistics
Text/Data Mining
Classification- Supervised
learning
Clustering- Unsupervised
learning
Analyzing results
Mapping/Visualization
Result interpretation Iterative and interactive process
www.decideo.fr/bruley
Text mining actors
Publishers
Enriched content
Annotation tools
Tools for authors
New applications based on annotation layers
Richer cross linking based on content…
Analysts
Empowers them
Annotating research output Libraries
Hypothesis generation
Summarisation of findings
Linking between Institutional repositories
Focused semantic search…
Access to richer metadata
Aggregation
Aids to subject analysis/classification …
www.decideo.fr/bruley
Challenges in text mining
Data collection is “free text”, is not well-organized (Semi-
structured or unstructured)
No uniform access over all sources, each source has separate
storage and algebra, examples: email, databases, applications,
web
A quintuple heterogeneity: semantic, linguistic, structure,
format, size of unit information
Learning techniques for processing text typically need annotated
training
XML as the common model, it allows:
– Manipulation data with standards
– Mining becomes more data mining
– RDF emerging as a complementary model
The more structure you can explore the better you can do
mining
www.decideo.fr/bruley
Data source administration
File System
Databases
Intranet EDMS
Internet
XML Normalisation
Web
Crawling
-subject
-Author
On-line
Databank -text corpora
-keywords
Information Provider
Format filter
www.decideo.fr/bruley
Text mining tasks
Name Extractions
Term Extraction
Feature extraction Abbreviation Extraction
Categorization Relationship Extraction
Text Analysis Summarization
Tools Hierarchical Clustering
Clustering
Binary relational Clustering
TM
Text search engine
Web Searching NetQuestion Solution
Tools
Web Crawler
www.decideo.fr/bruley
Information extraction
Keyword Ranking Extract domain-specific
information from natural language
text
Link Analysis – Need a dictionary of
extraction patterns (e.g.,
“traveled to ” or
Query Log Analysis “presidents of ”)
• Constructed by hand
Metadata Extraction • Automatically learned
from hand-annotated
training data
Intelligent Match – Need a semantic lexicon
(dictionary of words with
semantic category labels)
Duplicate Elimination
• Typically constructed
by hand
www.decideo.fr/bruley
Document collections treatment
Categorization Clustering
www.decideo.fr/bruley
Text Mining example: Obama vs.
McCain
www.decideo.fr/bruley
Aster Data position for Text Analysis
Data Analytic
Pre-Processing Mining
Acquisition Applications
Gather text from relevant Perform processing Apply data mining Leverage insights from
sources required to transform and techniques to derive text mining to provide
store text data and insights about stored information that improves
(web crawling, document information information decisions and processes
scanning, news feeds,
Twitter feeds, …) (stemming, parsing, indexing, (statistical analysis, (sentiment analysis, document
entity extraction, …) classification, natural management, fraud analysis,
language processing, …) e-discovery, ...)
Aster Data Fit
Third-Party Tools Fit
Aster Data Value: Massive scalability of text storage and processing, Functions for text processing, Flexibility to develop diverse
custom analytics and incorporate third-party libraries
www.decideo.fr/bruley
Aster Data Value for Text Analytics
• Ability to store and process massive volumes of text data
– Massively parallel data stores and massively parallel analytics engine
– SQL-MapReduce framework enables in-database processing for
specialized text analytics tools
• Tools and extensibility for processing diverse text data
– SQL-MapReduce framework enables loading and transforming diverse
sources and types of text data
– Pre-built functions for text processing
• Flexible platform for building and processing diverse analytics
– SQL-MapReduce framework enables creation of flexible, reusable
analytics
– Embedded MapReduce processing engine for high-performance analytics
www.decideo.fr/bruley
Aster Data Capabilities for Text Data
Pre-built SQL-MapReduce functions for text processing
• Data transformation utilities
- Pack: compress multi-column data into a
Custom and Packaged Analytics
single column
- Unpack: extract nested data for further
analysis Aster Data nCluster
App App App
App App App
• Web log analysis
- Sessionization: identify unique
Aster Data Analytic Foundation
browsing sessions in clickstream data
• Text analysis SQL SQL-MapReduce
- Text parser: general tool for tokenizing,
stemming, and counting text data
Data Data Data
- nGram: split text into component parts
(words & phrases)
- Levenstein distance: compute “distance”
between words
www.decideo.fr/bruley