Embed
Email

Text Mining

Document Sample
Text Mining
Description

Introduction to text mining, synthetic overview

Shared by: Michel Bruley
Stats
views:
37
posted:
12/2/2011
language:
English
pages:
15
Text mining



michel.bruley@teradata.com









December 2011







Extract from various presentations: Temis, URI-INIST-CNRS, Aster Data …





www.decideo.fr/bruley

Information context





 Big amount of information is available in

textual form in databases and online sources



 In this context, manual analysis and effective

extraction of useful information are not

possible



 It is relevant to provide automatic tools for

analyzing large textual collections





www.decideo.fr/bruley

Text mining definition



The objective of Text Mining is to exploit

information contained in textual documents in

various ways, including … discovery of patterns

and trends in data, associations among entities,

predictive rules, etc.



The results can be important both for:

 the analysis of the collection, and

 providing intelligent navigation and browsing

methods



www.decideo.fr/bruley

Text mining pipeline





Unstructured Text

(implicit knowledge)









Structured content

(explicit knowledge)









www.decideo.fr/bruley

Text mining process



Text preprocessing

Syntactic/Semantic text

analysis



Features Generation

Bag of words



Features Selection

Simple counting

Statistics



Text/Data Mining

Classification- Supervised

learning

Clustering- Unsupervised

learning



Analyzing results

Mapping/Visualization

Result interpretation Iterative and interactive process

www.decideo.fr/bruley

Text mining actors

Publishers

Enriched content

Annotation tools

Tools for authors

New applications based on annotation layers

Richer cross linking based on content…









Analysts

Empowers them

Annotating research output Libraries

Hypothesis generation

Summarisation of findings

Linking between Institutional repositories

Focused semantic search…

Access to richer metadata

Aggregation

Aids to subject analysis/classification …







www.decideo.fr/bruley

Challenges in text mining



 Data collection is “free text”, is not well-organized (Semi-

structured or unstructured)

 No uniform access over all sources, each source has separate

storage and algebra, examples: email, databases, applications,

web

 A quintuple heterogeneity: semantic, linguistic, structure,

format, size of unit information

 Learning techniques for processing text typically need annotated

training

 XML as the common model, it allows:

– Manipulation data with standards

– Mining becomes more data mining

– RDF emerging as a complementary model

 The more structure you can explore the better you can do

mining

www.decideo.fr/bruley

Data source administration



File System

Databases

Intranet EDMS









Internet

XML Normalisation

Web

Crawling

-subject

-Author

On-line

Databank -text corpora

-keywords

Information Provider









Format filter





www.decideo.fr/bruley

Text mining tasks



Name Extractions

Term Extraction

Feature extraction Abbreviation Extraction



Categorization Relationship Extraction

Text Analysis Summarization

Tools Hierarchical Clustering

Clustering

Binary relational Clustering

TM







Text search engine

Web Searching NetQuestion Solution

Tools

Web Crawler







www.decideo.fr/bruley

Information extraction



Keyword Ranking Extract domain-specific

information from natural language

text

Link Analysis – Need a dictionary of

extraction patterns (e.g.,

“traveled to ” or

Query Log Analysis “presidents of ”)

• Constructed by hand

Metadata Extraction • Automatically learned

from hand-annotated

training data

Intelligent Match – Need a semantic lexicon

(dictionary of words with

semantic category labels)

Duplicate Elimination

• Typically constructed

by hand





www.decideo.fr/bruley

Document collections treatment





Categorization Clustering









www.decideo.fr/bruley

Text Mining example: Obama vs.

McCain









www.decideo.fr/bruley

Aster Data position for Text Analysis



Data Analytic

Pre-Processing Mining

Acquisition Applications



Gather text from relevant Perform processing Apply data mining Leverage insights from

sources required to transform and techniques to derive text mining to provide

store text data and insights about stored information that improves

(web crawling, document information information decisions and processes

scanning, news feeds,

Twitter feeds, …) (stemming, parsing, indexing, (statistical analysis, (sentiment analysis, document

entity extraction, …) classification, natural management, fraud analysis,

language processing, …) e-discovery, ...)









Aster Data Fit

Third-Party Tools Fit





Aster Data Value: Massive scalability of text storage and processing, Functions for text processing, Flexibility to develop diverse

custom analytics and incorporate third-party libraries





www.decideo.fr/bruley

Aster Data Value for Text Analytics



• Ability to store and process massive volumes of text data

– Massively parallel data stores and massively parallel analytics engine

– SQL-MapReduce framework enables in-database processing for

specialized text analytics tools



• Tools and extensibility for processing diverse text data

– SQL-MapReduce framework enables loading and transforming diverse

sources and types of text data

– Pre-built functions for text processing



• Flexible platform for building and processing diverse analytics

– SQL-MapReduce framework enables creation of flexible, reusable

analytics

– Embedded MapReduce processing engine for high-performance analytics

www.decideo.fr/bruley

Aster Data Capabilities for Text Data

Pre-built SQL-MapReduce functions for text processing

• Data transformation utilities

- Pack: compress multi-column data into a

Custom and Packaged Analytics

single column

- Unpack: extract nested data for further

analysis Aster Data nCluster



App App App

App App App

• Web log analysis

- Sessionization: identify unique

Aster Data Analytic Foundation

browsing sessions in clickstream data





• Text analysis SQL SQL-MapReduce



- Text parser: general tool for tokenizing,

stemming, and counting text data

Data Data Data

- nGram: split text into component parts

(words & phrases)

- Levenstein distance: compute “distance”

between words

www.decideo.fr/bruley



Related docs
Other docs by Michel Bruley
Sentiment Analysis
Views: 147  |  Downloads: 2
1 - Text mining V0
Views: 43  |  Downloads: 0
Text Mining
Views: 37  |  Downloads: 1
By registering with docstoc.com you agree to our
privacy policy

You are almost ready to download!

You are almost ready to download!