Open Source Text Mining
Hinrich Schütze, Enkata
Text Mining 2003 @ SDM03
Cathedral Hill Hotel, San Francisco
May 3, 2003
Open source used to be a crackpot idea.
Bill Gates on linux (1999.03.24): “I really don't think in the
commercial market, we'll see it in any significant way.”
MS 10-Q quarterly filing (2003.01.31): “The popularization
of the open source movement continues to pose a
significant challenge to the company's business model.”
Open source is an enabler for radical new things
Ultra-cheap web servers
Walmart pc for $200
Open Source Dominates
Source: Netcraft 4
Text mining has not had much impact.
Many small companies & small projects
No large-scale adoption
Exception: text-mining-enhanced search
Text mining could transform the world.
Unstructured → structured
Amount of information has exploded
Amount of accessible information has not
Can open source text mining make this
Unstructured vs Structured
Prabhakar Raghavan, Verity
High cost of deploying text mining solutions
How can we lower this cost?
100% proprietary solutions
Require re-invention of core infrastructure
Leave fewer resources for high-value
applications built on top of core
Public domain, bsd, gpl (gnu public license)
Like data mining but for text
NLP (Natural Language Processing)
Has interesting applications now
More than just information retrieval /
Usually: some statistical, probabilistic or
frequentistic component 8
Text Mining vs. NLP
(Natural Language Processing)
What is not text mining: speech, language
models, parsing, machine translation
Typical text mining: clustering, information
extraction, question answering
Statistical and high volume
Text Mining: History
80s: Electronic text gives birth to Statistical
Natural Language Processing (StatNLP).
90s: DARPA sponsors Message
Understanding Conferences (MUC) and
Information Extraction (IE) community.
Mid-90s: Data Mining becomes a discipline
and usurps much of IE and StatNLP as “text
Text Mining: Hearst’s Definition
JobTitle: Ice Cream Guru
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Goal: Connect two disconnected subfields of
Start with 1st subfield
Identify key concepts
Search for 2nd subfield with same concepts
Implemented in Arrowsmith system
Discovery: magnesium is potential treatment
When is Open Source
Many users (operating system)
Fun to work on (games)
Public funding available (OpenBSD, security)
Open source author gains fame/satisfaction/immortality/community
A little adaptation is easy
Most users do not need any adaptation (out of the box use)
Incremental releases are useful
Cost sharing without administrative/legal overhead
Dozens of companies with significant interest in linux (ibm …)
Many of these companies contribute to open source
This is in effect an informal consortium
A formal effort probably would have killed linux.
Same applies to text mining?
Also: bugs, security, high-availability, ideal for consulting &
hardware companies like IBM
When is Open Source Not
Boring & rare problem
Print driver for 10 year old printer
Complex integrated solutions
Good UI experience for non-geeks
(at least for now)
Text Mining and Open Source
Important problem: fame, satisfaction,
immortality, community can be gained
Pooling of resources / critical mass
Most text mining requires significant
Most text mining requires data resources as
well as source code.
The need for data resources does not fit well 17
into the open source paradigm.
Text Mining Open Source Today
Excellent for information retrieval, but not
much text mining.
Rain/bow, Weka, GTP, TDMAPI
Text mining algorithms / infrastructure, no
NLP toolkit, some data resources
Excellent data resources, but not enough
Open Source with Open Data
Spell checkers (e.g., emacs)
Antispam software (e.g., spamassassin)
Named entity recognition (Gate/Annie)
Free version less powerful than in-house
SpamAssassin: Code + Data
Open Data Resources:
Classification model for spam
Named entity recognition
Word lists, dictionaries
Domain model, taxonomies, regular
Code vs Data
Needed N. Entity Recognition
Complex&Integrated SW Linux
Good UI Design Web Servers
Open Source with Data: Key Issues
Can data resources be recycled?
Problems have to be similar.
More difficult than one would expect: my first attempt
Next: case study
Assume there is a large library of data resources
How do we identify the data resources that can be
How do we adapt them?
How do we get from here to there?
Need incremental approach that is sustained by
successes along the way.
Text Mining without Data
Premise: “Knowledge-poor” text mining taps
small part of potential of text mining.
Knowledge-poor text mining examples
First story detection
Many success stories
Case Study: ODP -> Reuters on ODP
Apply to Reuters
Case Study: Text Classification
Key Issues for text classification
Show that text classifiers can be recycled
How can we select reusable classifiers for a
How do we adapt them?
Train classifiers on open directory (ODP)
165,000 docs (nodes), crawled in 2000, 505 classes
Apply classifiers to Reuters RCV1
780,000 docs, >1000 classes
Hypothesis: A library of classifiers based on
ODP can be recycled for RCV1.
Train 505 classifiers on ODP
Apply them to Reuters
Compute chi2 for all ODP x Reuters pairs
Evaluate n pairs with the best chi2
Area under ROC curve
Plot false positive rate vs true positive rate
Compute area under the curve
Rank documents, compute precision for each rank
Average for all positive documents
Estimated based on 25% sample 27
Japan: ODP -> Reuters
BusIndTraMar0 / I76300: Ports
These are results without any adaptation.
Performance expected to be much better
Class relationships are m:n, not 1:1
ODP: RegEurUniBusInd0 (UK industries)
I13000 (petroleum & natural gas)
I17000 (water supply)
I32000 (mechanical engineering)
I66100 (restaurants, cafes, fast food)
I9741105 (radio broadcasting) 32
Why Recycling Classifiers is
Autonomous vs relative decisions
ODP Japan classifier w/o modifications has
high precision, but only 1% recall on RCV1!
Most classifiers are tuned for optimal
performance in embedded system.
Tuning decreases robustness in recycling.
Tokenization, document length, numbers
Numbers throw off medline vs. non-medline
categorizer (financial classified as medical)
Length-sensitive multinomial Naïve Bayes:
What would an open source text classification
package look like?
Text mining algorithms
To adapt recycled data resources
To create new data resources
Recycled data resources
Newly created data resources
Pick a good area
Bioinformatics: genes / proteins
Product catalogs 34
Other Text Mining Areas
Named entity recognition
Data vs Code
What about just sharing training sets?
What about just sharing models?
Small preprocessing changes can throw you
Share (simple?) classifier cum preprocessor
Still proprietary issues
Open Source & Data
Code+Data publish Enhanced
Code+Data adapt Enhanced
Open source is successful because it makes
free riding hard.
Viral nature of GPL.
Harder to achieve for some data resources
Apply to your data
You own 100% of the result
Less of a problem for dictionaries and
Open Directory License
No license to sell derivative works?
Some criteria for derivative works
Substantially similar (seinfeld trivia)
Potential damage to future marketing of derivative
Code vs Data Licenses
If I open-source my code, then I will benefit
from bug fixes & enhancements written by
If I open-source my data resource, then my
classification model may become more robust
due to improvements made by others.
Code is very abstract: few issues with
proprietary information creeping in.
Text mining resources are not very abstract:
there is a potential of sensitive information 40
Areas in Need of Research
How to identify reusable text mining components
ODP/Reuters case study does not address this.
Need (small) labeled sample to be able to do this?
How to adapt reusable text mining components
Interactive parameter tweaking?
Combination of recycled classifier and new training
Most estimation techniques require large labeled
The point is to avoid construction of a large labeled
Create viral license for data resources.
Many interesting research issues
Need institution/individual to take the lead
Need motivated network of contributors
data resource contributors
source code contributors
Start with small & simple project that proves
If it works … text mining could become an
enabler on a par with linux.
RegAsiJap0 JAP 0.86 0.62
RegAsiPhi0 PHLNS 0.91 0.56
RegAsiIndSta0 INDIA 0.85 0.53
SpoSocPla0 CCAT 0.60 0.53
RegEurRus0 CCAT 0.58 0.51
RegEurRus0 RUSS 0.85 0.51
SpoSocPla0 GSPO 0.78 0.42
SpoBasReg0 GSPO 0.75 0.33
RegAsiIndSta0 MCAT 0.56 0.32
SpoBasPla1 GSPO 0.80 0.31
SpoBasCol0 GSPO 0.78 0.31
SpoBasCol1 GSPO 0.74 0.26
RegEurSlo0 SLVAK 0.86 0.25
SpoBasPla0 GSPO 0.77 0.24
RegEurRus0 MCAT 0.49 0.23
BusIndTraMar0 I76300 0.81 0.23
SpoHocIceLeaPro0 GSPO 0.71 0.20
SpoBasMinLea0 GSPO 0.71 0.20
RegMidLeb0 LEBAN 0.83 0.19
RecAvi0 I36400 0.74
RegSou0 BRAZ 0.84 0.18
http://www-csli.stanford.edu/~schuetze (this talk, some
Source of Gates quote:
Kurt D. Bollacker and Joydeep Ghosh. A scalable method for
classifier knowledge reuse. In Proceedings of the 1997
International Conference on Neural Networks, pages 1474-79,
June 1997. (proposes measure for selecting classifiers for reuse)
W.Cohen, D.Kudenko: Transferring and Retraining Learned
Information Filters, Proceedings of the Fourteenth National
Conference on Artificial Intelligence, AAAI 97. (transfer within
the same dataset)
Kurt D. Bollacker and Joydeep Ghosh. A supra-classifier
architecture for scalable knowledge reuse. In The 1998
International Conference on Machine Learning, pp. 64-72, July
1998. (transfer within the same dataset)
Motivation of open source contributors: