Introduction to Textual Data Mining
Charles Nicholas CSEE Department, UMBC nicholas@cs.umbc.edu May 4, 1999
What is Data Mining?
• A hot buzzword at the moment • Usually used in the context of relational databases • Typical question: if a grocery store customer buys formula and diapers, how likely is it that they’ll also buy beer?
– So, should beer be stocked near the baby supplies?
What is Textual Data Mining?
• Textual Data Mining is the effort to gain insight from some collection of textual data. • Textual Data Mining draws ideas from information retrieval, computational linguistics, artificial intelligence, etc.
– pattern recognition – machine learning – natural language processing
Why do we care?
• Information overload - To characterize documents by topic, with little or no human intervention, to help filter information • Query Processing - To answer ad hoc questions • Discovery - To identify new “interesting” trends in documents and corpora
Issues in Textual Data Mining
• Can we gain insight from text as we do from databases? • Strategies:
– Statistical treat documents as sets of independent variables – Linguistic analyze the syntax and/or semantics of documents – Graphical treat documents as objects that can be visualized
Strategies
• Statistical
– related to IR, i.e. documents are treated as vectors – examples: authorship and topic spotting
• Linguistic
– lots of NLP, which is hard – related to database form of DM, e.g. MUC
• Graphical
– plot graphs of document attributes – rely on humans to spot patterns
Using Linear Algebra in Statistical TDM
• Singular value decomposition is essentially a pattern recognition technique, and is (therefore) an example of statistical data mining • To ask questions such as
– Are these documents written by the same person? – Do these documents touch on the same subject?
Term-Document Matrix
• Arrange the documents in a matrix form, with “terms” down the side, and documents across the top • This term-document matrix is usually sparse; term weighting can make it more so • Terms can be words or ngrams
– we chose ngrams because of the need to process noisy and/or multilingual documents
Properties of the SVD
• Express t-d matrix A = U S VT • Columns of U are sets of terms that occur together. • The singular values are the main diagonal entries in S, and they give the relative importance of these patterns • Entries in the rows of SVT are the coordinates of the documents in the space spanned by the columns of U
Interpreting U
• Each column U1, U2, …, Uk of U represents a pattern of terms that tend to occur together • Examination of terms that occur frequently will indicate words or phrases that are prominent in an interesting subset of documents • A frequency distribution can be plotted
Example frequency distribution
Example plot of singular values (showing importance of patterns)
Interpreting
T V
• The columns of U form a basis, and the entries in row i of SVT are the coordinates of document i in the space spanned by the columns of U • Documents that have large values in a certain dimension have many instances of the corresponding terms
Example: Coordinates of documents in various dimensions
Previous Work: Federalist Papers
• Written by James Madison, John Jay, and Alexander Hamilton • Out of 88, 11 are of unknown authorship • Kjell and Frieder’s hypothesis: differences in patterns of ngrams are attributable to authorship (as opposed to dialect, genre, subject, etc.)
Principal Components
• Kjell and Frieder chose sets of ngrams that most distinguished the documents with known authorship - discarding lots of others • They then used (CPU intensive) PCA to map segments of (reduced) documents into 2-space • Document segments 1k long were plotted • Indicates Madison wrote the eleven “unknown” Federalist Papers
To address these issues
• Dahlberg and Nicholas showed that the SVD could be used to identify “interesting” properties of documents and corpora • Large, possibly multi-lingual corpora • It turns out that PCA and SVD are related, mathematically speaking, but SVD can be done much more efficiently.
Recent Work
• Why apply these techniques to the Bible?
– A multi-lingual test case. The Bible was written in Hebrew, Aramaic and Greek – Authorship issues are non-trivial, and have been discussed for centuries
• The Hebrew and Greek text is available online
Biblical Authorship Questions
• Who wrote the Pentateuch, traditionally attributed to Moses? • Who was the Deuteronomist? • Who was the Chronicler? • Was Isaiah only one person? • Did Paul write all those Epistles? • Who wrote the book of Hebrews?
Was Ezra the Chronicler?
• Each chapter of Chronicles, Ezra and Nehemiah was considered a document, 88 chapters in all • We built the matrix, using Hebrew 3-grams, and performed the SVD calculations using MATLAB
Ezra, Nehemiah, Chronicles
What does this graph say?
• Some chapters, such as Nehemiah 7 and Ezra 2, are different from the rest
– Most of the text is narrative – Ezra 2 is a census, as is Nehemiah 7
• This plot supports the hypothesis that all (three!) books were written by the same person • Need to look at more dimensions
Can we spot other characteristics?
• The SVD identifies “structure” in the term document matrix; it just finds patterns • In particular, language or dialect really stand out • Consider the books of Ecclesiastes, Song of Songs, and Daniel
Ecclesiastes, Song of Songs, and Daniel
What do these graphs say?
• Song of Songs and Ecclesiastes are clustered together, consistent with Solomonic authorship • Chapters 2-7 of Daniel are in Aramaic • Speculation: dimension 1 is distance from “Solomonic” Hebrew, dimension 2 is Hebrew vs. Aramaic. So the Aramaic chapters have high values in both dimensions.
Was Isaiah only one person?
Graphical
• Can already see that statistical and graphical strategies are complementary • Now describe an ngram-based IR system that lends itself to both strategies
Telltale
• Telltale is an IR engine designed for
– scalability – use with a wide variety of document types and languages – generating corpus metadata
• Some key features
– use of character n-grams – use of corpus “centroids” as metadata – Agent API using KQML
Telltale user interface
highlighted word or phrase Set thresholds
Current document
Documents in corpus, sorted by similarity to quer Functions
Query
Show highlights Relevance feedback
VR based visualization of retrieval
• The only way to comprehend a large corpus or result set is through visualization.
• SFA, for example, provides
– Real-time, interactive stereo viewing of results from Telltale
– Each document is rendered as a glyph (icon) – Document properties mapped to 3D location, shape, color, transparency, and texture. – Spatialization of complex relationships and comprehensible display of multiple variables
VR Approach
• Immersive
– Isolates the user from environment – Expensive
• Minimally-immersive
– – – – Access to environment Collaboration possible Low cost Two hands give proprioception
• Uses Two 3D Trackers with Buttons
– User manipulates 3D scene with trackers – Each hand has a distinct role -- left sets up context and right performs fine manipulation
Visualizing a document space
Mappings
X: similarity to “federal reserve bank” Y: similarity to “commodity prices” Z: similarity to “foreign exchange rate of the dollar” Shape: similarity to “coup attempt against Noreiga” with cube as lowest and cone as highest Color: age of document with blue as the oldest and yellow as the newest Transparency: Texture:
Conclusions
• Textual data mining is becoming more important • Statistical, linguistic, and graphical strategies each have their place • Hybrid approaches are probably best • We need to work with more corpora
Charles Nicholas
Dr. Charles Nicholas is an Associate Professor of Computer Science and Electrical Engineering at UMBC. He received a Ph.D. in Computer Science from The Ohio State University in 1988. Nicholas was General Chair of the ACM Conference on Information and Knowledge Management CIKM’95, CIKM’96, CIKM’97 and CIKM’98, and Co-Chair of the Principles of Digital Document Processing Workshop PODP’96 and PODDP’98. His areas of interest include information retrieval, electronic document processing, and software engineering.