Introduction to Textual Data Mining

Reviews
Introduction to Textual Data Mining Charles Nicholas CSEE Department, UMBC nicholas@cs.umbc.edu May 4, 1999 What is Data Mining? • A hot buzzword at the moment • Usually used in the context of relational databases • Typical question: if a grocery store customer buys formula and diapers, how likely is it that they’ll also buy beer? – So, should beer be stocked near the baby supplies? What is Textual Data Mining? • Textual Data Mining is the effort to gain insight from some collection of textual data. • Textual Data Mining draws ideas from information retrieval, computational linguistics, artificial intelligence, etc. – pattern recognition – machine learning – natural language processing Why do we care? • Information overload - To characterize documents by topic, with little or no human intervention, to help filter information • Query Processing - To answer ad hoc questions • Discovery - To identify new “interesting” trends in documents and corpora Issues in Textual Data Mining • Can we gain insight from text as we do from databases? • Strategies: – Statistical treat documents as sets of independent variables – Linguistic analyze the syntax and/or semantics of documents – Graphical treat documents as objects that can be visualized Strategies • Statistical – related to IR, i.e. documents are treated as vectors – examples: authorship and topic spotting • Linguistic – lots of NLP, which is hard – related to database form of DM, e.g. MUC • Graphical – plot graphs of document attributes – rely on humans to spot patterns Using Linear Algebra in Statistical TDM • Singular value decomposition is essentially a pattern recognition technique, and is (therefore) an example of statistical data mining • To ask questions such as – Are these documents written by the same person? – Do these documents touch on the same subject? Term-Document Matrix • Arrange the documents in a matrix form, with “terms” down the side, and documents across the top • This term-document matrix is usually sparse; term weighting can make it more so • Terms can be words or ngrams – we chose ngrams because of the need to process noisy and/or multilingual documents Properties of the SVD • Express t-d matrix A = U S VT • Columns of U are sets of terms that occur together. • The singular values are the main diagonal entries in S, and they give the relative importance of these patterns • Entries in the rows of SVT are the coordinates of the documents in the space spanned by the columns of U Interpreting U • Each column U1, U2, …, Uk of U represents a pattern of terms that tend to occur together • Examination of terms that occur frequently will indicate words or phrases that are prominent in an interesting subset of documents • A frequency distribution can be plotted Example frequency distribution Example plot of singular values (showing importance of patterns) Interpreting T V • The columns of U form a basis, and the entries in row i of SVT are the coordinates of document i in the space spanned by the columns of U • Documents that have large values in a certain dimension have many instances of the corresponding terms Example: Coordinates of documents in various dimensions Previous Work: Federalist Papers • Written by James Madison, John Jay, and Alexander Hamilton • Out of 88, 11 are of unknown authorship • Kjell and Frieder’s hypothesis: differences in patterns of ngrams are attributable to authorship (as opposed to dialect, genre, subject, etc.) Principal Components • Kjell and Frieder chose sets of ngrams that most distinguished the documents with known authorship - discarding lots of others • They then used (CPU intensive) PCA to map segments of (reduced) documents into 2-space • Document segments 1k long were plotted • Indicates Madison wrote the eleven “unknown” Federalist Papers To address these issues • Dahlberg and Nicholas showed that the SVD could be used to identify “interesting” properties of documents and corpora • Large, possibly multi-lingual corpora • It turns out that PCA and SVD are related, mathematically speaking, but SVD can be done much more efficiently. Recent Work • Why apply these techniques to the Bible? – A multi-lingual test case. The Bible was written in Hebrew, Aramaic and Greek – Authorship issues are non-trivial, and have been discussed for centuries • The Hebrew and Greek text is available online Biblical Authorship Questions • Who wrote the Pentateuch, traditionally attributed to Moses? • Who was the Deuteronomist? • Who was the Chronicler? • Was Isaiah only one person? • Did Paul write all those Epistles? • Who wrote the book of Hebrews? Was Ezra the Chronicler? • Each chapter of Chronicles, Ezra and Nehemiah was considered a document, 88 chapters in all • We built the matrix, using Hebrew 3-grams, and performed the SVD calculations using MATLAB Ezra, Nehemiah, Chronicles What does this graph say? • Some chapters, such as Nehemiah 7 and Ezra 2, are different from the rest – Most of the text is narrative – Ezra 2 is a census, as is Nehemiah 7 • This plot supports the hypothesis that all (three!) books were written by the same person • Need to look at more dimensions Can we spot other characteristics? • The SVD identifies “structure” in the term document matrix; it just finds patterns • In particular, language or dialect really stand out • Consider the books of Ecclesiastes, Song of Songs, and Daniel Ecclesiastes, Song of Songs, and Daniel What do these graphs say? • Song of Songs and Ecclesiastes are clustered together, consistent with Solomonic authorship • Chapters 2-7 of Daniel are in Aramaic • Speculation: dimension 1 is distance from “Solomonic” Hebrew, dimension 2 is Hebrew vs. Aramaic. So the Aramaic chapters have high values in both dimensions. Was Isaiah only one person? Graphical • Can already see that statistical and graphical strategies are complementary • Now describe an ngram-based IR system that lends itself to both strategies Telltale • Telltale is an IR engine designed for – scalability – use with a wide variety of document types and languages – generating corpus metadata • Some key features – use of character n-grams – use of corpus “centroids” as metadata – Agent API using KQML Telltale user interface highlighted word or phrase Set thresholds Current document Documents in corpus, sorted by similarity to quer Functions Query Show highlights Relevance feedback VR based visualization of retrieval • The only way to comprehend a large corpus or result set is through visualization. • SFA, for example, provides – Real-time, interactive stereo viewing of results from Telltale – Each document is rendered as a glyph (icon) – Document properties mapped to 3D location, shape, color, transparency, and texture. – Spatialization of complex relationships and comprehensible display of multiple variables VR Approach • Immersive – Isolates the user from environment – Expensive • Minimally-immersive – – – – Access to environment Collaboration possible Low cost Two hands give proprioception • Uses Two 3D Trackers with Buttons – User manipulates 3D scene with trackers – Each hand has a distinct role -- left sets up context and right performs fine manipulation Visualizing a document space Mappings X: similarity to “federal reserve bank” Y: similarity to “commodity prices” Z: similarity to “foreign exchange rate of the dollar” Shape: similarity to “coup attempt against Noreiga” with cube as lowest and cone as highest Color: age of document with blue as the oldest and yellow as the newest Transparency: Texture: Conclusions • Textual data mining is becoming more important • Statistical, linguistic, and graphical strategies each have their place • Hybrid approaches are probably best • We need to work with more corpora Charles Nicholas Dr. Charles Nicholas is an Associate Professor of Computer Science and Electrical Engineering at UMBC. He received a Ph.D. in Computer Science from The Ohio State University in 1988. Nicholas was General Chair of the ACM Conference on Information and Knowledge Management CIKM’95, CIKM’96, CIKM’97 and CIKM’98, and Co-Chair of the Principles of Digital Document Processing Workshop PODP’96 and PODDP’98. His areas of interest include information retrieval, electronic document processing, and software engineering.

Related docs
introduction to data mining
Views: 8  |  Downloads: 1
Data Mining Introduction
Views: 70  |  Downloads: 17
Textual
Views: 73  |  Downloads: 0
A Short Introduction to Sequential Data Mining
Views: 60  |  Downloads: 6
ii textual criticism as higher criticism
Views: 1  |  Downloads: 0
Data Mining
Views: 93  |  Downloads: 24
Introduction to Data Mining
Views: 1897  |  Downloads: 312
Introduction to Text Mining
Views: 20  |  Downloads: 2
data mining
Views: 250  |  Downloads: 26
Introduction to Data Mining
Views: 59  |  Downloads: 14
Data Mining Outline
Views: 3  |  Downloads: 1
Data Mining
Views: 0  |  Downloads: 0
premium docs
Other docs by gregoria
Hold Your Tenants Accountable
Views: 311  |  Downloads: 3
ITD_2007_instructions101606AD
Views: 123  |  Downloads: 0
Sample Executive Summary Airex
Views: 669  |  Downloads: 22
Sample Executive Summary Heartsoft
Views: 368  |  Downloads: 4
Transcript of Louisiana Purchase Treaty 1803
Views: 217  |  Downloads: 1
Treaty of Ghent info
Views: 209  |  Downloads: 0
press-release-template
Views: 841  |  Downloads: 41
Texas Incorporation of a profit corporation
Views: 246  |  Downloads: 1
Security Agreement for Buying Business Assets
Views: 284  |  Downloads: 6
3-day_Notice_To_Cure_Violations
Views: 261  |  Downloads: 1
Application for membership and service contract
Views: 272  |  Downloads: 8
Federal Judiciary Act info
Views: 218  |  Downloads: 0
Employment agreement
Views: 255  |  Downloads: 6