Text Analysis Methods for Searching, Organizing, Labeling and

Reviews
Text Analysis: Methods for Searching, Organizing, Labeling and Summarizing Document Collections Danny Dunlavy Computer Science and Informatics Department (1415) Sandia National Laboratories July 16, 2008 CSRI Student Seminar Series SAND2008-4999P Outline • Introduction • Motivational Problems • Data • Analysis Pipeline • Transformation, Analysis, and Post-processing • Hybrid Systems • Examples • Conclusions Introduction • Knowledge discovery – Goal of text analysis – Data → information → knowledge • Challenges – Too much information to process manually – Data ambiguity • Word sense, multilingual, errors, weak signals – Heterogeneous data sources – Interpretability • Goals of this talk – Exposure to research in text analysis at Sandia – Focus on methods based on mathematical principles Example 1: Information Retrieval Problem: ambiguous queries lead to information overload and topic confusion Basketball player Mathematician Rank: 5, 50, 109, … Jazz Musician? Rank: > 200 Solutions: optimization, linear algebra, machine learning, and probabilistic modeling Example 2: Spam Detection X-TMWD-Spam-Summary: TS=20080714175419; ID=1; SEV=2.3.2; DFV=B2008071415; IFV=2.0.4,4.0-9; AIF=B2008071415; RPD=5.03.0010; ENG=NA; RPDID=7374723D303030312E30413031303230332E34383742393243372E303043312C73733D312C6667733D30; CAT=NONE; CON=NONE; SIG=AAABAMQFAAAAAAAAAAAAAIAIgDkAAAM= X-TMWD-IP-Reputation: SIP=128.8.128.57; IPRID=303030312E30413039303330322E34383742393243362E30303330; CTCLS=T2; CAT=Unknown Date: Mon, 14 Jul 2008 13:54:13 -0400 From: Dianne O'Leary To: dmdunla@sandia.gov X-TMWD-Spam-Summary: TS=20080714175417; SEV=2.2.2; DFV=B2008071415; IFV=2.0.4,4.0-9; AIF=B2008071415; RPD=5.02.0125; ENG=IBF; RPDID=7374723D303030312E30413031303230332E34383742393243392E303045422C73733D312C6667733D30; CAT=NONE; CON=NONE X-MMS-Spam-Filter-ID: B2008071415_5.02.0125_4.0-9 X-PMX-Version: 5.4.2.344556, Antispam-Engine: 2.6.0.325393, Antispam-Data: 2008.7.14.174143 X-PerlMx-Spam: Gauge=IIIIIII, Probability=7%, Report='BODY_SIZE_1000_LESS 0, BODY_SIZE_300_399 0, BODY_SIZE_5000_LESS 0, __CT 0, __CTE 0, __CT_TEXT_PLAIN 0, __HAS_MSGID 0, __MIME_TEXT_ONLY 0, __MIME_VERSION 0, __SANE_MSGID 0, __SUBJ_MISSING 0' Problem: term meaning/usage ambiguity and deceit creates confusion between spam and good e-mail Bayesian Statistics • SpamAssassin • Cloudmark Authority • MailSweeper Business Suite Signature (Hash) Analysis • Cloudmark SpamNet • IM Message Inspector Lexical Analysis • Brightmail Anti-Spam • Tumbleweed Email Firewall Neural Networks • SurfControl E-mail Filter • AntiSpam for SMTP Heuristic Patterns • McAfee SpamKiller • Brightmail Anti-Spam Solutions: optimization, linear algebra, machine learning, probabilistic modeling [S. Ali and Y. Xiang (2007), “Spam Classification Using Adaptive Boosting Algorithm," Proc. ICIS 2007.] Example 3: Topic Detection and Association Problem: determine topics in text collections and identify the most important, novel, or significant relationships Clustering and visualization are key analysis methods Solutions: optimization, linear algebra, machine learning, and probabilistic modeling, visualization http://www.kartoo.com http://cloud.clusty.com Text Data • Text collection(s) – Corpus (corpora) Semi-Structured Data e-mail, web pages, blogs, etc. Metadata raw source index date collected source reliability etc. Data E-mail Headers to from date subject etc. Data E-mail Headers message body attachments Unstructured Data reports, newswire, etc. Data text Metadata raw source index data collected source reliability etc. • Structured – Database fielded data Unstructured Text Processing Metadata processing tool parameters used date processed Data named entities relationships facts events • Semi-structured – XML, HTML • Unstructured – Formal – Informal • E-mail, chat, code comments, … Analysis • Newspaper articles, scientific articles, business reports, … • Other characteristics – Incomplete, noisy (errors, ambiguity), multilingual Text Analysis Pipeline Ingestion Pre-processing Transformation Analysis Post-processing Archiving File readers (ASCII, UTF-8, XML, PDF, ...) Tokenization, stemming, part-of-speech tagging named entity extraction, sentence boundaries Data model, dimensionality reduction, feature weighting, feature extraction/selection Information retrieval, clustering, summarization, classification, pattern recognition, statistics Visualization, filtering, summary statistics Database, file, web site Vector Space Model • Vector Space Model for Text – Terms (features): – Documents (objects): – Term  Document Matrix: – : measure of importance of term in document • Term Examples – – – – Sentence: “Danny re-sent $1.” Words: danny, sent, re [# chars?], $ [sym?], 1 [#?], re-sent [-?] n-grams (3): dan, ann, nny, ny_, _re, re-, e-s, sen, ent, nt_, … Named entities (people, orgs, money, etc.): danny, $1 • Document Examples – Documents, paragraphs, sentences, fixed-size chunks [G. Salton, A. Wong, and C. S. Yang (1975), "A Vector Space Model for Automatic Indexing," Comm. ACM, 18(11), 613–620.] Feature Weighting Term  Document Matrix Scaling: Feature Extraction: Dimension Reduction • Goal: find new, smaller set of features (dimensions) that best captures variability, correlations, or structure in the data • Methods – Principal component analysis (PCA) • Eigenvalue decomposition of covariance matrix of • Pre-processing: mean of each feature is 0 – Singular value decomposition of – Local Linear Embedding (LLE) • Express points as combinations of neighbors and embed points into lower dimensional space (preserving neighbors) – Multidimensional scaling • Preserve pairwise distances in lower dimensional space – ISOMAP (nonlinear) • Extends MDS to use geodesic distances on a weighted graph Analysis Tasks in This Talk • Information retrieval – Goal: find documents most related to a query – Challenges: pseudonyms, synonyms, stemming, errors – Methods: LSA (later), boolean search, probabilistic retrieval • Clustering – Goal: find a set of partitions that best separates groups of like objects – Challenges: distance metrics, number of clusters, uniqueness – Methods: k-means (later), agglomerative, graph-based • Summarization – Goal: find a compact representation of text with same meaning – Challenges: single- vs. multi-document summaries, subjectivity – Methods: HMM+QR (later), probabilistic • Classification – Goal: predict labels/categories of data instances (documents) – Challenges: data overfitting, – Methods: HEMLOCK (S. Gilpin, later), decision trees, naïve bayes, SVM Other Analysis Tasks • Machine translation • Speech recognition • Cross language information retrieval • Word sense disambiguation – Determining sense of ambiguous words from context • Lexical acquisition – Filling in gaps in dictionaries build from text corpora • Concept drift detection – Change in general topics in streaming data • Association analysis – Discovering novel relationships hidden in text Hybrid Systems • Rules + statistics/probabilities – Entity extraction (persons, organizations, locations) • Rules: list of common names, capitalization • Probabilities: chance name occurs given sequence of words • Any combination of data analytic tools Parser Data modeler Feature extractor Often developed independently Clustering tool Hybrid System Development • Data model – Cross-system, cross-platform accessibility – Accommodation of multiple data structures • System – Modularized framework (plug-and-play capabilities) – Compatible interfaces – Multiple user interfaces • TITAN: customizable front-ends to analysis pipelines • YALE: required parameters vs. complete set of parameters • Performance, Verification & Validation – Tests for independent systems and overall system – Compatible test data and benchmarks – Analysis of parameter dependencies across individual systems Hybrid System Example Query, Cluster, Summarize Motivation • Query – methods plasma physics • Retrieval – General: Google, 7.8´106 of >2.5´1010 documents – Targeted: arXiv, 9,000 of >403,000 documents • Problems – Too much information – Redundant information – Results: link, title, abstract, snippet (?), etc. – Ordering of results (meaning of “best” match?) Problems to Solve • QCS (Query, Cluster, Summarize) – Unstructured text parsing (common representation) – Data fusion (cleaning, assimilating, normalizing) – Natural language processing (sentences, POS) – Document retrieval (ranking) – High-dimensional clustering (data organization) – Automatic text summarization (data reduction) – Data representation/visualization (multiple perspectives) Query Latent Semantic Analysis (LSA) d1 d2 d3 d4 t1 t2 documents … dn concepts terms singular values documents concepts terms . . . tm Truncated SVD • SVD: • Truncated SVD: • Query scores (query as new “doc”): • LSA Ranking: [Deerwester, S. C., et al. (1990). Indexing by latent semantic analysis. J. Am. Soc. Inform. Sci. 41 (6), 391–407.] LSA Example 1 d1 : Hurricane. A hurricane is a catastrophe. hurricane is a catastrophe. d2 : An example of a catastrophe is a hurricane. d3 : An earthquake is bad. d4 : Earthquake. An earthquake is a catastrophe. normalization only q hurricane earthquake catastrophe 1 0 0 A hurricane earthquake catastrophe qTA d1 .89 2 0 .45 1 .89 d2 .71 1 0 .71 1 .71 d3 0 1 0 0 d4 0 .89 2 .45 1 0 A2 hurricane earthquake catastrophe qTA2 Remove stopwords rank-2 approximation d1 .78 -.03 .59 .78 d2 .78 .02 .60 .78 d3 -.11 .96 .15 – d4 .11 .92 .30 .11 captures link to doc 4 LSA Example 2 ∆ ∆ ∆ ∆ ∆ o o policy planning politics tomlinson 1986 Sport in Society: policy, Politics and Culture, ed A. Tomlinson (1990) Policy and Politics in Sport, PE and Leisure eds S. Fleming, M. Talbot and A. Tomlinson (1995) Policy and Planning (II), ed J. Wilkinson (1986) Policy and Planning (I), ed J. Wilkinson (1986) Leisure: Politics, Planning and People, ed A. Tomlinson (1985) o o o ∆ ∆ ∆ ∆ o o parker lifestyles 1989 part Work, Leisure and Lifestyles (Part 2), ed S. R. Parker (1989) Work, Leisure and Lifestyles (Part 1), ed S. R. Parker (1989) [Leisure Studies of America Data: 97 documents, 335 terms] Cluster Generalized Spherical K-Means (gmeans) • The Players – Documents: – Partition/Disjoint Sets: – Concept vectors (centroids): • The Game – Maximize • The Rules – Adaptive, but bounded k – Similarity Estimation – First variation (stochastic perturbation) [Dhillon, I. S., et al. (2002). Iterative clustering of high dimensional text data augmented by local search. Proc. IEEE ICDM.] Summarize Hidden Markov Model + Pivoted QR • Single Document Summarization – Mark summary sentences in training documents – Build probabilistic model • Markov chain observations n 1 n 2 n – log(#subject terms + 1) • terms showing up in titles, topics, subject descriptions, etc. – log(#topic terms + 1) • terms above a threshold using a mutual information statistic • Hidden Markov Model (HMM) – Hidden states: {summary, non-summary} – Score sentences in each document • Probabilities of sentence being a summary sentence [Conroy, J. M., et al. (2001). Text summarization via hidden markov models and pivoted QR matrix decomposition.] Summarize Hidden Markov Model + Pivoted QR • Multi-document Summarization – Goal: generate w-word summaries – Use HMM scores to select candidate sentences (~2w) – Terms as sentence features • Terms: • Sentences: • Scaling: = HMM score of sentence • Pivoted QR – Choose column with maximum norm ( ) – Subtract components along from remaining columns – Stop: chosen sentences (columns)  ~w words • Removes semantic redundancy QCS: Evaluation • Document Understanding Conference (DUC) – Automatics evaluation of summarizers (ROUGE) • Measures how well you agree with human summaries – Human (), QCS (), S only () summaries – QCS finds subtopics and outliers Cluster 1 Cluster 2 ROUGE-2 score vs. Summarizers (Humans, QCS, S) QCS: Evaluation • Document Understanding Conference (DUC) – Scoring as a function of QCS cluster size (k) – QCS (), S only (---) summaries – Best results for different clusters use different k Cluster 1 Cluster 2 ROUGE-2 scores vs. number of clusters Benefits of QCS • Dynamic data organization and compression – Subset of documents relevant to a query – Topic clusters, single summary per cluster • Multiple perspectives (analyses) – Relevance ranking, topic clusters, summaries • Efficient use of computation – Parsing, term counts, natural language processing, etc. Other Examples ParaText™: Scalable Text Analysis • ParaText™ Lite – Serial client-server text analysis – Parser, vector space model, SVD, data similarities/graph creation – Built on vtkTextEngine (Titan) – Works with ~10K–100K documents • ParaText™ – – – – End-to-end scalable text analysis Challenge 1: Parsing [parallel string hashing, hierarchical agglomeration] Challenge 2: Text modeling [initial Trilinos/Titan integration complete] Challenge 3: Load balancing [initial: documents; goal: Isorropia/Zoltan] • Impact – Available in ThreatView 1.2.0+ directly or through ParaText™ server – Plans to interface to LSAView, OverView (1424), Sisyphus (1422) ParaText™ Client XML HTTP Master ParaText™ Server P0 PTS Reader Parser Matrix SVD ParaText™ Server (PTS) P1 PTS Reader Parser Matrix SVD Artifact DB 1 or 2 DB Servers Pk PTS Reader Parser Matrix SVD Matrices DB Parallel Pipeline HPC Resource (cluster, multicore server, etc.) LSAView: Algorithm Analysis/ Development • LSAView – Analysis and exploration of impact of informatics algorithms on end-user visual analysis of data – Aids in discovery process of optimal algorithm parameters for given data and tasks • Features – Side-by-side comparison of visualizations for two sets of parameters – Small multiple view for analyzing 2+ parameter sets simultaneously – Linked document, graph, matrix, and tree data views – Interactive, zoomable, hierarchical matrix and matrix-difference views – Statistical inference tests used to highlight novel parameter impact • Impact – Used in developing and understanding ParaText™ and LSALIB algorithms LSAView Impact • Document similarities: • Inner product view: • Scaled inner product view: What is the best scaling for document similarity graph generation? 2 2 1.5 1 0.5 0 20 40 60 80 20 40 60 80 2 1.5 1 0.5 0 20 40 60 80 2 1.5 1 0.5 0 20 40 60 80 k 1.5 1 0.5 0 original scaling no scaling inverse sqrt inverse [Leisure Studies of America Data: 97 documents, 335 terms] E-Mail Classification • LSN Assistant / Sandia Categorization Framework – Yucca Mountain: categorize e-mail (Relevant, Federal Record, Privileged) – Machine learning library and GUI for document categorization & review – For review of existing categorizations, recommendations for new documents – Balanced learning • Skewed class distributions • Importance – Solved important, real problem • ~400K e-mails incorrectly categorized – Foundation for LSN Online Assistant • Real-time system for recommendations • Impact – Dong Kim, lead of DOE/OCRWM LSN certification is “very impressed with the LSN Assistant Tool and the approach to doing the review.” – Factor of 3 speedup over manual categorization review only Conclusions • Text analysis relies heavily upon mathematics – Linear algebra, optimization, machine learning, probability theory, statistics, graph theory • Hybrid system development is a challenge – More than just gluing pieces together • Large-scale analysis is important – Storing and processing large amounts of data – Scaling algorithms up – Developing new algorithms for large data • Useful across many application domains Collaborations • QCS – Dianne O’Leary (Maryland), John Conroy & Judith Schlesinger (IDA/CCS) • LSALIB – Tammy Kolda (8962) • ParaText™ – Tim Shead & Pat Crossno (1424) • LSAView – Pat Crossno (1424) • Sandia Categorization Framework – Justin Basilico (6341) and Steve Verzi (6343) • HEMLOCK – Sean Gilpin (1415) Thank You Text Analysis: Methods for Searching, Organizing, Labeling and Summarizing Document Collections Danny Dunlavy dmdunla@sandia.gov http://www.cs.sandia.gov/~dmdunla

Shared by: Fittington Fit
Other docs by Fittington Fit
Jon Stewart1
Views: 165  |  Downloads: 0
Executive Employee Agreement
Views: 323  |  Downloads: 11
Minutes of First Directors Meeting
Views: 296  |  Downloads: 10
Safety policy
Views: 570  |  Downloads: 33
Non-Discrimination Policy
Views: 702  |  Downloads: 15
Drug Free Workplace Policy
Views: 306  |  Downloads: 11
CorpDocs- Corporate Governance Guidelines
Views: 341  |  Downloads: 21
Related docs