Document Sample

Data Mining and Scalability Lauren Massa-Lochridge Nikolay Kojuharov Hoa Nguyen Quoc Le Outline Data Mining Overview Scalability Challenges & Approaches. Overview – Association rules. Case study - BIRCH – An Efficient Data Clustering Method for VLDB. Case Study – Scientific Data Mining. Q&A DATA MINING Data Mining: Rationale Data size Datain databases is estimated to double every year. Number of people who look at the data stays constant Complexity The analysis is complex. The characteristics and relationships are often unexpected and unintuitive. Knowledge discovery tools and algorithms are needed to make sense and use of data Data Mining: Rationale (cont’d) As of 2003, France Telecom has largest decision- support DB, ~30 TB; AT&T was 2nd with 26 TB database. Some of the largest databases on the Web, as of 2003, include Alexa (www.alexa.com) internet archive: 7 years of data, 500 TB Internet Archive (www.archive.org),~ 300 TB Google, over 4 Billion pages, many, many TB Applications Business – analyze inventory, predict customer acceptance, etc. Science – find correlation between genes and diseases, pollution and global warming, etc. Government – uncover terrorist networks, predict flu pandemic, etc. Adapted from: Data Mining, and Knowledge Discovery: An Introduction, http://www.kdnuggets.com/dmcourse/other_lectures/intro-to-data-mining-notes.htm Data Mining: Definition Semi-automatic discovery of patterns, changes, anomalies, rules, and statistically significant structures and events in data. Nontrivial extraction of implicit, previously unknown, and potentially useful information from data Data mining is often done on targeted, preprocessed, transformed data. Targeted: data fusion, sampling. Preprocessed: Noise removal, feature selection, normalization. Transformed: Dimension reduction. Data Mining: Evolution Evolutionar Business Question Enabling Product Characteristics y Step Technologies Providers Data "What was my total revenue Computers, tapes, IBM, CDC Retrospective, Collection in the last five years?" disks static data ('60s) delivery Data Access "What were unit sales in RDBMS, SQL, Oracle, Sybase, Retrospective, ('80s) New England last March?" ODBC Informix, IBM, dynamic data Microsoft delivery at record level Data "What were unit sales in OLAP, Pilot, Comshare, Retrospective, Warehousing New England last March? multidimensional Arbor, Cognos, dynamic data ('90s) Drill down to Boston." databases, data Microstrategy delivery at warehouses multiple levels Data Mining "What’s likely to happen to Advanced Pilot, Lockheed, Prospective, (today) Boston unit sales next algorithms, IBM, SGI, proactive month? Why?" multiprocessor numerous startups information computers, (nascent industry) delivery massive databases Adapted from: An Introduction to Data Mining, http://www.thearling.com/text/dmwhite/dmwhite.htm Data Mining: Approaches Clustering - identify natural groupings within the data. Classification - learn a function to map a data item into one of several predefined classes. Summarization – describe groups, summary statistics, etc. Association – identify data items that occur frequently together. Prediction – predict values or distribution of missing data. Time-series analysis – analyze data to find periodicity, trends, deviations. SCALING Scalability & Performance Scaling and performance are often considered together in Data Mining. The problem of scalability in DM is not only how to process such large sets of data, but how to do it within a useful timeframe. Many of the issues of scalability in DM and DBMS are similar to scaling performance issues for Data Management in general. Dr. Gregory Piatetsky-Shapiro & Prof. Gary Parker, (P&P) define that the main issue for a clustering algorithms in general as an approach to DM is: “The main issue in clustering is how to evaluate the quality of potential grouping. There are many methods, ranging from manual, visual inspection to a variety of mathematical measures that minimize the similarity of items within the cluster and maximize the difference between the clusters." Common DM Scaling Problem Algorithms generally: Operate on data with assumption of in- memory processing of entire data set Operate under assumption that KIWI will be used to address I/O and other performance scaling issues Or just don't address scalability within resource constraints at all Data Mining: Scalability Large Datasets Use scalable I/O architecture - minimize I/O, make it fit, make it fast. Reduce data - aggregation, dimensional reduction, compression, discretization. Complex Algorithms Reduce algorithm complexity Exploit parallelism, use specialized hardware Complex Results Effectivevisualization Increase understanding, trustworthiness Scalable I/O Architecture Shared memory parallel computers: local + global memory. Locking is used to synchronize. Distributed memory parallel computers: Message Passing/ Remote Memory Operations. Parallel Disk: B records – 1 unit. D blocks can be read or written at once. Primitives: Scatter, Gather, Reduction Data parallelism or Task parallelism Scaling – General Approaches Question: How can we make tackle memory constraints and efficiency? Manipulate data to fits into memory – Statistics: sampling, selecting features, partition, summarization. Database: Reduce the time to access out of memory data – specialized data structures, block reads, parallel block reads. High Performance Computing: Use several processors Data Mining Imp: Efficient DM primitives, Pre-compute Misc.: Reduce the amount of data - Discretization, Compression, Transformation. SCALING in ASSOCIATION RULES Association rules Beer-Diapers example. Basket data analysis, Cross-market, sale- campaigns, Web-log analysis etc. Introduced for the first time in 1993 Mining Association Rules between Sets of Items in Large Databases by R. Agrawal et al., SIGMOD 93 Conference. AR Applications X ==> Y: What are the interesting applications? find all rules with “bagels” as X? what should be shelved together with bagels? what would be impacted if stop selling bagels? find all rules with “Diet Coke” as Y? what the store should do to promote Diet Coke? find all rules relating any items in Aisles 1 and 2? shelf planning to see if the two aisles are related AR: Input & Output Input: a database of sales “transactions” parameters: TID Items minimal support: say 50% minimal confidence: say 100% 100 1 34 200 23 5 300 123 5 Output: 400 2 5 rule 1: {2, 3} ==> {5}: s =?, c =? rule 2: {3, 5} ==> {2} more … … Apriori: A Candidate Generation- and-Test Approach Apriori pruning principle: If there is any itemset which is infrequent, its superset should not be generated/tested! (Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94) Method: Initially, scan DB once to get frequent 1-itemset Generate length (k+1) candidate itemsets from length k frequent itemsets Test the candidates against DB Terminate when no frequent or candidate set can be generated Scaling Attempts Count Distribution: distribute transaction data among processors and count the transactions in parallel. scale linearly with # of transactions Savasere et. al. (VLDB95): Partition data and scan twice (local, global). Toivonen (VLDB96): Sampling – Verification of closure borders. Brin et. al. (SIGMOD97): Dynamic itemset counting. Pei & Han (SIGMOD00): Compact Description (FP-tree), no candidate generation. Scale up with partition-based projection.. SCALABLE DATA CLUSTERING: BIRCH Approach BIRCH Approach Informal definition: "data clustering identifies the sparse and the crowded places and hence discovers the overall distribution patterns of the data set." Hierarchical clustering utilizing a distance measure is the catagory of clustering algorithm that BIRCH uses, K-Means is an example of distance measure. Approach: "statistical identification of clusters, i.e. densely populated regions in multi-dimensional dataset, given the desired number of clusters K. and a dataset of N points, and a distance based measurement”. Problem with other approaches, distance measure, hierarchical, etc. are all similar in terms of scaling and resource utilization. BIRCH Novelty First algorithm proposed in the database area that filters out “noise”, i.e. outliers Prior work does not adequately address large data sets with minimization of I/O cost Prior work does not address issues of data set fit to memory Prior work does not address resource utilization or resource constraints in scalability and performance Database / DM Oriented Constraints Resource utilization is maximizing usage of available resources as opposed to just working within resources constraints alone, which does not necessarily optimize utilization. Resource utilization is important in DM Scaling or for any case where the data sets are very large. BIRCH single scan of data set yields a minimum of “good enough” clustering. One or more additional passes are optional and depending upon specifics of constraints for a particular system and application, can be used to improve the quality over and above the "good enough" . Database Oriented Constraints Database Oriented Constraints are what differentiates BIRCH from more general DM algorithms Limited acceptable response time Resource Utilization – optimize not just work within resources available – necessary for VeryLarge data sets Fit to available memory Minimize I/O costs Need I/O cost linear in size of data set Features of BIRCH Solution : Locality of reference: each unit clustering decision made without scanning all data points for all existing clusters. Clustering decision: measurements reflect natural "closeness" of points Locality enables incrementally maintained and updated during clustering process Optional removal of outliers: Cluster equals dense region of points. Outlier equals point in sparse region. More Features of BIRCH Solution : Optimal memory resource usage -> Utilization and and within Resource Constraints. Finest possible sub clusters, given memory resource and I/O/time constraints: Finest clusters given memory implies best accuracy achievable (another type of optimal utilization). Minimize I/O costs: implies efficiency and required response time. More Features of BIRCH Solution Running time linearly scalable (in size of data set). Optionally, incremental scan of data set, i.e. do not have to scan entire data said in advance and increments adjustable. Only scans complete data set once (others scan multiple times) Background (Single cluster) Given N d-dimensional data points : {Xi} “Centroid” 2 i1 ( Xi X 0) N R( )1/ 2 “radius” N i1 j 1 N N ( Xi Xj) 2 D( )1/ 2“diameter” N ( N 1) Background (two clusters) Given the centroids : X0 and Y0, The centroid Euclidean distance D0: 2 1/ 2 D0 (( X 0 Y 0) ) The centroid Manhattan distance D1: (i ) (i ) D1 X 0 Y 0 i 1 X 0 Y 0 d Background ( two clusters) Average inter-cluster distance 2 i 1 j 1 ( Xi Yj ) N1 N2 D2= ( )1/ 2 N1N 2 Average intra-cluster distance 2 i 1 j 1 ( Xi Yj ) N1 N2 D3= ( )1/ 2 ( N1 N 2)( N1 N 2 1) Clustering Feature CF = (N, LS, SS) N = |C| “number of data points” LS = i 1 Xi “linear sum of N data points” N SS = i 0 Xi2 “square sum of N data points ” N Summarization of cluster CF Additive Theorem Assume CF1=(N1, LS1 ,SS1), CF2 =(N2,LS2,SS2) . CF1 CF2 ( N1 N 2 , LS1 LS 2 , SS1 SS 2 ) Information stored in CFs is sufficient to compute: Centroids Measures for the compactness of clusters Distance measure for clusters CF-Tree height-balanced tree two parameters: branching factor B : An internal node contains at most B entries [CFi, childi] L : A leaf node contains at most L entries [CFi] threshold T The diameter of all entries in a leaf node is at most T Leaf nodes are connected via prev and next pointers efficient for data scan CF tree example BIRCH Algorithm Scaling Details CF / CF Tree used to optimize clusters for memory & I/O: P, page size (page of memory) Tree size a function of T, larger T -> smaller CF Tree Require node to fit in memory page size P –> split to fit, or merge for optimal utilization - dynamically P can be varied on the system or in the algorithm for performance tuning and scaling BIRCH Algorithm Steps Phase 1 Start CF tree t1 of initial T Continue scanning data and insert into t1 Out of memory Finish scanning data Result? • increase T • rebuild CF tree t2 of new T from CF tree t1. if a leaf entry is a potential outlier, write to disk. Otherwise use it. • t1 <= t2 Otherwise Result? Out of disk space Re-absorb potential outliers into t1 Re-absorb potential outliers into t1 Analysis I/O cost: M N M M d * N * B(1 log B ) log 2 *d * * B(1 log B ) P N0 ES P Where N: Number of data points M: Memory size P: Page size d: dimension N0: Number of data points loaded into memory with threshold T0 SCIENTIFIC DATA MINING Terminologies Data mining: The semi-automatic discovery of patterns, associations, anomalies, and statistically significant structures of data. Pattern recognition: The discovery and characterization of patterns Pattern: An ordering with an underlying structure Feature: Extractable measurement or attribute Scientific data mining Figure 1. Key steps in scientific data mining Data mining is essential Scientific data set is very complex Multi-sensor, multi-resolution,multi- spectral data High-dimentional data Mesh data from simulation Data contaminated with noise Sensor noise, clouds, atmospheric turbulence, … Data mining is essential Massive dataset Advances in technology allows us to collect ever increasing amount of scientific data (in experiments, observations, and simulations) Astronomies dataset with tens of millions of galaxies Sloan Digital Sky Survey: Assuming the pixel size of about 0.25”, the whole sky is 10Tera pixels (2 bytes/pixel and 1TeraByte) Collection of data made possible by advances in: Sensors (telescopes, satellites, …) Computers and storages (faster, parallel, …) We need fast and accurate data analysis techniques to realize the full potential of our enhanced data collecting ability. And manual techniques are impossible Data mining in astronomy FIRST: Detecting radio-emitted stars Dataset: 100GByte of image data (1996) Image Map 16K image maps, 7.1MB each Data mining in astronomy Example Result: Find 20K radio-emitted stars from 400K entries Mining climate data (Univ. of Minnesota) Research Goal: Average Monthly Temperature Find global climate patterns of interest to Earth Scientists A key interest is finding connections between the ocean / atmosphere and the land. Global snapshots of values for a number of variables on land surfaces or water. Span a range of 10 to 50 years. Mining climate data (Univ. of Minnesota) EOS satellites provide high resolution measurements Finer spatial grids 8 km 8 km grid produces 10,848,672 data points 1 km 1 km grid produces 694,315,008 data points More frequent measurements Multiple instruments Earth Observing System Generates terabytes of day per day (e.g., Terra and Aqua satellites) SCALABILITY Questions and Answering!

DOCUMENT INFO

Shared By:

Categories:

Tags:

Stats:

views: | 5 |

posted: | 8/7/2012 |

language: | |

pages: | 49 |

OTHER DOCS BY 9Dr3haxy

How are you planning on using Docstoc?
BUSINESS
PERSONAL

By registering with docstoc.com you agree to our
privacy policy and
terms of service, and to receive content and offer notifications.

Docstoc is the premier online destination to start and grow small businesses. It hosts the best quality and widest selection of professional documents (over 20 million) and resources including expert videos, articles and productivity tools to make every small business better.

Search or Browse for any specific document or resource you need for your business. Or explore our curated resources for Starting a Business, Growing a Business or for Professional Development.

Feel free to Contact Us with any questions you might have.