Spatial Data Mining
Yang Yubin
Joint Laboratory for Geoinformation Science The Chinese University of Hong Kong yangyubin@cuhk.edu.hk
Agenda
• • • • • Motivation and General Description Data Mining: Basic Concepts Data Mining Techniques Spatial Data Mining Spatial Data Mining Scenarios in Meteorology and Weather Forecasting • Conclusions • Questions & Discussions
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 2
• • • • •
Motivation and General Description Data Mining: Basic Concepts Data Mining Techniques Spatial Data Mining Spatial Data Mining Scenarios in Meteorology and Weather Forecasting • Conclusions • Questions & Discussions
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 3
Why do we need Data Mining?
• Large number of records(cases) (108-1012 bytes)
– – – – One thousand (103) bytes = 1 kilobyte (KB) One million (106) bytes = 1 megabyte (MB) One billion (109) bytes = 1 gigabyte (GB) One trillion (1012) bytes = 1 terabyte (TB)
• High dimensional data (variables)
– 10-104 attributes
• Only a small portion, typically 5% to 10%, of the collected data is ever analyzed • We are drowning in data, but starving for knowledge!
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 4
Scientific Viewpoint
• Data collected and stored at enormous speeds (Gbyte/hour)
– remote sensor on a satellite – telescope scanning the skies – scientific simulations generating terabytes of data
• • • •
Classical modeling techniques are infeasible Data reduction Cataloging, classifying, segmenting data Helps scientists in Hypothesis Formation
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 5
Current Situations (1)
• Great efforts for construction and maintenance of large information databases • Data cannot be analyzed by standard statistical methods
– numerous missing records – data are qualitative rather than quantitative
• We do not always know what information might be represented or how relevant it might be to the questions
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 6
Current Situations (2)
• the ways and means for using all this data lag far behind the increase of available data
– Information can only be found with:
• a lot of coincidence (internet) • not explicitly available (company databases) • only accessible for human eyes by using lots of processing power (astronomical, meteorological and earth observation data)
• This leads to a clear demand for means of uncovering the information and knowledge hidden in the massive quantities of data
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 7
• • • • •
Motivation and General Description Data Mining: Basic Concepts Data Mining Techniques Spatial Data Mining Spatial Data Mining Scenarios in Meteorology and Weather Forecasting • Conclusions • Questions & Discussions
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 8
What is Data Mining?
• Data mining is concerned with solving problems by analyzing existing data • ―Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from huge amount of data‖
• Alternative Names: Knowledge Discovery in Databases (KDD)
– A term originated in Artificial Intelligence (AI) field – KDD consists of several steps (one of which is Data Mining)
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 9
Data Mining vs. KDD
• Knowledge Discovery in Databases (KDD): The whole process of finding useful information and patterns in data • Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process • Data mining is the core of the knowledge discovery process
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 10
KDD Process
• Selection: Obtain data from various sources. • Preprocessing: Cleanse data. • Transformation: Convert to common format. Transform to new format. • Data Mining: Obtain desired results. • Interpretation/Evaluation: Present results to user in meaningful manner
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 11
Data Mining: A KDD Process
– Data mining: core of knowledge discovery process
Pattern Evaluation
Data Mining
Task-relevant Data Data Warehouse Data Cleaning Data Integration Databases
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 12
Selection
Typical Data Mining Architecture
Graphical user interface
Pattern evaluation Data mining engine
Database or data warehouse server
Data cleaning & data integration Filtering
Knowledge-base
Databases
Data Warehouse
Hong Kong Observatory Hong Kong Meteorological Society 13
2004/09/09
Data Mining: Confluence of Multiple Disciplines
Database Systems Statistics
Machine Learning
Data Mining
Visualization
Information Theory
Algorithms, …,Other Disciplines
Hong Kong Observatory Hong Kong Meteorological Society 14
2004/09/09
Data Mining is:
• A ―hot‖ word for a class of techniques that find patterns in data • A user-centric, interactive process which leverages analysis technologies and computing power • A group of techniques that find relationships that have not previously been discovered • Not reliant on an existing database • A relatively easy task that requires knowledge of the business problem/subject matter expertise
2004/09/09
Hong Kong Observatory Hong Kong Meteorological Society
15
Experts and clients are needed in:
• • • • • • Define and redefine problems Determine relevant aspects of the problem Supply the data Remove errors from the data Provide constraints on possible patterns Interpret patterns and possibly reject implausible ones • Evaluate predicted effects…
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 16
• • • • •
Motivation and General Description Data Mining: Basic Concepts Data Mining Techniques Spatial Data Mining Spatial Data Mining Scenarios in Meteorology and Weather Forecasting • Conclusions • Questions & Discussions
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 17
Primary Data Mining Tasks (1)
• Descriptive Modeling
– Finding a compact description for large dataset
[Concept Description]
– Clustering people or things into groups based on their attributes [Clustering] – Associating what events are likely to occur together
[Association Rule]
– Sequencing what events are likely to lead to later events [Sequential Pattern Analysis] – Discovering the most significant changes
[Deviation Detection]
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 18
Primary Data Mining Tasks (2)
• Predictive Modeling
– Classifying people or things into groups by recognizing patterns [Classification] – Forecasting what may happen in the future by mapping a data item to a predicting real-value variable [Regression]
2004/09/09
Hong Kong Observatory Hong Kong Meteorological Society
19
Concept Description
• Characterization: provides a concise and succinct summarization of the given collection of data • Discrimination: provides descriptions comparing two or more collections of data • can handle complex data types of the attributes • a more automated process
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 20
Concept description: Characterization
Name Jim Initial Woodman Relation Scott Lachance Laura Lee …
Removed
Gender M M F …
Retained
Major CS
Birth-Place
Birth_date
Residence 3511 Main St., Richmond 345 1st Ave., Richmond
125 Austin Ave., Burnaby …
Phone # 687-4598 253-9106 420-5232 … Removed
GPA 3.67 3.70 3.83 … Excl, VG,..
Vancouver,BC, 8-12-76 Canada CS Montreal, Que, 28-7-75 Canada Physics Seattle, WA, USA 25-8-70 … … …
Sci,Eng, Bus
Country
Age range
Age_range 20-25 25-30 …
City
GPA Very-good Excellent …
Gender Major
Birth_region Canada Foreign …
Residence Richmond Burnaby …
Count 16 22 …
Generalized Relation
M F …
Science Science …
Birth_Region Canada Gender M F Total 16 10 26 14 22 36 30 32 62 Foreign Total
2004/09/09
Hong Kong Observatory Hong Kong Meteorological Society
21
Clustering
• Cluster: a collection of data objects
– Similar to one another within the same cluster – Dissimilar to the objects in other clusters
• Clustering
– Grouping a set of data objects into clusters based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity
• Example
– Land use: Identification of areas of similar land use in an earth observation database – City-planning: Identifying groups of houses according to their house type, value, and geographical location
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 22
Association rule
• Association (correlation and causality)
– age(X, ―20..29‖) ^ income(X, ―20..29K‖) ―PC‖) [support = 2%, confidence = 60%] buys(X,
• Association rule mining
– Finding frequent patterns, associations, correlations among sets of items or objects in transaction databases, relational databases, and other information repositories – Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database
• Motivation: finding regularities in data
– What products were often purchased together?
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 23
Example: Association rule
Transaction-id 10 20 30 40
• Itemset A1,A2={a1, …, ak} • Find all the rules A1A2 with min a1,a2, a3 confidence and support a1, a3 – support, s, probability that a a1, a4 transaction contains A1A2 a2, a5, a6 – confidence, c, conditional probability that a transaction having A1 also contains A2. Let min_support = 50%, min_conf = 50%: a1 a3 (50%, 66.7%) a3 a1 (50%, 100%)
Items bought
Hong Kong Observatory Hong Kong Meteorological Society 24
2004/09/09
Sequential Pattern Analysis
• Given a set of sequences, find the complete set of frequent subsequences
SID
10 20 30
sequence
<(ad)c(bc)(ae)> <(ef)(ab)(df)cb>
Given support threshold min_sup =2, <(ab)c> is a sequential pattern
• Applications of sequential pattern
– Customer shopping sequences:
• First buy computer, then CD-ROM, and then digital camera, within 3 months.
40
– Weblog click streams – Telephone calling patterns
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 25
Deviation Detection
• Outlier analysis
– Outlier: a data object that does not comply with the general behavior of the data – It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis
• Trend and evolution analysis
– Trend and deviation: regression analysis – Periodicity analysis – Similarity-based analysis
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 26
Classification and Regression
• Classification:
– constructs a model (classifier) based on the training set and uses it in classifying new data – Example: Climate Classification,…
• Regression:
– models continuous-valued functions, i.e., predicts unknown or missing values – Example: stock trends prediction,…
2004/09/09
Hong Kong Observatory Hong Kong Meteorological Society
27
Classification (1): Model Construction
Training Data
Classification Algorithms
NAME RANK M ike M ary B ill Jim D ave Anne A ssistan t P ro f A ssistan t P ro f P ro fesso r A sso ciate P ro f A ssistan t P ro f A sso ciate P ro f
YEARS TENURED 3 7 2 7 6 3 no yes yes yes no no
Classifier (Model)
IF rank = ‗professor‘ OR years > 6 THEN tenured = ‗yes‘
28
2004/09/09
Hong Kong Observatory Hong Kong Meteorological Society
Classification (2): Prediction Using the Model
Classifier
Testing Data
Unseen Data (Jeff, Professor, 4)
NAME RANK T om M erlisa G eorge Joseph A ssistant P rof A ssociate P rof P rofessor A ssistant P rof
2004/09/09
YEARS TENURED 2 7 5 7 no no yes yes
Tenured?
Hong Kong Observatory Hong Kong Meteorological Society
29
Classification Techniques
• • • • • Decision Tree Induction Bayesian Classification Neural Networks Genetic Algorithms Fuzzy Set and Logic
2004/09/09
Hong Kong Observatory Hong Kong Meteorological Society
30
Regression
• Regression is similar to classification
– First, construct a model – Second, use model to predict unknown value
• Methods
– Linear and multiple regression – Non-linear regression
• Regression is different from classification
– Classification refers to predict categorical class label – Regression models continuous-valued functions
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 31
Are All the ―Discovered‖ Patterns Interesting?
• A data mining task may generate thousands of patterns, not all of them are interesting. • Interestingness measures:
– A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm – Objective vs. Subjective interestingness measures: • Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
• Subjective: based on user‘s belief in the data, e.g.,
unexpectedness, novelty, executability, etc.
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 32
• • • • •
Motivation and General Description Data Mining: Basic Concepts Data Mining Techniques Spatial Data Mining Spatial Data Mining Scenarios in Meteorology and Weather Forecasting • Conclusions • Questions & Discussions
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 33
Spatial Data Mining
• Spatial Patterns
– – – – Spatial outliers Location prediction Associations, co-locations Hotspots, Clustering, trends, …
• Primary Tasks
– – – – Mining Spatial Association Rules Spatial Classification and Prediction Spatial Data Clustering Analysis Spatial Outlier Analysis
• Example: Unusual warming of Pacific ocean (El Nino) affects weather in USA…
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 34
Spatial Data Mining Results
• Understanding spatial data, discovering relationships between spatial and nonspatial data, construction of spatial knowledge bases, etc. • In various forms
– The description of the general weather patterns in a set of geographic regions is a spatial characteristic rule. – The comparison of two weather patterns in two geographic regions is a spatial discriminant rule. – A rule like ―most cities in Canada are close to the Canada-US border‖ is a spatial association rule
• near(x,coast) ^ southeast(x, USA) ) hurricane(x), (70%)
– Others: spatial clusters,…
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 35
What is Spatial Data?
• The data related to objects that occupy space
– traffic, bird habitats, global climate, logistics, ...
• Object types:
– Points, Lines, Polygons,etc.
Used in/for:
GIS - Geographic Information Systems Meteorology Astronomy Environmental studies, etc.
36
2004/09/09
Hong Kong Observatory Hong Kong Meteorological Society
Basic Concepts (1)
• Spatial data mining follows along the same functions in data mining, with the end objective to find patterns in geography, meteorology, etc. • The main difference (Spatial autocorrelation)
– the neighbors of a spatial object may have an influence on it and therefore have to be considered as well
• Spatial attributes
– Topological
• adjacency or inclusion information
– Geometric
• position (longitude/latitude), area, perimeter, boundary polygon
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 37
Basic Concepts (2)
• Spatial neighborhood
– Topological relation
• ―intersect‖, ―overlap‖, ―disjoint‖, …
– distance relation
• ―close_to‖, ―far_away‖,…
Global Model
– direction/orientation relation
• ―left_of‖, ―west_of‖,…
• Global model might be inconsistent with regional models
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society
Local Model
38
Applications
• NASA Earth Observing System (EOS): Earth science data • National Inst. of Justice: crime mapping • Census Bureau, Dept. of Commerce: census data • Dept. of Transportation (DOT): traffic data • National Inst. of Health(NIH): cancer clusters • ……
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 39
Example: What Kind of Houses Are Highly Valued?—Associative Classification
2004/09/09
Hong Kong Observatory Hong Kong Meteorological Society
40
• • • • •
Motivation and General Description Data Mining: Basic Concepts Data Mining Techniques Spatial Data Mining Spatial Data Mining Scenarios in Meteorology and Weather Forecasting • Conclusions • Questions & Discussions
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 41
Meteorological Data Mining
• Motivation
– Lot of analysis methods must be applied to fast growing data for climate studies
• Result
– Appropriate presentation instruments (graphs, maps, reports, etc) must be applied
• Examples
– Spatial outliers can be associated with disastrous natural events such as tornadoes, hurricane, and forest fires – Associations between disaster events and certain meteorological observations
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 42
Case Studies (1): Astronomy
• SKICAT(SKy Image Cataloging and Analysis Tool ) (Caltech, US) • The Palomar Observatory discovered 22 quasars with the help of data mining • the Second Palomar Observatory Sky Survey (POSS-II)
– decision tree methods – classification of galaxies, stars and other stellar objects
• About 3 TB of sky images were analyzed
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 43
Case Studies (2): NCAR & UCAR
• National Center for Atmospheric Research (NCAR) & University Corporation for Atmospheric Research(UCAR), US – http://www.ucar.edu/
• ―Automatic Fuzzy Logic-based systems now compete with human forecasts‖
• Richard Wagoner, Deputy Director at Research Applications Program(RAP), NCAR
• Intelligent Weather System (IWS)
– Detection and forecast in the areas of en-route turbulence, en-route icing, ceiling/visibility, and convective hazards in the aviation community – Road winter maintenance, airport operations, and flash flood forecasting
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 44
Operational Application
• Prediction System: WIND-2
– WIND: ―Weather Is Not Discrete‖
• Consists of three parts:
– Data
• Past airport weather observations, 30 years of hourly observations, time series of 300,000 detailed observations • Recent and current observations (METARs) • Model based guidance (knowledge of near-term changes,e.g., imminent wind-shift, onset/cessation of precipitation)
– Fuzzy similarity-measuring algorithm – Prediction composition – predictions based on k nearest neighbors(k-nn, clustering method)
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 45
Operational Application
• Hybrid methods are used to predict weather
– Dynamical approach - based upon equations of the atmosphere,uses finite element techniques – Empirical approach - similar weather situations lead to similar outcomes
• WIND runs in real-time for meteorologically different sites • Data-mining/forecast process takes about one second
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 46
2004/09/09
Hong Kong Observatory Hong Kong Meteorological Society
47
Case Studies (3): CrossGrid (EU)
• Objective
– To develop, implement and exploit new Grid components for interactive compute and data intensive applications like flooding crisis team decision support systems, air pollution combined with weather forecasting
• Main tasks in Meteorological applications package
– Data mining for atmospheric circulation patterns
• Find a set of representative prototypes of the atmospheric patterns in a region of interest
– Weather forecasting for maritime applications – Ocean wave forecasting by models of various complexity
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 48
• Data
– ERA-15 using a T106L31 model (from 1978 to 1994) with 1.125◦ resolution – Terabytes – Comprises data from approx. 20 variables (such as temperature,humidity, pressure, etc.) at 30 pressure levels of a 360x360 nodes grid
SOM Application for DataMining
Adaptive Competitive Learning
Downscaling Weather Forecasts
Sub-grid details scape from numerical models
6
2004/09/09
Hong Kong Observatory Hong Kong Meteorological Society
49
Dept. of Applied Mathematics Universidad de Cantabria
Santander, Spain
2004/09/09
Hong Kong Observatory Hong Kong Meteorological Society
50
Case Studies (4): Typhoon Image Data Mining
• Objective
– To establish algorithms and database models for the discovery of information and knowledge useful for typhoon analysis and prediction – Content-based image retrieval technology to search for similar cloud patterns in the past – Data mining technology to extract spatio-temporal pattern information which is meaningful from the meteorology viewpoints
• Result
– Alignment of Multiple Typhoons, Explore by Projection to 2D Plane, Diurnal Analysis
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 51
Methods
• Archive of approximately 34,000 typhoon images for the northern and southern hemisphere • Various data mining approaches
– Principal component analysis(PCA), K-means clustering, self-organizing map(SOM), wavelet transform
• Retrieval of historical similar patterns from image databases to perform instance-based typhoon analysis and prediction • Extracting the eigenvectors of the whole typhoon image collection
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 52
2004/09/09
Hong Kong Observatory Hong Kong Meteorological Society
53
Case Studies (5): LEAD
• Linked Environments for Atmospheric Discovery
– To accommodate the real time, on-demand, and dynamically-adaptive nature of mesoscale problems
• Complexities: vastly disparate, high volume and bandwidth data • Tremendous computational demands
– Used in accessing, preparing, assimilating, predicting, managing, mining/analyzing, and displaying a broad array of meteorological and related information
• Data Mining Solution Center: ITSC, The Univ. of Alabama in Huntsville, US
– http://datamining.itsc.uah.edu/index.jsp
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 54
ADaM
• The Algorithm Development and Mining
– Component architecture data mining toolkit – For geophysical phenomena detection and feature extraction
• Applications
– Detecting tropical cyclones and estimating their maximum sustained wind speed – Mesocyclone Identification from RADAR – Detecting Cumulus Cloud Fields in GOES Images
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 55
ADaM (cont’d)
– Mesoscale Convective Systems Detection
• EOS Special Sensor Microwave/Imager (SSM/I) Brightness Temperature Swaths from DMSP F13 and F14
– Rain Detection Using SSM/I – Lightning Detection Using OLS – Rain Accumulation Study
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 56
Case Studies (6): Rainfall Classification
University of Oklahoma Norman • To classify significant and interesting features within a two-dimensional spatial field of meteorological data
– Observed or predicted rainfall
• Data source
– Estimates of hourly accumulated rainfall – Using radar and raingage data
• ―Attributes‖ for classification
– Statistical parameters representing the distribution of rainfall amounts across the region
• Classification Method
– Hierarchical cluster analysis
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 57
Many Others…
• JARtool Project (Fayyad et al., NASA ) • Identifying volcanoes on the surface of Venus from images transmitted by the Magellan spacecraft • More than 30,000 high resolution Synthetic Aperture Radar(SAR) images of the surface of Venus from different angles • The obtained accuracy was about 80%
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 58
What we can learn from those scenarios?
• Data Mining is a promising way for meteorological analysis • Very strong interaction between scientists and the knowledge discovery system is necessary • The users define features of the meteorological phenomena based on their expert knowledge • The system extracts the instances of such phenomena • Then, further analysis of phenomena is possible
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 59
• • • • •
Motivation and General Description Data Mining: Basic Concepts Data Mining Techniques Spatial Data Mining Spatial Data Mining Scenarios in Meteorology and Weather Forecasting • Conclusions • Questions & Discussions
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 60
Conclusions
• Data mining: discovering interesting patterns from large amounts of data • A natural evolution of database technology, in great demand, with wide applications • A KDD process includes data mining, and other steps • Data Mining can be performed in a variety of information repositories • Data mining Tasks: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 61
And now discussion
2004/09/09
Hong Kong Observatory Hong Kong Meteorological Society
62