Docstoc

datamining

Document Sample
datamining Powered By Docstoc
					Spatial Data Mining
Yang Yubin
Joint Laboratory for Geoinformation Science The Chinese University of Hong Kong yangyubin@cuhk.edu.hk

Agenda
• • • • • Motivation and General Description Data Mining: Basic Concepts Data Mining Techniques Spatial Data Mining Spatial Data Mining Scenarios in Meteorology and Weather Forecasting • Conclusions • Questions & Discussions
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 2

• • • • •

Motivation and General Description Data Mining: Basic Concepts Data Mining Techniques Spatial Data Mining Spatial Data Mining Scenarios in Meteorology and Weather Forecasting • Conclusions • Questions & Discussions
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 3

Why do we need Data Mining?
• Large number of records(cases) (108-1012 bytes)
– – – – One thousand (103) bytes = 1 kilobyte (KB) One million (106) bytes = 1 megabyte (MB) One billion (109) bytes = 1 gigabyte (GB) One trillion (1012) bytes = 1 terabyte (TB)

• High dimensional data (variables)
– 10-104 attributes

• Only a small portion, typically 5% to 10%, of the collected data is ever analyzed • We are drowning in data, but starving for knowledge!
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 4

Scientific Viewpoint
• Data collected and stored at enormous speeds (Gbyte/hour)
– remote sensor on a satellite – telescope scanning the skies – scientific simulations generating terabytes of data

• • • •

Classical modeling techniques are infeasible Data reduction Cataloging, classifying, segmenting data Helps scientists in Hypothesis Formation
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 5

Current Situations (1)
• Great efforts for construction and maintenance of large information databases • Data cannot be analyzed by standard statistical methods
– numerous missing records – data are qualitative rather than quantitative

• We do not always know what information might be represented or how relevant it might be to the questions
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 6

Current Situations (2)
• the ways and means for using all this data lag far behind the increase of available data
– Information can only be found with:
• a lot of coincidence (internet) • not explicitly available (company databases) • only accessible for human eyes by using lots of processing power (astronomical, meteorological and earth observation data)

• This leads to a clear demand for means of uncovering the information and knowledge hidden in the massive quantities of data
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 7

• • • • •

Motivation and General Description Data Mining: Basic Concepts Data Mining Techniques Spatial Data Mining Spatial Data Mining Scenarios in Meteorology and Weather Forecasting • Conclusions • Questions & Discussions
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 8

What is Data Mining?
• Data mining is concerned with solving problems by analyzing existing data • ―Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from huge amount of data‖
• Alternative Names: Knowledge Discovery in Databases (KDD)
– A term originated in Artificial Intelligence (AI) field – KDD consists of several steps (one of which is Data Mining)
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 9

Data Mining vs. KDD
• Knowledge Discovery in Databases (KDD): The whole process of finding useful information and patterns in data • Data Mining: Use of algorithms to extract the information and patterns derived by the KDD process • Data mining is the core of the knowledge discovery process
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 10

KDD Process

• Selection: Obtain data from various sources. • Preprocessing: Cleanse data. • Transformation: Convert to common format. Transform to new format. • Data Mining: Obtain desired results. • Interpretation/Evaluation: Present results to user in meaningful manner
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 11

Data Mining: A KDD Process
– Data mining: core of knowledge discovery process
Pattern Evaluation

Data Mining

Task-relevant Data Data Warehouse Data Cleaning Data Integration Databases
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 12

Selection

Typical Data Mining Architecture
Graphical user interface

Pattern evaluation Data mining engine
Database or data warehouse server
Data cleaning & data integration Filtering

Knowledge-base

Databases

Data Warehouse
Hong Kong Observatory Hong Kong Meteorological Society 13

2004/09/09

Data Mining: Confluence of Multiple Disciplines
Database Systems Statistics

Machine Learning

Data Mining

Visualization

Information Theory

Algorithms, …,Other Disciplines
Hong Kong Observatory Hong Kong Meteorological Society 14

2004/09/09

Data Mining is:
• A ―hot‖ word for a class of techniques that find patterns in data • A user-centric, interactive process which leverages analysis technologies and computing power • A group of techniques that find relationships that have not previously been discovered • Not reliant on an existing database • A relatively easy task that requires knowledge of the business problem/subject matter expertise

2004/09/09

Hong Kong Observatory Hong Kong Meteorological Society

15

Experts and clients are needed in:
• • • • • • Define and redefine problems Determine relevant aspects of the problem Supply the data Remove errors from the data Provide constraints on possible patterns Interpret patterns and possibly reject implausible ones • Evaluate predicted effects…
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 16

• • • • •

Motivation and General Description Data Mining: Basic Concepts Data Mining Techniques Spatial Data Mining Spatial Data Mining Scenarios in Meteorology and Weather Forecasting • Conclusions • Questions & Discussions
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 17

Primary Data Mining Tasks (1)
• Descriptive Modeling
– Finding a compact description for large dataset
[Concept Description]

– Clustering people or things into groups based on their attributes [Clustering] – Associating what events are likely to occur together
[Association Rule]

– Sequencing what events are likely to lead to later events [Sequential Pattern Analysis] – Discovering the most significant changes
[Deviation Detection]
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 18

Primary Data Mining Tasks (2)
• Predictive Modeling
– Classifying people or things into groups by recognizing patterns [Classification] – Forecasting what may happen in the future by mapping a data item to a predicting real-value variable [Regression]

2004/09/09

Hong Kong Observatory Hong Kong Meteorological Society

19

Concept Description
• Characterization: provides a concise and succinct summarization of the given collection of data • Discrimination: provides descriptions comparing two or more collections of data • can handle complex data types of the attributes • a more automated process
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 20

Concept description: Characterization
Name Jim Initial Woodman Relation Scott Lachance Laura Lee …
Removed

Gender M M F …
Retained

Major CS

Birth-Place

Birth_date

Residence 3511 Main St., Richmond 345 1st Ave., Richmond
125 Austin Ave., Burnaby …

Phone # 687-4598 253-9106 420-5232 … Removed

GPA 3.67 3.70 3.83 … Excl, VG,..

Vancouver,BC, 8-12-76 Canada CS Montreal, Que, 28-7-75 Canada Physics Seattle, WA, USA 25-8-70 … … …
Sci,Eng, Bus

Country

Age range
Age_range 20-25 25-30 …

City
GPA Very-good Excellent …

Gender Major

Birth_region Canada Foreign …

Residence Richmond Burnaby …

Count 16 22 …

Generalized Relation

M F …

Science Science …

Birth_Region Canada Gender M F Total 16 10 26 14 22 36 30 32 62 Foreign Total

2004/09/09

Hong Kong Observatory Hong Kong Meteorological Society

21

Clustering
• Cluster: a collection of data objects
– Similar to one another within the same cluster – Dissimilar to the objects in other clusters

• Clustering
– Grouping a set of data objects into clusters based on the principle: maximizing the intra-class similarity and minimizing the interclass similarity

• Example
– Land use: Identification of areas of similar land use in an earth observation database – City-planning: Identifying groups of houses according to their house type, value, and geographical location
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 22

Association rule
• Association (correlation and causality)
– age(X, ―20..29‖) ^ income(X, ―20..29K‖) ―PC‖) [support = 2%, confidence = 60%] buys(X,

• Association rule mining
– Finding frequent patterns, associations, correlations among sets of items or objects in transaction databases, relational databases, and other information repositories – Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database

• Motivation: finding regularities in data
– What products were often purchased together?
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 23

Example: Association rule
Transaction-id 10 20 30 40

• Itemset A1,A2={a1, …, ak} • Find all the rules A1A2 with min a1,a2, a3 confidence and support a1, a3 – support, s, probability that a a1, a4 transaction contains A1A2 a2, a5, a6 – confidence, c, conditional probability that a transaction having A1 also contains A2. Let min_support = 50%, min_conf = 50%: a1  a3 (50%, 66.7%) a3  a1 (50%, 100%)
Items bought
Hong Kong Observatory Hong Kong Meteorological Society 24

2004/09/09

Sequential Pattern Analysis
• Given a set of sequences, find the complete set of frequent subsequences
SID
10 20 30

sequence
<a(abc)(ac)d(cf)> <(ad)c(bc)(ae)> <(ef)(ab)(df)cb>

Given support threshold min_sup =2, <(ab)c> is a sequential pattern

• Applications of sequential pattern
– Customer shopping sequences:
• First buy computer, then CD-ROM, and then digital camera, within 3 months.

40

<eg(af)cbc>

– Weblog click streams – Telephone calling patterns
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 25

Deviation Detection
• Outlier analysis
– Outlier: a data object that does not comply with the general behavior of the data – It can be considered as noise or exception but is quite useful in fraud detection, rare events analysis

• Trend and evolution analysis
– Trend and deviation: regression analysis – Periodicity analysis – Similarity-based analysis
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 26

Classification and Regression
• Classification:
– constructs a model (classifier) based on the training set and uses it in classifying new data – Example: Climate Classification,…

• Regression:
– models continuous-valued functions, i.e., predicts unknown or missing values – Example: stock trends prediction,…

2004/09/09

Hong Kong Observatory Hong Kong Meteorological Society

27

Classification (1): Model Construction
Training Data
Classification Algorithms

NAME RANK M ike M ary B ill Jim D ave A nne A ssistant P rof A ssistant P rof P rofessor A ssociate P rof A ssistant P rof A ssociate P rof

YEARS TENURED 3 7 2 7 6 3 no yes yes yes no no

Classifier (Model)

IF rank = ‗professor‘ OR years > 6 THEN tenured = ‗yes‘
28

2004/09/09

Hong Kong Observatory Hong Kong Meteorological Society

Classification (2): Prediction Using the Model
Classifier
Testing Data

Unseen Data (Jeff, Professor, 4)

NAME RANK T om M erlisa G eorge Joseph A ssistant P rof A ssociate P rof P rofessor A ssistant P rof
2004/09/09

YEARS TENURED 2 7 5 7 no no yes yes

Tenured?

Hong Kong Observatory Hong Kong Meteorological Society

29

Classification Techniques
• • • • • Decision Tree Induction Bayesian Classification Neural Networks Genetic Algorithms Fuzzy Set and Logic

2004/09/09

Hong Kong Observatory Hong Kong Meteorological Society

30

Regression
• Regression is similar to classification
– First, construct a model – Second, use model to predict unknown value

• Methods
– Linear and multiple regression – Non-linear regression

• Regression is different from classification
– Classification refers to predict categorical class label – Regression models continuous-valued functions
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 31

Are All the ―Discovered‖ Patterns Interesting?
• A data mining task may generate thousands of patterns, not all of them are interesting. • Interestingness measures:
– A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm – Objective vs. Subjective interestingness measures: • Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.

• Subjective: based on user‘s belief in the data, e.g.,
unexpectedness, novelty, executability, etc.
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 32

• • • • •

Motivation and General Description Data Mining: Basic Concepts Data Mining Techniques Spatial Data Mining Spatial Data Mining Scenarios in Meteorology and Weather Forecasting • Conclusions • Questions & Discussions
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 33

Spatial Data Mining
• Spatial Patterns
– – – – Spatial outliers Location prediction Associations, co-locations Hotspots, Clustering, trends, …

• Primary Tasks
– – – – Mining Spatial Association Rules Spatial Classification and Prediction Spatial Data Clustering Analysis Spatial Outlier Analysis

• Example: Unusual warming of Pacific ocean (El Nino) affects weather in USA…
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 34

Spatial Data Mining Results
• Understanding spatial data, discovering relationships between spatial and nonspatial data, construction of spatial knowledge bases, etc. • In various forms
– The description of the general weather patterns in a set of geographic regions is a spatial characteristic rule. – The comparison of two weather patterns in two geographic regions is a spatial discriminant rule. – A rule like ―most cities in Canada are close to the Canada-US border‖ is a spatial association rule
• near(x,coast) ^ southeast(x, USA) ) hurricane(x), (70%)

– Others: spatial clusters,…
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 35

What is Spatial Data?
• The data related to objects that occupy space
– traffic, bird habitats, global climate, logistics, ...

• Object types:
– Points, Lines, Polygons,etc.

Used in/for:
   

GIS - Geographic Information Systems Meteorology Astronomy Environmental studies, etc.
36

2004/09/09

Hong Kong Observatory Hong Kong Meteorological Society

Basic Concepts (1)
• Spatial data mining follows along the same functions in data mining, with the end objective to find patterns in geography, meteorology, etc. • The main difference (Spatial autocorrelation)
– the neighbors of a spatial object may have an influence on it and therefore have to be considered as well

• Spatial attributes
– Topological
• adjacency or inclusion information

– Geometric
• position (longitude/latitude), area, perimeter, boundary polygon
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 37

Basic Concepts (2)
• Spatial neighborhood
– Topological relation
• ―intersect‖, ―overlap‖, ―disjoint‖, …

– distance relation
• ―close_to‖, ―far_away‖,…
Global Model

– direction/orientation relation
• ―left_of‖, ―west_of‖,…

• Global model might be inconsistent with regional models
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society

Local Model
38

Applications
• NASA Earth Observing System (EOS): Earth science data • National Inst. of Justice: crime mapping • Census Bureau, Dept. of Commerce: census data • Dept. of Transportation (DOT): traffic data • National Inst. of Health(NIH): cancer clusters • ……
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 39

Example: What Kind of Houses Are Highly Valued?—Associative Classification

2004/09/09

Hong Kong Observatory Hong Kong Meteorological Society

40

• • • • •

Motivation and General Description Data Mining: Basic Concepts Data Mining Techniques Spatial Data Mining Spatial Data Mining Scenarios in Meteorology and Weather Forecasting • Conclusions • Questions & Discussions
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 41

Meteorological Data Mining
• Motivation
– Lot of analysis methods must be applied to fast growing data for climate studies

• Result
– Appropriate presentation instruments (graphs, maps, reports, etc) must be applied

• Examples
– Spatial outliers can be associated with disastrous natural events such as tornadoes, hurricane, and forest fires – Associations between disaster events and certain meteorological observations
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 42

Case Studies (1): Astronomy
• SKICAT(SKy Image Cataloging and Analysis Tool ) (Caltech, US) • The Palomar Observatory discovered 22 quasars with the help of data mining • the Second Palomar Observatory Sky Survey (POSS-II)
– decision tree methods – classification of galaxies, stars and other stellar objects

• About 3 TB of sky images were analyzed
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 43

Case Studies (2): NCAR & UCAR
• National Center for Atmospheric Research (NCAR) & University Corporation for Atmospheric Research(UCAR), US – http://www.ucar.edu/

• ―Automatic Fuzzy Logic-based systems now compete with human forecasts‖
• Richard Wagoner, Deputy Director at Research Applications Program(RAP), NCAR

• Intelligent Weather System (IWS)
– Detection and forecast in the areas of en-route turbulence, en-route icing, ceiling/visibility, and convective hazards in the aviation community – Road winter maintenance, airport operations, and flash flood forecasting
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 44

Operational Application
• Prediction System: WIND-2
– WIND: ―Weather Is Not Discrete‖

• Consists of three parts:
– Data
• Past airport weather observations, 30 years of hourly observations, time series of 300,000 detailed observations • Recent and current observations (METARs) • Model based guidance (knowledge of near-term changes,e.g., imminent wind-shift, onset/cessation of precipitation)

– Fuzzy similarity-measuring algorithm – Prediction composition – predictions based on k nearest neighbors(k-nn, clustering method)
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 45

Operational Application
• Hybrid methods are used to predict weather
– Dynamical approach - based upon equations of the atmosphere,uses finite element techniques – Empirical approach - similar weather situations lead to similar outcomes

• WIND runs in real-time for meteorologically different sites • Data-mining/forecast process takes about one second
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 46

2004/09/09

Hong Kong Observatory Hong Kong Meteorological Society

47

Case Studies (3): CrossGrid (EU)
• Objective
– To develop, implement and exploit new Grid components for interactive compute and data intensive applications like flooding crisis team decision support systems, air pollution combined with weather forecasting

• Main tasks in Meteorological applications package
– Data mining for atmospheric circulation patterns
• Find a set of representative prototypes of the atmospheric patterns in a region of interest

– Weather forecasting for maritime applications – Ocean wave forecasting by models of various complexity
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 48

• Data
– ERA-15 using a T106L31 model (from 1978 to 1994) with 1.125◦ resolution – Terabytes – Comprises data from approx. 20 variables (such as temperature,humidity, pressure, etc.) at 30 pressure levels of a 360x360 nodes grid

SOM Application for DataMining

Adaptive Competitive Learning

Downscaling Weather Forecasts
Sub-grid details scape from numerical models

6

2004/09/09

Hong Kong Observatory Hong Kong Meteorological Society

49

Dept. of Applied Mathematics Universidad de Cantabria

Santander, Spain

2004/09/09

Hong Kong Observatory Hong Kong Meteorological Society

50

Case Studies (4): Typhoon Image Data Mining
• Objective
– To establish algorithms and database models for the discovery of information and knowledge useful for typhoon analysis and prediction – Content-based image retrieval technology to search for similar cloud patterns in the past – Data mining technology to extract spatio-temporal pattern information which is meaningful from the meteorology viewpoints

• Result
– Alignment of Multiple Typhoons, Explore by Projection to 2D Plane, Diurnal Analysis
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 51

Methods
• Archive of approximately 34,000 typhoon images for the northern and southern hemisphere • Various data mining approaches
– Principal component analysis(PCA), K-means clustering, self-organizing map(SOM), wavelet transform

• Retrieval of historical similar patterns from image databases to perform instance-based typhoon analysis and prediction • Extracting the eigenvectors of the whole typhoon image collection
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 52

2004/09/09

Hong Kong Observatory Hong Kong Meteorological Society

53

Case Studies (5): LEAD
• Linked Environments for Atmospheric Discovery
– To accommodate the real time, on-demand, and dynamically-adaptive nature of mesoscale problems
• Complexities: vastly disparate, high volume and bandwidth data • Tremendous computational demands

– Used in accessing, preparing, assimilating, predicting, managing, mining/analyzing, and displaying a broad array of meteorological and related information

• Data Mining Solution Center: ITSC, The Univ. of Alabama in Huntsville, US
– http://datamining.itsc.uah.edu/index.jsp
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 54

ADaM
• The Algorithm Development and Mining
– Component architecture data mining toolkit – For geophysical phenomena detection and feature extraction

• Applications
– Detecting tropical cyclones and estimating their maximum sustained wind speed – Mesocyclone Identification from RADAR – Detecting Cumulus Cloud Fields in GOES Images
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 55

ADaM (cont’d)
– Mesoscale Convective Systems Detection
• EOS Special Sensor Microwave/Imager (SSM/I) Brightness Temperature Swaths from DMSP F13 and F14

– Rain Detection Using SSM/I – Lightning Detection Using OLS – Rain Accumulation Study
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 56

Case Studies (6): Rainfall Classification
University of Oklahoma Norman • To classify significant and interesting features within a two-dimensional spatial field of meteorological data
– Observed or predicted rainfall

• Data source
– Estimates of hourly accumulated rainfall – Using radar and raingage data

• ―Attributes‖ for classification
– Statistical parameters representing the distribution of rainfall amounts across the region

• Classification Method
– Hierarchical cluster analysis
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 57

Many Others…
• JARtool Project (Fayyad et al., NASA ) • Identifying volcanoes on the surface of Venus from images transmitted by the Magellan spacecraft • More than 30,000 high resolution Synthetic Aperture Radar(SAR) images of the surface of Venus from different angles • The obtained accuracy was about 80%
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 58

What we can learn from those scenarios?
• Data Mining is a promising way for meteorological analysis • Very strong interaction between scientists and the knowledge discovery system is necessary • The users define features of the meteorological phenomena based on their expert knowledge • The system extracts the instances of such phenomena • Then, further analysis of phenomena is possible
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 59

• • • • •

Motivation and General Description Data Mining: Basic Concepts Data Mining Techniques Spatial Data Mining Spatial Data Mining Scenarios in Meteorology and Weather Forecasting • Conclusions • Questions & Discussions
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 60

Conclusions
• Data mining: discovering interesting patterns from large amounts of data • A natural evolution of database technology, in great demand, with wide applications • A KDD process includes data mining, and other steps • Data Mining can be performed in a variety of information repositories • Data mining Tasks: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.
2004/09/09 Hong Kong Observatory Hong Kong Meteorological Society 61

And now discussion

2004/09/09

Hong Kong Observatory Hong Kong Meteorological Society

62


				
DOCUMENT INFO
Shared By:
Categories:
Tags: data, mining
Stats:
views:185
posted:12/9/2008
language:Malay
pages:62
Description: data mining