SDSC Summer Institute 2005
TUTORIAL Data Mining for Scientific Applications
Peter Shin Hector Jasso San Diego Supercomputer Center UCSD
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Overview
Introduction to data mining
• • Definitions, concepts, applications Machine learning methods for KDD
• Supervised learning – classification • Unsupervised learning – clustering
Cyberinfrastructure for data mining
• SDSC resources – hardware and software
Survey of Applications at SKIDL Break Hands on tutorial with IBM Intelligent Miner and SKIDLkit
• • Targeted Marketing Microarray analysis (leukemia dataset)
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Data Mining Definition
The search for interesting patterns and models, in large data collections,
using statistical and machine learning methods,
and high-performance computational infrastructure.
Key point: applications are data-driven and compute-intensive
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Analysis Levels and Infrastructure
• Informal methods – graphs, plots, visualizations, exploratory data analysis (yes – Excel is a data mining tool)
• Advanced query processing and OLAP – e.g., National Virtual Observatory (NVO)
• Machine learning (compute-intensive statistical methods)
• Supervised – classification, prediction • Unsupervised – clustering
Computational infrastructure needed at all levels – collections management, information integration, high-performance database systems, web services, grid services, scientific workflows, the global IT grid, observing systems
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
The Case for Data Mining: Data Reality
• Deluge from new sources
• Remote sensing • Microarray processing • Wireless communication • Simulation models • Instrumentation – microscopes, telescopes • Digital publishing • Federation of collections “5 exabytes (5 million terabytes) of new information was created in 2002” (source: UC Berkeley researchers Peter Lyman and Hal Varian) This is the result of a recent paradigm shift: from hypothesis-driven data collection to data mining Data destination: Legacy archives and independent collection activities
• • •
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Knowledge Discovery Process
Knowledge
Application/Decision Support Presentation/Visualization Analysis/Modeling Management/Federation/Warehousing Processing/Cleansing/Corrections
Data
Collection
“Data is not information; information is not knowledge; knowledge is not wisdom.” Gary Flake, Principal Scientist & Head of Yahoo! Research Labs, July 2004.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Characteristics of Data Mining Applications
• Data:
• • • • Lots of data, numerous sources Noisy – missing values, outliers, interference Heterogeneous – mixed types, mixed media Complex – scale, resolution, temporal, spatial dimensions
• Relatively little domain theory, few quantitative causal models • Lack of valid ground truth • Advice: don’t choose problems that have all these characteristics …
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Scientific vs. Commercial Data Mining
Goals:
• Science – Theories: Need for insight and theory-based models, interpretable model structures, generate domain rules or causal structures, support for theory development • Commercial – Profits: black boxes OK
Types of data:
• Science – Images, sensors, simulations • Commercial - Transaction data • Both - Spatial and temporal dimensions, heterogeneous
Trend – Common IT (information technology) tools fit both enterprises
• Database systems (Oracle, DB2, etc), integration tools (Information Integrator), web services (Blue Titan, .NET) • This is good!
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Introduction to Machine Learning
Basic machine learning theory Concepts and feature vectors Supervised and unsupervised learning Model development
training and testing methodology, model validation, overfitting confusion matrices Decision Trees classification k-means clustering Hierarchical clustering Bayesian networks and probabilistic inference Support vector machines
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Survey of algorithms
Basic Machine Learning Theory
Basic inductive learning hypothesis:
• Having a large number of observations, we can approximate the rule that describes how the data was generated, and thus generate a model (using some algorithm)
No Free Lunch Theorem:
• There is no ultimate algorithm: In the absence of prior information about the problem, there are no reasons to prefer one learning algorithm over another.
Conclusion:
• There is no problem-independent “best” learning system. Formal theory and algorithms are not enough. • Machine learning is an empirical subject.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Concepts are described as feature vectors
Example: vehicles
• • • • • Has wheels Runs on gasoline Carries people Flies Weighs less than 500 pounds
Boolean feature vectors for vehicles
• car254 [ 1 1 1 0 0 ] • motorcyle14 [ 1 1 1 0 1 ] • airplane132 [ 1 1 1 1 0 ]
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Easy to generalize to complex data types:
• • • • • Number of wheels Fuel type Carrying capacity Flies Weight
car254 [ 4, gas, 6, 0, 2000 ] motorcyle14 [ 2, gas, 2, 0, 400 ] airplane132 [ 10, jetfuel, 110, 1, 35000 ]
Most machine learning algorithms expect feature vectors, stored in text files or databases
Suggestions: • Identify the target concept • Organize your data to fit feature vector representation • Design your database schemas to support generation of data in this format
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Supervised vs. Unsupervised Learning
Supervised – Each feature vector belongs to a class (label). Labels are given externally, and algorithms learn to predict the label of new samples/observations. Unsupervised – Finds structure in the data, by clustering similar elements together. No previous knowledge of classes needed.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Model development
Training and testing Train Test Apply
Model validation
• • • • Hold-out validation (2/3, 1/3 splits) Cross validation, simple and n-fold (reuse) Bootstrap validation (sample with replacement) Jackknife validation (leave one out)
• When possible hide a subset of the data until train-test is complete.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Avoid overfitting
Optimal Depth
100% 80%
Accuracy
Overfitting
Train
60% 40% 20% 0% 0 2 4 Algorithm Steps 6 8
Test
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Avoid overfitting
Optimal Depth
100% 80%
Accuracy
Overfitting
Train
60% 40% 20% 0% 0 2 4 Algorithm Steps 6 8
Test
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Confusion matrices
Predicted
Negative Positive
Negative
124 8
15 84
“proportion of predictions correct”
Actual
Positive
Accuracy = (124 + 84) / (124 + 15 + 8 + 84) True positive rate = 84 / (8 + 84) False positive rate = 15 / (124 + 15) False negative rate = 8 / (8 + 84)
“proportion of positive cases correctly identified” “proportion of negative cases incorrectly class as positive” “proportion of negative cases correctly identified”
True negative rate = 124 / (124 + 15)
“proportion of positive cases incorrectly class as negative”
Precision = 84 / (15 + 84) “proportion of predicted positive cases that were correct”
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Classification – Decision Tree
Annual
Ecosystem
Precipitation
Desert Forest Forest Desert Forest Prairie
SAN DIEGO SUPERCOMPUTER CENTER
2 120 104 5 116 63
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Desert Forest Forest Desert Forest Prairie
NO
2 120 104 5 116 63
YES
Precipitation > 63?
Desert Desert Prairie
2 5 63
Forest Forest Forest
120 104 116
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Desert Forest Forest Desert Forest Prairie
2 120 104 5 116 63
Desert Desert Prairie
NO
2 5 63
NO YES
YES
Precipitation > 63?
Forest
120 104 116
Precipitation > 5?
Desert Desert
2 5
Prairie
63
Forest Forest
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Desert Forest Forest Desert Forest Prairie
2 120 104 5 116 63
Learned Model If (Precip > 63 ) then “Forest” else If (Precip > 5) then “Prairie”
else “Desert”
Confusion matrix
Predicted
Actual
D D 2 F 0 P 0
F 0 3 0
P 0 0 1
Classification accuracy on training data is 100%
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Testing Set Results
IF(Precip > 63 ) then Forest Else If (Precip > 5) then Prairie Else Desert
Test Data
Desert Forest Prairie 8 100 55 4 116 72 Prairie Forest Prairie Desert
Learned Model
Confusion matrix
Predicted
Desert Forest Prairie
Forest
Forest
Actual
D D 1 F 0 P 0
F 0 2 1
P 1 0 1
True
Predicted
Result: Accuracy 67% Model shows overfitting, generalizes poorly
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER
Pruning to improve generalization Pruned Decision Tree
Desert Forest Forest 2 120 104
IF(Precip < 60 ) then Desert Else, [P(Forest) = .75] & [P(Prairie) = .25]
Desert
Forest Prairie
5
116 63
Precipitation < 60?
Desert 2 Forest 120
Desert
5
Forest
Forest Prairie
104
116 63
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Decision Trees Summary
• • • • • • • Simple to understand Works with mixed data types Heuristic search sensitive to local minima Models non-linear functions Handles classification and regression Many successful applications Readily available tools
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Overview of Clustering
• Definition:
• Clustering is the discovery of classes • Unlabeled examples => unsupervised learning.
• Survey of Applications
• Grouping of web-visit data, clustering of genes according to their expression values, grouping of customers into distinct profiles,
• Survey of Methods
• • • • k-means clustering Hierarchical clustering Expectation Maximization (EM) algorithm Gaussian mixture modeling
• Cluster analysis
• Concept (class) discovery • Data compression/summarization • Bootstrapping knowledge
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Clustering – k-Means
Precipitation Temperature
8 71
62 49 17 32
81 70
63 45 76 49
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Clustering – k-Means
90 80
Temperature
70 60 50 40 30 0 20 40 Precipitation 60 80
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Clustering – k-Means
90 80
Temperature
70 60 50 40 30 0 20 40 Precipitation 60 80
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Clustering – k-Means
90 80
Temperature
70 60 50 40 30 0 20 40 Precipitation
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
60
80
Clustering – k-Means
90 80
Temperature
70 60 50 40 30 0 20 40 Precipitation 60 80
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Clustering – k-Means
90 80
Temperature
70 60 50 40 30 0 20 40 Precipitation 60 80
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Clustering – k-Means
90 80
Temperature
70 60 50 40 30 0 20 40 Precipitation 60 80
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Clustering – k-Means
90 80
Temperature
70 60 50 40 30 0 20 40 Precipitation 60 80
Cluster Temperature Precipitation
C1 C2
70 - 85 35 - 60
0 - 25 25 - 55
C3
50 – 80
50 – 80
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Clustering – k-Means
90 80
Temperature
70 60 50 40 30 0 20 40 Precipitation 60 80
Cluster Temperature Precipitation
C1 C2
70 - 85 35 - 60
0 - 25 25 - 55
C3
50 – 80
50 – 80
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Clustering – k-Means
90 80
Temperature
70 60 50 40 30 0 20 40 Precipitation 60 80
Cluster Temperature Precipitation Ecosystem
C1 C2
70 - 85 35 - 60
0-25 25 - 55
Desert Prairie
C3
50 – 80
50 – 80
Forest
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Using k-means
• Requires a priori knowledge of „k‟ • The final outcome depends on the initial choice of k-means -- inconsistency • Sensitive to the outliers, which can skew the means of their clusters • Favors spherical clusters – clusters may not match domain boundaries • Requires real-valued features
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Cyberinfrastructure for Data Mining
• Resources – hardware and software (analysis tools and middleware) • Policies – allocating resources to the scientific community. Challenges to the traditional supercomputer model. Requirements for interactive and real-time analysis resources.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
NSF TeraGrid
Building Integrated National CyberInfrastructure
• Prototype for CyberInfrastructure
• Ubiquitous computational resources • Plug-in compatibility
• National Reach:
• SDSC, NCSA, CIT, ANL, PSC
• High Performance Network:
• 40 Gb/s backbone, 30 Gb/s to each site
• Over 20 Teraflops compute power • Over 1PB Online Storage • 8.9PB Archival Storage
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
SDSC is Data-Intensive Center
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
39
SDSC is Data-Intensive Center
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
40
SDSC Machine Room Data Architecture
Philosophy: enable SDSC configuration to serve the grid as Data Center
• • • • .5 PB disk 6 PB archive 1 GB/s disk-to-tape Optimized support for DB2 /Oracle
LAN (multiple GbE, TCP/IP)
Power 4
Blue Horizon
Local Disk (50TB)
Power 4 DB
Sun F15K WAN (30 Gb/s)
SCSI/IP or FC/IP
HPSS
SAN (2 Gb/s, SCSI)
200 MB/s per controller
Linux Cluster, 4TF
30 MB/s per drive
FC GPFS Disk (100TB)
FC Disk Cache (400 TB)
Database Engine Silos and Tape, 6 PB, 1 GB/sec disk to tape 32 tape drives
Data Miner
Vis Engine
Blue Horizon: 1152 processor IBM SP, 1.7 Teraflops HPSS: over 600 TB data stored
SAN DIEGO SUPERCOMPUTER CENTER
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
SDSC IBM Regatta - DataStar
• • • • • • • 100+ TB Disk Numerous fast CPUs 64 GB of RAM per node DB2 v8.x ESE IBM Intelligent Miner SAS Enterprise Miner Platform for high-performance database, data mining, comparative IT studies …
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Data Mining Tools used at SDSC
• • • • • • • • • • • SAS Enterprise Miner (Protein crystallization - JCSG) IBM Intelligent Miner (Protein crystallization - JCSG, Corn Yield – Michigan State University, Security logs - SDSC) CART (Protein crystallization - JCSG) Matlab SVM package (TeraBridge health monitoring – UCSD Structural Engineering Department, North Temperate Lakes Monitoring - LTER) PyML (Text Mining – NSDL, Hyperspectral data - LTER) SKIDLkit by SDSC (Microarray analysis – UCSD Cancer Center, Hyperspectral data - LTER) SVMlight (Hyperspectral data, LTER) LSI by Telecordia (Text Mining – NSDL) CoClustering by Fair Isaac (Text Mining – NSDL) Matlab Bayes Net package WEKA
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
SKIDLkit
• Toolkit for feature selection and classification
• • • • • • Filter methods Wrapper methods Data normalization Feature selection Support Vector Machine & Naïve Bayesian Clustering http://daks.sdsc.edu/skidl
• Will use it in the hands-on demo…
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Survey of Applications at SDSC
• Text mining the NSDL (National Science Digital Library) collection
•
Sensor networks for bridge monitoring (with Structural Engineering Dept., UCSD)
•
Spatio-temporal Analysis of 9-1-1 Call Stream Data
•
Hyperspectral remote sensing data for groundcover classification (with Long Term Ecological Research Network - LTER)
•
Microarray analysis for tumor detection (with UCSD Cancer Center)
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Application: Text Mining the National Science Digital Library (NSDL) Collection
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Project Goal
Assist the educators and students in finding relevant information by categorizing the materials by scientific discipline and grade level using contextual information
General Approach
Based on various metadata in the NSDL community, study the contents of the associated documents and apply machine learning algorithms
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Source of Vocabulary
• Eisenhower National Clearinghouse
• 8417 documents with labels specifying intended grade level • Documents are intended for the teachers • Selected subset of about 1350 documents that could be associated with a AAAS category
• • • • Kindergarten-2nd 3rd-5th 6th - 8th 9th - 12th
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Processing
• Identify the words used in the kindergarten-2nd grade levels by the teachers • Identify the new words used in each of the AAAS categories • Characterize the growth of the vocabulary • Characterize the complexity of the new terms (number of words from prior grade levels used to explain the new word).
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Characterization of Learning
AAAS Level # of documents Total words % new words
Complexity
Kindergarten2nd
3rd-5th 6th-8th 9th-12th
150
220 430 540
2907
4155 6681 10226 30% 37% 35%
1
3 5 10
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Characterization of Learning
• Learn about 33% more words each AAAS category
• This is an exponential growth and must eventually saturate
• Complexity grows by about a factor of 2 per AAAS category
• In later grades, it takes more of your old vocabulary to interpret new words
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Text Mining the NSDL
Variously Formatted Documents
Strip Formatting
Pick out content words using “stop lists”
Stemming
Processing pipeline
Various Retrieval Schemes (LSI, Classification, or clustering modules) Generate Term Document Matrix Word count, Term Weighting Discard words that appear in every document or only one
Query: for a list of words, get docs with highest score
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Application: Sensor Stream Mining
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Sensor Networks for Bridge Monitoring
• Task:
• Identify which pier is damaged based on the data stream fed by the sensors at the span middles. • Apply multi-resolution technique
Sensors
• Assumption:
• The lower end of a pier can be damaged (location of plastic hinge) • There is only one damaged pier at a time.
span middle
pier
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Application: Spatiotemporal Analysis of 9-1-1 Call Stream Data
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Project Goal
Perform spatiotemporal analysis on 9-1-1 call data to improve:
• Overall emergency planning • Real-time emergency decision support
General Approach
Correlate call data “signatures” (unusual spatiotemporal trends) with State-wide and local events: - earthquakes, forest fires, weather events
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Study Area and Dates: San Francisco Bay Area, April 2005
San Francisco Area
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
First Analysis: “Call Rhythm”
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Application: Classification of Land Types Using Hyperspectral Data
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Study Area
New Mexico Sevilleta National Wildlife Refuge Study Area
New Mexico
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Previously Available Image/Map Types
Relief Shaded Map
Landsat Image
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
New image type: NASA’s JPL (Jet Propulsion Lab) Aviris (Airborne Visible/Infrared Imaging Spectrometer) scans, “hyperspectral images”
Scanned from an altitude of 20km, 10km flightline
201 bands of electromagnetic information per pixel, spanning infrared to ultraviolet
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Hyperspectral Scans for Study Area
Study Area…
Complete Aviris scan of the Sevilleta Wildlife refuge, 20m per pixel
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Data set
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Results
Support Vector Machine, one-against-one, wavelet transformation: 97.1 % accuracy on test data
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Application: Microarray Analysis for Tumor Detection
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Microarray Analysis for Tumor Detection
Characteristics of the Data:
• 88 prostate tissue samples:
• 37 labeled “no tumor”, • 51 labeled “tumor”
• Each tissue with 10,600 gene expression measurements • Collected by the UCSD Cancer Center, analyzed at SDSC
Tasks:
• Build model to classify new, unseen tissues as either “no tumor” or “tumor” • Identify key genes to determine their biological significance in the process of cancer
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Simple classifier based on expression levels for two genes
No Tumor
Tumor
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Results
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Break
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Hands-on Analysis
• Part I:
• Decision Tree classification using IBM Intelligent Miner • Using classification models to make rational decisions • Peter Shin
• Part II:
• Feature selection, Naïve Bayes Classifiers and Support Vector Machines using SKIDLkit • Classification of microarray data • Hector Jasso
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Data Mining Example: Targeting Customers
• Problem Characteristics:
1. We make $50 profit on a sale of $200 shoes. 2. A preliminary study shows that people who make over $50k will buy the shoes at a rate of 5% when they receive the brochure. 3. People who make less than $50k will buy the shoes at a rate of 1% when they receive the brochure. 4. It costs $1 to send a brochure to a potential customer. 5. In general, we do not know whether a person will make more than $50k or not.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Available Information
• Variable Description
• Please refer to the hand-out.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Possible Marketing Plans
• We will send out 30,000 brochures.
• Plan A: Ignore data and randomly send brochures
• Plan B: Use data mining to target a specific group with high probabilities of responding
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Plan A
• Strategy:
• Send brochures to anyone
• Cost of sending one brochure = $1 • Probability of Response
• 1% of the population who make <= $50k (76%) • 5% of the population who make > $50k (24%) • Resulting in: (1% * 76% + 5% * 24%) = 1.96% final response rate
• Earnings
• Expected profit from one brochure = (Probability of response * profit – Cost of a brochure) (1.96% * $50 - $1) = -$0.02 • Expected Earning = Expected profit from one brochure * number of brochures sent
-$0.02 * 30000 = -$600
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Plan B
• Strategy:
• Send out brochures to only to: married, college or above, managerial/professional/sales/tech. support/protective service/armed forces, age >= 28.5, hours_per_week >= 31
• Cost of sending one brochure = $1 • Probability of Response
• 1% of the population who make <= $50k (20.6%) • 5% of the population who make > $50k (79.4%) • Resulting in: (1% * 20.6% + 5% * 79.4%) = 4.176% final response rate
• Earnings
• Expected profit from one brochure = (Probability of response * profit – Cost of a brochure) (4.176% * $50 - $1) = $1.088 • (Probability of response * profit – Cost of a flier) * number of fliers
$1.088 * 30000 = $32,640
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Comparison of Two Plans
• Expected earning from plan A
• -$600
• Expected earning from plan B
• $32,640
• Net Difference
• $32,640 – (-$600) = $33,240
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Acknowledgements
• Original source Census Bureau (1994)
• Data processed and donated by Ron Kohavi and Barry Becker (Data Mining and Visualization, SGI)
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Data Mining Example: Microarray Analysis “Labeled” cases
(38 bone marrow samples: 27 AML, 11 ALL
Each contains 7129 gene expression values)
Train model
(using Neural Networks, Support Vector Machines, Bayesian nets, etc.)
key genes
34 New unlabeled bone marrow samples
Model
AML/ALL
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Microarray Data Challenges to Machine Learning Algorithms:
• Few samples for analysis (38 labeled)
• Extremely high-dimensional data (7129 gene expression values per sample)
• Noisy data • Complex underlying mechanisms, not fully understood
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Some genes are more useful than others for building classification models
Example: genes 36569_at and 36495_at are useful
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Some genes are more useful than others for building classification models
Example: genes 36569_at and 36495_at are useful AML
ALL
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Some genes are more useful than others for building classification models
Example: genes 37176_at and 36563_at not useful
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Importance of Feature (Gene) Selection
• Majority of genes are not directly related to leukemia
• Having a large number of features enhances the model’s flexibility, but makes it prone to overfitting
• Noise and the small number of training samples makes this even more likely • Some types of models, like Neural Networks do not scale well with many features
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
With 7219 genes, how do we choose the best?
• Distance metrics to capture class separation • Rank genes according to distance metric score • Choose the top n ranked genes
HIGH score
SAN DIEGO SUPERCOMPUTER CENTER
LOW score
at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Distance Metrics
• Tamayo’s Relative Class Separation
x1 x 2 s1
2
x1 x 2 s1 s 2
• t-test
s2
2
n1
n2
• Bhattacharyya distance
1 ( x 2 x1 ) 4 s1 s 2
2 2
2
1 2
log
s1 s 2 2 s1 s 2
2
2
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
A gene with an undetected outlier could score artificially high
Score jumps from 0.00651 to 0.042566
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
How Support Vector Machines (SVMs) work
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
How Support Vector Machines (SVMs) work
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
How Support Vector Machines (SVMs) work
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
How Support Vector Machines (SVMs) work
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
How Support Vector Machines (SVMs) work
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
How Support Vector Machines (SVMs) work
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
How Support Vector Machines (SVMs) work
Support vectors
margin
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
How Support Vector Machines (SVMs) work
Support vectors
margin
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
How Support Vector Machines (SVMs) work
Support vectors
margin
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Characteristics of SVMs
• Scales well to high-dimensional problems • Fast convergence to solution • Has well-defined statistical properties
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Naïve Bayesian Classifiers
output variable
X (Class)
input variables
… w1 w2 w3 wn
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Characteristics of Naïve Bayesian Classifiers
• Scales well to high-dimensional problems
• Fast to compute
• Based on Bayesian probability theory
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Approaches to Feature Selection
Filter Approach
Input Features Feature Selection by Distance Metric Score Train Model
Model
Wrapper Approach
Input Features Feature Selection Search
Feature Set
Train Model
Importance of features given by the model
Model
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Software Available: SKIDLkit
• Developed at SDSC:
• http://daks.sdsc.edu/skidl
• Implements:
• • • • Filter and wrapper approaches Naïve Bayesian Net and SVM t-test, Prediction Strength, Bhattacharyya distance Outlier detection
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO
Leukemia Dataset
• Collected by White Institute Center for Genomics Research • Made available at:
• http://www-genome.wi.mit.edu/cgi-bin/cancer/datasets.cgi,
• Under “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression” • Also availabe as a sample dataset in SKIDLkit
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO