VIEWS: 8 PAGES: 63 POSTED ON: 2/13/2012
CSE 300 Data mining and its application and usage in medicine By Radhika 1 Data Mining and Medicine History CSE Past 20 years with relational databases 300 More dimensions to database queries earliest and most successful area of data mining Mid 1800s in London hit by infectious disease Two theories – Miasma theory Bad air propagated disease – Germ theory Water-borne Advantages – Discover trends even when we don’t understand reasons – Discover irrelevant patterns that confuse than enlighten – Protection against unaided human inference of patterns provide quantifiable measures and aid human judgment Data Mining Patterns persistent and meaningful Knowledge Discovery of Data 2 The future of data mining 10 biggest killers in the US CSE 300 Data mining = Process of discovery of interesting, meaningful and actionable patterns hidden in large amounts of data 3 Major Issues in Medical Data Mining Heterogeneity of medical data CSE Volume and complexity 300 Physician’s interpretation Poor mathematical categorization Canonical Form Solution: Standard vocabularies, interfaces between different sources of data integrations, design of electronic patient records Ethical, Legal and Social Issues Data Ownership Lawsuits Privacy and Security of Human Data Expected benefits Administrative Issues 4 Why Data Preprocessing? Patient records consist of clinical, lab parameters, CSE results of particular investigations, specific to tasks 300 Incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data Noisy: containing errors or outliers Inconsistent: containing discrepancies in codes or names Temporal chronic diseases parameters No quality data, no quality mining results! Data warehouse needs consistent integration of quality data Medical Domain, to handle incomplete, inconsistent or noisy data, need people with domain knowledge 5 What is Data Mining? The KDD Process CSE 300 Pattern Evaluation Data Mining Task-relevant Data Data Selection Warehouse Data Cleaning Data Integration Databases 6 From Tables and Spreadsheets to Data Cubes A data warehouse is based on a multidimensional data CSE model that views data in the form of a data cube 300 A data cube, such as sales, allows data to be modeled and viewed in multiple dimensions Dimension tables, such as item (item_name, brand, type), or time(day, week, month, quarter, year) Fact table contains measures (such as dollars_sold) and keys to each of related dimension tables W. H. Inmon:“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.” 7 Data Warehouse vs. Heterogeneous DBMS Data warehouse: update-driven, high performance CSE Information from heterogeneous sources is 300 integrated in advance and stored in warehouses for direct query and analysis Do not contain most current information Query processing does not interfere with processing at local sources Store and integrate historical information Support complex multidimensional queries 8 Data Warehouse vs. Operational DBMS OLTP (on-line transaction processing) CSE Major task of traditional relational DBMS 300 Day-to-day operations: purchasing, inventory, banking, manufacturing, payroll, registration, accounting, etc. OLAP (on-line analytical processing) Major task of data warehouse system Data analysis and decision making Distinct features (OLTP vs. OLAP): User and system orientation: customer vs. market Data contents: current, detailed vs. historical, consolidated Database design: ER + application vs. star + subject View: current, local vs. evolutionary, integrated Access patterns: update vs. read-only but complex queries 9 CSE 300 10 Why Separate Data Warehouse? High performance for both systems CSE DBMS tuned for OLTP: access methods, indexing, 300 concurrency control, recovery Warehouse tuned for OLAP: complex OLAP queries, multidimensional view, consolidation Different functions and different data: Missing data: Decision support requires historical data which operational DBs do not typically maintain Data consolidation: DS requires consolidation (aggregation, summarization) of data from heterogeneous sources Data quality: different sources typically use inconsistent data representations, codes and formats which have to be reconciled 11 CSE 300 12 CSE 300 13 Typical OLAP Operations Roll up (drill-up): summarize data CSE by climbing up hierarchy or by dimension reduction 300 Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: project and select Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes. Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its back-end relational tables (using SQL) 14 CSE 300 15 CSE 300 16 Multi-Tiered Architecture CSE 300 Monitor OLAP Server other Metadata & sources Integrator Analysis Query Operational Extract Reports DBs Transform Data Serve Load Data mining Refresh Warehouse Data Marts Data Sources OLAP Engine Front-End Tools Data Storage 17 Steps of a KDD Process Learning the application domain: CSE relevant prior knowledge and goals of application 300 Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation: Find useful features, dimensionality/variable reduction, invariant representation. Choosing functions of data mining summarization, classification, regression, association, clustering. Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge 18 Common Techniques in Data Mining Predictive Data Mining CSE 300 Most important Classification: Relate one set of variables in data to response variables Regression: estimate some continuous value Descriptive Data Mining Clustering: Discovering groups of similar instances Association rule extraction Variables/Observations Summarization of group descriptions 19 Leukemia Different types of cells look very similar CSE 300 Given a number of samples (patients) can we diagnose the disease accurately? Predict the outcome of treatment? Recommend best treatment based of previous treatments? Solution: Data mining on micro-array data 38 training patients, 34 testing patients ~ 7000 patient attributes 2 classes: Acute Lymphoblastic Leukemia(ALL) vs Acute Myeloid Leukemia (AML) 20 Clustering/Instance Based Learning Uses specific instances to perform classification than general CSE IF THEN rules 300 Nearest Neighbor classifier Most studied algorithms for medical purposes Clustering– Partitioning a data set into several groups (clusters) such that Homogeneity: Objects belonging to the same cluster are similar to each other Separation: Objects belonging to different clusters are dissimilar to each other. Three elements The set of objects The set of attributes Distance measure 21 Measure the Dissimilarity of Objects CSE Find best matching instance 300 Distance function Measure the dissimilarity between a pair of data objects Things to consider Usually very different for interval-scaled, boolean, nominal, ordinal and ratio-scaled variables Weights should be associated with different variables based on applications and data semantic Quality of a clustering result depends on both the distance measure adopted and its implementation 22 Minkowski Distance Minkowski distance: a generalization CSE 300 d (i, j) q | x x |q | x x |q ... | x x |q (q 0) i1 j1 i2 j2 ip jp If q = 2, d is Euclidean distance If q = 1, d is Manhattan distance Xi (1,7) xi 12 8.48 q=2 q=1 6 6 Xj(7,1) xj 23 Binary Variables A contingency table for binary data CSE Object j 300 1 0 sum 1 a b ab 0 c d cd Object i sum a c bd p Simple matching coefficient d (i , j ) bc abcd 24 Dissimilarity between Binary Variables Example CSE 300 A1 A2 A3 A4 A5 A6 A7 Object 1 1 0 1 1 1 0 0 Object 2 1 1 1 0 0 0 1 Object 2 1 0 sum 1 2 2 4 Object 1 0 2 1 3 sum 4 3 7 d (O ,O ) 2 2 4 1 2 2 2 2 1 7 25 K-nearest neighbors algorithm Initialization CSE Arbitrarily choose k objects as the initial cluster 300 centers (centroids) Iteration until no change For each object Oi Calculate the distances between Oi and the k centroids (Re)assign Oi to the cluster whose centroid is the closest to Oi Update the cluster centroids based on current assignment 26 k-Means Clustering Method cluster 10 current mean 10 9 clusters 9 CSE 8 8 300 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 objects new relocate clusters 10 10 d 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 27 Dataset Data set from UCI repository CSE 300 http://kdd.ics.uci.edu/ 768 female Pima Indians evaluated for diabetes After data cleaning 392 data entries 28 Hierarchical Clustering Groups observations based on dissimilarity CSE 300 Compacts database into “labels” that represent the observations Measure of similarity/Dissimilarity Euclidean Distance Manhattan Distance Types of Clustering Single Link Average Link Complete Link 29 Hierarchical Clustering: Comparison Single-link Complete-link CSE 5 300 1 4 1 3 2 5 5 5 2 1 2 2 3 6 3 6 3 1 4 4 4 Average-link Centroid distance 5 1 5 4 1 2 5 2 2 5 2 3 6 3 3 6 4 1 1 4 4 3 30 Compare Dendrograms Single-link Complete-link CSE 300 1 2 5 3 6 4 1 2 5 3 6 4 Average-link Centroid distance 2 5 3 6 4 1 1 2 5 3 6 4 31 Which Distance Measure is Better? Each method has both advantages and disadvantages; CSE application-dependent 300 Single-link Can find irregular-shaped clusters Sensitive to outliers Complete-link, Average-link, and Centroid distance Robust to outliers Tend to break large clusters Prefer spherical clusters 32 Dendrogram from dataset CSE 300 Minimum spanning tree through the observations Single observation that is last to join the cluster is patient whose blood pressure is at bottom quartile, skin thickness is at bottom quartile and BMI is in bottom half Insulin was however largest and she is 59-year old diabetic 33 Dendrogram from dataset CSE 300 Maximum dissimilarity between observations in one cluster when compared to another 34 Dendrogram from dataset CSE 300 Average dissimilarity between observations in one cluster when compared to another 35 Supervised versus Unsupervised Learning Supervised learning (classification) CSE Supervision: Training data (observations, 300 measurements, etc.) are accompanied by labels indicating the class of the observations New data is classified based on training set Unsupervised learning (clustering) Class labels of training data are unknown Given a set of measurements, observations, etc., need to establish existence of classes or clusters in data 36 Classification and Prediction Derive models that can use patient specific CSE information, aid clinical decision making 300 Apriori decision on predictors and variables to predict No method to find predictors that are not present in the data Numeric Response Least Squares Regression Categorical Response Classification trees Neural Networks Support Vector Machine Decision models Prognosis, Diagnosis and treatment planning Embed in clinical information systems 37 Least Squares Regression Find a linear function of predictor variables that CSE minimize the sum of square difference with response 300 Supervised learning technique Predict insulin in our dataset :glucose and BMI 38 Decision Trees Decision tree CSE Each internal node tests an attribute 300 Each branch corresponds to attribute value Each leaf node assigns a classification ID3 algorithm Based on training objects with known class labels to classify testing objects Rank attributes with information gain measure Minimal height least number of tests to classify an object Used in commercial tools eg: Clementine ASSISTANT Deal with medical datasets Incomplete data Discretize continuous variables Prune unreliable parts of tree Classify data 39 Decision Trees CSE 300 40 Algorithm for Decision Tree Induction CSE 300 Basic algorithm (a greedy algorithm) Attributes are categorical (if continuous-valued, they are discretized in advance) Tree is constructed in a top-down recursive divide-and-conquer manner At start, all training examples are at the root Test attributes are selected on basis of a heuristic or statistical measure (e.g., information gain) Examples are partitioned recursively based on selected attributes 41 Training Dataset CSE Age BMI Hereditary Vision Risk of 300 Condition X P1 <=30 high no fair no P2 <=30 high no excellent no P3 >40 high no fair yes P4 31…40 medium no fair yes P5 31…40 low yes fair yes P6 31…40 low yes excellent no P7 >40 low yes excellent yes P8 <=30 medium no fair no P9 <=30 low yes fair yes P10 31…40 medium yes fair yes P11 <=30 medium yes excellent yes P12 >40 medium no excellent yes P13 >40 high yes fair yes P14 31…40 medium no excellent no 42 Construction of A Decision Tree for “Condition X” CSE [P1,…P14] Age? 300 Yes: 9, No:5 30…40 <=30 >40 [P1,P2,P8,P9,P11] [P3,P7,P12,P13] [P4,P5,P6,P10,P14] Yes: 2, No:3 Yes: 4, No:0 Yes: 3, No:2 History YES Vision no yes excellent fair [P1,P2,P8] [P9,P11] [P6,P14] [P4,P5,P10] Yes: 0, Yes: 2, Yes: 0, Yes: 3, No:3 No:0 No:2 No:0 NO YES NO YES 43 Entropy and Information Gain S contains si tuples of class Ci for i = {1, ..., m} CSE 300 Information measures info required to classify any arbitrary tuple m si si I( s1,s2,...,sm ) log 2 i 1 s s Entropy of attribute A with values {a1,a2,…,av} v s1 j ... smj E(A) I (s1 j ,..., smj ) j 1 s Information gained by branching on attribute A Gain( A ) I( s1, s2,..., sm ) E( A ) 44 Entropy and Information Gain Select attribute with the highest information gain (or CSE greatest entropy reduction) 300 Such attribute minimizes information needed to classify samples 45 Rule Induction IF conditions THEN Conclusion CSE Eg: CN2 300 Concept description: Characterization: provides a concise and succinct summarization of given collection of data Comparison: provides descriptions comparing two or more collections of data Training set, testing set Imprecise Predictive Accuracy P/P+N 46 Example used in a Clinic Hip arthoplasty trauma surgeon predict patient’s long- CSE term clinical status after surgery 300 Outcome evaluated during follow-ups for 2 years 2 modeling techniques Naïve Bayesian classifier Decision trees Bayesian classifier P(outcome=good) = 0.55 (11/20 good) Probability gets updated as more attributes are considered P(timing=good|outcome=good) = 9/11 (0.846) P(outcome = bad) = 9/20 P(timing=good|outcome=bad) = 5/9 47 CSE 300 Nomogram 48 Bayesian Classification Bayesian classifier vs. decision tree CSE Decision tree: predict the class label 300 Bayesian classifier: statistical classifier; predict class membership probabilities Based on Bayes theorem; estimate posterior probability Naïve Bayesian classifier: Simple classifier that assumes attribute independence High speed when applied to large databases Comparable in performance to decision trees 49 Bayes Theorem Let X be a data sample whose class label is unknown CSE 300 Let Hi be the hypothesis that X belongs to a particular class Ci P(Hi) is class prior probability that X belongs to a particular class Ci Can be estimated by ni/n from training data samples n is the total number of training data samples ni is the number of training data samples of class Ci P( X | H )P(H ) P(H | X ) i i i P( X ) Formula of Bayes Theorem 50 More classification Techniques Neural Networks CSE Similar to pattern recognition properties of biological 300 systems Most frequently used Multi-layer perceptrons – Input with bias, connected by weights to hidden, output Backpropagation neural networks Support Vector Machines Separate database to mutually exclusive regions Transform to another problem space Kernel functions (dot product) Output of new points predicted by position Comparison with classification trees Not possible to know which features or combination of features most influence a prediction 51 Multilayer Perceptrons Non-linear transfer functions to weighted sums of CSE inputs 300 Werbos algorithm Random weights Training set, Testing set 52 Support Vector Machines 3 steps CSE Support Vector creation 300 Maximal distance between points found Perpendicular decision boundary Allows some points to be misclassified Pima Indian data with X1(glucose) X2(BMI) 53 What is Association Rule Mining? Finding frequent patterns, associations, correlations, or causal CSE structures among sets of items or objects in transaction 300 databases, relational databases, and other information repositories PatientID Conditions Example of Association Rules 1 High LDL Low HDL, High BMI, Heart Failure {High LDL, Low HDL} 2 High LDL Low HDL, {Heart Failure} Heart Failure, Diabetes 3 Diabetes 4 High LDL Low HDL, Heart Failure People who have high LDL 5 High BMI , High LDL (“bad” cholesterol), low Low HDL, Heart Failure HDL (“good cholesterol”) are at higher risk of heart failure. 54 Association Rule Mining Market Basket Analysis CSE Same groups of items bought placed together 300 Healthcare Understanding among association among patients with demands for similar treatments and services Goal : find items for which joint probability of occurrence is high Basket of binary valued variables Results form association rules, augmented with support and confidence 55 Association Rule Mining Association Rule CSE An implication D Trans 300 expression of the form containing both X Y, where X and Y X and Y are itemsets and XY= Rule Evaluation Metrics Support (s): Fraction of Trans Trans transactions that containing X containing Y contain both X and Y Confidence (c): # trans containing ( X Y ) P( X Y ) Measures how often # trans in D items in Y appear in transactions that contain X # trans containing ( X Y ) P( X | Y ) # trans containing X 56 The Apriori Algorithm CSE Starts with most frequent 1-itemset 300 Include only those “items” that pass threshold Use 1-itemset to generate 2-itemsets Stop when threshold not satisfied by any itemset L1 = {frequent items}; for (k = 1; Lk !=; k++) do Candidate Generation: Ck+1 = candidates generated from Lk; Candidate Counting: for each transaction t in database do increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_sup return k Lk; 57 Apriori-based Mining CSE 300 Data base D 1-candidates Freq 1-itemsets 2-candidates TID Items Itemset Sup Itemset Sup Itemset 10 a, c, d a 2 a 2 ab 20 b, c, e Scan D b 3 b 3 ac 30 a, b, c, e c 3 c 3 ae 40 b, e d 1 e 3 bc Min_sup=0.5 e 3 be ce 3-candidates Freq 2-itemsets Counting Scan D Itemset Itemset Sup Itemset Sup bce ac 2 ab 1 bc 2 ac 2 Scan D be 3 ae 1 Freq 3-itemsets ce 2 bc 2 Itemset Sup be 3 bce 2 ce 2 58 Principle Component Analysis Principle Components CSE In cases of large number of variables, highly possible that 300 some subsets of the variables are very correlated with each other. Reduce variables but retain variability in dataset Linear combinations of variables in the database Variance of each PC maximized – Display as much spread of the original data PC orthogonal with each other – Minimize the overlap in the variables Each component normalized sum of square is unity – Easier for mathematical analysis Number of PC < Number of variables Associations found Small number of PC explain large amount of variance Example 768 female Pima Indians evaluated for diabetes Number of times pregnant, two-hour oral glucose tolerance test (OGTT) plasma glucose, Diastolic blood pressure, Triceps skin fold thickness, Two-hour serum insulin, BMI, Diabetes pedigree function, Age, Diabetes onset within last 5 years 59 PCA Example CSE 300 60 National Cancer Institute CSE CancerNet http://www.nci.nih.gov 300 CancerNet for Patients and the Public CancerNet for Health Professionals CancerNet for Basic Reasearchers CancerLit 61 Conclusion About ¾ billion of people’s medical records are CSE electronically available 300 Data mining in medicine distinct from other fields due to nature of data: heterogeneous, with ethical, legal and social constraints Most commonly used technique is classification and prediction with different techniques applied for different cases Associative rules describe the data in the database Medical data mining can be the most rewarding despite the difficulty 62 CSE 300 Thank you !!! 63