Data Mining A Tutorial-Based Primer

• This chapter uses MS Excel and Weka Statistical Techniques Chapter 10 10.1 Linear Regression Analysis f ( x1 , x2 , x3 ...xn )  a1 x1  a2 x2  a3 x3  .......an xn  c Equation 10.1 10.1 Linear Regression Analysis • A Supervised technique that generalizes a set of numeric data by creating a math equation relating one or more ,nput variables to a single output variable. • With linear regression we attemp to model vairation in a dependent variable as a linear combination of one or more independent variable • Linear regression is appro when the relation betwee the dependent and the independent variables are nearly linear Simple Linear Regression (slope-intercept form) y  ax  b Equation 10.2 Simple Linear Regression (least squares criterion) b  xy x 2 a y n  b y n Equation 10.3 Multiple Linear Regression with Excel Try to estimate the value of a building Table 10.1 • District Office Building Data Space 2310 2333 2356 2379 2402 2425 2448 2471 2494 2517 2540 Offices 2 2 3 3 2 4 2 2 3 4 2 Entrances 2 2 1.5 2 3 2 1.5 2 3 4 3 Age 20 12 33 43 53 23 99 34 23 55 22 Value $142,000 $144,000 $151,000 $150,000 $139,000 $169,000 $126,000 $142,900 $163,000 $169,000 $149,000 A Regression Equation for the District Office Building Data Value  27.64 Space  12529 .77Offices  2553 .21Entrances  234 .24 Age  52317 .83 Table 10.2 • Regression Statistics for the Office Building Data –234.2371645 13.26801148 0.996747993 459.7536742 1732393319 2553.211 530.6692 970.5785 6 5652135 12529.77 400.0668 #N/A #N/A #N/A 27.64139 5.429374 #N/A #N/A #N/A 52317.83 12237.36 #N/A #N/A #N/A 10.1 Linear Regression Analysis • How accurate are the results – Use scatterplot diagram, and the line for the formula – Which ind vars are linearly related to dep vars. Use the stats? – Coefficient determination=1, no difference between actual (in the table) and computed values for dependent variable.(reps corrolation between actual and computed values) – Standard error for the estimate of dep var. F stat for the regression analysis • Used to establish, if the coeff. deter. İs significant. – Look up f critical values (459) from one-tailed F tables in stat books using v1(number of ind vars, 4), v2 (no of instance – no of vars, 115=6) • Regression equation is able to correctly determine assesed values of office buildings that are part of the training data 180000 160000 140000 Accessed Value 120000 100000 80000 60000 40000 20000 0 2200 2250 2300 2350 2400 Floor Space 2450 2500 2550 2600 Figure 10.1 A simple linear regression equation Regression Trees Test 1 < >= Test 2 Test 3 < LRM1 >= LRM2 < LRM3 >= Test 4 < LRM4 >= LRM5 Figure 10.2 A generic model tree Regression Tree • Essentially a desicion tree with leaf node with numeric variables • The value at an individual leaf node is numeric average of the output attribute for all instances passing through the tree to the leaf node posititon • Regresion trees are more accurate than lınear regression, when data is nonlinear • But is more difficult to interpret • Sometime regression trees are combined with linear regression to form model trees Model Trees • Regression tree + linear regression • Each leaf node represents a linear regression quation instead of an average value • Model trees simplify regession trees by reducing the number of nodes in the tree. • More complex tree means less linear relationship between dep and ind vars. Amt <= 246 > 246 TotCost <= 178 > 178 Amt <= 171 > 171 <= 390 <= 136 Amt TotCost > 136 TotCost > 390 <= 309 > 309 Trips <= 7.5 > 7.5 Trips <= 39 > 39 LRM1 LRM2 LRM3 LRM4 LRM5 LRM6 LRM7 LRM8 LRM9 Figure 10.3 A model tree for the deer hunter dataset (output attribute yes) 10.2 Logistic Regression Logistic Regression • Using linear regresion to model problems with observed outcome restricted to 2 values (e.g. yes/no) is sriously flawed. Value restriction placed on output var is not observed in the regression equation, Linear regression produce straight line unbounded onboth ends. • Therefor the linear equation must be transform to restric output to [0,1], Thus regression equation can be thought of as producing a probablity of occurence or nonoccurence of a measured event. • Logistic regression applies logaithmic transform. Transforming the Linear Regression Model Logistic regression is a nonlinear regression technique that associates a conditional probability with each data instance. 1 denotes observaton of one class (yes) 0 denotes observation of another class (no) Thus a conditional proabality of seeing class associatied with y=1 (yes) p(y=1|x), given the values in the feature vector x The Logistic Regression Model eax c p( y  1 | x )  ax  c 1 e where e is the base of natural logarithms often denoted as exp Determine the coefficients in x, (ax+c) using an iterative method (tries to minimize the sum of logarithms of predicted probablities) Convergence occurs when logarithmic summation is close to 0 or when it doesn’t change from iteration to iteration Equation 10.7 1.200 1.000 P(y = 1 | x) 0.800 0.600 0.400 0.200 0.000 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 x Figure 10.4 The logistic regressioin equation Logistic Regression: An Example Credit card Example: CreditCardPromotionNet file. LifeIns Pro is output ax  c  0.0001 Income  19.827 CreditCardIns  8.314 Sex  0.415 Age  17.691 CreditCardIns and Sex are most influantion attribs. Table 10.3 • Logistic Regression: Dependent Variable = Life Insurance Promotion Instance 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Income 40K 30K 40K 30K 50K 20K 30K 20K 30K 30K 40K 20K 50K 40K 20K Credit Card Insurance 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 Sex 1 0 1 1 0 0 1 1 1 0 0 1 0 1 0 Age 45 40 42 43 38 55 35 27 43 41 43 29 39 55 19 Life Insurance Promotion 0 1 0 1 1 0 1 0 0 1 1 1 1 0 1 Computed Probability 0.007 0.987 0.024 1.000 0.999 0.049 1.000 0.584 0.005 0.981 0.985 0.380 0.999 0.000 1.000 Logistic Regression • Classify a new instance using logistic regression – – – – – income=35K Credit card insurance=1 Sex=0 Age=39 P(y=1|x)=0.999 10.3 Bayes Classifier •Supervised classification tech, categorical output attrib •All input vars are independent, of equal importance P( H | E )  P( E | H )  P( H ) P( E ) where H is the hypothesis to be tested E is the evidence associated with H •P(H|E) likelihood of H (dependent var representing a predicted class) •P(E|H) conditional probability of H is true given evidence E (computed from training data) •P(H) apriori probability, denotes 10.9 Equation probability of H before the presentation of evidence E (computed from training data) Bayes Classifier: An Example Credit card promotion data set Sex is output Table 10.4 • Data for Bayes Classifier Magazine Promotion Yes Yes No Yes Yes No Yes No Yes Yes Watch Promotion No Yes No Yes No No Yes No No Yes Life Insurance Promotion No Yes No Yes Yes No Yes No No Yes Credit Card Insurance No Yes No Yes No No Yes No No No Sex Male Female Male Male Female Female Male Male Male Female The Instance to be Classified Magazine Promotion = Yes Watch Promotion = Yes Life Insurance Promotion = No Credit Card Insurance = No Sex = ? 2 hypothesis, sex=female, sex=male Table 10.5 • Counts and Probabilities for Attribute Sex (Evidence E) Magazine Promotion Sex Yes No Ratio: yes/total Ratio: no/total Male 4 2 4/6 2/6 Female 3 1 3/4 1/4 Watch Promotion Male 2 4 2/6 4/6 Female 2 2 2/4 2/4 Life Insurance Promotion Male 2 4 2/6 4/6 Female 3 1 3/4 1/4 Credit Card Insurance Male 2 4 2/6 4/6 Female 1 3 1/4 3/4 Computing The Probability For Sex = Male P( sex  male | E )  P( E | sex  male ) P( sex  male ) P( E ) Equation 10.10 Conditional Probabilities for Sex = Male P(magazine promotion = yes | sex = male) = 4/6 P(watch promotion = yes | sex = male) = 2/6 P(life insurance promotion = no | sex = male) = 4/6 P(credit card insurance = no | sex = male) = 4/6 P(E | sex =male) = (4/6) (2/6) (4/6) (4/6) = 8/81 The Probability for Sex=Male Given Evidence E P(sex = male | E)  0.0593 / P(E) The Probability for Sex=Female Given Evidence E P(sex = female| E)  0.0281 / P(E) P(sex = male | E) > P(sex = female| E) The instance is most likely a male credit card customer Zero-Valued Attribute Counts Problem with Bayes is when of the counts are 0, to solve this problem a small constant to numerator/dominator n/d becomes n  ( k )( p ) d k k is a value between 0 and 1 (usually 1) p is an equal fractional part of the total number of possible values for the attribute k is 0.5 for an attrib with 2 possible values Example: P(E | sex =male) = (3/4)(2/4)(1/4)(3/4) = 9/128 P(E | sex =male) = (3.5/5)(2.5/5)(1.5/5)(3.5/5) = 0.0176 Equation 10.12 Table 10.6 • Addition of Attribute Age to the Bayes Classifier Dataset Magazine Promotion Yes Yes No Yes Yes No Yes No Yes Yes Watch Promotion No Yes No Yes No No Yes No No Yes Life Insurance Promotion No Yes No Yes Yes No Yes No No Yes Credit Card Insurance No Yes No Yes No No Yes No No No Age 45 40 42 30 38 55 35 27 43 41 Sex Male Female Male Male Female Female Male Male Male Female Missing Data With Bayes classifier missing data items are ignored. Missing Data • Example Numeric Data Table 10.7 • Five Instances from the Credit Card Promotion Database Instance Income Range 40–50K 25–35K 40–50K 25–35K 50–60K Magazine Promotion Yes Yes No Yes Yes Watch Promotion No Yes No Yes No Life Insurance Promotion No Yes No Yes Yes Sex Male Female Male Male Female I1 I2 I3 I4 I5 Numeric Data Probability Density Function, (attribute values are assumed to be normally distributed) f ( x )  1 /( 2 s ) e where e = the exponential function m = the class mean for the given numerical attribute s = the class standard deviation for the attribute x = the attribute value  ( x  m ) 2 /( 2s 2 ) Equation 10.13 Numeric Data • • • • • • Magazine Promotion = Yes Watch Promotion = Yes Life Insurance Promotion = No Credit Card Insurance = No Age = 45 Sex = ? – – – – … P(E|sex=male) = …. P(age=45|sex=male) σ = 7.69 П = 37, x=45 P(age=45|sex=male) = 1/(….) = 0.03 – P(sex=male|E) = 0.0018/P(E) – P(sex=female|E) = 0.0016/P(E) – Instance belong to male 10.4 Clustering Algorithms Agglomerative Clustering 1. Place each instance into a separate partition. 2. Until all instances are part of a single cluster: a. Determine the two most similar clusters. b. Merge the clusters chosen into a single cluster. 3. Choose a clustering formed by one of the step 2 iterations as a final result. Agglomerative Clustering: An Example Table 10.8 • Agglomerative Clustering: First Iteration I1 I1 I2 I3 I4 I5 1.00 0.20 0.80 0.40 0.40 I2 1.00 0.00 0.80 0.60 I3 I4 I5 1.00 0.20 0.20 1.00 0.40 1.00 Table 10.9 • Agglomerative Clustering: Second Iteration I1 I3 I1 I3 I2 I4 I5 0.80 0.33 0.47 0.47 I2 1.00 0.80 0.60 I4 I5 1.00 0.40 1.00 Agglomerative Clustering Final step of the Algorithm is to choose final clustering among all. (Requires heuristics) • Use similarity measure for creating clusters, compare average within-cluster similarity with overall similarity of all instances in dataset (domain similarity) • This technique can be best used to eliminate clusterings rather than to choose a final result Agglomerative Clustering Final step of the Algorithm is to choose final clustering among all. (Requires heuristics) • Use within-cluster similarity measure and within-cluster similarities of pairwisecombined clusters in the cluster set. Look for the highest similarity • This technique can be best used to eliminate clusters rather than to choose a final result Agglomerative Clustering Final step of the Algorithm is to choose final clustering among all. (Requires heuristics) • Use previous 2 techniques to eliminate some of the clusterings • Feed each remaining clustering to a rule generator • The clustering with best defining rules is chosen. • (4th tech) Bayesian Information Criterion Conceptual Clustering 1. 2. Create a cluster with the first instance as its only member. For each remaining instance, take one of two actions at each tree level. a. Place the new instance into an existing cluster. b. Create a new concept cluster having the new instance as its only member. Data for Conceptual Clustering Table 10.10 • Data for Conceptual Clustering Tails I1 I2 I3 I4 I5 I6 I7 One Two Two One One One One Color Light Light Dark Dark Light Light Light Nuclei One Two Two Three Two Two Three N P(N) = 7/7 One Tails Two Light Color Dark One Nuclei Two Three P(V|C) .71 .29 .71 .29 .14 .57 .29 P(C|V) 1.0 1.0 1.0 1.0 1.0 1.0 1.0 N1 P(N1) = 3/7 One Tails Two Light Color Dark One Nuclei Two Three P(V|C) 1.0 0.0 1.0 0.0 .33 .67 0.0 P(C|V) .6 0.0 .6 0.0 1.0 .5 0.0 N2 P(N2) = 2/7 One Tails Two Light Color Dark One Nuclei Two Three P(V|C) 0.0 1.0 .5 .5 0.0 1.0 0.0 P(C|V) 0.0 1.0 .2 .5 0.0 .5 0.0 N4 P(N4) = 2/7 One Tails Two Light Color Dark One Nuclei Two Three P(V|C) 1.0 0.0 .5 .5 0.0 0.0 1.0 P(C|V) .4 0.0 .2 .5 0.0 0.0 1.0 N3 P(N3) = 1/3 One Tails Two Light Color Dark One Nuclei Two Three P(V|C) 1.0 0.0 1.0 0.0 1.0 0.0 0.0 P(C|V) .33 0.0 .33 0.0 1.0 0.0 0.0 N5 P(N5) = 2/3 One Tails Two Light Color Dark One Nuclei Two Three P(V|C) 1.0 0.0 1.0 0.0 0.0 1.0 0.0 P(C|V) .67 0.0 .67 0.0 0.0 1.0 0.0 I2 I3 I4 I7 I1 I5 I6 Figure 10.5 A COBWEB-created hierarchy Expectation Maximization 1. 2. Guess initial values for the five parameters. Until a termination criterion is achieved: a. Use the probability density function for normal distributions to compute the cluster probability for each instance. b. Use the probability scores assigned to each instance in step 2(a) to re-estimate the parameters. The EM Algorithm: An Example Table 10.11 • An EM Clustering of Gamma-Ray Burst Data Cluster 0 # Instances Log Fluence Mean SD Log HR321 Mean SD Log T90 Mean SD 518 –5.6670 0.4088 Cluster 1 340 –4.8131 0.5301 Cluster 2 321 –6.3657 0.5812 0.0538 0.3018 0.2949 0.1939 0.5478 0.2766 –0.3794 0.4825 1.2709 0.4906 1.7159 0.3793 10.5 Heuristics or Statistics? Query and Visualization Techniques • Query tools • OLAP tools • Visualization tools Machine Learning and Statistical Techniques 1. 2. 3. 4. 5. Statistical techniques typically assume an underlying distribution for the data whereas machine learning techniques do not. Machine learning techniques tend to have a human flavor. Machine learning techniques are better able to deal with missing and noisy data. Most machine learning techniques are able to explain their behavior. Statistical techniques tend to perform poorly with large-sized data. Specialized Techniques Chapter 11 11.1 Time-Series Analysis Time-series Problems: Prediction applications with one or more timedependent attributes. An Example with Linear Regression The Stock Index Dataset Table 11.1 • Weekly Average Closing Prices for the Nasdaq and Dow Jones Industrial Average Week 200003 200004 200005 200006 200007 200008 200009 200010 200011 200012 Nasdaq Average 4176.75 4052.01 4104.28 4398.72 4445.53 4535.15 4745.58 4949.09 4742.40 4818.01 Dow Average 11413.28 10967.60 10992.38 10726.28 10506.68 10121.31 10167.38 9952.52 10223.11 10937.36 Nasdaq-1 Average 3968.47 4176.75 4052.01 4104.28 4398.72 4445.53 4535.15 4745.58 4949.09 4742.40 Dow-1 Average 11587.96 11413.28 10967.60 10992.38 10726.28 10506.68 10121.31 10167.38 9952.52 10223.11 Nasdaq-2 Average 3847.25 3968.47 4176.75 4052.01 4104.28 4398.72 4445.53 4535.15 4745.58 4949.09 Dow-2 Average 11224.10 11587.96 11413.28 10967.60 10992.38 10726.28 10506.68 10121.31 10167.38 9952.52 Linear Regression Equations for the Stock Index Dataset Nasdaq Average  1.279 ( Nasdaq  1 Average)  0.330 ( Nasdaq  2 Average)  179 .297 Dow Average  0.932 ( Dow  1 Average)  0.303 ( Dow  2 Average)  3967 .093 Table 11.2 • Actual and Predicted Nasdaq and Dow Closing Prices Nasdaq Average Close Week 45 46 47 48 49 Dow Average Close Actual 10854.73 10638.36 10456.68 10494.16 10560.95 Actual 3258.61 3065.91 2851.70 2713.12 2615.75 Predicted 3408.19 3477.37 3544.05 3602.48 3651.73 Error –149.58 –411.46 –692.35 –889.36 –1035.98 Predicted 10954.14 10883.32 10793.97 10731.85 10701.07 Error –99.41 –244.96 –337.30 –237.69 –140.12 A Neural Network Example Table 11.3 • Week 49 Predicted Nasdaq Average Weekly Closing Prices Training Data Limited Nasdaq Attributes Predicted Value 3359 2598 2672 Data Includes to Dow Jones Attributes Predicted Value 2693 2652 2712 Epochs 10000 20000 30000 RMS 0.084 0.045 0.032 Prediction Error 744 –17 57 RMS 0.035 0.033 0.033 Prediction Error 22 37 97 Categorical Attribute Prediction Table 11.4 • Average Weekly Closings /Categorical Output Nasdaq Average Close 4176.75 4052.01 4104.28 4398.72 4445.53 4535.15 4745.58 4949.09 4742.40 4818.01 Dow Average Close 11413.28 10967.60 10992.38 10726.28 10506.68 10121.31 10167.38 9952.516 10223.11 10937.36 Nasdaq-1 Average Close 3968.47 4176.75 4052.01 4104.28 4398.72 4445.53 4535.15 4745.58 4949.09 4742.40 Dow-1 Average Close 11587.96 11413.28 10967.6 10992.38 10726.28 10506.68 10121.31 10167.38 9952.516 10223.11 Nasdaq-2 Average Close 3847.25 3968.47 4176.75 4052.01 4104.28 4398.72 4445.53 4535.15 4745.58 4949.09 Dow-2 Average Close 11224.10 11587.96 11413.28 10967.60 10992.38 10726.28 10506.68 10121.31 10167.38 9952.516 Nasdaq Gain/ Loss Gain Loss Gain Gain Gain Gain Gain Gain Loss Loss Dow Gain/ Loss Loss Loss Gain Loss Loss Loss Gain Loss Gain Gain Week 200003 200004 200005 200006 200007 200008 200009 200010 200011 200012 General Considerations • Test and modify created models as new data becomes available. • Try one or more data transformations if less than optimal results are obtained. • Exercise caution when predicting future outcome with training data having several predicted fields. • Try a nonlinear model if a linear model offers poor results. • Use unsupervised clustering to determine if input attribute values allow the output attribute to cluster into meaningful categories. 11.2 Mining the Web Web-Based Mining: General Issues • Clickstreams • Extended Common Log File Format • Session Files • User Sessions • Pageviews • Cookies Web Server Logs Data Preparation Session File Data Mining Algorithm(s) Learner Model Figure 11.1 A generic Web usage model Data Mining for Web Site Evaluation Sequence miners are special data mining programs able to discover frequently accessed Web pages that occur in the same order. Data Mining for Personalization Session Data Data Mining Algorithm(s) Clusters Usage Profiles Figure 11.2 Creatiing usage profiles from session data Usage Profiles User Navigational Activity Recommendation Engine Recommended Hypertext Links Figure 11.3 Hypertext link recommendations from usage profiles Data Mining for Web Site Adaptation The index synthesis problem: Given a Web site and a visitor access log, create new index pages containing collections of links to related but currently unlinked pages. 11.3 Mining Textual Data • Train: Create an attribute dictionary. • Filter: Remove common words. • Classify: Classify new documents. 11.4 Improving Performance • Bagging • Boosting • Instance Typicality Table 11.5 • Test Set Accuracy Scores for Typical and Atypical Training Data Decision Tree Most Typical Least Typical 76.57% 49.84% Bagging 76.89% 49.17% Boosting 76.56% 48.84% Table 11.6 • Classification Correctness: Bagging, Boosting, and Typicality Single Model Decision Tree Bayes Classifier 82.83% 87.79% Bagging 86.47% 86.14% Boosting 88.45% 84.82% Typicality 85.81% 85.81% Part IV Intelligent Systems Rule-Based Systems Chapter 12 12.1 Exploring Artificial Intelligence Table 12.1 • Areas of Study Within Artificial Intelligence Scheduling Problems Expert Systems Game Playing Intelligent Agents Intelligent Database Retrieval Intelligent Tutoring Systems Machine Learning Natural Language Processing Planning Robotics and Computer Vision Speech Recognition Theorem Proving Nearest Neighbor Heuristic When conducting a state-space search, always move to the next closest state. A 5 9 D 12 10 B 3 15 C Figure 12.1 Starting at city A, the nearest neighbor heuristic finds a C 4 6 12 A 10 5 D 15 B Figure 12.2 The nearest neighbor heuristic chooses the path A-C-D-B- The Water Jug Problem Table 12.2 • Rules for the Water Jug Problem Action 1. Fill the 4-gallon jug. 2. Fill the 3-gallon jug. 3. Empty the 4-gallon jug onto the ground. 4. Empty the 3-gallon jug onto the ground. 5. Pour water from the 3-gallon jug into the 4-gallon jug until the 4-gallon jug is full. 6. Pour water from the 4-gallon jug into the 3-gallon jug until the 3-gallon jug is full. 7. Pour all the water from the 3-gallon jug into the 4-gallon jug. 8. Pour all the water from the 4-gallon jug into the 3-gallon jug. Required Conditions The 4-gallon jug is not full. The 3-gallon jug is not full. The 4-gallon jug is not empty. The 3-gallon jug is not empty. Resultant State (4,y) (x,3) (0,y) (x,0) The total amount of water in both jugs is > = 4 and the 3-gallon jug is not empty. The total amount of water in both jugs is > = 3 and the 4-gallon jug is not empty. The total amount of water in both jugs is <=4 and the 3-gallon jug is not empty. The total amount of water in both jugs is <=3 and the 4-gallon jug is not empty. (4,y – (4 – x)) (x – (3 – y),3) (x + y,0) (0, x + y) A B C D E F G H I J Figure 12.3 A hypothetical state space representation Depth-First Search A-B-E-F-C-G-I-J-H-D A B C D E F G H I J Breadth-First Search A-B-C-D-E-F-G-H-I-J A B C D E F G H I J (0,0) (0,0) Initial State Step 1 Rule 1 Rule 2 Step 2 Rule 1 (0,0) Rule 2 (0,0) Step 3 Rule 1 Rule 2 Steps 4, 5 & 6 Rule 1 (0,0) Rule 2 (4,0) (0,3) Rule 2 (4,0) (0,3) Rule 6 Rule 2 (4,0) (0,3) Rule 6 Rule 2 (4,0) (0,3) Rule 6 (4,3) (1,3) (4,3) (1,3) Rule 4 (4,3) (1,3) Rule 4 (1,0) (1,0) Rule 8 (0,1) Rule 1 (4,1) Rule 6 (2,3) Figure 12.4 A depth-first solution for the water jug problem (0,0) Rule 1 Rule 2 (4,0) Rule 2 Rule 6 (0,3) Rule 7 (4,3) (1,3) Rule 4 (3,0) Rule 2 (1,0) Rule 8 (3,3) Rule 5 (0,1) Rule 1 (4,2) Rule 3 (4,1) Rule 6 (0,2) Rule 7 (2,3) (2,0) Figure 12.5 The complete state space for the water jug problem Table 12.3 • A Hypothetical Set of Production Rules Production Rules 1. 2. 3. 4. 5. 6. If a If c If x If w If w If b then b and d then x and y then g then y then d then y Known Facts a, c Backward Chaining Creating a Goal Tree Rule 3 goal g y x g y x b g y x b a c Rule 4 g g y w g y w b a x x Rule 6 Rule 1 w Rule 2 Rule 5 g y x b a c d w w d w Figure 12.6 Creating a goal tree Expert Systems Knowledge Base User User Interface Inference Engine Figure 12.7 An expert system architecture Developing an Expert System Problem Definition 1 Knowledge Acquisition 2 Knowledge Representation 3 Testing & Evaluation 4 Knowledge Programming Figure 12.8 The expert system development cycle Structuring A Rule-Based System Form 1040 Tax Dependency Person is a Dependent Relationship Test is Satisfied Joint Return Test is Satisfied Citizen or Resident Test is Satisfied Income Test is Satisfied Support Test is Satisfied Figure 12.9 A first-level goal tree for dependency exemption Person is a Dependent Rule 1 Relationship Test is Satisfied Joint Return Test is Satisfied Citizen or Resident Test is Satisfied Rule 2 Income Test is Satisfied Support Test is Satisfied Resident Test is Satisfied Rule 3 Citizen/Alien is Satisfied Rule 5 United States Citizen Resident Alien is Satisfied Rule 7 Presence Test is Satisfied Rule 9 31 Days Current Year 183 Days During Last 3 Years Green Card Test is Satisfied Resident of Canada United States Resident Not United States Resident Non-Resident Test is Satisfied Rule 4 Country/Child is Satisfied Rule 6 C/M Test is Satisfied Rule 8 Resident of Mexico Foreign Test is Satisfied Rule 11 Lived in Foreign Country Lived with Taxpayer Entire Year Adopted Child Test is Satisfied Rule 10 Child is Adopted Figure 12.10 A goal tree for the citizen/resident test Choosing a Data Mining Technique Technique Selected Supervised Strategy Selected Unsupervised Strategy Selected Association Rule Strategy Selected ... Learning is Supervised Supervised Model Chosen ... Decision Tree is Selected Linear Regression is Selected ... Backpropogation Learning is Selected Output Constraints are Satisfied Desirable Criteria Test is Satisfied Output Attribute is Categorical Output Attribute is Singular Explanation is Required Production Rules are Required Input Attribute Criteria are Satisfied Some Input Attributes Are Categorical Data Distribution Test is Satisfied Data Distribution is Unknown Data Distribution is Not Normal Figure 12.11 A partial goal tree for choosing a data mining technique Managing Uncertainty in Rule-Based Systems Chapter 13 13.1 Uncertainty: Sources and Solutions Sources of Uncertainty Rule 1:Large Package Rule IF package size is large THEN send package UPS Sources of Uncertainty • Rule Antecedent • Rule Confidence • Combining Uncertain Information General Methods for Dealing with Uncertainty • Probability-Based Methods • Heuristic Methods Probability-Based Methods • Objective Probability • Experimental Probability • Subjective Probability Heuristic Methods • Certainty Factors • Fuzzy Logic 13.2 Fuzzy Rule-Based Systems Fuzzy Sets A set associated with a linguistic value that gives the degree of membership for a numerical value. Degree of Membership Degree of Membership far_from X close_to X Degree of Membership Degree of Membership approaching X moving_from X Figure 13.1 Four membership functions for distance from x Fuzzy Reasoning: An Example 1. 2. 3. 4. Fuzzification Rule Inference Rule Composition Defuzzification Young Middle-Aged Old Degree of Membership 33 Age Few Some Several Degree of Membership Previous_Accepts Low Moderate High Degree of Membership Life_Insurance_Accept Figure 13.2 Fuzzy sets for age, previous_accepts and Degree of Membership 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 65 (a) Life_Insurance_Accept = High 1 0.9 0.8 0.7 0.6 Degree 0.5 of Membership 0.4 0.3 0.2 0.1 0 100 25 (b) Life_Insurance_Accept = Moderate 65 Figure 13.3 Clipping rule consequent membership functions 1 0.9 0.8 0.7 0.6 Degree of Membership 0.5 0.4 0.3 0.2 0.1 0 25 35 45 55 65 75 85 95 100 Figure 13.4 A fuzzy set created by rule composition 13.3 A Probability-Based Approach to Uncertainty Bayes Theorem P(H|E)  where P ( H ) is the a priori probabilit y that the hypothesis to be tested is true P ( H ) is the a priori probabilit y that H is false. P(H )  1  P ( H ) E is the evidence associated with the hypothesis . The denominato r expression represents P ( E ) P ( E | H ) is the conditiona l probabilit y of the evidence knowing the hypothesis is true P ( E | H ) is the conditiona l probabilit y of the evidence knowing the hypothesis is false P(E|H)  P(H) P(E|H)  P(H)  P(E| H)  P(H) Equation 13.3 Table 13.1 • Life Insurance Promotion Data Instance # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Age 25 33 19 43 35 26 50 24 20 62 36 27 28 25 Previous Accepts 2 4 1 5 1 3 2 2 0 3 5 0 1 3 Life Insurance Promotion Yes Yes Yes No No Yes No Yes No No Yes No No Yes Multiple Evidence with Bayes Theorem P ( H i|E1 & E2 & .. & E j )  P ( E1 & E2 .. & E j | H i )  P ( H i ) k 1  P( E1 & E2 .. & E j | H k )  P( H k ) n Equation 13.4 Multiple Evidence with Bayes Theorem P ( H i | E1 & E 2 & .. & E j )  P ( E1 | H i )  P ( E2 | H i )  ..  P ( E j | H i )  P ( H i ) k 1  P ( E1 | H k )  P ( E2 | H k )  ..  P ( E j | H k )  P ( H k ) n Equation 13.5 Likelihood Ratios:Necessity and Sufficiency LS  P( E | H ) P( E | H ) LN  P(E | H ) P(E | H ) Equation 13.6 General Considerations • P(H|E) + P(~H|E) must sum to 1. • Conditional independence between multiple pieces of evidence must be assumed. • Prior Probabilities are often unobtainable. • Large amounts of data must be gathered to obtain reasonable estimates for conditional probabilities. Intelligent Agents Chapter 14 14.1 Characteristics of Intelligent Agents • Situatedness • Autonomy • Adaptivity • Sociability 14.2 Types of Agents •Anticipatory agents • Filtering agents • Semiautonomous agents • Find-and-retrieve agents • User agents • Monitor and Surveillance agents • Data Mining agents • Proactive agents • Cooperative agents 14.3 Integrating Data Mining, Expert Systems and Intelligent Agents Rule-Based Parameter Selector Rule-Based Model Selector Previous New Settings Settings Input Data Model Attribute Choice Information User Natural Language Interface Summary Report Data Mining Agent Selected Data Summary Report Rule-Based Domain Analyzer Data Mining Tool Clean Parameter & Results Settings Transformed Data Data Mining Session Figure 14.1 An agent-based model for data mining The iDA Software Appendix A Datasets for Data Mining Appendix B Decision Tree Attribute Selection Appendix C Computing Gain Ratio GainRatio( A)  Gain( A) / Split Info( A) Equation C.1 Computing Gain(A) Gain( A)  Info( I )  Info( I , A) Equation C.2 Computing Info(I) n # in class i  # in class Info( I )    log  i 1 # in I  # in I i   Equation C.3 Computing Info(I,A) k # in class j Info(I,A)   info (class j) j 1 # in I Equation C.4 Computing Split Info(A) k # in class j  # in class j  Split Info(A)    log   j 1 # in I  # in I  Equation C.5 Table C.1 • Income Range 40–50K 30–40K 40–50K 30–40K 50–60K 20–30K 30–40K 20–30K 30–40K 30–40K 40–50K 20–30K 50–60K 40–50K 20–30K The Credit Card Promotion Database Life Insurance Promotion No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes Credit Card Insurance No No No Yes No No Yes No No No No No No No Yes Sex Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female Age 45 40 42 43 38 55 35 27 43 41 43 29 39 55 19 Income Range 20-30K 30-40K 40-50K 50-60K 2 Yes 2 No 4 Yes 1 No 1 Yes 3 No 2 Yes Figure C.1 A partial decision tree with root node = income range Statistics for Performance Evaluation Appendix D D.1 Single-Valued Summary Statistics Computing the Mean 1 n x μ  i n i 1 where μ is the mean value n is the number of data items xi is the ith data item Equation D.1 Computing the Variance σ  2  (μ  x ) n 1 i 1 i n 2 where σ2 is the variance μ is the population mean n is the number of data items xi is the ith data item Equation D.2 D.2 The Normal Distribution The Normal Curve f ( x )  1 /( 2πσ ) e where f(x) is the height of the curve corresponding to values of x e is the base of natural logarithms approximated by 2.718282 m is the arithmetic mean for the data  ( x  μ ) 2 /( 2 σ 2 ) s is the standard deviation Equation D.3 D.3 Comparing Supervised Learner Models • Comparing Models with Independent Test Data • Pairwise Comparison with a Single Test Set Comparing Models with Independent Test Data P  E1  E2 (v1 / n1  v2 / n2 ) Two independent test sets, set A containing n1 elements and set B with n elements Error rate E1 and variance v1 for model M1 on test set A Error rate E2 and variance v2 for model M2 on test set B Equation D.4 Pairwise Comparison with a Single Test Set Computing Joint Variance for a Single Test Set V12 where V12 is the joint vari ance e1i is the classifier error on the i instance for learner model M1 e2i is the classifier error on the i instance for learner model M 2 E1  E2 is the overall classifier error rate for model M1 minus the classifier error rate for model M 2 n is the total number of test set instances Equation D.5 th th  n 2  [( e1i  e2i )  ( E1  E2 )] n  1 i 1 1 Pairwise Comparison with a Single Test Set P where V12 is the joint vari ance as computed on the previous slide E1  E 2 is the overall classifier error rate for model M1 minus the classifier error rate for model M 2 n is the total number of test set instances E1  E2 V12 / n Equation D.6 D.4 Confidence Intervals for Numeric Output n 2 variance( mae )   ( ei - mae ) n - 1 i 1 1 where ei is the absolute error for the ith instance n is the number of instances Equation D.7 D.5 Comparing Models with Numeric Output • Independent Test Sets • Pairwise Comparison with a Single Test Set • Overall Comparison with a Single Test Set Comparing Models with Independent Test Sets P mae1  mae2 (v1 / n1  v2 / n2 ) where mae1 is the mean absolute error for model M1 mae2 is the mean absolute error for model M2 v1 and v2 are variance scores associated with M1 and M2 n1 and n2 are the number of instances within each respective test set Equation D.8 Pairwise Comparison with a Single Test Set P mae1  mae2 V12 / n where mae1 is the mean absolute error for model M1 mae2 is the mean absolute error for model M2 V12 is the joint variance computed with the formula defined in Equation D.5 n is the number of test set instances Equation D.9 Overall Comparison with a Single Test Set variance ( maej )   n 1 1 i 1 n ( ei  mae j ) 2 where maej is the mean absolute error for model j ei is the absolute value of the computed value minus the actual value for instance i n is the number of test set instances Equation D.10 Overall Comparison with a Single Test Set P mae1  mae2 v(2 / n) where ν is either the average or the larger of the variance scores for each model n is the total number of test set instances Equation D.11 Excel Pivot Tables: Office 97 Appendix E Figure E.1 Pivot table wizard step 3 Figure E.2 Calculating a summary report Figure E.3 A summary report of income range Figure E.4 A pie chart for income range Figure E.5 A pivot table showing age and credit card insurance choice Figure E.6 Grouping the credit card promotion data by age Watch Promo = No Life Insurance Promo = Yes Magazine Promo = Yes Watch Promo No Yes Ye s No Yes No ine az o g Ma rom P Life Insurance Promo Figure E.7A credit card promotion cube Figure E.8 A pivot table with page variables Figure E.9 A pivot table with page variables for credit card promotions

Related docs
Data Mining A Tutorial-Based Primer
Views: 98  |  Downloads: 15
A tutorial-based users� manual for Poly3D
Views: 35  |  Downloads: 1
Data Mining
Views: 16  |  Downloads: 6
an information systems and technology primer
Views: 4  |  Downloads: 0
an excel primer
Views: 8  |  Downloads: 0
data mining
Views: 620  |  Downloads: 59
A Globus Primer
Views: 0  |  Downloads: 0
data mining with r
Views: 0  |  Downloads: 0
KM_Primer
Views: 8  |  Downloads: 3
OA_primer
Views: 64  |  Downloads: 1
Mongolia Primer
Views: 1  |  Downloads: 0
What is Data Mining
Views: 3  |  Downloads: 1
Other docs by techmaster
Transcript of Monroe Doctrine
Views: 182  |  Downloads: 1
Monroe Doctrine info
Views: 197  |  Downloads: 0
Dealer computer software license agreement
Views: 512  |  Downloads: 28
License to use trademark
Views: 280  |  Downloads: 7
sa_______'
Views: 185  |  Downloads: 0
Promissory note
Views: 472  |  Downloads: 16
Canning business
Views: 327  |  Downloads: 3
4175final28nov[1]
Views: 104  |  Downloads: 0
Consignment Contract
Views: 1997  |  Downloads: 103
Assignment of limited partners interest
Views: 323  |  Downloads: 6
Agency in foreign country
Views: 203  |  Downloads: 10
Transcript of Test Ban Treaty
Views: 135  |  Downloads: 0