• This chapter uses MS Excel and Weka
Statistical Techniques
Chapter 10
10.1 Linear Regression Analysis
f ( x1 , x2 , x3 ...xn ) a1 x1 a2 x2 a3 x3 .......an xn c
Equation 10.1
10.1 Linear Regression Analysis
• A Supervised technique that generalizes a set of numeric data by creating a math equation relating one or more ,nput variables to a single output variable. • With linear regression we attemp to model vairation in a dependent variable as a linear combination of one or more independent variable • Linear regression is appro when the relation betwee the dependent and the independent variables are nearly linear
Simple Linear Regression (slope-intercept form)
y ax b
Equation 10.2
Simple Linear Regression (least squares criterion)
b xy x
2
a
y n
b y n
Equation 10.3
Multiple Linear Regression with Excel
Try to estimate the value of a building
Table 10.1 • District Office Building Data
Space
2310 2333 2356 2379 2402 2425 2448 2471 2494 2517 2540
Offices
2 2 3 3 2 4 2 2 3 4 2
Entrances
2 2 1.5 2 3 2 1.5 2 3 4 3
Age
20 12 33 43 53 23 99 34 23 55 22
Value
$142,000 $144,000 $151,000 $150,000 $139,000 $169,000 $126,000 $142,900 $163,000 $169,000 $149,000
A Regression Equation for the District Office Building Data
Value 27.64 Space 12529 .77Offices 2553 .21Entrances 234 .24 Age 52317 .83
Table 10.2 • Regression Statistics for the Office Building Data
–234.2371645 13.26801148 0.996747993 459.7536742 1732393319 2553.211 530.6692 970.5785 6 5652135 12529.77 400.0668 #N/A #N/A #N/A 27.64139 5.429374 #N/A #N/A #N/A 52317.83 12237.36 #N/A #N/A #N/A
10.1 Linear Regression Analysis
• How accurate are the results
– Use scatterplot diagram, and the line for the formula – Which ind vars are linearly related to dep vars. Use the stats? – Coefficient determination=1, no difference between actual (in the table) and computed values for dependent variable.(reps corrolation between actual and computed values) – Standard error for the estimate of dep var.
F stat for the regression analysis
• Used to establish, if the coeff. deter. İs significant.
– Look up f critical values (459) from one-tailed F tables in stat books using v1(number of ind vars, 4), v2 (no of instance – no of vars, 115=6)
• Regression equation is able to correctly determine assesed values of office buildings that are part of the training data
180000 160000 140000
Accessed Value
120000 100000 80000 60000 40000 20000 0 2200 2250 2300 2350 2400 Floor Space 2450 2500 2550 2600
Figure 10.1 A simple linear regression equation
Regression Trees
Test 1
<
>=
Test 2
Test 3
<
LRM1
>=
LRM2
<
LRM3
>=
Test 4
<
LRM4
>=
LRM5
Figure 10.2 A generic model tree
Regression Tree
• Essentially a desicion tree with leaf node with numeric variables • The value at an individual leaf node is numeric average of the output attribute for all instances passing through the tree to the leaf node posititon • Regresion trees are more accurate than lınear regression, when data is nonlinear • But is more difficult to interpret • Sometime regression trees are combined with linear regression to form model trees
Model Trees
• Regression tree + linear regression • Each leaf node represents a linear regression quation instead of an average value • Model trees simplify regession trees by reducing the number of nodes in the tree. • More complex tree means less linear relationship between dep and ind vars.
Amt <= 246 > 246
TotCost <= 178 > 178 Amt <= 171 > 171 <= 390 <= 136 Amt
TotCost > 136 TotCost > 390 <= 309 > 309 Trips <= 7.5 > 7.5 Trips <= 39 > 39
LRM1
LRM2
LRM3
LRM4
LRM5
LRM6
LRM7
LRM8
LRM9
Figure 10.3 A model tree for the deer hunter dataset (output attribute yes)
10.2 Logistic Regression
Logistic Regression
• Using linear regresion to model problems with observed outcome restricted to 2 values (e.g. yes/no) is sriously flawed. Value restriction placed on output var is not observed in the regression equation, Linear regression produce straight line unbounded onboth ends. • Therefor the linear equation must be transform to restric output to [0,1], Thus regression equation can be thought of as producing a probablity of occurence or nonoccurence of a measured event. • Logistic regression applies logaithmic transform.
Transforming the Linear Regression Model
Logistic regression is a nonlinear regression technique that associates a conditional probability with each data instance.
1 denotes observaton of one class (yes) 0 denotes observation of another class (no)
Thus a conditional proabality of seeing class associatied with y=1 (yes) p(y=1|x), given the values in the feature vector x
The Logistic Regression Model
eax c p( y 1 | x ) ax c 1 e
where e is the base of natural logarithms often denoted as exp
Determine the coefficients in x, (ax+c) using an iterative method (tries to minimize the sum of logarithms of predicted probablities)
Convergence occurs when logarithmic summation is close to 0 or when it doesn’t change from iteration to iteration
Equation 10.7
1.200 1.000
P(y = 1 | x)
0.800 0.600 0.400 0.200 0.000 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6
x
Figure 10.4 The logistic regressioin equation
Logistic Regression: An Example
Credit card Example: CreditCardPromotionNet file.
LifeIns Pro is output
ax c 0.0001 Income 19.827 CreditCardIns 8.314 Sex 0.415 Age 17.691
CreditCardIns and Sex are most influantion attribs.
Table 10.3 • Logistic Regression: Dependent Variable = Life Insurance Promotion
Instance
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Income
40K 30K 40K 30K 50K 20K 30K 20K 30K 30K 40K 20K 50K 40K 20K
Credit Card Insurance
0 0 0 1 0 0 1 0 0 0 0 0 0 0 1
Sex
1 0 1 1 0 0 1 1 1 0 0 1 0 1 0
Age
45 40 42 43 38 55 35 27 43 41 43 29 39 55 19
Life Insurance Promotion
0 1 0 1 1 0 1 0 0 1 1 1 1 0 1
Computed Probability
0.007 0.987 0.024 1.000 0.999 0.049 1.000 0.584 0.005 0.981 0.985 0.380 0.999 0.000 1.000
Logistic Regression
• Classify a new instance using logistic regression
– – – – – income=35K Credit card insurance=1 Sex=0 Age=39 P(y=1|x)=0.999
10.3 Bayes Classifier
•Supervised classification tech, categorical output attrib •All input vars are independent, of equal importance
P( H | E ) P( E | H ) P( H ) P( E ) where H is the hypothesis to be tested E is the evidence associated with H
•P(H|E) likelihood of H (dependent var representing a predicted class) •P(E|H) conditional probability of H is true given evidence E (computed from training data) •P(H) apriori probability, denotes 10.9 Equation probability of H before the presentation of evidence E (computed from training data)
Bayes Classifier: An Example
Credit card promotion data set Sex is output
Table 10.4 • Data for Bayes Classifier
Magazine Promotion
Yes Yes No Yes Yes No Yes No Yes Yes
Watch Promotion
No Yes No Yes No No Yes No No Yes
Life Insurance Promotion
No Yes No Yes Yes No Yes No No Yes
Credit Card Insurance
No Yes No Yes No No Yes No No No
Sex
Male Female Male Male Female Female Male Male Male Female
The Instance to be Classified
Magazine Promotion = Yes Watch Promotion = Yes Life Insurance Promotion = No Credit Card Insurance = No Sex = ? 2 hypothesis, sex=female, sex=male
Table 10.5 • Counts and Probabilities for Attribute Sex (Evidence E)
Magazine Promotion Sex
Yes No Ratio: yes/total Ratio: no/total Male 4 2 4/6 2/6 Female 3 1 3/4 1/4
Watch Promotion
Male 2 4 2/6 4/6 Female 2 2 2/4 2/4
Life Insurance Promotion
Male 2 4 2/6 4/6 Female 3 1 3/4 1/4
Credit Card Insurance
Male 2 4 2/6 4/6 Female 1 3 1/4 3/4
Computing The Probability For Sex = Male
P( sex male | E ) P( E | sex male ) P( sex male ) P( E )
Equation 10.10
Conditional Probabilities for Sex = Male
P(magazine promotion = yes | sex = male) = 4/6 P(watch promotion = yes | sex = male) = 2/6 P(life insurance promotion = no | sex = male) = 4/6 P(credit card insurance = no | sex = male) = 4/6 P(E | sex =male) = (4/6) (2/6) (4/6) (4/6) = 8/81
The Probability for Sex=Male Given Evidence E
P(sex = male | E) 0.0593 / P(E)
The Probability for Sex=Female Given Evidence E
P(sex = female| E) 0.0281 / P(E) P(sex = male | E) > P(sex = female| E) The instance is most likely a male credit card customer
Zero-Valued Attribute Counts
Problem with Bayes is when of the counts are 0, to solve this problem a small constant to numerator/dominator
n/d becomes
n ( k )( p ) d k k is a value between 0 and 1 (usually 1) p is an equal fractional part of the total number of possible values for the attribute
k is 0.5 for an attrib with 2 possible values Example: P(E | sex =male) = (3/4)(2/4)(1/4)(3/4) = 9/128 P(E | sex =male) = (3.5/5)(2.5/5)(1.5/5)(3.5/5) = 0.0176
Equation 10.12
Table 10.6 • Addition of Attribute Age to the Bayes Classifier Dataset
Magazine Promotion
Yes Yes No Yes Yes No Yes No Yes Yes
Watch Promotion
No Yes No Yes No No Yes No No Yes
Life Insurance Promotion
No Yes No Yes Yes No Yes No No Yes
Credit Card Insurance
No Yes No Yes No No Yes No No No
Age
45 40 42 30 38 55 35 27 43 41
Sex
Male Female Male Male Female Female Male Male Male Female
Missing Data
With Bayes classifier missing data items are ignored.
Missing Data
• Example
Numeric Data
Table 10.7 • Five Instances from the Credit Card Promotion Database
Instance Income Range
40–50K 25–35K 40–50K 25–35K 50–60K
Magazine Promotion
Yes Yes No Yes Yes
Watch Promotion
No Yes No Yes No
Life Insurance Promotion
No Yes No Yes Yes
Sex
Male Female Male Male Female
I1 I2 I3 I4 I5
Numeric Data
Probability Density Function, (attribute values are assumed to be normally distributed)
f ( x ) 1 /( 2 s ) e
where
e = the exponential function m = the class mean for the given numerical attribute s = the class standard deviation for the attribute x = the attribute value
( x m ) 2 /( 2s 2 )
Equation 10.13
Numeric Data
• • • • • • Magazine Promotion = Yes Watch Promotion = Yes Life Insurance Promotion = No Credit Card Insurance = No Age = 45 Sex = ?
– – – – … P(E|sex=male) = …. P(age=45|sex=male) σ = 7.69 П = 37, x=45 P(age=45|sex=male) = 1/(….) = 0.03
– P(sex=male|E) = 0.0018/P(E) – P(sex=female|E) = 0.0016/P(E) – Instance belong to male
10.4 Clustering Algorithms
Agglomerative Clustering
1. Place each instance into a separate partition. 2. Until all instances are part of a single cluster: a. Determine the two most similar clusters. b. Merge the clusters chosen into a single cluster. 3. Choose a clustering formed by one of the step 2 iterations as a final result.
Agglomerative Clustering: An Example
Table 10.8 • Agglomerative Clustering: First Iteration
I1 I1 I2 I3 I4 I5
1.00 0.20 0.80 0.40 0.40
I2
1.00 0.00 0.80 0.60
I3
I4
I5
1.00 0.20 0.20
1.00 0.40
1.00
Table 10.9 • Agglomerative Clustering: Second Iteration
I1 I3 I1 I3 I2 I4 I5
0.80 0.33 0.47 0.47
I2
1.00 0.80 0.60
I4
I5
1.00 0.40
1.00
Agglomerative Clustering
Final step of the Algorithm is to choose final clustering among all. (Requires heuristics) • Use similarity measure for creating clusters, compare average within-cluster similarity with overall similarity of all instances in dataset (domain similarity) • This technique can be best used to eliminate clusterings rather than to choose a final result
Agglomerative Clustering
Final step of the Algorithm is to choose final clustering among all. (Requires heuristics) • Use within-cluster similarity measure and within-cluster similarities of pairwisecombined clusters in the cluster set. Look for the highest similarity • This technique can be best used to eliminate clusters rather than to choose a final result
Agglomerative Clustering
Final step of the Algorithm is to choose final clustering among all. (Requires heuristics) • Use previous 2 techniques to eliminate some of the clusterings • Feed each remaining clustering to a rule generator • The clustering with best defining rules is chosen. • (4th tech) Bayesian Information Criterion
Conceptual Clustering
1. 2. Create a cluster with the first instance as its only member. For each remaining instance, take one of two actions at each tree level. a. Place the new instance into an existing cluster. b. Create a new concept cluster having the new instance as its only member.
Data for Conceptual Clustering
Table 10.10 • Data for Conceptual Clustering
Tails I1 I2 I3 I4 I5 I6 I7
One Two Two One One One One
Color
Light Light Dark Dark Light Light Light
Nuclei
One Two Two Three Two Two Three
N
P(N) = 7/7 One Tails Two Light Color Dark One Nuclei Two Three P(V|C) .71 .29 .71 .29 .14 .57 .29 P(C|V) 1.0 1.0 1.0 1.0 1.0 1.0 1.0
N1
P(N1) = 3/7 One Tails Two Light Color Dark One Nuclei Two Three P(V|C) 1.0 0.0 1.0 0.0 .33 .67 0.0 P(C|V) .6 0.0 .6 0.0 1.0 .5 0.0
N2
P(N2) = 2/7 One Tails Two Light Color Dark One Nuclei Two Three P(V|C) 0.0 1.0 .5 .5 0.0 1.0 0.0 P(C|V) 0.0 1.0 .2 .5 0.0 .5 0.0
N4
P(N4) = 2/7 One Tails Two Light Color Dark One Nuclei Two Three P(V|C) 1.0 0.0 .5 .5 0.0 0.0 1.0 P(C|V) .4 0.0 .2 .5 0.0 0.0 1.0
N3
P(N3) = 1/3 One Tails Two Light Color Dark One Nuclei Two Three P(V|C) 1.0 0.0 1.0 0.0 1.0 0.0 0.0 P(C|V) .33 0.0 .33 0.0 1.0 0.0 0.0
N5
P(N5) = 2/3 One Tails Two Light Color Dark One Nuclei Two Three P(V|C) 1.0 0.0 1.0 0.0 0.0 1.0 0.0 P(C|V) .67 0.0 .67 0.0 0.0 1.0 0.0 I2 I3 I4 I7
I1
I5
I6
Figure 10.5 A COBWEB-created hierarchy
Expectation Maximization
1. 2. Guess initial values for the five parameters. Until a termination criterion is achieved: a. Use the probability density function for normal distributions to compute the cluster probability for each instance. b. Use the probability scores assigned to each instance in step 2(a) to re-estimate the parameters.
The EM Algorithm: An Example
Table 10.11 • An EM Clustering of Gamma-Ray Burst Data
Cluster 0
# Instances Log Fluence Mean SD Log HR321 Mean SD Log T90 Mean SD 518 –5.6670 0.4088
Cluster 1
340 –4.8131 0.5301
Cluster 2
321 –6.3657 0.5812
0.0538 0.3018
0.2949 0.1939
0.5478 0.2766 –0.3794 0.4825
1.2709 0.4906
1.7159 0.3793
10.5 Heuristics or Statistics?
Query and Visualization Techniques
• Query tools • OLAP tools • Visualization tools
Machine Learning and Statistical Techniques
1. 2. 3. 4. 5. Statistical techniques typically assume an underlying distribution for the data whereas machine learning techniques do not. Machine learning techniques tend to have a human flavor. Machine learning techniques are better able to deal with missing and noisy data. Most machine learning techniques are able to explain their behavior. Statistical techniques tend to perform poorly with large-sized data.
Specialized Techniques
Chapter 11
11.1 Time-Series Analysis
Time-series Problems: Prediction applications with one or more timedependent attributes.
An Example with Linear Regression
The Stock Index Dataset
Table 11.1 • Weekly Average Closing Prices for the Nasdaq and Dow Jones Industrial Average
Week
200003 200004 200005 200006 200007 200008 200009 200010 200011 200012
Nasdaq Average
4176.75 4052.01 4104.28 4398.72 4445.53 4535.15 4745.58 4949.09 4742.40 4818.01
Dow Average
11413.28 10967.60 10992.38 10726.28 10506.68 10121.31 10167.38 9952.52 10223.11 10937.36
Nasdaq-1 Average
3968.47 4176.75 4052.01 4104.28 4398.72 4445.53 4535.15 4745.58 4949.09 4742.40
Dow-1 Average
11587.96 11413.28 10967.60 10992.38 10726.28 10506.68 10121.31 10167.38 9952.52 10223.11
Nasdaq-2 Average
3847.25 3968.47 4176.75 4052.01 4104.28 4398.72 4445.53 4535.15 4745.58 4949.09
Dow-2 Average
11224.10 11587.96 11413.28 10967.60 10992.38 10726.28 10506.68 10121.31 10167.38 9952.52
Linear Regression Equations for the Stock Index Dataset
Nasdaq Average 1.279 ( Nasdaq 1 Average) 0.330 ( Nasdaq 2 Average) 179 .297
Dow Average 0.932 ( Dow 1 Average) 0.303 ( Dow 2 Average) 3967 .093
Table 11.2 • Actual and Predicted Nasdaq and Dow Closing Prices
Nasdaq Average Close Week
45 46 47 48 49
Dow Average Close Actual
10854.73 10638.36 10456.68 10494.16 10560.95
Actual
3258.61 3065.91 2851.70 2713.12 2615.75
Predicted
3408.19 3477.37 3544.05 3602.48 3651.73
Error
–149.58 –411.46 –692.35 –889.36 –1035.98
Predicted
10954.14 10883.32 10793.97 10731.85 10701.07
Error
–99.41 –244.96 –337.30 –237.69 –140.12
A Neural Network Example
Table 11.3 • Week 49 Predicted Nasdaq Average Weekly Closing Prices
Training Data Limited Nasdaq Attributes Predicted Value
3359 2598 2672
Data Includes to Dow Jones Attributes Predicted Value
2693 2652 2712
Epochs
10000 20000 30000
RMS
0.084 0.045 0.032
Prediction Error
744 –17 57
RMS
0.035 0.033 0.033
Prediction Error
22 37 97
Categorical Attribute Prediction
Table 11.4
• Average Weekly Closings /Categorical Output
Nasdaq Average Close 4176.75 4052.01 4104.28 4398.72 4445.53 4535.15 4745.58 4949.09 4742.40 4818.01 Dow Average Close 11413.28 10967.60 10992.38 10726.28 10506.68 10121.31 10167.38 9952.516 10223.11 10937.36 Nasdaq-1 Average Close 3968.47 4176.75 4052.01 4104.28 4398.72 4445.53 4535.15 4745.58 4949.09 4742.40 Dow-1 Average Close 11587.96 11413.28 10967.6 10992.38 10726.28 10506.68 10121.31 10167.38 9952.516 10223.11 Nasdaq-2 Average Close 3847.25 3968.47 4176.75 4052.01 4104.28 4398.72 4445.53 4535.15 4745.58 4949.09 Dow-2 Average Close 11224.10 11587.96 11413.28 10967.60 10992.38 10726.28 10506.68 10121.31 10167.38 9952.516 Nasdaq Gain/ Loss Gain Loss Gain Gain Gain Gain Gain Gain Loss Loss Dow Gain/ Loss Loss Loss Gain Loss Loss Loss Gain Loss Gain Gain
Week 200003 200004 200005 200006 200007 200008 200009 200010 200011 200012
General Considerations
• Test and modify created models as new data becomes
available. • Try one or more data transformations if less than optimal results are obtained. • Exercise caution when predicting future outcome with training data having several predicted fields. • Try a nonlinear model if a linear model offers poor results. • Use unsupervised clustering to determine if input attribute values allow the output attribute to cluster into meaningful categories.
11.2 Mining the Web
Web-Based Mining: General Issues
• Clickstreams
• Extended Common Log File Format • Session Files • User Sessions • Pageviews • Cookies
Web Server Logs
Data Preparation
Session File
Data Mining Algorithm(s)
Learner Model
Figure 11.1 A generic Web usage model
Data Mining for Web Site Evaluation
Sequence miners are special data mining programs able to discover frequently accessed Web pages that occur in the same order.
Data Mining for Personalization
Session Data
Data Mining Algorithm(s)
Clusters
Usage Profiles
Figure 11.2 Creatiing usage profiles from session data
Usage Profiles User Navigational Activity
Recommendation Engine
Recommended Hypertext Links
Figure 11.3 Hypertext link recommendations from usage profiles
Data Mining for Web Site Adaptation
The index synthesis problem: Given a Web site and a visitor access log, create new index pages containing collections of links to related but currently unlinked pages.
11.3 Mining Textual Data
• Train: Create an attribute dictionary. • Filter: Remove common words. • Classify: Classify new documents.
11.4 Improving Performance
• Bagging • Boosting • Instance Typicality
Table 11.5 • Test Set Accuracy Scores for Typical and Atypical Training Data
Decision Tree
Most Typical Least Typical 76.57% 49.84%
Bagging
76.89% 49.17%
Boosting
76.56% 48.84%
Table 11.6 • Classification Correctness: Bagging, Boosting, and Typicality
Single Model
Decision Tree Bayes Classifier 82.83% 87.79%
Bagging
86.47% 86.14%
Boosting
88.45% 84.82%
Typicality
85.81% 85.81%
Part IV
Intelligent Systems
Rule-Based Systems
Chapter 12
12.1 Exploring Artificial Intelligence
Table 12.1 • Areas of Study Within Artificial Intelligence
Scheduling Problems Expert Systems Game Playing Intelligent Agents Intelligent Database Retrieval Intelligent Tutoring Systems Machine Learning Natural Language Processing Planning Robotics and Computer Vision Speech Recognition Theorem Proving
Nearest Neighbor Heuristic
When conducting a state-space search, always move to the next closest state.
A
5
9
D
12
10
B
3
15
C
Figure 12.1 Starting at city A, the nearest neighbor heuristic finds a
C
4
6
12
A
10
5
D
15
B
Figure 12.2 The nearest neighbor heuristic chooses the path A-C-D-B-
The Water Jug Problem
Table 12.2 • Rules for the Water Jug Problem
Action
1. Fill the 4-gallon jug. 2. Fill the 3-gallon jug. 3. Empty the 4-gallon jug onto the ground. 4. Empty the 3-gallon jug onto the ground. 5. Pour water from the 3-gallon jug into the 4-gallon jug until the 4-gallon jug is full. 6. Pour water from the 4-gallon jug into the 3-gallon jug until the 3-gallon jug is full. 7. Pour all the water from the 3-gallon jug into the 4-gallon jug. 8. Pour all the water from the 4-gallon jug into the 3-gallon jug.
Required Conditions
The 4-gallon jug is not full. The 3-gallon jug is not full. The 4-gallon jug is not empty. The 3-gallon jug is not empty.
Resultant State
(4,y) (x,3) (0,y) (x,0)
The total amount of water in both jugs is > = 4 and the 3-gallon jug is not empty. The total amount of water in both jugs is > = 3 and the 4-gallon jug is not empty. The total amount of water in both jugs is <=4 and the 3-gallon jug is not empty. The total amount of water in both jugs is <=3 and the 4-gallon jug is not empty.
(4,y – (4 – x)) (x – (3 – y),3)
(x + y,0) (0, x + y)
A
B
C
D
E
F
G
H
I
J
Figure 12.3 A hypothetical state space representation
Depth-First Search
A-B-E-F-C-G-I-J-H-D
A
B
C
D
E
F
G
H
I
J
Breadth-First Search
A-B-C-D-E-F-G-H-I-J
A
B
C
D
E
F
G
H
I
J
(0,0) (0,0)
Initial State Step 1 Rule 1 Rule 2 Step 2 Rule 1
(0,0)
Rule 2
(0,0)
Step 3 Rule 1 Rule 2 Steps 4, 5 & 6 Rule 1
(0,0)
Rule 2
(4,0)
(0,3)
Rule 2
(4,0)
(0,3)
Rule 6 Rule 2
(4,0)
(0,3)
Rule 6 Rule 2
(4,0)
(0,3)
Rule 6
(4,3)
(1,3)
(4,3)
(1,3)
Rule 4
(4,3)
(1,3)
Rule 4
(1,0)
(1,0)
Rule 8
(0,1)
Rule 1
(4,1)
Rule 6
(2,3)
Figure 12.4 A depth-first solution for the water jug problem
(0,0)
Rule 1 Rule 2
(4,0)
Rule 2 Rule 6
(0,3)
Rule 7
(4,3)
(1,3)
Rule 4
(3,0)
Rule 2
(1,0)
Rule 8
(3,3)
Rule 5
(0,1)
Rule 1
(4,2)
Rule 3
(4,1)
Rule 6
(0,2)
Rule 7
(2,3)
(2,0)
Figure 12.5 The complete state space for the water jug problem
Table 12.3 • A Hypothetical Set of Production Rules
Production Rules
1. 2. 3. 4. 5. 6. If a If c If x If w If w If b then b and d then x and y then g then y then d then y
Known Facts
a, c
Backward Chaining
Creating a Goal Tree
Rule 3 goal
g y x g y x b g y x b a c
Rule 4
g
g y w g y w b a x x
Rule 6
Rule 1
w
Rule 2
Rule 5
g y x b a c d w
w
d
w
Figure 12.6 Creating a goal tree
Expert Systems
Knowledge Base User User Interface Inference Engine
Figure 12.7 An expert system architecture
Developing an Expert System
Problem Definition
1
Knowledge Acquisition
2
Knowledge Representation 3
Testing & Evaluation
4
Knowledge Programming
Figure 12.8 The expert system development cycle
Structuring A Rule-Based System
Form 1040 Tax Dependency
Person is a Dependent
Relationship Test is Satisfied
Joint Return Test is Satisfied
Citizen or Resident Test is Satisfied
Income Test is Satisfied
Support Test is Satisfied
Figure 12.9 A first-level goal tree for dependency exemption
Person is a Dependent Rule 1
Relationship Test is Satisfied
Joint Return Test is Satisfied
Citizen or Resident Test is Satisfied Rule 2
Income Test is Satisfied
Support Test is Satisfied
Resident Test is Satisfied Rule 3 Citizen/Alien is Satisfied Rule 5 United States Citizen Resident Alien is Satisfied Rule 7 Presence Test is Satisfied Rule 9 31 Days Current Year 183 Days During Last 3 Years Green Card Test is Satisfied Resident of Canada United States Resident Not United States Resident
Non-Resident Test is Satisfied Rule 4 Country/Child is Satisfied Rule 6 C/M Test is Satisfied Rule 8 Resident of Mexico Foreign Test is Satisfied Rule 11 Lived in Foreign Country Lived with Taxpayer Entire Year Adopted Child Test is Satisfied Rule 10 Child is Adopted
Figure 12.10 A goal tree for the citizen/resident test
Choosing a Data Mining Technique
Technique Selected
Supervised Strategy Selected
Unsupervised Strategy Selected
Association Rule Strategy Selected
...
Learning is Supervised Supervised Model Chosen
...
Decision Tree is Selected
Linear Regression is Selected
...
Backpropogation
Learning is Selected
Output Constraints are Satisfied
Desirable Criteria Test is Satisfied
Output Attribute is Categorical
Output Attribute is Singular
Explanation is Required
Production Rules are Required
Input Attribute Criteria are Satisfied
Some Input Attributes Are Categorical
Data Distribution Test is Satisfied
Data Distribution is Unknown
Data Distribution is Not Normal
Figure 12.11 A partial goal tree for choosing a data mining technique
Managing Uncertainty in Rule-Based Systems
Chapter 13
13.1 Uncertainty: Sources and Solutions
Sources of Uncertainty
Rule 1:Large Package Rule IF package size is large THEN send package UPS
Sources of Uncertainty
• Rule Antecedent • Rule Confidence • Combining Uncertain Information
General Methods for Dealing with Uncertainty
• Probability-Based Methods • Heuristic Methods
Probability-Based Methods
• Objective Probability • Experimental Probability • Subjective Probability
Heuristic Methods
• Certainty Factors • Fuzzy Logic
13.2 Fuzzy Rule-Based Systems
Fuzzy Sets
A set associated with a linguistic value that gives the degree of membership for a numerical value.
Degree of Membership
Degree of Membership
far_from X
close_to X
Degree of Membership
Degree of Membership
approaching X
moving_from X
Figure 13.1 Four membership functions for distance from x
Fuzzy Reasoning: An Example
1. 2. 3. 4. Fuzzification Rule Inference Rule Composition Defuzzification
Young
Middle-Aged
Old
Degree of Membership
33 Age
Few
Some
Several
Degree of Membership
Previous_Accepts
Low
Moderate
High
Degree of Membership
Life_Insurance_Accept
Figure 13.2 Fuzzy sets for age, previous_accepts and
Degree of Membership
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 65
(a) Life_Insurance_Accept = High
1 0.9 0.8 0.7 0.6 Degree 0.5 of Membership 0.4 0.3 0.2 0.1 0 100 25
(b) Life_Insurance_Accept = Moderate
65
Figure 13.3 Clipping rule consequent membership functions
1 0.9 0.8 0.7 0.6
Degree of Membership
0.5 0.4 0.3 0.2 0.1 0 25 35 45 55 65 75 85 95 100
Figure 13.4 A fuzzy set created by rule composition
13.3 A Probability-Based Approach to Uncertainty
Bayes Theorem
P(H|E) where P ( H ) is the a priori probabilit y that the hypothesis to be tested is true P ( H ) is the a priori probabilit y that H is false. P(H ) 1 P ( H ) E is the evidence associated with the hypothesis . The denominato r expression represents P ( E ) P ( E | H ) is the conditiona l probabilit y of the evidence knowing the hypothesis is true P ( E | H ) is the conditiona l probabilit y of the evidence knowing the hypothesis is false P(E|H) P(H) P(E|H) P(H) P(E| H) P(H)
Equation 13.3
Table 13.1 • Life Insurance Promotion Data
Instance #
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Age
25 33 19 43 35 26 50 24 20 62 36 27 28 25
Previous Accepts
2 4 1 5 1 3 2 2 0 3 5 0 1 3
Life Insurance Promotion
Yes Yes Yes No No Yes No Yes No No Yes No No Yes
Multiple Evidence with Bayes Theorem
P ( H i|E1 & E2 & .. & E j ) P ( E1 & E2 .. & E j | H i ) P ( H i )
k 1
P( E1 & E2 .. & E j | H k ) P( H k )
n
Equation 13.4
Multiple Evidence with Bayes Theorem
P ( H i | E1 & E 2 & .. & E j ) P ( E1 | H i ) P ( E2 | H i ) .. P ( E j | H i ) P ( H i )
k 1
P ( E1 | H k ) P ( E2 | H k ) .. P ( E j | H k ) P ( H k )
n
Equation 13.5
Likelihood Ratios:Necessity and Sufficiency
LS P( E | H ) P( E | H ) LN P(E | H ) P(E | H )
Equation 13.6
General Considerations
• P(H|E) + P(~H|E) must sum to 1. • Conditional independence between multiple pieces of evidence must be assumed. • Prior Probabilities are often unobtainable. • Large amounts of data must be gathered to obtain reasonable estimates for conditional probabilities.
Intelligent Agents
Chapter 14
14.1 Characteristics of Intelligent Agents
• Situatedness • Autonomy • Adaptivity • Sociability
14.2 Types of Agents
•Anticipatory agents • Filtering agents • Semiautonomous agents • Find-and-retrieve agents • User agents • Monitor and Surveillance agents • Data Mining agents • Proactive agents • Cooperative agents
14.3 Integrating Data Mining, Expert Systems and Intelligent Agents
Rule-Based Parameter Selector
Rule-Based Model Selector
Previous New Settings Settings Input Data
Model Attribute Choice Information
User
Natural Language Interface
Summary Report
Data Mining Agent
Selected Data Summary Report
Rule-Based Domain Analyzer
Data Mining Tool
Clean Parameter & Results Settings Transformed Data
Data Mining Session
Figure 14.1 An agent-based model for data mining
The iDA Software
Appendix A
Datasets for Data Mining
Appendix B
Decision Tree Attribute Selection
Appendix C
Computing Gain Ratio
GainRatio( A) Gain( A) / Split Info( A)
Equation C.1
Computing Gain(A)
Gain( A) Info( I ) Info( I , A)
Equation C.2
Computing Info(I)
n # in class i # in class Info( I ) log i 1 # in I # in I i
Equation C.3
Computing Info(I,A)
k # in class j Info(I,A) info (class j) j 1 # in I
Equation C.4
Computing Split Info(A)
k # in class j # in class j Split Info(A) log j 1 # in I # in I
Equation C.5
Table C.1 •
Income Range
40–50K 30–40K 40–50K 30–40K 50–60K 20–30K 30–40K 20–30K 30–40K 30–40K 40–50K 20–30K 50–60K 40–50K 20–30K
The Credit Card Promotion Database
Life Insurance Promotion
No Yes No Yes Yes No Yes No No Yes Yes Yes Yes No Yes
Credit Card Insurance
No No No Yes No No Yes No No No No No No No Yes
Sex
Male Female Male Male Female Female Male Male Male Female Female Male Female Male Female
Age
45 40 42 43 38 55 35 27 43 41 43 29 39 55 19
Income Range
20-30K
30-40K
40-50K
50-60K
2 Yes 2 No
4 Yes 1 No
1 Yes 3 No
2 Yes
Figure C.1 A partial decision tree with root node = income range
Statistics for Performance Evaluation
Appendix D
D.1 Single-Valued Summary Statistics
Computing the Mean
1 n x μ i n i 1
where μ is the mean value n is the number of data items xi is the ith data item
Equation D.1
Computing the Variance
σ
2
(μ x ) n
1
i 1 i
n
2
where σ2 is the variance μ is the population mean n is the number of data items xi is the ith data item
Equation D.2
D.2 The Normal Distribution
The Normal Curve
f ( x ) 1 /( 2πσ ) e
where f(x) is the height of the curve corresponding to values of x e is the base of natural logarithms approximated by 2.718282 m is the arithmetic mean for the data
( x μ ) 2 /( 2 σ 2 )
s is the standard deviation
Equation D.3
D.3 Comparing Supervised Learner Models
• Comparing Models with Independent Test Data • Pairwise Comparison with a Single Test Set
Comparing Models with Independent Test Data
P
E1 E2
(v1 / n1 v2 / n2 )
Two independent test sets, set A containing n1 elements and set B with n elements Error rate E1 and variance v1 for model M1 on test set A Error rate E2 and variance v2 for model M2 on test set B
Equation D.4
Pairwise Comparison with a Single Test Set
Computing Joint Variance for a Single Test Set
V12 where V12 is the joint vari ance e1i is the classifier error on the i instance for learner model M1 e2i is the classifier error on the i instance for learner model M 2 E1 E2 is the overall classifier error rate for model M1 minus the classifier error rate for model M 2 n is the total number of test set instances
Equation D.5
th th
n 2 [( e1i e2i ) ( E1 E2 )] n 1 i 1 1
Pairwise Comparison with a Single Test Set
P
where V12 is the joint vari ance as computed on the previous slide E1 E 2 is the overall classifier error rate for model M1 minus the classifier error rate for model M 2 n is the total number of test set instances
E1 E2 V12 / n
Equation D.6
D.4 Confidence Intervals for Numeric Output
n 2 variance( mae ) ( ei - mae ) n - 1 i 1
1
where
ei is the absolute error for the ith instance
n is the number of instances
Equation D.7
D.5 Comparing Models with Numeric Output
• Independent Test Sets • Pairwise Comparison with a Single Test Set • Overall Comparison with a Single Test Set
Comparing Models with Independent Test Sets
P
mae1 mae2
(v1 / n1 v2 / n2 )
where mae1 is the mean absolute error for model M1 mae2 is the mean absolute error for model M2 v1 and v2 are variance scores associated with M1 and M2 n1 and n2 are the number of instances within each respective test set
Equation D.8
Pairwise Comparison with a Single Test Set
P mae1 mae2
V12 / n
where mae1 is the mean absolute error for model M1 mae2 is the mean absolute error for model M2 V12 is the joint variance computed with the formula defined in Equation D.5 n is the number of test set instances
Equation D.9
Overall Comparison with a Single Test Set
variance ( maej )
n 1
1
i 1
n
( ei mae j )
2
where maej is the mean absolute error for model j ei is the absolute value of the computed value minus the actual value for instance i n is the number of test set instances
Equation D.10
Overall Comparison with a Single Test Set
P
mae1 mae2
v(2 / n)
where ν is either the average or the larger of the variance scores for each model n is the total number of test set instances
Equation D.11
Excel Pivot Tables: Office 97
Appendix E
Figure E.1 Pivot table wizard step 3
Figure E.2 Calculating a summary report
Figure E.3 A summary report of income range
Figure E.4 A pie chart for income range
Figure E.5 A pivot table showing age and credit card insurance choice
Figure E.6 Grouping the credit card promotion data by age
Watch Promo = No Life Insurance Promo = Yes Magazine Promo = Yes
Watch Promo
No
Yes
Ye s No
Yes
No
ine az o g Ma rom P
Life Insurance Promo
Figure E.7A credit card promotion cube
Figure E.8 A pivot table with page variables
Figure E.9 A pivot table with page variables for credit card promotions