Bayesian Classification

Document Sample

```					        Data Analysis and Mining                                                      4. Data Classification (Part II)

Bayesian Classification

4. Data Classification                                                   Instance based Classification

(Part II)
Classification Accuracy and Evaluation
Chunping Li
cli@tsinghua.edu.cn                                               Summary

Data Analysis and Mining - Data                                           Data Analysis and Mining - Data
2006-6-13                            Classification                  1   2006-6-13                             Classification                  2

Bayesian Classification                                                               Bayesian Theorem
Statistical classification approach                                Given: the classified samples set D (the domain for fruits)
X: a sample with unknown class label (attributes: red and
round)
Probabilistic learning: Calculate explicit probabilities for          H: some hypothesis (such as X belongs to the class “Apple”)
hypothesis, the most practical approaches to certain                  P(X|H): posteriori probability of X conditioned on H (given X is
types of learning problems                                            apple, the probability that X is red and round)
P(H): prior probability of H (the probability that
Bayesian classifiers: statistical classifier，also high               any given sample is apple)
accuracy and speed when applied to large databases                    P(X): prior probability of X (the probability that
any given sample is red and round)
For classification problem: determine P(H|X) --- the probability
Probabilistic prediction: Predict multiple hypotheses,                that H holds given the observed sample X
weighted by their probabilities                                           Bayesian Theorem        P(H | X ) = P( X | H )P(H )
P( X )
Data Analysis and Mining - Data                                           Data Analysis and Mining - Data
2006-6-13                            Classification                  3   2006-6-13                             Classification                  4

1
Bayesian Classifiers                                                                 Bayesian Classifiers
Consider each attribute and class label as random variables                   Approach:
compute the posterior probability P(Cj| A1, A2, …, An) for
all values of Cj using the Bayes theorem
Given a record with attributes (A1, A2,…,An),
P( A1 A2 KAn | Cj)P(Cj)
there are m classes, i.e., Cj ∈ {C1, C2,…,Cm}                                          P(Cj | A1 A2 KAn ) =
P( A1 A2 KAn )
Goal is to predict class Cj
Specifically, we want to find the value of Cj that                               Choose value of Cj that maximizes
maximizes P(Cj| A1, A2,…,An )                                                           P(Cj | A1, A2, …, An)

Can we estimate P(Cj| A1, A2,…,An ) directly from data?                             Equivalent to choosing value of C that maximizes
P(A1, A2, …, An|Cj) P(Cj)

How to estimate P(A1, A2, …, An | Cj )?
Data Analysis and Mining - Data                                             Data Analysis and Mining - Data
2006-6-13                                Classification                  5    2006-6-13                              Classification                           6

Naïve Bayesian Classification
Class Conditional independence assumption                             <=30       high       no  fair                                   no
<=30       high       no  excellent                              no
Assume independence among attributes Ai when class                    31…40      high       no  fair                                   yes
is given:                                                             >40        medium     no  fair                                   yes
P(A1, A2, …, An |Cj) = P(A1| Cj) P(A2| Cj)… P(An| Cj)             >40        low       yes  fair                                   yes
>40        low       yes  excellent                              no
31…40      low       yes  excellent                              yes
Can estimate P(Ai| Cj) for all Ai and Cj.                          <=30       medium     no  fair                                   no
<=30       low       yes  fair                                   yes
>40        medium    yes  fair                                   yes
New point (record) is classified to Cj if P(Cj) Π P(Ai|
<=30       medium    yes  excellent                              yes
Cj) is maximal.                                                    31…40      medium     no  excellent                              yes
31…40      high      yes  fair                                   yes
>40        medium     no  excellent                              no
Data Analysis and Mining - Data                                             Data Analysis and Mining - Data
2006-6-13                                Classification                  7    2006-6-13                              Classification                           8

2
(III)
Two classes: C1: buys_computer= “yes”                                 P(student=“yes” | C1)= 6/9=0.667
C2: buys_computer= “no”                                 P(student=“yes” | C2)= 1/5=0.200
Unclassified sample X:                                                  P(credit_rating=“fair” | C1)= 6/9=0.667
(age=“<=30”, income=“medium”, student=“yes”,                            P(credit_rating=“fair” | C2)= 2/5=0.400
credit_rating=“fair”)
For each P(Ci):                                                         P(X | C1)= 0.222 x0.444x0.667x0.667=0.044
P(X | C2)= 0.600 x0.400x0.200x0.400=0.019
P(X | C1) x P(C1)=0.044x0.643=0.028
For each P(X|Ci):                                                       P(X | C2) x P(C2)=0.019x0.357=0.007
P(age=“<=30”| C1)= 2/9=0.222
P(age=“<=30”| C2)= 3/5=0.600                                           Conclusion: P(C1 |X) > P(C2 |X)
P(income=“medium” | C1)= 4/9=0.444                                        X     C1: buys_computer= “yes”
P(income=“medium” | C2)= 2/5=0.400
Data Analysis and Mining - Data                                       Data Analysis and Mining - Data
2006-6-13                              Classification             9    2006-6-13                             Classification           10

The independence hypothesis…                                            4. Data Classification (Part II)

… makes computation possible                                                  Bayesian Classification
… yields optimal classifiers when satisfied
… but is possibly not satisfied in practice, as attributes                    Instance based Classification
(variables) are often correlated.
Attempts to overcome this limitation:                                         Classification Accuracy and Evaluation
Bayesian networks, that combine Bayesian reasoning
with causal relationships between attributes                           Summary
Decision trees, that reason on one attribute at the
time, considering most important attributes first
Data Analysis and Mining - Data                                       Data Analysis and Mining - Data
2006-6-13                              Classification             11   2006-6-13                             Classification           12

3
Instance-Based Classifiers                                                                     Instance-Based Method
Set of Stored Cases                                                                                   Typical approaches
• Store the training records

Atr1        ……...   AtrN   Class                   • Use training records to                                k-nearest neighbor approach (kNN)
A
predict the class label of
unseen cases
Instances represented as points in a Euclidean
B
space.
Uses k “closest” points (nearest neighbors) for
B                                   Unseen Case
performing classification
C                                            ……...
Atr1               AtrN
A                                                                                Case-based reasoning (CBR)
C                                                                                 Uses symbolic representations and knowledge-
based inference
B

Data Analysis and Mining - Data                                                                 Data Analysis and Mining - Data
2006-6-13                              Classification                                     13   2006-6-13                               Classification                                 14

Nearest-Neighbor Classifiers
Nearest Neighbor Classifiers
Basic idea:                                                                                                                                Requires three things
– The set of stored records
If it walks like a duck, quacks like a duck, then it’s
– Distance Metric to compute
probably a duck                                                                                                                              distance between records
– The value of k, the number of
Compute                  Test
nearest neighbors to retrieve
Distance                Record
To classify an unknown record:
– Compute distance to other
training records
– Identify k nearest neighbors
– Use class labels of nearest
Training
neighbors to determine the
Records                                             Choose k of the                                                                                   class label of unknown
“nearest” records                                                                                 record (e.g., by taking
(i.e., most similar)                                                                              majority vote)

Data Analysis and Mining - Data                                                                 Data Analysis and Mining - Data
2006-6-13                              Classification                                     15   2006-6-13                               Classification                                 16

4
Definition of Nearest Neighbor                                                                 Nearest Neighbor Classification

Compute distance between two points:
Euclidean distance

d ( p, q ) =                       −q )
X                        X                                 X

∑ ( pi
2

i                        i

Determine the class from nearest neighbor list
(a) 1-nearest neighbor   (b) 2-nearest neighbor               (c) 3-nearest neighbor
take the majority vote of class labels among
K-nearest neighbors of a record x are data points                                            the k-nearest neighbors
that have the k smallest distance to x

Data Analysis and Mining - Data                                                          Data Analysis and Mining - Data
2006-6-13                               Classification                                    17   2006-6-13                         Classification                   18

Discussion on the k-NN Algorithm                                                        Transduction vs. Induction

k-NN classifiers are lazy learners (or learning
It does not build models explicitly
Unlike eager learners such as decision tree induction
and rule-based systems
Classifying unknown records are relatively expensive
Robust to noisy data by averaging k-nearest
neighbors

Data Analysis and Mining - Data                                                          Data Analysis and Mining - Data
2006-6-13                               Classification                                    19   2006-6-13                         Classification                   20

5
Case-Based Reasoning (CBR)                                                      Other Classification Approaches
CBR: Uses a database of problem solutions to solve new problems
Store symbolic description (tuples or cases)—not points in a Euclidean
Artificial Neural Network
space                                                                                Support Vector Machine
Some typical applications: Customer-service, legal ruling, etc.
Genetic algorithms
Methodology
Instances represented by rich symbolic descriptions (e.g, logical               Rough set approach
rules, natural language, …)
Fuzzy set approach
Search for similar cases, tight coupling between case retrieval,
knowledge-based reasoning, and problem solving                                  etc …
Challenges
Find a good similarity metric
Indexing based on syntactic similarity measure

Data Analysis and Mining - Data                                            Data Analysis and Mining - Data
2006-6-13                                Classification                     21   2006-6-13                          Classification           22

4. Data Classification (Part II)                                                Model Evaluation

Bayesian Classification                                                           How to evaluate the performance of a model?

Instance based Classification                                                     Metrics for performance evaluation

Classification Accuracy and Evaluation                                            How to obtain reliable estimates?

Summary                                                                           Methods for performance evaluation

Data Analysis and Mining - Data                                            Data Analysis and Mining - Data
2006-6-13                                Classification                     23   2006-6-13                          Classification           24

6
Metrics for Performance Evaluation                                                             Metrics for Performance Evaluation…

Focus on the predictive capability of a model (i.e.,                                                                        PREDICTED CLASS
accuracy)
Rather than how fast it takes to classify or build                                                                               Class=Yes            Class=No
models, etc.
Class=Yes                 a                b
PREDICTED CLASS                                                                           ACTUAL
CLASS   Class=No                  c                d
Class=Yes           Class=No
a: TP (true positive)
b: FN (false negative)
Class=Yes         a                     b
ACTUAL
a+d
c: FP (false positive)

CLASS          Class=No          c                     d
d: TN (true negative)
Most widely-used metric: Accuracy                     =
a+b+c+d
Data Analysis and Mining - Data                                                            Data Analysis and Mining - Data
2006-6-13                               Classification                                    25   2006-6-13                           Classification                          26

Limitation of Accuracy                                                                   Accuracy Measures
We have two classes: positive and negative class,
Accuracy is only one measure (error = 1-accuracy).                                   and given samples set
Accuracy is not suitable in some applications                                       pos（a+b） /neg (c+d) : number of positive(yes) /negative(no)
Consider a 2-class problem                                                                                   samples
t_pos (a) /t_neg (d): number of correctly classified
Actual number of Class 0 examples = 9990
positive/negative samples
Actual number of Class 1 examples = 10                                           f_pos (c): number of incorrectly labeled as positive samples
pos  t_pos
sensitivity=t_pos/pos
If model predicts everything to be class 0,                                         specificity=t_neg/neg
accuracy is 9990/10000 = 99.9 %
accuracy=sensitivity x pos/(pos+neg)
i.e., accuracy is misleading because model does not
+ specificity x neg/(pos+neg)                                 f_pos
detect any class 1 example
Data Analysis and Mining - Data                                                            Data Analysis and Mining - Data
2006-6-13                               Classification                                    27   2006-6-13                           Classification                          28

7
Precision and Recall Measures                                                                    Model Evaluation
How to evaluate the performance of a model?
a
Precision (p) =                             Precision=t_pos/(t_pos+f_pos)
a+c
Metrics for performance evaluation
a
Recall (r) =                                       Recall =t_pos/pos
a+b                                                                     How to obtain reliable estimates?
Note: precision and recall only measure classification on the

positive class.                                                                            Methods for performance evaluation

Data Analysis and Mining - Data                                                     Data Analysis and Mining - Data
2006-6-13                                    Classification                         29   2006-6-13                               Classification           30

Fundamental Assumption for Classifying                                                       Estimating the Classification Accuracy
Holdout method: Training-and-testing
Assumption: The distribution of training examples is
identical to the distribution of test examples (including
future unseen examples).                                                                       (class labeled) data randomly partitioned two
independent sets,
In practice, this assumption is often violated to certain
degree.
Strong violations will clearly result in poor classification                                    e.g., training set (2/3), test set (1/3)
accuracy.
To achieve good accuracy on the test data, training                                            used for data set with large number of samples
examples must be sufficiently representative of the
test data.

Data Analysis and Mining - Data                                                     Data Analysis and Mining - Data
2006-6-13                                    Classification                         31   2006-6-13                               Classification           32

8
Estimating the Classification Accuracy                                          Increasing Classifier Accuracy
k-fold Cross-validation                                             Are there general techniques for improving classifier accuracy?

divide the (class labeled) data set into k sub-samples        Bagging and Boosting – Ensemble Classification Methods
Combining a series of T learned classifiers C1,…,CT
with the aim of creating an improved composite
use k-1 sub-samples as training data and one sub-                classifier C*                               New data
sample as test data --- k-fold cross-validation                                                                        Sample X
C1

for data set with moderate size                                                               C2                        Combine       Class
Data

…
CT
Data Analysis and Mining - Data                                              Data Analysis and Mining - Data
2006-6-13                               Classification                 33   2006-6-13                                Classification                            34

4. Data Classification (Part II)                                           Summary & Research Focus
Classification is an extensively studied problem (mainly in statistics,
Bayesian Classification
machine learning & neural networks, etc)
Classification is probably one of the most widely used data mining
Instance based Classification                                      techniques with a lot of extensions
Scalability is still an important issue for classification techniques

Classification Accuracy and Model Evaluation                       It is always reasonable to assume that all samples are uniquely
classifiable.
It is useful to return a probability class distribution
Summary
Research directions: classification of non-relational data, e.g., text,
spatial, multimedia, etc.
No classification method is superior over all others for all data types
Data Analysis and Mining - Data                 and domains.                 Data Analysis and Mining - Data
2006-6-13                               Classification                 35   2006-6-13                                Classification                            36

9
WEKA -- A Novel Tool for Data Mining

Open source software in Java, for solving real-
world data mining problems
Developed by the University of Waikato
http:// www.cs.waikato.ac.nz/ml/weka/
Relative materials can be found at the web site

Data Analysis and Mining - Data                    Data Analysis and Mining - Data
2006-6-13                        Classification           37   2006-6-13            Classification           38

10

```
DOCUMENT INFO
Shared By:
Categories:
Stats:
 views: 87 posted: 6/1/2010 language: English pages: 10