; Ontology Translation on the Semantic Web
Documents
User Generated
Resources
Learning Center
Your Federal Quarterly Tax Payments are due April 15th

# Ontology Translation on the Semantic Web

VIEWS: 20 PAGES: 40

• pg 1
```									     DSC433/533, Fall 2008
Information Analysis for
Managerial Decisions
Instructor: Dejing Dou
Office hours: Wednesdays 3:30pm-5:00pm
Oct. 28 and Oct. 30, 2008

1
Chapter 6: Classification and Prediction
   What is classification? What is prediction?
– Issues regarding classification and prediction
 Classification by decision tree induction
 Bayesian Classification
 Classification by Neural Networks
 Classification based on concepts from
association rule mining
 Other Classification Methods
 Prediction
 Classification accuracy
2
Classification vs. Prediction
   Classification:
– classifies data (constructs a model) based on the training set
and the values (class labels) in a classifying attribute and uses
it in classifying new data
– predicts categorical class labels (discrete or nominal)
   Prediction:
– models continuous-valued functions, i.e., predicts unknown or
missing values (e.g. linear, multiple and nonlinear regressions)
   Typical Applications
–   credit approval
–   target marketing
–   medical diagnosis
–   treatment effectiveness analysis
3
Classification—A Two-Step Process
   Model construction: describing a set of predetermined classes
– Each tuple/sample is assumed to belong to a predefined class, as determined
by the class label attribute
– The set of tuples used for model construction is training set
– Supervised learning vs. unsupervised learning (clustering)
– The model is represented as classification rules, decision trees, Bayesian
network, neural network, or mathematical formulae
   Model usage: for classifying future or unknown objects
– Estimate accuracy of the model
  The known label of test sample is compared with the classified result
from the model
 Accuracy rate is the percentage of test set samples that are correctly
classified by the model
 Test set is independent of training set, otherwise over-fitting will occur

– If the accuracy is acceptable, use the model to classify (predict) data tuples
whose class labels are not known
4
Model Construction
Classification
Algorithms
Training
Data

NAME   RANK           YEARS TENURED         Classifier
Mike   Assistant Prof   3      no           (Model)
Mary   Assistant Prof   7      yes
Bill   Professor        2      yes
Jim    Associate Prof   7      yes    IF rank = ‗professor‘
Dave   Assistant Prof   6      no     OR years > 6
Anne   Associate Prof   3      no
THEN tenured = ‗yes‘
5
Use the Model in Prediction

Classifier

Testing
Data                            Unseen Data

(Jeff, Professor, 4)
NAME RANK                     YEARS TENURED
T om       A ssistant P rof     2     no          Tenured?
M erlisa   A ssociate P rof     7     no
G eorge    P rofessor           5     yes
Joseph     A ssistant P rof     7     yes                         6
Supervised vs. Unsupervised
Learning
   Supervised learning (classification)
– Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
– New data is classified based on the training set

   Unsupervised learning (clustering)
– The class labels of training data is unknown
– Given a set of measurements, observations, etc. with the aim of
establishing the existence of classes or clusters in the data

7
Issues Regarding Classification and
Prediction : Data Preparation
   Data cleaning
– Preprocess data in order to reduce noise and handle missing
values
   Relevance analysis (feature selection)
– Remove the irrelevant or redundant attributes
   Data transformation
– Generalize and/or normalize data

8
Evaluating Classification Methods
   Predictive accuracy
   Speed
– time to construct the model
– time to use the model
   Robustness
– handling noise and missing values
   Scalability
– efficiency in disk-resident databases
   Interpretability:
– understanding and insight provided by the model
   Other Goodness of rules besides accuracy
– decision tree size
– compactness of classification rules
9
Training Dataset

age    income student credit_rating   buys_computer
This            <=30    high       no  fair                  no
follows an      <=30
31…40
high
high
no
no
excellent
fair
no
yes
example         >40     medium     no  fair                  yes
>40     low       yes fair                   yes
from            >40     low       yes excellent              no
Quinlan’s       31…40
<=30
low
medium
yes excellent
no  fair
yes
no
ID3             <=30    low       yes fair                   yes
>40     medium    yes fair                   yes
(iterative
<=30    medium    yes excellent              yes
Dichotomiser)   31…40   medium     no  excellent             yes
31…40   high      yes fair                   yes
>40     medium     no  excellent             no

10
Output: A Decision Tree for

age?

<=30          overcast
30..40     >40

student?           yes        credit rating?

no              yes              excellent    fair

no              yes                 no        yes
11
Algorithm for Decision Tree Induction
   Basic algorithm (a greedy algorithm and a version of ID3)
– Tree is constructed in a top-down recursive divide-and-conquer
manner
– At start, all the training examples are at the root
– Attributes are categorical (discrete-valued, if continuous-valued,
they are discretized in advance)
– Examples are partitioned recursively based on selected attributes
– Test attributes are selected on the basis of a heuristic or statistical
measure (e.g., information gain)
   Conditions for stopping partitioning (any of them)
– All samples for a given node belong to the same class
– There are no remaining attributes for further partitioning –
majority voting is employed for classifying the leaf
– There are no samples left                                        12
Attribute Selection Measure:
Information Gain (ID3/C4.5)
   Select the attribute with the highest information gain
   S contains si tuples of class Ci for i = {1, …, m},
information measures info required to classify any
arbitrary tuple to m classes m
si       si
I( s1,s 2,...,s m )               log 2
i 1   s        s
   entropy of attribute A with values {a1,a2,…,av}, which is
the required information to partition S into the subset{S1, S2,…,Sv}, then
let Sj contain sij samples of vclass Ci.
s1 j  ...  smj
E(A)                       I ( s1 j ,..., smj )
j 1         s
   information gained by branching on attribute A
Gain(A) I(s1, s 2 ,...,sm)  E(A)
How much would be gained by branching on A. The smaller entropy (still
required), the greater the purity of the partitions.
13
Attribute Selection by
Information Gain Computation
      Class P: buys_computer = ―yes‖                  E ( age) 
5
I (2,3) 
4
I (4,0)
      Class N: buys_computer = ―no‖                              14           14
      I(p, n) = I(9, 5) =0.940                                    5
    I (3,2)  0.694
      Compute the entropy for age                                14
age           pi       ni I(pi, ni)            5
I (2,3) means ―age <=30‖ has 5
<=30            2        3 0.971                14
30…40           4        0 0
out of 14 samples, with 2
>40             3        2 0.971                     yes‘es and 3 no‘s.
2     2 3     3
I (2,3)   log 2  log 2  0.97
age    income student credit_rating   buys_computer
<=30    high       no  fair                  no
<=30
31…40
high
high
no
no
excellent
fair
no
yes
5     5 5     5
>40     medium     no  fair                  yes
>40     low       yes fair                   yes          Hence
>40
31…40
low
low
yes excellent
yes excellent
no
yes
Gain(age)  I ( p, n)  E (age)  0.246
Gain(income)  0.029
<=30    medium     no  fair                  no
<=30    low       yes fair                   yes
>40     medium    yes fair                   yes
<=30
31…40
medium
medium
yes excellent
no  excellent
yes
yes
Gain( student)  0.151
14
31…40
>40
high
medium
yes fair
no  excellent
yes
no             Gain(credit _ rating)  0.048
Other Attribute Selection
Measures
   Gain Ratio (C4.5), Gini index (CART)
– All attributes can be continuous-valued
– Assume there exist several possible split values for each attribute
– May need other tools, such as clustering, to get the possible split
values
– Can be modified for categorical attributes

  2,   MDL (Minimum Description Length), etc

15
Output: A Decision Tree for

age?

<=30          overcast
30..40     >40

student?           yes        credit rating?

no              yes              excellent    fair

no              yes                 no        yes
16
Attribute Selection by
Information Gain Computation
age    income student    credit_rating   buys_computer
<=30   high      no      fair                  no
<=30   high      no      excellent             no
<=30   medium    no      fair                  no
<=30   low       yes     fair                  yes
<=30   medium    yes     excellent             yes

2     2 3     3
I (2,3)   log 2  log 2  0.97
5     5 5     5
2         3
E ( student)  I (2,0)  I (3,0)  0
2     5   2     5 1
E (income)  I (2,0)  I (1,1)  I (1,0)  0.4
5         5        5
3         2
E (credit _ rating)  I (1,2)  I (1,1)  0.799
5         5
Gain(income)  0.57
Gain( student)  0.97
Gain(credit _ rating )  0.22                      17
Tree Pruning
   Overfitting: An induced tree may overfit the training data
– Too many branches, some may reflect anomalies due to noise or
outliers
– Poor accuracy for unseen samples
   Two approaches to avoid overfitting
– Prepruning: Halt tree construction early—do not split a node if
this would result in the goodness measure (e.g. information gain,
χ2) falling below a threshold. The node becomes leaf.
 Difficult to choose an appropriate threshold

– Postpruning: Remove branches from a ―fully grown‖ tree—get a
sequence of progressively pruned trees
 Use a set of data different from the training data to decide
which is the ―best pruned tree‖
18
Approaches to Determine the
Final Tree Size
   Separate training (2/3) and testing (1/3) sets
   Use all the data for training
– but apply a statistical test (e.g., 2) to estimate whether
expanding or pruning a node may improve the entire
distribution
   Use minimum description length (MDL) principle
– MDL uses encoding to the ―best‖ decision tree as one that
requires the fewest number of bits.
– halting growth of the tree when the encoding is minimized

19
Extracting Classification Rules from
Trees
   Represent the knowledge in the form of IF-THEN rules
– One rule is created for each path from the root to a leaf
– Each attribute-value pair along a path forms a conjunction
– The leaf node holds the class prediction

   Rules are easier for humans to understand
   Example
IF age = ―<=30‖ AND student = ―no‖ THEN buys_computer = ―no‖
IF age = ―<=30‖ AND student = ―yes‖ THEN buys_computer = ―yes‖
IF age = ―31…40‖                    THEN buys_computer = ―yes‖
IF age = ―>40‖ AND credit_rating = ―excellent‖ THEN buys_computer = ―no‖
IF age = ―>40‖ AND credit_rating = ―fair‖ THEN buys_computer = ―yes‖

20
Enhancements to basic
decision tree induction
   Allow for continuous-valued attributes (ID3 -> C4.5)
– Dynamically define new discrete-valued attributes that
partition the continuous attribute value into a discrete set of
intervals
   Handle missing attribute values
– Assign the most common value of the attribute
– Assign probability to each of the possible values
   Attribute construction
– Create new attributes based on existing ones that are sparsely
represented – Grouping of categorical attribute values.
– This reduces fragmentation, repetition, and replication     21
Classification in Large Databases
   Classification—a classical problem extensively studied by
statisticians and machine learning researchers
   Scalability: Classifying data sets with millions of examples
and hundreds of attributes with reasonable speed
   Why decision tree induction in data mining?
– relatively faster learning speed (than other classification methods)
– convertible to simple and easy to understand classification rules
– can use SQL queries for accessing databases
– comparable classification accuracy with other methods

22
Scalable Decision Tree Induction
Methods in Data Mining Studies
   SLIQ (EDBT‘96 — Mehta et al.)
– builds an index for each attribute. Only class list and the current
attribute list reside in memory.
   SPRINT (VLDB‘96 — J. Shafer et al.)
– constructs an attribute list data structure
   PUBLIC (VLDB‘98 — Rastogi & Shim)
– integrates tree splitting and tree pruning: stop growing the tree
earlier
   RainForest (VLDB‘98 — Gehrke, Ramakrishnan & Ganti)
– separates the scalability aspects from the criteria that determine the
quality of the tree
– builds an AVC-list (attribute, value, class label)
23
Data Cube-Based
Decision-Tree Induction
   Integration of generalization with decision-tree induction
(RIDE‘97, Kamber et al).
   Attribute-oriented induction uses concept hierarchies
   Classification at different concept levels.
– Low-level concepts (e.g., precise temperature, humidity, outlook)
can result in quite large and bushy classification-trees
– High-level concepts can result in useless decision tree
– Some intermediate level set by domain expert or threshhold
   Cube-based multi-level classification
– Relevance analysis at multi-levels.
24
– Information-gain analysis with dimension + level.
Let’s Practice: extracting rules
from the tree with Weka

age?

<=30          overcast
30..40     >40

student?           yes        credit rating?

no              yes              excellent    fair

no              yes                 no        yes
25
One Path, One Rule
IF age = ―<=30‖ AND student = ―no‖ THEN buys_computer = ―no‖

IF age = ―<=30‖ AND student = ―yes‖ THEN buys_computer = ―yes‖

IF age = ―31…40‖    THEN buys_computer = ―yes‖

IF age = ―>40‖ AND credit_rating = ―excellent‖ THEN

IF age = ―>40‖ AND credit_rating = ―fair‖ THEN buys_computer =
―yes‖                                                26
Examples for C4.5
class1.xls in Weka using J48 (Weka‘s
 Try
implementation of C4.5.

 Part   is the class label

 How  about Charles Club, using Florence as
class label?
27
Presentation of Classification Results

28
Visualization of a Decision Tree in
SGI/MineSet 3.0

29
Decision Tree In Practice
   To explore a large dataset to pick out useful
variables

   To predict future states of important variables in an
industrial process

   To form directed clusters of customers for a
recommendation system
30
Decision Tree as a Data
Exploration Tool
   The attributes selection process help pick up the
variables that are likely to be important for
predicting targets.

   I.e., the attributes used for classification rules are
important.

31
Boston Globe case
   Goal: estimate a town‘s expected home delivery
circulation level based on various demographic and
geographic characteristics

   Problem: Find a handful attributes for regression.

32
Boston Globe case
   The number of subscribing households in a given
city or town may not make a good target because
the size of town or cities

   A better target (class label)?

33
Boston Globe case
   Penetration: the proportion of households that
subscribe to the paper.

   Find factor among hundreds in the town signature,
separate towns with high penetration (top one
third good) and low penetration (bottom one
34
Boston Globe case
   The rules looks like:

If median home value <=\$226K,
| sub to pop child ratio >= 0.61, good (97%)
| sub to pop child ratio <= 0.61 (50% vs. 50%)
| % population Age 18-24 < 0.09, good (100%)
| % population Age 18-24 >0.09, bad (100%)
If median home value >=\$226K, good (99%)
What above mean?                                      35
Boston Globe case
   Median home value is the best first split, also the most
important factor
– <226K town are poor prospects

   Next: a family of derived variables comparing the
subscriber base in the town to the town population as a
whole (e.g., child ratio, % of population age)

   Others: mean years of school completed, percentage of the
population in blue collar occupations, percentage of high-
status occupations                                     36
Applying Decision tree for
prediction
   Predicting the future is one of the most important
applications of data mining.
   Analyzing trends in historical data in order to predict
future behavior.
– A major bank studied customer data in order to spot earlier
warning signs for attrition in its checking accounts. ATM
withdraws, payroll direct deposits, balance inquires, visits to
teller, …
– A manufacturer of diesel engines tried to forecast diesel engine
sales based on historical truck registration data.
   Major difference from statistical study for cycles: multiple
37
attributes and one attribute
Nestle Coffee Rosters:
Process Control
   Roster variables: temperature of air at various exhaust
points, the speed of various fans, the rate of that gas is
burned, the amount of water introduced to quench the
beans, and the positions of various flaps and values.

   A lot of ways for things to go wrong: too light in color, a
costly and damaging roaster fire.

   Goals of simulator: help operators keep the roaster running
properly. Data from 60 sensors, every 30 seconds.
38
Nestle Coffee Rosters:
Process Control
   Using the simulator to try out new recipes, a large number of new
recipes could be evaluated without interrupting production

   The simulator could be used to train new operators and expose them
to routine problems and their solutions. Operators could try out
different approaches to resolving a problem.

   The simulator could track the operation of the actual roaster and
project in several minutes into the future to avoid problems.

39
Nestle Coffee Rosters:
Process Control
   The simulation was built using a training set of 34,000 cases and
evaluated in another 40,000 cases.

   For each case, the simulator generated projected snapshots 60 steps
in the future.

   The size of the error increase with time. For example, the error rate
for product temperature tuned out to be 2/3 degree per minute of
projection, but even 30 minutes into the future the simulator is doing
considerably better than random guessing.

40

```
To top