Assignment 3 (Due on 11102008) (Association Rule by umf86597

VIEWS: 121 PAGES: 4

									INFS 755 (Fall 2008)

Assignment 3 (Due on 11/10/2008)
(Association Rule Mining, Clustering, Classification)
[Huzefa Rangwala]
Max Points: 80
Extra Credit Points 30

This is an individual assignment. Please ensure that assignment is submitted in class in hard copy
before start class. No late submissions allowed. There are three parts to the assignment. Some of
the parts require Weka. The point of the assignment is to have fun!

Part 1 (Clustering)

    1. Use the similarity matrix in Table 8.1 to perform single and complete link hierarchical
        clustering. Show your results by drawing a dendrogram. The dendrogram should clearly
        show the order in which the points are merged. (20 points)


                           P1     P2     P3     P4     P5
                      P1   1.00   0.10   0.41   0.55   0.35
                      P2   0.10   1.00   0.64   0.47   0.98
                      P3   0.41   0.64   1.00   0.44   0.85
                      P4   0.55   0.47   0.44   1.00   0.76
                      P5   0.35   0.98   0.85   0.76   1.00


    2. Download the “iris.arff” dataset from the class web-site. Load the Iris dataset into the
        Weka Explorer. Ensure that you are not using the target labels in the clustering solution.
        (20 points)

            a. Perform k-means on this dataset to form “k” clusters. Note set values of k to be 2
               to 10.
            b. Plot SSE as k varies from 2 to 10. Describe the curve in brief.
            c. For k=3, describe what classes (Setosa, Virginica, and Versicolor) each cluster
               contains.
            d. Compute the entropy, and a confusion matrix using the class labels for k =3.
               What classes are confused ?
Part 2 (Association Rule Mining)

1. Given the lattice structure shown in Figure above and the transactions given in Table 6.3, label
each node with the following letter(s): (20 points)

        • M if the node is a maximal frequent itemset,
        • C if it is a closed frequent itemset,
        • N if it is frequent but neither maximal nor closed, and
        • I if it is infrequent.
        Assume that the support threshold is equal to 30%.

2. Download the “basket.data” (Transaction Database from Clementine’s Demos (Another Data
 Mining Toolkit). The goal of this data mining study is to find groups of product items often
 bought together by the customers of a supermarket, whose baskets are represented in the baskets
 dataset.

Understand the parameters for Apriori algorithm in weka. Run the algorithm with minimum
support (0.5) and confidence (0.9). Explain the output generated. Is it in line with the class
discussion on apriori rule mining. Explain the top two rules generated. Run the algorithm again
with support = 0.9, and confidence = 0.9. What changes did you observe in the two runs ? (20
points)
Part 3 (Classification)




                                                                           1. Given the Bayesian
network shown above compute the following probabilities. (15 points)

        (a) P(B = good, F = empty, G = empty, S = yes).
        (b) P(B = bad, F = empty, G = not empty, S = no).
        (c) Given that the battery is bad, compute the probability that the car will start.


        2. Given the data sets shown in Figures (a) to (e), explain how the decision tree, naıve
        Bayes, and k-nearest neighbor classifiers would perform on these data sets. Be sure to
        explain your answers. (15 points)

								
To top