Data Mining Tools
Due: Monday, December 6th, 9:00 am
Data Mining is a powerful tool to study patterns and relations in numerous data that surrounds us. In
class, we have learned about clustering (unsupervised learning), classification (using principal
component analysis, networks, fuzzy logic, and other learning tools), and models that help to
understand and predict values for new data based on training data set (using decision trees and
There is a number of commercial as well as research products that implement some or all of the above
tools. One such example is Open Source software issued under the GNU General Public License. Weka is
a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied
directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing,
classification, regression, clustering, association rules, and visualization. More advanced programmers
and researchers can use Weka to develop new machine learning and data mining schemes.
What you should do:
1: Follow link onto official Weka web site: http://www.cs.waikato.ac.nz/ml/weka/ and read “Getting
started” information. Download and install Weka version weka-3-6-3 for Windows , Mac or Linux. Note
that Windows version Windowsx64 has been tested for both self-extracting executable weka-3-6-
3jre.exe; 34.8 MB and weka-3-6-3.exe; 20.85 MB. Install and run the software. Remember that if you are
using lab machines, the software must be uninstalled following the completion of the assignment.
2: Run Weka (Weka 3.6 with console). You should be able to see a basic interface:
3: There are sample databases that will be downloaded and come with Weka program files. The specific
data file format that Weka can work with is .arff. Use /data/contact-lenses.arff database for this
assignment. You can browse other datasets if desired. For lenses data set, demonstrate that you can
perform the following FIVE major Weka functionalities in Weka EXPLORER PACKAGE (you may find out
more about them from on-line help or comprehensive Weka tutorials under Documentation links).
A. [6p total] Run Classify utility. For test option chose “Use training set”.
A1. [1p] For PRISM classifier, report resulting Prism rules.
A2. [3p] For Decision table, report number of correctly classified instances, incorrectly classified
instances and mean absolute error.
A3. [2p] For Ridor classifier, report number of rules and list them all
B. [6p total] Run Cluster functionality. Run DBScan, Hierarchical Clustering and SimpleKMeans methods
(on training set).Store clusters for visualization. For Hierarchical Clustering and SimpleKMean chose
number of clusters to be 5 (by clicking onto the string with parameters next to CHOOSE button (below
Clusterer) [2p]. Use default settings for DBscan [1p]. Report clustering results for all 3 methods
(clusterer’s output) [3p].
C. [4p total] Run Associate functionality with Apriori associator on lenses dataset. Provide written
answers from the resulting run.
C1 [1p] What is minimum support reported?
C2 [1p] Minimum confidence?
C3 [1p] Generated sets of large itemlists?
C4 [1p]Best rules found?
D. [2p total] Select Attributes Functionality – for Search method, choose Principal Component Analysis
with Ranker Search Method (parameters chosen by the system). Use full training set. Provide
screenshot of Attribute selection output (with correlation matrix, eigenvalues and eigenvectors). No
discussion on the output needed.
E. [2p total] Provide screenshot of visualization for lenses dataset (with all attributes). Please change
PlotSize, Jitter and Colors from defaults (no need for multiple screenshots, one is enough for one chosen
setting). Expand one chosen quadrant to show X/Y point distribution (one only).
Bonus: [3p] There is a variety of applications and projects that use Weka. The full list is available under
Further Information –Related Projects menu item. Some interesting examples are:
WekaMetal - a meta-learning extension to Weka.
Tertius: a system for rule discovery.
MARFF - extension of ARFF for Multi-Relational Applications.
TClass - classifying multivariate time series.
Bayesian Network Classifiers - with bindings for Weka.
Weka on Text - software for text mining.
Judge - software for document classification and clustering.
Fuzzy algorithms - for clustering and classification.
Agent Academy - Java integrated development framework for creating Intelligent Agents and
Multi Agent Systems
GeneticProgramming - Genetic Programming Classifier for Weka
Weka-GDPM - extended version of Weka 3.4 to support automatic geographic data
preprocessing for spatial data mining.
OpenSubspace - An open source framework for evaluation and exploration of subspace
clustering algorithms in WEKA
Olex-GA - A genetic algorithm for the induction of rule-based text classifiers
Graph RAT - A framework for combining graph and non-graph algorithms
TUBE - Tree-based Density Estimation Algorithms
Your goal is to choose ONE from the above REDUCED LIST of applications (there are more links on the
web site, but they are less relevant to course material), run it and answer the questions below:
Q1. [1p] Name of the chosen Weka project from the above list and one sentence justification why this
project/topic was chosen
Q2. [1p] Main functionality of the project chosen –one paragraph
Q3. [1p] Example of applications (i.e. which data sets/databases can be studied with this tool)
Q4. [1p] Your experience with how easy it was to run it –or whether it was possible at all.
NOTE: due to the highly distributed and complex nature of the project, some links might be deactivated
during the course of the assignment. If the issue persists, please choose an alternative project and
inform your TA that project is no longer available.
What to submit
Submit WRITTEN REPORT as .doc or .pdf file to your TA, according to TA requirements. Course late
assignment policy allows for up to 2 days late submission, based on the date and time it is received by
your TA, with 10% of your mark penalty for each late day. Sample file for testing your program may be
provided by your TA.
The assignment must be done individually so everything that you hand in must be your original work.
Copying another student's work is an academic misconduct.