Parallelizing WEKA Genetic Algorithm Implementation and
Adding Additional Genetic Algorithms to WEKA Library
CIS 895 – MSE Project
Department of Computing and Information Sciences
Kansas State University
Version # Author Release Date Description
Version 1.0 James Louis 10/20/2008 Initial Release
Version 1.1 James Louis 12/11/2008 Added Change Log and
I. Project Overview
1. The purpose of this project is to extend the genetic programming algorithm in the
WEKA Machine Learning Library to use new genetic operators. These new operators
are based on the operators of the GENEsYs Genetic Algorithm Library.
1. Incorporate the GENEsYs genetic library operators into the WEKA library.
1. GENEsYs was created by Thomas Bäck in 1992 to provide a package of
interchangeable genetic operators for experiments using the genetic algorithm in C
and C++. It is an extension of the GENESIS 4.5 program created by John
Grefenstette in 1987. Currently no new versions of the library have been developed
since then. It contains the data structures necessary for the genetic algorithm and for
data management. Since its development, the WEKA library was developed as an
attempt to conglomerate numerous data-mining and machine learning algorithms
into a single package using Java for greater portability. It does not contain all of the
operators that are present in the GENEsYs library.
2. Parallelize the WEKA genetic algorithm implementation to handle large hypothesis
spaces by division over parallel processing machines. Currently the WEKA library does
not parallelize any of its algorithms, but there are other projects introducing parallelism
to specific processes. (See Section I.4.3.2)
1. Changes to requirements could and already have occurred, resulting in design and
evaluation changes that require time and resources.
1. Addition of parallelism.
2. Addition of Mersenne Twister pseudo-random number generator.
3. Comparison testing of GENEsYs.
4. Addition of the GENEsYs interface port.
2. Interruptions disrupt the development process in time needed for the interruption and
time needed resume development.
3. Waning interest, which could lead to more changes or a complete scrapping of the
4. Poor/incoherent development model which results in wasted effort.
1. The implementation cannot be embarrassingly parallel. (See Document 3.1 Section
2. Headless parallel implementation. (See Document 3.1 Section I.4.3)
3. Must use the WEKA library
1. WEKA is a library of data-mining algorithms and supporting structures developed in
Java to make data-mining applications on multiple computer architectures. The
algorithms included in the library are drawn from a wide variety of data-mining
1. The WEKA library also includes a program called WEKA Explorer that
provides a GUI for development of data-mining schemes using many of the
algorithms available in the library. Schemes are built using a component-based
process flow interface, showing the steps in the scheme. Not all of the algorithms
in the WEKA library are available in this GUI. For example, there are no
components representing the genetic algorithms in the library.
2. WEKA is currently on version 3.5.8.
3. WEKA's genetic algorithm is implemented three times. The first use is for
feature selection, where attributes in mined date are pruned for usefulness in
another algorithm. The second and third uses are for developing optimized
Bayesian network algorithms.
1. Mutation in the WEKA library is limited to a single rate, single bit
2. Crossover is limited to a single point crossover, producing two new
3. Selection uses an elitist ranking system.
2. Description of other parallelization projects.
1. Written by David R. Musicant and Sebastian Celis.
2. Time line:8/23/1999 - 3/20/2002
3. Purpose: Weka-Parallel was created to allow researchers to conduct cross-
validation on a classifier in parallel. Cross-validation is the process of
withholding selections of available data from the learning process and using
that data to test the correctness of the result of the learning process. Cross-
validation is used to provide empirical proof of a classifier being sufficiently
generalized to avoid overfitting. Overfitting is the process of a classifier
identifying meaningless correlations in the training data during its learning
1. Latest version of Weka this could have used is WEKA 3.0. Updating this
project to the current WEKA version would require the modifications to
be reimplemented in the new version.
2. The WEKA library has been modified to add the features presented in
this project directly to the WEKA library code.
3. Documentation for this project consists of an automated Javadoc-
produced set of web pages describing the code.
5. This project does not parallelize anything other than cross-validation. The
genetic algorithm is unmodified. Additionally, this project is also
embarrassingly parallel, since communication only happens at the beginning
and end of each fold.
1. Written by Domenico Talia, Paolo Trunfio, and Oreste Verta.
2. Time line:2005- 7/2/2008
3. Purpose: Weka4WS was created to allow researchers to perform data-mining
using resources that have been distributed in a GRID environment. A GRID
is a type of network where clusters of computers work together to perform
large tasks. This is done via a protocol that requires few assumptions, in this
case the Web Services Resource Framework.
1. Nodes: This project distinguishes resources by assigning them a node
1. User nodes contain a modified copy of the WEKA Explorer and a
local copy of the WEKA library for performing local requests.
2. Storage nodes contain the data that is mined.
3. Computing nodes perform the actual mining.
2. Process: The user interacts with the program at a user node, requesting a
computing node resource. The computing node downloads the data from
a storage node and data-mines it. The results are sent back to the user
node for further use.
3. The WEKA library has been modified to add the features presented in
this project directly to the WEKA library code.
5. This project does not parallelize the data-mining algorithms. The discrete
parallelization occurs between the algorithms, data access, and user
interfaces in a data-mining scheme. This project also relies on the GUI
interface in the WEKA Explorer program, where schemes can be planned out
and executed. The genetic algorithms in the WEKA library are not yet
available in the WEKA Explorer.
3. Examination of other algorithm packages
1. GeneMining project uses a system called MayDay to visualize datamining on
micro-array data. The MayDay system is built using the WEKA library and does
not appear to contain new algorithms for genetic programming. Their site is also
undergoing reconstruction since Sept. 2008 and this project is no longer
2. GAJIT is an incomplete translation of the GAGS C++ genetic algorithm library.
It is limited to elitist most-fit selection and a random-gene mutation operations
for the selection and mutation steps of the genetic algorithm. It also has two-
point, n-point, and random bit crossovers. There are also several operations for
transposing, adding, and removing genes that are not present in the GENEsYs
3. GAUL is a C/C++ library developed by Stewart Adcock for implementing
genetic algorithms in other applications. It has numerous mutation, crossover,
and selection operations, more than GENEsYs. Some of these operations reuse
operators on different chromosomes, or meaningful sets of genes in a hypothesis.
Numerous fitness comparison operations are also available. The fitness
evaluation function seems to be specified by the application the genetic
algorithm is used in.
4. JGAP is a Java library developed by Klaus Meffert and Neil Rotstan to provide a
basic framework for genetic algorithms that can be used in other applications.
This library only supplies a few genetic operators, but varies how these operators
are applied. There are also only a few fitness evaluation
4. Mersenne Twister algorithm is to be used for generating random numbers. The
Mersenne Twister algorithm was developed by Makoto Matsumoto and Takuji
Nishimura as a method of quickly generating high-quality pseudo random numbers.
1. The Mersenne Twister is assumed to be better than the random number generator
implementation provided in the Java Development Kit, which uses a linear
2. The suggestion was made to examine Ben Perry's implementation from his Masters
thesis for possible reuse in this project. (See Document 2.1 Section I.4.3.1)
1. Create a solution that can be reused for further development with minimum changes.
1. Reuse for this project is defined as using the genetic algorithm data-mining tasks
that can be specified later by the developer using the project.
2. Minimal changes is defined as having an easy method of adding user specified task
without the need to alter the basic structures used for the genetic algorithm.
6. Main Product features
1. The project can be called from the WEKA Explorer's built in command line interface
2. A configuration file can be supplied, listing the ip/web addresses of the computers used
in the parallelized algorithm and the ports used to communicate with the local
3. Additional genetic operators from the GENEsYs library will be added to the WEKA
library's existing operators.
7. Quality Attributes
1. Each subproject will have a dedicated test phase for the requirements of that project.
2. The GENEsYs Comparison subproject is a quality test on the results.
3. Code will have Javadoc comments describing the functions and parameters.
4. User Manual will be included in the final documentation.
8. External Interfaces
1. The project must have a command-line interface capable of working with the console
available in the WEKA Explorer CLI.
II. Requirements Specification
1. The project must interface with existing data structures and processes in the WEKA library.
2. The project must emulate the interface for the GENEsYs library. (See Document 1.1)
3. The project must use a Mersenne Twister algorthim for its pseudo random number
generation. (See Document 2.1)
4. The project must partition population processing over parallel capable machines. These
machines must work without user interaction, with the exception of a configuration file.
There must be communication occurring between these machines. (See Document 3.1)
5. The genetic operators available in GENEsYs must be implemented accurately. (See
6. There must be a comparison to the original GENEsYs library. (See Document 5.1)