Try the all-new QuickBooks Online for FREE.  No credit card required.

BMI leukemia

Document Sample
BMI  leukemia Powered By Docstoc
					BMI 214
Spring 2008

Assignment 2:
Machine learning for expression data and Genotype-phenotype associations

Name:             _________________________

SUNet ID:         _________________________


Part 1: Weka for machine learning in expression data
      a. Introduction
      b. Supervised learning
      c. Feature selection
      d. Unsupervised learning

Part 2: Genotype-phenotype associations
      a. Introduction
      b. Feature selection

Submission: Send your answers as a PDF document to

Note: Please show work when appropriate, and/or cite sources (e.g. for problem 20,
site source used to obtain your answer.)

Part 1: Weka for machine learning in expression data
Part 1a: Introduction
Download / familiarize yourself with Weka.

As discussed in class, Weka is a useful tool that has implemented most of the major
machine learning algorithms. In this part you will re-live Project 2, but now let Weka do
the work. It will be both a tutorial and problem set.

We will be using the Weka GUI in this assignment. There are also an excellent
command line interface and Java API, which will not be covered. Read more in the
documentation if interested.

Download and start the Weka GUI. Follow the instructions on the Weka site:

You will see four buttons; in this class we will only use the Explorer functionality.

There is much Weka documentation available. You may familiarize yourself with
Explorer as much as you like by reading the user guide and using their provided sample
datasets. Some good (optional) starting points:
Explorer guide:
Weka Wiki:

Part 1b: Supervised learning

We will be using the same leukemia data as in Project 2. Recall that that dataset
comprised 72 leukemia patients, 28 of which were AML and 44 of which were ALL.
Each patient had expression measurements for 7129 genes.

We have compiled these data into a Weka-compatible CSV file.

This file has the 72 leukemia patients (rows) and expression values for 150 genes
(columns). This data matrix is slightly altered from the one in Project 2: it is transposed,
and we chose a subset of 150 genes to make your life easier. The gene and experiment
names, as well as a link to the reference paper, are available in Project 2.

Open the leukemia.csv file in Weka.

Question 1. What is the mean value of expression of the gene labeled “CD33
CD33 antigen (differentiation antigen)”?

Now we will run KNN. Go to the “classify” tab. Under “Classifier” click the “Choose”
button. Expand the “lazy” menu (yes, we use only the best!). Choose “IBk” and OK.
This is KNN: IBk stands for Instance-Based k. Now next to the button it should say
“IBk” with some parameters. Click on this text and a menu will pop up. For “KNN”, enter
5. Recall that this means the algorithm will use the five nearest neighbors to classify
each data point. Leave the rest of the values as default.
      FYI: KNN is a type of instance-based learning, also known as lazy learning, in
      which the function is only approximated locally and all computation is deferred until
Under “Test options” choose the radio button “Cross-validation” and under “Folds” enter
5. The dropdown menu below Test options should say “(Nom) leukemia_type”. This
means that the algorithm will classify “leukemia_type” (AML or ALL), using the
experiments as attributes.

Click the “Start‟ button. The main window will show a variety of result summary
statistics, such as accuracy, true positives, false positives, and a confusion matrix.

Question 2a. What is the % of correctly classified instances?

ROC curves, as discussed in lecture, illustrate the tradeoffs between sensitivity and
specificity. Roc curves plot Sensitivity vs. 1 – Specificity, or TP (true positive) rate vs.
FP (false positive) rate. Recall that

TP rate = Sensitivity = TP / (TP + FN)
FP rate = 1 – Specificity = FP / (TN + FN)
Specificity = 1 – FP rate = TN / (FP + TN)

Where TP = number of true positives and FP = number of false positives.

Another way to think about it is:
TP rate = # true positives / # of positive instances
          (The true positive rate is the proportion of actual positive instances that were
          correctly reported as being positive.)

FP rate = # false positives / # of negative instances
          (The false positive rate is the proportion of negative instances that were
          erroneously reported as being positive.)

Question 2b. What are the TP and FP rates for ALL and AML?
ALL TP Rate =
ALL FP Rate =
AML TP Rate =
AML FP Rate =

Optional: Right click on your result in the “Result list” on the left side of the screen.
Choose “visualize threshold curve” and “ALL”. An ROC curve plots true positive (TP)
rate vs. false positive (FP) rate, which are the defaults. You can also view other types of
curves by clicking the dropdown menus. For example, precision-recall curves are an
alternative to ROC curves; precision and recall are options in the dropdown menu.

Recall that we can trivially achieve a TP rate of 1 for AML by classifying all of the
patients as positive (AML). That would correspond to a TP rate of 0 for ALL, since none
of the patients could then be classified ALL.

Question 3. Now we’ll try a different classification algorithm, ZeroR. Click the
Choose button under Classifier, and expand the “rules” folder. Choose “ZeroR”.
Again use cross-validation with Folds=5. Run it.

% correctly classified instances =
ALL TP Rate =
ALL FP Rate =
AML TP Rate =
AML FP Rate =

Explanation: ZeroR is a baseline classifier that simply identifies the class that is most
abundant (in this case, ALL with 44 patients), and predicts all variables (patients) to be
in that class. This illustrates that when the variables are unevenly assigned to classes, it
can be a problem to simply look at accuracy. For example, if you have a dataset of 100
variables (patients), where 90 were ALL, and 10 were AML, then ZeroR will predict them
all to be ALL, with 90% correctly classified (90% accuracy). Another classifier might
have 85% accuracy, which looks pretty good until you compare it with ZeroR. You can
get around the problem of uneven class sizes by weighting, a topic which we won‟t cover
here. (But check out the “CostSensitiveClassifier” and “Cost Sensitive evaluation” if
you‟re interested.)

Stop for a moment and think about the meanings of each of these terms (ALL TP rate,
ALL FP rate, AML TP rate, AML FP rate), and why you got the answers you wrote down,
when using ZeroR.

We‟ll try a few more algorithms on this dataset.

Question 4. Under the “bayes” Classifier folder, choose “NaiveBayes” and run.

% correctly classified instances =
ALL TP Rate =
ALL FP Rate =
AML TP Rate =
AML FP Rate =

Question 5. Under the “functions” Classifier folder, choose “SMO” (a popular
implementation of SVM) and run.

% correctly classified instances =
ALL TP Rate =
ALL FP Rate =
AML TP Rate =
AML FP Rate =

Question 6. Under the “trees” Classifier folder, choose “ADTree” (a popular
implementation of a decision tree) and run.

% correctly classified instances =
ALL TP Rate =
ALL FP Rate =
AML TP Rate =
AML FP Rate =

You can read about the various other algorithms in the documentation if interested.

Part 1c: Feature selection (aka attribute selection)

We will now use feature selection algorithms to extract the most “informative” genes for
classifying AML vs ALL. As mentioned in lecture, this is a useful real world exercise
because many times we are interested not in measuring all the gene expressions, but in
finding one or a few whose expression level can be used as a marker for the disease.
The general area of “Biomarkers” is economically important because it may lead to
diagnostic tests that can be marketed to clinical laboratories.
If interested, you can learn more about the theory behind feature selection in machine
learning textbooks and online resources. Wikipedia is always a good place to start.

First we will choose a subset of five genes arbitrarily – the first five. To do this in Weka,
go back to the “Preprocess” tab and click the checkbox next to the first five genes (Zyxin
through RNS2). Scroll down to the bottom of the gene list and click “leukemia_type”,
which is the class label (AML or ALL). Check this box, too, so that there are 6 total
boxes checked. Click the “Invert” button above the list, so that the other 145 boxes
become checked. Click the “Remove” button below the list, so that you are left with 5
genes and leukemia_type.

Go to the “Classify” tab again. Classify using IBk and K = 5.

Question 7. What is the % correctly classified instances using just the first five

Re-load the dataset with “Open file…” to get all 150 genes back.

We‟ll now find five genes that are “informative” according to some metric. Choose the
“Select attributes” tab at the top of the screen. Under „Attribute Evaluator”, choose the
algorithm called “InfoGainAttributeEval” and Search Method “Ranker” and Start. The
output shows the genes ranked by information gain, which is one possible metric for
measuring the “goodness” of a partition. The genes at the top of the list are the most
“informative” as attributes for classification.

Question 8. What are the top five genes in this output?

Back under “Preprocess”, remove all attributes except the top five genes from the
previous question, in the same manner as described above, so that you are left with the
five genes from Question 8 and the class attribute leukemia_type. Under “Classify”, run
the exact same algorithm as before (IBk, K=5).

Question 9. What is the % of correctly classified instances?

Part 1d: Unsupervised learning
K-means clustering

This part will use the same dataset that you used for Project 2 k-means clustering.
Recall that that dataset comprised expression profiles for 2467 genes across 79
Download a Weka-compatible CSV file at

The rows are genes and the columns are experiments. The gene and experiment
names, as well as a link to the reference paper, is available on the Project 2 instructions.

In this particular file, we have added one additional column called “ribosomal” with value
Y or N, to indicate whether a gene is ribosomal.

Question 10. What is the mean value of expression of the experiment labeled
“alpha 21”?

Now we will run K-means clustering. Go to the Cluster tab. Under Clusterer choose
SimpleKMeans. Click the text “SimpleKMeans” and set numClusters to 2. “Cluster
mode” should be “Use training set”. Run it.

Question 11. How many genes in each of the two clusters?

An optional exercise for those interested -- If you want to see which genes were
assigned to which clusters, right-click on the result in the Result list and choose
“Visualize cluster assignments”. Click the Save button and choose a location. This will
save a file in ARFF format (the Weka format), where the last value on each line will be a
cluster assignment. (See the Weka documentation for more on ARFF format).

Question 12. Recall that K-means depends on a random starting seed. Change
this seed by clicking on the text “SimpleKMeans” and changing the “seed” value
to 15. How many genes are now in the two clusters?

Question 13. Try adjusting the seed value to various values (e.g. 1.2, 5, 20, 100).
How much do the numbers of genes in each cluster change? What might this say
about the genes and their assignments to clusters? (I.e. are the assignments
robust or weak?)

Question 14. Now under “Cluster mode”, choose “Classes to clusters evaluation”
and run. How many of the 121 ribosomal genes are assigned to the same cluster?

Question 15. How many other genes are assigned to that cluster?
(If you were interested in what those genes were, you might save the ARFF output file
as described above, extract the genes in that cluster, and enter them in the GO Term
Finder, as discussed in Project 2, to get a feel for their functions. That‟s left as an
optional exercise.)

Question 16. For completeness, we’ll now classify the yeast data. Under the
Classify tab, again choose the IBk algorithm, and use KNN=5. Use cross-
validation with Folds=5 and run.

% correctly classified instances =
Class N TP Rate =
Class N FP Rate =
Class Y TP Rate =
Class Y FP Rate =

Question 17. Run as in the previous question but using trivial classifier ZeroR,
which was discussed above.

% correctly classified instances =
Class N TP Rate =
Class N FP Rate =
Class Y TP Rate =
Class Y FP Rate =

Part 2:
Part 2a: Introduction

Researching and answering these introductory questions will help you understand the
genotype-phenotype section below.

Question 18. What does diploid mean? How many copies of every chromosome
are there in one cell of a diploid organism? How many copies of every gene?

Question 19. What is an “allele”, with respect to a single SNP? How many alleles
are theoretically possible for one SNP locus? Usually, how many alternate forms
actually arise in life for one SNP locus?

Counting the number of SNPs in the human genome depends on your definition for a
SNP. For example, there may be some bases where only a few people in the world
have a variant allele. For most purposes, it is not useful to call these bases SNPs.
Thus, to define SNPs, we refer to the “minor allele frequency”, which is the percent of
the population having the uncommon allele. The minor allele frequency can range from
0 to 50%.

Question 20. Approximately how many SNPs are there in the human genome (i.e.
how many bases of the ~3 billion base pairs in the human genome are SNPs),
assuming at least a 1% minor allele frequency? Explain your answer.

Question 21. SNPs are one type of genetic variation – what other types exist?

Question 22. What is penetrance, in terms of phenotype? What does 50%
penetrance mean? 100%? (three sentences or less)

Question 23. What is a haplotype? How is it related to linkage disequilibrium?
(three sentences or less)

Part 2b: Feature selection (aka attribute selection)


With genome sequencing cost drastically falling, we will soon be able to sequence
millions of people cheaply. With this wealth of new data, we will be able to associate
genotypes with phenotypes (e.g. diseases or drug responses). The goal is to find the
genes (or more specifically, the SNPs) that are found in conjunction with certain
phenotypes. We will be using a semi-synthetic dataset to explore this type of study.
The dataset can be found at

Dataset description:

Many past genotype-phenotype association studies only examined a single gene
hypothesized to cause the phenotype (disease). In reality, there are complex
interactions of genes that cause disease. We will use machine learning to find such
complex associations, which is a very cutting-edge research area.

We will use Weka for this section. Open the file “genotenureitis1.arff”. Instead of CSV,
this file is in Weka‟s ARFF format, which is described at

This file contains genotype/phenotype data for 557 subjects (people). Each row in the
data section represents one subject (person).

The (comma-delimited) columns are the attributes of the subjects. The first column is
the person‟s identifier.

The next 188 columns represent genotypes, which in this case are SNPs (single
nucleotide polymorphisms). The SNPs were identified from three regions in the
genome. The SNPs are named
where j is the region (1,2, or 3), and i is the SNP number within that region. The values
each SNP can take are {11, 12, 22}.

Recall that humans are diploid. In this file, a value of “11” for a subject‟s SNP means
that the both copies are allele 1 (where allele 1‟s possible values are the nucleotides
{A,T,G,C}; the actual allele isn‟t specified.). “12” means that one was allele 1, and the
other was allele 2. “22” means that both were allele 2. (Allele 2 would be a nucleotide
other than allele 1.)

The last 20 columns in the file represent (artificially created) phenotypes. Some of the
phenotypes, for example, are
   - “degree”, with possible values MD, MD/PhD, or PhD
   - “npubs” (number of publications), with possible values 4 through 12.
   - “gotgrants” (total grant dollars earned, in millions), with possible values 0 through

Note that the genotype data are real, but the phenotype data are artificial. A more
lengthy (and far more humorous) description of the phenotypes can be found here:
This is an optional read; it is not necessary for understanding this assignment. This
document and the genotenuritis data file were created for a recent pharmacogenetics

Data filtering

Back to Weka and genotenureitis1.arff. Scroll to the bottom of the attribute names.
Examine the distributions for the attribute “irep” by clicking on it.

Question 24. How many class labels exist for the attribute irep?

The attribute irep is not useful to us; we will remove it and other useless fields with a
filter. In the “Preprocess” tab, under “Filter”, choose unsupervised->attribute-
>RemoveUseless, and click Apply. This will remove the “irep” attribute and the “ID”
attribute(which actually appeared twice – attribute number 1 and 192). You can undo
operations like this one by clicking the Undo button at the top.
If you don‟t apply this filter, you may get erroneous results in the following sections.

Genotype-genotype associations

First we‟ll examine whether SNPs are associated with each other. Recall the discussion
in class about haplotypes and linkage disequilibrium.

Question 25. Under the “Associate” tab, run the default associator (Apriori).
Copy here the output under “Best rules found:” (This should be nine SNP
association pairs and one phenotype-SNP pair.)

Question 26. Within all but one SNP-SNP pair, the two SNPs have something in
common. What is it, and what does it mean biologically?
Hint: look at the SNP names and the explanation of SNP names above.
(3 sentences or less.)

Attribute selection

We now perform attribute selection to find which SNPs best predict some of the

Phenotype 1: gotgrants

The first phenotype we‟ll investigate is “gotgrants”. We want to know whether there are
any SNPs related to the amount of grant money earned.

We disclose that the first phenotype was created artificially as a simple linear function of
the penetrance of a single SNP. You will try to discover what this SNP was.

Choose “gotgrants” from the dropdown menu on the left. Under the “Select attributes”
tab, choose the GainRatioAttributeEval evaluator with Ranker Search Method. Use
cross-validation with Folds=5. Run.

Question 27. What is the top SNP (genotype) in this feature selection result?

Question 28. Run multiple Attribute Evaluators with multiple Search Methods; is
the top gene the same in general?
(Note that some Evaluators require specific Search Methods; see the error log and/or
documentation if you get funny results.)
Question 29. Look at the top 80 or so SNPs. Do you see anything interesting
about them? What does this mean biologically? (3 sentences or less.)

Phenotype 2: pctdrivel

The next phenotype is “pctdrivel”; again, we‟ll investigate whether there are any SNPs
that are good predictors of this phenotype.

This phenotype was created artificially as a function of two other phenotypes. The
inclusion of other phenotypes represents the fact that environment can impact
phenotype, not just genetic makeup (cf. the nature vs. nurture debate).

Run multiple Attribute Evaluators with multiple Search Methods.

Question 30. What do you think are the two phenotypes affecting pctdrivel and
why? A well-justified answer will be accepted. (3 sentences or less.)


Phenotype 3: rivalside

This phenotype is very difficult to decipher; doing so will earn 5% extra credit.

This phenotype was created artificially as a more complex function of one other
phenotype and six SNPs.

What attributes do you think are affecting rivalside and why?