Data Mining, CS 277, Winter 2010 Due Date in

W
Shared by:
Categories
-
Stats
views:
7
posted:
9/23/2010
language:
English
pages:
3
Document Sample

```							                                 Homework 1
Data Mining, CS 277, Winter 2010

Due Date: in class, Thursday January 14th

Class Web Page
http://www.ics.uci.edu/~smyth/courses/cs277/

1
Homework 1: CS 277, Data Mining: Winter 2010                                                          2

Problem 1
An outlier is a data point that is sometimes deﬁned as “an extreme observation that does not
seem consistent with the rest of the data.” Outliers can arise because of measurement or data
entry errors—or could be a true measurement for a particularly unusual object or person. The
interpretation of outliers can be subjective and context-dependent. One person’s outlier may be
another person’s Nobel prize!
In data mining it is always a useful exercise to check for outliers before embarking on any analysis
of the data, since the presence of outliers can signiﬁcantly skew the results and interpretation of
any analysis.

Part A
Say we have a large medical data set with many variables measured on many patients. For simplicity
assume that all the variables are real-valued. Imagine we look at a single variable in the data set,
and we estimate the mean and the median, where the mean is the arithmetic average of the values
and the median is deﬁned as the value for which 50% of the values are higher and 50% are lower.
How sensitive are each of these methods to outliers? Feel free to use a sketch/plot and/or some

Part B
Describe brieﬂy an automated scheme (i.e., an algorithm) for identifying potential outliers for the
case where you are only considering a single variable at a time. You can propose your own method,
or do some research and discuss a method that is already known in the literature.

Part C
Consider part B again, but now where you want to identify outliers in multi-dimensional space, e.g.,
by simultaneously considering all (e.g., 10 or 20 or more) variables. Brieﬂy describe an algorithm
for detecting outliers in a multi-dimensional real-valued space. Discuss the potential strengths and
weaknesses of your scheme. Again you can either invent your own method, or you can do some
research and describe a known method (provide references).

Part D
Given your answers to the questions above, do you think that it is reasonable to expect that the
problem of outlier detection can be completely automated? Is it possible to detect outliers without
having a prior model or set of expectations for the data? Discuss your answer.
Homework 1: CS 277, Data Mining: Winter 2010                                                          3

Problem 2: MATLAB Exercise
To do this part of the assignment you ﬁrst need to make sure you can run MATLAB. If you have
not used MATLAB before please take some to go through some of the basic MATLAB tutorial
material. Please consult the class Web page for details on where to ﬁnd MATLAB on the UCI
campus and how to access tutorial materials.
When you have gone through the tutorial material, then download the Zip ﬁle from the Web
page with MATLAB data and scripts for Homework 1.

A. Demo Scripts
These demos are intended to illustrate some basic aspects of simple exploratory data analysis as
well as providing you with examples of MATLAB’s computational and plotting capabilities. For
more information on the Pima Indians data set and the Census (also known as “Adult”) data set,
please visit the UCI Machine Learning Archive Web site.
To begin the ﬁrst demo type pimademo at the command line in MATLAB and a script will be
executed to illustrate some basic data analysis and visualization capabilities in MATLAB, using
the Pima Indians data set. Please go through this carefully and make sure you understand what is
being plotted at each step.
Now type censusdemo at the command line in MATLAB and go through the demo for the
Census data.
Feel free to explore other aspects of the data beyond what is shown in the demo, e.g., call the
functions in the script with other variables, use other visualization functions in MATLAB to look
at the data (e.g., bar charts, pie-plots, etc.).

B. Data Exploration
Find two diﬀerent data sets that seem interesting to you (e.g., on the Web, from a research project
you are involved in, etc). A useful source of (relatively small) data sets is the UCI machine learning
repository.
For each of the data sets do the following:

• Figure out how to load the data into MATLAB (MATLAB can read a variety of ﬁle formats,

• Find 3 interesting histograms (for any 3 variables in each data set), plot them, and brieﬂy
discuss why they are interesting. To do this you will have to ﬁgure out how to call the hist.m
function and/or the dmhist.m and dmhist2.m functions used in the scripts for the demo.

• Find 2 interesting scatter plots in each data set (for 2 pairs of variables in each data set), plot
them, and brieﬂy discuss why they are interesting. Again you will need to use dmscatter.m
and/or dmscatter2.m for this purpose (or write your own simple functions if you wish).

```
Related docs
Other docs by qgp38355
Christmas 'Day Off' Raffle Poster