Data Mining, CS 277, Winter 2010 Due Date in

W
Document Sample
scope of work template
							                                 Homework 1
                         Data Mining, CS 277, Winter 2010

                     Due Date: in class, Thursday January 14th


Class Web Page
http://www.ics.uci.edu/~smyth/courses/cs277/




                                        1
Homework 1: CS 277, Data Mining: Winter 2010                                                          2

Problem 1
An outlier is a data point that is sometimes defined as “an extreme observation that does not
seem consistent with the rest of the data.” Outliers can arise because of measurement or data
entry errors—or could be a true measurement for a particularly unusual object or person. The
interpretation of outliers can be subjective and context-dependent. One person’s outlier may be
another person’s Nobel prize!
    In data mining it is always a useful exercise to check for outliers before embarking on any analysis
of the data, since the presence of outliers can significantly skew the results and interpretation of
any analysis.

Part A
Say we have a large medical data set with many variables measured on many patients. For simplicity
assume that all the variables are real-valued. Imagine we look at a single variable in the data set,
and we estimate the mean and the median, where the mean is the arithmetic average of the values
and the median is defined as the value for which 50% of the values are higher and 50% are lower.
   How sensitive are each of these methods to outliers? Feel free to use a sketch/plot and/or some
simple equations to provide an example to justify your answer.

Part B
Describe briefly an automated scheme (i.e., an algorithm) for identifying potential outliers for the
case where you are only considering a single variable at a time. You can propose your own method,
or do some research and discuss a method that is already known in the literature.

Part C
Consider part B again, but now where you want to identify outliers in multi-dimensional space, e.g.,
by simultaneously considering all (e.g., 10 or 20 or more) variables. Briefly describe an algorithm
for detecting outliers in a multi-dimensional real-valued space. Discuss the potential strengths and
weaknesses of your scheme. Again you can either invent your own method, or you can do some
research and describe a known method (provide references).

Part D
Given your answers to the questions above, do you think that it is reasonable to expect that the
problem of outlier detection can be completely automated? Is it possible to detect outliers without
having a prior model or set of expectations for the data? Discuss your answer.
Homework 1: CS 277, Data Mining: Winter 2010                                                          3

Problem 2: MATLAB Exercise
To do this part of the assignment you first need to make sure you can run MATLAB. If you have
not used MATLAB before please take some to go through some of the basic MATLAB tutorial
material. Please consult the class Web page for details on where to find MATLAB on the UCI
campus and how to access tutorial materials.
   When you have gone through the tutorial material, then download the Zip file from the Web
page with MATLAB data and scripts for Homework 1.

A. Demo Scripts
These demos are intended to illustrate some basic aspects of simple exploratory data analysis as
well as providing you with examples of MATLAB’s computational and plotting capabilities. For
more information on the Pima Indians data set and the Census (also known as “Adult”) data set,
please visit the UCI Machine Learning Archive Web site.
    To begin the first demo type pimademo at the command line in MATLAB and a script will be
executed to illustrate some basic data analysis and visualization capabilities in MATLAB, using
the Pima Indians data set. Please go through this carefully and make sure you understand what is
being plotted at each step.
    Now type censusdemo at the command line in MATLAB and go through the demo for the
Census data.
    Feel free to explore other aspects of the data beyond what is shown in the demo, e.g., call the
functions in the script with other variables, use other visualization functions in MATLAB to look
at the data (e.g., bar charts, pie-plots, etc.).

B. Data Exploration
Find two different data sets that seem interesting to you (e.g., on the Web, from a research project
you are involved in, etc). A useful source of (relatively small) data sets is the UCI machine learning
repository.
   For each of the data sets do the following:

   • Figure out how to load the data into MATLAB (MATLAB can read a variety of file formats,
     see functions such as load.m, fread.m, textscan.m, dlmread.m, xlsread.m, etc).

   • Find 3 interesting histograms (for any 3 variables in each data set), plot them, and briefly
     discuss why they are interesting. To do this you will have to figure out how to call the hist.m
     function and/or the dmhist.m and dmhist2.m functions used in the scripts for the demo.

   • Find 2 interesting scatter plots in each data set (for 2 pairs of variables in each data set), plot
     them, and briefly discuss why they are interesting. Again you will need to use dmscatter.m
     and/or dmscatter2.m for this purpose (or write your own simple functions if you wish).

						
Related docs
Other docs by qgp38355
Christmas 'Day Off' Raffle Poster
Views: 350  |  Downloads: 0
Date1826 4-17-68
Views: 2  |  Downloads: 0
Sorted by CIC Date
Views: 9  |  Downloads: 0
Make it a Day On... Not a Day Off!
Views: 6  |  Downloads: 0
Feline Dental treatment plan
Views: 33  |  Downloads: 1
Instrument Entry into force date
Views: 7  |  Downloads: 0
Author-date method of citation
Views: 53  |  Downloads: 0