Document Sample
COMPUTE THIS Powered By Docstoc
					DESCRIPTION: This event integrates Personal Computing (PC) technology, and quantitative data analysis,
using excel. Teams are presented with a problem that requires quantitative data capture and the organization and presentation of data in a graphical format.

Data Mining Computing Event

A TEAM OF UP TO: 2 APPROXIMATE TIME: 75 Minutes THE COMPETITION: 1. During the competition, each team will be provided with a single IBM Compatible PC with word
processing (MS Word), spreadsheet (MS Excel), WWW browser (MS Explorer), and Internet access.

2. Teams will be given data, referred to as *_train.xls, and *_test.xls, corresponding to a classification
problem of interest. The *_train.xls is the training data set that the teams need to use to design the decision tree classifier. The *_test.xls is the dataset that the teams need to use to test the accuracy of the decision tree classifier. Each of the files (*_train.xls, *_test.xls) contains a number of rows. Each row contains numerical values for the features of the data; the last element of the row is the class to which each datum belongs to.

3. The problem statement will require the summarization of the information contained in the data. For
instance, the teams may be requested to calculate average values, and/or standard deviation values for each one of the data features, contained in the *_train.xls, and or *_test.xls. The teams may also be required to produce two dimensional graphs, depicting the values of two features, where the points in the graph are colored differently depending on the class of a data point.

4. The problem statement will also include up to five (5) short answer questions. At least one question will
require you to summarize the data (as explained above). Furthermore, one question will ask you to calculate the best feature value to split the data with (*_train.xls) for the first split of a decision tree classifier. Also, one question will ask you to calculate the best feature value to split the data with (*_train.xls) for the second split of a decision tree classifier. Additionally, one question will ask you to draw the decision tree classifier that you have discovered using the split values that you have identified in Questions 2 and 3. Finally, the last question will ask you to test the performance of your decision tree classifier that you have designed in Question 4, using some of the data of your *_test.xls dataset. All the calculations required here can be performed using EXCEL functions. You should show all your calculations in your MS Word report file. In the same file you should also report on how well your decision tree classifier performed, and you should discuss its performance. Division C teams may also be asked to use specific statistical functions in Excel (AVERAGE, MEDIAN, FREQUENCY, STDEV, etc) to further analyze data that they have collected. At the national tournament only, Division C teams may also be required to develop Excel VBA code as part of a short answer question.

5. Teams will construct an MS Excel (.xls) file which contains the data tables and graphics associated with the
problem and an MS Word (.doc) file which contains the answers and appropriate explanations for the answers. The event supervisor will specify how these files are to be submitted at the conclusion of the event. Teams should include their school name and team number (as appropriate) within both files to ensure proper identification by the event supervisor.

6. No resource materials (e.g. reference books, cards, etc.) or handheld calculators may be used during the
competition. Blank tablet paper and writing instruments may be used to assist teams in organizing their thoughts, if desired. Teams may also use any publicly accessible www search engine or resource (e.g., Google) to locate information, as they see fit. However, during the event, no external communication is permitted with other individuals via e-mail, chat rooms, or other forms of collaborative computing; the penalty for an infraction of this nature will be immediate disqualification.
SCORING: High score wins based on 1. Completeness and Accuracy of Data Summarization: 20 Pts. 2.

Completeness, Accuracy, and Format of Graphical Data Presentation: 30 Pts. 3. Answers and explanations for the answers provided 50: Pts. The tiebreakers shall be: (1) The number of short answers questions correctly answered, (2) The overall quality of graphical data presentation.