Your Federal Quarterly Tax Payments are due April 15th Get Help Now >>

ohio state by i301aw

VIEWS: 0 PAGES: 2

									                        671 LAB ASSIGNMENT II
The due date for this lab assignment is Monday (midnight) Nov 30th.

The goal is to find frequent (3-item) itemsets in a transactional database (www.cse.ohio-
state.edu/~srini/671/671.data ) and derive some interesting rules from this database. You
may work on this project in teams of two. The format for this database is as follows:

Column 1: Customer Identifier [YOU NEED TO IGNORE THIS ONE]
Column 2: Transaction Identifier [Note in this dataset column 1 and column 2 are identical and
you will need only the Transaction Identifier]
Column 3: Number of items purchased in this transaction (a value between 1 and K)
Columns 4-K+3: The item id of each item purchased.

There are 1000 items (0-999 item ids) in this database and there are 1000 transactions in this
database.

I have broken down this assignment into two parts to help you gauge progress. For part one you
are required to first import the data from the file and compute all frequent 2-itemsets exceeding a
user specified minimum support. The user specified minimum support will need to be a command
line argument of the form “–support 0.1”. Here 0.1 will indicate a support of 0.1 or 10% of the
database. Only those frequent item pairs that occur in 100 or more transactions should be
reported. Similarly if we use “-support 0.01” only those frequent pairs that occur at least 10 or
more than 10 times should be stored and reported in the following format:

[Item ID1 Item ID 2 <support value>]

For example if you were to print out information from part one the sample output would look like
0 1 200
0 2 101
50 100 240

This indicates that itemset [item 0 and item 1] co-occurred 200 times, itemset [item 0 and item 2]
co-occurred 101 times and itemset [item 50 and 100] co-occurred 240 times.

        Create a struct-type
                         { int support
                          ptr to linked list of TIDs (TID-list)}
                Create DB an array 1000X1000 of struct-type < this can be optimized>
                /* Read database */
                For each transaction read in
                         For each pair of items (item id 1 and item id 2]
                         appearing in transaction
                                   Update DB[item id1, item id2]
                                            Increment support;
                                            Add TID to linked list of TIDs

                For all entries in DB
                         Print out item-id pairs if support value of element exceeds minimal
                         Support threshold
Note, that you need to update only the lower or upper part of the array DB as it is symmetric.
[See pseudo-code for second part for clues on how to do this]. Note, your code should follow the
general flow of the code, type definitions specific loop usage (while vs. for) is left to your choice.

For the second part you will need to list out all the frequent 3-item sets and the rule involving a
particular item-set with highest confidence. You will use the linked lists you have stored and
intersect appropriate ones to compute if a candidate 3-itemset is frequent or not. [NOTE:
REMOVE PRINT INFORMATION FROM THE FIRST PART if you were printing to test the
first part.]
Pseudocode for final part point should add the following to the above.

                         For each item I
                                 For each item J where J>I
                                         For each itek K where K > J
                                                 If DB[I,J]. and DB[J,K] and DB[I,K] all exceed
                                                 minimum support then [I,J,K] is a candidate
                                                         Intersect TID-list of [I,J] and [J,K]
                                                                  If intersection > mimimum
                                                                  support
                                                                           Print itemset and rule(s)
                                                                           involving a single
                                                                           consequent, with
                                                                           highest confidence (you
                                                                           need to figure this one
                                                                           out! ).
                                                                  END If
                                                 End If
                                         End For
                                 End For
                         End For

Print format: itemID1, itemID2  itemID3, support: rule, confidence. For two or more rules
that involve the same set of items and yield the same maximal confidence – report all of them.
For example if we have the itemset {A,B,C} has support 20% and both AB  C and AC  B
has a confidence of 40% and BC  A has a confidence of 30% then you should only report the
first two. If on the other hand BC A has a confidence of 50% then you should only report BC-
>A. Note you do not have to print the results of the first part. Print only the results of the second
part and store it in a file called OutputAsssocLab2. Use ID lab2 to submit this part using the
online submit command.

WHAT YOU NEED TO SUBMIT:
Source codes (using the submit command), a README file, a REPORT file, and affiliated
makefiles. Each source file should be well commented and ideally each step in the pseudocode
should be clearly marked within the comments. Again you can choose to work in teams of 2 if
desired. A REPORT file in word or ascii text is also required describing the tests performed and
the implementation choices made. For teams of two you must clearly separate out the
contributions of each team member (what they did). The README file should only contain
instructions on how to run the program.

								
To top