# Netflix Project

Document Sample

```					Netflix Prize: Predicting Ratings
Data
• mv_00(movieID).txt:
1:
(1-2,649,429) (1-5)
• Over 17,000 movie txt files
• Over 400,000 userID
• Two Gigs zipped
Overall Plan
• Compute user similarity using:
– termFrequency: # of movies in common
– documentFrequency: 1/|rating1 – rating2|
• tfdf = (# of movies in common) *
1/|rating1 – rating2|
Plan 1
• Store it all in memory (haha) in java
• Store a User class with:
– UserID
– Array of Movies classes:
• movieID
• Rating
• Then have matrix of users with an
array of top similar users using
(tfdf)

• Problem 1 - Memory issues
Plan 2*
• Step 1: store in text files on hard drive in java
– text file for each user
• Step 2: compute similarity (tfdf)
– text file of top then users for each user
• Step 3: predictions
– Run through two directories of text files to compute an average
movie rating prediction

• Problem 2 - Very Slow:
– Step 1: 3 days – ~5000 movie text files currently
– Step 2: 1 user every 35 mins | 1 user every 5 mins
– Step 3: ~10 minutes currently
Plan 3
• Step 1: Store in text file’s data in a database
using php
– Table: userID | movieID | rating
• Primary keys: userID, movieID
• Step 2: Compute Similarity
– Table: userID | 1st userIDs | 2nd userID | etc.
• Primary key: userID
• Step 3: Predictions
• Problem 3 - Very Slow:
– Step 1: 4 days – 7000 movie text files currently
– Step 2: n/a
– Step 3: n/a
Results

• Predicting everything 3.0:
– RMSE = 1.3149
• Similarities I have so far:
– RMSE = 1.3149 | 384 users
– RMSE = 1.3149 | 575 users