I Tube, You Tube, Everyone Tubes: Analyzing the World’s Largest User Generated Content Video System. Presented By: Anirban Banerjee, Dept. Of Computer Science and Engineering, UC Riverside, CA 92521 Anirban@cs.ucr.edu Problem Addressed • We don’t know much about how popularity of content on Youtube-like sites change. Motivation • Understanding how users access content can help to – target ads – predict which videos will become popular – help in resource allocation. Contribution • Extensive trace-driven analysis • Extract interesting features about content popularity • Understand shift in popularity over time • Effect of duplicate content Outline • Methodology • UGC Popularity • Popularity Evolution • Caching issues • Duplicate and Illegal content • Conclusions • Comments Methodology (UGC) • Youtube and Daum (Korea) • Youtube - 2 categories – Entertainment – Science and Technology • Daum - all categories • Video info: uploader, upload time, length, views, ratings Methodology (non UGC) • Netflix, Lovefilm, Yahoo Movies. • Brief Observations: – It takes 15 days in YouTube to produce the number of movies listed in IMDBs DB – # of publishers is massive for UGC – # of movies/publisher is more or less same for UGC and non-UGC – Popularity and ratings show strong correlation – User participation levels are low Outline • Methodology • UGC Popularity • Popularity Evolution • Caching issues • Duplicate and Illegal content • Conclusions • Comments UGC Popularity • Not easy to conclude that popularity follows power law. – Non popular items in Netflix don’t follow power law – 10% of videos get 80% of hits in UGC UGC Popularity • Why is this interesting: – This behavior is different from other VOD systems: PowerInfo (China) – Caching small # of videos will satisfy large # of requests. UGC Popularity • Popular content analysis – Exhibit power-law behavior – Sharp decay for popular content – Exact popularity distr. Is category dependent – Truncation at tail with exp. Cutoff • Reason: Hit a video once (P2P concept, hitonce users) UGC Popularity • Popular content analysis – UGC has fetch-at-most-once behavior – Extend simulation from Gummadi et al. paper (U: # of users in the system, R: # of requests per user, V: # of videos) • All HitOnce scenarios show truncated tail • Increasing R or reducing V amplifies tail, increasing U has no effect, UGC Popularity • Not so Popular content analysis – Questions • What is the distribution of these items • What effects the distribution • Sci dataset follows Zipf • Result filtering causes sharp drop-off UGC Popularity • Not so Popular content analysis – What will be the result of removing result filters • The videos in the tail will receive views Outline • Methodology • UGC Popularity • Popularity Evolution • Caching issues • Duplicate and Illegal content • Conclusions • Comments Popularity Evolution • Question – Requests concentrate on young/old videos – How fast does the popularity change • Findings – For really young items (< 1 month) slight increase in avg. requests observed. – 80% of videos requested on a day are older than 1 month (72% of traffic) – Except very new videos, user preference seems insensitive to age Popularity Evolution • Out of top 20 videos requested on a day, 50% are new Insensitivity 50% point Popularity Evolution • After 1 day, 90% of items will be watched at least once, 40% over 10 times • Prob. of video being requested decreases over time • If a video does not receive enough hits early, it will probably not receive hits later Dips at predictable intervals Popularity Evolution • Predicting future popularity – Analyzing 2-3 days worth of popularity data is good enough for prediction – Young videos can make rapid changes in rank – Revival of the dead does not happen – Rank of older videos don’t fluctuate as much Outline • Methodology • UGC Popularity • Popularity Evolution • Caching issues • Duplicate and Illegal content • Conclusions • Comments Caching Issues • 3 scenarios – Static finite cache (long term popular vids, 90% of traffic) – Dynamic infinite cache – Hybrid finite cache (static +10k vids per day) – Replay 6 day trace under various schemes and calculate hit and miss ratios Caching Issues Cache efficiency Hybrid model is best, better than static by 10% Can P2P help? 95% of videos requested after 10 mins or longer, small fraction of files will benefit Caching Issues • Expected number of concurrent users in the system • Users watch full video • Start to share as soon as streaming starts • Stay on site for about 28 mins Very few videos helped Load is decreased Outline • Methodology • UGC Popularity • Popularity Evolution • Caching issues • Duplicate and Illegal content • Conclusions • Comments Duplicate and Illegal content • Duplicates (aliases), sample 216 videos from 10K, and use 51 volunteers • Most videos have 1-4 aliases Duplicates cause popularity dilution Duplicate and Illegal content • Aliases are uploaded on the same day or within a week • Possibly responsible for flattened tail • Aliases mostly uploaded by one-time uploaders • Only 0.4% of all videos have been deleted by Youtube due to “concerns” – Of these 5% are copyright violations Conclusions • Extensive study of the Youtube UGC portal. – What effects popularity of videos – Analyzed long tail behavior of popularity – Simple caching policies can help – Aliases dilute popularity rankings My Comments • Wait for it waaaiiiit for it!!