Challenges and Opportunities in Building Personalized Online

Document Sample
Challenges and Opportunities in Building Personalized Online Powered By Docstoc
					Challenges and Opportunities
     in Building Personalized
 Online Content Aggregators

                     Ka Cheung Sia
         Adviser: Prof. Junghoo Cho

                      Oral Defense
                   January 12 2009
                    Outline
   Emergence of Web 2.0
   Online content aggregators
   Challenges and opportunities
       RSS monitoring
       Personalized recommendations
       Social annotations
   Conclusion


                     Challenges and Opportunities in Building   2
                     Personalized Online Content Aggregator
Web 1.0
                           A few professional
                            content creators
                                  News
                                  Corporate sites
                                  Portal


                           One way consumption
                            of information




Challenges and Opportunities in Building             3
Personalized Online Content Aggregator
                        Web 2.0
   Facilitators of content
    sharing
       Wikipedia
       Blog
       Media file sharing
       Discussion group
   Everyone can publish
    content easily
   Handheld devices and
    innovation online                                 Being Web 2.0 publishers
    applications
                             Challenges and Opportunities in Building            4
                             Personalized Online Content Aggregator
           Growth of UGC / blogs
   In 2007 study
       Professional content : 2GB / day
       UGC : 8-10GB / day
   Bloglines.com
       26% users with >30 subscriptions
   2006 person of the year - TIME




                                Challenges and Opportunities in Building   5
                                Personalized Online Content Aggregator
                                RSS
   Really Simple
    Syndication
       XML
       Contains 10-15 latest posts
   Machine readable
       Datetime of publications
       Title / content
       Permalink
   Subscription
       RSS reader
       Personalized homepage

                           Challenges and Opportunities in Building   6
                           Personalized Online Content Aggregator
      How RSS helps readers?
   Without RSS                                  With RSS
    (visit different URLs)                        (centralized access)




                      Challenges and Opportunities in Building           7
                      Personalized Online Content Aggregator
                             RSS usage
   High usage but low
    awareness
        27% consume
        4% aware
   Common usage
        News feeds
        Podcasting
        My MSN / My Yahoo! / etc.
        Google reader / bloglines
        Indexing blogs
   Time-sensitive content
        “RSS – Crossing into the Mainstream” Yahoo white paper by Joshua Grossnickle Oct 2005
                                   Challenges and Opportunities in Building                     8
                                   Personalized Online Content Aggregator
     Online content aggregator
   Centralized access to subscribed content in executive
    summary style
   Leverage collaborative filtering
   Ubiquitous access
   Collect useful social annotation data




                        Challenges and Opportunities in Building   9
                        Personalized Online Content Aggregator
    Online content aggregator
     (Google reader example)



Social annotations
                                                                Recommendations
   (Chapter 5)
                                                                   (Chapter 3)




                     Newly updated articles
 Subscription list     (Chapter 2 & 4)
                         Challenges and Opportunities in Building                 10
                         Personalized Online Content Aggregator
    Challenges and opportunities
   How to deliver up-to-date content?
       New articles update quickly with recurring patterns
       Significance of articles deteriorates quickly over time


   How to provide better personalization?
       Ranking articles/topics based on user interest
       Efficient computation to handle large number of users


   What is the knowledge in Web 2.0 data?
       Improve Web resources categorization
       Vocabulary usage


                            Challenges and Opportunities in Building   11
                            Personalized Online Content Aggregator
                          Outline
   Emergence of Web 2.0
   Online content aggregator
   Challenges and opportunities
       RSS monitoring
           How to deliver “fresh” content
       Providing better personalization
       Web 2.0 knowledge mining
   Conclusion

                          Challenges and Opportunities in Building   12
                          Personalized Online Content Aggregator
           The retrieval problem
   Research problem in proxies, search engines, …
       Source cooperativeness [DKP01, OW02]
       Priority of different content [CG03a]
       Resource constraints
       User satisfaction [PO05, WSY02]
       Politeness issues, …

                        retrieval                                deliver



          Data source                 aggregator                           user


                             Challenges and Opportunities in Building             13
                             Personalized Online Content Aggregator
                                   Metrics
   Evaluation at time u1
       Freshness: 0
       Age: (u1  t 4 )

       Delay: ( 1  t1 )  ( 1  t 2 )  ( 1  t3 )
       Miss-penalty: 2


   Push vs. Pull
       Push: All updates are known (e.g. RSS ping services)
       Pull: Future updates are estimated




                                    Challenges and Opportunities in Building   14
                                    Personalized Online Content Aggregator
                  Refined model
   Commonly used Webpage change model
   Homogeneous Poisson model
    λ(t) = λ at any t




   RSS content update more frequently with recurring pattern
   Periodic inhomogeneous Poisson model
    λ(t) = λ(t-nT), n=1,2,… , T is the period




           user                                                     data source
                         Challenges and Opportunities in Building                 15
                         Personalized Online Content Aggregator
           Optimization problem
   Resource allocation
       How often to contact a data source?
       O1 is more active and has more subscribers than O2, how
        much often should we contact O1?
                         mi  wi i

   Retrieval scheduling
       When to contact a data source?
       Given 2 retrievals allocated for O1, when to retrieve from it?
        Both in the morning, or one in the morning, one at night?


                          Challenges and Opportunities in Building   16
                          Personalized Online Content Aggregator
    Retrieval schedule intuition
   t=1                                       t=0 or 2
    No postings missed                         All postings (in the
                                               same period) missed




                   Challenges and Opportunities in Building           17
                   Personalized Online Content Aggregator
    Necessary optimal condition
   Given λ(t) and u(t), schedule τj’s that minimizes delay / miss

   Delay: Schedule right after large number of new posts
    Miss-penalty: Schedule right before lot’s of user access




                         Challenges and Opportunities in Building    18
                         Personalized Online Content Aggregator
                   Performance
   Reduce miss by 33% compared to CGM03 for 1 retrieval per day
   Reduce miss further by 20% when consider user access pattern




                        Challenges and Opportunities in Building   19
                        Personalized Online Content Aggregator
                         Summary
   Better RSS content update model
   Significantly improve “content freshness” under
    same resource constraint
   Analysis of typical posting patterns and access
    patterns

   “Efficient Monitoring Algorithm for Fast News Alert”, with Junghoo Cho and
    Hyun-Kyu Cho, in IEEE TKDE 2007
   “Monitoring RSS Feeds based on User Browsing Pattern”, with Junghoo
    Cho, Koji Hino, Yun Chi, Shenghuo Zhu and Belle L. Tseng in ICWSM 2007




                            Challenges and Opportunities in Building         20
                            Personalized Online Content Aggregator
                         Outline
   Emergence of Web 2.0
   Online content aggregator
   Challenges and opportunities
       RSS monitoring
       Providing better personalization
           Ranking articles/topics based on user interest
           Efficient computation to support large number
            of users
       Social annotations
   Conclusion
                         Challenges and Opportunities in Building   21
                         Personalized Online Content Aggregator
          Learning user profile
   Users are reluctant to
    indicate their interest
   Cold-start problem
   Diversified
                                                                  Learning
    recommendations [ZMK05]                                       process
   Drift of user interest
    [WBP01]
                                             feedback                        recommendations
   Relevance feedback [Eft00,
    KDF05]

   Goal: Improve relevance of
    recommendations
     click utility

                       Challenges and Opportunities in Building                         22
                       Personalized Online Content Aggregator
               Ranking model 1
   Assumptions
       K predefined topics
       Every recommendation item belongs to one topic

   User profile: Θi – Pr (click | read, topic i)
       Θi is estimated by α/(α+β) drawing from a beta
        distribution with parameters α, β

               Topic   1        2            3            4            5   6
               α       2        0            2            5            3   2
               β       10       0            1            0            3   1

                            Challenges and Opportunities in Building           23
                            Personalized Online Content Aggregator
                 Ranking model 2
   Ranking bias: g(j) – Pr (read | j)
       Read probability decreases with rank
       Borrow from web search studies



   Utility function: U(R; Θ)




       R: ranking of topics
       Articles belong to the same topics are chosen randomly



                               Challenges and Opportunities in Building   24
                               Personalized Online Content Aggregator
                   Ranking topics
   Updating posteriori distribution after each iteration
       Not clicked: βnew=βold + g(ri)
       Clicked: αnew=αold + 1


   Ranking function of topics
       Exploitation + λ*exploration                                Example (λ=1)
             Mean + λ*variance                                     α=2, β=2
                                                                    Ranking 0.55
                        
                                                                  α=5, β=5
             (   ) 2 (    1)                              Ranking  0.52

                         Challenges and Opportunities in Building                    25
                         Personalized Online Content Aggregator
                        Simulation
   Click utility improve in long run                  More accurate
                                                        estimation of user
                                                        interest Θ



   Adapts to drift of interest




                            Challenges and Opportunities in Building         26
                            Personalized Online Content Aggregator
                   User studies
   10 users from UCLA and NEC
   45 categories from dmoz.org
       Arts/Archecture
       Computers/E-books
       Science/Biology
       …
                                                                      First 25 iterations
   Survey of user interest before
    experiment
   7 articles (Webpages) per
    iteration
   3 strategies interleaved

                                                                   Drifted at 25th iteration
                        Challenges and Opportunities in Building                               27
                        Personalized Online Content Aggregator
                         Summary
   Learning framework
       Exploitation: recommend user interested items
       Exploration: explore user’s other potential interest
   Proven to improve click utility and adapt to drift of user
    interest



   “Capturing User Interest by Both Exploitation and Exploration”, with
    Shenghuo Zhu, Yun Chi, Koji Hino, and Belle L. Tseng, in UM 2007




                            Challenges and Opportunities in Building   28
                            Personalized Online Content Aggregator
                         Outline
   Emergence of Web 2.0
   Online content aggregator
   Challenges and opportunities
       RSS monitoring
       Providing better personalization
           Ranking articles/topics based on user interest
           Efficient computation to support large number
            of users
       Social annotations
   Conclusion
                         Challenges and Opportunities in Building   29
                         Personalized Online Content Aggregator
    Aggregation as recommendation
   User-generated content in Blogosphere and Web 2.0 services
    contain rich information of recent events
   Aggregation of individual opinions often shows interesting
    popular topics




                       Challenges and Opportunities in Building   30
                       Personalized Online Content Aggregator
            Personal recommendation
                                                                          Dark Knight is
                                          Finished                          great, more
                                         watching                           entertaining       Um.. it will be
              Michael Phelps          Michael Phelps                      than watching       good if there is
              performance in          in Olympics, let                    Olympics and        a free show of
               Olympics is             me watch the                        shows in Las        Dark Knight
               awesome...             WALL-E DVD...                           Vegas!           and WALL-E




 RSS
sources




             Olympics                         Dark Knight                              Las Vegas
  Items
(phrases)
                         Michael Phelps                                   WALL-E


                               Challenges and Opportunities in Building                                   31
                               Personalized Online Content Aggregator
                    Matrix formulation
        Reference Matrix (E) – the number of times a blogger mention
         a phrase/link in his blog post
        Subscription matrix (T) – how often a user reads a blog
        Personalized score (TE)



E        o1    o2    o3   T       b1       b2       b3       b4           TE   o1    o2    o3
b1       3     2     0    u1      0.8      0.8      0        0            u1   2.4   4.0   0.0
b2       0     3     0    u2      0.2      0.2      0.6      0.6          u2   1.8   2.2   2.4
b3       1     0     1    u3      0        0        0.5      0.5          u3   1.0   1.0   2
b4       1     2     3
Total    5     7     4



                               Challenges and Opportunities in Building                         32
                               Personalized Online Content Aggregator
    Database operation of matrix
   Reference (rss-id, item,
                                                                       E      o1        o2        o3
    score)
                                                                       b1     3         2          0
       <b1, o1, 3>
        <b1, o2, 2>                                                    b2     0         3          0
        <b2, o2, 3>                                                    b3     1         0          1
        …                                                              b4     1         2          3
       Grows over time
   Subscription (user-id, rss-
    id, score)
                                                                T       b1        b2         b3        b4
       <u1, b1, 0.8>
                                                                u1      0.8       0.8        0         0
        <u1, b2, 0.8>
        <u2, b1, 0.2>                                           u2      0.2       0.2        0.6       0.6
        …                                                       u3      0         0          0.5       0.5
       Relatively stable

                            Challenges and Opportunities in Building                                         33
                            Personalized Online Content Aggregator
                       Baselines
   Aggregate Query
    SELECT t.item, sum(t.score*e.score) As p_score
    FROM Endorsement e, Trust t
    WHERE e.blog-id = t.blog-id AND
    t.user-id = <user id>
    GROUP BY t.items
    ORDER BY p_score DESC LIMIT 20




           On-the-fly (OTF)                                              View
                              Challenges and Opportunities in Building          34
                              Personalized Online Content Aggregator
          Two stage computation
   Support large number of
    users and rss sources
       OTF – high query cost
       VIEW – high update cost
   Identify “template” users
       Users often share similar
        reading interest
       Example: template users
        interested in sports / politics
        / technologies / …
       Result are pre-computed
        and then combined in two
        stages
                             Challenges and Opportunities in Building   35
                             Personalized Online Content Aggregator
    Discover user groups by NMF
   Decompose subscription matrix T into sub-matrices W and H
       Non-negative matrix factorization (NMF) [Hoy04]
       W : [individual users : template users] relationship
       H : [template users : blogs] relationship




   Example: user 2’s subscription vector is expressed as
    linear combination of two template users
   NMF as an approximation of original subscription matrix
       Accurate
       Sparse

                            Challenges and Opportunities in Building   36
                            Personalized Online Content Aggregator
        Reconstruction of results
   Personalized scores of template users are pre-computed
   (HE) is maintained as sorted lists for template users




   W*(HE) becomes the personalized scores of all users
       Computed using Threshold Algorithm [FLN01]
           Top-K list
           (HE) are sorted lists
           W*(HE) is weighted linear combination

                             Challenges and Opportunities in Building   37
                             Personalized Online Content Aggregator
                     Experiments
   Bloglines.com: online RSS reader
   Subscription matrix T: (0 or 1) subscription profile
       91k users
       487k feeds
   Reference matrix E: blog-keyword occurrence
       Feed content collected between Nov 2006 – Jul 2007
       Top 20 nouns with highest tf-idf in each posts are selected as
        keywords
   Platform
       Python implementation of proposed method
       MySQL server on linux with data reside in RAID


                           Challenges and Opportunities in Building      38
                           Personalized Online Content Aggregator
      The difference by personalization
     Week 2007 Jan 7 – Jan 13                                 Distinct difference between top-20
         Major event: iphone released                          recommended words
                                                                      Among users – 1.13
         3 users with large number of
          subscriptions                                               Between users and global – 1.12


                 2007-01-07 to 2007-01-13
Global        User 90439      User 90550       User 91017
sales         cattle          brazil           yorker
iphone        beef            iguazu           iraq
apple         iphone          reuters          bush
manager       chicago         search           president
iraq          iraq            vegas            views
management    bush            argentina        avenue
development   apple           kibbutz          dept
software      companies       video            troops
business      prices          cathartik        saddam
phone         quarter         google           iran



                                    Challenges and Opportunities in Building                             39
                                    Personalized Online Content Aggregator
Efficiency of proposed method
   Update cost
       OTF (222K) < NMF (3.2M) < VIEW (23.6M)
   Query response time
       Average over 1000 users with highest number of subscription
       OTF : execute SQL query directly on MySQL server
       NMF: python implementation that interfaces with MySQL server

                  Method    avg          std              max           min
                  OTF      2.05s         3.60s            84.42s        0.037s
                  NMF      0.46s         0.53s            2.84s         0.007s

   Average query response time reduced by 75%, eliminated outliers of
    significant delay
   70% approximation

                             Challenges and Opportunities in Building            40
                             Personalized Online Content Aggregator
                       Summary
   Provide personalized recommendation by selective
    aggregation
   Proposed matrix model for personalized aggregation
       Optimization by NMF & Threshold Algorithm
   Real life dataset study shows query response time can
    be reduced significantly with acceptable approximation
    accuracy




   “Efficient Computation of Personal Aggregation Queries on Blogs”,
    with Junghoo Cho, Yun Chi, and Belle L. Tseng, in SIGKDD 2008

                          Challenges and Opportunities in Building   41
                          Personalized Online Content Aggregator
                      Outline
   Emergence of Web 2.0
   Online content aggregator
   Challenges and opportunities
       RSS monitoring
       Providing better personalization
       Social annotations
         Vocabulary usage  effective advertising
          keyword selection
   Conclusion
                       Challenges and Opportunities in Building   42
                       Personalized Online Content Aggregator
               Social annotations
   Bookmark tags, video/picture annotations, article tags
       Evolving vocabularies (itouch, wow, w00t, …)
       Emoticons (>_<, Orz, …)
   Intensive human effort
   Latent Dirichlet Allocation [BNJ03]                               users   tags   documents
                                                                       (u)     (w)      (d)
       Recover hidden topics z’s
       Represent words p(z|w) and documents
        p(z|d) as distribution over hidden topics
   Improving information retrieval
       Web document retrieval [WZY06, ZBZ08]
       Social tagging usability [CM07]


                           Challenges and Opportunities in Building                        43
                           Personalized Online Content Aggregator
Topic categorization




      Challenges and Opportunities in Building   44
      Personalized Online Content Aggregator
Desired properties of effective
    advertising keywords
   Specific
       Reach target audience
       e.g. automobiles > ford, good > programming
   Emerging
       Developing vs. stable
       Easier to attract user attention
   Time-(in)sensitive
       Context change over time
       Watch for change in target audience


   How can these properties be learned from social annotations
    collected in aggregators?
                            Challenges and Opportunities in Building   45
                            Personalized Online Content Aggregator
                            Emerging
   Words correspond to emerging topics
       Users actively explore new pages and annotate evenly on different
        pages
       Examples (between December 2007 and March 2008):
           rails2.0 (ruby on rails webapp framework)
           kindle (amazon ebook)
           itouch (unofficial nickname of ipod touch)
           eeepc (subnotebook by Asus)
           obama (Barrack Obama)
           jailbreak (Apple iphone crack software)
                                                                           emerging   stable

   Change of entropy



                                Challenges and Opportunities in Building                       46
                                Personalized Online Content Aggregator
Effective advertising keyword
           classifier
   10+ features extracted from social annotations for each word
   User study performed on Amazon Mechanical Turk

   10-fold cross-validation on different classifiers
       SVM                                            70.3%
       Logistic regression                            69.8%
       C4.5                                           73.3%
       Random forest                                  73.3%
       K-nn                                           67.3%
       Back-propagation neural nets                   63.9%
       Naïve Bayes                                    59.9%
       Best-5 combined                                73.8%

                          Challenges and Opportunities in Building   47
                          Personalized Online Content Aggregator
                       Summary
   Leverage social annotations collected from online
    content aggregator users
   Social annotation differ significantly from general
    text corpora
       New metrics / features
       Usage in online advertising



   “Exploring Social Annotations for Effective Advertising Keyword
    Selection”, with Junghoo Cho, work in progress


                          Challenges and Opportunities in Building    48
                          Personalized Online Content Aggregator
                   Conclusion
   Web 2.0 phenomenon
       More content sharing and diverse interest


   Personalized online content aggregator
       Easier access to different information sources
       Deliver update content
       Deliver better personalized recommendations
       Leverage human effort collected in the aggregator


                      Challenges and Opportunities in Building   49
                      Personalized Online Content Aggregator
                         References
   [CG03a] Junghoo Cho and Hector Garcia-Molina. “Effective Page Referesh Policies
    for Web Crawlers.” ACM TODS 28(4), 2003
   [DKP01] Paven Deolasee, Amol Katkar, Ankur Panchbudhe, Krithi Ramamritham,
    and Prashant Shenoy. “Adaptive Push-Pull: Disseminating Dynamic Web Data”
    WWW 2001
   [OW02] Chris Olston and Jennifer Widom. “Best-Effort Cache Synchronization with
    Source cooperation” SIGMOD 2002
   [PO05] Sandeep Pandy and Christopher Olston. “User-Centric Web Crawling” WWW
    2005
   [WSY02] J.L. Wolf, M.S. Squillante, P.S. Yu, J. Sethuraman, and L. Ozsen. “Optimal
    Crawling Strategies for Web Search Engines.” WWW 2002

   [FLN01] Ronald Fagin, Amnon Lotem, and Moni Naor. “Optimal Aggregation
    Algorithms for Middleware.” PODS 2001
   [Hoy04] Patrik Hoyer “Non-negative Matrix Factorization with Sparseness
    Constraints” Journal of Machine Learning Research, 5:1457-1469, 2004
   [LWL07] Chengkai Li, Ming Wang, Lipyeow Lim, Haixun Wang, and Kevin Chen-
    Chuan Chang. “Supporting Ranking and Clustering as Generalized Order-By and
    Group-By” SIGMOD 2007
   [PP07] Seung-Taek Park and David Pennock. “Applying Collaborative Filtering
    Techniques to Movie Search for Better Ranking and Browsing” SIGKDD 2007

                               Challenges and Opportunities in Building              50
                               Personalized Online Content Aggregator
                         References
   [Eft00] E.N. Efthimiadis “Interactive Query Expansion: A User-based Evaluation in a
    Relevance Feedback Environment” JASIS 51(11), 2000
   [KDF05] Diane Kelly, Vijay Deepak Dollu, and Xin Fu. “The Loquacious User: A
    Document-Independent Source of Term for Query Expansion” SIGIR 2005
   [WPB01] Geoffrey I. Webb, Michael J. Pazzani, and Daniel Billsus. “Machine
    Learning for User Modeling” User Modeling and User-Adapted Interaction, 11(1-2)19-
    29, 2001
   [ZMK05] Cai-Nicolas Ziegler, Sean M. McNee, Joseph A. Konstan, and Georg
    Lausen. “Improving Recommendation Lists Through Topic Diversification” WWW
    2005

   [BNJ03] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. “Latent Dirichlet
    Allocation” Journal of Machine Learning Research, 3:993-1022, 2003
   [CM07] Ed H. Chi and Todd Mytkowicz. “Understanding Navigability of Social
    Tagging Systems” CHI 2007
   [WZY06] Xian Wu, Lei Zhang, and Yong Yu. “Exploring Social Annotation for the
    Semantic Web” WWW 2006
   [ZBZ08] Ding Zhou, Jiang Bian, Shuyi Zheng, Hongyuan Zha, and C. Lee Giles.
    “Exploring Social Annotations for Information Retrieal” WWW 2008



                               Challenges and Opportunities in Building               51
                               Personalized Online Content Aggregator
Thank you
     Q&A
              Additional slides
   RSS monitoring
     Different data posting patterns
     Optimal size of estimation window
     Consistency of posting rate
   Providing personalized recommendations
     Partition of trust matrix
     Threshold algorithm
     NMF approximation accuracy
     Approximation accuracy
     Multi-armed bandit problem
   Social annotations
     Preferential attachment / usage
     URL in photography category
     Distribution of entropy change
     Performance of different classifiers

                       Challenges and Opportunities in Building   53
                       Personalized Online Content Aggregator
Different data posting patterns




            Challenges and Opportunities in Building   54
            Personalized Online Content Aggregator
Optimal size of estimation window
   Resource constraint: 4 retrievals per day per feeds on
    average
   2 weeks seems an appropriate choice




                      Challenges and Opportunities in Building   55
                      Personalized Online Content Aggregator
    Consistency of posting rate
   90% of the RSS feeds post consistently




                    Challenges and Opportunities in Building   56
                    Personalized Online Content Aggregator
   Partition of subscription matrix
       Decomposition is useful when matrix is dense
       Real life data is often skewed
       Hybrid method: uses NMF only in its effective region

                                                                          2.7M subscription pairs


                                                        2. VIEW         Users with >30 subscriptions
  Blogs with                                                            Feeds with >30 subscribers
more subscribers       1. OTF                                              10k feeds, 24k users
                                                                          ~1M subscription pairs


                                                         3. NMF



                         Users with more subscription
                             Challenges and Opportunities in Building                           57
                             Personalized Online Content Aggregator
               Threshold algorithm
   Proposed by Fagin et.al. [FLN01]
    Efficient computation of top-K items from multiple lists
    with a monotone aggregate function


             blogs

    update


     Template user’s
    recommendations

    query
             users

                       Challenges and Opportunities in Building   58
                       Personalized Online Content Aggregator
    NMF approximation accuracy
   Dense region of subscription
    matrix
        >30 subscribers: 10152 feeds
        >30 subscriptions: 24340 users
   L2 norm comparison
        Rank     SVD       NMF
        80       848.5     856.9
        90       841.6     850.1
        100      835.1     844.6
        110      829.0     837.9
        120      823.2     833.0


   Sparsity of W (23%), H (13%)
   NMF approximation is close to
    SVD and sparse


                                 Challenges and Opportunities in Building   59
                                 Personalized Online Content Aggregator
        Approximation accuracy
   How many items are approximated by NMF in the top 20 list?
       Ti – top 20 items of user i computed by OTF
       Ai – top 20 items of user i computed by NMF
   70% approximation and more accurate for higher rank items

             | Ai  Ti | / | Ti |                                              Correlation with rank




                                    Challenges and Opportunities in Building                           60
                                    Personalized Online Content Aggregator
    Multi-armed bandit problem
   Well-studied problem in reinforcement learning / statistics

   Problem statement
     Background: You are given n different choices
     Decision: For each choice you receive a numerical reward
       chosen from an unknown stationary probability distribution
     Goal: maximize the total reward over some time period


   Solutions
     Action-value methods (greedy & ε-greedy)
     Softmax Action Selection (decaying)
     Pursuit methods
     Associative search


                          Challenges and Opportunities in Building   61
                          Personalized Online Content Aggregator
Preferential attachment/usage
   URL / Tag usage distribution




                  Challenges and Opportunities in Building   62
                  Personalized Online Content Aggregator
URL in photography category
   Documents ranked by p(d|z) values




                  Challenges and Opportunities in Building   63
                  Personalized Online Content Aggregator
                               Specific
   Stop word list
   Inverse document frequency (idf)
   Ontology based
                           T

   Entropy H ( w)       p( z
                          i 1
                                       i   | w) log( p ( zi | w))




   Least specific tags found
       idf – [web, reference, software, design, …]
       Entropy – [temp, for, important, good, …]
                                 Challenges and Opportunities in Building   64
                                 Personalized Online Content Aggregator
                     Time-sensitivity
   The usage / associated context changes over time
       “holiday”
           Travel packages: [travel, eclipse, europe, guide, …]
           Christmas shopping: [christmas, gift, shopping, …]
       “programmers”
           Programming: [programming, development, code, patterns, …]
           Job hunting: [work, jobs, career, job, …]




   KL-divergence of two distributions
   Jaccard coefficient of two sets of tagged URL
                                  Challenges and Opportunities in Building   65
                                  Personalized Online Content Aggregator
Distribution of entropy change
   Entropy increase over time (+0.1 over 3 months)




                    Challenges and Opportunities in Building   66
                    Personalized Online Content Aggregator
    Performance of different classifiers

   10-fold cross-validation

Classifier                  Specific                     Emerging   Stable
SVM                          65.7%                          70.3%   63.9%
Logistic regression          66.3%                          69.8%   60.7%
C4.5                         64.5%                          73.3%   59.0%
Random forest                65.1%                          73.3%   57.4%
Knn (k=5)                    60.1%                          67.3%   64.5%
Multilayer perceptron        60.3%                          63.9%   60.1%
Naïve Bayes                  67.4%                          59.9%   63.4%
Best-5 combined              66.3%                          73.8%   63.4%



                        Challenges and Opportunities in Building             67
                        Personalized Online Content Aggregator

				
DOCUMENT INFO
Shared By:
Categories:
Tags:
Stats:
views:13
posted:7/6/2012
language:
pages:67