URL Recommendation based on Asymmetric Tag Similarity and by maclaren1

VIEWS: 5 PAGES: 15

									                                                                                                    1



URL Recommendation based on Asymmetric Tag
   Similarity and Diffusion Based Grouping
                        Praveen Kumar B and C.V.Krishnakumar

                                         I. Introduction
   This project attempts to address the problem of URL recommendation by using predominantly
the User given Tags and Sharing patterns from the ShareThis and Del.icio.us data. We use a
slightly different way of selecting a candidate set for analysis by the formation of a Hypergraph,
with the URLs as edges and assymetric distance between their corresponding tags forming the
weights of the directed edges. We then use a Diffusion based grouping, that enables us to divide the
hypergraph into candidate groups.
   A more refined analysis taking into consideration the different URL parameters including the
tagging information, the user information, the time information,etc is then carried out to rank the
URL pairs. These ranked pairs are compared against a particular user’s known preferences and the
relevant URLs are returned back to them. The overall block diagram of the system is shown in the
following figure.

A. Problem Statement
  The problem statement that we attempt to address in our work is - Given that a set of users have
shown interest in certain URLs by tagging or sharing it, what other URLs will they be interested
in? In our approach, we use the act of tagging or sharing of a page on the web by the user as
an indication of his interest in the content present in the page and as being reflective of the user’s
perception of the page. In this project, we have the data from ShareThis which we then extend
with Del.icio.us to get a comprehensive data set of rich tags. Our goal in this work is to build
a system that can recommend interesting URLs to the user, given his tagging information alone,
without resorting to techniques like page scrapping and web crawling.




                         Fig. 1. Top Level Block Diagram


                                          II. Motivation
  Currently, the collaborative tagging mechanism has emerged as a powerful way to label and
organize data on the web [1]. In fact [1] also descibes the powerful ways in which tags could be
used by converting them into a hierarchial taxonomy. [2] uses associative mining and clustering of
tags to personalize the web experience of the user.In this work we attempt to solve the problem of
recommending interesting pages to the user who has already tagged some pages of his interest. This
problem has been approached in various ways. There are many other approaches that have been
explored for similar problem. Some approaches retrieve the entire text of the page ( by scrapping
or crawling) in order to determine the quality and/or the content of the page. However, in this
project, we take a different approach and base our entire analysis on the tag information that is
provided by the user.

A. Choices made
A.1 Why Tags?
  In this work, we try to explore if the rich tags, that are presented by the users themselves is
sufficient to make reasonable recommendations of interesting content to the user. Moreover, the
crawling,page scrapping, HTML cleaning and indexing processes add considerable overload to the
working of the system that also includes those. On the other hand, there are also some cases wherein
the user’s terminology differs from the webmaster’s terminology. For instance, Yahoo! homepage,
though being the world’s most popular portal does not contain the word ”portal”. In such cases,
anchor text comes handy in traditional applications. In this project, we explore the use of tags as
being characteristic of the users’ perceptions about a particular site.

A.2 Delicious
   We initially started out with data from the ShareThis alone. However, we felt the need to
augment it with data from a much more diverse source. Delicious, a popular social bookmarking
site, is a very trusted source and enables us to tap into the tag information thereby providing us
with additional data in terms of tags,sharing patterns, an index on the popularity of the page,etc.
that we integrated with the ShareThis data.

                          III. Data Collection and Preprocessing
   The first step in our method involves the collection, cleaning and the preprocessing of data. Since
we also needed to integrate the data from different sources, this step assumed a greater significance
in this specific case. We first collect all the data that is present in the ShareThis. This data includes
comprehensive information about the shared information that includes the title, the anonymized
userid of the user, the text of the tag, the last activity date, etc.We augment this data by querying
Delicious for the URLs and also integrating the Delicious data with the ShareThis data.

A. Preprocessing Steps
A.1 Matching of URLs in Delicious and ShareThis
  There are cases wherein the exact URL that is tagged in ShareThis does not match the one stored
in Del.icio.us. This problem was solved by performing a recursive depth reduction of the URLs.
For Example, if the URL that has been tagged using ShareThis is
www.nytimes.com/depth5/depth4/depth3/depth2/depth1/depth0.html
  If this very URL has not been tagged in Delicious, there would be no match found between
Delicious data and the SharehThis data for this URL. In such a case, we use the concept that we
define as the Depth of an URL. We define the Depth of an URL [ δ ] as the number of terms separated
                                                  2
by forward slashes ’/’, that don’t exist in the matched URL but do so in in the original. For instance,
for the above example, if the URL available in Del.icio.us was www.nytimes.com/depth5/depth4/ ,
then we consider this to be a match of δ = 4;
  This matching is done recursively. In order to avoid rapid succession of hits on Del.icio.us
, we sort the entries to be matched and keep a cache of the k last entries matched, so that in
case there had been a match in one of the k previous steps, that match would be taken without
the need to query delicious. This step is important since it greatly enhances the speed of the
implementation, especially since Del.icio.us enforces a gap of at least 1 second between successive
requests. Moreover, this process was further hastened by parallely executing the jobs on multiple
nodes and then integrating them at the end.

A.2 Tag word Split
    ShareThis gives us TagText that holds the actual Tag what was tagged by some user over an
url. Typically, the tags are extremely short, spanning at most 3 words on an average. There are
primarily two methods to process such tag text:
     Standard Information Retrieval Techniques : These techniques mainly involve text based anal-
       ysis of the tagtext, using parameters like Term Frequency and the Inverse Document Fre-
       quency. However, since the tagtext was very minimal, it did not make much sense to represent
       it as a document and perform processing on that.
     Processing on Split Tags : We followed a simpler approach of splitting the tags (by space)
       and considering them as individual words that could be dimensions of a vector in an n-
       dimensional vector space. But for the case of uniformity, we needed to do two more kinds of
       preprocessing

A.3 Stemming, Lowercasing and Stop Word Removal
  Splitting the tagtext into individual words also needed to be accompanied by additional pre-
processing steps that included Lowercasing of all the text in the data ( including the URL and title
) so that tags Google and google are not considered distinct from each other during the counts.
Also, since the Tag space was inherently sparse, in order to reasonably reduce the dimensions
involved, we also stemmed the tags, but at the same time also retained the original tag. We used
the Porter’s Stemmer to do the stemming. In addition to this, we found that certain split words
were very common amongst the tags and undesirably added noise to the system. Such stopwords
were also removed from the tags. Thus, using this approach the tagtext ” This Musicician is the
best performer ” becomes ”music best perform”.

A.4 Popularity Count
  We define the Popularity count as the total number of people to have tagged an URL. We also
store this count as a measure of the popularity of the URL concerned.

B. Data Obtained
  After performing the preprocessing steps on the integrated data, we loaded the cleaned data
into the Aster Framework using the nCluster Loader. The following are some of the important
parameters of interest to our application.
     • Url
     • Depth
     • Last activity date

                                                  3
    •Total number of tags for the URL
    •Popularity Count of the URL - Total number of people who have tagged it
   • User’s tagging patterns - from ShareThis only
  The statistics of the data are given in the table below:
                                             TABLE I
                                        Statistics on Data
                       Statistic                       Value
                       No of Total Records             1080873
                       No of Total Distinct Tags       5823
                       No of Unique URLs               10205
                       No of Unique Users              1190259(Sharethis)
                       No of Unique Matching URLs      3762
                       Average depth                   1.9684
                       Average Popularity              480.825


                           IV. Overview of the Approach Used
  The top level view of the approach of the algorithm that we use in this system is depicted in the
flowchart.




                                     V. Algorithm Used
  The algorithm that we use here follows a two phase process :
   • Selection of the Candidate Tests with a Diffusion algorithm.
   • Ranking of the selected group of Candidate Tests taking into account important static pa-
      rameters of the Url and its confidence




                                                4
                         Fig. 2. Overview of Algorithm Used


  The algorithm is depicted in the Figure 2. As has been mentioned in the problem statement, the
primary aim of our system is to recommend interesting URLs to the user given that he has already
tagged some URLs that are considered to be a reflection of his interest. The working of out 2-phase
algorithm to first select the candidate group set and then to score the elements in it is described
below.
  We start the the process from a Target Set of URLs, say T = {U1 , U2 , ..., Un }, which are the
URLs where the user(s) have shown interest in.

A. Data Collection from the Target Data Set
  The data collection step is the process of obtaining a initial set of records to work with. The first
cycle in the figure 2 is responsible for doing this task. From the repository, the system finds the
set of all the records corresponding to each of Ui T . Let this set be called R. Following this, the
system finds all the tags that occur in R and forms another set T agR . As a final step, all the URLs
that correspond to any tag Ti T agR is taken up. Hence, we end with a new and a bigger and a
loosely related URL set, called as the Working Set for further processing.
  If the data is very sparse then, it is possible that one iteration of the previous procedure would
not yield sufficient URLs. In such a case, we do multiple iterations until the Working Set we have
                                                   5
is sufficiently big.

A.1 Implementation
  In our system, the Target Set was a collection of 19 URLs and then formed a Working set by
collecting all the tags associated with these 19 URLs and then by collecting all the URLs associated
with the tag set so obtained.

B. Diffusion Algorithm and Notion of Asymmetric Distance
  In this subsection, we explain the Diffusion Algorithm, that is based on asymmetric distances,
similar to the notion described in [3]. We first define it here:
     • Asymmetric Distance : This is the distance measure that we use in our analysis and it is the
       measure on which we base our confidences. In our case, the asymmetric distance of a pair of
       nodes
                   |A.T ags B.T ags|
       ADA→B =           |A.T ags|
                                     The asymmetric distance gives the weight for the directed edge
       A → B in the hypergraph.
  We use the notion of Asymmetric Distance as against a symmetric distance since it models
the notion of Confidence better. The Confidence is defined the same way as in a normal Apriori
algorithm. The first step is to have an edge between all the pairs that share at least one tag. The
edges are directed and are weighted by the asymmetric distance of the pair as defined above. For
instance, a sample graph after this first step is as shown. The green node marks the Target Set.




                         Fig. 3. Step 1


   We note that the edge E is not connected to any of the others because it does not have any
tag in common with the others.Following this, the second step consists of growing the candidate
set outwards, by taking in the next level of nodes, as shown. Here , if there is an edge (A,B) and
there exists and edge (B,c), we also include (A,C)with the weight = weight(A,B) * weight(B,C).
We note that here E has been pruned out and also transitive edges between D and B have not been
considered. This is because, at each stage, we dont consider the transitive pairs that do not consist
of an element in the Target Set, nor do we include joins.




                         Fig. 4. Step 2




                                                 6
B.1 Pruning out of results
  This process is repeated iteratively till we obtain a candidate set. At each iteration, we keep a
threshold with the help of which we prune out the edges falling below it.

B.2 Implementation
  Our implementation of this algorithm comprised of a Self Join on tag on the set of nodes to give
the pairs of elements, and then a Map-reduce operation on them that computes the number of tags
that the pair of nodes share. This value is used to find the confidence of each edge. We perform 2
optimizations here:-
     • Pruning by Confidence
     • Pruning by Size
  Some conditions to be guarded against include the possibility of self loops amongst nodes, or
exponential increase in the number of records after the self join as a result of the presence of a very
common tag. In particular, we had a tag refer which was highly common and caused a result of
around 5 billion records. The noisy data was cleaned at this stage by removing this tag, that acted
very similar to the stop-words. We performed two iterations and obtained 302356 records to be
analyzed.

C. Computation of soft Groups of Related URLs
   We define the group of candidate URLs for a given URL as a set of candidate URLs that are
potentially similar to the target URL, where the similarity is measured by the asymmetric distance.
From the previous step, we group the records by their source Urls and consider the records that
have their source URLs present in the Target Set T. Thus, we now have a loose group of candidate
elements for each URL in the Target Set T. We now proceed to analyze each of these URLs with
much more stringent parameters.

C.1 Soft ’Clustering’
  The soft clustering enables an item to be in more than one group, with varying degrees of
confidence. This idea reflects our model better wherein a single URL can be placed under different
categories based on the tags that the users assign them. However, in the above statement we do
not use the term clustering in a strict sense. In our approach, we just group the candidate set of
URLs and use them for the final processing.

D. Ranking of URLs within a given group
  Within a given candidate group, we use the following formula to rank the URLs.

Score(A → B) = B.popularityScore ∗ ( Conf idence(A→B) )(0.99)age−in−days
                                            δ(B)

where

B.popularityScore = Number of people who have tagged B, an indication of its popularity.

δ(B) = This enables us to weigh url-pairs that match only on hosts lower.

Age = Recency of the URL as measured by the last activity date. The older it is, the more it

                                                  7
would fade away as reflected in the (0.99)age−in−days .


E. Identification of k-best URLS
  Once the scores have been assigned to each URL in the candidate group, the system considers
the weights of the URLs in its candidate set as had been calculated in the previous step. The top k
results are then provided as a recommendation to the user. Before they are presented to the user,
they are again re-ranked by the relative preferences of the user across various groups.
  More formally, if the user has tagged Xi urls in group i , then the new score - scoren ew of a url
returned from group i will be
                     Xi
  score( new(i)) =     (Xi )
                               ∗ score( url)
                     i




  where score( url) is the score of the url as determined by the algorithm.

                                      VI. Results and Evaluation
  In this section, we present come results that we obtained on the using our system for generating
the recommendations of the URLs.
  For the usecase where in we consider 19 URLs as our initial points the URL grouping statistics
are as given in Table II.
                                                   TABLE II
                                         Statistics of Use-Case Data
                                       Statistic               Value
                                       No of Total Groups      19
                                       Maximum Group Size      765
                                       Minimum Group Size      24
                                       Average Group Size      253.7


  For the entire data, we have the statistics as given in Table III
                                                  TABLE III
                                          Statistics of Entire Data
                                       Statistic               Value
                                       No of Total Groups      5237
                                       Maximum Group Size      3757
                                       Minimum Group Size      1


  We present the results of three distinct target URLs and analyze them here.

A. Result on New York Times
  In this subsection we observe the results that our approach produced on a group represented by
the New York Times in Figure 5. The black bubbles reflect the popularity of the URLs and the
red bubbles are the resulting scores assigned to the urls by our algorithm. We see that the red

                                                     8
bubbles envelop the black bubbles most of the time. This means that the algorithm either pushes
some url’s score down or pulls some less popular url’s score up. This is very evident by the single
red dot at around (100,4). It was actually a less popular URL whose score has been boosted by our
algorithm. Also, near (25,4) some of the more popular URLs are being pushed down.
  This happens because we consider other parameters that model the usefulness of the URL too ,
and not just its popularity count.




Fig. 5. View of Group of New York Times

B. Dot Tunes Group
   In Figure 6, we are presented with another extreme case wherein the number of tags are very
less. Though in this case the algorithm should perform badly owing to the lack of many tags, we
were surprised to find that the resultant scores from our algorithm managed to stay in track with
the URLs’ popularity count.




Fig. 6. View of Group of Dot Tunes


  One possible reason for this could be that the data came from a small group, and consisted of
nodes that were densely linked to each other and hence the relevancy score shot up.




                                                9
C. ILounge Group
  In this case, the user had tagged http://www.dottunes.net/ilounge.htm Our algorithm generated
a candidate group of URLs and ranked each URL in the group by the measure described. The top
twenty five results for this experiment are as given below:
    http://5thirtyone.com/archives/862                   http://www.lifeclever.com/17-power
    http://macenstein.com/default/archives/1112
    http://macenstein.com/default/archives/1651
    http://gizmodo.com/5053280/androids-10-most-exciting-apps
    http://www.bluelounge.com/thesanctuary.php
    http://www.methodshop.com/gadgets/tutorials/handbrake/index.shtml http://lifehacker.com
    http://my.barackobama.com/page/content/iphone
    http://howto.wired.com/wiki/Read_Ebooks_on_Your_iPhone

In this case too, the URL that was taken initially was an obscure URL, but we got good results
here. An even more surprising thing was that most of the URLs were from diverse sources.

D. Weakness of our approach
  However, our approach failed in the case of a group represented by speakfrench present in the
small sampled data. The results obtained are as given below:
-------------------------------------------------------------
 http://hundredpushups.com/index.html
  http://www.bbc.co.uk/languages/
 http://www.lifehack.org/articles/technology/15-coolest-firefox-tricks-ever.html http://ali
 http://5thirtyone.com/archives/862
 http://mashable.com/2007/05/14/video-howtos/
 http://abduzeedo.com/node/133

  We see that other than the 2nd result, none of the others are related to the french speaking.
Actually, the reason for such a result is summarized below: Reasons:
     • The number of tags for the url for this group are less and are very generic. This causes the
       results to go haywire, since the generic tags add to the noise.
     • We realized that just extracting data from first iteration of data extraction is not sufficient,
       it must be done multiple times and on a large amount of data.
     • The scores fall rapidly for recommended urls in the sample data than when compared to the
       complete data
  However, we repeated the same experiment with the entire corpus of data and we obtained
http://www.bbc.co.uk/languages/ as the first result.
http://www.bbc.co.uk/languages/ | 3983 | 666.454
http://hundredpushups.com/index.html | 8330|562.333
http://mashable.com/2007/05/14/video-howtos/|1564 | 164.951|
http://see.stanford.edu/see/courses.aspx|213|140.926
This showed that the presence of a large data corpus ( in this case - large number of tags)negates
the undesired characteristics presented by a small chunks of data.
                                                10
                                VII. Observations and Inferences
  In this section, we present some interesting insights into the characteristics of the data we used.
Our data came from primarily two sources - Del.icio.us and ShareThis . Whilst the ShareThis
data was more comprehensive and complete, the Del.icio.us data was more dynamic and diverse.
We used a data set that integrated the data from both these rich sources in order to harness the
power of the rich tags that the users provide.

A. Variation of TagCount across URLs
  The Figure 7 presents an insight into the distribution of tag counts across the URLs. We see
that the shape of the curve closely resembles that of a power law distribution. This was a surprising
result, and shows that it is the already popular URLs that get a higher number of tags from the
users.




Fig. 7. Distribution of TagCount across the URLs


B. Variation of Age across URLs
   The Figure 8 presents an insight into the distribution of the age across the URLs. We define age
as the number of days since the last activity was performed on that URL. We found some expected
results here. The lower part of the graph, closest to the X-Axis is very dense, which indicates that
most of the URLs that we consider are also quite fresh in nature. Since age plays an important role
in our ranking function, the recency or the freshness of the URL is extremely important.




                                                   11
                        Fig. 8. Variation of Age across the URLs


C. Variation of Depth across URLs
  The Figure 9 presents an peek into the distribution of the depth across the URLs. The Depth
of an URL [ δ ] as the number of terms separated by forward slashes ’/’, that don’t exist in the
matched URL but do so in in the original. The more the depth of the URL while matching, the
more generic it would be and the less interesting it would be. So, getting good matches on tags on
very high depths is not a very interesting phenomenon. However, our plot on the distribution of
depths of the URL shows most matches occuring at depth levels of zero or two. The purpose of the
Figure 9 is to validate our notion of depth and to make sure that it does not play an unfair role in
the ranking of the urls.




                        Fig. 9. Variation of Depth across the URLs        s

                                                  12
                     Fig. 10. Variation of Popularity Count across the URLs




                     Fig. 11. Variation of Popularity Count across all the URLs




                     Fig. 12. Zoomed in Version of Popularity Count across all
                     the URLs


In the Figures 11 and 12 we see the distribution of the popularity of the URLs. The popularity
                                             13
reflects the total tags that a URL has. We observe that apart from a few outliers,, most of the
URLs have few tags, as is evident from the Figure 12.

                                         VIII. Limitations
    •   In our approach, for the initial candidate selection, we depend only on the tags provided by
         the users under the assumption that the tags are reflective of the content. From our analysis,
         we found that this is true for the cases where there are a large number of tags. However, for
         URLs having a small number of tags, this approach can be led astray by the presence of a
         few very generic tags.
         For instance, in the group of French Learning ( mentioned in the previous section), our
         approach performs very poorly.
    •   In our approach, we have not normalized the confidence for each node of the Hypergraph.
         This restricts us from applying probabilistic analysis to the weight propagation mechanism.
    •   In this method, we do not cluster the users. This was because the we could not find a way
         around Delicious’s access control, that did not permit access the User information and what
         other tags he had posted. Thus, our entire analysis is based on the the ShareThis Data with
         additional URL and tag information aggregated from Del.icio.us .

                                        IX. Future Work
    •   We also hope to add the dimension of semantic tag distances based on an ontology such as
        wordnet so as to provide more refined measures of Distances.
    •   Currently we do not look into the Periodicity of the URL tagging or sharing behaviour. Our
        ranking function always gives a slight preference to the fresher urls. However for events like
        the Oscars, the webpages may receive periodic bursts in activity. We believe that if we can
        identify the temporal model, then the recommendation would be richer and more context
        sensitive.
    •   Currently all of the work is done offline. In future, we would to extend and test this approach
        on the Online models.
    •   Currently, we generate the candidate itemsets only on the basis of tags. However ,this can
        be a biased estimate sometimes. We hope to consider other simple unbiased parameters like
        domain names of the users when generating the candidate itemsets

                                          X. Conclusion
   In this project, we proposed and experimented with an approach to recommend URLs based
on the Users’ tagging and sharing data drawn from Del.icio.us and ShareThis It is based on
simple algorithms for generating a candidate set for each Target Url and then using a much more
refined function to operate on the candidate set in order to provide the appropriate recommen-
dations.Although this method can certainly be improved by using more sophisticated analysis, we
observe that the current implementation does reasonably well on many of the recommendations.
The section on Observations lists the plots that support the fact that the Tagging information is a
rich source of information that could be harnessed.




                                                  14
                                        References
[1] Paul Heymann and Hector Garcia-Molina Collaborative Creation of Communal Hierarchical
   Taxonomies in Social Tagging Systems, InfoLab Technical Report, Stanford University, 2006.
[2] Banshad Mobasher Creating Adaptive Websites through Usage Based Clustering of URLs, Uni-
   versity of Minnesota.
[3] Kazumasa Ozawa. Classic: A Hierarchical Clustering Algorithm based on Assymmetric Simi-
   larities, Osaka Electro Communication University, 1981.
[4] Maria Grineva, Maxim Grenev et al Harnessing Wikiperdia for Smart Tags Clustering, Pro-
   ceedings of Triple-I, 2008.
[5] Andrew McCallum, Kamal Nigam A Comparison of Event models for Naive Bayes Text Clas-
   sification, Proceedings of the AAIWS. 1998.
[6] Edwin Simpson Clustering Tags in Enterprise and Web Folksonomies, HP Labs Tech Report.
   2008.




                                             15

								
To top