Collective Collaborative Tagging System Jong Youl Choi, Marlon Pierce Department of Computer Science Community Grid Lab Indiana University at Bloomington Email: email@example.com Abstract— I. I NTRODUCTION Motivation is two fold. • Collaborative tagging, also known as social tagging, is a system to collect knowledge from the people and the quality of knowledge users can get will increase as the quantity of data people provided grows. Currently in the Internet lots of collaborative tagging sites exist but there is no way to integrate the data from the multiple sites to form a large and uniﬁed set of collaborative data from which users can have more accurate and richer information than from a single site. Fig. 1. Overview of Collective Collaborative Tagging (CCT) System • During the recent development of information retrieval (IR) and machine learning technology, lots of IR al- gorithms have been well studied and open to public. A. Architecture Although most of the collaborative tagging sites provide The system consists of three main components; data im- various searching services, their algorithms are closed to porter, data coordinator, and user service (Figure 1). public and somewhat secret to the users. Furthermore, Details of main three components are as follow. most of them provide only one type of searching algo- • Data Importer: Importing tagging data with machine rithm and the users have no choice to apply various other readable format such as RDF, RSS, Atom or Web APIs IR algorithms to ﬁnd the best information available from from number of different collaborative tagging sites. the data. Using the same data set with various different Importing can be done asynchronously or synchronously. searching algorithms, users can have more possibilities to • Data coordinator: Merging data from different sources discover hidden information varied in the data set. and storing them into a uniform repository. The coordina- The purpose of this paper is i) introducing a new collabo- tor will resolve possible format conﬂicts and duplication rative tagging system which can collect tag data from other problem which may exist in multiple sites. repositories and merge them in order to provide better quality • User service: Providing various machine learning based of knowledge and comparing commonly used algorithms for searching algorithms and options users can choose to the folksonomy analysis. run as a form of Web service API. The queries will be performed with the uniﬁed repository which stores tag- II. A NEW SYSTEM ging data collected from different collaborative tagging systems Motivated from the above observations, we propose a col- lective collaborative tagging (CCT for short) system which B. Service Type can provide various collaborative tagging services in a uniform Various kinds of user requests to extract information from way to users. Our CCT system is designed to provide the the annotated data can exist in collaborative tagging sys- following key functions. tem; for example, searching items by using tags, getting • Importing data from multiple sources to build a large and personalized recommendations based on user’s proﬁles or past uniﬁed tag repository activities, discovering group of users or communities sharing • Query services with options to run various IR algorithms similar interests, just to name a few. Those demands can • Query services with options to run with different data be generally categorized into 4 classes and our CCT system sources and parameter settings will provide services to support those requests. The following TABLE I G ENERAL TYPES OF SERVICE IN COLLABORATIVE TAGGING SYSTEMS d1 d2 d3 t1 w11 w12 w13 Type Name Description I Searching For a set of given tags as an input, ﬁnd t2 w21 w22 w23 A= the most relevant objects (documents, items, t3 w31 w32 w33 users, or tags) t4 w41 w42 w43 II Recommendation Create a recommendation list of objects which a user hasn’t observed yet. User’s proﬁle or past activity information can be Fig. 2. An example used. III Clustering Find communities or groups of users or objects based on the similarity. IV Trend detection Detect interesting or abnormal tagging be- the key step for building a successful collaborative tagging havior in time series analysis manner. system. In this section, we discuss the models for developing folksonomy searching engines and various algorithms for searching and tag analysis. classiﬁcation is not exclusive but rather overlapping in some sense. A. Models Type I – Searching by tags : For a given set of tags as an input, searching the most relevant objects with the input For building an efﬁcient searching engine for folksonomies, tags is an essential function in the collaborative tagging the way to represent folksonomy data is an important issue. system. Generally the objects can be either documents, In the ﬁeld of Information Retrieval (hereafter IR for short), items, users, or anything annotated by tags in the system. two models – the vector space model and the graph model – Results will be returned to users in an ordered fashion have been widely used and they are both well applicable in based on some computed scores. folksonomy indexing. Type II – Recommendation : With no explicit input of Although both models are sharing many similar aspects, tags, the system will return a recommendation list of they are distinct in many practical points of views. As ex- objects. While the input tags used in searching by tags amples, the Latent Semantic Indexing (LSI) (we will discuss should be explicitly deﬁned by a user, in recommendation details of this algorithm later) is using the vector space model those are generated implicitly by the system, based on for indexing and measuring pairwise similarities between user’s previous activities, preferences, or proﬁles. For an objects, and the famous ranking algorithm PageRank used by example, the system can give to a user a recommendation Google and its variant TagRank for folksonomy searching are list of documents which haven’t been discovered by the based on the graph model. While the vector space model has user, based on the user’s past tagging activities. Also, been widely used in many areas due to its simplicity, not many recommendation of tags is possible when a user wants researches have been conducted for the use of the graph model to annotate a document for the ﬁrst time, the system can so far. recommend other co-used tags with his initial input. 1) Vector space model: In the vector space model, also Type III – Clustering : This is so called community dis- known as bag-of-words model, each object can be represented covery. Not only searching for the most relevant objects, as an unordered collection of tags and by using mathematical it is also useful ﬁnding a group or a community which notation a vector can be used. I.e., an object dj can be repre- shares more common interests expressed by tags within sented as a q-dimension column vector (w1j , · · · , wqj ), where the group members than with others. q equals the total number of distinct tags in the system and Type IV – Trend Detection : The system analyzes the wij is a weight of the occurrence of the tag ti (We will discuss tagging activities in time-series manner and detect inter- various weight schemes shortly). Thus, the whole collection of esting patterns of tagging or abnormality among the tag n objects can be represented as a matrix A ∈ Rq×n where each data set. column corresponds to dj . An example is shown in Figure 2. More speciﬁc examples of service types or information 2) Graph-based model: Although the vector space model users can get for each category are summarized in Table I. is simple and easy-to-use, sometimes it lacks the ability to In the following section we discuss how those services can describe object-object relationships, which is more easier in be implemented by using various machine learning algorithms. the graph model. In the graph model, folksonomies can be represented as a network of connections, also known as tag III. M ODELS FOR TAG A NALYSIS graph, which consists of objects as nodes and connections A collaborative tagging system is designed to utilize the between objects as edges. An example is shown in Figure 3. power of peoples knowledge and provide an efﬁcient way More speciﬁcally, a tag graph is a undirected tripartite graph of searching information from the collaboratively annotated G = (V, E) where nodes in V are one of objects in disjoint data set. In this way, the system can help users to ﬁnd the subsets of three entities – objects, tags, and users – and edges information with more efﬁciency and discover unexposed or exist only between three entities. Each edge will be added for hidden information buried under piles of information. Thus, a single transaction, i.e., annotating an object with a set of developing efﬁcient models and algorithms for searching is tags by a user. n thus TF-IDF equals tfij × log dfi . Formulas are summarized data ula university foreign us_investigators sociology undergraduates facebook harvard in Table II. couterparts scholars in chemistry 58 73 79 fellowship graduate 2) Similarity Measurement: Similarity measurement is to research mathematical collaborative 66 68 61 76 dod measure a degree of likeness between two tagged objects in sciences efri fda biology france 56 fellowhips nsf 22 nih folksonomies. In the vector space model, various similarity hostage cuba diabetes 34 fidel de 9 gaming astrophysics 67aapf firmware information_integration iphone ioc intelligent measurement schemes have been developed in the ﬁeld of IR tour hci 91 xps420 37 astronomy dell innovation 20 4 aljazeera 3 60 security koreanhcard emacs usability 69 organizational and in practice three similarity measurement schemes are the systens 15 microformat david castro gram chemicalgeorge centers ihop pc nvo 1 mac 81 change hardware ipoad orders maps most popular among them: Cosine, Jaccard, and Pearson  88 step google teragrid canada 12 mathematics buy bush cbc brown breast 19 camp 59 11 bonding eminem pacman vdt 26 27 grids 3887 osx rankings methodology 2 72 gridftp cloudcomputing 90 7 pollution noise (summarized in Table II). 86 health market 39 cancer event ipod stock 21 critics expansion 92 recall us pak While in the vector space model such similarities are conference college multiple patent 85 abc imac 18 apple 40 workflow iphones 49 infringement pakistan ps3 measured by geometric characteristics (such as cosine angles) apache 53 taliban 36 25 48 54 89 35 80 mashup researcher popfly microsoft 28 internetsites sony processor or statistical ways (such as Jaccard and Pearson), similarities itunes 84 web20 stockmarket fireworks songs in the graph model can be measured by graph theoretic microformats 83 search 16 properties, such as hop distances, shortest paths, maximum 77 75 rupert murdoch jones uk citeam2 texas streaming ﬂows, and so on. uninterested pagerank 50 63 dow corp 8 46 guardian 42 weather 82 seattle hilton misc Pairwise similarity is also an important measurement for 41 aircraft 787 airplane problems 5 news 44 47 homepage 31 78 using in ﬁnding groups or communities. Note, however, mea- 6 personal 43 boeing edinburgh scotland 29 primary audio social 52 worldnews suring pairwise similarity is also different in both models. science nokia academic wiki 45 30 33 encyclopedia 17 something careers advancement participation In vector space mode, all object-object similarities can be 2423 women 70 citeam msi 51 bbc engineering interested 64 74 62 directly computed from the tag-object matrix A; I.e, in the 32 10 13 71 55 65 14 57 vector space model, we can compute a pairwise similarity matrix D = [δij ] ∈ Rn×n and its entries δij by computing the similarity between any two objects dj and dk among total n Fig. 3. An example of tag graph. The data used in this ﬁgure obtained from documents. Thus, the computation cost to build n×n pairwise our in-house collaborative tagging system, MSI-CIEC portal (See Table III) similarity matrix D is O(n2 ) However, in the graph model we cannot compute pairwise B. Similarity Measurement similarities directly from the matrix A but, instead, we should do this iteratively; Firstly, compute only similarities of directly Measuring similarity between two objects is a key step in connected objects, i.e., objects sharing at least one common folksonomy analysis and it is directly related to the perfor- tag between them, and then, measure similarities of the others, mance of the system. Although it is possible in folksonomy which have no direct connections, by means of discovering analysis to measure various types of similarities such as object- paths between them. Path discoveries can be done by using object, object-tag, object-user, user-tag, and user-user, in this the algorithms for ﬁnding the shortest path. Floyd-Warshall paper we only consider object-object similarity for simplicity. algorithm  is well known for this problem and this requires The other measurements can be easily estimated by using the generally O(n3 ) computations. same manner. 1) Weight Measurement: Weight measurement is a scheme IV. A LGORITHMS to quantify the weight element wij of the tag-object matrix A for each 1 ≤ i ≤ q and 1 ≤ j ≤ n. A simple minded Currently numerous algorithm have been studied for sup- approach is to count the occurrence of the tag ti for the object porting various types of services in collaborative tagging dj , which is known as Term Frequency (TF for short). As systems and this is also very active research area. In this observed in many IR researches, however, this approach has section, we focus on core algorithms which can successfully an disadvantage to utilize the low frequency terms or tags. support our service classiﬁcation as shown in Table I. Tag distributions in folksonomies usually follows the Zipf’s power law where a few majority tags govern the most of A. Latent Semantic Indexing distributions and thus minor tags can lost their importance in The Latent Semantic Indexing (hereafter LSI for short) has many searching algorithms. Thus, some normalization scheme been widely used for indexing the Web pages or documents in should be used to avoid this problem and to collect more libraries and served as one of the most popular searching al- variety information by exploiting minor tags in folksonomies. gorithms based on the vector space model. The LSI algorithm Various schemes have been suggested in many IR literatures can be also used in folksonomies as a searching engine to but the most popular scheme is Term Frequency-Inverse Docu- support the Type-I service in the vector space model. Using the ment Frequency (TF-IDF for short) which is the multiplication tag-object matrix collected in the system as an input, the LSI of TF and IDF. In a nutshell, term frequency tfij is the number algorithm can help to recover underlying or latent structures of of tagged term ti for document dj and the document frequency folksonomies, often obscured by noisy data, and enable to ﬁnd dfi is the number of documents having the same tag ti . IDF is the true relationship between tags and objects without noises n computed by log dfi for the total number of document n and based on the statistical information. TABLE II E QUATIONS USED FOR MEASURE WEIGHTS AND DISSIMILARITIES . S LIGHTLY MODIFIED FROM ORIGINAL EQUATIONS . Abbr Name Deﬁnition TF tfij Term Frequency The number of tagged term tj for document di DF dfj Document Frequency The number of documents having the same tag tj n TF-IDF tf idfij TF-Inverse DF tfij × log df where n is the total number of di j COS(di , dj ) Cosine wik wjk / 2 wik 2 wjk k k k JAC(di , dj ) Jaccard k wik wjk / w2 + k ik w2 k jk − k wik wjk wik wjk − 1 q wik wjk k k k PEA(di , dj ) Pearson w2 − 1 ( q wik )2 1 w2 − q ( wjk )2 k ik k k jk k TABLE III The core idea of LSI algorithm is that since the dimension D ATA SETS USED IN OUR EXPERIMENTS of the raw or untreated tag-object matrix is usually too high to ﬁnd the concise relationships between tags and objects, the Data Sets Documents Tags Remarks MSI-CIEC portal 92 178 In-house system dimension should be reduced to recover latent structures of Connotea 1131 6071 Harvested from Connotea the input matrix. Thus, the algorithm projects the tag-object matrix A = [aij ] ∈ Rq×n in the n-dimension space onto a lower dimension space d such that d ≪ n in order to remove B. FolkRank “noisy” information and recover the true relationships. In this sense, the LSI algorithm can be considered as a dimension Inspired from the PageRank algorithm which exploits the reduction algorithm from n-dimension to d-dimension. network structures of Web pages, the FolkRank algorithm has For dimension reduction processing, the LSI uses the Singu- been developed as a folksonomy search engine by using the lar Value Decomposition (SVD) method to ﬁnd the best lower graph model. The FolkRank algorithm can be used to provide ˆ dimension matrix A of the raw matrix A as an input in a way Type-1 service by using the graph model. ˆ to make the 2-norm difference ||A − A||2 minimized. The FolkRank algorithm uses the weight spread approaches, 1) Preprocedure: The LSI algorithm ﬁnds the best projec- which is the same strategy used by the PageRank algorithm. tions of the input tag-object matrix A onto a lower dimension The intuition is that the popularly tagged objects will receive or latent space by using SVD. Compute the decomposition of more and more weights from the neighbor objects. The differ- the input matrix A by using the SVD, ence from PageRank, however, is that weights are spreading out through the undirected edges of the tag graph which is A = U ΣV T (1) the characteristic in the folksonomy graph model, while the weight spreads have a direction in the PageRank. where U and V are orthogonal matrices (i.e., U U T = V T V = The FolkRank algorithm as follows: First, build a tripartite I) and Σ is a diagonal matrix having n eigen values such as graph from the folksonomy data. Second, spread weights Σ = diag(σ1 , · · · , σn ) and σ1 ≥ σ2 ≥ · · · ≥ σn ≥ 0. iteratively by using the following equation. Choose a target lower dimension d such as d ≪ n and ˆ deﬁne a new reduced diagonal matrix Σ from Σ by removing w = dAw + (1 − d)p (4) ˆ = diag(σ1 , · · · , σd ) where d < n. σd+1 , · · · , σn , such that Σ ˆ ˆ Similarly, compute U and V by removing (d + 1, · · · , n)th C. Clustering columns. Then, V ˆ represents the new object coordinates in To be added... (k-means and Deterministic Annealing) the reduced space. Note that the new matrix in the lower dimension is V. E XPERIMENTS ˆ ˆˆˆ A = U ΣV T (2) For the experiments in this paper, we used two sets of folksonomy data: one from our in-house collaborative tagging ˆ and A is the best approximation of the matrix A in a sense system called MSI-CIEC portal, which is currently under ˆ that 2-norm difference δ = ||A − A||2 is minimized. development, and the other harvested from Connotea, one of 2) Queries: A query q is given by a vector of tags such the well-known folksonomy systems. The Connotea data was that q = (q1 , · · · , qt ). By using the reduced matrices above, obtained in January 2008 and only collected approximately ˆ Compute q , 1000 documents in the most popular document list and their ˆ ˆ related tags. The data used in this experiment is summarized q = q T U (Σ)−1 , ˆ (3) in Table III ˆ ˆ and compare q with each document (i.e, each row of V ) in the reduced space by measuring similarity. Objects having the A. Latent Semantic Indexing highest similarities are the answers of the query. To be added... B. Clustering To be added... VI. C ONCLUSION R EFERENCES o  K. Boyack, R. Klavans, and K. B¨ rner, “Mapping the backbone of science,” Scientometrics, vol. 64, no. 3, pp. 351–374, 2005.  R. Floyd, “Algorithm 97: Shortest path,” Communications of the ACM, vol. 5, no. 6, p. 345, 1962.