VIEWS: 19 PAGES: 28 POSTED ON: 4/5/2010
Design Trade-Offs for Search Engine Caching RICARDO BAEZA-YATES, ARISTIDES GIONIS, FLAVIO P. JUNQUEIRA, VANESSA MURDOCK, and VASSILIS PLACHOURAS Yahoo! Research and FABRIZIO SILVESTRI 20 ISTI – CNR In this article we study the trade-offs in designing efﬁcient caching systems for Web search engines. We explore the impact of different approaches, such as static vs. dynamic caching, and caching query results vs. caching posting lists. Using a query log spanning a whole year, we explore the limitations of caching and we demonstrate that caching posting lists can achieve higher hit rates than caching query answers. We propose a new algorithm for static caching of posting lists, which outperforms previous methods. We also study the problem of ﬁnding the optimal way to split the static cache between answers and posting lists. Finally, we measure how the changes in the query log inﬂuence the effectiveness of static caching, given our observation that the distribution of the queries changes slowly over time. Our results and observations are applicable to different levels of the data-access hierarchy, for instance, for a memory/disk layer or a broker/remote server layer. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Search process; H.3.4 [Information Storage and Retrieval]: Systems and Software—Distributed systems, performance evaluation (efﬁciency and effectiveness) General Terms: Algorithms, Design Additional Key Words and Phrases: Caching, Web search, query logs ACM Reference Format: Baeza-Yates, R., Gionis, A., Junqueira, F. P., Murdock, V., Plachouras, V., and Silvestri, F. 2008. Design trade-offs for search engine caching. ACM Trans. Web, 2, 4, Article 20 (October 2008), 28 pages. DOI = 10.1145/1409220.1409223 http://doi.acm.org/10.1145/1409220.1409223 This article is an expanded version of an article that previously appeared in Proceedings of the 30th Annual ACM Conference on Research and Development in Information Retrieval, 183–190. Authors’ addresses: R. Baeza-Yates, A. Gionis, F. P. Junqueira, V. Murdock, and V. Plachouras, Yahoo! Research Barcelona, Avda. Diagonal 177, 8th ﬂoor, 08018, Barcelona, Spain; email: email@example.com, firstname.lastname@example.org, email@example.com, firstname.lastname@example.org, email@example.com; F. Silvestri, Istituto ISTI A. Faedo, Consiglio Nazionale delle Ricerche (CNR), via Moruzzi 1, I-56100, Pisa, Italy; email: firstname.lastname@example.org. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for proﬁt or direct commercial advantage and that copies show this notice on the ﬁrst page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior speciﬁc permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or email@example.com. C 2008 ACM 1559-1131/2008/10-ART20 $5.00 DOI 10.1145/1409220.1409223 http://doi.acm.org/ 10.1145/1409220.1409223 ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. 20:2 • R. Baeza-Yates et al. 1. INTRODUCTION Millions of queries are submitted daily to Web search engines, and users have high expectations of the quality of results and the latency to receive them. As the searchable Web becomes larger, with more than 20 billion pages to index, evaluating a single query requires processing large amounts of data. In such a setting, using a cache is crucial to reducing response time and to increasing the response throughput. The primary use of a cache memory is to speed up computation by exploiting patterns present in query streams. Since access to primary memory (RAM) is orders of magnitude faster than access to secondary memory (disk), the average latency drops signiﬁcantly with the use of a cache. A secondary, yet important, goal is reducing the workload to back-end servers. If the hit rate is x, then the back-end servers receive 1 − x of the original query trafﬁc. Caching can be applied at different levels with increasing response latencies or processing requirements. For example, the different levels may correspond to the main memory, the disk, or resources in a local or a wide area network. The decision of what to cache can be taken either off-line (static) or online (dy- namic). A static cache is usually based on historical information and is subject to periodic updates. A dynamic cache keeps objects stored in its limited number of entries according to the sequence of requests. When a new request arrives, the cache system decides whether to evict some entry from the cache in the case of a cache miss. Such online decisions are based on a cache policy, and several different policies have been studied in the past. For a search engine, there are two possible ways to use a cache memory: Caching answers. As the engine returns answers to a particular query, it may decide to store these partial answers (say, top-K results) to resolve future queries. Caching terms. As the engine evaluates a particular query, it may decide to store in memory the posting lists of the involved query terms. Often the whole set of posting lists does not ﬁt in memory, and consequently, the engine has to select a small set to keep in memory to speed up query processing. Returning an answer to a query that already exists in the cache is more efﬁ- cient than computing the answer using cached posting lists. On the other hand, a cached posting list can be used to process any query with the corresponding term, implying a higher hit rate for cached posting lists. Caching of posting lists has additional challenges. As posting lists have vari- able size, caching them dynamically is not very efﬁcient, due to the complex- ity in terms of efﬁciency and space, and the skewed distribution of the query stream, as shown later. Static caching of posting lists poses even more chal- lenges: when deciding which terms to cache, one faces the trade-off between frequently queried terms and terms with small posting lists that are space efﬁ- cient. Finally, before deciding to adopt a static caching policy, the query stream should be analyzed to verify that its characteristics do not change rapidly over time. ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. Design Trade-Offs for Search Engine Caching • 20:3 Fig. 1. One caching level in a distributed search architecture. In this article we explore trade-offs in the design of each cache level, showing that the problem is the same at each level, and only a few parameters change. In general, we assume that each level, of caching in a distributed search archi- tecture is similar to that shown in Figure 1. We mainly use a query log from Yahoo! UK, spanning a whole year, to explore the limitations of dynamically caching query answers or posting lists for query terms, and in some cases, we use a query log from the TodoCL search engine to validate our results. We observe that caching posting lists can achieve higher hit rates than caching query answers. We propose new algorithms for the static caching of posting lists for query terms, showing that the static caching of query terms is more effective than dynamic caching with LRU or LFU policies. We provide an analysis of the trade-offs between static caching of query answers and of query terms. This analysis enables us to obtain the optimal allocation of memory for different types of static caches, for both a particular implementation of a re- trieval system and a simple model of a distributed system. Finally, we explore how changes in the query log inﬂuence the effectiveness of static caching. More concretely, our main conclusions are the following: — Caching query answers results in lower hit ratios compared with caching of posting lists for query terms, but it is faster because there is no need for query evaluation. We provide a framework for the analysis of the trade-off between static caching of query answers and posting lists. — We evaluate the beneﬁts of keeping compressed postings in the posting list cache. To the best of our knowledge, this is the ﬁrst time cache entries are kept compressed. We show that compression is worthwhile in real cases, since it results in a lower average response time. — Static caching of terms can be more effective than dynamic caching with, for example, LRU. We provide algorithms based on the KNAPSACK problem for selecting the posting lists to put in a static cache, and we show improvements over previous work, achieving a hit ratio over 90%. — Changes in the query distribution over time have little impact on static caching. This article is an extended version of the one presented at ACM SIGIR 2007 [Baeza-Yates et al. 2007], making the following additional contributions: ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. 20:4 • R. Baeza-Yates et al. — In addition to the Yahoo! UK log and the UK 2006 document collection, we use a query log and a document collection from the TodoCL search engine to validate some of our results. — We present results that show that a mixed policy of combining static and dynamic cache for the problem of caching posting lists, performs better than either static or dynamic caching alone. — We present results from experiments using a real system that validates our computational model. The remainder of this article is organized as follows. Sections 2 and 3 sum- marize related work and characterize the data sets we use. Section 4 discusses the limitations of dynamic caching. Sections 5 and 6 introduce algorithms for caching posting lists, and a theoretical framework for the analysis of static caching, respectively. Section 7 discusses the impact of changes in the query distribution on static caching, and Section 8 provides our concluding remarks. 2. RELATED WORK Caching is a useful technique for Web systems that are accessed by a large num- ber of users. It enables a shorter average response time, it reduces the workload on back-end servers, and it reduces the overall amount of utilized bandwidth. In a Web system, both clients and servers can cache items. Browsers cache Web objects on the client side, whereas servers cache precomputed answers or par- tial data used in the computation of new answers. A third possibility, although of less interest to this article, is to use proxies to mediate the communication be- tween clients and servers, storing frequently requested objects [Podlipnig and Boszormenyi 2003]. Query logs constitute a valuable source of information for evaluating the effectiveness of caching systems. Silverstein et al.  analyze a large query log of the AltaVista search engine containing about a billion queries submitted over more than a month. Tests conducted include the analysis of the query sessions for each user, and of the correlations among the terms of the queries. Similarly to other work, their results show that the majority of the users (in this case about 85%) visit the ﬁrst page of results only. They also show that 77% of the sessions end after the ﬁrst query. Jansen et al.  conduct a similar analysis, obtaining results similar to the previous study. They conclude that while IR systems and Web search engines are similar in their features, users of the latter are very different from users of IR systems. Jansen and Spink  presents a thorough analysis of search engine user behavior. Besides analyzing the distribution of page-views, number of terms, number of queries, and so forth, they show a topical classiﬁcation of the submitted queries, pointing out how users interact with their preferred search engine. Beitzel et al.  analyze a very large Web query log containing queries submitted by a population of tens of millions users searching the Web through AOL. They partition the query log into groups of queries submitted during different hours of the day. The analysis highlights the changes in popularity and uniqueness of topically categorized queries within the different groups. ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. Design Trade-Offs for Search Engine Caching • 20:5 While there are several studies analyzing query logs for different purposes, just a few consider caching for search engines. This might be due to the difﬁculty in showing the effectiveness without having a real system available for testing. As noted by Xie and O’Hallaron  and conﬁrmed by our analysis, many popular queries are shared by different users. This level of sharing justiﬁes the choice of a server-side caching system for Web search engines. In one of the ﬁrst published works on exploiting user query history, Raghavan and Sever  propose using a query base built upon a set of persistent “optimal” queries submitted in the past, to improve the retrieval effectiveness for similar future queries. Markatos  shows the existence of temporal locality in queries, and compares the performance of different variants of the LRU policy, using hit ratio as a metric. According to his analysis, static caching is very effective if employed on very small caches (50Mbytes), but gracefully degrades as the cache size increases. Based on the observations of Markatos, Lempel and Moran  propose a new caching policy, called probabilistic driven caching (PDC), which attempts to estimate the probability distribution of all possible queries submitted to a search engine. PDC is the ﬁrst policy to adopt prefetching in anticipation of user requests. To this end, PDC exploits a model of user behavior, where a user session starts with a query for the ﬁrst page of results, and can proceed with one or more follow-up queries (i.e., queries requesting successive pages of results). When no follow-up queries are received within τ seconds, the session is considered ﬁnished. Fagni et al.  follow Markatos’ work by showing that combining static and dynamic caching policies, together with an adaptive prefetching policy, achieves a high hit ratio. In their experiments, they observe that devoting a large fraction of entries to static caching, along with prefetching, obtains the best hit ratio. Baeza-Yates et al.  introduce a caching mechanism for query answers where the cache memory is split in two parts. The ﬁrst part is used to cache results of queries that are likely to be repeated in the future, and the second part is used to cache all other queries. The decision to cache the query results in the ﬁrst or the second part depends on features of the query, such as its past frequency or its length (in tokens or characters). One of the main issues with the design of a server-side cache is the amount of memory resources usually available on servers. Tsegay et al.  consider caching of pruned posting lists in a setting where query evaluation terminates when the set of top ranked documents does not change by processing more postings. Zhang et al.  study caching of blocks of compressed posting lists using several dynamic caching algorithms, and ﬁnd that evicting from memory the least frequently used blocks of posting lists performs very well in terms of hit ratio. Our static caching algorithm for posting lists, in Section 5, uses the ratio frequency/size in order to evaluate the goodness of an item to cache. Similar ideas have been used in the context of ﬁle caching [Young 2002], Web caching [Cao and Irani 1997], and even caching of posting lists [Long and Suel 2005], but in all cases in a dynamic setting. To the best of our knowledge we are the ﬁrst to use this approach for static caching of posting lists. ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. 20:6 • R. Baeza-Yates et al. Since systems are often hierarchical, there have been proposed multiple level caching architectures. Saraiva et al.  propose a new architecture for Web search engines using a two-level dynamic caching system. Their goal for such systems has been to improve response time for hierarchical engines. In their architecture, both levels use an LRU eviction policy. They ﬁnd that the second-level cache can effectively reduce disk trafﬁc, thus increasing the overall throughput. Baeza-Yates and Saint-Jean  propose a three-level index organization with a frequency based posting list static cache. Long and Suel  propose a caching system structured according to three different levels. The intermediate level contains frequently occurring pairs of terms and stores the intersections of the corresponding inverted lists. The last two studies are related to ours in that they exploit different caching strategies at different levels of the memory hierarchy. There is a large body of work devoted to query optimization. Buckley and Lewit , in one of the earliest works, take a term-at-a-time ap- proach to decide when inverted lists need not be further examined. More re- cent examples demonstrate that the top k documents for a query can be re- turned without the need for evaluating the complete set of posting lists [Anh ¨ and Moffat 2006; Buttcher and Clarke 2006; Strohman et al. 2005; Ntoulas and Cho 2007]. Although these approaches seek to improve query process- ing efﬁciency, they differ from our current work in that they do not consider caching. They may be considered separate and complementary to a cache-based approach. 3. DATA CHARACTERIZATION Our main dataset consists of a crawl of documents from the UK domain, and the logs for one year of queries submitted to http://www.yahoo.co.uk from November 2005 to November 2006. To further validate our results, we use a second dataset consisting of a crawl of documents indexed by the TodoCL search engine1 from 2003, with queries submitted to the search engine from May to November, 2003. The document collection from the UK is a summary of the UK domain crawled in May 2006 [Boldi et al. 2004; Castillo et al. 2006].2 This summary corresponds to a maximum of 400 crawled documents per host, using a breadth-ﬁrst crawling strategy, comprising 15GB. The distribution of document frequencies of terms in the UK collection follows a power law distribution with parameter 1.24.3 The corpus statistics for the Chile data are comparable to those for the UK Summary collection. The distribution of document frequencies for every term in the Chile corpus follows a power law of parameter 1.10. The statistics for both collections are shown in Table I. With respect to our query-log datasets, in a year of queries to the UK search engine, 50% of the total volume of queries are unique. The average query length 1 http://www.todocl.cl visited July 2008. 2 The collection is available from the University of Milan: http://law.dsi.unimi.it/. 3 In this article we use power laws to ﬁt the data in the main part of the distribution, since in general, the power law does not ﬁt well across the two extremes. ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. Design Trade-Offs for Search Engine Caching • 20:7 Table I. Statistics of the Document Collections UK-2006 Sample Statistics Chile Sample Statistics # of documents 2,786,391 # of documents 3,110,605 # of terms 6,491,374 # of terms 3,894,893 # of postings 773,440,986 # of postings 529,599,712 # of tokens 2,109,512,558 # of tokens 1,578,821,207 Inverted ﬁle size (bytes) 1,189,266,893 Inverted ﬁle size (bytes) 1,004,086,805 is 2.5 terms, with the longest query having hundreds of terms. Figure 2(a) shows the distributions of queries and query terms for a sample of the query logs from yahoo.co.uk for part of a year. The x-axis represents the normalized frequency rank of the query or term, that is, the most frequent query appears closest to the y-axis. The y-axis is the normalized frequency for a given query (or term). As expected, the distribution of query frequencies and query term frequencies shown in this graph follow power law distributions, with parameters of 0.83 and 1.06, respectively. In this ﬁgure, the queries and terms were normalized for case and white space. The Chile query log resembles the UK query log, where 60% of the total volume of queries are unique queries, and 80% of the unique queries are sin- gleton queries—queries that appear only once in the logs. The average query length was 2.63 terms, the longest being 73 terms. Figure 2(b) shows the query and term distributions for the Chile query log. The queries were normalized for case and whitespace. The query and term distributions follow a power law, with parameters of 0.62 and 0.88, respectively. Finally, we computed the correlation between the document frequency of terms in the UK collection, and the number of queries to yahoo.co.uk that contain a particular term in the query log, to be 0.42. The correlation between the document frequency of terms in the collection indexed by TodoCL, and the number of queries to the TodoCL search engine containing a particular term, is only 0.29. A scatter plot for a random sample of terms for both data sets is shown in Figure 3. In this experiment, terms have been converted to lower case in both the queries and the documents so that the frequencies will be comparable. 4. CACHING OF QUERIES AND TERMS Caching relies upon the assumption that there is locality in the stream of re- quests. That is, there must be sufﬁcient repetition in the stream of requests and within intervals of time that enable a cache memory of reasonable size to be effective. In the UK query log, 88% of the unique queries are singleton queries, and 44% are singleton queries out of the whole volume. Thus, out of all queries in the stream composing the query log, the upper threshold on hit ratio is 56%. This is because only 56% of all the queries comprise queries that have multiple occurrences. It is important to observe, however, that not all queries in this 56% can be cache hits because of compulsory misses. A compulsory miss happens when the cache receives a query for the ﬁrst time. This is different from capacity misses, which happen due to space constraints on the amount of memory the cache uses. If we consider a cache with inﬁnite memory, then the ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. 20:8 • R. Baeza-Yates et al. Fig. 2. The distribution of queries, query terms, and document terms in the UK dataset (a) and the Chile dataset (b). The curves are shown for a large subset of queries for part of a year. The y-axis has been normalized for each distribution. The x-axis has been normalized by the rank, so that the most frequent term is closest to the y-axis. hit ratio is 50% because, as mentioned in the previous section, unique queries are 50% of the total query volume. Note that for an inﬁnite cache there are no capacity misses. As we mentioned before, another possibility is to cache the posting lists of terms. Intuitively, this gives more freedom in the utilization of the cache content to respond to queries, because cached terms might form a new query. On the other hand, they need more space. As opposed to queries, the fraction of singleton terms in the total volume of terms is smaller. In the UK query log, only 4% of the terms appear once, but ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. Design Trade-Offs for Search Engine Caching • 20:9 Fig. 3. Normalized scatter plot of document-term frequencies vs. query-term frequencies for the UK collection, and queries to yahoo.co.uk (left), and the same for the TodoCL data (right). Fig. 4. Arrival rate of queries and terms and estimated workload for the UK log. this accounts for 73% of the vocabulary of query terms. We show in Section 5 that caching a small fraction of terms, while accounting for terms appearing in many documents, is potentially very effective. Figure 4(a) shows several curves corresponding to the normalized arrival rate of queries in the UK log for different cases using days as bins. That is, we plot the normalized number of elements that appear in a day. This graph shows only a period of 122 days, and we normalize the values by the maximum value observed throughout the whole period of the query log. “total queries” and “total terms” correspond to the total volume of queries and terms, respec- tively. “Unique queries” and “unique terms” correspond to the arrival rate of unique queries and terms. Finally, “query diff ” and “terms diff ” correspond to the difference between the curves for total and unique. In Figure 4(a), as expected, the volume of terms is much higher than the volume of queries. The difference between the total number of terms and the number of unique terms is much larger than the difference between the total number of queries and the number of unique queries. This observation implies that terms repeat signiﬁcantly more than queries. If we use smaller bins, say ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. 20:10 • R. Baeza-Yates et al. of one hour, then the ratio of unique to volume is higher for both terms and queries, because it leaves less room for repetition. We also estimated the workload using the document frequency of terms as a measure of how much work a query imposes on a search engine. We found that it closely follows the arrival rate for terms shown in Figure 4(a). In more detail, Figure 4(b) plots the sum of the length of the posting lists associated with terms in each bin, normalized by the average workload. Since the absolute workload values are substantially higher, normalizing is necessary to make the graph comparable to the others for total, unique, and difference in Figure 4(a). We then normalize it a second time using the same procedure as for the curves in Figure 4(a). The main observation in this graph is that the workload closely follows the arrival rate for terms. The graph would not have such a shape if, for example, the terms in queries in periods of high activity had, on average, shorter posting lists. To demonstrate the effect of a dynamic cache on the query frequency distri- bution of Figure 2(a), we plot the same frequency graph, but now considering the frequency of queries after going through an LRU cache. On a cache miss, an LRU cache decides upon an entry to evict, using the information on the recency of queries. In this graph, the most frequent queries are not the same queries that were most frequent before the cache. It is possible that queries that are most frequent after the cache have different characteristics, and tuning the search engine to queries that were frequent before the cache may degrade per- formance for non-cached queries. The maximum frequency after caching is less than 1% of the maximum frequency before the cache, thus showing that the cache is very effective in reducing the load of frequent queries. If we rerank the queries according to after-cache frequency, the distribution is still a power law, but with a much smaller value for the highest frequency. When discussing the effectiveness of caching dynamically, an important met- ric is cache miss rate. To analyze the cache miss rate for different memory constraints, we use the working set model [Denning 1980; Slutz and Traiger 1974]. A working set, informally, is the set of references that an application or an operating system is currently working with. The model uses such sets in a strategy that tries to capture the temporal locality of references. The working set strategy then consists in keeping in memory only the elements that are ref- erenced in the previous θ steps of the input sequence, where θ is a conﬁgurable parameter corresponding to the window size. Originally, working sets have been used for page replacement algorithms of operating systems, and considering such a strategy in the context of search engines is interesting for three reasons. First, it captures the amount of locality of queries and terms in a sequence of queries. Locality in this case refers to the frequency of queries and terms in a window of time. If many queries appear multiple times in a window, then locality is high. Second, it enables an ofﬂine analysis of the expected miss rate given different memory constraints. Third, working sets capture aspects of efﬁcient caching algorithms, such as LRU. LRU assumes that references further in the past are less likely to be referenced in the present, which is implicit in the concept of working sets [Slutz and Traiger 1974]. ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. Design Trade-Offs for Search Engine Caching • 20:11 Fig. 5. Frequency graph after LRU cache. We now characterize the working set model more formally. Following the model of Slutz and Traiger , let ρk denote a ﬁnite reference sequence of elements, elements being either queries or terms, where r(t) evaluates to the element at position t of this sequence, and k is the length of the sequence. A working set for ρk is as follows: Deﬁnition 4.1. The working set at time t is the distinct set of elements among r(t − θ + 1) . . . r(t). The function ck (x), used to compute the miss rate, is deﬁned as follows: Deﬁnition 4.2. ck (x), 1 ≤ x ≤ k, is the number of occurrences of xt = x in ρk , where xt is the number of elements since the last reference to r(t). We deﬁne the miss rate as: θ m(θ ) = 1 − (1/k) ck (x). (1) x=1 Figure 6(a) plots the miss rate for different working set sizes, and we consider working sets of both queries and terms. The working set sizes are normalized against the total number of queries in the query log. In the graph for queries, there is a sharp decay until approximately 0.01, and there is a decrease in the rate at which the miss rate drops as we increase the size of the working set over 0.01. Finally, the minimum value it reaches is 50% miss rate, not shown in the ﬁgure, since we have cut the tail of the curve for presentation purposes. For the sequence of terms that we use to plot the term curve in the ﬁgure, we have not considered all the terms in the log. Instead, we use the same number of queries we use for the query graph, taken from the head of the query log. ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. 20:12 • R. Baeza-Yates et al. Fig. 6. Miss rate as a function of the working set size and distribution of distances. Compared with the query curve, we observe that the minimum miss rate for terms is substantially smaller. The miss rate also drops sharply on values up to 0.01, and it decreases minimally for higher values. The minimum value, however, is slightly over 10%, which is much smaller than the minimum value for the sequence of queries. This implies that, with such a policy, it is possible to achieve over 80% hit rate, if we consider caching dynamically posting lists for terms as opposed to caching answers for queries. This result does not consider the space required for each unit stored in the cache memory, or the amount of time it takes to put together a response to a user query. We analyze these issues more carefully later in this article. It is interesting also to observe the histogram of Figure 6(b), which is an intermediate step in the computation of the miss rate graph. It reports the distribution of distances between repetitions of the same frequent query. The distance in the plot is measured in the number of distinct queries separating a query and its repetition, and it considers only queries appearing at least 10 times. For example the distance between repetitions of the query q in the query stream q, q1 , q2 , q2 , q1 , q3 , q is three. From Figures 6(a) and 6(b), we conclude that even if we set the size of the query answers cache to a relatively large number of entries, the miss rate is high. Thus caching the posting lists of terms has the potential to improve the hit ratio. This is what we explore next. 5. CACHING POSTING LISTS The previous section shows that caching posting lists can obtain a higher hit rate as compared with caching query answers. In this section, we study the problem of how to select posting lists to place in a certain amount of available memory, assuming that the whole index is larger than the amount of memory available. The posting lists have variable size (in fact, their size distribution follows a power law), so it is beneﬁcial for a caching policy to consider the sizes of the posting lists. In Section 5.1, we describe a new algorithm for caching posting lists statically. We compare our algorithm with a static-cache algorithm that considers only query frequency statistics, as well as with dynamic-cache algorithms, such as LRU, LFU, and a modiﬁed dynamic algorithm that takes ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. Design Trade-Offs for Search Engine Caching • 20:13 posting-list size into account. Additionally, in Section 5.2, we discuss a mixed caching policy that considers partitioning the available cache in two parts and using one part as static cache and the other part as dynamic cache. 5.1 Static Caching Before discussing the static caching strategies, we introduce some notation. We use f q (t) to denote the query-term frequency of a term t, that is, the number of queries containing t in the query log, and f d (t) to denote the document fre- quency of t, that is, the number of documents in the collection in which the term t appears. The ﬁrst strategy we consider is the algorithm proposed by Baeza-Yates and Saint-Jean , which consists in selecting the posting lists of the terms with the highest query-term frequencies f q (t). We call this algorithm QTF. The QTF algorithm is clearly motivated by ﬁlling the cache with terms that appear often in the queries. The query-term frequencies are computed from past query logs, and for the policy to be effective we assume that the query-term frequencies do not change much over time. Later in this article we analyze the impact of the query-log dynamics on static caching. Next, we describe our suggested static-cache algorithm. Our main observa- tion is that there is a trade-off between f q (t) and f d (t). On the one hand, terms with high f q (t) are useful to keep in the cache because they are queried often. On the other hand, terms with high f d (t) are not good candidates because they correspond to long posting lists and consume a substantial amount of space. In fact, the problem of selecting the best posting lists for the static cache corre- sponds to the standard Knapsack problem: given a knapsack of ﬁxed capacity, and a set of n items, declaring, for example, that the i-th item has value ci and size si , select the set of items that ﬁt in the knapsack and maximize the overall value. In our case, “value” corresponds to f q (t) and “size” corresponds to f d (t). Thus we employ a simple algorithm for the knapsack problem, which is selecting the posting lists of the terms with the highest values of the ratio f q (t) f d (t) . We call this algorithm QTFDF. We tried other variations considering query frequencies instead of term frequencies, but the gain was minimal relative to the complexity added. In addition to the above two static algorithms, we consider the following algorithms for dynamic caching: — LRU: a standard LRU algorithm, but many posting lists might need to be evicted (in order of least-recent usage) until there is enough space in the memory to place the currently accessed posting list. — LFU: a standard LFU algorithm (eviction of the least-frequently used), with the same modiﬁcation as the LRU. — DYN-QTFDF: a dynamic version of the QTFDF algorithm; evict from the cache f q (t) the term(s) with the lowest f d (t) ratio. The performance of all the above algorithms for the UK and Chile datasets are shown in Figures 7 and 8, respectively. For the results on the UK dataset, ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. 20:14 • R. Baeza-Yates et al. Fig. 7. Hit rate of different strategies for caching posting lists for the UK dataset. Fig. 8. Hit rate of different strategies for caching posting lists for the Chile dataset. we use 15 weeks of the UK query log, and for the Chile dataset, we use 4 months of the Chile query log. Performance is measured with hit rate. The cache size is measured as a fraction of the total space required to store the posting lists of all terms. For the dynamic algorithms, we load the cache with terms in order of f q (t) and we let the cache “warm up” for 1 million queries. For the static algorithms, we assume complete knowledge of the frequencies f q (t), that is, we estimate f q (t) from the whole query stream. As we show in Section 7, the results do not change much if we compute the query-term frequencies using the ﬁrst 3 or 4 weeks of the query log, and measure the hit rate on the rest. ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. Design Trade-Offs for Search Engine Caching • 20:15 Fig. 9. Fraction of terms whose posting lists ﬁt in cache for the two different static algorithms. The most important observation from our experiments is that the static QTFDF algorithm has a better hit rate than all the dynamic algorithms. An important beneﬁt of a static cache is that it requires no eviction and it is hence more efﬁcient when evaluating queries. However, if the characteristics of the query trafﬁc change frequently over time, then it requires repopulating the cache often, or there will be a signiﬁcant impact on hit rate. A measure illustrating the difference between the QTFDF and QTF algorithms is demonstrated in Figure 9(a), and 9(b), where we show the fraction of terms whose posting lists ﬁt in cache for the two static algorithms. QTF selects terms with high f q (t) values. However, many of those terms tend to have long posting lists, and as a result, few posting lists ﬁt in cache. On the other hand, QTFDF prefers to select many more (and shorter) posting lists, even though they have smaller f q (t) values. 5.2 Adding Dynamic Cache In addition to pure static and dynamic caching policies, we also consider a mixed caching policy: given a ﬁxed amount of available cache, partition it in two parts and use the one part as static cache and the other part as dynamic cache. We consider combining the static and dynamic caching policies, as demonstrated in the previous section, namely the QTFDF algorithm for static and the LRU algorithm for dynamic caching. The motivation behind considering such a mixed policy is to leverage the good performance of static caching, but at the same time to employ dynamic caching in order to handle temporal correlations and bursts in the query log stream. Figure 10 presents the results of our experiment that was performed using 15 weeks of the UK query log. Given a ﬁxed amount of memory for caching posting lists, we allocate an α fraction of the memory for the QTFDF policy and the rest for the LRU policy. We tried with α = 0.1, 0.25, 0.5, 0.75 and 0.9. Like the results presented in Fagni et al. , our mixed static/dynamic strategy has led to an improvement in the hit ratio of the cache. The improve- ment is more signiﬁcant for the smaller sizes of the cache; as the cache size increases, the performance of the QTFDF algorithm levels the performance of ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. 20:16 • R. Baeza-Yates et al. Fig. 10. The effect of adding dynamic cache to the QTFDF algorithm. the mixed policy. Also, as Figure 10 shows, the best performance is achieved for α = 0.9, that is, allocating the largest part of the cache for the static policy. 6. ANALYSIS OF STATIC CACHING In this section, we provide a detailed analysis for the problem of deciding whether it is preferable to cache query answers or cache posting lists. Our analysis takes into account the impact of caching between two levels of the data-access hierarchy. It can either be applied at the memory/disk layer or at a server/remote server layer, as in the architecture discussed in the introduction. Using a particular system model, we obtain estimates for the parameters required by our analysis, which we subsequently use to decide the optimal trade-off between caching query answers and caching posting lists. To validate the optimal trade-off, we run an implementation of the system with a cache of query answers and a cache of posting lists. 6.1 Analytical Model Let M be the size of the cache measured in answer units, that is, assume that the cache can store M query answers. For the sake of simplicity, assume that all posting lists are of the same length L, measured in answer units. We consider the following two cases: (1) a cache that stores only precomputed answers, and (2) a cache that stores only posting lists. In the ﬁrst case, Nc = M answers ﬁt in the cache, while in the second case N p = M/L posting lists ﬁt in the cache. Thus N p = Nc /L. Note that although posting lists require more space, we can combine terms to evaluate more queries (or partial queries). For case (1), suppose that a query answer in the cache can be evaluated in one time unit. For case (2), assume that if the posting lists of the terms of a query are in the cache, then the results can be computed in TR1 time units, ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. Design Trade-Offs for Search Engine Caching • 20:17 Fig. 11. Cache saturation as a function of size. while if the posting lists are not in the cache, then the results can be computed in TR2 time units. Of course we have that TR2 > TR1 . Now we want to compare the time to answer a stream of Q queries in both cases. Let us use the QTF algorithm as an approximation to the QTFDF algorithm (in fact, if the correlation between query terms and document terms is 0, this approximation is quite good) and Vc (Nc ) be the volume of the most frequent Nc queries. Then, for case (1), we have an overall time TC A = Vc (Nc ) + TR2 (Q − Vc (Nc )). Similarly, for case (2), let V p (N p ) be the number of computable queries using posting lists of the most frequent N p terms: Then we have overall time: TP L = TR1 V p (N p ) + TR2 (Q − V p (N p )). We want to check under which conditions we have TP L < TC A . Then, TP L − TC A = (TR2 − 1)Vc (Nc ) − (TR2 − TR1 )V p (N p ). (2) Figure 11 shows the values of V p and Vc for the UK query log. We can see that caching answers saturates faster, and for this particular data there is no additional beneﬁt from using more than 10% of the index space for caching answers. Since the query distribution in practice is ﬁnite, Vc (n) will be a fraction that depends on n of the total number of queries Q. Now we estimate this fraction. Since the query distribution follows a power law with parameter 0 < α < 1 in our two data sets, the i-th most frequent query appears with probability proportional to i1 . Therefore, the volume Vc (n), which is the total number of the α n most frequent queries, is: n Q Vc (n) = V0 ≈ V0 n1−α Q, i=1 iα where V0 = 1/U 1−α and U is the number of unique queries in the query stream. We know that V p (n) grows faster than Vc (n) and we assume, based ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. 20:18 • R. Baeza-Yates et al. Fig. 12. Relation of query volumes of precomputed answers Vc (n) and posting lists V p (n). on experimental results, that the relation is of the form V p (n) = kVc (n)β (see Figure 12). In the worst case, for a large cache, β → 1. That is, both techniques will cache a constant fraction of the overall query volume. By setting β = 1, replacing N p = Nc /L, and by combining with Equation (2), we obtain the result that caching posting lists makes sense only if the ratio L1−α (TR2 − 1) ρ= < 1. k(TR2 − TR1 ) Differently from previous works, we also want to evaluate whether caching compressed postings is better than caching plain postings. Caching compressed postings has the beneﬁt of allowing the accommodation of a greater number of entries, in fact we have L < L, at the cost of a greater computational cost, that is, TR1 > TR1 . The trade-off of caching postings vs. query results is now as follows: L 1−α (TR2 − 1) ρ = . k(TR2 − TR1 ) That is ρ is ratio for comparing cached answers with caching posting lists when compression is used. Using compression is better if ρ < ρ. In the next section, L we show that L is about 3, and according to the experiments that we show later, compression is always better. For a small cache, we are interested in the transient behavior and then β > 1, as computed from the UK data (between 2 and 3 as shown in Figure 12). In this case, there will always be a point where TP L > TC A for a large number of queries, and this shows the importance of the real values of TR, which we estimate next. As we showed in the previous section, instead of ﬁlling the cache only with answers or only with posting lists, a better strategy is to divide the total cache space into a cache for answers and a cache for posting lists. In such a case, there will be some queries that could be answered by both parts of the cache, and a good caching technique should try to minimize the intersection of both ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. Design Trade-Offs for Search Engine Caching • 20:19 caches. Finding the optimal division of the cache in order to minimize the overall retrieval time is a difﬁcult problem to solve analytically. In Section 6.3, we use simulations to derive optimal cache trade-offs for particular implementation examples. 6.2 Parameter Estimation We now use a particular implementation of a centralized system and the model of a distributed system as examples from which we estimate the parameters of the analysis from the previous section. We perform the experiments using an optimized version of Terrier [Ounis et al. 2006], for both indexing documents and processing queries, on a single machine with a Pentium 4 at 2GHz and 1GB of RAM. We index the documents from the UK-2006 dataset, without removing stop words or applying stemming. The posting lists in the inverted ﬁle consist of pairs of document identiﬁer and term frequency. We compress the document identiﬁer gaps using Elias gamma encoding, and the term frequencies in documents using unary encoding [Witten et al. 1994]. The size of the inverted ﬁle is 1,189Mb. A stored answer requires 1264 bytes, and an uncompressed posting takes 8 bytes. From Table I, we obtain L = (8· # of#postings) = 0.75 and L = Inverted ﬁle size = 0.26. 1264· of terms 1264· # of terms We estimate the ratio TR = T/Tc between the average time T it takes to evaluate a query and the average time Tc it takes to return a stored answer for the same query, in the following way. Tc is measured by loading the answers for 100,000 queries in memory, and answering the queries from memory. The average time is Tc = 0.069ms. T is measured by processing the same 100,000 queries (the ﬁrst 10,000 queries are used to warm up the system). For each query, we remove stop words if there are at least three remaining terms. The stop words correspond to the terms with a frequency higher than the number of documents in the index. We use a document-at-a-time approach to retrieve documents containing all query terms. The only disk access required during query processing is for reading compressed posting lists from the inverted ﬁle. We perform both full and partial evaluation of answers, because some queries are likely to retrieve a large number of documents, and only a fraction of the retrieved documents will be seen by users. In the partial evaluation of queries, we terminate the processing after matching 10,000 documents. The estimated ratios TR are presented in Table II. Figure 13 shows, for a sample of queries, the workload of the system with partial query evaluation and compressed posting lists. The x-axis corresponds to the total time the system spends processing a particular query, and the verti- cal axis corresponds to the sum t∈q f q · f d (t). Notice that the total number of postings of the query-terms does not necessarily provide an accurate estimate of the workload imposed on the system by a query (which is the case for full evaluation and uncompressed lists). The analysis of the previous section also applies to a distributed retrieval sys- tem in one or multiple sites. Suppose that a document partitioned distributed system is running on a cluster of machines interconnected through a Local Area Network (LAN) in one site. The broker receives queries and broadcasts them ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. 20:20 • R. Baeza-Yates et al. Table II. Ratios Between the Average Time to Evaluate a Query and the Average Time to Return Cached Answers (centralized and distributed case) Centralized system TR1 TR2 TR1 TR2 Full evaluation 233 1760 707 1140 Partial evaluation 99 1626 493 798 L L L L LAN system TR1 TR2 TR1 TR2 Full evaluation 242 1769 716 1149 Partial evaluation 108 1635 502 807 W W W W WAN system TR1 TR2 TR1 TR2 Full evaluation 5001 6528 5475 5908 Partial evaluation 4867 6394 5270 5575 Fig. 13. Workload for partial query evaluation with compressed posting lists. to the query processors, which answer the queries and return the results to the broker. Finally, the broker merges the received answers and generates the ﬁnal set of answers (we assume that the time spent on merging results is neg- ligible). The difference between the centralized architecture and the document partition architecture is the extra communication between the broker and the query processors. Using ICMP pings on a 100Mbps LAN, we have observed that sending the query from the broker to the query processors, which send an answer of 4,000 bytes back to the broker, takes on average 0.615ms. Hence TR L = TR + 0.615ms/0.069ms = TR + 9. In the case when the broker and the query processors are in different sites connected through a Wide Area Network (WAN), we estimate that broadcasting the query from the broker to the query processors, and getting back an answer of 4,000 bytes, takes on average 329ms. Hence TRW = TR + 329ms/0.069ms = W W TR + 4768. We can see that TR2 /TR1 = 1.31 < TR2 /TR1 , suggesting that there is greater beneﬁt from storing answers for queries when the retrieval system is distributed across a WAN, since the network communication dominates the response time for such systems. We corroborate this observation next. ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. Design Trade-Offs for Search Engine Caching • 20:21 Fig. 14. Optimal division of the cache memory in a server. 6.3 Simulation Rsesults We now address the problem of ﬁnding the optimal trade-off between caching query answers and caching posting lists. To make the problem concrete, we assume a ﬁxed size M on the available memory, out of which x units are used for caching query answers, and M − x for caching posting lists. We perform a simulation and compute the average response time as a func- tion of x. Using a part of the query log as training data, we ﬁrst allocate in the cache the answers to the most frequent queries that ﬁt in space x, and then we use the rest of the memory to cache posting lists. For selecting posting lists, we use the QTFDF algorithm, applied to the training query log but excluding the queries that have already been cached. In Figure 14, we plot the simulated response time for a centralized system as a function of x. For the uncompressed index, we use M = 1GB, and for the compressed index we use M = 0.5GB, to make a fair comparison. In the case of the conﬁguration that uses partial query evaluation with compressed posting lists, the lowest response time is achieved when 0.15GB out of the 0.5GB is allocated for storing answers for queries. We obtained similar trends in the results for the LAN setting. Figure 15 shows the simulated workload for a distributed system across a WAN. In this case, the total amount of memory is split between the broker, which holds the cached answers of queries, and the query processors, which hold the cache of posting lists. According to the ﬁgure, the difference between the conﬁgurations of the query processors is less important because the network communication overhead increases the response time substantially. When us- ing uncompressed posting lists, the optimal allocation of memory corresponds to using approximately 70% of the memory for caching query answers. This is explained by the fact that there is no need for network communication when the query can be answered by the cache at the broker. 6.4 Experimenting with a Real System We validate the results obtained from the simulation of the previous section by running a real system, varying the amount of memory allocated for a cache ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. 20:22 • R. Baeza-Yates et al. Fig. 15. Optimal division of the cache memory when the next level requires WAN access. Fig. 16. Average response time and throughput in a server for different splits of memory between a cache of query answers and a cache of postings. of query answers and for a cache of posting lists. Our system uses threads to process queries in parallel. Each thread processes queries to completion, and independent of other threads. The posting lists are uncompressed as needed during query processing, which stops after matching 10,000 documents. For training, we use the same data as in the simulation. For testing, we use the ﬁrst 30,000 queries from the simulation, where the ﬁrst 10,000 queries are used to warm up the system. For each conﬁguration, we run the system ﬁve times and report the average response time in milliseconds, as well as the av- erage throughput. The error bars correspond to the minimum and maximum average response time and throughput obtained among all ﬁve runs. We vali- date our simulation results from the previous section by running a system on a different server with 2 dual-core processors at 2GHz and 6GB of RAM. Figure 16(a) shows the average response time on the y-axis and the amount of memory allocated to the cache of query answers on the x-axis. Allocating a total of 0.5GB for either cache, if we allocate 0.2GB for the cache of query answers, then the remaining 0.3GB is used by the cache of posting lists. Both curves for a single-threaded and a two-threaded system follow the same trend ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. Design Trade-Offs for Search Engine Caching • 20:23 Table III. Average Response Time and Throughput for a System Without Cache of Query Answers and Posting Lists, and for a System with Both Caches and the Optimal Split of Memory Avg. Response Time (ms) Throughput (q/s) 1 thread/no cache 22.63 44.51 1 thread/optimal trade-off 12.35 81.06 2 threads/no cache 21.61 92.68 2 threads/optimal trade-off 12.06 167.37 shown in Figure 14. In the case of the single-threaded system, the optimal allocation of memory corresponds to 0.15GB for the cache of query answers. We also obtained the same result in our simulation described in the previous section. It is important to note that even though we run the system on a faster server than the one on which the parameters were estimated, the results of the simulation remain valid. Figure 16(b) shows the throughput achieved in the case of the single-threaded and two-threaded systems. We observe that throughput doubles when using two threads because the two server cores serve queries simultaneously. Having multiple cores, however, is not sufﬁcient to have throughput increasing linearly, since there are other system resources the threads share, such as the disk. Although we have not investigated this issue in depth, in our interpretation is that this happens because the two threads overlap minimally in the use of the disk. Such a minimal overlap is possible because of the large number of hits, which produce fewer accesses to disk and spread them over time. Reducing the number of accesses and spreading them over time results in a small probability of overlap. We note that in both cases, the optimal allocation of memory is also at 0.15GB for the cache of query answers. One important observation is that the optimal trade-off between the cache of query answers and that of posting lists signiﬁcantly increases the capacity of the system, compared to a system that does not use cache. Table III shows the average response time and throughput of the system without cache of query answers or posting lists, and of the system with both caches and the optimal memory allocation. The optimal trade-off results in at least 44% reduction in average response time and an increase of 80% in throughput. 7. EFFECT OF THE QUERY DYNAMICS Since the queries in the incoming trafﬁc follow a power-law, some of the most fre- quent queries will remain frequent even after some period of time. Topics upon which queries are submitted, however, vary over time and might invalidate the static cache built so far. Hence, we assess the impact of time on the validity of the trained model, studying the statistical characteristics of the query stream, showing that there is little variation in hit rate over sufﬁciently long periods of time. 7.1 Stability of static caching of answers For our query log, the query distribution and query-term distribution change slowly over time. To support this claim, we ﬁrst assess how query topics change ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. 20:24 • R. Baeza-Yates et al. Fig. 17. The distribution of queries in the ﬁrst week of June, 2006 (upper curve) compared to the distribution of new queries in the remainder of 2006. over time. Figure 17 shows a comparison of the distribution of queries from the ﬁrst week in June 2006, to the distribution of queries for the remainder of 2006 that did not appear in the ﬁrst week in June. The x-axis shows the rank of the query frequency, normalized on a log scale. The y-axis shows the frequency of a given query. We found that a very small percentage of queries are new queries. (the highest frequency is more than two orders of magnitude smaller). In fact, the majority of queries that appear in a given week repeat in the following weeks for the next six months. We observe the stability of the hit rate by considering a test period of three months and compare the effect of the training duration by training the static set for one and two weeks. Figure 18 shows the results. The hit rate is sta- ble both in the case of one and two weeks training, for all the three months tested. Interestingly, hit rate is consistently lower for the static cache trained for a single week. The peaks in the graph correspond to nightly periods, when the hit rate is highest. The lowest values happen during daily periods, often corresponding to 2–3pm. 7.2 Stability of Static Caching of Posting Lists The static cache of posting lists can be periodically recomputed. To estimate the time interval in which we need to recompute the posting lists on the static cache, we need to consider an efﬁciency/quality trade-off: using too short a time interval might be prohibitively expensive, while recomputing the cache too infrequently might lead to having an obsolete cache not corresponding to the statistical characteristics of the current query stream. We measure the effect on the QTFDF algorithm of the changes in a 15-week query stream (Figure 19(a)). We compute the query-term frequencies over the ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. Design Trade-Offs for Search Engine Caching • 20:25 Fig. 18. Hit rate trend of a static cache of 128,000 results containing the most frequent queries extracted from one and two weeks before the test period. Hit rate values correspond to periods of six hours. Fig. 19. Impact of distribution changes on the static caching of posting lists. whole stream, select which terms to cache, and then compute the hit rate on the whole query stream. This hit rate is an upper bound, and it assumes perfect knowledge of the query term frequencies. To simulate a realistic sce- nario, we use the ﬁrst 6 (3) weeks of the query stream for computing query term frequencies, and the following 9 (12) weeks to estimate the hit rate. As Figure 19(a) shows, the hit rate decreases by less than 2%. We repeated the same experiment for the QTF algorithm and the decrease on the hit rate was less than 0.2%. The high correlation among the query term frequencies during different time periods explains the small changes in hit rate as time elapses. Indeed, the ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. 20:26 • R. Baeza-Yates et al. pairwise correlation among all possible 3-week periods of the 15-week query stream, is over 99.5%. A similar result, shown in Figure 19(b), was obtained for the Chile dataset after training the static cache, using one month of the query log for training, and three months for testing. In this case, however, the degradation of the quality is greater due to the longer testing period. 8. CONCLUSIONS Caching is an effective technique in search engines for improving response time, reducing the load on query processors, and improving network bandwidth uti- lization. The results we presented in this article consider both dynamic and static caching. According to our results, dynamic caching of queries has limited effectiveness due to the high number of compulsory misses caused by the num- ber of unique or infrequent queries. In our UK log, the minimum miss rate is 50% using a working set strategy. Caching terms is more effective with respect to miss rate, achieving values as low as 12%. We also propose a new algorithm for static caching of posting lists that outperforms previous static caching al- gorithms as well as dynamic algorithms such as LRU and LFU, obtaining hit rate values that are over 10% higher compared with these strategies. As one of our main contributions, we present a framework for the analysis of the trade-off between caching query results and caching posting lists. In particular, we use the trade-off to evaluate if compression in posting list caching is worthwhile. Keeping compressed postings, in fact, allows for accommodating a greater number of entries at the cost of a greater query evaluation time. We show in the experimental analysis that compression is always better. To the best of our knowledge, this is the ﬁrst work considering compression inside cache entries. We plan to extend this further by evaluating the impact of different encoding schemes on the performance of posting list caching. We also show that partitioning the available cache into a static and a dynamic part improves the cache performance for caching posting lists. We use simulation as well as a real system to evaluate different types of architectures. Our results show that for centralized and LAN environments, there is an optimal allocation of caching query results and caching of posting lists, while for WAN scenarios in which network latency prevails, it is more important to cache query results. We leave to future work query processing algorithms that better integrate with caching, improved algorithms for caching posting lists, and a study of the consequences of the results in a production system. REFERENCES ANH, V. N. AND MOFFAT, A. 2006. Pruned query evaluation using pre-computed impacts. In Pro- ceedings of the 29th International ACM Conference on Research and Development in Information Retrieval (SIGIR’06). ACM, New York, NY, 372–379. BAEZA-YATES, R., GIONIS, A., JUNQUEIRA, F., MURDOCK, V., PLACHOURAS, V., AND SILVESTRI, F. 2007. The impact of caching on search engines. In Proceedings of the 30th International ACM Conference on Research and Development in Information Retrieval (SIGIR’07). ACM, New York, NY, 183–190. BAEZA-YATES, R., JUNQUEIRA, F., PLACHOURAS, V., AND WITSCHEL, H. F. 2007. Admission policies for caches of search engine results. In Proceedings of the 14th International Symposium on String ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. Design Trade-Offs for Search Engine Caching • 20:27 Processing and Information Retrieval (SPIRE’07). Lecture Notes in Computer Science, Vol. 4726, 74–85. BAEZA-YATES, R. AND SAINT-JEAN, F. 2003. A three level search engine index based in query log distribution. In Proceedings of the 10th International Symposium on String Processing and In- formation Retrieval (SPIRE’03). Lecture Notes in Computer Science, Vol. 2857, 56–65. BEITZEL, S. M., JENSEN, E. C., CHOWDHURY, A., GROSSMAN, D., AND FRIEDER, O. 2004. Hourly analysis of a very large topically categorized web query log. In Proceedings of the 27th International ACM Conference on Research and Development in Information Retrieval (SIGIR’04). ACM, New York, NY, 321–328. BOLDI, P., CODENOTTI, B., SANTINI, M., AND VIGNA, S. 2004. Ubicrawler: a scalable fully distributed web crawler. Softw. Pract. Exper. 34, 8. BUCKLEY, C. AND LEWIT, A. F. 1985. Optimization of inverted vector searches. In Proceedings of the 8th International ACM Conference on Research and Development in Information Retrieval (SIGIR’85). ACM, New York, NY, 97–110. ¨ BUTTCHER, S. AND CLARKE, C. L. A. 2006. A document-centric approach to static index pruning in text retrieval systems. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management (CIKM’06). ACM, New York, NY, 182–189. CAO, P. AND IRANI, S. 1997. Cost-aware WWW proxy caching algorithms. In USENIX Symposium on Internet Technologies and Systems. CASTILLO, C., DONATO, D., BECCHETTI, L., BOLDI, P., LEONARDI, S., SANTINI, M., AND VIGNA, S. 2006. A reference collection for web spam. SIGIR Forum 40, 2, 11–24. DENNING, P. 1980. Working sets past and present. IEEE Trans. Softw. Eng. SE-6, 1, 64–84. FAGNI, T., PEREGO, R., SILVESTRI, F., AND ORLANDO, S. 2006. Boosting the performance of web search engines: caching and prefetching query results by exploiting historical usage data. ACM Trans. Inform. Syst. 24, 1, 51–78. JANSEN, B. AND SPINK, A. 2006. How are we searching the World Wide Web? A comparison of nine search engine transaction logs. Inform. Process. Manag. 42, 248–263. JANSEN, B. J., SPINK, A., BATEMAN, J., AND SARACEVIC, T. 1998. Real life information retrieval: a study of user queries on the web. SIGIR Forum 32, 1, 5–17. LEMPEL, R. AND MORAN, S. 2003. Predictive caching and prefetching of query results in search engines. In Proceedings of the 12th International World Wide Web Conference (WWW’03). ACM, New York, NY, 19–28. LONG, X. AND SUEL, T. 2005. Three-level caching for efﬁcient query processing in large web search engines. In Proceedings of the 14th International World Wide Web Conference (WWW’05). ACM, New York, NY, 257–266. MARKATOS, E. P. 2001. On caching search engine query results. Comput. Commun. 24, 2, 137–143. NTOULAS, A. AND CHO, J. 2007. Pruning policies for two-tiered inverted index with correctness guarantee. In Proceedings of the 30th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’07). ACM, New York, NY, 191–198. OUNIS, I., AMATI, G., PLACHOURAS, V., HE, B., MACDONALD, C., AND LIOMA, C. 2006. Terrier: a high performance and scalable information retrieval platform. In SIGIR Workshop on Open Source Information Retrieval. PODLIPNIG, S. AND BOSZORMENYI, L. 2003. A survey of web cache replacement strategies. ACM Comput. Surv. 35, 4, 374–398. RAGHAVAN, V. V. AND SEVER, H. 1995. On the reuse of past optimal queries. In Proceedings of the 18th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’95). ACM, New York, NY, 344–350. SARAIVA, P. C., DE MOURA, E. S., ZIVIANI, N., MEIRA, W., FONSECA, R., AND RIBERIO-NETO, B. 2001. Rank-preserving two-level caching for scalable search engines. In Proceedings of the 24th In- ternational ACM Conference on Research and Development in Information Retrieval (SIGIR’01). ACM, New York, NY, 51–58. SILVERSTEIN, C., MARAIS, H., HENZINGER, M., AND MORICZ, M. 1999. Analysis of a very large web search engine query log. SIGIR Forum 33, 1, 6–12. SLUTZ, D. R. AND TRAIGER, I. L. 1974. A note on the calculation of average working set size. Commun. ACM 17, 10, 563–565. STROHMAN, T., TURTLE, H., AND CROFT, W. B. 2005. Optimization strategies for complex queries. In Proceedings of the 28th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’05). ACM, New York, NY, 219–225. ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008. 20:28 • R. Baeza-Yates et al. TSEGAY, Y., TURPIN, A., AND ZOBEL, J. 2007. Dynamic index pruning for effective caching. In Pro- ceedings of the 16th ACM conference on Conference on Information and Knowledge Management (CIKM’07). ACM, New York, NY, 987–990. WITTEN, I. H., BELL, T. C., AND MOFFAT, A. 1994. Managing Gigabytes: Compressing and Indexing Documents and Images. John Wiley & Sons, Inc., New York, NY. XIE, Y. AND O’HALLARON, D. R. 2002. Locality in search engine queries and its implications for caching. In Proceedings of the 21st Annual Joint Conference of the IEEE Computer and Commu- nications Societies (INFOCOM’02). YOUNG, N. E. 2002. On-line ﬁle caching. Algorithmica 33, 3, 371–383. ZHANG, J., LONG, X., AND SUEL, T. 2008. Performance of compressed inverted list caching in search engines. In Proceedings of the 17th International World Wide Web Conference (WWW’08). ACM, New York, NY, 387–396. Received December 2007; revised July 2008; accepted August 2008 ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
Pages to are hidden for
"a20-baeza-yates"Please download to view full document