Docstoc

a20-baeza-yates

Document Sample
a20-baeza-yates Powered By Docstoc
					Design Trade-Offs for Search Engine
Caching
RICARDO BAEZA-YATES, ARISTIDES GIONIS, FLAVIO P. JUNQUEIRA,
VANESSA MURDOCK, and VASSILIS PLACHOURAS
Yahoo! Research
and
FABRIZIO SILVESTRI                                                                                              20
ISTI – CNR


In this article we study the trade-offs in designing efficient caching systems for Web search engines.
We explore the impact of different approaches, such as static vs. dynamic caching, and caching
query results vs. caching posting lists. Using a query log spanning a whole year, we explore the
limitations of caching and we demonstrate that caching posting lists can achieve higher hit rates
than caching query answers. We propose a new algorithm for static caching of posting lists, which
outperforms previous methods. We also study the problem of finding the optimal way to split the
static cache between answers and posting lists. Finally, we measure how the changes in the query
log influence the effectiveness of static caching, given our observation that the distribution of the
queries changes slowly over time. Our results and observations are applicable to different levels of
the data-access hierarchy, for instance, for a memory/disk layer or a broker/remote server layer.
Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information
Search and Retrieval—Search process; H.3.4 [Information Storage and Retrieval]: Systems
and Software—Distributed systems, performance evaluation (efficiency and effectiveness)
General Terms: Algorithms, Design
Additional Key Words and Phrases: Caching, Web search, query logs
ACM Reference Format:
Baeza-Yates, R., Gionis, A., Junqueira, F. P., Murdock, V., Plachouras, V., and Silvestri, F. 2008.
Design trade-offs for search engine caching. ACM Trans. Web, 2, 4, Article 20 (October 2008),
28 pages. DOI = 10.1145/1409220.1409223 http://doi.acm.org/10.1145/1409220.1409223



This article is an expanded version of an article that previously appeared in Proceedings of the 30th
Annual ACM Conference on Research and Development in Information Retrieval, 183–190.
Authors’ addresses: R. Baeza-Yates, A. Gionis, F. P. Junqueira, V. Murdock, and V.
Plachouras, Yahoo! Research Barcelona, Avda. Diagonal 177, 8th floor, 08018, Barcelona, Spain;
email: rbaeza@acm.org, gionis@yahoo-inc.com, fpj@yahoo-inc.com, vmurdock@yahoo-inc.com,
vassilis@yahoo-inc.com; F. Silvestri, Istituto ISTI A. Faedo, Consiglio Nazionale delle Ricerche
(CNR), via Moruzzi 1, I-56100, Pisa, Italy; email: fabrizio.silvestri@isti.cnr.it.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is
granted without fee provided that copies are not made or distributed for profit or direct commercial
advantage and that copies show this notice on the first page or initial screen of a display along
with the full citation. Copyrights for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,
to redistribute to lists, or to use any component of this work in other works requires prior specific
permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn
Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or permissions@acm.org.
 C 2008 ACM 1559-1131/2008/10-ART20 $5.00 DOI 10.1145/1409220.1409223 http://doi.acm.org/
10.1145/1409220.1409223

                      ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
20:2       •      R. Baeza-Yates et al.

1. INTRODUCTION
Millions of queries are submitted daily to Web search engines, and users have
high expectations of the quality of results and the latency to receive them. As
the searchable Web becomes larger, with more than 20 billion pages to index,
evaluating a single query requires processing large amounts of data. In such
a setting, using a cache is crucial to reducing response time and to increasing
the response throughput.
   The primary use of a cache memory is to speed up computation by exploiting
patterns present in query streams. Since access to primary memory (RAM) is
orders of magnitude faster than access to secondary memory (disk), the average
latency drops significantly with the use of a cache. A secondary, yet important,
goal is reducing the workload to back-end servers. If the hit rate is x, then the
back-end servers receive 1 − x of the original query traffic.
   Caching can be applied at different levels with increasing response latencies
or processing requirements. For example, the different levels may correspond
to the main memory, the disk, or resources in a local or a wide area network.
The decision of what to cache can be taken either off-line (static) or online (dy-
namic). A static cache is usually based on historical information and is subject
to periodic updates. A dynamic cache keeps objects stored in its limited number
of entries according to the sequence of requests. When a new request arrives,
the cache system decides whether to evict some entry from the cache in the case
of a cache miss. Such online decisions are based on a cache policy, and several
different policies have been studied in the past.
   For a search engine, there are two possible ways to use a cache memory:

 Caching answers. As the engine returns answers to a particular query, it may
  decide to store these partial answers (say, top-K results) to resolve future
  queries.
 Caching terms. As the engine evaluates a particular query, it may decide to
  store in memory the posting lists of the involved query terms. Often the whole
  set of posting lists does not fit in memory, and consequently, the engine has
  to select a small set to keep in memory to speed up query processing.

   Returning an answer to a query that already exists in the cache is more effi-
cient than computing the answer using cached posting lists. On the other hand,
a cached posting list can be used to process any query with the corresponding
term, implying a higher hit rate for cached posting lists.
   Caching of posting lists has additional challenges. As posting lists have vari-
able size, caching them dynamically is not very efficient, due to the complex-
ity in terms of efficiency and space, and the skewed distribution of the query
stream, as shown later. Static caching of posting lists poses even more chal-
lenges: when deciding which terms to cache, one faces the trade-off between
frequently queried terms and terms with small posting lists that are space effi-
cient. Finally, before deciding to adopt a static caching policy, the query stream
should be analyzed to verify that its characteristics do not change rapidly over
time.
ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
                            Design Trade-Offs for Search Engine Caching                   •       20:3




              Fig. 1. One caching level in a distributed search architecture.


   In this article we explore trade-offs in the design of each cache level, showing
that the problem is the same at each level, and only a few parameters change.
In general, we assume that each level, of caching in a distributed search archi-
tecture is similar to that shown in Figure 1. We mainly use a query log from
Yahoo! UK, spanning a whole year, to explore the limitations of dynamically
caching query answers or posting lists for query terms, and in some cases, we
use a query log from the TodoCL search engine to validate our results.
   We observe that caching posting lists can achieve higher hit rates than
caching query answers. We propose new algorithms for the static caching of
posting lists for query terms, showing that the static caching of query terms is
more effective than dynamic caching with LRU or LFU policies. We provide an
analysis of the trade-offs between static caching of query answers and of query
terms. This analysis enables us to obtain the optimal allocation of memory for
different types of static caches, for both a particular implementation of a re-
trieval system and a simple model of a distributed system. Finally, we explore
how changes in the query log influence the effectiveness of static caching.
   More concretely, our main conclusions are the following:

— Caching query answers results in lower hit ratios compared with caching
  of posting lists for query terms, but it is faster because there is no need for
  query evaluation. We provide a framework for the analysis of the trade-off
  between static caching of query answers and posting lists.
— We evaluate the benefits of keeping compressed postings in the posting list
  cache. To the best of our knowledge, this is the first time cache entries are
  kept compressed. We show that compression is worthwhile in real cases, since
  it results in a lower average response time.
— Static caching of terms can be more effective than dynamic caching with,
  for example, LRU. We provide algorithms based on the KNAPSACK problem for
  selecting the posting lists to put in a static cache, and we show improvements
  over previous work, achieving a hit ratio over 90%.
— Changes in the query distribution over time have little impact on static
  caching.

  This article is an extended version of the one presented at ACM SIGIR
2007 [Baeza-Yates et al. 2007], making the following additional contributions:
                 ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
20:4       •      R. Baeza-Yates et al.

— In addition to the Yahoo! UK log and the UK 2006 document collection, we
  use a query log and a document collection from the TodoCL search engine to
  validate some of our results.
— We present results that show that a mixed policy of combining static and
  dynamic cache for the problem of caching posting lists, performs better than
  either static or dynamic caching alone.
— We present results from experiments using a real system that validates our
  computational model.

   The remainder of this article is organized as follows. Sections 2 and 3 sum-
marize related work and characterize the data sets we use. Section 4 discusses
the limitations of dynamic caching. Sections 5 and 6 introduce algorithms for
caching posting lists, and a theoretical framework for the analysis of static
caching, respectively. Section 7 discusses the impact of changes in the query
distribution on static caching, and Section 8 provides our concluding remarks.


2. RELATED WORK
Caching is a useful technique for Web systems that are accessed by a large num-
ber of users. It enables a shorter average response time, it reduces the workload
on back-end servers, and it reduces the overall amount of utilized bandwidth.
In a Web system, both clients and servers can cache items. Browsers cache Web
objects on the client side, whereas servers cache precomputed answers or par-
tial data used in the computation of new answers. A third possibility, although
of less interest to this article, is to use proxies to mediate the communication be-
tween clients and servers, storing frequently requested objects [Podlipnig and
Boszormenyi 2003].
    Query logs constitute a valuable source of information for evaluating the
effectiveness of caching systems. Silverstein et al. [1999] analyze a large query
log of the AltaVista search engine containing about a billion queries submitted
over more than a month. Tests conducted include the analysis of the query
sessions for each user, and of the correlations among the terms of the queries.
Similarly to other work, their results show that the majority of the users (in
this case about 85%) visit the first page of results only. They also show that 77%
of the sessions end after the first query. Jansen et al. [1998] conduct a similar
analysis, obtaining results similar to the previous study. They conclude that
while IR systems and Web search engines are similar in their features, users of
the latter are very different from users of IR systems. Jansen and Spink [2006]
presents a thorough analysis of search engine user behavior. Besides analyzing
the distribution of page-views, number of terms, number of queries, and so forth,
they show a topical classification of the submitted queries, pointing out how
users interact with their preferred search engine. Beitzel et al. [2004] analyze
a very large Web query log containing queries submitted by a population of tens
of millions users searching the Web through AOL. They partition the query log
into groups of queries submitted during different hours of the day. The analysis
highlights the changes in popularity and uniqueness of topically categorized
queries within the different groups.
ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
                            Design Trade-Offs for Search Engine Caching                   •       20:5

   While there are several studies analyzing query logs for different purposes,
just a few consider caching for search engines. This might be due to the difficulty
in showing the effectiveness without having a real system available for testing.
As noted by Xie and O’Hallaron [2002] and confirmed by our analysis, many
popular queries are shared by different users. This level of sharing justifies the
choice of a server-side caching system for Web search engines.
   In one of the first published works on exploiting user query history, Raghavan
and Sever [1995] propose using a query base built upon a set of persistent
“optimal” queries submitted in the past, to improve the retrieval effectiveness
for similar future queries. Markatos [2001] shows the existence of temporal
locality in queries, and compares the performance of different variants of the
LRU policy, using hit ratio as a metric. According to his analysis, static caching
is very effective if employed on very small caches (50Mbytes), but gracefully
degrades as the cache size increases.
   Based on the observations of Markatos, Lempel and Moran [2003] propose a
new caching policy, called probabilistic driven caching (PDC), which attempts
to estimate the probability distribution of all possible queries submitted to a
search engine. PDC is the first policy to adopt prefetching in anticipation of
user requests. To this end, PDC exploits a model of user behavior, where a
user session starts with a query for the first page of results, and can proceed
with one or more follow-up queries (i.e., queries requesting successive pages of
results). When no follow-up queries are received within τ seconds, the session
is considered finished.
   Fagni et al. [2006] follow Markatos’ work by showing that combining static
and dynamic caching policies, together with an adaptive prefetching policy,
achieves a high hit ratio. In their experiments, they observe that devoting a
large fraction of entries to static caching, along with prefetching, obtains the
best hit ratio.
   Baeza-Yates et al. [2007] introduce a caching mechanism for query answers
where the cache memory is split in two parts. The first part is used to cache
results of queries that are likely to be repeated in the future, and the second
part is used to cache all other queries. The decision to cache the query results
in the first or the second part depends on features of the query, such as its past
frequency or its length (in tokens or characters).
   One of the main issues with the design of a server-side cache is the amount
of memory resources usually available on servers. Tsegay et al. [2007] consider
caching of pruned posting lists in a setting where query evaluation terminates
when the set of top ranked documents does not change by processing more
postings. Zhang et al. [2008] study caching of blocks of compressed posting lists
using several dynamic caching algorithms, and find that evicting from memory
the least frequently used blocks of posting lists performs very well in terms
of hit ratio. Our static caching algorithm for posting lists, in Section 5, uses
the ratio frequency/size in order to evaluate the goodness of an item to cache.
Similar ideas have been used in the context of file caching [Young 2002], Web
caching [Cao and Irani 1997], and even caching of posting lists [Long and Suel
2005], but in all cases in a dynamic setting. To the best of our knowledge we
are the first to use this approach for static caching of posting lists.
                 ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
20:6       •      R. Baeza-Yates et al.

   Since systems are often hierarchical, there have been proposed multiple
level caching architectures. Saraiva et al. [2001] propose a new architecture
for Web search engines using a two-level dynamic caching system. Their goal
for such systems has been to improve response time for hierarchical engines.
In their architecture, both levels use an LRU eviction policy. They find that
the second-level cache can effectively reduce disk traffic, thus increasing the
overall throughput. Baeza-Yates and Saint-Jean [2003] propose a three-level
index organization with a frequency based posting list static cache. Long and
Suel [2005] propose a caching system structured according to three different
levels. The intermediate level contains frequently occurring pairs of terms and
stores the intersections of the corresponding inverted lists. The last two studies
are related to ours in that they exploit different caching strategies at different
levels of the memory hierarchy.
   There is a large body of work devoted to query optimization. Buckley
and Lewit [1985], in one of the earliest works, take a term-at-a-time ap-
proach to decide when inverted lists need not be further examined. More re-
cent examples demonstrate that the top k documents for a query can be re-
turned without the need for evaluating the complete set of posting lists [Anh
                      ¨
and Moffat 2006; Buttcher and Clarke 2006; Strohman et al. 2005; Ntoulas
and Cho 2007]. Although these approaches seek to improve query process-
ing efficiency, they differ from our current work in that they do not consider
caching. They may be considered separate and complementary to a cache-based
approach.

3. DATA CHARACTERIZATION
Our main dataset consists of a crawl of documents from the UK domain, and
the logs for one year of queries submitted to http://www.yahoo.co.uk from
November 2005 to November 2006. To further validate our results, we use a
second dataset consisting of a crawl of documents indexed by the TodoCL search
engine1 from 2003, with queries submitted to the search engine from May to
November, 2003.
   The document collection from the UK is a summary of the UK domain crawled
in May 2006 [Boldi et al. 2004; Castillo et al. 2006].2 This summary corresponds
to a maximum of 400 crawled documents per host, using a breadth-first crawling
strategy, comprising 15GB. The distribution of document frequencies of terms
in the UK collection follows a power law distribution with parameter 1.24.3
The corpus statistics for the Chile data are comparable to those for the UK
Summary collection. The distribution of document frequencies for every term
in the Chile corpus follows a power law of parameter 1.10. The statistics for
both collections are shown in Table I.
   With respect to our query-log datasets, in a year of queries to the UK search
engine, 50% of the total volume of queries are unique. The average query length

1 http://www.todocl.cl     visited July 2008.
2 The collection is available from the University of Milan: http://law.dsi.unimi.it/.
3 In this article we use power laws to fit the data in the main part of the distribution, since in

general, the power law does not fit well across the two extremes.

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
                                Design Trade-Offs for Search Engine Caching                   •       20:7

                         Table I. Statistics of the Document Collections
           UK-2006 Sample Statistics                            Chile Sample Statistics
    # of documents                    2,786,391       # of documents                     3,110,605
    # of terms                        6,491,374       # of terms                         3,894,893
    # of postings                   773,440,986       # of postings                    529,599,712
    # of tokens                   2,109,512,558       # of tokens                    1,578,821,207
    Inverted file size (bytes)     1,189,266,893       Inverted file size (bytes)      1,004,086,805


is 2.5 terms, with the longest query having hundreds of terms. Figure 2(a) shows
the distributions of queries and query terms for a sample of the query logs from
yahoo.co.uk for part of a year. The x-axis represents the normalized frequency
rank of the query or term, that is, the most frequent query appears closest to
the y-axis. The y-axis is the normalized frequency for a given query (or term).
As expected, the distribution of query frequencies and query term frequencies
shown in this graph follow power law distributions, with parameters of 0.83
and 1.06, respectively. In this figure, the queries and terms were normalized
for case and white space.
   The Chile query log resembles the UK query log, where 60% of the total
volume of queries are unique queries, and 80% of the unique queries are sin-
gleton queries—queries that appear only once in the logs. The average query
length was 2.63 terms, the longest being 73 terms. Figure 2(b) shows the query
and term distributions for the Chile query log. The queries were normalized
for case and whitespace. The query and term distributions follow a power law,
with parameters of 0.62 and 0.88, respectively.
   Finally, we computed the correlation between the document frequency of
terms in the UK collection, and the number of queries to yahoo.co.uk that
contain a particular term in the query log, to be 0.42. The correlation between
the document frequency of terms in the collection indexed by TodoCL, and the
number of queries to the TodoCL search engine containing a particular term, is
only 0.29.
   A scatter plot for a random sample of terms for both data sets is shown in
Figure 3. In this experiment, terms have been converted to lower case in both
the queries and the documents so that the frequencies will be comparable.

4. CACHING OF QUERIES AND TERMS
Caching relies upon the assumption that there is locality in the stream of re-
quests. That is, there must be sufficient repetition in the stream of requests
and within intervals of time that enable a cache memory of reasonable size
to be effective. In the UK query log, 88% of the unique queries are singleton
queries, and 44% are singleton queries out of the whole volume. Thus, out of all
queries in the stream composing the query log, the upper threshold on hit ratio
is 56%. This is because only 56% of all the queries comprise queries that have
multiple occurrences. It is important to observe, however, that not all queries
in this 56% can be cache hits because of compulsory misses. A compulsory miss
happens when the cache receives a query for the first time. This is different
from capacity misses, which happen due to space constraints on the amount of
memory the cache uses. If we consider a cache with infinite memory, then the
                     ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
20:8       •      R. Baeza-Yates et al.




Fig. 2. The distribution of queries, query terms, and document terms in the UK dataset (a) and
the Chile dataset (b). The curves are shown for a large subset of queries for part of a year. The
y-axis has been normalized for each distribution. The x-axis has been normalized by the rank, so
that the most frequent term is closest to the y-axis.


hit ratio is 50% because, as mentioned in the previous section, unique queries
are 50% of the total query volume. Note that for an infinite cache there are no
capacity misses.
   As we mentioned before, another possibility is to cache the posting lists of
terms. Intuitively, this gives more freedom in the utilization of the cache content
to respond to queries, because cached terms might form a new query. On the
other hand, they need more space.
   As opposed to queries, the fraction of singleton terms in the total volume of
terms is smaller. In the UK query log, only 4% of the terms appear once, but
ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
                               Design Trade-Offs for Search Engine Caching                   •       20:9




Fig. 3. Normalized scatter plot of document-term frequencies vs. query-term frequencies for the
UK collection, and queries to yahoo.co.uk (left), and the same for the TodoCL data (right).




       Fig. 4. Arrival rate of queries and terms and estimated workload for the UK log.


this accounts for 73% of the vocabulary of query terms. We show in Section 5
that caching a small fraction of terms, while accounting for terms appearing in
many documents, is potentially very effective.
   Figure 4(a) shows several curves corresponding to the normalized arrival
rate of queries in the UK log for different cases using days as bins. That is,
we plot the normalized number of elements that appear in a day. This graph
shows only a period of 122 days, and we normalize the values by the maximum
value observed throughout the whole period of the query log. “total queries”
and “total terms” correspond to the total volume of queries and terms, respec-
tively. “Unique queries” and “unique terms” correspond to the arrival rate of
unique queries and terms. Finally, “query diff ” and “terms diff ” correspond to
the difference between the curves for total and unique.
   In Figure 4(a), as expected, the volume of terms is much higher than the
volume of queries. The difference between the total number of terms and the
number of unique terms is much larger than the difference between the total
number of queries and the number of unique queries. This observation implies
that terms repeat significantly more than queries. If we use smaller bins, say
                    ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
20:10       •      R. Baeza-Yates et al.

of one hour, then the ratio of unique to volume is higher for both terms and
queries, because it leaves less room for repetition.
    We also estimated the workload using the document frequency of terms as a
measure of how much work a query imposes on a search engine. We found that
it closely follows the arrival rate for terms shown in Figure 4(a). In more detail,
Figure 4(b) plots the sum of the length of the posting lists associated with terms
in each bin, normalized by the average workload. Since the absolute workload
values are substantially higher, normalizing is necessary to make the graph
comparable to the others for total, unique, and difference in Figure 4(a). We
then normalize it a second time using the same procedure as for the curves in
Figure 4(a). The main observation in this graph is that the workload closely
follows the arrival rate for terms. The graph would not have such a shape if,
for example, the terms in queries in periods of high activity had, on average,
shorter posting lists.
    To demonstrate the effect of a dynamic cache on the query frequency distri-
bution of Figure 2(a), we plot the same frequency graph, but now considering
the frequency of queries after going through an LRU cache. On a cache miss, an
LRU cache decides upon an entry to evict, using the information on the recency
of queries. In this graph, the most frequent queries are not the same queries
that were most frequent before the cache. It is possible that queries that are
most frequent after the cache have different characteristics, and tuning the
search engine to queries that were frequent before the cache may degrade per-
formance for non-cached queries. The maximum frequency after caching is less
than 1% of the maximum frequency before the cache, thus showing that the
cache is very effective in reducing the load of frequent queries. If we rerank the
queries according to after-cache frequency, the distribution is still a power law,
but with a much smaller value for the highest frequency.
    When discussing the effectiveness of caching dynamically, an important met-
ric is cache miss rate. To analyze the cache miss rate for different memory
constraints, we use the working set model [Denning 1980; Slutz and Traiger
1974]. A working set, informally, is the set of references that an application or
an operating system is currently working with. The model uses such sets in a
strategy that tries to capture the temporal locality of references. The working
set strategy then consists in keeping in memory only the elements that are ref-
erenced in the previous θ steps of the input sequence, where θ is a configurable
parameter corresponding to the window size.
    Originally, working sets have been used for page replacement algorithms
of operating systems, and considering such a strategy in the context of search
engines is interesting for three reasons. First, it captures the amount of locality
of queries and terms in a sequence of queries. Locality in this case refers to the
frequency of queries and terms in a window of time. If many queries appear
multiple times in a window, then locality is high. Second, it enables an offline
analysis of the expected miss rate given different memory constraints. Third,
working sets capture aspects of efficient caching algorithms, such as LRU. LRU
assumes that references further in the past are less likely to be referenced in
the present, which is implicit in the concept of working sets [Slutz and Traiger
1974].
ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
                           Design Trade-Offs for Search Engine Caching                   •       20:11




                        Fig. 5. Frequency graph after LRU cache.


   We now characterize the working set model more formally. Following the
model of Slutz and Traiger [1974], let ρk denote a finite reference sequence of
elements, elements being either queries or terms, where r(t) evaluates to the
element at position t of this sequence, and k is the length of the sequence. A
working set for ρk is as follows:
  Definition 4.1. The working set at time t is the distinct set of elements
among r(t − θ + 1) . . . r(t).
The function ck (x), used to compute the miss rate, is defined as follows:
    Definition 4.2. ck (x), 1 ≤ x ≤ k, is the number of occurrences of xt = x in
ρk , where xt is the number of elements since the last reference to r(t).
  We define the miss rate as:
                                                       θ
                              m(θ ) = 1 − (1/k)             ck (x).                                 (1)
                                                      x=1

   Figure 6(a) plots the miss rate for different working set sizes, and we consider
working sets of both queries and terms. The working set sizes are normalized
against the total number of queries in the query log. In the graph for queries,
there is a sharp decay until approximately 0.01, and there is a decrease in the
rate at which the miss rate drops as we increase the size of the working set
over 0.01. Finally, the minimum value it reaches is 50% miss rate, not shown
in the figure, since we have cut the tail of the curve for presentation purposes.
For the sequence of terms that we use to plot the term curve in the figure, we
have not considered all the terms in the log. Instead, we use the same number
of queries we use for the query graph, taken from the head of the query log.
                 ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
20:12       •      R. Baeza-Yates et al.




        Fig. 6. Miss rate as a function of the working set size and distribution of distances.


   Compared with the query curve, we observe that the minimum miss rate
for terms is substantially smaller. The miss rate also drops sharply on values
up to 0.01, and it decreases minimally for higher values. The minimum value,
however, is slightly over 10%, which is much smaller than the minimum value
for the sequence of queries. This implies that, with such a policy, it is possible to
achieve over 80% hit rate, if we consider caching dynamically posting lists for
terms as opposed to caching answers for queries. This result does not consider
the space required for each unit stored in the cache memory, or the amount of
time it takes to put together a response to a user query. We analyze these issues
more carefully later in this article.
   It is interesting also to observe the histogram of Figure 6(b), which is an
intermediate step in the computation of the miss rate graph. It reports the
distribution of distances between repetitions of the same frequent query. The
distance in the plot is measured in the number of distinct queries separating
a query and its repetition, and it considers only queries appearing at least 10
times. For example the distance between repetitions of the query q in the query
stream q, q1 , q2 , q2 , q1 , q3 , q is three. From Figures 6(a) and 6(b), we conclude
that even if we set the size of the query answers cache to a relatively large
number of entries, the miss rate is high. Thus caching the posting lists of terms
has the potential to improve the hit ratio. This is what we explore next.

5. CACHING POSTING LISTS
The previous section shows that caching posting lists can obtain a higher hit
rate as compared with caching query answers. In this section, we study the
problem of how to select posting lists to place in a certain amount of available
memory, assuming that the whole index is larger than the amount of memory
available. The posting lists have variable size (in fact, their size distribution
follows a power law), so it is beneficial for a caching policy to consider the sizes
of the posting lists. In Section 5.1, we describe a new algorithm for caching
posting lists statically. We compare our algorithm with a static-cache algorithm
that considers only query frequency statistics, as well as with dynamic-cache
algorithms, such as LRU, LFU, and a modified dynamic algorithm that takes
ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
                            Design Trade-Offs for Search Engine Caching                   •       20:13

posting-list size into account. Additionally, in Section 5.2, we discuss a mixed
caching policy that considers partitioning the available cache in two parts and
using one part as static cache and the other part as dynamic cache.

5.1 Static Caching
Before discussing the static caching strategies, we introduce some notation. We
use f q (t) to denote the query-term frequency of a term t, that is, the number
of queries containing t in the query log, and f d (t) to denote the document fre-
quency of t, that is, the number of documents in the collection in which the term
t appears.
      The first strategy we consider is the algorithm proposed by Baeza-Yates and
Saint-Jean [2003], which consists in selecting the posting lists of the terms with
the highest query-term frequencies f q (t). We call this algorithm QTF. The QTF
algorithm is clearly motivated by filling the cache with terms that appear often
in the queries. The query-term frequencies are computed from past query logs,
and for the policy to be effective we assume that the query-term frequencies do
not change much over time. Later in this article we analyze the impact of the
query-log dynamics on static caching.
      Next, we describe our suggested static-cache algorithm. Our main observa-
tion is that there is a trade-off between f q (t) and f d (t). On the one hand, terms
with high f q (t) are useful to keep in the cache because they are queried often.
On the other hand, terms with high f d (t) are not good candidates because they
correspond to long posting lists and consume a substantial amount of space. In
fact, the problem of selecting the best posting lists for the static cache corre-
sponds to the standard Knapsack problem: given a knapsack of fixed capacity,
and a set of n items, declaring, for example, that the i-th item has value ci
and size si , select the set of items that fit in the knapsack and maximize the
overall value. In our case, “value” corresponds to f q (t) and “size” corresponds
to f d (t). Thus we employ a simple algorithm for the knapsack problem, which
is selecting the posting lists of the terms with the highest values of the ratio
 f q (t)
 f d (t)
         . We call this algorithm QTFDF. We tried other variations considering query
frequencies instead of term frequencies, but the gain was minimal relative to
the complexity added.
      In addition to the above two static algorithms, we consider the following
algorithms for dynamic caching:

— LRU: a standard LRU algorithm, but many posting lists might need to be
  evicted (in order of least-recent usage) until there is enough space in the
  memory to place the currently accessed posting list.
— LFU: a standard LFU algorithm (eviction of the least-frequently used), with
  the same modification as the LRU.
— DYN-QTFDF: a dynamic version of the QTFDF algorithm; evict from the cache
                               f q (t)
  the term(s) with the lowest f d (t) ratio.

  The performance of all the above algorithms for the UK and Chile datasets
are shown in Figures 7 and 8, respectively. For the results on the UK dataset,
                  ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
20:14        •     R. Baeza-Yates et al.




        Fig. 7. Hit rate of different strategies for caching posting lists for the UK dataset.




        Fig. 8. Hit rate of different strategies for caching posting lists for the Chile dataset.

we use 15 weeks of the UK query log, and for the Chile dataset, we use 4 months
of the Chile query log. Performance is measured with hit rate. The cache size is
measured as a fraction of the total space required to store the posting lists of
all terms.
     For the dynamic algorithms, we load the cache with terms in order of f q (t)
and we let the cache “warm up” for 1 million queries. For the static algorithms,
we assume complete knowledge of the frequencies f q (t), that is, we estimate
 f q (t) from the whole query stream. As we show in Section 7, the results do not
change much if we compute the query-term frequencies using the first 3 or 4
weeks of the query log, and measure the hit rate on the rest.
ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
                              Design Trade-Offs for Search Engine Caching                   •       20:15




Fig. 9. Fraction of terms whose posting lists fit in cache for the two different static algorithms.


   The most important observation from our experiments is that the static
QTFDF algorithm has a better hit rate than all the dynamic algorithms. An
important benefit of a static cache is that it requires no eviction and it is hence
more efficient when evaluating queries. However, if the characteristics of the
query traffic change frequently over time, then it requires repopulating the
cache often, or there will be a significant impact on hit rate.
   A measure illustrating the difference between the QTFDF and QTF algorithms
is demonstrated in Figure 9(a), and 9(b), where we show the fraction of terms
whose posting lists fit in cache for the two static algorithms. QTF selects terms
with high f q (t) values. However, many of those terms tend to have long posting
lists, and as a result, few posting lists fit in cache. On the other hand, QTFDF
prefers to select many more (and shorter) posting lists, even though they have
smaller f q (t) values.

5.2 Adding Dynamic Cache
In addition to pure static and dynamic caching policies, we also consider a mixed
caching policy: given a fixed amount of available cache, partition it in two parts
and use the one part as static cache and the other part as dynamic cache. We
consider combining the static and dynamic caching policies, as demonstrated
in the previous section, namely the QTFDF algorithm for static and the LRU
algorithm for dynamic caching.
   The motivation behind considering such a mixed policy is to leverage the
good performance of static caching, but at the same time to employ dynamic
caching in order to handle temporal correlations and bursts in the query log
stream. Figure 10 presents the results of our experiment that was performed
using 15 weeks of the UK query log. Given a fixed amount of memory for caching
posting lists, we allocate an α fraction of the memory for the QTFDF policy and
the rest for the LRU policy. We tried with α = 0.1, 0.25, 0.5, 0.75 and 0.9.
   Like the results presented in Fagni et al. [2006], our mixed static/dynamic
strategy has led to an improvement in the hit ratio of the cache. The improve-
ment is more significant for the smaller sizes of the cache; as the cache size
increases, the performance of the QTFDF algorithm levels the performance of
                    ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
20:16       •      R. Baeza-Yates et al.




                Fig. 10. The effect of adding dynamic cache to the QTFDF algorithm.


the mixed policy. Also, as Figure 10 shows, the best performance is achieved for
α = 0.9, that is, allocating the largest part of the cache for the static policy.

6. ANALYSIS OF STATIC CACHING
In this section, we provide a detailed analysis for the problem of deciding
whether it is preferable to cache query answers or cache posting lists. Our
analysis takes into account the impact of caching between two levels of the
data-access hierarchy. It can either be applied at the memory/disk layer or at a
server/remote server layer, as in the architecture discussed in the introduction.
   Using a particular system model, we obtain estimates for the parameters
required by our analysis, which we subsequently use to decide the optimal
trade-off between caching query answers and caching posting lists. To validate
the optimal trade-off, we run an implementation of the system with a cache of
query answers and a cache of posting lists.

6.1 Analytical Model
Let M be the size of the cache measured in answer units, that is, assume that
the cache can store M query answers. For the sake of simplicity, assume that all
posting lists are of the same length L, measured in answer units. We consider
the following two cases: (1) a cache that stores only precomputed answers, and
(2) a cache that stores only posting lists. In the first case, Nc = M answers fit
in the cache, while in the second case N p = M/L posting lists fit in the cache.
Thus N p = Nc /L. Note that although posting lists require more space, we can
combine terms to evaluate more queries (or partial queries).
   For case (1), suppose that a query answer in the cache can be evaluated in
one time unit. For case (2), assume that if the posting lists of the terms of a
query are in the cache, then the results can be computed in TR1 time units,
ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
                            Design Trade-Offs for Search Engine Caching                   •       20:17




                      Fig. 11. Cache saturation as a function of size.


while if the posting lists are not in the cache, then the results can be computed
in TR2 time units. Of course we have that TR2 > TR1 .
   Now we want to compare the time to answer a stream of Q queries in both
cases. Let us use the QTF algorithm as an approximation to the QTFDF algorithm
(in fact, if the correlation between query terms and document terms is 0, this
approximation is quite good) and Vc (Nc ) be the volume of the most frequent Nc
queries. Then, for case (1), we have an overall time
                        TC A = Vc (Nc ) + TR2 (Q − Vc (Nc )).
Similarly, for case (2), let V p (N p ) be the number of computable queries using
posting lists of the most frequent N p terms: Then we have overall time:
                    TP L = TR1 V p (N p ) + TR2 (Q − V p (N p )).
We want to check under which conditions we have TP L < TC A . Then,
              TP L − TC A = (TR2 − 1)Vc (Nc ) − (TR2 − TR1 )V p (N p ).                              (2)
Figure 11 shows the values of V p and Vc for the UK query log. We can see
that caching answers saturates faster, and for this particular data there is no
additional benefit from using more than 10% of the index space for caching
answers.
   Since the query distribution in practice is finite, Vc (n) will be a fraction that
depends on n of the total number of queries Q. Now we estimate this fraction.
Since the query distribution follows a power law with parameter 0 < α < 1
in our two data sets, the i-th most frequent query appears with probability
proportional to i1 . Therefore, the volume Vc (n), which is the total number of the
                 α

n most frequent queries, is:
                                            n
                                                Q
                           Vc (n) = V0             ≈ V0 n1−α Q,
                                          i=1
                                                iα

where V0 = 1/U 1−α and U is the number of unique queries in the query
stream. We know that V p (n) grows faster than Vc (n) and we assume, based
                  ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
20:18       •      R. Baeza-Yates et al.




    Fig. 12. Relation of query volumes of precomputed answers Vc (n) and posting lists V p (n).


on experimental results, that the relation is of the form V p (n) = kVc (n)β (see
Figure 12). In the worst case, for a large cache, β → 1. That is, both techniques
will cache a constant fraction of the overall query volume. By setting β = 1,
replacing N p = Nc /L, and by combining with Equation (2), we obtain the result
that caching posting lists makes sense only if the ratio
                                          L1−α (TR2 − 1)
                                    ρ=                   < 1.
                                          k(TR2 − TR1 )
Differently from previous works, we also want to evaluate whether caching
compressed postings is better than caching plain postings. Caching compressed
postings has the benefit of allowing the accommodation of a greater number of
entries, in fact we have L < L, at the cost of a greater computational cost,
that is, TR1 > TR1 . The trade-off of caching postings vs. query results is now
as follows:
                                              L 1−α (TR2 − 1)
                                       ρ =                    .
                                              k(TR2 − TR1 )
That is ρ is ratio for comparing cached answers with caching posting lists when
compression is used. Using compression is better if ρ < ρ. In the next section,
                 L
we show that L is about 3, and according to the experiments that we show
later, compression is always better.
   For a small cache, we are interested in the transient behavior and then β > 1,
as computed from the UK data (between 2 and 3 as shown in Figure 12). In this
case, there will always be a point where TP L > TC A for a large number of queries,
and this shows the importance of the real values of TR, which we estimate
next.
   As we showed in the previous section, instead of filling the cache only with
answers or only with posting lists, a better strategy is to divide the total cache
space into a cache for answers and a cache for posting lists. In such a case,
there will be some queries that could be answered by both parts of the cache,
and a good caching technique should try to minimize the intersection of both
ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
                            Design Trade-Offs for Search Engine Caching                   •       20:19

caches. Finding the optimal division of the cache in order to minimize the overall
retrieval time is a difficult problem to solve analytically. In Section 6.3, we use
simulations to derive optimal cache trade-offs for particular implementation
examples.


6.2 Parameter Estimation
We now use a particular implementation of a centralized system and the model
of a distributed system as examples from which we estimate the parameters of
the analysis from the previous section. We perform the experiments using an
optimized version of Terrier [Ounis et al. 2006], for both indexing documents
and processing queries, on a single machine with a Pentium 4 at 2GHz and
1GB of RAM.
   We index the documents from the UK-2006 dataset, without removing stop
words or applying stemming. The posting lists in the inverted file consist of pairs
of document identifier and term frequency. We compress the document identifier
gaps using Elias gamma encoding, and the term frequencies in documents using
unary encoding [Witten et al. 1994]. The size of the inverted file is 1,189Mb. A
stored answer requires 1264 bytes, and an uncompressed posting takes 8 bytes.
From Table I, we obtain L = (8· # of#postings) = 0.75 and L = Inverted file size = 0.26.
                               1264· of terms                  1264· # of terms
   We estimate the ratio TR = T/Tc between the average time T it takes to
evaluate a query and the average time Tc it takes to return a stored answer for
the same query, in the following way. Tc is measured by loading the answers
for 100,000 queries in memory, and answering the queries from memory. The
average time is Tc = 0.069ms. T is measured by processing the same 100,000
queries (the first 10,000 queries are used to warm up the system). For each
query, we remove stop words if there are at least three remaining terms. The
stop words correspond to the terms with a frequency higher than the number
of documents in the index. We use a document-at-a-time approach to retrieve
documents containing all query terms. The only disk access required during
query processing is for reading compressed posting lists from the inverted file.
We perform both full and partial evaluation of answers, because some queries
are likely to retrieve a large number of documents, and only a fraction of the
retrieved documents will be seen by users. In the partial evaluation of queries,
we terminate the processing after matching 10,000 documents. The estimated
ratios TR are presented in Table II.
   Figure 13 shows, for a sample of queries, the workload of the system with
partial query evaluation and compressed posting lists. The x-axis corresponds
to the total time the system spends processing a particular query, and the verti-
cal axis corresponds to the sum t∈q f q · f d (t). Notice that the total number of
postings of the query-terms does not necessarily provide an accurate estimate
of the workload imposed on the system by a query (which is the case for full
evaluation and uncompressed lists).
   The analysis of the previous section also applies to a distributed retrieval sys-
tem in one or multiple sites. Suppose that a document partitioned distributed
system is running on a cluster of machines interconnected through a Local Area
Network (LAN) in one site. The broker receives queries and broadcasts them
                  ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
20:20       •      R. Baeza-Yates et al.

                            Table II. Ratios Between the Average Time to
                          Evaluate a Query and the Average Time to Return
                          Cached Answers (centralized and distributed case)
                         Centralized system         TR1      TR2       TR1      TR2
                         Full evaluation            233     1760       707     1140
                         Partial evaluation          99     1626       493      798
                                                       L       L         L        L
                         LAN system                 TR1     TR2       TR1      TR2
                         Full evaluation             242    1769       716     1149
                         Partial evaluation          108    1635       502      807
                                                     W        W         W        W
                         WAN system                TR1      TR2       TR1      TR2
                         Full evaluation           5001     6528      5475     5908
                         Partial evaluation        4867     6394      5270     5575




           Fig. 13. Workload for partial query evaluation with compressed posting lists.


to the query processors, which answer the queries and return the results to
the broker. Finally, the broker merges the received answers and generates the
final set of answers (we assume that the time spent on merging results is neg-
ligible). The difference between the centralized architecture and the document
partition architecture is the extra communication between the broker and the
query processors. Using ICMP pings on a 100Mbps LAN, we have observed
that sending the query from the broker to the query processors, which send
an answer of 4,000 bytes back to the broker, takes on average 0.615ms. Hence
TR L = TR + 0.615ms/0.069ms = TR + 9.
   In the case when the broker and the query processors are in different sites
connected through a Wide Area Network (WAN), we estimate that broadcasting
the query from the broker to the query processors, and getting back an answer
of 4,000 bytes, takes on average 329ms. Hence TRW = TR + 329ms/0.069ms =
                                W     W
TR + 4768. We can see that TR2 /TR1 = 1.31 < TR2 /TR1 , suggesting that there
is greater benefit from storing answers for queries when the retrieval system
is distributed across a WAN, since the network communication dominates the
response time for such systems. We corroborate this observation next.
ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
                           Design Trade-Offs for Search Engine Caching                   •       20:21




               Fig. 14. Optimal division of the cache memory in a server.

6.3 Simulation Rsesults
We now address the problem of finding the optimal trade-off between caching
query answers and caching posting lists. To make the problem concrete, we
assume a fixed size M on the available memory, out of which x units are used
for caching query answers, and M − x for caching posting lists.
   We perform a simulation and compute the average response time as a func-
tion of x. Using a part of the query log as training data, we first allocate in the
cache the answers to the most frequent queries that fit in space x, and then
we use the rest of the memory to cache posting lists. For selecting posting lists,
we use the QTFDF algorithm, applied to the training query log but excluding
the queries that have already been cached.
   In Figure 14, we plot the simulated response time for a centralized system
as a function of x. For the uncompressed index, we use M = 1GB, and for the
compressed index we use M = 0.5GB, to make a fair comparison. In the case of
the configuration that uses partial query evaluation with compressed posting
lists, the lowest response time is achieved when 0.15GB out of the 0.5GB is
allocated for storing answers for queries. We obtained similar trends in the
results for the LAN setting.
   Figure 15 shows the simulated workload for a distributed system across a
WAN. In this case, the total amount of memory is split between the broker,
which holds the cached answers of queries, and the query processors, which
hold the cache of posting lists. According to the figure, the difference between
the configurations of the query processors is less important because the network
communication overhead increases the response time substantially. When us-
ing uncompressed posting lists, the optimal allocation of memory corresponds
to using approximately 70% of the memory for caching query answers. This is
explained by the fact that there is no need for network communication when
the query can be answered by the cache at the broker.

6.4 Experimenting with a Real System
We validate the results obtained from the simulation of the previous section
by running a real system, varying the amount of memory allocated for a cache
                 ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
20:22       •      R. Baeza-Yates et al.




    Fig. 15. Optimal division of the cache memory when the next level requires WAN access.




Fig. 16. Average response time and throughput in a server for different splits of memory between
a cache of query answers and a cache of postings.


of query answers and for a cache of posting lists. Our system uses threads to
process queries in parallel. Each thread processes queries to completion, and
independent of other threads. The posting lists are uncompressed as needed
during query processing, which stops after matching 10,000 documents.
   For training, we use the same data as in the simulation. For testing, we use
the first 30,000 queries from the simulation, where the first 10,000 queries are
used to warm up the system. For each configuration, we run the system five
times and report the average response time in milliseconds, as well as the av-
erage throughput. The error bars correspond to the minimum and maximum
average response time and throughput obtained among all five runs. We vali-
date our simulation results from the previous section by running a system on
a different server with 2 dual-core processors at 2GHz and 6GB of RAM.
   Figure 16(a) shows the average response time on the y-axis and the amount
of memory allocated to the cache of query answers on the x-axis. Allocating
a total of 0.5GB for either cache, if we allocate 0.2GB for the cache of query
answers, then the remaining 0.3GB is used by the cache of posting lists. Both
curves for a single-threaded and a two-threaded system follow the same trend
ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
                             Design Trade-Offs for Search Engine Caching                   •       20:23

          Table III. Average Response Time and Throughput for a System Without
        Cache of Query Answers and Posting Lists, and for a System with Both Caches
                              and the Optimal Split of Memory
                                         Avg. Response Time (ms)          Throughput (q/s)
         1 thread/no cache                           22.63                       44.51
         1 thread/optimal trade-off                  12.35                       81.06
         2 threads/no cache                          21.61                       92.68
         2 threads/optimal trade-off                 12.06                      167.37


shown in Figure 14. In the case of the single-threaded system, the optimal
allocation of memory corresponds to 0.15GB for the cache of query answers.
We also obtained the same result in our simulation described in the previous
section. It is important to note that even though we run the system on a faster
server than the one on which the parameters were estimated, the results of the
simulation remain valid.
   Figure 16(b) shows the throughput achieved in the case of the single-threaded
and two-threaded systems. We observe that throughput doubles when using
two threads because the two server cores serve queries simultaneously. Having
multiple cores, however, is not sufficient to have throughput increasing linearly,
since there are other system resources the threads share, such as the disk.
Although we have not investigated this issue in depth, in our interpretation is
that this happens because the two threads overlap minimally in the use of the
disk. Such a minimal overlap is possible because of the large number of hits,
which produce fewer accesses to disk and spread them over time. Reducing the
number of accesses and spreading them over time results in a small probability
of overlap. We note that in both cases, the optimal allocation of memory is also
at 0.15GB for the cache of query answers.
   One important observation is that the optimal trade-off between the cache
of query answers and that of posting lists significantly increases the capacity of
the system, compared to a system that does not use cache. Table III shows the
average response time and throughput of the system without cache of query
answers or posting lists, and of the system with both caches and the optimal
memory allocation. The optimal trade-off results in at least 44% reduction in
average response time and an increase of 80% in throughput.

7. EFFECT OF THE QUERY DYNAMICS
Since the queries in the incoming traffic follow a power-law, some of the most fre-
quent queries will remain frequent even after some period of time. Topics upon
which queries are submitted, however, vary over time and might invalidate the
static cache built so far. Hence, we assess the impact of time on the validity of
the trained model, studying the statistical characteristics of the query stream,
showing that there is little variation in hit rate over sufficiently long periods of
time.

7.1 Stability of static caching of answers
For our query log, the query distribution and query-term distribution change
slowly over time. To support this claim, we first assess how query topics change
                   ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
20:24       •      R. Baeza-Yates et al.




Fig. 17. The distribution of queries in the first week of June, 2006 (upper curve) compared to the
distribution of new queries in the remainder of 2006.



over time. Figure 17 shows a comparison of the distribution of queries from the
first week in June 2006, to the distribution of queries for the remainder of 2006
that did not appear in the first week in June. The x-axis shows the rank of the
query frequency, normalized on a log scale. The y-axis shows the frequency of a
given query. We found that a very small percentage of queries are new queries.
(the highest frequency is more than two orders of magnitude smaller). In fact,
the majority of queries that appear in a given week repeat in the following
weeks for the next six months.
   We observe the stability of the hit rate by considering a test period of three
months and compare the effect of the training duration by training the static
set for one and two weeks. Figure 18 shows the results. The hit rate is sta-
ble both in the case of one and two weeks training, for all the three months
tested. Interestingly, hit rate is consistently lower for the static cache trained
for a single week. The peaks in the graph correspond to nightly periods, when
the hit rate is highest. The lowest values happen during daily periods, often
corresponding to 2–3pm.

7.2 Stability of Static Caching of Posting Lists
The static cache of posting lists can be periodically recomputed. To estimate
the time interval in which we need to recompute the posting lists on the static
cache, we need to consider an efficiency/quality trade-off: using too short a
time interval might be prohibitively expensive, while recomputing the cache
too infrequently might lead to having an obsolete cache not corresponding to
the statistical characteristics of the current query stream.
   We measure the effect on the QTFDF algorithm of the changes in a 15-week
query stream (Figure 19(a)). We compute the query-term frequencies over the
ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
                               Design Trade-Offs for Search Engine Caching                   •       20:25




Fig. 18. Hit rate trend of a static cache of 128,000 results containing the most frequent queries
extracted from one and two weeks before the test period. Hit rate values correspond to periods of
six hours.




          Fig. 19. Impact of distribution changes on the static caching of posting lists.


whole stream, select which terms to cache, and then compute the hit rate
on the whole query stream. This hit rate is an upper bound, and it assumes
perfect knowledge of the query term frequencies. To simulate a realistic sce-
nario, we use the first 6 (3) weeks of the query stream for computing query
term frequencies, and the following 9 (12) weeks to estimate the hit rate. As
Figure 19(a) shows, the hit rate decreases by less than 2%. We repeated the
same experiment for the QTF algorithm and the decrease on the hit rate was less
than 0.2%.
   The high correlation among the query term frequencies during different time
periods explains the small changes in hit rate as time elapses. Indeed, the
                     ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
20:26       •      R. Baeza-Yates et al.

pairwise correlation among all possible 3-week periods of the 15-week query
stream, is over 99.5%.
   A similar result, shown in Figure 19(b), was obtained for the Chile dataset
after training the static cache, using one month of the query log for training, and
three months for testing. In this case, however, the degradation of the quality
is greater due to the longer testing period.

8. CONCLUSIONS
Caching is an effective technique in search engines for improving response time,
reducing the load on query processors, and improving network bandwidth uti-
lization. The results we presented in this article consider both dynamic and
static caching. According to our results, dynamic caching of queries has limited
effectiveness due to the high number of compulsory misses caused by the num-
ber of unique or infrequent queries. In our UK log, the minimum miss rate is
50% using a working set strategy. Caching terms is more effective with respect
to miss rate, achieving values as low as 12%. We also propose a new algorithm
for static caching of posting lists that outperforms previous static caching al-
gorithms as well as dynamic algorithms such as LRU and LFU, obtaining hit
rate values that are over 10% higher compared with these strategies.
   As one of our main contributions, we present a framework for the analysis
of the trade-off between caching query results and caching posting lists. In
particular, we use the trade-off to evaluate if compression in posting list caching
is worthwhile. Keeping compressed postings, in fact, allows for accommodating
a greater number of entries at the cost of a greater query evaluation time. We
show in the experimental analysis that compression is always better. To the best
of our knowledge, this is the first work considering compression inside cache
entries. We plan to extend this further by evaluating the impact of different
encoding schemes on the performance of posting list caching. We also show
that partitioning the available cache into a static and a dynamic part improves
the cache performance for caching posting lists. We use simulation as well as a
real system to evaluate different types of architectures. Our results show that
for centralized and LAN environments, there is an optimal allocation of caching
query results and caching of posting lists, while for WAN scenarios in which
network latency prevails, it is more important to cache query results. We leave
to future work query processing algorithms that better integrate with caching,
improved algorithms for caching posting lists, and a study of the consequences
of the results in a production system.

REFERENCES
ANH, V. N. AND MOFFAT, A. 2006. Pruned query evaluation using pre-computed impacts. In Pro-
  ceedings of the 29th International ACM Conference on Research and Development in Information
  Retrieval (SIGIR’06). ACM, New York, NY, 372–379.
BAEZA-YATES, R., GIONIS, A., JUNQUEIRA, F., MURDOCK, V., PLACHOURAS, V., AND SILVESTRI, F. 2007. The
  impact of caching on search engines. In Proceedings of the 30th International ACM Conference on
  Research and Development in Information Retrieval (SIGIR’07). ACM, New York, NY, 183–190.
BAEZA-YATES, R., JUNQUEIRA, F., PLACHOURAS, V., AND WITSCHEL, H. F. 2007. Admission policies for
  caches of search engine results. In Proceedings of the 14th International Symposium on String

ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
                               Design Trade-Offs for Search Engine Caching                   •       20:27

  Processing and Information Retrieval (SPIRE’07). Lecture Notes in Computer Science, Vol. 4726,
  74–85.
BAEZA-YATES, R. AND SAINT-JEAN, F. 2003. A three level search engine index based in query log
  distribution. In Proceedings of the 10th International Symposium on String Processing and In-
  formation Retrieval (SPIRE’03). Lecture Notes in Computer Science, Vol. 2857, 56–65.
BEITZEL, S. M., JENSEN, E. C., CHOWDHURY, A., GROSSMAN, D., AND FRIEDER, O. 2004. Hourly analysis
  of a very large topically categorized web query log. In Proceedings of the 27th International ACM
  Conference on Research and Development in Information Retrieval (SIGIR’04). ACM, New York,
  NY, 321–328.
BOLDI, P., CODENOTTI, B., SANTINI, M., AND VIGNA, S. 2004. Ubicrawler: a scalable fully distributed
  web crawler. Softw. Pract. Exper. 34, 8.
BUCKLEY, C. AND LEWIT, A. F. 1985. Optimization of inverted vector searches. In Proceedings of
  the 8th International ACM Conference on Research and Development in Information Retrieval
  (SIGIR’85). ACM, New York, NY, 97–110.
  ¨
BUTTCHER, S. AND CLARKE, C. L. A. 2006. A document-centric approach to static index pruning in
  text retrieval systems. In Proceedings of the 15th ACM International Conference on Information
  and Knowledge Management (CIKM’06). ACM, New York, NY, 182–189.
CAO, P. AND IRANI, S. 1997. Cost-aware WWW proxy caching algorithms. In USENIX Symposium
  on Internet Technologies and Systems.
CASTILLO, C., DONATO, D., BECCHETTI, L., BOLDI, P., LEONARDI, S., SANTINI, M., AND VIGNA, S. 2006. A
  reference collection for web spam. SIGIR Forum 40, 2, 11–24.
DENNING, P. 1980. Working sets past and present. IEEE Trans. Softw. Eng. SE-6, 1, 64–84.
FAGNI, T., PEREGO, R., SILVESTRI, F., AND ORLANDO, S. 2006. Boosting the performance of web search
  engines: caching and prefetching query results by exploiting historical usage data. ACM Trans.
  Inform. Syst. 24, 1, 51–78.
JANSEN, B. AND SPINK, A. 2006. How are we searching the World Wide Web? A comparison of nine
  search engine transaction logs. Inform. Process. Manag. 42, 248–263.
JANSEN, B. J., SPINK, A., BATEMAN, J., AND SARACEVIC, T. 1998. Real life information retrieval: a
  study of user queries on the web. SIGIR Forum 32, 1, 5–17.
LEMPEL, R. AND MORAN, S. 2003. Predictive caching and prefetching of query results in search
  engines. In Proceedings of the 12th International World Wide Web Conference (WWW’03). ACM,
  New York, NY, 19–28.
LONG, X. AND SUEL, T. 2005. Three-level caching for efficient query processing in large web search
  engines. In Proceedings of the 14th International World Wide Web Conference (WWW’05). ACM,
  New York, NY, 257–266.
MARKATOS, E. P. 2001. On caching search engine query results. Comput. Commun. 24, 2, 137–143.
NTOULAS, A. AND CHO, J. 2007. Pruning policies for two-tiered inverted index with correctness
  guarantee. In Proceedings of the 30th International ACM SIGIR Conference on Research and
  Development in Information Retrieval (SIGIR’07). ACM, New York, NY, 191–198.
OUNIS, I., AMATI, G., PLACHOURAS, V., HE, B., MACDONALD, C., AND LIOMA, C. 2006. Terrier: a high
  performance and scalable information retrieval platform. In SIGIR Workshop on Open Source
  Information Retrieval.
PODLIPNIG, S. AND BOSZORMENYI, L. 2003. A survey of web cache replacement strategies. ACM
  Comput. Surv. 35, 4, 374–398.
RAGHAVAN, V. V. AND SEVER, H. 1995. On the reuse of past optimal queries. In Proceedings of
  the 18th International ACM SIGIR Conference on Research and Development in Information
  Retrieval (SIGIR’95). ACM, New York, NY, 344–350.
SARAIVA, P. C., DE MOURA, E. S., ZIVIANI, N., MEIRA, W., FONSECA, R., AND RIBERIO-NETO, B. 2001.
  Rank-preserving two-level caching for scalable search engines. In Proceedings of the 24th In-
  ternational ACM Conference on Research and Development in Information Retrieval (SIGIR’01).
  ACM, New York, NY, 51–58.
SILVERSTEIN, C., MARAIS, H., HENZINGER, M., AND MORICZ, M. 1999. Analysis of a very large web
  search engine query log. SIGIR Forum 33, 1, 6–12.
SLUTZ, D. R. AND TRAIGER, I. L. 1974. A note on the calculation of average working set size.
  Commun. ACM 17, 10, 563–565.
STROHMAN, T., TURTLE, H., AND CROFT, W. B. 2005. Optimization strategies for complex queries. In
  Proceedings of the 28th International ACM SIGIR Conference on Research and Development in
  Information Retrieval (SIGIR’05). ACM, New York, NY, 219–225.

                     ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.
20:28       •      R. Baeza-Yates et al.

TSEGAY, Y., TURPIN, A., AND ZOBEL, J. 2007. Dynamic index pruning for effective caching. In Pro-
  ceedings of the 16th ACM conference on Conference on Information and Knowledge Management
  (CIKM’07). ACM, New York, NY, 987–990.
WITTEN, I. H., BELL, T. C., AND MOFFAT, A. 1994. Managing Gigabytes: Compressing and Indexing
  Documents and Images. John Wiley & Sons, Inc., New York, NY.
XIE, Y. AND O’HALLARON, D. R. 2002. Locality in search engine queries and its implications for
  caching. In Proceedings of the 21st Annual Joint Conference of the IEEE Computer and Commu-
  nications Societies (INFOCOM’02).
YOUNG, N. E. 2002. On-line file caching. Algorithmica 33, 3, 371–383.
ZHANG, J., LONG, X., AND SUEL, T. 2008. Performance of compressed inverted list caching in search
  engines. In Proceedings of the 17th International World Wide Web Conference (WWW’08). ACM,
  New York, NY, 387–396.

Received December 2007; revised July 2008; accepted August 2008




ACM Transactions on the Web, Vol. 2, No. 4, Article 20, Publication date: October 2008.

				
DOCUMENT INFO
Shared By:
Categories:
Stats:
views:19
posted:4/5/2010
language:English
pages:28