Web-Conscious Storage Management for Web Proxies Evangelos P. Markatos, Dionisios N. Pnevmatikatos , Member, IEEE , Michail D. Flouris, and Manolis G. H. Katevenis, Member, IEEE IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 10, NO. 6, DECEMBER 2002 Outline 1. Introduction 2. W EBCOSM 3. Simulation-Based Evaluation 4. Implementation (Foxy) 5. Conclusion Introduction Fig. 1. Typical web proxy action sequence. Introduction (Cont.) WWW proxies are being increasingly used to provide Internet access to users behind a firewall and to reduce wide-area network traffic by caching frequently used URLs. However, many proxy servers often fail to provide the available bandwidth to the proxy process. Introduction (Cont.) The authors study the overheads associated with file I/O for web proxies, and propose Web- Conscious Storage Management (WebCoSM) , a set of techniques, to overcome file I/O limitations. Overheads associated with disk I/O Storing each URL in a separate file. Aggregate several URLs per file. Disk head movements due to file write requests in widely scattered disk sectors. File space allocation algorithm. URL read operations. Cluster several read operations together and reorganize the layout of the URLs on the magnetic disk. WebCoSM The file system of a web proxy will not be able to keep up with the proxy’s Internet requests due to the mismatch between the storage requirements needed by the web proxy and the storage guarantees provided by the file system. We address this performance mismatch in two ways: meta-data overhead reduction, and data- locality exploitation. Meta-data overhead reduction Most of the meta-data overhead that cripples web proxy performance can be traced to the storage of each URL in a separate file. To eliminate this performance bottleneck, we propose a novel URL-grouping method (called BUDDY), in which we store all the URLs into a small number of files. BUDDY (URL-grouping Method) To simplify space management, we use the URL size as the grouping criterion. Each file is composed of fixed-size slots, where each slot is large enough to contain a URL. Each new URL is stored in the first available slot of the appropriate file. BUDDY (Cont.) The main advantage of BUDDY is that it practically eliminates the overhead of file creation/deletion operations by storing potentially thousands of URLs per file. Although BUDDY reduces the file management overhead by avoiding file creations and deletions, it makes no special effort to lay data intelligently on a disk so as to improve write or read performance. Data-Locality Exploitation A significant amount of locality exists in the URL reference streams. Identifying and exploiting this locality can result in large performance improvements. 1.Optimizing Write Throughput Instead of writing new data in some free space on the disk, we continuously append data to the disk until we reach the end of the disk, in which case we continue from the beginning. STREAM: file-space management algorithm. STREAM URL-write operations continue appending data to the file until the end of the file, in which case, new URL-write operations continue from the beginning of the file writing on free slots. URL-delete operations mark the space currently occupied the URL as free, so it can later be reused by future URL-write operations. STREAM (Cont.) Note that STREAM stores all data in a single file, while BUDDY stores data in more than one files; therefore STREAM and BUDDY are incompatible, but STREAM subsumes the functionality (and the benefits) of BUDDY. 2.Preserving the Locality of the URL Stream Web objects requested contiguously by a single client, may be serviced and stored in the proxy’s disk subsystem interleaved with web objects requested from totally unrelated clients. To recover the lost locality, we augmented the STREAM technique with an extra level of buffers called locality buffers, between the proxy server and the file system. STREAM-PACK Small-write problem : writing a small amount of data to the file system, usually resulted in both a disk-read and a disk-write operation. The reason for this peculiar behavior : if a process writes a small amount of data in a file, the OS will read the corresponding page from the disk (if it is not already in the main memory file buffer cache), perform the write in the main memory page, and then, at a later time, write the entire updated page to the disk. STREAM-PACK (Cont.) Add in one-page-long packetizer buffer. Once the packetizer fills up, or if the current request is not contiguous to the previous one, the packetizer is sent to the file system to be written to the disk. STREAM-PACK-LOC Grouping requests according to their origin web server before storing them to the disk. Fig.2. Streaming into locality buffers. Simulation-Based Evaluation Fig.3. Evaluation methodology. Meta-Data Overhead Reduction Evaluation Fig.4. File Management Overhead for web proxies. Optimizing Write Throughput 1) Streaming Write Throughput Fig.5. Performance of BUDDY and STREAM. Optimizing Write Throughput (Cont.) 2) Achieving Maximum Write Bandwidth Fig.6. Performance of STREAM and STREAM-PACK Optimizing Write Throughput (Cont.) 3) Latency Issues: Fig.7. Cumulative distribution of (a) URL-write and (b) URL-read operation latency for STREAM-PACK and SQUID. Preserving the Locality of the URL Stream 1) Performance Evaluation of LOCALITY BUFFERS: Fig.8. Average size of contiguous free disk blocks as a function of time. Preserving the Locality of the URL Stream (cont.) Locality buffers not only cluster the free space more effectively, they also populate the allocated space with clusters of related documents by gathering Fig.9. Cumulative distribution of distances between read requests. Preserving the Locality of the URL Stream (cont.) 2) Latency Issues: Fig.10. Cumulative distribution of latency of URL-read operations for SQUID, STREAM-PACK, and STREAM-PACK-LOC. Performance Evaluation Summary Table 1. Performance of traditional and WEBCOSM techniques Implementation Result Conclusion Methods like WebCoSM that reduce disk head movements and stream data to disk will result in increasingly larger performance improvements. Furthermore, web-conscious storage management methods will not only result in better performance, but also help to expose areas for further research in discovering and exploiting the locality in the Web.