Distributed Proxy Server for Enhanced Web Information Retrieval and Security
1. INTRODUCTION:..................................................................................2 2. RELATED CONTEMPORARIES: ........................................................2 3. SYSTEM DESIGN:.................................................................................3 3.1 Browsing using the proxy:.................................................................3 3.2 Browser Protection:..........................................................................4 3.3 Indexing and Web Crawling:............................................................5 3.3.1 Index Crawlers .........................................................................5 3.3.2 Index Database...........................................................................6 3.4 Specialized Searches.........................................................................7 4. RANKING ALGORITHMS...................................................................7 4.1 Word Occurrence ranking.................................................................7 4.2 Relevance Ranking............................................................................9 5. NETWORK MONITORING.................................................................9 6. EXPERIMENTAL RESULTS...............................................................10 6.1 Testing Environment Configuration................................................10 6.2 Traffic Analysis................................................................................11 6.3 Indexing and Querying....................................................................11 6.4 Ranking Algorithms.........................................................................13 7. CONCLUSION.................................................................................... 13 8. REFERENCES..................................................................................... 14
ABSTRACT: This paper mainly emphasis the information retrieval from the Internet.Usually a proxy server is responsible for redirecting the client request and getting information from the web servers. This paper mainly focus that how to increase the cache space for Proxy server by making it distributed, and how efficient the website search can be made, by making the index server distributed. As the result of this ideas, the bandwidth of the network would
In an industry where the presence of information spells advantage and the dearth of it predicts inconvenience, the distribution and dispersion of this vital asset plays an Indispensable role. The basic problem is not the absence of knowledge but its unavailability. Explosion in knowledge production domains have led to the burial of the required information under the vast expanses of data. Such an innovation was possible due to the creation of a common platter for a wide variety of systems to access data uniformly. A testimonial of this scenario is the internet, which plays host to innumerable
data entities, but makes the actual retrieval of data cumbersome. Further, the time spent by one system to search for a data cannot be easily utilized by the others, having similar goals. Thus, cardinal network bandwidth and processor time is wasted in doing chores already performed by other users. Another notable feature of the information access and search pattern is the busty transfer of data. This is particularly a unique attribute of a system where the user is browsing for information. Whenever the user requests for any information, the data transfer occurs. It comes to a sudden halt when the user relaxes to read the data, leaving the bandwidth under-utilized. Even the use of multiple browse requests cannot utilize the communication channel completely, as an average user finds it difficult to switch contexts during long prose or intervals of typing. As a solution to these diverse yet related problems, we present a mechanism that extends the existent system by easily integrating itself with the contemporary technologies. This system resembles a distributed computer system with clusters working cooperatively, standing to benefit from each others resources.
2. Related contemporaries:
The present day information retrieval techniques range from enlisting the help of UseNet groups that announce the appearance of new sites, thus acting as unofficial directories for domains, to full searchable text databases and personal favorite lists. Some of the latest techniques also include the use of search engines built as index servers for keyword querying. Some of these engines build these indexes manually, as in the case of Yahoo! and http://galaxy.einet.net. Another mutant of the index servers allow the users to add links that can be used by others for similar queries. Yet another variation is the growing variety of the web-bots indexing that have web crawlers moving all over the web to gather information for the users, by the way of links. The most successful varieties of these include a combination of both the varieties, allowing the updating and manual weeding out of pesky links. Some of the projects under this category include the Global On-line Directory (GOLD) and the Archie- like web server (AliWeb). The first step toward distributing these indexes could be marked as the Distributedly Administered Categorical List of Documents (DACLoD). This system was however closed to a list of document on a kind of a virtual private network and did not have specific index updations.
3. System Design:
The system we propose is an unprecedented extension and culmination of the technologies mentioned to improve the efficiency and response ratio of the entire system. It is a derivative of the distributed class of the computing system that accords special emphasis on the security and integrity of the information handled by the system. The system that we implement to address these issues is a morph of a proxy server available on the local machine through which all the data is transferred. The proxy acts as a gateway performing four distinct functions, as detailed in the following sections. The proxy implemented can work in tandem with the existing browsers and web servers, not attempting any changes to the existing protocols. 3.1 Browsing using the proxy: The settings of the browser are changed to indicate the presence of a proxy server that runs at the default http port 80. The port is specifically chosen to allow the local machine
to act as a web server for the other members of the cluster. This allows faster access of web pages that are cached at the machine during the crawling and indexing process. The process of indexing is detailed in the later sections. As a result of this indexing, pages may be available in the machines of the local clusters making the access faster. The proxy forwards the request for the page to the list of proxies first, to check for availability and then download it, if found. This mechanism leads to a distributed proxy system that benefits better usage of the cache space collectively. Further, pages downloaded by one machine are available quickly for the others participating in the cluster. This is convenient in practical cases where the clusters are a part of a single organization and the browsing habits are fairly uniform. The main function of the proxy is to redirect the browse requests to appropriate servers and retrieve the pages returned. As in case of the usual proxy, it caches the pages to enable their retrieval faster. One enhancement is the start additional threads to download contents while encountering special tags like the <img>, <frame> etc. The browser usually asks for the main page, and then follows the request of images, frames or objects. As the proxy encounters these tags, it spawns new threads to download the image and the frame contents, even as the pages are transferred. Thus, data is not transferred serially, all requests occur in parallel which is the basic working mechanism of all download accelerators. The speed advantage may not be very visible in the limited bandwidth systems as the protocol stack may be the only area to benefit (which is usually fast enough for a commoner to notice) for fixed bandwidth systems. In the case of the systems connected to Network Translation Servers (NTS, a stop gap solution before IPv6 is implemented on all systems) found in large organizations, the difference is better. As a result of parallel requests, a scenario is of browsing using multiple browsers for the single page. The NTS server allocates more slots, or at least fills up all the slots reserved for the machine. The experimental results listed in the appendix show a marked increase in the bandwidth utilization using the proxy as a platform.
In the event of the machine acting as a web server, the port of the crawler must be mapped to a random port number. However, all requests for searches from the cluster groups arrive at port at 80. To avoid breaking the existing system of searches, the server can act host to a dynamic page like a CGI, JSP or ASP content whose primary function shall be limited to retrieving data from the index database. The proxy still stays on a different port that is specified at the browser and all page requests are routed through it. Another attribute of the browsing fraternity is the attitude to follow the links available on a web page. Analysis of the pattern displays abrupt sessions of transmissions and inactivity that occur at times of requests and reading respectively. This idle time can be utilized to download some of links that have a grater probability of being clicked which can be decided on a weighting system, drawing special emphasis on link words like next, go, link, index, etc. The process is given the minimum priority to avoid any hindrance to the normal browsing procedures. Such pages are also forwarded to the indexing systems to reduce the burdens on the crawlers. The pages can also serve as a starting pointing for the crawlers to perform indexing. 3.2 Browser Protection: Since all data passes through the proxy, it serves as the right place for installing a security service to prevent malicious scripts and illegal material. The proxy may be well installed as a service that may not be terminated except by the administrator. Keeping in tune with raising popularity of personal firewalls, the proxy may also be set behind the firewall to protect the system. The proxy may well be extended to act as a direct guardian of the full network service as a part of the Service Provider Interface (SPI) that provides API function to the upper socket layers. The analysis of the data transfer through the ports allows the detection of illegal activities by any Trojans installed on the system. Even if the Trojans encapsulate the sensitive data into HTTP format to avoid detection, the transfer is logged by the proxy and the intruder detected at a later time. The proxy may also act as nanny software with the required policies set directly to the configuration. This method has an increased relevance in today‟s distributed environment with protocol stacks and applications rapidly adapting Simple Object Access Protocols SOAP) and eXtended Markup Language (XML) over the HTTP for various frameworks, including the evolutionary dot Net platforms. Monitoring the HTTP ports for such data intrusions becomes much more appropriate. The proceedings of the intrusion detection model are detailed in a later section. 3.3 Indexing and Web Crawling: Information is obtained by sending a query that is composed of a set of words connected with logical operators. The blank spaces are interpreted as the AND operator and short frequent words like „the‟, „is‟, etc. are neglected, unless specified. Phased queries require an exact match to be found in the web page returned by the index server as the result of the query. The result returned is a list of all the pages in the index that have specific relevance to the words in the query displaying all information in a specific ranking order. All the search request queries are sent as a POSTED message to a web page. This methodology also fits into the scenario of a machine being a web server. In case of the system we propose, the index server is spread over a number of peer systems that are a part of the cluster group. A collaborative search is made to locate the link to the required query. A notable point is that the system proposed is not a replacement for the existing system but a complement, to speed up information retrieval.
To access information not previously found, it is always good for an existent index to be used. The system tends to learn as the user browses through the pages.
3.3.1 Index Crawlers Index crawlers are special class of WWW browsers that follow hyperlinks automatically. As the proxy download the web pages, the hyperlinks that seem probable candidates for indexing are queued for crawling. Crawling is the process of traversing the page looking for index words and hyperlinks. During the periods of the inactivity, these links are traversed in a breadth first manner (a direct consequence of queuing) to download other pages linked to them. As the probability of the user clicking one of the hyperlinks in the page is high, they may be even stored in a cache. This enables an intelligent cache that stores that files that the user may wish to browse. The cache is on the hard disk and is faster than getting the information from the web. It also utilizes the space that lies vacant in the disk by storing the indices and cached files at locations, marked as temporary files. The need of this space by other programs forfeits the files. This system also ensures that the entire hard disk is better utilized. The entire organization works on an array of computers that form a part of the computing cluster. This feeds the browser with an alternative cache and benefit from the files used by other users. The files to be deleted are detected based on the least recently used mechanism. The index crawler implemented is an integral part of the proxy. It directly connects to the http port 80 of remote machines and known ports specified by the machines in its cluster. Before any crawl activity, the crawler sends request to the cluster for the availability of the page. If the page is unavailable, it performs its usual activity. On the event of the occurrence of the page, the page is de-queued. 3.3.2 Index Database The index database holds the list of all the keywords and the page ids. The database may itself provide mechanisms for indexing and clustering of the data populated as hypertext. The schema of the database is displayed in the following figure. An inverse indexed database may also be helpful to allow easy access of files that are required for quick indexing. A peculiar issue that may arise is the validity period of the entries of the database. The pages mentioned in the database are liable to change without notification. Hence, once in a while a HEAD search schedule is performed to weed out dead links. The HEAD search issues the HEAD command of the HTTP to check the modification date of the page without downloading the entire page. Although this database is not directly related to the databases available in the other machines of the cluster, all redundancy can be avoided by adding a simple patch to the crawler. On the occurrence of the page at another machine of the cluster, the crawl process is halted.
A mechanism of hashing all the index entries is used for data retrieval in case of large indices. The complete system of database can be modeled as a type of a distributed system. The only difference is that the database tables are not directly connected to each other. It is the communication in form of the queries that is the sole connection. In case of the availability of a better distributed DBMS, the individual databases could be directly integrated with the system. All queries could be routed in a similar way. This scenario may however, slow down the crawl process as the data is not fragmented. Defragmentation in the case of a Distributed DBMS eats up vital time. The searches are done as usual using languages like SQL or QBE, sorting the results based on the ranks assigned to the individual entries. The suggested procedures to calculate the ranks are detailed in a later section. 3.4 Specialized Searches The web pages that are developed in the recent times have extensively used meta information to aid the searching procedures. Meta data is also useful in searching resources like images and objects that depend on the tags and the <alt> information. The sites that use XML to provide a summary of their contents, like those of the newspapers can be searched contextually using this meta-data. This meta-information is also used to denote the relationships between the pages. Even if a query word is not in the page, it may still be relevant at the result list. The proxy incorporates all these activities in the search process. Better ranking is allotted to the information pages that have meta-data associated to them. Some information may be considered sensitive or confidential, classified as private and the user may not want to share the data, or his history of browsing with other users. In such cases, the user may remove the certain pages form the cache. To protect the privacy of the individuals in the group, the proxy shall allow access only to results of queries rather that the database contents. Thus, the pages actually cached may not be directly available. To prevent all disclosure of privacy, the proxy can also be programmed to respond to requests from the cluster machines. Apart from the crawlers indexing the pages an option is also available for the web creators to register their page to be listed. They populate the index table indirectly and
ranks are static for the start and gain ranks as they start getting referred to, by other pages. The users may also rank the information directly at the servers. In this case the pages are not downloaded till requests and only the database is updated.
4. Ranking algorithms
One of the most important steps in the querying processing is the rank by which the individual page links are displayed. The ranking could be either based on the occurrence of the words in the query in the document or the relevance of the query to the document. The former method is easy to implement as it requires only Boolean sort of operation to be performed. The latter methodology required semantic search through the document. The values of the ranks calculated are done without conjunction and the disjunction in the queries. For including the logical operations on the queries, the ranks of the individual elements are suitably combined. 4.1 Word Occurrence ranking The simplest algorithm is to calculate the occurrence of the word in the page. The rank then calculates to
Ci,j = Occurrence of jth query in the word This algorithm does not emphases on the frequency of the words, nor does it rely on the connection with the length of the document. To include the length of the document as a parameter in the query ranking, the formula could be modified as
where the parameters, a, ß = constants given specific weights Li = Incoming link from i to j page Lo = outgoing link The above formula takes care of the links concerning a page. Even if a page does not have a keyword of the query, it may still be relevant to the context. Thus, such links are added an arbitrary ranking factor that is denoted by c2, usually lower than the factor that indicates the occurrence of the word directly, as done by c1. The sum of all the ranking elements results in the total rank of the page. The values assigned to the constants must have optimum distance to eliminate problems due to extremes. One extreme is display of all the pages that are linked directly to the page, when the difference is too small. The other minimum is the elimination of related pages when the difference is too large. The convergence of this formula gives the original formula. The experimental results show an optimum difference of 10 to 15 between the values. The above scheme gives equal importance to both the referring and the referred page. In the real world situations, the referred paged must be given better advantage. To
incorporate this strategy, the formula is modified to include the referred to terms. Another omission is that of multiple reference, that form chains. This cannot be directly integrated into the mechanism. The formula incorporating all the above elements is shown below.
The formula inherits the rank of all the pages that refer to it, with a reduction of the rank as it propagates through the chain of pages. This equation gives the most cited page by ranking both the referrer and the occurrence and the word. Lij takes a Boolean value, depending on the incoming links. The constant c1 is the depreciation factor, the value by which the rank decays for every pass through the chain of pages. 4.2 Relevance Ranking The relevance ranking deals with the implementation of semantic search of the pages and their indices. The ranking is done not just on the word occurrence, but also the context of the search. This requires an increased number of query words, to make the search better. However, the use of contexts can slow down the entire process. It also is found to fail in cases generalized searches. The first step in the case of generalized searches is the application of a dictionary and suggestion of alternatives for the misspelled words. The query itself is altered, using the dictionary to include the contextual synonyms. These additions are given a ranking factor; lower than the original query words. The frequency of the words in the document and the length of the document also have to be included. This zeroes in on the formula that includes all these factors.
where, Rq = Rank of the qth word in the query Sq = Weighting factor of the word a = depreciation constant, as mentioned in the previous section The other parameters have the same meaning, defined in the previous section. Another enhancement possible in the query definition is the conversion of the entire query into a form that can be used to represent data universally. This is done in the case the search aided by XML data. In the ordinary case, it does not give a better advantage as the problem of word sense ambiguity exists in web prose.
5. Network Monitoring
With the increasing use of the HTTP protocol for the sake of information transmission in cases like .NET and XML, the monitoring of the HTTP port plays a vital role in the security of assets that depend on the port. The proxy also holds the potential to prevent popup ads, malicious scripts and other factors that are a threat to the integration of the system. The flow of sensitive information, Trojans that use these ports, etc. can be monitored by integrating the proxy into the SPI of the network programming interface. The analysis
of all the data passing through the network, the anomaly in the traffic can be detected and any threat identified. The implementation uses a probabilistic model to determine any abrupt changes in the network traffic at the various ports. Although it may not be able to specifically identify the threat, it acts as an identifier of the potential threats. Another use of the proxy is to identify potential bottlenecks and the systems that are a part of the hard pressed traffic. In some cases, when the bandwidth allocated to a system is fixed and the other systems remain idle, the proxy can also command the idle systems to make page requests and redirect the replies to the original systems, just as the NTS does. This utilizes the wasted bandwidth of the idle systems also. The definition of the traffic analysis pattern requires the identification of a model that adapts itself to the changing browsing habits of the user. The proxy keeps track of all the bytes that pass through it. It can also calculate the rate of the traffic that travel through it, as a function of the time. This traffic excludes the ones that is generated by itself as a measure of utilizing the bandwidth and is a criterion only for the requests of the browser and the replies to it. In some cases, like during downloads, it can stop recording the traffic. However, it ensures that the downloaded file is logged, to explain any erratic activity at the time of the download process. The list of all the servers to which requests is made can also be logged, to calculate the frequency to access any server. This process not only helps in security, but can also help the server to request the pages from the server at times of idleness, updating the proxy and indexing. The connection made to the server become faster and the chance that the pages from the server are in the cache increases. The experimental results show clearly that major deviations from the normal browsing activities are logged and detected. Any other form of intrusion using mails can also be logged in on the proxy integration.
6. Experimental Results
Various tests were conducted to affirm the reliability of the use of the proxy. The tests were carried independently, during the real time of operation of the machines with the users having no clue of a proxy server being installed. The various results obtained are elucidated in the following sections 6.1 Testing Environment Configuration The proxy servers were installed on a Pentium 4 machines having 256 MB RAM, 30GB HDD, running the Internet Explorer 6.0 and Opera as browsers. Connection to the internet was using a leased line, with the machine routed to the net using a Network Translation Server that allocates fixed bandwidth to individual machines. The cluster group consisted of machines that varied slightly in the processor or memory configurations. None of the machines acted as a web server and due to the presence of an NTS, did not have individual IP addresses known to the outside world. Inside the internal LAN, they were given addresses ranging 220.127.116.11 to 18.104.22.168. All the machines were groups of the cluster and had the proxy installed at the port 80. The browser was configured to use a proxy at 127.0.0.1:80 address. To test the effect of the lack of a proxy server, a machine with the internal address 22.214.171.124 was independently monitored for traffic, using a pseudo proxy server that just routed the data to the NTS after counting the bytes sent. In the cluster all the groups were running Windows 2000, with the exception of one machine that was running Red Hat 9.0, browsing on Opera and Lynx. 6.2 Traffic Analysis
The first part of the testing process included the analysis of the traffic on the network and the individual machines. This includes the effect of the proxies and distributed cache. The single machine was also monitored for traffic during the same time. In the first step, no searches were made, based on the indexing available in the proxies. All the page requests were made to the internet.
The x axis represents time and the y represents the percentage of data transfer. The results show a clear increase in the request traffic. One notable point was that during long periods of browsing, the use of multiple browsers was in no way near the speed provided by the proxy. The expected result of accessing similar pages by the same cream of users was noticed explicitly. The frequently accesses sites included yahoo, rediff, sify, google, about.com and the college intranet pages. This speeded up the access rate to about 37 % at an average. The bandwidth seemed to be completely utilized and slots allocated to free systems were also utilized. The data tabulated gives a percentage of the utilization assuming that all the machines were allocated individual machines. The number of page faults was also tabulated and was found to be low enough to browse frequently accessed sites. 6.3 Indexing and Querying The indexing traffic was calculated as a ratio between the requests generated by the proxy to those that are inherently created by the processor. The proxy generated requests are a part of the indexing and crawling process. Although the downloading may be due to the want of links of probabilistic pages, they are calculated as index requests. Nevertheless, they are still used in the indexing process. The pages requested by the browser, even if found in the distributed proxy (including the same machine) is regarded as a separate request. The response times need not be considered separately as they fall directly under the category of the traffic analysis. As the proxy is implemented as a distribution, the indexing performance also depends on the pages encountered in the cache. The graph is drawn for the number of page faults during the browsing process. As the actual number of page faults cannot be estimated in a short time frame, the data was taken over a period of time, sampling at discrete intervals and smoothing the resulting curve. The peaks are obtained whenever a new page is
accessed. The x-axis represents the time axis and the y axis notes the traffic to the outside internet.
The indexing time is not considered as it stays negligible when compared to the network transmission time. The requests are also counted as a direct measure of the network, which may not be the best case. Many a times, some sites are accessed faster than the rest, due to the geographical dispersion. As the only valid scale available is the number of requests, the performance is measured, assuming that all the pages are retrieved at the same time. The x axis represents time and the y represents the percentage of data transfer
The rise and the fall of the browser requests that are indicated by the darker areas are the result of the request of a page, the images downloaded subsequently, followed by the filling of the form, and the user resending the information. During the registration and the typing process, the proxy performs the crawling process. The retransmission of the data causes the final rise of the traffic. 6.4 Ranking Algorithms The last step of the testing process includes testing the effectiveness of the ranking algorithms. This test may seem biased as the relevance may not be directed related to the occurrence of the world. As it may be difficult to analyze the contextual relevance of any word in the query, the occurrence of the query word is regarded as the relevance factor.
The ratio of the number of the pages displayed to the total number of pages searched, in the hypothetical search space that seem relevant. This relevance may be biased as the pages are required, based only on the context of the search. The graph only servers as a scale for rating the number of pages retrieved. Too many pages or too low pages are bound to have problems for the user. However, the denominator normalizes this effect and sports a negative in case the non relevant words are fetched. The y-axis represents the relevance factor, while the x-axis represents the number of words in the query.
The curve order is from the first formula to the last, bottom up. The graph evidently displays the last formula to be better. A special case is the depressions at levels near 7. These may just be experimental errors as may not directly relate to the proportionality of the query words to the page relevance.
All the experimental results show that the system is better to browse in a cluster than as a stand alone. The proxy system may be extended to include facilities not to be downloaded based on occurrence but on some semantic category. The queries could also be changed to include the semantic meaning of the query that involves natural language processing. As far as the proxy implementation is concerned, it is currently being extended to form a part of the SPI of the network to act as a proxy for all the port. It shall be given port specific domain knowledge, to make the TCP and UDP transmissions faster. Rather than explicitly using a database for indexing, a distributed database may be used to have implicit transmissions, making the job of the manual index maintenance easy. Another suggested improvement is the use of XML advanced searches, as they tend to furnish better details for the search queries. On the whole, the system is an implementation of distributed computing systems that involves advanced data mining techniques for crawling and speeding up the browsing procedure.
1. Berners Lee, T., “Uniform Resource Locators”, Internet Working Draft
2. Berners Lee, T., “Hyper Text Transfer Protocol”, Internet Working Draft 3. Eichmann, D. “The RBSE Spider”, in the proceedings of the First International Conference on the WWW, Geneva, published as a working draft. 4. Koster, M. “ALIWEB, pages 175-182, ed. 1994” 5. Pinkerten, B. “Experiences of the Crawler”, proceedings of the First International Conference of the WWW, Geneva, published as a resource document. 6. Salton, G., “Term Weighting Approaches in Automatic Text Retrival”, pages 513523, ed. 1988