Scalable Web Server Clustering Technologies J. Wei Background Growth of Internet, Dynamic content and increasing users force us to find faster server (Web). In the past, we replaced the web server with faster machine (processor). Drawback: Short-term (Moore Law, the number of transistors per integrated circuit would double every 18 months); Expensive: we need to replace almost the whole machine. Solution: Add more processor or machine to the Web server. (It is commodity hardware and software, so that we can keep the past investment.) Requirement There is no application state kept in server. For application requests need to be transfer from one to other servers. (Except some protocol-specific service, such as Secure Sockets Layer) Transactions must be relatively short and with high frequencies. (Short, because we do not use special hardware or software to process the request. High frequency, the sample space is large so that we can employ the stochastic method to distribute the requests. Requests are stochastic distribution, from anywhere at anytime.) OSI vs. TCP/IP Layer 4 Switch Special Technologies: Single IP. Because of the Network Address Translation (NAT) so that the cluster servers appear to be a single server with one IP address. Higher-layer address screening. The switch can make forwarding decision upon the content of the request from layer 4. Terminology L4/2: Layer 4 Switching with Layer 2 Packet Forwarding. The system has identical layer 3 (Network) with unique MAC address. L4/3: Layer 4 Switching with Layer 3 Packet Forwarding. The system has identical layer 4 (Transport, same services) with unique network address. Layer 7 Switch: Make forwarding decision based on the content of client requests. It can employ L4/2 or L4/3. Terminology (cont.) Client-side Transparency: The whole cluster servers appear to be a single host to clients because of the dispatcher. Server-side Transparency: Each cluster server runs standard web-server designed for standalone server. It servers the requests forwarded from dispatcher just the same as the requests come directly from the clients. Performance Index: Connections per seconds or bits per seconds. (Cluster Maximum Utilization) L4/2 Clustering The cluster’s IP address (A) is shared by the dispatcher and servers through the use of primary and secondary IP addresses. (BK: Each host can have several IP addresses.) The dispatcher’s primary IP address is A. The servers use A as secondary address. All packets whose destinations are A are forwarded to the dispatcher through the use of Address Resolution Protocol (ARP) in the nearest gateway/router. Technology Specification Load-Sharing Algorithm: Round-Robin or other policies. Session Map: When request is connection initiation, if it belongs to established connection in the map, forward it to the previously selected server, or select a server and save the connection in the map. If it doesn’t contain a SYN, it maybe discarded or not. Backup method: To avoid the down of the dispatcher and servers. L4/2 Traffic Flow L4/2 Traffic Flow (cont.) 1. A client sends a request to A. 2. The router sends the request to the dispatcher. 3. Based on the load-sharing algorithm, the dispatcher selects actual server (2) to serve the client. 4. Server 2 replies the client directly. Advantage vs. Disadvantage Advantage: Servers reply clients directly, which avoid the dispatcher to be bottleneck. Don’t need to recalculate the checksum because it operates on layer 2. Disadvantage: There must be direct physical connection to all servers and the dispatcher. ONE-IP (Bell Lab, 1996) Load-Sharing Algorithm Routing-based Dispatching: Hash the incoming client’s address to get a number that indicates which server to service the request; ONE-IP (cont.) Broadcast based dispatching: Each server has a fixed and disjoint portion of the address space. ONE-IP (cont.) Drawback: Cannot adapt to the condition that the client requests are disproportionately distributed. Backup: Watchdog daemonm watchd • Dispatcher fail: The backup dispatcher will notice the missing heartbeat of the primary dispatcher and take over. • Server fail: Reconfigure the hash table or the address filters on other servers. Network Dispatcher(IBM 1996) It powered the 1998 Olympic Games website with up to 2000 requests/s. Experimental results are 2200 requests/s. Network Dispatcher (cont.) Load-Sharing Algorithm: Weighted Round Robin algorithm. Connection Map: Discard the packet that doesn’t contain a SYN or a non-zero allocation weight is unavailable. Backup: Dispatcher: Secondary dispatcher. In fact, it contains some extra dispatchers. The secondary dispatcher will take the IP of the failed dispatcher. Server: High Availability Cluster Multi-Processing for AIX (HACMP) on the IBM SP-2. Reconfigure the dispatchers to exclude the node; Failed server will automatically reboot; Reconfigure the dispatchers to include the node. Network Dispatcher (cont.) Client affinity: Background: Two connections from the same client must be assigned to the same server such as the FTP and SSL services. The connection requests from the same client before given affinity life span expires are sent to the same server. “The quality of the load sharing may suffer slightly, but the overall performance of the system improves.” Others LSMAC (University of Nebraska-Lincoln) “Implement L4/2 clustering as a portable user-space application running on commodity systems” Alteon ACEdirector (hardware implementation) AceDirector 2’s primary focus is on load balancing Internet services such as HTTP and FTP. Load-Sharing Algorithm: Round-robin and least-connections load sharing policies. Support SSL service. L4/3 Clustering • The dispatcher appears as a single host to clients while as a gateway to the servers (IP address = A). • Each server has its own IP address that can be globally unique or locally unique (IP addresses = B1, B2, … , Bn). • Load sharing algorithm: Round robin or other algorithms; • Keep a session map table. L4/3 Clustering (cont.) L4/3 Clustering (cont.) 1. A client sends request with A as the destination; 2. The packet comes to the dispatcher; 3. Based on the load sharing algorithm and session table, select the server, rewrite the destination IP address, recalculate the checksums, forward it to the server; 4. The server replies the request through the dispatcher (gateway) address A as the destination address. 5. The dispatcher rewrite the source IP address of reply as A, recalculate the checksums, forward it to the client. • Disadvantage: 1. Recalculate twice the checksums. (IP and TCP) 2. All traffic flow through the dispatcher. (Bottleneck) Magicrouter University of California at Berkeley, 1996 Fast Packet Interposing and modifications of kernel Load sharing Algorithms: • Round robin • Random • Incremental Load Backup: • Dispatcher: primary + backup model. • Server: Use ARP to map server IP addresses to MAC addresses to detect the fail of servers. LocalDirector (Cisco, 1996) Load sharing Algorithm: • Least connections: choose the server with fewest connections • Fastest Response: choose the server that response the request first. • Round-Robin: Strictly RR policy. Backup: • Dispatcher: extra LocalDirector unit that linked to the primary one with special failover cable • Server: Contact servers periodically, when fail, remove it, continue to contact, when up, add to the server pool Sticky flag: similar as IBM’s client affinity. LSNAT University of Nebraska-Lincoln User-space implementation RFC2391: Load Sharing using IP Network Address Translation (LSNAT) Backup: • Dispatcher: select one server as new dispatcher. Distributed State Reconstruction Mechanism to rebuild the map of existing connections. • Server: Exclude from active servers pool. When up, include it again. L7 Clustering Make dispatch decision based on the content. (Application Layer) Content-based dispatching LARD Locality-Aware Request Distribution, Rice University It uses TCP handoff protocol with the modified kernel. Different server processes different kind of requests, which can make use of specialized server. Web Accelerator (IBM) “The accelerator can now perform content-based routing in which it makes intelligent decisions about where to route requests based on the URL.” L7 based on L4/2; Web page caching; The dispatcher services as a gateway/router. All traffic flows through the dispatcher. ArrowPoint Content-based dispatching policy; Caching mechanism is similar to Web Accelerator; Sticky connection; Hot standby of the dispatcher and server node fail detection mechanism. Conclusion L4/2 Clustering Bottleneck: power of dispatcher to process incoming request; Advantage: Sustainable request rate. L4/3 Clustering Bottleneck: recalculation of checksums. L7 Clustering Bottleneck: complexity of content-based dispatching algorithm; Advantage: Localizing request space and caching request results. Qualitative comparison Client-based approach: Advantage: Reduce the load on web server by implementing route service in client side. Disadvantage: It is not general applicability and it need the server-side cooperation. Dispatcher-based approach: Advantage: Full control of client requests to gain good load balancing. Easy to implementation. Disadvantage: Risk of dispatcher bottleneck. Qualitative comparison (cont.) DNS-based approach: Advantage: High Scalability. No risk of bottleneck. Disadvantage: Due to the address caching mechanisms, need sophisticated algorithms to gain load balancing. Less than 32 web servers for each public URL because of the limitation of UDP packet size. Server-based approach: Advantage: No risk of single-point failure and bottleneck. Disadvantage: Redirection will increase the latency time for clients. Qualitative comparison (cont.) Quantitative comparison Cluster Maximum Utilization: At a given instant, the highest utilization among all servers in the cluster. Cumulative Frequency: Exponential Distribution: Heavy-tailed Distribution: Quantitative comparison (cont.) Exponential Distribution Model Quantitative comparison (cont.) Dispatcher-based: In almost all time, the utilization is below 0.8 DNS-based (adaptive TTL): utilization below 0.9 DNS-based (constant TTL): 20% time overload Server-based: utilization below 0.9 DNS-RR: overload time > 70% Quantitative comparison (cont.) Heavy-tailed Distribution Model Quantitative comparison (cont.) Dispatcher-based: work fine DNS-based (adaptive TTL): work fine without risk of bottleneck Server-based: poor performance when the load is high, work fine before the load over 0.9 Conclusion Bottleneck will be the network throughput. By making use of wide area network bandwidth, it can get much better performance.