Posted by : Unknown Friday, July 26, 2013

Abstract

                     Many large web sites get more than 100 million hits everyday. They need a scalable web server system that can provide better performance to all the clients that may be in different geographical regions. Higher delays and losses are common on WAN  links. To provide a better service to all the clients, it is natural to have fully replicated web server clusters in different geographical regions. In such an environment, one of the most important issue is that of server selection (and load balancing). The client's request should be directed to one of the servers in a way that the response can be quick. We assume that web servers are functionally homogeneous, i.e. any one of them can serve any client request. Another important point is that this system should not require modification of any client side component or existing standard protocol. Many techniques exists for the selection of the nearest web servers from the clients point of view. Many of the existing systems do only load balancing. This scheme assumes that the replicated site has all the web servers in one cluster. This assumption holds good for medium sized web servers but if the network traffic increases beyond certain limit we face problems in terms of connectivity. To deal with the problem large web sites will have multiple clusters replicated which are geographically distributed. But  this leads to the problem of selecting the nearest cluster and then do load balancing within the servers of that cluster. This changes the problem to first select the nearest cluster and then do the load balancing among the servers of that cluster.
     Designing such system involves making decisions about how best server is selected for a request such that user receives response of request in minimum time and how this request is directed to that server. In most strategies, a server is selected without taking into account any system state information, e.g. random, round robin etc. Some policies use weighted capacity algorithms to direct more percentage of requests to more capable servers. But few strategies select a server based on the server state and very few strategies take client state information into account. There is always a tradeoff between the overhead due to collection of system state information and performance gain by use of available state information. If too much state information (of server or clients) is collected, it may result in high overheads for collection of information and performance gain may not be comparable to overheads. So we must carefully collect only that state information that might improve performance of system as seen by clients but do not result in very high overheads.
The paper contains a scheme that takes into account by collecting information about the  load on each server as well as estimating round trip time between clusters and those of clients which make large number of requests.

Load balancing for Web-servers
        
            Number of Internet users have been increasing quite rapidly from few millions to few billions.  As a result some most popular websites have been experiencing a hit of somewhere around 10 to 20 million hits/day. Such popular websites need a scalable web server system that can provide better performance to all clients that may be in different geographical locations. Higher delays and losses are common on WAN links so, to provide better services it is common approach to have fully replicated web servers in different geographical locations. In such an environment the server selection is the most important criteria. The client's request should be directed to one of the servers in a way that the response can be quick. We assume that web servers are functionally homogeneous, i.e. any one of them can serve any client request. Another important point is that this system should not require modification of any client side component or existing standard protocol

Steps in HTTP request service :

                        Basically all the Web related activities involve two basic components
  1. A Client : which is a piece of software such as browser through which we can                   
                       Access all the web documents. Example  Internet Explorer ,
                       Netscape Navigator

  1. A Server :   A server is a computer that runs any of the software’s such as HTTP
                          NCSA etc .The Server stores all the web documents and on request
                           From the client retrieves the documents and sends it back to the
                          Client

The HTTP request service involves the following two steps

  1. Domain name to IP address mapping :
         The domain name present in URL must first be translated to an IP address. The client software requests its local resolver for it, if this mapping is not in its cache. The resolver in turn returns the IP address for that domain name, that it may get from Intermediate name servers (which may have cached this mapping) or from directly from authorized DNS for that domain name either recursively or iteratively.

  1. Request for object to server with that IP address:
           Then client software sends request for object to server having that IP address. The server may return requested object directly or it may redirect it to other server using HTTP header options or fetch the object from other server and deliver to client or may transparently forward the request to other server which replies directly to client with address of forwarding server, etc.
Thus HTTP request service path allows us to distribute requests at two levels, first at DNS at the time of resolution of domain name to server IP address, and the other at server when request reaches at that server. Any system consisting of multiple servers and some request distribution mechanism is termed Distributed Web Server System (DWSS).
Time taken for service of any HTTP request submitted by client depends on two major factors namely network conditions and server load. Even if there is a capable server system present, but the connectivity of client in terms of delay, available bandwidth or packet loss is not good, it will sees large delays. If server system is saturated with requests, time taken for service is very large. So for keeping response time minimum, web server system should take into account both the factors.
         Traditionally for distributed systems the load balancing was done by process migration from one system that is heavily loaded  to a system that is lightly loaded .It involves migration decision  i,e  which units should be migrated  and then migration of load units to other nodes.
Both of these parts can be carried out either locally or globally. Load balancing can be classified according to the decision base and migration space. If migration decision is carried out according to local load situation and that of neighbors, it is called local decision base. If this decision is based on load condition of subset of the whole network, then it is called global decision base. Similarly if load unit is migrated to direct neighbors, then it is called local migration space, otherwise it is called global migration space. So according to decision base and migration space, four different categories of schemes emerge:
  •  Local Decision base Local Migration Space (LDLM)
  •  Local Decision base Global Migration Space (LDGM)
  •  Global Decision base Local Migration Space (GDLM)
  •  Global Decision base Global Migration Space (GDGM)
         However, these approaches for load balancing are not suitable for load balancing in the web context for several reasons.
·              In the web context there are multiple points for load balancing (e.g. at the DNS             or at the server) while traditional techniques assume a single point.
·               The cost factors are not homogeneous in web and can vary a lot, while in traditional systems most servers are assumed to be generally of similar capacity and capability.
·                The jobs were assumed to be compute intensive and hence the focus was to distribute the compute load. In the web, on the other hand, the load is mostly I/O oriented where caching plays a very significant role in performance and will impact the schemes. Even cost of migration of load unit and granularity of load varies for different points of load balancing. Due to these, and other reasons, it is best to consider the load balancing problem in the Web as a new problem, which requires different approaches.
      In web context, which server to select has been mostly studied from client point of view, i.e. either client side DNS or client proxy or clients themselves decide which server to choose. Usually, these entities send probes to multiple servers and select best server based on probe results or they take into account previous history of responses sent by server. But these probes are usually not sufficient to accurately measure server load conditions, since load on servers can change easily with time and usually these probes can not find current load on the servers and until all clients use such softwares and there is co-operation with server side entities (it is however very difficult to reach at common method acceptable to all), they will either incur too much overhead or will not give much better performance.
Request Distribution Mechanisms:
Cardelini et al classify web server architectures based on the entity which distributes the incoming requests among the servers in four classes of methods. Some of the methods in each category use feedback based algorithms and some use non-feedback algorithms as discussed in. So we can categorize the request distribution mechanisms based on entity that routes the request as follows:
  • Client-based approach
  • DNS-based approach
  • Dispatcher-based approach
  • Server-based approach
  • Anycast

Client-based approach

In this approach, client side entity is responsible for selecting the server so no server side processing is required for selection of server. The routing to replica is done by client software (browser) or by client-side DNS or proxy servers. So these schemes can be categorized as follows:
  • Web clients : In this approach clients are aware of existence of replicas of same resource on multiple servers and they choose the replica themselves. Following are two schemes that utilize client software for server selection.
    1. Netscape's Approach : This approach is taken by Netscape Navigator browsers. On access to Netscape home page, browser generates a random number X between 1 and 32 and accesses http://homeX.netscape.com. Each server can have multiple homeX aliases pointing to it so that client software need not to be modified in case more servers are deployed, just changing aliases will suffice.
      This approach is not generally applicable as not all companies can control client software, it requires re-installation or change of web clients if number of aliases increase. Also, it does not guarantee server availability and load balancing of servers because if any server is down or overloaded (and the aliases has not been changed), random selection will still try to access resource from that server.
    2. Smart Clients : In scheme proposed by Yoshilakawa et al, a Java Applet is run on the client side, whenever user accesses the Distributed Web Server System. This Applet knows all the IP addresses of servers in the System. Applet sends messages to probe node load, response time and network delays, etc., and selects the best node.
      This approach does not require client software modification and provides scalability and availability, but downloading the Java Applet requires a TCP connection, and extra probe messages cause delay and increased network traffic. Also all clients might not be capable of running the Java Applet.
  • Client's DNS resolver : This scheme is used by Beck and Moore in I2-DSI system. In this scheme, client's local DNS resolver issues probes to servers instead of web client and may choose the server based on response time or previous access performance reports from client.
    This scheme requires customized DNS and clients must also be modified for giving reports. If the server address is cached, then all requests in future will go to the same server. So load balancing may not be achieved. If caching is restricted by a lower TTL value, then we are putting additional load on DNS infrastructure.
  • Client Side Proxy : This scheme was proposed by Baentsh et al. Servers form a hierarchical structure and content replicated on each server is some part of URL name space. Each parent server in hierarchy propagates information about replicas present on direct descendents in extra HTTP headers in response to request for resource. Client-side proxy learns about replicas and next time request can go to server containing replica of resource.
    This approach requires both server software and proxy modification to give information about replica and process extra HTTP headers respectively.
All these approaches require change in client side components, which are not controlled by the e-Commerce company or the hosting ISP, So these approaches suffer from the problem of limited applicability.

DNS-based approach

   In this approach, server side authorized DNS maps domain name to IP address of one of the nodes of the cluster, based on various scheduling policies. Selection of replica occurs at server side DNS so it does not suffer from applicability problem of client-side mechanisms. But DNS has limited control over requests reaching at server because of caching of IP address mapping at several levels viz., by client softwares, local DNS resolvers, intermediate name servers, etc. Besides the mapping, a validity period for this URL to IP address mapping, known as Time-To-Live (TTL) is also supplied. After expiration of TTL period this mapping request is again forwarded to authorized DNS. Setting this value to very small or zero does not work because of existence of non cooperative intermediate name servers and client level caching. Also, it increases network traffic and DNS itself can become bottleneck.
 DNS based algorithms can be classified on the basis of the scheduling algorithms used for server selection and TTL values.
  • Constant TTL algorithms : These are classified on the basis of the system state information used by DNS for server selection. The system state information can include both client and server state information, like load, location etc.
    1. System stateless algorithms : Most simple and first used algorithm of this type is round robin (DNS-RR). It was used by NCSA (National Center for Supercomputing Applications)to handle large traffic volume using multiple servers. In this approach, primary DNS returns IP addresses of servers in the round robin fashion.
It suffers from uneven load distribution and server overloading, since large number of client from same domain (using same proxy/gateway) are assigned same server. Also, whole document tree must be replicated on every server or network file system should be used.
    1. Server state based algorithms : A simple feedback mechanism from servers about their loads is very effective in avoiding server overloading and not giving IP address of unavailable servers. The scheduling policy might be to select the least loaded server any time.
This approach solves overloading problem to some extent yet control over requests is not good because of caching of IP addresses. Some implementations try to solve this problem by reducing TTL value to zero but it is not generally applicable and puts more load on DNS.
    1. Client state based algorithms : In this approach, two types of information about clients, the typical load arriving to system from each connected domain (from same proxy/gateway) and the geographical proximity can be used by DNS for scheduling.
      Requests arriving from domains having higher request rate per TTL value can be assigned to more capable server. Proximity information can be used to select nearest server to minimize network traffic.
One mode of Cisco Distributed Director takes client location (approximated from client's IP address) and client-server link latency into account to select the server by acting as primary DNS.
This approach also suffers form same problem experienced by Server state based algorithms.
    1. Server and Client state based algorithms : Cisco Distributed Director takes server availability information along with client proximity information into account while making server selection decision. These algorithms can also use various other state estimates for server selection. Such algorithms give the best results.
  • Dynamic TTL algorithms : These algorithms also change TTL values while mapping host name to address. These are of two types:
    1. Variable TTL algorithms : As server load increases these algorithms try to increase DNS control over request distribution by decreasing TTL values.
    2. Adaptive TTL algorithms : These algorithms take into account the domain request rate (number of requests from a domain in TTL time period) and server capacities, for assigning TTL values. So a large TTL value can be assigned for a more capable server and less TTL value for those mappings that have high domain request rate.
      These are most robust and effective in load balancing even in presence of skewed loads and non-cooperative name servers, but these don't take geographical information into account.
DNS based approaches are more suitable for static replication schemes and are less suitable for dynamic replication schemes because changing place of replicated object may require change in mapping. In general these approaches suffer from limited control over request problem due to caching of resolved IP addresses at various levels.

Dispatcher-based approach

This approach gives full control over client requests to server side entity. In this approach, the DNS returns the address of a dispatcher that routes all the client request to other servers in the cluster. Thus it acts as a centralized scheduler at the server side that controls all the client request distribution. It presents single IP address to outside world, hence is much more transparent. These mechanisms can be categorized as follows:
  • Packet single-rewriting by the dispatcher : In this approach, all packets first reach dispatcher because IP address of dispatcher is provided by DNS. All the servers in cluster have different private addresses visible within the cluster. The dispatcher selects server in the cluster using simple algorithms like round robin etc. and changes the incoming packet's destination address with the private address of selected servers in the cluster. It also maintains a list of source IP addresses for active connections and sends the received packets from each TCP connection to the same server node. Further, nodes in the cluster need to replace source address in response packets with the IP address of dispatcher.
    Although this solution maintains user transparency, it requires changes in the kernel of all the servers since packet rewriting occurs at TCP/IP level. This system combined with DNS-based solution for dispatcher, i.e primary DNS resolving host name to IP address of one of dispatcher for each cluster, can scale from LAN to WAN.
  • Packet double-rewriting by the dispatcher : This approach is similar to the above scheme, except that all address changes are done by the centralized dispatcher, not by nodes in cluster. The dispatcher first changes each incoming IP packet's destination address to that of selected server and sends it to the selected server node in the cluster. It also needs to modify the packets on the way back to the client, i.e., now in response IP packet, it replaces the source IP address of selected server with its address. The algorithm for server selection can be round robin, random, etc.
    Cisco local director selects the server with least active connections. Magic router uses a application level process that intercepts all packets between client and server and modifies address and checksum fields.
    This approach has advantage that it does not require modification of all nodes in cluster.
  • Packet forwarding by the dispatcher : In this approach instead of IP packet rewriting dispatcher forwards packets to nodes in cluster using MAC address.
IBM Network Dispatcher's LAN solution assumes that server nodes are on the same LAN and share the same IP address but nodes have disabled ARP mechanism, so all packets reach to dispatcher. The dispatcher then forwards these packets to selected servers using their MAC addresses on the LAN without modifying its IP header. The scheduling policy can be based on server load and availability.
This mechanism is transparent to both client and server. No packet rewriting is required by dispatcher or servers as they share same IP address.
IBM Network Dispatcher's WAN solution is based on dispatcher at two levels. Centralized first level dispatcher uses single-rewriting mechanism to forward the packets to one of the second level dispatchers (on WAN) for each cluster, i.e. it replaces its IP address from packets to that of selected dispatcher(each cluster has its dispatcher). Second level dispatcher (at each cluster) changes its IP address in packet back to that of first level dispatcher and forwards it to selected server on LAN using MAC addresses. Selected node responds with IP address of primary dispatcher as in the previous approach.
  • ONE-IP address : In this approach, multiple machines in the web server system have same secondary IP address. This secondary IP address is then publicized by DNS. It is of two types:
    1. routing-based dispatching : In this approach all packets with ONE-IP address are directed to IP address dispatcher by the subnetwork router. The dispatcher selects the server by applying hash function on the client IP address and then reroutes the packets to selected server using its primary IP address. Since hashing function is applied on client IP address, all packets from same client reach to same server.
    2. broadcast-based dispatching : In this approach subnetwork router broadcasts the packets having destination ONE-IP address to all servers in web server cluster, the servers themselves compute hash function on client IP address to decide whether they are actual destination or not. It causes more server overhead.
Using simple hash function guarantees that same server will be selected for a given IP address but at the same time it is also the weakest factor in dynamic selection of server for load balancing. By changing hash function fault-tolerance can be achieved. Still hash function on client IP address is static assignment of server to each client.
  • HTTP redirection by Dispatcher
    In this approach centralized dispatcher redirects the HTTP requests among the web server nodes by specifying appropriate status code in response and indicating the selected web server node address in its header. Dispatching can be based on load on servers or location.
This approach is transparent to user as most browsers support it, but user can perceive little bit more delay. No packet rewriting is required in this approach but state information of the server, i.e. load, number of connections etc. should be communicated to dispatcher in this case.
The Distributed Director in second mode uses estimate of client server proximity and node availability to select the server and redirects the client to selected server. Its main disadvantage is duplication of TCP connections and hence increased delay in response.

Server-based approach

This approach allows two-level dispatching, first by cluster DNS and later each server may reassign a received request to one of the other server in the cluster. This solves the problem of non-uniform load distribution of client request and limited control of DNS.
  • HTTP redirection by Server
    The approach is used in SWEB. First request reaches to host in cluster using normal DNS resolution but it can further redirect request to other server. It does second level dispatching through the redirection mechanism of the HTTP protocol. This redirection may depend on the load of server or may be done in a round robin fashion. The servers need to exchange status information periodically for taking redirection decisions but this cost is negligible with respect to traffic generated by client requests. Its main disadvantage is duplication of TCP connections and hence increased delay in response.
  • Packet Forwarding by Server
    In this approach, first level scheduling is done using round robin DNS mechanism, the second level dispatching is done by packet rewriting mechanism that is transparent to users. So first request reaches to any node in cluster, if that node figures out that other node is better for serving this request, node uses MAC address to reroute the packet to selected sever.
It does not require HTTP request redirection hence it is better in terms of latency time. The server selection can be stateless i.e. based on hash function or based on load information on servers. If loading information is used for rerouting, server need to exchange load information among themselves. Also this scheme can work with both LAN and WAN based solution.
  • Akamai's Approach
    Akamai's approach is very different. In their approach, URLs of objects embedded in HTML page, like images, Java Applets, multimedia components etc., are modified by proprietary software Launcher running at server, to the URLs of the objects available at any Akamai server nearest to client. It is claimed that these embedded objects comprise nearly 70% of typical page in overall bytes. A map of current Internet traffic conditions, the loads of all Akamai servers worldwide, and the locations of Internet users is built for selection of server. This map is updated once per second. While making selection of server, it is made sure that no server is overloaded and number of servers containing replica is proportional to number of requests for the object.
    This approach is very useful when page contains large multimedia objects. It requires protocol for getting information about other servers distributed geographically, and client location. It scales geographically well but it also requires pages to be modified according to the client location.

Anycast

In IPv6, an anycast service will be supported. This service assumes that the same IP address is assigned to a set of hosts, and IP router has path to its closest host in routing table. Thus different IP routers have paths to different hosts with the same IP address.
This approach automatically selects the closest host, thus load distribution causes no overhead. But it also implies almost static replication since changes in routing table take time. Which can be solved in future through Active Networks, in which simple program injected by application can be executed at routers.
These mechanism have their relative pros and cons. Client side approach does not require any server side processing but suffers from limited applicability problem. DNS based approaches suffers from problem of limited control over client request due to caching and non-cooperative name servers. They provide coarse level control over client request but these approaches do not suffer from single point of failure problem which is present in Dispatcher based approaches. Dispatcher based approaches give finer level control over client request. Packet forwarding approaches are most suitable for LAN based solutions and can scale to WAN solution. Server based approaches offer fine grain control and do not suffer from single point of failure problem but redirection causes increase in latency period.
Our focus is on a general scheme that can be fully implemented at server side and can be very easily deployed with currently used infrastructure and standard protocols. Hence we do not consider client side approaches and do not assume existence of any support or special component or modified protocol running at client side. We consider whole server architecture for collection of metrics required for selection of server, role of each entity and method of request distribution.
Proposed Architecture for Web server system:
A Distributed Web Server System (DWSS) consists of a large number of servers with some mechanism to distribute the incoming client requests among those servers. We have the following design goals for the DWSS architecture:

  • Components used should be compatible with current protocol and network elements, i.e. they can be deployed in current infrastructure and protocol suite very easily.
  • It should not require change of components at client side or components on which website administrator has no control, i.e. change in only server side components is allowed.
  • System should be geographically scalable, i.e. more servers in clusters can be added when needed in LAN environment and besides that more clusters (that may be geographically far apart) can be added in web server system on WAN.
  • System should give better performance in terms of latency perceived at client side, i.e. time lag between request submission by user and content reaching at client side software should be minimized.
  • System should be user transparent, i.e. single virtual interface to access website should be provided at the URL level, request should be directed to appropriate server automatically by web server system.
  • System should be fault tolerant, i.e. system should continue working (may be with degraded performance) even if some servers or clusters fail or taken off-line.
  • System should avoid overloading of any server, i.e. requests beyond capacity of any server should not reach to it, since it may result in crashing of servers.
  • System should not incur too much additional overhead for its functioning, in terms of computation required or network traffic generated. 
The system model  is shown in the above figure. Different steps in HTTP request service are shown in this figure. Client software first asks its local resolver for IP address of web server, if local resolver or intermediate resolvers do not have this mapping or TTL has expired, this request reaches to server side authorized DNS in step 1 and DNS replies with IP address of front node of one of several cluster (selected according to algorithm, which we discuss later) in step 1.1. In step 2, client software or some entity on behalf of client (client proxy or gateway) sends request to front node of that cluster using obtained IP address in step 1. Front node decides which server in the cluster should serve the request (algorithm for selection is described later) and request is forwarded to that server by front node in step 3. Finally, in step 4, selected server replies with request object on behalf of front node.
We have chosen cluster based model because it creates additional level for system state information collection and gives full control over dispatching of each HTTP connection. Besides, our assumption is that clients are geographically distributed in distant parts of world and company can place each cluster at strategic location near its customers, where they can serve customers better. This model also allows website administrator to change number of servers in any cluster as well as change the number and location of clusters easily.
This model allows us to collect finer level information about each server at the cluster level and aggregated information about each cluster can be passed to entity(DNS in our case) requiring this state information for making request distribution decisions at coarser level.
The aim here is to assign each client request to the best server such that client experiences minimum latency between HTTP request and reception of requested object.
First level decision can be taken by DNS itself, DNS can resolve IP address of cluster which can give better service to this client. Parameters affecting delay in service of HTTP request are load at selected server (and hence cluster) and path characteristics between client and server. So to take this decision, DNS should have recent cluster state information and proximity of client with clusters. DNS has information about client IP address and cluster IP addresses. Since clusters are under control of website administrator, they can provide any state information required by DNS. Since only server side components can be modified, they will have to gather the proximity information themselves. There are various metrics to measure proximity between the client and clusters. Some metrics are:
  • Geographical distance between cluster and client
  • Network distance in hops between cluster and client
  • Round trip time (RTT) between cluster and client
  • Available bandwidth on path between cluster and client
  • Response time of any prior web document fetch
  • Latency of any prior web document fetch
Geographical distance is significant only when time taken for transmission of requested object and propagation delay on wire are comparable, i.e. propagation delay is also significant. Propagation delay is significant for very large distances even at speed of light (can be 100s of milliseconds). But transmission media used for long distances (usually optical fiber) has very low delays and if satellite communication is used for even local connections, geographical distances may not correspond to actual delays on network. Nevertheless, it results in lesser traffic on long distance lines and usually corresponds to lower delays in practice. Geographical distance can be approximated by IP address of client if such database is available. According to RFC 1466 different IP address ranges were allocated to different geographical regions to keep routing tables shorter. Using higher 8 bits of IP address only, geographical region of client can be approximated.
Network distance in hops is also a good metric and can be obtained from routers. But it does not take into account bandwidth available in path, current traffic on the path between cluster and server. In short, available bandwidth for transfer on path and delays in each hop are not taken into account. It is usually a static measure of proximity, since as found by Paxon that 68% routes on the Internet are stable for at least a week and 87% routes on Internet are stable for at least six hours. Also, studies by Crovella et al have found that the hop count has very low correlation (0.16) with response time (measured at client side). So it does not seem very good metric to use.
Round trip time is another metric that can give better and relatively accurate delay experienced in path and to some extent, a lower RTT indicates higher available bandwidth. However, it is very dynamic in nature, it changes quickly over relatively short period of time. It has much more variation for different clusters compared to hop count, it gives better path information between client and cluster. On the downside, it is relatively costlier to measure and requires more frequent refreshes.
Measuring bandwidth, by using tools like pathchar or even using other more efficient current techniques, generates lots of additional network traffic and takes long time, so use of this metric is not practical.
Last two metrics can be used only after a number of clusters are tried (which result in degraded performance for client) and a huge database is maintained. Still, load on clusters can change over time and older information may not predict good cluster. These metrics are really useful for client side server selection only.
After comparing these metrics for client-cluster proximity, we conclude that RTT is the best metric to use for getting path information. It requires periodic refresh and is relatively costlier to measure (compared to hop count or geographical information) but it provides current and better network characteristics information. So we should try to limit overheads in measuring it. Crovella et al found in their study that when used at client side it resulted in less than 1% additional network traffic and gave very good results when three ping messages were used to measure RTT information.
To further minimize overhead, instead of all clusters measuring RTT for each client, we propose to do the measurement only for a small subset of clients with very high request rate. Arlitt et al find out that 75% of total HTTP requests to any server come from 10% of networks. So if we collect information about only very high request rate generating clients, we can use that information for all clients on the same network. We can further limit number of clusters which should measure RTT based on geographical information (approximated by use of client IP address) and having less load to certain maximum number(say at most 3).
In our approach, each server gives state information to front node in the cluster and this aggregated state information is used for assigning requests within the cluster and is propagated to DNS in aggregated form to make coarse grain (per client IP based) request assignment to cluster. More details can be found in section 3.6. We gather this proximity information once high request rate is reported by cluster to DNS, so it does not delay reply from DNS, however first reply for even those clients is based on geographical proximity information approximated using IP addresses (it is used for all clients, who either do not generate large number of requests or query DNS first time after long interval).
Web server system consists of many clusters distributed geographically all over the world placed at strategic locations, similarly clients are also in different geographical regions. Thus it enables us to take into account variation of request rate from each geographic region.

Our approach is to dynamically distribute requests based on current system state information. All servers in cluster report state information to front node and front node uses this information to distribute individual client requests (each TCP connection) coming to cluster among servers in cluster intelligently. Front node reports aggregated cluster load information to DNS like a single node of high capacity. Which once again uses this information to resolve IP address of a cluster for queries from client to provide them better service in terms of perceived delay. Collecting only server state information is not sufficient, servers also collect number of requests coming from each IP address and send to front node which aggregates this information and reports IP addresses of clients having very high request rate to DNS. It has been found that more than 75% requests can come from 10% of networks. Using request rate information, DNS asks few clusters to collect cluster to client proximity information only for those clients. Cluster to client proximity is found by sending ICMP Echo request messages to clients. Additional messages sent among entities are shown in Figure .
For client request distribution inside each cluster, IP packet forwarding by changing destination IP address of request packets only can be used as shown in Figure . Every IP packet that reaches at front node for HTTP connection is diverted at IP layer before delivery to TCP layer. A program running at the front node selects server for this client based on client IP address and server load information, address of selected server is filled in destination address field of IP packet and packet is re-injected back on network, so it reaches the selected server. At server this packet is once again diverted and destination address is set back to IP address of front node and is re-injected in TCP/IP stack of server node. Each server has secondary IP address (ARP disabled) same as IP address of front node, so HTTP server accepts this packet and response packets directly go to client from selected server without doing any additional modification or delay. This results in additional delay of about one millisecond for each incoming packet, if servers and front node are on same LAN. Since this packet forwarding can be done at application layer, it was chosen for emulation, however in actual system, packet forwarding inside kernel using MAC address can be done or dedicated hardware can be used for more efficient dispatching.

Algorithms

Load balancing is done at two places in path of HTTP request service, first at the DNS level and secondly at the front node of cluster. DNS tries to balance load on clusters by providing IP address of appropriate cluster's front node. When request reaches the front node, it balances load amongst the servers in that cluster. In exceptional cases when cluster is overloaded (due to uneven request rate from clients and caching of DNS entries), HTTP requests can be redirected to other lightly loaded cluster(s).
Within each cluster, every server periodically sends its load information to front node, which sends aggregated load information about cluster to DNS. This load information transfer can take place aperiodically too if load condition changes suddenly at any component, say any server becomes overloaded.
A number of system state information parameters are collected by each server, for example, system load averages, system and user cpu utilization, free RAM, Buffer RAM, number of disk accesses, free swap, number of processes, number of requests served in last 64 seconds and number of bytes transferred in last 64 seconds. Using average number of connections sent (dispatched and currently active) to particular server in past predefined time interval and its load condition in that time interval (a user defined function depending on bottlenecks present) capacity of each server, i.e. average number of connections it can serve without significant increase in response time is dynamically estimated and updated with every load update from server by front node. Similarly every front node aggregates load information of every server and informs available free capacity of whole cluster to DNS periodically.
We describe algorithm below at each component (DNS, front node and servers in each cluster).

Load balancing at DNS

In response to query from client for resolving domain name, DNS returns IP address of server. All requests are sent to server having that IP address for time period called Time to live (TTL). After expiration of TTL, query is once again sent to DNS. Since within TTL period all requests from that client (or its gateway) are sent to the same server, if number of request generated by that client are higher than others it can create load skew. Aim should be to assign clients having high request rates to servers having higher capacity. Again TTL value should be small because load skew is created by these clients.
For getting request rate (number of requests in unit period), servers (or front nodes of clusters) should send this information periodically to DNS. We distinguish between clients based on their request rates. Servers send information of request rate only when client request rate is higher than a threshold. DNS instructs few possibly nearest clusters (having remaining capacity higher than request rate) to get round trip delay to client.
For clients having high request rate, we maintain information about their request rate, list of few candidate servers(say 3) having enough remaining capacity at time of RTT probe with RTT, time stamp of last RTT probe.
Procedure recvmsgs is procedure responsible for receiving messages of different types and dispatching these messages to appropriate handler functions depending on type of message, pseudo code below shows main messages received :
procedure recvmsgs
{
 Input: Socket for receiving messages
 Output: None
 
 read message from socket and determine type of message
 
 switch(message type){
        case load_info:
        /* Message from front nodes about load information on each cluster */
            call update_load 
            break;
        case request_rate:
        /* Message from front nodes about request rate of clients */
            call update_requestrate
            break;                 
        case ip_request:
        /* Message from DNS for preferred IP address of client */
            call resolve_ip
            break;
        case rtt_reply:
        /* Message from front nodes about RTTs between cluster and clients */
            call update_rtts
            break;            
}
Procedure update_request_rate is called when front node sends this request rate information to DNS.
procedure update_request_rate
{ 
 Input: Client IP addresses, request rate
 Output: None
 
 for each IP address of client (or its gateway) {
    if(no request rate available for this IP)
       add request rate record for this client with current time stamp
    else
       update request record for this client with current time stamp
    
    if(no candidate server in list or time stamp of probe is too old)
        send_probes_for_rtt(Client IP)
 }
         
 update average request rate information.
}
If no request rate information about a client IP is received for few periods of request update then that entry is deleted.
Procedure send_probes_for_rtt adds IP address of client for sending probe for measuring rtt to list of new nearest and not overloaded clusters.
procedure send_probes_for_rtt
{
 Input : IP address of client to probe
 Output: None
 
  select few clusters nearest (approximated using IP address) to client having 
  remaining capacity > request rate of client
  
  for each cluster in above list
          add client IP for sending request for rtt probes for this cluster
  
  update probe timestamp for client with current time
}
Actual message containing Client IP addresses is sent to each cluster periodically after every fixed interval or sufficient number of clients are already queued.
Procedure update_rtts is executed when message from cluster front node about information of round trip time between them and client is received.
procedure update_rtts
{
 Input: Cluster IP, Client IP, rtt, number of successful rtt probes
 Output: None
 
 if(number of candidate servers is less for Client IP)
      add_candidate(Client IP,Cluster IP,rtt,num probes)
 else if(any candidate server has higher rtt in candidate server list 
         or had less number of successful probes)
      update_candidate(Client IP,Cluster IP, rtt, num probes)  
         
}
add_candidate and update_candidate keep a list of rtt records in ascending order of round trip time and number of successful rtt probes for given client IP.
Procedure update_load is executed when message from front node of any cluster about load information is received.
procedure update_load
{
 Input: IP address of cluster's front node, capacity, load
 Output: None
         
 find record for node
 update load information of cluster
 update available free capacity of cluster and whole system
}
Finally clients request for host name to IP address resolution.
procedure resolve_ip
{
 Input: IP address of client (or its gateway, i.e. firewall etc.) and domain name
 Output: IP address of front node of cluster
 
  if (information about client request rate is available){
      if(probe time stamp is too old)
             send_probes_for_rtt(client IP)
          
      find list of clusters sorted on previously probed rtt to client
      for each cluster in list in ascending order of rtt
         if(available capacity of cluster > request rate of client){ 
                 reduce available capacity of cluster by client request rate
                 return(Cluster IP address);
         }
      /* If all servers probed are overloaded */
      send_probes_for_rtt(client IP)
  }      
  else{
        set request_rate to average request rate of all clients. 
        find list of nearest clusters sorted on nearness approximated by IP address
        for each cluster in list in ascending order of proximity      
           if( available capacity of cluster > request rate of client){
               reduce available capacity of cluster by client request rate
               return(Cluster IP address);
           }
  }
  /* If no cluster is yet selected, all servers are overloaded */
  select cluster in proportion to free capacity
  return(Cluster IP address)       
}

Load balancing at front node of each cluster

First front node collects information about request rates from each client IP, then periodically it sends request rate information of only those clients which have high request rate to DNS.
Similar to DNS, front node also receives different types of messages and invokes appropriate message handler based on type of message, main messages are server load information and client request rate from each servers in cluster, request for measuring RTT to client from DNS and it also selects server in cluster for each new TCP connection from client and rewrites destination address of IP packets coming from clients with selected address.
Each server periodically (at large intervals of order of minute) sends request rate information of clients in terms of number of requests by that client. On receipt of request rate update message, receive_request_rate procedure is invoked.
procedure receive_request_rate
{
 Input: Client IP addresses, requests 
 Output: None
 
 for each Client IP address
    update_request_rate(Client IP,number of requests)
 
 update global request rate information
}
update_request_rate creates new record or finds record for given client IP and aggregates request rate information about each client.
Periodically cluster sends aggregated request rate information of clients which generate high number of requests than average client.
procedure send_request_rate
{
 Input: Client IP and their request rates
 Output: None (sends this info to DNS)
 
 calculate Threshold based on average request rate
 
 for each Client IP having request rate > Threshold{
    add Client IP and request rate in queue
    if(queue is full)
        send queued request rate information of clients to DNS
 }       
 send queued request rate information of clients to DNS
}
Front node receives detailed load information from each server periodically. Using average number of connections sent to it in that predefined interval and obtained load information from server, front node estimates number of connections server can serve, i.e. capacity of server. This estimate is updated with every load update from server.
procedure receive_server_load
{
 Input: Server IP address, load
 Output: None
 
 find record for server using IP address and update server load
 estimate and update number of connection server can serve
 update cluster's load information and available capacity
}
Cluster sends aggregated load information periodically to server or when load condition changes significantly.
When DNS requests for measuring RTT between client and Cluster, following procedure is executed.
procedure receive_probe_for_rtt
{
 Input: Client IP addresses
 Output: None 
 
 for each Client IP address in list
      send predefined number of echo requests to client periodically 
}
Clients reply with Echo reply for each echo request, RTT is measured and averaged. Average RTT along with number of successful probes are sent to DNS periodically.
Finally it forwards requests to servers in cluster in proportion to remaining capacity of each server,
procedure forward_request
{
 Input: IP packets from clients for HTTP request
 Output: IP packets with destination address of selected server
 
 if(connection already exists for this client IP and port){
 
     if(packet is fin)
           move this connection record to a list where it will be recycled after few 
           minutes
     
     update time stamp for this connection
     write IP address of server in destination field and re inject on network
 }
 else if(packet is syn){
       select servers in proportion to their remaining capacity
       create new connection record with current time stamp
       write IP address of server in destination field and re-inject on network
 } else
          drop this packet         
 
 if(load on each server > capacity and least loaded cluster list not empty)
        redirect request to other clusters in proportion to their free capacity
}
All the connection records for connection on which there was no packet transmitted from source for a long time are also freed periodically.

Support at each server

Each server sends its load information to front node periodically or when its load condition changes significantly.
procedure send_server_load
{
 Input: Current load
 Output: Sends load information to front node
 
 get current load information from system
 send_load_to_front_node(load)
}
Each server also sends client request rates to front node periodically however at longer interval (order of minute).
procedure send_request_rate
{
 Input: Client IP and their request rates
 Output: None (sends this info to front node)
 
 read html access log file and aggregate number of requests from each client 
 send_request_rate_to_frontnode(Client IP, request rate)
}
Also each server has secondary aliased IP address same as front node's IP address so when packet is received using other IP address, this packet should be re-injected back in protocol stack with changed destination IP address of front node.
procedure change_destination_address
{
 Input: Incoming IP packets for HTTP connection
 Output: IP packets with changed destination address
 
 for each incoming IP packet for HTTP connection
     rewrite destination address to IP address of front node(and secondary IP)
      and re-inject it back in TCP/IP stack
}
Thus IP packets received by front node are forwarded to server using local private IP address of server and then server rewrites dest address back to cluster IP address and to tcp layer it seems that this packet came with destination address of aliased secondary IP address directly.

Conclusion and Future Extensions

We designed and implemented a test bed for evaluation of load balancing strategies for distributed web server systems. This test bed is quite flexible and new policies can be compared with already existing policies very easily. This test bed will help in understanding trade offs and impact of different parameters on a distributed web server system.
In our thesis, we proposed an adaptive and dynamic policy for server selection and request distribution for a very large website. This DWSS can be deployed with current infrastructure and protocols in use. This architecture is scalable and fault tolerant too. In short, it meets all goals mentioned in design section.
We modified IP packet forwarding method to rewrite only incoming IP packets using shared common IP packets. This can be implemented totally at application layer with divert socket and IP firewalling support, since packets from clients are much shorter, even at application layer there is less overhead as compared to rewriting reply packets which was used in earlier proposed request distribution mechanisms.
From results obtained, we can conclude that our architecture will give better results when clients accessing a particular site are spread in different geographical regions and they are far away from each other. Our architecture is geographically scalable as well as fault tolerant for new incoming requests. Our architecture achieved its main goal of minimizing response time perceived to client. 

Leave a Reply

Subscribe to Posts | Subscribe to Comments

Blog Archive

- Copyright © Seminar Sparkz Inc -- Powered by Semianr Sparkz Inc - Designed by Shaik Chand -