Proactive caching of DNS records: addressing a performance bottleneck☆
Introduction
The resolution of a host name to an IP-address is a necessary predecessor to communication between Internet hosts. In particular, it is required for connection establishment and HTTP exchanges with a Web server. Domain name system (DNS) [21], [22] is in essence a distributed database that answers queries on mapping between names and addresses. Name-servers belong to a hierarchy where typically servers responsible for large domains delegate other name-servers to be in charge of subdomains. DNS was designed prior to the onset of the Web, but fortunately, its design allowed it to scale and accommodate the explosive growth of the Internet. The perhaps unavoidable downside of this design is that resolving a DNS query often involves communication with at least one remote name-server, and may require following a delegation/referral chain of several remote name-servers. Furthermore, since name-servers are different hosts than the HTTP servers that are contacted subsequently, DNS resolutions create additional potential point(s) of failure. Resolutions typically use UDP exchanges and use timeout and retransmission, which adds a delay on the order of seconds, in the event of packet loss.1 The overall impact of DNS resolutions on user-perceived latency stems from both additional RTTs to remote servers and sensitivity to long timeouts.
Caching of query results at local name-servers decreases both overhead and user-perceived latency and is instrumental for performance. A caching mechanism for DNS was specified in RFCs 1034 and 1035 [21], [22], and is integrated in BIND [1], the most popular name-server software. Name-servers resolve client queries about hostnames inside and outside their authoritative zones. Outside queries are resolved through communication with other name-servers, following referral chains to an authoritative name-server. Name-servers cache the results of queries sent to other name-servers. Each resource record (RR) (e.g., CNAME, IP-address, or an authoritative name-server) is provided with a time to live (TTL) value, and name-servers may cache entries they are not authoritative for only until the TTL expires.
Although “cache misses” (resolutions necessitating exchanges with remote servers) precede only a fraction of HTTP connections, their durations tend to be unpredictable and heavy-tailed. Studies show that when long Web waits do occur, DNS resolutions are a significant cause [7], [17], [23], [25]. Such occasional unexpected long delays significantly impact the consistency of service quality, which is often measured by extremes. High variance in “connecting” time was not critical for applications such as email, telnet, and FTP, that dominated the Internet when DNS specifications emerged, but is detrimental to Web browsing. Unfortunately, it seems that the impact of DNS resolutions is inherent in the current architecture. As bandwidth increases and content transmission time decreases, Web service speed would be increasingly dominated by RTTs, and the relative contribution of DNS resolutions would only increase.
Caching in BIND currently works in a passive manner: information for which a name-server is not authoritative for is obtained only as a consequence of a client query and is cached until either the TTL expires, or the name-server process dies. Here we propose and evaluate enhancements to basic passive caching aimed at reducing user-perceived latency due to DNS query time. We suggest making DNS caches proactive. Proactive DNS caching integrates automatically-generated “preemptive” queries that update the cache. These automatic refreshes make it more likely that a forthcoming user request would find a fresh cached answer and would not trigger communication with remote name servers. The challenge is to balance the number of eliminated cache misses and overhead of additional DNS queries issued to remote name-servers. Renewal policies, proposed and studied here, are a natural class of proactive caching algorithms addressing this challenge. Our enhancements fall within the framework of the current DNS architecture and can be locally deployed.
Renewal policies refresh cached entries upon their expiration by issuing a new query. The different policies vary by when an entry is renewed. We consider several natural policies, based on reference locality (analogous to the cache replacement algorithm LRU), access-frequency (analogous to LFU), and an adaptive per-hostname policy (analogous to policies studied in [6], [10], [18]). These analogies are made based on properties of the request sequence that are exploited but the underlying cost/benefit measures are different. We experimentally evaluated and compared performance of the different polices using a heterogeneous set of proxy logs, consistently obtaining significant increase in hit-rate at reasonable overhead costs.2
Our study shows that the best renewal policies eliminate about 60% of the DNS misses at the cost of roughly doubling the number of DNS queries. Section 3 indicates how this reduction in the miss rate of the DNS server translates to a reduction in user-perceived latency. The measurements presented in Section 3 show that DNS misses are responsible for about 25% of the HTTP requests which take over 300 ms.3 Thus by eliminating 60% of the DNS misses we are likely to reduce the fraction of HTTP requests which exceed 300 ms from 40% to about 34%. Notice that downloading a single Web page triggers about 10–20 HTTP requests on average to all embedded contents [20], [23]. Thus a reduction of 5% in the fraction of long HTTP requests translates to a larger reduction in the fraction of Web pages that have long download time.
Since DNS traffic constitute a small fraction (3%) of the overall network traffic [24] we expect that the cost of increasing the number of DNS queries by a factor of two would not have a detrimental global effect. Highly-loaded DNS servers may suffer from an increase in the number of queries which they get. Since system administrators control the load on their name-servers by varying TTLs and allocating secondary servers we suspect that load on majority of the DNS servers is not a bottleneck.4 Those servers that are highly loaded, like the root servers for example, can be backed up by more machines to distribute the load.
Preresolving (prefetching DNS queries) is a technique related to renewal. Preresolving was proposed in [7] as a low-overhead alternative to the prefetching of documents. Preresolving applies prediction schemes (e.g., by analyzing hyperlinks or tracking access patterns) to decide which hostnames (Web servers) to preresolve. The work [7] demonstrated a potential for considerable reduction in user-perceived latency when preresolving hostnames returned on search-engines responses. The tradeoff of latency and overhead also measures the performance of preresolving algorithms. The basic difference between renewal policies and preresolving is in their deployment: Preresolving utilizes predictions made based on per-user access patterns and currently-viewed hyperlinks. This information is available at the user’s browser or proxy server, and therefore, preresolving queries would most naturally be initiated there and be viewed at the local name-server as regular client queries. Renewal policies, on the other hand, aggregate per-record patterns from DNS query sequences. Hence, they can be incorporated within the name-server cache and be transparent to its clients. Preresolving and renewal also differ in their impact on traffic: first, like document prefetching, preresolving is more likely to generate bursts [12]. Second, prediction-based preresolves are more likely to include loaded root servers whereas renewals are often directed only to lightly loaded servers lower in the DNS hierarchy.
DNS TTL-based freshness control poses an inherent conflict for a domain administrator assigning TTL values. Smaller TTL values increase user-perceived latency and name-server load, and makes the site more likely to be inaccessible when the name-server is down. Large TTL values, on the other hand, constitute long-term commitments. If the name-to-address translation changes, many users would look at the cached no-longer-valid IP address, and would be unable to reach the host until the TTL expires and a new DNS query is issued. In practice, TTL values are set conservatively: our measurements indicated that periods between changes are considerably longer than respective TTL values. This observation was our underlying motivation for introducing simultaneous-validation (SV). Under SV, when a client issues a request to a host (Web server) and a cached expired resolution is available, the proxy/browser issues the HTTP request(s) using the expired address while simultaneously issuing a DNS query to resolve the hostname. For transparency and consistency, fetched contents are held and displayed only if the stale address entry is validated by the DNS query results. SV reduces latency since DNS query and communication with the host are performed concurrently rather than sequentially. Our evaluation reveals that the mapping of names to IP-addresses is fairly static, and consequently, estimated SV success rate is over 98%. The SV approach is fundamentally different from preresolving or renewal policies. SV does not impose overhead of additional DNS queries. Deploying SV, however, necessitates caching of expired DNS records and support by both the DNS cache and the browser or proxy server.5
To put in context our proposed enhancement to DNS caching, we contrast it with the more well-studied subject of Web content caching. Content is typically cached beyond its freshness lifetime, but expired entries are validated with the origin before being served. HTTP provides mechanisms for client-driven validation and freshness control [14]. Proposed approaches to reduce user-perceived latency incurred on validation requests included server-driven validations [4], [19], transferring stale cached data to a client while the data validity is being verified [13]. Furthermore, in a recent study we simulated techniques similar to the ones we suggest here for web content caches [8].
Although considerable research targeted validation latency for content caching, it seems that no analogous proposed enhancements were made so far for passive caching of DNS records. DNS caching differs in some basic aspects from content-caching: entries have considerably smaller sizes, storage space is ample with respect to the amount of data, and response sizes are small. Yet, there are also basic resemblances: Query time (cost of a cache miss) is significant and hence cache hit-rate is crucial for reducing perceived-latency; freshness typically expires well before the object is modified; and request sequences (to hostnames for DNS caching and to URLs for content caching) exhibit reference locality and characteristic frequencies.
A recent study by Shaikh et al. [23] tries to estimate the effect that DNS-based server selection would have on the DNS system. DNS-based server selection reduces the magnitude of TTL values radically (typically smaller than 20 s). This study shows that these small TTL values are detrimental to latency. It also shows that since clients are often not close to their name-servers, DNS based server selection may be far from optimal. Our study shows that proactive DNS caching can efficiently circumvent some of latency associated with these small TTL values.
Wills and Shang [25] also conducted a recent study of DNS lookup costs. They find that DNS lookup time contributed more than one second to approximately 20% of retrievals for Web objects linked by home pages of popular servers. Based on simulations of proxy logs they report that about 6–10% of the HTTP requests in the log generate DNS misses. This miss rate go up to 20–30% when they assume that repeated requests to the same server within 15 s window are carried over the same persistent connection. They also measured low change rates in mappings of hosts to IP addresses. These reported statistics are similar to our measurements reported later in the paper.
Another recent study of DNS performance is by Jung et al. [17]. This study is based on packet traces of both DNS and TCP traffic in contrast with ours and the studies cited previously which are driven from simulations of proxy traces. Working with real DNS data Jung et al. were able to discover that about 13% of DNS lookups result in answer that indicated an error. They also observe heavy tail behavior of DNS lookup time, and report that roughly 15% of all lookups require a query packet sent to the root server. By correlating TCP connections and DNS resolutions they deduce that the DNS miss rate is 20–30%. (Their definition of miss was somewhat different from ours, if the address was not cached but a relevant name-server was cached they classified the request as a hit.) They also find that most misses are for servers with small TTLs (<10 min) as reducing all TTLs to 10 min did not affect miss rate too much in their simulations. Working with packet traces Jung et al. had the advantage of inspecting all DNS traffic, not only traffic related to Web accesses. Our measurements in contrast are for Web related DNS traffic.
These recent studies done independently of ours, emphasis that DNS is a significant latency factor. Its contribution to latency grows with the increase usage of DNS based server selection. Our suggested techniques address this performance bottleneck at a reasonable cost.
Section 2 provides background material on the domain name system and some relevant statistics. In Section 3 we provide and discuss measurements which demonstrating the effect of DNS query time on user-perceived latency. The bulk of our contribution is contained in 4 Renewal policies, 5 Performance evaluation of policies, 6 Simultaneous validation. In 4 Renewal policies, 5 Performance evaluation of policies we present and evaluate several renewal policies. In Section 6 we introduce and evaluate SV. We conclude in Section 7 and outline future research issues.
Section snippets
The domain name system
The current specification of DNS is detailed in RFCs 1034 and 1035 [21], [22] from 1987. Even though designed prior to the onset of the World Wide Web, DNS scaled well with the explosive growth of the domain name space. The currently most widely-used implementation of DNS is the Berkeley Internet Name Domain (BIND) maintained by the Internet Software Consortium [16]. BIND is included as a standard part of most vendors’ UNIX offerings.
In graph-theoretic terms, the domain name space is a tree
DNS-related user-perceived latency
To further motivate our ideas we show in this section that DNS lookup is a considerable latency factor in the perceived latency of the Web user. To that end, we use some of our own measurements from a recent study in which we “replayed” an AT&T proxy log [7], and additional measurements of query times to two well connected servers of a CDN.
While simulating the AT&T proxy log we measured the relative contribution of DNS resolution to the overall latency of each request. Fig. 5, Fig. 6 illustrate
Renewal policies
Name-servers receive and resolve DNS queries and cache and reuse records for the time period specified in their TTL value.8 A client query that can be answered from the local cache is labeled cache hit. Otherwise, resolving the client query involves issuing queries to remote name-server(s) and the client query constitutes a cache miss. Under passive caching, currently practiced by BIND, DNS queries are issued by
Data
Our experiments extrapolated workload and performance in name-server caches from proxy logs. We used logs from three of the large NLANR Web caches (downloaded from the NLANR site [15]) and a proxy log from the AT&T Research proxy. The NLANR caches are high-volume, with large rate of requests and clients that include many proxy caches, whereas the AT&T proxy log reflects the activity of about 460 individual users (IP-addresses). Properties of the different logs are provided in Table 1. We
Simultaneous validation
SV reduces total document-fetching time by concurrently performing the hostname resolutions and subsequent parts of the process (TCP connection establishment and HTTP request-response). It is potentially effective when a fresh entry for the hostname is not available at the local cache, but the cache contains an expired entry obtained from a previous resolution or from elsewhere. SV needs to be supported at the entity initiating TCP connections to Web servers, typically, browsers or proxy
Discussion
Latency incurred on DNS misses is inherent in the hierarchical/distributed nature of DNS and is often dominated by RTTs to multiple destinations. As such, query time is not considerably shortened when bandwidth is increased. Nonetheless, reducing perceived-latency due to query times is crucial for improving the experience of web users. We view enhancement to current passive caching of DNS data as a necessary step in the ultimate goal of reducing Web latency. To this end, we proposed and
Acknowledgements
The authors thank the referees for their valuable suggestions.
Edith Cohen, is a researcher at AT&T Labs-Research. She did her undergraduate and M.Sc. studies at Tel-Aviv University, and received a Ph.D. in Computer Science from Stanford University in 1991. She joined Bell Laboratories in 1991 (now AT&T Labs). During 1997, she was in UC Berkeley as a visiting professor. Her research interests include design and analysis of algorithms, combinatorial optimization, Web performance, networking, and data mining.
References (25)
- et al.
Prefetching the means for document transfer: a new approach for reducing
Computer Networks
(2002) - et al.
Refreshment policies for web content caches
Computer Networks
(2002) - et al.
DNS and BIND
(1998) - AltaVista....
A study of replacement algorithms for virtual storage computers
IBM Systems Journal
(1966)- et al.
Maintaining strong cache consistency in the world wide web
IEEE Transactions on Computers
(1998) - E. Cohen, H. Kaplan. Proactive caching of dns records: Addressing a performance bottleneck. in: Proceedings of the...
- et al.
Exploiting regularities in web traffic patterns for cache replacement
Algorithmica
(2002) - E. Cohen, H. Kaplan, J.D. Oldham, Policies for managing TCP connections under persistent HTTP, in: Proceedings of the...
- et al.
Evaluating server-assisted cache replacement in the Web
Introduction to Algorithms
Cited by (0)
Edith Cohen, is a researcher at AT&T Labs-Research. She did her undergraduate and M.Sc. studies at Tel-Aviv University, and received a Ph.D. in Computer Science from Stanford University in 1991. She joined Bell Laboratories in 1991 (now AT&T Labs). During 1997, she was in UC Berkeley as a visiting professor. Her research interests include design and analysis of algorithms, combinatorial optimization, Web performance, networking, and data mining.
Haim Kaplan, received his Ph.D. degree from Princeton University at 1997. He was a member of technical stuff at AT&T research from 1996 to 1999. Since 1999 he is an Assistant Professor in the School of Computer Science at Tel Aviv University. His research interests are design and analysis of algorithms and data structures.