Elsevier

Performance Evaluation

Volume 34, Issue 4, 18 December 1998, Pages 249-271
Performance Evaluation

Analysis and modeling of World Wide Web traffic for capacity dimensioning of Internet access lines

https://doi.org/10.1016/S0166-5316(98)00040-6Get rights and content

Abstract

A study on traffic characterization of the Internet is essential to design the Internet infrastructure. In this paper, we first characterize WWW (World Wide Web) traffic based on the access log data obtained at four different servers. We find that the document size, the request inter-arrival time and the access frequency of WWW traffic follow heavy-tail distributions. Namely, the document size and the request inter-arrival time follow log-normal distributions, and the access frequency does the Pareto distribution. For the request inter-arrival time, however, an exponential distribution becomes adequate if we are concerned with the busiest hours. Based on our analytic results, we next build an M/G/1/PS queuing model to discuss a design methodology of the Internet access network. The accuracy of our model is validated by comparing with the trace-driven simulation. We also investigate the effect of document caching at the Proxy server on the WWW traffic characteristics. The results show that the traffic volume is actually reduced by the document replacement policies, but the traffic characteristics are not much affected. It suggests that our modeling approach can be applied to the case with document caching, which is demonstrated by simulation experiments.

Introduction

A rapid growth of the Internet leads to an increasing number of the Internet service providers in Japan. It exceeds 1200 at the end of 1997, and a growth rate is estimated to be a hundred percent per month. This enormous growth has a great impact on the Internet infrastructures; backbone lines are being increased to have larger capacities of 45 Mbps, and new technologies such as CATV Internet are developed for the access networks. For an efficient design of such an infrastructure, it becomes essential to study the characteristics of the Internet traffic. Accordingly, the characteristics of Internet applications such as ftp and telnet have already been investigated in the past [1], [2], [3], but we now have a new emerging technology, WWW (World Wide Web), which is generating a major part of the Internet traffic. In analyzing the traffic characteristics of WWW, we first need to notice that WWW consists of various kinds of media including text, image, audio, and motion video. Moreover, it is not sufficient to examine only its average behavior to design high-quality networks. As we will see in Section 3, WWW traffic has a long-tail distribution, which implies that the probability density of the document transfer time on the network does not quickly converge, and tail probabilities are much larger than the one obtained through the conventional traffic model such as an M/M/1 queuing model.

In this paper, we first study the characteristics of WWW traffic using the access log data gathered at four different WWW servers. Then, we build an M/G[log-normal]/1/PS (processor sharing) queuing model to discuss a design methodology of the Internet access line. By comparing it with trace-driven simulation, we will show that our model can predict delay distributions for document transfer at a reasonably accurate level. We then give some insights on the multimedia traffic treatment of WWW by analytically deriving the document transfer delays dependent on the media. As known in the literature, the PS scheduling discipline gives small delays for the customers with less work demands when compared with FIFO discipline [4], and it is achieved at the expense of larger delays for those with more work demands. In our context, the text file tends to be of a small size as will be shown later, and therefore it is transfered fast. It is desirable since it would be important to quickly obtain the HTML document files when browsing the WWW page. For providing high-quality service, however, the line capacity should be large enough so that the WWW user can obtain even motion video of large size quickly before the user cancels the retrieval. Our objective is to estimate the required capacity for those networks. Here, we only consider retrievals of motion video, and the real-time transfer of the motion video is out of concern in the current paper. We then investigate how the characteristics of WWW traffic are affected by document caching which is widely adopted in today’s Internet access networks. In this case, only documents injected onto the access line due to the cache miss should be taken into account. By simulating three popular cache replacement policies, we examine the traffic characteristics observed on the access line. Then we again use the queuing model and validate its accuracy by comparing with trace-driven simulation.

Recently, a lot of papers have been published, investigating the statistical nature of WWW traffic. In [5], [6], Arlitt and Williamson investigated access patterns to WWW servers and found various “invariants” based on the logs recorded at several WWW servers. On the contrary, our main subject is to build the WWW traffic model applicable to the Internet access line, which is connected to the backbone network. Thus, we mainly investigate the characteristics of the document sizes and the request arrival patterns. For this, we gathered the data at the Proxy server of Osaka University, which is configured such that HTTP requests from all users in the site are received by the Proxy server and issued to the access line if necessary. Crovella et al. also investigated the several characteristics of WWW traffic (e.g., Web document sizes and their popularities) in [7], [8] by using the client-based log data. They showed that those characteristics can be well approximated by the Pareto distribution and this fact indicates the self-similar nature of the WWW traffic. This fact is also pointed out in [5], [6] while in [7], [8] the authors claim that it is not found at some sites. In [9], Almeida et al., using those results and their Web monitoring tool, obtained the service demand of the Web server which is required when making the Web server queuing model. However, they all only focussed on the tail portion of the WWW traffic. For example, in the case of the document sizes, the Pareto approximation was never in good agreement with the documents smaller than 2.5 Kbyte while those documents account for more than 50% of all documents as we will show in Section 3. On the contrary, we aim at seeking the best distribution for overall ranges since our main goal is the modeling of WWW traffic to be applied to capacity dimensioning of the access line. In [10], Deng divided the WWW traffic into ON and OFF periods and showed that those lengths follow the Weibull and the Pareto distributions by using the client-based log data. However, as mentioned in the above, our target is the capacity dimensioning of the access line, and therefore, the result in [10] is not sufficient. We need to know the traffic characteristics on the access line, at which the traffic from the users is aggregated. We then used the log data of the proxy server to seek an adequate distribution for the request inter-arrival times on the access line. Furthermore, in the above studies [7], [8], [10], only the fitting test against each distribution was done for validating the accuracy of their model. However, it is not sufficient because the correlation between the inter-arrival and document-size distributions may give influences on the applicability of the model. We will discuss the accuracy of our model by comparing it with the trace-driven simulation.

Moreover, active researches on the caching mechanisms have been recently carried out [11], [12], [13], [14], [15]. In these studies, various caching algorithms were proposed and the performance was evaluated by using several criterion such as the hit ratio and the weighted hit ratio. More recently, Bowman et al. [16] studied the hierarchical caching system in the Harvest cache project. Then, in [17], [18], Claffy et al. gave some insights on performance and implementation aspects of the hierarchical caching system. Since our main subject of this paper is to investigate how the characteristics of WWW traffic are affected by the caching policies, we only adopt the basic caching policies as will be explained later. We believe that our study on the effect of caching policies on the characteristics of WWW traffic can help design the hierarchical caching system, but further studies are necessary.

The remainder of this paper is organized as follows. We first give a short summary of our access log data in Section 2. In Section 3, we will analyze traffic characteristics of WWW to derive distributions of the document size, the request interval, and the access frequency. An M/G/1/PS queuing model based on the obtained results is then introduced to discuss the delay characteristics of WWW traffic in Section 4. In Section 5, we study the characteristics of WWW traffic with document caching and discuss the applicability of our model to a more practical situation. Finally, we conclude our paper in Section 6 with the summary and future work.

Section snippets

Summary of the access log data

In this section, we summarize the access log data collected at four different WWW servers. See Table 1. The row (I) in the table, the case of Osaka University, is the log data of the Proxy server at the Education Center for Information Processing in Osaka University. All HTTP requests from about 560 workstations are first sent to the Proxy server of this center so that our log data contain all accesses to WWW servers on the Internet. 14% of HTTP requests are destined for the hosts located

Analysis of WWW traffic

In this section, we analyze the characteristics of WWW traffic. We follow the statistical approach described in [2], where the approach was applied to the analysis of the Internet traffic such as telnet and ftp. We first choose several probability distributions, and determine parameters based on the log data. Next, the goodness-of-fit of each model is tested. Finally, the most suitable distribution is selected via the method based on chi-squared examination. In the following sections, we first

Delay characteristics of WWW traffic

In the previous section, we found that the document sizes and request intervals of WWW traffic have long-tail distributions, which decay very slowly in its tail. Those are quite different from the conventional exponential distribution. In a strict sense, however, the selected distribution is most suitable only among distributions that we have tested, and the appropriateness of the chosen model must be evaluated according to the purpose of the study. It is capacity dimensioning in our case, and

Effect of document caching at Proxy server

In the previous section, we did not consider the document caching at the Proxy server in spite of the fact that it has an ability to considerably save the required capacity of the access line. Recently, active researches have been carried out for developing an effective algorithm of document replacements (see, e.g., [11], [12], [13], [14], [15]). Different from those researches, however, our main subject in this paper is to investigate how the characteristics of WWW traffic are affected by the

Conclusion

In this paper, we have first studied the characteristics of WWW traffic using the access log data gathered at four different sites. We have found that WWW traffic exhibits long-tail distributions except for request intervals during the busiest hours. The log-normal distribution is most appropriate for the document size. Then, the access line is modeled by the M/G/1/PS queuing model, and its practical utility is shown by comparing with the trace-driven simulation. We have also demonstrated that

Masahiko Nabe received the M.E. degree in System Engineering from Kobe University, Kobe, Japan, in 1990. In April 1990, he joined Kansai Electric Power Company. He is Currently a Ph.D. candidate at the Department of Information and Computer Sciences, Osaka University, Osaka, Japan. His research work is in the area of design of Internet infrastructure.

References (25)

  • V. Paxson, S. Floyd, The failure of poisson modeling, Proc. SIGCOMM’94, 1994, pp....
  • V. Paxson

    Empirically derived analytic models of wide-area TCP connections

    IEEE/ACM Trans. Networking

    (1994)
  • W.E. Leland, M.S. Taqqu, W. Willinger, D. Wilson, On the self-similar nature of ethernet traffic, Proc. SIGCOMM ’93,...
  • S.S. Lavenberg, Computer Performance Modeling Handbook, Academic Press, New York,...
  • M.F. Arlitt, A performance study of Internet Web servers, M.Sc. Thesis, University of Saskatchewan, 1996, available at...
  • M.F. Arlitt, C.L. Williamson, Web server workload characterization: the search for invariants, Proc. ACM SIGMETRICS...
  • M. Crovella, A. Bestavros, Explaining World Wide Web self-similarity, Rep. BU-CS-95-015, Boston Universtiy,...
  • C. Cunha, A. Bestavros, M. Crovella, Characteristics of WWW client-based traces, Rep. BU-CS-95-010, Boston University,...
  • J.M. Almeida, V. Almeida, D.J. Yates, Measuring the behavior of a World Wide Web server!, Proc. 7th IFIP Conf. on High...
  • S. Deng, Empirical model of WWW Document arrivals at access link, Proc. ICC’96, 1996, pp....
  • S. Williams, M. Abrams, C.R. Strandridge, G. Abdulla, E.A. Fox, Removal policies in network caches for World-Wide-Web...
  • M. Baentsch, G. Molter, P. Sturm, Introducing application-level replication and naming into today’s Web, Proc. 5th Int....
  • Cited by (0)

    Masahiko Nabe received the M.E. degree in System Engineering from Kobe University, Kobe, Japan, in 1990. In April 1990, he joined Kansai Electric Power Company. He is Currently a Ph.D. candidate at the Department of Information and Computer Sciences, Osaka University, Osaka, Japan. His research work is in the area of design of Internet infrastructure.

    Masayuki Murata received the D.E. degree in Information and Computer Sciences from Osaka University, Japan, in 1988. In April 1984, he joined Tokyo Research Laboratory, IBM Japan, as a researcher. From September 1987 to January 1989, he was an Assistant Professor with Computation Center, Osaka University. On February 1989, he moved to the Department of Information and Computer Sciences, Faculty of Engineering Science, Osaka University, and he has been an Associate Professor since December 1992. His research interests include computer communication netwroks, performance modeling and evaluation, and queueing systems. He is a member of the IEEE and ACM.

    Hideo Miyahara received the D.E. degree from Osaka University, Japan in 1973. From 1973 to 1980, he was an Assistant Professor in Kyoto University. From 1980 to 1986, he was an Associate Professor in the Faculty of Engineering Science, Osaka University. From 1986 to 1989, he was a professor of the Computation Center, Osaka University. Since 1989, he has been a professor in the Faculty of Engineering Science, Osaka University. From 1998, he is a dean of the Faculty of Engineering Scinece, Osaka University. His research interests include performance evaluation of computer communication networks and multimedia systems. He is an IEEE fellow.

    An earlier version of this paper was presented at the Performance and Control of Network Systems Conference, W. S. Lai, Hi. Kobayashi (Eds.), Proceedings of SPIE (The International Society for Optical Engineering), vol. 3231, Dallas, TX, 3–5 November 1997.

    View full text