Skip to main content
Log in

A novel content classification scheme for web caches

  • Original Paper
  • Published:
Evolving Systems Aims and scope Submit manuscript

Abstract

Web caches are useful in reducing the user perceived latencies and web traffic congestion. Multi-level classification of web objects in caching is relatively an unexplored area. This paper proposes a novel classification scheme for web cache objects which utilizes a multinomial logistic regression (MLR) technique. The MLR model is trained to classify web objects using the information extracted from web logs. We introduce a novel grading parameter worthiness as a key for the object classification. Simulations are carried out with the datasets generated from real world trace files using the classifier in Least Recently Used-Class Based (LRU-C) and Least Recently Used-Multilevel Classes (LRU-M) cache models. Test results confirm that the proposed model has good online learning and prediction capability and suggest that the proposed approach is applicable to adaptive caching.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. There are six explanatory variables. Hence, we consider the nearest power of 2 as 8. This gives a flexibility to redefine W, according to an application.

  2. Maximum likelihood estimation begins with writing a mathematical expression known as the Likelihood Function of the sample data. The likelihood of a set of data is the probability of obtaining that particular set of data, given the chosen probability distribution model. This expression contains the unknown model parameters. The values of these parameters that maximize the sample likelihood are known as the Maximum Likelihood Estimator.

  3. Here, we implement LRU-C method using binary LR method. Hence, worthiness factor will have only two classes; W = 0 and W = 1. Also, we do not consider the features from HTTP responses of the server and the HTML structure of the object.

References

  • Agresti A, Wiley J (1990) Categorical data analysis, vol 1, 2nd edn. Wiley, New York

  • Ahn H, Moon H, Fazzari M, Lim N, Chen J, Kodell R (2007) Classification by ensembles from random partitions of high-dimensional data. Comput Stat Data Anal 51(12):6166–6179

    Article  MathSciNet  MATH  Google Scholar 

  • Attar V, Sinha P, Wankhade K (2010) A fast and light classifier for data streams. Evol Syst 1(3):199–207. doi:10.1007/s12530-010-9010-1

  • Bahn H, Koh K, Noh S, Lyul S (2002) Efficient replacement of nonuniform objects in web caches. Computer 35(6):65–73

    Google Scholar 

  • Bian N, Chen H (2008) A least grade page replacement algorithm for web cache optimization. In: Knowledge discovery and data mining, 2008. WKDD 2008. First international workshop on, pp 469–472

  • Breslau L, Cao P, Fan L, Phillips G, Shenker S (1999) Web caching and Zipf-like distributions: evidence and implications. IEEE INFOCOM 1(1):126–134

    Google Scholar 

  • Cao P, Irani S (2002) Cost-aware www proxy caching algorithms. IEEE Trans Comput 51(6):193–206

    Google Scholar 

  • Chen X, Zhang X (2003) A popularity-based prediction model for web prefetching (No. 3). IEEE Computer Society Press, Los Alamitos

  • Chu F, Wang Y, Zaniolo C (2004) An adaptive learning approach for noisy data streams. In: Data mining, 2004. ICDM '04. Fourth IEEE international conference on, pp 351–354

  • Cobb J, ElAarag H (2008) Web proxy cache replacement scheme based on back-propagation neural network. J Syst Softw 81(9):1539–1558

    Google Scholar 

  • Dill S, Kumar R, McCurley K, Rajagopalan S, Sivakumar D, Tomkins A (2002) Self-similarity in the web. ACM Trans Int Technol 2(3):205–223

    Article  Google Scholar 

  • Dreiseitl S, Ohno-Machado L (2002) Logistic regression and artificial neural network classification models: a methodology review. J Biomed Inform 35(5–6):352–359

    Article  Google Scholar 

  • Dreiseitl S, Ohno-Machado L, Kittler H, Vinterbo S, Billhardt H, Binder M (2001) A comparison of machine learning methods for the diagnosis of pigmented skin lesions. J Biomed Inform 34(1):28–36

    Article  Google Scholar 

  • Efron B, Gong G (1983) A leisurely look at the bootstrap, the jackknife, and cross-validation. Am Stat, pp 36–48

  • Foong AP, Hu Y-H, Heisey DM (1999) Logistic regression in an adaptive web cache. IEEE Int Comput 3(5):27–36

    Article  Google Scholar 

  • Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors). Ann Stat 28(2):337–407

    Article  MathSciNet  MATH  Google Scholar 

  • Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. Lect Notes Comput Sci 1:286–295

    Article  Google Scholar 

  • Gao J, Fan W, Han J, Yu PS (2007) A general framework for mining concept-drifting data streams with skewed distributions. In: Proceedings of SDM

  • Gonzalez-Canete FJ, Casilari E, Trivino-Cabrera A (2006) Two new metrics to evaluate the performance of a web cache with admission control. In: Electrotechnical conference, 2006. MELECON 2006. IEEE mediterranean, pp 696–699

  • Green M, Björk J, Forberg J, Ekelund U, Edenbrandt L, Ohlsson M (2006) Comparison between neural networks and multiple logistic regression to predict acute coronary syndrome in the emergency room. (No. 3). Tecklenburg, Federal Republic of Germany, Burgverlag, c1989

  • Hosmer D, Lemeshow S (2000) Applied logistic regression, vol 354, 2nd edn. Wiley, New York. http://books.google.com/books?id=Po0RLQ7USIMC

  • Imai K, King G, Lau O (2006) Zelig: everyone’s statistical software. http://gking.harvard.edu/zelig

  • Jin S, Bestavros A (2000) Popularity-aware greedy dual-size web proxy caching algorithms. In: Distributed computing systems, 2000. Proceedings. 20th international conference on, pp 254–261

  • Klinkenberg R, Renz I (1998) Adaptive information filtering: learning in the presence of concept drifts. Learn Text Categor 1:33–40

    Google Scholar 

  • Komarek P (2004) Logistic regression for data mining and high-dimensional classification. Biostatistics 4:138

    Google Scholar 

  • Koskela T, Heikkonen J, Kaski K (2003) Web cache optimization with nonlinear model using object features. Comput Netw 43(6):805–817

    Article  MATH  Google Scholar 

  • Krashakov SA, Teslyuk AB, Shchur LN (2006) On the universality of rank distributions of website popularity. Comput Netw 50(11):1769–1780

    Article  MATH  Google Scholar 

  • Krashakov SA, Teslyuk AB, Shchur LN (2006) On the universality of rank distributions of website popularity. Comput Netw 50(11):1769–1780

    Article  MATH  Google Scholar 

  • Landwehr N, Hall M, Frank E (2005) Logistic model trees. Mach Learn 59(1):161–205

    Article  MATH  Google Scholar 

  • Li K, Nanya T, Qu W (2007) A minimal access cost-based multimedia object replacement algorithm. In: IEEE international parallel and distributed processing symposium, 2007. IPDPS 2007, pp 1–7

  • Lim T, Loh W, Shih Y (2000) A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach Learn 40(3):203–228

    Article  MATH  Google Scholar 

  • Long W, Griffith J, Selker H, D’agostino R (1993) A comparison of logistic regression to decision-tree induction in a medical domain. Comput Biomed Res 26:74–97

    Article  Google Scholar 

  • Lu Y, Abdelzaher T, Lu C, Tao G (2002) An adaptive control framework for QoS guarantees and its application to differentiated caching. In: Quality of service, 2002. Tenth IEEE International Workshop on, pp 23–32

  • Markatchev N and Williamson C (2002) Webtraff: A GUI for web proxy cache workload modeling and analysis. In: Modeling, analysis and simulation of computer and telecommunications systems, 2002. MASCOTS 2002. Proceedings. 10th IEEE international symposium on, p 356–363

  • Mierswa I, Wurst M, Klinkenberg R, Scholz M, Euler T (2006) Yale: rapid prototyping for complex data mining tasks. In: Ungar L, Craven M, Gunopulos D, Eliassi-Rad T (eds) Kdd ’06: Proceedings of the 12th acm sigkdd international conference on knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 935–940

  • Miller A (2002) Subset selection in regression. CRC Press, New York

  • NLANR (2010) Cache access logs [online]. ftp://ircache.nlanr.net/traces/

  • Pallis G, Thomos C, Stamos K, Vakali A, Andreadis G (2007) Content classification for caching under CDNs. In: Innovations in information technology, 2007. IIT '07. 4th international conference on, pp 586–590

  • Podlipnig S, Böszörmenyi L (2003) A survey of web cache replacement strategies. ACM Comput Surv 35(4):374–398

    Article  Google Scholar 

  • Sargent D (2001) Comparison of artificial neural networks with other statistical approaches. CA A Cancer J Clin 91(S8):1636–1642

    Google Scholar 

  • Steyerberg EW, Harrell FE, Borsboom GJJM, Eijkemans MJC, Vergouwe Y, Habbema JDF (2001) Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol 54(8):774–781

    Article  Google Scholar 

  • Sulaiman S, Shamsuddin SM, Forkan F, Abraham A (2008) Intelligent web caching using neurocomputing and particle swarm optimization algorithm. In: Ams ’08: Proceedings of the 2008 second asia international conference on modelling & simulation (ams). IEEE Computer Society, Washington, DC, pp 642–647

  • Team RDC (2008) R: a language and environment for statistical computing. R Language software Team, Vienna

  • Tian W, Choi B, Phoha VV (2002) An adaptive web cache access predictor using neural network. In: Iea/aie ’02: Proceedings of the 15th international conference on industrial and engineering applications of artificial intelligence and expert systems. Springer, London, pp 450–459

  • TraceGraph (2005) Trace graph tool (online). http://www.tracegraph.com/traceconverter.html

  • Tsymbal A (2004) The problem of concept drift: definitions and related work. Computer Science Department, Trinity College, Dublin

  • Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, pp 226–235

  • Wang Y (2005) A multinomial logistic regression modeling approach for anomaly intrusion detection. Comput Secur 24(8):662–674

    Google Scholar 

  • Xu L, Chow M-C, Gao XZ (2005) Comparisons of logistic regression and artificial neural network on power distribution systems fault cause identification. In: Soft computing in industrial applications, 2005. SMCia/05. Proceedings of the 2005 IEEE Mid-summer workshop on, pp 128–131

  • Yang Q, Zhang HH (2003) Web-log mining for predictive web caching. IEEE Trans Knowl Data Eng 15(4):1050–1053

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to G. P. Sajeev.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sajeev, G.P., Sebastian, M.P. A novel content classification scheme for web caches. Evolving Systems 2, 101–118 (2011). https://doi.org/10.1007/s12530-010-9026-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12530-010-9026-6

Keywords

Navigation