A novel content classification scheme for web caches

Sajeev, G. P.; Sebastian, M. P.

doi:10.1007/s12530-010-9026-6

A novel content classification scheme for web caches

Original Paper
Published: 28 December 2010

Volume 2, pages 101–118, (2011)
Cite this article

Evolving Systems Aims and scope Submit manuscript

G. P. Sajeev¹ &
M. P. Sebastian²

258 Accesses
13 Citations
Explore all metrics

Abstract

Web caches are useful in reducing the user perceived latencies and web traffic congestion. Multi-level classification of web objects in caching is relatively an unexplored area. This paper proposes a novel classification scheme for web cache objects which utilizes a multinomial logistic regression (MLR) technique. The MLR model is trained to classify web objects using the information extracted from web logs. We introduce a novel grading parameter worthiness as a key for the object classification. Simulations are carried out with the datasets generated from real world trace files using the classifier in Least Recently Used-Class Based (LRU-C) and Least Recently Used-Multilevel Classes (LRU-M) cache models. Test results confirm that the proposed model has good online learning and prediction capability and suggest that the proposed approach is applicable to adaptive caching.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

There are six explanatory variables. Hence, we consider the nearest power of 2 as 8. This gives a flexibility to redefine W, according to an application.
Maximum likelihood estimation begins with writing a mathematical expression known as the Likelihood Function of the sample data. The likelihood of a set of data is the probability of obtaining that particular set of data, given the chosen probability distribution model. This expression contains the unknown model parameters. The values of these parameters that maximize the sample likelihood are known as the Maximum Likelihood Estimator.
Here, we implement LRU-C method using binary LR method. Hence, worthiness factor will have only two classes; W = 0 and W = 1. Also, we do not consider the features from HTTP responses of the server and the HTML structure of the object.

References

Agresti A, Wiley J (1990) Categorical data analysis, vol 1, 2nd edn. Wiley, New York
Ahn H, Moon H, Fazzari M, Lim N, Chen J, Kodell R (2007) Classification by ensembles from random partitions of high-dimensional data. Comput Stat Data Anal 51(12):6166–6179
Article MathSciNet MATH Google Scholar
Attar V, Sinha P, Wankhade K (2010) A fast and light classifier for data streams. Evol Syst 1(3):199–207. doi:10.1007/s12530-010-9010-1
Bahn H, Koh K, Noh S, Lyul S (2002) Efficient replacement of nonuniform objects in web caches. Computer 35(6):65–73
Google Scholar
Bian N, Chen H (2008) A least grade page replacement algorithm for web cache optimization. In: Knowledge discovery and data mining, 2008. WKDD 2008. First international workshop on, pp 469–472
Breslau L, Cao P, Fan L, Phillips G, Shenker S (1999) Web caching and Zipf-like distributions: evidence and implications. IEEE INFOCOM 1(1):126–134
Google Scholar
Cao P, Irani S (2002) Cost-aware www proxy caching algorithms. IEEE Trans Comput 51(6):193–206
Google Scholar
Chen X, Zhang X (2003) A popularity-based prediction model for web prefetching (No. 3). IEEE Computer Society Press, Los Alamitos
Chu F, Wang Y, Zaniolo C (2004) An adaptive learning approach for noisy data streams. In: Data mining, 2004. ICDM '04. Fourth IEEE international conference on, pp 351–354
Cobb J, ElAarag H (2008) Web proxy cache replacement scheme based on back-propagation neural network. J Syst Softw 81(9):1539–1558
Google Scholar
Dill S, Kumar R, McCurley K, Rajagopalan S, Sivakumar D, Tomkins A (2002) Self-similarity in the web. ACM Trans Int Technol 2(3):205–223
Article Google Scholar
Dreiseitl S, Ohno-Machado L (2002) Logistic regression and artificial neural network classification models: a methodology review. J Biomed Inform 35(5–6):352–359
Article Google Scholar
Dreiseitl S, Ohno-Machado L, Kittler H, Vinterbo S, Billhardt H, Binder M (2001) A comparison of machine learning methods for the diagnosis of pigmented skin lesions. J Biomed Inform 34(1):28–36
Article Google Scholar
Efron B, Gong G (1983) A leisurely look at the bootstrap, the jackknife, and cross-validation. Am Stat, pp 36–48
Foong AP, Hu Y-H, Heisey DM (1999) Logistic regression in an adaptive web cache. IEEE Int Comput 3(5):27–36
Article Google Scholar
Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting (With discussion and a rejoinder by the authors). Ann Stat 28(2):337–407
Article MathSciNet MATH Google Scholar
Gama J, Medas P, Castillo G, Rodrigues P (2004) Learning with drift detection. Lect Notes Comput Sci 1:286–295
Article Google Scholar
Gao J, Fan W, Han J, Yu PS (2007) A general framework for mining concept-drifting data streams with skewed distributions. In: Proceedings of SDM
Gonzalez-Canete FJ, Casilari E, Trivino-Cabrera A (2006) Two new metrics to evaluate the performance of a web cache with admission control. In: Electrotechnical conference, 2006. MELECON 2006. IEEE mediterranean, pp 696–699
Green M, Björk J, Forberg J, Ekelund U, Edenbrandt L, Ohlsson M (2006) Comparison between neural networks and multiple logistic regression to predict acute coronary syndrome in the emergency room. (No. 3). Tecklenburg, Federal Republic of Germany, Burgverlag, c1989
Hosmer D, Lemeshow S (2000) Applied logistic regression, vol 354, 2nd edn. Wiley, New York. http://books.google.com/books?id=Po0RLQ7USIMC
Imai K, King G, Lau O (2006) Zelig: everyone’s statistical software. http://gking.harvard.edu/zelig
Jin S, Bestavros A (2000) Popularity-aware greedy dual-size web proxy caching algorithms. In: Distributed computing systems, 2000. Proceedings. 20th international conference on, pp 254–261
Klinkenberg R, Renz I (1998) Adaptive information filtering: learning in the presence of concept drifts. Learn Text Categor 1:33–40
Google Scholar
Komarek P (2004) Logistic regression for data mining and high-dimensional classification. Biostatistics 4:138
Google Scholar
Koskela T, Heikkonen J, Kaski K (2003) Web cache optimization with nonlinear model using object features. Comput Netw 43(6):805–817
Article MATH Google Scholar
Krashakov SA, Teslyuk AB, Shchur LN (2006) On the universality of rank distributions of website popularity. Comput Netw 50(11):1769–1780
Article MATH Google Scholar
Krashakov SA, Teslyuk AB, Shchur LN (2006) On the universality of rank distributions of website popularity. Comput Netw 50(11):1769–1780
Article MATH Google Scholar
Landwehr N, Hall M, Frank E (2005) Logistic model trees. Mach Learn 59(1):161–205
Article MATH Google Scholar
Li K, Nanya T, Qu W (2007) A minimal access cost-based multimedia object replacement algorithm. In: IEEE international parallel and distributed processing symposium, 2007. IPDPS 2007, pp 1–7
Lim T, Loh W, Shih Y (2000) A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Mach Learn 40(3):203–228
Article MATH Google Scholar
Long W, Griffith J, Selker H, D’agostino R (1993) A comparison of logistic regression to decision-tree induction in a medical domain. Comput Biomed Res 26:74–97
Article Google Scholar
Lu Y, Abdelzaher T, Lu C, Tao G (2002) An adaptive control framework for QoS guarantees and its application to differentiated caching. In: Quality of service, 2002. Tenth IEEE International Workshop on, pp 23–32
Markatchev N and Williamson C (2002) Webtraff: A GUI for web proxy cache workload modeling and analysis. In: Modeling, analysis and simulation of computer and telecommunications systems, 2002. MASCOTS 2002. Proceedings. 10th IEEE international symposium on, p 356–363
Mierswa I, Wurst M, Klinkenberg R, Scholz M, Euler T (2006) Yale: rapid prototyping for complex data mining tasks. In: Ungar L, Craven M, Gunopulos D, Eliassi-Rad T (eds) Kdd ’06: Proceedings of the 12th acm sigkdd international conference on knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 935–940
Miller A (2002) Subset selection in regression. CRC Press, New York
NLANR (2010) Cache access logs [online]. ftp://ircache.nlanr.net/traces/
Pallis G, Thomos C, Stamos K, Vakali A, Andreadis G (2007) Content classification for caching under CDNs. In: Innovations in information technology, 2007. IIT '07. 4th international conference on, pp 586–590
Podlipnig S, Böszörmenyi L (2003) A survey of web cache replacement strategies. ACM Comput Surv 35(4):374–398
Article Google Scholar
Sargent D (2001) Comparison of artificial neural networks with other statistical approaches. CA A Cancer J Clin 91(S8):1636–1642
Google Scholar
Steyerberg EW, Harrell FE, Borsboom GJJM, Eijkemans MJC, Vergouwe Y, Habbema JDF (2001) Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol 54(8):774–781
Article Google Scholar
Sulaiman S, Shamsuddin SM, Forkan F, Abraham A (2008) Intelligent web caching using neurocomputing and particle swarm optimization algorithm. In: Ams ’08: Proceedings of the 2008 second asia international conference on modelling & simulation (ams). IEEE Computer Society, Washington, DC, pp 642–647
Team RDC (2008) R: a language and environment for statistical computing. R Language software Team, Vienna
Tian W, Choi B, Phoha VV (2002) An adaptive web cache access predictor using neural network. In: Iea/aie ’02: Proceedings of the 15th international conference on industrial and engineering applications of artificial intelligence and expert systems. Springer, London, pp 450–459
TraceGraph (2005) Trace graph tool (online). http://www.tracegraph.com/traceconverter.html
Tsymbal A (2004) The problem of concept drift: definitions and related work. Computer Science Department, Trinity College, Dublin
Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, pp 226–235
Wang Y (2005) A multinomial logistic regression modeling approach for anomaly intrusion detection. Comput Secur 24(8):662–674
Google Scholar
Xu L, Chow M-C, Gao XZ (2005) Comparisons of logistic regression and artificial neural network on power distribution systems fault cause identification. In: Soft computing in industrial applications, 2005. SMCia/05. Proceedings of the 2005 IEEE Mid-summer workshop on, pp 128–131
Yang Q, Zhang HH (2003) Web-log mining for predictive web caching. IEEE Trans Knowl Data Eng 15(4):1050–1053
Article Google Scholar

Download references

Author information

Authors and Affiliations

Government Engineering College Kozhikode, WestHill, Kozhikode, 673005, Kerala, India
G. P. Sajeev
Indian Institute of Management Kozhikode (IIMK), Calicut, 673570, Kerala, India
M. P. Sebastian

Authors

G. P. Sajeev
View author publications
You can also search for this author in PubMed Google Scholar
M. P. Sebastian
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to G. P. Sajeev.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sajeev, G.P., Sebastian, M.P. A novel content classification scheme for web caches. Evolving Systems 2, 101–118 (2011). https://doi.org/10.1007/s12530-010-9026-6

Download citation

Received: 08 August 2010
Accepted: 02 December 2010
Published: 28 December 2010
Issue Date: June 2011
DOI: https://doi.org/10.1007/s12530-010-9026-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel content classification scheme for web caches

Abstract

Access this article

Similar content being viewed by others

Improving the Performance of a Proxy Cache Using Expectation Maximization with Naive Bayes Classifier

Web cache intelligent replacement strategy combined with GDSF and SVM network re-accessed probability prediction

Enhancements to Randomized Web Proxy Caching Algorithms Using Data Mining Classifier Model

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A novel content classification scheme for web caches

Abstract

Access this article

Similar content being viewed by others

Improving the Performance of a Proxy Cache Using Expectation Maximization with Naive Bayes Classifier

Web cache intelligent replacement strategy combined with GDSF and SVM network re-accessed probability prediction

Enhancements to Randomized Web Proxy Caching Algorithms Using Data Mining Classifier Model

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation