Performance improvement of web caching in Web 2.0 via knowledge discovery

https://doi.org/10.1016/j.jss.2013.04.060Get rights and content

Highlights

  • Improvement of web caching in Web 2.0 by fragmenting the contents of the pages.

  • Definition of a framework to adapt fragment designs in content aggregated web pages.

  • The adaptation is done by the use of decision trees obtained using knowledge discovery.

  • Our solution shows shorter latencies than the traditional web cache schemes.

  • The overhead generated by the new approach is low enough to validate the solution.

Abstract

Web 2.0 systems are more unpredictable and customizable than traditional web applications. This causes that performance techniques, such as web caching, limit their improvements. Our study was based on the hypotheses that the use of web caching in Web 2.0 applications, particularly in content aggregation systems, can be improved by adapting the content fragment designs. We proposed to base this adaptation on the analysis of the characterization parameters of the content elements and on the creation of a classification algorithm. This algorithm was deployed with decision trees, created in an off-line knowledge discovery process. We also defined a framework to create and adapt fragments of the web documents to reduce the user-perceived latency in web caches. The experiment results showed that our solution had a remarkable reduction in the user-perceived latency even losses in the cache hit ratios and in the overhead generated on the system, in comparison with other web cache schemes.

Introduction

Content aggregation systems (CAS) are Web 2.0 applications in which users are able to create their own web pages by the aggregation of contents. The web pages retrieve content from distributed sources and assemble it in a single web page. These applications differ from other content management systems (CMS) in that users do not create the content, they only set up the web pages by aggregating content from public services and remote sources (web services, RSS, etc.). The web application retrieves independent content from these sources (content elements, CEs).

This type of web pages has high update rates because the content is retrieved from different sources and the web page needs to be updated every time a single source changes. They have also a very high customizable degree.

There are many examples of web applications that fit in CAS: social networks, web blogs, feed aggregation tools, etc. We focused our study on personal start-pages (Yahoo! Pipes, iGoogle, Netvibes and PageFlakes).

Depending on the system tier where the assemblies of the content elements take place, the cache is able to manage and to store whole and indivisible web pages, or otherwise, content elements. When the assembling process is done in the web proxy cache, the cache is able to store content elements (CEs) independently. This benefits the hit ratio of the web cache (only the invalidated content elements are requested to the web server), but it worsens the overhead times of assembling all the content elements. On the other hand, when the assemblies take place in the web application server, if one content element is invalidated, the whole web page is also invalidated and the cache needs to request to the server all the contents of the web page. This reduces the hit ratio, but only one server request is generated so the overhead times corresponding to server connections and assemblies are shorter.

Our research was focused on creating intermediate schemes in which some of the content elements were assembled in the application server and other ones in the web proxy cache. Thus, the increases of the hit ratios with the losses in the overhead times were balanced, obtaining shorter user-perceived latencies. Our hypothesis was that this adaptation can be done with decision trees created in a previous and off-line knowledge discovery process. We suggested to obtain this knowledge using the characteristics of the contents and the performance results of an emulation of synthetic content models. The contributions of this research work are: (a) Definition of a framework to adapt the content fragment elements of a web page to reduce the user-perceived latency in content aggregation systems. (b) Deployment of the adaptive core of the framework using knowledge discovery techniques by mining performance data obtained in an off-line process using synthetic content models. (c) Experimental validation of the proposed framework and the use of knowledge discovery in a real system with contents of a real website.

Section snippets

Background and related work

Systems in which the contents have high update rates and high customizable levels reduce the performance of web caching techniques. This is the main drawback to apply caching in current Web 2.0 systems and the main motivation of our research.

It is well-known that a solution for the problem of web caching in Web 2.0 systems is to reduce the minimum cacheable unit (Yuan et al., 2003). The cache is able to manage fragments of the web pages instead of complete web pages. In this type of systems,

Proposed approach

COFRADIAS framework (COntent FRagment ADaptation In web Aggregation Systems) is our proposal which extends the basic scheme of a content aggregation system including an adaptive core. COFRADIAS framework (Fig. 2) is not only the definition of the interfaces between the new adaptive core and the tiers of a traditional CAS. It is also the design solution taken in order to adapt the tiers of a CAS system to the new type of elements: the content fragments.

The interaction between the proxy and the

Validation

The validation of our approach is based on comparing the performance results and the overhead of our solution with a traditional web cache scheme in content aggregation systems and with solutions proposed in other researches that address our same problem. The results of our approach are compared with one of the traditional web cache schemes for CAS systems and with the research of Hassan et al. (2010).

In relation with traditional web cache schemes, we have proved in previous results (Guerrero

Conclusions and future work

The work presented in this article was addressed to improve the user-perceived latency in Web 2.0 systems based on the aggregation of content from remote sources. The improvement was based on the creation of fragments of the web pages which can be managed independently by the web cache. The content fragment design had to balance the overhead time losses of a fragment with a big number of fragments and the improvement of the hit ratio under the same conditions.

We defined COFRADIAS framework to

Acknowledgments

This work has been financed by the Spanish Ministry of Education and Science through the TIN11-23889 project

Dr. Carlos Guerrero is an assistant professor of computer architecture and technology at the Computer Science Department of the University of the Balearic Islands. His research interests are web performance, web engineering, web applications, data mining and intelligent systems. He has taken part in seven research projects (national and international) and has published around 12 papers in different international conferences and journals. He has been member of the program committee of several

References (26)

  • Y.-F. Huang et al.

    Mining web logs to improve hit ratios of prefetching and caching

    Knowledge-Based Systems

    (2008)
  • C. Kumar et al.

    A new approach for a proxy-level web caching mechanism

    Decision Support Systems

    (2008)
  • G. Pallis et al.

    A clustering-based prefetching scheme on a web cache environment

    Computers and Electrical Engineering

    (2008)
  • F. Benevenuto et al.

    Characterization and analysis of user profiles in online video sharing systems

    Journal of Information and Data Management

    (2010)
  • F. Benevenuto et al.

    Characterizing user behavior in online social networks

  • J. Challenger et al.

    A fragment-based approach for efficiently creating dynamic web content

    ACM Transactions on Internet Technologies

    (2005)
  • A. Datta et al.

    Proxy-based acceleration of dynamically generated content on the world wide web: an approach and implementation

  • P. Gill et al.

    Youtube traffic characterization: a view from the edge

  • C. Guerrero et al.

    The applicability of balanced ESI for web caching – a proposed algorithm and a case of study

  • C. Guerrero et al.

    Rule-based system to improve performance on mash-up web applications

  • C. Guerrero et al.

    Evaluation of a fragment-optimized content aggregation web system

  • C. Guerrero et al.

    Improving web cache performance via adaptive content fragmentation design

  • O.A.-H. Hassan et al.

    The MACE approach for caching mashups

    International Journal of Web Services Research

    (2010)
  • Cited by (6)

    Dr. Carlos Guerrero is an assistant professor of computer architecture and technology at the Computer Science Department of the University of the Balearic Islands. His research interests are web performance, web engineering, web applications, data mining and intelligent systems. He has taken part in seven research projects (national and international) and has published around 12 papers in different international conferences and journals. He has been member of the program committee of several international conferences.

    Dr. Isaac Lera is an associate lecturer at the Computer Science Department of the University of the Balearic Islands. Along his research activity, he has participated in different local, national and international projects and he has published around 10 articles in conferences and journals. He is interested in topics such as: system performance, web services, semantic representations, and ontology engineering.

    Dr. Carlos Juiz is an associate professor at the University of the Balearic Islands (UIB), Spain. He is co-author of more than 150 papers, published reviews and book chapters. He is senior member of the IEEE and senior member of the ACM, and also member of the Steering Committee of the Workshop of Software and Performance. He organized the international Workshop on Middleware and Performance (WOMP 2006) and the European Performance Engineering Workshop (EPEW 2008). He is also member of ARTEMISIA (Advanced Research & Technology for Embedded Intelligent Systems Industrial Association) and member of the standardization committee of NESSI (Networked European Software and Services Initiative). Carlos Juiz is an invited expert of the International Telecommunications Union (ITU).

    View full text