Web log data warehousing and mining for intelligent web caching
Introduction
If data mining is aimed at discovering regularities and patterns hidden in data, the emerging area of web mining is aimed at discovering regularities and patterns in the structure and content of web resources, as well as in the way web resources are accessed and used [11], [19], [25], [26], [27], [32], [41], [42]. In this paper we describe one particular data/web mining application based on data warehouse technology: the development of an intelligent web caching architecture, capable of adapting its behavior on the basis of the access patterns of the clients/users. Such usage patterns, or models, are extracted from the historical access data recorded in log files, by means of data mining techniques.
More precisely, the idea is to extend the least recently used (LRU) cache replacement policy adopted by web and proxy servers by making it sensitive to web access models, extracted from web log data. To this end, we introduce several ways to construct intelligent web caching algorithms that employ predictive models of web requests. The goal of these algorithms is to maximize the so-called hit rate, namely the percentage of requested web entities that are retrieved directly in cache, without requesting them back to the origin server.
The general idea is motivated by the following observation. The LRU policy – drop from cache the least recently used entities – is based on the assumption that the requests that occurred in the recent past will likely occur in the near future too. This assumption is often true in practice, and this explains why LRU is effective: in particular, when requests are characterized by temporal locality (like the case of web requests [7], [9], [12], [24], [30]). Now, the more information we can extract from the history of recent requests, the more informed the cache replacement strategies that can be devised. This is a clear indication to mine the web log data for access models which may be employed for the above purpose.
When compared with the many alternatives and variations to LRU caching presented in the literature, our approach has a unique feature: its adaptiveness to changes in the usage patterns, which are rather natural in the web. This is due to the fact that the proposed caching strategies are parametric w.r.t. the data mining models, which can be recomputed periodically in order to keep track of the recent past.
We adopt two data mining techniques, which yield two classes of web caching algorithms: frequent patterns and decision trees. In the first case, we extract from past web logs patterns of the form A→B, where A and B are web entities: one such pattern means that when A is requested, then B is also likely to be requested within the same user session. A pattern A→B may be used as follows within the cache replacement algorithm on new requests: if A is requested, and therefore assigned a high priority due to the LRU principle, then B should also be treated analogously.
In the second case, we develop on the basis of the web logs a decision tree, i.e., a model capable of predicting, given a request of a web object A, the time of the next future request of A given other properties of A itself. Again, the prediction is based on the historical data contained in the web logs.
A prerequisite of either technique is the development of a data warehouse of log data, as raw log files are a thoroughly unsatisfactory starting point for data mining. We therefore developed a fully automated acquisition process which migrates log data into a carefully designed data mart, oriented to the analysis required for intelligent web caching. In the data mart, the web log information is consolidated, cleaned, selected and prepared for the data mining analysis; for instance, new derived attributes are introduced, and missing values for attributes are approximated.
As a final step in the knowledge discovery process, we designed a reference web caching model as a means to evaluate the models extracted by data mining. The architecture, which emulates a cache, and is parametric to the replacement strategy, supports the evaluation and comparison of the various replacement policies, according to two selected metrics: the hit rate and the weighted hit rate, which also takes into account the size of the requested entities.
The overall process, from log data acquisition to model extraction up to evaluation by the web cache architecture, is formalized and implemented within a database management system, Microsoft's SQL Server 2000 [34]. The prototype system was used for a first round of extensive experiments over large log files from two web servers. A thorough presentation of the experimental results is outside the scope of this paper; moreover, further benchmarking is needed and is currently pursued. However, the performance figures obtained so far indicate that the developed methods, compared with LRU, substantially increase in the hit rate, and outperform on this metric many traditional caching strategies. This indication, together with the adaptiveness of data-mining-based caching, definitely motivates further study on this subject.
Section snippets
Web caching
Web caching inherits several techniques and issues from caching in processor memory and file systems [10]. However, the peculiarities of the web give rise to a number of novel issues which call for adequate solutions.
Caches can be deployed in different ways, which can be classified depending on where the cache is located within the network. The spectrum of possibilities ranges from caches close to the client (browser caching, proxy server caching) to caches close to the origin server (web
A data mart of web log data
We have developed a data mart for web logs specifically to support intelligent caching strategies. The data mart is populated starting from a web log data warehouse (such as those described in [18], [31], [35]) or, more simply, from raw web/proxy server log files that we assume containing some very basic fields. The data mart population consists of a number of preprocessing and coding steps that perform data selection, cleaning and transformation. The data mart has been implemented as a
Deploying the reference model with data mart data
We present in this section two instantiations of the general intelligent caching model of Fig. 1, the first one based on frequent patterns, and the second one on decision trees. For each of them, a brief introduction to the general mining task is given, followed by the description of its application to the caching strategy and a summary of the results of experimental simulations.
For both the approaches, the results of simulations are obtained by building and then simulating an extended LRU over
Conclusions
We have presented two approaches to enhance LRU-based web server caching with data mining models (frequent patterns and decision trees) built on historical data. Also, the design of a suitable data mart has been presented, together with the main problems that such a design must solve. The performance figures of the developed methods, compared with LRU from one side and the theoretical off-line strategy ORCL from the other side, indicate substantial increase in the hit rate. Also, the
Acknowledgements
Research partially supported by FST – Fabbrica Servizi Telematici, Cagliari, Italy, under grant no. FR-22-2030 (Web mining project MineFaST). We also acknowledge support from Microsoft Research, Cambridge, UK, under grant no. 2000-23. We are grateful to the team at FST Research: R. Fenu, M. Magini, O. Murru, L. Petrella and L. Sannais, for giving us the opportunity to conduct this research, and for many useful discussions.
Francesco Bonchi holds a Laurea degree in Computer Science (University of Pisa, 1998). Since 1999 he has been a Ph.D. student in Computer Science at the University of Pisa. He has been a visiting fellow at the Kanwal Rekhi School of Information Technology, Indian Institute of Technology, Bombay (2000). His current research interests are data mining query optimization, meta learning and web mining.
References (47)
- et al.
Caching proxies: limitations and potentials
- C.C. Aggarwal, J.L. Wolf, P.S. Yu, Caching policies for web objects, Technical Report RC20619, IBM Research Division,...
- et al.
On disk caching of web objects in proxy servers
- et al.
Fast algorithms for mining association rules in large databases
- et al.
On-line algorithms
ACM Computing Surveys
(1999) - et al.
Performance evaluation of Web proxy cache replacement policies
- et al.
Internet Web servers: workload characterization and performance implications
IEEE/ACM Transactions on Networking
(1997) - et al.
A unified approach to approximating resource allocation and scheduling
- et al.
Changes in Web client access patterns: characteristics and caching implications
World Wide Web
(1999) A study of replacement algorithms for virtual storage computers
IBM Systems Journal
(1996)
Analysis of navigation behaviour in web sites integrating multiple information systems
VLDB Journal
Web caching and Zipf-like distributions: evidence and implications
Maintaining strong cache consistency in the world wide web
IEEE Transactions on Computers
Greedydual-size: a cost-aware www proxy caching algorithm
Active cache: caching dynamic contents on the web
Data mining for traversal patterns in a web environment
Grouping web page references into transactions for mining world wide web browsing patterns
Data preparation for mining world wide web browsing patterns
Knowledge and Information Systems
Discovery of interesting usage patterns from web data
A survey of proxy cache evaluation techniques
Improving proxy cache performance: analysis of three replacement policies
IEEE Internet Computing
Cited by (48)
Enhancing Greedy Web Proxy caching using Weighted Random Indexing based Data Mining Classifier
2019, Egyptian Informatics JournalAutomated web usage data mining and recommendation system using K-Nearest Neighbor (KNN) classification method
2016, Applied Computing and InformaticsCitation Excerpt :Data mining is the extraction of knowledge from large amount of observational data sets, to discover unsuspected relationship and pattern hidden in data, summarize the data in novel ways to make it understandable and useful to the data users [13,31,2]. Web usage mining is the application of data mining technique to automatically discover and extract useful information from a particular web site [2,22,30]. The term web mining was believed to have first came to be in 1996 by Etzioni in his paper titled “The World Wide Web: Quagmire or Gold mine” and since then attention of researchers world over has been shifted to this important research area [26].
A practical extension of web usage mining with intentional browsing data toward usage
2009, Expert Systems with ApplicationsWeb usage mining with intentional browsing data
2008, Expert Systems with ApplicationsCitation Excerpt :External hyperlink: The data collection module only collects browsing data on the authorized Web site, and stops once the user clicks on a hyperlink leading to external unauthorized Web sites. Client-side program installation: Some client-site programs need to be installed (Bonchi et al., 2001) on local computes in order to collect all possible browsing data, especially intentional browsing data. Users may not accept it.
Building a knowledge base for implementing a web-based computerized recommendation system
2007, International Journal on Artificial Intelligence ToolsPopular, but hardly used: Has Google Analytics been to the detriment of Web Analytics?
2023, ACM International Conference Proceeding Series
Francesco Bonchi holds a Laurea degree in Computer Science (University of Pisa, 1998). Since 1999 he has been a Ph.D. student in Computer Science at the University of Pisa. He has been a visiting fellow at the Kanwal Rekhi School of Information Technology, Indian Institute of Technology, Bombay (2000). His current research interests are data mining query optimization, meta learning and web mining.
Fosca Giannotti was born in 1958 in Italy. She graduated in Computer Science in 1982 at the University of Pisa. From 1982 to 1985 she was a research assistant, Dipartimento di Informatica, Università di Pisa. From 1985 to 1989 she was a senior researcher at R&D Labs of Sipe Optimization and Systems and Management, Pisa. In 1989–1990 she was a visiting researcher of MCC, Austin, TX, USA, involved in the LDL (Logic Database Language) project. She is currently a senior researcher at CNUCE, Institute of CNR (Italian National Research Council) in Pisa. Her current research interests include knowledge discovery and data mining, spatio-temporal reasoning, and database programming languages design, implementation, and formal semantics, especially in the field of logic database languages.
Cristian Gozzi was born in 1974 in Lucca, Italy. He graduated in Computer Science (Laurea in Informatica) at the University of Pisa in October 2000. Since November 2000 he is working with the Pisa KDD Laboratory at CNUCE-CNR as a contributor in the MineFaST research project, aimed at analysis of web access data and development of intelligent caching techniques. His current research interests are in the areas of databases, data mining and knowledge discovery.
Giuseppe Manco holds a Laurea degree (University of Pisa, 1994) in Computer Science and a Ph.D. in Computer Science (University of Pisa, 2001). He is currently senior researcher at the Institute for Systems Analysis and Information Technology of the National Research Council of Italy. He has been contract researcher at the CNUCE Institute in Pisa (April 1999–January 2001), and visiting fellow at the CWI Institute in Amsterdam (1998). His current research interests include deductive databases, knowledge discovery in databases and data mining, web databases and semi-structured data management.
Mirco Nanni holds a Laurea degree in Computer Science (University of Pisa, 1997). Since 1998 he has been a Ph.D. student in Computer Science at the University of Pisa. From April to June 1999 he was a visiting fellow in College Park – University of Maryland, working on probabilistic agent programming. His current research interests include deductive databases, data mining and knowledge discovery, distributed interactive systems and agent programming.
Dino Pedreschi was born in 1958 in Italy, and holds a Ph.D. in Computer Science from the University of Pisa, obtained in 1987. He is currently a full professor at the Dipartimento di Informatica of the University of Pisa. He has been a visiting scientist at the University of Texas at Austin (1989–1990), at CWI Amsterdam (1993) and at UCLA (1995). His current research interests are in logic in databases, and particularly in data analysis, deductive databases, the integration of data mining and database querying, spatio-temporal reasoning, and formal methods for deductive computing.
Chiara Renso was born in Italy in 1968 and holds a Masters degree and a Ph.D. in Computer Science (University of Pisa, 1992 and 1997). She is currently a researcher at CNUCE Institute of CNR, Italy. She has been working on extensions of logical languages with modularity, defining the language MedLan as a proposal to perform semantic integration of different data sources. Her current research interests are: extensions to logic programming to perform spatio-temporal and uncertain reasoning on geographical data and use of web mining techniques for intelligent web caching.
Salvatore Ruggieri holds a Laurea and a Ph.D. in Computer Science (University of Pisa, 1994 and 1999). He is currently a researcher at the Dipartimento di Informatica of the University of Pisa. He has been an ERCIM fellow at the Rutherford Appleton Laboratory at Oxford (1995). His current research interests are data analysis in deductive databases, (parallel) tree-induction algorithms for data mining, formal methods in logic programming and intelligent multimedia presentation systems.