Abstract
This paper presents a comprehensive survey of web log/usage mining based on over 100 research papers. This is the first survey dedicated exclusively to web log/usage mining. The paper identifies several web log mining sub-topics including specific ones such as data cleaning, user and session identification. Each sub-topic is explained, weaknesses and strong points are discussed and possible solutions are presented. The paper describes examples of web log mining and lists some major web log mining software packages.
Similar content being viewed by others
References
Adomavicius, G. (1997). Discovery of actionable patterns in databases: The action hierarchy approach. Knowledge discovery and data mining. Newport Beach, CA, Menlo Park, CA.
Agrawal, R., Imielinski, T., & Swami, S. (1993). Mining association rules between sets in large databases. Conference on Management of Data (ACM SIGMOD), Washington, DC.
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. Proceedings of the 20th VLDB, Santiago, Chile.
Agrawal, R., & Srikant R. (1995). Mining sequential patterns. Data engineering. Taipei: IEEE.
Ansari, S., Kohavi, R., Mason, L., & Zheng, Z. (2001). Integrating e-commerce and data mining: Architecture and challenges. Data mining. San Jose, CA: IEEE Computer Society.
Balabanovic, M., Shoham, Y., & Yun, Y. (1995). An adaptive agent for automated web browsing. Visual Communication and Image Representation, 4.
Barish, G., & Obraczka, K. (2000). World wide web caching: Trends and techniques. IEEE Communications Magazine, 5, 178–184.
Berendt, B., Mobasher, B., Nakagawa, M., & Spiliopoulou, M. (2002). The impact of site structure and user environment on session reconstruction in web usage analysis. 4th WebKDD 2002 Workshop, ACM-SIGKDD Conference on Knowledge Discovery in Databases (KDD’2002), Edmonton, Alberta, Canada.
Berendt, B., & Spiliopoulou, M. (2000). Analysis of navigation behaviour in web sites integrating multiple information systems. VLDB, 56–75.
Berry, M. J. A., & Linoff, G. (1997). Data mining techniques: For marketing, sales, and customer support. New York: Wiley.
Bonchi, F., Giannotti, F., Gozzi, C., Manco, G., Nanni, M., Pedreschi, D. et al. (2001a). Web log data warehousing and mining for intelligent web caching. Elsevier Science.
Bonchi, F., Giannotti, F., Gozzi, C., Manco, G., Nanni, M., Pedreschi, D., et al. (2001b). Web log data warehousing and mining for intelligent web caching. Data and Knowledge Engineering, 2, 165–189.
Buchner, A. G., Mulvenna, M. D., Anand, S. S., & Hughes, J. G. (1999). An Internet-enabled knowledge discovery process. International database conference; Heterogeneous and internet databases. Hong Kong: City University of Hong Kong.
Catledge, L. D., & Pitkow, J. E. (1995). Characterizing browsing strategies in the world-wide web. Computer Networks and ISDN Systems, 6, 10–65.
Chakrabarti, S. (2000). Data mining for hypertext: A tutorial survey. ACM SIGKDD Explorations, 2, 1–11.
Chen, M. S., Park, J. S., & Yu, P. S. (1996). Data mining for path traversal patterns in a web environment. Distributed computing systems. Hong Kong: IEEE Computer Society.
Chi, E. H. (2002). Improving web usability through visualization. IEEE Internet Computing, 64–71.
Consens, M. P., Eigler, F. C., Hasan, M. Z., Mendelzon, A. O., Noik, E. G., Ryman, A. G., et al. (1994). Architecture and applications of the Hy^+ visualization system. IBM Systems Journal, 3, 458.
Cooley, R., Mobasher, B., & Srivastava, J. (1997a). Grouping web page references into transactions for mining world wide web browsing patterns. IEEE Knowledge and Data Engineering Exchange Workshop (KDEX’97). Los Alamitos, CA: IEEE Computer Society.
Cooley, R., Mobasher, B., & Srivastava, J. (1997b). Web mining: Information and pattern discovery on the word wide web. 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’97).
Cooley, R., Mobasher, B., & Srivastava, J. (1999a). Automatic personalization based on web usage mining. Chicago, IL: Depaul University.
Cooley, R., Mobasher, B., & Srivastava, J. (1999b). Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems, 1, 5–32.
Dai, H., Luo, T., Mobasher, B., Nakagawa, M., & Witshire, J. (2000). Discovery of aggregate usage profiles for web personalization. Mining for E-Commerce Workshop (WebKDD’2000, held in conjunction with the ACM-SIGKDD on Knowledge Discovery in Databases KDD’2000), Boston, MA.
Dai, H., & Mobasher, B. (2003). A road map to more effective web personalization: Integrating domain knowledge with web usage mining. Proceedings of the International Conference on Internet Computing (IC’03), Las Vegas, NV.
Davison, B. (1999). A survey of proxy cache evaluation techniques. 4th International Web Caching Workshop (WCW’99).
Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification. New York: Wiley.
Duska, B. M., Marwood, D., & Freeley, M. J. (1997). The measured access characteristics of world-wide-web client proxy caches. Proceedings of the USENIX Symposium on Internet Technologies and Systems, Monterey, CA: USENIX Association.
Dyreson, C. (1997). Using an incomplete data cube as a summary data sieve. Bulletin of the IEEE Technical Committee on Data Engineering (March): 19–26.
Elo-Dean, S., & Viveros M. (1997). Data mining the IBM official 1996 Olympics Web site, IBM T.J. Watson Research Center.
Famili, A., Shen, W. M., Weber, R., & Simoudis, E. (1997). Data preprocessing and intelligent data analysis. Intelligent Data Analysis, 3–23.
Faulstich, L. C., & Spiliopoulou, M. (1998). WUM: A tool for web utilization analysis. EDBT Workshop WebDB’98. Valencia, Spain: Springer-Verlag.
Faulstich, L., Pohle, C., & Spiliopoulou, M. (1999). Improving the effectiveness of a web site with web usage mining. KDD Workshop WEBKDD’99, San Diego, CA.
Faulstich, L., Spiliopoulou, M., & Wilkler, K. (1999). A data mining analyzing the navigational behaviour of web users. Workshop on Machine Learning User Modeling of the ACAI’99 International Conference, Creta, Greece.
Fayyad, U. M. (1996). Advances in knowledge discovery and data mining. Menlo Park, CA: AAAI Press.
Fleishman, G. (1996). Web log analysis, who’s doing what, when? Web Developer.
Fong, J., Hughes, J. G., & Zhu, J. (2000). Online web mining transactions association rules using frame metadata model.
Glassman, S. (1994). A caching relay for the world wide web. 1st World Wide Web Conference. Geneva, Switzerland: Elsevier.
Han, J., Cai, Y., & Cercone, N. (1993). Date-driven discovery of quantitative rules in relational databases. IEEE Transactions on Knowledge and Data Engineering, 29–40.
Han, J., Chiang, J., Chee, S., Chen, J., Chen, Q., Cheng, S., et al. (1997). DBMiner: A system for data mining in relational databases and data warehouses. CASCON’97: Meeting of Minds, Toronto, Canada.
Han, J., He, Y., & Wang, K. (2000). Mining frequent itemsets using support constraints. International Conference on Very Large Databases (VLDB’00), Cairo, Egypt.
Han, J., Xin, M., & Zaïane, O. R. (1998). Discovering web access patterns and trends by applying OLAP and data mining technology on web logs. Conference on Advances in Digital Libraries, Santa Barbara, CA.
Hand, D., Mannila, H., & Smyth, P. (2001). Principles of data mining.
He, D., & Goker, A. (2003). Detecting session boundaries from Web user logs.
Jain, N., Han, E., Mobasher, B., & Srivastava, J. (1996). Web mining: Pattern discovery from world wide web transactions. Minneapolis, MN: University of Minnesota.
Jain, N., Han, E., Mobasher, B., & Srivastava, J. (1997). Web mining: Pattern discovery from world wide web transactions. Minneapolis, MN: University of Minnesota.
Kanth, K. V. R., & Siva, R. (2002). Personalization and location-based technologies for e-commerce applications, eJETA.
Kato, H., Hiraishi, H., & Mizoguchi, F. (2001). Log summarizing agent for web access data using data mining techniques. IEEE Intelligent Systems and Their Applications, 2642–2647.
Kosala, R., & Blockeel, H. (2000). Web mining research: A survey. ACM SIGKDD Explorations, 1, 1–15.
Lin, I. Y., Huang, X. M., & Chen, M. S. (1999). Capturing user access patterns in the web for data mining. Tools with artificial intelligence. Chicago, IL: IEEE Computer Society.
Luotonen, A., & Altis, K. (1994). World-wide web proxies. Selected Papers of First World-Wide Web Conference, Elsevier Science Division. 147.
Lup Low, W., Li Lee, M., & Wang Ling, T. (2001). A knowledge-based approach for duplicate elimination in data cleaning. Information Systems, 8, 585–606.
Mannila, H., Toivonen, H., & Verkamo, A. I. (1995). Discovering frequent episodes in sequences. Knowledge discovery & data mining. Montreal, Canada: AAAI Press.
Mobasher, B., Dai, H., Luo, T., Sung, Y., & Zhu, J. (2000). Integrating web usage and content mining for more effective personalization. Proceedings of the International Conference on E-Commerce and Web Technologies (ECWeb2000), Greenwich, UK.
Mobasher, B., Dai, H., Luo, T., & Nakagawa, M. (2002). Using sequential and non-sequential patterns in predictive web usage mining tasks. International Conference on Data Mining. Maebashi City, Japan: IEEE Computer Society.
Montgomery, A. L., & Faloutsos, C. (2001). Identifying web browsing trends and patterns. Computer, 7, 94–95.
Pabarskaite, Z. (2002). Implementing advanced cleaning and end-user interpretability technologies in web log mining. Information Technology Interfaces ITI2002, Collaboration and Interaction in Knowledge-Based Environments, Cavtat/Dubrovnik, Croatia.
Pabarskaite, Z. (2003). Decision trees for web log mining. Intelligent Data Analysis, 2, 141–155.
Pabarskaite, Z., & Raudys, A. (2002). Advances in web usage mining. The 6th World Multiconference on Systemics, Cybernetics and Informatics (SCI 2002), Florida, USA.
Padmanabhan, B., & Tuzhilin, A. (1999). Unexpectedness as a measure of interestingness in knowledge discovery. Decision Support Systems, 303–318.
Padmanabhan, B., & Tuzhilin, A. (2000). Small is beautiful: Discovering the minimal set of unexpected patterns. International Conference on Knowledge Discovery and Data Mining; KDD 2000. Boston, MA: Association for Computing Machinery.
Paliouras, G., Papatheodorou, C., Karkaletsis, V., Spyropoulous, C., & Tzitziras, P. (1999). From web usage statistics to web usage analysis. IEEE International Conference on Systems Man and Cybernetics: II-159–II-164.
Perkowitz, M., & Etzioni, O. (1997). Adaptive web sites: An AI challenge. Proceedings IJCAI’97, Nagoya, Japan.
Perkowitz, M., & Etzioni, O. (1998). Adaptive web sites: Automatically synthesizing web pages. AAA/98.
Perkowitz, M., & Etzioni, O. (1999). Towards adaptive web sites: Conceptual framework and case study. Eighth International World Wide Web Conference, Toronto, Ontario.
Peterson, T., & Pinkelman, J. (2000). Microsoft OLAP Unleashed.
Piatetsky-Shapiro, G., & Matheus, C. J. (1994). The interestingness of deviations. Knowledge discovery in databases. Seattle, WA: AAAI Press.
Pirjo, M. (2000). Attribute, event sequence, and event type similarity notions for data mining. Department of Computer Science, 199.
Pirolli, P., Pitkow, J., & Rao, R. (1996). Silk from a sow’s ear: Extracting usable structure from the web. Human factors in computing systems: Common ground; CHI 96, Vancouver; Canada, New York.
Pitkow, J. (1997). In search of reliable usage data on the WWW. The Sixth International World Wide Web Conference, Santa Clara, CA.
Pitkow, J., & Bharat, K. (1994). WEBVIZ: A tool for world-wide web access log analysis. First International World Wide Web Conference, CERN, Geneva, Switzerland.
Pitkow, J., & Margaret, R. (1994). Integrating bottom-up and top-down analysis for intelligent hypertext. Intelligent Knowledge Management.
Pitkow, J., & Pirolli, P. (1999). Mining longest repeating subsequences to predict world wide web surfing. Internet Technologies and Systems; USENIX Symposium on Internet Technologies and Systems. Boulder, CO: USENIX Association.
Pitkow, J., & Recker, M. (1994). A simple yet robust caching algorithm based on dynamic access patterns. First International World Wide Web Conference, CERN, Geneva, Switzerland.
Roberts, S. (2002). Users are still wary of cookies. Computer Weekly, 24.
Sarukkai, R. R. (2000). Link prediction and path analysis using Markov chains. Computer Networks, 377–386.
Savola, T., Brown, M., Jung, J., Brandon, B., Meegan, R., Murphy, K., et al. (1996). Using HTML.
Schechter, S., Krishnan, M., & Smith, M. D. (1998). Using path profiles to predict HTTP requests. Computer Networks and ISDN Systems (1–7), 457–467.
Shahabi, C., Zarkesh, A., Adibi, J., & Shah, V. (1997). Knowledge discovery from users Webpage navigation. Research Issues in Data Engineering, Birmingham, England.
Silberschatz, A., & Tuzhilin, A. (1995). On subjective measures of interestingness in knowledge discovery. Knowledge discovery & data mining. Montreal, Canada: AAAI Press.
Spiliopoulou, M. (1999). Managing interesting rules in sequence mining. 3rd European Conference on Principles and Practice of Knowledge Discovery in Databases PKDD’99. Prague, Czech Republic: Springer-Verlag.
Srikant, R., & Agrawal, R. (1996). Mining sequential patterns: Generalizations and performance improvements. Extending database technology. Avignon, France: Springer.
Tan, P., & Kumar, V. (2002). Discovery of web robot sessions based on their navigational patterns. Data Mining and Knowledge Discovery, 9–35.
Tauscher, L., & Greenberg, S. (1997). How people revisit web pages: Empirical findings and implications for the design of history systems. International Journal of Human Computer Studies, 1, 97–138.
Wang, J. (1999). A survey of web caching schemes for the Internet. Computer Communication Review, 5, 36–46.
Weiss, S. M., & Kulikowski, C. A. (1991). Computer systems that learn: Classification and prediction methods from statistics, neural nets, machine learning, and expert systems. Morgan Kaufmann.
Wu, K., Yu, P.S., Ballman, A. (1998). SpeedTracer: A web usage mining and analysis tool. IBM Systems Journal, 37, 89–105.
Xiao, Y., & Dunham, M. H. (2001). Efficient mining of traversal patterns. Data & Knowledge Engineering, 191–214.
Xiao, J., & Zhang, Y. (2001). Clustering of web users using session-based similarity measures. IEEE.
Yang, Q., Wang, H., Zhang, W. (2002). Web-log mining for quantitative temporal-event prediction. IEEE Computational Intelligence Bulletin, 1, 10–18.
Yun, C. H., & Chen, M. S. (2000a). Using pattern–join and purchase–combination for mining web transaction patterns in an electronic commerce environment. Compsac, 99–104.
Yun, C.-H., & Chen, M.-S. (2000b). Mining web transaction patterns in an electronic commerce environment. 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Pabarskaite, Z., Raudys, A. A process of knowledge discovery from web log data: Systematization and critical review. J Intell Inf Syst 28, 79–104 (2007). https://doi.org/10.1007/s10844-006-0004-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-006-0004-1