Skip to main content
Log in

A process of knowledge discovery from web log data: Systematization and critical review

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

This paper presents a comprehensive survey of web log/usage mining based on over 100 research papers. This is the first survey dedicated exclusively to web log/usage mining. The paper identifies several web log mining sub-topics including specific ones such as data cleaning, user and session identification. Each sub-topic is explained, weaknesses and strong points are discussed and possible solutions are presented. The paper describes examples of web log mining and lists some major web log mining software packages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Adomavicius, G. (1997). Discovery of actionable patterns in databases: The action hierarchy approach. Knowledge discovery and data mining. Newport Beach, CA, Menlo Park, CA.

  • Agrawal, R., Imielinski, T., & Swami, S. (1993). Mining association rules between sets in large databases. Conference on Management of Data (ACM SIGMOD), Washington, DC.

  • Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. Proceedings of the 20th VLDB, Santiago, Chile.

  • Agrawal, R., & Srikant R. (1995). Mining sequential patterns. Data engineering. Taipei: IEEE.

    Google Scholar 

  • Ansari, S., Kohavi, R., Mason, L., & Zheng, Z. (2001). Integrating e-commerce and data mining: Architecture and challenges. Data mining. San Jose, CA: IEEE Computer Society.

    Google Scholar 

  • Balabanovic, M., Shoham, Y., & Yun, Y. (1995). An adaptive agent for automated web browsing. Visual Communication and Image Representation, 4.

  • Barish, G., & Obraczka, K. (2000). World wide web caching: Trends and techniques. IEEE Communications Magazine, 5, 178–184.

    Article  Google Scholar 

  • Berendt, B., Mobasher, B., Nakagawa, M., & Spiliopoulou, M. (2002). The impact of site structure and user environment on session reconstruction in web usage analysis. 4th WebKDD 2002 Workshop, ACM-SIGKDD Conference on Knowledge Discovery in Databases (KDD’2002), Edmonton, Alberta, Canada.

  • Berendt, B., & Spiliopoulou, M. (2000). Analysis of navigation behaviour in web sites integrating multiple information systems. VLDB, 56–75.

  • Berry, M. J. A., & Linoff, G. (1997). Data mining techniques: For marketing, sales, and customer support. New York: Wiley.

    Google Scholar 

  • Bonchi, F., Giannotti, F., Gozzi, C., Manco, G., Nanni, M., Pedreschi, D. et al. (2001a). Web log data warehousing and mining for intelligent web caching. Elsevier Science.

  • Bonchi, F., Giannotti, F., Gozzi, C., Manco, G., Nanni, M., Pedreschi, D., et al. (2001b). Web log data warehousing and mining for intelligent web caching. Data and Knowledge Engineering, 2, 165–189.

    Article  Google Scholar 

  • Buchner, A. G., Mulvenna, M. D., Anand, S. S., & Hughes, J. G. (1999). An Internet-enabled knowledge discovery process. International database conference; Heterogeneous and internet databases. Hong Kong: City University of Hong Kong.

    Google Scholar 

  • Catledge, L. D., & Pitkow, J. E. (1995). Characterizing browsing strategies in the world-wide web. Computer Networks and ISDN Systems, 6, 10–65.

    Google Scholar 

  • Chakrabarti, S. (2000). Data mining for hypertext: A tutorial survey. ACM SIGKDD Explorations, 2, 1–11.

    Article  MathSciNet  Google Scholar 

  • Chen, M. S., Park, J. S., & Yu, P. S. (1996). Data mining for path traversal patterns in a web environment. Distributed computing systems. Hong Kong: IEEE Computer Society.

    Google Scholar 

  • Chi, E. H. (2002). Improving web usability through visualization. IEEE Internet Computing, 64–71.

  • Consens, M. P., Eigler, F. C., Hasan, M. Z., Mendelzon, A. O., Noik, E. G., Ryman, A. G., et al. (1994). Architecture and applications of the Hy^+ visualization system. IBM Systems Journal, 3, 458.

    Google Scholar 

  • Cooley, R., Mobasher, B., & Srivastava, J. (1997a). Grouping web page references into transactions for mining world wide web browsing patterns. IEEE Knowledge and Data Engineering Exchange Workshop (KDEX’97). Los Alamitos, CA: IEEE Computer Society.

    Google Scholar 

  • Cooley, R., Mobasher, B., & Srivastava, J. (1997b). Web mining: Information and pattern discovery on the word wide web. 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’97).

  • Cooley, R., Mobasher, B., & Srivastava, J. (1999a). Automatic personalization based on web usage mining. Chicago, IL: Depaul University.

    Google Scholar 

  • Cooley, R., Mobasher, B., & Srivastava, J. (1999b). Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems, 1, 5–32.

    Google Scholar 

  • Dai, H., Luo, T., Mobasher, B., Nakagawa, M., & Witshire, J. (2000). Discovery of aggregate usage profiles for web personalization. Mining for E-Commerce Workshop (WebKDD’2000, held in conjunction with the ACM-SIGKDD on Knowledge Discovery in Databases KDD’2000), Boston, MA.

  • Dai, H., & Mobasher, B. (2003). A road map to more effective web personalization: Integrating domain knowledge with web usage mining. Proceedings of the International Conference on Internet Computing (IC’03), Las Vegas, NV.

  • Davison, B. (1999). A survey of proxy cache evaluation techniques. 4th International Web Caching Workshop (WCW’99).

  • Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification. New York: Wiley.

    Google Scholar 

  • Duska, B. M., Marwood, D., & Freeley, M. J. (1997). The measured access characteristics of world-wide-web client proxy caches. Proceedings of the USENIX Symposium on Internet Technologies and Systems, Monterey, CA: USENIX Association.

  • Dyreson, C. (1997). Using an incomplete data cube as a summary data sieve. Bulletin of the IEEE Technical Committee on Data Engineering (March): 19–26.

  • Elo-Dean, S., & Viveros M. (1997). Data mining the IBM official 1996 Olympics Web site, IBM T.J. Watson Research Center.

  • Famili, A., Shen, W. M., Weber, R., & Simoudis, E. (1997). Data preprocessing and intelligent data analysis. Intelligent Data Analysis, 3–23.

  • Faulstich, L. C., & Spiliopoulou, M. (1998). WUM: A tool for web utilization analysis. EDBT Workshop WebDB’98. Valencia, Spain: Springer-Verlag.

  • Faulstich, L., Pohle, C., & Spiliopoulou, M. (1999). Improving the effectiveness of a web site with web usage mining. KDD Workshop WEBKDD’99, San Diego, CA.

  • Faulstich, L., Spiliopoulou, M., & Wilkler, K. (1999). A data mining analyzing the navigational behaviour of web users. Workshop on Machine Learning User Modeling of the ACAI’99 International Conference, Creta, Greece.

  • Fayyad, U. M. (1996). Advances in knowledge discovery and data mining. Menlo Park, CA: AAAI Press.

    Google Scholar 

  • Fleishman, G. (1996). Web log analysis, who’s doing what, when? Web Developer.

  • Fong, J., Hughes, J. G., & Zhu, J. (2000). Online web mining transactions association rules using frame metadata model.

  • Glassman, S. (1994). A caching relay for the world wide web. 1st World Wide Web Conference. Geneva, Switzerland: Elsevier.

  • Han, J., Cai, Y., & Cercone, N. (1993). Date-driven discovery of quantitative rules in relational databases. IEEE Transactions on Knowledge and Data Engineering, 29–40.

  • Han, J., Chiang, J., Chee, S., Chen, J., Chen, Q., Cheng, S., et al. (1997). DBMiner: A system for data mining in relational databases and data warehouses. CASCON’97: Meeting of Minds, Toronto, Canada.

  • Han, J., He, Y., & Wang, K. (2000). Mining frequent itemsets using support constraints. International Conference on Very Large Databases (VLDB’00), Cairo, Egypt.

  • Han, J., Xin, M., & Zaïane, O. R. (1998). Discovering web access patterns and trends by applying OLAP and data mining technology on web logs. Conference on Advances in Digital Libraries, Santa Barbara, CA.

  • Hand, D., Mannila, H., & Smyth, P. (2001). Principles of data mining.

  • He, D., & Goker, A. (2003). Detecting session boundaries from Web user logs.

  • Jain, N., Han, E., Mobasher, B., & Srivastava, J. (1996). Web mining: Pattern discovery from world wide web transactions. Minneapolis, MN: University of Minnesota.

    Google Scholar 

  • Jain, N., Han, E., Mobasher, B., & Srivastava, J. (1997). Web mining: Pattern discovery from world wide web transactions. Minneapolis, MN: University of Minnesota.

    Google Scholar 

  • Kanth, K. V. R., & Siva, R. (2002). Personalization and location-based technologies for e-commerce applications, eJETA.

  • Kato, H., Hiraishi, H., & Mizoguchi, F. (2001). Log summarizing agent for web access data using data mining techniques. IEEE Intelligent Systems and Their Applications, 2642–2647.

  • Kosala, R., & Blockeel, H. (2000). Web mining research: A survey. ACM SIGKDD Explorations, 1, 1–15.

    Article  Google Scholar 

  • Lin, I. Y., Huang, X. M., & Chen, M. S. (1999). Capturing user access patterns in the web for data mining. Tools with artificial intelligence. Chicago, IL: IEEE Computer Society.

    Google Scholar 

  • Luotonen, A., & Altis, K. (1994). World-wide web proxies. Selected Papers of First World-Wide Web Conference, Elsevier Science Division. 147.

  • Lup Low, W., Li Lee, M., & Wang Ling, T. (2001). A knowledge-based approach for duplicate elimination in data cleaning. Information Systems, 8, 585–606.

    Article  Google Scholar 

  • Mannila, H., Toivonen, H., & Verkamo, A. I. (1995). Discovering frequent episodes in sequences. Knowledge discovery & data mining. Montreal, Canada: AAAI Press.

    Google Scholar 

  • Mobasher, B., Dai, H., Luo, T., Sung, Y., & Zhu, J. (2000). Integrating web usage and content mining for more effective personalization. Proceedings of the International Conference on E-Commerce and Web Technologies (ECWeb2000), Greenwich, UK.

  • Mobasher, B., Dai, H., Luo, T., & Nakagawa, M. (2002). Using sequential and non-sequential patterns in predictive web usage mining tasks. International Conference on Data Mining. Maebashi City, Japan: IEEE Computer Society.

  • Montgomery, A. L., & Faloutsos, C. (2001). Identifying web browsing trends and patterns. Computer, 7, 94–95.

    Article  Google Scholar 

  • Pabarskaite, Z. (2002). Implementing advanced cleaning and end-user interpretability technologies in web log mining. Information Technology Interfaces ITI2002, Collaboration and Interaction in Knowledge-Based Environments, Cavtat/Dubrovnik, Croatia.

  • Pabarskaite, Z. (2003). Decision trees for web log mining. Intelligent Data Analysis, 2, 141–155.

    Google Scholar 

  • Pabarskaite, Z., & Raudys, A. (2002). Advances in web usage mining. The 6th World Multiconference on Systemics, Cybernetics and Informatics (SCI 2002), Florida, USA.

  • Padmanabhan, B., & Tuzhilin, A. (1999). Unexpectedness as a measure of interestingness in knowledge discovery. Decision Support Systems, 303–318.

  • Padmanabhan, B., & Tuzhilin, A. (2000). Small is beautiful: Discovering the minimal set of unexpected patterns. International Conference on Knowledge Discovery and Data Mining; KDD 2000. Boston, MA: Association for Computing Machinery.

  • Paliouras, G., Papatheodorou, C., Karkaletsis, V., Spyropoulous, C., & Tzitziras, P. (1999). From web usage statistics to web usage analysis. IEEE International Conference on Systems Man and Cybernetics: II-159–II-164.

  • Perkowitz, M., & Etzioni, O. (1997). Adaptive web sites: An AI challenge. Proceedings IJCAI’97, Nagoya, Japan.

  • Perkowitz, M., & Etzioni, O. (1998). Adaptive web sites: Automatically synthesizing web pages. AAA/98.

  • Perkowitz, M., & Etzioni, O. (1999). Towards adaptive web sites: Conceptual framework and case study. Eighth International World Wide Web Conference, Toronto, Ontario.

  • Peterson, T., & Pinkelman, J. (2000). Microsoft OLAP Unleashed.

  • Piatetsky-Shapiro, G., & Matheus, C. J. (1994). The interestingness of deviations. Knowledge discovery in databases. Seattle, WA: AAAI Press.

    Google Scholar 

  • Pirjo, M. (2000). Attribute, event sequence, and event type similarity notions for data mining. Department of Computer Science, 199.

  • Pirolli, P., Pitkow, J., & Rao, R. (1996). Silk from a sow’s ear: Extracting usable structure from the web. Human factors in computing systems: Common ground; CHI 96, Vancouver; Canada, New York.

  • Pitkow, J. (1997). In search of reliable usage data on the WWW. The Sixth International World Wide Web Conference, Santa Clara, CA.

  • Pitkow, J., & Bharat, K. (1994). WEBVIZ: A tool for world-wide web access log analysis. First International World Wide Web Conference, CERN, Geneva, Switzerland.

  • Pitkow, J., & Margaret, R. (1994). Integrating bottom-up and top-down analysis for intelligent hypertext. Intelligent Knowledge Management.

  • Pitkow, J., & Pirolli, P. (1999). Mining longest repeating subsequences to predict world wide web surfing. Internet Technologies and Systems; USENIX Symposium on Internet Technologies and Systems. Boulder, CO: USENIX Association.

  • Pitkow, J., & Recker, M. (1994). A simple yet robust caching algorithm based on dynamic access patterns. First International World Wide Web Conference, CERN, Geneva, Switzerland.

  • Roberts, S. (2002). Users are still wary of cookies. Computer Weekly, 24.

  • Sarukkai, R. R. (2000). Link prediction and path analysis using Markov chains. Computer Networks, 377–386.

  • Savola, T., Brown, M., Jung, J., Brandon, B., Meegan, R., Murphy, K., et al. (1996). Using HTML.

  • Schechter, S., Krishnan, M., & Smith, M. D. (1998). Using path profiles to predict HTTP requests. Computer Networks and ISDN Systems (1–7), 457–467.

  • Shahabi, C., Zarkesh, A., Adibi, J., & Shah, V. (1997). Knowledge discovery from users Webpage navigation. Research Issues in Data Engineering, Birmingham, England.

  • Silberschatz, A., & Tuzhilin, A. (1995). On subjective measures of interestingness in knowledge discovery. Knowledge discovery & data mining. Montreal, Canada: AAAI Press.

    Google Scholar 

  • Spiliopoulou, M. (1999). Managing interesting rules in sequence mining. 3rd European Conference on Principles and Practice of Knowledge Discovery in Databases PKDD’99. Prague, Czech Republic: Springer-Verlag.

  • Srikant, R., & Agrawal, R. (1996). Mining sequential patterns: Generalizations and performance improvements. Extending database technology. Avignon, France: Springer.

    Google Scholar 

  • Tan, P., & Kumar, V. (2002). Discovery of web robot sessions based on their navigational patterns. Data Mining and Knowledge Discovery, 9–35.

  • Tauscher, L., & Greenberg, S. (1997). How people revisit web pages: Empirical findings and implications for the design of history systems. International Journal of Human Computer Studies, 1, 97–138.

    Article  Google Scholar 

  • Wang, J. (1999). A survey of web caching schemes for the Internet. Computer Communication Review, 5, 36–46.

    Article  Google Scholar 

  • Weiss, S. M., & Kulikowski, C. A. (1991). Computer systems that learn: Classification and prediction methods from statistics, neural nets, machine learning, and expert systems. Morgan Kaufmann.

  • Wu, K., Yu, P.S., Ballman, A. (1998). SpeedTracer: A web usage mining and analysis tool. IBM Systems Journal, 37, 89–105.

    Article  Google Scholar 

  • Xiao, Y., & Dunham, M. H. (2001). Efficient mining of traversal patterns. Data & Knowledge Engineering, 191–214.

  • Xiao, J., & Zhang, Y. (2001). Clustering of web users using session-based similarity measures. IEEE.

  • Yang, Q., Wang, H., Zhang, W. (2002). Web-log mining for quantitative temporal-event prediction. IEEE Computational Intelligence Bulletin, 1, 10–18.

    Google Scholar 

  • Yun, C. H., & Chen, M. S. (2000a). Using pattern–join and purchase–combination for mining web transaction patterns in an electronic commerce environment. Compsac, 99–104.

  • Yun, C.-H., & Chen, M.-S. (2000b). Mining web transaction patterns in an electronic commerce environment. 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zidrina Pabarskaite.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pabarskaite, Z., Raudys, A. A process of knowledge discovery from web log data: Systematization and critical review. J Intell Inf Syst 28, 79–104 (2007). https://doi.org/10.1007/s10844-006-0004-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-006-0004-1

Keywords

Navigation