A process of knowledge discovery from web log data: Systematization and critical review

Pabarskaite, Zidrina; Raudys, Aistis

doi:10.1007/s10844-006-0004-1

A process of knowledge discovery from web log data: Systematization and critical review

Published: 28 December 2006

Volume 28, pages 79–104, (2007)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Zidrina Pabarskaite¹ &
Aistis Raudys¹

596 Accesses
53 Citations
Explore all metrics

Abstract

This paper presents a comprehensive survey of web log/usage mining based on over 100 research papers. This is the first survey dedicated exclusively to web log/usage mining. The paper identifies several web log mining sub-topics including specific ones such as data cleaning, user and session identification. Each sub-topic is explained, weaknesses and strong points are discussed and possible solutions are presented. The paper describes examples of web log mining and lists some major web log mining software packages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Adomavicius, G. (1997). Discovery of actionable patterns in databases: The action hierarchy approach. Knowledge discovery and data mining. Newport Beach, CA, Menlo Park, CA.
Agrawal, R., Imielinski, T., & Swami, S. (1993). Mining association rules between sets in large databases. Conference on Management of Data (ACM SIGMOD), Washington, DC.
Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. Proceedings of the 20th VLDB, Santiago, Chile.
Agrawal, R., & Srikant R. (1995). Mining sequential patterns. Data engineering. Taipei: IEEE.
Google Scholar
Ansari, S., Kohavi, R., Mason, L., & Zheng, Z. (2001). Integrating e-commerce and data mining: Architecture and challenges. Data mining. San Jose, CA: IEEE Computer Society.
Google Scholar
Balabanovic, M., Shoham, Y., & Yun, Y. (1995). An adaptive agent for automated web browsing. Visual Communication and Image Representation, 4.
Barish, G., & Obraczka, K. (2000). World wide web caching: Trends and techniques. IEEE Communications Magazine, 5, 178–184.
Article Google Scholar
Berendt, B., Mobasher, B., Nakagawa, M., & Spiliopoulou, M. (2002). The impact of site structure and user environment on session reconstruction in web usage analysis. 4th WebKDD 2002 Workshop, ACM-SIGKDD Conference on Knowledge Discovery in Databases (KDD’2002), Edmonton, Alberta, Canada.
Berendt, B., & Spiliopoulou, M. (2000). Analysis of navigation behaviour in web sites integrating multiple information systems. VLDB, 56–75.
Berry, M. J. A., & Linoff, G. (1997). Data mining techniques: For marketing, sales, and customer support. New York: Wiley.
Google Scholar
Bonchi, F., Giannotti, F., Gozzi, C., Manco, G., Nanni, M., Pedreschi, D. et al. (2001a). Web log data warehousing and mining for intelligent web caching. Elsevier Science.
Bonchi, F., Giannotti, F., Gozzi, C., Manco, G., Nanni, M., Pedreschi, D., et al. (2001b). Web log data warehousing and mining for intelligent web caching. Data and Knowledge Engineering, 2, 165–189.
Article Google Scholar
Buchner, A. G., Mulvenna, M. D., Anand, S. S., & Hughes, J. G. (1999). An Internet-enabled knowledge discovery process. International database conference; Heterogeneous and internet databases. Hong Kong: City University of Hong Kong.
Google Scholar
Catledge, L. D., & Pitkow, J. E. (1995). Characterizing browsing strategies in the world-wide web. Computer Networks and ISDN Systems, 6, 10–65.
Google Scholar
Chakrabarti, S. (2000). Data mining for hypertext: A tutorial survey. ACM SIGKDD Explorations, 2, 1–11.
Article MathSciNet Google Scholar
Chen, M. S., Park, J. S., & Yu, P. S. (1996). Data mining for path traversal patterns in a web environment. Distributed computing systems. Hong Kong: IEEE Computer Society.
Google Scholar
Chi, E. H. (2002). Improving web usability through visualization. IEEE Internet Computing, 64–71.
Consens, M. P., Eigler, F. C., Hasan, M. Z., Mendelzon, A. O., Noik, E. G., Ryman, A. G., et al. (1994). Architecture and applications of the Hy^+ visualization system. IBM Systems Journal, 3, 458.
Google Scholar
Cooley, R., Mobasher, B., & Srivastava, J. (1997a). Grouping web page references into transactions for mining world wide web browsing patterns. IEEE Knowledge and Data Engineering Exchange Workshop (KDEX’97). Los Alamitos, CA: IEEE Computer Society.
Google Scholar
Cooley, R., Mobasher, B., & Srivastava, J. (1997b). Web mining: Information and pattern discovery on the word wide web. 9th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’97).
Cooley, R., Mobasher, B., & Srivastava, J. (1999a). Automatic personalization based on web usage mining. Chicago, IL: Depaul University.
Google Scholar
Cooley, R., Mobasher, B., & Srivastava, J. (1999b). Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems, 1, 5–32.
Google Scholar
Dai, H., Luo, T., Mobasher, B., Nakagawa, M., & Witshire, J. (2000). Discovery of aggregate usage profiles for web personalization. Mining for E-Commerce Workshop (WebKDD’2000, held in conjunction with the ACM-SIGKDD on Knowledge Discovery in Databases KDD’2000), Boston, MA.
Dai, H., & Mobasher, B. (2003). A road map to more effective web personalization: Integrating domain knowledge with web usage mining. Proceedings of the International Conference on Internet Computing (IC’03), Las Vegas, NV.
Davison, B. (1999). A survey of proxy cache evaluation techniques. 4th International Web Caching Workshop (WCW’99).
Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification. New York: Wiley.
Google Scholar
Duska, B. M., Marwood, D., & Freeley, M. J. (1997). The measured access characteristics of world-wide-web client proxy caches. Proceedings of the USENIX Symposium on Internet Technologies and Systems, Monterey, CA: USENIX Association.
Dyreson, C. (1997). Using an incomplete data cube as a summary data sieve. Bulletin of the IEEE Technical Committee on Data Engineering (March): 19–26.
Elo-Dean, S., & Viveros M. (1997). Data mining the IBM official 1996 Olympics Web site, IBM T.J. Watson Research Center.
Famili, A., Shen, W. M., Weber, R., & Simoudis, E. (1997). Data preprocessing and intelligent data analysis. Intelligent Data Analysis, 3–23.
Faulstich, L. C., & Spiliopoulou, M. (1998). WUM: A tool for web utilization analysis. EDBT Workshop WebDB’98. Valencia, Spain: Springer-Verlag.
Faulstich, L., Pohle, C., & Spiliopoulou, M. (1999). Improving the effectiveness of a web site with web usage mining. KDD Workshop WEBKDD’99, San Diego, CA.
Faulstich, L., Spiliopoulou, M., & Wilkler, K. (1999). A data mining analyzing the navigational behaviour of web users. Workshop on Machine Learning User Modeling of the ACAI’99 International Conference, Creta, Greece.
Fayyad, U. M. (1996). Advances in knowledge discovery and data mining. Menlo Park, CA: AAAI Press.
Google Scholar
Fleishman, G. (1996). Web log analysis, who’s doing what, when? Web Developer.
Fong, J., Hughes, J. G., & Zhu, J. (2000). Online web mining transactions association rules using frame metadata model.
Glassman, S. (1994). A caching relay for the world wide web. 1st World Wide Web Conference. Geneva, Switzerland: Elsevier.
Han, J., Cai, Y., & Cercone, N. (1993). Date-driven discovery of quantitative rules in relational databases. IEEE Transactions on Knowledge and Data Engineering, 29–40.
Han, J., Chiang, J., Chee, S., Chen, J., Chen, Q., Cheng, S., et al. (1997). DBMiner: A system for data mining in relational databases and data warehouses. CASCON’97: Meeting of Minds, Toronto, Canada.
Han, J., He, Y., & Wang, K. (2000). Mining frequent itemsets using support constraints. International Conference on Very Large Databases (VLDB’00), Cairo, Egypt.
Han, J., Xin, M., & Zaïane, O. R. (1998). Discovering web access patterns and trends by applying OLAP and data mining technology on web logs. Conference on Advances in Digital Libraries, Santa Barbara, CA.
Hand, D., Mannila, H., & Smyth, P. (2001). Principles of data mining.
He, D., & Goker, A. (2003). Detecting session boundaries from Web user logs.
Jain, N., Han, E., Mobasher, B., & Srivastava, J. (1996). Web mining: Pattern discovery from world wide web transactions. Minneapolis, MN: University of Minnesota.
Google Scholar
Jain, N., Han, E., Mobasher, B., & Srivastava, J. (1997). Web mining: Pattern discovery from world wide web transactions. Minneapolis, MN: University of Minnesota.
Google Scholar
Kanth, K. V. R., & Siva, R. (2002). Personalization and location-based technologies for e-commerce applications, eJETA.
Kato, H., Hiraishi, H., & Mizoguchi, F. (2001). Log summarizing agent for web access data using data mining techniques. IEEE Intelligent Systems and Their Applications, 2642–2647.
Kosala, R., & Blockeel, H. (2000). Web mining research: A survey. ACM SIGKDD Explorations, 1, 1–15.
Article Google Scholar
Lin, I. Y., Huang, X. M., & Chen, M. S. (1999). Capturing user access patterns in the web for data mining. Tools with artificial intelligence. Chicago, IL: IEEE Computer Society.
Google Scholar
Luotonen, A., & Altis, K. (1994). World-wide web proxies. Selected Papers of First World-Wide Web Conference, Elsevier Science Division. 147.
Lup Low, W., Li Lee, M., & Wang Ling, T. (2001). A knowledge-based approach for duplicate elimination in data cleaning. Information Systems, 8, 585–606.
Article Google Scholar
Mannila, H., Toivonen, H., & Verkamo, A. I. (1995). Discovering frequent episodes in sequences. Knowledge discovery & data mining. Montreal, Canada: AAAI Press.
Google Scholar
Mobasher, B., Dai, H., Luo, T., Sung, Y., & Zhu, J. (2000). Integrating web usage and content mining for more effective personalization. Proceedings of the International Conference on E-Commerce and Web Technologies (ECWeb2000), Greenwich, UK.
Mobasher, B., Dai, H., Luo, T., & Nakagawa, M. (2002). Using sequential and non-sequential patterns in predictive web usage mining tasks. International Conference on Data Mining. Maebashi City, Japan: IEEE Computer Society.
Montgomery, A. L., & Faloutsos, C. (2001). Identifying web browsing trends and patterns. Computer, 7, 94–95.
Article Google Scholar
Pabarskaite, Z. (2002). Implementing advanced cleaning and end-user interpretability technologies in web log mining. Information Technology Interfaces ITI2002, Collaboration and Interaction in Knowledge-Based Environments, Cavtat/Dubrovnik, Croatia.
Pabarskaite, Z. (2003). Decision trees for web log mining. Intelligent Data Analysis, 2, 141–155.
Google Scholar
Pabarskaite, Z., & Raudys, A. (2002). Advances in web usage mining. The 6th World Multiconference on Systemics, Cybernetics and Informatics (SCI 2002), Florida, USA.
Padmanabhan, B., & Tuzhilin, A. (1999). Unexpectedness as a measure of interestingness in knowledge discovery. Decision Support Systems, 303–318.
Padmanabhan, B., & Tuzhilin, A. (2000). Small is beautiful: Discovering the minimal set of unexpected patterns. International Conference on Knowledge Discovery and Data Mining; KDD 2000. Boston, MA: Association for Computing Machinery.
Paliouras, G., Papatheodorou, C., Karkaletsis, V., Spyropoulous, C., & Tzitziras, P. (1999). From web usage statistics to web usage analysis. IEEE International Conference on Systems Man and Cybernetics: II-159–II-164.
Perkowitz, M., & Etzioni, O. (1997). Adaptive web sites: An AI challenge. Proceedings IJCAI’97, Nagoya, Japan.
Perkowitz, M., & Etzioni, O. (1998). Adaptive web sites: Automatically synthesizing web pages. AAA/98.
Perkowitz, M., & Etzioni, O. (1999). Towards adaptive web sites: Conceptual framework and case study. Eighth International World Wide Web Conference, Toronto, Ontario.
Peterson, T., & Pinkelman, J. (2000). Microsoft OLAP Unleashed.
Piatetsky-Shapiro, G., & Matheus, C. J. (1994). The interestingness of deviations. Knowledge discovery in databases. Seattle, WA: AAAI Press.
Google Scholar
Pirjo, M. (2000). Attribute, event sequence, and event type similarity notions for data mining. Department of Computer Science, 199.
Pirolli, P., Pitkow, J., & Rao, R. (1996). Silk from a sow’s ear: Extracting usable structure from the web. Human factors in computing systems: Common ground; CHI 96, Vancouver; Canada, New York.
Pitkow, J. (1997). In search of reliable usage data on the WWW. The Sixth International World Wide Web Conference, Santa Clara, CA.
Pitkow, J., & Bharat, K. (1994). WEBVIZ: A tool for world-wide web access log analysis. First International World Wide Web Conference, CERN, Geneva, Switzerland.
Pitkow, J., & Margaret, R. (1994). Integrating bottom-up and top-down analysis for intelligent hypertext. Intelligent Knowledge Management.
Pitkow, J., & Pirolli, P. (1999). Mining longest repeating subsequences to predict world wide web surfing. Internet Technologies and Systems; USENIX Symposium on Internet Technologies and Systems. Boulder, CO: USENIX Association.
Pitkow, J., & Recker, M. (1994). A simple yet robust caching algorithm based on dynamic access patterns. First International World Wide Web Conference, CERN, Geneva, Switzerland.
Roberts, S. (2002). Users are still wary of cookies. Computer Weekly, 24.
Sarukkai, R. R. (2000). Link prediction and path analysis using Markov chains. Computer Networks, 377–386.
Savola, T., Brown, M., Jung, J., Brandon, B., Meegan, R., Murphy, K., et al. (1996). Using HTML.
Schechter, S., Krishnan, M., & Smith, M. D. (1998). Using path profiles to predict HTTP requests. Computer Networks and ISDN Systems (1–7), 457–467.
Shahabi, C., Zarkesh, A., Adibi, J., & Shah, V. (1997). Knowledge discovery from users Webpage navigation. Research Issues in Data Engineering, Birmingham, England.
Silberschatz, A., & Tuzhilin, A. (1995). On subjective measures of interestingness in knowledge discovery. Knowledge discovery & data mining. Montreal, Canada: AAAI Press.
Google Scholar
Spiliopoulou, M. (1999). Managing interesting rules in sequence mining. 3rd European Conference on Principles and Practice of Knowledge Discovery in Databases PKDD’99. Prague, Czech Republic: Springer-Verlag.
Srikant, R., & Agrawal, R. (1996). Mining sequential patterns: Generalizations and performance improvements. Extending database technology. Avignon, France: Springer.
Google Scholar
Tan, P., & Kumar, V. (2002). Discovery of web robot sessions based on their navigational patterns. Data Mining and Knowledge Discovery, 9–35.
Tauscher, L., & Greenberg, S. (1997). How people revisit web pages: Empirical findings and implications for the design of history systems. International Journal of Human Computer Studies, 1, 97–138.
Article Google Scholar
Wang, J. (1999). A survey of web caching schemes for the Internet. Computer Communication Review, 5, 36–46.
Article Google Scholar
Weiss, S. M., & Kulikowski, C. A. (1991). Computer systems that learn: Classification and prediction methods from statistics, neural nets, machine learning, and expert systems. Morgan Kaufmann.
Wu, K., Yu, P.S., Ballman, A. (1998). SpeedTracer: A web usage mining and analysis tool. IBM Systems Journal, 37, 89–105.
Article Google Scholar
Xiao, Y., & Dunham, M. H. (2001). Efficient mining of traversal patterns. Data & Knowledge Engineering, 191–214.
Xiao, J., & Zhang, Y. (2001). Clustering of web users using session-based similarity measures. IEEE.
Yang, Q., Wang, H., Zhang, W. (2002). Web-log mining for quantitative temporal-event prediction. IEEE Computational Intelligence Bulletin, 1, 10–18.
Google Scholar
Yun, C. H., & Chen, M. S. (2000a). Using pattern–join and purchase–combination for mining web transaction patterns in an electronic commerce environment. Compsac, 99–104.
Yun, C.-H., & Chen, M.-S. (2000b). Mining web transaction patterns in an electronic commerce environment. 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining.

Download references

Author information

Authors and Affiliations

Institute of Mathematics and Informatics, Akademijos 4, Vilnius, 2600, Lithuania
Zidrina Pabarskaite & Aistis Raudys

Authors

Zidrina Pabarskaite
View author publications
You can also search for this author in PubMed Google Scholar
Aistis Raudys
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zidrina Pabarskaite.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pabarskaite, Z., Raudys, A. A process of knowledge discovery from web log data: Systematization and critical review. J Intell Inf Syst 28, 79–104 (2007). https://doi.org/10.1007/s10844-006-0004-1

Download citation

Received: 30 November 2003
Revised: 03 April 2005
Accepted: 20 July 2005
Published: 28 December 2006
Issue Date: February 2007
DOI: https://doi.org/10.1007/s10844-006-0004-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A process of knowledge discovery from web log data: Systematization and critical review

Abstract

Access this article

Similar content being viewed by others

Web Usage Mining—Process, Tools and Practices

An Overview on Web Usage Mining

“An Inclusive Survey on Data Preprocessing Methods Used in Web Usage Mining”

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A process of knowledge discovery from web log data: Systematization and critical review

Abstract

Access this article

Similar content being viewed by others

Web Usage Mining—Process, Tools and Practices

An Overview on Web Usage Mining

“An Inclusive Survey on Data Preprocessing Methods Used in Web Usage Mining”

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation