Skip to main content

Categorizing Visitors Dynamically by Fast and Robust Clustering of Access Logs

  • Conference paper
  • First Online:
Web Intelligence: Research and Development (WI 2001)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2198))

Included in the following conference series:

Abstract

Clustering plays a central role in segmenting markets. The identification of categories of visitors to a Web-site is very useful towards improved Web applications. However, the large volume involved in mining visitation paths, demands efficient clustering algorithms that are also resistant to noise and outliers. Also, dissimilarity between visitation paths involves sophisticated evaluation and results in large dimension of attribute-vectors. We present a randomized, iterative algorithm (a la Expectation Maximization or k-means) but based on discrete medoids. We prove that our algorithm converges and that has subquadratic complexity. We compare to the implementation of the fastest version of matrixbased clustering for visitor paths and show that our algorithm outperforms dramatically matrix-based methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. C. Bajaj. Proving geometric algorithm non-solvability: An application of factoring polynomials. Journal of Symbolic Computation, 2:99–102, 1986.

    Article  MathSciNet  Google Scholar 

  2. J. Borges and M. Levene. Mining assocaition rules in hypertext databases. R. Agrawal, ed., 4th Int. Conf. on KDD, 149–153, NY, August 27–31 1998.

    Google Scholar 

  3. C. R. Cunha and C. F. B. Jaccound. Determining WWW user’s next access and its application to prefetching. Int. Symp. on Computers and Communication’97, Alexandria, Egypt, July, 1997.

    Google Scholar 

  4. A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likehood from incomplete data via the EM algorithm. J. Royal Statistical Society B, 39:1–38, 1977.

    MATH  MathSciNet  Google Scholar 

  5. M. Ester, H.P. Kriegel, S. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. E. Simoudis, et al eds., 2nd Int. Conf. KDD, 226-231, Menlo Park, CA, 1996. AAAI Press.

    Google Scholar 

  6. V. Estivill-Castro and M.E. Houle. Roboust clustering of large geo-referenced data sets. N. Zhong & L. Zhow, eds., 3rd PAKDD-99, 327–337. LNAI 1574, 1999.

    Google Scholar 

  7. V. Estivill-Castro and M.E. Houle. Fast randomized algorithms for robust estimation of location. TSDM2000, 74–85, Lyon, 2000. LNAI 2007.

    Google Scholar 

  8. V. Estivill-Castro and A.T. Murray. Discovering associations in spatial data-an efficient medoid based approach. X. Wu, et al, eds., 2nd PAKDD-98, 110–121, Melbourne, Australia, 1998. LNAI 1394.

    Google Scholar 

  9. I.B. Hall, L.O. Özyurt and J.C. Bezdek. Clustering with a genetically optimized approach. IEEE T. on Evolutionary Computation, 3(2):103–112, 1999.

    Article  Google Scholar 

  10. T. Kato, H. Nakyama and Y. Yamane. Navigation analysis tool based on the correlation between contents and access patterns. Manuscript. http://www.citeseer.nj.nec.com/354234.html.

  11. J. MacQueen. Some methods for classification and analysis of multivariate observations. L. Le Cam and J. Neyman eds., 5th Berkley Symp. on Mathematical Statistics and Probability, 281–297, 1967. Volume 1.

    Google Scholar 

  12. B. Mobasher, H. Dai, T. Luo, M. Nakagawa, Y. Sun, and J. Wilshire. Discovery of aggregate usage profiles for web personalization. WEB Mining for E-Commerce Workshop Web KDD-2000, Boston, August 2000.

    Google Scholar 

  13. T. Morzy, M. Wojciechowski, and Zakrzewicz. Scalabale hierarchical clustering methods for sequences of categorical values. D. Cheung, et al eds., 5th PAKDD, 282–293, Hong Kong, 2001. LNAI 2035.

    Google Scholar 

  14. F. Murtagh. Comments of “Parallel algorithms for hierarchical clustering and cluster validity”. IEEE T. on Pattern Analysis and Machine Intelligence, 14(10):1056–1057, 1992.

    Article  Google Scholar 

  15. R.T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. 20th VLDB, 144–155, 1994. Santiago, Chile, Morgan Kaufmann.

    Google Scholar 

  16. G. Paliouras, C. Papatheodorou, V. Karkaletsis, and C. Spyropoulos. Clustering the users of large web sites into communities. P. Langley, ed., 17th Int. Conf. on Machine Learning, 719–726, 2000. Morgan Kaufmann.

    Google Scholar 

  17. J. Pei, J. Han, B. Mortazavi-asl, and H. Zhu. Mining access patterns efficiently from web logas. T. Terano, et al, eds., 4th PAKDD, 396–407, Kyoto, 2000. LNCS 1805.

    Google Scholar 

  18. D.M. Pennock, E. Horvitz, S. Lawrence, and C.L. Giles. Collaborative filtering by personality diagnosis: A hybrid memory-and model-based approach. 16th Conf. on Uncertanity in Artificial Intelligence, 473–840, 2000. Morgan Kaufmann.

    Google Scholar 

  19. M Perkowitz and O. Etzioni. Adaptive web sites: an AI challenge. IJCAI, 16–23, Nagoya, Japan, 1998.

    Google Scholar 

  20. M Perkowitz and O. Etzioni. Adaptive web sites: Automatically synthesizing web pages. 15th National Conf. on Artificial Intelligence, 727–732, 1998. AAAI Press.

    Google Scholar 

  21. P.J. Rousseeuw and A.M. Leroy. Robust regression and outlier detection. Wiley, NY, 1987.

    MATH  Google Scholar 

  22. C. Shahabi, A. M. Zarkesh, J. Adibi, and V. Shah. Knowledge discovery from users web page navigation. P. Schevemann, ed., Int. Workshop on Research Issues in Data Engineering IEEE RIDE’97, 20–31, 1997.

    Google Scholar 

  23. M. Spiliopoulou. Web usage mining for web site evaluation. Communication of the ACM, 43(8):127–134, 2000.

    Article  Google Scholar 

  24. J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan. Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Esplorations, 1(2):12–23, January 2000.

    Article  Google Scholar 

  25. M.B. Teitz and P. Bart. Heuristic methods for estimating the generalized vertex median of a weighted graph. Operations Research, 16:955–961, 1968.

    Article  MATH  Google Scholar 

  26. W. Wang, J. Yang, and R. Muntz. STING: A statistical information grid approach to spatial data mining. 23rd VLDB, 186–195, Athens, 1997. Morgan Kaufmann.

    Google Scholar 

  27. J. Xiao, Y. Zhang, X. Jia, and T. Li. Measuring similarity of interests for clustering web-users. In M.E. Orlowsak & J.F. Roddick, eds., 12th Australian Database Conf. ADC 2001, 107–114, Gold Coast, 2001. IEEE Computer Society.

    Google Scholar 

  28. A.M. Zarkesh, J. Adibi, C. Shahabi, R. Sadri, and V. Shah. Analysis and design of server informative WWW-sites. In K. Golshani, F.; Makki, editor, 6th ACM CIKM, 254–261, Las Vegas, 1997.

    Google Scholar 

  29. B. Zhang, M. Hsu, and U. Dayal. K-harmonic means-a spatial clustering algorithm with boosting. TSDM2000, 31–42, Lyon, 2000. LNAI 2007.

    Google Scholar 

  30. T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. SIGMOD Record, 25(2):103–114, June 1996.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Estivill-Castro, V., Yang, J. (2001). Categorizing Visitors Dynamically by Fast and Robust Clustering of Access Logs. In: Zhong, N., Yao, Y., Liu, J., Ohsuga, S. (eds) Web Intelligence: Research and Development. WI 2001. Lecture Notes in Computer Science(), vol 2198. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45490-X_64

Download citation

  • DOI: https://doi.org/10.1007/3-540-45490-X_64

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-42730-8

  • Online ISBN: 978-3-540-45490-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics