Abstract
Clustering plays a central role in segmenting markets. The identification of categories of visitors to a Web-site is very useful towards improved Web applications. However, the large volume involved in mining visitation paths, demands efficient clustering algorithms that are also resistant to noise and outliers. Also, dissimilarity between visitation paths involves sophisticated evaluation and results in large dimension of attribute-vectors. We present a randomized, iterative algorithm (a la Expectation Maximization or k-means) but based on discrete medoids. We prove that our algorithm converges and that has subquadratic complexity. We compare to the implementation of the fastest version of matrixbased clustering for visitor paths and show that our algorithm outperforms dramatically matrix-based methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
C. Bajaj. Proving geometric algorithm non-solvability: An application of factoring polynomials. Journal of Symbolic Computation, 2:99–102, 1986.
J. Borges and M. Levene. Mining assocaition rules in hypertext databases. R. Agrawal, ed., 4th Int. Conf. on KDD, 149–153, NY, August 27–31 1998.
C. R. Cunha and C. F. B. Jaccound. Determining WWW user’s next access and its application to prefetching. Int. Symp. on Computers and Communication’97, Alexandria, Egypt, July, 1997.
A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likehood from incomplete data via the EM algorithm. J. Royal Statistical Society B, 39:1–38, 1977.
M. Ester, H.P. Kriegel, S. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. E. Simoudis, et al eds., 2nd Int. Conf. KDD, 226-231, Menlo Park, CA, 1996. AAAI Press.
V. Estivill-Castro and M.E. Houle. Roboust clustering of large geo-referenced data sets. N. Zhong & L. Zhow, eds., 3rd PAKDD-99, 327–337. LNAI 1574, 1999.
V. Estivill-Castro and M.E. Houle. Fast randomized algorithms for robust estimation of location. TSDM2000, 74–85, Lyon, 2000. LNAI 2007.
V. Estivill-Castro and A.T. Murray. Discovering associations in spatial data-an efficient medoid based approach. X. Wu, et al, eds., 2nd PAKDD-98, 110–121, Melbourne, Australia, 1998. LNAI 1394.
I.B. Hall, L.O. Özyurt and J.C. Bezdek. Clustering with a genetically optimized approach. IEEE T. on Evolutionary Computation, 3(2):103–112, 1999.
T. Kato, H. Nakyama and Y. Yamane. Navigation analysis tool based on the correlation between contents and access patterns. Manuscript. http://www.citeseer.nj.nec.com/354234.html.
J. MacQueen. Some methods for classification and analysis of multivariate observations. L. Le Cam and J. Neyman eds., 5th Berkley Symp. on Mathematical Statistics and Probability, 281–297, 1967. Volume 1.
B. Mobasher, H. Dai, T. Luo, M. Nakagawa, Y. Sun, and J. Wilshire. Discovery of aggregate usage profiles for web personalization. WEB Mining for E-Commerce Workshop Web KDD-2000, Boston, August 2000.
T. Morzy, M. Wojciechowski, and Zakrzewicz. Scalabale hierarchical clustering methods for sequences of categorical values. D. Cheung, et al eds., 5th PAKDD, 282–293, Hong Kong, 2001. LNAI 2035.
F. Murtagh. Comments of “Parallel algorithms for hierarchical clustering and cluster validity”. IEEE T. on Pattern Analysis and Machine Intelligence, 14(10):1056–1057, 1992.
R.T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. 20th VLDB, 144–155, 1994. Santiago, Chile, Morgan Kaufmann.
G. Paliouras, C. Papatheodorou, V. Karkaletsis, and C. Spyropoulos. Clustering the users of large web sites into communities. P. Langley, ed., 17th Int. Conf. on Machine Learning, 719–726, 2000. Morgan Kaufmann.
J. Pei, J. Han, B. Mortazavi-asl, and H. Zhu. Mining access patterns efficiently from web logas. T. Terano, et al, eds., 4th PAKDD, 396–407, Kyoto, 2000. LNCS 1805.
D.M. Pennock, E. Horvitz, S. Lawrence, and C.L. Giles. Collaborative filtering by personality diagnosis: A hybrid memory-and model-based approach. 16th Conf. on Uncertanity in Artificial Intelligence, 473–840, 2000. Morgan Kaufmann.
M Perkowitz and O. Etzioni. Adaptive web sites: an AI challenge. IJCAI, 16–23, Nagoya, Japan, 1998.
M Perkowitz and O. Etzioni. Adaptive web sites: Automatically synthesizing web pages. 15th National Conf. on Artificial Intelligence, 727–732, 1998. AAAI Press.
P.J. Rousseeuw and A.M. Leroy. Robust regression and outlier detection. Wiley, NY, 1987.
C. Shahabi, A. M. Zarkesh, J. Adibi, and V. Shah. Knowledge discovery from users web page navigation. P. Schevemann, ed., Int. Workshop on Research Issues in Data Engineering IEEE RIDE’97, 20–31, 1997.
M. Spiliopoulou. Web usage mining for web site evaluation. Communication of the ACM, 43(8):127–134, 2000.
J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan. Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Esplorations, 1(2):12–23, January 2000.
M.B. Teitz and P. Bart. Heuristic methods for estimating the generalized vertex median of a weighted graph. Operations Research, 16:955–961, 1968.
W. Wang, J. Yang, and R. Muntz. STING: A statistical information grid approach to spatial data mining. 23rd VLDB, 186–195, Athens, 1997. Morgan Kaufmann.
J. Xiao, Y. Zhang, X. Jia, and T. Li. Measuring similarity of interests for clustering web-users. In M.E. Orlowsak & J.F. Roddick, eds., 12th Australian Database Conf. ADC 2001, 107–114, Gold Coast, 2001. IEEE Computer Society.
A.M. Zarkesh, J. Adibi, C. Shahabi, R. Sadri, and V. Shah. Analysis and design of server informative WWW-sites. In K. Golshani, F.; Makki, editor, 6th ACM CIKM, 254–261, Las Vegas, 1997.
B. Zhang, M. Hsu, and U. Dayal. K-harmonic means-a spatial clustering algorithm with boosting. TSDM2000, 31–42, Lyon, 2000. LNAI 2007.
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. SIGMOD Record, 25(2):103–114, June 1996.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Estivill-Castro, V., Yang, J. (2001). Categorizing Visitors Dynamically by Fast and Robust Clustering of Access Logs. In: Zhong, N., Yao, Y., Liu, J., Ohsuga, S. (eds) Web Intelligence: Research and Development. WI 2001. Lecture Notes in Computer Science(), vol 2198. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45490-X_64
Download citation
DOI: https://doi.org/10.1007/3-540-45490-X_64
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42730-8
Online ISBN: 978-3-540-45490-8
eBook Packages: Springer Book Archive