Categorizing Visitors Dynamically by Fast and Robust Clustering of Access Logs

Estivill-Castro, Vladimir; Yang, Jianhua

doi:10.1007/3-540-45490-X_64

Vladimir Estivill-Castro⁵ &
Jianhua Yang⁵

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2198))

Included in the following conference series:

Asia-Pacific Conference on Web Intelligence

675 Accesses
5 Citations

Abstract

Clustering plays a central role in segmenting markets. The identification of categories of visitors to a Web-site is very useful towards improved Web applications. However, the large volume involved in mining visitation paths, demands efficient clustering algorithms that are also resistant to noise and outliers. Also, dissimilarity between visitation paths involves sophisticated evaluation and results in large dimension of attribute-vectors. We present a randomized, iterative algorithm (a la Expectation Maximization or k-means) but based on discrete medoids. We prove that our algorithm converges and that has subquadratic complexity. We compare to the implementation of the fastest version of matrixbased clustering for visitor paths and show that our algorithm outperforms dramatically matrix-based methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

C. Bajaj. Proving geometric algorithm non-solvability: An application of factoring polynomials. Journal of Symbolic Computation, 2:99–102, 1986.
Article MathSciNet Google Scholar
J. Borges and M. Levene. Mining assocaition rules in hypertext databases. R. Agrawal, ed., 4th Int. Conf. on KDD, 149–153, NY, August 27–31 1998.
Google Scholar
C. R. Cunha and C. F. B. Jaccound. Determining WWW user’s next access and its application to prefetching. Int. Symp. on Computers and Communication’97, Alexandria, Egypt, July, 1997.
Google Scholar
A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum likehood from incomplete data via the EM algorithm. J. Royal Statistical Society B, 39:1–38, 1977.
MATH MathSciNet Google Scholar
M. Ester, H.P. Kriegel, S. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. E. Simoudis, et al eds., 2nd Int. Conf. KDD, 226-231, Menlo Park, CA, 1996. AAAI Press.
Google Scholar
V. Estivill-Castro and M.E. Houle. Roboust clustering of large geo-referenced data sets. N. Zhong & L. Zhow, eds., 3rd PAKDD-99, 327–337. LNAI 1574, 1999.
Google Scholar
V. Estivill-Castro and M.E. Houle. Fast randomized algorithms for robust estimation of location. TSDM2000, 74–85, Lyon, 2000. LNAI 2007.
Google Scholar
V. Estivill-Castro and A.T. Murray. Discovering associations in spatial data-an efficient medoid based approach. X. Wu, et al, eds., 2nd PAKDD-98, 110–121, Melbourne, Australia, 1998. LNAI 1394.
Google Scholar
I.B. Hall, L.O. Özyurt and J.C. Bezdek. Clustering with a genetically optimized approach. IEEE T. on Evolutionary Computation, 3(2):103–112, 1999.
Article Google Scholar
T. Kato, H. Nakyama and Y. Yamane. Navigation analysis tool based on the correlation between contents and access patterns. Manuscript. http://www.citeseer.nj.nec.com/354234.html.
J. MacQueen. Some methods for classification and analysis of multivariate observations. L. Le Cam and J. Neyman eds., 5th Berkley Symp. on Mathematical Statistics and Probability, 281–297, 1967. Volume 1.
Google Scholar
B. Mobasher, H. Dai, T. Luo, M. Nakagawa, Y. Sun, and J. Wilshire. Discovery of aggregate usage profiles for web personalization. WEB Mining for E-Commerce Workshop Web KDD-2000, Boston, August 2000.
Google Scholar
T. Morzy, M. Wojciechowski, and Zakrzewicz. Scalabale hierarchical clustering methods for sequences of categorical values. D. Cheung, et al eds., 5th PAKDD, 282–293, Hong Kong, 2001. LNAI 2035.
Google Scholar
F. Murtagh. Comments of “Parallel algorithms for hierarchical clustering and cluster validity”. IEEE T. on Pattern Analysis and Machine Intelligence, 14(10):1056–1057, 1992.
Article Google Scholar
R.T. Ng and J. Han. Efficient and effective clustering methods for spatial data mining. 20th VLDB, 144–155, 1994. Santiago, Chile, Morgan Kaufmann.
Google Scholar
G. Paliouras, C. Papatheodorou, V. Karkaletsis, and C. Spyropoulos. Clustering the users of large web sites into communities. P. Langley, ed., 17th Int. Conf. on Machine Learning, 719–726, 2000. Morgan Kaufmann.
Google Scholar
J. Pei, J. Han, B. Mortazavi-asl, and H. Zhu. Mining access patterns efficiently from web logas. T. Terano, et al, eds., 4th PAKDD, 396–407, Kyoto, 2000. LNCS 1805.
Google Scholar
D.M. Pennock, E. Horvitz, S. Lawrence, and C.L. Giles. Collaborative filtering by personality diagnosis: A hybrid memory-and model-based approach. 16th Conf. on Uncertanity in Artificial Intelligence, 473–840, 2000. Morgan Kaufmann.
Google Scholar
M Perkowitz and O. Etzioni. Adaptive web sites: an AI challenge. IJCAI, 16–23, Nagoya, Japan, 1998.
Google Scholar
M Perkowitz and O. Etzioni. Adaptive web sites: Automatically synthesizing web pages. 15th National Conf. on Artificial Intelligence, 727–732, 1998. AAAI Press.
Google Scholar
P.J. Rousseeuw and A.M. Leroy. Robust regression and outlier detection. Wiley, NY, 1987.
MATH Google Scholar
C. Shahabi, A. M. Zarkesh, J. Adibi, and V. Shah. Knowledge discovery from users web page navigation. P. Schevemann, ed., Int. Workshop on Research Issues in Data Engineering IEEE RIDE’97, 20–31, 1997.
Google Scholar
M. Spiliopoulou. Web usage mining for web site evaluation. Communication of the ACM, 43(8):127–134, 2000.
Article Google Scholar
J. Srivastava, R. Cooley, M. Deshpande, and P.-N. Tan. Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Esplorations, 1(2):12–23, January 2000.
Article Google Scholar
M.B. Teitz and P. Bart. Heuristic methods for estimating the generalized vertex median of a weighted graph. Operations Research, 16:955–961, 1968.
Article MATH Google Scholar
W. Wang, J. Yang, and R. Muntz. STING: A statistical information grid approach to spatial data mining. 23rd VLDB, 186–195, Athens, 1997. Morgan Kaufmann.
Google Scholar
J. Xiao, Y. Zhang, X. Jia, and T. Li. Measuring similarity of interests for clustering web-users. In M.E. Orlowsak & J.F. Roddick, eds., 12th Australian Database Conf. ADC 2001, 107–114, Gold Coast, 2001. IEEE Computer Society.
Google Scholar
A.M. Zarkesh, J. Adibi, C. Shahabi, R. Sadri, and V. Shah. Analysis and design of server informative WWW-sites. In K. Golshani, F.; Makki, editor, 6th ACM CIKM, 254–261, Las Vegas, 1997.
Google Scholar
B. Zhang, M. Hsu, and U. Dayal. K-harmonic means-a spatial clustering algorithm with boosting. TSDM2000, 31–42, Lyon, 2000. LNAI 2007.
Google Scholar
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. SIGMOD Record, 25(2):103–114, June 1996.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science & Software Engineering, The University of Newcastle, Callaghan, NSW, 2308, Australia
Vladimir Estivill-Castro & Jianhua Yang

Authors

Vladimir Estivill-Castro
View author publications
You can also search for this author in PubMed Google Scholar
Jianhua Yang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Systems and Information Engineering, Maebashi Institute of Technology, 460-1 Kamisadori-Cho, Maebashi-City, 371-0816, Japan
Ning Zhong
Department of Computer Science, University of Regina, Regina, Saskatchewan, Canada, S4S 0A2
Yiju Yao
Department of Computer Science, Hong Kong Baptist University, 224 Waterloo Road, Kowloon, Hong Kong, China
Jiming Liu
Department of Information and Computer Science, Waseda University, 3-4-1 Okubo Shinjuku-Ku, Tokyo, 169, Japan
Setsuo Ohsuga

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Estivill-Castro, V., Yang, J. (2001). Categorizing Visitors Dynamically by Fast and Robust Clustering of Access Logs. In: Zhong, N., Yao, Y., Liu, J., Ohsuga, S. (eds) Web Intelligence: Research and Development. WI 2001. Lecture Notes in Computer Science(), vol 2198. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45490-X_64

Download citation

DOI: https://doi.org/10.1007/3-540-45490-X_64
Published: 19 October 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42730-8
Online ISBN: 978-3-540-45490-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics