Abstract
We present a new methodology for exploring and analyzing navigation patterns on a web site. The patterns that can be analyzed consist of sequences of URL categories traversed by users. In our approach, we first partition site users into clusters such that users with similar navigation paths through the site are placed into the same cluster. Then, for each cluster, we display these paths for users within that cluster. The clustering approach we employ is model-based (as opposed to distance-based) and partitions users according to the order in which they request web pages. In particular, we cluster users by learning a mixture of first-order Markov models using the Expectation-Maximization algorithm. The runtime of our algorithm scales linearly with the number of clusters and with the size of the data; and our implementation easily handles hundreds of thousands of user sessions in memory. In the paper, we describe the details of our method and a visualization tool based on it called WebCANVAS. We illustrate the use of our approach on user-traffic data from msnbc.com.
Similar content being viewed by others
References
Anderson, C., Domingos, P., and Weld, D. 2001. Adaptive Web navigation for wireless devices. In Proceedings of the Seventeenth International Joint Conference on Artificial Intelligence, San Francisco, CA: Morgan Kaufmann, pp. 879–884.
Banfield, J. and Raftery, A. 1993. Model-based Gaussian and non-Gaussian clustering. Biometrics, 49:803–821.
Bernardo, J. 1979. Expected information as expected utility. Annals of Statistics, 7:686–690.
Bernardo, J. and Smith, A. 1994. Bayesian Theory. New York: John Wiley and Sons.
Bestavros, A. 1996. Speculative data dissemination and service to reduce server load, network traffic, and service time in distributed information systems. In Proceedings of the Twelfth International Conference on Data Engineering, (S. Y. W. Su (Ed.)), IEEE Computer Society, pp. 180–187.
Borges, J. and Levene, M. 2000. Data mining of user navigation patterns. In Web Usage Analysis and User Profiling, (B. Masand, and M., Spiliopoulou (Eds.)). Berlin: Springer, pp. 92–111.
Cadez, I. and Smyth, P. 1999. Probabilistic clustering using hierarchical models. Technical Report 99-16, Information and Computer Science, University of California, Irvine.
Cheeseman, P. and Stutz, J. 1995. Bayesian classification (AutoClass): Theory and results. In Advances in Knowledge Discovery and Data Mining, (U. Fayyad, G. Piatesky-Shapiro, P. Smyth, and R. Uthurusamy (Eds.)). Menlo Park, CA: AAAI Press, pp. 153–180.
Chen, M.-S., Park, J., and Yu, P. 1998. Efficient data mining for traversal patterns. IEEE Transactions on Knowledge and Data Engineering, 10:209–221.
Cooley, R., Tan, P.-N., and Srivastava, J. 2000. Websift: the Web site information filter system. In Web Usage Analysis and User Profiling, (B. Masand, and M. Spiliopoulou (Eds.)). Berlin: Springer, pp. 163–182.
Dempster, A., Laird, N., and Rubin, D. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B 39:1–38.
Deshpande, M. and Karypis, G. 2003. Selective Markov models for predicting web-page accesses. ACM Transactions on Internet Technology. To appear.
Fraley, C. and Raftery, A. 1998. How many clusters? Which clustering method? Answers via model-based cluster analysis. Computer Journal, 41:578–588.
Fu, Y., Sandhu, K., and Shih, M.-Y. 2000. Clustering of Web users based on access patterns. In Web Usage Analysis and User Profiling, (B. Masand and M. Spiliopoulou (Eds.)). Berlin: Springer, pp. 21–38.
Good, I. 1965. The Estimation of Probabilities. Cambridge, MA: MIT Press.
Huberman, B., Pirolli, P., Pitkow, J., and Lukose, R. 1997. Strong regularities in World Wide Web surfing. Science, 280:95–97.
Krogh, A. 1994. Hidden Markov models in computational biology: Applications to protein modeling. Journal of Molecular Biology, 235:1501–1531.
McLachlan, G. and Basford, K. 1988. Mixture Models: Inference and Applications to Clustering. Marcel Dekker.
Minar, N. and Donath, J. 1999. Visualizing crowds at a Web site. In Conference on Human Factors in Computing Systems; CHI99, pp. 186–187.
Padmanabhan, V. and Mogul, J. 1996. Using predictive pre-fetching to improve world wide web latency. ACM Computer Communication Review, 26:22–36.
Pirolli, P. and Pitkow, J. 1999. Distribution of surfer's paths through the world wide web. World Wide Web, 2:29–45.
Poulsen, C. 1990. Mixed Markov and latent Markov modelling applied to brand choice behavior. International Journal of Research in Marketing, 7:5–19.
Rabiner, L., Lee, C., Juang, B., and Wilpon, L. 1989.HMM clustering for connected word recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing, Los Alamitos, CA: IEEE Computer Society Press, pp. 405–408.
Ridgeway, G. and Altschuler, S. 1998. Clustering finite discrete Markov chains. In Proceedings of the Section on Physical and Engineering Sciences, pp. 228–229.
Sarukkai, R. 2000. Link prediction and path analysis using Markov chains. Computer Networks, 33(1–6):377–386.
Sen, R. and Hansen, M. 2003. Predicting a Web user's next access based on log data. Journal of Computational Graphics and Statistics, 12(1):143–155.
Smyth, P. 1997. Clustering sequences using hidden Markov models. In Advances in Neural Information Processing Systems 9, (M. Mozer, M. Jordan, and T. Petsche (Eds.)). MIT Press, pp. 648–654.
Smyth, P., Ide, K., and Ghil, M. 1999. Multiple regimes in Northern hemisphere height fields via mixture model clustering. Journal of the Atmospheric Sciences, 56:3704–3723.
Smyth, P. 1999. Probabilistic model-based clustering of multivariate and sequential data. In Proceedings of Seventh International Workshop on Artificial Intelligence and Statistics, San Francsico, CA: Morgan Kaufmann, pp. 299–304.
Spiliopoulou, M., Pohle, C., and Faulstich, L. 2000. Improving the effectiveness of a web site with Web usage mining. In Web Usage Analysis and User Profiling, (B. Masand and M. Spiliopoulou (Eds.)). Berlin: Springer, pp. 142–162.
Thiesson, B., Meek, C., Chickering, D., and Heckerman, D. 1999. Computationally efficient methods for selecting among mixtures of graphical models, with discussion. In Bayesian Statistics 6: Proceedings of the Sixth Valencia International Meeting, Oxford: Clarendon Press, pp. 631–656.
Wedel, M. and Kamakura, W. 1998. Market Segmentation: Conceptual and Methodological Foundations. Kluwer Academic Publishers.
Wexelblat, A. and Maes, P. 1999. Footprints: History-rich tools for information foraging. In Proceedings of ACMCHI 99 Conference on Human Factors in Computing Systems, pp. 270–277.
Yan, T., Jacobsen, M., Garcia-Molina, H., and Dayal, U. 1996. From user access patterns to dynamic hypertext linking. Computer Networks, 28(7–11):1007–1014.
Zaiane, O., Xin, M., and Han, J. 1998. Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs. In Proceedings of the Advances in Digital Libraries Conference, pp. 19–29.
Zuckerman, I., Albrecht, D., and Nicholson, A. 1999. Predicting user's requests on the WWW. In Proceedings of the Seventh International Conference on User Modeling, Springer Wien, pp. 275–284.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cadez, I., Heckerman, D., Meek, C. et al. Model-Based Clustering and Visualization of Navigation Patterns on a Web Site. Data Mining and Knowledge Discovery 7, 399–424 (2003). https://doi.org/10.1023/A:1024992613384
Issue Date:
DOI: https://doi.org/10.1023/A:1024992613384