Abstract
Most existing algorithms cluster highly correlated data objects (e.g. web pages and web queries) separately. Some other algorithms, however, do take into account the relationship between data objects, but they either integrate content and link features into a unified feature space or apply a hard clustering algorithm, making it difficult to fully utilize the correlated information over the heterogeneous Web objects. In this paper, we propose a novel unified probabilistic framework for iteratively clustering correlated heterogeneous data objects until it converges. Our approach introduces two latent clustering layers, which serve as two mixture probabilistic models of the features. In each clustering iteration we use EM (Expectation-Maximization) algorithm to estimate the parameters of the mixture model in one latent layer and propagate them to the other one. The experimental results show that our approach effectively combines the content and link features and improves the performance of the clustering.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bilmes, J.: A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models (1997)
Boley, D., Gini, M., Gross, R., Han, S., Hastings, K., pis, G.K., Kumar, V., Mobasher, B., Moore, J.: Partitioning-based clustering for web document categorization. Decision Support Systems 27(3) (1999)
Brants, T., Chen, F., Tsochantaridis, I.: Topic-based document segmentation with probabilistic latent semantic analysis. In: Proc. of the 11th international conference on Information and knowledge management (2002)
Cohn, D., Hofmann, T.: The missing link - a probabilistic model of document content and hypertext connectivity. Neural Information Processing Systems (2001)
Cover, T.M., Thomas, J.A.: Elements of information theory (1991)
Dempster, A., Laird, N., Rubin, D.: Maximum-likelihood from incomplete data via the em algorithm. Machine Learning 39 (1977)
Gaussier, E., Goutte, C., Popat, K., Chen, F.: A hierarchical model for clustering and categorising documents. In: Proc. of ECIR 2002, 24th European Colloquium on Information Retrieval Research (2002)
Han, J., Kamer, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2001)
Hofmann, T.: Probabilistic latent semantic analysis. In: Proc. of Uncertainty in Artificial Intelligence, UAI 1999, Stockholm (1999)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proc. of the 22nd Annual ACM Conference on Research and Development in Information Retrieval, Berkeley, California (August 1999)
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1-2) (2001)
Hofmann, T.: Latent semantic models for collaborative filtering (2004)
Hofmann, T., Puzicha, J.: Mixture models for co-occurrence and histogram data
Hofmann, T., Puzicha, J.: Unsupervised learning from dyadic data. Technical Report TR-98-042, Berkeley, CA (1998)
Jin, X., Zhou, Y., Mobasher, B.: Web usage mining based on probabilistic latent semantic analysis. In: Proc. of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining (2004)
Morzy, T., Wojciechowski, M., Zakrzewicz, M.: Web users clustering (2000)
Sinkkonen, J., Kaski, S.: Clustering based on conditional distributions in an auxiliary space. Neural Computation 14(1) (2002)
Slonim, N., Tishby, N.: Document clustering using word clusters via the information bottleneck method. In: Proc. of the 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval (2000)
Ungar, L., Foster, D.: Clustering methods for collaborative filtering (1998)
Wen, J.R., Nie, J.Y., Zhang, H.J.: Query clustering using user logs (2002)
Yan, T.W., Jacobsen, M., Garcia-Molina, H., Dayal, U.: From user access patterns to dynamic hypertext linking. Technical Note CS-TN-97-42 (February 1997)
Zamir, O., Etzioni, O.: Web document clustering: A feasibility demonstration. In: Proc. of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1998)
Zeng, H.J., Chen, Z., Ma, W.Y.: A unified framework for clustering heterogeneous web object. In: Proc. of the 3rd WISE (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liu, G., Zhu, W., Yu, Y. (2005). A Unified Probabilistic Framework for Clustering Correlated Heterogeneous Web Objects. In: Zhang, Y., Tanaka, K., Yu, J.X., Wang, S., Li, M. (eds) Web Technologies Research and Development - APWeb 2005. APWeb 2005. Lecture Notes in Computer Science, vol 3399. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31849-1_9
Download citation
DOI: https://doi.org/10.1007/978-3-540-31849-1_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25207-8
Online ISBN: 978-3-540-31849-1
eBook Packages: Computer ScienceComputer Science (R0)