A Unified Probabilistic Framework for Clustering Correlated Heterogeneous Web Objects

Liu, Guowei; Zhu, Weibin; Yu, Yong

doi:10.1007/978-3-540-31849-1_9

Guowei Liu²¹,
Weibin Zhu²¹ &
Yong Yu²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3399))

Included in the following conference series:

Asia-Pacific Web Conference

529 Accesses

Abstract

Most existing algorithms cluster highly correlated data objects (e.g. web pages and web queries) separately. Some other algorithms, however, do take into account the relationship between data objects, but they either integrate content and link features into a unified feature space or apply a hard clustering algorithm, making it difficult to fully utilize the correlated information over the heterogeneous Web objects. In this paper, we propose a novel unified probabilistic framework for iteratively clustering correlated heterogeneous data objects until it converges. Our approach introduces two latent clustering layers, which serve as two mixture probabilistic models of the features. In each clustering iteration we use EM (Expectation-Maximization) algorithm to estimate the parameters of the mixture model in one latent layer and propagate them to the other one. The experimental results show that our approach effectively combines the content and link features and improves the performance of the clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bilmes, J.: A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models (1997)
Google Scholar
Boley, D., Gini, M., Gross, R., Han, S., Hastings, K., pis, G.K., Kumar, V., Mobasher, B., Moore, J.: Partitioning-based clustering for web document categorization. Decision Support Systems 27(3) (1999)
Google Scholar
Brants, T., Chen, F., Tsochantaridis, I.: Topic-based document segmentation with probabilistic latent semantic analysis. In: Proc. of the 11th international conference on Information and knowledge management (2002)
Google Scholar
Cohn, D., Hofmann, T.: The missing link - a probabilistic model of document content and hypertext connectivity. Neural Information Processing Systems (2001)
Google Scholar
Cover, T.M., Thomas, J.A.: Elements of information theory (1991)
Google Scholar
Dempster, A., Laird, N., Rubin, D.: Maximum-likelihood from incomplete data via the em algorithm. Machine Learning 39 (1977)
Google Scholar
Gaussier, E., Goutte, C., Popat, K., Chen, F.: A hierarchical model for clustering and categorising documents. In: Proc. of ECIR 2002, 24th European Colloquium on Information Retrieval Research (2002)
Google Scholar
Han, J., Kamer, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2001)
Google Scholar
Hofmann, T.: Probabilistic latent semantic analysis. In: Proc. of Uncertainty in Artificial Intelligence, UAI 1999, Stockholm (1999)
Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: Proc. of the 22nd Annual ACM Conference on Research and Development in Information Retrieval, Berkeley, California (August 1999)
Google Scholar
Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1-2) (2001)
Google Scholar
Hofmann, T.: Latent semantic models for collaborative filtering (2004)
Google Scholar
Hofmann, T., Puzicha, J.: Mixture models for co-occurrence and histogram data
Google Scholar
Hofmann, T., Puzicha, J.: Unsupervised learning from dyadic data. Technical Report TR-98-042, Berkeley, CA (1998)
Google Scholar
Jin, X., Zhou, Y., Mobasher, B.: Web usage mining based on probabilistic latent semantic analysis. In: Proc. of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining (2004)
Google Scholar
Morzy, T., Wojciechowski, M., Zakrzewicz, M.: Web users clustering (2000)
Google Scholar
Sinkkonen, J., Kaski, S.: Clustering based on conditional distributions in an auxiliary space. Neural Computation 14(1) (2002)
Google Scholar
Slonim, N., Tishby, N.: Document clustering using word clusters via the information bottleneck method. In: Proc. of the 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval (2000)
Google Scholar
Ungar, L., Foster, D.: Clustering methods for collaborative filtering (1998)
Google Scholar
Wen, J.R., Nie, J.Y., Zhang, H.J.: Query clustering using user logs (2002)
Google Scholar
Yan, T.W., Jacobsen, M., Garcia-Molina, H., Dayal, U.: From user access patterns to dynamic hypertext linking. Technical Note CS-TN-97-42 (February 1997)
Google Scholar
Zamir, O., Etzioni, O.: Web document clustering: A feasibility demonstration. In: Proc. of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1998)
Google Scholar
Zeng, H.J., Chen, Z., Ma, W.Y.: A unified framework for clustering heterogeneous web object. In: Proc. of the 3rd WISE (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science & Engineering Department, Shanghai Jiaotong University, Shanghai, China
Guowei Liu, Weibin Zhu & Yong Yu

Authors

Guowei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Weibin Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yong Yu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Victoria University, Australia
Yanchun Zhang
University of Kyoto, Japan
Katsumi Tanaka
Chinese University of Hong Kong, Hong Kong, China
Jeffrey Xu Yu
Key Laboratory of Data Engineering and Knowledge Engineering, Renmin University of China, MOE, 100872, Beijing, P.R. China
Shan Wang
Department of Computer Science and Engineering, Shanghai Jiatong University, 80 Dongcuan Road, 200240, Shanghai, China
Minglu Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, G., Zhu, W., Yu, Y. (2005). A Unified Probabilistic Framework for Clustering Correlated Heterogeneous Web Objects. In: Zhang, Y., Tanaka, K., Yu, J.X., Wang, S., Li, M. (eds) Web Technologies Research and Development - APWeb 2005. APWeb 2005. Lecture Notes in Computer Science, vol 3399. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31849-1_9

Download citation

DOI: https://doi.org/10.1007/978-3-540-31849-1_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25207-8
Online ISBN: 978-3-540-31849-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics