Skip to main content

A Unified Probabilistic Framework for Clustering Correlated Heterogeneous Web Objects

  • Conference paper
Web Technologies Research and Development - APWeb 2005 (APWeb 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3399))

Included in the following conference series:

  • 529 Accesses

Abstract

Most existing algorithms cluster highly correlated data objects (e.g. web pages and web queries) separately. Some other algorithms, however, do take into account the relationship between data objects, but they either integrate content and link features into a unified feature space or apply a hard clustering algorithm, making it difficult to fully utilize the correlated information over the heterogeneous Web objects. In this paper, we propose a novel unified probabilistic framework for iteratively clustering correlated heterogeneous data objects until it converges. Our approach introduces two latent clustering layers, which serve as two mixture probabilistic models of the features. In each clustering iteration we use EM (Expectation-Maximization) algorithm to estimate the parameters of the mixture model in one latent layer and propagate them to the other one. The experimental results show that our approach effectively combines the content and link features and improves the performance of the clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Bilmes, J.: A gentle tutorial on the em algorithm and its application to parameter estimation for gaussian mixture and hidden markov models (1997)

    Google Scholar 

  • Boley, D., Gini, M., Gross, R., Han, S., Hastings, K., pis, G.K., Kumar, V., Mobasher, B., Moore, J.: Partitioning-based clustering for web document categorization. Decision Support Systems 27(3) (1999)

    Google Scholar 

  • Brants, T., Chen, F., Tsochantaridis, I.: Topic-based document segmentation with probabilistic latent semantic analysis. In: Proc. of the 11th international conference on Information and knowledge management (2002)

    Google Scholar 

  • Cohn, D., Hofmann, T.: The missing link - a probabilistic model of document content and hypertext connectivity. Neural Information Processing Systems (2001)

    Google Scholar 

  • Cover, T.M., Thomas, J.A.: Elements of information theory (1991)

    Google Scholar 

  • Dempster, A., Laird, N., Rubin, D.: Maximum-likelihood from incomplete data via the em algorithm. Machine Learning 39 (1977)

    Google Scholar 

  • Gaussier, E., Goutte, C., Popat, K., Chen, F.: A hierarchical model for clustering and categorising documents. In: Proc. of ECIR 2002, 24th European Colloquium on Information Retrieval Research (2002)

    Google Scholar 

  • Han, J., Kamer, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann, San Francisco (2001)

    Google Scholar 

  • Hofmann, T.: Probabilistic latent semantic analysis. In: Proc. of Uncertainty in Artificial Intelligence, UAI 1999, Stockholm (1999)

    Google Scholar 

  • Hofmann, T.: Probabilistic latent semantic indexing. In: Proc. of the 22nd Annual ACM Conference on Research and Development in Information Retrieval, Berkeley, California (August 1999)

    Google Scholar 

  • Hofmann, T.: Unsupervised learning by probabilistic latent semantic analysis. Mach. Learn. 42(1-2) (2001)

    Google Scholar 

  • Hofmann, T.: Latent semantic models for collaborative filtering (2004)

    Google Scholar 

  • Hofmann, T., Puzicha, J.: Mixture models for co-occurrence and histogram data

    Google Scholar 

  • Hofmann, T., Puzicha, J.: Unsupervised learning from dyadic data. Technical Report TR-98-042, Berkeley, CA (1998)

    Google Scholar 

  • Jin, X., Zhou, Y., Mobasher, B.: Web usage mining based on probabilistic latent semantic analysis. In: Proc. of the 2004 ACM SIGKDD international conference on Knowledge discovery and data mining (2004)

    Google Scholar 

  • Morzy, T., Wojciechowski, M., Zakrzewicz, M.: Web users clustering (2000)

    Google Scholar 

  • Sinkkonen, J., Kaski, S.: Clustering based on conditional distributions in an auxiliary space. Neural Computation 14(1) (2002)

    Google Scholar 

  • Slonim, N., Tishby, N.: Document clustering using word clusters via the information bottleneck method. In: Proc. of the 23rd International ACM SIGIR Conference on Research and Development in Information Retrieval (2000)

    Google Scholar 

  • Ungar, L., Foster, D.: Clustering methods for collaborative filtering (1998)

    Google Scholar 

  • Wen, J.R., Nie, J.Y., Zhang, H.J.: Query clustering using user logs (2002)

    Google Scholar 

  • Yan, T.W., Jacobsen, M., Garcia-Molina, H., Dayal, U.: From user access patterns to dynamic hypertext linking. Technical Note CS-TN-97-42 (February 1997)

    Google Scholar 

  • Zamir, O., Etzioni, O.: Web document clustering: A feasibility demonstration. In: Proc. of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1998)

    Google Scholar 

  • Zeng, H.J., Chen, Z., Ma, W.Y.: A unified framework for clustering heterogeneous web object. In: Proc. of the 3rd WISE (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Liu, G., Zhu, W., Yu, Y. (2005). A Unified Probabilistic Framework for Clustering Correlated Heterogeneous Web Objects. In: Zhang, Y., Tanaka, K., Yu, J.X., Wang, S., Li, M. (eds) Web Technologies Research and Development - APWeb 2005. APWeb 2005. Lecture Notes in Computer Science, vol 3399. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31849-1_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-31849-1_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25207-8

  • Online ISBN: 978-3-540-31849-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics