Skip to main content

Multi-type Features Based Web Document Clustering

  • Conference paper
Web Information Systems – WISE 2004 (WISE 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3306))

Included in the following conference series:

Abstract

Clustering has been demonstrated as a feasible way to explore the contents of document collection and organize search engine results. For this task, many features of Web page, such as content, anchor text, URL, hyperlink etc, can be exploited and different results can be obtained. We expect to provide a unified and even better result for end users. Some work have studied how to use several types of features together to perform clustering. Most of them focus on ensemble method or combination of similarity. In this paper, we propose a novel algorithm: Multi-type Features based Reinforcement Clustering (MFRC). This algorithm does not use a unique combine score for all feature spaces, but uses the intermediate clustering result in one feature space as additional information to gradually enhance clustering in other spaces. Finally a consensus can be achieved by such mutual reinforcement. And the experimental results show that MFRC also provides some performance improvement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co-Training. In: Proceeding of the Conference on Computational Learning Theory (1998)

    Google Scholar 

  2. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Chichester (1991)

    Book  MATH  Google Scholar 

  3. Dash, M., Liu, H.: Feature Selection for Clustering. In: Proceeding of 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining (2000)

    Google Scholar 

  4. Dudoit, S., Fridlyand, J.: Bagging to Improve the Accuracy of a Clustering Procedure. Bioinformatics (2003)

    Google Scholar 

  5. He, X., Zha, H., Ding, C., Simon, H.D.: Web Document Clustering Using HyperlinkStructures. Computational Statistics and Data Analysis 45, 19–45 (2002)

    Article  MathSciNet  Google Scholar 

  6. Kessler, M.M.: Bibliographic coupling between scientific papers. American Documentation 14, 10–25 (1963)

    Article  Google Scholar 

  7. Larsen, B., Aone, C.: Fast and Effective Text Mining Using Linear-time Document Clustering. In: Proceedings of the 5th ACM SIGKDD International Conference (1999)

    Google Scholar 

  8. Liu, T., Liu, S., Chen, Z., Ma, W.-Y.: An Evaluation on Feature Selection for Text Clustering. In: Proc. of the 20th International Conference on Machine Learning (2003)

    Google Scholar 

  9. Martin, H.C.L., Mario, A.T.F., Jain, A.K.: Feature Saliency in unsupervised learning, Technical Report, Michigan Sate University (2002)

    Google Scholar 

  10. Minaei, B., Topchy, A., Punch, W.F.: Ensembles of Partitions via Data Resampling. In: Proceeding of the International Conference on Information Technology (2004)

    Google Scholar 

  11. Nigam, K., Ghani, R.: Analyzing the Effectiveness and Applicability of Co-Training. In: Proceeding of Information and Knowledge Management (2000)

    Google Scholar 

  12. Ntoulas, A., Cho, J., Olston, C.: What’s New on the Web? The Evolution of the Web from a Search Engine Perspective. To appear: the 13th International WWW (2004)

    Google Scholar 

  13. Salton, G.: Automatic Text Processing: The transformation, analysis, and retrieval of information by computer. Addison-Wesley, Reading (1989)

    Google Scholar 

  14. Small, H.: Co-citation in Scientific Literature: A new measure of the relationship between two documents. Journal of the American Society for Information (1973)

    Google Scholar 

  15. Strehl, A., Ghosh, J.: Cluster Ensembles – A Knowledge Reuse Framework for Combining Multiple Partitions. Journal on Machine Learning Research (2002)

    Google Scholar 

  16. Topchy, A., Jain, A.K., Punch, W.: A Mixture Model of Clustering Ensembles. To appear in Proceedings of the SIAM International Conference on Data Mining (2004)

    Google Scholar 

  17. Wang, J., Zeng, H.-J., Chen, Z., Lu, H., Li, T., Ma, W.-Y.: ReCoM: Reinforcement Clustering of Multi-Type Interrelated Data Objects. In: Proc. of the 26th SIGIR (2003)

    Google Scholar 

  18. Wang, Y., Kitsuregawa, M.: Clustering of Web Search Results with Link Analysis. Technique report (1999)

    Google Scholar 

  19. Weiss, R., Velez, B., Sheldon, M.A., Namprempre, C., Szilagyi, P., Duda, A., Gifford, D.K.: HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering. In: 7th ACM Conference on Hypertext, pp. 180–193 (1996)

    Google Scholar 

  20. Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of 14th International Conference on Machine Learning (1997)

    Google Scholar 

  21. Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration. In: Proceeding of the 21st Annual International ACM SIGIR Conference (1998)

    Google Scholar 

  22. Zeng, H.-J., Chen, Z., Ma, W.-Y.: A Unified Framework for Clustering Heterogeneous Web Objects. In: Proc. of the 3rd International Conference on WISE (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Huang, S., Xue, GR., Zhang, BY., Chen, Z., Yu, Y., Ma, WY. (2004). Multi-type Features Based Web Document Clustering. In: Zhou, X., Su, S., Papazoglou, M.P., Orlowska, M.E., Jeffery, K. (eds) Web Information Systems – WISE 2004. WISE 2004. Lecture Notes in Computer Science, vol 3306. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30480-7_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30480-7_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23894-2

  • Online ISBN: 978-3-540-30480-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics