Multi-type Features Based Web Document Clustering

Huang, Shen; Xue, Gui-Rong; Zhang, Ben-Yu; Chen, Zheng; Yu, Yong; Ma, Wei-Ying

doi:10.1007/978-3-540-30480-7_27

Shen Huang²¹,
Gui-Rong Xue²¹,
Ben-Yu Zhang²²,
Zheng Chen²²,
Yong Yu²¹ &
…
Wei-Ying Ma²²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3306))

Included in the following conference series:

International Conference on Web Information Systems Engineering

1172 Accesses
1 Citations
3 Altmetric

Abstract

Clustering has been demonstrated as a feasible way to explore the contents of document collection and organize search engine results. For this task, many features of Web page, such as content, anchor text, URL, hyperlink etc, can be exploited and different results can be obtained. We expect to provide a unified and even better result for end users. Some work have studied how to use several types of features together to perform clustering. Most of them focus on ensemble method or combination of similarity. In this paper, we propose a novel algorithm: Multi-type Features based Reinforcement Clustering (MFRC). This algorithm does not use a unique combine score for all feature spaces, but uses the intermediate clustering result in one feature space as additional information to gradually enhance clustering in other spaces. Finally a consensus can be achieved by such mutual reinforcement. And the experimental results show that MFRC also provides some performance improvement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co-Training. In: Proceeding of the Conference on Computational Learning Theory (1998)
Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Chichester (1991)
Book MATH Google Scholar
Dash, M., Liu, H.: Feature Selection for Clustering. In: Proceeding of 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining (2000)
Google Scholar
Dudoit, S., Fridlyand, J.: Bagging to Improve the Accuracy of a Clustering Procedure. Bioinformatics (2003)
Google Scholar
He, X., Zha, H., Ding, C., Simon, H.D.: Web Document Clustering Using HyperlinkStructures. Computational Statistics and Data Analysis 45, 19–45 (2002)
Article MathSciNet Google Scholar
Kessler, M.M.: Bibliographic coupling between scientific papers. American Documentation 14, 10–25 (1963)
Article Google Scholar
Larsen, B., Aone, C.: Fast and Effective Text Mining Using Linear-time Document Clustering. In: Proceedings of the 5th ACM SIGKDD International Conference (1999)
Google Scholar
Liu, T., Liu, S., Chen, Z., Ma, W.-Y.: An Evaluation on Feature Selection for Text Clustering. In: Proc. of the 20th International Conference on Machine Learning (2003)
Google Scholar
Martin, H.C.L., Mario, A.T.F., Jain, A.K.: Feature Saliency in unsupervised learning, Technical Report, Michigan Sate University (2002)
Google Scholar
Minaei, B., Topchy, A., Punch, W.F.: Ensembles of Partitions via Data Resampling. In: Proceeding of the International Conference on Information Technology (2004)
Google Scholar
Nigam, K., Ghani, R.: Analyzing the Effectiveness and Applicability of Co-Training. In: Proceeding of Information and Knowledge Management (2000)
Google Scholar
Ntoulas, A., Cho, J., Olston, C.: What’s New on the Web? The Evolution of the Web from a Search Engine Perspective. To appear: the 13th International WWW (2004)
Google Scholar
Salton, G.: Automatic Text Processing: The transformation, analysis, and retrieval of information by computer. Addison-Wesley, Reading (1989)
Google Scholar
Small, H.: Co-citation in Scientific Literature: A new measure of the relationship between two documents. Journal of the American Society for Information (1973)
Google Scholar
Strehl, A., Ghosh, J.: Cluster Ensembles – A Knowledge Reuse Framework for Combining Multiple Partitions. Journal on Machine Learning Research (2002)
Google Scholar
Topchy, A., Jain, A.K., Punch, W.: A Mixture Model of Clustering Ensembles. To appear in Proceedings of the SIAM International Conference on Data Mining (2004)
Google Scholar
Wang, J., Zeng, H.-J., Chen, Z., Lu, H., Li, T., Ma, W.-Y.: ReCoM: Reinforcement Clustering of Multi-Type Interrelated Data Objects. In: Proc. of the 26th SIGIR (2003)
Google Scholar
Wang, Y., Kitsuregawa, M.: Clustering of Web Search Results with Link Analysis. Technique report (1999)
Google Scholar
Weiss, R., Velez, B., Sheldon, M.A., Namprempre, C., Szilagyi, P., Duda, A., Gifford, D.K.: HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering. In: 7th ACM Conference on Hypertext, pp. 180–193 (1996)
Google Scholar
Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of 14th International Conference on Machine Learning (1997)
Google Scholar
Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration. In: Proceeding of the 21st Annual International ACM SIGIR Conference (1998)
Google Scholar
Zeng, H.-J., Chen, Z., Ma, W.-Y.: A Unified Framework for Clustering Heterogeneous Web Objects. In: Proc. of the 3rd International Conference on WISE (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Shanghai Jiao Tong University, 1954 Huashan Ave., Shanghai, 200030, P.R.China
Shen Huang, Gui-Rong Xue & Yong Yu
Microsoft Research Asia, 5F, Sigma Center 49 Zhichun Road, Beijing, 100080, P.R.China
Ben-Yu Zhang, Zheng Chen & Wei-Ying Ma

Authors

Shen Huang
View author publications
You can also search for this author in PubMed Google Scholar
Gui-Rong Xue
View author publications
You can also search for this author in PubMed Google Scholar
Ben-Yu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yong Yu
View author publications
You can also search for this author in PubMed Google Scholar
Wei-Ying Ma
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of ITEE, The University of Queensland, Australia
Xiaofang Zhou
Database Systems Research and Development Center, University of Florida, P.O. Box 116125, 470 CSE, 32601-6125, Gainesville, FL, USA
Stanley Su
INFOLAB, Dept. of Information Systems and Management, Tilburg University, The Netherlands
Mike P. Papazoglou
Polish-Japanese Institute of Information Technology, Faculty of IT, Ul. Koszykowa 86, 02-008, Warsaw, Poland
Maria Elzbieta Orlowska
Rutherford Appleton Laboratory, Science and Technology Facilities Council, Harwell Science and Innovation Campus, OX11 0QX, Didcot, UK
Keith Jeffery

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, S., Xue, GR., Zhang, BY., Chen, Z., Yu, Y., Ma, WY. (2004). Multi-type Features Based Web Document Clustering. In: Zhou, X., Su, S., Papazoglou, M.P., Orlowska, M.E., Jeffery, K. (eds) Web Information Systems – WISE 2004. WISE 2004. Lecture Notes in Computer Science, vol 3306. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30480-7_27

Download citation

DOI: https://doi.org/10.1007/978-3-540-30480-7_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23894-2
Online ISBN: 978-3-540-30480-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics