Abstract
Clustering has been demonstrated as a feasible way to explore the contents of document collection and organize search engine results. For this task, many features of Web page, such as content, anchor text, URL, hyperlink etc, can be exploited and different results can be obtained. We expect to provide a unified and even better result for end users. Some work have studied how to use several types of features together to perform clustering. Most of them focus on ensemble method or combination of similarity. In this paper, we propose a novel algorithm: Multi-type Features based Reinforcement Clustering (MFRC). This algorithm does not use a unique combine score for all feature spaces, but uses the intermediate clustering result in one feature space as additional information to gradually enhance clustering in other spaces. Finally a consensus can be achieved by such mutual reinforcement. And the experimental results show that MFRC also provides some performance improvement.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Blum, A., Mitchell, T.: Combining Labeled and Unlabeled Data with Co-Training. In: Proceeding of the Conference on Computational Learning Theory (1998)
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Chichester (1991)
Dash, M., Liu, H.: Feature Selection for Clustering. In: Proceeding of 4th Pacific-Asia Conference on Knowledge Discovery and Data Mining (2000)
Dudoit, S., Fridlyand, J.: Bagging to Improve the Accuracy of a Clustering Procedure. Bioinformatics (2003)
He, X., Zha, H., Ding, C., Simon, H.D.: Web Document Clustering Using HyperlinkStructures. Computational Statistics and Data Analysis 45, 19–45 (2002)
Kessler, M.M.: Bibliographic coupling between scientific papers. American Documentation 14, 10–25 (1963)
Larsen, B., Aone, C.: Fast and Effective Text Mining Using Linear-time Document Clustering. In: Proceedings of the 5th ACM SIGKDD International Conference (1999)
Liu, T., Liu, S., Chen, Z., Ma, W.-Y.: An Evaluation on Feature Selection for Text Clustering. In: Proc. of the 20th International Conference on Machine Learning (2003)
Martin, H.C.L., Mario, A.T.F., Jain, A.K.: Feature Saliency in unsupervised learning, Technical Report, Michigan Sate University (2002)
Minaei, B., Topchy, A., Punch, W.F.: Ensembles of Partitions via Data Resampling. In: Proceeding of the International Conference on Information Technology (2004)
Nigam, K., Ghani, R.: Analyzing the Effectiveness and Applicability of Co-Training. In: Proceeding of Information and Knowledge Management (2000)
Ntoulas, A., Cho, J., Olston, C.: What’s New on the Web? The Evolution of the Web from a Search Engine Perspective. To appear: the 13th International WWW (2004)
Salton, G.: Automatic Text Processing: The transformation, analysis, and retrieval of information by computer. Addison-Wesley, Reading (1989)
Small, H.: Co-citation in Scientific Literature: A new measure of the relationship between two documents. Journal of the American Society for Information (1973)
Strehl, A., Ghosh, J.: Cluster Ensembles – A Knowledge Reuse Framework for Combining Multiple Partitions. Journal on Machine Learning Research (2002)
Topchy, A., Jain, A.K., Punch, W.: A Mixture Model of Clustering Ensembles. To appear in Proceedings of the SIAM International Conference on Data Mining (2004)
Wang, J., Zeng, H.-J., Chen, Z., Lu, H., Li, T., Ma, W.-Y.: ReCoM: Reinforcement Clustering of Multi-Type Interrelated Data Objects. In: Proc. of the 26th SIGIR (2003)
Wang, Y., Kitsuregawa, M.: Clustering of Web Search Results with Link Analysis. Technique report (1999)
Weiss, R., Velez, B., Sheldon, M.A., Namprempre, C., Szilagyi, P., Duda, A., Gifford, D.K.: HyPursuit: A Hierarchical Network Search Engine that Exploits Content-Link Hypertext Clustering. In: 7th ACM Conference on Hypertext, pp. 180–193 (1996)
Yang, Y., Pedersen, J.O.: A Comparative Study on Feature Selection in Text Categorization. In: Proceedings of 14th International Conference on Machine Learning (1997)
Zamir, O., Etzioni, O.: Web Document Clustering: A Feasibility Demonstration. In: Proceeding of the 21st Annual International ACM SIGIR Conference (1998)
Zeng, H.-J., Chen, Z., Ma, W.-Y.: A Unified Framework for Clustering Heterogeneous Web Objects. In: Proc. of the 3rd International Conference on WISE (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Huang, S., Xue, GR., Zhang, BY., Chen, Z., Yu, Y., Ma, WY. (2004). Multi-type Features Based Web Document Clustering. In: Zhou, X., Su, S., Papazoglou, M.P., Orlowska, M.E., Jeffery, K. (eds) Web Information Systems – WISE 2004. WISE 2004. Lecture Notes in Computer Science, vol 3306. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30480-7_27
Download citation
DOI: https://doi.org/10.1007/978-3-540-30480-7_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23894-2
Online ISBN: 978-3-540-30480-7
eBook Packages: Springer Book Archive