Abstract
The ever-increasing numbers of Web-accessible documents are available in languages other than English. The management of these heterogeneous document collections has posed a challenge. This paper proposes a novel model, called a domain alignment translation model, to conduct cross-lingual document clustering. While most existing cross-lingual document clustering methods make use of an expensive machine translation system to fill the gap between two languages, our model aims to effectively handle the cross-lingual document clustering by learning a cross-lingual domain alignment model and a domain-specific term translation model in a collaborative way. Experimental results show our method, i.e. C-TLS, without any resources other than a bilingual dictionary can achieve comparable performance to the direct machine translation method via a machine translation system, e.g. Google language tool. Also, our method is more efficient.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Mathieu, B., Besançon, R., Fluhr, C.: Multilingual document clusters discovery. In: RIAO’2004 proceedings, Université d’Avignon, France (2004)
Evans, D., Klavans, J.L., McKeown, K.R.: Columbia Newsblaster: Multilingual News Summarization on the Web. In: Proc. HLT(’04), Boston, MA (2004)
Evans, D.K., Klavans, J.L.: A Platform for Multilingual News Summarization. Technical report, Columbia University Department of Computer Science (2003)
Chen, H.H., Lin, C.J.: A multilingual news summarizer. In: Proceedings of the 18th International Conference on Computational Linguistics, pp. 159–165 (2000)
Hartigan, J.A.: Clustering Algorithms. John Wiley and Sons, Chichester (1975)
Ertoz, L., Steinbach, M., Kumar, V.: Finding Topics in Collections of Documents: A Shared Nearest Neighbor Approach. In: Text Mine’01, Workshop on Text Mining (1st SIAM International Conference on Data Mining), SIAM, Philadelphia (2001)
Carpuat, M., Wu, D.: Word sense disambiguation vs. statistical machine translation. In: ACL 2005 (2005)
Meilǎ, M., Heckerman, D.: An Experimental Comparison of Model-Based Clustering Methods. Machine Learning 42(1/2), 9–29 (2001)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Wu, K., Lu, BL. (2007). Cross-Lingual Document Clustering. In: Zhou, ZH., Li, H., Yang, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4426. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71701-0_107
Download citation
DOI: https://doi.org/10.1007/978-3-540-71701-0_107
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71700-3
Online ISBN: 978-3-540-71701-0
eBook Packages: Computer ScienceComputer Science (R0)