Skip to main content

Cross-Lingual Document Clustering

  • Conference paper
Advances in Knowledge Discovery and Data Mining (PAKDD 2007)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4426))

Included in the following conference series:

  • 1902 Accesses

Abstract

The ever-increasing numbers of Web-accessible documents are available in languages other than English. The management of these heterogeneous document collections has posed a challenge. This paper proposes a novel model, called a domain alignment translation model, to conduct cross-lingual document clustering. While most existing cross-lingual document clustering methods make use of an expensive machine translation system to fill the gap between two languages, our model aims to effectively handle the cross-lingual document clustering by learning a cross-lingual domain alignment model and a domain-specific term translation model in a collaborative way. Experimental results show our method, i.e. C-TLS, without any resources other than a bilingual dictionary can achieve comparable performance to the direct machine translation method via a machine translation system, e.g. Google language tool. Also, our method is more efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Mathieu, B., Besançon, R., Fluhr, C.: Multilingual document clusters discovery. In: RIAO’2004 proceedings, Université d’Avignon, France (2004)

    Google Scholar 

  2. Evans, D., Klavans, J.L., McKeown, K.R.: Columbia Newsblaster: Multilingual News Summarization on the Web. In: Proc. HLT(’04), Boston, MA (2004)

    Google Scholar 

  3. Evans, D.K., Klavans, J.L.: A Platform for Multilingual News Summarization. Technical report, Columbia University Department of Computer Science (2003)

    Google Scholar 

  4. Chen, H.H., Lin, C.J.: A multilingual news summarizer. In: Proceedings of the 18th International Conference on Computational Linguistics, pp. 159–165 (2000)

    Google Scholar 

  5. Hartigan, J.A.: Clustering Algorithms. John Wiley and Sons, Chichester (1975)

    MATH  Google Scholar 

  6. Ertoz, L., Steinbach, M., Kumar, V.: Finding Topics in Collections of Documents: A Shared Nearest Neighbor Approach. In: Text Mine’01, Workshop on Text Mining (1st SIAM International Conference on Data Mining), SIAM, Philadelphia (2001)

    Google Scholar 

  7. Carpuat, M., Wu, D.: Word sense disambiguation vs. statistical machine translation. In: ACL 2005 (2005)

    Google Scholar 

  8. MeilÇŽ, M., Heckerman, D.: An Experimental Comparison of Model-Based Clustering Methods. Machine Learning 42(1/2), 9–29 (2001)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Zhi-Hua Zhou Hang Li Qiang Yang

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer Berlin Heidelberg

About this paper

Cite this paper

Wu, K., Lu, BL. (2007). Cross-Lingual Document Clustering. In: Zhou, ZH., Li, H., Yang, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4426. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71701-0_107

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-71701-0_107

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-71700-3

  • Online ISBN: 978-3-540-71701-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics