Cross-Lingual Document Clustering

Wu, Ke; Lu, Bao-Liang

doi:10.1007/978-3-540-71701-0_107

Ke Wu¹ &
Bao-Liang Lu¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4426))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

1902 Accesses

Abstract

The ever-increasing numbers of Web-accessible documents are available in languages other than English. The management of these heterogeneous document collections has posed a challenge. This paper proposes a novel model, called a domain alignment translation model, to conduct cross-lingual document clustering. While most existing cross-lingual document clustering methods make use of an expensive machine translation system to fill the gap between two languages, our model aims to effectively handle the cross-lingual document clustering by learning a cross-lingual domain alignment model and a domain-specific term translation model in a collaborative way. Experimental results show our method, i.e. C-TLS, without any resources other than a bilingual dictionary can achieve comparable performance to the direct machine translation method via a machine translation system, e.g. Google language tool. Also, our method is more efficient.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

C-BiLDA extracting cross-lingual topics from non-parallel texts by distinguishing shared from unshared content

Article 13 November 2015

Research on Intelligent Retrieval Model of Multilingual Text Information in Corpus

A word embedding-based approach to cross-lingual topic modeling

Article 24 April 2021

References

Mathieu, B., Besançon, R., Fluhr, C.: Multilingual document clusters discovery. In: RIAO’2004 proceedings, Université d’Avignon, France (2004)
Google Scholar
Evans, D., Klavans, J.L., McKeown, K.R.: Columbia Newsblaster: Multilingual News Summarization on the Web. In: Proc. HLT(’04), Boston, MA (2004)
Google Scholar
Evans, D.K., Klavans, J.L.: A Platform for Multilingual News Summarization. Technical report, Columbia University Department of Computer Science (2003)
Google Scholar
Chen, H.H., Lin, C.J.: A multilingual news summarizer. In: Proceedings of the 18th International Conference on Computational Linguistics, pp. 159–165 (2000)
Google Scholar
Hartigan, J.A.: Clustering Algorithms. John Wiley and Sons, Chichester (1975)
MATH Google Scholar
Ertoz, L., Steinbach, M., Kumar, V.: Finding Topics in Collections of Documents: A Shared Nearest Neighbor Approach. In: Text Mine’01, Workshop on Text Mining (1st SIAM International Conference on Data Mining), SIAM, Philadelphia (2001)
Google Scholar
Carpuat, M., Wu, D.: Word sense disambiguation vs. statistical machine translation. In: ACL 2005 (2005)
Google Scholar
Meilǎ, M., Heckerman, D.: An Experimental Comparison of Model-Based Clustering Methods. Machine Learning 42(1/2), 9–29 (2001)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dong Chuan Road, Shanghai 200240, China
Ke Wu & Bao-Liang Lu

Authors

Ke Wu
View author publications
You can also search for this author in PubMed Google Scholar
Bao-Liang Lu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Zhi-Hua Zhou Hang Li Qiang Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wu, K., Lu, BL. (2007). Cross-Lingual Document Clustering. In: Zhou, ZH., Li, H., Yang, Q. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2007. Lecture Notes in Computer Science(), vol 4426. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71701-0_107

Download citation

DOI: https://doi.org/10.1007/978-3-540-71701-0_107
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71700-3
Online ISBN: 978-3-540-71701-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Cross-Lingual Document Clustering

Abstract

Access this chapter

Preview

Similar content being viewed by others

C-BiLDA extracting cross-lingual topics from non-parallel texts by distinguishing shared from unshared content

Research on Intelligent Retrieval Model of Multilingual Text Information in Corpus

A word embedding-based approach to cross-lingual topic modeling

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Cross-Lingual Document Clustering

Abstract

Access this chapter

Preview

Similar content being viewed by others

C-BiLDA extracting cross-lingual topics from non-parallel texts by distinguishing shared from unshared content

Research on Intelligent Retrieval Model of Multilingual Text Information in Corpus

A word embedding-based approach to cross-lingual topic modeling

References

Author information

Authors and Affiliations

Editor information

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation