Article

Mining comparable bilingual text corpora for cross-language information integration

Authors:

ChengXiang ZhaiAuthors Info & Claims

KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

Pages 691 - 696

https://doi.org/10.1145/1081870.1081958

Published: 21 August 2005 Publication History

Abstract

Integrating information in multiple natural languages is a challenging task that often requires manually created linguistic resources such as a bilingual dictionary or examples of direct translations of text. In this paper, we propose a general cross-lingual text mining method that does not rely on any of these resources, but can exploit comparable bilingual text corpora to discover mappings between words and documents in different languages. Comparable text corpora are collections of text documents in different languages that are about similar topics; such text corpora are often naturally available (e.g., news articles in different languages published in the same time period). The main idea of our method is to exploit frequency correlations of words in different languages in the comparable corpora and discover mappings between words in different languages. Such mappings can then be used to further discover mappings between documents in different languages, achieving cross-lingual information integration. Evaluation of the proposed method on a 120MB Chinese-English comparable news collection shows that the proposed method is effective for mapping words and documents in English and Chinese. Since our method only relies on naturally available comparable corpora, it is generally applicable to any language pairs as long as we have comparable corpora.

References

[1]

J. Allan et al. Challenges in information retrieval and language modeling: report of a workshop held at the center for intelligent information retrieval. SIGIR Forum, 37(1):31--47, 2003.

Digital Library

[2]

L. Ballesteros and W. B. Croft. Resolving ambiguity for cross-language retrieval. In Research and Development in Information Retrieval, pages 64--71, 1998.

Digital Library

[3]

A. Berger and J. Lafferty. Information retrieval as statistical translation. In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pages 222--229, 1999.

Digital Library

[4]

T. M. Cover and J. Thomas. Elements of Information Theory. Wiley, 1991.

Digital Library

[5]

M. Franz, J. S. McCarley, and S. Roukos. Ad hoc and multilingual information retrieval at IBM. In Text REtrieval Conference, pages 104--115, 1998.

[6]

P. Fung. A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. In Proceedings of ACL 1995, pages 236--243, 1995.

Digital Library

[7]

M. Kay and M. Roscheisen. Text translation alignment. Computational Linguistics, 19(1):75--102, 1993.

Digital Library

[8]

H. Masuichi, R. Flournoy, S. Kaufmann, and S. Peters. A bootstrapping method for extracting bilingual text pairs. In Proc. 18th COLINC, 2000.

Digital Library

[9]

J. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of the ACM SIGIR'98, pages 275--281, 1998.

Digital Library

[10]

R. Rapp. Identifying word translations in non-parallel texts. In Proceedings of ACL 1995, pages 320--322, 1995.

Digital Library

[11]

S. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of SIGIR'94, pages 232--241, 1994.

Digital Library

[12]

S. E. Robertson, S. Walker, S. Jones, M. M.Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. In D. K. Harman, editor, The Third Text REtrieval Conference (TREC-3), pages 109--126, 1995.

[13]

F. Sadat, M. Yoshikawa, and S. Uemura. Bilingual terminology acquisition from comparable corpora and phrasal translation to cross-language information retrieval. http://acl.ldc.upenn.edu/P/P03/P03-2025.pdf.

[14]

G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.

Digital Library

[15]

K. Tanaka and H. Iwasaki. Extraction of lexical translation from non-aligned corpora. In Proceedings of COLING 1996, 1996.

Digital Library

[16]

J. Veronis. Parallel text processing: Alignment and use of translation corpora. In Kluwer Academic Publishers., 2000.

[17]

J. Xu, R. Weischedel, and C. Nguyen. Evaluating a probabilistic model for cross-lingual information retrieval. In Proceedings of ACM SIGIR 2001, 2001.

Digital Library

[18]

C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR'01, pages 334--342, Sept 2001.

Digital Library

[19]

C. Zhai and J. Lafferty. Two-stage language models for information retrieval. In Proceedings of SIGIR'02, pages 49--56, Aug 2002.

Digital Library

[20]

C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In Proceedings of KDD 2004, 2004.

Digital Library

Cited By

España-Bonet CBarrón-Cedeño AMàrquez L(2022)Tailoring and evaluating the Wikipedia for in-domain comparable corpora extractionKnowledge and Information Systems10.1007/s10115-022-01767-565:3(1365-1397)Online publication date: 1-Nov-2022
https://doi.org/10.1007/s10115-022-01767-5
Azarbonyad HShakery AFaili H(2019)A learning to rank approach for cross-language information retrieval exploiting multiple translation resourcesNatural Language Engineering10.1017/S135132491900003225:3(363-384)Online publication date: 5-Mar-2019
https://doi.org/10.1017/S1351324919000032
XU JCHEN YRU KZHANG YARAKI K(2017)An Approach for Chinese-Japanese Named Entity Equivalents Extraction Using Inductive Learning and Hanzi-Kanji Mapping TableIEICE Transactions on Information and Systems10.1587/transinf.2016EDP7425E100.D:8(1882-1892)Online publication date: 2017
https://doi.org/10.1587/transinf.2016EDP7425
Show More Cited By

Index Terms

Mining comparable bilingual text corpora for cross-language information integration
1. Information systems
  1. Information retrieval

Recommendations

Unsupervised Word-Sense Disambiguation Using Bilingual Comparable Corpora

An unsupervised method for word-sense disambiguation using bilingual comparable corpora was developed. First, it extracts word associations, i.e., statistically significant pairs of associated words, from the corpus of each language. Then, it aligns ...
Enhancing cross-language information retrieval by an automatic acquisition of bilingual terminology from comparable corpora
SIGIR '03: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval

This paper presents an approach to bilingual lexicon extraction from comparable corpora and evaluations on Cross-Language Information Retrieval. We explore a bi-directional extraction of bilingual terminology primarily from comparable corpora. A ...
Learning bilingual translations from comparable corpora to cross-language information retrieval: hybrid statistics-based and linguistics-based approach
AsianIR '03: Proceedings of the sixth international workshop on Information retrieval with Asian languages - Volume 11

Recent years saw an increased interest in the use and the construction of large corpora. With this increased interest and awareness has come an expansion in the application to knowledge acquisition and bilingual terminology extraction. The present paper ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

August 2005

844 pages

ISBN:159593135X

DOI:10.1145/1081870

General Chair:
Robert Grossman
University of Illinois at Chicago & Open Data Partners, USA
,
Program Chairs:
Roberto Bayardo
IBM Almaden Research, USA
,
Kristin Bennett
RPI, USA

Copyright © 2005 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 August 2005

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

KDD05

Sponsor:

KDD05: The Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 21 - 24, 2005

Illinois, Chicago, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
716
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)1

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

España-Bonet CBarrón-Cedeño AMàrquez L(2022)Tailoring and evaluating the Wikipedia for in-domain comparable corpora extractionKnowledge and Information Systems10.1007/s10115-022-01767-565:3(1365-1397)Online publication date: 1-Nov-2022
https://doi.org/10.1007/s10115-022-01767-5
Azarbonyad HShakery AFaili H(2019)A learning to rank approach for cross-language information retrieval exploiting multiple translation resourcesNatural Language Engineering10.1017/S135132491900003225:3(363-384)Online publication date: 5-Mar-2019
https://doi.org/10.1017/S1351324919000032
XU JCHEN YRU KZHANG YARAKI K(2017)An Approach for Chinese-Japanese Named Entity Equivalents Extraction Using Inductive Learning and Hanzi-Kanji Mapping TableIEICE Transactions on Information and Systems10.1587/transinf.2016EDP7425E100.D:8(1882-1892)Online publication date: 2017
https://doi.org/10.1587/transinf.2016EDP7425
RAHIMI RSHAKERY ADADASHKARIMI JARIANNEZHAD MDEHGHANI MESFAHANI H(2016)Building a multi-domain comparable corpus using a learning to rank methodNatural Language Engineering10.1017/S135132491600016422:04(627-653)Online publication date: 15-Jun-2016
https://doi.org/10.1017/S1351324916000164
Heyman GVulić IMoens M(2016)C-BiLDA extracting cross-lingual topics from non-parallel texts by distinguishing shared from unshared contentData Mining and Knowledge Discovery10.1007/s10618-015-0442-x30:5(1299-1323)Online publication date: 1-Sep-2016
https://dl.acm.org/doi/10.1007/s10618-015-0442-x
Vulić IDe Smet WTang JMoens M(2015)Probabilistic topic modeling in multilingual settings: An overview of its methodology and applicationsInformation Processing & Management10.1016/j.ipm.2014.08.00351:1(111-147)Online publication date: Jan-2015
https://doi.org/10.1016/j.ipm.2014.08.003
Hashemi HShakery A(2014)Mining a Persian-English comparable corpus for cross-language information retrievalInformation Processing and Management: an International Journal10.1016/j.ipm.2013.10.00250:2(384-398)Online publication date: 1-Mar-2014
https://dl.acm.org/doi/10.1016/j.ipm.2013.10.002
Yan GHe WShen JTang C(2014)A bilingual approach for conducting Chinese and English social media sentiment analysisComputer Networks: The International Journal of Computer and Telecommunications Networking10.1016/j.comnet.2014.08.02175:PB(491-503)Online publication date: 24-Dec-2014
https://dl.acm.org/doi/10.1016/j.comnet.2014.08.021
HUANG DWANG SREN F(2013)Creating Chinese-English Comparable CorporaIEICE Transactions on Information and Systems10.1587/transinf.E96.D.1853E96.D:8(1853-1861)Online publication date: 2013
https://doi.org/10.1587/transinf.E96.D.1853
Wang WZhao TZhang C(2013)Bilingual seed lexicon adaptation for entity translation extraction2013 Ninth International Conference on Natural Computation (ICNC)10.1109/ICNC.2013.6818181(1309-1313)Online publication date: Jul-2013
https://doi.org/10.1109/ICNC.2013.6818181
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten