skip to main content
10.1145/1081870.1081958acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Mining comparable bilingual text corpora for cross-language information integration

Published: 21 August 2005 Publication History

Abstract

Integrating information in multiple natural languages is a challenging task that often requires manually created linguistic resources such as a bilingual dictionary or examples of direct translations of text. In this paper, we propose a general cross-lingual text mining method that does not rely on any of these resources, but can exploit comparable bilingual text corpora to discover mappings between words and documents in different languages. Comparable text corpora are collections of text documents in different languages that are about similar topics; such text corpora are often naturally available (e.g., news articles in different languages published in the same time period). The main idea of our method is to exploit frequency correlations of words in different languages in the comparable corpora and discover mappings between words in different languages. Such mappings can then be used to further discover mappings between documents in different languages, achieving cross-lingual information integration. Evaluation of the proposed method on a 120MB Chinese-English comparable news collection shows that the proposed method is effective for mapping words and documents in English and Chinese. Since our method only relies on naturally available comparable corpora, it is generally applicable to any language pairs as long as we have comparable corpora.

References

[1]
J. Allan et al. Challenges in information retrieval and language modeling: report of a workshop held at the center for intelligent information retrieval. SIGIR Forum, 37(1):31--47, 2003.
[2]
L. Ballesteros and W. B. Croft. Resolving ambiguity for cross-language retrieval. In Research and Development in Information Retrieval, pages 64--71, 1998.
[3]
A. Berger and J. Lafferty. Information retrieval as statistical translation. In Proceedings of the 1999 ACM SIGIR Conference on Research and Development in Information Retrieval, pages 222--229, 1999.
[4]
T. M. Cover and J. Thomas. Elements of Information Theory. Wiley, 1991.
[5]
M. Franz, J. S. McCarley, and S. Roukos. Ad hoc and multilingual information retrieval at IBM. In Text REtrieval Conference, pages 104--115, 1998.
[6]
P. Fung. A pattern matching method for finding noun and proper noun translations from noisy parallel corpora. In Proceedings of ACL 1995, pages 236--243, 1995.
[7]
M. Kay and M. Roscheisen. Text translation alignment. Computational Linguistics, 19(1):75--102, 1993.
[8]
H. Masuichi, R. Flournoy, S. Kaufmann, and S. Peters. A bootstrapping method for extracting bilingual text pairs. In Proc. 18th COLINC, 2000.
[9]
J. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of the ACM SIGIR'98, pages 275--281, 1998.
[10]
R. Rapp. Identifying word translations in non-parallel texts. In Proceedings of ACL 1995, pages 320--322, 1995.
[11]
S. Robertson and S. Walker. Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In Proceedings of SIGIR'94, pages 232--241, 1994.
[12]
S. E. Robertson, S. Walker, S. Jones, M. M.Hancock-Beaulieu, and M. Gatford. Okapi at TREC-3. In D. K. Harman, editor, The Third Text REtrieval Conference (TREC-3), pages 109--126, 1995.
[13]
F. Sadat, M. Yoshikawa, and S. Uemura. Bilingual terminology acquisition from comparable corpora and phrasal translation to cross-language information retrieval. http://acl.ldc.upenn.edu/P/P03/P03-2025.pdf.
[14]
G. Salton and M. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983.
[15]
K. Tanaka and H. Iwasaki. Extraction of lexical translation from non-aligned corpora. In Proceedings of COLING 1996, 1996.
[16]
J. Veronis. Parallel text processing: Alignment and use of translation corpora. In Kluwer Academic Publishers., 2000.
[17]
J. Xu, R. Weischedel, and C. Nguyen. Evaluating a probabilistic model for cross-lingual information retrieval. In Proceedings of ACM SIGIR 2001, 2001.
[18]
C. Zhai and J. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR'01, pages 334--342, Sept 2001.
[19]
C. Zhai and J. Lafferty. Two-stage language models for information retrieval. In Proceedings of SIGIR'02, pages 49--56, Aug 2002.
[20]
C. Zhai, A. Velivelli, and B. Yu. A cross-collection mixture model for comparative text mining. In Proceedings of KDD 2004, 2004.

Cited By

View all
  • (2022)Tailoring and evaluating the Wikipedia for in-domain comparable corpora extractionKnowledge and Information Systems10.1007/s10115-022-01767-565:3(1365-1397)Online publication date: 1-Nov-2022
  • (2019)A learning to rank approach for cross-language information retrieval exploiting multiple translation resourcesNatural Language Engineering10.1017/S135132491900003225:3(363-384)Online publication date: 5-Mar-2019
  • (2017)An Approach for Chinese-Japanese Named Entity Equivalents Extraction Using Inductive Learning and Hanzi-Kanji Mapping TableIEICE Transactions on Information and Systems10.1587/transinf.2016EDP7425E100.D:8(1882-1892)Online publication date: 2017
  • Show More Cited By

Index Terms

  1. Mining comparable bilingual text corpora for cross-language information integration

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
    August 2005
    844 pages
    ISBN:159593135X
    DOI:10.1145/1081870
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 21 August 2005

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. comparable corpora
    2. cross-lingual text mining
    3. document alignment
    4. frequency correlation

    Qualifiers

    • Article

    Conference

    KDD05

    Acceptance Rates

    Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

    Upcoming Conference

    KDD '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)3
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2022)Tailoring and evaluating the Wikipedia for in-domain comparable corpora extractionKnowledge and Information Systems10.1007/s10115-022-01767-565:3(1365-1397)Online publication date: 1-Nov-2022
    • (2019)A learning to rank approach for cross-language information retrieval exploiting multiple translation resourcesNatural Language Engineering10.1017/S135132491900003225:3(363-384)Online publication date: 5-Mar-2019
    • (2017)An Approach for Chinese-Japanese Named Entity Equivalents Extraction Using Inductive Learning and Hanzi-Kanji Mapping TableIEICE Transactions on Information and Systems10.1587/transinf.2016EDP7425E100.D:8(1882-1892)Online publication date: 2017
    • (2016)Building a multi-domain comparable corpus using a learning to rank methodNatural Language Engineering10.1017/S135132491600016422:04(627-653)Online publication date: 15-Jun-2016
    • (2016)C-BiLDA extracting cross-lingual topics from non-parallel texts by distinguishing shared from unshared contentData Mining and Knowledge Discovery10.1007/s10618-015-0442-x30:5(1299-1323)Online publication date: 1-Sep-2016
    • (2015)Probabilistic topic modeling in multilingual settings: An overview of its methodology and applicationsInformation Processing & Management10.1016/j.ipm.2014.08.00351:1(111-147)Online publication date: Jan-2015
    • (2014)Mining a Persian-English comparable corpus for cross-language information retrievalInformation Processing and Management: an International Journal10.1016/j.ipm.2013.10.00250:2(384-398)Online publication date: 1-Mar-2014
    • (2014)A bilingual approach for conducting Chinese and English social media sentiment analysisComputer Networks: The International Journal of Computer and Telecommunications Networking10.1016/j.comnet.2014.08.02175:PB(491-503)Online publication date: 24-Dec-2014
    • (2013)Creating Chinese-English Comparable CorporaIEICE Transactions on Information and Systems10.1587/transinf.E96.D.1853E96.D:8(1853-1861)Online publication date: 2013
    • (2013)Bilingual seed lexicon adaptation for entity translation extraction2013 Ninth International Conference on Natural Computation (ICNC)10.1109/ICNC.2013.6818181(1309-1313)Online publication date: Jul-2013
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media