research-article

Managing information disparity in multilingual document collections

Authors:
Kevin Duh

NTT Communication Science Laboratories, Japan

NTT Communication Science Laboratories, Japan
View Profile

,
Ching-Man Au Yeung

NTT Communication Science Laboratories, Huawei, Hong Kong

NTT Communication Science Laboratories, Huawei, Hong Kong
View Profile

,
Tomoharu Iwata

NTT Communication Science Laboratories, Japan

NTT Communication Science Laboratories, Japan
View Profile

,
Masaaki Nagata

NTT Communication Science Laboratories, Japan

NTT Communication Science Laboratories, Japan
View Profile

ACM Transactions on Speech and Language Processing Volume 10 Issue 1Article No.: 1pp 1–28https://doi.org/10.1145/2442076.2442077

Published:22 March 2013Publication History

ACM Transactions on Speech and Language Processing

Abstract

Information disparity is a major challenge with multilingual document collections. When documents are dynamically updated in a distributed fashion, information content among different language editions may gradually diverge. We propose a framework for assisting human editors to manage this information disparity, using tools from machine translation and machine learning. Given source and target documents in two different languages, our system automatically identifies information nuggets that are new with respect to the target and suggests positions to place their translations. We perform both real-world experiments and large-scale simulations on Wikipedia documents and conclude our system is effective in a variety of scenarios.

References

Adafre, F. A. and de Rijke, M. 2006. 2006. Finding similar sentences across multiple languages in wikipedia. In Proceedings of the 11^th Conference of the European Chapter of the Association for Computational Linguistics. 62--69.Google Scholar
Adar, E., Skinner, M., and Weld, D. S. 2009. Information arbitrage across multi-lingual wikipedia. In Proceedings of the 2^nd ACM International Conference on Web Search and Data Mining (WSDM'09). ACM, New York, 94--103. Google ScholarDigital Library
Au Yeung, C.-M., Duh, K., and Nagata, M. 2011. Providing cross-lingual editing assistance to wikipedia editors. InProceedings of the 12^th International Conference on Computational Linguistics and Intelligent Text Processing - Part II (CICLing'11). Springer, 377--389. http://portal.acm.org/citation.cfm&quest;id=1964750.1964786 Google ScholarDigital Library
Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., and Ives, Z. G. 2007. DBpedia: A nucleus for a web of open data. In Proceedings of the 6^th International Semantic Web Conference and 2^nd Asian Semantic Web Conference (ISWC'07 and ASWC'07). Springer, 722--735. Google ScholarDigital Library
Auer, S. and Lehmann, J. 2007. What have innsbruck and leipzig in common&quest; Extracting semantics from wiki content. In Proceedings of the 4^th European Conference on the Semantic Web (ESWC'07). Springer, 503--517. Google ScholarDigital Library
Bai, B., Weston, J., Grangier, D., Collobert, R., Sadamasa, K., Qi, Y., Cortes, C., and Mohri, M. 2009. Polynomial semantic indexing. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS).Google Scholar
Barrp-cedeno, A., Rosso, P., Pinto, D., and Juan, A. 2008. On cross-lingual plagiarism analysis using a statistical model. In Proceedings of the European Conference on Artificial Intelligence PAN Workshop. 9--13.Google Scholar
Barzilay, R. and Elhadad, N. 2003. Sentence alignment for monolingual comparable corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 25--32. Google ScholarDigital Library
Blei, D. and Lafferty, J. 2007. A correlated topic model of science. Ann. Appl. Statist. 1, 1, 17--35.Google ScholarCross Ref
Brown, P. F., Della Pietra, V. J., Della Pietra, S. A., and Mercer, R. L. 1993. The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist. 19, 2, 263--311. http://portal. acm.org/citation.cfm&quest;id=972470.972474. Google ScholarDigital Library
Budanitsky, A. and Hirst, G. 2006. Evaluating wordnet-based measures of lexical semantic relatedness. Comput. Linguist. 32, 1, 13--47. Google ScholarDigital Library
Chen, E., Snyder, B., and Barzilay, R. 2007. Incremental text structuring with online hierarchical ranking. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Association for Computational Linguistics, 83--91. http://www.aclweb.org/anthology/D/D07/D07-1009.Google Scholar
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inf. Sci. 41, 6.Google ScholarCross Ref
Gabrilovich, E. and Markovitch, S. 2007. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence. Google ScholarDigital Library
Gaspari, F., Toral, A., and Naskar, S. K. 2011. User-Focused task-oriented mt evaluation for wikis: A case study. In Proceedings of the 3^rd Joint EM+/CNGL Workshop (JEC'11).Google Scholar
Hecht, B. and Gergle, D. 2010. The tower of babel meets web 2.0: User-Generated content and its applications in a multilingual context. In Proceedings of the 28^th International Conference on Human Factors in Computing Systems (CHI'10). ACM, New York, 291--300. Google ScholarDigital Library
Huberdeau, L. P., Paquet, S., and Desilets, A. 2008. The cross-lingual wiki engibe: Enabling collaboration across language barriers. In Proceedings of the International Symposium on Wikis and Open Collaboration (WikiSym). Google ScholarDigital Library
Joachims, T. 2006. Training linear svms in linear time. In Proceedings of the 12^th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06). Google ScholarDigital Library
Knoth, P., Zilka, L., and Zdrahal, Z. 2011. Using explicit semantic analysis for cross-lingual link discovery. In Proceedings of the 5th International Workshop on Cross Lingual Information Access: Computational Linguistics and the Information Need of Multilingual Societies (CLIA) at the 5^th International Joint Conference on Natural Language Processing (IJC-NLP'11).Google Scholar
Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45^th Annual Meeting of the Association of Computational Linguistics (ACL). Google ScholarDigital Library
Kouylekov, M. and Negri, M. 2010. An open-source package for recognizing textual entailment. In Proceedings of the Annual Meeting of the Association for Computational Lingusitics Systems Demonstration. Google ScholarDigital Library
Kumaran, A., Datha, N., Ashok, B., Saravanan, K., Ande, A., Sharma, A., Vedantham, S., Natampally, V., Dendi, V., and Maurice, S. 2010. WikiBABEL: A system for multilingual wikipedia content. In Proceedings of the AMTA Workshop on Collaborative Translation: Technology, Crowdsourcing and the Translator Perspective.Google Scholar
Lapata, M. 2003. Probabilistic text structuring: Experiments with sentence ordering. In Proceedings of the 41^st Annual Meeting on Association for Computational Linguistics (ACL ’03). Association for Computational Linguistics, 545--552. Google ScholarDigital Library
Manning, C. D., Raghavan, P., and Schutze, H. 2008.Introduction to Information Retrieval. Cambridge University Press. Google ScholarDigital Library
Markou, M. and Singh, S. 2003. Novelty detection: A review, part 1: Statistical approaches. Signal Process. 83, 2481--2497. Google ScholarDigital Library
Mehdad, Y. 2010. Automatic cost estimation for tree edit distance using particle swarm optimization. In Proceedings of the ACL-IJCNLP Conference Short Papers (ACLShort'09). Google ScholarDigital Library
Mehdad, Y., Negri, M., and Federico, M. 2010. Towards cross-lingual textual entailment. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguitics (NAACL). Google ScholarDigital Library
Mehdad, Y., Negri, M., and Federico, M. 2011. Using bilingual parallel corpora for cross-lingual textual entailment. In Proceedings of the Meeting of the Association of Computational Linguistics (ACL). Google ScholarDigital Library
Mimno, D., Wallach, H., Naradowsky, J., Smith, D. A., and McCallum, A. 2009. Polylingual topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Google ScholarDigital Library
Monz, C., Nastase, V., Negri, M., Fahrni, A., Mehdad, Y., and Strube, M. 2011. CoSyne: A framework for multilingual content synchronization of wikis. In Proceedings of the International Symposium on Wikis and Open Collaboration (WikiSym). Google ScholarDigital Library
Negri, M., Bentivogli, L., Mehdad, Y., Giampiccolo, D., and Marchetti, A. 2011. Divide and conquer: Crowdsourcing the creation of cross-lingual textual entailment corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Google ScholarDigital Library
Och, F. 2003. Minimum error ate training in statistical machine translation. In Proceedings of the Meeting of the Association of Computational Linguistics (ACL). Google ScholarDigital Library
Och, F. and Ney, H. 2004. The alignment template approach to statistical machine translation. Comput. Linguist. 30, 4, 417--449. Google ScholarDigital Library
Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Meeting of the Association of Computational Linguistics (ACL). Google ScholarDigital Library
Pinto, D., Civera, J., Barron-Cedeoo, A., Juan, A., and Rosso, P. 2009. A statistical approach to crosslingual natural language tasks. J. Algor. 64, 1, 51--60. Google ScholarDigital Library
Rapp, R. 1999. Automatic identification of word translations from unrelated english and german corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL). Google ScholarDigital Library
Sauper, C. and Barzilay, R. 2009. Automatically generating wikipedia articles: A structure- aware approach. In Proceedings of the Joint Conference of the 47^th Annual Meeting of the ACL and the 4^th International Joint Conference on Natural Language Processing of the AFNLP. Association for Computational Linguistics, 208--216. Google ScholarDigital Library
Sorg, P. and Cimiano, P. 2008. Cross-Lingual information retrieval with explicit semantic analysis. In Working Notes for the CLEF'08 Workshop.Google Scholar
Stolcke, A. 2002. SRILM - An extensible language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP).Google Scholar
Zhu, X., Ghahramani, Z., and Lafferty, J. 2003. Semi-Supervised learning using gaussian fields and harmonic functions. In Proceedings of the International Conference on Machine Learning (ICML).Google Scholar

Index Terms

Managing information disparity in multilingual document collections

Recommendations

Double-pass clustering technique for multilingual document collections

It is often necessary to categorize automatically multilingual document sets, in which documents written in a variety of languages are included, into topically homogeneous subsets, such as when applying an automatic summarization system for multilingual ...
Read More
Extending document management systems with user-specific active properties

Document properties are a compelling infrastructure on which to develop document management applications. A property-based approach avoids many of the problems of traditional heierarchical storage mechanisms, reflects document organizations meaningful ...
Read More
Automatic office document classification and information extraction
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Speech and Language Processing Volume 10, Issue 1
March 2013
50 pages
ISSN:1550-4875
EISSN:1550-4883
DOI:10.1145/2442076
Issue’s Table of Contents

Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 March 2013
- Accepted: 1 October 2012
- Revised: 1 July 2012
- Received: 1 July 2011
Published in tslp Volume 10, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Cross-lingual methods
document management systems
machine translation applications
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 371
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Managing information disparity in multilingual document collections

ACM Transactions on Speech and Language Processing

Abstract

References

Cited By

Index Terms

Recommendations

Double-pass clustering technique for multilingual document collections

Extending document management systems with user-specific active properties

Automatic office document classification and information extraction

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Managing information disparity in multilingual document collections

ACM Transactions on Speech and Language Processing

Abstract

References

Cited By

Index Terms

Recommendations

Double-pass clustering technique for multilingual document collections

Extending document management systems with user-specific active properties

Automatic office document classification and information extraction

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media