skip to main content
research-article

Managing information disparity in multilingual document collections

Published:22 March 2013Publication History
Skip Abstract Section

Abstract

Information disparity is a major challenge with multilingual document collections. When documents are dynamically updated in a distributed fashion, information content among different language editions may gradually diverge. We propose a framework for assisting human editors to manage this information disparity, using tools from machine translation and machine learning. Given source and target documents in two different languages, our system automatically identifies information nuggets that are new with respect to the target and suggests positions to place their translations. We perform both real-world experiments and large-scale simulations on Wikipedia documents and conclude our system is effective in a variety of scenarios.

References

  1. Adafre, F. A. and de Rijke, M. 2006. 2006. Finding similar sentences across multiple languages in wikipedia. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics. 62--69.Google ScholarGoogle Scholar
  2. Adar, E., Skinner, M., and Weld, D. S. 2009. Information arbitrage across multi-lingual wikipedia. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining (WSDM'09). ACM, New York, 94--103. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Au Yeung, C.-M., Duh, K., and Nagata, M. 2011. Providing cross-lingual editing assistance to wikipedia editors. InProceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing - Part II (CICLing'11). Springer, 377--389. http://portal.acm.org/citation.cfm?id=1964750.1964786 Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., and Ives, Z. G. 2007. DBpedia: A nucleus for a web of open data. In Proceedings of the 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference (ISWC'07 and ASWC'07). Springer, 722--735. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Auer, S. and Lehmann, J. 2007. What have innsbruck and leipzig in common? Extracting semantics from wiki content. In Proceedings of the 4th European Conference on the Semantic Web (ESWC'07). Springer, 503--517. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Bai, B., Weston, J., Grangier, D., Collobert, R., Sadamasa, K., Qi, Y., Cortes, C., and Mohri, M. 2009. Polynomial semantic indexing. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS).Google ScholarGoogle Scholar
  7. Barrp-cedeno, A., Rosso, P., Pinto, D., and Juan, A. 2008. On cross-lingual plagiarism analysis using a statistical model. In Proceedings of the European Conference on Artificial Intelligence PAN Workshop. 9--13.Google ScholarGoogle Scholar
  8. Barzilay, R. and Elhadad, N. 2003. Sentence alignment for monolingual comparable corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 25--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Blei, D. and Lafferty, J. 2007. A correlated topic model of science. Ann. Appl. Statist. 1, 1, 17--35.Google ScholarGoogle ScholarCross RefCross Ref
  10. Brown, P. F., Della Pietra, V. J., Della Pietra, S. A., and Mercer, R. L. 1993. The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist. 19, 2, 263--311. http://portal. acm.org/citation.cfm?id=972470.972474. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Budanitsky, A. and Hirst, G. 2006. Evaluating wordnet-based measures of lexical semantic relatedness. Comput. Linguist. 32, 1, 13--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Chen, E., Snyder, B., and Barzilay, R. 2007. Incremental text structuring with online hierarchical ranking. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Association for Computational Linguistics, 83--91. http://www.aclweb.org/anthology/D/D07/D07-1009.Google ScholarGoogle Scholar
  13. Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inf. Sci. 41, 6.Google ScholarGoogle ScholarCross RefCross Ref
  14. Gabrilovich, E. and Markovitch, S. 2007. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Gaspari, F., Toral, A., and Naskar, S. K. 2011. User-Focused task-oriented mt evaluation for wikis: A case study. In Proceedings of the 3rd Joint EM+/CNGL Workshop (JEC'11).Google ScholarGoogle Scholar
  16. Hecht, B. and Gergle, D. 2010. The tower of babel meets web 2.0: User-Generated content and its applications in a multilingual context. In Proceedings of the 28th International Conference on Human Factors in Computing Systems (CHI'10). ACM, New York, 291--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Huberdeau, L. P., Paquet, S., and Desilets, A. 2008. The cross-lingual wiki engibe: Enabling collaboration across language barriers. In Proceedings of the International Symposium on Wikis and Open Collaboration (WikiSym). Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Joachims, T. 2006. Training linear svms in linear time. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Knoth, P., Zilka, L., and Zdrahal, Z. 2011. Using explicit semantic analysis for cross-lingual link discovery. In Proceedings of the 5th International Workshop on Cross Lingual Information Access: Computational Linguistics and the Information Need of Multilingual Societies (CLIA) at the 5th International Joint Conference on Natural Language Processing (IJC-NLP'11).Google ScholarGoogle Scholar
  20. Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL). Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Kouylekov, M. and Negri, M. 2010. An open-source package for recognizing textual entailment. In Proceedings of the Annual Meeting of the Association for Computational Lingusitics Systems Demonstration. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kumaran, A., Datha, N., Ashok, B., Saravanan, K., Ande, A., Sharma, A., Vedantham, S., Natampally, V., Dendi, V., and Maurice, S. 2010. WikiBABEL: A system for multilingual wikipedia content. In Proceedings of the AMTA Workshop on Collaborative Translation: Technology, Crowdsourcing and the Translator Perspective.Google ScholarGoogle Scholar
  23. Lapata, M. 2003. Probabilistic text structuring: Experiments with sentence ordering. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (ACL ’03). Association for Computational Linguistics, 545--552. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Manning, C. D., Raghavan, P., and Schutze, H. 2008.Introduction to Information Retrieval. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Markou, M. and Singh, S. 2003. Novelty detection: A review, part 1: Statistical approaches. Signal Process. 83, 2481--2497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Mehdad, Y. 2010. Automatic cost estimation for tree edit distance using particle swarm optimization. In Proceedings of the ACL-IJCNLP Conference Short Papers (ACLShort'09). Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Mehdad, Y., Negri, M., and Federico, M. 2010. Towards cross-lingual textual entailment. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguitics (NAACL). Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Mehdad, Y., Negri, M., and Federico, M. 2011. Using bilingual parallel corpora for cross-lingual textual entailment. In Proceedings of the Meeting of the Association of Computational Linguistics (ACL). Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Mimno, D., Wallach, H., Naradowsky, J., Smith, D. A., and McCallum, A. 2009. Polylingual topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Monz, C., Nastase, V., Negri, M., Fahrni, A., Mehdad, Y., and Strube, M. 2011. CoSyne: A framework for multilingual content synchronization of wikis. In Proceedings of the International Symposium on Wikis and Open Collaboration (WikiSym). Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Negri, M., Bentivogli, L., Mehdad, Y., Giampiccolo, D., and Marchetti, A. 2011. Divide and conquer: Crowdsourcing the creation of cross-lingual textual entailment corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Och, F. 2003. Minimum error ate training in statistical machine translation. In Proceedings of the Meeting of the Association of Computational Linguistics (ACL). Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Och, F. and Ney, H. 2004. The alignment template approach to statistical machine translation. Comput. Linguist. 30, 4, 417--449. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Meeting of the Association of Computational Linguistics (ACL). Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Pinto, D., Civera, J., Barron-Cedeoo, A., Juan, A., and Rosso, P. 2009. A statistical approach to crosslingual natural language tasks. J. Algor. 64, 1, 51--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Rapp, R. 1999. Automatic identification of word translations from unrelated english and german corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL). Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Sauper, C. and Barzilay, R. 2009. Automatically generating wikipedia articles: A structure- aware approach. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Association for Computational Linguistics, 208--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Sorg, P. and Cimiano, P. 2008. Cross-Lingual information retrieval with explicit semantic analysis. In Working Notes for the CLEF'08 Workshop.Google ScholarGoogle Scholar
  39. Stolcke, A. 2002. SRILM - An extensible language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP).Google ScholarGoogle Scholar
  40. Zhu, X., Ghahramani, Z., and Lafferty, J. 2003. Semi-Supervised learning using gaussian fields and harmonic functions. In Proceedings of the International Conference on Machine Learning (ICML).Google ScholarGoogle Scholar

Index Terms

  1. Managing information disparity in multilingual document collections

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Speech and Language Processing
          ACM Transactions on Speech and Language Processing   Volume 10, Issue 1
          March 2013
          50 pages
          ISSN:1550-4875
          EISSN:1550-4883
          DOI:10.1145/2442076
          Issue’s Table of Contents

          Copyright © 2013 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 22 March 2013
          • Accepted: 1 October 2012
          • Revised: 1 July 2012
          • Received: 1 July 2011
          Published in tslp Volume 10, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader