Abstract
Information disparity is a major challenge with multilingual document collections. When documents are dynamically updated in a distributed fashion, information content among different language editions may gradually diverge. We propose a framework for assisting human editors to manage this information disparity, using tools from machine translation and machine learning. Given source and target documents in two different languages, our system automatically identifies information nuggets that are new with respect to the target and suggests positions to place their translations. We perform both real-world experiments and large-scale simulations on Wikipedia documents and conclude our system is effective in a variety of scenarios.
- Adafre, F. A. and de Rijke, M. 2006. 2006. Finding similar sentences across multiple languages in wikipedia. In Proceedings of the 11th Conference of the European Chapter of the Association for Computational Linguistics. 62--69.Google Scholar
- Adar, E., Skinner, M., and Weld, D. S. 2009. Information arbitrage across multi-lingual wikipedia. In Proceedings of the 2nd ACM International Conference on Web Search and Data Mining (WSDM'09). ACM, New York, 94--103. Google ScholarDigital Library
- Au Yeung, C.-M., Duh, K., and Nagata, M. 2011. Providing cross-lingual editing assistance to wikipedia editors. InProceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing - Part II (CICLing'11). Springer, 377--389. http://portal.acm.org/citation.cfm?id=1964750.1964786 Google ScholarDigital Library
- Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., and Ives, Z. G. 2007. DBpedia: A nucleus for a web of open data. In Proceedings of the 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference (ISWC'07 and ASWC'07). Springer, 722--735. Google ScholarDigital Library
- Auer, S. and Lehmann, J. 2007. What have innsbruck and leipzig in common? Extracting semantics from wiki content. In Proceedings of the 4th European Conference on the Semantic Web (ESWC'07). Springer, 503--517. Google ScholarDigital Library
- Bai, B., Weston, J., Grangier, D., Collobert, R., Sadamasa, K., Qi, Y., Cortes, C., and Mohri, M. 2009. Polynomial semantic indexing. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS).Google Scholar
- Barrp-cedeno, A., Rosso, P., Pinto, D., and Juan, A. 2008. On cross-lingual plagiarism analysis using a statistical model. In Proceedings of the European Conference on Artificial Intelligence PAN Workshop. 9--13.Google Scholar
- Barzilay, R. and Elhadad, N. 2003. Sentence alignment for monolingual comparable corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 25--32. Google ScholarDigital Library
- Blei, D. and Lafferty, J. 2007. A correlated topic model of science. Ann. Appl. Statist. 1, 1, 17--35.Google ScholarCross Ref
- Brown, P. F., Della Pietra, V. J., Della Pietra, S. A., and Mercer, R. L. 1993. The mathematics of statistical machine translation: Parameter estimation. Comput. Linguist. 19, 2, 263--311. http://portal. acm.org/citation.cfm?id=972470.972474. Google ScholarDigital Library
- Budanitsky, A. and Hirst, G. 2006. Evaluating wordnet-based measures of lexical semantic relatedness. Comput. Linguist. 32, 1, 13--47. Google ScholarDigital Library
- Chen, E., Snyder, B., and Barzilay, R. 2007. Incremental text structuring with online hierarchical ranking. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL). Association for Computational Linguistics, 83--91. http://www.aclweb.org/anthology/D/D07/D07-1009.Google Scholar
- Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. 1990. Indexing by latent semantic analysis. J. Amer. Soc. Inf. Sci. 41, 6.Google ScholarCross Ref
- Gabrilovich, E. and Markovitch, S. 2007. Computing semantic relatedness using wikipedia-based explicit semantic analysis. In Proceedings of the 20th International Joint Conference on Artificial Intelligence. Google ScholarDigital Library
- Gaspari, F., Toral, A., and Naskar, S. K. 2011. User-Focused task-oriented mt evaluation for wikis: A case study. In Proceedings of the 3rd Joint EM+/CNGL Workshop (JEC'11).Google Scholar
- Hecht, B. and Gergle, D. 2010. The tower of babel meets web 2.0: User-Generated content and its applications in a multilingual context. In Proceedings of the 28th International Conference on Human Factors in Computing Systems (CHI'10). ACM, New York, 291--300. Google ScholarDigital Library
- Huberdeau, L. P., Paquet, S., and Desilets, A. 2008. The cross-lingual wiki engibe: Enabling collaboration across language barriers. In Proceedings of the International Symposium on Wikis and Open Collaboration (WikiSym). Google ScholarDigital Library
- Joachims, T. 2006. Training linear svms in linear time. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'06). Google ScholarDigital Library
- Knoth, P., Zilka, L., and Zdrahal, Z. 2011. Using explicit semantic analysis for cross-lingual link discovery. In Proceedings of the 5th International Workshop on Cross Lingual Information Access: Computational Linguistics and the Information Need of Multilingual Societies (CLIA) at the 5th International Joint Conference on Natural Language Processing (IJC-NLP'11).Google Scholar
- Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., et al. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics (ACL). Google ScholarDigital Library
- Kouylekov, M. and Negri, M. 2010. An open-source package for recognizing textual entailment. In Proceedings of the Annual Meeting of the Association for Computational Lingusitics Systems Demonstration. Google ScholarDigital Library
- Kumaran, A., Datha, N., Ashok, B., Saravanan, K., Ande, A., Sharma, A., Vedantham, S., Natampally, V., Dendi, V., and Maurice, S. 2010. WikiBABEL: A system for multilingual wikipedia content. In Proceedings of the AMTA Workshop on Collaborative Translation: Technology, Crowdsourcing and the Translator Perspective.Google Scholar
- Lapata, M. 2003. Probabilistic text structuring: Experiments with sentence ordering. In Proceedings of the 41st Annual Meeting on Association for Computational Linguistics (ACL ’03). Association for Computational Linguistics, 545--552. Google ScholarDigital Library
- Manning, C. D., Raghavan, P., and Schutze, H. 2008.Introduction to Information Retrieval. Cambridge University Press. Google ScholarDigital Library
- Markou, M. and Singh, S. 2003. Novelty detection: A review, part 1: Statistical approaches. Signal Process. 83, 2481--2497. Google ScholarDigital Library
- Mehdad, Y. 2010. Automatic cost estimation for tree edit distance using particle swarm optimization. In Proceedings of the ACL-IJCNLP Conference Short Papers (ACLShort'09). Google ScholarDigital Library
- Mehdad, Y., Negri, M., and Federico, M. 2010. Towards cross-lingual textual entailment. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguitics (NAACL). Google ScholarDigital Library
- Mehdad, Y., Negri, M., and Federico, M. 2011. Using bilingual parallel corpora for cross-lingual textual entailment. In Proceedings of the Meeting of the Association of Computational Linguistics (ACL). Google ScholarDigital Library
- Mimno, D., Wallach, H., Naradowsky, J., Smith, D. A., and McCallum, A. 2009. Polylingual topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Google ScholarDigital Library
- Monz, C., Nastase, V., Negri, M., Fahrni, A., Mehdad, Y., and Strube, M. 2011. CoSyne: A framework for multilingual content synchronization of wikis. In Proceedings of the International Symposium on Wikis and Open Collaboration (WikiSym). Google ScholarDigital Library
- Negri, M., Bentivogli, L., Mehdad, Y., Giampiccolo, D., and Marchetti, A. 2011. Divide and conquer: Crowdsourcing the creation of cross-lingual textual entailment corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). Google ScholarDigital Library
- Och, F. 2003. Minimum error ate training in statistical machine translation. In Proceedings of the Meeting of the Association of Computational Linguistics (ACL). Google ScholarDigital Library
- Och, F. and Ney, H. 2004. The alignment template approach to statistical machine translation. Comput. Linguist. 30, 4, 417--449. Google ScholarDigital Library
- Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the Meeting of the Association of Computational Linguistics (ACL). Google ScholarDigital Library
- Pinto, D., Civera, J., Barron-Cedeoo, A., Juan, A., and Rosso, P. 2009. A statistical approach to crosslingual natural language tasks. J. Algor. 64, 1, 51--60. Google ScholarDigital Library
- Rapp, R. 1999. Automatic identification of word translations from unrelated english and german corpora. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (ACL). Google ScholarDigital Library
- Sauper, C. and Barzilay, R. 2009. Automatically generating wikipedia articles: A structure- aware approach. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP. Association for Computational Linguistics, 208--216. Google ScholarDigital Library
- Sorg, P. and Cimiano, P. 2008. Cross-Lingual information retrieval with explicit semantic analysis. In Working Notes for the CLEF'08 Workshop.Google Scholar
- Stolcke, A. 2002. SRILM - An extensible language modeling toolkit. In Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP).Google Scholar
- Zhu, X., Ghahramani, Z., and Lafferty, J. 2003. Semi-Supervised learning using gaussian fields and harmonic functions. In Proceedings of the International Conference on Machine Learning (ICML).Google Scholar
Index Terms
- Managing information disparity in multilingual document collections
Recommendations
Double-pass clustering technique for multilingual document collections
It is often necessary to categorize automatically multilingual document sets, in which documents written in a variety of languages are included, into topically homogeneous subsets, such as when applying an automatic summarization system for multilingual ...
Extending document management systems with user-specific active properties
Document properties are a compelling infrastructure on which to develop document management applications. A property-based approach avoids many of the problems of traditional heierarchical storage mechanisms, reflects document organizations meaningful ...
Comments