skip to main content
research-article

Mining Synonymous Transliterations from the World Wide Web

Published:01 March 2010Publication History
Skip Abstract Section

Abstract

The World Wide Web has been considered one of the important sources for information. Using search engines to retrieve Web pages can gather lots of information, including foreign information. However, to be better understood by local readers, proper names in a foreign language, such as English, are often transliterated to a local language such as Chinese. Due to different translators and the lack of translation standard, translating foreign proper nouns may result in different transliterations and pose a notorious headache. In particular, it may cause incomplete search results. Using one transliteration as a query keyword will fail to retrieve the Web pages which use a different word as the transliteration. Consequently, important information may be missed. We present a framework for mining synonymous transliterations as many as possible from the Web for a given transliteration. The results can be used to construct a database of synonymous transliterations which can be utilized for query expansion so as to alleviate the incomplete search problem. Experimental results show that the proposed framework can effectively retrieve the set of snippets which may contain synonymous transliterations and then extract the target terms. Most of the extracted synonymous transliterations have higher rank of similarity to the input transliteration compared to other noise terms.

References

  1. Chen, H. H., Huang, S. J., Ding, Y. W., and Tsai, S. C. 1998. Proper name translation in cross-language information retrieval. In Proceedings of the 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics (ACL’98). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Chen, H. H., Lin, W., Yang, C. C., and Lin, W. H. 2006. Translating/transliterating named entities for multilingual information access. J. Amer. Soc. Inform. Sci. Technol. 645--659. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Cheng, P.-J., Teng, J.-W., Chen, R.-C., Wang, J.-H., Lu, W.-H., and Chien, L.-F. 2004. Translating unknown queries with Web corpora for cross-language information retrieval. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04). Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Cilibrasi, R. L. and Vitanyi, P. M. B. 2007. The Google similarity distance. IEEE Trans. Knowl. Data Eng. 19, 3, 370--383. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Fang, G., Yu, H., and Nishino, F. 2006. Chinese-English term translation mining based on semantic prediction. In Proceedings of the International Conference on Computer Linguistics (COLING’06). 199--206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Hsu, C.-C. and Chen, C.-H. 2008. Synonymous Chinese transliterations retrieval from World Wide Web by using association words. In Proceedings of the International Conference on Computational Science (ICCS’08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Hsu, C.-C., Chen, C.-H., Shih, T. T., and Chen, C. K. 2007. Measuring similarity between transliterations against noise data. ACM Trans. Asian Lang. Inform. Process. 6, 1. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Huang, S., Chen, Z., Yu, Y., and Ma, W.-Y. 2006. Multitype features coselection for Web document clustering. IEEE Trans. Knowl. Data Eng. 18, 4, 448--459. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Jeong, K. S., Myaeng, S. H., Lee, J. S., and Choi, K.-S. 1999. Automatic identification and back transliteration of foreign words for information retrieval. Inform. Proc. Man. 35, 523--540.Google ScholarGoogle ScholarCross RefCross Ref
  10. Jiang, L., Zhou, M., Chien, L.-F., and Niu, C. 2007. Named entity translation with Web mining and transliteration. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’07). 1629--1634. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Knight, K. and Graehl, J. 1998. Machine transliteration. Comput. Linguist. 24, 4, 599--612. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Kondrak, G. 2003. Phonetic alignment and similarity. Comput. Hum. 37, 3, 273--291.Google ScholarGoogle ScholarCross RefCross Ref
  13. Kuo, J.-S. and Li, H. 2008. Multi-view co-training of transliteration model. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP’08).Google ScholarGoogle Scholar
  14. Kuo, J. S., Li, H., and Yang, Y. K. 2007. A phonetic similarity model for automatic extraction of transliteration pairs. ACM Trans. Asian Lang. Inform. Process. 6, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Lee, C. J., Chang, J. S., and Jang, J.-S. R. 2006. Alignment of bilingual named entities in parallel corpora using statistical models and multiple knowledge sources. ACM Trans. Asian Lang. Inform. Process. 5, 2, 121--145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Lee, J. S. 1999. An English-Korean transliteration and retransliteration model for cross-lingual information retrieval. Ph.D. dissertation, Department of Computer Science, Korea Advanced Institute of Science and Technology.Google ScholarGoogle Scholar
  17. Li, H., Cao, Y., and Li, C. 2003. Using bilingual Web data to mine and rank translations. IEEE Intell. Syst. 54--59. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Li, H., Zhang, M., and Su, J. 2004. A joint source-channel model for machine transliteration. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (ACL’04). 159--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Lin, W. H. and Chen, H. H. 2000. Similarity measure in backward transliteration between different character sets and its applications to CLIR. In Proceedings of the Research on Computational Linguistics Conference XIII (ROCLING’00). 97--113.Google ScholarGoogle Scholar
  20. Lin, W. H. and Chen, H. H. 2002. Backward machine transliteration by learning phonetic similarity. In Proceedings of the 6th Conference on Natural Language Learning (CoNLL’02). 139--145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Lu, W. H., Chien, L. F., and Lee, H. J. 2002. Translation of Web queries using anchor text mining. ACM Trans. Asian Lang. Inform. Process. 1, 2, 159--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Lu, W. H., Chien, L. F., and Lee, H. J. 2003. LiveTrans: Translation suggestion for cross-language Web search from Web anchor texts and search results. In Proceedings of the Research on Computational Linguistics Conference (ROCLING’03).Google ScholarGoogle Scholar
  23. Oh, J.-H. and Choi, K.-S. 2006. An ensemble of transliteration models for information retrieval. Inform. Proc. Man. 42, 4, 980--1002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Sakoe, H. and Chiba, S. 1978. Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoustics Speech Signal Proc. 43--49.Google ScholarGoogle ScholarCross RefCross Ref
  25. Stalls, B. G. and Knight, K. 1998. Translating names and technical terms in arabic text. In Proceedings of the COLING/ACL Workshop on Computational Approaches to Semitic Languages (CASL’98). Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. TIME. 2009. The 2008 Time 100: The world’s most influential people. http://www.time.com/time/specials/2007/0,28757,1733748,00.html?iid=redirect-time100.Google ScholarGoogle Scholar
  27. Virga, P. and Khudanpur, S. 2003. Transliteration of proper names in crosslingual information retrieval. In Proceedings of the ACL Workshop on Multilingual and Mixed-Language Named Entity Recognition: Combining Statistical and Symbolic Models (MLNER’03). 57--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Wagner, R. A. and Fischer, M. J. 1974. The string-to-string correction problem. J. ACM 21, 168--173. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Wan, S. and Verspoor, C. M. 1998. Automatic English-Chinese name transliteration for development of multilingual resources. In Proceedings of the 17th International Conference on Computer Linguistics and 36th Association of Computational Linguistics. 1352--1356. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Wikipedia. 2006. Chinese language. http://en.wikipedia.org/wiki/Chinese_language.Google ScholarGoogle Scholar
  31. Wu, J. C. and Chang, J. S. 2007. Learning to find English to Chinese transliterations on the Web. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP’07). 996--1004.Google ScholarGoogle Scholar
  32. Zhang, Y., Huang, F., and Vogel, S. 2005. Mining translations of OOV terms from the web through cross-lingual query expansion. In Proceedings of the 28th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR’05). 669--670. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Zhang, Y. and Vines, P. 2004. Using the Web for automated translation extraction in cross-language information retrieval. In Proceedings of the Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR’04). 162--168. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Mining Synonymous Transliterations from the World Wide Web

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Asian Language Information Processing
              ACM Transactions on Asian Language Information Processing  Volume 9, Issue 1
              March 2010
              106 pages
              ISSN:1530-0226
              EISSN:1558-3430
              DOI:10.1145/1731035
              Issue’s Table of Contents

              Copyright © 2010 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 1 March 2010
              • Accepted: 1 July 2009
              • Revised: 1 June 2009
              • Received: 1 April 2009
              Published in talip Volume 9, Issue 1

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader