Abstract
The World Wide Web has been considered one of the important sources for information. Using search engines to retrieve Web pages can gather lots of information, including foreign information. However, to be better understood by local readers, proper names in a foreign language, such as English, are often transliterated to a local language such as Chinese. Due to different translators and the lack of translation standard, translating foreign proper nouns may result in different transliterations and pose a notorious headache. In particular, it may cause incomplete search results. Using one transliteration as a query keyword will fail to retrieve the Web pages which use a different word as the transliteration. Consequently, important information may be missed. We present a framework for mining synonymous transliterations as many as possible from the Web for a given transliteration. The results can be used to construct a database of synonymous transliterations which can be utilized for query expansion so as to alleviate the incomplete search problem. Experimental results show that the proposed framework can effectively retrieve the set of snippets which may contain synonymous transliterations and then extract the target terms. Most of the extracted synonymous transliterations have higher rank of similarity to the input transliteration compared to other noise terms.
- Chen, H. H., Huang, S. J., Ding, Y. W., and Tsai, S. C. 1998. Proper name translation in cross-language information retrieval. In Proceedings of the 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics (ACL’98). Google ScholarDigital Library
- Chen, H. H., Lin, W., Yang, C. C., and Lin, W. H. 2006. Translating/transliterating named entities for multilingual information access. J. Amer. Soc. Inform. Sci. Technol. 645--659. Google ScholarDigital Library
- Cheng, P.-J., Teng, J.-W., Chen, R.-C., Wang, J.-H., Lu, W.-H., and Chien, L.-F. 2004. Translating unknown queries with Web corpora for cross-language information retrieval. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’04). Google ScholarDigital Library
- Cilibrasi, R. L. and Vitanyi, P. M. B. 2007. The Google similarity distance. IEEE Trans. Knowl. Data Eng. 19, 3, 370--383. Google ScholarDigital Library
- Fang, G., Yu, H., and Nishino, F. 2006. Chinese-English term translation mining based on semantic prediction. In Proceedings of the International Conference on Computer Linguistics (COLING’06). 199--206. Google ScholarDigital Library
- Hsu, C.-C. and Chen, C.-H. 2008. Synonymous Chinese transliterations retrieval from World Wide Web by using association words. In Proceedings of the International Conference on Computational Science (ICCS’08). Google ScholarDigital Library
- Hsu, C.-C., Chen, C.-H., Shih, T. T., and Chen, C. K. 2007. Measuring similarity between transliterations against noise data. ACM Trans. Asian Lang. Inform. Process. 6, 1. Google ScholarDigital Library
- Huang, S., Chen, Z., Yu, Y., and Ma, W.-Y. 2006. Multitype features coselection for Web document clustering. IEEE Trans. Knowl. Data Eng. 18, 4, 448--459. Google ScholarDigital Library
- Jeong, K. S., Myaeng, S. H., Lee, J. S., and Choi, K.-S. 1999. Automatic identification and back transliteration of foreign words for information retrieval. Inform. Proc. Man. 35, 523--540.Google ScholarCross Ref
- Jiang, L., Zhou, M., Chien, L.-F., and Niu, C. 2007. Named entity translation with Web mining and transliteration. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’07). 1629--1634. Google ScholarDigital Library
- Knight, K. and Graehl, J. 1998. Machine transliteration. Comput. Linguist. 24, 4, 599--612. Google ScholarDigital Library
- Kondrak, G. 2003. Phonetic alignment and similarity. Comput. Hum. 37, 3, 273--291.Google ScholarCross Ref
- Kuo, J.-S. and Li, H. 2008. Multi-view co-training of transliteration model. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP’08).Google Scholar
- Kuo, J. S., Li, H., and Yang, Y. K. 2007. A phonetic similarity model for automatic extraction of transliteration pairs. ACM Trans. Asian Lang. Inform. Process. 6, 2. Google ScholarDigital Library
- Lee, C. J., Chang, J. S., and Jang, J.-S. R. 2006. Alignment of bilingual named entities in parallel corpora using statistical models and multiple knowledge sources. ACM Trans. Asian Lang. Inform. Process. 5, 2, 121--145. Google ScholarDigital Library
- Lee, J. S. 1999. An English-Korean transliteration and retransliteration model for cross-lingual information retrieval. Ph.D. dissertation, Department of Computer Science, Korea Advanced Institute of Science and Technology.Google Scholar
- Li, H., Cao, Y., and Li, C. 2003. Using bilingual Web data to mine and rank translations. IEEE Intell. Syst. 54--59. Google ScholarDigital Library
- Li, H., Zhang, M., and Su, J. 2004. A joint source-channel model for machine transliteration. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (ACL’04). 159--166. Google ScholarDigital Library
- Lin, W. H. and Chen, H. H. 2000. Similarity measure in backward transliteration between different character sets and its applications to CLIR. In Proceedings of the Research on Computational Linguistics Conference XIII (ROCLING’00). 97--113.Google Scholar
- Lin, W. H. and Chen, H. H. 2002. Backward machine transliteration by learning phonetic similarity. In Proceedings of the 6th Conference on Natural Language Learning (CoNLL’02). 139--145. Google ScholarDigital Library
- Lu, W. H., Chien, L. F., and Lee, H. J. 2002. Translation of Web queries using anchor text mining. ACM Trans. Asian Lang. Inform. Process. 1, 2, 159--172. Google ScholarDigital Library
- Lu, W. H., Chien, L. F., and Lee, H. J. 2003. LiveTrans: Translation suggestion for cross-language Web search from Web anchor texts and search results. In Proceedings of the Research on Computational Linguistics Conference (ROCLING’03).Google Scholar
- Oh, J.-H. and Choi, K.-S. 2006. An ensemble of transliteration models for information retrieval. Inform. Proc. Man. 42, 4, 980--1002. Google ScholarDigital Library
- Sakoe, H. and Chiba, S. 1978. Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoustics Speech Signal Proc. 43--49.Google ScholarCross Ref
- Stalls, B. G. and Knight, K. 1998. Translating names and technical terms in arabic text. In Proceedings of the COLING/ACL Workshop on Computational Approaches to Semitic Languages (CASL’98). Google ScholarDigital Library
- TIME. 2009. The 2008 Time 100: The world’s most influential people. http://www.time.com/time/specials/2007/0,28757,1733748,00.html?iid=redirect-time100.Google Scholar
- Virga, P. and Khudanpur, S. 2003. Transliteration of proper names in crosslingual information retrieval. In Proceedings of the ACL Workshop on Multilingual and Mixed-Language Named Entity Recognition: Combining Statistical and Symbolic Models (MLNER’03). 57--64. Google ScholarDigital Library
- Wagner, R. A. and Fischer, M. J. 1974. The string-to-string correction problem. J. ACM 21, 168--173. Google ScholarDigital Library
- Wan, S. and Verspoor, C. M. 1998. Automatic English-Chinese name transliteration for development of multilingual resources. In Proceedings of the 17th International Conference on Computer Linguistics and 36th Association of Computational Linguistics. 1352--1356. Google ScholarDigital Library
- Wikipedia. 2006. Chinese language. http://en.wikipedia.org/wiki/Chinese_language.Google Scholar
- Wu, J. C. and Chang, J. S. 2007. Learning to find English to Chinese transliterations on the Web. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP’07). 996--1004.Google Scholar
- Zhang, Y., Huang, F., and Vogel, S. 2005. Mining translations of OOV terms from the web through cross-lingual query expansion. In Proceedings of the 28th Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR’05). 669--670. Google ScholarDigital Library
- Zhang, Y. and Vines, P. 2004. Using the Web for automated translation extraction in cross-language information retrieval. In Proceedings of the Annual International ACM Conference on Research and Development in Information Retrieval (SIGIR’04). 162--168. Google ScholarDigital Library
Index Terms
- Mining Synonymous Transliterations from the World Wide Web
Recommendations
Synonymous Chinese Transliterations Retrieval from World Wide Web by Using Association Words
ICCS '08: Proceedings of the 8th international conference on Computational Science, Part IWe present a framework for mining synonymous transliterations from a set of Web pages collected via a search engine. An integrated statistical measure is proposed to form search keywords for a search engine in order to retrieve relevant Web snippets. We ...
Measuring similarity between transliterations against noise data
When editors of newspapers and magazines translate proper nouns from foreign languages into Chinese, the Chinese translation (termed transliterations) they choose will typically be phonetically similar to the original word. With many different ...
Amharic-English bilingual web search engine
MEDES '12: Proceedings of the International Conference on Management of Emergent Digital EcoSystemsAs non-English languages are growing exponentially on the Web, the number of online non-English speakers who realizes the importance of finding information in different languages is enormously growing. However, the major general purpose search engines ...
Comments