Abstract
As various data on the World Wide Web are becoming massively available, more and more traditional algorithm centric problems turn to find their solutions in a data centric way. In this paper, we present such a typical example - a Self-Dependent and Data-Based (SDDB) method for building bilingual dictionaries from the Web. Being different from many existing methods that focus on finding effective algorithms in sentence segmentation and word alignment through machine learning etc, SDDB strongly relies on the data of bilingual web pages from Chinese Web that are big enough to cover the terms for building dictionaries. The algorithms of SDDB are based on statistics of bilingual entries that are easy to collect from the parenthetical sentences from the Web. They are simply linear to the number of sentences and hence are scalable. In addition, rather than depending on pre-existing corpus to build bilingual dictionaries, which is commonly adopted in many existing methods, SDDB constructs the corpus from the Web by itself. This characterizes SDDB as an automatic method covering the complete process of building a bilingual dictionary from scratch. A Chinese-English dictionary with over 4 million Chinese-English entries and over 6 million English-Chinese entries built by SDDB shows a competitive performance to a popular commercial products on the Web.
This work was supported in part by National Natural Science Foundation of China under grant No. 60833003.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Banko, M., Brill, E.: Scaling to very very large corpora for natural language disambiguation. In: Proceedings of ACL, p. 33 (2001)
Baroni, M., Ueyama, M.: Building general-and special-purpose corpora by Web crawling. In: Proceedings of NIJL International Symposium, pp. 31–40 (2006)
Brown, P., Pietra, V., Pietra, S., Mercer, R.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2), 263–311 (1993)
Cao, G., Gao, J., Nie, J., Redmond, W.: A system to mine large-scale bilingual dictionaries from monolingual web. In: Proc. of MT Summit XI, pp. 57–64 (2007)
Huang, F., Zhang, Y., Vogel, S.: Mining key phrase translations from web corpora. In: Proceedings of EMNLP, p. 490 (2005)
Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Workshop on Research Issues on DMKD (1997)
Jansche, M., Sproat, R.: Named entity transcription with pair n-gram models. In: Proceedings of Named Entities Workshop, pp. 32–35 (2009)
Jiang, L., Yang, S., Zhou, M., Liu, X., Zhu, Q.: Mining bilingual data from the web with adaptively learnt patterns. In: Proceedings of ACL, pp. 870–878 (2009)
Jiang, L., Zhou, M., Chien, L., Niu, C.: Named entity translation with web mining and transliteration. In: Proc. of IJCAI, vol. 7, pp. 1629–1634 (2007)
Keller, F., Lapata, M.: Using the web to obtain frequencies for unseen bigrams. Computational Linguistics 29(3), 459–484 (2003)
Lapata, M., Keller, F.: The Web as a baseline: Evaluating the performance of unsupervised Web-based models for a range of NLP tasks. In: Proc. of HLT-NAACL, pp. 121–128 (2004)
Lapata, M., Keller, F.: Web-based models for natural language processing. ACM TSLPÂ 2(1), 3 (2005)
Li, H., Cao, Y., Li, C.: Using bilingual web data to mine and rank translations. IEEE Intelligent Systems 18(4), 54–59 (2003)
Lin, D., Zhao, S., Van Durme, B., Pasca, M.: Mining parenthetical translations from the web by word alignment. In: ACL 2008, pp. 994–1002 (2008)
Nie, J., Simard, M., Isabelle, P., Durand, R.: Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In: Proceedings of SIGIR, pp. 74–81 (1999)
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. ACM SIGMODÂ 29(2), 438 (2000)
Sato, S.: Web-Based Transliteration of Person Names. In: Proceedings of WI-IAT, pp. 273–278 (2009)
Snow, R., O’Connor, B., Jurafsky, D., Ng, A.: Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In: Proceedings of EMNLP, pp. 254–263 (2008)
Tsang, I., Kwok, J., Cheung, P.: Core vector machines: Fast SVM training on very large data sets. Journal of Machine Learning Research 6(1), 363 (2006)
Wu, J., Chang, J.: Learning to find English to Chinese transliterations on the web. In: Proc. of EMNLP-CoNLL, pp. 996–1004 (2007)
Yang, M., Liu, D., Zhao, T., Qi, H., Lin, K.: Web based translation of Chinese organization name. Journal of Electronics 26(2), 279–284 (2009)
Zhang, Y., Vines, P.: Using the web for automated translation extraction in cross-language information retrieval. In: Proceedings of SIGIR, pp. 162–169 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Han, J., Zhou, L., Liu, J. (2011). SDDB: A Self-Dependent and Data-Based Method for Constructing Bilingual Dictionary from the Web. In: Du, X., Fan, W., Wang, J., Peng, Z., Sharaf, M.A. (eds) Web Technologies and Applications. APWeb 2011. Lecture Notes in Computer Science, vol 6612. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20291-9_22
Download citation
DOI: https://doi.org/10.1007/978-3-642-20291-9_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20290-2
Online ISBN: 978-3-642-20291-9
eBook Packages: Computer ScienceComputer Science (R0)