SDDB: A Self-Dependent and Data-Based Method for Constructing Bilingual Dictionary from the Web

Han, Jun; Zhou, Lizhu; Liu, Juan

doi:10.1007/978-3-642-20291-9_22

Jun Han²¹,
Lizhu Zhou²¹ &
Juan Liu²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6612))

Included in the following conference series:

Asia-Pacific Web Conference

1053 Accesses

Abstract

As various data on the World Wide Web are becoming massively available, more and more traditional algorithm centric problems turn to find their solutions in a data centric way. In this paper, we present such a typical example - a Self-Dependent and Data-Based (SDDB) method for building bilingual dictionaries from the Web. Being different from many existing methods that focus on finding effective algorithms in sentence segmentation and word alignment through machine learning etc, SDDB strongly relies on the data of bilingual web pages from Chinese Web that are big enough to cover the terms for building dictionaries. The algorithms of SDDB are based on statistics of bilingual entries that are easy to collect from the parenthetical sentences from the Web. They are simply linear to the number of sentences and hence are scalable. In addition, rather than depending on pre-existing corpus to build bilingual dictionaries, which is commonly adopted in many existing methods, SDDB constructs the corpus from the Web by itself. This characterizes SDDB as an automatic method covering the complete process of building a bilingual dictionary from scratch. A Chinese-English dictionary with over 4 million Chinese-English entries and over 6 million English-Chinese entries built by SDDB shows a competitive performance to a popular commercial products on the Web.

This work was supported in part by National Natural Science Foundation of China under grant No. 60833003.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Banko, M., Brill, E.: Scaling to very very large corpora for natural language disambiguation. In: Proceedings of ACL, p. 33 (2001)
Google Scholar
Baroni, M., Ueyama, M.: Building general-and special-purpose corpora by Web crawling. In: Proceedings of NIJL International Symposium, pp. 31–40 (2006)
Google Scholar
Brown, P., Pietra, V., Pietra, S., Mercer, R.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2), 263–311 (1993)
Google Scholar
Cao, G., Gao, J., Nie, J., Redmond, W.: A system to mine large-scale bilingual dictionaries from monolingual web. In: Proc. of MT Summit XI, pp. 57–64 (2007)
Google Scholar
Huang, F., Zhang, Y., Vogel, S.: Mining key phrase translations from web corpora. In: Proceedings of EMNLP, p. 490 (2005)
Google Scholar
Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Workshop on Research Issues on DMKD (1997)
Google Scholar
Jansche, M., Sproat, R.: Named entity transcription with pair n-gram models. In: Proceedings of Named Entities Workshop, pp. 32–35 (2009)
Google Scholar
Jiang, L., Yang, S., Zhou, M., Liu, X., Zhu, Q.: Mining bilingual data from the web with adaptively learnt patterns. In: Proceedings of ACL, pp. 870–878 (2009)
Google Scholar
Jiang, L., Zhou, M., Chien, L., Niu, C.: Named entity translation with web mining and transliteration. In: Proc. of IJCAI, vol. 7, pp. 1629–1634 (2007)
Google Scholar
Keller, F., Lapata, M.: Using the web to obtain frequencies for unseen bigrams. Computational Linguistics 29(3), 459–484 (2003)
Article Google Scholar
Lapata, M., Keller, F.: The Web as a baseline: Evaluating the performance of unsupervised Web-based models for a range of NLP tasks. In: Proc. of HLT-NAACL, pp. 121–128 (2004)
Google Scholar
Lapata, M., Keller, F.: Web-based models for natural language processing. ACM TSLP 2(1), 3 (2005)
Google Scholar
Li, H., Cao, Y., Li, C.: Using bilingual web data to mine and rank translations. IEEE Intelligent Systems 18(4), 54–59 (2003)
Article Google Scholar
Lin, D., Zhao, S., Van Durme, B., Pasca, M.: Mining parenthetical translations from the web by word alignment. In: ACL 2008, pp. 994–1002 (2008)
Google Scholar
Nie, J., Simard, M., Isabelle, P., Durand, R.: Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In: Proceedings of SIGIR, pp. 74–81 (1999)
Google Scholar
Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. ACM SIGMOD 29(2), 438 (2000)
Article Google Scholar
Sato, S.: Web-Based Transliteration of Person Names. In: Proceedings of WI-IAT, pp. 273–278 (2009)
Google Scholar
Snow, R., O’Connor, B., Jurafsky, D., Ng, A.: Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In: Proceedings of EMNLP, pp. 254–263 (2008)
Google Scholar
Tsang, I., Kwok, J., Cheung, P.: Core vector machines: Fast SVM training on very large data sets. Journal of Machine Learning Research 6(1), 363 (2006)
MATH Google Scholar
Wu, J., Chang, J.: Learning to find English to Chinese transliterations on the web. In: Proc. of EMNLP-CoNLL, pp. 996–1004 (2007)
Google Scholar
Yang, M., Liu, D., Zhao, T., Qi, H., Lin, K.: Web based translation of Chinese organization name. Journal of Electronics 26(2), 279–284 (2009)
Google Scholar
Zhang, Y., Vines, P.: Using the web for automated translation extraction in cross-language information retrieval. In: Proceedings of SIGIR, pp. 162–169 (2004)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Technology, Tsinghua University, Beijing, 100084, China
Jun Han, Lizhu Zhou & Juan Liu

Authors

Jun Han
View author publications
You can also search for this author in PubMed Google Scholar
Lizhu Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Juan Liu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Information, Renmin University of China, 100872, Beijing, China
Xiaoyong Du
LFCS, School of Informatics, University of Edinburgh, 10 Crichton Street, EH8 9AB, Edinburgh, Scotland, UK
Wenfei Fan
School of Software, Tsinghua University, Room 819, Main Building, 100084, Beijing, China
Jianmin Wang
Computer School, Wuhan University, Luojiashan Road, 430072, Wuhan, Hubei, China
Zhiyong Peng
School of Information Technology and Electrical Engineering, The University of Queensland, QLD 4072, St. Lucia, Australia
Mohamed A. Sharaf

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Han, J., Zhou, L., Liu, J. (2011). SDDB: A Self-Dependent and Data-Based Method for Constructing Bilingual Dictionary from the Web. In: Du, X., Fan, W., Wang, J., Peng, Z., Sharaf, M.A. (eds) Web Technologies and Applications. APWeb 2011. Lecture Notes in Computer Science, vol 6612. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20291-9_22

Download citation

DOI: https://doi.org/10.1007/978-3-642-20291-9_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20290-2
Online ISBN: 978-3-642-20291-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics