Skip to main content

SDDB: A Self-Dependent and Data-Based Method for Constructing Bilingual Dictionary from the Web

  • Conference paper
Web Technologies and Applications (APWeb 2011)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6612))

Included in the following conference series:

  • 1053 Accesses

Abstract

As various data on the World Wide Web are becoming massively available, more and more traditional algorithm centric problems turn to find their solutions in a data centric way. In this paper, we present such a typical example - a Self-Dependent and Data-Based (SDDB) method for building bilingual dictionaries from the Web. Being different from many existing methods that focus on finding effective algorithms in sentence segmentation and word alignment through machine learning etc, SDDB strongly relies on the data of bilingual web pages from Chinese Web that are big enough to cover the terms for building dictionaries. The algorithms of SDDB are based on statistics of bilingual entries that are easy to collect from the parenthetical sentences from the Web. They are simply linear to the number of sentences and hence are scalable. In addition, rather than depending on pre-existing corpus to build bilingual dictionaries, which is commonly adopted in many existing methods, SDDB constructs the corpus from the Web by itself. This characterizes SDDB as an automatic method covering the complete process of building a bilingual dictionary from scratch. A Chinese-English dictionary with over 4 million Chinese-English entries and over 6 million English-Chinese entries built by SDDB shows a competitive performance to a popular commercial products on the Web.

This work was supported in part by National Natural Science Foundation of China under grant No. 60833003.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Banko, M., Brill, E.: Scaling to very very large corpora for natural language disambiguation. In: Proceedings of ACL, p. 33 (2001)

    Google Scholar 

  2. Baroni, M., Ueyama, M.: Building general-and special-purpose corpora by Web crawling. In: Proceedings of NIJL International Symposium, pp. 31–40 (2006)

    Google Scholar 

  3. Brown, P., Pietra, V., Pietra, S., Mercer, R.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2), 263–311 (1993)

    Google Scholar 

  4. Cao, G., Gao, J., Nie, J., Redmond, W.: A system to mine large-scale bilingual dictionaries from monolingual web. In: Proc. of MT Summit XI, pp. 57–64 (2007)

    Google Scholar 

  5. Huang, F., Zhang, Y., Vogel, S.: Mining key phrase translations from web corpora. In: Proceedings of EMNLP, p. 490 (2005)

    Google Scholar 

  6. Huang, Z.: A fast clustering algorithm to cluster very large categorical data sets in data mining. In: Workshop on Research Issues on DMKD (1997)

    Google Scholar 

  7. Jansche, M., Sproat, R.: Named entity transcription with pair n-gram models. In: Proceedings of Named Entities Workshop, pp. 32–35 (2009)

    Google Scholar 

  8. Jiang, L., Yang, S., Zhou, M., Liu, X., Zhu, Q.: Mining bilingual data from the web with adaptively learnt patterns. In: Proceedings of ACL, pp. 870–878 (2009)

    Google Scholar 

  9. Jiang, L., Zhou, M., Chien, L., Niu, C.: Named entity translation with web mining and transliteration. In: Proc. of IJCAI, vol. 7, pp. 1629–1634 (2007)

    Google Scholar 

  10. Keller, F., Lapata, M.: Using the web to obtain frequencies for unseen bigrams. Computational Linguistics 29(3), 459–484 (2003)

    Article  Google Scholar 

  11. Lapata, M., Keller, F.: The Web as a baseline: Evaluating the performance of unsupervised Web-based models for a range of NLP tasks. In: Proc. of HLT-NAACL, pp. 121–128 (2004)

    Google Scholar 

  12. Lapata, M., Keller, F.: Web-based models for natural language processing. ACM TSLP 2(1), 3 (2005)

    Google Scholar 

  13. Li, H., Cao, Y., Li, C.: Using bilingual web data to mine and rank translations. IEEE Intelligent Systems 18(4), 54–59 (2003)

    Article  Google Scholar 

  14. Lin, D., Zhao, S., Van Durme, B., Pasca, M.: Mining parenthetical translations from the web by word alignment. In: ACL 2008, pp. 994–1002 (2008)

    Google Scholar 

  15. Nie, J., Simard, M., Isabelle, P., Durand, R.: Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the Web. In: Proceedings of SIGIR, pp. 74–81 (1999)

    Google Scholar 

  16. Ramaswamy, S., Rastogi, R., Shim, K.: Efficient algorithms for mining outliers from large data sets. ACM SIGMOD 29(2), 438 (2000)

    Article  Google Scholar 

  17. Sato, S.: Web-Based Transliteration of Person Names. In: Proceedings of WI-IAT, pp. 273–278 (2009)

    Google Scholar 

  18. Snow, R., O’Connor, B., Jurafsky, D., Ng, A.: Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In: Proceedings of EMNLP, pp. 254–263 (2008)

    Google Scholar 

  19. Tsang, I., Kwok, J., Cheung, P.: Core vector machines: Fast SVM training on very large data sets. Journal of Machine Learning Research 6(1), 363 (2006)

    MATH  Google Scholar 

  20. Wu, J., Chang, J.: Learning to find English to Chinese transliterations on the web. In: Proc. of EMNLP-CoNLL, pp. 996–1004 (2007)

    Google Scholar 

  21. Yang, M., Liu, D., Zhao, T., Qi, H., Lin, K.: Web based translation of Chinese organization name. Journal of Electronics 26(2), 279–284 (2009)

    Google Scholar 

  22. Zhang, Y., Vines, P.: Using the web for automated translation extraction in cross-language information retrieval. In: Proceedings of SIGIR, pp. 162–169 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Han, J., Zhou, L., Liu, J. (2011). SDDB: A Self-Dependent and Data-Based Method for Constructing Bilingual Dictionary from the Web. In: Du, X., Fan, W., Wang, J., Peng, Z., Sharaf, M.A. (eds) Web Technologies and Applications. APWeb 2011. Lecture Notes in Computer Science, vol 6612. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20291-9_22

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20291-9_22

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20290-2

  • Online ISBN: 978-3-642-20291-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics