Skip to main content

Building Compact Lexicons for Cross-Domain SMT by Mining Near-Optimal Pattern Sets

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9651))

Included in the following conference series:

  • 2506 Accesses

Abstract

Statistical machine translation models are known to benefit from the availability of a domain bilingual lexicon. Bilingual lexicons are traditionally comprised of multiword expressions, either extracted from parallel corpora or manually curated. We claim that “patterns”, comprised of words and higher order categories, generalize better in capturing the syntax and semantics of the domain. In this work, we present an approach to extract such patterns from a domain corpus and curate a high quality bilingual lexicon. We discuss several features of these patterns, that, define the “consensus” between their underlying multiwords. We incorporate the bilingual lexicon in a baseline SMT model and detailed experiments show that the resulting translation model performs much better than the baseline and other similar systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We release our code for optimal pattern-set identification, as well as the lexicons.

    https://www.cse.iitb.ac.in/~ganesh/Publications.html.

  2. 2.

    http://www.statmt.org/wmt07/shared-task.html.

  3. 3.

    https://mymemory.translated.net/.

  4. 4.

    http://www.statmt.org/moses/RELEASE-3.0/models/.

  5. 5.

    inclusive and exclusive mode http://www.statmt.org/moses/?n=Advanced.Hybrid.

References

  1. Bhuiyan, M., Mukhopadhyay, S., Hasan, M.A.: Interactive pattern mining on hidden data: a sampling-based solution. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM 2012, pp. 95–104. ACM, New York, NY, USA (2012)

    Google Scholar 

  2. Bonchi, F., Giannotti, F., Mazzanti, A., Pedreschi, D.: ExAnte: anticipated data reduction in constrained pattern mining. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 59–70. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  3. Chen, H., Huang, H., Tjiu, J., Tan, C., Chen, H.: Identification and translation of significant patterns for cross-domain SMT applications. In: Proceedings of Machine Translation Summit XIII (2011)

    Google Scholar 

  4. Federico, M., Bertoldi, N., Cettolo, M., Negri, M., Turchi, M., Trombetti, M., Cattelan, A., Farina, A., Lupinetti, D., Martines, A., et al.: The MateCat tool. In: Proceedings of COLING, pp. 129–132 (2014)

    Google Scholar 

  5. Iyer, R.K., Bilmes, J.A.: Submodular optimization with submodular cover and submodular knapsack constraints. In: Advances in Neural Information Processing Systems, pp. 2436–2444 (2013)

    Google Scholar 

  6. Joshi, S., Ramakrishnan, G., Balakrishnan, S., Srinivasan, A.: Information extraction using non-consecutive word sequences. In: Proceedings of TextLink 2007, The Twentieth International Joint Conference on Artificial Intelligence (2007)

    Google Scholar 

  7. Kempe, D., Kleinberg, J., Tardos, E.: Maximizing the spread of influence through a social network. In: SIGKDD (2003)

    Google Scholar 

  8. Koehn, P., Schroeder, J.: Experiments in domain adaptation for statistical machine translation. In: Proceedings of the Second Workshop on Statistical Machine Translation StatMT 2007, pp. 224–227. Association for Computational Linguistics, Stroudsburg, PA, USA (2007)

    Google Scholar 

  9. Krause, A., Golovin, D.: Submodular function maximization. Tractability: Pract. Approaches Hard Prob. 3, 19 (2012)

    Google Scholar 

  10. Lambert, P.: Data inferred multi-word expressions for statistical machine translation. In: MT Summit X (2005)

    Google Scholar 

  11. Lin, H., Bilmes, J.: Multi-document summarization via budgeted maximization of submodular functions. In: NAACL (2010)

    Google Scholar 

  12. Liu, J., Shang, J., Wang, C., Ren, X., Han, J.: Mining quality phrases from massive text corpora. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD 2015, pp. 1729–1744. ACM, New York, NY, USA (2015)

    Google Scholar 

  13. Minoux, M.: Accelerated greedy algorithms for maximizing submodular set functions. In: Stoer, J. (ed.) Optimization Techniques. LNCIS, vol. 7, pp. 234–243. Springer, Heidelberg (1978)

    Chapter  Google Scholar 

  14. Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functions. Math. Program. 14(1), 265–294 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  15. Nemhauser, G.L., Wolsey, L.A.: Best algorithms for approximating the maximum of a submodular set function. Math. Oper. Res. 3(3), 177–188 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  16. Nepveu, L., Lapalme, G., Qubec, M., Foster, G.: Adaptive language and translation models for interactive machine translation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2004)

    Google Scholar 

  17. Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)

    Article  MATH  Google Scholar 

  18. Pal, S., Bandyopadhyay, S.: Handling multiword expressions in phrase-based statistical machine translation. In: Machine Translation Summit XIII, pp. 215–224 (2011)

    Google Scholar 

  19. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)

    Google Scholar 

  20. Ranta, A.: Grammatical framework. J. Funct. Program. 14(02), 145–189 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  21. Ren, Z., Lü, Y., Cao, J., Liu, Q., Huang, Y.: Improving statistical machine translation using domain bilingual multiword expressions. In: Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, MWE 2009, pp. 47–54. Association for Computational Linguistics, Stroudsburg, PA, USA (2009)

    Google Scholar 

  22. Tan, C.M., Wang, Y.F., Lee, C.D.: The use of bigrams to enhance text categorization. Inf. Process. Manag. 38(4), 529–546 (2002)

    Article  MATH  Google Scholar 

  23. Tiedemann, J.: News from OPUS - a collection of multilingual parallel corpora with tools and interfaces. Recent Advances in Natural Language Processing, pp. 237–248. John Benjamins, Amsterdam (2009)

    Chapter  Google Scholar 

  24. Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), European Language Resources Association (ELRA), Istanbul, Turkey, May 2012

    Google Scholar 

  25. Vogel, S., Ney, H., Tillmann, C.: HMM-based word alignment in statistical translation. In: Proceedings of the 16th Conference on Computational Linguistics - COLING 1996, vol. 2, pp. 836–841. Association for Computational Linguistics, Stroudsburg, PA, USA (1996)

    Google Scholar 

  26. Wu, H., Wang, H., Zong, C.: Domain adaptation for statistical machine translation with domain dictionary and monolingual corpora. In: IEEE Signal Processing Magazine (2008)

    Google Scholar 

  27. Xin, D., Shen, X., Mei, Q., Han, J.: Discovering interesting patterns through user’s interactive feedback. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2006, pp. 773–778. ACM, New York, NY, USA (2006)

    Google Scholar 

Download references

Acknowledgments

This research was supported by the Intranet Search project from IRCC at IIT Bombay.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pankaj Singh .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Singh, P., Kulkarni, A., Ojha, H., Kumar, V., Ramakrishnan, G. (2016). Building Compact Lexicons for Cross-Domain SMT by Mining Near-Optimal Pattern Sets. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J., Wang, R. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9651. Springer, Cham. https://doi.org/10.1007/978-3-319-31753-3_24

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-31753-3_24

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-31752-6

  • Online ISBN: 978-3-319-31753-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics