Building Compact Lexicons for Cross-Domain SMT by Mining Near-Optimal Pattern Sets

Singh, Pankaj; Kulkarni, Ashish; Ojha, Himanshu; Kumar, Vishwajeet; Ramakrishnan, Ganesh

doi:10.1007/978-3-319-31753-3_24

Pankaj Singh¹⁹,
Ashish Kulkarni¹⁹,
Himanshu Ojha¹⁹,
Vishwajeet Kumar¹⁹ &
…
Ganesh Ramakrishnan¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9651))

Included in the following conference series:

Pacific-Asia Conference on Knowledge Discovery and Data Mining

2634 Accesses

Abstract

Statistical machine translation models are known to benefit from the availability of a domain bilingual lexicon. Bilingual lexicons are traditionally comprised of multiword expressions, either extracted from parallel corpora or manually curated. We claim that “patterns”, comprised of words and higher order categories, generalize better in capturing the syntax and semantics of the domain. In this work, we present an approach to extract such patterns from a domain corpus and curate a high quality bilingual lexicon. We discuss several features of these patterns, that, define the “consensus” between their underlying multiwords. We incorporate the bilingual lexicon in a baseline SMT model and detailed experiments show that the resulting translation model performs much better than the baseline and other similar systems.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Induction of latent domains in heterogeneous corpora: a case study of word alignment

Article 01 December 2017

Augmenting SMT with Semantically-Generated Virtual-Parallel Corpora from Monolingual Texts

A Comparative Study on Effective Approaches for Unsupervised Statistical Machine Translation

Notes

1.
We release our code for optimal pattern-set identification, as well as the lexicons.
https://www.cse.iitb.ac.in/~ganesh/Publications.html.
2.
http://www.statmt.org/wmt07/shared-task.html.
3.
https://mymemory.translated.net/.
4.
http://www.statmt.org/moses/RELEASE-3.0/models/.
5.
inclusive and exclusive mode http://www.statmt.org/moses/?n=Advanced.Hybrid.

References

Bhuiyan, M., Mukhopadhyay, S., Hasan, M.A.: Interactive pattern mining on hidden data: a sampling-based solution. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM 2012, pp. 95–104. ACM, New York, NY, USA (2012)
Google Scholar
Bonchi, F., Giannotti, F., Mazzanti, A., Pedreschi, D.: ExAnte: anticipated data reduction in constrained pattern mining. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 59–70. Springer, Heidelberg (2003)
Chapter Google Scholar
Chen, H., Huang, H., Tjiu, J., Tan, C., Chen, H.: Identification and translation of significant patterns for cross-domain SMT applications. In: Proceedings of Machine Translation Summit XIII (2011)
Google Scholar
Federico, M., Bertoldi, N., Cettolo, M., Negri, M., Turchi, M., Trombetti, M., Cattelan, A., Farina, A., Lupinetti, D., Martines, A., et al.: The MateCat tool. In: Proceedings of COLING, pp. 129–132 (2014)
Google Scholar
Iyer, R.K., Bilmes, J.A.: Submodular optimization with submodular cover and submodular knapsack constraints. In: Advances in Neural Information Processing Systems, pp. 2436–2444 (2013)
Google Scholar
Joshi, S., Ramakrishnan, G., Balakrishnan, S., Srinivasan, A.: Information extraction using non-consecutive word sequences. In: Proceedings of TextLink 2007, The Twentieth International Joint Conference on Artificial Intelligence (2007)
Google Scholar
Kempe, D., Kleinberg, J., Tardos, E.: Maximizing the spread of influence through a social network. In: SIGKDD (2003)
Google Scholar
Koehn, P., Schroeder, J.: Experiments in domain adaptation for statistical machine translation. In: Proceedings of the Second Workshop on Statistical Machine Translation StatMT 2007, pp. 224–227. Association for Computational Linguistics, Stroudsburg, PA, USA (2007)
Google Scholar
Krause, A., Golovin, D.: Submodular function maximization. Tractability: Pract. Approaches Hard Prob. 3, 19 (2012)
Google Scholar
Lambert, P.: Data inferred multi-word expressions for statistical machine translation. In: MT Summit X (2005)
Google Scholar
Lin, H., Bilmes, J.: Multi-document summarization via budgeted maximization of submodular functions. In: NAACL (2010)
Google Scholar
Liu, J., Shang, J., Wang, C., Ren, X., Han, J.: Mining quality phrases from massive text corpora. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD 2015, pp. 1729–1744. ACM, New York, NY, USA (2015)
Google Scholar
Minoux, M.: Accelerated greedy algorithms for maximizing submodular set functions. In: Stoer, J. (ed.) Optimization Techniques. LNCIS, vol. 7, pp. 234–243. Springer, Heidelberg (1978)
Chapter Google Scholar
Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functions. Math. Program. 14(1), 265–294 (1978)
Article MathSciNet MATH Google Scholar
Nemhauser, G.L., Wolsey, L.A.: Best algorithms for approximating the maximum of a submodular set function. Math. Oper. Res. 3(3), 177–188 (1978)
Article MathSciNet MATH Google Scholar
Nepveu, L., Lapalme, G., Qubec, M., Foster, G.: Adaptive language and translation models for interactive machine translation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (2004)
Google Scholar
Och, F.J., Ney, H.: A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1), 19–51 (2003)
Article MATH Google Scholar
Pal, S., Bandyopadhyay, S.: Handling multiword expressions in phrase-based statistical machine translation. In: Machine Translation Summit XIII, pp. 215–224 (2011)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)
Google Scholar
Ranta, A.: Grammatical framework. J. Funct. Program. 14(02), 145–189 (2004)
Article MathSciNet MATH Google Scholar
Ren, Z., Lü, Y., Cao, J., Liu, Q., Huang, Y.: Improving statistical machine translation using domain bilingual multiword expressions. In: Proceedings of the Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation and Applications, MWE 2009, pp. 47–54. Association for Computational Linguistics, Stroudsburg, PA, USA (2009)
Google Scholar
Tan, C.M., Wang, Y.F., Lee, C.D.: The use of bigrams to enhance text categorization. Inf. Process. Manag. 38(4), 529–546 (2002)
Article MATH Google Scholar
Tiedemann, J.: News from OPUS - a collection of multilingual parallel corpora with tools and interfaces. Recent Advances in Natural Language Processing, pp. 237–248. John Benjamins, Amsterdam (2009)
Chapter Google Scholar
Tiedemann, J.: Parallel data, tools and interfaces in OPUS. In: Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012), European Language Resources Association (ELRA), Istanbul, Turkey, May 2012
Google Scholar
Vogel, S., Ney, H., Tillmann, C.: HMM-based word alignment in statistical translation. In: Proceedings of the 16th Conference on Computational Linguistics - COLING 1996, vol. 2, pp. 836–841. Association for Computational Linguistics, Stroudsburg, PA, USA (1996)
Google Scholar
Wu, H., Wang, H., Zong, C.: Domain adaptation for statistical machine translation with domain dictionary and monolingual corpora. In: IEEE Signal Processing Magazine (2008)
Google Scholar
Xin, D., Shen, X., Mei, Q., Han, J.: Discovering interesting patterns through user’s interactive feedback. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2006, pp. 773–778. ACM, New York, NY, USA (2006)
Google Scholar

Download references

Acknowledgments

This research was supported by the Intranet Search project from IRCC at IIT Bombay.

Author information

Authors and Affiliations

Computer Science and Engineering, IIT Bombay, Mumbai, India
Pankaj Singh, Ashish Kulkarni, Himanshu Ojha, Vishwajeet Kumar & Ganesh Ramakrishnan

Authors

Pankaj Singh
View author publications
You can also search for this author in PubMed Google Scholar
Ashish Kulkarni
View author publications
You can also search for this author in PubMed Google Scholar
Himanshu Ojha
View author publications
You can also search for this author in PubMed Google Scholar
Vishwajeet Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Ganesh Ramakrishnan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pankaj Singh .

Editor information

Editors and Affiliations

The University of Melbourne, Melbourne, Victoria, Australia
James Bailey
The University of Texas at Dallas, Richardson, Texas, USA
Latifur Khan
Osaka University, Osaka, Japan
Takashi Washio
University of Auckland, Auckland, New Zealand
Gill Dobbie
Shenzhen University, Shenzhen, China
Joshua Zhexue Huang
Massey University, Auckland, New Zealand
Ruili Wang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Singh, P., Kulkarni, A., Ojha, H., Kumar, V., Ramakrishnan, G. (2016). Building Compact Lexicons for Cross-Domain SMT by Mining Near-Optimal Pattern Sets. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J., Wang, R. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9651. Springer, Cham. https://doi.org/10.1007/978-3-319-31753-3_24

Download citation

DOI: https://doi.org/10.1007/978-3-319-31753-3_24
Published: 12 April 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31752-6
Online ISBN: 978-3-319-31753-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics