Skip to main content
Log in

Incident-Driven Machine Translation and Name Tagging for Low-resource Languages

  • Published:
Machine Translation

Abstract

We describe novel approaches to tackling the problem of natural language processing for low-resource languages. The approaches are embodied in systems for name tagging and machine translation (MT) that we constructed to participate in the NIST LoReHLT evaluation in 2016. Our methods include universal tools, rapid resource and knowledge acquisition, rapid language projection, and joint methods for MT and name tagging.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. http://www.itl.nist.gov/iad/mig/tests/mt/.

  2. http://www.isi.edu/natural-language/software/romanizer/uroman-v1.2.tar.gz.

  3. http://wals.info/.

  4. http://cldr.unicode.org/.

  5. http://sswl.railsplayground.net/.

  6. http://fieldsupport.dliflc.edu/.

  7. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/cmt-40/Nice/Elicitation/Elicitation_Corpus-LDC/.

  8. https://www.archives.gov/iwg/declassified-records/rg-263-cia-records.

  9. https://en.wiktionary.org.

  10. http://panlex.org/.

  11. http://cldr.unicode.org/.

  12. http://www.isi.edu/natural-language/software/utilities/cleaner-v1.0.tar.gz.

  13. https://en.wiktionary.org.

References

  • Alvarez A, Levin L, Frederking R, Good J, Peterson E (2005) Semi-automated elicitation corpus generation. In: Proceedings of MT Summit X

  • Baldwin T, Pool J, Colowick S (2010) PanLex and LEXTRACT: translating all words of all languages of the world. In: Proceedings of the 23rd international conference on computational linguistics

  • Bond F, Paik K (2012) A survey of Wordnets and their licenses. In: Proceedings of the 6th global WordNet conference

  • Chiu JP, Nichols E (2016) Named entity recognition with bidirectional LSTM-CNNs. Trans ACI. arXiv:1511.08308

  • Creutz M, Lagus K (2005) Unsupervised morpheme segmentation and morphology induction from text corpora using Morfessor 1.0. Helsinki University of Technology, Helsinki

    Google Scholar 

  • Dryer MS, Haspelmath M (eds) (2013) WALS Online

  • Engesath T, Yakup M, Dwyer A (2009) Greetings from the Teklimakan: a handbook of modern Uyghur. University of Kansas Scholarworks, Lawrence

    Google Scholar 

  • Ge T, Dou Q, Pan X, Ji H, Cui L, Chang B, Sui Z, Zhou M (2015) Aligning coordinated text streams through burst information network construction and decipherment. In: arXiv preprint arXiv:1609.08237

  • Graves A, Jaitly N, Mohamed Ar (2013) Hybrid speech recognition with deep bidirectional LSTM. In: 2013 IEEE workshop on automatic speech recognition and understanding (ASRU). IEEE, pp 273–278

  • Grishman R, Sundheim B (1996) Message understanding conference-6: a brief history. In: Proceedings of COLING

  • Heafield K, Lavie A (2010) Combining machine translation output with open source: the Carnegie Mellon multi-engine machine translation scheme. Prague Bull. Math. Linguist. 93:27–36

    Article  Google Scholar 

  • Ji H (2009) Mining name translations from comparable corpora by creating bilingual information networks. In: Proceedings of ACL-IJCNLP workshop on building and using comparable corpora

  • Ji H, Grishman R (2007) Collaborative entity extraction and translation. In: Proceedings of international conference on recent advances in natural language processing

  • Ji H, Grishman R (2011) Knowledge base population: Successful approaches and challenges. In: Proceedings of ACL

  • Jiampojamarn S, Bhargava A, Dou Q, Dwyer K, Kondrak G (2009) Directl: A language-independent approach to transliteration. In: Proceedings of named entities workshop

  • Kamholz D, Pool J, Colowick S (2014) Panlex: building a resource for panlingual lexical translation. In: Proceedings of the ninth international conference on language resources and evaluation

  • Lample G, Ballesteros M, Kawakami K, Subramanian S, Dyer C (2016) Neural architectures for named entity recognition. In: Proceedings the 2016 conference of the North American chapter of the association for computational linguistics—human language technologies (NAACL-HLT 2016)

  • Liang P, Taskar B, Klein D (2006) Alignment by agreement. In: Proceedings of NAACL, pp 104–111

  • Lin Y, Pan X, Deri A, Ji H, Knight K (2016) Leveraging entity linking and related language projection to improve name transliteration. In: Proceedings of ACL workshop on named entities

  • Lu D, Pan X, Pourdamghani N, Chang SF, Ji H, Knight K (2016) A multi-media approach to cross-lingual entity knowledge transfer. In: Proceedings of ACI

  • de Melo G (2014) Etymological wordnet: tracing the history of words. In: Proceeddings of the conference on language resources

  • de Melo G, Weikum G (2009) Towards a universal Wordnet by learning from combined evidence. In: Proceedings of The conference on information and knowledge management

  • de Melo G, Weikum G (2010) Towards universal multilingual knowledge bases. In: Proceedings of the 5th global Wordnet conference

  • NIST (2005) http://www.itl.nist.gov/iad/mig/tests/ace/2005/

  • Och FJ (2003) Minimum error rate training in statistical machine translation. In: Proceedings of ACL

  • Och FJ, Ney H (2003) A systematic comparison of various statistical alignment models. Comput. Linguist. 29(1):19–51

    Article  MATH  Google Scholar 

  • Pan X, Cassidy T, Hermjakob U, Ji H, Knight K (2015) Unsupervised entity linking with abstract meaning representation. In: Proceedings of NAACL-HLT

  • Pan X, Zhang B, May J, Nothman J, Knight K, Ji H (2017) Cross-lingual name tagging and linking for 282 languages. In: Proceedings of ACL

  • Papineni K, Roukos S, Ward T, Zhu WJ (2002) BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318

  • Pourdamghani N, Knight K (2017) Deciphering related languages. In: Proceedings of EMNLP

  • Probst K, Brown RD, Carbonell JG, Lavie A, Levin L (2001) Design and implementation of controlled elicitation for machine translation of low-density languages. In: Machine Translation Summit VIII

  • Searle JR (1980) Minds, brains, and programs. Behav Brain Sci 3(03):417–424

    Article  Google Scholar 

  • Tiimiir H, Lee A (2003) Modern Uyghur grammar (morphology). Yildiz, Istanbul

    Google Scholar 

  • Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Proceedings of the conference on computer vision and pattern recognition

  • Yu D, Pan X, Zhang B, Huang L, Lu D, Whitehead S, Ji H (2016) RPI_BLENDER TAC-KBP2016 system description. In: Proceedings of text analysis conference (TAC2016)

  • Zakir H (2010) Introduction to modern Uighur. H. Zakir, New York

    Google Scholar 

  • Zhang B, Pan X, Wang T, Vaswani A, Ji H, Knight K, Marcu D (2016) Name tagging for low-resource incident languages based on expectation-driven learning. In: Proceedings of NAACL-HLT

Download references

Acknowledgements

We would like to thank other ELISA team members who contributed to resource construction and system preparation before the evaluation: Chris Callison-Burch (UPenn), Aliya Deri (USC) and Ashish Vaswani (Google). We thank Billy Wagner from Next Century for running the LTDE to produce name tagging runs. This work was supported by the U.S. Defense Advanced Research Projects Agency (DARPA) LORELEI Program No. HR0011-15-C-0115 and ARL/ARO MURI W911NF-10-1-0533. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kevin Knight.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hermjakob, U., Li, Q., Marcu, D. et al. Incident-Driven Machine Translation and Name Tagging for Low-resource Languages. Machine Translation 32, 59–89 (2018). https://doi.org/10.1007/s10590-017-9207-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-017-9207-1

Keywords

Navigation