Abstract
The LoReHLT16 evaluation challenged participants to extract Situation Frames (SFs)—structured descriptions of humanitarian need situations—from monolingual Uyghur text. The ARIEL-CMU SF detector combines two classification paradigms, a manually curated keyword-spotting system and a machine learning classifier. These were applied by translating the models on a per-feature basis, rather than translating the input text. The resulting combined model provides the accuracy of human insight with the generality of machine learning, and is relatively tractable to human analysis and error correction. Other factors contributing to success were automatic dictionary creation, the use of phonetic transcription, detailed, hand-written morphological analysis, and naturalistic glossing for error analysis by humans. The ARIEL-CMU SF pipeline produced the top-scoring LoReHLT16 situation frame detection systems for the metrics SFType, SFType+Place+Need, SFType+Place+Relief, and SFType+Place+Urgency, at each of the three checkpoints.
Similar content being viewed by others
Notes
This is thus not lemmatization per se—the lemma of all of these is qatar, with -liq being a suffix—but rather an attempt to find the most appropriate corresponding word in the lexicons, whether it is a lemma or not.
Compared to our SFType detection systems, the features in our English Status-detection decision trees focused comparatively more on functional words (e.g., words more often indicative of tense, aspect, or modality) than content words. We did not believe these words would translate well using a lexical feature-translation approach, so we did not submit any of these results as part of a primary submission.
The error correction was performed on both models, but in the keyword model it was more straightforward to fix (i.e., by simply removing the keyword) and to know that the fix had worked.
References
Baker M (1985) The mirror principle and morphosyntactic explanation. Linguistic Inquiry 16:373–415
Beesley KR, Karttunen L (2003) Finite state morphology. CSLI Publications, Stanford
Bharadwaj A, Mortensen D, Dyer C, Carbonell J (2016) Phonologically aware neural model for named entity recognition in low resource transfer settings. In: Proceedings of the 2016 conference on empirical methods in natural language processing. Association for Computational Linguistics, Austin, Texas, pp 1462–1472
Brown PE, Pietra VJD, Pietra SAD, Mercer RL (1993) The mathematics of statistical machine translation: parameter estimation. Comput Linguist 19(1):263–312
Dyer C, Chahuneau V, Smith NA (2013) A simple, fast, and effective reparameterization of IBM Model 2. In: Proceedings of the 2013 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Atlanta, Georgia, pp 644–648
Forcada ML, Ginestí-Rosell M, Nordfalk J, O’Regan J, Ortiz-Rojas S, Pérez-Ortiz JA, Sánchez-Martínez F, Ramírez-Sánchez G, Tyers FM (2011) Apertium: a free/open-source platform for rule-based machine translation. Mach Transl 25(2):127–144
Frost R, Launchbury J (1989) Constructing natural language interpreters in a lazy functional language. Comput J 32:108–121
Hutton G (1992) Higher-order functions for parsing. J Funct Progr 2:323–343
Hutton G, Meijer E (1988) Monadic parser combinators. J Funct Progr 8:437–444
Lewis MP, Simons GF, Fennig CD (2015) Ethnologue: languages of the world, 18th edn. SIL International, Dallas, Texas
Linden K, Silfverberg M, Axelson E, Hardwick S, Pirinen T (2011) HFST-framework for compiling and applying morphologies. Commun Comput Inf Sci 100:67–85
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. CoRR abs/1301.3781
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Burges CJC, Bottou L, Welling M, Ghahramani Z, Weinberger KQ (eds) Advances in neural information processing systems, vol 26. Curran Associates, Inc., pp 3111–3119
Olteanu A, Castillo C, Diaz F, Vieweg S (2014) Crisislex: a lexicon for collecting and filtering microblogged communications in crises. In: Proceedings of the AAAI conference on weblogs and social media (ICWSM’14), Ann Arbor, MI, USA
Renduchintala A, Knowles R, Koehn P, Eisner J (2016) Creating interactive macaronic interfaces for language learning. In: Proceedings of ACL-2016 System Demonstrations, Association for Computational Linguistics, Berlin, Germany, pp 133–138
Stenetorp P, Pyysalo S, Topić G, Ohta T, Ananiadou S, Tsujii J (2012) brat: a web-based tool for NLP-assisted text annotation. In: Proceedings of the demonstrations at the 13th conference of the European chapter of the Association for Computational Linguistics, Association for Computational Linguistics, Avignon, France, pp 102–107
Strassel S, Tracey J (2014) LORELEI language packs: data, tools, and resources for technology development in low resource languages. In: LREC 2016: 10th edition of the language resources and evaluation conference, Portoroz, pp 3273–3280
Strassel S, Bies A, Tracey J (2017) Situational awareness for low resource languages: the LORELEI situation frame annotation task. In: SMERP2017: first international workshop on exploitation of social media for emergency relief and preparedness, Aberdeen
Temnikova I, Castillo C, Vieweg S (2015) Emterms 1.0: a terminological resource for crisis tweets. In: Proceedings of the international conference on information systems for crisis response and management (ISCRAM’15), Kristiansand, Norway
Washington JN, Ipasov IS, Tyers FM (2014) Finite-state morphological transducers for three Kypchak languages. In: Proceedings of the 9th conference on language resources and evaluation, LREC2014
Xu R, Yang Y, Liu H, Hsi A (2016) Cross-lingual text classification via model translation with limited dictionaries. In: Proceedings of the 25th ACM international on conference on information and knowledge management, ACM, pp 95–104
Acknowledgements
This project was sponsored by the Defense Advanced Research Projects Agency (DARPA) Information Innovation Office (I2O), program: Low Resource Languages for Emergent Incidents (LORELEI), issued by DARPA/I2O under Contract No. HR0011-15-C-0114.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Littell, P., Tian, T., Xu, R. et al. The ARIEL-CMU situation frame detection pipeline for LoReHLT16: a model translation approach. Machine Translation 32, 105–126 (2018). https://doi.org/10.1007/s10590-017-9205-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-017-9205-3