ABSTRACT
Language scaling aims to deploy Natural Language Processing (NLP) applications economically across many countries/regions with different languages. Language scaling has been heavily invested by industry since many parties want to deploy their applications/services to global markets. At the same time, scaling out NLP applications to various languages, essentially a data science problem, remains a grand challenge due to the huge differences in the morphology, syntaxes, and pragmatics among different languages. We present a comprehensive survey and tutorial on language scaling. We start with a clear problem description for language scaling and an intuitive discussion on the overall challenges. Then, we outline two major categories of approaches to language scaling, namely, model transfer and data transfer. We present a taxonomy to summarize various methods in literature. A large part of the tutorial is organized to address various types of NLP applications. Finally, we discuss several important challenges in this area and future directions.
- Zuyi Bao, Rui Huang, C. Li, and Kenny Zhu. 2019. Low-Resource Sequence Labeling via Unsupervised Multilingual Contextualized Representations. ArXiv abs/1910.10893 (2019).Google Scholar
- Xilun Chen, Ahmed Hassan Awadallah, Hany Hassan, Wei Wang, and Claire Cardie. 2019. Multi-Source Cross-Lingual Model Transfer: Learning What to Share. In ACL.Google Scholar
- Zewen Chi, Li Dong, Furu Wei, Wenhui Wang, Xian-Ling Mao, and He yan Huang. 2020. Cross-Lingual Natural Language Generation via Pre-Training. ArXiv abs/1909.10481 (2020).Google Scholar
- Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Shijin Wang, and Guoping Hu. 2019. Cross-Lingual Machine Reading Comprehension. In EMNLP-IJCNLP. 1586--1595.Google Scholar
- Y. Fang, S. Wang, Z. Gan, S. Sun, and JJ. Liu. 2020. FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding. ArXiv abs/2009.05166 (2020).Google Scholar
- Roman Grundkiewicz, Marcin Junczys-Dowmunt, and Kenneth Heafield. 2019. Neural Grammatical Error Correction Systems with Unsupervised Pre-training on Synthetic Data. In BEA@ACL.Google Scholar
- Jiang Guo, Wanxiang Che, David Yarowsky, Haifeng Wang, and Ting Liu. 2015. Cross-lingual dependency parsing based on distributed representations. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 1234--1244.Google ScholarCross Ref
- Masahiro Kaneko, Masato Mita, Shun Kiyono, Jun Suzuki, and Kentaro Inui. 2020. Encoder-Decoder Models Can Benefit from Pre-trained Masked Language Models in Grammatical Error Correction. ArXiv abs/2005.00987 (2020).Google Scholar
- Patrick Lewis, Barlas O?uz, Ruty Rinott, Sebastian Riedel, and Holger Schwenk. 2019. MLQA: Evaluating cross-lingual extractive question answering. arXiv preprint arXiv:1910.07475 (2019).Google Scholar
- Shining Liang, Linjun Shou, Jian Pei, Ming Gong, WanLi Zuo, and Daxin Jiang. 2020. CalibreNet: Calibration Networks for Multilingual Sequence Labeling. ArXiv abs/2011.05723 (2020).Google Scholar
- Junhao Liu, Linjun Shou, Jian Pei, Ming Gong, Min Yang, and Daxin Jiang. 2020. Cross-lingual Machine Reading Comprehension with Language Branch Knowledge Distillation. ArXiv abs/2010.14271 (2020).Google Scholar
- Yinhan Liu and Jiatao Gu et al. 2020. Multilingual Denoising Pre-training for Neural Machine Translation. ArXiv abs/2001.08210 (2020).Google Scholar
- Zihan Liu, Genta Indra Winata, Zhaojiang Lin, Peng Xu, and Pascale Fung. 2020. Attention-Informed Mixed-Language Training for Zero-shot Cross-lingual Taskoriented Dialogue Systems. ArXiv abs/1911.09273 (2020).Google Scholar
- Ryan McDonald, Slav Petrov, and Keith B Hall. 2011. Multi-source transfer of delexicalized dependency parsers. (2011).Google Scholar
- L. Qin, Minheng Ni, Y. Zhang, and W. Che. 2020. CoSDA-ML: Multi-Lingual Code-Switching Data Augmentation for Zero-Shot Cross-Lingual NLP. In IJCAI.Google Scholar
- Michael Sejr Schlichtkrull and Anders Søgaard. 2017. Cross-lingual dependency parsing with late decoding for truly low-resource languages. arXiv preprint arXiv:1701.01623 (2017).Google Scholar
- Jörg Tiedemann and Zeljko Agi?. 2016. Synthetic treebanking for cross-lingual dependency parsing. Journal of Artificial Intelligence Research 55 (2016), 209--248.Google ScholarDigital Library
- Yuxuan Wang, Wanxiang Che, Jiang Guo, Yijia Liu, and Ting Liu. 2019. Crosslingual BERT transformation for zero-shot dependency parsing. arXiv preprint arXiv:1909.06775 (2019).Google Scholar
- Q. Wu, Zijia Lin, Börje F. Karlsson, B. Huang, and Jianguang Lou. 2020. UniTrans : Unifying Model Transfer and Data Transfer for Cross-Lingual Named Entity Recognition with Unlabeled Data. In IJCAI.Google Scholar
- Q. Wu, Zijia Lin, Börje F. Karlsson, Jian-Guang Lou, and B. Huang. 2020. Single- /Multi-Source Cross-Lingual NER via Teacher-Student Learning on Unlabeled Data in Target Language. ArXiv abs/2004.12440 (2020).Google Scholar
- Linting Xue and Noah Constant et al. 2020. mT5: A massively multilingual pre-trained text-to-text transformer. ArXiv abs/2010.11934 (2020).Google Scholar
- Ikumi Yamashita, Satoru Katsumata, Masahiro Kaneko, Aizhan Imankulova, and Mamoru Komachi. 2020. Cross-lingual Transfer Learning for Grammatical Error Correction. In COLING.Google Scholar
- Z. Yang, R. Salakhutdinov, and William W. Cohen. 2016. Multi-Task Cross-Lingual Sequence Tagging from Scratch. ArXiv abs/1603.06270 (2016).Google ScholarDigital Library
- Fei Yuan, Linjun Shou, X. Bai, Ming Gong, Yaobo Liang, N. Duan, Y. Fu, and Daxin Jiang. 2020. Enhancing Answer Boundary Detection for Multilingual Machine Reading Comprehension. In ACL.Google Scholar
- Wei Zhao, Liang Wang, Kewei Shen, Ruoyu Jia, and Jingming Liu. 2019. Improving Grammatical Error Correction via Pre-Training a Copy-Augmented Architecture with Unlabeled Data. In NAACL-HLT.Google Scholar
- Wangchunshu Zhou, Tao Ge, C. Mu, Ke Xu, Furu Wei, and M. Zhou. 2020. Improving Grammatical Error Correction with Machine Translation Pairs. ArXiv abs/1911.02825 (2020).Google Scholar
Index Terms
- Language Scaling: Applications, Challenges and Approaches
Recommendations
Cross-lingual word sense disambiguation for languages with scarce resources
Canadian AI'11: Proceedings of the 24th Canadian conference on Advances in artificial intelligenceWord Sense Disambiguation has long been a central problem in computational linguistics. Word Sense Disambiguation is the ability to identify the meaning of words in context in a computational manner. Statistical and supervised approaches require a large ...
A survey on Urdu and Urdu like language stemmers and stemming techniques
Stemming is one of the basic steps in natural language processing applications such as information retrieval, parts of speech tagging, syntactic parsing and machine translation, etc. It is a morphological process that intends to convert the inflected ...
A cross-lingual transfer learning method for online COVID-19-related hate speech detection
AbstractDuring the COVID-19 pandemic, online social media platforms such as Twitter facilitate the exchange of information among people. However, the prevalence of “infodemic” such as online hate speech has exacerbated social rifts, discrimination, ...
Highlights- Suggested a cross-lingual COVID-19-related hate speech detection problem.
- Introduced cross-lingual contrastive learning.
- Validated model performance on datasets in three languages.
Comments