Toward a Better Text Data Augmentation via Filtering and Transforming Augmented Instances

Xia, Fei; He, Shizhu; Liu, Kang; Liu, Shengping; Zhao, Jun

doi:10.1007/978-981-16-6471-7_15

Toward a Better Text Data Augmentation via Filtering and Transforming Augmented Instances

Fei Xia^11,12,
Shizhu He^11,12,
Kang Liu^11,12,
Shengping Liu¹³ &
…
Jun Zhao^11,12

Conference paper
First Online: 28 October 2021

1851 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1466))

Abstract

Thanks to a large amount of high-quality labeled data (instances), deep learning offers significant performance benefits in a variety of tasks. However, instance construction is very time-consuming and laborious, and it is a big challenge for natural language processing (NLP) tasks in many fields. For example, the instances of the question matching dataset CHIP in the medical field are only 2.7% of the general field dataset LCQMC, and its performance is only 79.19% of the general field. Due to the scarcity of instances, people often use methods such as data augmentation, robust learning, and the pre-trained model to alleviate this problem. Text data augmentation and pre-trained models are two of the most commonly used methods to solve this problem in NLP. However, current experiments have shown that the use of general data augmentation techniques may have limited or even negative effects on the pre-trained model. In order to fully understand the reasons for this result, this paper uses three types of data quality assessment methods from two levels of label-independent and label-dependent, and then select, filter and transform the results of the three text data augmentation methods. Our experiments on both generic and specialized (medical) fields have shown that through analysis, selection/filtering, and transformation of augmented instances, the performance of intent understanding and question matching in the pre-trained model can be effectively improved.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
http://api.fanyi.baidu.com/.

References

Anaby-Tavor, A., et al.: Not enough data? deep learning to the rescue! arXiv:1911.03118 (2019)
Chen, N., Su, X., Liu, T., Hao, Q., Wei, M.: A benchmark dataset and case study for Chinese medical question intent classification. BMC Med. Inform. Decis. Making 20(3), 1–7 (2020)
Google Scholar
Cui, Y., Che, W., Liu, T., Qin, B., Wang, S., Hu, G.: Revisiting pre-trained models for Chinese natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 657–668. Association for Computational Linguistics, Online, November 2020. https://www.aclweb.org/anthology/2020.findings-emnlp.58
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (2019)
Google Scholar
Hu, J., Wang, G., Lochovsky, F., Sun, J.T., Chen, Z.: Understanding user’s query intent with wikipedia. In: Proceedings of the 18th International Conference on World Wide Web, pp. 471–480 (2009)
Google Scholar
Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 423–430 (2003)
Google Scholar
Liu, X., et al.: Lcqmc: A large-scale Chinese question matching corpus. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1952–1962 (2018)
Google Scholar
Longpre, S., Wang, Y., DuBois, C.: How effective is task-agnostic data augmentation for pretrained transformers? arXiv:2010.01764 (2020)
Malandrakis, N., Shen, M., Goyal, A., Gao, S., Sethi, A., Metallinou, A.: Controlled text generation for data augmentation in intelligent artificial agents. arXiv:1910.03487 (2019)
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Google Scholar
Wei, J., Zou, K.: Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv:1901.11196 (2019)
Wong, S.C., Gatt, A., Stamatescu, V., McDonnell, M.D.: Understanding data augmentation for classification: when to warp? In: 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pp. 1–6. IEEE (2016)
Google Scholar
Wu, X., Lv, S., Zang, L., Han, J., Hu, S.: Conditional bert contextual augmentation. arXiv:1812.06705 (2019)
Xie, Q., Dai, Z., Hovy, E., Luong, M.T., Le, Q.V.: Unsupervised data augmentation for consistency training. arXiv: Learning (2020)
Yu, A.W., et al.: Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv:1804.09541 (2018)
Zhang, H., Cissé, M., Dauphin, Y., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. arXiv:1710.09412 (2018)
Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: NIPS (2015)
Google Scholar
Zhang, X., Wu, X., Chen, F., Zhao, L., Lu, C.T.: Self-paced robust learning for leveraging clean labels in noisy data. In: AAAI (2020)
Google Scholar
Zhang, Z.: Gpt2-ml: Gpt-2 for multiple languages. https://github.com/imcaspar/gpt2-ml (2019)

Download references

Acknowledgement

This work was supported by the Science and Technology Program of the Headquarters of State Grid Corporation of China, Research on Knowledge Discovery, Reasoning and Decision-making for Electric Power Operation and Maintenance Based on Graph Machine Learning and Its Applications, under Grant 5700-202012488A-0-0-00. This work was also supported by the independent research project of National Laboratory of Pattern Recognition, the Youth Innovation Promotion Association CAS and Beijing Academy of Artificial Intelligence (BAAI).

Author information

Authors and Affiliations

National Laboratory of Pattern Recognition Institute of Automation, Chinese Academy Sciences, Beijing, 100190, China
Fei Xia, Shizhu He, Kang Liu & Jun Zhao
School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, 100190, China
Fei Xia, Shizhu He, Kang Liu & Jun Zhao
Beijing Unisound Information Technology, Beijing, 100028, China
Shengping Liu

Authors

Fei Xia
View author publications
You can also search for this author in PubMed Google Scholar
Shizhu He
View author publications
You can also search for this author in PubMed Google Scholar
Kang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Shengping Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jun Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Fei Xia .

Editor information

Editors and Affiliations

Harbin Institute of Technology, Harbin, China
Bing Qin
Peking University, Beijing, China
Zhi Jin
Tongji University, Shanghai, China
Haofen Wang
University of Edinburgh, Edinburgh, UK
Jeff Pan
University of South China, Hengyang, China
Yongbin Liu
Chinese Academy of Sciences, Beijing, China
Bo An

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xia, F., He, S., Liu, K., Liu, S., Zhao, J. (2021). Toward a Better Text Data Augmentation via Filtering and Transforming Augmented Instances. In: Qin, B., Jin, Z., Wang, H., Pan, J., Liu, Y., An, B. (eds) Knowledge Graph and Semantic Computing: Knowledge Graph Empowers New Infrastructure Construction. CCKS 2021. Communications in Computer and Information Science, vol 1466. Springer, Singapore. https://doi.org/10.1007/978-981-16-6471-7_15

Download citation

DOI: https://doi.org/10.1007/978-981-16-6471-7_15
Published: 28 October 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-6470-0
Online ISBN: 978-981-16-6471-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics