Skip to main content

Toward a Better Text Data Augmentation via Filtering and Transforming Augmented Instances

  • Conference paper
  • First Online:
  • 1851 Accesses

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1466))

Abstract

Thanks to a large amount of high-quality labeled data (instances), deep learning offers significant performance benefits in a variety of tasks. However, instance construction is very time-consuming and laborious, and it is a big challenge for natural language processing (NLP) tasks in many fields. For example, the instances of the question matching dataset CHIP in the medical field are only 2.7% of the general field dataset LCQMC, and its performance is only 79.19% of the general field. Due to the scarcity of instances, people often use methods such as data augmentation, robust learning, and the pre-trained model to alleviate this problem. Text data augmentation and pre-trained models are two of the most commonly used methods to solve this problem in NLP. However, current experiments have shown that the use of general data augmentation techniques may have limited or even negative effects on the pre-trained model. In order to fully understand the reasons for this result, this paper uses three types of data quality assessment methods from two levels of label-independent and label-dependent, and then select, filter and transform the results of the three text data augmentation methods. Our experiments on both generic and specialized (medical) fields have shown that through analysis, selection/filtering, and transformation of augmented instances, the performance of intent understanding and question matching in the pre-trained model can be effectively improved.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://api.fanyi.baidu.com/.

References

  1. Anaby-Tavor, A., et al.: Not enough data? deep learning to the rescue! arXiv:1911.03118 (2019)

  2. Chen, N., Su, X., Liu, T., Hao, Q., Wei, M.: A benchmark dataset and case study for Chinese medical question intent classification. BMC Med. Inform. Decis. Making 20(3), 1–7 (2020)

    Google Scholar 

  3. Cui, Y., Che, W., Liu, T., Qin, B., Wang, S., Hu, G.: Revisiting pre-trained models for Chinese natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: Findings, pp. 657–668. Association for Computational Linguistics, Online, November 2020. https://www.aclweb.org/anthology/2020.findings-emnlp.58

  4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (2019)

    Google Scholar 

  5. Hu, J., Wang, G., Lochovsky, F., Sun, J.T., Chen, Z.: Understanding user’s query intent with wikipedia. In: Proceedings of the 18th International Conference on World Wide Web, pp. 471–480 (2009)

    Google Scholar 

  6. Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics, pp. 423–430 (2003)

    Google Scholar 

  7. Liu, X., et al.: Lcqmc: A large-scale Chinese question matching corpus. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1952–1962 (2018)

    Google Scholar 

  8. Longpre, S., Wang, Y., DuBois, C.: How effective is task-agnostic data augmentation for pretrained transformers? arXiv:2010.01764 (2020)

  9. Malandrakis, N., Shen, M., Goyal, A., Gao, S., Sethi, A., Metallinou, A.: Controlled text generation for data augmentation in intelligent artificial agents. arXiv:1910.03487 (2019)

  10. Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)

    Google Scholar 

  11. Wei, J., Zou, K.: Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv:1901.11196 (2019)

  12. Wong, S.C., Gatt, A., Stamatescu, V., McDonnell, M.D.: Understanding data augmentation for classification: when to warp? In: 2016 International Conference on Digital Image Computing: Techniques and Applications (DICTA), pp. 1–6. IEEE (2016)

    Google Scholar 

  13. Wu, X., Lv, S., Zang, L., Han, J., Hu, S.: Conditional bert contextual augmentation. arXiv:1812.06705 (2019)

  14. Xie, Q., Dai, Z., Hovy, E., Luong, M.T., Le, Q.V.: Unsupervised data augmentation for consistency training. arXiv: Learning (2020)

  15. Yu, A.W., et al.: Qanet: Combining local convolution with global self-attention for reading comprehension. arXiv:1804.09541 (2018)

  16. Zhang, H., Cissé, M., Dauphin, Y., Lopez-Paz, D.: mixup: Beyond empirical risk minimization. arXiv:1710.09412 (2018)

  17. Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: NIPS (2015)

    Google Scholar 

  18. Zhang, X., Wu, X., Chen, F., Zhao, L., Lu, C.T.: Self-paced robust learning for leveraging clean labels in noisy data. In: AAAI (2020)

    Google Scholar 

  19. Zhang, Z.: Gpt2-ml: Gpt-2 for multiple languages. https://github.com/imcaspar/gpt2-ml (2019)

Download references

Acknowledgement

This work was supported by the Science and Technology Program of the Headquarters of State Grid Corporation of China, Research on Knowledge Discovery, Reasoning and Decision-making for Electric Power Operation and Maintenance Based on Graph Machine Learning and Its Applications, under Grant 5700-202012488A-0-0-00. This work was also supported by the independent research project of National Laboratory of Pattern Recognition, the Youth Innovation Promotion Association CAS and Beijing Academy of Artificial Intelligence (BAAI).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fei Xia .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xia, F., He, S., Liu, K., Liu, S., Zhao, J. (2021). Toward a Better Text Data Augmentation via Filtering and Transforming Augmented Instances. In: Qin, B., Jin, Z., Wang, H., Pan, J., Liu, Y., An, B. (eds) Knowledge Graph and Semantic Computing: Knowledge Graph Empowers New Infrastructure Construction. CCKS 2021. Communications in Computer and Information Science, vol 1466. Springer, Singapore. https://doi.org/10.1007/978-981-16-6471-7_15

Download citation

  • DOI: https://doi.org/10.1007/978-981-16-6471-7_15

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-16-6470-0

  • Online ISBN: 978-981-16-6471-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics