skip to main content
research-article

Explainability-Based Mix-Up Approach for Text Data Augmentation

Authors Info & Claims
Published:20 February 2023Publication History
Skip Abstract Section

Abstract

Text augmentation is a strategy for increasing the diversity of training examples without explicitly collecting new data. Owing to the efficiency and effectiveness of text augmentation, numerous augmentation methodologies have been proposed. Among them, the method based on modification, particularly the mix-up method of swapping words between two or more sentences, is widely used because it can be applied simply and shows good levels of performance. However, the existing mix-up approaches are limited; they do not reflect the importance of the manipulated word. That is, even if a word that has a critical effect on the classification result is manipulated, it is not considered significant in labeling the augmented data. Therefore, in this study, we propose an effective text augmentation technique that explicitly derives the importance of manipulated words and reflects this importance in the labeling of augmented data. The importance of each word, in other words, explainability, is calculated, and this is explicitly reflected in the labeling process of the augmented data. The results of the experiment confirmed that when the importance of the manipulated word was reflected in the labeling, the performance was significantly higher than that of the existing methods.

REFERENCES

  1. [1] Adadi Amina and Berrada Mohammed. 2018. Peeking inside the black-box: A survey on explainable artificial intelligence (XAI). IEEE Access 6 (2018), 5213852160.Google ScholarGoogle ScholarCross RefCross Ref
  2. [2] Atliha Viktar and Šešok Dmitrij. 2020. Text augmentation using BERT for image captioning. Applied Sciences 10, 17 (2020), 5978.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Chattopadhay Aditya, Sarkar Anirban, Howlader Prantik, and Balasubramanian Vineeth N.. 2018. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV’18). IEEE, 839847.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Chen Jiaao, Wu Yuwei, and Yang Diyi. 2020. Semi-supervised models via data augmentationfor classifying interactive affective responses. In Proceedings of the 3rd Workshop on Affective Content Analysis (AffCon 2020) co-located with Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2020), New York, USA, February 7, 2020, Niyati Chhaya, Kokil Jaidka, Jennifer Healey, Lyle Ungar, and Atanu Sinha (Eds.). CEUR-WS.org, 151–160.Google ScholarGoogle Scholar
  5. [5] Chen Jiaao, Yang Zichao, and Yang Diyi. 2020. Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2147–2157. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Dai Xiang and Adel Heike. 2020. An analysis of simple data augmentation for named entity recognition. Proceedings of the 28th International Conference on Computational Linguistics.Google ScholarGoogle Scholar
  7. [7] Edunov Sergey, Ott Myle, Auli Michael, and Grangier David. 2018. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 489–500. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Feng Steven Y., Gangal Varun, Wei Jason, Chandar Sarath, Vosoughi Soroush, Mitamura Teruko, and Hovy Eduard. 2021. A survey of data augmentation approaches for nlp. In Proceedings of the Findings of the Association for Computational Linguistics (ACL-IJCNLP’21). Association for Computational Linguistics, 968–988.Google ScholarGoogle Scholar
  9. [9] Guo Demi, Kim Yoon, and Rush Alexander M.. 2020. Sequence-level mixed sample data augmentation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 5547–5552.Google ScholarGoogle Scholar
  10. [10] Hayashi Tomoki, Watanabe Shinji, Zhang Yu, Toda Tomoki, Hori Takaaki, Astudillo Ramon, and Takeda Kazuya. 2018. Back-translation-style data augmentation for end-to-end ASR. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT’18). IEEE, 426433.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Hu Linwei, Chen Jie, Nair Vijayan N, and Sudjianto Agus. 2018. Locally interpretable models and effects based on supervised partitioning (LIME-SUP). arXiv:1806.00663. Retrieved from https://arxiv.org/abs/1806.00663.Google ScholarGoogle Scholar
  12. [12] Ibrahim Mai, Torki Marwan, and El-Makky Nagwa M.. 2020. AlexU-BackTranslation-TL at SemEval-2020 task 12: Improving offensive language detection using data augmentation and transfer learning. In Proceedings of the 14th Workshop on Semantic Evaluation. 18811890.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Kumar Varun, Choudhary Ashutosh, and Cho Eunah. 2020. Data augmentation using pre-trained transformer models. In Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems. Association for Computational Linguistics, 18–26.Google ScholarGoogle Scholar
  14. [14] Lee Younghoon, Park Jungmin, and Cho Sungzoon. 2020. Extraction and prioritization of product attributes using an explainable neural network. Pattern Analysis and Applications 23, 4 (2020), 1767–1777.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Liu Chaojun, Wang Yongqiang, Kumar Kshitiz, and Gong Yifan. 2016. Investigations on speaker adaptation of LSTM RNN models for speech recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). IEEE, 50205024.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Liu Pei, Wang Xuemin, Xiang Chao, and Meng Weiye. 2020. A survey of text data augmentation. In Proceedings of the 2020 International Conference on Computer Communication and Network Security (CCNS’20). IEEE, 191195.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Liu Ruibo, Xu Guangxuan, Jia Chenyan, Ma Weicheng, Wang Lili, and Vosoughi Soroush. 2020. Data boost: Text data augmentation through reinforcement learning guided conditional generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 9031–9041. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  18. [18] Liu Sisi, Lee Kyungmi, and Lee Ickjai. 2020. Document-level multi-topic sentiment classification of email data with bilstm and data augmentation. Knowledge-Based Systems 197 (2020), 105918.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Lun Jiaqi, Zhu Jia, Tang Yong, and Yang Min. 2020. Multiple data augmentation strategies for improving performance on automatic short answer scoring. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 1338913396.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Lundberg Scott M. and Lee Su-In. 2017. A unified approach to interpreting model predictions. In Proceedings of the Advances in Neural Information Processing Systems. 47654774.Google ScholarGoogle Scholar
  21. [21] Ma Wentao, Cui Yiming, Si Chenglei, Liu Ting, Wang Shijin, and Hu Guoping. 2020. CharBERT: Character-aware pre-trained language model. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, 39–50. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Montavon Grégoire, Binder Alexander, Lapuschkin Sebastian, Samek Wojciech, and Müller Klaus-Robert. 2019. Layer-wise relevance propagation: An overview. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning (2019), 193209.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Ribeiro Marco Tulio, Singh Sameer, and Guestrin Carlos. 2016. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 11351144.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Selvaraju Ramprasaath R., Cogswell Michael, Das Abhishek, Vedantam Ramakrishna, Parikh Devi, and Batra Dhruv. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision. 618626.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Liang Ding, Di Wu, and Dacheng Tao. 2021. Improving neural machine translation by bidirectional training. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 3278–3284. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Shorten Connor, Khoshgoftaar Taghi M., and Furht Borko. 2021. Text data augmentation for deep learning. Journal of Big Data 8, 1 (2021), 134.Google ScholarGoogle ScholarCross RefCross Ref
  27. [27] Shrikumar Avanti, Greenside Peyton, and Kundaje Anshul. 2017. Learning important features through propagating activation differences. In Proceedings of the International Conference on Machine Learning. PMLR, 31453153.Google ScholarGoogle Scholar
  28. [28] Smilkov Daniel, Thorat Nikhil, Kim Been, Viégas Fernanda, and Wattenberg Martin. 2017. Smoothgrad: Removing noise by adding noise. arXiv:1706.03825. Retrieved from https://arxiv.org/abs/1706.03825.Google ScholarGoogle Scholar
  29. [29] Sugiyama Amane and Yoshinaga Naoki. 2019. Data augmentation using back-translation for context-aware neural machine translation. In Proceedings of the 4th Workshop on Discourse in Machine Translation (DiscoMT’19). 3544.Google ScholarGoogle ScholarCross RefCross Ref
  30. [30] Tikhomirov Mikhail, Loukachevitch N., Sirotina Anastasiia, and Dobrov Boris. 2020. Using bert and augmentation in named entity recognition for cybersecurity domain. In Proceedings of the International Conference on Applications of Natural Language to Information Systems. Springer, 1624.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Wang Longshaokan, Fazel-Zarandi Maryam, Tiwari Aditya, Matsoukas Spyros, and Polymenakos Lazaros. 2020. Data augmentation for training dialog models robust to speech recognition errors. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI. Association for Computational Linguistics, 63–70. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Wang Xinyi, Pham Hieu, Dai Zihang, and Neubig Graham. 2018. Switchout: An efficient data augmentation algorithm for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 856–861.Google ScholarGoogle Scholar
  33. [33] Wei Jason and Zou Kai. 2019. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 6382–6388.Google ScholarGoogle Scholar
  34. [34] Wu Xing, Lv Shangwen, Zang Liangjun, Han Jizhong, and Hu Songlin. 2019. Conditional bert contextual augmentation. In Proceedings of the International Conference on Computational Science. Springer, 8495.Google ScholarGoogle ScholarCross RefCross Ref
  35. [35] Xie Qizhe, Dai Zihang, Hovy Eduard, Luong Minh-Thang, and Le Quoc V.. 2020. Unsupervised data augmentation for consistency training, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.). Curran Associates, Inc., 6256–6268. https://proceedings.neurips.cc/paper/2020/file/44feb0096faa8326192570788b38c1d1-Paper.pdf.Google ScholarGoogle Scholar
  36. [36] Xu Binxia, Qiu Siyuan, Zhang Jie, Wang Yafang, Shen Xiaoyu, and Melo Gerard de. 2020. Data augmentation for multiclass utterance classification–a systematic study. In Proceedings of the 28th International Conference on Computational Linguistics. 54945506.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Yoo Kang Min, Lee Hanbit, Dernoncourt Franck, Bui Trung, Chang Walter, and Lee Sang-goo. 2020. Variational hierarchical dialog autoencoder for dialog state tracking data augmentation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 3406–3425.Google ScholarGoogle Scholar
  38. [38] Zhang Rongzhi, Yu Yue, and Zhang Chao. 2020. Seqmix: Augmenting active sequence labeling via sequence mixup. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 8566–8579. DOI:Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Zhou Bolei, Khosla Aditya, Lapedriza Agata, Oliva Aude, and Torralba Antonio. 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 29212929.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Explainability-Based Mix-Up Approach for Text Data Augmentation

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Knowledge Discovery from Data
      ACM Transactions on Knowledge Discovery from Data  Volume 17, Issue 1
      January 2023
      375 pages
      ISSN:1556-4681
      EISSN:1556-472X
      DOI:10.1145/3572846
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 20 February 2023
      • Online AM: 27 April 2022
      • Accepted: 21 April 2022
      • Revised: 2 March 2022
      • Received: 27 October 2021
      Published in tkdd Volume 17, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
    • Article Metrics

      • Downloads (Last 12 months)602
      • Downloads (Last 6 weeks)65

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text

    HTML Format

    View this article in HTML Format .

    View HTML Format