skip to main content
research-article

Explainability-Based Mix-Up Approach for Text Data Augmentation

Published: 20 February 2023 Publication History

Abstract

Text augmentation is a strategy for increasing the diversity of training examples without explicitly collecting new data. Owing to the efficiency and effectiveness of text augmentation, numerous augmentation methodologies have been proposed. Among them, the method based on modification, particularly the mix-up method of swapping words between two or more sentences, is widely used because it can be applied simply and shows good levels of performance. However, the existing mix-up approaches are limited; they do not reflect the importance of the manipulated word. That is, even if a word that has a critical effect on the classification result is manipulated, it is not considered significant in labeling the augmented data. Therefore, in this study, we propose an effective text augmentation technique that explicitly derives the importance of manipulated words and reflects this importance in the labeling of augmented data. The importance of each word, in other words, explainability, is calculated, and this is explicitly reflected in the labeling process of the augmented data. The results of the experiment confirmed that when the importance of the manipulated word was reflected in the labeling, the performance was significantly higher than that of the existing methods.

References

[1]
Amina Adadi and Mohammed Berrada. 2018. Peeking inside the black-box: A survey on explainable artificial intelligence (XAI). IEEE Access 6 (2018), 52138–52160.
[2]
Viktar Atliha and Dmitrij Šešok. 2020. Text augmentation using BERT for image captioning. Applied Sciences 10, 17 (2020), 5978.
[3]
Aditya Chattopadhay, Anirban Sarkar, Prantik Howlader, and Vineeth N. Balasubramanian. 2018. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV’18). IEEE, 839–847.
[4]
Jiaao Chen, Yuwei Wu, and Diyi Yang. 2020. Semi-supervised models via data augmentationfor classifying interactive affective responses. In Proceedings of the 3rd Workshop on Affective Content Analysis (AffCon 2020) co-located with Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2020), New York, USA, February 7, 2020, Niyati Chhaya, Kokil Jaidka, Jennifer Healey, Lyle Ungar, and Atanu Sinha (Eds.). CEUR-WS.org, 151–160.
[5]
Jiaao Chen, Zichao Yang, and Diyi Yang. 2020. Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2147–2157. DOI:
[6]
Xiang Dai and Heike Adel. 2020. An analysis of simple data augmentation for named entity recognition. Proceedings of the 28th International Conference on Computational Linguistics.
[7]
Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 489–500. DOI:
[8]
Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, and Eduard Hovy. 2021. A survey of data augmentation approaches for nlp. In Proceedings of the Findings of the Association for Computational Linguistics (ACL-IJCNLP’21). Association for Computational Linguistics, 968–988.
[9]
Demi Guo, Yoon Kim, and Alexander M. Rush. 2020. Sequence-level mixed sample data augmentation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 5547–5552.
[10]
Tomoki Hayashi, Shinji Watanabe, Yu Zhang, Tomoki Toda, Takaaki Hori, Ramon Astudillo, and Kazuya Takeda. 2018. Back-translation-style data augmentation for end-to-end ASR. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT’18). IEEE, 426–433.
[11]
Linwei Hu, Jie Chen, Vijayan N Nair, and Agus Sudjianto. 2018. Locally interpretable models and effects based on supervised partitioning (LIME-SUP). arXiv:1806.00663. Retrieved from https://arxiv.org/abs/1806.00663.
[12]
Mai Ibrahim, Marwan Torki, and Nagwa M. El-Makky. 2020. AlexU-BackTranslation-TL at SemEval-2020 task 12: Improving offensive language detection using data augmentation and transfer learning. In Proceedings of the 14th Workshop on Semantic Evaluation. 1881–1890.
[13]
Varun Kumar, Ashutosh Choudhary, and Eunah Cho. 2020. Data augmentation using pre-trained transformer models. In Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems. Association for Computational Linguistics, 18–26.
[14]
Younghoon Lee, Jungmin Park, and Sungzoon Cho. 2020. Extraction and prioritization of product attributes using an explainable neural network. Pattern Analysis and Applications 23, 4 (2020), 1767–1777.
[15]
Chaojun Liu, Yongqiang Wang, Kshitiz Kumar, and Yifan Gong. 2016. Investigations on speaker adaptation of LSTM RNN models for speech recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). IEEE, 5020–5024.
[16]
Pei Liu, Xuemin Wang, Chao Xiang, and Weiye Meng. 2020. A survey of text data augmentation. In Proceedings of the 2020 International Conference on Computer Communication and Network Security (CCNS’20). IEEE, 191–195.
[17]
Ruibo Liu, Guangxuan Xu, Chenyan Jia, Weicheng Ma, Lili Wang, and Soroush Vosoughi. 2020. Data boost: Text data augmentation through reinforcement learning guided conditional generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 9031–9041. DOI:
[18]
Sisi Liu, Kyungmi Lee, and Ickjai Lee. 2020. Document-level multi-topic sentiment classification of email data with bilstm and data augmentation. Knowledge-Based Systems 197 (2020), 105918.
[19]
Jiaqi Lun, Jia Zhu, Yong Tang, and Min Yang. 2020. Multiple data augmentation strategies for improving performance on automatic short answer scoring. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13389–13396.
[20]
Scott M. Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. In Proceedings of the Advances in Neural Information Processing Systems. 4765–4774.
[21]
Wentao Ma, Yiming Cui, Chenglei Si, Ting Liu, Shijin Wang, and Guoping Hu. 2020. CharBERT: Character-aware pre-trained language model. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, 39–50. DOI:
[22]
Grégoire Montavon, Alexander Binder, Sebastian Lapuschkin, Wojciech Samek, and Klaus-Robert Müller. 2019. Layer-wise relevance propagation: An overview. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning (2019), 193–209.
[23]
Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1135–1144.
[24]
Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision. 618–626.
[25]
Liang Ding, Di Wu, and Dacheng Tao. 2021. Improving neural machine translation by bidirectional training. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 3278–3284. DOI:
[26]
Connor Shorten, Taghi M. Khoshgoftaar, and Borko Furht. 2021. Text data augmentation for deep learning. Journal of Big Data 8, 1 (2021), 1–34.
[27]
Avanti Shrikumar, Peyton Greenside, and Anshul Kundaje. 2017. Learning important features through propagating activation differences. In Proceedings of the International Conference on Machine Learning. PMLR, 3145–3153.
[28]
Daniel Smilkov, Nikhil Thorat, Been Kim, Fernanda Viégas, and Martin Wattenberg. 2017. Smoothgrad: Removing noise by adding noise. arXiv:1706.03825. Retrieved from https://arxiv.org/abs/1706.03825.
[29]
Amane Sugiyama and Naoki Yoshinaga. 2019. Data augmentation using back-translation for context-aware neural machine translation. In Proceedings of the 4th Workshop on Discourse in Machine Translation (DiscoMT’19). 35–44.
[30]
Mikhail Tikhomirov, N. Loukachevitch, Anastasiia Sirotina, and Boris Dobrov. 2020. Using bert and augmentation in named entity recognition for cybersecurity domain. In Proceedings of the International Conference on Applications of Natural Language to Information Systems. Springer, 16–24.
[31]
Longshaokan Wang, Maryam Fazel-Zarandi, Aditya Tiwari, Spyros Matsoukas, and Lazaros Polymenakos. 2020. Data augmentation for training dialog models robust to speech recognition errors. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI. Association for Computational Linguistics, 63–70. DOI:
[32]
Xinyi Wang, Hieu Pham, Zihang Dai, and Graham Neubig. 2018. Switchout: An efficient data augmentation algorithm for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 856–861.
[33]
Jason Wei and Kai Zou. 2019. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 6382–6388.
[34]
Xing Wu, Shangwen Lv, Liangjun Zang, Jizhong Han, and Songlin Hu. 2019. Conditional bert contextual augmentation. In Proceedings of the International Conference on Computational Science. Springer, 84–95.
[35]
Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V. Le. 2020. Unsupervised data augmentation for consistency training, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.). Curran Associates, Inc., 6256–6268. https://proceedings.neurips.cc/paper/2020/file/44feb0096faa8326192570788b38c1d1-Paper.pdf.
[36]
Binxia Xu, Siyuan Qiu, Jie Zhang, Yafang Wang, Xiaoyu Shen, and Gerard de Melo. 2020. Data augmentation for multiclass utterance classification–a systematic study. In Proceedings of the 28th International Conference on Computational Linguistics. 5494–5506.
[37]
Kang Min Yoo, Hanbit Lee, Franck Dernoncourt, Trung Bui, Walter Chang, and Sang-goo Lee. 2020. Variational hierarchical dialog autoencoder for dialog state tracking data augmentation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 3406–3425.
[38]
Rongzhi Zhang, Yue Yu, and Chao Zhang. 2020. Seqmix: Augmenting active sequence labeling via sequence mixup. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 8566–8579. DOI:
[39]
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2921–2929.

Cited By

View all
  • (2024)Recent Applications of Explainable AI (XAI): A Systematic Literature ReviewApplied Sciences10.3390/app1419888414:19(8884)Online publication date: 2-Oct-2024
  • (2024)KI-Mix: Enhancing Cyber Threat Detection in Incomplete Supervision Setting Through Knowledge-informed Pseudo-anomaly Generation2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC54092.2024.10830942(1992-1997)Online publication date: 6-Oct-2024
  • (2024)Few-shot biomedical relation extraction using data augmentation and domain informationNeurocomputing10.1016/j.neucom.2024.127881595(127881)Online publication date: Aug-2024
  • Show More Cited By

Index Terms

  1. Explainability-Based Mix-Up Approach for Text Data Augmentation

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Knowledge Discovery from Data
    ACM Transactions on Knowledge Discovery from Data  Volume 17, Issue 1
    January 2023
    375 pages
    ISSN:1556-4681
    EISSN:1556-472X
    DOI:10.1145/3572846
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 February 2023
    Online AM: 27 April 2022
    Accepted: 21 April 2022
    Revised: 02 March 2022
    Received: 27 October 2021
    Published in TKDD Volume 17, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Text augmentation
    2. mix-up approach
    3. XAI
    4. soft-labeling
    5. word-explainability

    Qualifiers

    • Research-article

    Funding Sources

    • National Research Foundation of Korea (NRF) grant funded by the Korean government (MSIT)

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)450
    • Downloads (Last 6 weeks)24
    Reflects downloads up to 20 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Recent Applications of Explainable AI (XAI): A Systematic Literature ReviewApplied Sciences10.3390/app1419888414:19(8884)Online publication date: 2-Oct-2024
    • (2024)KI-Mix: Enhancing Cyber Threat Detection in Incomplete Supervision Setting Through Knowledge-informed Pseudo-anomaly Generation2024 IEEE International Conference on Systems, Man, and Cybernetics (SMC)10.1109/SMC54092.2024.10830942(1992-1997)Online publication date: 6-Oct-2024
    • (2024)Few-shot biomedical relation extraction using data augmentation and domain informationNeurocomputing10.1016/j.neucom.2024.127881595(127881)Online publication date: Aug-2024
    • (2024)Explainable AI models for predicting drop coalescence in microfluidics deviceChemical Engineering Journal10.1016/j.cej.2023.148465481(148465)Online publication date: Feb-2024
    • (2024)Shapley visual transformers for image-to-text generationApplied Soft Computing10.1016/j.asoc.2024.112205166(112205)Online publication date: Nov-2024
    • (2024)Advanced pseudo-labeling approach in mixing-based text data augmentation methodPattern Analysis & Applications10.1007/s10044-024-01340-627:4Online publication date: 30-Sep-2024
    • (2024)Regulating the level of manipulation in text augmentation with systematic adjustment and advanced sentence embeddingNeural Computing and Applications10.1007/s00521-024-10663-8Online publication date: 14-Dec-2024
    • (2023)A Method for Extrapolating Continuous Functions by Generating New Training Samples for Feedforward Artificial Neural NetworksAxioms10.3390/axioms1208075912:8(759)Online publication date: 1-Aug-2023
    • (2023)OPTIMA-DEM: An Optimized Threat Behavior Prediction Method using DEMATEL-ISM2023 IEEE 12th International Conference on Cloud Networking (CloudNet)10.1109/CloudNet59005.2023.10490058(413-417)Online publication date: 1-Nov-2023
    • (2023)SRL-ACOJournal of King Saud University - Computer and Information Sciences10.1016/j.jksuci.2023.10161135:7Online publication date: 1-Jul-2023

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media