Abstract
Text augmentation is a strategy for increasing the diversity of training examples without explicitly collecting new data. Owing to the efficiency and effectiveness of text augmentation, numerous augmentation methodologies have been proposed. Among them, the method based on modification, particularly the mix-up method of swapping words between two or more sentences, is widely used because it can be applied simply and shows good levels of performance. However, the existing mix-up approaches are limited; they do not reflect the importance of the manipulated word. That is, even if a word that has a critical effect on the classification result is manipulated, it is not considered significant in labeling the augmented data. Therefore, in this study, we propose an effective text augmentation technique that explicitly derives the importance of manipulated words and reflects this importance in the labeling of augmented data. The importance of each word, in other words, explainability, is calculated, and this is explicitly reflected in the labeling process of the augmented data. The results of the experiment confirmed that when the importance of the manipulated word was reflected in the labeling, the performance was significantly higher than that of the existing methods.
- [1] . 2018. Peeking inside the black-box: A survey on explainable artificial intelligence (XAI). IEEE Access 6 (2018), 52138–52160.Google ScholarCross Ref
- [2] . 2020. Text augmentation using BERT for image captioning. Applied Sciences 10, 17 (2020), 5978.Google ScholarCross Ref
- [3] . 2018. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV’18). IEEE, 839–847.Google ScholarCross Ref
- [4] . 2020. Semi-supervised models via data augmentationfor classifying interactive affective responses. In Proceedings of the 3rd Workshop on Affective Content Analysis (AffCon 2020) co-located with Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI 2020), New York, USA, February 7, 2020, Niyati Chhaya, Kokil Jaidka, Jennifer Healey, Lyle Ungar, and Atanu Sinha (Eds.). CEUR-WS.org, 151–160.Google Scholar
- [5] . 2020. Mixtext: Linguistically-informed interpolation of hidden space for semi-supervised text classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2147–2157.
DOI: Google ScholarCross Ref - [6] . 2020. An analysis of simple data augmentation for named entity recognition. Proceedings of the 28th International Conference on Computational Linguistics.Google Scholar
- [7] . 2018. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 489–500.
DOI: Google ScholarCross Ref - [8] . 2021. A survey of data augmentation approaches for nlp. In Proceedings of the Findings of the Association for Computational Linguistics (ACL-IJCNLP’21). Association for Computational Linguistics, 968–988.Google Scholar
- [9] . 2020. Sequence-level mixed sample data augmentation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 5547–5552.Google Scholar
- [10] . 2018. Back-translation-style data augmentation for end-to-end ASR. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT’18). IEEE, 426–433.Google ScholarCross Ref
- [11] . 2018. Locally interpretable models and effects based on supervised partitioning (LIME-SUP). arXiv:1806.00663. Retrieved from https://arxiv.org/abs/1806.00663.Google Scholar
- [12] . 2020. AlexU-BackTranslation-TL at SemEval-2020 task 12: Improving offensive language detection using data augmentation and transfer learning. In Proceedings of the 14th Workshop on Semantic Evaluation. 1881–1890.Google ScholarCross Ref
- [13] . 2020. Data augmentation using pre-trained transformer models. In Proceedings of the 2nd Workshop on Life-long Learning for Spoken Language Systems. Association for Computational Linguistics, 18–26.Google Scholar
- [14] . 2020. Extraction and prioritization of product attributes using an explainable neural network. Pattern Analysis and Applications 23, 4 (2020), 1767–1777.Google ScholarDigital Library
- [15] . 2016. Investigations on speaker adaptation of LSTM RNN models for speech recognition. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’16). IEEE, 5020–5024.Google ScholarDigital Library
- [16] . 2020. A survey of text data augmentation. In Proceedings of the 2020 International Conference on Computer Communication and Network Security (CCNS’20). IEEE, 191–195.Google ScholarCross Ref
- [17] . 2020. Data boost: Text data augmentation through reinforcement learning guided conditional generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 9031–9041.
DOI: Google ScholarCross Ref - [18] . 2020. Document-level multi-topic sentiment classification of email data with bilstm and data augmentation. Knowledge-Based Systems 197 (2020), 105918.Google ScholarCross Ref
- [19] . 2020. Multiple data augmentation strategies for improving performance on automatic short answer scoring. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13389–13396.Google ScholarCross Ref
- [20] . 2017. A unified approach to interpreting model predictions. In Proceedings of the Advances in Neural Information Processing Systems. 4765–4774.Google Scholar
- [21] . 2020. CharBERT: Character-aware pre-trained language model. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, 39–50.
DOI: Google ScholarCross Ref - [22] . 2019. Layer-wise relevance propagation: An overview. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning (2019), 193–209.Google ScholarDigital Library
- [23] . 2016. Why should i trust you?: Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 1135–1144.Google ScholarDigital Library
- [24] . 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision. 618–626.Google ScholarCross Ref
- [25] Liang Ding, Di Wu, and Dacheng Tao. 2021. Improving neural machine translation by bidirectional training. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 3278–3284.
DOI: Google ScholarCross Ref - [26] . 2021. Text data augmentation for deep learning. Journal of Big Data 8, 1 (2021), 1–34.Google ScholarCross Ref
- [27] . 2017. Learning important features through propagating activation differences. In Proceedings of the International Conference on Machine Learning. PMLR, 3145–3153.Google Scholar
- [28] . 2017. Smoothgrad: Removing noise by adding noise. arXiv:1706.03825. Retrieved from https://arxiv.org/abs/1706.03825.Google Scholar
- [29] . 2019. Data augmentation using back-translation for context-aware neural machine translation. In Proceedings of the 4th Workshop on Discourse in Machine Translation (DiscoMT’19). 35–44.Google ScholarCross Ref
- [30] . 2020. Using bert and augmentation in named entity recognition for cybersecurity domain. In Proceedings of the International Conference on Applications of Natural Language to Information Systems. Springer, 16–24.Google ScholarCross Ref
- [31] . 2020. Data augmentation for training dialog models robust to speech recognition errors. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI. Association for Computational Linguistics, 63–70.
DOI: Google ScholarCross Ref - [32] . 2018. Switchout: An efficient data augmentation algorithm for neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 856–861.Google Scholar
- [33] . 2019. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP’19). Association for Computational Linguistics, 6382–6388.Google Scholar
- [34] . 2019. Conditional bert contextual augmentation. In Proceedings of the International Conference on Computational Science. Springer, 84–95.Google ScholarCross Ref
- [35] . 2020. Unsupervised data augmentation for consistency training, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.). Curran Associates, Inc., 6256–6268. https://proceedings.neurips.cc/paper/2020/file/44feb0096faa8326192570788b38c1d1-Paper.pdf.Google Scholar
- [36] . 2020. Data augmentation for multiclass utterance classification–a systematic study. In Proceedings of the 28th International Conference on Computational Linguistics. 5494–5506.Google ScholarCross Ref
- [37] . 2020. Variational hierarchical dialog autoencoder for dialog state tracking data augmentation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 3406–3425.Google Scholar
- [38] . 2020. Seqmix: Augmenting active sequence labeling via sequence mixup. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP’20). Association for Computational Linguistics, 8566–8579.
DOI: Google ScholarCross Ref - [39] . 2016. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2921–2929.Google ScholarCross Ref
Index Terms
- Explainability-Based Mix-Up Approach for Text Data Augmentation
Recommendations
Mixing Approach for Text Data Augmentation Based on an Ensemble of Explainable Artificial Intelligence Methods
AbstractTo improve the accuracy and robustness of a model, text data augmentation is utilized to expand data. Among various text data augmentation methodologies, the method of mixing two or more data to generate augmented data is one of the most used ...
Vec2Node: Self-Training with Tensor Augmentation for Text Classification with Few Labels
Machine Learning and Knowledge Discovery in DatabasesAbstractRecent advances in state-of-the-art machine learning models like deep neural networks heavily rely on large amounts of labeled training data which is difficult to obtain for many applications. To address label scarcity, recent work has focused on ...
Tailored text augmentation for sentiment analysis
AbstractIn synonym replacement-based data augmentation techniques for natural language processing tasks, words in a sentence are often sampled randomly with equal probability. In this paper, we propose a novel data augmentation technique named Tailored ...
Highlights- A novel data augmentation algorithm for sentiment analysis.
- Word sampling and replacing based on discriminative power and relevance to sentiment.
- Application of the algorithm to sentiment analysis on polices against COVID-19.
Comments