research-article

Meaning-Sensitive Text Data Augmentation with Intelligent Masking

Authors:

Buddhika Kasthuriarachchy,

Darren WallsAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 14, Issue 6

Article No.: 104, Pages 1 - 20

https://doi.org/10.1145/3623403

Published: 14 November 2023 Publication History

Abstract

With the recent popularity of applying large-scale deep neural network-based models for natural language processing (NLP), attention to develop methods for text data augmentation is at its peak, since the limited size of training data tends to significantly affect the accuracy of these models. To this end, we propose a novel text data augmentation technique called Intelligent Masking with Optimal Substitutions Text Data Augmentation (IMOSA). IMOSA, developed for labelled sentences, can identify the most favourable sentences and locate the appropriate word combinations in a particular sentence to replace and generate synthetic sentences with a meaning closer to the original sentence, while also significantly increasing the diversity of the dataset. We demonstrate that the proposed technique notably improves the performance of classifiers based on attention-based transformer models through the extensive experiments for five different text classification tasks which are performed under the low data regime in a context-aware NLP setting. The analysis clearly shows that IMOSA effectively generates more sentences using favourable original examples and completely ignores undesirable examples. Furthermore, the experiments carried out confirm IMOSA’s ability to add diversity to the augmented dataset using multiple distinct masking patterns against the same original sentence, which remarkably adds variety to the training dataset. IMOSA consistently outperforms the two key masked language model-based text data augmentation techniques, and demonstrates a robust performance against the critical challenging NLP tasks.

References

[1]

Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. 2019. Reconciling modern machine-learning practice and the classical bias&variance tradeoff. Proceedings of the National Academy of Sciences 116, 32 (2019), 15849–15854. DOI:

[2]

Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Association for Computational Linguistics, Berlin, Germany, 10–21. DOI:

[3]

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What does BERT look at? An analysis of BERT’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Florence, Italy, 276–286. DOI:

[4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT.

[5]

Yoav Goldberg. 2019. Assessing BERT's syntactic abilities. Retrieved from https://arxiv.org/abs/1901.05287

[6]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems. Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27, Curran Associates, Inc.Retrieved from https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf

[7]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8(1997), 1735–1780. DOI:

Digital Library

[8]

Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 328–339. DOI:

[9]

Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. 2019. What does BERT learn about the structure of language?. In Proceedings of the ACL 2019—57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy. Retrieved from https://hal.inria.fr/hal-02131630

[10]

Buddhika Kasthuriarachchy, Madhu Chetty, Gour Karmakar, and Darren Walls. 2020. Pre-trained language models with limited data for intent classification. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN). 1–9. DOI:

[11]

Buddhika Kasthuriarachchy, Madhu Chetty, Adrian Shatte, and Darren Walls. 2021. Cost effective annotation framework using zero-shot text classification. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN). 1–8. DOI:

[12]

Buddhika Kasthuriarachchy, Madhu Chetty, Adrian Shatte, and Darren Walls. 2021. From general language understanding to noisy text comprehension. Applied Sciences 11, 17 (2021). DOI:

[13]

Sosuke Kobayashi. 2018. Contextual augmentation: Data augmentation by words with paradigmatic relations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 452–457.

[14]

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 7871–7880. DOI:

[15]

Xin Li and Dan Roth. 2002. Learning question classifiers. In Proceedings of the COLING 2002: The 19th International Conference on Computational Linguistics. Retrieved from https://aclanthology.org/C02-1150

Digital Library

[16]

Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. 2019. Linguistic knowledge and transferability of contextual representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 1073–1094. DOI:

[17]

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. Retrieved from https://arxiv.org/abs/1907.11692

[18]

Shayne Longpre, Yu Wang, and Chris DuBois. 2020. How effective is task-agnostic data augmentation for pretrained transformers?. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 4401–4411. DOI:

[19]

George A. Miller. 1995. WordNet: A lexical database for english. Communications of the ACM 38, 11(1995), 39–41. DOI:

Digital Library

[20]

Nathan Ng, Kyunghyun Cho, and Marzyeh Ghassemi. 2020. SSMBA: Self-supervised manifold based data augmentation for improving out-of-domain robustness. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 1268–1283. DOI:

[21]

Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04). Barcelona, Spain, 271–278. DOI:

Digital Library

[22]

Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05). Association for Computational Linguistics, Ann Arbor, Michigan, 115–124. DOI:

Digital Library

[23]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 86–96. DOI:

[24]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1715–1725.

[25]

Linqing Shi, Danyang Liu, Gongshen Liu, and Kui Meng. 2020. AUG-BERT: An efficient data augmentation algorithm for text classification. In Proceedings of the Communications, Signal Processing, and Systems. Qilian Liang, Wei Wang, Xin Liu, Zhenyu Na, Min Jia, and Baoju Zhang (Eds.), Springer Singapore, Singapore, 2191–2198.

[26]

Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, 1631–1642. Retrieved from https://aclanthology.org/D13-1170

[27]

Marina Sokolova and Guy Lapalme. 2009. A systematic analysis of performance measures for classification tasks. Information Processing and Management 45, 4 (2009), 427–437. DOI:

Digital Library

[28]

M. Stone. 1974. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. Series B (Methodological) 36, 2 (1974), 111–147. Retrieved from http://www.jstor.org/stable/2984809

[29]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., 6000–6010.

Digital Library

[30]

William Yang Wang and Diyi Yang. 2015. That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using #petpeeve Tweets. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 2557–2563. DOI:

[31]

Jason Wei and Kai Zou. 2019. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 6382–6388. DOI:

[32]

Xing Wu, Shangwen Lv, Liangjun Zang, Jizhong Han, and Songlin Hu. 2019. Conditional BERT contextual augmentation. In Proceedings of the Computational Science – ICCS 2019. João M. F. Rodrigues, Pedro J. S. Cardoso, Jânio Monteiro, Roberto Lam, Valeria V. Krzhizhanovskaya, Michael H. Lees, Jack J. Dongarra, and Peter M. A. Sloot (Eds.), Springer International Publishing, Cham, 84–95.

Digital Library

[33]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. Retrieved from https://arxiv.org/abs/1609.08144

[34]

Keyulu Xu, Mozhi Zhang, Jingling Li, Simon Shaolei Du, Ken-Ichi Kawarabayashi, and Stefanie Jegelka. 2021. How neural networks extrapolate: From feedforward to graph neural networks. In Proceedings of the International Conference on Learning Representations.

[35]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Proceedings of the Advances in Neural Information Processing Systems. H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc.

[36]

Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. SeqGAN: Sequence generative adversarial nets with policy gradient. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI’17). AAAI, 2852–2858.

[37]

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1 (NIPS’15). MIT, Cambridge, MA, 649–657.

Digital Library

[38]

Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin. 2017. Adversarial feature matching for text generation. In Proceedings of the 34th International Conference on Machine Learning - Volume 70 (ICML’17). JMLR.org, 4006–4015.

Digital Library

[39]

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV). 19–27. DOI:

Digital Library

Index Terms

Meaning-Sensitive Text Data Augmentation with Intelligent Masking
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing

Recommendations

Evaluating the Impact of Text Data Augmentation on Text Classification Tasks using DistilBERT
Abstract
Data augmentation entails artificially expanding the dataset's size by applying various transformations to the existing raw data. Enhancing the quality and quantity of the datasets with varying sizes by employing varieddata augmentation techniques ...
Multi-strategy text data augmentation for enhanced aspect-based sentiment analysis in resource-limited scenarios
Abstract
Aspect-based sentiment analysis (ABSA) constitutes a significant field within natural language processing (NLP). This study proposes a multi-strategy text data augmentation methodology to overcome challenges such as limited dataset sizes and the ...
Text Categorization for Improved Priors of Word Meaning
CICLing '07: Proceedings of the 8th International Conference on Computational Linguistics and Intelligent Text Processing

Distributions of the senses of words are often highly skewed. This fact is exploited by word sense disambiguation (WSD) systems which back off to the predominant (most frequent) sense of a word when contextual clues are not strong enough. The topic ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology

ACM Transactions on Intelligent Systems and Technology Volume 14, Issue 6

December 2023

493 pages

ISSN:2157-6904

EISSN:2157-6912

DOI:10.1145/3632517

Editor:
Huan Liu
Arizona State University, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 November 2023

Online AM: 13 September 2023

Accepted: 11 August 2023

Revised: 03 August 2023

Received: 24 April 2022

Published in TIST Volume 14, Issue 6

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Global Hosts Pty Ltd trading as SportsHosts

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
262
Total Downloads

Downloads (Last 12 months)127
Downloads (Last 6 weeks)7

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents