skip to main content
research-article

Meaning-Sensitive Text Data Augmentation with Intelligent Masking

Published: 14 November 2023 Publication History

Abstract

With the recent popularity of applying large-scale deep neural network-based models for natural language processing (NLP), attention to develop methods for text data augmentation is at its peak, since the limited size of training data tends to significantly affect the accuracy of these models. To this end, we propose a novel text data augmentation technique called Intelligent Masking with Optimal Substitutions Text Data Augmentation (IMOSA). IMOSA, developed for labelled sentences, can identify the most favourable sentences and locate the appropriate word combinations in a particular sentence to replace and generate synthetic sentences with a meaning closer to the original sentence, while also significantly increasing the diversity of the dataset. We demonstrate that the proposed technique notably improves the performance of classifiers based on attention-based transformer models through the extensive experiments for five different text classification tasks which are performed under the low data regime in a context-aware NLP setting. The analysis clearly shows that IMOSA effectively generates more sentences using favourable original examples and completely ignores undesirable examples. Furthermore, the experiments carried out confirm IMOSA’s ability to add diversity to the augmented dataset using multiple distinct masking patterns against the same original sentence, which remarkably adds variety to the training dataset. IMOSA consistently outperforms the two key masked language model-based text data augmentation techniques, and demonstrates a robust performance against the critical challenging NLP tasks.

References

[1]
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. 2019. Reconciling modern machine-learning practice and the classical bias&variance tradeoff. Proceedings of the National Academy of Sciences 116, 32 (2019), 15849–15854. DOI:
[2]
Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, and Samy Bengio. 2016. Generating sentences from a continuous space. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning. Association for Computational Linguistics, Berlin, Germany, 10–21. DOI:
[3]
Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D. Manning. 2019. What does BERT look at? An analysis of BERT’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Florence, Italy, 276–286. DOI:
[4]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the NAACL-HLT.
[5]
Yoav Goldberg. 2019. Assessing BERT's syntactic abilities. Retrieved from https://arxiv.org/abs/1901.05287
[6]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems. Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27, Curran Associates, Inc.Retrieved from https://proceedings.neurips.cc/paper/2014/file/5ca3e9b122f61f8f06494c97b1afccf3-Paper.pdf
[7]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8(1997), 1735–1780. DOI:
[8]
Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 328–339. DOI:
[9]
Ganesh Jawahar, Benoît Sagot, and Djamé Seddah. 2019. What does BERT learn about the structure of language?. In Proceedings of the ACL 2019—57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy. Retrieved from https://hal.inria.fr/hal-02131630
[10]
Buddhika Kasthuriarachchy, Madhu Chetty, Gour Karmakar, and Darren Walls. 2020. Pre-trained language models with limited data for intent classification. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN). 1–9. DOI:
[11]
Buddhika Kasthuriarachchy, Madhu Chetty, Adrian Shatte, and Darren Walls. 2021. Cost effective annotation framework using zero-shot text classification. In Proceedings of the 2021 International Joint Conference on Neural Networks (IJCNN). 1–8. DOI:
[12]
Buddhika Kasthuriarachchy, Madhu Chetty, Adrian Shatte, and Darren Walls. 2021. From general language understanding to noisy text comprehension. Applied Sciences 11, 17 (2021). DOI:
[13]
Sosuke Kobayashi. 2018. Contextual augmentation: Data augmentation by words with paradigmatic relations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). 452–457.
[14]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 7871–7880. DOI:
[15]
Xin Li and Dan Roth. 2002. Learning question classifiers. In Proceedings of the COLING 2002: The 19th International Conference on Computational Linguistics. Retrieved from https://aclanthology.org/C02-1150
[16]
Nelson F. Liu, Matt Gardner, Yonatan Belinkov, Matthew E. Peters, and Noah A. Smith. 2019. Linguistic knowledge and transferability of contextual representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 1073–1094. DOI:
[17]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. Retrieved from https://arxiv.org/abs/1907.11692
[18]
Shayne Longpre, Yu Wang, and Chris DuBois. 2020. How effective is task-agnostic data augmentation for pretrained transformers?. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 4401–4411. DOI:
[19]
George A. Miller. 1995. WordNet: A lexical database for english. Communications of the ACM 38, 11(1995), 39–41. DOI:
[20]
Nathan Ng, Kyunghyun Cho, and Marzyeh Ghassemi. 2020. SSMBA: Self-supervised manifold based data augmentation for improving out-of-domain robustness. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 1268–1283. DOI:
[21]
Bo Pang and Lillian Lee. 2004. A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04). Barcelona, Spain, 271–278. DOI:
[22]
Bo Pang and Lillian Lee. 2005. Seeing stars: Exploiting class relationships for sentiment categorization with respect to rating scales. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05). Association for Computational Linguistics, Ann Arbor, Michigan, 115–124. DOI:
[23]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Improving neural machine translation models with monolingual data. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 86–96. DOI:
[24]
Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1715–1725.
[25]
Linqing Shi, Danyang Liu, Gongshen Liu, and Kui Meng. 2020. AUG-BERT: An efficient data augmentation algorithm for text classification. In Proceedings of the Communications, Signal Processing, and Systems. Qilian Liang, Wei Wang, Xin Liu, Zhenyu Na, Min Jia, and Baoju Zhang (Eds.), Springer Singapore, Singapore, 2191–2198.
[26]
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Seattle, Washington, 1631–1642. Retrieved from https://aclanthology.org/D13-1170
[27]
Marina Sokolova and Guy Lapalme. 2009. A systematic analysis of performance measures for classification tasks. Information Processing and Management 45, 4 (2009), 427–437. DOI:
[28]
M. Stone. 1974. Cross-validatory choice and assessment of statistical predictions. Journal of the Royal Statistical Society. Series B (Methodological) 36, 2 (1974), 111–147. Retrieved from http://www.jstor.org/stable/2984809
[29]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., 6000–6010.
[30]
William Yang Wang and Diyi Yang. 2015. That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using #petpeeve Tweets. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, 2557–2563. DOI:
[31]
Jason Wei and Kai Zou. 2019. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 6382–6388. DOI:
[32]
Xing Wu, Shangwen Lv, Liangjun Zang, Jizhong Han, and Songlin Hu. 2019. Conditional BERT contextual augmentation. In Proceedings of the Computational Science – ICCS 2019. João M. F. Rodrigues, Pedro J. S. Cardoso, Jânio Monteiro, Roberto Lam, Valeria V. Krzhizhanovskaya, Michael H. Lees, Jack J. Dongarra, and Peter M. A. Sloot (Eds.), Springer International Publishing, Cham, 84–95.
[33]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. Retrieved from https://arxiv.org/abs/1609.08144
[34]
Keyulu Xu, Mozhi Zhang, Jingling Li, Simon Shaolei Du, Ken-Ichi Kawarabayashi, and Stefanie Jegelka. 2021. How neural networks extrapolate: From feedforward to graph neural networks. In Proceedings of the International Conference on Learning Representations.
[35]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Proceedings of the Advances in Neural Information Processing Systems. H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc.
[36]
Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2017. SeqGAN: Sequence generative adversarial nets with policy gradient. In Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI’17). AAAI, 2852–2858.
[37]
Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1 (NIPS’15). MIT, Cambridge, MA, 649–657.
[38]
Yizhe Zhang, Zhe Gan, Kai Fan, Zhi Chen, Ricardo Henao, Dinghan Shen, and Lawrence Carin. 2017. Adversarial feature matching for text generation. In Proceedings of the 34th International Conference on Machine Learning - Volume 70 (ICML’17). JMLR.org, 4006–4015.
[39]
Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV). 19–27. DOI:

Index Terms

  1. Meaning-Sensitive Text Data Augmentation with Intelligent Masking

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Intelligent Systems and Technology
    ACM Transactions on Intelligent Systems and Technology  Volume 14, Issue 6
    December 2023
    493 pages
    ISSN:2157-6904
    EISSN:2157-6912
    DOI:10.1145/3632517
    • Editor:
    • Huan Liu
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 14 November 2023
    Online AM: 13 September 2023
    Accepted: 11 August 2023
    Revised: 03 August 2023
    Received: 24 April 2022
    Published in TIST Volume 14, Issue 6

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Text data augmentation
    2. masked language model
    3. IMOSA

    Qualifiers

    • Research-article

    Funding Sources

    • Global Hosts Pty Ltd trading as SportsHosts

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 262
      Total Downloads
    • Downloads (Last 12 months)127
    • Downloads (Last 6 weeks)7
    Reflects downloads up to 02 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media