skip to main content
10.1145/3366424.3383552acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article

EasyAug: An Automatic Textual Data Augmentation Platform for Classification Tasks

Published: 20 April 2020 Publication History

Abstract

Imbalanced data is a perennial problem that impedes the learning abilities of current machine learning-based classification models. One approach to address it is to leverage data augmentation to expand the training set. For image data, there are a number of suitable augmentation techniques that have proven effective in previous work. For textual data, however, due to the discrete units inherent in natural language, techniques that randomly perturb the signal may be ineffective. Additionally, due to the substantial discrepancy between different textual datasets (e.g., different domains), an augmentation approach that facilitates the classification on one dataset may be detrimental on another dataset. For practitioners, comparing different data augmentation techniques is non-trivial, as the corresponding methods might need to be incorporated into different system architectures, and the implementation of some approaches, such as generative models, is laborious. To address these challenges, we develop EasyAug, a data augmentation platform that provides several augmentation approaches. Users can conveniently compare the classification results and can easily choose the most suitable one for their own dataset. In addition, the system is extensible and can incorporate further augmentation approaches, such that with minimal effort a new method can comprehensively be compared with the baselines.

References

[1]
E. Khvedchenya V. I. Iglovikov A. Buslaev, A. Parinov and A. A. Kalinin. 2018. Albumentations: fast and flexible image augmentations. ArXiv e-prints (2018). arXiv:1809.06839
[2]
Marcus D Bloice, Peter M Roth, and Andreas Holzinger. 2019. Biomedical image augmentation using Augmentor. Bioinformatics 35, 21 (04 2019), 4522–4524.
[3]
Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. 2015. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349(2015).
[4]
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In SIGKDD (San Francisco, California, USA) (KDD ’16). ACM, New York, NY, USA, 785–794.
[5]
Yutai Hou, Yijia Liu, Wanxiang Che, and Ting Liu. 2018. Sequence-to-sequence data augmentation for dialogue language understanding. arXiv preprint arXiv:1807.01554(2018).
[6]
Alexander B. Jung, Kentaro Wada, Jon Crall, Satoshi Tanaka, Jake Graving, Sarthak Yadav, Joy Banerjee, Gábor Vecsei, Adam Kraft, Jirka Borovec, Christian Vallentin, Semen Zhydenko, Kilian Pfeiffer, Ben Cook, Ismael Fernández, Weng Chi-Hung, Abner Ayala-Acevedo, Raphael Meudec, Matias Laporte, 2019. imgaug. Online; accessed 25-Sept-2019.
[7]
Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114(2013).
[8]
Sosuke Kobayashi. 2018. Contextual augmentation: Data augmentation by words with paradigmatic relations. arXiv preprint arXiv:1805.06201(2018).
[9]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.
[10]
Edward Ma. 2019. nlpaug: Data augmentation for NLP.
[11]
Rishabh Misra. 2018. News Category Dataset. https://doi.org/10.13140/RG.2.2.20331.18729
[12]
Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2017. A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI.
[13]
Xiaoyu Shen, Hui Su, Yanran Li, Wenjie Li, Shuzi Niu, Yang Zhao, Akiko Aizawa, and Guoping Long. 2017. A Conditional Variational Framework for Dialog Generation. In ACL. 504–509.
[14]
Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems. 3483–3491.
[15]
Jason W Wei and Kai Zou. 2019. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196(2019).
[16]
Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. 2016. Attribute2image: Conditional image generation from visual attributes. In ECCV. Springer, 776–791.
[17]
Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412(2017).
[18]
Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. 2017. Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders. In ACL. 654–664.

Cited By

View all
  • (2024)An Ensemble-Based Multi-Classification Machine Learning Classifiers Approach to Detect Multiple Classes of CyberbullyingMachine Learning and Knowledge Extraction10.3390/make60100096:1(156-170)Online publication date: 12-Jan-2024
  • (2024)Language Model-Based Text Augmentation System for Cerebrovascular Disease Related Medical ReportApplied Sciences10.3390/app1419865214:19(8652)Online publication date: 25-Sep-2024
  • (2024)Effect of Text Augmentation and Adversarial Training on Fake News DetectionIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.334459711:4(4775-4789)Online publication date: Aug-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
WWW '20: Companion Proceedings of the Web Conference 2020
April 2020
854 pages
ISBN:9781450370240
DOI:10.1145/3366424
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 April 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. data augmentation
  2. imbalanced data
  3. model fusion
  4. text classification
  5. text generation

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

WWW '20
Sponsor:
WWW '20: The Web Conference 2020
April 20 - 24, 2020
Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)68
  • Downloads (Last 6 weeks)6
Reflects downloads up to 02 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)An Ensemble-Based Multi-Classification Machine Learning Classifiers Approach to Detect Multiple Classes of CyberbullyingMachine Learning and Knowledge Extraction10.3390/make60100096:1(156-170)Online publication date: 12-Jan-2024
  • (2024)Language Model-Based Text Augmentation System for Cerebrovascular Disease Related Medical ReportApplied Sciences10.3390/app1419865214:19(8652)Online publication date: 25-Sep-2024
  • (2024)Effect of Text Augmentation and Adversarial Training on Fake News DetectionIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.334459711:4(4775-4789)Online publication date: Aug-2024
  • (2024)Enhancing Pipeline Monitoring: Optimizing Window Size with Monte Carlo Search and CB-AttentionNet2024 International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA61862.2024.00273(1772-1779)Online publication date: 18-Dec-2024
  • (2024)Systematical Randomness Assignment for the Level of Manipulation in Text Augmentation2024 International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA61862.2024.00252(1633-1638)Online publication date: 18-Dec-2024
  • (2024)Advancing NLP models with strategic text augmentation: A comprehensive study of augmentation methods and curriculum strategiesNatural Language Processing Journal10.1016/j.nlp.2024.1000717(100071)Online publication date: Jun-2024
  • (2023)Medical Specialty Classification Based on Semiadversarial Data AugmentationComputational Intelligence and Neuroscience10.1155/2023/49193712023(1-14)Online publication date: 17-Oct-2023
  • (2023)Generation of Training Examples for Tabular Natural Language InferenceProceedings of the ACM on Management of Data10.1145/36267301:4(1-27)Online publication date: 12-Dec-2023
  • (2023)APIRO: A Framework for Automated Security Tools API RecommendationACM Transactions on Software Engineering and Methodology10.1145/351276832:1(1-42)Online publication date: 13-Feb-2023
  • (2023)Data augmentation using virtual word insertion techniques in text classification tasksExpert Systems10.1111/exsy.1351941:4Online publication date: 12-Dec-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media