research-article

EasyAug: An Automatic Textual Data Augmentation Platform for Classification Tasks

Authors:

Gerard de Melo,

Xiaolong LiAuthors Info & Claims

WWW '20: Companion Proceedings of the Web Conference 2020

Pages 249 - 252

https://doi.org/10.1145/3366424.3383552

Published: 20 April 2020 Publication History

Abstract

Imbalanced data is a perennial problem that impedes the learning abilities of current machine learning-based classification models. One approach to address it is to leverage data augmentation to expand the training set. For image data, there are a number of suitable augmentation techniques that have proven effective in previous work. For textual data, however, due to the discrete units inherent in natural language, techniques that randomly perturb the signal may be ineffective. Additionally, due to the substantial discrepancy between different textual datasets (e.g., different domains), an augmentation approach that facilitates the classification on one dataset may be detrimental on another dataset. For practitioners, comparing different data augmentation techniques is non-trivial, as the corresponding methods might need to be incorporated into different system architectures, and the implementation of some approaches, such as generative models, is laborious. To address these challenges, we develop EasyAug, a data augmentation platform that provides several augmentation approaches. Users can conveniently compare the classification results and can easily choose the most suitable one for their own dataset. In addition, the system is extensible and can incorporate further augmentation approaches, such that with minimal effort a new method can comprehensively be compared with the baselines.

References

[1]

E. Khvedchenya V. I. Iglovikov A. Buslaev, A. Parinov and A. A. Kalinin. 2018. Albumentations: fast and flexible image augmentations. ArXiv e-prints (2018). arXiv:1809.06839

[2]

Marcus D Bloice, Peter M Roth, and Andreas Holzinger. 2019. Biomedical image augmentation using Augmentor. Bioinformatics 35, 21 (04 2019), 4522–4524.

[3]

Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. 2015. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349(2015).

[4]

Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting System. In SIGKDD (San Francisco, California, USA) (KDD ’16). ACM, New York, NY, USA, 785–794.

[5]

Yutai Hou, Yijia Liu, Wanxiang Che, and Ting Liu. 2018. Sequence-to-sequence data augmentation for dialogue language understanding. arXiv preprint arXiv:1807.01554(2018).

[6]

Alexander B. Jung, Kentaro Wada, Jon Crall, Satoshi Tanaka, Jake Graving, Sarthak Yadav, Joy Banerjee, Gábor Vecsei, Adam Kraft, Jirka Borovec, Christian Vallentin, Semen Zhydenko, Kilian Pfeiffer, Ben Cook, Ismael Fernández, Weng Chi-Hung, Abner Ayala-Acevedo, Raphael Meudec, Matias Laporte, 2019. imgaug. Online; accessed 25-Sept-2019.

[7]

Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114(2013).

[8]

Sosuke Kobayashi. 2018. Contextual augmentation: Data augmentation by words with paradigmatic relations. arXiv preprint arXiv:1805.06201(2018).

[9]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems. 1097–1105.

[10]

Edward Ma. 2019. nlpaug: Data augmentation for NLP.

[11]

Rishabh Misra. 2018. News Category Dataset. https://doi.org/10.13140/RG.2.2.20331.18729

[12]

Iulian Vlad Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, and Yoshua Bengio. 2017. A hierarchical latent variable encoder-decoder model for generating dialogues. In AAAI.

[13]

Xiaoyu Shen, Hui Su, Yanran Li, Wenjie Li, Shuzi Niu, Yang Zhao, Akiko Aizawa, and Guoping Long. 2017. A Conditional Variational Framework for Dialog Generation. In ACL. 504–509.

[14]

Kihyuk Sohn, Honglak Lee, and Xinchen Yan. 2015. Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems. 3483–3491.

[15]

Jason W Wei and Kai Zou. 2019. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196(2019).

[16]

Xinchen Yan, Jimei Yang, Kihyuk Sohn, and Honglak Lee. 2016. Attribute2image: Conditional image generation from visual attributes. In ECCV. Springer, 776–791.

[17]

Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. 2017. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412(2017).

[18]

Tiancheng Zhao, Ran Zhao, and Maxine Eskenazi. 2017. Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders. In ACL. 654–664.

Cited By

Alqahtani AIlyas M(2024)An Ensemble-Based Multi-Classification Machine Learning Classifiers Approach to Detect Multiple Classes of CyberbullyingMachine Learning and Knowledge Extraction10.3390/make60100096:1(156-170)Online publication date: 12-Jan-2024
https://doi.org/10.3390/make6010009
Kim YKim CKim Y(2024)Language Model-Based Text Augmentation System for Cerebrovascular Disease Related Medical ReportApplied Sciences10.3390/app1419865214:19(8652)Online publication date: 25-Sep-2024
https://doi.org/10.3390/app14198652
Ahmed HTraore ISaad SMamun M(2024)Effect of Text Augmentation and Adversarial Training on Fake News DetectionIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.334459711:4(4775-4789)Online publication date: Aug-2024
https://doi.org/10.1109/TCSS.2023.3344597
Show More Cited By

Index Terms

EasyAug: An Automatic Textual Data Augmentation Platform for Classification Tasks
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification
    2. Machine learning approaches
      1. Classification and regression trees
2. Information systems
  1. Information systems applications

Index terms have been assigned to the content through auto-classification.

Recommendations

Domain-Aligned Data Augmentation for Low-Resource and Imbalanced Text Classification
Advances in Information Retrieval
Abstract
Data Augmentation approaches often use Language Models, pretrained on large quantities of unlabeled generic data, to conditionally generate examples. However, the generated data can be of subpar quality and struggle to maintain the same ...
Enhancing Electron Microscopy Image Classification Using Data Augmentation
SC-W '24: Proceedings of the SC '24 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis

Manual labeling for machine learning tasks such as image classification is tedious and labor-intensive; as a result, scientific datasets suitable for deep learning applications are scarce and limited. While data augmentation techniques have shown promise ...
A Combination of Resampling and Ensemble Method for Text Classification on Imbalanced Data
Big Data – BigData 2021
Abstract
One of the major factor which can affect the accuracy of text classification is the imbalanced dataset. In order to find the suitable method to handle this issue, six different ensemble methods are used to train models on imbalanced dataset. The ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '20: Companion Proceedings of the Web Conference 2020

April 2020

854 pages

ISBN:9781450370240

DOI:10.1145/3366424

Editors:
Amal El Fallah Seghrouchni
Sorbonne University, France
,
Gita Sukthankar
University of Central Florida, United States
,
Tie-Yan Liu
Microsoft Research Asia, China
,
Maarten van Steen
University of Twente, Netherlands

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 April 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '20

Sponsor:

SIGWEB

WWW '20: The Web Conference 2020

April 20 - 24, 2020

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

32
Total Citations
View Citations
808
Total Downloads

Downloads (Last 12 months)68
Downloads (Last 6 weeks)6

Reflects downloads up to 02 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Alqahtani AIlyas M(2024)An Ensemble-Based Multi-Classification Machine Learning Classifiers Approach to Detect Multiple Classes of CyberbullyingMachine Learning and Knowledge Extraction10.3390/make60100096:1(156-170)Online publication date: 12-Jan-2024
https://doi.org/10.3390/make6010009
Kim YKim CKim Y(2024)Language Model-Based Text Augmentation System for Cerebrovascular Disease Related Medical ReportApplied Sciences10.3390/app1419865214:19(8652)Online publication date: 25-Sep-2024
https://doi.org/10.3390/app14198652
Ahmed HTraore ISaad SMamun M(2024)Effect of Text Augmentation and Adversarial Training on Fake News DetectionIEEE Transactions on Computational Social Systems10.1109/TCSS.2023.334459711:4(4775-4789)Online publication date: Aug-2024
https://doi.org/10.1109/TCSS.2023.3344597
Khazali SShoura TJalilian EMoshirpour M(2024)Enhancing Pipeline Monitoring: Optimizing Window Size with Monte Carlo Search and CB-AttentionNet2024 International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA61862.2024.00273(1772-1779)Online publication date: 18-Dec-2024
https://doi.org/10.1109/ICMLA61862.2024.00273
Cha YLee Y(2024)Systematical Randomness Assignment for the Level of Manipulation in Text Augmentation2024 International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA61862.2024.00252(1633-1638)Online publication date: 18-Dec-2024
https://doi.org/10.1109/ICMLA61862.2024.00252
Kesgin HAmasyali M(2024)Advancing NLP models with strategic text augmentation: A comprehensive study of augmentation methods and curriculum strategiesNatural Language Processing Journal10.1016/j.nlp.2024.1000717(100071)Online publication date: Jun-2024
https://doi.org/10.1016/j.nlp.2024.100071
Zhang HZhu DTan HShafiq MGu Z(2023)Medical Specialty Classification Based on Semiadversarial Data AugmentationComputational Intelligence and Neuroscience10.1155/2023/49193712023(1-14)Online publication date: 17-Oct-2023
https://doi.org/10.1155/2023/4919371
Bussotti JVeltri ESantoro DPapotti P(2023)Generation of Training Examples for Tabular Natural Language InferenceProceedings of the ACM on Management of Data10.1145/36267301:4(1-27)Online publication date: 12-Dec-2023
https://dl.acm.org/doi/10.1145/3626730
Sworna ZIslam CBabar M(2023)APIRO: A Framework for Automated Security Tools API RecommendationACM Transactions on Software Engineering and Methodology10.1145/351276832:1(1-42)Online publication date: 13-Feb-2023
https://dl.acm.org/doi/10.1145/3512768
Long ZLi HShi JMa X(2023)Data augmentation using virtual word insertion techniques in text classification tasksExpert Systems10.1111/exsy.1351941:4Online publication date: 12-Dec-2023
https://doi.org/10.1111/exsy.13519
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten